Home

Awesome

<a id="readme-top"></a>

<!-- PROJECT SHIELDS --> <!-- *** I'm using markdown "reference style" links for readability. *** Reference links are enclosed in brackets [ ] instead of parentheses ( ). *** See the bottom of this document for the declaration of the reference variables *** for contributors-url, forks-url, etc. This is an optional, concise syntax you may use. *** https://www.markdownguide.org/basic-syntax/#reference-style-links --> <!-- PROJECT LOGO --> <br /> <div align="center"> <h1 align="center">ChemFM: A Foundation Model for Chemical Design and Property Prediction</h1>

Stargazers Forks Issues GitHub License

<p align="center"> <a href="https://arxiv.org/abs/2410.21422"> <img src="https://info.arxiv.org/brand/images/brand-supergraphic.jpg" alt="arxiv" width="25" height="25" style="vertical-align: middle; margin-right: 0px;"> </a> <a href="https://arxiv.org/abs/2410.21422"> ArXiv </a> | <a href="https://huggingface.co/ChemFM"> <img src="https://huggingface.co/front/assets/huggingface_logo.svg" alt="Hugging Face" width="20" height="20" style="vertical-align: middle; margin-right: 0px;"> </a> <a href="https://huggingface.co/ChemFM"> Hugging Face </a> | <a href="https://discord.gg/4WbHDnPY"> <img src="https://camo.githubusercontent.com/ae76bfbcd3ea4af324682842213b28d9a7ebdd8791d8531d1b7e3b8b4d2a0302/68747470733a2f2f6564656e742e6769746875622e696f2f537570657254696e7949636f6e732f696d616765732f7376672f646973636f72642e737667" alt="Discord" width="25" height="25" style="vertical-align: middle; margin-right: 0px;"> </a> <a href="https://discord.gg/4WbHDnPY"> Discord </a> </p> <p align="center"> <a href="https://github.com/TheLuoFengLab/ChemFM/issues/new?labels=bug&template=bug-report---.md">Report Bug</a> · <a href="https://github.com/TheLuoFengLab/ChemFM/issues/new?labels=enhancement&template=feature-request---.md">Request Feature</a> </p> </div> <!-- TABLE OF CONTENTS --> <details> <summary>Table of Contents</summary> <ol> <li> <a href="#about-the-project">About The Project</a> </li> <li> <a href="#getting-started">Getting Started</a> </li> <li><a href="#usage">Usage</a></li> <li><a href="#roadmap">Roadmap</a></li> <li><a href="#contributing">Contributing</a></li> <li><a href="#contact">Contact</a></li> <li><a href="#citation">Citation</a></li> <li><a href="#acknowledgments">Acknowledgments</a></li> <li><a href="#license">License</a></li> </ol> </details> <!-- ABOUT THE PROJECT -->

About The Project

ChemFM is a large-scale foundation model, specifically designed for chemistry. It has been pre-trained on 178 million molecules from UniChem using self-supervised causal language modeling, enabling the extraction of versatile and generalizable molecular representations.

The model comes in two variations with approximately 1 billion and 3 billion trainable parameters:

<p align="center"> <img src="images/pretrain.jpg" alt="Pretraining Overview" width="800"> </p>

The model can be fine-tuned for a wide range of downstream chemical tasks, such as:

<p align="center"> <img src="images/finetune.jpg" alt="Pretraining Overview" width="800"> </p> <p align="right">(<a href="#readme-top">back to top</a>)</p> <!-- GETTING STARTED -->

Getting Started

ChemFM has been tested with Python 3.10 and PyTorch 2.3.0. You can easily set up the required environment using Conda by following these steps:

<p align="right">(<a href="#readme-top">back to top</a>)</p> <!-- USAGE EXAMPLES -->

Usage

Quick Start

To get started with ChemFM, you can load the ChemFM models directly from Hugging Face using the following Python script:

from transformers import AutoModel, AutoTokenizer

# Load the ChemFM-3B model and tokenizer
model_name = "ChemFM/ChemFM-3B"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Pre-training the Model

Pre-training requires significant time and high-performance GPUs due to the scale of both the model and the dataset. For instance, ChemFM-3B took over 20 days on 16 H100 GPUs. For detailed instructions on how to pre-train ChemFM, please refer to the pretraining subfolder.

Fine-tuning the Model

Fine-tuning can typically be performed on a single moderate GPU machine. For detailed instructions on how to fine-tune ChemFM for specific tasks, please refer to the relevant subfolders:

<p align="right">(<a href="#readme-top">back to top</a>)</p> <!-- ROADMAP -->

Roadmap

This GitHub project is still under active development. Below is the current roadmap:

If you'd like to request additional features, please submit a feature request in the GitHub Issues section, or feel free to contact us.

<p align="right">(<a href="#readme-top">back to top</a>)</p> <!-- CONTRIBUTING -->

Contributing

Any contributions you make are greatly appreciated and can include, but not limited to:

If you have suggestions for improvement, feel free to fork the repository and submit a pull request. You can also open an issue with the "enhancement" tag.

<p align="right">(<a href="#readme-top">back to top</a>)</p> <!-- CONTACT -->

Contact

Main Developer: Feiyang Cai - feiyang@clemson.edu
Project Supervisor: Feng Luo - luofeng@clemson.edu

Join our community on Discord to stay updated or ask questions.

<p align="right">(<a href="#readme-top">back to top</a>)</p>

Citation

If you find our work valuable, please consider giving the project a star and citing it in your research:

@article{ChemFM,
      title={A Foundation Model for Chemical Design and Property Prediction}, 
      author={Feiyang Cai and Tianyu Zhu and Tzuen-Rong Tzeng and Yongping Duan and Ling Liu and Srikanth Pilla and Gang Li and Feng Luo},
      year={2024},
      journal = {arXiv preprint arXiv:2410.21422},
}

Thank you for your support!

<p align="right">(<a href="#top">back to top</a>)</p> <!-- ACKNOWLEDGMENTS -->

Acknowledgments

The pre-training of ChemFM is based on TinyLlama, and the fine-tuning process is supported by Hugging Face.

We would also like to thank Clemson University's Palmetto Cluster team for their invaluable support with cloud computing resources and maintenance.

<p align="right">(<a href="#readme-top">back to top</a>)</p> <!-- LICENSE -->

License

This project is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. For more details, please see the LICENSE file.

<p align="right">(<a href="#readme-top">back to top</a>)</p>