Awesome

<div align="center"> <h1 align="center">ChemFM: A Foundation Model for Chemical Design and Property Prediction</h1>

<a href="https://arxiv.org/abs/2410.21422"> <img src="https://info.arxiv.org/brand/images/brand-supergraphic.jpg" alt="arxiv" width="25" height="25" style="vertical-align: middle; margin-right: 0px;"> </a> <a href="https://arxiv.org/abs/2410.21422"> ArXiv </a> | <a href="https://huggingface.co/ChemFM"> <img src="https://huggingface.co/front/assets/huggingface_logo.svg" alt="Hugging Face" width="20" height="20" style="vertical-align: middle; margin-right: 0px;"> </a> <a href="https://huggingface.co/ChemFM"> Hugging Face </a> | <a href="https://discord.gg/4WbHDnPY"> <img src="https://camo.githubusercontent.com/ae76bfbcd3ea4af324682842213b28d9a7ebdd8791d8531d1b7e3b8b4d2a0302/68747470733a2f2f6564656e742e6769746875622e696f2f537570657254696e7949636f6e732f696d616765732f7376672f646973636f72642e737667" alt="Discord" width="25" height="25" style="vertical-align: middle; margin-right: 0px;"> </a> <a href="https://discord.gg/4WbHDnPY"> Discord </a> <a href="https://github.com/TheLuoFengLab/ChemFM/issues/new?labels=bug&template=bug-report---.md">Report Bug</a> · <a href="https://github.com/TheLuoFengLab/ChemFM/issues/new?labels=enhancement&template=feature-request---.md">Request Feature</a> </div>  <details> <summary>Table of Contents</summary> <ol> <li> <a href="#about-the-project">About The Project</a> </li> <li> <a href="#getting-started">Getting Started</a> </li> <li><a href="#usage">Usage</a></li> <li><a href="#roadmap">Roadmap</a></li> <li><a href="#contributing">Contributing</a></li> <li><a href="#contact">Contact</a></li> <li><a href="#citation">Citation</a></li> <li><a href="#acknowledgments">Acknowledgments</a></li> <li><a href="#license">License</a></li> </ol> </details>

About The Project

ChemFM is a large-scale foundation model, specifically designed for chemistry. It has been pre-trained on 178 million molecules from UniChem using self-supervised causal language modeling, enabling the extraction of versatile and generalizable molecular representations.

The model comes in two variations with approximately 1 billion and 3 billion trainable parameters:

ChemFM-1B [<a href="https://huggingface.co/ChemFM/ChemFM-1B"><img src="https://huggingface.co/front/assets/huggingface_logo.svg" alt="Hugging Face" width="20" height="20" style="vertical-align: middle; margin-right: 0px;"> Model Page</a>]
ChemFM-3B [<a href="https://huggingface.co/ChemFM/ChemFM-3B"><img src="https://huggingface.co/front/assets/huggingface_logo.svg" alt="Hugging Face" width="20" height="20" style="vertical-align: middle; margin-right: 0px;"> Model Page</a>]

The model can be fine-tuned for a wide range of downstream chemical tasks, such as:

Molecular property prediction [<a href="https://huggingface.co/spaces/ChemFM/molecular_property_prediction"><img src="https://huggingface.co/front/assets/huggingface_logo.svg" alt="Hugging Face" width="20" height="20" style="vertical-align: middle; margin-right: 0px;"> Demo</a>]
Conditional molecular generation [<a href="https://huggingface.co/spaces/ChemFM/molecular_conditional_generation"><img src="https://huggingface.co/front/assets/huggingface_logo.svg" alt="Hugging Face" width="20" height="20" style="vertical-align: middle; margin-right: 0px;"> Demo</a>] (under construction)
Reaction synthesis and retro-synthesis predictions [<a href="https://huggingface.co/spaces/ChemFM/reaction_prediction"><img src="https://huggingface.co/front/assets/huggingface_logo.svg" alt="Hugging Face" width="20" height="20" style="vertical-align: middle; margin-right: 0px;"> Demo</a>] (under construction)
And more ...

Getting Started

ChemFM has been tested with Python 3.10 and PyTorch 2.3.0. You can easily set up the required environment using Conda by following these steps:

Clone the repository

git clone https://github.com/TheLuoFengLab/ChemFM.git
cd ChemFM

Create and activate Conda environment

conda env create -f environment.yml 
conda activate ChemFM

(<a href="#readme-top">back to top</a>)

Usage

Quick Start

To get started with ChemFM, you can load the ChemFM models directly from Hugging Face using the following Python script:

from transformers import AutoModel, AutoTokenizer

# Load the ChemFM-3B model and tokenizer
model_name = "ChemFM/ChemFM-3B"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Pre-training the Model

Pre-training requires significant time and high-performance GPUs due to the scale of both the model and the dataset. For instance, ChemFM-3B took over 20 days on 16 H100 GPUs. For detailed instructions on how to pre-train ChemFM, please refer to the pretraining subfolder.

Fine-tuning the Model

Fine-tuning can typically be performed on a single moderate GPU machine. For detailed instructions on how to fine-tune ChemFM for specific tasks, please refer to the relevant subfolders:

Molecular property prediction: finetuning/property_prediction
Conditional molecular generation: finetuning/conditional_generation
Reaction synthesis and retro-synthesis predictions: finetuning/reaction_prediction

(<a href="#readme-top">back to top</a>)

Roadmap

This GitHub project is still under active development. Below is the current roadmap:

If you'd like to request additional features, please submit a feature request in the GitHub Issues section, or feel free to contact us.

(<a href="#readme-top">back to top</a>)

Contributing

Any contributions you make are greatly appreciated and can include, but not limited to:

New dataset evaluations for existing tasks.
Extensions to new task domains in chemistry.

If you have suggestions for improvement, feel free to fork the repository and submit a pull request. You can also open an issue with the "enhancement" tag.

(<a href="#readme-top">back to top</a>)

Contact

Main Developer: Feiyang Cai - feiyang@clemson.edu
Project Supervisor: Feng Luo - luofeng@clemson.edu

Join our community on Discord to stay updated or ask questions.

(<a href="#readme-top">back to top</a>)

Citation

If you find our work valuable, please consider giving the project a star and citing it in your research:

@article{ChemFM,
      title={A Foundation Model for Chemical Design and Property Prediction}, 
      author={Feiyang Cai and Tianyu Zhu and Tzuen-Rong Tzeng and Yongping Duan and Ling Liu and Srikanth Pilla and Gang Li and Feng Luo},
      year={2024},
      journal = {arXiv preprint arXiv:2410.21422},
}

Thank you for your support!

(<a href="#top">back to top</a>)

Acknowledgments

The pre-training of ChemFM is based on TinyLlama, and the fine-tuning process is supported by Hugging Face.

We would also like to thank Clemson University's Palmetto Cluster team for their invaluable support with cloud computing resources and maintenance.

(<a href="#readme-top">back to top</a>)

License

This project is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. For more details, please see the LICENSE file.

(<a href="#readme-top">back to top</a>)