Awesome
<a id="readme-top"></a>
<!-- PROJECT SHIELDS --> <!-- *** I'm using markdown "reference style" links for readability. *** Reference links are enclosed in brackets [ ] instead of parentheses ( ). *** See the bottom of this document for the declaration of the reference variables *** for contributors-url, forks-url, etc. This is an optional, concise syntax you may use. *** https://www.markdownguide.org/basic-syntax/#reference-style-links --> <!-- PROJECT LOGO --> <br /> <div align="center"> <h1 align="center">ChemFM: A Foundation Model for Chemical Design and Property Prediction</h1> <p align="center"> <a href="https://arxiv.org/abs/2410.21422"> <img src="https://info.arxiv.org/brand/images/brand-supergraphic.jpg" alt="arxiv" width="25" height="25" style="vertical-align: middle; margin-right: 0px;"> </a> <a href="https://arxiv.org/abs/2410.21422"> ArXiv </a> | <a href="https://huggingface.co/ChemFM"> <img src="https://huggingface.co/front/assets/huggingface_logo.svg" alt="Hugging Face" width="20" height="20" style="vertical-align: middle; margin-right: 0px;"> </a> <a href="https://huggingface.co/ChemFM"> Hugging Face </a> | <a href="https://discord.gg/4WbHDnPY"> <img src="https://camo.githubusercontent.com/ae76bfbcd3ea4af324682842213b28d9a7ebdd8791d8531d1b7e3b8b4d2a0302/68747470733a2f2f6564656e742e6769746875622e696f2f537570657254696e7949636f6e732f696d616765732f7376672f646973636f72642e737667" alt="Discord" width="25" height="25" style="vertical-align: middle; margin-right: 0px;"> </a> <a href="https://discord.gg/4WbHDnPY"> Discord </a> </p> <p align="center"> <a href="https://github.com/TheLuoFengLab/ChemFM/issues/new?labels=bug&template=bug-report---.md">Report Bug</a> · <a href="https://github.com/TheLuoFengLab/ChemFM/issues/new?labels=enhancement&template=feature-request---.md">Request Feature</a> </p> </div> <!-- TABLE OF CONTENTS --> <details> <summary>Table of Contents</summary> <ol> <li> <a href="#about-the-project">About The Project</a> </li> <li> <a href="#getting-started">Getting Started</a> </li> <li><a href="#usage">Usage</a></li> <li><a href="#roadmap">Roadmap</a></li> <li><a href="#contributing">Contributing</a></li> <li><a href="#contact">Contact</a></li> <li><a href="#citation">Citation</a></li> <li><a href="#acknowledgments">Acknowledgments</a></li> <li><a href="#license">License</a></li> </ol> </details> <!-- ABOUT THE PROJECT -->About The Project
ChemFM is a large-scale foundation model, specifically designed for chemistry. It has been pre-trained on 178 million molecules from UniChem using self-supervised causal language modeling, enabling the extraction of versatile and generalizable molecular representations.
The model comes in two variations with approximately 1 billion and 3 billion trainable parameters:
- ChemFM-1B [<a href="https://huggingface.co/ChemFM/ChemFM-1B"><img src="https://huggingface.co/front/assets/huggingface_logo.svg" alt="Hugging Face" width="20" height="20" style="vertical-align: middle; margin-right: 0px;"> Model Page</a>]
- ChemFM-3B [<a href="https://huggingface.co/ChemFM/ChemFM-3B"><img src="https://huggingface.co/front/assets/huggingface_logo.svg" alt="Hugging Face" width="20" height="20" style="vertical-align: middle; margin-right: 0px;"> Model Page</a>]
The model can be fine-tuned for a wide range of downstream chemical tasks, such as:
- Molecular property prediction [<a href="https://huggingface.co/spaces/ChemFM/molecular_property_prediction"><img src="https://huggingface.co/front/assets/huggingface_logo.svg" alt="Hugging Face" width="20" height="20" style="vertical-align: middle; margin-right: 0px;"> Demo</a>]
- Conditional molecular generation [<a href="https://huggingface.co/spaces/ChemFM/molecular_conditional_generation"><img src="https://huggingface.co/front/assets/huggingface_logo.svg" alt="Hugging Face" width="20" height="20" style="vertical-align: middle; margin-right: 0px;"> Demo</a>] (under construction)
- Reaction synthesis and retro-synthesis predictions [<a href="https://huggingface.co/spaces/ChemFM/reaction_prediction"><img src="https://huggingface.co/front/assets/huggingface_logo.svg" alt="Hugging Face" width="20" height="20" style="vertical-align: middle; margin-right: 0px;"> Demo</a>] (under construction)
- And more ...
Getting Started
ChemFM has been tested with Python 3.10 and PyTorch 2.3.0. You can easily set up the required environment using Conda by following these steps:
- Clone the repository
git clone https://github.com/TheLuoFengLab/ChemFM.git cd ChemFM
- Create and activate Conda environment
conda env create -f environment.yml conda activate ChemFM
Usage
Quick Start
To get started with ChemFM, you can load the ChemFM models directly from Hugging Face using the following Python script:
from transformers import AutoModel, AutoTokenizer
# Load the ChemFM-3B model and tokenizer
model_name = "ChemFM/ChemFM-3B"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
Pre-training the Model
Pre-training requires significant time and high-performance GPUs due to the scale of both the model and the dataset. For instance, ChemFM-3B took over 20 days on 16 H100 GPUs. For detailed instructions on how to pre-train ChemFM, please refer to the pretraining subfolder.
Fine-tuning the Model
Fine-tuning can typically be performed on a single moderate GPU machine. For detailed instructions on how to fine-tune ChemFM for specific tasks, please refer to the relevant subfolders:
- Molecular property prediction: finetuning/property_prediction
- Conditional molecular generation: finetuning/conditional_generation
- Reaction synthesis and retro-synthesis predictions: finetuning/reaction_prediction
Roadmap
This GitHub project is still under active development. Below is the current roadmap:
If you'd like to request additional features, please submit a feature request in the GitHub Issues section, or feel free to contact us.
<p align="right">(<a href="#readme-top">back to top</a>)</p> <!-- CONTRIBUTING -->Contributing
Any contributions you make are greatly appreciated and can include, but not limited to:
- New dataset evaluations for existing tasks.
- Extensions to new task domains in chemistry.
If you have suggestions for improvement, feel free to fork the repository and submit a pull request. You can also open an issue with the "enhancement" tag.
<p align="right">(<a href="#readme-top">back to top</a>)</p> <!-- CONTACT -->Contact
Main Developer: Feiyang Cai - feiyang@clemson.edu
Project Supervisor: Feng Luo - luofeng@clemson.edu
Join our community on Discord to stay updated or ask questions.
<p align="right">(<a href="#readme-top">back to top</a>)</p>Citation
If you find our work valuable, please consider giving the project a star and citing it in your research:
@article{ChemFM,
title={A Foundation Model for Chemical Design and Property Prediction},
author={Feiyang Cai and Tianyu Zhu and Tzuen-Rong Tzeng and Yongping Duan and Ling Liu and Srikanth Pilla and Gang Li and Feng Luo},
year={2024},
journal = {arXiv preprint arXiv:2410.21422},
}
Thank you for your support!
<p align="right">(<a href="#top">back to top</a>)</p> <!-- ACKNOWLEDGMENTS -->Acknowledgments
The pre-training of ChemFM is based on TinyLlama, and the fine-tuning process is supported by Hugging Face.
We would also like to thank Clemson University's Palmetto Cluster team for their invaluable support with cloud computing resources and maintenance.
<p align="right">(<a href="#readme-top">back to top</a>)</p> <!-- LICENSE -->License
This project is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. For more details, please see the LICENSE file.
<p align="right">(<a href="#readme-top">back to top</a>)</p>