Awesome
CACTUS šµ | Chemistry Agent Connecting Tool Usage to Science
Introduction
CACTUS is an innovative tool-augmented language model designed to assist researchers and chemists in various chemistry-related tasks. By integrating state-of-the-art language models with a suite of powerful cheminformatics tools, CACTUS provides an intelligent and efficient solution for exploring chemical space, predicting molecular properties, and accelerating drug discovery. Just as the cactus thrives in the harsh desert environment, adapting to limited resources and extreme conditions, CACTUS has been implemented by Pacific Northwest National Laboratory (PNNL) Scientists to navigate the complex landscape of chemical data and extract valuable insights.
<img width="1000" alt="Cactus_header" src="assets/workflow_diagram_V2_white_bkg.png">Preprint Available here
Demo (API-only) on HuggingFace Spaces here
Running Cactus š
Getting started with Cactus is as simple as:
from cactus.agent import Cactus
Model = Cactus(model_name="google/gemma7b", model_type="vllm")
Model.run("What is the molecular weight of the smiles: OCC1OC(O)C(C(C1O)O)O")
Installation š»
To install cactus
:
pip install git+https://github.com/pnnl/cactus.git
The default PyTorch
version is compiled for cuda
12.1 (or cpu for non-cuda systems). If you want to install for an older version of cuda
, you should install from source and edit the pyproject.toml
file at the [[tool.rye.sources]]
section before installing. But be aware vllm
may not work properly for older versions of PyTorch
.
Note: cactus
currently only supports Python versions 3.10
-3.12
. Ensure you are using one of these versions before installation.
Alternatively for development, you can install in an editable configuration using:
git clone https://github.com/pnnl/cactus.git
cd cactus
python -m pip install -e .
or install using rye
by running:
git clone https://github.com/pnnl/cactus.git
cd cactus
rye sync
Benchmarking š
We provide scripts for generating lists of benchmarking questions to evaluate the performance of the CACTUS agent.
These scripts are located in the benchmark
directory.
To build the dataset used in the paper, we can run:
python benchmark_creation.py
This will generate a readable dataset named QuestionsChem.csv
for use with the Cactus
agent.
Models Tested
For this application we are benchmarking the following models:
Model | model_name |
---|---|
llama2-7b | meta-llama/Llama-2-7b-hf |
llama3-8b | meta-llama/Meta-Llama-3-8B |
mistral-7b | mistralai/Mistral-7B-v0.1 |
gemma-7b | google/gemma-7b-it |
falcon-7b | tiiuae/falcon-7b |
MPT-7b | mosaicml/mpt-7b |
Phi-2 | microsoft/phi-2 |
Phi-3 | microsoft/Phi-3-mini-4k-instruct |
OLMo-1b | allenai/OLMo-1B |
These models were selected based on their strong performance in natural language tasks and their potential for adaptation to domain-specific applications.
Tools Available
For the initial release, we have simple cheminformatics tools available:
Tool Name | Tool Usage |
---|---|
calculate_molwt | Calculate Molecular weight |
calculate_logp | Calculate the Partition Coefficient |
calculate_tpsa | Calculate the Topological Polar Surface Area |
calculate_qed | Calculate the Qualitative Estimate of Drug-likeness |
calculate_sa | Calculate the Synthetic Accessibility |
calculate_bbb_permeant | Calculate Blood Brain Barrier Permeance |
calculate_gi_absorption | Calculate the Gastrointestinal Absorption |
calculate_druglikeness | Calculate druglikeness based on Lipinski's Rule of 5 |
brenk_filter | Calculate if molecule passes the Brenk Filter |
pains_filter | Calculate if molecule passes the PAINS Filter |
ā ļø Notice: These tools currently expect a SMILES as input, tools for conversion between identifiers are available but not yet working as intended. Fix to come soon.
Future Directions
We are continuously working on improving CACTUS and expanding its capabilities for molecular discovery. Some of our planned features include:
š§¬ Integration with physics-based models for 3D structure prediction and analysis
š§ Support for advanced machine learning techniques (e.g., graph neural networks)
šÆ Enhanced tools for target identification and virtual screening
š Improved interpretability and explainability of the model's reasoning process
We welcome contributions from the community and are excited to collaborate with researchers and developers to further advance the field of AI-driven drug discovery.
Citation
If you use CACTUS in your research, please cite our preprint:
@article{mcnaughton2024cactus,
title={CACTUS: Chemistry Agent Connecting Tool-Usage to Science},
author={Andrew D. McNaughton and Gautham Ramalaxmi and Agustin Kruel and Carter R. Knutson and Rohith A. Varikoti and Neeraj Kumar},
year={2024},
eprint={2405.00972},
archivePrefix={arXiv},
primaryClass={cs.CL}
}