Home

Awesome

CACTUS šŸŒµ | Chemistry Agent Connecting Tool Usage to Science

arXiv License Ruff Rye

Spaces

Introduction

CACTUS is an innovative tool-augmented language model designed to assist researchers and chemists in various chemistry-related tasks. By integrating state-of-the-art language models with a suite of powerful cheminformatics tools, CACTUS provides an intelligent and efficient solution for exploring chemical space, predicting molecular properties, and accelerating drug discovery. Just as the cactus thrives in the harsh desert environment, adapting to limited resources and extreme conditions, CACTUS has been implemented by Pacific Northwest National Laboratory (PNNL) Scientists to navigate the complex landscape of chemical data and extract valuable insights.

<img width="1000" alt="Cactus_header" src="assets/workflow_diagram_V2_white_bkg.png">

Preprint Available here

Demo (API-only) on HuggingFace Spaces here

Running Cactus šŸƒ

Getting started with Cactus is as simple as:

from cactus.agent import Cactus

Model = Cactus(model_name="google/gemma7b", model_type="vllm")
Model.run("What is the molecular weight of the smiles: OCC1OC(O)C(C(C1O)O)O")

Installation šŸ’»

To install cactus:

pip install git+https://github.com/pnnl/cactus.git

The default PyTorch version is compiled for cuda 12.1 (or cpu for non-cuda systems). If you want to install for an older version of cuda, you should install from source and edit the pyproject.toml file at the [[tool.rye.sources]] section before installing. But be aware vllm may not work properly for older versions of PyTorch.

Note: cactus currently only supports Python versions 3.10-3.12. Ensure you are using one of these versions before installation.

Alternatively for development, you can install in an editable configuration using:

git clone https://github.com/pnnl/cactus.git
cd cactus
python -m pip install -e .

or install using rye by running:

git clone https://github.com/pnnl/cactus.git
cd cactus
rye sync

Benchmarking šŸ“Š

We provide scripts for generating lists of benchmarking questions to evaluate the performance of the CACTUS agent.

These scripts are located in the benchmark directory.

To build the dataset used in the paper, we can run:

python benchmark_creation.py

This will generate a readable dataset named QuestionsChem.csv for use with the Cactus agent.

Models Tested

For this application we are benchmarking the following models:

Modelmodel_name
llama2-7bmeta-llama/Llama-2-7b-hf
llama3-8bmeta-llama/Meta-Llama-3-8B
mistral-7bmistralai/Mistral-7B-v0.1
gemma-7bgoogle/gemma-7b-it
falcon-7btiiuae/falcon-7b
MPT-7bmosaicml/mpt-7b
Phi-2microsoft/phi-2
Phi-3microsoft/Phi-3-mini-4k-instruct
OLMo-1ballenai/OLMo-1B

These models were selected based on their strong performance in natural language tasks and their potential for adaptation to domain-specific applications.

Tools Available

For the initial release, we have simple cheminformatics tools available:

Tool NameTool Usage
calculate_molwtCalculate Molecular weight
calculate_logpCalculate the Partition Coefficient
calculate_tpsaCalculate the Topological Polar Surface Area
calculate_qedCalculate the Qualitative Estimate of Drug-likeness
calculate_saCalculate the Synthetic Accessibility
calculate_bbb_permeantCalculate Blood Brain Barrier Permeance
calculate_gi_absorptionCalculate the Gastrointestinal Absorption
calculate_druglikenessCalculate druglikeness based on Lipinski's Rule of 5
brenk_filterCalculate if molecule passes the Brenk Filter
pains_filterCalculate if molecule passes the PAINS Filter

āš ļø Notice: These tools currently expect a SMILES as input, tools for conversion between identifiers are available but not yet working as intended. Fix to come soon.

Future Directions

We are continuously working on improving CACTUS and expanding its capabilities for molecular discovery. Some of our planned features include:

šŸ§¬ Integration with physics-based models for 3D structure prediction and analysis
šŸ”§ Support for advanced machine learning techniques (e.g., graph neural networks)
šŸŽÆ Enhanced tools for target identification and virtual screening    
šŸ“œ Improved interpretability and explainability of the model's reasoning process

We welcome contributions from the community and are excited to collaborate with researchers and developers to further advance the field of AI-driven drug discovery.

Citation

If you use CACTUS in your research, please cite our preprint:

@article{mcnaughton2024cactus,
    title={CACTUS: Chemistry Agent Connecting Tool-Usage to Science},
    author={Andrew D. McNaughton and Gautham Ramalaxmi and Agustin Kruel and Carter R. Knutson and Rohith A. Varikoti and Neeraj Kumar},
    year={2024},
    eprint={2405.00972},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}