Home

Awesome

Llamol

<p align="center"> <img src="assets/llamol.png" width="300" height="300" alt="LLamol"> </p>

This is the official repository for the paper "LLamol: A Dynamic Multi-Conditional Generative Transformer for De Novo Molecular Design". In this repository are the weights for LLamol (out/llama2-M-Full-RSS-Canonical-Canonical.pt) and the dataset OrganiX13.

Image made with Hotpot.ai

Installation

Install using Mamba to be fast: https://mamba.readthedocs.io/en/latest/micromamba-installation.html

$ "${SHELL}" <(curl -L micro.mamba.pm/install.sh)
$ micromamba env create -f torch2-env.yaml
$ micromamba activate torch2-llamol
$ python sample.py

Download and preprocess the OrganiX13 dataset:

If you want to train with the full 12.5 Million dataset do the following steps. These are not necessary if you just want to use the model for inference:

  1. Download and preprocess the OPV dataset by running /data/opv/prepare_opv.py

  2. Download and preprocess the ZINC dataset by running /data/zinc/zinc_complete/run_download.py followed by /data/zinc/convert_to_parquet.py (we recommend at least 16GB RAM for this)

  3. Download and preprocess the ZINC dataset by running /data/qm9_zinc250k_cep/convert_to_parquet.py

  4. Run data/combine_all.py to combine the dataset to data/OrganiX13.parquet (this can take a while, especially on the zinc dataset. In total it took ~2 hours when using my Laptop, which has 16 GB ram and an Intel i7 10th Gen)

  5. Run preprocess_dataset.py which should create the file .cache/processed_dataset_None.pkl

Now you can use that in the training of the model by specifing the file under the processed_dataset_ckpt of the training .yaml files.

Interactive Demo

After installation you can play around with the model using the demonstrator.ipynb file. Just run all and scroll down to the last cell. After a short time there should be a UI where you can play around with the model.

Training

First the env needs to be activated so:

$ conda activate torch2-llamol # When installed with conda instead of micromamba
OR
$ micromamba activate torch2-llamol

To train locally you can run:

# To set the config that you want to train with
$ python train.py train=llama2-M-Full-RSS-Canonical

Parameters can also be overriden by using the following, for example:

$ python train.py train=llama2-M-Full-RSS-Canonical train.model.dim=1024

For more information look at Hydra

To start a job on a SLURM cluster use the following script:

$ sbatch trainLLamaMol.sh 

Training Multi-GPU on 1 Node with multiple GPUS (nproc_per_node)

torchrun --standalone --max_restarts=3  --nnodes=1 --nproc_per_node=2 --rdzv-backend=c10d  --rdzv-endpoint="$localhost:12345" train.py train=llama2-M-Full-RSS-Canonical > "train_runs/run_MultiGPU.out" 

Training Multi-GPU on 1 Node with multiple GPUS on a Cluster

Currently there is only one script to train with DDP. To change the number of GPUS in that script you have to change the bash script itself. TODO: Make it more dynamic, with allowing console commands to change the number of GPUS etc.

sbatch trainLLamaMolDDPSingleNode.sh

Sampling

Sampling can be changed by the OPTIONAL parameters as shown below.

$ python sample.py --help

$ python sample.py --num_samples 2000 --ckpt_path "out/llama2-M-Full-RSS-Canonical.pt"  --max_new_tokens 256 --cmp_dataset_path="data/OrganiX13.parquet" --seed 4312 --context_cols logp sascore mol_weight --temperature 0.8

Using own dataset

Use the preprocess_dataset.py file to tokenize the dataset. The dataset should be either in the parquet or csv format. The SMILES used for training should be in the smiles column in the dataset. All conditions, should be given to the pretokenize function. After the preprocessing is done a file should be stored in the .cache directory with the name processed_dataset_{limit}.pkl. You could also rename this file to not overwrite it every time you run the preprocessing.

The .cache/processed_dataset_{limit}.pkl can then be set in the config/train/llama2-M-Full-RSS-Canonical.yaml file to change the training with the new dataset in the processed_dataset_ckpt field in the yaml file.

Training methods

The training method we used and described in the paper is here called RSS for "Random Smiles Sampling" which was the method then described in the "Stochastic Context Learning" as taking a random subsequence from the current SMILES while training and feeding that into the model as a token sequence condition. So the model we used in the paper was the out/llama2-M-Full-RSS-Canonical.pt.

We also tried other approached for including the token sequence. One was using murcko scaffolds as they were used in the MolGPT paper, but this approach did not yield great results for our purposes. The other was using BRICKS decomposition, which also did not yield very good results.

The different methods are implemented in the fragment_creator.py file. Each of the models were trained with their respective configurations in the config/train folder.

Thanks

Funding disclaimer

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under Grant Agreement no. 875489.

This website reflects only the author’s view. The funding agency is not responsible for any use made of the information it contains.

License

<p xmlns:cc="http://creativecommons.org/ns#" xmlns:dct="http://purl.org/dc/terms/"><span property="dct:title">LLamol is licensed under <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;">CC BY-NC-SA 4.0<img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/cc.svg?ref=chooser-v1"><img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/by.svg?ref=chooser-v1"><img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/nc.svg?ref=chooser-v1"><img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/sa.svg?ref=chooser-v1"></a></p>