Home

Awesome

CodonBERT

Repository containing data, code and walkthrough for methods in the paper CodonBERT: large language models for mRNA design and optimization.

Environment Setup

Dependency management is done via poetry.

pip install poetry
poetry install

Ensure you have CUDA drivers if you plan on using a GPU.

CodonBERT

The CodonBERT Pytorch model can be downloaded here. The artifact is under a license. The code and repository are under a software license.

Pretraining and finetuning scripts are under benchmarks/CodonBERT.

To extract embeddings from the model, use extract_embed.py. The dataset sample.fasta is included for reference.

Dataset

Pre-training dataset are under benchmarks/CodonBERT/data/pre-train, train_seqs_id_1.csv.zip and train_seqs_id_2.csv.zip list the NCBI IDs for all 10 million sequences for pre-training. train_samples.csv provides a training sample. eval.csv stores the held-out dataset for evaluation.

To run finetuning, the --task flag must be used. All downstream datasets are under benchmarks/CodonBERT/data/fine-tune. As part of the release, we are sharing an internal dataset. Additionally, the data from other published datasets mentioned in the paper that were used for benchmarking are also included.

TextCNN

Code for training the TextCNN model is in the textcnn directory. Edit main.py to point to the desired embeddings and run python main.py to train the model.

Example Notebooks

The notebooks folder contains walkthrough Jupyter notebooks for benchmarking the TFIDF model as well as the TextCNN model with a pre-trained word2vec embedding representation. These use datamodel_mRFP as a test dataset.

Citations

If you find the model useful in your research, please cite our paper:

@article {Li2023.09.09.556981,
	author = {Sizhen Li and Saeed Moayedpour and Ruijiang Li and Michael Bailey and Saleh Riahi and Milad Miladi and Jacob Miner and Dinghai Zheng and Jun Wang and Akshay Balsubramani and Khang Tran and Minnie Zacharia and Monica Wu and Xiaobo Gu and Ryan Clinton and Carla Asquith and Joseph Skalesk and Lianne Boeglin and Sudha Chivukula and Anusha Dias and Fernando Ulloa Montoya and Vikram Agarwal and Ziv Bar-Joseph and Sven Jager},
	title = {CodonBERT: Large Language Models for mRNA design and optimization},
	elocation-id = {2023.09.09.556981},
	year = {2023},
	doi = {10.1101/2023.09.09.556981},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2023/09/12/2023.09.09.556981},
	eprint = {https://www.biorxiv.org/content/early/2023/09/12/2023.09.09.556981.full.pdf},
	journal = {bioRxiv}
}