Awesome

GexMolGen 🐰

Have you ever thought about designing personalized drugs based on your own genes? Sounds fascinating, doesn't it? This is something our GexMolGen aim for!

Introduction

GexMolGen is a model for generating hit-like molecules based on gene expression signatures. The workflow of GexMolGen is shown below: Overview

We divide the task of generating hit-like molecules from gene expression profiles into four steps:

Encoding of gene expression and small molecular data
Matching of genetic modality and small molecular modality
Transformation from genetic modality to small molecular modality
Generation of small molecules

To simplify the process, we use pre-trained models for the encoders in steps 1, namely scGPT and hierVAE. Step 2 is introduced to align the genetic and molecular modalities, while step 3 facilitates the transformation from genetic embeddings to molecular ones. These stages are inspired by DALL.E - simple yet effective! Hahaha..

GexMolGen is an attempt to explore the chemical and biological relationships in the drug discovery process using large language models and multimodal techniques. It has high effectiveness in generating results, flexible input, and strong controllability. For further details, please refer to our paper GexMolGen: Cross-modal Generation of Hit-like Molecules via Large Language Model Encoding of Gene Expression Signatures.

Installation

Before running pip install -r requirements.txt, we strongly advise that you individually install RDKit, FlashAttention, PyTorch on your device. Here are some configurations from our device for reference:

CUDA == 11.7
Python == 3.8
rdkit == 2023.3.2
flash-attn == 1.0.1
torch == 1.13.0+cu117
gradio == 3.40.1

Please be mindful of version compatibility during your actual setup.

Next, you need to pull down scGPT under this project. Installation is not necessary.

Model Parameters

If you want to use our model, you can download it from the provided link. This link already includes the pre-trained 'whole-human' version of the scGPT weights, so there's no need for an additional download.

Demo

To facilitate your use of our model, we have created an interactive interface. After configuring the environment and adjusting some addresses according to your installation path, you can simply run python server.py in the command line to display the interface.

We currently have two integrated functions: Standard and Screen.

Standard: This function generates a specified number of drugs based on gene transcription profile data.
Screen: This function allows you to input reference molecules and similarity calculation methods. It will output the generated results in descending order of similarity to the reference molecules
Retrieving: Retrieval of potential small molecules by providing gene expression profiles and the molecular database you want to search. 🆕

We provide experimental data for AKT2 (server_test_ctl.csv and server_test_pert.csv) and reference inhibitors (AKT2_ref.csv). You can use the Screen function to verify the Result 2.3 in our paper. demo

To-do-list

Upload video version explanation of the demo
Upload the complete datase
Upload training code

Acknowledgements

Finally, we would like to express our deepest gratitude to the authors of scGPT and hierVAE. They have not only created excellent work but also made it open source for the benefit of researchers worldwide.

No matter what questions you have, feel free to contact us via email or raise issues on GitHub. We firmly believe that different perspectives helps us develop better tools. 😉