Home

Awesome

<div align="center"> <h2 align="center"> <img src="figure/logo.png" width="8%" height="18%"> ChatCell: Facilitating Single-Cell Analysis with Natural Language </h2> <p align="center"> <a href="https://chat.openai.com/g/g-vUwj222gQ-chatcell">πŸ’» Demo</a> β€’ <a href="https://huggingface.co/datasets/zjunlp/ChatCell-Instructions">πŸ€— Dataset</a> β€’ <a href="#2">⌚️ QuickStart</a> β€’ <a href="#3">πŸ› οΈ Usage</a> β€’ <a href="#4">πŸš€ Evaluation</a> β€’ <a href="#5">🧬 Single-cell Analysis Tasks</a> β€’ <a href="#6">πŸ“ Cite</a> </p> <div align=center><img src="figure/intro.gif" width="60%" height="100%" /></div> </div>

The project <b>ChatCell</b> aims to facilitate single-cell analysis with natural language, which derives from the Cell2Sentence technique to obtain cell language tokens and utilizes cell vocabulary adaptation for T5-based pre-training. Have a try with the demo at GPTStore App!

✨ Acknowledgements

Special thanks to the authors of Cell2Sentence: Teaching Large Language Models the Language of Biology and Representing cells as sentences enables natural-language processing for single-cell transcriptomics for their inspiring work.

The workflow_data/src folder and transform.py in this project are grounded in their research. Grateful for their valuable contributions to the field.

πŸ†• News

πŸ“Œ Table of Contents


<h2 id="2">⌚️ Quickstart</h2>
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("zjunlp/chatcell-small")
model = AutoModelForSeq2SeqLM.from_pretrained("zjunlp/chatcell-small")
input_text="Distinguish between resistant and sensitive cancer cells in response to Cisplatin, using the data from the 100 most expressed genes in descending order MYL12B FTL MYL12A HIST1H4C RPL23 GSTP1 RPS3 ENO1 RPLP1 TXN ANXA2 PPP1CB B2M RPLP0 HSPA8 H2AFZ TPI1 ANXA1 RPL7 GAPDH CHP1 LDHA RPL3 S100A11 PRDX1 CALM2 CAPZA1 SLC25A5 RPS27 YWHAZ GNB2L1 PTBP3 RPS6 MOB1A S100A2 ACTG1 BROX SAT1 RPL35A CA2 PSMB4 RPL8 TBL1XR1 RPS18 HNRNPH1 RPL27 RPS14 RPS11 ANP32E RPL19 C6ORF62 RPL9 EEF1A1 RPL5 COLGALT1 NPM1 CCT6A RQCD1 CACUL1 RPL4 HSP90AA1 MALAT1 ALDOA PSMA4 SEC61G RPL38 PSMB5 FABP5 HSP90AB1 RPL35 CHCHD2 EIF3E COX4I1 RPL21 PAFAH1B2 PTMA TMED4 PSMB3 H3F3B AGO1 DYNLL1 ATP5A1 LDHB COX7B ACTB RPS27A PSME2 ELMSAN1 NDUFA1 HMGB2 PSMB6 TMSB10 SET RPL12 RPL37A RPS13 EIF1 ATP5G1 RPS3A TOB1."

# Encode the input text and generate a response with specified generation parameters
input_ids = tokenizer(input_text,return_tensors="pt").input_ids
output_ids = model.generate(input_ids, max_length=512, num_return_sequences=1, no_repeat_ngram_size=2, top_k=50, top_p=0.95, do_sample=True)

# Decode and print the generated output text
output_text = tokenizer.decode(output_ids[0],skip_special_tokens=True)
print(output_text)
<h2 id="3">πŸ› οΈ Usage</h2> <h3 id="1">πŸ“š Step1: Prepare the data</h3>

❗️Note: You can download the original data from the raw_data directory. Alternatively, you can directly download the pre-processed data we provide on huggingface to skip Step 1 of the process.

Change to the evaluation directory with the command: cd workflow_data.

1. For tasks such as random cell sentence generation, pseudo-cell generation, and cell type annotation, we utilize cells from the SHARE-seq mouse skin dataset.

2. For the drug sensitivity prediction task, we select GSE149383 and GSE117872 datasets.

3. After preparing instructions for each specific task, follow the steps below to merge the datasets using the merge.py script.

<h3 id="7"> πŸ“œ Step2 : Vocabulary Adaptation</h3>

To adapt the tokenizer vocabulary with new terms from cell biology, follow these steps using the vocabulary_adaptation.py script.

<h3 id="2">πŸ› οΈ Step3: Train and generate</h3>

1. Training

2. Generation

<h3 id="3">⌨️ Step4: Translating sentences into gene expressions</h3>

For the pseudo-cell generation task, we also translate sentences into gene expressions, including data extraction and transformation stages.

<h2 id="4">πŸš€ Evaluation</h2>

To evaluate the performance of various tasks, follow these steps:

<h2 id="5">🧬 Single-cell Analysis Tasks</h2>

ChatCell can handle the following single-cell tasks:

<p align="center"> <img src="figure/example1.jpg" width="80%" height="60%"> </p> <p align="center"> <img src="figure/example2.jpg" width="80%" height="60%"> </p> <p align="center"> <img src="figure/example3.jpg" width="80%" height="60%"> </p> <p align="center"> <img src="figure/example4.jpg" width="80%" height="60%"> </p> <h2 id="6">πŸ“ Cite</h2>
@article{fang2024chatcell,
  title={ChatCell: Facilitating Single-Cell Analysis with Natural Language},
  author={Fang, Yin and Liu, Kangwei and Zhang, Ningyu and Deng, Xinle and Yang, Penghui and Chen, Zhuo and Tang, Xiangru and Gerstein, Mark and Fan, Xiaohui and Chen, Huajun},
  year={2024},
}

Other Related Projects