Home

Awesome

TourSynbio<sup>TM</sup>

<div align="center">

OpenXLab_Model

English | 简体中文

</div>

Contents <!-- omit in toc -->

Introduction

[TourSynbio<sup>TM</sup>] is an advanced protein language model that integrates knowledge from the field of proteins. Based on InternLM2-Chat-7B, it is fine-tuned using the Xtuner toolkit and the SFT (Supervised Fine-Tuning) dataset from ProteinLMBench. TourSynbio<sup>TM</sup> not only understands human language but also the sequences of proteins—the language of life. It seamlessly bridges the gap between specialized protein data and general language, making complex data and information easier to understand and apply. Its powerful reasoning capabilities allow it to extract valuable insights from complex data, accelerating the process of scientific discovery.

News

[2024.06.23] TourSynbio<sup>TM</sup> (SFT only) is now open source.

Usage

Quick Start

Download Model

<details> <summary>From OpenXLab</summary>

Refer to Download Model.

pip install openxlab
from openxlab.model import download
download(model_repo=[model_link], 
         model_name=[model_link], output='./')
</details>

Local Deployment

  1. Get the project code from Github

    git clone (ourlink)
    python (start_file_name)
    
  2. Create and activate a virtual environment

    conda env create -f environment.yml
    conda activate (envName)
    pip install -r requirements.txt
    
  3. Run the demo

    streamlit run web_demo.py --server.address=0.0.0.0 --server.port=8501
    

XTuner Fine-tuning Guide

XTuner supports fine-tuning large language models. For a dataset preprocessing guide, please refer to the documentation. For a fine-tuning guide, please refer to the documentation.

The main changes are pretrained model path, data path, and fine-tuning method (LoRA). Other hyperparameters can be adjusted as needed. Here, we keep the defaults.

Note:

Both the SFT and SSL stages involve modifying the config file, with the same modification method. However, the input in SSL data is empty during data construction. For detailed pretrained data construction, see the documentation.

xtuner train internlm2_7b_protein_lora

For example, you can fine-tune InternLM2-Chat-7B on the protein dataset using the LoRA algorithm:

# Single GPU
xtuner train internlm2_7b_protein_lora --deepspeed deepspeed_zero2
# Multiple GPUs
(DIST) NPROC_PER_NODE=${GPU_NUM} xtuner train internlm2_7b_protein_lora --deepspeed deepspeed_zero2
(SLURM) srun ${SRUN_ARGS} xtuner train internlm2_7b_protein_lora --launcher slurm --deepspeed deepspeed_zero2
xtuner convert pth_to_hf ${CONFIG_NAME_OR_PATH} ${PTH} ${SAVE_PATH}

Open Source License

This project is licensed under the Apache License 2.0.