Home

Awesome

Diffusion-based hierarchical language modeling.

Dependencies

Please follow the instructions in genslm to setup environment. This is particularly important if you plan to use DeepSpeed for distributed training.

Next, install this directory by

pip install -e .

Training with DeepSpeed Zero Stage 2

For foundation models with fewer than (including) 2.5B parameters, we can train the model using Zero Stage 2:

export NODES=10
export GPUS_PER_NODE=4
export MASTER_ADDR=x3006c0s13b1n0.hsn.cm.polaris.alcf.anl.gov
export LR=1e-4
export EPOCHS=20
export TRAIN_BATCH_SIZE=2
export ACCUMULATION=1
export EVAL_BATCH_SIZE=1
export SAVE_TOTAL_LIMIT=5
export SAVE_FOLDER=2.5B_${NODES}nodes_deepspeed_diffusion_sep_checkpoints_${LR}
export TRAIN_FILE=data/sample_train.txt
export TEST_FILE=data/sample_val.txt
export CL_MODEL=/lus/eagle/projects/CVD-Mol-AI/yuntian/genomenewnaive/encoder_93810/run_l0.001_b32/checkpoints
export MODEL=EleutherAI/gpt-neox-20b # doesn't matter, will be ignored
deepspeed --num_gpus=${GPUS_PER_NODE} --num_nodes=${NODES} --master_addr=${MASTER_ADDR} --hostfile=hostfile --master_port=54321 examples/pytorch/language-modeling/run_clm_genslm_2.5B.py \
       --per_device_train_batch_size=${TRAIN_BATCH_SIZE} \
       --deepspeed=deepspeed_configs/zero2.json \
       --per_device_eval_batch_size=${EVAL_BATCH_SIZE} \
       --gradient_accumulation_steps=${ACCUMULATION} \
       --output_dir=${SAVE_FOLDER} \
       --model_type=${MODEL} \
       --model_name_or_path=${MODEL} \
       --do_train \
       --do_eval \
       --train_file=${TRAIN_FILE} \
       --validation_file=${TEST_FILE} --overwrite_output_dir --save_total_limit=${SAVE_TOTAL_LIMIT} \
       --learning_rate=${LR} --num_train_epochs=${EPOCHS} --load_best_model_at_end=True \
       --evaluation_strategy=epoch --save_strategy=epoch \
       --cl_model_name_or_path=${CL_MODEL} \
       --latent_dim=32 \
       --block_size 1024 --fp16 --prediction_loss_only

Training with DeepSpeed Zero Stage 3

export NODES=10
export GPUS_PER_NODE=4
export MASTER_ADDR=x3006c0s13b1n0.hsn.cm.polaris.alcf.anl.gov
export LR=1e-4
export EPOCHS=20
export TRAIN_BATCH_SIZE=2
export ACCUMULATION=1
export EVAL_BATCH_SIZE=1
export SAVE_TOTAL_LIMIT=5
export SAVE_FOLDER=2.5B_${NODES}nodes_deepspeed_diffusion_sep_checkpoints_${LR}
export TRAIN_FILE=data/sample_train.txt
export TEST_FILE=data/sample_val.txt
export CL_MODEL=/lus/eagle/projects/CVD-Mol-AI/yuntian/genomenewnaive/encoder_93810/run_l0.001_b32/checkpoints
export MODEL=EleutherAI/gpt-neox-20b # doesn't matter, will be ignored
deepspeed --num_gpus=${GPUS_PER_NODE} --num_nodes=${NODES} --master_addr=${MASTER_ADDR} --hostfile=hostfile --master_port=54321 examples/pytorch/language-modeling/run_clm_genslm_25B.py \
       --per_device_train_batch_size=${TRAIN_BATCH_SIZE} \
       --deepspeed=deepspeed_configs/zero3.json \
       --per_device_eval_batch_size=${EVAL_BATCH_SIZE} \
       --gradient_accumulation_steps=${ACCUMULATION} \
       --output_dir=${SAVE_FOLDER} \
       --model_type=${MODEL} \
       --model_name_or_path=${MODEL} \
       --do_train \
       --do_eval \
       --train_file=${TRAIN_FILE} \
       --validation_file=${TEST_FILE} --overwrite_output_dir --save_total_limit=${SAVE_TOTAL_LIMIT} \
       --learning_rate=${LR} --num_train_epochs=${EPOCHS} --load_best_model_at_end=True \
       --evaluation_strategy=epoch --save_strategy=epoch \
       --cl_model_name_or_path=${CL_MODEL} \
       --latent_dim=32 \
       --block_size 1024 --fp16 --prediction_loss_only

Generate

To generate, run

CUDA_VISIBLE_DEVICES=0 python examples/pytorch/language-modeling/generate_genslm_2.5B.py

Citations

If you use our models in your research, please cite this paper:

@article{zvyagin2022genslms,
  title={GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics.},
  author={Zvyagin, Max T and Brace, Alexander and Hippe, Kyle and Deng, Yuntian and Zhang, Bin and Bohorquez, Cindy Orozco and Clyde, Austin and Kale, Bharat and Perez-Rivera, Danilo and Ma, Heng and others},
  journal={bioRxiv},
  year={2022},
  publisher={Cold Spring Harbor Laboratory}
}