Home

Awesome

Tiny Llama

This is a side project that follows all the acceleration tricks in tinyllama, with the minimal modification to the huggingface transformers code. This means that one can pretrain a tinyllama with huggingface trainer on RTX3090 / RTX4090 / A6000 / A100 without gradient checkpointing, and the training speed is comparable to (or even higher than) the original tinyllama code.

I use the latest codes in FlashAttention. I also use accelerate to accelerate the training.

News

Benchmark

[!NOTE] The results below are based on an older version of the code. The latest version may have slightly different results.

ModelGPUBatch Size Per GPUGPU MemorySpeed (tokens/s)
tinyllama1*RTX3090422.3GiB8.2k
tinyllama8*RTX3090320.8GiB32.7k
tinyllama1*RTX4090423.1GiB17k
tinyllama1*A6000843.8GiB12.2k
tinyllama4*A6000844GiB40.4k
tinyllama8*A6000844.1GiB73.3k
tinyllama1*A100-80G1676.2GiB26.6k
tinyllama8*A100-80G1676.9GiB204.5k
tinyllama8*A100-80G16*877.0GiB212.9k
llama-7b8*A100-80G178.8GiB22.4k
<details> <summary>Deepspeed Results</summary>
ModelGPUBatch Size Per GPUGPU MemorySpeed (tokens/s)
tinyllama8*RTX3090416.3G36k
tinyllama4*A6000830G35k
tinyllama4*A60001239G40k
tinyllama8*A40830G86k
tinyllama8*A401239G92k
llama-7b8*A40139.5G4.7k
llama-7b8*A100-80G460G18k
</details>

where 16*8 means a batch size of 16 per GPU and gradient accumulation steps of 8. It achieves a throughput of 26.6k tokens per second per A100-80G GPU.

That means you could train a chinchilla-optimal TinyLlama (1.1B param, 22B tokens) in 1 week with 4 A6000 or 8 RTX3090, or 28.7 hours with 8 A100-80G.

Installation

Use a proper python version, install pytorch, xformers, and other dependencies.

# for example, use python 3.10
conda install pytorch pytorch-cuda=12.1 -c pytorch -c nvidia
conda install xformers -c xformers
pip install -r requirements.txt

Change the cuda version if it is not compatible.

<details> <summary>Development Environment</summary>

The current implementation is developed using python 3.12.4.

# install pytorch (2.5.0 nightly)
conda install pytorch pytorch-cuda=12.1 -c pytorch-nightly -c nvidia

# install xFormers
# (Optional) Makes the build much faster
pip install ninja
# Set TORCH_CUDA_ARCH_LIST if running and building on different GPU types
pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
# (this can take dozens of minutes)

# install other dependencies like transformers, accelerate, etc.
pip install -r requirements.txt

[!NOTE] I am using pytorch nightly build since I want to use FlexAttention. Feel free to use the stable version of PyTorch. If so, you can install xFormers by conda install xformers -c xformers.

</details>

Usage

Option 1: Use our scripts

We have already provide you with the scripts to do training and inference. Simply run the following command.

bash run_clm.sh

See run_clm.sh for more details.

Option 2: Modify your own code

In python code, import the tinyllama module. The code will automatically replace the transformer modules with the optimized modules.

The flash attention 2 is supported by huggingface transformers. Just load the model with flash attention 2 enabled.

+ import tinyllama
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T")
- model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T")
+ model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T", attn_implementation="flash_attention_2")

...

That's not over! You need to switch on the optimizing strategies by the environment variables.

+ export TINY_FUSED_RMSNORM=1
+ export TINY_FUSED_CROSSENTROPY=1
+ export TINY_FUSED_ROTARY=1
+ export TINY_FUSED_SWIGLU=1

accelerate launch run_clm.py \
    --model_name_or_path TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T \
    --dataset_name wikitext \
    --dataset_config_name wikitext-103-raw-v1 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --auto_find_batch_size \
    --block_size 2048 \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.02 \
    --learning_rate 5e-4 \
    --weight_decay 6.6e-6 \
    --bf16 \
    --torch_dtype bfloat16 \
    --do_train \
    --do_eval \
    --do_predict \
    --num_train_epochs 20 \
    --save_total_limit 3 \
    --save_strategy epoch \
    --evaluation_strategy epoch \
    --load_best_model_at_end True \
    --metric_for_best_model eval_loss \
    --report_to none \
    --output_dir outputs/tiny-llama

Now the codes should be Blazingly Fast. The acceleration also applies to other llama models.

Related Projects