Home

Awesome

torchtune

Unit Test Recipe Integration Test

Introduction | Installation | Get Started | Documentation | Community | License | Citing torchtune

📣 Recent updates 📣

 

Introduction

torchtune is a PyTorch library for easily authoring, finetuning and experimenting with LLMs.

torchtune provides:

 

Models

torchtune currently supports the following models.

ModelSizes
Llama3.370B [models, configs]
Llama3.2-Vision11B, 90B [models, configs]
Llama3.21B, 3B [models, configs]
Llama3.18B, 70B, 405B [models, configs]
Llama38B, 70B [models, configs]
Llama27B, 13B, 70B [models, configs]
Code-Llama27B, 13B, 70B [models, configs]
Mistral7B [models, configs]
Gemma2B, 7B [models, configs]
Gemma22B, 9B, 27B [models, configs]
Microsoft Phi3Mini [models, configs]
Qwen20.5B, 1.5B, 7B [models, configs]
Qwen2.50.5B, 1.5B, 3B, 7B, 14B, 32B, 72B [models, configs]

We're always adding new models, but feel free to file an issue if there's a new one you would like to see in torchtune.

 

Finetuning recipes

torchtune provides the following finetuning recipes for training on one or more devices.

Finetuning MethodDevicesRecipeExample Config(s)
Full Finetuning1-8full_finetune_single_device <br> full_finetune_distributedLlama3.1 8B single-device <br> Llama 3.1 70B distributed
LoRA Finetuning1-8lora_finetune_single_device <br> lora_finetune_distributedQwen2 0.5B single-device <br> Gemma 7B distributed
QLoRA Finetuning1-8lora_finetune_single_device <br> lora_finetune_distributedPhi3 Mini single-device <br> Llama 3.1 405B distributed
DoRA/QDoRA Finetuning1-8lora_finetune_single_device <br> lora_finetune_distributedLlama3 8B QDoRA single-device <br> Llama3 8B DoRA distributed
Quantization-Aware Training2-8qat_distributedLlama3 8B QAT
Quantization-Aware Training and LoRA Finetuning2-8qat_lora_finetune_distributedLlama3 8B QAT
Direct Preference Optimization1-8lora_dpo_single_device <br> lora_dpo_distributedLlama2 7B single-device <br> Llama2 7B distributed
Proximal Policy Optimization1ppo_full_finetune_single_deviceMistral 7B
Knowledge Distillation1knowledge_distillation_single_deviceQwen2 1.5B -> 0.5B

The above configs are just examples to get you started. If you see a model above not listed here, we likely still support it. If you're unsure whether something is supported, please open an issue on the repo.

 

Memory and training speed

Below is an example of the memory requirements and training speed for different Llama 3.1 models.

[!NOTE] For ease of comparison, all the below numbers are provided for batch size 2 (without gradient accumulation), a dataset packed to sequence length 2048, and torch compile enabled.

If you are interested in running on different hardware or with different models, check out our documentation on memory optimizations here to find the right setup for you.

ModelFinetuning MethodRunnable OnPeak Memory per GPUTokens/sec *
Llama 3.1 8BFull finetune1x 409018.9 GiB1650
Llama 3.1 8BFull finetune1x A600037.4 GiB2579
Llama 3.1 8BLoRA1x 409016.2 GiB3083
Llama 3.1 8BLoRA1x A600030.3 GiB4699
Llama 3.1 8BQLoRA1x 40907.4 GiB2413
Llama 3.1 70BFull finetune8x A10013.9 GiB **1568
Llama 3.1 70BLoRA8x A10027.6 GiB3497
Llama 3.1 405BQLoRA8x A10044.8 GB653

*= Measured over one full training epoch

**= Uses CPU offload with fused optimizer

 

Optimization flags

torchtune exposes a number of levers for memory efficiency and performance. The table below demonstrates the effects of applying some of these techniques sequentially to the Llama 3.2 3B model. Each technique is added on top of the previous one, except for LoRA and QLoRA, which do not use optimizer_in_bwd or AdamW8bit optimizer.

Baseline:

TechniquePeak Memory Active (GiB)% Change Memory vs PreviousTokens Per Second% Change Tokens/sec vs Previous
Baseline25.5-2091-
+ Packed Dataset60.0+135.16%7075+238.40%
+ Compile51.0-14.93%8998+27.18%
+ Chunked Cross Entropy42.9-15.83%9174+1.96%
+ Activation Checkpointing24.9-41.93%7210-21.41%
+ Fuse optimizer step into backward23.1-7.29%7309+1.38%
+ Activation Offloading21.8-5.48%7301-0.11%
+ 8-bit AdamW17.6-19.63%6960-4.67%
LoRA8.5-51.61%8210+17.96%
QLoRA4.6-45.71%8035-2.13%

The final row in the table vs baseline + Packed Dataset uses 81.9% less memory with a 284.3% increase in tokens per second. It can be run via the command:

tune run lora_finetune_single_device --config llama3_2/3B_qlora_single_device \
dataset.packed=True \
compile=True \
loss=torchtune.modules.loss.CEWithChunkedOutputLoss \
enable_activation_checkpointing=True \
optimizer_in_bwd=False \
enable_activation_offloading=True \
optimizer=torch.optim.AdamW \
tokenizer.max_seq_len=4096 \
gradient_accumulation_steps=1 \
epochs=1 \
batch_size=2

 

Installation

torchtune is tested with the latest stable PyTorch release as well as the preview nightly version. torchtune leverages torchvision for finetuning multimodal LLMs and torchao for the latest in quantization techniques; you should install these as well.

 

Install stable release

# Install stable PyTorch, torchvision, torchao stable releases
pip install torch torchvision torchao
pip install torchtune

 

Install nightly release

# Install PyTorch, torchvision, torchao nightlies
pip install --pre --upgrade torch torchvision torchao --index-url https://download.pytorch.org/whl/nightly/cu121 # full options are cpu/cu118/cu121/cu124
pip install --pre --upgrade torchtune --extra-index-url https://download.pytorch.org/whl/nightly/cpu

You can also check out our install documentation for more information, including installing torchtune from source.

 

To confirm that the package is installed correctly, you can run the following command:

tune --help

And should see the following output:

usage: tune [-h] {ls,cp,download,run,validate} ...

Welcome to the torchtune CLI!

options:
  -h, --help            show this help message and exit

...

 

Get Started

To get started with torchtune, see our First Finetune Tutorial. Our End-to-End Workflow Tutorial will show you how to evaluate, quantize and run inference with a Llama model. The rest of this section will provide a quick overview of these steps with Llama3.1.

Downloading a model

Follow the instructions on the official meta-llama repository to ensure you have access to the official Llama model weights. Once you have confirmed access, you can run the following command to download the weights to your local machine. This will also download the tokenizer model and a responsible use guide.

To download Llama3.1, you can run:

tune download meta-llama/Meta-Llama-3.1-8B-Instruct \
--output-dir /tmp/Meta-Llama-3.1-8B-Instruct \
--ignore-patterns "original/consolidated.00.pth" \
--hf-token <HF_TOKEN> \

[!Tip] Set your environment variable HF_TOKEN or pass in --hf-token to the command in order to validate your access. You can find your token at https://huggingface.co/settings/tokens

 

Running finetuning recipes

You can finetune Llama3.1 8B with LoRA on a single GPU using the following command:

tune run lora_finetune_single_device --config llama3_1/8B_lora_single_device

For distributed training, tune CLI integrates with torchrun. To run a full finetune of Llama3.1 8B on two GPUs:

tune run --nproc_per_node 2 full_finetune_distributed --config llama3_1/8B_full

[!Tip] Make sure to place any torchrun commands before the recipe specification. Any CLI args after this will override the config and not impact distributed training.

 

Modify Configs

There are two ways in which you can modify configs:

Config Overrides

You can directly overwrite config fields from the command line:

tune run lora_finetune_single_device \
--config llama2/7B_lora_single_device \
batch_size=8 \
enable_activation_checkpointing=True \
max_steps_per_epoch=128

Update a Local Copy

You can also copy the config to your local directory and modify the contents directly:

tune cp llama3_1/8B_full ./my_custom_config.yaml
Copied to ./my_custom_config.yaml

Then, you can run your custom recipe by directing the tune run command to your local files:

tune run full_finetune_distributed --config ./my_custom_config.yaml

 

Check out tune --help for all possible CLI commands and options. For more information on using and updating configs, take a look at our config deep-dive.

 

Custom Datasets

torchtune supports finetuning on a variety of different datasets, including instruct-style, chat-style, preference datasets, and more. If you want to learn more about how to apply these components to finetune on your own custom dataset, please check out the provided links along with our API docs.

 

Community

torchtune focuses on integrating with popular tools and libraries from the ecosystem. These are just a few examples, with more under development:

 

Community Contributions

We really value our community and the contributions made by our wonderful users. We'll use this section to call out some of these contributions. If you'd like to help out as well, please see the CONTRIBUTING guide.

 

Acknowledgements

The Llama2 code in this repository is inspired by the original Llama2 code.

We want to give a huge shout-out to EleutherAI, Hugging Face and Weights & Biases for being wonderful collaborators and for working with us on some of these integrations within torchtune.

We also want to acknowledge some awesome libraries and tools from the ecosystem:

 

License

torchtune is released under the BSD 3 license. However you may have other legal obligations that govern your use of other content, such as the terms of service for third-party models.

Citing torchtune

If you find the torchtune library useful, please cite it in your work as below.

@software{torchtune,
  title = {torchtune: PyTorch's finetuning library},
  author = {torchtune maintainers and contributors},
  url = {https//github.com/pytorch/torchtune},
  license = {BSD-3-Clause},
  month = apr,
  year = {2024}
}