Home

Awesome

<div align="center">

<a href="https://unsloth.ai"><picture> <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20logo%20white%20text.png"> <source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20logo%20black%20text.png"> <img alt="unsloth logo" src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20logo%20black%20text.png" height="110" style="max-width: 100%;"> </picture></a>

<a href="https://colab.research.google.com/drive/1Ys44kVvmeZtnICzWz0xgpRnrIOjZAuxp?usp=sharing"><img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/start free finetune button.png" height="48"></a> <a href="https://discord.gg/unsloth"><img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/Discord button.png" height="48"></a> <a href="https://ko-fi.com/unsloth"><img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/buy me a coffee button.png" height="48"></a>

Finetune Llama 3.2, Mistral, Phi-3.5 & Gemma 2-5x faster with 80% less memory!

</div>

✨ Finetune for Free

All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, Ollama, vLLM or uploaded to Hugging Face.

Unsloth supportsFree NotebooksPerformanceMemory use
Llama 3.2 (3B)▶️ Start for free2x faster60% less
Llama 3.1 (8B)▶️ Start for free2x faster60% less
Phi-3.5 (mini)▶️ Start for free2x faster50% less
Gemma 2 (9B)▶️ Start for free2x faster63% less
Mistral Small (22B)▶️ Start for free2x faster60% less
Ollama▶️ Start for free1.9x faster43% less
Mistral v0.3 (7B)▶️ Start for free2.2x faster73% less
ORPO▶️ Start for free1.9x faster43% less
DPO Zephyr▶️ Start for free1.9x faster43% less

🦥 Unsloth.ai News

<details> <summary>Click for more news</summary> </details>

🔗 Links and Resources

TypeLinks
📚 Documentation & WikiRead Our Docs
<img height="14" src="https://upload.wikimedia.org/wikipedia/commons/6/6f/Logo_of_Twitter.svg" />  Twitter (aka X)Follow us on X
💾 Installationunsloth/README.md
🥇 BenchmarkingPerformance Tables
🌐 Released ModelsUnsloth Releases
✍️ BlogRead our Blogs

⭐ Key Features

🥇 Performance Benchmarking

1 A100 40GB🤗Hugging FaceFlash Attention🦥Unsloth Open Source🦥Unsloth Pro
Alpaca1x1.04x1.98x15.64x
LAION Chip21x0.92x1.61x20.73x
OASST1x1.19x2.17x14.83x
Slim Orca1x1.18x2.22x14.82x
Free Colab T4Dataset🤗Hugging FacePytorch 2.1.1🦥Unsloth🦥 VRAM reduction
Llama-2 7bOASST1x1.19x1.95x-43.3%
Mistral 7bAlpaca1x1.07x1.56x-13.7%
Tiny Llama 1.1bAlpaca1x2.06x3.87x-73.8%
DPO with ZephyrUltra Chat1x1.09x1.55x-18.6%

💾 Installation Instructions

For stable releases, use pip install unsloth. We recommend pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" for most installations though.

Conda Installation

⚠️Only use Conda if you have it. If not, use Pip. Select either pytorch-cuda=11.8,12.1 for CUDA 11.8 or CUDA 12.1. We support python=3.10,3.11,3.12.

conda create --name unsloth_env \
    python=3.11 \
    pytorch-cuda=12.1 \
    pytorch cudatoolkit xformers -c pytorch -c nvidia -c xformers \
    -y
conda activate unsloth_env

pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes
<details> <summary>If you're looking to install Conda in a Linux environment, <a href="https://docs.anaconda.com/miniconda/">read here</a>, or run the below 🔽</summary>
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh
~/miniconda3/bin/conda init bash
~/miniconda3/bin/conda init zsh
</details>

Pip Installation

⚠️Do **NOT** use this if you have Conda. Pip is a bit more complex since there are dependency issues. The pip command is different for torch 2.2,2.3,2.4,2.5 and CUDA versions.

For other torch versions, we support torch211, torch212, torch220, torch230, torch240 and for CUDA versions, we support cu118 and cu121 and cu124. For Ampere devices (A100, H100, RTX3090) and above, use cu118-ampere or cu121-ampere or cu124-ampere.

For example, if you have torch 2.4 and CUDA 12.1, use:

pip install --upgrade pip
pip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git"

Another example, if you have torch 2.5 and CUDA 12.4, use:

pip install --upgrade pip
pip install "unsloth[cu124-torch250] @ git+https://github.com/unslothai/unsloth.git"

And other examples:

pip install "unsloth[cu121-ampere-torch240] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere-torch240] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-torch240] @ git+https://github.com/unslothai/unsloth.git"

pip install "unsloth[cu121-torch230] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere-torch230] @ git+https://github.com/unslothai/unsloth.git"

pip install "unsloth[cu121-torch250] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu124-ampere-torch250] @ git+https://github.com/unslothai/unsloth.git"

Or, run the below in a terminal to get the optimal pip installation command:

wget -qO- https://raw.githubusercontent.com/unslothai/unsloth/main/unsloth/_auto_install.py | python -

Or, run the below manually in a Python REPL:

try: import torch
except: raise ImportError('Install torch via `pip install torch`')
from packaging.version import Version as V
v = V(torch.__version__)
cuda = str(torch.version.cuda)
is_ampere = torch.cuda.get_device_capability()[0] >= 8
if cuda != "12.1" and cuda != "11.8" and cuda != "12.4": raise RuntimeError(f"CUDA = {cuda} not supported!")
if   v <= V('2.1.0'): raise RuntimeError(f"Torch = {v} too old!")
elif v <= V('2.1.1'): x = 'cu{}{}-torch211'
elif v <= V('2.1.2'): x = 'cu{}{}-torch212'
elif v  < V('2.3.0'): x = 'cu{}{}-torch220'
elif v  < V('2.4.0'): x = 'cu{}{}-torch230'
elif v  < V('2.5.0'): x = 'cu{}{}-torch240'
elif v  < V('2.6.0'): x = 'cu{}{}-torch250'
else: raise RuntimeError(f"Torch = {v} too new!")
x = x.format(cuda.replace(".", ""), "-ampere" if is_ampere else "")
print(f'pip install --upgrade pip && pip install "unsloth[{x}] @ git+https://github.com/unslothai/unsloth.git"')

Windows Installation

To run Unsloth directly on Windows:

trainer = SFTTrainer(
    dataset_num_proc=1,
    ...
)

For advanced installation instructions or if you see weird errors during installations:

  1. Install torch and triton. Go to https://pytorch.org to install it. For example pip install torch torchvision torchaudio triton
  2. Confirm if CUDA is installated correctly. Try nvcc. If that fails, you need to install cudatoolkit or CUDA drivers.
  3. Install xformers manually. You can try installing vllm and seeing if vllm succeeds. Check if xformers succeeded with python -m xformers.info Go to https://github.com/facebookresearch/xformers. Another option is to install flash-attn for Ampere GPUs.
  4. Finally, install bitsandbytes and check it with python -m bitsandbytes

📜 Documentation

from unsloth import FastLanguageModel 
from unsloth import is_bfloat16_supported
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
max_seq_length = 2048 # Supports RoPE Scaling interally, so choose any!
# Get LAION dataset
url = "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"
dataset = load_dataset("json", data_files = {"train" : url}, split = "train")

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-v0.3-bnb-4bit",      # New Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/llama-3-8b-bnb-4bit",           # Llama-3 15 trillion tokens model 2x faster!
    "unsloth/llama-3-8b-Instruct-bnb-4bit",
    "unsloth/llama-3-70b-bnb-4bit",
    "unsloth/Phi-3-mini-4k-instruct",        # Phi-3 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",             # Gemma 2.2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True,
)

# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    max_seq_length = max_seq_length,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    tokenizer = tokenizer,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        max_steps = 60,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        output_dir = "outputs",
        optim = "adamw_8bit",
        seed = 3407,
    ),
)
trainer.train()

# Go to https://github.com/unslothai/unsloth/wiki for advanced tips like
# (1) Saving to GGUF / merging to 16bit for vLLM
# (2) Continued training from a saved LoRA adapter
# (3) Adding an evaluation loop / OOMs
# (4) Customized chat templates

<a name="DPO"></a>

DPO Support

DPO (Direct Preference Optimization), PPO, Reward Modelling all seem to work as per 3rd party independent testing from Llama-Factory. We have a preliminary Google Colab notebook for reproducing Zephyr on Tesla T4 here: notebook.

We're in 🤗Hugging Face's official docs! We're on the SFT docs and the DPO docs!

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Optional set GPU device ID

from unsloth import FastLanguageModel, PatchDPOTrainer
from unsloth import is_bfloat16_supported
PatchDPOTrainer()
import torch
from transformers import TrainingArguments
from trl import DPOTrainer

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/zephyr-sft-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True,
)

# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
    model,
    r = 64,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    max_seq_length = max_seq_length,
)

dpo_trainer = DPOTrainer(
    model = model,
    ref_model = None,
    args = TrainingArguments(
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 8,
        warmup_ratio = 0.1,
        num_train_epochs = 3,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        seed = 42,
        output_dir = "outputs",
    ),
    beta = 0.1,
    train_dataset = YOUR_DATASET_HERE,
    # eval_dataset = YOUR_DATASET_HERE,
    tokenizer = tokenizer,
    max_length = 1024,
    max_prompt_length = 512,
)
dpo_trainer.train()

🥇 Detailed Benchmarking Tables

1 A100 40GB🤗Hugging FaceFlash Attention 2🦥Unsloth OpenUnsloth EqualUnsloth ProUnsloth Max
Alpaca1x1.04x1.98x2.48x5.32x15.64x
codeCodeCodeCodeCode
seconds1040100152541919667
memory MB182351536596318525
% saved15.7447.1853.25

Llama-Factory 3rd party benchmarking

MethodBitsTGSGRAMSpeed
HF16239218GB100%
HF+FA216295417GB123%
Unsloth+FA216400716GB168%
HF424159GB101%
Unsloth+FA2437267GB160%

Performance comparisons between popular models

<details> <summary>Click for specific model benchmarking tables (Mistral 7b, CodeLlama 34b etc.)</summary>

Mistral 7b

1 A100 40GBHugging FaceFlash Attention 2Unsloth OpenUnsloth EqualUnsloth ProUnsloth Max
Mistral 7B Slim Orca1x1.15x2.15x2.53x4.61x13.69x
codeCodeCodeCodeCode
seconds18131571842718393132
memory MB32853193851246510271
% saved40.9962.0668.74

CodeLlama 34b

1 A100 40GBHugging FaceFlash Attention 2Unsloth OpenUnsloth EqualUnsloth ProUnsloth Max
Code Llama 34BOOM ❌0.99x1.87x2.61x4.27x12.82x
code▶️ CodeCodeCodeCode
seconds195319821043748458152
memory MB40000332172741322161
% saved16.9631.4744.60

1 Tesla T4

1 T4 16GBHugging FaceFlash AttentionUnsloth OpenUnsloth Pro EqualUnsloth ProUnsloth Max
Alpaca1x1.09x1.69x1.79x2.93x8.3x
code▶️ CodeCodeCodeCode
seconds15991468942894545193
memory MB7199705964595443
% saved1.9410.2824.39

2 Tesla T4s via DDP

2 T4 DDPHugging FaceFlash AttentionUnsloth OpenUnsloth EqualUnsloth ProUnsloth Max
Alpaca1x0.99x4.95x4.44x7.28x20.61x
code▶️ CodeCodeCode
seconds98829946199622271357480
memory MB9176912869046782
% saved0.5224.7626.09
</details>

Performance comparisons on 1 Tesla T4 GPU:

<details> <summary>Click for Time taken for 1 epoch</summary>

One Tesla T4 on Google Colab bsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10

SystemGPUAlpaca (52K)LAION OIG (210K)Open Assistant (10K)SlimOrca (518K)
Huggingface1 T423h 15m56h 28m8h 38m391h 41m
Unsloth Open1 T413h 7m (1.8x)31h 47m (1.8x)4h 27m (1.9x)240h 4m (1.6x)
Unsloth Pro1 T43h 6m (7.5x)5h 17m (10.7x)1h 7m (7.7x)59h 53m (6.5x)
Unsloth Max1 T42h 39m (8.8x)4h 31m (12.5x)0h 58m (8.9x)51h 30m (7.6x)

Peak Memory Usage

SystemGPUAlpaca (52K)LAION OIG (210K)Open Assistant (10K)SlimOrca (518K)
Huggingface1 T47.3GB5.9GB14.0GB13.3GB
Unsloth Open1 T46.8GB5.7GB7.8GB7.7GB
Unsloth Pro1 T46.4GB6.4GB6.4GB6.4GB
Unsloth Max1 T411.4GB12.4GB11.9GB14.4GB
</details> <details> <summary>Click for Performance Comparisons on 2 Tesla T4 GPUs via DDP:</summary> **Time taken for 1 epoch**

Two Tesla T4s on Kaggle bsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10

SystemGPUAlpaca (52K)LAION OIG (210K)Open Assistant (10K)SlimOrca (518K) *
Huggingface2 T484h 47m163h 48m30h 51m1301h 24m *
Unsloth Pro2 T43h 20m (25.4x)5h 43m (28.7x)1h 12m (25.7x)71h 40m (18.1x) *
Unsloth Max2 T43h 4m (27.6x)5h 14m (31.3x)1h 6m (28.1x)54h 20m (23.9x) *

Peak Memory Usage on a Multi GPU System (2 GPUs)

SystemGPUAlpaca (52K)LAION OIG (210K)Open Assistant (10K)SlimOrca (518K) *
Huggingface2 T48.4GB | 6GB7.2GB | 5.3GB14.3GB | 6.6GB10.9GB | 5.9GB *
Unsloth Pro2 T47.7GB | 4.9GB7.5GB | 4.9GB8.5GB | 4.9GB6.2GB | 4.7GB *
Unsloth Max2 T410.5GB | 5GB10.6GB | 5GB10.6GB | 5GB10.5GB | 5GB *
</details>

<br>

Thank You to