Home

Awesome

LQ-LoRA: Low-rank plus Quantized Matrix Decomposition for Efficient Language Model Finetuning [Paper]

Changelog

Artifacts

Installation

  1. Clone the repo
git clone https://github.com/HanGuo97/lq-lora.git
cd lq-lora
  1. Create Docker image (optional)
# Using BuiltKit
DOCKER_BUILDKIT=1 docker build \
    -t lqlora \
    -f Dockerfile \
    .

docker run -ti --rm \
    --gpus all \
    -p 28888:8888 \
    --shm-size=2g \
    lqlora \
    bash -c "cd main/ && jupyter-lab --ip=0.0.0.0 --allow-root"
  1. Install dependencies
bash scripts/setup.sh

Note: Some of the codebase relies on PyTorch>=2.1.

Usages

Downloading Data for Quantization

After downloading the files, please update FILE_NAMES_DICT in models/allocation_utils accordingly.

Applying Quantization

from transformers import AutoTokenizer, AutoModelForCausalLM
from models import lora_utils

data = "c4"         # applying data-aware quantization
budget = "2.75"     # target bits
model_size = "70b"  # 7b or 70b

# Loads the base model (to CPU)
model = AutoModelForCausalLM.from_pretrained(
    f"meta-llama/Llama-2-{model_size}-hf")

# Adds LoRA components, etc
model = lora_utils.prepare_model_for_lora(
    model=model,
    num_ranks=64,
    lora_alpha=16,
    lora_dropout=0.0,
    use_gradient_checkpointing=True)

# Applies LQ-LoRA to the model.
lora_utils.transform_lora_layers(
    lpq=True,
    model=model,
    model_name=f"llama-2-{model_size}/lpq-64/{data},budget={budget}",
    device="cuda")

Saving Quantized Models

Note that HuggingFace's PEFT library only saves the adapter parameters. Since LQ-LoRA additionally changes the base model parameters, we need to save the entire weights of the model.

state_dict = model.state_dict()
file_name = os.path.join(
    output_dir,
    "full_model.pth")
torch.save(state_dict, file_name)

Loading Quantized Models

# No need to apply `transform_lora_layers` because
# these will be loaded from the checkpoint.
model = lora_utils.prepare_model_for_lora(
    model=model,
    num_ranks=64,
    lora_alpha=16,
    lora_dropout=0.0,
    use_gradient_checkpointing=True,
    checkpoint_dir=checkpoint_dir)  # -> enter the path to the checkpoint directory

Todos

Acknowledgement

This code reuses components from several libraries including QLoRA and OmniQuant.