Awesome
LLaMA 2
[!WARNING]
Outdated LLaMA 2 code example. If you are looking for Llama 3, please go to meta-llama/llama3.This will remain here for learning resources on how to port LLaMA 2 to different hardwares and frameworks.
This is a variant of the LLaMA 2 model and has the following changes:
- Compression: 8-bit model quantization using bitsandbytes
- Non-Model Parallel (MP): run 13B model in a single GPU. All MP codes removed.
- Extended model:
- Fix the sampler — a better sampler that improve generations quality:
temperature
,top_p
,repetition_penalty
,tail_free
. - (Future): provides more controls for generations, expose repetition penalty so that CLI can pass-in the options.
- Fix the sampler — a better sampler that improve generations quality:
And more soon. I'm experimenting with compression and acceleration techniques to make the models:
- smaller and faster
- run on low-resources hardwares
I'm also building LLaMA-based ChatGPT.
Hardware
ChattyLLaMA
ChattyLLaMA is experimental LLaMA-based ChatGPT.
Documentations
All the new codes are available in the chattyllama directory.
Combined
All changes and fixes baked into one:
- Non-Model Parallel (MP): all MP constructs removed (MP shards weights across a GPU cluster setup)
- 8-bit quantized model using bitsandbytes
- Sampler fixes, better sampler
Source files location:
chattyllama/combined/model.py
: a fork of LLaMA model.chattyllama/combined/inference.py
: run model inference (it's a modified copy ofexample.py
).
Non-MP/single GPU
Source files location:
chattyllama/model.py
: a fork of LLaMA model.chattyllama/inference.py
: run model inference
Code Examples
Code walkthrough: notebooks.
This shows how you can get it running on 1x A100 40GB GPU. The code is outdated though. It's using the original model version from MetaAI.
For bleeding edge things, follow the below quick start.
Quick start
-
Download model weights into
./model
. -
Install all the needed dependencies.
$ git clone https://github.com/cedrickchee/llama.git
$ cd llama && pip install -r requirements.txt
Note:
- Don't use Conda. Use pip.
- If you have trouble with bitsandbytes, build and install it from source.
$ pip install -e .
#torchrun --nproc_per_node 1 example.py --ckpt_dir ../7B --tokenizer_path ../tokenizer.model
$ cd chattyllama/combined
- Modify
inference.py
with the path to your weights directory:
# ...
if __name__ == "__main__":
main(
ckpt_dir="/model/vi/13B", # <-- change the path
tokenizer_path="/model/vi/tokenizer.model", # <-- change the path
temperature=0.7,
top_p=0.85,
max_seq_len=1024,
max_batch_size=1
)
- Modify
inference.py
with your prompt:
def main(...):
# ...
prompts = [
"I believe the meaning of life is"
]
# ...
- Run inference:
$ python inference.py
LLaMA 2 compatible port
Looking to use LLaMA model with HuggingFace library? Well look at my "transformers-llama" repo.
Other ports
-
Text generation web UI - A Gradio Web UI for running Large Language Models like LLaMA, GPT-Neo, OPT, and friends. My guide: "Installing 8/4-bit LLaMA with text-generation-webui on Linux"
-
LLaMa CPU fork - We need more work like this that lower the compute requirements. Really under appreciated.
-
Running LLaMA 7B on a 64GB M2 MacBook Pro with llama.cpp by Simon Willison - llama.cpp is from the same Whisper.cpp hacker, ggerganov. Never dissapointed by ggerganov's work.
It's genuinely possible to run a LLM that's hinting towards the performance of GPT3 on your own hardware now. I thought that was still a few years away.
Looking at this rate of model compression/acceleration progress, soon we can run a LLM inference locally on mobile devices. QNNPACK, a hardware optimized library that also supports mobile processors can help. JIT compiler like OpenXLA/PyTorch Glow can optimize the computation graph so the model runs well on low-resources hardware.
We underestimated pre-trained language models (~2019) and overestimated a lot of things.
A quick tutorial by me: 4 Steps in Running LLaMA-7B on a M1 MacBook with
llama.cpp
My llama.cpp patches for Linux support. (WIP)
-
Dalai - The simplest way to run LLaMA on your personal computer. It automatically install and run LLaMA on your computer. Powered by llama.cpp and Shawn's llama-dl CDN.
-
Stanford Alpaca: An Open-Source Instruction-Following LLaMA Model
- Alpaca-LoRA - Fine-tuning and training code for LLaMA to replicate the Alpaca instruct-tuned model on consumer hardware, while awaiting Stanford to release their code.
- Alpaca.cpp - Locally run an instruction-tuned chat-style LLM. This combines the LLaMA foundation model (llama.cpp) with an open reproduction (Alpaca-LoRA) of Stanford Alpaca, a fine-tuning of the base model to obey instructions (akin to the RLHF used to train ChatGPT).
- Alpaca Native model weights - The model was fine-tuned using the original repository: https://github.com/tatsu-lab/stanford_alpaca (no LoRA has been used). Examples:
- Minimal LLaMA - Jason's HuggingFace Transformers port using OPT code internally. This version should be more stable. But the code is not well-tested yet. Bonus: you can quickly see how well the model can be fine-tuned either using HuggingFace PEFT with 8-bit or Pipeline Parallelism.
- pyllama - Run LLM in a single GPU, as simple as
pip install pyllama
. It's a quick & dirty hacked version of 🦙 LLaMA. Bonus: comes with a way to start a Gradio Web UI for trying out prompting in browser. Good tips: "To load KV cache in CPU, runexport KV_CAHCHE_IN_GPU=0
in the shell.". - minichatgpt - Train ChatGPT in minutes with ColossalAI (blog post) (minichatgpt training process is pending my verification. I can confirm the code there was based on ColossalAI's mini demo. It doesn't support LLaMA yet.)
- Supports LoRA
- Supports RL paradigms, like reward model, PPO
- Datasets used for training:
- Train with prompt data from: fka/awesome-minichatgpt-prompts. Training scripts and instructions here.
- Train the reward model using Dahoas/rm-static dataset.
Supporting tools
- Resharding and HuggingFace conversion - Useful scripts for transforming the weights, if you still want to spread the weights and run the larger model (in fp16 instead of int8) across multiple GPUs for some reasons.
Plan
TODO:
Priority: high
- Improve sampler - refer to shawwn/llama fork.
- Fine-tune the models on a diverse set of instructions datasets from LAION's OpenAssistant. Check out my ChatGPT notes for larger training data. (blocked by dataset v1)
- Try the fine-tuning protocol from Flan.
- LLaMA paper touches on finetuning briefly, referencing that.
- Fine-tune model with HF's PEFT and Accelerate. PEFT doesn't support causal LM like LLaMA yet (blocked by PR)
Priority: low
- Start and try other fine-tuning ideas:
- ChatGPT-like = LLaMA + CarperAI's tRLX (RLHF) library + Anthropic's public preference dataset. I don't know how feasible if the experiments are larger scale (compute-wise) that use RL models that are good at instruction following.
Reminder-to-self:
- People under-appreciate fine-tuning alone compared to RLHF. RL algorithms (unsupervised) are quite finicky compared to supervised deep learning. RL is hard-ish.
Original README
This repository is intended as a minimal, hackable and readable example to load LLaMA 2 (arXiv) models and run inference. In order to download the checkpoints and tokenizer, fill this google form
Setup
In a conda env with pytorch / cuda available, run:
pip install -r requirements.txt
Then in this repository:
pip install -e .
Download
Once your request is approved, you will receive links to download the tokenizer and model files.
Edit the download.sh
script with the signed url provided in the email to download the model weights and tokenizer.
Inference
The provided example.py
can be run on a single or multi-gpu node with torchrun
and will output completions for two pre-defined prompts. Using TARGET_FOLDER
as defined in download.sh
:
torchrun --nproc_per_node MP example.py --ckpt_dir $TARGET_FOLDER/model_size --tokenizer_path $TARGET_FOLDER/tokenizer.model
Different models require different MP values:
Model | MP |
---|---|
7B | 1 |
13B | 2 |
33B | 4 |
65B | 8 |
FAQ
- 1. The download.sh script doesn't work on default bash in MacOS X
- 2. Generations are bad!
- 3. CUDA Out of memory errors
- 4. Other languages
Reference
LLaMA: Open and Efficient Foundation Language Models -- https://arxiv.org/abs/2302.13971
@article{touvron2023llama,
title={LLaMA: Open and Efficient Foundation Language Models},
author={Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth{\'e}e and Rozi{\`e}re, Baptiste and Goyal, Naman and Hambro, Eric and Azhar, Faisal and Rodriguez, Aurelien and Joulin, Armand and Grave, Edouard and Lample, Guillaume},
journal={arXiv preprint arXiv:2302.13971},
year={2023}
}
Model Card
See MODEL_CARD.md
License
See the LICENSE file.