Awesome

MagicPIG: LSH Sampling for Efficient LLM Generation

This repo is for exploring the possibility of GPU-CPU system powered by LSH.

installation

bash install.sh

Only Intel CPUs are supported now. We also provide a huggingface-like implementation for accuracy evaluation, which does not need Intel CPUs.

Experiments

cd RULER/RULER/scripts
export K=10 # LSH hyper-parameter for MagicPIG and Page Size for Quest
export L=150 # LSH hyper-parameter for MagicPIG and number of selected pages for Quest
export sink=4 # sink token
export local=64 # local token
export model=0 # 0: MagicPIG; 1: Quest; 2: TopK 3: Oracle Sampling
export expid=0
bash run.sh llama3-8b-chat-128k synthetic $K $L $sink $local $model $expid

This script is implemented in huggingface to replicate accuracy results (for RULER benchmark). The reference file (for model and KV cache implementation) can be found in refs/.

Three models are supported now: llama3-8b-chat-128k (Llama-3.1-8B-Instruct), llama3-70b-chat-128k (Llama-3.1-70B-Instruct), mistral-7b-chat-512k ( MegaBeam-Mistral-7B-512k).

In models/, we implement MagicPIG CPU/GPU codes for sanity checks and benchmarking. models/magicpig_llama.py and models/cache.py are expected to be equivalent to refs/hf_model_ref.py and refs/hf_cache_ref.py.

To benchmark the speed of MagicPIG

cd models
OMP_NUM_THREADS=96 python benchmark.py --P 98000 --M 98304 --B 1 --model meta-llama/Meta-Llama-3.1-8B-Instruct

To achieve the best performance, currently you need to manually set the omp threads in lsh/lsh.cc and attention/gather_gemv.cc (as well as here) to match the number of physical cores in your CPUs.

For generation purposes,

cd models
python generation.py --path ../data/data32k.json

where path specifies the input contexts.

models/magicpig_config.json adjusts proper hyper-parameters such as (K, L) in LSH algorithms and which layer to keep in GPUs.

Milestones

Integrate Flashinfer attention (on going).
CPU thread scheduling (on going).
GPU-CPU pipelines.
Multi-GPU.

@article{chen2024magicpig,
  title={MagicPIG: LSH Sampling for Efficient LLM Generation},
  author={Chen, Zhuoming and Sadhukhan, Ranajoy and Ye, Zihao and Zhou, Yang and Zhang, Jianyu and Nolte, Niklas and Tian, Yuandong and Douze, Matthijs and Bottou, Leon and Jia, Zhihao and Chen, Beidi},
  journal={arXiv preprint arXiv:2410.16179},
  year={2024}
}
}