Home

Awesome

MagicPIG: LSH Sampling for Efficient LLM Generation

installation

bash install.sh

Only Intel CPUs are supported now. We also provide a huggingface-like implementation for accuracy evaluation, which does not need Intel CPUs.

Experiments

cd RULER/RULER/scripts
export K=10 # LSH hyper-parameter for MagicPIG and Page Size for Quest
export L=150 # LSH hyper-parameter for MagicPIG and number of selected pages for Quest
export sink=4 # sink token
export local=64 # local token
export model=0 # 0: MagicPIG; 1: Quest; 2: TopK 3: Oracle Sampling
export expid=0
bash run.sh llama3-8b-chat-128k synthetic $K $L $sink $local $model $expid

This script is implemented in huggingface to replicate accuracy results (for RULER benchmark). The reference file (for model and KV cache implementation) can be found in refs/.

Three models are supported now: llama3-8b-chat-128k (Llama-3.1-8B-Instruct), llama3-70b-chat-128k (Llama-3.1-70B-Instruct), mistral-7b-chat-512k ( MegaBeam-Mistral-7B-512k).

In models/, we implement MagicPIG CPU/GPU codes for sanity checks and benchmarking. models/magicpig_llama.py and models/cache.py are expected to be equivalent to refs/hf_model_ref.py and refs/hf_cache_ref.py.

To benchmark the speed of MagicPIG

cd models
OMP_NUM_THREADS=96 python benchmark.py --P 98000 --M 98304 --B 1 --model meta-llama/Meta-Llama-3.1-8B-Instruct

To achieve the best performance, currently you need to manually set the omp threads in lsh/lsh.cc and attention/gather_gemv.cc (as well as here) to match the number of physical cores in your CPUs.

For generation purposes,

cd models
python generation.py --path ../data/data32k.json

where path specifies the input contexts.

models/magicpig_config.json adjusts proper hyper-parameters such as (K, L) in LSH algorithms and which layer to keep in GPUs.

@article{chen2024magicpig,
  title={MagicPIG: LSH Sampling for Efficient LLM Generation},
  author={Chen, Zhuoming and Sadhukhan, Ranajoy and Ye, Zihao and Zhou, Yang and Zhang, Jianyu and Nolte, Niklas and Tian, Yuandong and Douze, Matthijs and Bottou, Leon and Jia, Zhihao and others},
  journal={arXiv preprint arXiv:2410.16179},
  year={2024}
}
}