Home

Awesome

Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward

This repository contains the open-source code and benchmark results for the paper - Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward.<br> The benchmark assesses the performance of various compression and inference methods.

Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward<br> Arnav Chavan, Raghav Magazine, Shubham Kushwaha, Deepak Gupta, Merouane Debbah<br>Nyun AI, Transmute AI Lab, KU 6G Research Cente

<!-- ## Updates ### <Month> <Date>, <Year> : <Title> -->

Getting Started

All the experiments are performed in isolated Python3.10 environments with method-specific requirements such as package & library versions. The exact repository and branch details can be inferred from .gitmodules

Repository Organization

The repository follows a structured format with a branch naming convention of "A100<method>", where <method> denotes the specific evaluation method. The organization within each branch is outlined as follows:

Please note that all setup, generation, and benchmarking scripts (.sh) strive to be kept up-to-date with the latest runs and are tailored to Python 3.10 with CUDA 12.1 (or the version necessary for the method). Adjustments to the scripts may be required, or a different script should be utilized.

Branch Overview

Discover various branches dedicated to evaluated methods within the repository:

Note: Each branch is equipped with its own set of updated scripts, which may or may not be synchronized with other branches. Additionally, specific quantization methods might lack dedicated branches; however, the corresponding scripts can be directly referenced in the respective branches or from the main branch. Models and scales directly obtained from the HF Hub were also utilized as needed.

Results Overview

Pruning

MethodSparsityRM (GB)WM (GB)Tokens/sPerplexity
Baseline-26.1612.5530.9012.62
Wanda-SP20%---22.12
Wanda-SP50%---366.43
LLM-Pruner20%10.3810.0932.5719.77
LLM-Pruner50%6.546.2340.95112.44
LLM-Pruner*20%10.3810.0932.5717.37
LLM-Pruner*50%6.546.2340.9538.12
FLaP20%9.729.4433.9014.62
FLaP50%6.266.0742.8831.80

* with fine-tuning

<!-- **Analysis:** - The baseline model has a sparsity level of 0%. - Wanda-SP introduces sparsity at 20% and 50%, significantly impacting perplexity. - LLM-Pruner achieves sparsity of 20% and 50% with variations in running and weight memory. - FLaP demonstrates sparsity at 20% and 50%, influencing both memory and tokens/s metrics. -->

Quantization

MethodInference EngineWM (GB)RM (GB)Tokens/sPerplexity
Baseline FP16PyTorch12.5526.1630.905.85
GPTQ 2bitPyTorch2.112.9820.91NaN
GPTQ 3bitPyTorch2.873.8621.247.36
GPTQ 4bitPyTorch3.634.6521.636.08
GPTQ 8bitPyTorch6.677.6221.365.86
AWQ 4bit GEMMPyTorch3.684.6428.516.02
AWQ 4bit GEMVPyTorch3.684.6431.816.02
QLoRA (NF4)PyTorch3.564.8419.706.02
LLM.int8()PyTorch6.587.715.245.89
K-Quants 4bitLlama.cpp3.807.38104.455.96
OmniQuant 3bitMLC-LLM3.205.1083.46.65
OmniQuant 4bitMLC-LLM3.805.70134.25.97
<!-- **Analysis:** - Baseline FP16 serves as the reference for comparison. - GPTQ 2bit exhibits lower memory consumption but leads to perplexity issues. - Various quantization methods show diverse impacts on memory, tokens/s, and perplexity. - LLM.int8() significantly reduces memory but at the expense of tokens/s. - K-Quants 4bit in Llama.cpp achieves a balance between memory and tokens/s. - OmniQuant methods in MLC-LLM show competitive performance in terms of memory and tokens/s. -->

Engine Results

MethodHardware SupportQuantization TypeWM (GB)RM (GB)Tokens/sPerplexity
Llama.cppNVIDIA GPUGGUF K-Quant 2bit2.363.69102.156.96
AMD GPUGGUF 4bit3.564.88128.975.96
Apple SiliconGGUF AWQ 4bit3.564.88129.255.91
CPUGGUF K-Quant 4bit3.594.90109.725.87
GGUF 8bit6.677.7893.395.79
GGUF FP1612.5513.2266.815.79
ExLlamaNVIDIA GPUGPTQ 4bit3.635.3577.106.08
AMD GPU
ExLlamav2NVIDIA GPUEXL2 2bit2.015.21153.7520.21
AMD GPUEXL2 4bit3.366.61131.686.12
GPTQ 4bit3.636.93151.306.03
EXL2 8bit6.379.47115.815.76
FP1612.5515.0967.705.73
vLLMNVIDIA GPUAWQ GEMM 4bit3.6234.55114.436.02
AMD GPUGPTQ 4bit3.6336.51172.886.08
FP1612.5535.9279.745.85
TensorRT-LLMNVIDIA GPUAWQ GEMM 4bit3.425.69194.866.02
GPTQ 4bit3.605.88202.166.08
INT86.538.55143.575.89
FP1612.5514.6183.435.85
TGINVIDIA GPUAWQ GEMM 4bit3.6236.67106.846.02
AMD GPUGPTQ 4bit3.6937.85163.226.08
Intel GPUFP412.5537.2136.916.15
AWS Inferentia2NF412.5537.2136.326.02
BF1612.5538.0373.595.89
FP1612.5538.0374.195.85
MLC-LLMNVIDIA GPUOmniQuant 3bit3.25.183.46.65
AMD GPU, CPU, WebGPUOmniQuant 4bit3.85.7134.25.97
Apple Silicon, Intel GPU, WASM, Adreno MaliFP1612.5515.3887.375.85
<!-- **Analysis:** - Each method exhibits varying performance metrics across different hardware and quantization types. - Llama.cpp with GGUF K-Quant 2bit shows competitive tokens/s and perplexity. - ExLlamav2 with EXL2 2bit achieves high tokens/s but at the expense of increased memory consumption. - TensorRT-LLM with INT8 quantization stands out for high tokens/s and relatively lower memory usage. - TGI on AMD GPU demonstrates lower tokens/s compared to other hardware setups. - MLC-LLM showcases promising results across diverse hardware, especially with OmniQuant 4bit quantization. --> <!-- ## Citation If you find our project is helpful, please feel free to leave a star and cite our paper: ```BibTeX @misc{chavan2023oneforall, title={Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward}, author={Arnav Chavan and Zhuang Liu and Deepak Gupta and Eric Xing and Zhiqiang Shen}, year={2023}, eprint={2306.07967}, archivePrefix={arXiv}, primaryClass={cs.LG} } ``` -->

Acknowledgements

We extend our gratitude to the following repositories and sources for providing essential methods, engines, and datasets utilized in our benchmarking project:

  1. Llama-2-7b-hf - Hugging Face model repository for Llama-2-7b.
  2. llama.cpp - Source for llama.cpp, a key engine method used in our benchmarks.
  3. exllama - Repository for the ExLlama engine method.
  4. exllamav2 - Source for ExLlamaV2 engine method.
  5. alpaca-cleaned - Alpaca dataset on Hugging Face, utilized in our benchmarks.
  6. squeezellm - Repository for SqueezeLLM quantization method.
  7. squeezellmgradients - Repository for SqueezeLLM-gradients.
  8. omniquant - Source for OmniQuant quantization method.
  9. mlcllm - Repository for the MLC-LLM engine method.
  10. llmpruner - Source for LLM-Pruner pruning method.
  11. tensorrtllm - Source for TensorRT-LLM engine method (branch: release/0.5.0).
  12. autogptq - Repository for AutoGPTQ, offering quantization package based on the GPTQ algorithm.
  13. autoawq - Repository for AutoAWQ, implementing the AWQ algorithm for 4-bit quantization.
  14. vllm - Source for vllm package offering the inference and serving engine

These resources have been instrumental in conducting the benchmarks and evaluations. We appreciate the creators and maintainers of these repositories for their valuable contributions.