Awesome
Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward
This repository contains the open-source code and benchmark results for the paper - Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward.<br> The benchmark assesses the performance of various compression and inference methods.
<!-- ## Updates ### <Month> <Date>, <Year> : <Title> -->Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward<br> Arnav Chavan, Raghav Magazine, Shubham Kushwaha, Deepak Gupta, Merouane Debbah<br>Nyun AI, Transmute AI Lab, KU 6G Research Cente
Getting Started
All the experiments are performed in isolated Python3.10 environments with method-specific requirements such as package & library versions. The exact repository and branch details can be inferred from .gitmodules
Repository Organization
The repository follows a structured format with a branch naming convention of "A100<method>", where <method> denotes the specific evaluation method. The organization within each branch is outlined as follows:
-
engine/: This directory contains the implementation of engine methods along with setup, generation, and bench scripts.
-
prune/: Here, you'll find the implementation of pruning methods along with associated setup, generation, and bench scripts.
-
quant/: This directory houses the implementation of quantization methods, complete with setup, generation, and bench scripts.
-
exports/: A shared export folder is structured as
exports/MODEL_NAME/METHOD_TYPE/METHOD_NAME/
. -
experiments/: This section holds formatted benchmarking results in notebooks, featuring metrics such as RM (run memory), WM (weight memory), and various GPU utilization graphs.
Please note that all setup, generation, and benchmarking scripts (.sh) strive to be kept up-to-date with the latest runs and are tailored to Python 3.10 with CUDA 12.1 (or the version necessary for the method). Adjustments to the scripts may be required, or a different script should be utilized.
Branch Overview
Discover various branches dedicated to evaluated methods within the repository:
-
A100Exllama: Investigates the Exllama engine with GPTQ quantization.
-
A100Exllamav2: Explores the latest ExllamaV2 featuring EXL2 & GPTQ quantizations.
-
A100Llamacpp: Examines the CPP implementation of the Llama architecture for enhanced speed.
-
A100MLCLLM: Explores MLCLLM, offering extensive hardware and platform support.
-
A100TGI: Investigates the Text Generation Inference toolkit, employed for LLM inferences in production.
-
A100VLLM: Explores VLLM.
-
A100TensorRTLLM: Investigates NVIDIA's TensorRTLLM inference engine.
-
A100GPTQ: Explores the GPTQ quantization method through AutoGPTQ.
-
A100HF: Investigates multiple quantization methods, alongside baseline generation speeds for each method.
-
A1000mniquant: Explores the OmniQuant quantization method.
Note: Each branch is equipped with its own set of updated scripts, which may or may not be synchronized with other branches. Additionally, specific quantization methods might lack dedicated branches; however, the corresponding scripts can be directly referenced in the respective branches or from the main branch. Models and scales directly obtained from the HF Hub were also utilized as needed.
Results Overview
Pruning
Method | Sparsity | RM (GB) | WM (GB) | Tokens/s | Perplexity |
---|---|---|---|---|---|
Baseline | - | 26.16 | 12.55 | 30.90 | 12.62 |
Wanda-SP | 20% | - | - | - | 22.12 |
Wanda-SP | 50% | - | - | - | 366.43 |
LLM-Pruner | 20% | 10.38 | 10.09 | 32.57 | 19.77 |
LLM-Pruner | 50% | 6.54 | 6.23 | 40.95 | 112.44 |
LLM-Pruner* | 20% | 10.38 | 10.09 | 32.57 | 17.37 |
LLM-Pruner* | 50% | 6.54 | 6.23 | 40.95 | 38.12 |
FLaP | 20% | 9.72 | 9.44 | 33.90 | 14.62 |
FLaP | 50% | 6.26 | 6.07 | 42.88 | 31.80 |
* with fine-tuning
<!-- **Analysis:** - The baseline model has a sparsity level of 0%. - Wanda-SP introduces sparsity at 20% and 50%, significantly impacting perplexity. - LLM-Pruner achieves sparsity of 20% and 50% with variations in running and weight memory. - FLaP demonstrates sparsity at 20% and 50%, influencing both memory and tokens/s metrics. -->Quantization
Method | Inference Engine | WM (GB) | RM (GB) | Tokens/s | Perplexity |
---|---|---|---|---|---|
Baseline FP16 | PyTorch | 12.55 | 26.16 | 30.90 | 5.85 |
GPTQ 2bit | PyTorch | 2.11 | 2.98 | 20.91 | NaN |
GPTQ 3bit | PyTorch | 2.87 | 3.86 | 21.24 | 7.36 |
GPTQ 4bit | PyTorch | 3.63 | 4.65 | 21.63 | 6.08 |
GPTQ 8bit | PyTorch | 6.67 | 7.62 | 21.36 | 5.86 |
AWQ 4bit GEMM | PyTorch | 3.68 | 4.64 | 28.51 | 6.02 |
AWQ 4bit GEMV | PyTorch | 3.68 | 4.64 | 31.81 | 6.02 |
QLoRA (NF4) | PyTorch | 3.56 | 4.84 | 19.70 | 6.02 |
LLM.int8() | PyTorch | 6.58 | 7.71 | 5.24 | 5.89 |
K-Quants 4bit | Llama.cpp | 3.80 | 7.38 | 104.45 | 5.96 |
OmniQuant 3bit | MLC-LLM | 3.20 | 5.10 | 83.4 | 6.65 |
OmniQuant 4bit | MLC-LLM | 3.80 | 5.70 | 134.2 | 5.97 |
Engine Results
Method | Hardware Support | Quantization Type | WM (GB) | RM (GB) | Tokens/s | Perplexity |
---|---|---|---|---|---|---|
Llama.cpp | NVIDIA GPU | GGUF K-Quant 2bit | 2.36 | 3.69 | 102.15 | 6.96 |
AMD GPU | GGUF 4bit | 3.56 | 4.88 | 128.97 | 5.96 | |
Apple Silicon | GGUF AWQ 4bit | 3.56 | 4.88 | 129.25 | 5.91 | |
CPU | GGUF K-Quant 4bit | 3.59 | 4.90 | 109.72 | 5.87 | |
GGUF 8bit | 6.67 | 7.78 | 93.39 | 5.79 | ||
GGUF FP16 | 12.55 | 13.22 | 66.81 | 5.79 | ||
ExLlama | NVIDIA GPU | GPTQ 4bit | 3.63 | 5.35 | 77.10 | 6.08 |
AMD GPU | ||||||
ExLlamav2 | NVIDIA GPU | EXL2 2bit | 2.01 | 5.21 | 153.75 | 20.21 |
AMD GPU | EXL2 4bit | 3.36 | 6.61 | 131.68 | 6.12 | |
GPTQ 4bit | 3.63 | 6.93 | 151.30 | 6.03 | ||
EXL2 8bit | 6.37 | 9.47 | 115.81 | 5.76 | ||
FP16 | 12.55 | 15.09 | 67.70 | 5.73 | ||
vLLM | NVIDIA GPU | AWQ GEMM 4bit | 3.62 | 34.55 | 114.43 | 6.02 |
AMD GPU | GPTQ 4bit | 3.63 | 36.51 | 172.88 | 6.08 | |
FP16 | 12.55 | 35.92 | 79.74 | 5.85 | ||
TensorRT-LLM | NVIDIA GPU | AWQ GEMM 4bit | 3.42 | 5.69 | 194.86 | 6.02 |
GPTQ 4bit | 3.60 | 5.88 | 202.16 | 6.08 | ||
INT8 | 6.53 | 8.55 | 143.57 | 5.89 | ||
FP16 | 12.55 | 14.61 | 83.43 | 5.85 | ||
TGI | NVIDIA GPU | AWQ GEMM 4bit | 3.62 | 36.67 | 106.84 | 6.02 |
AMD GPU | GPTQ 4bit | 3.69 | 37.85 | 163.22 | 6.08 | |
Intel GPU | FP4 | 12.55 | 37.21 | 36.91 | 6.15 | |
AWS Inferentia2 | NF4 | 12.55 | 37.21 | 36.32 | 6.02 | |
BF16 | 12.55 | 38.03 | 73.59 | 5.89 | ||
FP16 | 12.55 | 38.03 | 74.19 | 5.85 | ||
MLC-LLM | NVIDIA GPU | OmniQuant 3bit | 3.2 | 5.1 | 83.4 | 6.65 |
AMD GPU, CPU, WebGPU | OmniQuant 4bit | 3.8 | 5.7 | 134.2 | 5.97 | |
Apple Silicon, Intel GPU, WASM, Adreno Mali | FP16 | 12.55 | 15.38 | 87.37 | 5.85 |
Acknowledgements
We extend our gratitude to the following repositories and sources for providing essential methods, engines, and datasets utilized in our benchmarking project:
- Llama-2-7b-hf - Hugging Face model repository for Llama-2-7b.
- llama.cpp - Source for llama.cpp, a key engine method used in our benchmarks.
- exllama - Repository for the ExLlama engine method.
- exllamav2 - Source for ExLlamaV2 engine method.
- alpaca-cleaned - Alpaca dataset on Hugging Face, utilized in our benchmarks.
- squeezellm - Repository for SqueezeLLM quantization method.
- squeezellmgradients - Repository for SqueezeLLM-gradients.
- omniquant - Source for OmniQuant quantization method.
- mlcllm - Repository for the MLC-LLM engine method.
- llmpruner - Source for LLM-Pruner pruning method.
- tensorrtllm - Source for TensorRT-LLM engine method (branch: release/0.5.0).
- autogptq - Repository for AutoGPTQ, offering quantization package based on the GPTQ algorithm.
- autoawq - Repository for AutoAWQ, implementing the AWQ algorithm for 4-bit quantization.
- vllm - Source for vllm package offering the inference and serving engine
These resources have been instrumental in conducting the benchmarks and evaluations. We appreciate the creators and maintainers of these repositories for their valuable contributions.