Awesome
llm-inference-benchmark
LLM Inference benchmark
Inference frameworks
Framework | Producibility**** | Docker Image | API Server | OpenAI API Server | WebUI | Multi Models** | Multi-node | Backends | Embedding Model |
---|---|---|---|---|---|---|---|---|---|
text-generation-webui | Low | Yes | Yes | Yes | Yes | No | No | Transformers/llama.cpp/ExLlama/ExLlamaV2/AutoGPTQ/AutoAWQ/GPTQ-for-LLaMa/CTransformers | No |
OpenLLM | High | Yes | Yes | Yes | No | With BentoML | With BentoML | Transformers(int8,int4,gptq), vLLM(awq/squeezellm), TensorRT | No |
vLLM* | High | Yes | Yes | Yes | No | No | Yes(With Ray) | vLLM | No |
Xinference | High | Yes | Yes | Yes | Yes | Yes | Yes | Transformers/vLLM/TensorRT/GGML | Yes |
TGI*** | Medium | Yes | Yes | No | No | No | No | Transformers/AutoGPTQ/AWQ/EETP/vLLM/ExLlama/ExLlamaV2 | No |
ScaleLLM | Medium | Yes | Yes | Yes | Yes | No | No | Transformers/AutoGPTQ/AWQ/vLLM/ExLlama/ExLlamaV2 | No |
FastChat | High | Yes | Yes | Yes | Yes | Yes | Yes | Transformers/AutoGPTQ/AWQ/vLLM/ExLlama/ExLlamaV2 | Yes |
- *vLLM/TGI can also serve as a backend.
- **Multi Models: Capable of loading multiple models simultaneously.
- ***TGI does not support chat mode; manual parsing of the prompt is required.
Inference backends
Backend | Device | Compatibility** | PEFT Adapters* | Quatisation | Batching | Distributed Inference | Streaming |
---|---|---|---|---|---|---|---|
Transformers | GPU | High | Yes | bitsandbytes(int8/int4), AutoGPTQ(gptq), AutoAWQ(awq) | Yes | accelerate | Yes |
vLLM | GPU | High | No | awq/squeezellm | Yes | Yes | Yes |
ExLlamaV2 | GPU/CPU | Low | No | GPTQ | Yes | Yes | Yes |
TensorRT | GPU | Medium | No | some models | Yes | Yes | Yes |
Candle | GPU/CPU | Low | No | No | Yes | Yes | Yes |
CTranslate2 | GPU | Low | No | Yes | Yes | Yes | Yes |
TGI | GPU | Medium | Yes | awq/eetq/gptq/bitsandbytes | Yes | Yes | Yes |
llama-cpp*** | GPU/CPU | High | No | GGUF/GPTQ | Yes | No | Yes |
lmdeploy | GPU | Medium | No | AWQ | Yes | Yes | Yes |
Deepspeed-FastGen | GPU | Low | No | No | Yes | Yes | Yes |
- *PEFT Adapters: support to load seperate PEFT adapters(mostly lora).
- **Compatibility: High: Compatible with most models; Medium: Compatible with some models; Low: Compatible with few models.
- ***llama.cpp's Python binding: llama-cpp-python.
Benchmark
Hardware:
- GPU: 1x NVIDIA RTX4090 24GB
- CPU: Intel Core i9-13900K
- Memory: 96GB
Software:
- VM: WSL2 on Windows 11
- Guest OS: Ubuntu 22.04
- NVIDIA Driver Version: 536.67
- CUDA Version: 12.2
- PyTorch: 2.1.1
Model:
- BFloat16: 01-ai/Yi-6B-Chat
- GPTQ 8bit: 01-ai/Yi-6B-Chat-8bits
- AWQ 4bit: 01-ai/Yi-6B-Chat-4bits
- GGUF 8bit/4bit: TheBloke/Yi-6B-GGUF
Data:
- Prompt Length: 512 (with some random characters to avoid cache).
- Max Tokens: 200.
Backend Benchmark
No Quantisation
Backend | TPS@4 | QPS@4 | TPS@1 | QPS@1 | FTL@1 |
---|---|---|---|---|---|
text-generation-webui Transformer | 40.39 | 0.15 | 41.47 | 0.21 | 344.61 |
text-generation-webui Transformer with flash-attention-2 | 58.30 | 0.21 | 43.52 | 0.21 | 341.39 |
text-generation-webui ExllamaV2 | 69.09 | 0.26 | 50.71 | 0.27 | 564.80 |
OpenLLM PyTorch | 60.79 | 0.22 | 44.73 | 0.21 | 514.55 |
TGI | 192.58 | 0.90 | 59.68 | 0.28 | 82.72 |
vLLM | 222.63 | 1.08 | 62.69 | 0.30 | 95.43 |
TensorRT | - | - | - | - | - |
CTranslate2* | - | - | - | - | - |
lmdeploy | 236.03 | 1.15 | 67.86 | 0.33 | 76.81 |
-
bs: Batch Size.
bs=4
indicates the batch size is 4. -
TPS: Tokens Per Second.
-
QPS: Queries Per Second.
-
FTL: First Token Latency, measured in milliseconds. Applicable only in stream mode.
-
Encountered an error using CTranslate2 to convert Yi-6B-Chat. See details in the issue.
8Bit Quantisation
Backend | TPS@4 | QPS@4 | TPS@1 | QPS@1 | FTL@1 |
---|---|---|---|---|---|
TGI eetq 8bit | 293.08 | 1.41 | 88.08 | 0.42 | 63.69 |
TGI GPTQ 8bit | - | - | - | - | - |
OpenLLM PyTorch AutoGPTQ 8bit | 49.8 | 0.17 | 29.54 | 0.14 | 930.16 |
- bitsandbytes is very slow (int8 6.8 tokens/s), so we don't benchmark it.
- eetq-8bit doesn't require specific model.
- TGI GPTQ 8bit load failed: Server error: module 'triton.compiler' has no attribute 'OutOfResources'
- TGI GPTQ bit use exllama or triton backend.
4Bit Quantisation
Backend | TPS@4 | QPS@4 | TPS@1 | QPS@1 | FTL@1 |
---|---|---|---|---|---|
TGI AWQ 4bit | 336.47 | 1.61 | 102.00 | 0.48 | 94.84 |
vLLM AWQ 4bit | 29.03 | 0.14 | 37.48 | 0.19 | 3711.0 |
text-generation-webui llama-cpp GGUF 4bit | 67.63 | 0.37 | 56.65 | 0.34 | 331.57 |