Awesome

llm-inference-benchmark

LLM Inference benchmark

Inference frameworks

Framework	Producibility****	Docker Image	API Server	OpenAI API Server	WebUI	Multi Models**	Multi-node	Backends	Embedding Model
text-generation-webui	Low	Yes	Yes	Yes	Yes	No	No	Transformers/llama.cpp/ExLlama/ExLlamaV2/AutoGPTQ/AutoAWQ/GPTQ-for-LLaMa/CTransformers	No
OpenLLM	High	Yes	Yes	Yes	No	With BentoML	With BentoML	Transformers(int8,int4,gptq), vLLM(awq/squeezellm), TensorRT	No
vLLM*	High	Yes	Yes	Yes	No	No	Yes(With Ray)	vLLM	No
Xinference	High	Yes	Yes	Yes	Yes	Yes	Yes	Transformers/vLLM/TensorRT/GGML	Yes
TGI***	Medium	Yes	Yes	No	No	No	No	Transformers/AutoGPTQ/AWQ/EETP/vLLM/ExLlama/ExLlamaV2	No
ScaleLLM	Medium	Yes	Yes	Yes	Yes	No	No	Transformers/AutoGPTQ/AWQ/vLLM/ExLlama/ExLlamaV2	No
FastChat	High	Yes	Yes	Yes	Yes	Yes	Yes	Transformers/AutoGPTQ/AWQ/vLLM/ExLlama/ExLlamaV2	Yes

*vLLM/TGI can also serve as a backend.
**Multi Models: Capable of loading multiple models simultaneously.
***TGI does not support chat mode; manual parsing of the prompt is required.

Inference backends

Backend	Device	Compatibility**	PEFT Adapters*	Quatisation	Batching	Distributed Inference	Streaming
Transformers	GPU	High	Yes	bitsandbytes(int8/int4), AutoGPTQ(gptq), AutoAWQ(awq)	Yes	accelerate	Yes
vLLM	GPU	High	No	awq/squeezellm	Yes	Yes	Yes
ExLlamaV2	GPU/CPU	Low	No	GPTQ	Yes	Yes	Yes
TensorRT	GPU	Medium	No	some models	Yes	Yes	Yes
Candle	GPU/CPU	Low	No	No	Yes	Yes	Yes
CTranslate2	GPU	Low	No	Yes	Yes	Yes	Yes
TGI	GPU	Medium	Yes	awq/eetq/gptq/bitsandbytes	Yes	Yes	Yes
llama-cpp***	GPU/CPU	High	No	GGUF/GPTQ	Yes	No	Yes
lmdeploy	GPU	Medium	No	AWQ	Yes	Yes	Yes
Deepspeed-FastGen	GPU	Low	No	No	Yes	Yes	Yes

*PEFT Adapters: support to load seperate PEFT adapters(mostly lora).
**Compatibility: High: Compatible with most models; Medium: Compatible with some models; Low: Compatible with few models.
***llama.cpp's Python binding: llama-cpp-python.

Benchmark

Hardware:

GPU: 1x NVIDIA RTX4090 24GB
CPU: Intel Core i9-13900K
Memory: 96GB

Software:

VM: WSL2 on Windows 11
Guest OS: Ubuntu 22.04
NVIDIA Driver Version: 536.67
CUDA Version: 12.2
PyTorch: 2.1.1

Model:

BFloat16: 01-ai/Yi-6B-Chat
GPTQ 8bit: 01-ai/Yi-6B-Chat-8bits
AWQ 4bit: 01-ai/Yi-6B-Chat-4bits
GGUF 8bit/4bit: TheBloke/Yi-6B-GGUF

Data:

Prompt Length: 512 (with some random characters to avoid cache).
Max Tokens: 200.

Backend Benchmark

No Quantisation

Backend	TPS@4	QPS@4	TPS@1	QPS@1	FTL@1
text-generation-webui Transformer	40.39	0.15	41.47	0.21	344.61
text-generation-webui Transformer with flash-attention-2	58.30	0.21	43.52	0.21	341.39
text-generation-webui ExllamaV2	69.09	0.26	50.71	0.27	564.80
OpenLLM PyTorch	60.79	0.22	44.73	0.21	514.55
TGI	192.58	0.90	59.68	0.28	82.72
vLLM	222.63	1.08	62.69	0.30	95.43
TensorRT	-	-	-	-	-
CTranslate2*	-	-	-	-	-
lmdeploy	236.03	1.15	67.86	0.33	76.81

bs: Batch Size. bs=4 indicates the batch size is 4.
TPS: Tokens Per Second.
QPS: Queries Per Second.
FTL: First Token Latency, measured in milliseconds. Applicable only in stream mode.
Encountered an error using CTranslate2 to convert Yi-6B-Chat. See details in the issue.

8Bit Quantisation

Backend	TPS@4	QPS@4	TPS@1	QPS@1	FTL@1
TGI eetq 8bit	293.08	1.41	88.08	0.42	63.69
TGI GPTQ 8bit	-	-	-	-	-
OpenLLM PyTorch AutoGPTQ 8bit	49.8	0.17	29.54	0.14	930.16

bitsandbytes is very slow (int8 6.8 tokens/s), so we don't benchmark it.
eetq-8bit doesn't require specific model.
TGI GPTQ 8bit load failed: Server error: module 'triton.compiler' has no attribute 'OutOfResources'
- TGI GPTQ bit use exllama or triton backend.

4Bit Quantisation

Backend	TPS@4	QPS@4	TPS@1	QPS@1	FTL@1
TGI AWQ 4bit	336.47	1.61	102.00	0.48	94.84
vLLM AWQ 4bit	29.03	0.14	37.48	0.19	3711.0
text-generation-webui llama-cpp GGUF 4bit	67.63	0.37	56.65	0.34	331.57