Home

Awesome

llm-inference-benchmark

LLM Inference benchmark

Inference frameworks

FrameworkProducibility****Docker ImageAPI ServerOpenAI API ServerWebUIMulti Models**Multi-nodeBackendsEmbedding Model
text-generation-webuiLowYesYesYesYesNoNoTransformers/llama.cpp/ExLlama/ExLlamaV2/AutoGPTQ/AutoAWQ/GPTQ-for-LLaMa/CTransformersNo
OpenLLMHighYesYesYesNoWith BentoMLWith BentoMLTransformers(int8,int4,gptq), vLLM(awq/squeezellm), TensorRTNo
vLLM*HighYesYesYesNoNoYes(With Ray)vLLMNo
XinferenceHighYesYesYesYesYesYesTransformers/vLLM/TensorRT/GGMLYes
TGI***MediumYesYesNoNoNoNoTransformers/AutoGPTQ/AWQ/EETP/vLLM/ExLlama/ExLlamaV2No
ScaleLLMMediumYesYesYesYesNoNoTransformers/AutoGPTQ/AWQ/vLLM/ExLlama/ExLlamaV2No
FastChatHighYesYesYesYesYesYesTransformers/AutoGPTQ/AWQ/vLLM/ExLlama/ExLlamaV2Yes

Inference backends

BackendDeviceCompatibility**PEFT Adapters*QuatisationBatchingDistributed InferenceStreaming
TransformersGPUHighYesbitsandbytes(int8/int4), AutoGPTQ(gptq), AutoAWQ(awq)YesaccelerateYes
vLLMGPUHighNoawq/squeezellmYesYesYes
ExLlamaV2GPU/CPULowNoGPTQYesYesYes
TensorRTGPUMediumNosome modelsYesYesYes
CandleGPU/CPULowNoNoYesYesYes
CTranslate2GPULowNoYesYesYesYes
TGIGPUMediumYesawq/eetq/gptq/bitsandbytesYesYesYes
llama-cpp***GPU/CPUHighNoGGUF/GPTQYesNoYes
lmdeployGPUMediumNoAWQYesYesYes
Deepspeed-FastGenGPULowNoNoYesYesYes

Benchmark

Hardware:

Software:

Model:

Data:

Backend Benchmark

No Quantisation

BackendTPS@4QPS@4TPS@1QPS@1FTL@1
text-generation-webui Transformer40.390.1541.470.21344.61
text-generation-webui Transformer with flash-attention-258.300.2143.520.21341.39
text-generation-webui ExllamaV269.090.2650.710.27564.80
OpenLLM PyTorch60.790.2244.730.21514.55
TGI192.580.9059.680.2882.72
vLLM222.631.0862.690.3095.43
TensorRT-----
CTranslate2*-----
lmdeploy236.031.1567.860.3376.81

8Bit Quantisation

BackendTPS@4QPS@4TPS@1QPS@1FTL@1
TGI eetq 8bit293.081.4188.080.4263.69
TGI GPTQ 8bit-----
OpenLLM PyTorch AutoGPTQ 8bit49.80.1729.540.14930.16

4Bit Quantisation

BackendTPS@4QPS@4TPS@1QPS@1FTL@1
TGI AWQ 4bit336.471.61102.000.4894.84
vLLM AWQ 4bit29.030.1437.480.193711.0
text-generation-webui llama-cpp GGUF 4bit67.630.3756.650.34331.57