Home

Awesome

LLMPerf Leaderboard :trophy:

Utilizing the LLMPerf, we have benchmarked a selection of LLM inference providers. Our analysis focuses on evaluating their performance, reliability, and efficiency under the following key metrics:

The LLMPerf Leaderboard displays results in a clear, transparent manner. Our aim is to provide users and developers with vital insights into the capabilities and limitations of each provider, informing decisions for future integrations and deployments. In line with our commitment to transparency and utility, we also provide reproducible steps in Run Configurations as shown below:

Run Configurations

For each of the benchmark run, it is performed with the below command template from the LLMPerf repository

   python token_benchmark_ray.py \
    --model <MODEL_NAME> \
    --mean-input-tokens 550 \
    --stddev-input-tokens 0 \
    --mean-output-tokens 150 \
    --stddev-output-tokens 0 \
    --max-num-completed-requests 150 \
    --num-concurrent-requests 5 \
    --llm-api <litellm/openai> 

For each provider, we perform:

We ran the LLMPerf clients on an AWS EC2 (Instance type: i4i.large) from us-west-2 (Oregon) region. The results were up-to-date of December 19, 2023, 3am PST. You could find the detailed results in the raw_data folder.

Caveats and Disclaimers

Note that there may be some possible source of biases or discrepancies from your perceived behavior:

Output Tokens Throughput (tokens/s)

The output tokens throughput is measured as the average number of output tokens returned per second. We collect results by sending 150 requests to each LLM inference provider, and calculate the mean output tokens throughput based on 150 requests. A higher output tokens throughput indicates a higher throughput of the LLM inference provider.

70B Models

<img src=".assets/output_tokens_per_s.jpg">
FrameworkModelMedianMeanMinMaxP25P75P95P99
anyscalemeta-llama/Llama-2-70b-chat-hf6663228656727782
bedrockmeta.llama2-70b-chat-v12121132220222222
fireworksaccounts/fireworks/models/llama-v2-70b-chat4040334638424546
groqllama2-70b-4096185184148208174195207208
leptonllama2-70b3333313932343438
perplexityllama-2-70b-chat303084429313644
replicatemeta/llama-2-70b-chat10921110101111
togethertogether_ai/togethercomputer/llama-2-70b-chat6564257961687476

13B Models

<img src=".assets/output_tokens_per_s_13b.jpg">
FrameworkModelMedianMeanMinMaxP25P75P95P99
anyscalemeta-llama/Llama-2-13b-chat-hf12012081156110128141148
bedrockmeta.llama2-13b-chat-v13635193933383839
fireworksaccounts/fireworks/models/llama-v2-13b-chat4242394541434444
leptonllama2-13b4343374842444648
replicatemeta/llama-2-13b-chat161863512203535
togethertogether_ai/togethercomputer/llama-2-13b-chat102101112398108119122

7B Models

<img src=".assets/output_tokens_per_s_7b.jpg">
FrameworkModelMedianMeanMinMaxP25P75P95P99
anyscalemeta-llama/Llama-2-7b-chat-hf5151455749545657
fireworksaccounts/fireworks/models/llama-v2-7b-chat7676538275787982
leptonllama2-7b3636334035384040
replicatemeta/llama-2-7b-chat263227820357377
togethertogether_ai/togethercomputer/llama-2-7b-chat7575509570818790

Time to First Token (seconds)

For streaming applications, the TTFT is how long before the LLM returns the first token.

70B Models

<img src=".assets/ttft.jpg">
FrameworkModelMedianMeanMinMaxP25P75P95P99
anyscalemeta-llama/Llama-2-70b-chat-hf0.210.250.180.590.190.300.360.47
bedrockmeta.llama2-70b-chat-v10.390.410.290.720.370.410.540.69
fireworksaccounts/fireworks/models/llama-v2-70b-chat0.510.510.320.960.390.560.790.95
groqllama2-70b-40960.220.230.170.360.190.240.30.35
leptonllama2-70b0.930.90.721.120.820.961.011.1
perplexityllama-2-70b-chat0.370.420.290.700.340.520.630.66
replicatemeta/llama-2-70b-chat1.195.080.9771.571.031.724.2363.63
togethertogether_ai/togethercomputer/llama-2-70b-chat0.630.620.460.890.550.670.770.87

13B Models

<img src=".assets/ttft_13b.jpg">
FrameworkModelMedianMeanMinMaxP25P75P95P99
anyscalemeta-llama/Llama-2-13b-chat-hf0.200.220.180.560.190.220.340.50
bedrockmeta.llama2-13b-chat-v10.270.330.160.770.250.30.740.76
fireworksaccounts/fireworks/models/llama-v2-13b-chat0.490.470.280.660.390.540.590.65
leptonllama2-13b1.081.070.821.40.951.151.241.37
replicatemeta/llama-2-13b-chat5.656.270.9817.013.628.3114.7616.71
togethertogether_ai/togethercomputer/llama-2-13b-chat0.540.890.390.910.460.600.700.81

* Perplexity doesn't offer 13B models when the data was gathered. More details for models offered could be found here.

7B Models

<img src=".assets/ttft_7b.jpg">
FrameworkModelMedianMeanMinMaxP25P75P95P99
anyscalemeta-llama/Llama-2-7b-chat-hf0.200.230.180.500.190.230.340.46
fireworksaccounts/fireworks/models/llama-v2-7b-chat0.330.330.211.090.320.340.370.88
leptonllama2-7b1.131.110.881.331.041.181.291.32
replicatemeta/llama-2-7b-chat3.683.610.997.22.315.016.376.99
togethertogether_ai/togethercomputer/llama-2-7b-chat0.520.580.420.950.460.710.840.94

* Perplexity doesn't offer Llama-2-7B models when the data was gathered. More details for models offered could be found here.

* Bedrock doesn't offer Llama-2-7B models when the data was gathered. More details for models offered could be found here.

Feedback