Home

Awesome

🏆 LLM-Leaderboard

A joint community effort to create one central leaderboard for LLMs. Contributions and corrections welcome! <br> We refer to a model being "open" if it can be locally deployed and used for commercial purposes.

Interactive Dashboard

https://llm-leaderboard.streamlit.app/ <br> https://huggingface.co/spaces/ludwigstumpp/llm-leaderboard

Leaderboard

Model NamePublisherOpen?Chatbot Arena EloHellaSwag (few-shot)HellaSwag (zero-shot)HellaSwag (one-shot)HumanEval-Python (pass@1)LAMBADA (zero-shot)LAMBADA (one-shot)MMLU (zero-shot)MMLU (few-shot)TriviaQA (zero-shot)TriviaQA (one-shot)WinoGrande (zero-shot)WinoGrande (one-shot)WinoGrande (few-shot)
alpaca-7bStanfordno0.7390.661
alpaca-13bStanfordno1008
bloom-176bBigScienceyes0.7440.1550.299
cerebras-gpt-7bCerebrasyes0.6360.6360.2590.141
cerebras-gpt-13bCerebrasyes0.6350.6350.2580.146
chatglm-6bChatGLMyes985
chinchilla-70bDeepMindno0.8080.7740.6750.749
codex-12b / code-cushman-001OpenAIno0.317
codegen-16B-monoSalesforceyes0.293
codegen-16B-multiSalesforceyes0.183
codegx-13bTsinghua Universityno0.229
dolly-v2-12bDatabricksyes9440.7100.622
eleuther-pythia-7bEleutherAIyes0.6670.6670.2650.1980.661
eleuther-pythia-12bEleutherAIyes0.7040.7040.2530.2330.638
falcon-7bTIIyes0.7810.350
falcon-40bTIIyes0.8530.527
fastchat-t5-3bLmsys.orgyes951
gal-120bMeta AIno0.526
gpt-3-7b / curieOpenAIno0.6820.243
gpt-3-175b / davinciOpenAIno0.7930.7890.4390.702
gpt-3.5-175b / text-davinci-003OpenAIno0.8220.8340.4810.7620.5690.7580.816
gpt-3.5-175b / code-davinci-002OpenAIno0.463
gpt-4OpenAIno0.9530.6700.8640.875
gpt4all-13b-snoozyNomic AIyes0.7500.713
gpt-neox-20bEleutherAIyes0.7180.7190.7190.2690.2760.347
gpt-j-6bEleutherAIyes0.6630.6830.6830.2610.2490.234
koala-13bBerkeley BAIRno10820.7260.688
llama-7bMeta AIno0.7380.1050.7380.3020.4430.701
llama-13bMeta AIno9320.7920.1580.730
llama-33bMeta AIno0.8280.2170.760
llama-65bMeta AIno0.8420.2370.6340.770
llama-2-70bMeta AIyes0.8730.698
mpt-7bMosaicMLyes0.7610.7020.2960.343
oasst-pythia-12bOpen Assistantyes10650.6810.650
opt-7bMeta AIno0.6770.6770.2510.227
opt-13bMeta AIno0.6920.6920.2570.282
opt-66bMeta AIno0.7450.276
opt-175bMeta AIno0.7910.318
palm-62bGoogle Researchno0.770
palm-540bGoogle Researchno0.8380.8340.8360.2620.7790.8180.6930.8140.8110.8370.851
palm-coder-540bGoogle Researchno0.359
palm-2-sGoogle Researchno0.8200.8070.7520.779
palm-2-s*Google Researchno0.376
palm-2-mGoogle Researchno0.8400.8370.8170.792
palm-2-lGoogle Researchno0.8680.8690.8610.830
palm-2-l-instructGoogle Researchno0.909
replit-code-v1-3bReplityes0.219
stablelm-base-alpha-7bStability AIyes0.4120.5330.2510.0490.501
stablelm-tuned-alpha-7bStability AIno8580.5360.548
starcoder-base-16bBigCodeyes0.304
starcoder-16bBigCodeyes0.336
vicuna-13bLmsys.orgno1169

Benchmarks

Benchmark NameAuthorLinkDescription
Chatbot Arena EloLMSYShttps://lmsys.org/blog/2023-05-03-arena/"In this blog post, we introduce Chatbot Arena, an LLM benchmark platform featuring anonymous randomized battles in a crowdsourced manner. Chatbot Arena adopts the Elo rating system, which is a widely-used rating system in chess and other competitive games." (Source: https://lmsys.org/blog/2023-05-03-arena/)
HellaSwagZellers et al.https://arxiv.org/abs/1905.07830v1"HellaSwag is a challenge dataset for evaluating commonsense NLI that is specially hard for state-of-the-art models, though its questions are trivial for humans (>95% accuracy)." (Source: https://paperswithcode.com/dataset/hellaswag)
HumanEvalChen et al.https://arxiv.org/abs/2107.03374v2"It used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions." (Source: https://paperswithcode.com/dataset/humaneval)
LAMBADAPaperno et al.https://arxiv.org/abs/1606.06031"The LAMBADA evaluates the capabilities of computational models for text understanding by means of a word prediction task. LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word. To succeed on LAMBADA, computational models cannot simply rely on local context, but must be able to keep track of information in the broader discourse." (Source: https://huggingface.co/datasets/lambada)
MMLUHendrycks et al.https://github.com/hendrycks/test"The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots." (Source: "https://paperswithcode.com/dataset/mmlu")
TriviaQAJoshi et al.https://arxiv.org/abs/1705.03551v2"We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions." (Source: https://arxiv.org/abs/1705.03551v2)
WinoGrandeSakaguchi et al.https://arxiv.org/abs/1907.10641v2"A large-scale dataset of 44k [expert-crafted pronoun resolution] problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset." (Source: https://arxiv.org/abs/1907.10641v2)

How to Contribute

We are always happy for contributions! You can contribute by the following:

Future Ideas

More Open LLMs

If you are interested in an overview about open llms for commercial use and finetuning, check out the open-llms repository.

Sources

The results of this leaderboard are collected from the individual papers and published results of the model authors. For each reported value, the source is added as a link.

Special thanks to the following pages:

Disclaimer

Above information may be wrong. If you want to use a published model for commercial use, please contact a lawyer.