Awesome

🏆 LLM-Leaderboard

A joint community effort to create one central leaderboard for LLMs. Contributions and corrections welcome! <br> We refer to a model being "open" if it can be locally deployed and used for commercial purposes.

Interactive Dashboard

https://llm-leaderboard.streamlit.app/ <br> https://huggingface.co/spaces/ludwigstumpp/llm-leaderboard

Leaderboard

Model Name	Publisher	Open?	Chatbot Arena Elo	HellaSwag (few-shot)	HellaSwag (zero-shot)	HellaSwag (one-shot)	HumanEval-Python (pass@1)	LAMBADA (zero-shot)	LAMBADA (one-shot)	MMLU (zero-shot)	MMLU (few-shot)	TriviaQA (zero-shot)	TriviaQA (one-shot)	WinoGrande (zero-shot)	WinoGrande (one-shot)	WinoGrande (few-shot)
alpaca-7b	Stanford	no			0.739									0.661
alpaca-13b	Stanford	no	1008
bloom-176b	BigScience	yes		0.744			0.155			0.299
cerebras-gpt-7b	Cerebras	yes			0.636			0.636		0.259		0.141
cerebras-gpt-13b	Cerebras	yes			0.635			0.635		0.258		0.146
chatglm-6b	ChatGLM	yes	985
chinchilla-70b	DeepMind	no			0.808			0.774			0.675			0.749
codex-12b / code-cushman-001	OpenAI	no					0.317
codegen-16B-mono	Salesforce	yes					0.293
codegen-16B-multi	Salesforce	yes					0.183
codegx-13b	Tsinghua University	no					0.229
dolly-v2-12b	Databricks	yes	944		0.710									0.622
eleuther-pythia-7b	EleutherAI	yes			0.667			0.667		0.265		0.198		0.661
eleuther-pythia-12b	EleutherAI	yes			0.704			0.704		0.253		0.233		0.638
falcon-7b	TII	yes		0.781							0.350
falcon-40b	TII	yes		0.853							0.527
fastchat-t5-3b	Lmsys.org	yes	951
gal-120b	Meta AI	no								0.526
gpt-3-7b / curie	OpenAI	no		0.682							0.243
gpt-3-175b / davinci	OpenAI	no		0.793	0.789						0.439			0.702
gpt-3.5-175b / text-davinci-003	OpenAI	no		0.822	0.834		0.481	0.762			0.569			0.758		0.816
gpt-3.5-175b / code-davinci-002	OpenAI	no					0.463
gpt-4	OpenAI	no		0.953			0.670				0.864					0.875
gpt4all-13b-snoozy	Nomic AI	yes			0.750									0.713
gpt-neox-20b	EleutherAI	yes		0.718	0.719			0.719		0.269	0.276	0.347
gpt-j-6b	EleutherAI	yes		0.663	0.683			0.683		0.261	0.249	0.234
koala-13b	Berkeley BAIR	no	1082		0.726									0.688
llama-7b	Meta AI	no			0.738		0.105	0.738		0.302		0.443		0.701
llama-13b	Meta AI	no	932		0.792		0.158							0.730
llama-33b	Meta AI	no			0.828		0.217							0.760
llama-65b	Meta AI	no			0.842		0.237				0.634			0.770
llama-2-70b	Meta AI	yes		0.873							0.698
mpt-7b	MosaicML	yes			0.761			0.702		0.296		0.343
oasst-pythia-12b	Open Assistant	yes	1065		0.681									0.650
opt-7b	Meta AI	no			0.677			0.677		0.251		0.227
opt-13b	Meta AI	no			0.692			0.692		0.257		0.282
opt-66b	Meta AI	no		0.745							0.276
opt-175b	Meta AI	no		0.791							0.318
palm-62b	Google Research	no												0.770
palm-540b	Google Research	no		0.838	0.834	0.836	0.262	0.779	0.818		0.693		0.814	0.811	0.837	0.851
palm-coder-540b	Google Research	no					0.359
palm-2-s	Google Research	no				0.820			0.807				0.752		0.779
palm-2-s*	Google Research	no					0.376
palm-2-m	Google Research	no				0.840			0.837				0.817		0.792
palm-2-l	Google Research	no				0.868			0.869				0.861		0.830
palm-2-l-instruct	Google Research	no														0.909
replit-code-v1-3b	Replit	yes					0.219
stablelm-base-alpha-7b	Stability AI	yes			0.412			0.533		0.251		0.049		0.501
stablelm-tuned-alpha-7b	Stability AI	no	858		0.536									0.548
starcoder-base-16b	BigCode	yes					0.304
starcoder-16b	BigCode	yes					0.336
vicuna-13b	Lmsys.org	no	1169

Benchmarks

Benchmark Name	Author	Link	Description
Chatbot Arena Elo	LMSYS	https://lmsys.org/blog/2023-05-03-arena/	"In this blog post, we introduce Chatbot Arena, an LLM benchmark platform featuring anonymous randomized battles in a crowdsourced manner. Chatbot Arena adopts the Elo rating system, which is a widely-used rating system in chess and other competitive games." (Source: https://lmsys.org/blog/2023-05-03-arena/)
HellaSwag	Zellers et al.	https://arxiv.org/abs/1905.07830v1	"HellaSwag is a challenge dataset for evaluating commonsense NLI that is specially hard for state-of-the-art models, though its questions are trivial for humans (>95% accuracy)." (Source: https://paperswithcode.com/dataset/hellaswag)
HumanEval	Chen et al.	https://arxiv.org/abs/2107.03374v2	"It used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions." (Source: https://paperswithcode.com/dataset/humaneval)
LAMBADA	Paperno et al.	https://arxiv.org/abs/1606.06031	"The LAMBADA evaluates the capabilities of computational models for text understanding by means of a word prediction task. LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word. To succeed on LAMBADA, computational models cannot simply rely on local context, but must be able to keep track of information in the broader discourse." (Source: https://huggingface.co/datasets/lambada)
MMLU	Hendrycks et al.	https://github.com/hendrycks/test	"The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots." (Source: "https://paperswithcode.com/dataset/mmlu")
TriviaQA	Joshi et al.	https://arxiv.org/abs/1705.03551v2	"We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions." (Source: https://arxiv.org/abs/1705.03551v2)
WinoGrande	Sakaguchi et al.	https://arxiv.org/abs/1907.10641v2	"A large-scale dataset of 44k [expert-crafted pronoun resolution] problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset." (Source: https://arxiv.org/abs/1907.10641v2)

How to Contribute

We are always happy for contributions! You can contribute by the following:

table work (don't forget the links):
- filling missing entries
- adding a new model as a new row to the leaderboard. Please keep alphabetic order.
- adding a new benchmark as a new column in the leaderboard and add the benchmark to the benchmarks table. Please keep alphabetic order.
code work:
- improving the existing code
- requesting and implementing new features

Future Ideas

(TBD) add model year
(TBD) add model details:
- #params
- #tokens seen during training
- length context window
- architecture type (transformer-decoder, transformer-encoder, transformer-encoder-decoder, ...)

More Open LLMs

If you are interested in an overview about open llms for commercial use and finetuning, check out the open-llms repository.

Sources

The results of this leaderboard are collected from the individual papers and published results of the model authors. For each reported value, the source is added as a link.

Special thanks to the following pages:

Disclaimer

Above information may be wrong. If you want to use a published model for commercial use, please contact a lawyer.