Awesome

<div align="center"> <img src="https://github.com/user-attachments/assets/7d987c01-a398-413c-9860-c1704352a7fe" alt="lbops" width="300"/> <h1>Awesome Foundation Model Leaderboard</h1> <a href="https://awesome.re"> <img src="https://awesome.re/badge.svg" height="20"/> </a> <a href="https://github.com/SAILResearch/awesome-foundation-model-leaderboards/fork"> <img src="https://img.shields.io/badge/PRs-Welcome-red" height="20"/> </a> <a href="https://arxiv.org/pdf/2407.04065.pdf"> <img src="https://img.shields.io/badge/Arxiv-2407.04065-red" height="20"/> </a> </div>

Awesome Foundation Model Leaderboard is a curated list of awesome foundation model leaderboards (for an explanation of what a leaderboard is, please refer to this tutorial), along with various development tools and evaluation organizations according to our survey:

<p align="center"><strong>On the Workflows and Smells of Leaderboard Operations (LBOps):<br>An Exploratory Study of Foundation Model Leaderboards</strong></p> <p align="center"><a href="https://github.com/zhimin-z">Zhimin (Jimmy) Zhao</a>, <a href="https://abdulali.github.io">Abdul Ali Bangash</a>, <a href="https://www.filipecogo.pro">Filipe Roseiro Côgo</a>, <a href="https://mcis.cs.queensu.ca/bram.html">Bram Adams</a>, <a href="https://research.cs.queensu.ca/home/ahmed">Ahmed E. Hassan</a></p> <p align="center"><a href="https://sail.cs.queensu.ca">Software Analysis and Intelligence Lab (SAIL)</a></p>

If you find this repository useful, please consider giving us a star :star: and citation:

@article{zhao2024workflows,
  title={On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards},
  author={Zhao, Zhimin and Bangash, Abdul Ali and C{\^o}go, Filipe Roseiro and Adams, Bram and Hassan, Ahmed E},
  journal={arXiv preprint arXiv:2407.04065},
  year={2024}
}

Additionally, we provide a search toolkit that helps you quickly navigate through the leaderboards.

If you want to contribute to this list (please do), welcome to propose a pull request.

If you have any suggestions, critiques, or questions regarding this list, welcome to raise an issue.

Also, a leaderboard should be included if only:

It is actively maintained.
It is related to foundation models.

Tools
Challenges
Rankings
- Model Ranking
  - Comprehensive
  - Text
  - Image
  - Code
  - Video
  - Math
  - Agent
  - Audio
  - 3D
  - Multimodal
- Database Ranking
- Dataset Ranking
- Metric Ranking
- Paper Ranking
- Leaderboard Ranking

Tools

Name	Description
Demo Leaderboard	Demo leaderboard helps users easily deploy their leaderboards with a standardized template.
Demo Leaderboard Backend	Demo leaderboard backend helps users manage the leaderboard and handle submission requests, check this for details.
Kaggle Competition Creation	Kaggle Competition Creation enables you to design and launch custom competitions, leveraging your datasets to engage the data science community.
Leaderboard Explorer	Leaderboard Explorer helps users navigate the diverse range of leaderboards available on Hugging Face Spaces.
Open LLM Leaderboard Renamer	open-llm-leaderboard-renamer helps users rename their models in Open LLM Leaderboard easily.
Open LLM Leaderboard Results PR Opener	Open LLM Leaderboard Results PR Opener helps users showcase Open LLM Leaderboard results in their model cards.
Open LLM Leaderboard Scraper	Open LLM Leaderboard Scraper helps users scrape and export data from Open LLM Leaderboard.
Progress Tracker	This app visualizes the progress of proprietary and open-source LLMs over time as scored by the LMSYS Chatbot Arena.

Challenges

Name	Description
AIcrowd	AIcrowd hosts machine learning challenges and competitions across domains such as computer vision, NLP, and reinforcement learning, aimed at both researchers and practitioners.
AI Hub	AI Hub offers a variety of competitions to encourage AI solutions to real-world problems, with a focus on innovation and collaboration.
AI Studio	AI Studio offers AI competitions mainly for computer vision, NLP, and other data-driven tasks, allowing users to develop and showcase their AI skills.
Allen Institute for AI	The Allen Institute for AI provides leaderboards and benchmarks on tasks in natural language understanding, commonsense reasoning, and other areas in AI research.
Codabench	Codabench is an open-source platform for benchmarking AI models, enabling customizable, user-driven challenges across various AI domains.
DataFountain	DataFountain is a Chinese AI competition platform featuring challenges in finance, healthcare, and smart cities, encouraging solutions for industry-related problems.
DrivenData	DrivenData hosts machine learning challenges with a social impact, aiming to solve issues in areas, such as public health, disaster relief, and sustainable development.
Dynabench	Dynabench offers dynamic benchmarks where models are evaluated continuously, often involving human interaction, to ensure robustness in evolving AI tasks.
Eval AI	EvalAI is a platform for hosting and participating in AI challenges, widely used by researchers for benchmarking models in tasks, such as image classification, NLP, and reinforcement learning.
Grand Challenge	Grand Challenge provides a platform for medical imaging challenges, supporting advancements in medical AI, particularly in areas, such as radiology and pathology.
Hilti	Hilti hosts challenges aimed at advancing AI and machine learning in the construction industry, with a focus on practical, industry-relevant applications.
InsightFace	InsightFace focuses on AI challenges related to face recognition, verification, and analysis, supporting advancements in identity verification and security.
Kaggle	Kaggle is one of the largest platforms for data science and machine learning competitions, covering a broad range of topics from image classification to NLP and predictive modeling.
nuScenes	nuScenes enables researchers to study challenging urban driving situations using the full sensor suite of a real self-driving car, facilitating research in autonomous driving.
Robust Reading Competition	Robust Reading refers to the research area on interpreting written communication in unconstrained settings, with competitions focused on text recognition in real-world environments.
Tianchi	Tianchi, hosted by Alibaba, offers a range of AI competitions, particularly popular in Asia, with a focus on commerce, healthcare, and logistics.

Rankings

Model Ranking

Comprehensive

Name	Description
Artificial Analysis	Artificial Analysis is a platform to help users make informed decisions on AI model selection and hosting providers.
CompassRank	CompassRank is a platform to offer a comprehensive, objective, and neutral evaluation reference of foundation mdoels for the industry and research.
FlagEval	FlagEval is a comprehensive platform for evaluating foundation models.
Generative AI Leaderboards	Generative AI Leaderboard ranks the top-performing generative AI models based on various metrics.
Holistic Evaluation of Language Models	Holistic Evaluation of Language Models (HELM) is a reproducible and transparent framework for evaluating foundation models.
Papers With Code	Papers With Code provides open-source leaderboards and benchmarks, linking AI research papers with code to foster transparency and reproducibility in machine learning.
SuperCLUE	SuperCLUE is a series of benchmarks for evaluating Chinese foundation models.
Vellum LLM Leaderboard	Vellum LLM Leaderboard shows a comparison of capabilities, price and context window for leading commercial and open-source LLMs.

Text

Name	Description
ACLUE	ACLUE is an evaluation benchmark for ancient Chinese language comprehension.
African Languages LLM Eval Leaderboard	African Languages LLM Eval Leaderboard tracks progress and ranks performance of LLMs on African languages.
AgentBoard	AgentBoard is a benchmark for multi-turn LLM agents, complemented by an analytical evaluation board for detailed model assessment beyond final success rates.
AGIEval	AGIEval is a human-centric benchmark to evaluate the general abilities of foundation models in tasks pertinent to human cognition and problem-solving.
Aiera Leaderboard	Aiera Leaderboard evaluates LLM performance on financial intelligence tasks, including speaker assignments, speaker change identification, abstractive summarizations, calculation-based Q&A, and financial sentiment tagging.
AIR-Bench	AIR-Bench is a benchmark to evaluate heterogeneous information retrieval capabilities of language models.
AI Energy Score Leaderboard	AI Energy Score Leaderboard tracks and compares different models in energy efficiency.
ai-benchmarks	ai-benchmarks contains a handful of evaluation results for the response latency of popular AI services.
AlignBench	AlignBench is a multi-dimensional benchmark for evaluating LLMs' alignment in Chinese.
AlpacaEval	AlpacaEval is an automatic evaluator designed for instruction-following LLMs.
ANGO	ANGO is a generation-oriented Chinese language model evaluation benchmark.
Arabic Tokenizers Leaderboard	Arabic Tokenizers Leaderboard compares the efficiency of LLMs in parsing Arabic in its different dialects and forms.
Arena-Hard-Auto	Arena-Hard-Auto is a benchmark for instruction-tuned LLMs.
AutoRace	AutoRace focuses on the direct evaluation of LLM reasoning chains with metric AutoRace (Automated Reasoning Chain Evaluation).
Auto Arena	Auto Arena is a benchmark in which various language model agents engage in peer-battles to evaluate their performance.
Auto-J	Auto-J hosts evaluation results on the pairwise response comparison and critique generation tasks.
BABILong	BABILong is a benchmark for evaluating the performance of language models in processing arbitrarily long documents with distributed facts.
BBL	BBL (BIG-bench Lite) is a small subset of 24 diverse JSON tasks from BIG-bench. It is designed to provide a canonical measure of model performance, while being far cheaper to evaluate than the full set of more than 200 programmatic and JSON tasks in BIG-bench.
BeHonest	BeHonest is a benchmark to evaluate honesty - awareness of knowledge boundaries (self-knowledge), avoidance of deceit (non-deceptiveness), and consistency in responses (consistency) - in LLMs.
BenBench	BenBench is a benchmark to evaluate the extent to which LLMs conduct verbatim training on the training set of a benchmark over the test set to enhance capabilities.
BenCzechMark	BenCzechMark (BCM) is a multitask and multimetric Czech language benchmark for LLMs with a unique scoring system that utilizes the theory of statistical significance.
BiGGen-Bench	BiGGen-Bench is a comprehensive benchmark to evaluate LLMs across a wide variety of tasks.
BotChat	BotChat is a benchmark to evaluate the multi-round chatting capabilities of LLMs through a proxy task.
CaselawQA	CaselawQA is a benchmark comprising legal classification tasks derived from the Supreme Court and Songer Court of Appeals legal databases.
CFLUE	CFLUE is a benchmark to evaluate LLMs' understanding and processing capabilities in the Chinese financial domain.
Ch3Ef	Ch3Ef is a benchmark to evaluate alignment with human expectations using 1002 human-annotated samples across 12 domains and 46 tasks based on the hhh principle.
Chain-of-Thought Hub	Chain-of-Thought Hub is a benchmark to evaluate the reasoning capabilities of LLMs.
Chatbot Arena	Chatbot Arena hosts a chatbot arena where various LLMs compete based on user satisfaction.
ChemBench	ChemBench is a benchmark to evaluate the chemical knowledge and reasoning abilities of LLMs.
Chinese SimpleQA	Chinese SimpleQA is a Chinese benchmark to evaluate the factuality ability of language models to answer short questions.
CLEM Leaderboard	CLEM is a framework designed for the systematic evaluation of chat-optimized LLMs as conversational agents.
CLEVA	CLEVA is a benchmark to evaluate LLMs on 31 tasks using 370K Chinese queries from 84 diverse datasets and 9 metrics.
Chinese Large Model Leaderboard	Chinese Large Model Leaderboard is a platform to evaluate the performance of Chinese LLMs.
CMB	CMB is a multi-level medical benchmark in Chinese.
CMMLU	CMMLU is a benchmark to evaluate the performance of LLMs in various subjects within the Chinese cultural context.
CMMMU	CMMMU is a benchmark to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context.
CommonGen	CommonGen is a benchmark to evaluate generative commonsense reasoning by testing machines on their ability to compose coherent sentences using a given set of common concepts.
CompMix	CompMix is a benchmark for heterogeneous question answering.
Compression Rate Leaderboard	Compression Rate Leaderboard aims to evaluate tokenizer performance on different languages.
Compression Leaderboard	Compression Leaderboard is a platform to evaluate the compression performance of LLMs.
CopyBench	CopyBench is a benchmark to evaluate the copying behavior and utility of language models as well as the effectiveness of methods to mitigate copyright risks.
CoTaEval	CoTaEval is a benchmark to evaluate the feasibility and side effects of copyright takedown methods for LLMs.
ConvRe	ConvRe is a benchmark to evaluate LLMs' ability to comprehend converse relations.
CriticEval	CriticEval is a benchmark to evaluate LLMs' ability to make critique responses.
CS-Bench	CS-Bench is a bilingual benchmark designed to evaluate LLMs' performance across 26 computer science subfields, focusing on knowledge and reasoning.
CUTE	CUTE is a benchmark to test the orthographic knowledge of LLMs.
CyberMetric	CyberMetric is a benchmark to evaluate the cybersecurity knowledge of LLMs.
CzechBench	CzechBench is a benchmark to evaluate Czech language models.
C-Eval	C-Eval is a Chinese evaluation suite for LLMs.
Decentralized Arena Leaderboard	Decentralized Arena hosts a decentralized and democratic platform for LLM evaluation, automating and scaling assessments across diverse, user-defined dimensions, including mathematics, logic, and science.
DecodingTrust	DecodingTrust is a platform to evaluate the trustworthiness of LLMs.
Domain LLM Leaderboard	Domain LLM Leaderboard is a platform to evaluate the popularity of domain-specific LLMs.
Enterprise Scenarios leaderboard	Enterprise Scenarios Leaderboard tracks and evaluates the performance of LLMs on real-world enterprise use cases.
EQ-Bench	EQ-Bench is a benchmark to evaluate aspects of emotional intelligence in LLMs.
European LLM Leaderboard	European LLM Leaderboard tracks and compares performance of LLMs in European languages.
EvalGPT.ai	EvalGPT.ai hosts a chatbot arena to compare and rank the performance of LLMs.
Eval Arena	Eval Arena measures noise levels, model quality, and benchmark quality by comparing model pairs across several LLM evaluation benchmarks with example-level analysis and pairwise comparisons.
Factuality Leaderboard	Factuality Leaderboard compares the factual capabilities of LLMs.
FanOutQA	FanOutQA is a high quality, multi-hop, multi-document benchmark for LLMs using English Wikipedia as its knowledge base.
FastEval	FastEval is a toolkit for quickly evaluating instruction-following and chat language models on various benchmarks with fast inference and detailed performance insights.
FELM	FELM is a meta benchmark to evaluate factuality evaluation benchmark for LLMs.
FinEval	FinEval is a benchmark to evaluate financial domain knowledge in LLMs.
Fine-tuning Leaderboard	Fine-tuning Leaderboard is a platform to rank and showcase models that have been fine-tuned using open-source datasets or frameworks.
Flames	Flames is a highly adversarial Chinese benchmark for evaluating LLMs' value alignment across fairness, safety, morality, legality, and data protection.
FollowBench	FollowBench is a multi-level fine-grained constraints following benchmark to evaluate the instruction-following capability of LLMs.
Forbidden Question Dataset	Forbidden Question Dataset is a benchmark containing 160 questions from 160 violated categories, with corresponding targets for evaluating jailbreak methods.
FuseReviews	FuseReviews aims to advance grounded text generation tasks, including long-form question-answering and summarization.
GAIA	GAIA aims to test fundamental abilities that an AI assistant should possess.
GAVIE	GAVIE is a GPT-4-assisted benchmark for evaluating hallucination in LMMs by scoring accuracy and relevancy without relying on human-annotated groundtruth.
GPT-Fathom	GPT-Fathom is an LLM evaluation suite, benchmarking 10+ leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across 7 capability categories, all under aligned settings.
GrailQA	Strongly Generalizable Question Answering (GrailQA) is a large-scale, high-quality benchmark for question answering on knowledge bases (KBQA) on Freebase with 64,331 questions annotated with both answers and corresponding logical forms in different syntax (i.e., SPARQL, S-expression, etc.).
GTBench	GTBench is a benchmark to evaluate and rank LLMs' reasoning abilities in competitive environments through game-theoretic tasks, e.g., board and card games.
Guerra LLM AI Leaderboard	Guerra LLM AI Leaderboard compares and ranks the performance of LLMs across quality, price, performance, context window, and others.
Hallucinations Leaderboard	Hallucinations Leaderboard aims to track, rank and evaluate hallucinations in LLMs.
HalluQA	HalluQA is a benchmark to evaluate the phenomenon of hallucinations in Chinese LLMs.
Hebrew LLM Leaderboard	Hebrew LLM Leaderboard tracks and ranks language models according to their success on various tasks on Hebrew.
HellaSwag	HellaSwag is a benchmark to evaluate common-sense reasoning in LLMs.
Hughes Hallucination Evaluation Model leaderboard	Hughes Hallucination Evaluation Model leaderboard is a platform to evaluate how often a language model introduces hallucinations when summarizing a document.
Icelandic LLM leaderboard	Icelandic LLM leaderboard tracks and compare models on Icelandic-language tasks.
IFEval	IFEval is a benchmark to evaluate LLMs' instruction following capabilities with verifiable instructions.
IL-TUR	IL-TUR is a benchmark for evaluating language models on monolingual and multilingual tasks focused on understanding and reasoning over Indian legal documents.
Indic LLM Leaderboard	Indic LLM Leaderboard is platform to track and compare the performance of Indic LLMs.
Indico LLM Leaderboard	Indico LLM Leaderboard evaluates and compares the accuracy of various language models across providers, datasets, and capabilities like text classification, key information extraction, and generative summarization.
InstructEval	InstructEval is a suite to evaluate instruction selection methods in the context of LLMs.
Italian LLM-Leaderboard	Italian LLM-Leaderboard tracks and compares LLMs in Italian-language tasks.
JailbreakBench	JailbreakBench is a benchmark for evaluating LLM vulnerabilities through adversarial prompts.
Japanese Chatbot Arena	Japanese Chatbot Arena hosts the chatbot arena, where various LLMs compete based on their performance in Japanese.
Japanese Language Model Financial Evaluation Harness	Japanese Language Model Financial Evaluation Harness is a harness for Japanese language model evaluation in the financial domain.
Japanese LLM Roleplay Benchmark	Japanese LLM Roleplay Benchmark is a benchmark to evaluate the performance of Japanese LLMs in character roleplay.
JMED-LLM	JMED-LLM (Japanese Medical Evaluation Dataset for Large Language Models) is a benchmark for evaluating LLMs in the medical field of Japanese.
JMMMU	JMMMU (Japanese MMMU) is a multimodal benchmark to evaluate LMM performance in Japanese.
JustEval	JustEval is a powerful tool designed for fine-grained evaluation of LLMs.
KoLA	KoLA is a benchmark to evaluate the world knowledge of LLMs.
LaMP	LaMP (Language Models Personalization) is a benchmark to evaluate personalization capabilities of language models.
Language Model Council	Language Model Council (LMC) is a benchmark to evaluate tasks that are highly subjective and often lack majoritarian human agreement.
LawBench	LawBench is a benchmark to evaluate the legal capabilities of LLMs.
La Leaderboard	La Leaderboard evaluates and tracks LLM memorization, reasoning and linguistic capabilities in Spain, LATAM and Caribbean.
LogicKor	LogicKor is a benchmark to evaluate the multidisciplinary thinking capabilities of Korean LLMs.
LongICL Leaderboard	LongICL Leaderboard is a platform to evaluate long in-context learning evaluations for LLMs.
LooGLE	LooGLE is a benchmark to evaluate long context understanding capabilties of LLMs.
LAiW	LAiW is a benchmark to evaluate Chinese legal language understanding and reasoning.
LLM Benchmarker Suite	LLM Benchmarker Suite is a benchmark to evaluate the comprehensive capabilities of LLMs.
Large Language Model Assessment in English Contexts	Large Language Model Assessment in English Contexts is a platform to evaluate LLMs in the English context.
Large Language Model Assessment in the Chinese Context	Large Language Model Assessment in the Chinese Context is a platform to evaluate LLMs in the Chinese context.
LIBRA	LIBRA is a benchmark for evaluating LLMs' capabilities in understanding and processing long Russian text.
LibrAI-Eval GenAI Leaderboard	LibrAI-Eval GenAI Leaderboard focuses on the balance between the LLM’s capability and safety in English.
LiveBench	LiveBench is a benchmark for LLMs to minimize test set contamination and enable objective, automated evaluation across diverse, regularly updated tasks.
LLMEval	LLMEval is a benchmark to evaluate the quality of open-domain conversations with LLMs.
Llmeval-Gaokao2024-Math	Llmeval-Gaokao2024-Math is a benchmark for evaluating LLMs on 2024 Gaokao-level math problems in Chinese.
LLMHallucination Leaderboard	Hallucinations Leaderboard evaluates LLMs based on an array of hallucination-related benchmarks.
LLMPerf	LLMPerf is a tool to evaluate the performance of LLMs using both load and correctness tests.
LLMs Disease Risk Prediction Leaderboard	LLMs Disease Risk Prediction Leaderboard is a platform to evaluate LLMs on disease risk prediction.
LLM Leaderboard	LLM Leaderboard tracks and evaluates LLM providers, enabling selection of the optimal API and model for user needs.
LLM Leaderboard for CRM	CRM LLM Leaderboard is a platform to evaluate the efficacy of LLMs for business applications.
LLM Observatory	LLM Observatory is a benchmark that assesses and ranks LLMs based on their performance in avoiding social biases across categories like LGBTIQ+ orientation, age, gender, politics, race, religion, and xenophobia.
LLM Price Leaderboard	LLM Price Leaderboard tracks and compares LLM costs based on one million tokens.
LLM Rankings	LLM Rankings offers a real-time comparison of language models based on normalized token usage for prompts and completions, updated frequently.
LLM Roleplay Leaderboard	LLM Roleplay Leaderboard evaluates human and AI performance in a social werewolf game for NPC development.
LLM Safety Leaderboard	LLM Safety Leaderboard aims to provide a unified evaluation for language model safety.
LLM Use Case Leaderboard	LLM Use Case Leaderboard tracks and evaluates LLMs in business usecases.
LLM-AggreFact	LLM-AggreFact is a fact-checking benchmark that aggregates most up-to-date publicly available datasets on grounded factuality evaluation.
LLM-Leaderboard	LLM-Leaderboard is a joint community effort to create one central leaderboard for LLMs.
LLM-Perf Leaderboard	LLM-Perf Leaderboard aims to benchmark the performance of LLMs with different hardware, backends, and optimizations.
LMExamQA	LMExamQA is a benchmarking framework where a language model acts as an examiner to generate questions and evaluate responses in a reference-free, automated manner for comprehensive, equitable assessment.
LongBench	LongBench is a benchmark for assessing the long context understanding capabilities of LLMs.
Loong	Loong is a long-context benchmark for evaluating LLMs' multi-document QA abilities across financial, legal, and academic scenarios.
Low-bit Quantized Open LLM Leaderboard	Low-bit Quantized Open LLM Leaderboard tracks and compares quantization LLMs with different quantization algorithms.
LV-Eval	LV-Eval is a long-context benchmark with five length levels and advanced techniques for accurate evaluation of LLMs on single-hop and multi-hop QA tasks across bilingual datasets.
LucyEval	LucyEval offers a thorough assessment of LLMs' performance in various Chinese contexts.
L-Eval	L-Eval is a Long Context Language Model (LCLM) evaluation benchmark to evaluate the performance of handling extensive context.
M3KE	M3KE is a massive multi-level multi-subject knowledge evaluation benchmark to measure the knowledge acquired by Chinese LLMs.
MetaCritique	MetaCritique is a judge that can evaluate human-written or LLMs-generated critique by generating critique.
MINT	MINT is a benchmark to evaluate LLMs' ability to solve tasks with multi-turn interactions by using tools and leveraging natural language feedback.
Mirage	Mirage is a benchmark for medical information retrieval-augmented generation, featuring 7,663 questions from five medical QA datasets and tested with 41 configurations using the MedRag toolkit.
MedBench	MedBench is a benchmark to evaluate the mastery of knowledge and reasoning abilities in medical LLMs.
MedS-Bench	MedS-Bench is a medical benchmark that evaluates LLMs across 11 task categories using 39 diverse datasets.
Meta Open LLM leaderboard	The Meta Open LLM leaderboard serves as a central hub for consolidating data from various open LLM leaderboards into a single, user-friendly visualization page.
MIMIC Clinical Decision Making Leaderboard	MIMIC Clinical Decision Making Leaderboard tracks and evaluates LLms in realistic clinical decision-making for abdominal pathologies.
MixEval	MixEval is a benchmark to evaluate LLMs via by strategically mixing off-the-shelf benchmarks.
ML.ENERGY Leaderboard	ML.ENERGY Leaderboard evaluates the energy consumption of LLMs.
MMedBench	MMedBench is a medical benchmark to evaluate LLMs in multilingual comprehension.
MMLU	MMLU is a benchmark to evaluate the performance of LLMs across a wide array of natural language understanding tasks.
MMLU-by-task Leaderboard	MMLU-by-task Leaderboard provides a platform for evaluating and comparing various ML models across different language understanding tasks.
MMLU-Pro	MMLU-Pro is a more challenging version of MMLU to evaluate the reasoning capabilities of LLMs.
ModelScope LLM Leaderboard	ModelScope LLM Leaderboard is a platform to evaluate LLMs objectively and comprehensively.
Model Evaluation Leaderboard	Model Evaluation Leaderboard tracks and evaluates text generation models based on their performance across various benchmarks using Mosaic Eval Gauntlet framework.
MSNP Leaderboard	MSNP Leaderboard tracks and evaluates quantized GGUF models' performance on various GPU and CPU combinations using single-node setups via Ollama.
MSTEB	MSTEB is a benchmark for measuring the performance of text embedding models in Spanish.
MTEB	MTEB is a massive benchmark for measuring the performance of text embedding models on diverse embedding tasks across 112 languages.
MTEB Arena	MTEB Arena host a model arena for dynamic, real-world assessment of embedding models through user-based query and retrieval comparisons.
MT-Bench-101	MT-Bench-101 is a fine-grained benchmark for evaluating LLMs in multi-turn dialogues.
MY Malay LLM Leaderboard	MY Malay LLM Leaderboard aims to track, rank, and evaluate open LLMs on Malay tasks.
NoCha	NoCha is a benchmark to evaluate how well long-context language models can verify claims written about fictional books.
NPHardEval	NPHardEval is a benchmark to evaluate the reasoning abilities of LLMs through the lens of computational complexity classes.
Occiglot Euro LLM Leaderboard	Occiglot Euro LLM Leaderboard compares LLMs in four main languages from the Okapi benchmark and Belebele (French, Italian, German, Spanish and Dutch).
OlympiadBench	OlympiadBench is a bilingual multimodal scientific benchmark featuring 8,476 Olympiad-level mathematics and physics problems with expert-level step-by-step reasoning annotations.
OlympicArena	OlympicArena is a benchmark to evaluate the advanced capabilities of LLMs across a broad spectrum of Olympic-level challenges.
oobabooga	Oobabooga is a benchmark to perform repeatable performance tests of LLMs with the oobabooga web UI.
OpenEval	OpenEval is a platform assessto evaluate Chinese LLMs.
OpenLLM Turkish leaderboard	OpenLLM Turkish leaderboard tracks progress and ranks the performance of LLMs in Turkish.
Openness Leaderboard	Openness Leaderboard tracks and evaluates models' transparency in terms of open access to weights, data, and licenses, exposing models that fall short of openness standards.
Openness Leaderboard	Openness Leaderboard is a tool that tracks the openness of instruction-tuned LLMs, evaluating their transparency, data, and model availability.
OpenResearcher	OpenResearcher contains the benchmarking results on various RAG-related systems as a leaderboard.
Open Arabic LLM Leaderboard	Open Arabic LLM Leaderboard tracks progress and ranks the performance of LLMs in Arabic.
Open Chinese LLM Leaderboard	Open Chinese LLM Leaderboard aims to track, rank, and evaluate open Chinese LLMs.
Open CoT Leaderboard	Open CoT Leaderboard tracks LLMs' abilities to generate effective chain-of-thought reasoning traces.
Open Dutch LLM Evaluation Leaderboard	Open Dutch LLM Evaluation Leaderboard tracks progress and ranks the performance of LLMs in Dutch.
Open Financial LLM Leaderboard	Open Financial LLM Leaderboard aims to evaluate and compare the performance of financial LLMs.
Open ITA LLM Leaderboard	Open ITA LLM Leaderboard tracks progress and ranks the performance of LLMs in Italian.
Open Ko-LLM Leaderboard	Open Ko-LLM Leaderboard tracks progress and ranks the performance of LLMs in Korean.
Open LLM Leaderboard	Open LLM Leaderboard tracks progress and ranks the performance of LLMs in English.
Open Medical-LLM Leaderboard	Open Medical-LLM Leaderboard aims to track, rank, and evaluate open LLMs in the medical domain.
Open MLLM Leaderboard	Open MLLM Leaderboard aims to track, rank and evaluate LLMs and chatbots.
Open MOE LLM Leaderboard	OPEN MOE LLM Leaderboard assesses the performance and efficiency of various Mixture of Experts (MoE) LLMs.
Open Multilingual LLM Evaluation Leaderboard	Open Multilingual LLM Evaluation Leaderboard tracks progress and ranks the performance of LLMs in multiple languages.
Open PL LLM Leaderboard	Open PL LLM Leaderboard is a platform for assessing the performance of various LLMs in Polish.
Open Portuguese LLM Leaderboard	Open PT LLM Leaderboard aims to evaluate and compare LLMs in the Portuguese-language tasks.
Open Taiwan LLM leaderboard	Open Taiwan LLM leaderboard showcases the performance of LLMs on various Taiwanese Mandarin language understanding tasks.
Open-LLM-Leaderboard	Open-LLM-Leaderboard evaluates LLMs in language understanding and reasoning by transitioning from multiple-choice questions (MCQs) to open-style questions.
OPUS-MT Dashboard	OPUS-MT Dashboard is a platform to track and compare machine translation models across multiple language pairs and metrics.
OR-Bench	OR-Bench is a benchmark to evaluate the over-refusal of enhanced safety in LLMs.
ParsBench	ParsBench provides toolkits for benchmarking LLMs based on the Persian language.
Persian LLM Leaderboard	Persian LLM Leaderboard provides a reliable evaluation of LLMs in Persian Language.
Pinocchio ITA leaderboard	Pinocchio ITA leaderboard tracks and evaluates LLMs in Italian Language.
PL-MTEB	PL-MTEB (Polish Massive Text Embedding Benchmark) is a benchmark for evaluating text embeddings in Polish across 28 NLP tasks.
Polish Medical Leaderboard	Polish Medical Leaderboard evaluates language models on Polish board certification examinations.
Powered-by-Intel LLM Leaderboard	Powered-by-Intel LLM Leaderboard evaluates, scores, and ranks LLMs that have been pre-trained or fine-tuned on Intel Hardware.
PubMedQA	PubMedQA is a benchmark to evaluate biomedical research question answering.
PromptBench	PromptBench is a benchmark to evaluate the robustness of LLMs on adversarial prompts.
QAConv	QAConv is a benchmark for question answering using complex, domain-specific, and asynchronous conversations as the knowledge source.
QuALITY	QuALITY is a benchmark for evaluating multiple-choice question-answering with a long context.
RABBITS	RABBITS is a benchmark to evaluate the robustness of LLMs by evaluating their handling of synonyms, specifically brand and generic drug names.
Rakuda	Rakuda is a benchmark to evaluate LLMs based on how well they answer a set of open-ended questions about Japanese topics.
RedTeam Arena	RedTeam Arena is a red-teaming platform for LLMs.
Red Teaming Resistance Benchmark	Red Teaming Resistance Benchmark is a benchmark to evaluate the robustness of LLMs against red teaming prompts.
ReST-MCTS*	ReST-MCTS* is a reinforced self-training method that uses tree search and process reward inference to collect high-quality reasoning traces for training policy and reward models without manual step annotations.
Reviewer Arena	Reviewer Arena hosts the reviewer arena, where various LLMs compete based on their performance in critiquing academic papers.
RoleEval	RoleEval is a bilingual benchmark to evaluate the memorization, utilization, and reasoning capabilities of role knowledge of LLMs.
RPBench Leaderboard	RPBench-Auto is an automated pipeline for evaluating LLMs using 80 personae for character-based and 80 scenes for scene-based role-playing.
Russian Chatbot Arena	Chatbot Arena hosts a chatbot arena where various LLMs compete in Russian based on user satisfaction.
Russian SuperGLUE	Russian SuperGLUE is a benchmark for Russian language models, focusing on logic, commonsense, and reasoning tasks.
R-Judge	R-Judge is a benchmark to evaluate the proficiency of LLMs in judging and identifying safety risks given agent interaction records.
Safety Prompts	Safety Prompts is a benchmark to evaluate the safety of Chinese LLMs.
SafetyBench	SafetyBench is a benchmark to evaluate the safety of LLMs.
SALAD-Bench	SALAD-Bench is a benchmark for evaluating the safety and security of LLMs.
ScandEval	ScandEval is a benchmark to evaluate LLMs on tasks in Scandinavian languages as well as German, Dutch, and English.
Science Leaderboard	Science Leaderboard is a platform to evaluate LLMs' capabilities to solve science problems.
SciGLM	SciGLM is a suite of scientific language models that use a self-reflective instruction annotation framework to enhance scientific reasoning by generating and revising step-by-step solutions to unlabelled questions.
SciKnowEval	SciKnowEval is a benchmark to evaluate LLMs based on their proficiency in studying extensively, enquiring earnestly, thinking profoundly, discerning clearly, and practicing assiduously.
SCROLLS	SCROLLS is a benchmark to evaluate the reasoning capabilities of LLMs over long texts.
SeaExam	SeaExam is a benchmark to evaluate LLMs for Southeast Asian (SEA) languages.
SEAL LLM Leaderboards	SEAL LLM Leaderboards is an expert-driven private evaluation platform for LLMs.
SeaEval	SeaEval is a benchmark to evaluate the performance of multilingual LLMs in understanding and reasoning with natural language, as well as comprehending cultural practices, nuances, and values.
SEA HELM	SEA HELM is a benchmark to evaluate LLMs' performance across English and Southeast Asian tasks, focusing on chat, instruction-following, and linguistic capabilities.
SecEval	SecEval is a benchmark to evaluate cybersecurity knowledge of foundation models.
Self-Improving Leaderboard	Self-Improving Leaderboard (SIL) is a dynamic platform that continuously updates test datasets and rankings to provide real-time performance insights for open-source LLMs and chatbots.
Spec-Bench	Spec-Bench is a benchmark to evaluate speculative decoding methods across diverse scenarios.
StructEval	StructEval is a benchmark to evaluate LLMs by conducting structured assessments across multiple cognitive levels and critical concepts.
Subquadratic LLM Leaderboard	Subquadratic LLM Leaderboard evaluates LLMs with subquadratic/attention-free architectures (i.e. RWKV & Mamba).
SuperBench	SuperBench is a comprehensive system of tasks and dimensions to evaluate the overall capabilities of LLMs.
SuperGLUE	SuperGLUE is a benchmark to evaluate the performance of LLMs on a set of challenging language understanding tasks.
SuperLim	SuperLim is a benchmark to evaluate the language understanding capabilities of LLMs in Swedish.
Swahili LLM-Leaderboard	Swahili LLM-Leaderboard is a joint community effort to create one central leaderboard for LLMs.
S-Eval	S-Eval is a comprehensive, multi-dimensional safety benchmark with 220,000 prompts designed to evaluate LLM safety across various risk dimensions.
TableQAEval	TableQAEval is a benchmark to evaluate LLM performance in modeling long tables and comprehension capabilities, such as numerical and multi-hop reasoning.
TAT-DQA	TAT-DQA is a benchmark to evaluate LLMs on the discrete reasoning over documents that combine both structured and unstructured information.
TAT-QA	TAT-QA is a benchmark to evaluate LLMs on the discrete reasoning over documents that combines both tabular and textual content.
Thai LLM Leaderboard	Thai LLM Leaderboard aims to track and evaluate LLMs in the Thai-language tasks.
The Pile	The Pile is a benchmark to evaluate the world knowledge and reasoning ability of LLMs.
TOFU	TOFU is a benchmark to evaluate the unlearning performance of LLMs in realistic scenarios.
Toloka LLM Leaderboard	Toloka LLM Leaderboard is a benchmark to evaluate LLMs based on authentic user prompts and expert human evaluation.
Toolbench	ToolBench is a platform for training, serving, and evaluating LLMs specifically for tool learning.
Toxicity Leaderboard	Toxicity Leaderboard evaluates the toxicity of LLMs.
Trustbit LLM Leaderboards	Trustbit LLM Leaderboards is a platform that provides benchmarks for building and shipping products with LLMs.
TrustLLM	TrustLLM is a benchmark to evaluate the trustworthiness of LLMs.
TuringAdvice	TuringAdvice is a benchmark for evaluating language models' ability to generate helpful advice for real-life, open-ended situations.
TutorEval	TutorEval is a question-answering benchmark which evaluates how well an LLM tutor can help a user understand a chapter from a science textbook.
T-Eval	T-Eval is a benchmark for evaluating the tool utilization capability of LLMs.
UGI Leaderboard	UGI Leaderboard measures and compares the uncensored and controversial information known by LLMs.
UltraEval	UltraEval is an open-source framework for transparent and reproducible benchmarking of LLMs across various performance dimensions.
Vals AI	Vals AI is a platform evaluating generative AI accuracy and efficacy on real-world legal tasks.
VCR	Visual Commonsense Reasoning (VCR) is a benchmark for cognition-level visual understanding, requiring models to answer visual questions and provide rationales for their answers.
ViDoRe	ViDoRe is a benchmark to evaluate retrieval models on their capacity to match queries to relevant documents at the page level.
VLLMs Leaderboard	VLLMs Leaderboard aims to track, rank and evaluate open LLMs and chatbots.
VMLU	VMLU is a benchmark to evaluate overall capabilities of foundation models in Vietnamese.
WildBench	WildBench is a benchmark for evaluating language models on challenging tasks that closely resemble real-world applications.
Xiezhi	Xiezhi is a benchmark for holistic domain knowledge evaluation of LLMs.
Yanolja Arena	Yanolja Arena host a model arena to evaluate the capabilities of LLMs in summarizing and translating text.
Yet Another LLM Leaderboard	Yet Another LLM Leaderboard is a platform for tracking, ranking, and evaluating open LLMs and chatbots.
ZebraLogic	ZebraLogic is a benchmark evaluating LLMs' logical reasoning using Logic Grid Puzzles, a type of Constraint Satisfaction Problem (CSP).
ZeroSumEval	ZeroSumEval is a competitive evaluation framework for LLMs using multiplayer simulations with clear win conditions.

Image

Name	Description
Abstract Image	Abstract Image is a benchmark to evaluate multimodal LLMs (MLLM) in understanding and visually reasoning about abstract images, such as maps, charts, and layouts.
AesBench	AesBench is a benchmark to evaluate MLLMs on image aesthetics perception.
BLINK	BLINK is a benchmark to evaluate the core visual perception abilities of MLLMs.
BlinkCode	BlinkCode is a benchmark to evaluate MLLMs across 15 vision-language models (VLMs) and 9 tasks, measuring accuracy and image reconstruction performance.
CARES	CARES is a benchmark to evaluate the trustworthiness of Med-LVLMs across trustfulness, fairness, safety, privacy, and robustness using 41K question-answer pairs from 16 medical image modalities and 27 anatomical regions.
ChartMimic	ChartMimic is a benchmark to evaluate the visually-grounded code generation capabilities of large multimodal models using charts and textual instructions.
CharXiv	CharXiv is a benchmark to evaluate chart understanding capabilities of MLLMs.
ConTextual	ConTextual is a benchmark to evaluate MLLMs across context-sensitive text-rich visual reasoning tasks.
CORE-MM	CORE-MM is a benchmark to evaluate the open-ended visual question-answering (VQA) capabilities of MLLMs.
DreamBench++	DreamBench++ is a human-aligned benchmark automated by multimodal models for personalized image generation.
EgoPlan-Bench	EgoPlan-Bench is a benchmark to evaluate planning abilities of MLLMs in real-world, egocentric scenarios.
GlitchBench	GlitchBench is a benchmark to evaluate the reasoning capabilities of MLLMs in the context of detecting video game glitches.
HallusionBench	HallusionBench is a benchmark to evaluate the image-context reasoning capabilities of MLLMs.
InfiMM-Eval	InfiMM-Eval is a benchmark to evaluate the open-ended VQA capabilities of MLLMs.
LRVSF Leaderboard	LRVSF Leaderboard is a platform to evaluate LLMs regarding image similarity search in fashion.
LVLM Leaderboard	LVLM Leaderboard is a platform to evaluate the visual reasoning capabilities of MLLMs.
M3CoT	M3CoT is a benchmark for multi-domain multi-step multi-modal chain-of-thought of MLLMs.
Mementos	Mementos is a benchmark to evaluate the reasoning capabilities of MLLMs over image sequences.
MJ-Bench	MJ-Bench is a benchmark to evaluate multimodal judges in providing feedback for image generation models across four key perspectives: alignment, safety, image quality, and bias.
MLLM-as-a-Judge	MLLM-as-a-Judge is a benchmark with human annotations to evaluate MLLMs' judging capabilities in scoring, pair comparison, and batch ranking tasks across multimodal domains.
MLLM-Bench	MLLM-Bench is a benchmark to evaluate the visual reasoning capabilities of MLVMs.
MMBench Leaderboard	MMBench Leaderboard is a platform to evaluate the visual reasoning capabilities of MLLMs.
MME	MME is a benchmark to evaluate the visual reasoning capabilities of MLLMs.
MME-RealWorld	MME-RealWorld is a large-scale, high-resolution benchmark featuring 29,429 human-annotated QA pairs across 43 tasks.
MMIU	MMIU (Ultimodal Multi-image Understanding) is a benchmark to evaluate MLLMs across 7 multi-image relationships, 52 tasks, 77K images, and 11K curated multiple-choice questions.
MMMU	MMMU is a benchmark to evaluate the performance of multimodal models on tasks that demand college-level subject knowledge and expert-level reasoning across various disciplines.
MMR	MMR is a benchmark to evaluate the robustness of MLLMs in visual understanding by assessing their ability to handle leading questions, rather than just accuracy in answering.
MMSearch	MMSearch is a benchmark to evaluate the multimodal search performance of LMMs.
MMStar	MMStar is a benchmark to evaluate the multi-modal capacities of MLLMs.
MMT-Bench	MMT-Bench is a benchmark to evaluate MLLMs across a wide array of multimodal tasks that require expert knowledge as well as deliberate visual recognition, localization, reasoning, and planning.
MM-NIAH	MM-NIAH (Needle In A Multimodal Haystack) is a benchmark to evaluate MLLMs' ability to comprehend long multimodal documents through retrieval, counting, and reasoning tasks involving both text and image data.
MTVQA	MTVQA is a multilingual visual text comprehension benchmark to evaluate MLLMs.
Multimodal Hallucination Leaderboard	Multimodal Hallucination Leaderboard compares MLLMs based on hallucination levels in various tasks.
MULTI-Benchmark	MULTI-Benchmark is a benchmark to evaluate MLLMs on understanding complex tables and images, and reasoning with long context.
MultiTrust	MultiTrust is a benchmark to evaluate the trustworthiness of MLLMs across five primary aspects: truthfulness, safety, robustness, fairness, and privacy.
NPHardEval4V	NPHardEval4V is a benchmark to evaluate the reasoning abilities of MLLMs through the lens of computational complexity classes.
Provider Leaderboard	LLM API Providers Leaderboard is a platform to compare API provider performance for over LLM endpoints across performance key metrics.
OCRBench	OCRBench is a benchmark to evaluate the OCR capabilities of multimodal models.
PCA-Bench	PCA-Bench is a benchmark to evaluate the embodied decision-making capabilities of multimodal models.
Q-Bench	Q-Bench is a benchmark to evaluate the visual reasoning capabilities of MLLMs.
RewardBench	RewardBench is a benchmark to evaluate the capabilities and safety of reward models.
ScienceQA	ScienceQA is a benchmark used to evaluate the multi-hop reasoning ability and interpretability of AI systems in the context of answering science questions.
SciGraphQA	SciGraphQA is a benchmark to evaluate the MLLMs in scientific graph question-answering.
SEED-Bench	SEED-Bench is a benchmark to evaluate the text and image generation of multimodal models.
URIAL	URIAL is a benchmark to evaluate the capacity of language models for alignment without introducing the factors of fine-tuning (learning rate, data, etc.), which are hard to control for fair comparisons.
UPD Leaderboard	UPD Leaderboard is a platform to evaluate the trustworthiness of MLLMs in unsolvable problem detection.
Vibe-Eval	Vibe-Eval is a benchmark to evaluate MLLMs for challenging cases.
VideoHallucer	VideoHallucer is a benchmark to detect hallucinations in MLLMs.
VisIT-Bench	VisIT-Bench is a benchmark to evaluate the instruction-following capabilities of MLLMs for real-world use.
Waymo Open Dataset Challenges	Waymo Open Dataset Challenges hold diverse self-driving datasets to evaluate ML models.
WHOOPS!	WHOOPS! is a benchmark to evaluate the visual commonsense reasoning abilities of MLLMs.
WildVision-Bench	WildVision-Bench is a benchmark to evaluate VLMs in the wild with human preferences.
WildVision Arena	WildVision Arena hosts the chatbot arena where various MLLMs compete based on their performance in visual understanding.

Code

Name	Description
Aider LLM Leaderboards	Aider LLM Leaderboards evaluate LLM's ability to follow system prompts to edit code.
AppWorld	AppWorld is a high-fidelity execution environment of 9 day-to-day apps, operable via 457 APIs, populated with digital activities of ~100 people living in a simulated world.
Berkeley Function-Calling Leaderboard	Berkeley Function-Calling Leaderboard evaluates the ability of LLMs to call functions (also known as tools) accurately.
BigCodeBench	BigCodeBench is a benchmark for code generation with practical and challenging programming tasks.
Big Code Models Leaderboard	Big Code Models Leaderboard is a platform to track and evaluate the performance of LLMs on code-related tasks.
BIRD	BIRD is a benchmark to evaluate the performance of text-to-SQL parsing systems.
BookSQL	BookSQL is a benchmark to evaluate Text-to-SQL systems in the finance and accounting domain across various industries with a dataset of 1 million transactions from 27 businesses.
CanAiCode Leaderboard	CanAiCode Leaderboard is a platform to evaluate the code generation capabilities of LLMs.
ClassEval	ClassEval is a benchmark to evaluate LLMs on class-level code generation.
CodeApex	CodeApex is a benchmark to evaluate LLMs' programming comprehension through multiple-choice questions and code generation with C++ algorithm problems.
CodeScope	CodeScope is a benchmark to evaluate LLM coding capabilities across 43 languages and 8 tasks, considering difficulty, efficiency, and length.
CodeTransOcean	CodeTransOcean is a benchmark to evaluate code translation across a wide variety of programming languages, including popular, niche, and LLM-translated code.
Code Lingua	Code Lingua is a benchmark to compare the ability of code models to understand what the code implements in source languages and translate the same semantics in target languages.
Coding LLMs Leaderboard	Coding LLMs Leaderboard is a platform to evaluate and rank LLMs across various programming tasks.
Commit-0	Commit-0 is a from-scratch AI coding challenge to rebuild 54 core Python libraries, ensuring they pass unit tests with significant test coverage, lint/type checking, and cloud-based distributed development.
CRUXEval	CRUXEval is a benchmark to evaluate code reasoning, understanding, and execution capabilities of LLMs.
CSpider	CSpider is a benchmark to evaluate systems' ability to generate SQL queries from Chinese natural language across diverse, complex, and cross-domain databases.
CyberSecEval	CyberSecEval is a benchmark to evaluate the cybersecurity of LLMs as coding assistants.
DevOps AI Assistant Open Leaderboard	DevOps AI Assistant Open Leaderboard tracks, ranks, and evaluates DevOps AI Assistants across knowledge domains.
DevOps-Eval	DevOps-Eval is a benchmark to evaluate code models in the DevOps/AIOps field.
DomainEval	DomainEval is an auto-constructed benchmark for multi-domain code generation.
Dr.Spider	Dr.Spider is a benchmark to evaluate the robustness of text-to-SQL models using different perturbation test sets.
EffiBench	EffiBench is a benchmark to evaluate the efficiency of LLMs in code generation.
EvalPlus	EvalPlus is a benchmark to evaluate the code generation performance of LLMs.
EvoCodeBench	EvoCodeBench is an evolutionary code generation benchmark aligned with real-world code repositories.
EvoEval	EvoEval is a benchmark to evaluate the coding abilities of LLMs, created by evolving existing benchmarks into different targeted domains.
InfiBench	InfiBench is a benchmark to evaluate code models on answering freeform real-world code-related questions.
InterCode	InterCode is a benchmark to standardize and evaluate interactive coding with execution feedback.
Julia LLM Leaderboard	Julia LLM Leaderboard is a platform to compare code models' abilities in generating syntactically correct Julia code, featuring structured tests and automated evaluations for easy and collaborative benchmarking.
LiveCodeBench	LiveCodeBench is a benchmark to evaluate code models across code-related scenarios over time.
Long Code Arena	Long Code Arena is a suite of benchmarks for code-related tasks with large contexts, up to a whole code repository.
McEval	McEval is a massively multilingual code evaluation benchmark covering 40 languages (16K samples in 44 total), encompassing multilingual code generation, multilingual code explanation, and multilingual code completion tasks.
Memorization or Generation of Big Code Models Leaderboard	Memorization or Generation of Big Code Models Leaderboard tracks and compares code generation models' performance.
Multi-SWE-bench	Multi-SWE-bench is a multi-lingual GitHub issue resolving benchmark for code agents.
NaturalCodeBench	NaturalCodeBench is a benchmark to mirror the complexity and variety of scenarios in real coding tasks.
Nexus Function Calling Leaderboard	Nexus Function Calling Leaderboard is a platform to evaluate code models on performing function calling and API usage.
NL2SQL360	NL2SQL360 is a comprehensive evaluation framework for comparing and optimizing NL2SQL methods across various application scenarios.
PECC	PECC is a benchmark that evaluates code generation by requiring models to comprehend and extract problem requirements from narrative-based descriptions to produce syntactically accurate solutions.
ProLLM Benchmarks	ProLLM Benchmarks is a practical and reliable LLM benchmark designed for real-world business use cases across multiple industries and programming languages.
PyBench	PyBench is a benchmark evaluating LLM on real-world coding tasks including chart analysis, text analysis, image/ audio editing, complex math and software/website development.
RACE	RACE is a benchmark to evaluate the ability of LLMs to generate code that is correct and meets the requirements of real-world development scenarios.
RepoQA	RepoQA is a benchmark to evaluate the long-context code understanding ability of LLMs.
SciCode	SciCode is a benchmark designed to evaluate language models in generating code to solve realistic scientific research problems.
SolidityBench	SolidityBench is a benchmark to evaluate and rank the ability of LLMs in generating and auditing smart contracts.
Spider	Spider is a benchmark to evaluate the performance of natural language interfaces for cross-domain databases.
StableToolBench	StableToolBench is a benchmark to evaluate tool learning that aims to provide a well-balanced combination of stability and reality.
SWE-bench	SWE-bench is a benchmark for evaluating LLMs on real-world software issues collected from GitHub.
WebApp1K	WebApp1K is a benchmark to evaluate LLMs on their abilities to develop real-world web applications.
WebDev Arena	WebDev Arena hosts a chatbot arena where various LLMs compete based on website development.
WILDS	WILDS is a benchmark of in-the-wild distribution shifts spanning diverse data modalities and applications, from tumor identification to wildlife monitoring to poverty mapping.

Video

Name	Description
ChronoMagic-Bench	ChronoMagic-Bench is a benchmark to evaluate video models' ability to generate time-lapse videos with high metamorphic amplitude and temporal coherence across physics, biology, and chemistry domains using free-form text control.
DREAM-1K	DREAM-1K is a benchmark to evaluate video description performance on 1,000 diverse video clips featuring rich events, actions, and motions from movies, animations, stock videos, YouTube, and TikTok-style short videos.
LongVideoBench	LongVideoBench is a benchmark to evaluate the capabilities of video models in answering referred reasoning questions, which are dependent on long frame inputs and cannot be well-addressed by a single frame or a few sparse frames.
LVBench	LVBench is a benchmark to evaluate multimodal models on long video understanding tasks requiring extended memory and comprehension capabilities.
MLVU	MLVU is a benchmark to evaluate video models in multi-task long video understanding.
MMToM-QA	MMToM-QA is a multimodal benchmark to evaluate machine Theory of Mind (ToM), the ability to understand people's minds.
MVBench	MVBench is a benchmark to evaluate the temporal understanding capabilities of video models in dynamic video tasks.
OpenVLM Video Leaderboard	OpenVLM Video Leaderboard is a platform showcasing the evaluation results of 30 different VLMs on video understanding benchmarks using the VLMEvalKit framework.
TempCompass	TempCompass is a benchmark to evaluate Video LLMs' temporal perception using 410 videos and 7,540 task instructions across 11 temporal aspects and 4 task types.
VBench	VBench is a benchmark to evaluate video generation capabilities of video models.
VideoNIAH	VideoNIAH is a benchmark to evaluate the fine-grained understanding and spatio-temporal modeling capabilities of video models.
VideoPhy	VideoPhy is a benchmark to evaluate generated videos for adherence to physical commonsense in real-world material interactions.
VideoScore	VideoScore is a benchmark to evaluate text-to-video generative models on five key dimensions.
VideoVista	VideoVista is a benchmark with 25,000 questions from 3,400 videos across 14 categories, covering 19 understanding and 8 reasoning tasks.
Video-Bench	Video-Bench is a benchmark to evaluate the video-exclusive understanding, prior knowledge incorporation, and video-based decision-making abilities of video models.
Video-MME	Video-MME is a benchmark to evaluate the video analysis capabilities of video models.

Math

Name	Description
Abel	Abel is a platform to evaluate the mathematical capabilities of LLMs.
MathBench	MathBench is a multi-level difficulty mathematics evaluation benchmark for LLMs.
MathEval	MathEval is a benchmark to evaluate the mathematical capabilities of LLMs.
MathUserEval	MathUserEval is a benchmark featuring university exam questions and math-related queries derived from simulated conversations with experienced annotators.
MathVerse	MathVerse is a benchmark to evaluate vision-language models in interpreting and reasoning with visual information in mathematical problems.
MathVista	MathVista is a benchmark to evaluate mathematical reasoning in visual contexts.
MATH-V	MATH-Vision (MATH-V) is a benchmark of 3,040 visually contextualized math problems from competitions, covering 16 disciplines and 5 difficulty levels to evaluate LMMs' mathematical reasoning.
Open Multilingual Reasoning Leaderboard	Open Multilingual Reasoning Leaderboard tracks and ranks the reasoning performance of LLMs on multilingual mathematical reasoning benchmarks.
PutnamBench	PutnamBench is a benchmark to evaluate the formal mathematical reasoning capabilities of LLMs on the Putnam Competition.
SciBench	SciBench is a benchmark to evaluate the reasoning capabilities of LLMs for solving complex scientific problems.
TabMWP	TabMWP is a benchmark to evaluate LLMs in mathematical reasoning tasks that involve both textual and tabular data.
We-Math	We-Math is a benchmark to evaluate the human-like mathematical reasoning capabilities of LLMs with problem-solving principles beyond the end-to-end performance.

Agent

Name	Description
AgentBench	AgentBench is the benchmark to evaluate language model-as-Agent across a diverse spectrum of different environments.
AgentStudio	AgentStudio is an integrated solution featuring in-depth benchmark suites, realistic environments, and comprehensive toolkits.
CharacterEval	CharacterEval is a benchmark to evaluate Role-Playing Conversational Agents (RPCAs) using multi-turn dialogues and character profiles, with metrics spanning four dimensions.
GTA	GTA is a benchmark to evaluate the tool-use capability of LLM-based agents in real-world scenarios.
Leetcode-Hard Gym	Leetcode-Hard Gym is an RL environment interface to LeetCode's submission server for evaluating codegen agents.
LLM Colosseum Leaderboard	LLM Colosseum Leaderboard is a platform to evaluate LLMs by fighting in Street Fighter 3.
MAgIC	MAgIC is a benchmark to measure the abilities of cognition, adaptability, rationality and collaboration of LLMs within multi-agent sytems.
Olas Predict Benchmark	Olas Predict Benchmark is a benchmark to evaluate agents on historical and future event forecasting.
TravelPlanner	TravelPlanner is a benchmark to evaluate LLM agents in tool use and complex planning within multiple constraints.
VAB	VisualAgentBench (VAB) is a benchmark to evaluate and develop LMMs as visual foundation agents, which comprises 5 distinct environments across 3 types of representative visual agent tasks.
VisualWebArena	VisualWebArena is a benchmark to evaluate the performance of multimodal web agents on realistic visually grounded tasks.
WebAgent Leaderboard	WebAgent Leaderboard tracks and evaluates LLMs, VLMs, and agents on web navigation tasks.
WebArena	WebArena is a standalone, self-hostable web environment to evaluate autonomous agents.
γ-Bench	γ-Bench is a framework for evaluating LLMs' gaming abilities in multi-agent environments using eight classical game theory scenarios and a dynamic scoring scheme.
τ-Bench	τ-bench is a benchmark that emulates dynamic conversations between a language model-simulated user and a language agent equipped with domain-specific API tools and policy guidelines.

Audio

Name	Description
AIR-Bench	AIR-Bench is a benchmark to evaluate the ability of audio models to understand various types of audio signals (including human speech, natural sounds and music), and furthermore, to interact with humans in textual format.
AudioBench	AudioBench is a benchmark for general instruction-following audio models.
Open ASR Leaderboard	Open ASR Leaderboard provides a platform for tracking, ranking, and evaluating Automatic Speech Recognition (ASR) models.
Polish ASR Leaderboard	Polish ASR leaderboard aims to provide comprehensive overview of performance of ASR/STT systems for Polish.
SALMon	SALMon is an evaluation suite that benchmarks speech language models on consistency, background noise, emotion, speaker identity, and room impulse response.
TTS Arena	TTS-Arena hosts the Text To Speech (TTS) arena, where various TTS models compete based on their performance in generating speech.
Whisper Leaderboard	Whisper Leaderboard is a platform tracking and comparing audio models' speech recognition performance on various datasets.

3D

Name	Description
3D Arena	3D Arena hosts 3D generation arena, where various 3D generative models compete based on their performance in generating 3D models.
3D-POPE	3D-POPE is a benchmark to evaluate object hallucination in 3D generative models.
3DGen Arena	3DGen Arena hosts the 3D generation arena, where various 3D generative models compete based on their performance in generating 3D models.
BOP	BOP is a benchmark for 6D pose estimation of a rigid object from a single RGB-D input image.
GPTEval3D	GPTEval3D is a benchmark to evaluate MLLMs' capabiltiies of 3D content understanding via multi-view images as input.

Multimodal

| GenAI Arena | GenAI Arena hosts the visual generation arena, where various vision models compete based on their performance in image generation, image edition, and video generation. | | Labelbox Leaderboards | Labelbox Leaderboards evaluate performance of generative AI models using their data factory: platform, scientific process and expert humans. | MEGA-Bench | MEGA-Bench is a benchmark for multimodal evaluation with diverse tasks across 8 application types, 7 input formats, 6 output formats, and 10 multimodal skills, spanning single-image, multi-image, and video tasks. |

Database Ranking

Name	Description
VectorDBBench	VectorDBBench is a benchmark to evaluate performance, cost-effectiveness, and scalability of various vector databases and cloud-based vector database services.

Dataset Ranking

Name	Description
DataComp	DataComp is a benchmark to evaluate the performance of various datasets with a fixed model architecture.

Metric Ranking

Name	Description
AlignScore	AlignScore evaluates the performance of different metrics in assessing factual consistency.

Paper Ranking

Name	Description
Papers Leaderboard	Papers Leaderboard is a platform to evaluate the popularity of machine learning papers.

Leaderboard Ranking

Name	Description
Open Leaderboards Leaderboard	Open Leaderboards Leaderboard is a meta-leaderboard that leverages human preferences to compare machine learning leaderboards.