ACLUE | ACLUE is an evaluation benchmark for ancient Chinese language comprehension. |
African Languages LLM Eval Leaderboard | African Languages LLM Eval Leaderboard tracks progress and ranks performance of LLMs on African languages. |
AgentBoard | AgentBoard is a benchmark for multi-turn LLM agents, complemented by an analytical evaluation board for detailed model assessment beyond final success rates. |
AGIEval | AGIEval is a human-centric benchmark to evaluate the general abilities of foundation models in tasks pertinent to human cognition and problem-solving. |
Aiera Leaderboard | Aiera Leaderboard evaluates LLM performance on financial intelligence tasks, including speaker assignments, speaker change identification, abstractive summarizations, calculation-based Q&A, and financial sentiment tagging. |
AIR-Bench | AIR-Bench is a benchmark to evaluate heterogeneous information retrieval capabilities of language models. |
AI Energy Score Leaderboard | AI Energy Score Leaderboard tracks and compares different models in energy efficiency. |
ai-benchmarks | ai-benchmarks contains a handful of evaluation results for the response latency of popular AI services. |
AlignBench | AlignBench is a multi-dimensional benchmark for evaluating LLMs' alignment in Chinese. |
AlpacaEval | AlpacaEval is an automatic evaluator designed for instruction-following LLMs. |
ANGO | ANGO is a generation-oriented Chinese language model evaluation benchmark. |
Arabic Tokenizers Leaderboard | Arabic Tokenizers Leaderboard compares the efficiency of LLMs in parsing Arabic in its different dialects and forms. |
Arena-Hard-Auto | Arena-Hard-Auto is a benchmark for instruction-tuned LLMs. |
AutoRace | AutoRace focuses on the direct evaluation of LLM reasoning chains with metric AutoRace (Automated Reasoning Chain Evaluation). |
Auto Arena | Auto Arena is a benchmark in which various language model agents engage in peer-battles to evaluate their performance. |
Auto-J | Auto-J hosts evaluation results on the pairwise response comparison and critique generation tasks. |
BABILong | BABILong is a benchmark for evaluating the performance of language models in processing arbitrarily long documents with distributed facts. |
BBL | BBL (BIG-bench Lite) is a small subset of 24 diverse JSON tasks from BIG-bench. It is designed to provide a canonical measure of model performance, while being far cheaper to evaluate than the full set of more than 200 programmatic and JSON tasks in BIG-bench. |
BeHonest | BeHonest is a benchmark to evaluate honesty - awareness of knowledge boundaries (self-knowledge), avoidance of deceit (non-deceptiveness), and consistency in responses (consistency) - in LLMs. |
BenBench | BenBench is a benchmark to evaluate the extent to which LLMs conduct verbatim training on the training set of a benchmark over the test set to enhance capabilities. |
BenCzechMark | BenCzechMark (BCM) is a multitask and multimetric Czech language benchmark for LLMs with a unique scoring system that utilizes the theory of statistical significance. |
BiGGen-Bench | BiGGen-Bench is a comprehensive benchmark to evaluate LLMs across a wide variety of tasks. |
BotChat | BotChat is a benchmark to evaluate the multi-round chatting capabilities of LLMs through a proxy task. |
CaselawQA | CaselawQA is a benchmark comprising legal classification tasks derived from the Supreme Court and Songer Court of Appeals legal databases. |
CFLUE | CFLUE is a benchmark to evaluate LLMs' understanding and processing capabilities in the Chinese financial domain. |
Ch3Ef | Ch3Ef is a benchmark to evaluate alignment with human expectations using 1002 human-annotated samples across 12 domains and 46 tasks based on the hhh principle. |
Chain-of-Thought Hub | Chain-of-Thought Hub is a benchmark to evaluate the reasoning capabilities of LLMs. |
Chatbot Arena | Chatbot Arena hosts a chatbot arena where various LLMs compete based on user satisfaction. |
ChemBench | ChemBench is a benchmark to evaluate the chemical knowledge and reasoning abilities of LLMs. |
Chinese SimpleQA | Chinese SimpleQA is a Chinese benchmark to evaluate the factuality ability of language models to answer short questions. |
CLEM Leaderboard | CLEM is a framework designed for the systematic evaluation of chat-optimized LLMs as conversational agents. |
CLEVA | CLEVA is a benchmark to evaluate LLMs on 31 tasks using 370K Chinese queries from 84 diverse datasets and 9 metrics. |
Chinese Large Model Leaderboard | Chinese Large Model Leaderboard is a platform to evaluate the performance of Chinese LLMs. |
CMB | CMB is a multi-level medical benchmark in Chinese. |
CMMLU | CMMLU is a benchmark to evaluate the performance of LLMs in various subjects within the Chinese cultural context. |
CMMMU | CMMMU is a benchmark to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context. |
CommonGen | CommonGen is a benchmark to evaluate generative commonsense reasoning by testing machines on their ability to compose coherent sentences using a given set of common concepts. |
CompMix | CompMix is a benchmark for heterogeneous question answering. |
Compression Rate Leaderboard | Compression Rate Leaderboard aims to evaluate tokenizer performance on different languages. |
Compression Leaderboard | Compression Leaderboard is a platform to evaluate the compression performance of LLMs. |
CopyBench | CopyBench is a benchmark to evaluate the copying behavior and utility of language models as well as the effectiveness of methods to mitigate copyright risks. |
CoTaEval | CoTaEval is a benchmark to evaluate the feasibility and side effects of copyright takedown methods for LLMs. |
ConvRe | ConvRe is a benchmark to evaluate LLMs' ability to comprehend converse relations. |
CriticEval | CriticEval is a benchmark to evaluate LLMs' ability to make critique responses. |
CS-Bench | CS-Bench is a bilingual benchmark designed to evaluate LLMs' performance across 26 computer science subfields, focusing on knowledge and reasoning. |
CUTE | CUTE is a benchmark to test the orthographic knowledge of LLMs. |
CyberMetric | CyberMetric is a benchmark to evaluate the cybersecurity knowledge of LLMs. |
CzechBench | CzechBench is a benchmark to evaluate Czech language models. |
C-Eval | C-Eval is a Chinese evaluation suite for LLMs. |
Decentralized Arena Leaderboard | Decentralized Arena hosts a decentralized and democratic platform for LLM evaluation, automating and scaling assessments across diverse, user-defined dimensions, including mathematics, logic, and science. |
DecodingTrust | DecodingTrust is a platform to evaluate the trustworthiness of LLMs. |
Domain LLM Leaderboard | Domain LLM Leaderboard is a platform to evaluate the popularity of domain-specific LLMs. |
Enterprise Scenarios leaderboard | Enterprise Scenarios Leaderboard tracks and evaluates the performance of LLMs on real-world enterprise use cases. |
EQ-Bench | EQ-Bench is a benchmark to evaluate aspects of emotional intelligence in LLMs. |
European LLM Leaderboard | European LLM Leaderboard tracks and compares performance of LLMs in European languages. |
EvalGPT.ai | EvalGPT.ai hosts a chatbot arena to compare and rank the performance of LLMs. |
Eval Arena | Eval Arena measures noise levels, model quality, and benchmark quality by comparing model pairs across several LLM evaluation benchmarks with example-level analysis and pairwise comparisons. |
Factuality Leaderboard | Factuality Leaderboard compares the factual capabilities of LLMs. |
FanOutQA | FanOutQA is a high quality, multi-hop, multi-document benchmark for LLMs using English Wikipedia as its knowledge base. |
FastEval | FastEval is a toolkit for quickly evaluating instruction-following and chat language models on various benchmarks with fast inference and detailed performance insights. |
FELM | FELM is a meta benchmark to evaluate factuality evaluation benchmark for LLMs. |
FinEval | FinEval is a benchmark to evaluate financial domain knowledge in LLMs. |
Fine-tuning Leaderboard | Fine-tuning Leaderboard is a platform to rank and showcase models that have been fine-tuned using open-source datasets or frameworks. |
Flames | Flames is a highly adversarial Chinese benchmark for evaluating LLMs' value alignment across fairness, safety, morality, legality, and data protection. |
FollowBench | FollowBench is a multi-level fine-grained constraints following benchmark to evaluate the instruction-following capability of LLMs. |
Forbidden Question Dataset | Forbidden Question Dataset is a benchmark containing 160 questions from 160 violated categories, with corresponding targets for evaluating jailbreak methods. |
FuseReviews | FuseReviews aims to advance grounded text generation tasks, including long-form question-answering and summarization. |
GAIA | GAIA aims to test fundamental abilities that an AI assistant should possess. |
GAVIE | GAVIE is a GPT-4-assisted benchmark for evaluating hallucination in LMMs by scoring accuracy and relevancy without relying on human-annotated groundtruth. |
GPT-Fathom | GPT-Fathom is an LLM evaluation suite, benchmarking 10+ leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across 7 capability categories, all under aligned settings. |
GrailQA | Strongly Generalizable Question Answering (GrailQA) is a large-scale, high-quality benchmark for question answering on knowledge bases (KBQA) on Freebase with 64,331 questions annotated with both answers and corresponding logical forms in different syntax (i.e., SPARQL, S-expression, etc.). |
GTBench | GTBench is a benchmark to evaluate and rank LLMs' reasoning abilities in competitive environments through game-theoretic tasks, e.g., board and card games. |
Guerra LLM AI Leaderboard | Guerra LLM AI Leaderboard compares and ranks the performance of LLMs across quality, price, performance, context window, and others. |
Hallucinations Leaderboard | Hallucinations Leaderboard aims to track, rank and evaluate hallucinations in LLMs. |
HalluQA | HalluQA is a benchmark to evaluate the phenomenon of hallucinations in Chinese LLMs. |
Hebrew LLM Leaderboard | Hebrew LLM Leaderboard tracks and ranks language models according to their success on various tasks on Hebrew. |
HellaSwag | HellaSwag is a benchmark to evaluate common-sense reasoning in LLMs. |
Hughes Hallucination Evaluation Model leaderboard | Hughes Hallucination Evaluation Model leaderboard is a platform to evaluate how often a language model introduces hallucinations when summarizing a document. |
Icelandic LLM leaderboard | Icelandic LLM leaderboard tracks and compare models on Icelandic-language tasks. |
IFEval | IFEval is a benchmark to evaluate LLMs' instruction following capabilities with verifiable instructions. |
IL-TUR | IL-TUR is a benchmark for evaluating language models on monolingual and multilingual tasks focused on understanding and reasoning over Indian legal documents. |
Indic LLM Leaderboard | Indic LLM Leaderboard is platform to track and compare the performance of Indic LLMs. |
Indico LLM Leaderboard | Indico LLM Leaderboard evaluates and compares the accuracy of various language models across providers, datasets, and capabilities like text classification, key information extraction, and generative summarization. |
InstructEval | InstructEval is a suite to evaluate instruction selection methods in the context of LLMs. |
Italian LLM-Leaderboard | Italian LLM-Leaderboard tracks and compares LLMs in Italian-language tasks. |
JailbreakBench | JailbreakBench is a benchmark for evaluating LLM vulnerabilities through adversarial prompts. |
Japanese Chatbot Arena | Japanese Chatbot Arena hosts the chatbot arena, where various LLMs compete based on their performance in Japanese. |
Japanese Language Model Financial Evaluation Harness | Japanese Language Model Financial Evaluation Harness is a harness for Japanese language model evaluation in the financial domain. |
Japanese LLM Roleplay Benchmark | Japanese LLM Roleplay Benchmark is a benchmark to evaluate the performance of Japanese LLMs in character roleplay. |
JMED-LLM | JMED-LLM (Japanese Medical Evaluation Dataset for Large Language Models) is a benchmark for evaluating LLMs in the medical field of Japanese. |
JMMMU | JMMMU (Japanese MMMU) is a multimodal benchmark to evaluate LMM performance in Japanese. |
JustEval | JustEval is a powerful tool designed for fine-grained evaluation of LLMs. |
KoLA | KoLA is a benchmark to evaluate the world knowledge of LLMs. |
LaMP | LaMP (Language Models Personalization) is a benchmark to evaluate personalization capabilities of language models. |
Language Model Council | Language Model Council (LMC) is a benchmark to evaluate tasks that are highly subjective and often lack majoritarian human agreement. |
LawBench | LawBench is a benchmark to evaluate the legal capabilities of LLMs. |
La Leaderboard | La Leaderboard evaluates and tracks LLM memorization, reasoning and linguistic capabilities in Spain, LATAM and Caribbean. |
LogicKor | LogicKor is a benchmark to evaluate the multidisciplinary thinking capabilities of Korean LLMs. |
LongICL Leaderboard | LongICL Leaderboard is a platform to evaluate long in-context learning evaluations for LLMs. |
LooGLE | LooGLE is a benchmark to evaluate long context understanding capabilties of LLMs. |
LAiW | LAiW is a benchmark to evaluate Chinese legal language understanding and reasoning. |
LLM Benchmarker Suite | LLM Benchmarker Suite is a benchmark to evaluate the comprehensive capabilities of LLMs. |
Large Language Model Assessment in English Contexts | Large Language Model Assessment in English Contexts is a platform to evaluate LLMs in the English context. |
Large Language Model Assessment in the Chinese Context | Large Language Model Assessment in the Chinese Context is a platform to evaluate LLMs in the Chinese context. |
LIBRA | LIBRA is a benchmark for evaluating LLMs' capabilities in understanding and processing long Russian text. |
LibrAI-Eval GenAI Leaderboard | LibrAI-Eval GenAI Leaderboard focuses on the balance between the LLM’s capability and safety in English. |
LiveBench | LiveBench is a benchmark for LLMs to minimize test set contamination and enable objective, automated evaluation across diverse, regularly updated tasks. |
LLMEval | LLMEval is a benchmark to evaluate the quality of open-domain conversations with LLMs. |
Llmeval-Gaokao2024-Math | Llmeval-Gaokao2024-Math is a benchmark for evaluating LLMs on 2024 Gaokao-level math problems in Chinese. |
LLMHallucination Leaderboard | Hallucinations Leaderboard evaluates LLMs based on an array of hallucination-related benchmarks. |
LLMPerf | LLMPerf is a tool to evaluate the performance of LLMs using both load and correctness tests. |
LLMs Disease Risk Prediction Leaderboard | LLMs Disease Risk Prediction Leaderboard is a platform to evaluate LLMs on disease risk prediction. |
LLM Leaderboard | LLM Leaderboard tracks and evaluates LLM providers, enabling selection of the optimal API and model for user needs. |
LLM Leaderboard for CRM | CRM LLM Leaderboard is a platform to evaluate the efficacy of LLMs for business applications. |
LLM Observatory | LLM Observatory is a benchmark that assesses and ranks LLMs based on their performance in avoiding social biases across categories like LGBTIQ+ orientation, age, gender, politics, race, religion, and xenophobia. |
LLM Price Leaderboard | LLM Price Leaderboard tracks and compares LLM costs based on one million tokens. |
LLM Rankings | LLM Rankings offers a real-time comparison of language models based on normalized token usage for prompts and completions, updated frequently. |
LLM Roleplay Leaderboard | LLM Roleplay Leaderboard evaluates human and AI performance in a social werewolf game for NPC development. |
LLM Safety Leaderboard | LLM Safety Leaderboard aims to provide a unified evaluation for language model safety. |
LLM Use Case Leaderboard | LLM Use Case Leaderboard tracks and evaluates LLMs in business usecases. |
LLM-AggreFact | LLM-AggreFact is a fact-checking benchmark that aggregates most up-to-date publicly available datasets on grounded factuality evaluation. |
LLM-Leaderboard | LLM-Leaderboard is a joint community effort to create one central leaderboard for LLMs. |
LLM-Perf Leaderboard | LLM-Perf Leaderboard aims to benchmark the performance of LLMs with different hardware, backends, and optimizations. |
LMExamQA | LMExamQA is a benchmarking framework where a language model acts as an examiner to generate questions and evaluate responses in a reference-free, automated manner for comprehensive, equitable assessment. |
LongBench | LongBench is a benchmark for assessing the long context understanding capabilities of LLMs. |
Loong | Loong is a long-context benchmark for evaluating LLMs' multi-document QA abilities across financial, legal, and academic scenarios. |
Low-bit Quantized Open LLM Leaderboard | Low-bit Quantized Open LLM Leaderboard tracks and compares quantization LLMs with different quantization algorithms. |
LV-Eval | LV-Eval is a long-context benchmark with five length levels and advanced techniques for accurate evaluation of LLMs on single-hop and multi-hop QA tasks across bilingual datasets. |
LucyEval | LucyEval offers a thorough assessment of LLMs' performance in various Chinese contexts. |
L-Eval | L-Eval is a Long Context Language Model (LCLM) evaluation benchmark to evaluate the performance of handling extensive context. |
M3KE | M3KE is a massive multi-level multi-subject knowledge evaluation benchmark to measure the knowledge acquired by Chinese LLMs. |
MetaCritique | MetaCritique is a judge that can evaluate human-written or LLMs-generated critique by generating critique. |
MINT | MINT is a benchmark to evaluate LLMs' ability to solve tasks with multi-turn interactions by using tools and leveraging natural language feedback. |
Mirage | Mirage is a benchmark for medical information retrieval-augmented generation, featuring 7,663 questions from five medical QA datasets and tested with 41 configurations using the MedRag toolkit. |
MedBench | MedBench is a benchmark to evaluate the mastery of knowledge and reasoning abilities in medical LLMs. |
MedS-Bench | MedS-Bench is a medical benchmark that evaluates LLMs across 11 task categories using 39 diverse datasets. |
Meta Open LLM leaderboard | The Meta Open LLM leaderboard serves as a central hub for consolidating data from various open LLM leaderboards into a single, user-friendly visualization page. |
MIMIC Clinical Decision Making Leaderboard | MIMIC Clinical Decision Making Leaderboard tracks and evaluates LLms in realistic clinical decision-making for abdominal pathologies. |
MixEval | MixEval is a benchmark to evaluate LLMs via by strategically mixing off-the-shelf benchmarks. |
ML.ENERGY Leaderboard | ML.ENERGY Leaderboard evaluates the energy consumption of LLMs. |
MMedBench | MMedBench is a medical benchmark to evaluate LLMs in multilingual comprehension. |
MMLU | MMLU is a benchmark to evaluate the performance of LLMs across a wide array of natural language understanding tasks. |
MMLU-by-task Leaderboard | MMLU-by-task Leaderboard provides a platform for evaluating and comparing various ML models across different language understanding tasks. |
MMLU-Pro | MMLU-Pro is a more challenging version of MMLU to evaluate the reasoning capabilities of LLMs. |
ModelScope LLM Leaderboard | ModelScope LLM Leaderboard is a platform to evaluate LLMs objectively and comprehensively. |
Model Evaluation Leaderboard | Model Evaluation Leaderboard tracks and evaluates text generation models based on their performance across various benchmarks using Mosaic Eval Gauntlet framework. |
MSNP Leaderboard | MSNP Leaderboard tracks and evaluates quantized GGUF models' performance on various GPU and CPU combinations using single-node setups via Ollama. |
MSTEB | MSTEB is a benchmark for measuring the performance of text embedding models in Spanish. |
MTEB | MTEB is a massive benchmark for measuring the performance of text embedding models on diverse embedding tasks across 112 languages. |
MTEB Arena | MTEB Arena host a model arena for dynamic, real-world assessment of embedding models through user-based query and retrieval comparisons. |
MT-Bench-101 | MT-Bench-101 is a fine-grained benchmark for evaluating LLMs in multi-turn dialogues. |
MY Malay LLM Leaderboard | MY Malay LLM Leaderboard aims to track, rank, and evaluate open LLMs on Malay tasks. |
NoCha | NoCha is a benchmark to evaluate how well long-context language models can verify claims written about fictional books. |
NPHardEval | NPHardEval is a benchmark to evaluate the reasoning abilities of LLMs through the lens of computational complexity classes. |
Occiglot Euro LLM Leaderboard | Occiglot Euro LLM Leaderboard compares LLMs in four main languages from the Okapi benchmark and Belebele (French, Italian, German, Spanish and Dutch). |
OlympiadBench | OlympiadBench is a bilingual multimodal scientific benchmark featuring 8,476 Olympiad-level mathematics and physics problems with expert-level step-by-step reasoning annotations. |
OlympicArena | OlympicArena is a benchmark to evaluate the advanced capabilities of LLMs across a broad spectrum of Olympic-level challenges. |
oobabooga | Oobabooga is a benchmark to perform repeatable performance tests of LLMs with the oobabooga web UI. |
OpenEval | OpenEval is a platform assessto evaluate Chinese LLMs. |
OpenLLM Turkish leaderboard | OpenLLM Turkish leaderboard tracks progress and ranks the performance of LLMs in Turkish. |
Openness Leaderboard | Openness Leaderboard tracks and evaluates models' transparency in terms of open access to weights, data, and licenses, exposing models that fall short of openness standards. |
Openness Leaderboard | Openness Leaderboard is a tool that tracks the openness of instruction-tuned LLMs, evaluating their transparency, data, and model availability. |
OpenResearcher | OpenResearcher contains the benchmarking results on various RAG-related systems as a leaderboard. |
Open Arabic LLM Leaderboard | Open Arabic LLM Leaderboard tracks progress and ranks the performance of LLMs in Arabic. |
Open Chinese LLM Leaderboard | Open Chinese LLM Leaderboard aims to track, rank, and evaluate open Chinese LLMs. |
Open CoT Leaderboard | Open CoT Leaderboard tracks LLMs' abilities to generate effective chain-of-thought reasoning traces. |
Open Dutch LLM Evaluation Leaderboard | Open Dutch LLM Evaluation Leaderboard tracks progress and ranks the performance of LLMs in Dutch. |
Open Financial LLM Leaderboard | Open Financial LLM Leaderboard aims to evaluate and compare the performance of financial LLMs. |
Open ITA LLM Leaderboard | Open ITA LLM Leaderboard tracks progress and ranks the performance of LLMs in Italian. |
Open Ko-LLM Leaderboard | Open Ko-LLM Leaderboard tracks progress and ranks the performance of LLMs in Korean. |
Open LLM Leaderboard | Open LLM Leaderboard tracks progress and ranks the performance of LLMs in English. |
Open Medical-LLM Leaderboard | Open Medical-LLM Leaderboard aims to track, rank, and evaluate open LLMs in the medical domain. |
Open MLLM Leaderboard | Open MLLM Leaderboard aims to track, rank and evaluate LLMs and chatbots. |
Open MOE LLM Leaderboard | OPEN MOE LLM Leaderboard assesses the performance and efficiency of various Mixture of Experts (MoE) LLMs. |
Open Multilingual LLM Evaluation Leaderboard | Open Multilingual LLM Evaluation Leaderboard tracks progress and ranks the performance of LLMs in multiple languages. |
Open PL LLM Leaderboard | Open PL LLM Leaderboard is a platform for assessing the performance of various LLMs in Polish. |
Open Portuguese LLM Leaderboard | Open PT LLM Leaderboard aims to evaluate and compare LLMs in the Portuguese-language tasks. |
Open Taiwan LLM leaderboard | Open Taiwan LLM leaderboard showcases the performance of LLMs on various Taiwanese Mandarin language understanding tasks. |
Open-LLM-Leaderboard | Open-LLM-Leaderboard evaluates LLMs in language understanding and reasoning by transitioning from multiple-choice questions (MCQs) to open-style questions. |
OPUS-MT Dashboard | OPUS-MT Dashboard is a platform to track and compare machine translation models across multiple language pairs and metrics. |
OR-Bench | OR-Bench is a benchmark to evaluate the over-refusal of enhanced safety in LLMs. |
ParsBench | ParsBench provides toolkits for benchmarking LLMs based on the Persian language. |
Persian LLM Leaderboard | Persian LLM Leaderboard provides a reliable evaluation of LLMs in Persian Language. |
Pinocchio ITA leaderboard | Pinocchio ITA leaderboard tracks and evaluates LLMs in Italian Language. |
PL-MTEB | PL-MTEB (Polish Massive Text Embedding Benchmark) is a benchmark for evaluating text embeddings in Polish across 28 NLP tasks. |
Polish Medical Leaderboard | Polish Medical Leaderboard evaluates language models on Polish board certification examinations. |
Powered-by-Intel LLM Leaderboard | Powered-by-Intel LLM Leaderboard evaluates, scores, and ranks LLMs that have been pre-trained or fine-tuned on Intel Hardware. |
PubMedQA | PubMedQA is a benchmark to evaluate biomedical research question answering. |
PromptBench | PromptBench is a benchmark to evaluate the robustness of LLMs on adversarial prompts. |
QAConv | QAConv is a benchmark for question answering using complex, domain-specific, and asynchronous conversations as the knowledge source. |
QuALITY | QuALITY is a benchmark for evaluating multiple-choice question-answering with a long context. |
RABBITS | RABBITS is a benchmark to evaluate the robustness of LLMs by evaluating their handling of synonyms, specifically brand and generic drug names. |
Rakuda | Rakuda is a benchmark to evaluate LLMs based on how well they answer a set of open-ended questions about Japanese topics. |
RedTeam Arena | RedTeam Arena is a red-teaming platform for LLMs. |
Red Teaming Resistance Benchmark | Red Teaming Resistance Benchmark is a benchmark to evaluate the robustness of LLMs against red teaming prompts. |
ReST-MCTS* | ReST-MCTS* is a reinforced self-training method that uses tree search and process reward inference to collect high-quality reasoning traces for training policy and reward models without manual step annotations. |
Reviewer Arena | Reviewer Arena hosts the reviewer arena, where various LLMs compete based on their performance in critiquing academic papers. |
RoleEval | RoleEval is a bilingual benchmark to evaluate the memorization, utilization, and reasoning capabilities of role knowledge of LLMs. |
RPBench Leaderboard | RPBench-Auto is an automated pipeline for evaluating LLMs using 80 personae for character-based and 80 scenes for scene-based role-playing. |
Russian Chatbot Arena | Chatbot Arena hosts a chatbot arena where various LLMs compete in Russian based on user satisfaction. |
Russian SuperGLUE | Russian SuperGLUE is a benchmark for Russian language models, focusing on logic, commonsense, and reasoning tasks. |
R-Judge | R-Judge is a benchmark to evaluate the proficiency of LLMs in judging and identifying safety risks given agent interaction records. |
Safety Prompts | Safety Prompts is a benchmark to evaluate the safety of Chinese LLMs. |
SafetyBench | SafetyBench is a benchmark to evaluate the safety of LLMs. |
SALAD-Bench | SALAD-Bench is a benchmark for evaluating the safety and security of LLMs. |
ScandEval | ScandEval is a benchmark to evaluate LLMs on tasks in Scandinavian languages as well as German, Dutch, and English. |
Science Leaderboard | Science Leaderboard is a platform to evaluate LLMs' capabilities to solve science problems. |
SciGLM | SciGLM is a suite of scientific language models that use a self-reflective instruction annotation framework to enhance scientific reasoning by generating and revising step-by-step solutions to unlabelled questions. |
SciKnowEval | SciKnowEval is a benchmark to evaluate LLMs based on their proficiency in studying extensively, enquiring earnestly, thinking profoundly, discerning clearly, and practicing assiduously. |
SCROLLS | SCROLLS is a benchmark to evaluate the reasoning capabilities of LLMs over long texts. |
SeaExam | SeaExam is a benchmark to evaluate LLMs for Southeast Asian (SEA) languages. |
SEAL LLM Leaderboards | SEAL LLM Leaderboards is an expert-driven private evaluation platform for LLMs. |
SeaEval | SeaEval is a benchmark to evaluate the performance of multilingual LLMs in understanding and reasoning with natural language, as well as comprehending cultural practices, nuances, and values. |
SEA HELM | SEA HELM is a benchmark to evaluate LLMs' performance across English and Southeast Asian tasks, focusing on chat, instruction-following, and linguistic capabilities. |
SecEval | SecEval is a benchmark to evaluate cybersecurity knowledge of foundation models. |
Self-Improving Leaderboard | Self-Improving Leaderboard (SIL) is a dynamic platform that continuously updates test datasets and rankings to provide real-time performance insights for open-source LLMs and chatbots. |
Spec-Bench | Spec-Bench is a benchmark to evaluate speculative decoding methods across diverse scenarios. |
StructEval | StructEval is a benchmark to evaluate LLMs by conducting structured assessments across multiple cognitive levels and critical concepts. |
Subquadratic LLM Leaderboard | Subquadratic LLM Leaderboard evaluates LLMs with subquadratic/attention-free architectures (i.e. RWKV & Mamba). |
SuperBench | SuperBench is a comprehensive system of tasks and dimensions to evaluate the overall capabilities of LLMs. |
SuperGLUE | SuperGLUE is a benchmark to evaluate the performance of LLMs on a set of challenging language understanding tasks. |
SuperLim | SuperLim is a benchmark to evaluate the language understanding capabilities of LLMs in Swedish. |
Swahili LLM-Leaderboard | Swahili LLM-Leaderboard is a joint community effort to create one central leaderboard for LLMs. |
S-Eval | S-Eval is a comprehensive, multi-dimensional safety benchmark with 220,000 prompts designed to evaluate LLM safety across various risk dimensions. |
TableQAEval | TableQAEval is a benchmark to evaluate LLM performance in modeling long tables and comprehension capabilities, such as numerical and multi-hop reasoning. |
TAT-DQA | TAT-DQA is a benchmark to evaluate LLMs on the discrete reasoning over documents that combine both structured and unstructured information. |
TAT-QA | TAT-QA is a benchmark to evaluate LLMs on the discrete reasoning over documents that combines both tabular and textual content. |
Thai LLM Leaderboard | Thai LLM Leaderboard aims to track and evaluate LLMs in the Thai-language tasks. |
The Pile | The Pile is a benchmark to evaluate the world knowledge and reasoning ability of LLMs. |
TOFU | TOFU is a benchmark to evaluate the unlearning performance of LLMs in realistic scenarios. |
Toloka LLM Leaderboard | Toloka LLM Leaderboard is a benchmark to evaluate LLMs based on authentic user prompts and expert human evaluation. |
Toolbench | ToolBench is a platform for training, serving, and evaluating LLMs specifically for tool learning. |
Toxicity Leaderboard | Toxicity Leaderboard evaluates the toxicity of LLMs. |
Trustbit LLM Leaderboards | Trustbit LLM Leaderboards is a platform that provides benchmarks for building and shipping products with LLMs. |
TrustLLM | TrustLLM is a benchmark to evaluate the trustworthiness of LLMs. |
TuringAdvice | TuringAdvice is a benchmark for evaluating language models' ability to generate helpful advice for real-life, open-ended situations. |
TutorEval | TutorEval is a question-answering benchmark which evaluates how well an LLM tutor can help a user understand a chapter from a science textbook. |
T-Eval | T-Eval is a benchmark for evaluating the tool utilization capability of LLMs. |
UGI Leaderboard | UGI Leaderboard measures and compares the uncensored and controversial information known by LLMs. |
UltraEval | UltraEval is an open-source framework for transparent and reproducible benchmarking of LLMs across various performance dimensions. |
Vals AI | Vals AI is a platform evaluating generative AI accuracy and efficacy on real-world legal tasks. |
VCR | Visual Commonsense Reasoning (VCR) is a benchmark for cognition-level visual understanding, requiring models to answer visual questions and provide rationales for their answers. |
ViDoRe | ViDoRe is a benchmark to evaluate retrieval models on their capacity to match queries to relevant documents at the page level. |
VLLMs Leaderboard | VLLMs Leaderboard aims to track, rank and evaluate open LLMs and chatbots. |
VMLU | VMLU is a benchmark to evaluate overall capabilities of foundation models in Vietnamese. |
WildBench | WildBench is a benchmark for evaluating language models on challenging tasks that closely resemble real-world applications. |
Xiezhi | Xiezhi is a benchmark for holistic domain knowledge evaluation of LLMs. |
Yanolja Arena | Yanolja Arena host a model arena to evaluate the capabilities of LLMs in summarizing and translating text. |
Yet Another LLM Leaderboard | Yet Another LLM Leaderboard is a platform for tracking, ranking, and evaluating open LLMs and chatbots. |
ZebraLogic | ZebraLogic is a benchmark evaluating LLMs' logical reasoning using Logic Grid Puzzles, a type of Constraint Satisfaction Problem (CSP). |
ZeroSumEval | ZeroSumEval is a competitive evaluation framework for LLMs using multiplayer simulations with clear win conditions. |