Home

Awesome

Hallucination Leaderboard

Public LLM leaderboard computed using Vectara's Hughes Hallucination Evaluation Model. This evaluates how often an LLM introduces hallucinations when summarizing a document. We plan to update this regularly as our model and the LLMs get updated over time.

Also, feel free to check out our hallucination leaderboard on Hugging Face.

The rankings in this leaderboard are computed using the HHEM-2.1 hallucination evaluation model. If you are interested in the previous leaderboard, which was based on HHEM-1.0, it is available here

<table style="border-collapse: collapse;"> <tr> <td style="text-align: center; vertical-align: middle; border: none;"> <img src="img/candle.png" width="50" height="50"> </td> <td style="text-align: left; vertical-align: middle; border: none;"> In loving memory of <a href="https://www.ivinsfuneralhome.com/obituaries/Simon-Mark-Hughes?obId=30000023">Simon Mark Hughes</a>... </td> </tr> </table>

Last updated on November 6th, 2024

Plot: hallucination rates of various LLMs

ModelHallucination RateFactual Consistency RateAnswer RateAverage Summary Length (Words)
Zhipu AI GLM-4-9B-Chat1.3 %98.7 %100.0 %58.1
OpenAI-o1-mini1.4 %98.6 %100.0 %78.3
GPT-4o1.5 %98.5 %100.0 %77.8
GPT-4o-mini1.7 %98.3 %100.0 %76.3
GPT-4-Turbo1.7 %98.3 %100.0 %86.2
GPT-41.8 %98.2 %100.0 %81.1
GPT-3.5-Turbo1.9 %98.1 %99.6 %84.1
DeepSeek-V2.52.4 %97.6 %100.0 %83.2
Microsoft Orca-2-13b2.5 %97.5 %100.0 %66.2
Microsoft Phi-3.5-MoE-instruct2.5 %97.5 %96.3 %69.7
Intel Neural-Chat-7B-v3-32.6 %97.4 %100.0 %60.7
Qwen2.5-7B-Instruct2.8 %97.2 %100.0 %71.0
AI21 Jamba-1.5-Mini2.9 %97.1 %95.6 %74.5
Snowflake-Arctic-Instruct3.0 %97.0 %100.0 %68.7
Qwen2.5-32B-Instruct3.0 %97.0 %100.0 %67.9
Microsoft Phi-3-mini-128k-instruct3.1 %96.9 %100.0 %60.1
OpenAI-o1-preview3.3 %96.7 %100.0 %119.3
Google Gemini-1.5-Flash-0023.4 %96.6 %99.9 %59.4
01-AI Yi-1.5-34B-Chat3.7 %96.3 %100.0 %83.7
Llama-3.1-405B-Instruct3.9 %96.1 %99.6 %85.7
Microsoft Phi-3-mini-4k-instruct4.0 %96.0 %100.0 %86.8
Microsoft Phi-3.5-mini-instruct4.1 %95.9 %100.0 %75.0
Mistral-Large24.1 %95.9 %100.0 %77.4
Llama-3-70B-Chat-hf4.1 %95.9 %99.2 %68.5
Qwen2-VL-7B-Instruct4.2 %95.8 %100.0 %73.9
Qwen2.5-14B-Instruct4.2 %95.8 %100.0 %74.8
Qwen2.5-72B-Instruct4.3 %95.7 %100.0 %80.0
Llama-3.2-90B-Vision-Instruct4.3 %95.7 %100.0 %79.8
XAI Grok4.6 %95.4 %100.0 %91.0
Anthropic Claude-3-5-sonnet4.6 %95.4 %100.0 %95.9
Qwen2-72B-Instruct4.7 %95.3 %100.0 %100.1
Mixtral-8x22B-Instruct-v0.14.7 %95.3 %99.9 %92.0
Anthropic Claude-3-5-haiku4.9 %95.1 %100.0 %92.9
01-AI Yi-1.5-9B-Chat4.9 %95.1 %100.0 %85.7
Cohere Command-R4.9 %95.1 %100.0 %68.7
Llama-3.1-70B-Instruct5.0 %95.0 %100.0 %79.6
Llama-3.1-8B-Instruct5.4 %94.6 %100.0 %71.0
Cohere Command-R-Plus5.4 %94.6 %100.0 %68.4
Llama-3.2-11B-Vision-Instruct5.5 %94.5 %100.0 %67.3
Llama-2-70B-Chat-hf5.9 %94.1 %99.9 %84.9
IBM Granite-3.0-8B-Instruct6.5 %93.5 %100.0 %74.2
Google Gemini-1.5-Pro-0026.6 %93.7 %99.9 %62.0
Google Gemini-1.5-Flash6.6 %93.4 %99.9 %63.3
Microsoft phi-26.7 %93.3 %91.5 %80.8
Google Gemma-2-2B-it7.0 %93.0 %100.0 %62.2
Qwen2.5-3B-Instruct7.0 %93.0 %100.0 %70.4
Llama-3-8B-Chat-hf7.4 %92.6 %99.8 %79.7
Google Gemini-Pro7.7 %92.3 %98.4 %89.5
01-AI Yi-1.5-6B-Chat7.9 %92.1 %100.0 %98.9
Llama-3.2-3B-Instruct7.9 %92.1 %100.0 %72.2
databricks dbrx-instruct8.3 %91.7 %100.0 %85.9
Qwen2-VL-2B-Instruct8.3 %91.7 %100.0 %81.8
Cohere Aya Expanse 32B8.5 %91.5 %99.9 %81.9
IBM Granite-3.0-2B-Instruct8.8 %91.2 %100.0 %81.6
Mistral-7B-Instruct-v0.39.5 %90.5 %100.0 %98.4
Google Gemini-1.5-Pro9.1 %90.9 %99.8 %61.6
Anthropic Claude-3-opus10.1 %89.9 %95.5 %92.1
Google Gemma-2-9B-it10.1 %89.9 %100.0 %70.2
Llama-2-13B-Chat-hf10.5 %89.5 %99.8 %82.1
Mistral-Nemo-Instruct11.2 %88.8 %100.0 %69.9
Llama-2-7B-Chat-hf11.3 %88.7 %99.6 %119.9
Microsoft WizardLM-2-8x22B11.7 %88.3 %99.9 %140.8
Cohere Aya Expanse 8B12.2 %87.8 %99.9 %83.9
Amazon Titan-Express13.5 %86.5 %99.5 %98.4
Google PaLM-214.1 %85.9 %99.8 %86.6
Google Gemma-7B-it14.8 %85.2 %100.0 %113.0
Qwen2.5-1.5B-Instruct15.8 %84.2 %100.0 %70.7
Anthropic Claude-3-sonnet16.3 %83.7 %100.0 %108.5
Google Gemma-1.1-7B-it17.0 %83.0 %100.0 %64.3
Anthropic Claude-217.4 %82.6 %99.3 %87.5
Google Flan-T5-large18.3 %81.7 %99.3 %20.9
Mixtral-8x7B-Instruct-v0.120.1 %79.9 %99.9 %90.7
Llama-3.2-1B-Instruct20.7 %79.3 %100.0 %71.5
Apple OpenELM-3B-Instruct24.8 %75.2 %99.3 %47.2
Qwen2.5-0.5B-Instruct25.2 %74.8 %100.0 %72.6
Google Gemma-1.1-2B-it27.8 %72.2 %100.0 %66.8
TII falcon-7B-instruct29.9 %70.1 %90.0 %75.5

Model

This leaderboard uses HHEM-2.1, Vectara's commercial hallucination evaluation model, to compute the LLM rankings. You can find an open-source variant of that model, HHEM-2.1-Open on Hugging Face and Kaggle.

Data

See this dataset for the generated summaries we used for evaluating the models.

Prior Research

Much prior work in this area has been done. For some of the top papers in this area (factual consistency in summarization) please see here:

For a very comprehensive list, please see here - https://github.com/EdinburghNLP/awesome-hallucination-detection. The methods described in the following section use protocols established in those papers, amongst many others.

Methodology

For a detailed explanation of the work that went into this model please refer to our blog post on the release: Cut the Bull…. Detecting Hallucinations in Large Language Models.

To determine this leaderboard, we trained a model to detect hallucinations in LLM outputs, using various open source datasets from the factual consistency research into summarization models. Using a model that is competitive with the best state of the art models, we then fed 1000 short documents to each of the LLMs above via their public APIs and asked them to summarize each short document, using only the facts presented in the document. Of these 1000 documents, only 831 document were summarized by every model, the remaining documents were rejected by at least one model due to content restrictions. Using these 831 documents, we then computed the overall factual consistency rate (no hallucinations) and hallucination rate (100 - accuracy) for each model. The rate at which each model refuses to respond to the prompt is detailed in the 'Answer Rate' column. None of the content sent to the models contained illicit or 'not safe for work' content but the present of trigger words was enough to trigger some of the content filters. The documents were taken primarily from the CNN / Daily Mail Corpus. We used a temperature of 0 when calling the LLMs.

We evaluate summarization factual consistency rate instead of overall factual accuracy because it allows us to compare the model's response to the provided information. In other words, is the summary provided 'factually consistent' with the source document. Determining hallucinations is impossible to do for any ad hoc question as it's not known precisely what data every LLM is trained on. In addition, having a model that can determine whether any response was hallucinated without a reference source requires solving the hallucination problem and presumably training a model as large or larger than these LLMs being evaluated. So we instead chose to look at the hallucination rate within the summarization task as this is a good analogue to determine how truthful the models are overall. In addition, LLMs are increasingly used in RAG (Retrieval Augmented Generation) pipelines to answer user queries, such as in Bing Chat and Google's chat integration. In a RAG system, the model is being deployed as a summarizer of the search results, so this leaderboard is also a good indicator for the accuracy of the models when used in RAG systems.

Prompt Used

You are a chat bot answering questions using data. You must stick to the answers provided solely by the text in the passage provided. You are asked the question 'Provide a concise summary of the following passage, covering the core pieces of information described.' <PASSAGE>'

When calling the API, the <PASSAGE> token was then replaced with the source document (see the 'source' column in this dataset ).

API Integration Details

Below is a detailed overview of the models integrated and their specific endpoints:

OpenAI Models

Llama Models

Cohere Models

Anthropic Model

Mistral AI Models

Google Palm Models via Vertex AI

For an in-depth understanding of each model's version and lifecycle, especially those offered by Google, please refer to Model Versions and Lifecycles on Vertex AI.

Titan Models on Amazon Bedrock

Microsoft Models

Google Models on Hugging Face

TII Models on Hugging Face

Intel Model on Hugging Face

Databricks Model

Snowflake Model

Apple Model

01-AI Models

Zhipu AI Model

Qwen Models

AI21 Model

DeepSeek Model

IBM Models

XAI Model

Frequently Asked Questions

Coming Soon

<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=5f53f560-5ba6-4e73-917b-c7049e9aea2c" />