Home

Awesome

Needle In A Haystack - Pressure Testing LLMs

This repository is a fork of Greg Kamradt code https://twitter.com/GregKamradt

Original Repository: https://github.com/gkamradt/LLMTest_NeedleInAHaystack

This is Greg's original brainchild. Greg put an amazing amount of thought and work into this. Huge shout out to him for the original implementation and visualizing the results in such a compelling way. We can't thank Greg enough for his original contribution to the AI Engineering community!

Our goal is to add another level of rigor to the original tests and test more models.

WHY DOES THIS TEST MATTER? Most people are using retrieval to connect private facts or private information into a model's context window. It's the way teams are running LLMs on their private data. The ability for a LLM to make use of this data and recall facts correctly is important for use in RAG.

<img src="https://storage.googleapis.com/arize-assets/phoenix/assets/images/carbon_good.png" width=500 />
git clone https://github.com/Arize-ai/LLMTest_NeedleInAHaystack.git

We've switched out the Evals to use Phoenix: https://github.com/Arize-ai/phoenix

Supported model providers: OpenAI, Anthropic, Others coming soon

A simple 'needle in a haystack' analysis to test in-context retrieval ability of long context LLMs.

Example of random fact: The special magic {city} number is: {rnd_number}

The correct answer is rnd_number

The Test

  1. Place a sentence with a random number (the 'needle') in the middle of a long context window (the 'haystack')
  2. Ask the model to retrieve this statement
  3. Iterate over various document depths (where the needle is placed) and context lengths to measure performance

This is the code that backed

If ran and save_results = True, then this script will populate a result/ directory with evaluation information. Due to potential concurrent requests each new test will be saved as a file.

The key parameters:

Other Parameters:

OpenAI's GPT-4-Turbo (Run 12/13/2023)

<img src="https://storage.googleapis.com/arize-assets/phoenix/assets/images/gpt-4_2.png" alt="GPT-4-128 Context Testing" width="800"/>

Anthropic's Claude 2.1 (Run 12/12/2023)

<img src="https://storage.googleapis.com/arize-assets/phoenix/assets/images/claude_2_1_3.png" alt="GPT-4-128 Context Testing" width="800"/>