Home

Awesome

<img src="assets/logo.png" alt="Loong" width="120" align="left"><div align="center"><h1>  Loong: Benchmarking Long-Context LLMs with Extended Multi-Doc QA</h1></div>

<p align="center" style="font-size:200%"> <img alt="GitHub" src="https://img.shields.io/github/license/MozerWang/Loong.svg?color=blue&style=flat-square"> <img alt="GitHub top language" src="https://img.shields.io/github/languages/top/MozerWang/Loong"> <img alt="GitHub last commit" src="https://img.shields.io/github/last-commit/MozerWang/Loong"> </p> <p align="center"> <font size=6>📃</font> <a target="_self" href="https://arxiv.org/abs/2406.17419"> <img style="height:14pt" src="https://img.shields.io/badge/-Paper-red?style=flat&logo=arxiv"></a> <font size=6>•</font> <font size=6>💻</font> <a target="_self" href="https://github.com/MozerWang/Loong"> <img style="height:14pt" src="https://img.shields.io/badge/-Code-pink?style=flat&logo=github"></a> <font size=6>•</font> <font size=6>🤗</font> <a target="_self" href="https://modelscope.cn/datasets/iic/Loong"> <img style="height:14pt" src="https://img.shields.io/badge/-🤗%20Dataset-red?style=flat"></a> </p>

👀Overview

This repository contains code for our paper Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA. We propose a novel long-context benchmark, 🐉 Loong, aligning with realistic scenarios through extended multi-document question answering (QA). Loong typically consists of 11 documents per test instance on average, spanning three real-world scenarios in English and Chinese: (1) Financial Reports, (2) Legal Cases, and (3) Academic Papers. Meanwhile, Loong introduces new evaluation tasks from the perspectives of Spotlight Locating, Comparison, Clustering, and Chain of Reasoning, to facilitate a more realistic and comprehensive evaluation of long-context understanding. Furthermore, Loong features inputs of varying lengths (e.g., 10K-50K, 50K-100K, 100K-200K, beyond 200K) and evaluation tasks of diverse difficulty, enabling fine-grained assessment of LLMs across different context lengths and task complexities.

Please find more details of this work in our paper.

Overview of Loong

Showcase of four evaluation tasks in Loong (<di>...</di> marks the content of the i-th document). (a) Spotlight Locating: Locate the evidence. (b) Comparison: Locate and compare the evidence. (c) Clustering: Locate and cluster the evidence into groups. (d) Chain of Reasoning: Locate and reasoning along a logical chain.

📰News

[2024-09-20] 📰Our paper has been accepted to the EMNLP Main Conference.

[2024-07-30] 🤖The performance of phi-3, llama-3.1-8B, gpt-4o-mini on Loong are updated.

[2024-07-03] 🔥The code and benchmark are releasing. If you encounter any issues, please feel free to contact us.

[2024-06-25] 👨‍💻The code is currently being refined, and we plan to release the evaluation code and benchmark within the next one or two weeks. If you encounter any issues, please feel free to contact me at wangminzheng2023@ia.ac.cn.

🏆Leaderboard

<table> <thead> <tr> <th>Models</th> <th>Claimed Length</th> <th colspan="2" style="text-align: center;">Spotlight Locating</th> <th colspan="2" style="text-align: center;">Comparison</th> <th colspan="2" style="text-align: center;">Clustering</th> <th colspan="2" style="text-align: center;">Chain of Reason</th> <th colspan="2" style="text-align: center;">Overall</th> </tr> </thead> <tbody> <tr> <td><a href="https://ai.google.dev/gemini-api/docs/models/gemini#:~:text=Gemini-,Gemini%201.5%20Pro%20(Preview%20only),-Text%20and%20images">Gemini-1.5-pro</a></td> <td style="text-align: center;">1000K</td> <td style="text-align: center;">75.02</td><td style="text-align: center;">0.56</td> <td style="text-align: center;">49.94</td><td style="text-align: center;">0.27</td> <td style="text-align: center;">44.10</td><td style="text-align: center;">0.09</td> <td style="text-align: center;">64.97</td><td style="text-align: center;">0.37</td> <td style="text-align: center;">55.37</td><td style="text-align: center;">0.27</td> </tr> <tr style="background-color:#f0f0f0;"> <td><a href="https://platform.openai.com/docs/models/gpt-4o">GPT-4o</a></td> <td style="text-align: center;">128K</td> <td style="text-align: center;">73.95</td><td style="text-align: center;">0.62</td> <td style="text-align: center;">50.50</td><td style="text-align: center;">0.28</td> <td style="text-align: center;">44.29</td><td style="text-align: center;">0.09</td> <td style="text-align: center;">57.95</td><td style="text-align: center;">0.28</td> <td style="text-align: center;">53.47</td><td style="text-align: center;">0.26</td> </tr> <tr> <td><a href="https://docs.anthropic.com/en/docs/intro-to-claude#claude-3-5-family">Claude3.5-Sonnet</a></td> <td style="text-align: center;">200K</td> <td style="text-align: center;">58.45</td><td style="text-align: center;">0.49</td> <td style="text-align: center;">54.21</td><td style="text-align: center;">0.35</td> <td style="text-align: center;">45.77</td><td style="text-align: center;">0.07</td> <td style="text-align: center;">43.92</td><td style="text-align: center;">0.25</td> <td style="text-align: center;">48.85</td><td style="text-align: center;">0.23</td> </tr> <tr style="background-color:#f0f0f0;"> <td><a href="https://docs.anthropic.com/en/docs/intro-to-claude#claude-3-family">Claude3-Haiku</a></td> <td style="text-align: center;">200K</td> <td style="text-align: center;">68.68</td><td style="text-align: center;">0.59</td> <td style="text-align: center;">42.10</td><td style="text-align: center;">0.21</td> <td style="text-align: center;">35.04</td><td style="text-align: center;">0.02</td> <td style="text-align: center;">47.59</td><td style="text-align: center;">0.17</td> <td style="text-align: center;">44.88</td><td style="text-align: center;">0.19</td> </tr> <tr> <td><a href="https://huggingface.co/Qwen/Qwen2-72B-Instruct">Qwen2-72B-Instruct</a></td> <td style="text-align: center;">128K</td> <td style="text-align: center;">54.17</td><td style="text-align: center;">0.36</td> <td style="text-align: center;">42.38</td><td style="text-align: center;">0.20</td> <td style="text-align: center;">36.71</td><td style="text-align: center;">0.04</td> <td style="text-align: center;">47.76</td><td style="text-align: center;">0.18</td> <td style="text-align: center;">43.29</td><td style="text-align: center;">0.15</td> </tr> <tr style="background-color:#f0f0f0;"> <td><a href="https://platform.openai.com/docs/models/gpt-4o-mini">GPT-4o-mini</a></td> <td style="text-align: center;">128K</td> <td style="text-align: center;">53.12</td><td style="text-align: center;">0.41</td> <td style="text-align: center;">44.27</td><td style="text-align: center;">0.20</td> <td style="text-align: center;">32.58</td><td style="text-align: center;">0.04</td> <td style="text-align: center;">52.34</td><td style="text-align: center;">0.23</td> <td style="text-align: center;">42.95</td><td style="text-align: center;">0.18</td> </tr> <tr> <td><a href="https://huggingface.co/THUDM/glm-4-9b-chat-1m">GLM4-9B-Chat</a></td> <td style="text-align: center;">1000K</td> <td style="text-align: center;">57.35</td><td style="text-align: center;">0.47</td> <td style="text-align: center;">40.38</td><td style="text-align: center;">0.20</td> <td style="text-align: center;">28.52</td><td style="text-align: center;">0.02</td> <td style="text-align: center;">39.94</td><td style="text-align: center;">0.16</td> <td style="text-align: center;">38.31</td><td style="text-align: center;">0.16</td> </tr> <tr style="background-color:#f0f0f0;"> <td><a href="https://kimi.moonshot.cn/">Kimi-Chat</a></td> <td style="text-align: center;">200K</td> <td style="text-align: center;">60.98</td><td style="text-align: center;">0.50</td> <td style="text-align: center;">34.74</td><td style="text-align: center;">0.13</td> <td style="text-align: center;">28.76</td><td style="text-align: center;">0.04</td> <td style="text-align: center;">38.52</td><td style="text-align: center;">0.15</td> <td style="text-align: center;">37.49</td><td style="text-align: center;">0.16</td> </tr> <tr> <td><a href="https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct">Llama-3.1-8B-Instruct</a></td> <td style="text-align: center;">128K</td> <td style="text-align: center;">59.96</td><td style="text-align: center;">0.46</td> <td style="text-align: center;">35.73</td><td style="text-align: center;">0.18</td> <td style="text-align: center;">27.83</td><td style="text-align: center;">0.01</td> <td style="text-align: center;">35.59</td><td style="text-align: center;">0.14</td> <td style="text-align: center;">36.31</td><td style="text-align: center;">0.14</td> </tr> <tr style="background-color:#f0f0f0;"> <td><a href="https://huggingface.co/microsoft/Phi-3-small-128k-instruct">Phi-3-small</a></td> <td style="text-align: center;">128K</td> <td style="text-align: center;">29.23</td><td style="text-align: center;">0.10</td> <td style="text-align: center;">20.12</td><td style="text-align: center;">0.06</td> <td style="text-align: center;">17.53</td><td style="text-align: center;">0.00</td> <td style="text-align: center;">14.36</td><td style="text-align: center;">0.01</td> <td style="text-align: center;">19.03</td><td style="text-align: center;">0.03</td> </tr> <tr> <td><a href="https://huggingface.co/microsoft/Phi-3-mini-128k-instruct">Phi-3-mini</a></td> <td style="text-align: center;">128K</td> <td style="text-align: center;">25.65</td><td style="text-align: center;">0.15</td> <td style="text-align: center;">13.34</td><td style="text-align: center;">0.04</td> <td style="text-align: center;">12.00</td><td style="text-align: center;">0.00</td> <td style="text-align: center;">12.61</td><td style="text-align: center;">0.01</td> <td style="text-align: center;">14.54</td><td style="text-align: center;">0.04</td> </tr> </tbody> </table>

Overall results on four evaluation tasks. For each task, the indicator on the left represents the Avg Scores(0~100), while the right one represents the Perfect Rate(0~1).

<table> <thead> <tr> <th>Model</th> <th style="text-align: center;">Claimed Length</th> <th colspan="2" style="text-align: center;">Spotlight Locating</th> <th colspan="2" style="text-align: center;">Comparison</th> <th colspan="2" style="text-align: center;">Clustering</th> <th colspan="2" style="text-align: center;">Chain of Reasoning</th> <th colspan="2" style="text-align: center;">Overall</th> </tr> </thead> <tbody> <tr> <td colspan="12" style="text-align: center;"><b>Set1 (10K-50K)</b></td> </tr> <tr> <td><a href="https://platform.openai.com/docs/models/gpt-4o">GPT-4o</a></td> <td style="text-align: center;">128K</td> <td style="text-align: center;">85.67</td><td style="text-align: center;">0.81</td> <td style="text-align: center;">64.27</td><td style="text-align: center;">0.33</td> <td style="text-align: center;">57.01</td><td style="text-align: center;">0.24</td> <td style="text-align: center;">81.58</td><td style="text-align: center;">0.55</td> <td style="text-align: center;">70.40</td><td style="text-align: center;">0.44</td> </tr> <tr style="background-color:#f0f0f0;"> <td><a href="https://docs.anthropic.com/en/docs/intro-to-claude#claude-3-5-family">Claude3.5-Sonnet</a></td> <td style="text-align: center;">200K</td> <td style="text-align: center;">60.85</td><td style="text-align: center;">0.55</td> <td style="text-align: center;">69.07</td><td style="text-align: center;">0.47</td> <td style="text-align: center;">58.63</td><td style="text-align: center;">0.13</td> <td style="text-align: center;">68.57</td><td style="text-align: center;">0.50</td> <td style="text-align: center;">63.69</td><td style="text-align: center;">0.37</td> </tr> <tr> <td><a href="https://ai.google.dev/gemini-api/docs/models/gemini#:~:text=Gemini-,Gemini%201.5%20Pro%20(Preview%20only),-Text%20and%20images">Gemini-1.5-pro</a></td> <td style="text-align: center;">1000K</td> <td style="text-align: center;">75.00</td><td style="text-align: center;">0.60</td> <td style="text-align: center;">54.88</td><td style="text-align: center;">0.28</td> <td style="text-align: center;">56.15</td><td style="text-align: center;">0.23</td> <td style="text-align: center;">70.64</td><td style="text-align: center;">0.37</td> <td style="text-align: center;">63.36</td><td style="text-align: center;">0.34</td> </tr> <tr style="background-color:#f0f0f0;"> <td><a href="https://platform.openai.com/docs/models/gpt-4o-mini">GPT-4o-mini</a></td> <td style="text-align: center;">128K</td> <td style="text-align: center;">62.49</td><td style="text-align: center;">0.56</td> <td style="text-align: center;">65.48</td><td style="text-align: center;">0.40</td> <td style="text-align: center;">45.81</td><td style="text-align: center;">0.12</td> <td style="text-align: center;">79.85</td><td style="text-align: center;">0.55</td> <td style="text-align: center;">62.42</td><td style="text-align: center;">0.36</td> </tr> <tr> <td><a href="https://huggingface.co/Qwen/Qwen2-72B-Instruct">Qwen2-72B-Instruct</a></td> <td style="text-align: center;">200K</td> <td style="text-align: center;">68.49</td><td style="text-align: center;">0.55</td> <td style="text-align: center;">60.60</td><td style="text-align: center;">0.37</td> <td style="text-align: center;">47.08</td><td style="text-align: center;">0.08</td> <td style="text-align: center;">70.39</td><td style="text-align: center;">0.36</td> <td style="text-align: center;">60.11</td><td style="text-align: center;">0.29</td> </tr> <tr style="background-color:#f0f0f0;"> <td><a href="https://docs.anthropic.com/en/docs/intro-to-claude#claude-3-family">Claude3-Haiku</a></td> <td style="text-align: center;">200K</td> <td style="text-align: center;">60.94</td><td style="text-align: center;">0.55</td> <td style="text-align: center;">59.97</td><td style="text-align: center;">0.40</td> <td style="text-align: center;">45.53</td><td style="text-align: center;">0.04</td> <td style="text-align: center;">66.85</td><td style="text-align: center;">0.34</td> <td style="text-align: center;">57.14</td><td style="text-align: center;">0.28</td> </tr> <tr> <td><a href="https://kimi.moonshot.cn/">Kimi-Chat</a></td> <td style="text-align: center;">200K</td> <td style="text-align: center;">81.11</td><td style="text-align: center;">0.74</td> <td style="text-align: center;">46.70</td><td style="text-align: center;">0.20</td> <td style="text-align: center;">47.84</td><td style="text-align: center;">0.07</td> <td style="text-align: center;">53.77</td><td style="text-align: center;">0.17</td> <td style="text-align: center;">55.02</td><td style="text-align: center;">0.24</td> </tr> <tr style="background-color:#f0f0f0;"> <td><a href="https://huggingface.co/THUDM/glm-4-9b-chat-1m">GLM4-9B-Chat</a></td> <td style="text-align: center;">1000K</td> <td style="text-align: center;">63.11</td><td style="text-align: center;">0.53</td> <td style="text-align: center;">54.10</td><td style="text-align: center;">0.27</td> <td style="text-align: center;">39.50</td><td style="text-align: center;">0.08</td> <td style="text-align: center;">56.32</td><td style="text-align: center;">0.28</td> <td style="text-align: center;">51.43</td><td style="text-align: center;">0.25</td> </tr> <tr> <td><a href="https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct">Llama-3.1-8B-Instruct</a></td> <td style="text-align: center;">128K</td> <td style="text-align: center;">67.91</td><td style="text-align: center;">0.57</td> <td style="text-align: center;">41.62</td><td style="text-align: center;">0.20</td> <td style="text-align: center;">36.55</td><td style="text-align: center;">0.04</td> <td style="text-align: center;">54.74</td><td style="text-align: center;">0.34</td> <td style="text-align: center;">48.10</td><td style="text-align: center;">0.24</td> </tr> <tr style="background-color:#f0f0f0;"> <td><a href="https://huggingface.co/microsoft/Phi-3-mini-128k-instruct">Phi-3-mini</a></td> <td style="text-align: center;">128K</td> <td style="text-align: center;">46.13</td><td style="text-align: center;">0.30</td> <td style="text-align: center;">22.18</td><td style="text-align: center;">0.05</td> <td style="text-align: center;">19.30</td><td style="text-align: center;">0.02</td> <td style="text-align: center;">20.44</td><td style="text-align: center;">0.03</td> <td style="text-align: center;">24.58</td><td style="text-align: center;">0.07</td> </tr> <tr> <td><a href="https://huggingface.co/microsoft/Phi-3-small-128k-instruct">Phi-3-small</a></td> <td style="text-align: center;">128K</td> <td style="text-align: center;">35.00</td><td style="text-align: center;">0.15</td> <td style="text-align: center;">26.83</td><td style="text-align: center;">0.12</td> <td style="text-align: center;">17.01</td><td style="text-align: center;">0.00</td> <td style="text-align: center;">15.87</td><td style="text-align: center;">0.00</td> <td style="text-align: center;">21.44</td><td style="text-align: center;">0.05</td> </tr> <tr> <td colspan="12" style="text-align: center;"><b>Set2 (50K-100K)</b></td> </tr> <tr> <td><a href="https://platform.openai.com/docs/models/gpt-4o">GPT-4o</a></td> <td style="text-align: center;">128K</td> <td style="text-align: center;">86.76</td><td style="text-align: center;">0.72</td> <td style="text-align: center;">59.81</td><td style="text-align: center;">0.40</td> <td style="text-align: center;">47.83</td><td style="text-align: center;">0.11</td> <td style="text-align: center;">62.09</td><td style="text-align: center;">0.34</td> <td style="text-align: center;">58.38</td><td style="text-align: center;">0.29</td> </tr> <tr style="background-color:#f0f0f0;"> <td><a href="https://ai.google.dev/gemini-api/docs/models/gemini#:~:text=Gemini-,Gemini%201.5%20Pro%20(Preview%20only),-Text%20and%20images">Gemini-1.5-pro</a></td> <td style="text-align: center;">1000K</td> <td style="text-align: center;">76.50</td><td style="text-align: center;">0.57</td> <td style="text-align: center;">54.51</td><td style="text-align: center;">0.34</td> <td style="text-align: center;">44.58</td><td style="text-align: center;">0.09</td> <td style="text-align: center;">64.87</td><td style="text-align: center;">0.34</td> <td style="text-align: center;">55.56</td><td style="text-align: center;">0.26</td> </tr> <tr> <td><a href="https://docs.anthropic.com/en/docs/intro-to-claude#claude-3-5-family">Claude3.5-Sonnet</a></td> <td style="text-align: center;">200K</td> <td style="text-align: center;">63.83</td><td style="text-align: center;">0.53</td> <td style="text-align: center;">58.90</td><td style="text-align: center;">0.39</td> <td style="text-align: center;">50.96</td><td style="text-align: center;">0.10</td> <td style="text-align: center;">46.09</td><td style="text-align: center;">0.26</td> <td style="text-align: center;">52.73</td><td style="text-align: center;">0.24</td> </tr> <tr style="background-color:#f0f0f0;"> <td><a href="https://platform.openai.com/docs/models/gpt-4o-mini">GPT-4o-mini</a></td> <td style="text-align: center;">128K</td> <td style="text-align: center;">63.54</td><td style="text-align: center;">0.46</td> <td style="text-align: center;">51.48</td><td style="text-align: center;">0.26</td> <td style="text-align: center;">36.56</td><td style="text-align: center;">0.04</td> <td style="text-align: center;">56.51</td><td style="text-align: center;">0.25</td> <td style="text-align: center;">47.74</td><td style="text-align: center;">0.19</td> </tr> <tr> <td><a href="https://huggingface.co/Qwen/Qwen2-72B-Instruct">Qwen2-72B-Instruct</a></td> <td style="text-align: center;">128K</td> <td style="text-align: center;">64.53</td><td style="text-align: center;">0.43</td> <td style="text-align: center;">42.60</td><td style="text-align: center;">0.21</td> <td style="text-align: center;">38.52</td><td style="text-align: center;">0.05</td> <td style="text-align: center;">51.18</td><td style="text-align: center;">0.20</td> <td style="text-align: center;">45.71</td><td style="text-align: center;">0.17</td> </tr> <tr style="background-color:#f0f0f0;"> <td><a href="https://docs.anthropic.com/en/docs/intro-to-claude#claude-3-family">Claude3-Haiku</a></td> <td style="text-align: center;">200K</td> <td style="text-align: center;">73.71</td><td style="text-align: center;">0.66</td> <td style="text-align: center;">41.90</td><td style="text-align: center;">0.22</td> <td style="text-align: center;">36.18</td><td style="text-align: center;">0.02</td> <td style="text-align: center;">50.20</td><td style="text-align: center;">0.15</td> <td style="text-align: center;">45.45</td><td style="text-align: center;">0.17</td> </tr> <tr> <td><a href="https://kimi.moonshot.cn/">Kimi-Chat</a></td> <td style="text-align: center;">200K</td> <td style="text-align: center;">72.82</td><td style="text-align: center;">0.52</td> <td style="text-align: center;">46.77</td><td style="text-align: center;">0.21</td> <td style="text-align: center;">33.46</td><td style="text-align: center;">0.06</td> <td style="text-align: center;">40.51</td><td style="text-align: center;">0.15</td> <td style="text-align: center;">42.40</td><td style="text-align: center;">0.16</td> </tr> <tr style="background-color:#f0f0f0;"> <td><a href="https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct">Llama-3.1-8B-Instruct</a></td> <td style="text-align: center;">128K</td> <td style="text-align: center;">72.79</td><td style="text-align: center;">0.59</td> <td style="text-align: center;">44.51</td><td style="text-align: center;">0.27</td> <td style="text-align: center;">32.98</td><td style="text-align: center;">0.01</td> <td style="text-align: center;">40.53</td><td style="text-align: center;">0.15</td> <td style="text-align: center;">41.98</td><td style="text-align: center;">0.16</td> </tr> <tr> <td><a href="https://huggingface.co/THUDM/glm-4-9b-chat-1m">GLM4-9B-Chat</a></td> <td style="text-align: center;">1000K</td> <td style="text-align: center;">65.04</td><td style="text-align: center;">0.54</td> <td style="text-align: center;">41.80</td><td style="text-align: center;">0.23</td> <td style="text-align: center;">30.72</td><td style="text-align: center;">0.02</td> <td style="text-align: center;">42.34</td><td style="text-align: center;">0.17</td> <td style="text-align: center;">40.19</td><td style="text-align: center;">0.17</td> </tr> <tr style="background-color:#f0f0f0;"> <td><a href="https://huggingface.co/microsoft/Phi-3-small-128k-instruct">Phi-3-small</a></td> <td style="text-align: center;">128K</td> <td style="text-align: center;">34.17</td><td style="text-align: center;">0.16</td> <td style="text-align: center;">22.08</td><td style="text-align: center;">0.08</td> <td style="text-align: center;">20.51</td><td style="text-align: center;">0.01</td> <td style="text-align: center;">16.20</td><td style="text-align: center;">0.01</td> <td style="text-align: center;">21.40</td><td style="text-align: center;">0.04</td> </tr> <tr> <td><a href="https://huggingface.co/microsoft/Phi-3-mini-128k-instruct">Phi-3-mini</a></td> <td style="text-align: center;">128K</td> <td style="text-align: center;">44.71</td><td style="text-align: center;">0.29</td> <td style="text-align: center;">22.81</td><td style="text-align: center;">0.09</td> <td style="text-align: center;">16.37</td><td style="text-align: center;">0.00</td> <td style="text-align: center;">15.39</td><td style="text-align: center;">0.01</td> <td style="text-align: center;">20.84</td><td style="text-align: center;">0.05</td> </tr> <tr> <td colspan="12" style="text-align: center;"><b>Set3 (100K-200K)</b></td> </tr> <tr> <td><a href="https://ai.google.dev/gemini-api/docs/models/gemini#:~:text=Gemini-,Gemini%201.5%20Pro%20(Preview%20only),-Text%20and%20images">Gemini-1.5-pro</a></td> <td style="text-align: center;">1000K</td> <td style="text-align: center;">81.25</td><td style="text-align: center;">0.56</td> <td style="text-align: center;">44.66</td><td style="text-align: center;">0.20</td> <td style="text-align: center;">39.90</td><td style="text-align: center;">0.05</td> <td style="text-align: center;">58.38</td><td style="text-align: center;">0.36</td> <td style="text-align: center;">52.05</td><td style="text-align: center;">0.24</td> </tr> <tr style="background-color:#f0f0f0;"> <td><a href="https://platform.openai.com/docs/models/gpt-4o">GPT-4o</a></td> <td style="text-align: center;">128K</td> <td style="text-align: center;">74.84</td><td style="text-align: center;">0.65</td> <td style="text-align: center;">42.40</td><td style="text-align: center;">0.21</td> <td style="text-align: center;">38.70</td><td style="text-align: center;">0.04</td> <td style="text-align: center;">45.06</td><td style="text-align: center;">0.09</td> <td style="text-align: center;">46.95</td><td style="text-align: center;">0.19</td> </tr> <tr> <td><a href="https://docs.anthropic.com/en/docs/intro-to-claude#claude-3-5-family">Claude3.5-Sonnet</a></td> <td style="text-align: center;">200K</td> <td style="text-align: center;">65.36</td><td style="text-align: center;">0.56</td> <td style="text-align: center;">50.32</td><td style="text-align: center;">0.34</td> <td style="text-align: center;">37.79</td><td style="text-align: center;">0.03</td> <td style="text-align: center;">25.95</td><td style="text-align: center;">0.11</td> <td style="text-align: center;">42.06</td><td style="text-align: center;">0.19</td> </tr> <tr style="background-color:#f0f0f0;"> <td><a href="https://docs.anthropic.com/en/docs/intro-to-claude#claude-3-family">Claude3-Haiku</a></td> <td style="text-align: center;">200K</td> <td style="text-align: center;">77.81</td><td style="text-align: center;">0.67</td> <td style="text-align: center;">37.07</td><td style="text-align: center;">0.17</td> <td style="text-align: center;">30.94</td><td style="text-align: center;">0.01</td> <td style="text-align: center;">36.87</td><td style="text-align: center;">0.12</td> <td style="text-align: center;">41.41</td><td style="text-align: center;">0.18</td> </tr> <tr> <td><a href="https://platform.openai.com/docs/models/gpt-4o-mini">GPT-4o-mini</a></td> <td style="text-align: center;">128K</td> <td style="text-align: center;">58.27</td><td style="text-align: center;">0.49</td> <td style="text-align: center;">33.46</td><td style="text-align: center;">0.09</td> <td style="text-align: center;">27.33</td><td style="text-align: center;">0.01</td> <td style="text-align: center;">35.67</td><td style="text-align: center;">0.04</td> <td style="text-align: center;">35.63</td><td style="text-align: center;">0.11</td> </tr> <tr style="background-color:#f0f0f0;"> <td><a href="https://huggingface.co/Qwen/Qwen2-72B-Instruct">Qwen2-72B-Instruct</a></td> <td style="text-align: center;">128K</td> <td style="text-align: center;">46.99</td><td style="text-align: center;">0.27</td> <td style="text-align: center;">37.06</td><td style="text-align: center;">0.13</td> <td style="text-align: center;">31.50</td><td style="text-align: center;">0.02</td> <td style="text-align: center;">35.01</td><td style="text-align: center;">0.07</td> <td style="text-align: center;">35.94</td><td style="text-align: center;">0.09</td> </tr> <tr> <td><a href="https://huggingface.co/THUDM/glm-4-9b-chat-1m">GLM4-9B-Chat</a></td> <td style="text-align: center;">1000K</td> <td style="text-align: center;">69.19</td><td style="text-align: center;">0.56</td> <td style="text-align: center;">37.99</td><td style="text-align: center;">0.18</td> <td style="text-align: center;">26.63</td><td style="text-align: center;">0.01</td> <td style="text-align: center;">32.30</td><td style="text-align: center;">0.09</td> <td style="text-align: center;">37.36</td><td style="text-align: center;">0.16</td> </tr> <tr style="background-color:#f0f0f0;"> <td><a href="https://kimi.moonshot.cn/">Kimi-Chat</a></td> <td style="text-align: center;">200K</td> <td style="text-align: center;">62.13</td><td style="text-align: center;">0.54</td> <td style="text-align: center;">24.20</td><td style="text-align: center;">0.05</td> <td style="text-align: center;">21.98</td><td style="text-align: center;">0.01</td> <td style="text-align: center;">31.02</td><td style="text-align: center;">0.14</td> <td style="text-align: center;">31.37</td><td style="text-align: center;">0.14</td> </tr> <tr> <td><a href="https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct">Llama-3.1-8B-Instruct</a></td> <td style="text-align: center;">128K</td> <td style="text-align: center;">60.05</td><td style="text-align: center;">0.46</td> <td style="text-align: center;">25.86</td><td style="text-align: center;">0.11</td> <td style="text-align: center;">21.96</td><td style="text-align: center;">0.00</td> <td style="text-align: center;">19.14</td><td style="text-align: center;">0.02</td> <td style="text-align: center;">28.41</td><td style="text-align: center;">0.10</td> </tr> <tr style="background-color:#f0f0f0;"> <td><a href="https://huggingface.co/microsoft/Phi-3-small-128k-instruct">Phi-3-small</a></td> <td style="text-align: center;">128K</td> <td style="text-align: center;">25.12</td><td style="text-align: center;">0.06</td> <td style="text-align: center;">15.26</td><td style="text-align: center;">0.01</td> <td style="text-align: center;">16.80</td><td style="text-align: center;">0.00</td> <td style="text-align: center;">12.75</td><td style="text-align: center;">0.01</td> <td style="text-align: center;">16.94</td><td style="text-align: center;">0.01</td> </tr> <tr> <td><a href="https://huggingface.co/microsoft/Phi-3-mini-128k-instruct">Phi-3-mini</a></td> <td style="text-align: center;">128K</td> <td style="text-align: center;">7.40</td><td style="text-align: center;">0.03</td> <td style="text-align: center;">1.97</td><td style="text-align: center;">0.00</td> <td style="text-align: center;">6.07</td><td style="text-align: center;">0.00</td> <td style="text-align: center;">7.38</td><td style="text-align: center;">0.01</td> <td style="text-align: center;">5.79</td><td style="text-align: center;">0.01</td> </tr> <tr> <td colspan="12" style="text-align: center;"><b>Set4 (200K-250K)</b></td> </tr> <tr> <td><a href="https://ai.google.dev/gemini-api/docs/models/gemini#:~:text=Gemini-,Gemini%201.5%20Pro%20(Preview%20only),-Text%20and%20images">Gemini-1.5-pro</a></td> <td style="text-align: center;">1000K</td> <td style="text-align: center;">62.23</td><td style="text-align: center;">0.49</td> <td style="text-align: center;">43.08</td><td style="text-align: center;">0.20</td> <td style="text-align: center;">36.48</td><td style="text-align: center;">0.00</td> <td style="text-align: center;">68.51</td><td style="text-align: center;">0.49</td> <td style="text-align: center;">50.70</td><td style="text-align: center;">0.25</td> </tr> <tr style="background-color:#f0f0f0;"> <td><a href="https://docs.anthropic.com/en/docs/intro-to-claude#claude-3-family">Claude3-Haiku</a></td> <td style="text-align: center;">200K</td> <td style="text-align: center;">53.26</td><td style="text-align: center;">0.40</td> <td style="text-align: center;">27.00</td><td style="text-align: center;">0.03</td> <td style="text-align: center;">25.36</td><td style="text-align: center;">0.00</td> <td style="text-align: center;">28.11</td><td style="text-align: center;">0.05</td> <td style="text-align: center;">32.15</td><td style="text-align: center;">0.10</td> </tr> <tr> <td><a href="https://platform.openai.com/docs/models/gpt-4o">GPT-4o</a></td> <td style="text-align: center;">128K</td> <td style="text-align: center;">36.79</td><td style="text-align: center;">0.19</td> <td style="text-align: center;">23.97</td><td style="text-align: center;">0.08</td> <td style="text-align: center;">30.40</td><td style="text-align: center;">0.00</td> <td style="text-align: center;">32.89</td><td style="text-align: center;">0.07</td> <td style="text-align: center;">31.11</td><td style="text-align: center;">0.07</td> </tr> <tr style="background-color:#f0f0f0;"> <td><a href="https://docs.anthropic.com/en/docs/intro-to-claude#claude-3-5-family">Claude3.5-Sonnet</a></td> <td style="text-align: center;">200K</td> <td style="text-align: center;">36.91</td><td style="text-align: center;">0.24</td> <td style="text-align: center;">28.82</td><td style="text-align: center;">0.05</td> <td style="text-align: center;">28.68</td><td style="text-align: center;">0.00</td> <td style="text-align: center;">28.77</td><td style="text-align: center;">0.08</td> <td style="text-align: center;">30.51</td><td style="text-align: center;">0.08</td> </tr> <tr> <td><a href="https://huggingface.co/Qwen/Qwen2-72B-Instruct">Qwen2-72B-Instruct</a></td> <td style="text-align: center;">128K</td> <td style="text-align: center;">33.18</td><td style="text-align: center;">0.16</td> <td style="text-align: center;">26.59</td><td style="text-align: center;">0.08</td> <td style="text-align: center;">29.84</td><td style="text-align: center;">0.01</td> <td style="text-align: center;">25.81</td><td style="text-align: center;">0.04</td> <td style="text-align: center;">28.92</td><td style="text-align: center;">0.06</td> </tr> <tr style="background-color:#f0f0f0;"> <td><a href="https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct">Llama-3.1-8B-Instruct</a></td> <td style="text-align: center;">128K</td> <td style="text-align: center;">31.72</td><td style="text-align: center;">0.13</td> <td style="text-align: center;">27.27</td><td style="text-align: center;">0.10</td> <td style="text-align: center;">15.17</td><td style="text-align: center;">0.00</td> <td style="text-align: center;">22.89</td><td style="text-align: center;">0.02</td> <td style="text-align: center;">22.51</td><td style="text-align: center;">0.05</td> </tr> <tr> <td><a href="https://platform.openai.com/docs/models/gpt-4o-mini">GPT-4o-mini</a></td> <td style="text-align: center;">128K</td> <td style="text-align: center;">20.66</td><td style="text-align: center;">0.09</td> <td style="text-align: center;">19.18</td><td style="text-align: center;">0.03</td> <td style="text-align: center;">16.03</td><td style="text-align: center;">0.00</td> <td style="text-align: center;">27.81</td><td style="text-align: center;">0.00</td> <td style="text-align: center;">20.41</td><td style="text-align: center;">0.02</td> </tr> <tr style="background-color:#f0f0f0;"> <td><a href="https://huggingface.co/THUDM/glm-4-9b-chat-1m">GLM4-9B-Chat</a></td> <td style="text-align: center;">1000K</td> <td style="text-align: center;">15.67</td><td style="text-align: center;">0.12</td> <td style="text-align: center;">21.33</td><td style="text-align: center;">0.05</td> <td style="text-align: center;">12.35</td><td style="text-align: center;">0.00</td> <td style="text-align: center;">21.04</td><td style="text-align: center;">0.05</td> <td style="text-align: center;">16.84</td><td style="text-align: center;">0.05</td> </tr> <tr> <td><a href="https://kimi.moonshot.cn/">Kimi-Chat</a></td> <td style="text-align: center;">200K</td> <td style="text-align: center;">20.17</td><td style="text-align: center;">0.12</td> <td style="text-align: center;">9.17</td><td style="text-align: center;">0.00</td> <td style="text-align: center;">5.65</td><td style="text-align: center;">0.00</td> <td style="text-align: center;">22.61</td><td style="text-align: center;">0.11</td> <td style="text-align: center;">13.50</td><td style="text-align: center;">0.05</td> </tr> <tr style="background-color:#f0f0f0;"> <td><a href="https://huggingface.co/microsoft/Phi-3-small-128k-instruct">Phi-3-small</a></td> <td style="text-align: center;">128K</td> <td style="text-align: center;">22.36</td><td style="text-align: center;">0.02</td> <td style="text-align: center;">16.43</td><td style="text-align: center;">0.05</td> <td style="text-align: center;">11.50</td><td style="text-align: center;">0.00</td> <td style="text-align: center;">10.35</td><td style="text-align: center;">0.00</td> <td style="text-align: center;">14.27</td><td style="text-align: center;">0.01</td> </tr> <tr> <td><a href="https://huggingface.co/microsoft/Phi-3-mini-128k-instruct">Phi-3-mini</a></td> <td style="text-align: center;">128K</td> <td style="text-align: center;">5.21</td><td style="text-align: center;">0.00</td> <td style="text-align: center;">2.20</td><td style="text-align: center;">0.00</td> <td style="text-align: center;">3.45</td><td style="text-align: center;">0.00</td> <td style="text-align: center;">2.58</td><td style="text-align: center;">0.00</td> <td style="text-align: center;">3.38</td><td style="text-align: center;">0.00</td> </tr> </tbody> </table>

The performance of LLMs on four evaluation tasks with different length sets. For each task, the indicator on the left represents the Avg Scores(0~100), while the right one represents the Perfect Rate(0~1).

🔧Evaluate long-context LLMs

Step1 Download Loong benchmark and docs

git clone https://github.com/MozerWang/Loong.git
cd Loong
wget -P data/ http://alibaba-research.oss-cn-beijing.aliyuncs.com/loong/doc.zip
unzip data/doc.zip -d data/

Step2 Create a conda environment and Install other dependencies.

conda create --name loong python=3.9 -y
conda activate loong
pip install -r requirements.txt

Step3 Preparing the Model

  1. (Must) Set up your OPENAI key in config/models/gpt4.yaml
api_key: "Your OPENAI key"
  1. If you are using API-based LLM
# Firstly, Set up your key in config/models/*.yaml
api_key: "Your API key"
  1. If you are using Open-sourced LLM
# We recommend using vLLM. And we use HTTP server that implements OpenAI’s Completions and Chat API.
# We have provided using example for Qwen2 and GLM4. See details in Loong/src/vllm_eample.sh
cd src
sh vllm_example.sh

Step4 Evaluate

cd src
sh run.sh

Things To Know

# Input
generate(prompts, config, output_path, process_num, tag)

# Output
result = prompt.copy() # for prompt in prompts
result[tag] = response_content # Your LLM's response
with open(output_path, 'a', encoding='utf-8') as fw:
    fw.write(json.dumps(result, ensure_ascii=False) + '\n')

Citation

@inproceedings{wang2024loong,
  title={Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA},
  author={Minzheng Wang, Longze Chen, Cheng Fu, Shengyi Liao, Xinghua Zhang, Bingli Wu, Haiyang Yu, Nan Xu, Lei Zhang, Run Luo, Yunshui Li, Min Yang, Fei Huang, Yongbin Li},
  year={2024}
  booktitle ={Proceedings of EMNLP},
  url={https://aclanthology.org/2024.emnlp-main.322},
  pages={5627--5646},
}