


This repository contains code to run inference on the ZeroSCROLLS benchmark.


Load the data

from datasets import load_dataset

gov_report = load_dataset("tau/zero_scrolls", "gov_report", split="test")
Options are: ["gov_report", "summ_screen_fd", "qmsum", "squality", "qasper","narrative_qa", "quality", "musique", "space_digest","book_sum_sort"]
There is also a small number of examples (~20 per task) in a "validation" split, meant for eyeballing purposes

Inference with Huggingface models

python experiments/hf/run_hf_model.py --model-name=google/flan-t5-small

Supported models:

To add new models:

Inference with APIs

To run with models used in the paper*:

# if you want to use openai models
export OPENAI_API_KEY=<insert token here> 
export OPENAI_ORG=<insert org here>

# if you want to use anthropic models
export ANTHROPIC_API_KEY=<insert token here>

# if you want to limit the number of examples to run per task
export MAX_EXAMPLES=10

python experiments/api/run_api_model.py --model_name=gpt-3.5-turbo --limit_to_n_examples=$MAX_EXAMPLES

*These models and APIs tend to update, see the paper for the versions used in the baselines.

Models supported:

To add new a new API, you need to:

When using a prompt that includes opening XML tags, (e.g. "... Assistant: <answer>"), ensure that you post-process the generations to retain only the prefix before the closing XML tag generated by the model before submitting.

Prepare submission

To create a CSV file in the correct format for a leaderboard submission we recommend using our conversion script, prepare_submission.py.

Its inputs:

For each task, the predictions should be in a JSON file that is a mapping from an ID to a textual prediction:

    "example_id1": "prediction1",
    "example_id2": "prediction2",

Please set:


python submission/prepare_submission.py \
--gov_report_file GOV_REPORT_PREDS_FILE \
--summ_screen_fd_file SUMM_SCREEN_FD_PREDS_FILE \
--qmsum_file QMSUM_PREDS_FILE \
--squality_file SQUALITY_PREDS_FILE \
--qasper_file QASPER_PREDS_FILE \
--narrative_qa_file NARRATIVE_QA_PREDS_FILE \
--quality_file QUALITY_PREDS_FILE \
--musique_file MUSIQUE_PREDS_FILE \
--space_digest_file SPACE_DIGEST_PREDS_FILE \
--book_sum_sort_file BOOK_SUM_SORT_PREDS_FILE \
--output_dir OUTPUT_DIR

Verify your submission file


python submission/verify_submission.py \
--all_predictions SUBMMISION_FILE \
--output_dir OUTPUT_DIR

A valid submission file will result in the following line printed:

The verification was successful.

Please fix any errors before making your submission.


The live leaderboard is here.


    title = "{Z}ero{SCROLLS}: A Zero-Shot Benchmark for Long Text Understanding",
    author = "Shaham, Uri  and
      Ivgi, Maor  and
      Efrat, Avia  and
      Berant, Jonathan  and
      Levy, Omer",
    editor = "Bouamor, Houda  and
      Pino, Juan  and
      Bali, Kalika",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-emnlp.536",
    doi = "10.18653/v1/2023.findings-emnlp.536",
    pages = "7977--7989"

If you find the ZeroSCROLLS data useful, please make sure to cite also the original dataset papers: [bibtex]