Home

Awesome

PRESTO

PRESTO: A Multilingual Dataset for Parsing Realistic Task-Oriented Dialogs

Introduction

PRESTO is a dataset of over 550K contextual multilingual conversations between humans and virtual assistants. PRESTO contains a diverse array of challenges that occur in real-world NLU tasks such as disfluencies, code-switching, and revisions. It is the only large-scale human generated conversational parsing dataset that provides structured context such as a user's contacts and lists for each example.

The dataset can be downloaded here. This README documents the dataset structure and other important information about the dataset.

Dataset Structure

PRESTO is published with the following files:

├── presto_dataset.jsonl
├── presto_dev.jsonl
├── presto_test.jsonl
├── presto_train.jsonl
└── test_partitions
    ├── de-DE
    │   ├── de-DE_code-mixing
    │   │   └── test.jsonl
    │   ├── de-DE_disfluency
    │   │   └── test.jsonl
    │   ├── de-DE_non_contextual
    │   │   └── test.jsonl
    │   ├── de-DE_no_phenomena
    │   │   └── test.jsonl
    │   ├── de-DE_revisions
    │   │   └── test.jsonl
    │   └── test.jsonl
    ... (more languages)

presto_dataset.jsonl contains all examples in the dataset. The provided train, dev, and test sets are those used in the paper. Each test partition contains a subset of the examples from the full test set.

Each JSONL file contains a list of examples. An example has the following fields:

Citation

If you use this dataset, please cite the following paper:

@misc{https://doi.org/10.48550/arxiv.2303.08954,
  doi = {10.48550/ARXIV.2303.08954},
  url = {https://arxiv.org/abs/2303.08954},
  author = {Goel, Rahul and Ammar, Waleed and Gupta, Aditya and Vashishtha, Siddharth and Sano, Motoki and Surani, Faiz and Chang, Max and Choe, HyunJeong and Greene, David and He, Kyle and Nitisaroj, Rattima and Trukhina, Anna and Paul, Shachi and Shah, Pararth and Shah, Rushin and Yu, Zhou},
  title = {PRESTO: A Multilingual Dataset for Parsing Realistic Task-Oriented Dialogs},
  publisher = {arXiv},
  year = {2023},
}

License

PRESTO is licensed under CC BY 4.0.