Home

Awesome

The Schema-Guided Dialogue Dataset

Contact - schema-guided-dst@google.com

Overview

The Schema-Guided Dialogue (SGD) dataset consists of over 20k annotated multi-domain, task-oriented conversations between a human and a virtual assistant. These conversations involve interactions with services and APIs spanning 20 domains, such as banks, events, media, calendar, travel, and weather. For most of these domains, the dataset contains multiple different APIs, many of which have overlapping functionalities but different interfaces, which reflects common real-world scenarios. The wide range of available annotations can be used for intent prediction, slot filling, dialogue state tracking, policy imitation learning, language generation, and user simulation learning, among other tasks for developing large-scale virtual assistants. Additionally, the dataset contains unseen domains and services in the evaluation set to quantify the performance in zero-shot or few-shot settings.

Schema-Guided Dialogue - eXtended (SGD-X) is a benchmark for measuring the robustness of dialogue systems to linguistic variations in schemas. SGD-X extends the SGD dataset with 5 crowdsourced variants for every schema, where variants are semantically similar yet stylistically diverse. Models trained on SGD are evaluated on SGD-X to measure how well they can generalize in a real-world setting, where a large variety of linguistic styles exist.

The datasets are provided "AS IS" without any warranty, express or implied. Google disclaims all liability for any damages, direct or indirect, resulting from the use of this dataset.

Updates

10/19/2021 - SGD-X schemas for measuring robustness to linguistic variations in schemas released, along with a script to convert dialogue annotations according to the new schemas.

07/05/2020 - Test set annotations released. User actions and service calls made during the dialogue are also released for all dialogues.

10/14/2019 - DSTC8 challenge concluded. Details about the submissions to the challenge may be found in the DSTC8 overview paper.

10/07/2019 - Test dataset released without the dialogue state annotations.

07/23/2019 - Train and dev sets are publicly released as part of DSTC8 challenge.

Important Links

Data

The SGD dataset consists of schemas outlining the interface of different APIs and annotated dialogues. The dialogues were generated with the help of a dialogue simulator and paid crowd-workers. The data collection approach is summarized in this paper.

The SGD-X dataset consists of 5 linguistic variants of every schema in the original SGD dataset. Linguistic variants were written by hundreds of paid crowd-workers. In the SGD-X directory, v1 represents the variant closest to the original schemas and v5 the farthest in terms of linguistic distance. To evaluate model performance on SGD-X schemas, dialogues must be converted using the script generate_sgdx_dialogues.py.

Schema Representation

A service or API is essentially a set of functions (called intents), each taking a set of parameters (called slots). A schema is a normalized representation of the interface exposed by a service/API. In addition, the schema also includes natural language descriptions of the included functions and their parameters to outline the semantics of each element. The SGD schemas were manually generated by the dataset creators, and SGD-X schema variants were created by having crowd-workers paraphrase the original schemas. Each schema is represented as a json object containing the following fields:

*service_names follow the form "<domain name>_<number>" (e.g. Banks_2). The number is used to disambiguate services from the same domain. SGD-X variant schemas have two-digit numbers, where the first digit is copied from the original schema, and the second digit is the SGD-X variant number. For example, the v1 variant of Banks_2 is Banks_21.

Dialogue Representation

Dialogues are represented as a list of turns, where each turn contains either a user or system utterance. The annotations for a turn are grouped into frames, where each frame corresponds to a single service. Each turn in the single domain dataset contains exactly one frame. In multi-domain datasets, some turns may have multiple frames.

Each dialogue is represented as a json object with the following fields:

Each turn consists of the following fields:

Each frame consists of the following fields:

List of possible system acts:

List of possible user acts:

License

The SGD and SGD-X datasets are released under CC BY-SA 4.0 license. For the full license, see LICENSE.txt. Please cite the following papers if you use the datasets in your work:

SGD

@inproceedings{rastogi2020towards,
  title={Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset},
  author={Rastogi, Abhinav and Zang, Xiaoxue and Sunkara, Srinivas and Gupta, Raghav and Khaitan, Pranav},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={34},
  number={05},
  pages={8689--8696},
  year={2020}
}

SGD-X

@inproceedings{lee2022sgd,
  title={SGD-X: A Benchmark for Robust Generalization in Schema-Guided Dialogue Systems},
  author={Lee, Harrison and Gupta, Raghav and Rastogi, Abhinav and Cao, Yuan and Zhang, Bin and Wu, Yonghui},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={36},
  number={10},
  pages={10938--10946},
  year={2022}
}

Dataset Metadata

The following table is necessary for this dataset to be indexed by search engines such as <a href="https://g.co/datasetsearch">Google Dataset Search</a>.

<div itemscope itemtype="http://schema.org/Dataset"> <table> <tr> <th>property</th> <th>value</th> </tr> <tr> <td>name</td> <td><code itemprop="name">Schema-Guided Dialogue Dataset</code></td> </tr> <tr> <td>alternateName</td> <td><code itemprop="alternateName">SGD dataset</code></td> </tr> <tr> <td>url</td> <td><code itemprop="url">https://github.com/google-research-datasets/dstc8-schema-guided-dialogue</code></td> </tr> <tr> <td>sameAs</td> <td><code itemprop="sameAs">https://github.com/google-research-datasets/dstc8-schema-guided-dialogue</code></td> </tr> <tr> <td>description</td> <td><code itemprop="description">The dataset consists of conversations between a virtual assistant and a user ranging over a variety of domains including Travel, Events, Payment, Media, Restaurants, Weather etc. Annotations for natural language understanding, dialogue state tracking, policy learning, natural language generation and user simulation learning are also included.</code></td> </tr> <tr> <td>provider</td> <td> <div itemscope itemtype="http://schema.org/Organization" itemprop="provider"> <table> <tr> <th>property</th> <th>value</th> </tr> <tr> <td>name</td> <td><code itemprop="name">Google</code></td> </tr> <tr> <td>sameAs</td> <td><code itemprop="sameAs">https://en.wikipedia.org/wiki/Google</code></td> </tr> </table> </div> </td> </tr> <tr> <td>citation</td> <td><code itemprop="citation">https://identifiers.org/arxiv:1909.05855</code></td> </tr> </table> </div>