Home

Awesome

Topical-Chat

We introduce Topical-Chat, a knowledge-grounded human-human conversation dataset where the underlying knowledge spans 8 broad topics and conversation partners don’t have explicitly defined roles.

Topical-Chat broadly consists of two types of files:

We provide a simple script, build.py, to build the reading sets for the dataset, by making API calls to the relevant sources of the data.

For detailed information about the dataset, modeling benchmarking experiments and evaluation results, please refer to our paper.

Prerequisites

After cloning this repo, please run the following commands, preferably after creating a virtual environment:

pip install -r src/requirements.txt
mkdir reading_sets/post-build/

Please create your own Reddit API credentials, and manually add them to src/reddit/prawler.py

The scripts in this repo have been tested with Python 3.7 and we recommend using Python 3.7.

Build

Run python build.py, after having manually added your own Reddit credentials in src/reddit/prawler.py and creating a reading_sets/post-build/ directory.

build.py will read each file in reading_sets/pre-build/, create a replica JSON with the exact same name and the actual reading sets included in reading_sets/post-build/

Dataset

Statistics:

StatTrainValid Freq.Valid RareTest Freq.Test RareAll
# of conversations862853953953953910784
# of utterances18837811681116921176011770235281
average # of turns per conversation21.821.621.721.821.821.8
average length of utterance19.519.819.819.519.519.6

Split:

The data is split into 5 distinct groups: train, valid frequent, valid rare, test frequent and test rare. The frequent set contains entities frequently seen in the training set. The rare set contains entities that were infrequently seen in the training set.

Configuration Type:

For each conversation to be collected, we applied a random knowledge configuration from a pre-defined list of configurations, to construct a pair of reading sets to be rendered to the partnered Turkers. Configurations were defined to impose varying degrees of knowledge symmetry or asymmetry between partner Turkers, leading to the collection of a wide variety of conversations.

Reading sets for Turkers 1 and 2 in Config A

Reading sets for Turkers 1 and 2 in Config B

Reading sets for Turkers 1 and 2 in Config C&D

Conversations:

Each JSON file in conversations/ has the following format:

{
<conversation_id>: {
	“article_url”: <article url>,
	“config”: <config>, # one of A, B, C, D
	“content”: [ # ordered list of conversation turns
		{ 
		“agent”: “agent_1”, # or “agent_2”,
		“message” : <message text>,
		“sentiment”: <text>,
		“knowledge_source” : [“AS1”, “Personal Knowledge”, ...],
		“turn_rating”: “Poor”,
		},…
	],
	“conversation_rating”: {
		“agent_1”: “Good”,
		“agent_2”: “Excellent”
		}
},…
}

Reading Sets:

Each JSON file in reading_sets/post-build/ has the following format:

{
<conversation_id> : {
	“config” : <config>,
    “agent_1”: {
	    “FS1”: {
		 “entity”: <entity name>,
		 “shortened_wiki_lead_section”: <section text>,
		 “fun_facts”: [ <fact1_text>, <fact2_text>,…]
		    },
	    “FS2”:…
                    },
        ....
        },
    “agent_2”: {
	    “FS1”: {
		 “entity”: <entity name>,
		 “shortened_wiki_lead_section”: <section text>,
		 “fun_facts”: [ <fact1_text>, <fact2_text>,…],
	            },
	    “FS2”:…
                    },
        ...
        },
    “article”: {
		“url”: <url>,
		“headline” : <headline text>,
		“AS1”: <section 1 text>,
		“AS2”: <section 2 text>,
		“AS3”: <section 3 text>,
		“AS4”: <section 4 text>
	    }
	},
…
}

Wikipedia Data:

src/wiki/wiki.json has the following format:

{
  "shortened_wiki_lead_section": {
    <shortened wiki lead section text>: <unique_identifier>,
    <shortened wiki lead section text>: <unique_identifier>
  },
  "summarized_wiki_lead_section": {
    <summarized wiki lead section text>: <unique_identifier>,
    <summarized wiki lead section text>: <unique_identifier>
  }
}

build.py puts data from wiki.json into the relevant reading sets.

Citation

If you use Topical-Chat in your work, please cite with the following:

@inproceedings{gopalakrishnan2019topical,
  author={Karthik Gopalakrishnan and Behnam Hedayatnia and Qinlang Chen and Anna Gottardi and Sanjeev Kwatra and Anu Venkatesh and Raefer Gabriel and Dilek Hakkani-Tür},
  title={{Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={1891--1895},
  doi={10.21437/Interspeech.2019-3079},
  url={http://dx.doi.org/10.21437/Interspeech.2019-3079}
}
Gopalakrishnan, Karthik, et al. "Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations.", Proc. INTERSPEECH 2019

Acknowledgements

We thank Anju Khatri, Anjali Chadha and Mohammad Shami for their help with the public release of the dataset. We thank Jeff Nunn and Yi Pan for their early contributions to the dataset collection.