Awesome
<img src=figure/icon.png width=8% />Pchatbot: A Large-Scale Dataset for Personalized Chatbot
Introduction
We introduce Pchatbot, a large scale conversation dataset dedicated for the development of personalized dialogue models. In this dataset, we assign anonymized user IDs and timestamps to conversations. Users’ dialogue histories can be retrieved and used to build rich user profiles. With the availability of the dialogue histories, we can move from personality based models to personalized models.
Pchatbot has two subsets, named PchatbotW and PchatbotL, built from open-domain Weibo and judicial forums respectively. Since the data volume of each sub-data set is too large, we divided each sub-data set into 10 equal parts according to the number of users, and named them PchatbotW-i and PchatbotL-i.
The dataset paper is accepted to SIGIR 2021 (Resource Track). See paper for more details.
Citation
If you use the dataset in your work, please cite:
@inproceedings{qian2021pchatbot,
author = {Hongjin Qian and Xiaohe Li and Hanxun Zhong and Yu Guo and Yueyuan Ma and Yutao Zhu and Zhanliang Liu and Zhicheng Dou and Ji-Rong Wen},
title = {Pchatbot: A Large-Scale Dataset for Personalized Chatbot},
booktitle = {Proceedings of the {SIGIR} 2021},
publisher = {{ACM}},
year = {2021},
url = {https://doi.org/10.1145/3404835.3463239},
doi = {10.1145/3404835.3463239}}
The following paper uses the Pchatbot dataset:
- One Chatbot Per Person: Creating Personalized Chatbots based on Implicit User Profiles (SIGIR 2021 Long Paper)
@inproceedings{DBLP:conf/sigir/madousigir21,
author = {Zhengyi Ma and Zhicheng Dou and Yutao Zhu Hanxun Zhong and Ji-Rong Wen},
title = {One Chatbot Per Person: Creating Personalized Chatbots based onImplicit User Profiles},
booktitle = {Proceedings of the {SIGIR} 2021},
publisher = {{ACM}},
year = {2021},
url = {https://doi.org/10.1145/3404835.3462828},
doi = {10.1145/3404835.3462828}}
- Learning Implicit User Profile for Personalized Retrieval-based Chatbot (CIKM 2021 Long Paper)
@inproceedings{qian2021impchat,
author = {Hongjin Qian and Zhicheng Dou and Yutao Zhu Yueyuan Ma and Ji-Rong Wen},
title = {Learning Implicit User Profile for Personalized Retrieval-based Chatbot},
booktitle = {Proceedings of the {CIKM} 2021},
publisher = {{ACM}},
year = {2021},
url = {https://doi.org/10.1145/3459637.3482269},
doi = {10.1145/3459637.3482269}
- Less is More: Learning to Refine Dialogue History for Personalized Dialogue Generation (NAACL 2022 Long Paper)
@inproceedings{zhong-etal-2022-less,
title = "Less is More: Learning to Refine Dialogue History for Personalized Dialogue Generation",
author = "Zhong, Hanxun and
Dou, Zhicheng and
Zhu, Yutao and
Qian, Hongjin and
Wen, Ji-Rong",
booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
month = jul,
year = "2022",
address = "Seattle, United States",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.naacl-main.426",
doi = "10.18653/v1/2022.naacl-main.426",
pages = "5808--5820"
Dataset Statistics
The detailed statistics of Pchatbot shows as follow:
PchatbotW | PchatbotL | PchatbotW-1 | PchatbotL-1 | |
---|---|---|---|---|
#Posts | 5,319,596 | 20,145,956 | 3,597,407 | 4,662,911 |
#Responses | 139,448,339 | 59,427,457 | 13,992,870 | 5,523,160 |
#Users in posts | 772,002 | 5,203,345 | 417,294 | 1,107,989 |
#Users in responses | 23,408,367 | 203,636 | 2,340,837 | 20,364 |
Avg.#responses per post | 26.214 | 2.950 | 3.890 | 1.184 |
Max.#responses per post | 525 | 120 | 136 | 26 |
#Words | 8,512,945,238 | 3,013,617,497 | 855,005,996 | 284,099,064 |
Avg.#words per pair | 61.047 | 51.014 | 61.103 | 51.438 |
We construct two standard dataset from Pchatbot for both generation-based and retrieval-based tasks, named PchatbotW-R and PchatbotW-G. The two datasets can be directly used in coressponding dialogue tasks. We will release the standard dataset then. Their statistics are shown in the following table:
PchatbotW-R | PchatbotW-G | |
---|---|---|
Number of users | 420,000 | 300,000 |
Avg. history length | 32.3 | 11.4 |
Avg. length of post | 24.9 | 22.9 |
Avg. length of response | 10.1 | 9.6 |
Number of response candidates | 10 | - |
Number of training samples | 3,000,000 | 2,707,880 |
Number of validation samples | 600,000 | 600,000 |
Number of testing samples | 600,000 | 600,000 |
To obtain statistics, run:
python src/statistics.py
We will then release standard datasets for PchatbotL.
Data Content and Format
Obtain the data
For now we provide download link via Baidu Cloud which may be a bit slow outside mainland, China. We will update Google drive link asap.
Pchatbot-L:
md5: 48bd7ab93f625ebdf34c7254ff27ac2a
Pchatbot-W:
md5: cd443951973f47f5614df298e6e416da
If you cannot access Baidu Cloud Disk, contact us and we will try to provide other options.
Please fill in the application form and send it to the contact mail, we will then send the download links and the password for Baidu Cloud Disk to you. Note that the application form should be signed by the person in charge of your research group. We will update the download password regularly.
Pchatbot Files
The upload format of the dataset is .tar.bz2, you can decompress it as follows:
tar -jxvf xx.tar.bz2
The format of each piece of data in the data set is:
Post \t Post_user_id \t Post_timestamp \t Response \t Response_user_id \t Response_timestamp \n
post and response are sentences with word segmentation, separated by spaces.And we give several examples of the data in data/sample.txt
We also give some examples of user personalized information, as shown in the figure below, due to space constraints, we only selected 5 historical records for the user in each example.
PchatbotW.release_ver
Data Preprocessing
Instructions for data cleaning, preprocessing, aggregation and dataset constructs are in ./src/
folder.
Baseline models
We provide results of baseline models on the PchatbotW-R and PchatbotW-G dataset. For evaluation details, please refer to our paper. We will continue to update the results of other baseline models:
PchatbotW-R
R10@1 | R10@2 | R10@5 | MRR | nDCG | Paper | Code | |
---|---|---|---|---|---|---|---|
Conv-KNRM | 0.323 | 0.520 | 0.893 | 0.538 | 0.818 | Convolutional Neural Networks for Soft-Matching N-Grams in Ad-hoc Search | https://github.com/yunhenk/Conv-KNRM |
DAM | 0.438 | 0.644 | 0.966 | 0.635 | 0.881 | Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network | https://github.com/baidu/Dialogue |
IOI | 0.442 | 0.651 | 0.969 | 0.639 | 0.890 | One Time of Interaction May Not Be Enough: Go Deep with an Interaction-over-Interaction Network for Response Selection in Dialogues | https://github.com/chongyangtao/IOI |
RSM-DCK | 0.428 | 0.627 | 0.947 | 0.623 | 0.858 | Learning to Detect Relevant Contexts and Knowledge for Response Selection in Retrieval-based Dialogue Systems | Provided by the author |
PchatbotW-G
BLEU-1 | ROUGE-L | Dist-1 | Dist-2 | P-F1 | Paper | Code | |
---|---|---|---|---|---|---|---|
Seq2Seq | 4.889 | 7.594 | 0.229 | 3.404 | 0.771 | Sequence to Sequence Learning with Neural Networks | https://github.com/IBM/pytorch-seq2seq |
SPEAKER | 3.958 | 5.580 | 0.951 | 29.780 | 1.534 | A Persona-Based Neural Conversation Model | \ |
PERSONAWAE | 1.945 | 9.064 | 0.523 | 8.549 | 6.408 | Modeling Personalization in Continuous Space for Response Generation via Augmented Wasserstein Autoencoders | \ |
DialoGPT | 5.038 | 7.358 | 13.995 | 52.674 | 3.562 | DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation | https://github.com/microsoft/DialoGPT |
License
This repository is liciensed under Apache-2.0 License.
The Pchatbot dataset is liciensed under CC BY-NC 2.0.