Home

Awesome

<img src=figure/icon.png width=8% />Pchatbot: A Large-Scale Dataset for Personalized Chatbot

Introduction

We introduce Pchatbot, a large scale conversation dataset dedicated for the development of personalized dialogue models. In this dataset, we assign anonymized user IDs and timestamps to conversations. Users’ dialogue histories can be retrieved and used to build rich user profiles. With the availability of the dialogue histories, we can move from personality based models to personalized models.

Pchatbot has two subsets, named PchatbotW and PchatbotL, built from open-domain Weibo and judicial forums respectively. Since the data volume of each sub-data set is too large, we divided each sub-data set into 10 equal parts according to the number of users, and named them PchatbotW-i and PchatbotL-i.

The dataset paper is accepted to SIGIR 2021 (Resource Track). See paper for more details.

Citation

If you use the dataset in your work, please cite:

@inproceedings{qian2021pchatbot,
     author = {Hongjin Qian and Xiaohe Li and Hanxun Zhong and Yu Guo and Yueyuan Ma and Yutao Zhu and Zhanliang Liu and Zhicheng Dou and Ji-Rong Wen}, 
     title = {Pchatbot: A Large-Scale Dataset for Personalized Chatbot}, 
     booktitle = {Proceedings of the {SIGIR} 2021}, 
     publisher = {{ACM}}, 
     year = {2021}, 
     url = {https://doi.org/10.1145/3404835.3463239}, 
     doi = {10.1145/3404835.3463239}}

The following paper uses the Pchatbot dataset:

  1. One Chatbot Per Person: Creating Personalized Chatbots based on Implicit User Profiles (SIGIR 2021 Long Paper)
@inproceedings{DBLP:conf/sigir/madousigir21,
     author = {Zhengyi Ma and Zhicheng Dou and Yutao Zhu Hanxun Zhong and Ji-Rong Wen}, 
     title = {One Chatbot Per Person: Creating Personalized Chatbots based onImplicit User Profiles}, 
     booktitle = {Proceedings of the {SIGIR} 2021}, 
     publisher = {{ACM}}, 
     year = {2021}, 
     url = {https://doi.org/10.1145/3404835.3462828}, 
     doi = {10.1145/3404835.3462828}}
  1. Learning Implicit User Profile for Personalized Retrieval-based Chatbot (CIKM 2021 Long Paper)
@inproceedings{qian2021impchat,
     author = {Hongjin Qian and Zhicheng Dou and Yutao Zhu Yueyuan Ma and Ji-Rong Wen}, 
     title = {Learning Implicit User Profile for Personalized Retrieval-based Chatbot}, 
     booktitle = {Proceedings of the {CIKM} 2021}, 
     publisher = {{ACM}}, 
     year = {2021},
     url = {https://doi.org/10.1145/3459637.3482269},
     doi = {10.1145/3459637.3482269}
  1. Less is More: Learning to Refine Dialogue History for Personalized Dialogue Generation (NAACL 2022 Long Paper)
@inproceedings{zhong-etal-2022-less,
    title = "Less is More: Learning to Refine Dialogue History for Personalized Dialogue Generation",
    author = "Zhong, Hanxun  and
      Dou, Zhicheng  and
      Zhu, Yutao  and
      Qian, Hongjin  and
      Wen, Ji-Rong",
    booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.naacl-main.426",
    doi = "10.18653/v1/2022.naacl-main.426",
    pages = "5808--5820"

Dataset Statistics

The detailed statistics of Pchatbot shows as follow:

PchatbotWPchatbotLPchatbotW-1PchatbotL-1
#Posts5,319,59620,145,9563,597,4074,662,911
#Responses139,448,33959,427,45713,992,8705,523,160
#Users in posts772,0025,203,345417,2941,107,989
#Users in responses23,408,367203,6362,340,83720,364
Avg.#responses per post26.2142.9503.8901.184
Max.#responses per post52512013626
#Words8,512,945,2383,013,617,497855,005,996284,099,064
Avg.#words per pair61.04751.01461.10351.438

We construct two standard dataset from Pchatbot for both generation-based and retrieval-based tasks, named PchatbotW-R and PchatbotW-G. The two datasets can be directly used in coressponding dialogue tasks. We will release the standard dataset then. Their statistics are shown in the following table:

PchatbotW-RPchatbotW-G
Number of users420,000300,000
Avg. history length32.311.4
Avg. length of post24.922.9
Avg. length of response10.19.6
Number of response candidates10-
Number of training samples3,000,0002,707,880
Number of validation samples600,000600,000
Number of testing samples600,000600,000

To obtain statistics, run:

python src/statistics.py

We will then release standard datasets for PchatbotL.

Data Content and Format

Obtain the data

For now we provide download link via Baidu Cloud which may be a bit slow outside mainland, China. We will update Google drive link asap.

Pchatbot-L:

md5: 48bd7ab93f625ebdf34c7254ff27ac2a

Pchatbot-W:

md5: cd443951973f47f5614df298e6e416da

If you cannot access Baidu Cloud Disk, contact us and we will try to provide other options.

Please fill in the application form and send it to the contact mail, we will then send the download links and the password for Baidu Cloud Disk to you. Note that the application form should be signed by the person in charge of your research group. We will update the download password regularly.

Application Form

Pchatbot Files

The upload format of the dataset is .tar.bz2, you can decompress it as follows:

tar -jxvf xx.tar.bz2

The format of each piece of data in the data set is:

Post \t Post_user_id \t Post_timestamp \t Response \t Response_user_id \t Response_timestamp \n

post and response are sentences with word segmentation, separated by spaces.And we give several examples of the data in data/sample.txt

We also give some examples of user personalized information, as shown in the figure below, due to space constraints, we only selected 5 historical records for the user in each example. PchatbotW.release_ver

<table style='border-collapse:collapse;table-layout:fixed;'> <col> <col style='mso-width-source:userset;mso-width-alt:8064;'> <col style='mso-width-source:userset;mso-width-alt:10709'> <col > <tr > <td>Post</td> <td colspan=2 >酒酿 小 圆子 窝蛋 , 蒸 南瓜 玉米 和 阳光 玫瑰 山寨 一 把 芳婆 的 酒酿 圆子 , 挺 好吃 的 , 加 了 点干 桂花 增香</td> </tr> <tr > <td >Response</td> <td colspan=2 >干 桂花 是 点睛</td> </tr> <tr > <td rowspan=6 >History</td> <td>History Post</td> <td>History Response</td> </tr> <tr > <td >今日 晚餐 黄焖鸡 , 红烧 带鱼 和 丝瓜蛋 汤 淘鲜达 送来 的 带鱼 不 好 , 说 是 中段 , 实际 是 前段 和 尾巴 , 没 多少 肉 都 懒得 拍 。 黄焖鸡 太 下饭 啦 , 和 家属 都 添 了 小 半 碗 米饭 。 下午 做 的 巧克力 冰淇淋 , 味道 棒棒 哒</td> <td colspan=2 style='mso-ignore:colspan'>烦烦 和 光光 就是 永远 都 吃 不 胖 的 神仙 体质</td> </tr> <tr > <td> 因 為荔 枝樹 不是 每年 都 能 結果 , 不是 每年 都 能 吃到 , 但 卻是 每年 夏天 我 最 期待 的 水果 , 期待 的 童年味 , 在 河邊 玩耍 , 在 樹下 等 荔枝 的 夏日 。</td> <td>一定 要 有 机会 了 去 南方 看看 荔枝树 的 样子</td> </tr> <tr > <td >用 喜欢 的 餐具 穿 舒适 的 衣裙 吃 简单 可口 的 食物 这些 小快乐 足以 点亮 平淡 的 生活 餐具 白裙子 by</td> <td>穿 搭博 主好 美 呀</td> </tr> <tr > <td >柠檬 冰淇淋 搞定 ! 还有 强行 出镜 的 柠檬 扇子 广告 , 这么 尬 为啥 还要 发 呢 ? 因为 那个 抠门 的 家伙 给 我 的 寄 了 一 箱子 芒果 , 所谓 拿人 手短 吃 人 嘴软 , 希望 对方 也 有 这样 的 觉悟</td> <td>柠檬 盘子 也 很 好看</td> </tr> <tr > <td>天天 和 徐大 美丽 混一 起 。</td> <td>这个 蘑菇 看 起来 特别 好吃</td> </tr> </table> <table style='border-collapse:collapse;table-layout:fixed;'> <col> <col style='mso-width-source:userset;mso-width-alt:8064;'> <col style='mso-width-source:userset;mso-width-alt:10709'> <col > <tr > <td>Post</td> <td colspan=2 >woj : 考辛斯 寻求 一 份 年薪 在 1200-1800万 的 合同 。 但是 现在 甚至 没有 球队 愿意 给 他 一 份 中产 合同</td> </tr> <tr > <td >Response</td> <td colspan=2 >200万 湖人 要 了</td> </tr> <tr > <td rowspan=6 >History</td> <td>History Post</td> <td>History Response</td> </tr> <tr > <td >别 问 我 支持 火箭 还是 勇士 了 我 支持 小卡 凌晨 4点 在 洛杉矶 跑步 、 被 一个 老外 拖进 篮球场 、 教 了 几 个 小时 后 仰跳 投</td> <td colspan=2 style='mso-ignore:colspan'>我 怀疑 你 在 开车 , 但是 我 没有 证据</td> </tr> <tr > <td> 消息 : 鹈鹕 本来 对 湖人 之前 给 的 筹码 很 心动 , 但是 现在 莺歌 的 病情 改变 了 一切</td> <td>我 谢谢 您 嘞 , 去去去 快去 换季 后 赛塔图姆 吧</td> </tr> <tr > <td >小卡 会 成为 三连冠 终结者 王朝 毁灭者 吗 ??</td> <td>哈哈 职业 阻止 三 连 冠</td> </tr> <tr > <td >水花 兄弟 这 两 位 , 场下 真的 暖 , 场上 关键 时刻 真的 硬 作为 两队 的 中立 球迷 , 这 场 比赛 给 我 看 的 热血 沸腾 了 , 火箭 最后 也 一直 坚挺 着 , 真的 精彩 , 真的</td> <td>勇士 火箭 都 不 喜欢 , 甚至 有点 讨厌 , 但是 今天 这 场 比赛 , 确实 勇士 更 值得 赢</td> </tr> <tr > <td>大家 觉得 猛龙 和 雄鹿 谁 最 有 可能 进入 到 总决赛 ?</td> <td>范乔丹 : 看 老子 心情 吧</td> </tr> </table>

Data Preprocessing

Instructions for data cleaning, preprocessing, aggregation and dataset constructs are in ./src/ folder.

Baseline models

We provide results of baseline models on the PchatbotW-R and PchatbotW-G dataset. For evaluation details, please refer to our paper. We will continue to update the results of other baseline models:

PchatbotW-R

R10@1R10@2R10@5MRRnDCGPaperCode
Conv-KNRM0.3230.5200.8930.5380.818Convolutional Neural Networks for Soft-Matching N-Grams in Ad-hoc Searchhttps://github.com/yunhenk/Conv-KNRM
DAM0.4380.6440.9660.6350.881Multi-Turn Response Selection for Chatbots with Deep Attention Matching Networkhttps://github.com/baidu/Dialogue
IOI0.4420.6510.9690.6390.890One Time of Interaction May Not Be Enough: Go Deep with an Interaction-over-Interaction Network for Response Selection in Dialogueshttps://github.com/chongyangtao/IOI
RSM-DCK0.4280.6270.9470.6230.858Learning to Detect Relevant Contexts and Knowledge for Response Selection in Retrieval-based Dialogue SystemsProvided by the author

PchatbotW-G

BLEU-1ROUGE-LDist-1Dist-2P-F1PaperCode
Seq2Seq4.8897.5940.2293.4040.771Sequence to Sequence Learning with Neural Networkshttps://github.com/IBM/pytorch-seq2seq
SPEAKER3.9585.5800.95129.7801.534A Persona-Based Neural Conversation Model\
PERSONAWAE1.9459.0640.5238.5496.408Modeling Personalization in Continuous Space for Response Generation via Augmented Wasserstein Autoencoders\
DialoGPT5.0387.35813.99552.6743.562DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generationhttps://github.com/microsoft/DialoGPT

License

This repository is liciensed under Apache-2.0 License.

The Pchatbot dataset is liciensed under CC BY-NC 2.0.

FAQ