Awesome
ChatGPT-RetrievalQA: Can ChatGPT's responses act as training data for Q&A retrieval models?
The repository of paper "Generating Synthetic Documents for Cross-Encoder Re-Rankers: A Comparative Study of ChatGPT and Human Experts" and paper "A Test Collection of Synthetic Documents for Training Rankers: ChatGPT vs. Human Experts". A dataset for training and evaluating Question Answering (QA) Retrieval models on ChatGPT responses with the possibility of training/evaluating on real human responses.
If you use this dataset, please use the following bibtex references:
@InProceedings{askari2023chatgptcikm2023,
author = {Askari, Arian and Aliannejadi, Mohammad and Kanoulas, Evangelos and Verberne, Suzan},
titlE = {A Test Collection of Synthetic Documents for Training Rankers: ChatGPT vs. Human Experts},
year = 2023,
booktitle = {The 32nd ACM International Conference on Information and Knowledge Management (CIKM 2023)},
}
@InProceedings{askari2023genirsigir2023,
author = {Askari, Arian and Aliannejadi, Mohammad and Kanoulas, Evangelos and Verberne, Suzan},
title = {Generating Synthetic Documents for Cross-Encoder Re-Rankers: A Comparative Study of ChatGPT and Human Experts},
year = 2023,
booktitle = {Generative Information Retrieval workshop at ACM SIGIR 2023},
}
This work has been done under the supervision of Prof. Mohammad Aliannejadi, Evangelos Kanoulas, and Suzan Verberne during my visiting research at Information Retrieval Lab at the University of Amsterdam (IRLab@UvA).
Summary of what we did
Given a set of questions and corresponding ChatGPT's and humans' responses, we make two separate collections: one from ChatGPT and one from humans. By doing so, we provide several analysis opportunities from an information retrieval perspective regarding the usefulness of ChatGPT responses for training retrieval models. We provide the dataset for both end-to-end retrieval and a re-ranking setup. To give flexibility to other analyses, we organize all the files separately for ChatGPT and human responses.
Why rely on retrieval when ChatGPT can generate answers?
While ChatGPT is a powerful language model that can produce impressive answers, it is not immune to mistakes or hallucinations. Furthermore, the source of the information generated by ChatGPT is not transparent and usually there is no source for the generated information even when the information is correct. This can be a bigger concern when it comes to domains such as law, medicine, science, and other professional fields where trustworthiness and accountability are critical. Retrieval models, as opposed to generative models, retrieve the actual (true) information from sources and search engines provide the source of each retrieved item. This is why information retrieval -- even when ChatGPT is available -- remains an important application, especially in situations where reliability is vital.
Answer ranking dataset
This dataset is based on the public HC3 dataset, although our experimental setup and evaluation will be different. We split the data in a train, validation, and test set in order to train/evaluate answer retrieval models on ChatGPT or human answers. We store the actual response by human/ChatGPT as the relevant answer. For training, a set of random responses can be used as non-relevant answers. In our main experiments, we train on ChatGPT responses and evaluate on human responses. We release ChatGPT-RetrievalQA dataset in a similar format to the MSMarco dataset, which is a popular dataset for training retrieval models. Therefore, everyone could re-use their scripts for the MSMarco dataset on our data.
Description | Filename | File size | Num Records | Format |
---|---|---|---|---|
Collection-H (H: Human Responses) | collection_h.tsv | 38.6 MB | 58,546 | tsv: pid, passage |
Collection-C (C: ChatGPT Responses) | collection_c.tsv | 26.1 MB | 26,882 | tsv: pid, passage |
Queries | queries.tsv | 4 MB | 24,322 | tsv: qid, query |
Qrels-H Train (Train set Qrels for Human Responses) | qrels_h_train.tsv | 724 KB | 40,406 | TREC qrels format |
Qrels-H Validation (Validation set Qrels for Human Responses) | qrels_h_valid.tsv | 29 KB | 1,460 | TREC qrels format |
Qrels-H Test (Test set Qrels for Human Responses) | qrels_h_test.tsv | 326 KB | 16,680 | TREC qrels format |
Qrels-C Train (Train set Qrels for ChatGPT Responses) | qrels_c_train.tsv | 339 KB | 18,452 | TREC qrels format |
Qrels-C Validation (Validation set Qrels for ChatGPT Responses) | qrels_c_valid.tsv | 13 KB | 672 | TREC qrels format |
Qrels-C Test (Test set Qrels for ChatGPT Responses) | qrels_c_test.tsv | 152 KB | 7,756 | TREC qrels format |
Queries, Answers, and Relevance Labels | collectionandqueries.zip | 23.9 MB | 866,504 | |
Train-H Triples | train_h_triples.tsv | 58.68 GB | 40,641,772 | tsv: query, positive passage, negative passage |
Validation-H Triple | valid_h_triples.tsv | 2.02 GB | 1,468,526 | tsv: query, positive passage, negative passage |
Train-H Triples QID PID Format | train_h_qidpidtriples.tsv | 921.7 MB | 40,641,772 | tsv: qid, positive pid, negative pid |
Validation-H Triples QID PID Format | valid_h_qidpidtriples.tsv | 35.6 MB | 1,468,526 | tsv: qid, positive pid, negative pid |
Train-C Triples | train_c_triples.tsv | 37.4 GB | 18,473,122 | tsv: query, positive passage, negative passage |
Validation-C Triple | valid_c_triples.tsv | 1.32 GB | 672,659 | tsv: query, positive passage, negative passage |
Train-C Triples QID PID Format | train_c_qidpidtriples.tsv | 429.6 MB | 18,473,122 | tsv: qid, positive pid, negative pid |
Validation-C Triples QID PID Format | valid_c_qidpidtriples.tsv | 16.4 MB | 672,659 | tsv: qid, positive pid, negative pid |
We release the training and validation data in Triples format to facilitate training. The Triples files to train on ChatGPT responses are: "train_c_triples.tsv" and "valid_c_triples.tsv". Moreover, we release the triples based on human responses so everyone could compare training on ChatGPT VS training on human responses ("train_h_triples.tsv" and "valid_h_triples.tsv" files). Given each query and positive answer, 1000 negative answers have been sampled randomly.
Answer re-ranking dataset
Description | Filename | File size | Num Records |
---|---|---|---|
Top-H 1000 Train | top_1000_h_train.run | 646.6 MB | 16,774,122 |
Top-H 1000 Validation | top_1000_h_valid.run | 23.7 MB | 605,956 |
Top-H 1000 Test | top_1000_h_test.run | 270.6 MB | 692,0845 |
Top-C 1000 Train | top_1000_c_train.run | 646.6 MB | 16,768,032 |
Top-C 1000 Validation | top_1000_c_valid.run | 23.7 MB | 605,793 |
Top-C 1000 Test | top_1000_c_test.run | 271.1 MB | 6,917,616 |
The format of the run files of the Answer re-ranking dataset is in TREC run format.
Note: We use BM25 as first-stage ranker in Elasticsearch in order to rank top-1000 documents given a question (i.e., query). However, for some queries, less than 1000 documents will be retrieved which means there were less than 1000 documents with at least one word matched with the query in the collection.
Analyzing the effectiveness of BM25 on human/ChatGPT responses
Coming soon.
<!-- | Questions Split | Response Writer | MAP@1000 | NDCG@10 | Recall@10 | Recall@100 | Recall@1000 | |-----------------|-----------------|:--------:|:-------:|:---------:|:----------:|:-----------:| | Test | Human | .143 | .184 | .212 | .359 | .520 | | | ChatGPT | .370 | .396 | .526 | .823 | .898 | | Train | Human | .158 | .202 | .236 | .392 | .560 | | | ChatGPT | .413 | .443 | .577 | .834 | .903 | | Validation | Human | .154 | .200 | .228 | .370 | .523 | | | ChatGPT | .386 | .410 | .539 | .847 | .904 | #### Code for the above evaluation: [ChatGPT-RetrievalQA-Evlaution](https://colab.research.google.com/drive/1ywQaXVcOFGod6WR3Rq4owIxblXmTK7dF?usp=sharing) [![](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ywQaXVcOFGod6WR3Rq4owIxblXmTK7dF?usp=sharing) -->BERT re-ranking effectiveness on the Qrels-H Test
We train BERT on the responses that are produced by ChatGPT (using queries.tsv, collection_c.tsv, train_c_triples.tsv, valid_c_triples.tsv, qrels_c_train.tsv, and qrels_c_valid.tsv files). Next, we evaluate the effectiveness of BRET as an answer re-ranker model on human responses (using queries.tsv, collection_h.tsv, top_1000_c_test.run, and qrels_h_test.tsv). By doing so, we answer to the following question: "What is the effectiveness of an answer retrieval model that is trained on ChatGPT responses, when we evaluate it on human responses?"
Coming soon.
Collection of responses produced by other Large Language Models (LLMs)
Coming soon
Code for creating the dataset
ChatGPT-RetrievalQA-Dataset-Creator
Dataset source and copyright
Special thanks to the HC3 team for releasing Human ChatGPT Comparison Corpus (HC3) corpus. Our data is created based on their dataset and follows the license of them.