Home

Awesome

ChatGPT-RetrievalQA: Can ChatGPT's responses act as training data for Q&A retrieval models?

The repository of paper "Generating Synthetic Documents for Cross-Encoder Re-Rankers: A Comparative Study of ChatGPT and Human Experts" and paper "A Test Collection of Synthetic Documents for Training Rankers: ChatGPT vs. Human Experts". A dataset for training and evaluating Question Answering (QA) Retrieval models on ChatGPT responses with the possibility of training/evaluating on real human responses.

If you use this dataset, please use the following bibtex references:


@InProceedings{askari2023chatgptcikm2023,
  author = {Askari, Arian and Aliannejadi, Mohammad and Kanoulas, Evangelos and Verberne, Suzan},
  titlE = {A Test Collection of Synthetic Documents for Training Rankers: ChatGPT vs. Human Experts},
  year = 2023,
  booktitle = {The 32nd ACM International Conference on Information and Knowledge Management (CIKM 2023)},
}

@InProceedings{askari2023genirsigir2023,
  author = {Askari, Arian and Aliannejadi, Mohammad and Kanoulas, Evangelos and Verberne, Suzan},
  title = {Generating Synthetic Documents for Cross-Encoder Re-Rankers: A Comparative Study of ChatGPT and Human Experts},
  year = 2023,
  booktitle = {Generative Information Retrieval workshop at ACM SIGIR 2023},
}

This work has been done under the supervision of Prof. Mohammad Aliannejadi, Evangelos Kanoulas, and Suzan Verberne during my visiting research at Information Retrieval Lab at the University of Amsterdam (IRLab@UvA).

Summary of what we did

Given a set of questions and corresponding ChatGPT's and humans' responses, we make two separate collections: one from ChatGPT and one from humans. By doing so, we provide several analysis opportunities from an information retrieval perspective regarding the usefulness of ChatGPT responses for training retrieval models. We provide the dataset for both end-to-end retrieval and a re-ranking setup. To give flexibility to other analyses, we organize all the files separately for ChatGPT and human responses.

Why rely on retrieval when ChatGPT can generate answers?

While ChatGPT is a powerful language model that can produce impressive answers, it is not immune to mistakes or hallucinations. Furthermore, the source of the information generated by ChatGPT is not transparent and usually there is no source for the generated information even when the information is correct. This can be a bigger concern when it comes to domains such as law, medicine, science, and other professional fields where trustworthiness and accountability are critical. Retrieval models, as opposed to generative models, retrieve the actual (true) information from sources and search engines provide the source of each retrieved item. This is why information retrieval -- even when ChatGPT is available -- remains an important application, especially in situations where reliability is vital.

Answer ranking dataset

This dataset is based on the public HC3 dataset, although our experimental setup and evaluation will be different. We split the data in a train, validation, and test set in order to train/evaluate answer retrieval models on ChatGPT or human answers. We store the actual response by human/ChatGPT as the relevant answer. For training, a set of random responses can be used as non-relevant answers. In our main experiments, we train on ChatGPT responses and evaluate on human responses. We release ChatGPT-RetrievalQA dataset in a similar format to the MSMarco dataset, which is a popular dataset for training retrieval models. Therefore, everyone could re-use their scripts for the MSMarco dataset on our data.

DescriptionFilenameFile sizeNum RecordsFormat
Collection-H (H: Human Responses)collection_h.tsv38.6 MB58,546tsv: pid, passage
Collection-C (C: ChatGPT Responses)collection_c.tsv26.1 MB26,882tsv: pid, passage
Queriesqueries.tsv4 MB24,322tsv: qid, query
Qrels-H Train (Train set Qrels for Human Responses)qrels_h_train.tsv724 KB40,406TREC qrels format
Qrels-H Validation (Validation set Qrels for Human Responses)qrels_h_valid.tsv29 KB1,460TREC qrels format
Qrels-H Test (Test set Qrels for Human Responses)qrels_h_test.tsv326 KB16,680TREC qrels format
Qrels-C Train (Train set Qrels for ChatGPT Responses)qrels_c_train.tsv339 KB18,452TREC qrels format
Qrels-C Validation (Validation set Qrels for ChatGPT Responses)qrels_c_valid.tsv13 KB672TREC qrels format
Qrels-C Test (Test set Qrels for ChatGPT Responses)qrels_c_test.tsv152 KB7,756TREC qrels format
Queries, Answers, and Relevance Labelscollectionandqueries.zip23.9 MB866,504
Train-H Triplestrain_h_triples.tsv58.68 GB40,641,772tsv: query, positive passage, negative passage
Validation-H Triplevalid_h_triples.tsv2.02 GB1,468,526tsv: query, positive passage, negative passage
Train-H Triples QID PID Formattrain_h_qidpidtriples.tsv921.7 MB40,641,772tsv: qid, positive pid, negative pid
Validation-H Triples QID PID Formatvalid_h_qidpidtriples.tsv35.6 MB1,468,526tsv: qid, positive pid, negative pid
Train-C Triplestrain_c_triples.tsv37.4 GB18,473,122tsv: query, positive passage, negative passage
Validation-C Triplevalid_c_triples.tsv1.32 GB672,659tsv: query, positive passage, negative passage
Train-C Triples QID PID Formattrain_c_qidpidtriples.tsv429.6 MB18,473,122tsv: qid, positive pid, negative pid
Validation-C Triples QID PID Formatvalid_c_qidpidtriples.tsv16.4 MB672,659tsv: qid, positive pid, negative pid

We release the training and validation data in Triples format to facilitate training. The Triples files to train on ChatGPT responses are: "train_c_triples.tsv" and "valid_c_triples.tsv". Moreover, we release the triples based on human responses so everyone could compare training on ChatGPT VS training on human responses ("train_h_triples.tsv" and "valid_h_triples.tsv" files). Given each query and positive answer, 1000 negative answers have been sampled randomly.

Answer re-ranking dataset

DescriptionFilenameFile sizeNum Records
Top-H 1000 Traintop_1000_h_train.run646.6 MB16,774,122
Top-H 1000 Validationtop_1000_h_valid.run23.7 MB605,956
Top-H 1000 Testtop_1000_h_test.run270.6 MB692,0845
Top-C 1000 Traintop_1000_c_train.run646.6 MB16,768,032
Top-C 1000 Validationtop_1000_c_valid.run23.7 MB605,793
Top-C 1000 Testtop_1000_c_test.run271.1 MB6,917,616

The format of the run files of the Answer re-ranking dataset is in TREC run format.

Note: We use BM25 as first-stage ranker in Elasticsearch in order to rank top-1000 documents given a question (i.e., query). However, for some queries, less than 1000 documents will be retrieved which means there were less than 1000 documents with at least one word matched with the query in the collection.

Analyzing the effectiveness of BM25 on human/ChatGPT responses

Coming soon.

<!-- | Questions Split | Response Writer | MAP@1000 | NDCG@10 | Recall@10 | Recall@100 | Recall@1000 | |-----------------|-----------------|:--------:|:-------:|:---------:|:----------:|:-----------:| | Test | Human | .143 | .184 | .212 | .359 | .520 | | | ChatGPT | .370 | .396 | .526 | .823 | .898 | | Train | Human | .158 | .202 | .236 | .392 | .560 | | | ChatGPT | .413 | .443 | .577 | .834 | .903 | | Validation | Human | .154 | .200 | .228 | .370 | .523 | | | ChatGPT | .386 | .410 | .539 | .847 | .904 | #### Code for the above evaluation: [ChatGPT-RetrievalQA-Evlaution](https://colab.research.google.com/drive/1ywQaXVcOFGod6WR3Rq4owIxblXmTK7dF?usp=sharing) [![](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ywQaXVcOFGod6WR3Rq4owIxblXmTK7dF?usp=sharing) -->

BERT re-ranking effectiveness on the Qrels-H Test

We train BERT on the responses that are produced by ChatGPT (using queries.tsv, collection_c.tsv, train_c_triples.tsv, valid_c_triples.tsv, qrels_c_train.tsv, and qrels_c_valid.tsv files). Next, we evaluate the effectiveness of BRET as an answer re-ranker model on human responses (using queries.tsv, collection_h.tsv, top_1000_c_test.run, and qrels_h_test.tsv). By doing so, we answer to the following question: "What is the effectiveness of an answer retrieval model that is trained on ChatGPT responses, when we evaluate it on human responses?"

Coming soon.

Collection of responses produced by other Large Language Models (LLMs)

Coming soon

Code for creating the dataset

ChatGPT-RetrievalQA-Dataset-Creator

Dataset source and copyright

Special thanks to the HC3 team for releasing Human ChatGPT Comparison Corpus (HC3) corpus. Our data is created based on their dataset and follows the license of them.