Awesome

ChatGPT-RetrievalQA: Can ChatGPT's responses act as training data for Q&A retrieval models?

The repository of paper "Generating Synthetic Documents for Cross-Encoder Re-Rankers: A Comparative Study of ChatGPT and Human Experts" and paper "A Test Collection of Synthetic Documents for Training Rankers: ChatGPT vs. Human Experts". A dataset for training and evaluating Question Answering (QA) Retrieval models on ChatGPT responses with the possibility of training/evaluating on real human responses.

If you use this dataset, please use the following bibtex references:


@InProceedings{askari2023chatgptcikm2023,
  author = {Askari, Arian and Aliannejadi, Mohammad and Kanoulas, Evangelos and Verberne, Suzan},
  titlE = {A Test Collection of Synthetic Documents for Training Rankers: ChatGPT vs. Human Experts},
  year = 2023,
  booktitle = {The 32nd ACM International Conference on Information and Knowledge Management (CIKM 2023)},
}

@InProceedings{askari2023genirsigir2023,
  author = {Askari, Arian and Aliannejadi, Mohammad and Kanoulas, Evangelos and Verberne, Suzan},
  title = {Generating Synthetic Documents for Cross-Encoder Re-Rankers: A Comparative Study of ChatGPT and Human Experts},
  year = 2023,
  booktitle = {Generative Information Retrieval workshop at ACM SIGIR 2023},
}

This work has been done under the supervision of Prof. Mohammad Aliannejadi, Evangelos Kanoulas, and Suzan Verberne during my visiting research at Information Retrieval Lab at the University of Amsterdam (IRLab@UvA).

Summary of what we did

Given a set of questions and corresponding ChatGPT's and humans' responses, we make two separate collections: one from ChatGPT and one from humans. By doing so, we provide several analysis opportunities from an information retrieval perspective regarding the usefulness of ChatGPT responses for training retrieval models. We provide the dataset for both end-to-end retrieval and a re-ranking setup. To give flexibility to other analyses, we organize all the files separately for ChatGPT and human responses.

Why rely on retrieval when ChatGPT can generate answers?

While ChatGPT is a powerful language model that can produce impressive answers, it is not immune to mistakes or hallucinations. Furthermore, the source of the information generated by ChatGPT is not transparent and usually there is no source for the generated information even when the information is correct. This can be a bigger concern when it comes to domains such as law, medicine, science, and other professional fields where trustworthiness and accountability are critical. Retrieval models, as opposed to generative models, retrieve the actual (true) information from sources and search engines provide the source of each retrieved item. This is why information retrieval -- even when ChatGPT is available -- remains an important application, especially in situations where reliability is vital.

Answer ranking dataset

This dataset is based on the public HC3 dataset, although our experimental setup and evaluation will be different. We split the data in a train, validation, and test set in order to train/evaluate answer retrieval models on ChatGPT or human answers. We store the actual response by human/ChatGPT as the relevant answer. For training, a set of random responses can be used as non-relevant answers. In our main experiments, we train on ChatGPT responses and evaluate on human responses. We release ChatGPT-RetrievalQA dataset in a similar format to the MSMarco dataset, which is a popular dataset for training retrieval models. Therefore, everyone could re-use their scripts for the MSMarco dataset on our data.

Description	Filename	File size	Num Records	Format
Collection-H (H: Human Responses)	collection_h.tsv	38.6 MB	58,546	tsv: pid, passage
Collection-C (C: ChatGPT Responses)	collection_c.tsv	26.1 MB	26,882	tsv: pid, passage
Queries	queries.tsv	4 MB	24,322	tsv: qid, query
Qrels-H Train (Train set Qrels for Human Responses)	qrels_h_train.tsv	724 KB	40,406	TREC qrels format
Qrels-H Validation (Validation set Qrels for Human Responses)	qrels_h_valid.tsv	29 KB	1,460	TREC qrels format
Qrels-H Test (Test set Qrels for Human Responses)	qrels_h_test.tsv	326 KB	16,680	TREC qrels format
Qrels-C Train (Train set Qrels for ChatGPT Responses)	qrels_c_train.tsv	339 KB	18,452	TREC qrels format
Qrels-C Validation (Validation set Qrels for ChatGPT Responses)	qrels_c_valid.tsv	13 KB	672	TREC qrels format
Qrels-C Test (Test set Qrels for ChatGPT Responses)	qrels_c_test.tsv	152 KB	7,756	TREC qrels format
Queries, Answers, and Relevance Labels	collectionandqueries.zip	23.9 MB	866,504
Train-H Triples	train_h_triples.tsv	58.68 GB	40,641,772	tsv: query, positive passage, negative passage
Validation-H Triple	valid_h_triples.tsv	2.02 GB	1,468,526	tsv: query, positive passage, negative passage
Train-H Triples QID PID Format	train_h_qidpidtriples.tsv	921.7 MB	40,641,772	tsv: qid, positive pid, negative pid
Validation-H Triples QID PID Format	valid_h_qidpidtriples.tsv	35.6 MB	1,468,526	tsv: qid, positive pid, negative pid
Train-C Triples	train_c_triples.tsv	37.4 GB	18,473,122	tsv: query, positive passage, negative passage
Validation-C Triple	valid_c_triples.tsv	1.32 GB	672,659	tsv: query, positive passage, negative passage
Train-C Triples QID PID Format	train_c_qidpidtriples.tsv	429.6 MB	18,473,122	tsv: qid, positive pid, negative pid
Validation-C Triples QID PID Format	valid_c_qidpidtriples.tsv	16.4 MB	672,659	tsv: qid, positive pid, negative pid

We release the training and validation data in Triples format to facilitate training. The Triples files to train on ChatGPT responses are: "train_c_triples.tsv" and "valid_c_triples.tsv". Moreover, we release the triples based on human responses so everyone could compare training on ChatGPT VS training on human responses ("train_h_triples.tsv" and "valid_h_triples.tsv" files). Given each query and positive answer, 1000 negative answers have been sampled randomly.

Answer re-ranking dataset

Description	Filename	File size	Num Records
Top-H 1000 Train	top_1000_h_train.run	646.6 MB	16,774,122
Top-H 1000 Validation	top_1000_h_valid.run	23.7 MB	605,956
Top-H 1000 Test	top_1000_h_test.run	270.6 MB	692,0845
Top-C 1000 Train	top_1000_c_train.run	646.6 MB	16,768,032
Top-C 1000 Validation	top_1000_c_valid.run	23.7 MB	605,793
Top-C 1000 Test	top_1000_c_test.run	271.1 MB	6,917,616

The format of the run files of the Answer re-ranking dataset is in TREC run format.

Note: We use BM25 as first-stage ranker in Elasticsearch in order to rank top-1000 documents given a question (i.e., query). However, for some queries, less than 1000 documents will be retrieved which means there were less than 1000 documents with at least one word matched with the query in the collection.

Analyzing the effectiveness of BM25 on human/ChatGPT responses

Coming soon.

BERT re-ranking effectiveness on the Qrels-H Test

We train BERT on the responses that are produced by ChatGPT (using queries.tsv, collection_c.tsv, train_c_triples.tsv, valid_c_triples.tsv, qrels_c_train.tsv, and qrels_c_valid.tsv files). Next, we evaluate the effectiveness of BRET as an answer re-ranker model on human responses (using queries.tsv, collection_h.tsv, top_1000_c_test.run, and qrels_h_test.tsv). By doing so, we answer to the following question: "What is the effectiveness of an answer retrieval model that is trained on ChatGPT responses, when we evaluate it on human responses?"

Coming soon.

Collection of responses produced by other Large Language Models (LLMs)

Coming soon

Code for creating the dataset

ChatGPT-RetrievalQA-Dataset-Creator

Dataset source and copyright

Special thanks to the HC3 team for releasing Human ChatGPT Comparison Corpus (HC3) corpus. Our data is created based on their dataset and follows the license of them.