Home

Awesome

WebBrain: Learning to Generate Factually Correct Articles for Queries by Grounding on Large Web Corpus

Introduction

In this paper, we introduce a new NLP task -- generating short factual articles with references for queries by mining supporting evidence from the Web. In this task, called WebBrain, the ultimate goal is to generate a fluent, informative, and factually-correct short article (e.g., a Wikipedia article) for a factual query unseen in Wikipedia. To enable experiments on WebBrain, we construct a large-scale dataset WebBrain-Raw by extracting English Wikipedia articles and their crawlable Wikipedia references. WebBrain-Raw is ten times larger than the previous biggest peer dataset, which can greatly benefit the research community. From WebBrain-Raw, we construct two task-specific datasets: WebBrain-R and WebBrain-G, which are used to train in-domain retriever and generator, respectively. Besides, we empirically analyze the performances of the current state-of-the-art NLP techniques on WebBrain and introduce a new framework ReGen, which enhances the generation factualness by improved evidence retrieval and task-specific pre-training for generation. Experiment results show that ReGen outperforms all baselines in both automatic and human evaluations.

Citation

If you use the dataset in any publication or presentation, please cite:

@misc{qian2023webbrain,
      title={WebBrain: Learning to Generate Factually Correct Articles for Queries by Grounding on Large Web Corpus}, 
      author={Hongjing Qian and Yutao Zhu and Zhicheng Dou and Haoqi Gu and Xinyu Zhang and Zheng Liu and Ruofei Lai and Zhao Cao and Jian-Yun Nie and Ji-Rong Wen},
      year={2023},
      eprint={2304.04358},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Data Files

Application form

To download the datasets, please complete and sign the application form and submit it to us. Upon receipt of the application form, we will provide you with a download password.

You are required to sign manually. Application Form (pdf) Application Form (markdown)

The form is provided as markdown format which is easy to edit. When you finish the form, please save as pdf to send back.

Contact mail: ian[at]ruc.edu.cn

Download

We offer two methods for downloading the WebBrain datasets. The first option is to download the datasets directly from our self-maintained servers using the provided URL. Please note that these servers are not exclusive to the datasets, which means that network conditions and stability may vary.

The second option is to access the datasets via Baidu Cloud Disk, which is readily available in China. If you encounter any difficulties downloading the datasets using either method, please don't hesitate to contact us for assistance.

We understand that the size of these datasets (Terabytes) can make downloading challenging, so we are actively exploring additional options for hosting the data. We are committed to finding free and accessible alternatives and welcome any suggestions you may have.

You may download the sample data here: Google Drive, Baidu Cloud Disk.

We provide the following datasets:

DatasetDescriptionDownload LinkBaidu Cloud Disk
WebBrain-RawContains the raw text of WebBrain. It comprises 153 zipped data chunks in which each line is a Wikepedia page with its reference articles.On the wayLink
WebBrain-deduplicatedIn WebBrain-Raw, multiple Wikipedia pages might use an identical web page as a reference, leading to redundancy. In this dataset, we deduplicate all reference articles and generate a standalone reference database. We only keep the reference's URL in the Wikipedia page data.On the wayOn the way
WebBrain-G(eneration)This is a processed dataset for training and evaluating generation model.On the wayOn the way
WebBrain-R(etrieval)This is a processed dataset for training and evaluating retrieval model.On the wayLink

Data format:

WebBrain-Raw contains 154 chunk files, which are in jsonline format. Each line of data in WebBrain-Raw is in the following format:

{
   "url":"wiki_url",
   "title": "wiki_title"
   "text":"sentence_a <a href=\"wiki_hyperlink\">wiki_entry</a> sentence_b[1].
           <h2> section_title </h2> sentence_c.[2]"
   "text":"wiki_content",
   "references":[
      {
         "cite_id":"[1]",
         "title":"ref_title",
         "url":"ref_url",
         "text": "ref_content"
      },
      ...
   ]
}

For the Wiki pages, we keep necessary html tags to identify the Wiki section and the Wiki entry. The Wiki entry refers to the internal links to other Wiki page.

WebBrain-R contains four files: train.tsv / dev.tsv / test.tsv and corpus.jsonl. The first three files are in the same format:

qid\tquery\tpositive_passage_id\tnegative_passage1_id\t...\n

And data in corpus.jsonl are in the fowllowing format:

{"id": "passage_id", "content": "passage_content"}

WebBrain-G contains train / dev / test files, which are in the following format:

[title] wiki_title [ref] [ref_id] ref_title ref_content [SPLIT] ... [SPLIT] target_text 

where we append the Wiki title to the front of each reference, merge all references and the target text (label) with a special token [SPLIT].

The statistic information is as follow:

Statistics

Statistics of data for WebBrain-Raw.

Dataset# Wiki Pages# RefsStatusStorage Size
WikiSum (Liu et al., 2018)2.3M87MNeed crawling300GB
WikiCatSum (Perez-Beltrachini et al., 2019)0.17M23.5MReady4.8GB
Hiersumm (Liu & Lapata, 2019)1.66M-Ready6.9GB
WebBrain-Raw14.86M259.5MReady2.9TB

Statistics of data for experiments.

WebBrain-RWebBrain-G
# Queries2.74M12.32M
# Ref. passages3.20M12.61M
# Tokens / Query3.22.9
# Tokens / Passage237.5250.0
# Tokens / Target-108.6
# Training4.46M12.30M
# Validation0.2M0.5M
# Test88,93524,546

In the paper, we evaluate a proposed model, ReGen on the WebBrain dataset. We release the source codes of ReGen in this Repo: Link.

Terms of Use

FAQ