Awesome

BrWac2Wiki

Official repo for the dataset BrWac2Wiki.

The challenge: Generate Brazilian Wikipedia articles from multiple website texts!

This is a dataset for multi-document summarization in Portuguese, what means that it has examples of multiple documents (input) related to human-written summaries (output). In particular, it has entries of multiple related texts from Brazilian websites about a subject, and the summary is the Portuguese Wikipedia lead section on the same subject (lead: the first section, i.e., summary, of any Wipedia article). Input texts were extracted from BrWac corpus, and the output from Brazilian Wikipedia dumps page.

BrWac2Wiki contains 114.652 examples of (documents, wikipedia) pairs! So it is suitable for training and validating AI models for multi-document summarization in Portuguese. More information on the paper "PLSUM: Generating PT-BR Wikipedia by Summarizing Websites", by André Seidel Oliveira¹ and Anna Helena Reali Costa¹, that is going to be presented at ENIAC 2021. Our work is inspired by WikiSum, a similar dataset for the English language.

The full dataset can be downloaded here.

1 - researchers at the Department of Computer Engineering and Digital Systems (PCS) of University of São Paulo (USP)

Description of data

There are three files on the dataset: docids.json, input.csv, and output.csv.

docids.json:

Shows the BrWac docs related to each Wikipedia article. Each line is a json entry relating a unique Wikipedia article identifier, wiki_id, to several BrWac unique identifiers for documents, docids. Each BrWac document cite all the words from the Wikipedia article title, wiki_title, at least once. Example:

{
  "wiki_id": "415", 
  "wiki_title": "Hino da Independência do Brasil", 
  "docids": ["net-6bb71a", "nete-1e5c7d", "neth-1682c"],
}

wiki_id: is the Portuguese Wikipedia entity id for "Hino da Independência do Brasil";
wiki_title: is the title of a Wikipedia article;
docids: is a list of document unique ids from BrWac. Each document is the text content from an website;

input.csv:

Each line has the title for a wiki article and the sentences (document's extracts with a maximum of 100 words) from the BrWac documents associated to the article, separated by the symbol </s>. Lines in the same order as docids.json. Example:

1  astronomia </s> veja nesta página do site - busca relacionada a astronomico com a seguinte descrição - astronomico </s> astronômico dicionário informal significado de astronômico o que é astronômico substivo masculino referente a corpos celestes como estrelas planetas satélites. </s> (...)
2  (...)

output.csv :

Each line contains the lead section for a Wikipedia article, also in the same order as docids.json. Example:

1  O Hino da Independência é uma canção patriótica oficial comemorando a declaração da independência do Brasil, composta em 1822 por Dom Pedro I. A letra foi escrita pelo poeta Evaristo da Veiga.
2  (...)

Details

The search for association between BrWac documents and Wikipedia articles was made with the help of a MongoDB database. We populated the database with BrWac documents and them perform a text search for Wikipedia titles.

For time reasons, the search had the following rule:

Search for every word on the article title (AND search);
Limit a maximum of 15 documents per wiki article;
Search for 2 seconds at least 1 document, if not found, remove wiki article from dataset.

Acknowledgements

This research was supported by Itaú Unibanco S.A., with the scholarship program of Programa de Bolsas Itaú (PBI), and partially financed by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES), Finance Code 001, and CNPQ (grant 310085/2020-9), Brazil. Any opinions, findings, and conclusions expressed in this manuscript are those of the authors and do not necessarily reflect the views, official policy or position of the Itaú-Unibanco, CAPES and CNPq.