Awesome
The auto-<i>h</i>MDS Corpus
The auto-<i>h</i>MDS corpus is a (a) large, (b) heterogeneous, (c) multilingual, (d) multi-document summarization corpus. The corpus is an automatically generated extension of the manually created <i>h</i>MDS corpus (https://github.com/AIPHES/hMDS).
Size
language | topics | uncompressed size | source documents |
---|---|---|---|
de | 2,210 | 1,8 GB | 10,454 |
en | 5,106 | 12,5 GB | 54,290 |
total | 7,316 | 14,3 GB | 64,744 |
Reference
If you plan to refer to auto-<i>h</i>MDS Corpus in your publications, please cite the corresponding LREC 2018 paper:
@InProceedings{Zopf2018autohMDS,
author = {Zopf, Markus},
title = {auto-hMDS: Automatic Construction of a Large Heterogeneous Multilingual Multi-Document Summarization Corpus},
booktitle = {Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018)},
month = {May},
year = {2018},
address = {Miyazaki, Japan},
publisher = {Association for Computational Linguistics},
pages = {3228--3233},
website = {https://github.com/AIPHES/auto-hMDS}
}
Folder/File Hierarchy
Below the top-level folder, the language-specific parts of the corpus can be found. Language-specific folder names have the pattern "auto-hMDS-xx" where xx indicates the language. Currently, the corpus contains English "auto-hMDS-en" and German "auto-hMDS-de" topics.
In every language-specific folder, the topics can be found. The topic folders follow the pattern "x_y_z". - x indicates the topic id which equals the Wikipedia pageid. For topic "25_799993248_Autism", the topic id is 25 (http://en.wikipedia.org/?curid=25). - y indicates the revision id of the Wikipedia article. For topic "25_799993248_Autism", we included revision 799993248 in our corpus (https://en.wikipedia.org/w/index.php?title=Autism&oldid=799993248). - z indicates the topic name. The topic name of topic "25_799993248_Autism" is therefore "Autism".
A list of all topics for every language can be found in the "topics-xx.txt" files where xx indicates the language.
We illustrate the corpus structure below.
- auto-hMDS corpus
- auto-hMDS-de
- 96_165707958_Argon
- 97_168947548_Arsen
- 101_167549035_Americium
- 102_168948138_Atom
- 140_168992757_Aristoteles
- ...
- auto-hMDS-en
- 25_799993248_Autism
- 621_798125813_Amphibian
- 663_800770271_Apollo 8
- 751_800460460_Aikido
- 798_797054480_Aries (constellation)
- ...
Every topic has 2 sub-folders "input" and "reference".
The reference folder contains the original summary in the file "reference.txt" and a sentence-segmented version in the file "reference-segmented.txt". Every line in a "reference-segmented.txt" file contains one sentence. For every line (i.e. sentence) in "reference-segmented.txt" source web pages have been retrieved. The URLs of the web pages can be found in the "sentence_x_urls.txt" files where x indicates the sentence index. The first sentence index in "reference-segmented.txt" has index 0. The first line in "sentence_x_urls.txt" contains the sentence text and the following lines contain one URL per line.
Summaries, sentence-segmented summaries, and the URL lists for individual sentences can be found in the file auto-hMDS reference.zip.
The input folder contains the input documents which have to be summarized (see next section for more information). The input files have the pattern "sentence_x_y.*" where x indicates the sentences id and y indicates the link id. Note that neither the sentences ids nor the link ids have to be consecutively numbered. Links were skipped if a web pages was not retrievable. Sentences where skipped if all links of a sentence were not retrievable.
We provide 2 versions of every input document. *.html files contain the HTML code of the retrieved web page. *.html.ke.txt files contain all visible content of the web pages. The visible content has been extracted with the Boilerpipe Keep Everything boilerplate removal tool and does not contain HTML tags anymore. The structure of the topic folders is illustrated below.
- 96_165707958_Argon
-input
- collected_sentence_ids.txt
- sentence_1_1.html
- sentence_1_1.html.ke.txt
- sentence_2_1.html
- sentence_2_1.html.ke.txt
- ...
-reference
- reference.txt
- reference-segmented.txt
- sentence_0_urls.txt
- sentence_1_urls.txt
- sentence_2_urls.txt
- ...
Obtaining the Full Corpus
We are not allowed to share the corpus via Github due to copyright reasons. Hence, the file "auto-hMDS reference.zip" only contains the references (i.e., summaries) for the topics and not the downloaded web pages. To mitigate this issue, we include in every input folder a file named "collected_sentence_ids.txt" which contains a list of links of the web pages which have been included in the corpus. The web pages can be downloaded with the provided script in the file "InputDownloadScript.java". To download the web pages, the cromedriver.exe from Selenium (https://www.seleniumhq.org/) is required. The cromedriver.exe can be downloaded here: http://chromedriver.chromium.org/downloads and has to be stored in the /res folder of the Java project. Unfortunately, we found that some web pages are no longer available which means that the quality of the corpus created with the Java script might suffer.