Home

Awesome

The <i>h</i>MDS Corpus

The <i>h</i>MDS corpus is a <i>heterogeneous</i> multi-document summarization corpus built with a novel corpus construction approach. It consists of 91 topics coming from 3 different domains. You can find the guidelines which were used by the annotators to create the corpus in the Guidelines.md file.

Reference

If you plan to refer to <i>h</i>MDS in your publications, please cite the corresponding Coling 2016 paper:

@InProceedings{Zopf2016hMDS,
  author    = {Zopf, Markus and Peyrard, Maxime and Eckle-Kohler, Judith},
  title     = {The Next Step for Multi-Document Summarization: A Heterogeneous Multi-Genre Corpus Built with a Novel Construction Approach},
  booktitle = {Proceedings of the 26th International Conference on Computational Linguistics (COLING 2016)},
  month     = {December},
  year      = {2016},
  address   = {Osaka, Japan},
  publisher = {Association for Computational Linguistics},
  pages     = {1535--1545},
  url       = {https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_AIPHES/publications/2016/2016_COLING_hMDS_cameraReady.pdf},
  website = {https://github.com/AIPHES/hMDS}
}

Obtaining the Corpus

The public parts of the corpus can be found in the hMDS file. Due to copyright restrictions, we are not able to make the full corpus directly available. The subfolder "input", as described in the readme.txt in the hMDS archive files, is missing. To mitigate this issue, we added link lists containing references to the web pages included in the corpus (see Guidelines.md, step 6 for details) which allows an automatic crawling of the corpus.