Home

Awesome

"Rossiya Segodnya" news dataset

This repository contains a news dataset presented in the paper:

Daniil Gavrilov, Pavel Kalaidin, and Valentin Malykh. Self-Attentive Model for Headline Generation. 41st European Conference on Information Retrieval, 2019. arXiv:1901.07786 [cs.CL]

To download the dataset please use a direct link or clone the repository using git lfs.

Description

Full dataset contains 1003869 Russian language news documents from January, 2010 to December, 2014.

Dataset format: each row contains a JSON document that consists of two fields: text is a document body, while title is a news headline.

License

This data is lisensed by Rossiya Segodnya news agency (ria.ru) under CC-BY-ND-NC license. The license text could be accessed here. The Russian version of the same license could be accessed here.

Misc

If you're using the data in a research please consider citing the mentioned paper:

@inproceedings{gavrilov2018self,
	title={Self-Attentive Model for Headline Generation},
	author={Gavrilov, Daniil and  Kalaidin, Pavel and  Malykh, Valentin},
	booktitle={Proceedings of the 41st European Conference on Information Retrieval},
	year={2019}
}