Home

Awesome

<div align="center">

xMIND

CC BY-NC-SA 4.0

</div>

Description

xMIND is a large-scale multilingual news dataset for multi- and cross-lingual news recommendation. xMIND is derived from the English MIND (https://msnews.github.io/) dataset using open-source neural machine translation (i.e., NLLB 3.3B). xMIND contains 130K news translated into 14 linguistically and geographically diverse languages, with digital footprints of varying sizes. The goal of xMIND is to serve as a benchmark dataset for news recommendation, and to foster broader research into multilingual and cross-lingual news recommendation, for speakers of both high and low-resource languages.

The table below summarizes information about each language included in xMIND, according to the following criteria:

CodeLanguageScriptMacro-areaFamilyGenusRes.
SWHSwahiliLatinAfricaNiger-CongoBantuhigh
SOMSomaliLatinAfricaAfro-AsiaticLowland East Cushiticlow
CMNMandarin ChineseHanEurasiaSino-TibetanSinitichigh
JPNJapaneseJapaneseEurasiaJaponicJapanesichigh
TURTurkishLatinEurasiaAltaicTurkichigh
TAMTamilTamilEurasiaDravidianDravidianlow
VIEVietnameseLatinEurasiaAustro-AsiaticVietichigh
THAThaiThaiEurasiaTai-KadaiKam-Taihigh
RONRomanianLatinEurasiaIndo-EuropeanRomancehigh
FINFinnishLatinEurasiaUralicFinnichigh
KATGeorgianGeorgianEurasiaKartvelicGeorgian-Zanlow
HATHaitian CreoleLatinNorth-AmericaIndo-EuropeanCreoles and Pidginslow
INDIndonesianLatinPapunesiaAustronesianMalayo-Sumbawanhigh
GRNGuaraniLatinSouth-AmericaTupianMaweti-Guaranilow

Download

The xMIND dataset is free to download for research purposes.

We release the xMIND in two versions, corresponding to the original splits of MIND: xMINDsmall (training and validation sets) and xMINDlarge (training, validation, and test sets).

The zip-compressed TSV file containing the translated news, for each language and each split, can be downloaded from xMIND.

Automatically download

The download script enables automatically downloading the dataset for the chosen language, dataset size, and dataset split. By default, the scripts downloads the zipped dataset, extracts the TSV news file, and deletes the zip file.

The following commands can be used to choose which dataset version to dowload:

Data Format

Each news.tsv file contains the translated news; it has 3 columns, separated by the tab symbol:

An example for Romanian (RON) is shown below:

nidtitleabstract
N49265Aceste reţete cu sos de afine sunt perfecte pentru cina de Ziua Recunoştinţei.Nu vei mai vrea niciodată versiunea cumpărată din magazin.

Integration with MIND

The news in xMIND can be easily combined with the corresponding source news in English from the MIND dataset based on the unique news IDs. This should help researchers use xMIND in conjunction with the additional news annotations (e.g., categories, subcategories, named entities) and user behavior information provided in MIND.

To facilitate a seamless integration of xMIND with the MIND data, we provide scripts for loading the dataset and constructing bilingual user consumption patterns in the NewsRecLib library.

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

CC BY-NC-SA 4.0

If you intend to use, adapt, or share xMIND, particularly together with additional news and click behavior information from the original MIND dataset, please read and reference the Microsoft Research License Terms of MIND.

Citation

If you use xMIND, please cite the following publication:

@misc{iana2024mind,
      title={MIND Your Language: A Multilingual Dataset for Cross-lingual News Recommendation}, 
      author={Andreea Iana and Goran Glavaš and Heiko Paulheim},
      year={2024},
      eprint={2403.17876},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}

Also consider citing the following:

@inproceedings{wu2020mind,
  title={Mind: A large-scale dataset for news recommendation},
  author={Wu, Fangzhao and Qiao, Ying and Chen, Jiun-Hung and Wu, Chuhan and Qi, Tao and Lian, Jianxun and Liu, Danyang and Xie, Xing and Gao, Jianfeng and Wu, Winnie and others},
  booktitle={Proceedings of the 58th annual meeting of the association for computational linguistics},
  pages={3597--3606},
  year={2020}
}