Awesome
<div align="center">xMIND
</div>Description
xMIND is a large-scale multilingual news dataset for multi- and cross-lingual news recommendation. xMIND is derived from the English MIND (https://msnews.github.io/) dataset using open-source neural machine translation (i.e., NLLB 3.3B). xMIND contains 130K news translated into 14 linguistically and geographically diverse languages, with digital footprints of varying sizes. The goal of xMIND is to serve as a benchmark dataset for news recommendation, and to foster broader research into multilingual and cross-lingual news recommendation, for speakers of both high and low-resource languages.
The table below summarizes information about each language included in xMIND, according to the following criteria:
- Code: the three-letter ISO 693-3 code of the language;
- Language: the language name from WALS;
- Script: the English name of the script;
- Macro-area, Family ,and Genus: the macro-area, language family and genus from WALS and Glottolog
- Res.: the classification from into low-resource and high-resource
Code | Language | Script | Macro-area | Family | Genus | Res. |
---|---|---|---|---|---|---|
SWH | Swahili | Latin | Africa | Niger-Congo | Bantu | high |
SOM | Somali | Latin | Africa | Afro-Asiatic | Lowland East Cushitic | low |
CMN | Mandarin Chinese | Han | Eurasia | Sino-Tibetan | Sinitic | high |
JPN | Japanese | Japanese | Eurasia | Japonic | Japanesic | high |
TUR | Turkish | Latin | Eurasia | Altaic | Turkic | high |
TAM | Tamil | Tamil | Eurasia | Dravidian | Dravidian | low |
VIE | Vietnamese | Latin | Eurasia | Austro-Asiatic | Vietic | high |
THA | Thai | Thai | Eurasia | Tai-Kadai | Kam-Tai | high |
RON | Romanian | Latin | Eurasia | Indo-European | Romance | high |
FIN | Finnish | Latin | Eurasia | Uralic | Finnic | high |
KAT | Georgian | Georgian | Eurasia | Kartvelic | Georgian-Zan | low |
HAT | Haitian Creole | Latin | North-America | Indo-European | Creoles and Pidgins | low |
IND | Indonesian | Latin | Papunesia | Austronesian | Malayo-Sumbawan | high |
GRN | Guarani | Latin | South-America | Tupian | Maweti-Guarani | low |
Download
The xMIND dataset is free to download for research purposes.
We release the xMIND in two versions, corresponding to the original splits of MIND: xMINDsmall (training and validation sets) and xMINDlarge (training, validation, and test sets).
The zip-compressed TSV file containing the translated news, for each language and each split, can be downloaded from xMIND.
Automatically download
The download script enables automatically downloading the dataset for the chosen language, dataset size, and dataset split. By default, the scripts downloads the zipped dataset, extracts the TSV news file, and deletes the zip file.
The following commands can be used to choose which dataset version to dowload:
-
Download xMIND for all languages, all dataset sizes, all dataset splits (default setting):
python download.py
-
Download only one or more languages:
python download.py --languages {language_1} {language_2}
Use the ISO 693-3 code of the language from the table above to choose a specific language.
-
Download only one or more dataset sizes:
python download.py --sizes {dataset_size_1} {dataset_size_2}
Supported dataset sizes: large or small.
-
Download only one or more dataset splits:
python download.py --splits {dataset_split_1} {dataset_split_2} {dataset_split_3}
Supported dataset splits: train, dev, or test.
-
Download without extracting the zipped file:
python download.py --extract_archive
-
Download without deleting the zipped file:
python download.py --clean_archive
-
The downloaded dataset is by default stored in a newly created directory called xmIND. Change the destination directory as follows:
python download.py --dst_dir 'my_folder'
Data Format
Each news.tsv file contains the translated news; it has 3 columns, separated by the tab symbol:
- nid: News ID of the article, identical to the corresponding news ID from the MIND dataset of the article.
- title: The title of the news translated into the target language.
- abstract: The abstract of the news (when provided in the original MIND dataset) translated into the target language.
An example for Romanian (RON) is shown below:
nid | title | abstract |
---|---|---|
N49265 | Aceste reţete cu sos de afine sunt perfecte pentru cina de Ziua Recunoştinţei. | Nu vei mai vrea niciodată versiunea cumpărată din magazin. |
Integration with MIND
The news in xMIND can be easily combined with the corresponding source news in English from the MIND dataset based on the unique news IDs. This should help researchers use xMIND in conjunction with the additional news annotations (e.g., categories, subcategories, named entities) and user behavior information provided in MIND.
To facilitate a seamless integration of xMIND with the MIND data, we provide scripts for loading the dataset and constructing bilingual user consumption patterns in the NewsRecLib library.
License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
If you intend to use, adapt, or share xMIND, particularly together with additional news and click behavior information from the original MIND dataset, please read and reference the Microsoft Research License Terms of MIND.
Citation
If you use xMIND, please cite the following publication:
@misc{iana2024mind,
title={MIND Your Language: A Multilingual Dataset for Cross-lingual News Recommendation},
author={Andreea Iana and Goran Glavaš and Heiko Paulheim},
year={2024},
eprint={2403.17876},
archivePrefix={arXiv},
primaryClass={cs.IR}
}
Also consider citing the following:
@inproceedings{wu2020mind,
title={Mind: A large-scale dataset for news recommendation},
author={Wu, Fangzhao and Qiao, Ying and Chen, Jiun-Hung and Wu, Chuhan and Qi, Tao and Lian, Jianxun and Liu, Danyang and Xie, Xing and Gao, Jianfeng and Wu, Winnie and others},
booktitle={Proceedings of the 58th annual meeting of the association for computational linguistics},
pages={3597--3606},
year={2020}
}