Home

Awesome

XL-Sum

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021.

Updates

Table of Contents

Datasets

Disclaimer: You must agree to the license and terms of use before using the dataset.

We are releasing two versions of the dataset: an older version that has been reported in the paper; and a newer version with another added language (Traditional Chinese), more data, better formatting, better extraction, larger evaluation splits, and deduplication. We recommend using the latter and thus have organized the repository with data counts and benchmarks of the newer version. The new version contains a total of 1.35 million article-summary pairs, making XL-Sum the largest text summarization dataset publicly available.

All dataset files are in .jsonl format i.e. one JSON per line. One example from the English dataset is given below in JSON format. The fields are self-explanatory.

{
  "id": "technology-17657859",
  "url": "https://www.bbc.com/news/technology-17657859",
  "title": "Yahoo files e-book advert system patent applications",
  "summary": "Yahoo has signalled it is investigating e-book adverts as a way to stimulate its earnings.",
  "text": "Yahoo's patents suggest users could weigh the type of ads against the sizes of discount before purchase. It says in two US patent applications that ads for digital book readers have been \"less than optimal\" to date. The filings suggest that users could be offered titles at a variety of prices depending on the ads' prominence They add that the products shown could be determined by the type of book being read, or even the contents of a specific chapter, phrase or word. The paperwork was published by the US Patent and Trademark Office late last week and relates to work carried out at the firm's headquarters in Sunnyvale, California. \"Greater levels of advertising, which may be more valuable to an advertiser and potentially more distracting to an e-book reader, may warrant higher discounts,\" it states. Free books It suggests users could be offered ads as hyperlinks based within the book's text, in-laid text or even \"dynamic content\" such as video. Another idea suggests boxes at the bottom of a page could trail later chapters or quotes saying \"brought to you by Company A\". It adds that the more willing the customer is to see the ads, the greater the potential discount. \"Higher frequencies... may even be great enough to allow the e-book to be obtained for free,\" it states. The authors write that the type of ad could influence the value of the discount, with \"lower class advertising... such as teeth whitener advertisements\" offering a cheaper price than \"high\" or \"middle class\" adverts, for things like pizza. The inventors also suggest that ads could be linked to the mood or emotional state the reader is in as a they progress through a title. For example, they say if characters fall in love or show affection during a chapter, then ads for flowers or entertainment could be triggered. The patents also suggest this could applied to children's books - giving the Tom Hanks animated film Polar Express as an example. It says a scene showing a waiter giving the protagonists hot drinks \"may be an excellent opportunity to show an advertisement for hot cocoa, or a branded chocolate bar\". Another example states: \"If the setting includes young characters, a Coke advertisement could be provided, inviting the reader to enjoy a glass of Coke with his book, and providing a graphic of a cool glass.\" It adds that such targeting could be further enhanced by taking account of previous titles the owner has bought. 'Advertising-free zone' At present, several Amazon and Kobo e-book readers offer full-screen adverts when the device is switched off and show smaller ads on their menu screens, but the main text of the titles remains free of marketing. Yahoo does not currently provide ads to these devices, and a move into the area could boost its shrinking revenues. However, Philip Jones, deputy editor of the Bookseller magazine, said that the internet firm might struggle to get some of its ideas adopted. \"This has been mooted before and was fairly well decried,\" he said. \"Perhaps in a limited context it could work if the merchandise was strongly related to the title and was kept away from the text. \"But readers - particularly parents - like the fact that reading is an advertising-free zone. Authors would also want something to say about ads interrupting their narrative flow.\""
}

Download the complete dataset. See the legacy section for the older version(s).

We used a 80%-10%-10% split for all languages with a few exceptions. English was split 93%-3.5%-3.5% for the evaluation set size to resemble that of CNN/DM and XSum; Scottish Gaelic, Kyrgyz and Sinhala had relatively fewer samples, their evaluation sets were increased to 500 samples for more reliable evaluation. Same articles were used for evaluation in the two variants of Chinese and Serbian to prevent data leakage in multilingual training. Individual dataset download links with train-dev-test example counts are given below:

LanguageISO 639-1 CodeBBC subdomain(s)TrainDevTestTotalLink
Amharicamhttps://www.bbc.com/amharic57617197197199Download
Arabicarhttps://www.bbc.com/arabic375194689468946897Download
Azerbaijaniazhttps://www.bbc.com/azeri64788098098096Download
Bengalibnhttps://www.bbc.com/bengali81021012101210126Download
Burmesemyhttps://www.bbc.com/burmese45695705705709Download
Chinese (Simplified)zh-CNhttps://www.bbc.com/ukchina/simp, https://www.bbc.com/zhongwen/simp373624670467046702Download
Chinese (Traditional)zh-TWhttps://www.bbc.com/ukchina/trad, https://www.bbc.com/zhongwen/trad373734670467046713Download
Englishenhttps://www.bbc.com/english, https://www.bbc.com/sinhala *3065221153511535329592Download
Frenchfrhttps://www.bbc.com/afrique86971086108610869Download
Gujaratiguhttps://www.bbc.com/gujarati91191139113911397Download
Hausahahttps://www.bbc.com/hausa64188028028022Download
Hindihihttps://www.bbc.com/hindi707788847884788472Download
Igboighttps://www.bbc.com/igbo41835225225227Download
Indonesianidhttps://www.bbc.com/indonesia382424780478047802Download
Japanesejahttps://www.bbc.com/japanese71138898898891Download
Kirundirnhttps://www.bbc.com/gahuza57467187187182Download
Koreankohttps://www.bbc.com/korean44075505505507Download
Kyrgyzkyhttps://www.bbc.com/kyrgyz22665005003266Download
Marathimrhttps://www.bbc.com/marathi109031362136213627Download
Nepalinphttps://www.bbc.com/nepali58087257257258Download
Oromoomhttps://www.bbc.com/afaanoromoo60637577577577Download
Pashtopshttps://www.bbc.com/pashto143531794179417941Download
Persianfahttps://www.bbc.com/persian472515906590659063Download
Pidgin**n/ahttps://www.bbc.com/pidgin92081151115111510Download
Portuguesepthttps://www.bbc.com/portuguese574027175717571752Download
Punjabipahttps://www.bbc.com/punjabi82151026102610267Download
Russianruhttps://www.bbc.com/russian, https://www.bbc.com/ukrainian *622437780778077803Download
Scottish Gaelicgdhttps://www.bbc.com/naidheachdan13135005002313Download
Serbian (Cyrillic)srhttps://www.bbc.com/serbian/cyr72759099099093Download
Serbian (Latin)srhttps://www.bbc.com/serbian/lat72769099099094Download
Sinhalasihttps://www.bbc.com/sinhala32495005004249Download
Somalisohttps://www.bbc.com/somali59627457457452Download
Spanisheshttps://www.bbc.com/mundo381104763476347636Download
Swahiliswhttps://www.bbc.com/swahili78989879879872Download
Tamiltahttps://www.bbc.com/tamil162222027202720276Download
Telugutehttps://www.bbc.com/telugu104211302130213025Download
Thaithhttps://www.bbc.com/thai66168268268268Download
Tigrinyatihttps://www.bbc.com/tigrinya54516816816813Download
Turkishtrhttps://www.bbc.com/turkce271763397339733970Download
Ukrainianukhttps://www.bbc.com/ukrainian432015399539953999Download
Urduurhttps://www.bbc.com/urdu676658458845884581Download
Uzbekuzhttps://www.bbc.com/uzbek47285905905908Download
Vietnamesevihttps://www.bbc.com/vietnamese321114013401340137Download
Welshcyhttps://www.bbc.com/cymrufyw97321216121612164Download
Yorubayohttps://www.bbc.com/yoruba63507937937936Download

* A lot of articles in BBC Sinhala and BBC Ukrainian were written in English and Russian respectively. They were identified using Fasttext and moved accordingly.

** West African Pidgin English

Models

We are releasing a multilingual model checkpoint trained for 50k steps on the new data. To use this model for evaluation/inference refer to Training & Evaluation.

Benchmarks

Multilingual model scores on test sets are given below. We are also releasing the model-generated outputs for future analysis.

LanguageROUGE-1 / ROUGE-2 / ROUGE-L
Amharic20.0485 / 7.4111 / 18.0753
Arabic34.9107 / 14.7937 / 29.1623
Azerbaijani21.4227 / 9.5214 / 19.3331
Bengali29.5653 / 12.1095 / 25.1315
Burmese15.9626 / 5.1477 / 14.1819
Chinese (Simplified)39.4071 / 17.7913 / 33.406
Chinese (Traditional)37.1866 / 17.1432 / 31.6184
English37.601 / 15.1536 / 29.8817
French35.3398 / 16.1739 / 28.2041
Gujarati21.9619 / 7.7417 / 19.86
Hausa39.4375 / 17.6786 / 31.6667
Hindi38.5882 / 16.8802 / 32.0132
Igbo31.6148 / 10.1605 / 24.5309
Indonesian37.0049 / 17.0181 / 30.7561
Japanese48.1544 / 23.8482 / 37.3636
Kirundi31.9907 / 14.3685 / 25.8305
Korean23.6745 / 11.4478 / 22.3619
Kyrgyz18.3751 / 7.9608 / 16.5033
Marathi22.0141 / 9.5439 / 19.9208
Nepali26.6547 / 10.2479 / 24.2847
Oromo18.7025 / 6.1694 / 16.1862
Pashto38.4743 / 15.5475 / 31.9065
Persian36.9425 / 16.1934 / 30.0701
Pidgin37.9574 / 15.1234 / 29.872
Portuguese37.1676 / 15.9022 / 28.5586
Punjabi30.6973 / 12.2058 / 25.515
Russian32.2164 / 13.6386 / 26.1689
Scottish Gaelic29.0231 / 10.9893 / 22.8814
Serbian (Cyrillic)23.7841 / 7.9816 / 20.1379
Serbian (Latin)21.6443 / 6.6573 / 18.2336
Sinhala27.2901 / 13.3815 / 23.4699
Somali31.5563 / 11.5818 / 24.2232
Spanish31.5071 / 11.8767 / 24.0746
Swahili37.6673 / 17.8534 / 30.9146
Tamil24.3326 / 11.0553 / 22.0741
Telugu19.8571 / 7.0337 / 17.6101
Thai37.3951 / 17.275 / 28.8796
Tigrinya25.321 / 8.0157 / 21.1729
Turkish32.9304 / 15.5709 / 29.2622
Ukrainian23.9908 / 10.1431 / 20.9199
Urdu39.5579 / 18.3733 / 32.8442
Uzbek16.8281 / 6.3406 / 15.4055
Vietnamese32.8826 / 16.2247 / 26.0844
Welsh32.6599 / 11.596 / 26.1164
Yoruba31.6595 / 11.6599 / 25.0898

Multilingual ROUGE

Training & Evaluation

License

Contents of this repository are restricted to only non-commercial research purposes under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). Copyright of the dataset contents belongs to the original copyright holders.

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a>

Citation

If you use any of the datasets, models or code modules, please cite the following paper:

@inproceedings{hasan-etal-2021-xl,
    title = "{XL}-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages",
    author = "Hasan, Tahmid  and
      Bhattacharjee, Abhik  and
      Islam, Md. Saiful  and
      Mubasshir, Kazi  and
      Li, Yuan-Fang  and
      Kang, Yong-Bin  and
      Rahman, M. Sohel  and
      Shahriyar, Rifat",
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-acl.413",
    pages = "4693--4703",
}