Awesome

Benchmark datasets for keyphrase extraction

This repository contains a large, curated set of benchmark datasets for evaluating automatic keyphrase extraction algorithms. These datasets are all pre-processed using the Stanford CoreNLP suite and are available in XML format.

Dataset format

All datasets are stored according to the following, common structure:

dataset/
       /test/       <- test documents
       /train/      <- training documents (if available)
       /dev/        <- validation documents (if available)
       /src/        <- everything used to build the dataset
       /references/ <- reference keyphrases in json format

Bigger datasets (such as KP20k, KPTimes) should be downloaded and preprocessed using the dataset/src directory.

Reference (gold annotation) format

Reference keyphrases, used for evaluating automatic keyphrase extraction algorithms, are available in json format and named according to the following rules: [split].[annotator].[stem]?.json

where

split corresponds to the dataset split: test, train, dev or valid
annotator is the type of annotation: author, reader, editor, combined, contr (controlled vocabulary), uncontr (free annotation)
stem (optional) indicates that stemming (using nltk Porter algorithm) is applied on reference keyphrases.

Below is a an example of reference file format:

{
    "doc-1": [
        [
            "target detect"
        ],
        [
            "number of sensor",
            "sensor number"
        ]
    ],
    ...
}

Available datasets

dataset	lang	nature	train	dev	test	Annotation	#kp (test)	#words (test)
CSTR [1]	en	Full papers	130	-	500	A	5.4	11501.4
NUS [3]	en	Full papers	-	-	211	A+R	11.0	8398.3
PubMed [5]	en	Full papers	-	-	1320	A	5.4	5322.9
ACM [6]	en	Full papers	-	-	2304	A	5.3	9197.6
Citeulike-180 [13]	en	Full papers	-	-	182	R	5.4	8589.7
SemEval-2010 [10]	en	Full papers	144	-	100	A+R	14.7	7961.2
KP20k [15]	en	Abstracts	527,090	20,000	20,000	A	176	5.3
Inspec [2]	en	Abstracts	1000	500	500	I (uncontr)	9.8	134.6
TALN-Archives [14]	en/fr	Abstracts	-	-	521/1207	A	4.0/4.1	123.1/141.0
KDD [9]	en	Abstracts	-	-	755	A	4.1	190.7
WWW [9]	en	Abstracts	-	-	1330	A	4.8	163.5
TermITH-Eval [11]	fr	Abstracts	-	-	400	I	11.8	164.7
KPTimes [16]	en	News	259,923	10,000	20,000	E	5.0	921
DUC-2001 [4]	en	News	-	-	308	R	8.1	847.2
500N-KPCrowd [7]	en	News	450	-	50	R	46.2	465.3
110-PT-BN-KP [12]	pt	News	100	-	10	R	27.6	439.4
Wikinews-Keyphrase [8]	fr	News	-	-	100	R	9.7	313.6

Annotation for gold keyphrases are performed by authors (A), readers (R), editors (E) or professional indexers (I).

References

KEA: Practical automatic keyphrase extraction. Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C., & Nevill-Manning, C. G. In Proceedings of the fourth ACM conference on Digital libraries. p. 254-255. 1999.
Improved automatic keyword extraction given more linguistic knowledge. Anette Hulth. In Proceedings of EMNLP 2003. p. 216-223.
Keyphrase Extraction in Scientific Publications. Thuy Dung Nguyen and Min-Yen Kan. In Proceedings of International Conference on Asian Digital Libraries 2007. p. 317-326.
Single Document Keyphrase Extraction Using Neighborhood Knowledge. Xiaojun Wan and Jianguo Xiao. In Proceedings of AAAI 2008. pp. 855-860.
Keyphrase extraction from single documents in the open domain exploiting linguistic and statistical methods. Alexander Thorsten Schutz. Master's thesis, National University of Ireland (2008).
Large dataset for keyphrases extraction. Krapivin, M., Autaeu, A., & Marchese, M. (2009). University of Trento.
Supervised Topical Key Phrase Extraction of News Stories using Crowdsourcing, Light Filtering and Co-reference Normalization. Marujo, L., Gershman, A., Carbonell, J., Frederking, R., & Neto, J. P. In Proceedings of LREC 2012.
TopicRank: Graph-Based Topic Ranking for Keyphrase Extraction. Adrien Bougouin, Florian Boudin, Béatrice Daille. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP), 2013.
Citation-Enhanced Keyphrase Extraction from Research Papers: A Supervised Approach. Cornelia Caragea, Florin Bulgarov, Andreea Godea and Sujatha Das Gollapalli. In Proceedings of EMNLP 2014. pp. 1435-1446.
How Document Pre-processing affects Keyphrase Extraction Performance. Florian Boudin, Hugo Mougard and Damien Cram. COLING 2016 Workshop on Noisy User-generated Text (WNUT).
TermITH-Eval: a French Standard-Based Resource for Keyphrase Extraction Evaluation. Adrien Bougouin, Sabine Barreaux, Laurent Romary, Florian Boudin and Béatrice Daille. Language Resources and Evaluation Conference (LREC), 2016.
Keyphrase Cloud Generation of Broadcast News. Luis Marujo, Márcio Viveiros, João Paulo da Silva Neto. In Proceedings of Interspeech 2011.
Human-competitive tagging using automatic keyphrase extraction. O. Medelyan, E. Frank, I. H. Witten. In Proceedings of EMNLP 2009.
TALN Archives: a digital archive of French research articles in Natural Language Processing. Florian Boudin. In Proceedings of TALN 2013.
Deep Keyphrase Generation R. Meng, S. Zhao, S. Han, D. He, P. Brusilovsky and Y. Chi. In Proceedings of ACL 2017.
KPTimes: A Large-Scale Dataset for Keyphrase Generation on News Documents. Y. Gallina, F. Boudin and B. Daille. In Proceedings of INLG 2019.