Home

Awesome

KG20C: A scholarly knowledge graph benchmark dataset

To facilitate research in scholarly data analysis, we constructed the KG20C knowledge graph using data from 20 top computer science conferences. It can serve as a standard benchmark dataset for several tasks, including knowledge graph embedding, link prediction, recommendation systems, and question answering about high quality papers.

This has been introduced and used in the TPDL'19 paper Exploring Scholarly Data by Semantic Query on Knowledge Graph Embedding Space and the PhD thesis Multi-Relational Embedding for Knowledge Graph Representation and Analysis.

<p align="center"> <img alt="KG20C graph" src="./KG20C_graph.png" width=500px> <br> <i><b>Figure 1:</b> Overview of the KG20C knowledge graph.</i> </p>

Construction protocol

Scholarly data extraction

From the Microsoft Academic Graph dataset, we extracted high quality computer science papers published in top conferences between 1990 and 2010. The top conference list are based on A* conferences in the CORE ranking version 2020. The data was cleaned by removing conferences with less than 300 publications and papers with less than 20 citations. The final list includes 20 top conferences (in alphabetical order): AAAI, AAMAS, ACL, CHI, COLT, DCC, EC, FOCS, ICCV, ICDE, ICDM, ICML, ICSE, IJCAI, NIPS, SIGGRAPH, SIGIR, SIGMOD, UAI, and WWW.

Knowledge graph construction

From the scholarly data, we define the entities, the relations, and construct the triples. The knowledge graph can be seen as a labeled multi-digraph between scholarly entities, with edge labels expressing the relationships between the nodes. We use 5 intrinsic entity types including Paper, Author, Affiliation, Venue, and Domain. We also use 5 intrinsic relation types between the entities including author_in_affiliation, author_write_paper, paper_in_domain, paper_cite_paper, and paper_in_venue.

Benchmark data splitting

The knowledge graph was split uniformly at random into the training, validation, and test sets. We made sure that all entities and relations in the validation and test sets also appear in the training set so that their embeddings can be learned. We also made sure that there is no data leakage and no redundant triples in these splits, thus, KG20C constitutes a challenging benchmark for link prediction similar to WN18RR and FB15K-237.

Content of dataset

File format

All files are in tab-separated-values format, compatible with other popular benchmark datasets including WN18RR and FB15K-237. For example, train.txt includes "28674CFA author_in_affiliation 075CFC38", which denotes the author with id 28674CFA works in the affiliation with id 075CFC38.

The repo includes these files:

Statistics

Data statistics of the KG20C knowledge graph:

AuthorPaperConferenceDomainAffiliation
8,6805,047201,923692
EntitiesRelationsTraining triplesValidation triplesTest triples
16,362548,2133,6703,724

License

The dataset is free to use for research purpose. For other uses, please follow Microsoft Academic Graph license.

Baseline results

We include the results for link prediction and semantic queries on the KG20C dataset. Link prediction is a relational query task given a relation and the head or tail entity to predict the corresponding tail or head entities. Semantic queries include human-friendly query on the scholarly data. MRR is mean reciprocal rank, Hit@k is the percentage of correct predictions at top k.

For more information, please refer to the citations.

Link prediction results

We report results for 4 methods. Random, which is just random guess to show the task difficulty. Word2vec, which is the popular embedding method. SimplE/CP<sub>h</sub> and MEI are two recent knowledge graph embedding methods.

All models are in small size settings, equivalent to total embedding size of 100 (50x2 for Word2vec and SimplE/CP<sub>h</sub>, 10x10 for MEI).

ModelsMRRHit@1Hit@3Hit@10
Random0.001< 5e-4< 5e-4< 5e-4
Word2vec (small)0.0680.0110.0700.177
SimplE/CP<sub>h</sub> (small)0.2150.1480.2340.348
MEI (small)0.2300.1570.2580.368

Semantic queries results

The following results demonstrate semantic queries on knowledge graph embedding space, using the above MEI (small) model.

QueriesMRRHit@1Hit@3Hit@10
Who may work at this organization?0.2990.2210.3420.440
Where may this author work at?0.6260.5620.6690.731
Who may write this paper?0.2470.1640.2830.405
What papers may this author write?0.2730.1820.3240.430
Which papers may cite this paper?0.1160.0330.1200.290
Which papers may this paper cite?0.1930.0970.2250.404
Which papers may belong to this domain?0.0520.0250.0490.100
Which may be the domains of this paper?0.1890.1140.2060.333
Which papers may publish in this conference?0.1480.0840.1680.257
Which conferences may this paper publish in?0.6930.5420.8100.976

How to cite

If you found this dataset or our work useful, please cite us.

For the dataset and semantic query method, please cite:

For the extended semantic query method and baseline results, please cite:

For the MEI and MEIM knowledge graph embedding models, please cite:

For the Microsoft Academic Graph dataset, please cite:

See also