Awesome

SimKGC: Simple Contrastive Knowledge Graph Completion with Pre-trained Language Models

Official code repository for ACL 2022 paper "SimKGC: Simple Contrastive Knowledge Graph Completion with Pre-trained Language Models".

The paper is available at https://aclanthology.org/2022.acl-long.295.pdf.

In this paper, we identify that one key issue for text-based knowledge graph completion is efficient contrastive learning. By combining large number of negatives and hardness-aware InfoNCE loss, SimKGC can substantially outperform existing methods on popular benchmark datasets.

Requirements

python>=3.7
torch>=1.6 (for mixed precision training)
transformers>=4.15

All experiments are run with 4 V100(32GB) GPUs.

How to Run

It involves 3 steps: dataset preprocessing, model training, and model evaluation.

We also provide the predictions from our models in predictions directory.

For WN18RR and FB15k237 datasets, we use files from KG-BERT.

WN18RR dataset

Step 1, preprocess the dataset

bash scripts/preprocess.sh WN18RR

Step 2, training the model and (optionally) specify the output directory (< 3 hours)

OUTPUT_DIR=./checkpoint/wn18rr/ bash scripts/train_wn.sh

Step 3, evaluate a trained model

bash scripts/eval.sh ./checkpoint/wn18rr/model_last.mdl WN18RR

Feel free to change the output directory to any path you think appropriate.

FB15k-237 dataset

Step 1, preprocess the dataset

bash scripts/preprocess.sh FB15k237

Step 2, training the model and (optionally) specify the output directory (< 3 hours)

OUTPUT_DIR=./checkpoint/fb15k237/ bash scripts/train_fb.sh

Step 3, evaluate a trained model

bash scripts/eval.sh ./checkpoint/fb15k237/model_last.mdl FB15k237

Wikidata5M transductive dataset

Step 0, download the dataset. We provide a script to download the Wikidata5M dataset from its official website. This will download data for both transductive and inductive settings.

bash ./scripts/download_wikidata5m.sh

Step 1, preprocess the dataset

bash scripts/preprocess.sh wiki5m_trans

Step 2, training the model and (optionally) specify the output directory (about 12 hours)

OUTPUT_DIR=./checkpoint/wiki5m_trans/ bash scripts/train_wiki.sh wiki5m_trans

Step 3, evaluate a trained model (it takes about 1 hour due to the large number of entities)

bash scripts/eval_wiki5m_trans.sh ./checkpoint/wiki5m_trans/model_last.mdl

Wikidata5M inductive dataset

Make sure you have run scripts/download_wikidata5m.sh to download Wikidata5M dataset.

Step 1, preprocess the dataset

bash scripts/preprocess.sh wiki5m_ind

Step 2, training the model and (optionally) specify the output directory (about 11 hours)

OUTPUT_DIR=./checkpoint/wiki5m_ind/ bash scripts/train_wiki.sh wiki5m_ind

Step 3, evaluate a trained model

bash scripts/eval.sh ./checkpoint/wiki5m_ind/model_last.mdl wiki5m_ind

Troubleshooting

I encountered "CUDA out of memory" when running the code.

We run experiments with 4 V100(32GB) GPUs, please reduce the batch size if you don't have enough resources. Be aware that smaller batch size will hurt the performance for contrastive training.

Does this codebase support distributed data parallel(DDP) training?

No. Some input masks require access to batch data on all GPUs, so currently it only supports data parallel training for ease of implementation.

Citation

If you find our paper or code repository helpful, please consider citing as follows:

@inproceedings{wang-etal-2022-simkgc,
    title = "{S}im{KGC}: Simple Contrastive Knowledge Graph Completion with Pre-trained Language Models",
    author = "Wang, Liang  and
      Zhao, Wei  and
      Wei, Zhuoyu  and
      Liu, Jingming",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.295",
    pages = "4281--4294",
}