Awesome

PyTorch BERT Document Classification

Implementation and pre-trained models of the paper Enriching BERT with Knowledge Graph Embedding for Document Classification (PDF). A submission to the GermEval 2019 shared task on hierarchical text classification. If you encounter any problems, feel free to contact us or submit a GitHub issue.

Content

CLI script to run all experiments
WikiData author embeddings (view on Tensorboard Projector)
Data preparation
Requirements
Trained model weights as release files

Model architecture

BERT + Knowledge Graph Embeddings

Installation

Requirements:

Python 3.6
CUDA GPU
Jupyter Notebook

Install dependencies:

pip install -r requirements.txt

Prepare data

GermEval data

Download from shared-task website: here
Run all steps in Jupyter Notebook: germeval-data.ipynb

Author Embeddings

python wikidata_for_authors.py run ~/datasets/wikidata/index_enwiki-20190420.db \
    ~/datasets/wikidata/index_dewiki-20190420.db \
    ~/datasets/wikidata/torchbiggraph/wikidata_translation_v1.tsv.gz \
    ~/notebooks/bert-text-classification/authors.pickle \
    ~/notebooks/bert-text-classification/author2embedding.pickle

# OPTIONAL: Projector format
python wikidata_for_authors.py convert_for_projector \
    ~/notebooks/bert-text-classification/author2embedding.pickle
    extras/author2embedding.projector.tsv \
    extras/author2embedding.projector_meta.tsv

Reproduce paper results

Download pre-trained models: GitHub releases

Available experiment settings

Detailed settings for each experiment can found in cli.py.

task-a__bert-german_full
task-a__bert-german_manual_no-embedding
task-a__bert-german_no-manual_embedding
task-a__bert-german_text-only
task-a__author-only
task-a__bert-multilingual_text-only

task-b__bert-german_full
task-b__bert-german_manual_no-embedding
task-b__bert-german_no-manual_embedding
task-b__bert-german_text-only
task-b__author-only
task-b__bert-multilingual_text-only

Enviroment variables

TRAIN_DF_PATH: Path to Pandas Dataframe (pickle)
GPU_ID: Run experiments on this GPU (used for CUDA_VISIBLE_DEVICES)
OUTPUT_DIR: Directory to store experiment output
EXTRAS_DIR: Directory where author embeddings and gender data is located
BERT_MODELS_DIR: Directory where pre-trained BERT models are located

Validation set

python cli.py run_on_val <name> $GPU_ID $EXTRAS_DIR $TRAIN_DF_PATH $VAL_DF_PATH $OUTPUT_DIR --epochs 5

Test set

python cli.py run_on_test <name> $GPU_ID $EXTRAS_DIR $FULL_DF_PATH $TEST_DF_PATH $OUTPUT_DIR --epochs 5

Evaluation

The scores from the result table can be reproduced with the evaluation.ipynb notebook.

How to cite

If you are using our code, please cite our paper:

@inproceedings{Ostendorff2019,
    address = {Erlangen, Germany},
    author = {Ostendorff, Malte and Bourgonje, Peter and Berger, Maria and Moreno-Schneider, Julian and Rehm, Georg},
    booktitle = {Proceedings of the GermEval 2019 Workshop},
    title = {{Enriching BERT with Knowledge Graph Embedding for Document Classification}},
    year = {2019}
}

References

License

MIT