Home

Awesome

SememeWSD

Code and data for the COLING 2020 paper "Try to Substitute: An Unsupervised Chinese Word Sense Disambiguation Method Based on HowNet". [Paper]

Citation

Please cite our paper if you find it helpful.

@inproceedings{hou-etal-2020-try,
    title = "Try to Substitute: An Unsupervised {C}hinese Word Sense Disambiguation Method Based on {H}ow{N}et",
    author = "Hou, Bairu  and Qi, Fanchao  and Zang, Yuan  and Zhang, Xurui  and Liu, Zhiyuan  and Sun, Maosong",
    booktitle = "Proceedings of the 28th International Conference on Computational Linguistics",
    year = "2020",
}

This repository is mainly contributed by Bairu Hou, Fanchao Qi and Yuan Zang. To run our WSD model or use the WSD dataset, please refer to the following instructions.

Key Environment

import OpenHowNet
OpenHowNet.download()

build the Necessary Files:

mkdir aux_files
python data_util.py

Load the Dataset

The current available HowNet-based Chinese WSD dataset is based on an outdated version of HowNet that cannot be found now. To evaluation and further acamedic use, we build a new and larger HowNet-based Chinese WSD dataset based on the Chinese Word Sense Annotated Corpus used in SemEval-2007 task 5.

You can load the dataset with either eval in Python or json.

Load with Python

dataset = []
with open("data/dataset.txt",'r',encoding = 'utf-8') as f:
	for line in f:
		sample = eval(line.stri(p))
		dataset.append(sample)

Each line in the file will be transformed to a dict data type in Python. The keys of the dict include: context: A word list of the sentence include a token <target> that masks the targeted polysemous word. part-of-speech: A list of the part-of-speech for each token in the sentence. target_word: The original polysemous word in the sentence masked by <target> target_position: The position of the targeted polysemous word in the word list target_word_pos: The part-of-speech of targeted polysemous word sense: The correct sense of the targeted polysemous word in the context. Represented by a set of sememes from HowNet

Load with Json

dataset = []
with open("data/dataset.json",'r',encoding='utf-8') as f:
    for line in f:
        sample = json.loads(line.strip())
        dataset.append(sample)

You may need to manually transform some datatype if you load with json, such as the target_word_pos, which should be an integer.

You can load the dataset for the follwoing evaluation or in your own research.

Run the WSD Model on the Dataset

We recommend use cuda to accelerate the inference. Make sure you have generated the necessary files and put the dataset file in the data/ directory.

CUDA_VISIBLE_DEVICES=0 python run_model.py

The command will test the model on the whole dataset and generate a log file for further evaluation.

Evaluation

After you get the log file, you can evaluate it with the following command on various metrics.

python parse_log.py --model bert