Home

Awesome

Integrating Semantic Knowledge to Tackle Zero-shot Text Classification. NAACL-HLT 2019. Oral (Accepted)

Jingqing Zhang, Piyawat Lertvittayakumjorn, Yike Guo

Jingqing and Piyawat contributed equally to this project.

Paper link: arXiv:1903.12626

Contents

  1. Abstract
  2. Code
  3. Acknowledgement
  4. Citation
<h2 id="Abstract">Abstract</h2> Insufficient or even unavailable training data of emerging classes is a big challenge of many classification tasks, including text classification. Recognising text documents of classes that have never been seen in the learning stage, so-called zero-shot text classification, is therefore difficult and only limited previous works tackled this problem. In this paper, we propose a two-phase framework together with data augmentation and feature augmentation to solve this problem. Four kinds of semantic knowledge (word embeddings, class descriptions, class hierarchy, and a general knowledge graph) are incorporated into the proposed framework to deal with instances of unseen classes effectively. Experimental results show that each and the combination of the two phases clearly outperform baseline and recent approaches in classifying real-world texts under the zero-shot scenario. <h2 id="Code">Code</h2>

Checklist

In order to run the code, please check the following issues.

Please feel free to raise an issue if you find any difficulty to run the code or get the intermediate files.

How to perform data augmentation

An example:

python3 topic_translation.py \
        --data dbpedia \
        --nott 100

The arguments of the command represent

The location of the result file is specified by config.{zhang15_dbpedia, news20}_train_augmented_aggregated_path.

Three outputs files will be automatically generated (filepath defined in config.py).

How to perform feature augmentation / create v_{w,c}

An example:

python3 kg_vector_generation.py --data dbpedia 

The argument of the command represents

The locations of the result files are specified by config.{zhang15_dbpedia, news20}_kg_vector_dir.

How to train / test Phase 1

python3 train_reject.py \
        --data dbpedia \
        --unseen 0.5 \
        --model vw \
        --nepoch 3 \
        --rgidx 1 \
        --train 1
python3 train_reject_augmented.py \
        --data dbpedia \
        --unseen 0.5 \
        --model vw \
        --nepoch 3 \
        --rgidx 1 \
        --naug 100 \
        --train 1

The arguments of the command represent

The location of the result file (pickle) is specified by config.rejector_file. The pickle file is actually a list of 10 sublists (corresponding to 10 iterations). Each sublist contains predictions of each test case (1 = predicted as seen, 0 = predicted as unseen).

How to train / test the traditional classifier in Phase 2

An example:

python3 train_seen.py \
        --data dbpedia \
        --unseen 0.5 \
        --model vw \
        --sepoch 1 \
        --train 1

The arguments of the command represent

How to train / test the zero-shot classifier in Phase 2

An example:

python3 train_unseen.py \
        --data 20news \
        --unseen 0.5 \
        --model vwvcvkg \
        --ns 2 --ni 2 --sepoch 10 \
        --rgidx 1 --train 1

The arguments of the command represent

<h2 id="Acknowledgement">Acknowledgement</h2> We would like to thank Douglas McIlwraith, Nontawat Charoenphakdee, and three anonymous reviewers for helpful suggestions. Jingqing and Piyawat would also like to thank the support from LexisNexis&reg; Risk Solutions HPCC Systems&reg; academic program and Anandamahidol Foundation, respectively.

We would also like to thank @Nan Guoshun for the bugs reported.

<h2 id="Citation">Citation</h2>
@inproceedings{zhangkumjornZeroShot,
    title = "Integrating Semantic Knowledge to Tackle Zero-shot Text Classification",
    author = "Zhang, Jingqing and
    Lertvittayakumjorn, Piyawat and 
    Guo, Yike",
    booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Long Papers)",
    month = jun,
    year = "2019",
    address = "Minneapolis, USA",
    publisher = "Association for Computational Linguistics",
}