Awesome

Linuistic Adversity

Introduction

This repository is an implementation of the following work:

Li, Yitong , Trevor Cohn and Timothy Baldwin (2017) Robust Training under Linguistic Adversity, In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2017), Valencia, Spain.

Respository Structure

In this repository, files are separated into three fold:

data
noise generators
text_cnn : the convolutional neural network model for sentence-level classification tasks.

Details are following:

data/

This fold contains the original dataset,

movie review dataset (Pang and Lee, 2008)
customer review dataset (Hu and Liu, 2004)
subjectivity dataset (Pang and Lee, 2005)
Stanford Sentiment Treenbank (Socher et al., 2013)

For more sentiment analysis dataset can be found at HarvardNLP.

Please cite the original paper when you use data.

noise_generator/

This fold contains four different linguistic noise generators within four sub-folds each.

Please refer to the READMEfile for each noise generator methods to run them.

./WN/

The fold contains the code of semantic noise generator based on Wordnet.

For running the wordnet noise genereator code, you need the following dependencies:

NLTK, with following packages downloaded: -- averaged_perceptron_tagger -- punkt -- stopwords -- universal_tagset -- wordnet
Numpy
kenLM, with pre-built n-gram language model
Stanford-ner

./CFit/

Based on idea of Counter-fitting

Dependencies:

NLTK
Numpy
kenLM
Stanford-ner
pre-trained counter-fitting dictionary

./ERG/

Based on English Resource Grammar (ERG) system, and ACE. Dependencies:

ERG ACE

./Comp/

Based on sentence compression method.

text_cnn/

A convolutional neural network model for text classification tasks. The model is based on YoonKim's Convolutional Neural Networks for Sentence Classification and Denny Britz's implementation (https://github.com/dennybritz/cnn-text-classification-tf). Notice that the code has been implemented and tested with Tensorflow r1.0 and python 2.7, which may not be able to run on other version.

requirements

For run the cnn code, you need the following dependencies:

Python 2.7
Numpy
Tensorflow r1.0

Running the code

python train.py [parameters]

parameters:
    --dataset
        The training dataset (default:"mr")
    --noise_type
        Type of noise (default:"raw")
    --is_noise_train
        To train on the noisy data (default:False)
    --is_noise_test
        To test on the noisy data (default:False)
    --l2_reg_lambda
        Model L2 regularizaion lambda (default: 0)
    --dropout_keep_prob=1.0
        Model dropout rate (default:0.5)

For example, to train the model with settings:

nice python  train.py --dataset="mr" --is_noise_train=False --is_noise_test=False --noise_type="cf=0.5" --dropout_keep_prob=1.0

Also, refer to run_script_sample_on_subj.sh to get a better sense of training with different noise.

Contact us

Please email us if anything. All comments are welcome.