Awesome

NLP paper implementation relevant to classification with PyTorch

The papers were implemented in using korean corpus

Prelimnary & Usage

preliminary

pyenv virualenv 3.7.7 nlp
pyenv activate nlp
pip install -r requirements.txt

Usage

python build_dataset.py
python build_vocab.py
python train.py # default training parameter
python evaluate.py # defatul evaluation parameter

Single sentence classification (sentiment classification task)

Using the Naver sentiment movie corpus v1.0 (a.k.a. nsmc)
Configuration
- conf/model/{type}.json (e.g. type = ["sencnn", "charcnn",...])
- conf/dataset/nsmc.json
Structure

# example: Convolutional_Neural_Networks_for_Sentence_Classification
├── build_dataset.py
├── build_vocab.py
├── conf
│   ├── dataset
│   │   └── nsmc.json
│   └── model
│       └── sencnn.json
├── evaluate.py
├── experiments
│   └── sencnn
│       └── epochs_5_batch_size_256_learning_rate_0.001
├── model
│   ├── data.py
│   ├── __init__.py
│   ├── metric.py
│   ├── net.py
│   ├── ops.py
│   ├── split.py
│   └── utils.py
├── nsmc
│   ├── ratings_test.txt
│   ├── ratings_train.txt
│   ├── test.txt
│   ├── train.txt
│   ├── validation.txt
│   └── vocab.pkl
├── train.py
└── utils.py

Model \ Accuracy	Train (120,000)	Validation (30,000)	Test (50,000)	Date
SenCNN	91.95%	86.54%	85.84%	20/05/30
CharCNN	86.29%	81.69%	81.38%	20/05/30
ConvRec	86.23%	82.93%	82.43%	20/05/30
VDCNN	86.59%	84.29%	84.10%	20/05/30
SAN	90.71%	86.70%	86.37%	20/05/30
ETRIBERT	91.12%	89.24%	88.98%	20/05/30
SKTBERT	92.20%	89.08%	88.96%	20/05/30

Convolutional Neural Networks for Sentence Classification (as SenCNN)
- https://arxiv.org/abs/1408.5882
Character-level Convolutional Networks for Text Classification (as CharCNN)
- https://arxiv.org/abs/1509.01626
Efficient Character-level Document Classification by Combining Convolution and Recurrent Layers (as ConvRec)
- https://arxiv.org/abs/1602.00367
Very Deep Convolutional Networks for Text Classification (as VDCNN)
- https://arxiv.org/abs/1606.01781
A Structured Self-attentive Sentence Embedding (as SAN)
- https://arxiv.org/abs/1703.03130
BERT_single_sentence_classification (as ETRIBERT, SKTBERT)
- https://arxiv.org/abs/1810.04805

Pairwise-text-classification (paraphrase detection task)

Creating dataset from https://github.com/songys/Question_pair
Configuration
- conf/model/{type}.json (e.g. type = ["siam", "san",...])
- conf/dataset/qpair.json
Structure

# example: Siamese_recurrent_architectures_for_learning_sentence_similarity
├── build_dataset.py
├── build_vocab.py
├── conf
│   ├── dataset
│   │   └── qpair.json
│   └── model
│       └── siam.json
├── evaluate.py
├── experiments
│   └── siam
│       └── epochs_5_batch_size_64_learning_rate_0.001
├── model
│   ├── data.py
│   ├── __init__.py
│   ├── metric.py
│   ├── net.py
│   ├── ops.py
│   ├── split.py
│   └── utils.py
├── qpair
│   ├── kor_pair_test.csv
│   ├── kor_pair_train.csv
│   ├── test.txt
│   ├── train.txt
│   ├── validation.txt
│   └── vocab.pkl
├── train.py
└── utils.py

Model \ Accuracy	Train (6,136)	Validation (682)	Test (758)	Date
Siam	93.00%	83.13%	83.64%	20/05/30
SAN	89.47%	82.11%	81.53%	20/05/30
Stochastic	89.26%	82.69%	80.07%	20/05/30
ETRIBERT	95.07%	94.42%	94.06%	20/05/30
SKTBERT	95.43%	92.52%	93.93%	20/05/30

A Structured Self-attentive Sentence Embedding (as SAN)
- https://arxiv.org/abs/1703.03130
Siamese recurrent architectures for learning sentence similarity (as Siam)
- https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/viewPaper/12195
Stochastic Answer Networks for Natural Language Inference (as Stochastic)
- https://arxiv.org/abs/1804.07888
BERT_pairwise_text_classification (as ETRIBERT, SKTBERT)
- https://arxiv.org/abs/1810.04805