Awesome
Learning to Write
The official repo for Learning to Write, published in ACL, 2018.
You can view samples at our demo site.
If you use this in your own work, please cite us.
Requirements
# (0) Setup fresh python3 environment using your method of choice.
# (1) Install pytorch 0.4 using instructions from pytorch.org
# (2) Install torchtext at specific commit
cd ../
git clone https://github.com/pytorch/text.git
cd text/
git reset --hard 36310207f5ca45c87e3192ace320353816ead618
cd ../l2w/
pip3 install ../text/
Generating Pre-Trained Models
You can download pre-trained models and sample data here and here. Unzip them, and put them in the root of the repo.
You can then generate by running:
# TorontoBooks
python generate.py --data data/tbooks_sample.txt --lm models/tbooks/lm.pt --dic models/tbooks/vocab.pickle --print --cuda --scorers models/tbooks/best_scorer_weights.tsv
# TripAdvisor
python generate.py --data data/trip_sample.txt --lm models/trip/lm.pt --dic models/trip/vocab.pickle --print --cuda --scorers models/trip/best_scorer_weights.tsv
Training Your Own
Split Data
Split data into lm-train, disc-train, valid, and test
python scripts/split_data.py /path/to/data.txt /path/to/dataset/directory/
Build a Shared Dictionary
The base language model and all discriminators use the same vocabulary, so we have to build it ahead of time.
python utils/make_dic.py /path/to/training_set.txt path/to/save/vocab.pickle --max_vocab 100000
Train Language Model
We have to train the base generator first, because two of the discriminators rely on generations from the LM for their training data.
python adaptive_softmax/train.py --cuda --data /path/to/data --dic /path/to/dictionary --cutoffs 4000 40000 --nlayers 2
Train Discriminators
Data
First, let's build all the required data files for training all the discriminators.
For the main data, you need to run this script and generate from the language model.
# (1) Run the main processing script. Options for different kinds of datasets viewable using --help
python scripts/make_cc_version.py /path/to/data/
# (2) Run the script that generates data from the LM
bash scripts/gen_lm_data.sh /path/to/data/ /path/to/lm.pt /path/to/vocab.pickle
For the entailment data, first concatenate all the '.txt' version of all the SNLI and MultiNLI data (including train, dev, and test), but watchout to not include the column headers. Then
# (1) Format the data
python scripts/create_nli_dataset.py /path/to/concatenated/data.txt /path/to/nli_output.tsv
# (2) Split the data
python scripts/split_data.py /path/to/nli_output.tsv /path/to/nli_data/ --no_disc_train --valid_frac 0.1 --test_frac 0.1
Repetition
# (1) Make rep data
python scripts/create_classifier_dataset.py /path/to/disc_data/ /path/to/save/rep_data/ --comp lm
# (2) Train model
python trainers/train_classifier.py /path/to/rep_data/ --save_to /path/to/save/model.pt --dic /path/to/vocab.pickle --fix_embeddings --adam --ranking_loss --train_prefixes --decider_type reprnn
Entailment
The entailment data was already generated in the "Data" section, so now we can just train the model.
python trainers/train_entailment_classifier.py /path/to/nli_data/ --save_to /path/to/save/model.pt --dic /path/to/vocab.pickle --adagrad --batch_size 16 --lr 1 --num_epochs 100
Relevance
# (1) Make rel data
python scripts/create_classifier_dataset.py /path/to/disc_data/ /path/to/save/rel_data/ --comp random
# (2) Train model
python trainers/train_classifier.py /path/to/rel_data/ --save_to /path/to/save/model.pt --dic /path/to/vocab.pickle \
--decider_type cnncontext --adam --ranking_loss --train_prefixes
Lexical Style
The lexical style module uses the exact same data as the repetition module, but doesn't view data as sequences of cosine similarities. Thus, we can train it on the data we made for the repetition classifier:
# (1) Train model
python trainers/train_classifier.py /path/to/rep_data/ --save_to /path/to/save/model.pt --dic /path/to/vocab.pickle \
--decider_type poolending --adam --ranking_loss --train_prefixes
Train Discriminator Weightings
First you have to make a weights file, in the following (tab separated) format:
1 SCORER_PATH SCORER_CLASS /path/to/model.pt
SCORER_PATH and SCORER_CLASS are word_rep.context_scorer & ContextScorer respectively for all modules, except the entailment module. For the entailment module SCORER_PATH and SCORER_CLASS are entailment.entail_scorer_new & EntailmentScorer.
For an example, see our pre-trained models.
Once you have a scorer_weights.tsv simply run:
python scripts/create_classifier_dataset.py /path/to/disc_data/ /path/to/save/weight_data/ --comp none
python generate.py --cuda --data /path/to/weight_data/valid.tsv --lm /path/to/lm.pt --dic /path/to/vocab.pickle --scorers /path/to/scorer_weights.tsv --print --learn
Generate
python generate.py --cuda --data /path/to/weight_data/test.tsv --lm /path/to/lm.pt --dic /path/to/vocab.pickle --scorers /path/to/scorer_weights.tsv --print