Awesome
Predicting Protein Binding Affinity With Word Embeddings and Recurrent Neural Networks
Biorxiv link to paper: http://biorxiv.org/content/early/2017/04/18/128223.article-metrics
To recreate the results reported, download this repo, navigate to the main directory and run bash project_results_embedding.sh
and bash project_results_rnn.sh
. The data is already contained in the /data folder, and the results should pop up on the /results directory. Feel free do delete its current contents if you'd like to re-create them yourself.
The bash commands will run a variety of models/model parameters and will store each run in the results folder. For more info on the experiments ran, please refer to the paper submission. Then, run python analyze_results
to create the visualizations and csv summaries.
NOTE: running the above commands will take LONG (~36 hours). I'll post a script soon to reproduce just the best performing models soon.
Creating models and predictions
The main module responsible for the computations is mhcPreds_tflearn_cmd_line.py. It can be run as a standalone command line python program and it accepts a variety of different options:
mhcPreds_tflearn_cmd_line.py [-h] [-cmd CMD] [-b BATCH_SIZE]
[-bn BATCH_NORM] [-ls LAYER_SIZE]
[-nl NUM_LAYERS] [-d EMBEDDING_SIZE]
[-a ALLELE] [-m MODEL] [-c DATA_ENCODING]
[-r LEARNING_RATE] [-e EPOCHS] [-n NAME]
[-l LEN] [-s SAVE] [--data-dir DATA_DIR]
[--cell-size CELL_SIZE]
[--tensorboard-verbose TENSORBOARD_VERBOSE]
[--from-file FROM_FILE] [--run-id RUN_ID]
For example: mhcPreds_tflearn_cmd_line.py -cmd 'train_test_eval' -e 15 -bn 1 -nl 3 -c 'kmer_embedding' -a 'A0101' -m 'embedding_rnn' -r 0.001
Will run the train, test, and evaluation protocol with 15 epochs, 1 round of batch normalization, learning rate being 0.001. It will run on the a subset of the train data set comprised of peptides binding to the HLA-A0101 allele and will transform each kmer in the data set into a 9-mer. Other default parameters can be seen by
Results will be stored to the /mhcPreds/results/run_id
folder, where run_id is either specified by the user or a randomly selected number between 0 and 10000.
optional arguments:
-h, --help show this help message and exit
-cmd CMD command
-b BATCH_SIZE, --batch-size BATCH_SIZE
-bn BATCH_NORM, --batch-norm BATCH_NORM
Perform batch normalization either: only after LSTM (1), after and before (2)
-ls LAYER_SIZE, --layer-size LAYER_SIZE
Size of inner layeres of RNN
-nl NUM_LAYERS, --num-layers NUM_LAYERS
Number of LSTM layers
-d EMBEDDING_SIZE, --embedding-size EMBEDDING_SIZE
Embedding layer output dimension
-a ALLELE, --allele ALLELE
Allele to use for prediction. None predicts for all alleles.
-m MODEL, --model MODEL
RNN model. Basic LSTM, Birectional LSTM or simple RNN
-c DATA_ENCODING, --data-encoding DATA_ENCODING
Embedding layer output dimension
-r LEARNING_RATE, --learning-rate LEARNING_RATE
learning rate (default 0.001)
-e EPOCHS, --epochs EPOCHS
number of trainig epochs
-n NAME, --name NAME name of model, used when generating default weights filenames
-l LEN, --len LEN size of k-mer to predict on
-s SAVE, --save SAVE Save model to --data-dir
--data-dir DATA_DIR directory to use for saving models
--cell-size CELL_SIZE
size of RNN cell to use (default 32)
--tensorboard-verbose TENSORBOARD_VERBOSE
tensorboard verbosity level (default 0)
--run-id RUN_ID Name of run to be displayed in tensorboard and results folder
NOTES-1:
Here's a list of possible options for some of the parameters.
POSSIBLE_ALLELES = ['A3101', 'B1509', 'B2703', 'B1517', 'B1801', 'B1501', 'B4002', 'B3901', 'B5701', 'A6801',
'B5301', 'A2301', 'A2902', 'B0802', 'A3001', 'A0301', 'A0202', 'A0101', 'B4001', 'B5101',
'A1101', 'B4402', 'B0803', 'B5801', 'A2601', 'A0203', 'A3002', 'B4601', 'A3301', 'A6802',
'B3801', 'A3201', 'B3501', 'A2603', 'B0702', 'A6901', 'B0801', 'B4501', 'A0206', 'A0201',
'B1503', 'A2602', 'A8001', 'A2402', 'B2705', 'B4403', 'A2501', 'B5401']
TRAIN_DEFAULTS = ['A0201', 'A0301', 'A0203', 'A1101', 'A0206', 'A3101']
AVAILABLE_MODELS = ['deep_rnn', 'embedding_rnn', 'bi_rnn']
DATA_ENCODINGS = ['one_hot', 'kmer_embedding']
NOTES-2:
embedding_rnn
requires not parameters referring to a RNN since it's been found that using an embedding layer + hidden layers is sufficient to obtain good accuracy. Adding recurrent layers for the most hurts performance.- Similarly, 'embedding_rnn' requires 'kmer_embedding' as argument, and can not be used with
one_hot
data encoding. one_hot
encoding allows the user to specifiy a variety of different architectures, including:- bi-directional rnn with a user-definer layer size
- deep LSTM with a user-defined number of LSTM layers
- simple rnn with a user-definer layer size
- One Hot encoding usually leads to slower training due to increased feature dimension