Home

Awesome

EPIANN

Inspired by machine translation models we develp an attention-based nerual network model, EPIANN. Schematic overview of EPIANN

Data Augmentation

There are 6 cell lines. which are celline=GM12878, HUVEC, HeLa-S3, IMR90, K562 and NHEK, and each comes with its own folder. Within each folder, there is a single file: celline.csv. celline.csv is a renamed copy of

<p align="center"> https://<i></i>github.com/shwhalen/targetfinder/tree/master/paper/targetfinder/<b>celline</b>/output-ep/pairs.csv </p>

Before we actually train oorneural network model, we need to generate input data from genomic coordinates(hg19) of enhancers and promoters, along with the indicators of EPIs recorded in celline.csv. Data_Augmentation.R encoded an automatic data augmentation pipeline with several parameters specified in the following table.

ParametersExplanation
cellinechange it to one of the 6 cell lines with default = "IMR90"
folderthe name of the folder to hold all output files with default = "aug_50"
shift_distancethe step size to slide extended region around the enhancer and promoter with default = 50
enhancer_target_lengththe length of extended enhancer with default = 3000
promoter_target_lengththe length of extended promoter with default = 2000
positive_scalarthe augmentation ratio with default = 20
test_percentthe percent of test data among all with default = 0.1
random_seedthe random seed to sample test data with default = 1

You can find the output files with default parameters under the directory IMR90/aug_50/. The following files are currently not avaiable in the github repository because of the size limit (Work In Progress). They are avaiable in the repository.

IMR90/aug_50/IMR90_enhancer.fasta
IMR90/aug_50/IMR90_promoter.fasta
IMR90/aug_50/imbalanced/IMR90_enhancer.fasta
IMR90/aug_50/imbalanced/IMR90_promoter.fasta

Train Neural Netork Model

Under the directory IMR90/, you can find an example python script IMR90_EPIANN.py with the default setting. The parameters regarding inputs are explained in the following table.

ParametersExplanation
cellinechanage it to one of the 6 cell lines with default = 'IMR90'
file_prechange it to be the folder containing augmented data with default = 'aug_50/IMR90'
out_dirchange it to be the folder that contains the output with dedault = 'output/IMR90_EPIANN'
script_idchange it to be the current python script name in order to distinguish the outputs from multiple runs with default = 'IMR90_EPIANN'

The computational grpaph for the neural network is programmed using Tensorflow. On our setup, we use a single NVIDIA GTX 1080 or NVIDIA TITAN X with 5 CPU threads. A single batch takes about 6 seconds to train. All neural neural parameters can be altered in the script.

Neural Network ParametersExplanation
enhancer_lengththe length of input enhancers with default = 3000
promoter_lengththe length of input promoters with default = 2000
BATCH_SIZEthe half of exact batch size with default = 32
num_filtersthe number of convolution filters with default = 256
e_conv_widththe convolutional filter width with default = 15
dropout_rate_cnnthe dropout rate for the convolution layer with default = 0.2
dropout_ratethe dropout rate for all layers except the convolution layer with default = 0.2
pool_widththe max pooling size with default = 30
atten_hyperthe dimension of the attention-related parameters with default = 32
dense_neuron_coorthe dimension of the fully connected layers for coordinate prediction with default = [128, 64]
inter_dimthe dimension of the interaction quantification related parameters with default = 1
topkthe top-k pooling size with default = 32
dense_neuronthe dimension of the fully connected layers with default = 32
lambthe hyperparameter which mediates the cross-entropy error and the regression error with default = 10
num_of_epochthe number of epochs with default = 90
output_stepthe step size to report performance on test dataset with default = 500 batches

Required Pre-installed Packages

R (3.4.2) Library dependencies

GenomicRanges 1.28.2
BSgenome.Hsapiens.UCSC.hg19.masked 1.3.99

Python (2.7.6) Module dependencies

Sklearn 0.18.1

os
pickle
time
tensorflow 1.3.0
numpy 1.13.3
Sklearn 0.19.1
Biopython 1.67