Awesome
⌛️ Data availability
⌛️ Download pre-trained 3UTRBERT model
📘 Environment Setup
1.1 Create and activate a new virtual environment
conda create -n 3UTRBERT python=3.6.13
conda activate 3UTRBERT
1.2 Install the package and other requirements
conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=10.2 -c pytorch
git clone https://github.com/yangyn533/3UTRBERT
cd 3UTRBERT
python3 -m pip install --editable .
python3 -m pip install -r requirements.txt
If above commands do not run correctly. Following commands could be used to install missing packages manually after running the above commands. If you use the commands to install packages manually, above commands should be run first.
pip install seaborn
pip install transformers
pip install pyfaidx
pip install python-decouple
pip install sacremoses
pip install boto3
pip install sentencepiece
pip install Bio
pip install pyahocorasick
⌛️ Process data
The input file is in .fasta
format. For each sequence, the label of the sequence should be in the sequence ID. (example file can be found in example_data folder).
By running the following code, the input fasta file will be separated into train, dev and test sets. Each sequence will be tokenized into 3mer tokens. Example data locates in the example/data folder. train.tsv
is for training, dev.tsv
for validation and test.tsv
for test the performance.
python preprocess.py \
--data_dir <PATH_TO_YOUR_DATA> \
--output_dir <PATH_TO_YOUR_OUTPUT_DIRECTORY> \
--kmer 3
⌛️ Train
train.py
is used for fine-tune the model. The input data are train.tsv
and dev.tsv
. Make sure train.tsv
and dev.tsv
are in the same directory and the input path to this directory as the --data_dir
argument (not include the file name itself). --model_name_or_path
needs to be the path to your pre-trained model. --output_dir
is the location to store the fine-tuned model.
python train.py \
--data_dir <PATH_TO_YOUR_DATA> \
--output_dir <PATH_TO_YOUR_OUTPUT_DIRECTORY> \
--model_type 3utrprom \
--tokenizer_name rna3 \
--model_name_or_path <PATH_TO_YOUR_MODEL> \
--do_train \
--per_gpu_train_batch_size 32 \
--per_gpu_eval_batch_size 32 \
--learning_rate 5e-5 \
--logging_steps 100 \
--save_steps 1000 \
--num_train_epochs 3 \
--evaluate_during_training \
--max_seq_length 100 \
--warmup_percent 0.1 \
--hidden_dropout_prob 0.1 \
--overwrite_output \
--weight_decay 0.01 \
--seed 6
Please change the tokenizer name { rna3, rna4, rna5, rna6 } when changing the kmer choice.
⌛️ Predict
predict.py
is used for producing prediction results from the fine-tuned model. The input data is the test.tsv
. Make sure train.tsv
, dev.tsv
and test.tsv
are in the same directory and input path to this directory as the --data_dir
argument (not include the file name itself). --model_name_or_path
needs to be the path to your fine-tuned model. The output files of predict.py
are mainly pred_results.npy
and pred_results_scores.npy
. pred_results.npy
stores the probability for each sequence. pred_results_scores.npy
stores the metrics to evaluate the model.
python predict.py \
--data_dir <PATH_TO_YOUR_DATA> \
--output_dir <PATH_TO_YOUR_OUTPUT_DIRECTORY> \
--do_predict \
--tokenizer_name rna3 \
--model_type 3utrprom \
--model_name_or_path <PATH_TO_YOUR_MODEL> \
--max_seq_length 100 \
--per_gpu_eval_batch_size 32
Please change the tokenizer name { rna3, rna4, rna5, rna6 } when changing the kmer choice.
📊 Single resolution importance analysis
The following code extracted the attention scores and visualizes them.
python single_resolution_importance.py \
--kmer 3 \
--model_path <PATH_TO_YOUR_MODEL> \
--start_layer 11 \
--end_layer 11 \
--metric mean \
--sequence <SEQUENCE_USED> \
--save_path <PATH_TO_YOUR_OUTPUT_DIRECTORY>
Please make sure that the input sequence does not exceed the max-length limit.
📊 Mutation analysis
Before run the shell script. Make sure the parameters in the shell script are indicated.
source mutation_heatmap.sh
The following commonds comes from mutation_heatmap.sh
.
KMER
indicates the kmer used.
ORIGINAL_SEQ_PATH
should be the path the the folder where your sequence file locates (not include the file name itself).
MUTATE_SEQ_PATH
should be the folder your want to store the mutated sequence file (not include the file name itself).
WT_SEQ
should be the same sequence in the original sequence file.
Please store the original sequence in a .tsv
file called test.tsv
.
The example test.tsv
file is in example mutation_analysis/original_seq
, only one sequence is allowed to be in the file. The label for the sequence can be 0 or 1.
export KMER=3
export MODEL_PATH=<PATH_TO_YOUR_MODEL>
export ORIGINAL_SEQ_PATH=<PATH_TO_YOUR_ORIGINAL_SEQUENCE_FILE>
export MUTATE_SEQ_PATH=<PATH_TO_YOUR_MUTATED_SEQUENCE_FILE>
export PREDICTION_PATH=<PATH_TO_STORE_PREDICTION>
export WT_SEQ=<THE_SEQUENCE_USED_FOR_MUTATION>
export OUTPUT_PATH=<PATH_TO_YOUR_OUTPUT_DIRECTORY>
# mutate sequence
python mutate.py --seq_file $ORIGINAL_SEQ_PATH/test.tsv --save_file_dir $MUTATE_SEQ_PATH --k $KMER
# predict on sequence
mkdir $PREDICTION_PATH/original_pred
mkdir $PREDICTION_PATH/mutate_pred
python predict.py \
--data_dir $ORIGINAL_SEQ_PATH \
--output_dir $PREDICTION_PATH/original_pred \
--do_predict \
--tokenizer_name rna3 \
--model_type 3utrprom \
--model_name_or_path $MODEL_PATH \
--max_seg_length 100 \
--per_gpu_eval_batch_size 32
python predict.py \
--data_dir $MUTATE_SEQ_PATH \
--output_dir $PREDICTION_PATH/mutate_pred \
--do_predict \
--tokenizer_name rna3 \
--model_type 3utrprom \
--model_name_or_path $MODEL_PATH \
--max_seg_length 100 \
--per_gpu_eval_batch_size 32
# calculate scores
python calculate_diff_scores.py \
--orig_seq_file $ORIGINAL_SEQ_PATH/test.tsv \
--orig_pred_file $PREDICTION_PATH/original_pred/pred_results.npy \
--mut_seq_file $MUTATE_SEQ_PATH/test.tsv \
--mut_pred_file $PREDICTION_PATH/mutate_pred/pred_results.npy \
--save_file_dir $OUTPUT_PATH
# draw heatmap
python heatmap.py \
--score_file $OUTPUT_PATH \
--save_file_dir $OUTPUT_PATH \
--wt_seq $WT_SEQ
📊 Feature extraction
<PATH_TO_DATA>
is the path the folder that the data in (not include the data file name). The input fasta file should be named as seq_to_extract.fasta
. THe example data can be found in example folder.
python extract_LS_embedding.py \
--data_path <PATH_TO_DATA> \
--output_path <PATH_TO_YOUR_OUTPUT_DIRECTORY> \
--model_path <PATH_TO_YOUR_MODEL>
🧬 Motif analysis
The motif analysis requires the output of attentions. The required attention can be obtained from single_resolution_importance.py
. Store the attention into the directory used as input --predict_dir
.
python find_motifs.py \
--data_dir <PATH_TO_YOUR_DATA> \
--predict_dir <PATH_TO_YOUR_PREDICTION_OUTPUT_DIRECTORY> \
--window_size <ADJUST_THIS> \
--min_len <ADJUST_THIS> \
--pval_cutoff <ADJUST_THIS> \
--min_n_motif <ADJUST_THIS> \
--align_all_ties \
--save_file_dir <PATH_TO_YOUR_OUTPUT_DIRECTORY> \
--verbose