Home

Awesome

PRoBERTa

Ananthan Nambiar, Maeve Heflin, Simon Liu, Sergei Maslov, Mark Hopkins, Anna Ritz

Notes

BPE model

pretraining data

family data

conservative ppi data

aggressive data

pretrained weights

protein family finetuned weights

ppi conservative finetuned (20%) weights

ppi conservative finetuned (100%) weights

ppi aggressive finetuned (20%) weights

ppi aggressive finetuned (100%) weights

Requirements and Installation

sentencepiece tokenizer

pip3 install sentencepiece

Build fairseq from linked repo source.

git clone https://github.com/imonlius/fairseq.git
cd fairseq
pip3 install --editable . --no-binary cffi

tokenizer.py

Train a tokenizer and tokenize data for protein family and interaction fine-tuning

Example usage:

python3 tokenizer.py
NameDescription
pathPath to the protein family data. This should be a .tab file with "Sequence" and "Protein families" as two of the columns
int_pathPath to protein interaction data. This should be a json file with 'from', 'to' and 'link' for each interaction

pRoBERTa_pretrain.sh

Pre-train RoBERTa model

Example Usage:

bash pRoBERTa_pretrain.sh pretrain 4 pretrained_model \
        pretraining/split_binarized/ \
        768 5 125000 3125 0.0025 32 64 3
NameDescriptionExample
PREFIXPrefix for the model output filespretrain
NUM_GPUSNumber of GPUs to be used during pretraining4
OUTPUT_DIROutput directorypretrained_model
DATA_DIRBinarized input data directorypretraining/split_binarized/
ENCODER_EMBED_DIMDimension of embedding generated by the encoders768
ENCODER_LAYERSNumber of encoder layers in the model5
TOTAL_UPDATESTotal (maximum) number of updates during training125000
WARMUP_UPDATESTotal number of LR warm-up updates during training3125
PEAK_LEARNING_RATEPeak learning rate for training0.0025
MAX_SENTENCESMaximum number of sequences in each batch32
UPDATE_FREQUpdates the model every UPDATE_FREQ batches64
PATIENCEEarly stop training if valid performance doesn’t improve for PATIENCE consecutive validation runs3

pRoBERTa_finetune_ppi.sh:

Fine-tune RoBERTa model for Protein Interaction Prediction Task

Example Usage:

bash pRoBERTa_finetune_ppi.sh ppi 4 ppi_prediction \
        ppi_prediction/split_binarized/robustness_minisplits/0.80/ \
        768 5 12500 312 0.0025 32 64 2 3 \
        pretraining/checkpoint_best.pt \
        no
NameDescriptionExample
PREFIXPrefix for the model output filesppi
NUM_GPUSNumber of GPUs to use for finetuning4
OUTPUT_DIRModel output directoryppi_prediction
DATA_DIRBinarized input data directoryppi_prediction/split_binarized/robustness_minisplits/1.00
ENCODER_EMBED_DIMDimension of embedding generated by the encoders768
ENCODER_LAYERSNumber of encoder layers in the model5
TOTAL_UPDATESTotal (maximum) number of updates during training12500
WARMUP_UPDATESTotal number of LR warm-up updates during training3125
PEAK_LEARNING_RATEPeak learning rate for training0.0025
MAX_SENTENCESMaximum number of sequences in each batch32
UPDATE_FREQUpdates the model every UPDATE_FREQ batches64
PATIENCEEarly stop training if valid performance doesn’t improve for PATIENCE consecutive validation runs3
PRETRAIN_CHECKPOINTPath to pretrained model checkpointpretraining/checkpoint_best.pt
RESUME_TRAININGWhether to resume training from previous finetuned model checkpointsno

pRoBERTa_finetune_pfamclass.sh:

Fine-tune RoBERTa model for Family Classification Task

Example Usage:

bash pRoBERTa_finetune_pfamclass.sh family 4 family_classification \
        family_classification/split_binarized/robustness_minisplits/1.00 \
        768 5 12500 312 0.0025 32 64 4083 3 \
        pretraining/checkpoint_best.pt \
        no
NameDescriptionExample
PREFIXPrefix for the model output filesfamily
NUM_GPUSNumber of GPUs to use for finetuning4
OUTPUT_DIRModel output directoryfamily_classification
DATA_DIRBinarized input data directoryfamily_classification/split_binarized/robustness_minisplits/1.00
ENCODER_EMBED_DIMDimension of embedding generated by the encoders768
ENCODER_LAYERSNumber of encoder layers in the model5
TOTAL_UPDATESTotal (maximum) number of updates during training12500
WARMUP_UPDATESTotal number of LR warm-up updates during training3125
PEAK_LEARNING_RATEPeak learning rate for training0.0025
MAX_SENTENCESMaximum number of sequences in each batch32
UPDATE_FREQUpdates the model every UPDATE_FREQ batches64
PATIENCEEarly stop training if valid performance doesn’t improve for PATIENCE consecutive validation runs3
PRETRAIN_CHECKPOINTPath to pretrained model checkpointpretraining/checkpoint_best.pt
RESUME_TRAININGWhether to resume training from previous finetuned model checkpointsno

Clustering/protein_family_clustering_loop.py

Cluster proteins using k-means and calculate the normalized mutual information (NMI) with protein families. Before running this make sure to download roberta.base and the relevant checkpoints.

Example Usage:

python3 protein_family_clustering_loop.py
NameDescription
tokenized_data_filepathInput data filepath. This file has to contain tokenized protein sequences in a 'Tokenized Sequence' column, and the family each protein belongs to in a 'Protein families' column. Any other columns in this file will be ignored.
roberta_weightsdepending on whether you're using a pretrained or fine-tuned model, choose the appropriate weights
EMBEDDING_SIZEShould match the PRoBERTa model size
USE_NULL_MODELWhether to use random cluster prediction instead of k-means clustering

pRoBERTa_evaluate_family_batch.py:

Predict families using fine-tuned RoBERTa model

Example Usage:

python3 pRoBERTa_evaluate_family_batch.py family_classification/split_tokenized/full/Finetune_fam_data.split.test.10 \
	family_classification/split_binarized/robustness_minisplits/1.00/ \
	predictions.tsv \
	family_classification/checkpoints/ \
	protein_family_classification 256
NameDescriptionExample
DATAPath to input examples to predict. This should be formatted as a CSV with the columns, in order: tokenized sequence, true family labelfamily_classification/split_tokenized/full/Finetune_fam_data.split.test.10
BINARIZED_DATAPath to binarized family datafamily_classification/split_binarized/robustness_minisplits/1.00/
OUTPUTPath to output file with model predictionspredictions.tsv
MODEL_FOLDERModel checkpoints folder. Will use checkpoint_best.pt file in the folder.family_classification/checkpoints/
CLASSIFICATION_HEAD_NAMEName of the trained classification headprotein_family_classification
BATCH_SIZEBatch size for prediction256

pRoBERTa_evaluate_ppi_batch.py:

Predict PPI using fine-tuned RoBERTa model

Example Usage:

python3 pRoBERTa_evaluate_ppi_batch.py ppi_prediction/split_tokenized/full/Finetune_interact_tokenized.split.test.10 \
	ppi_prediction/split_binarized/robustness_minisplits/1.00/ \
	predictions.tsv \
	ppi_prediction/checkpoints/ \
	protein_interaction_prediction 256
NameDescriptionExample
DATAPath to input examples to predict. This should be formatted as a CSV with the columns, in order: tokenized from sequence, tokenized to sequence, true labelppi_prediction/split_tokenized/full/Finetune_interact_tokenized.split.test.10
BINARIZED_DATAPath to binarized PPI datappi_prediction/split_binarized/robustness_minisplits/1.00/
OUTPUTPath to output file with model predictionspredictions.tsv
MODEL_FOLDERModel checkpoints folder. Will use checkpoint_best.pt file in the folder.ppi_prediction/checkpoints/
CLASSIFICATION_HEAD_NAMEName of the trained classification headprotein_interaction_prediction
BATCH_SIZEBatch size for prediction256

shuffle_and_split_pretrain.sh:

Shuffle and split pretraining data file into training, validation, and test data files.

Example Usage:

bash shuffle_and_split_pretrain.sh pretraining/tokenized_seqs_v1.txt \
	pretraining/split_tokenized/ \
	tokenized_seqs_v1
NameDescriptionExample
INPUTInput file. Each line should be an example.pretraining/tokenized_seqs_v1.txt
OUTPUTOutput directorypretraining/split_tokenized/
PREFIXPrefix for output filestokenized_seqs_v1

shuffle_and_split.sh:

Shuffle and split finetuning data file into training, validation, and test data files.

Example Usage:

bash shuffle_and_split.sh family_classification/Finetune_fam_data.csv \
	family_classification/split_tokenized/full/ \
	Finetune_fam_data
NameDescriptionExample
INPUTInput file. Each line should be an example.family_classification/Finetune_fam_data.csv
OUTPUTOutput directoryfamily_classification/split_tokenized/full/
PREFIXPrefix for output filesFinetune_fam_data

percentage_splits.sh

Generate output files with a certain percentage of the input data file

Example Usage:

bash percentage_splits.sh family_classification/split_tokenized/full/Finetune_fam_data.split.train.80 \
	family_classification/split_tokenized/full/robustness_split
	Finetune_fam_data
NameDescriptionExample
INPUTInput filefamily_classification/split_tokenized/full/Finetune_fam_data.split.train.80
OUTPUTOutput directoryfamily_classification/split_tokenized/full/robustness_split
PREFIXPrefix for output filesFinetune_fam_data

Preprocess/binarize pretraining data:

fairseq-preprocess \
	--only-source \
	--trainpref tokenized_seqs_v1.split.train.80 \
	--validpref tokenized_seqs_v1.split.valid.10 \
	--testpref tokenized_seqs_v1.split.test.10 \
	--destdir pretraining/split_binarized \
	--workers 60

Preprocess/binarize family classification finetuning data:

# Split data into sequence and family files
for f in family_classification/split_tokenized/full/Finetune*; do
	cut -f1 -d',' "$f" > family_classification/split_tokenized/sequence/$(basename "$f").sequence
	cut -f2 -d',' "$f" > family_classification/split_tokenized/family/$(basename "$f").family
done

# Replace all spaces in family names with underscores
for f in family_classification/split_tokenized/family/*.family; do
	sed -i 's/ /_/g' "$f"
done

# Generate family label dictionary file
awk '{print $0,0}' family_classification/split_tokenized/family/*.family | sort | uniq > \
	family_classification/split_tokenized/family/families.txt

# Binarize sequences
fairseq-preprocess \
	--only-source \
	--trainpref family_classification/split_tokenized/sequence/Finetune_fam_data.split.train.80.sequence
        --validpref family_classification/split_tokenized/sequence/Finetune_fam_data.split.valid.10.sequence
        --testpref family_classification/split_tokenized/sequence/Finetune_fam_data.split.test.10.sequence
	--destdir family_classification/split_binarized/input0
	--workers 60
	--srcdict pretraining/split_binarized/dict.txt

# Binarize labels
fairseq-preprocess \
	--only-source \
	--trainpref family_classification/split_tokenized/family/Finetune_fam_data.split.train.80.family
	--validpref family_classification/split_tokenized/family/Finetune_fam_data.split.valid.10.family
	--testpref family_classification/split_tokenized/family/Finetune_fam_data.split.test.10.family 
	--destdir family_classification/split_binarized/label
	--workers 60
	--srcdict family_classification/split_tokenized/family/families.txt

Preprocess/binarize PPI data:

# Split data into from sequence, to sequence, and label files
for f in ppi_prediction/split_tokenized/full/Finetune*; do
        cut -f1 -d',' "$f" > ppi_prediction/split_tokenized/from/$(basename "$f").from
        cut -f2 -d',' "$f" > ppi_prediction/split_tokenized/to/$(basename "$f").to
	cut -f2 -d',' "$f" > ppi_prediction/split_tokenized/label/$(basename "$f").label
done

# Binarize sequences
fairseq-preprocess \
        --only-source \
        --trainpref ppi_prediction/split_tokenized/from/Finetune_interact_tokenized.split.train.80.from
        --validpref ppi_prediction/split_tokenized/from/Finetune_interact_tokenized.split.valid.10.from
        --testpref ppi_prediction/split_tokenized/from/Finetune_interact_tokenized.split.test.10.from
        --destdir ppi_prediction/split_binarized/input0
        --workers 60
        --srcdict pretraining/split_binarized/dict.txt

fairseq-preprocess \
        --only-source \
        --trainpref ppi_prediction/split_tokenized/to/Finetune_interact_tokenized.split.train.80.to
        --validpref ppi_prediction/split_tokenized/to/Finetune_interact_tokenized.split.valid.10.to
        --testpref ppi_prediction/split_tokenized/to/Finetune_interact_tokenized.split.test.10.to
        --destdir ppi_prediction/split_binarized/input1
        --workers 60
        --srcdict pretraining/split_binarized/dict.txt

# Binarize labels
fairseq-preprocess \
	--only-source \
	--trainpref ppi_prediction/split_tokenized/label/Finetune_interact_tokenized.split.train.80.label
        --validpref ppi_prediction/split_tokenized/label/Finetune_interact_tokenized.split.valid.10.label
        --testpref ppi_prediction/split_tokenized/label/Finetune_interact_tokenized.split.test.10.label
	--destdir ppi_prediction/split_binarized/label
	--workers 60