Awesome

SpliceBERT-analysis

Additional analysis on SpliceBERT. The original repository is available at SpliceBERT.

Benchmark

On SpliceAI's GTEx dataset

We fine-tuned SpliceBERT on SpliceAI's GTEx dataset with R-Drop regularization for 5 times using different random seeds (model weights: Google Drive). The average AP scores of SpliceBERT (900nt) is comparable (donor) or slightly superior (acceptor) to SpliceAI-10K, while the ensemble model (averaging the predictions of 5 models) underperforms that of SpliceAI-10K, which is likely because that SpliceBERT models were fine-tuned based on the same pre-trained model and thus lack sufficient diversity.

The source codes are available in benchmark_spliceai-gtex.

model	receptive field size	AP (donor)	AP (acceptor)
SpliceBERT	900	0.8547 $\pm$ 0.0012	0.8458 $\pm$ 0.0009
SpliceAI-10k	10001	0.8547 $\pm$ 0.0027	0.8434 $\pm$ 0.0023
SpliceAI-2k	2001	0.8369 $\pm$ 0.0015	0.8270 $\pm$ 0.0017
SpliceAI-400	401	0.7961 $\pm$ 0.0020	0.7873 $\pm$ 0.0026
SpliceAI-80	81	0.5216 $\pm$ 0.0022	0.4449 $\pm$ 0.0020

model (ensemble)	receptive field size	AP (donor)	AP (acceptor)
SpliceAI-10k (ensemble)	10001	0.8735	0.8644
SpliceBERT (ensemble)	900	0.8608	0.8524

On DeepSTARR's dataset

Though SpliceBERT was pre-trained on primary RNA sequences, it can also be applied to DNA sequences. We finetuned SpliceBERT on DeepSTARR's dataset (https://zenodo.org/records/5502060) to identify sequences with potential enhancer activity. SpliceBERT outperformed DeepSTARR (convolution model) and Nucleotide Transformer (DNA language model). The results are available at benchmark_deepstarr.

model	Developmental	Housekeeping
SpliceBERT	0.70	0.78
DeepSTARR	0.68	0.74
Nucleotide Transformer (multi-species)	0.64	0.75

<img src="./benchmark_deepstarr/splicebert_on_deepstarr.png"> SpliceBERT_on_DeepSTARR (show 20% points) </img>

Contact

For any questions, contact chenkenbio_[at]_gmail.com

Citation

@article{chen2024self_bbae163,
  title={Self-supervised learning on millions of primary RNA sequences from 72 vertebrates improves sequence-based RNA splicing prediction},
  author={Chen, Ken and Zhou, Yue and Ding, Maolin and Wang, Yu and Ren, Zhixiang and Yang, Yuedong},
  journal={Briefings in Bioinformatics},
  volume={25},
  number={3},
  pages={bbae163},
  year={2024},
  publisher={Oxford University Press}
}