Awesome
SeaMoon: Prediction of Molecular Motions Based on Language Models
SeaMoon is a deep learning framework that predicts protein motions from their amino acid sequences. It leverages embeddings of protein language models, such as the sequence-only-based ESM-2 (Lin et al. 2022), the multimodal ESM3 (Hayes et al. 2024), or the sequence-structure bilingual ProstT5 (Heinzinger et al. 2023). Given a query protein sequence, SeaMoon outputs sets of 3D displacements vectors for each C-alpha atom within an invariant subspace, which can be interpreted as linear motions.
Quick Start
Setup Environment
-
Create a new conda environment and activate it:
conda create --name seamoon python=3.11.9 conda activate seamoon
-
Install dependencies:
pip install -r requirements.txt
-
If you wish to use --torque-mode (see below) during inference or evaluation, you will need a working version of the Wolfram Engine. Make sure to specify the path to your
WolframKernel
at line 30 ofeval.py
. We used Wolfram Engine v14.0.
Test Run
A small test dataset of 100 input samples is included in data_set
to validate all main functions. If you wish to generate ground truth data and pre-compute embeddings (ProstT5 by default) for all of them, you can use:
python -m seamoon precompute-w-gt
If you wish to skip pre-computing, pre-computed data for 10 input samples are provided in data_set/training_data. You can launch SeaMoon inference (infer) and prediction evaluation (evaluate) directly on them.
-
Infer -- predict motion tensors (3 by default) from the input embeddings:
python -m seamoon infer
-
Evaluate -- optimally align all predictions against all ground-truth principal components and compute the normalised errors:
python -m seamoon evaluate
The full dataset from the paper can be downloaded here.
Usage
Pre-compute Embeddings
Pre-compute embeddings using either FASTA or PDB files, optionally specifying the protein language model:
-
From FASTA:
python -m seamoon precompute-from-fasta --input-files [path-to-fasta-or-list] --output-dir [output-directory] --emb-model [ProstT5|ESM]
-
From PDB:
python -m seamoon precompute-from-pdb --input-files [path-to-pdb-list] --output-dir [output-directory] --emb-model [ProstT5|ESM]
This mode allows you to specify a protein 3D structure that may be then used to orient the predicted motions (--torque-mode, see below).
- From DANCE binaries and alignments (with ground truth to train the model):
python -m seamoon precompute-w-gt --prefixes [file-with-prefixes] --bin-dir [binary-dir] --aln-dir [alignment-dir] --output-dir [output-directory] --emb-model [ProstT5|ESM]
This mode allows you to generate ground-truth data from conformational collections, in addition to the pLM embeddings.
Training
python -m seamoon train --config-path [path-to-config-file]
Inference
python -m seamoon infer --model-path [path-to-model] --config-file [path-to-config] --list-path [path-to-list] --precomputed-path [path-to-precomputed-data] --output-path [output-directory] --batch-size [batch-size] --torque-mode [true|false] --device [cuda|cpu]
By default, the predicted motion tensors will have arbitrary orientations. Set the --torque-mode option to True if you want to align them with respect to a 3D structure. This orientation procedure will produce four solutions that minimize the torque of the structure under the predicted motion.
Evaluation
python -m seamoon evaluate --model-path [path-to-model] --config-file [path-to-config] --list-path [path-to-list] --precomputed-path [path-to-precomputed-data] --output-path [output-directory] --batch-size [batch-size] --torque-mode [true|false] --device [cuda|cpu]
By default, the predicted motion tensors will be optimally aligned with the known ground-truth principal components prior to computing the errors. Set the --torque-mode option to True if you want to compute the errors directly from the predictions oriented through torque minimisation.