Awesome
3DMolMS
3D Molecular Network for Mass Spectra Prediction (3DMolMS) is a deep neural network model to predict the MS/MS spectra of compounds from their 3D conformations. This model's molecular representation, learned through MS/MS prediction tasks, can be further applied to enhance performance in other molecular-related tasks, such as predicting retention times (RT) and collision cross sections (CCS).
Read our paper in Bioinformatics | Try our online service at GNPS | Install from PyPI
🆕 3DMolMS v1.1.10 is now available both on PyPI and GNPS2!
<p align="center"> <img src='https://github.com/JosieHong/3DMolMS/blob/main/img/gnps2_msms_pred.png' width='600'> </p>The changes log can be found at [CHANGE_LOG.md].
Installation
3DMolMS is available on PyPI (molnetpack
). You can install the latest version using pip
:
pip install molnetpack
# PyTorch must be installed separately.
# Please check the official website of PyTorch for the proper version:
# https://pytorch.org/get-started/locally/
# e.g.
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
3DMolMS can also be installed through source codes:
git clone https://github.com/JosieHong/3DMolMS.git
cd 3DMolMS
pip install .
Usage
To get started quickly, you can instantiate a MolNet and load a CSV or MGF file for MS/MS prediction as:
import torch
from molnetpack import MolNet, plot_msms
# Set the device to CPU for CPU-only usage:
device = torch.device("cpu")
# For GPU usage, set the device as follows (replace '0' with your desired GPU index):
# gpu_index = 0
# device = torch.device(f"cuda:{gpu_index}")
# Instantiate a MolNet object
molnet_engine = MolNet(device, seed=42) # The random seed can be any integer.
# Load input data (here we use a CSV file as an example)
molnet_engine.load_data(path_to_test_data='./test/input_msms.csv')
"""Load data from the specified path.
Args:
path_to_test_data (str): Path to the test data file. Supported formats are 'csv', 'mgf', and 'pkl'.
Returns:
None
"""
# Predict MS/MS
pred_spectra_df = molnet_engine.pred_msms(instrument='qtof')
"""Predict MS/MS spectra.
Args:
path_to_results (Optional[str]): Path to save the prediction results. Supports '.mgf' or '.csv' formats. If None, the results won't be saved.
path_to_checkpoint (Optional[str]): Path to the model checkpoint. If None, the model will be downloaded from a default URL.
instrument (str): Type of instrument used ('qtof' or 'orbitrap').
Returns:
pd.DataFrame: DataFrame containing the predicted MS/MS results.
"""
We also implement a function to plot the predicted results.
# Plot the predicted MS/MS with 3D molecular conformation
plot_msms(pred_spectra_df, dir_to_img='./img/')
The sample input files, a CSV and an MGF, are located at ./test/demo_input.csv
and ./test/demo_input.mgf
, respectively. It's important to note that during the data loading phase, any input formats that are not supported will be automatically excluded. Below is a table outlining the types of input data that are supported:
Item | Supported input |
---|---|
Atom number | <=300 |
Atom types | 'C', 'O', 'N', 'H', 'P', 'S', 'F', 'Cl', 'B', 'Br', 'I', 'Na' |
Precursor types | '[M+H]+', '[M-H]-', '[M+H-H2O]+', '[M+Na]+', '[M+2H]2+' |
Collision energy | any number |
Below is an example of a predicted MS/MS spectrum plot.
<p align="center"> <img src='https://github.com/JosieHong/3DMolMS/blob/main/img/demo_0.png' width='600'> </p>A more detailed documentation for various tasks using molnetpack or source code can be found in the docs/ directory, which includes the following:
- ./docs/
- PROP_USAGE.md: Guide on using
molnetpack
for RT prediction, CCS prediction, and molecular embedding. - MSMS_PRED.md: Instructions for using 3DMolMS to predict MS/MS spectra from your own CSV files via the source code. The training details can be found in the next section.
- GEN_REFER_LIB.md: Instructions for using 3DMolMS to generate MS/MS reference libraries from small molecule databases, such as HMDB and RefMet, via the source code.
- PROP_PRED.md: Instructions for training and testing 3DMolMS on RT and CCS prediction via the source code.
- PRETRAIN.md: Instructions for pretraining 3DMolMS on the QM9 dataset via the source code.
- PROP_USAGE.md: Guide on using
Train your own model
Step 0: Clone the Repository and Set Up the Environment
Clone the 3DMolMS repository and install the required packages using the following commands:
git clone https://github.com/JosieHong/3DMolMS.git
cd 3DMolMS
# Please install the packages if you have not installed them yet.
pip install .
Step 1: Obtain the Pretrained Model
Download the pretrained model (molnet_pre_etkdgv3.pt.zip
) from Releases. You can also train the model from scratch. For details on pretraining the model on the QM9 dataset, refer to PRETRAIN.md.
Step 2: Prepare the Datasets
Download and organize the datasets into the ./data/
directory. The current version uses four datasets:
- Agilent DPCL, provided by Agilent Technologies.
- NIST20, available under license for academic use.
- MoNA, publicly available.
- Waters QTOF, our own experimental dataset.
The data directory structure should look like this:
|- data
|- origin
|- Agilent_Combined.sdf
|- Agilent_Metlin.sdf
|- hr_msms_nist.SDF
|- MoNA-export-All_LC-MS-MS_QTOF.sdf
|- MoNA-export-All_LC-MS-MS_Orbitrap.sdf
|- waters_qtof.mgf
Step 3: Preprocess the Datasets
Run the following commands to preprocess the datasets. Specify the dataset with --dataset
and select the instrument type as qtof
. Use --maxmin_pick
to apply the MaxMin algorithm for selecting training molecules; otherwise, selection will be random. The dataset configurations are in ./src/molnetpack/config/preprocess_etkdgv3.yml
.
python ./src/preprocess.py --dataset agilent nist mona waters gnps \
--instrument_type qtof orbitrap \
--data_config_path ./src/molnetpack/config/preprocess_etkdgv3.yml \
--mgf_dir ./data/mgf_debug/
Step 4: Train the Model
Use the following commands to train the model. Configuration settings for the model and training process are located in ./src/molnetpack/config/molnet.yml
.
# Train the model from pretrain:
# Q-TOF (Orbitrap is ignored here.):
python ./src/train.py --train_data ./data/qtof_etkdgv3_train.pkl \
--test_data ./data/qtof_etkdgv3_test.pkl \
--model_config_path ./src/molnetpack/config/molnet.yml \
--data_config_path ./src/molnetpack/config/preprocess_etkdgv3.yml \
--checkpoint_path ./check_point/molnet_qtof_etkdgv3.pt \
--transfer --resume_path ./check_point/molnet_pre_etkdgv3.pt \
--ex_model_path ./check_point/molnet_qtof_etkdgv3_jit.pt
# Train the model from scratch
# Q-TOF:
python ./src/train.py --train_data ./data/qtof_etkdgv3_train.pkl \
--test_data ./data/qtof_etkdgv3_test.pkl \
--model_config_path ./src/molnetpack/config/molnet.yml \
--data_config_path ./src/molnetpack/config/preprocess_etkdgv3.yml \
--checkpoint_path ./check_point/molnet_qtof_etkdgv3.pt \
--ex_model_path ./check_point/molnet_qtof_etkdgv3_jit.pt
# Orbitrap:
python ./src/train.py --train_data ./data/orbitrap_etkdgv3_train.pkl \
--test_data ./data/orbitrap_etkdgv3_test.pkl \
--model_config_path ./src/molnetpack/config/molnet.yml \
--data_config_path ./src/molnetpack/config/preprocess_etkdgv3.yml \
--checkpoint_path ./check_point/molnet_orbitrap_etkdgv3.pt \
--ex_model_path ./check_point/molnet_orbitrap_etkdgv3_jit.pt
Step 5: Evaluation
Let's evaluate the model trained above!
# Predict the spectra:
# Q-TOF:
python ./src/pred.py \
--test_data ./data/qtof_etkdgv3_test.pkl \
--model_config_path ./src/molnetpack/config/molnet.yml \
--data_config_path ./src/molnetpack/config/preprocess_etkdgv3.yml \
--resume_path ./check_point/molnet_qtof_etkdgv3.pt \
--result_path ./result/pred_qtof_etkdgv3_test.mgf
# Orbitrap:
python ./src/pred.py \
--test_data ./data/orbitrap_etkdgv3_test.pkl \
--model_config_path ./src/molnetpack/config/molnet.yml \
--data_config_path ./src/molnetpack/config/preprocess_etkdgv3.yml \
--resume_path ./check_point/molnet_orbitrap_etkdgv3.pt \
--result_path ./result/pred_orbitrap_etkdgv3_test.mgf
# Evaluate the cosine similarity between experimental spectra and predicted spectra:
# Q-TOF:
python ./src/eval.py ./data/qtof_etkdgv3_test.pkl ./result/pred_qtof_etkdgv3_test.mgf \
./eval_qtof_etkdgv3_test.csv ./eval_qtof_etkdgv3_test.png
# Orbitrap:
python ./src/eval.py ./data/orbitrap_etkdgv3_test.pkl ./result/pred_orbitrap_etkdgv3_test.mgf \
./eval_orbitrap_etkdgv3_test.csv ./eval_orbitrap_etkdgv3_test.png
Additional application
3DMolMS is also capable of predicting molecular properties and generating reference libraries for molecular identification. For more details, refer to PROP_PRED.md and GEN_REFER_LIB.md respectively.
Citation
@article{hong20233dmolms,
title={3DMolMS: prediction of tandem mass spectra from 3D molecular conformations},
author={Hong, Yuhui and Li, Sujun and Welch, Christopher J and Tichy, Shane and Ye, Yuzhen and Tang, Haixu},
journal={Bioinformatics},
volume={39},
number={6},
pages={btad354},
year={2023},
publisher={Oxford University Press}
}
@article{hong2024enhanced,
title={Enhanced structure-based prediction of chiral stationary phases for chromatographic enantioseparation from 3D molecular conformations},
author={Hong, Yuhui and Welch, Christopher J and Piras, Patrick and Tang, Haixu},
journal={Analytical Chemistry},
volume={96},
number={6},
pages={2351--2359},
year={2024},
publisher={ACS Publications}
}
License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.