Home

Awesome

MARCEL is a PyTorch-based benchmark library that evaluates the potential of machine learning on conformer ensembles across a diverse set of molecules, datasets, and models.

Why Learning over Conformer Ensembles?

It is critical to recognize that in reality molecules are not rigid, static objects; rather, thermodynamically-permissible rotations of chemical bonds, small vibrational motions, and dynamic intermolecular interactions cause molecules to continuously convert between different conformations. As a consequence, many experimentally observable chemical properties depend on the full distribution of thermodynamically-accessible conformers. Also, it is often challenging to determine a priori the conformers that predominantly contribute to molecular properties without doing prohibitively expensive simulations. Therefore, it is important to investigate the collective power of many different conformer structures lying on the local minima of the potential energy surface, also known as the conformer ensemble, for improving molecular representation learning models.

<p align="center"> <img src="https://media.drugdesign.org/course/molecular-geometry/conformers.gif" width="35%" alt="Copyright © 2022 drugdesign.org" class="center" alt="logo"/> </p>

Datasets

MARCEL include four datasets that cover a diverse range of chemical space, which focuses on four chemically-relevant tasks for both molecules and reactions, with an emphasis on Boltzmann-averaged properties of conformer ensembles computed at the Density-Functional Theory (DFT) level.

Drugs-75K

Drugs-75K is a subset of the GEOM-Drugs dataset, which includes 75,099 molecules with at least 5 rotatable bonds. For each molecule, Auto3D is used to generate and optimize the conformer ensembles and AIMNet-NSE is used to calculate three important DFT-based reactivity descriptors: ionization potential, electron affinity, and electronegativity.

Links: Download, Instructions

Kraken

Kraken is a dataset of 1,552 monodentate organophosphorus (III) ligands along with their DFT-computed conformer ensembles. We consider four 3D catalytic ligand descriptors exhibiting significant variance among conformers: Sterimol B5, Sterimol L, buried Sterimol B5, and buried Sterimol L. These descriptors quantify the steric size of a substituent in Å, and are commonly employed for Quantitative Structure-Activity Relationship (QSAR) modeling. The buried Sterimol variants describe the steric effects within the first coordination sphere of a metal.

Links: Download, Instructions

EE

EE is a dataset of 872 catalyst-substrate pairs involving 253 Rhodium (Rh)-bound atropisomeric catalysts derived from chiral bisphosphine, with 10 enamides as substrates. The dataset includes conformations of catalyst-substrate transition state complexes in two separate pro-S and pro-R configurations. The task is to predict the Enantiomeric Excess (EE) of the chemical reaction involving the substrate, defined as the absolute ratio between the concentration of each enantiomer in the product distribution. This dataset is generated with Q2MM, which automatically generates Transition State Force Fields (TSFFs) in order to simulate the conformer ensembles of each prochiral transition state complex. EE can then be computed from the conformer ensembles by Boltzmann-averaging the activation energies for the competing transition states. Unlike properties in Drugs-75K and Kraken, EE depends on the conformer ensembles of each pro-R and pro-S complex.

Links: Dataset access not publicly available, Instructions

BDE

BDE is a dataset containing 5,915 organometallic catalysts ML₁L₂ consisting of a metal center (M = Pd, Pt, Au, Ag, Cu, Ni) coordinated to two flexible organic ligands (L₁ and L₂), each selected from a 91-membered ligand library. The data includes conformations of each unbound catalyst, as well as conformations of the catalyst when bound to ethylene and bromide after oxidative addition with vinyl bromide. Each catalyst has an electronic binding energy, computed as the difference in the minimum energies of the bound-catalyst complex and unbound catalyst, following the DFT-optimization of their respective conformer ensembles. Although the binding energies are computed via DFT, the conformers provided for modeling are generated with Open Babel. This realistically represents the setting in which precise conformer ensembles are unknown at inference.

Links: Download, Instructions

Benchmarks

Prerequisites

The following packages are required for running the benchmarks.

Dataset Loaders

MARCEL has implemented PyG data loaders for each dataset. Download the dataset and place each zipped file under its corresponding directory, i.e. datasets/<NAME>/raw.

DatasetDataloader class
Drugs-75Kdata.drugs.Drugs
Krakendata.kraken.Kraken
EEdata.ee.EE_2D for 2D models, data.ee.EE for the others
BDEdata.bde.BDE

Batch Samplers

For Drugs-75K and Kraken, use EnsembleSampler to sample mini-batches of molecules from loaders.samplers. You can specify the sampling strategy to random that randomly samples one conformer, first that always loads the first conformer in each ensemble, or all that loads all conformers.

Since EE and BDE involve interactions between two molecules, we implement another sampler EnsembleMultiBatchSampler from loaders.samplers. In this case, each conformer of the system will be loaded as a tuple [data_0, data_1], which corresponds to one of the two molecules in the system.

Instructions on Reproducing Results

The default hyperparameters are set in config.py. Other model-dependent parameters are stored in the params folder separately. To reproduce the model you want to run, simply change the config parameter in ConfigLoader to the corresponding model parameter file. Then, specify dataset and target and change other parameters (e.g., learning_rate) when necessary in the command-line arguments.

ModelTraining script and key parameters
1D fingerprint modeltrain_fp_rf.py
1D SMILES-based sequential modeltrain_1d.py --model1d:model SEQ_ENCODER
2D modeltrain_2d.py --model2d:model GRAPH_ENCODER
Single-conformer 3D modeltrain_3d.py --model3d:augmentation False --model3d:model GRAPH_ENCODER
3D model with conformer samplingtrain_3d.py --model3d:augmentation True --model3d:model GRAPH_ENCODER
Conformer ensemble modeltrain_ensemble.py --model4d:set_encoder SET_ENCODER --model4d:graph_encoder GRAPH_ENCODER

License

The MARCEL benchmarks are licensed under Apache 2.0 License.