Awesome

Awesome Small Molecule Machine Learning

A curated list of awesome papers, data sets, frameworks, packages, blogs, and other resources related to machine learning for small-molecule drug discovery. Please contribute!

Contents

Papers
Data sets
Frameworks, Libraries, and Software Tools
Blogs
Twitter
Related lists

Papers

<a id="papers-surveys"></a>

Survey papers and books

Walters and Barzilay, 2021. Critical assessment of AI in drug discovery.
White, 2021. Deep Learning for Molecules and Materials.
Coley, 2020. Defining and Exploring Chemical Spaces.
Chuang et al, 2020. Learning Molecular Representations for Medicinal Chemistry.
Walters and Barzilay, 2020. Applications of Deep Learning in Molecule Generation and Molecular Property Prediction.
Cai et al, 2020. Transfer Learning for Drug Discovery.

<a id="papers-representation"></a>

Representation, transfer learning, and few-shot learning

Krenn et al, 2022. SELFIES and the future of molecular string representations.
Wang et al, 2022. Molecular Contrastive Learning of Representations via Graph Neural Networks. [Code]
Ahmad et al, 2021. ChemBERTa-2: Towards Chemical Foundation Models. [Code]
Satorras et al, 2021. E(n) Equivariant Graph Neural Networks. [Code]
Stanley et al, 2021. FS-Mol: A Few-Shot Learning Dataset of Molecules. [Code]
Townshend et al, 2021. ATOM3D: Tasks On Molecules in Three Dimensions.
Xue et al, 2021. X-MOL: large-scale pre-training for molecular understanding and diverse molecular analysis. [Code]
Ying et al, 2021. Do Transformers Really Perform Bad for Graph Representation? (Graphormer paper). [Code]
Chuang and Keiser, 2020. Attention-Based Learning on Molecular Ensembles.
Li and Fourches, 2020. Inductive transfer learning for molecular activity prediction: Next-Gen QSAR Models with MolPMoFiT. [Code]
Maziarka et al, 2020. Molecule Attention Transformer. [Code]
Nguyen et al., 2020. Meta-Learning GNN Initializations for Low-Resource Molecular Property Prediction [Code]
Rong et al., 2020. Self-Supervised Graph Transformer on Large-Scale Molecular Data (GROVER paper). [Code]
Hu et al, 2019. Strategies for Pre-training Graph Neural Networks. [Code]
Yang et al, 2019. Analyzing Learned Molecular Representations for Property Prediction (Chemprop). [Code]
Feinberg et al, 2018. PotentialNet for Molecular Property Prediction.
Altae-Tran et al, 2017. Low Data Drug Discovery with One-Shot Learning.

<a id="papers-generative-algorithms"></a>

Generative algorithms

Bengio et al, 2021. Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation. [Code]
Berenger and Tsuda, 2021. Molecular generation by Fast Assembly of (Deep)SMILES fragments. [Code]
Gao et al, 2021. Amortized Tree Generation for Bottom-up Synthesis Planning and Synthesizable Molecular Design. [Code]
Takeuchi et al, 2021. R-group replacement database for medicinal chemistry.
Imrie et al, 2020. Deep Generative Models for 3D Linker Design. [Code]
Jin et al, 2020. Hierarchical Generation of Molecular Graphs using Structural Motifs. [Code]
Polishchuk, 2020. CReM: chemically reasonable mutations framework for structure generation. [Code]
Brown, 2019. GuacaMol: Benchmarking Models for de Novo Molecular Design. [Code]
Popova et al, 2019. MolecularRNN: Generating realistic molecular graphs with optimized properties .
You et al, 2019. Graph Convolutional Policy Network for Goal-Directed Molecular Graph Generation. [Code]
Zhou et al, 2019. Optimization of Molecules via Deep Reinforcement Learning. [Code (official version)] [PyTorch implementation]
Jin et al, 2018. Junction Tree Variational Autoencoder for Molecular Graph Generation. [Code]
Merk et al, 2018. De Novo Design of Bioactive Small Molecules by Artificial Intelligence.

<a id="papers-hit-finding"></a>

Hit finding and potency prediciton

Stärk et al, 2022. EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction. [Code]
Bender et al, 2021. A practical guide to large-scale docking.
García-Ortegón et al, 2021. DOCKSTRING: easy molecular docking yields better benchmarks for ligand design. [Code] [Data]
Graff et al, 2021. Accelerating high-throughput virtual screening through molecular pool-based active learning. [Code]
Gentile et al, 2020. Deep Docking: A Deep Learning Platform for Augmentation of Structure Based Drug Discovery. [Code]
Cáceres et al, 2020. Adding Stochastic Negative Examples into Machine Learning Improves Molecular Bioactivity Prediction.
Lin et al, 2019. Ultra-large library docking for discovering new chemotypes.

<a id="papers-adme-tox"></a>

ADME and toxicity prediction

Fradkin et al, 2022. A Graph Neural Network Approach to Molecule Carcinogenicity Prediction.
Karim et al, 2021. CardioTox net: a robust predictor for hERG channel blockade based on deep learning meta-feature ensembles. [Code]
Siramshetty et al, 2021. Validating ADME QSAR Models Using Marketed Drugs.
Göller et al, 2020. Bayer’s in silico ADMET platform: a journey of machine learning over the past two decades.
Ryu et al, 2020. DeepHIT: a deep learning framework for prediction of hERG-induced cardiotoxicity. [Code]
Cai et al, 2019. Deep Learning-Based Prediction of Drug-Induced Cardiotoxicity. [Code]
Ogura et al, 2019. Support Vector Machine model for hERG inhibitory activities based on the integrated hERG database using descriptor selection by NSGA-II. [Data]
Lombardo et al, 2018. In Silico Absorption, Distribution, Metabolism, Excretion, and Pharmacokinetics (ADME-PK): Utility and Best Practices.

<a id="papers-synthetic-accessibility"></a>

Synthetic accessability and retrosynthetic planning

Fortunato et al, 2020. Data augmentation and pretraining for template-based retrosynthetic prediction in computer-aided synthesis planning.
Koch et al, 2020. Reinforcement Learning for Bioretrosynthesis.
Somnath et al, 2020. Learning Graph Models for Retrosynthesis Prediction.
Dai et al, 2019. Retrosynthesis Prediction with Conditional Graph Logic Network. [Code]
Coley et al, 2018. SCScore: Synthetic Complexity Learned from a Reaction Corpus. [Code] [DeepChem implementation]

<a id="dels"></a>

DNA-encoded libraries (DELs)

Lim et al, 2022. Machine Learning on DNA-Encoded Library Count Data Using an Uncertainty-Aware Probabilistic Loss Function. [Code]
McCloskey et al, 2020. Machine Learning on DNA-Encoded Libraries: A New Paradigm for Hit Finding.

<a id="papers-viz"></a>

Visualization and interpretability

Humer et al, 2021. ChemInformatics Model Explorer (CIME): Exploratory analysis of chemical model explanations. [Code]
Matveieva and Polishchuk, 2021. Benchmarks for interpretation of QSAR models. [Code]
Atsushi et al, 2019. Integrating the Structure–Activity Relationship Matrix Method with Molecular Grid Maps and Activity Landscape Models for Medicinal Chemistry Applications.
Naveja and Medina-Franco, 2019. Finding Constellations in Chemical Space Through Core Analysis.

<a id="papers-msms"></a>

MS/MS prediction

Young et al, 2023. MassFormer: Tandem Mass Spectrum Prediction for Small Molecules using Graph Transformers. [Code]
Goldman el al, 2023. Prefix-Tree Decoding for Predicting Mass Spectra from Molecules. [Code]
Hong et al, 2023. 3DMolMS: prediction of tandem mass spectra from 3D molecular conformations. [Code]
Wang et al, 2021. CFM-ID 4.0: More Accurate ESI-MS/MS Spectral Prediction and Compound Identification.
Wei et al, 2019. Rapid Prediction of Electron–Ionization Mass Spectrometry Using Neural Networks. [Code]

Data sets

<a id="frameworks"></a>

Frameworks, Libraries, and Software Tools

Blogs

Twitter

Related lists