Awesome
Awesome Small Molecule Machine Learning
A curated list of awesome papers, data sets, frameworks, packages, blogs, and other resources related to machine learning for small-molecule drug discovery. Please contribute!
Contents
Papers
<a id="papers-surveys"></a>
Survey papers and books
- Walters and Barzilay, 2021. Critical assessment of AI in drug discovery.
- White, 2021. Deep Learning for Molecules and Materials.
- Coley, 2020. Defining and Exploring Chemical Spaces.
- Chuang et al, 2020. Learning Molecular Representations for Medicinal Chemistry.
- Walters and Barzilay, 2020. Applications of Deep Learning in Molecule Generation and Molecular Property Prediction.
- Cai et al, 2020. Transfer Learning for Drug Discovery.
<a id="papers-representation"></a>
Representation, transfer learning, and few-shot learning
- Krenn et al, 2022. SELFIES and the future of molecular string representations.
- Wang et al, 2022. Molecular Contrastive Learning of Representations via Graph Neural Networks. [Code]
- Ahmad et al, 2021. ChemBERTa-2: Towards Chemical Foundation Models. [Code]
- Satorras et al, 2021. E(n) Equivariant Graph Neural Networks. [Code]
- Stanley et al, 2021. FS-Mol: A Few-Shot Learning Dataset of Molecules. [Code]
- Townshend et al, 2021. ATOM3D: Tasks On Molecules in Three Dimensions.
- Xue et al, 2021. X-MOL: large-scale pre-training for molecular understanding and diverse molecular analysis. [Code]
- Ying et al, 2021. Do Transformers Really Perform Bad for Graph Representation? (Graphormer paper). [Code]
- Chuang and Keiser, 2020. Attention-Based Learning on Molecular Ensembles.
- Li and Fourches, 2020. Inductive transfer learning for molecular activity prediction: Next-Gen QSAR Models with MolPMoFiT. [Code]
- Maziarka et al, 2020. Molecule Attention Transformer. [Code]
- Nguyen et al., 2020. Meta-Learning GNN Initializations for Low-Resource Molecular Property Prediction [Code]
- Rong et al., 2020. Self-Supervised Graph Transformer on Large-Scale Molecular Data (GROVER paper). [Code]
- Hu et al, 2019. Strategies for Pre-training Graph Neural Networks. [Code]
- Yang et al, 2019. Analyzing Learned Molecular Representations for Property Prediction (Chemprop). [Code]
- Feinberg et al, 2018. PotentialNet for Molecular Property Prediction.
- Altae-Tran et al, 2017. Low Data Drug Discovery with One-Shot Learning.
<a id="papers-generative-algorithms"></a>
Generative algorithms
- Bengio et al, 2021. Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation. [Code]
- Berenger and Tsuda, 2021. Molecular generation by Fast Assembly of (Deep)SMILES fragments. [Code]
- Gao et al, 2021. Amortized Tree Generation for Bottom-up Synthesis Planning and Synthesizable Molecular Design. [Code]
- Takeuchi et al, 2021. R-group replacement database for medicinal chemistry.
- Imrie et al, 2020. Deep Generative Models for 3D Linker Design. [Code]
- Jin et al, 2020. Hierarchical Generation of Molecular Graphs using Structural Motifs. [Code]
- Polishchuk, 2020. CReM: chemically reasonable mutations framework for structure generation. [Code]
- Brown, 2019. GuacaMol: Benchmarking Models for de Novo Molecular Design. [Code]
- Popova et al, 2019. MolecularRNN: Generating realistic molecular graphs with optimized properties .
- You et al, 2019. Graph Convolutional Policy Network for Goal-Directed Molecular Graph Generation. [Code]
- Zhou et al, 2019. Optimization of Molecules via Deep Reinforcement Learning. [Code (official version)] [PyTorch implementation]
- Jin et al, 2018. Junction Tree Variational Autoencoder for Molecular Graph Generation. [Code]
- Merk et al, 2018. De Novo Design of Bioactive Small Molecules by Artificial Intelligence.
<a id="papers-hit-finding"></a>
Hit finding and potency prediciton
- Stärk et al, 2022. EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction. [Code]
- Bender et al, 2021. A practical guide to large-scale docking.
- García-Ortegón et al, 2021. DOCKSTRING: easy molecular docking yields better benchmarks for ligand design. [Code] [Data]
- Graff et al, 2021. Accelerating high-throughput virtual screening through molecular pool-based active learning. [Code]
- Gentile et al, 2020. Deep Docking: A Deep Learning Platform for Augmentation of Structure Based Drug Discovery. [Code]
- Cáceres et al, 2020. Adding Stochastic Negative Examples into Machine Learning Improves Molecular Bioactivity Prediction.
- Lin et al, 2019. Ultra-large library docking for discovering new chemotypes.
<a id="papers-adme-tox"></a>
ADME and toxicity prediction
- Fradkin et al, 2022. A Graph Neural Network Approach to Molecule Carcinogenicity Prediction.
- Karim et al, 2021. CardioTox net: a robust predictor for hERG channel blockade based on deep learning meta-feature ensembles. [Code]
- Siramshetty et al, 2021. Validating ADME QSAR Models Using Marketed Drugs.
- Göller et al, 2020. Bayer’s in silico ADMET platform: a journey of machine learning over the past two decades.
- Ryu et al, 2020. DeepHIT: a deep learning framework for prediction of hERG-induced cardiotoxicity. [Code]
- Cai et al, 2019. Deep Learning-Based Prediction of Drug-Induced Cardiotoxicity. [Code]
- Ogura et al, 2019. Support Vector Machine model for hERG inhibitory activities based on the integrated hERG database using descriptor selection by NSGA-II. [Data]
- Lombardo et al, 2018. In Silico Absorption, Distribution, Metabolism, Excretion, and Pharmacokinetics (ADME-PK): Utility and Best Practices.
<a id="papers-synthetic-accessibility"></a>
Synthetic accessability and retrosynthetic planning
- Fortunato et al, 2020. Data augmentation and pretraining for template-based retrosynthetic prediction in computer-aided synthesis planning.
- Koch et al, 2020. Reinforcement Learning for Bioretrosynthesis.
- Somnath et al, 2020. Learning Graph Models for Retrosynthesis Prediction.
- Dai et al, 2019. Retrosynthesis Prediction with Conditional Graph Logic Network. [Code]
- Coley et al, 2018. SCScore: Synthetic Complexity Learned from a Reaction Corpus. [Code] [DeepChem implementation]
<a id="dels"></a>
DNA-encoded libraries (DELs)
- Lim et al, 2022. Machine Learning on DNA-Encoded Library Count Data Using an Uncertainty-Aware Probabilistic Loss Function. [Code]
- McCloskey et al, 2020. Machine Learning on DNA-Encoded Libraries: A New Paradigm for Hit Finding.
<a id="papers-viz"></a>
Visualization and interpretability
- Humer et al, 2021. ChemInformatics Model Explorer (CIME): Exploratory analysis of chemical model explanations. [Code]
- Matveieva and Polishchuk, 2021. Benchmarks for interpretation of QSAR models. [Code]
- Atsushi et al, 2019. Integrating the Structure–Activity Relationship Matrix Method with Molecular Grid Maps and Activity Landscape Models for Medicinal Chemistry Applications.
- Naveja and Medina-Franco, 2019. Finding Constellations in Chemical Space Through Core Analysis.
<a id="papers-msms"></a>
MS/MS prediction
- Young et al, 2023. MassFormer: Tandem Mass Spectrum Prediction for Small Molecules using Graph Transformers. [Code]
- Goldman el al, 2023. Prefix-Tree Decoding for Predicting Mass Spectra from Molecules. [Code]
- Hong et al, 2023. 3DMolMS: prediction of tandem mass spectra from 3D molecular conformations. [Code]
- Wang et al, 2021. CFM-ID 4.0: More Accurate ESI-MS/MS Spectral Prediction and Compound Identification.
- Wei et al, 2019. Rapid Prediction of Electron–Ionization Mass Spectrometry Using Neural Networks. [Code]
Data sets
- ADME@NCATS
- AMED Cardiotoxicity Database
- BindingDB
- ChEMBL
- DrugBank
- DrugMatrix
- Enamine Real database
- hERG Central
- MoleculeNet
- MONA: DB of Mass spec + other readouts
- NPASS database of natural products
- PubChem
- The Open Reaction Database
- Therapeutic Data Commons
- Zinc
<a id="frameworks"></a>
Frameworks, Libraries, and Software Tools
- AutoDock Vina
- BioPandas
- Chemprop
- DeepChem [Tutorials]
- Open Babel
- pdb-tools
- PyTorch Geometric
- rd_filters
- Small-World Search
- TorchDrug
Blogs
- Regina Barzilay
- Bob the Grumpy Med Chemist
- John Chodera
- Connor W. Coley
- Greg Landrum
- pen(Taka)
- Bharath Ramsundar
- Marwin Segler
- Patrick Walters