Awesome

knowtox_manuscript_SI

This repository is part of the supporting information to

KnowTox: Pipeline and Case Study for Confident Prediction of Potential Toxic Effects of Compounds in Early Phases of Development A. Morger1, M. Mathea2, J. H. Achenbach2, A. Wolf2, R. Buesen2, K-J. Schleifer2, R. Landsiedel2, A. Volkamer1 1: In Silico Toxicology and Structural Bioinformatics, Charité Universitätsmedizin, Berlin, Germany, volkamerlab.org 2: BASF SE, Ludwigshafen, Germany

Computational tools for toxicity prediction are promising in the process of reducing, refining and replacing animal testing. In our work, KnowTox was developed, a novel pipeline that combines three different in silico toxicology approaches to allow for confident prediction of potentially toxic effects of query compounds, i.e. machine learning models, alerts for toxic substructures and computational support for read-across. When applying machine learning models, applicability and reliability of predictions for new chemicals are of utmost importance. This was approached using conformal prediction. Several adaptions of the framework were investigated and proposed (i.e. KNN normalisation and balancing of proper training set) to improve the model performance. The model set-ups were validated using androgen receptor antagonism datasets.

Objective
Data and methods
Usage
License
Acknowledgement
Citation

Objective

(Back to Table of contents)

In the notebook it is demonstrated how a conformal predictor is built, applied to make predictions for external data, and how to evaluate the internal (crossvalidation) and external predictions. Similar to the model validation process described in the paper, the original, the normalised and the normalised+balanced models for androgen receptor antagonism are built. Comparing the three models, it is demonstrated how normalisation can improve validity on external data while balancing of the proper training and calibration set improves efficiency at significance level 0.2.

For an exhaustive explanation of conformal prediction and the validation process we refer to the paper.

Data and Methods

(Back to Table of contents)

The datasets used in this notebook were downloaded from public databases:

ToxCast database: https://figshare.com/articles/ToxCast_and_Tox21_Data_Spreadsheet/6062503 (EPA’s National Center for Computational Toxicology. ToxCast and Tox21 Data Spreadsheet. Download date 23.06.2017)
External data: https://www.tandfonline.com/doi/full/10.1080/1062936X.2016.1172665 (Norinder et al. SAR and QSAR in Environmental Research 27.4 (2016): 303-316.)

The molecules were standardised as described in the paper (Data and Methods/Dataset Preprocessing/Standardisation)

Remove duplicates
Use standardiser library (discard non-organic compounds, apply structure standardisation rules, neutralise, remove salts)
Remove small fragments and remaining mixtures
Remove duplicates

Descriptors were generated as specified in the paper (Data and Methods/Dataset Preprocessing/Descriptor calculation)

MorganMACCS: Calculate Morgan fingerprint (radius 3, 1024 bits) and MACCS keys using RDKit and concatenate
mmpcReduced: Calculate Morgan fingerprint (radius 3, 1024 bits), MACCS keys, and physicochemical descriptors using RDKit
Normalise physicochemical descriptors
Perform feature reduction based on ToxCast data (feature variance threshold 0.01 for binary descriptors (Morgan, MACCS) and 0.001 for continuous descriptors (physicochemical descriptors)
Concatenate (normalised and) reduced Morgan, MACCS and physicochemial descriptors

All methods and parameters used in this notebook are based on the paper. Note that for this notebook, the newest versions of the python libraries were used. Thus, results may slightly differ. Moreover, due to the randomness of the random forest and stratified splitting, exact numbers cannot be reproduced. However, the magnitude/scale and trend of the improvement steps are consistent.

Usage

(Back to Table of contents)

The notebook can be used to train aggregated conformal predictors on the ToxCast androgen receptor antagonism endpoint (assay endpoint id 762), to make predictions for an external androgen receptor antagonism dataset, and to evaluate the predictions. Three different set-ups (original, normalised, normalised+balanced) for the conformal predictors are offered.

The notebook could be adapted to train models for different ToxCast endpoints, as well as to input own dataframes with descriptors.

Installation

Get your local copy of the KnowTox_manuscript_SI repository by:
- Downloading it as a Zip archive and unzipping it, or
- Cloning it to your computer using git
```
git clone https://github.com/volkamerlab/KnowTox_manuscript_SI.git
```
Install the Anaconda (large download) or Miniconda (lighter) distribution for clean package version management.
Use the package manager conda to create an environment (called knowtox_SI) for the notebooks.

conda create --name knowtox_SI python=3.6
Activate the conda environment

conda activate knowtox_SI
Install packages

pip install scikit-learn

pip install https://github.com/morgeral/nonconformist/archive/master.zip

conda install jupyter

Start the jupyter notebook

`jupyter notebook`

Note: Due to the computational power needed to train the models, you might need to run the notebook with an increased max_buffer_size, e.g.:

jupyter notebook --NotebookApp.max_buffer_size=2000000000

License

(Back to Table of contents)

This work is licensed under the BSD 3-Clause "New" or "Revised" License.

Acknowledgement

(Back to Table of contents)

AM and AV would like to thank Jaime Rodríguez-Guerra for supporting the set up and reviewing this repository.

Citation

(Back to Table of contents)

If you make use of the KnowTox_manuscript_SI notebook, please cite:

@article{KnowTox,
    author = {
        Morger Andrea, 
        Mathea Miriam, 
        Achenbach Janosch Harald, 
        Wolf Antje, 
        Buesen Roland, 
        Schleifer Klaus-Juergen, 
        Landsiedel Robert, 
        Volkamer Andrea},
    title = {KnowTox: Pipeline and Case Study for Confident Prediction of Potential Tox Effects of Compounds in Early Phases of Development},
    year = {2020},
    journal = {Journal of Cheminformatic}
}