Awesome
In case you would like to cite this:
1. MolMapNet Dataset
- the following datasets are reported in the paper of <code> <i>"Out-of-the-Box Deep Learning Prediction of Pharmaceutical Properties by Broadly Learned Knowledge-Based Molecular Representations"</i> </code>, please find details of these datasets in this paper
2. Benchmark DataSet in MolNet and Chemprop
These benchmark datasets and the split induces have benn generated in this repo, the following table is the summary of these datasets.
<table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>task_name</th> <th>task_type</th> <th>n_samples</th> <th>n_task</th> <th>split_method</th> <th>n_cross_split</th> <th>task_metrics</th> </tr> <tr> <th>task_id</th> <th></th> <th></th> <th></th> <th></th> <th></th> <th></th> <th></th> </tr> </thead> <tbody> <tr> <th>01</th> <td>ESOL</td> <td>regression</td> <td>1128</td> <td>1</td> <td>random</td> <td>3</td> <td>RMSE</td> </tr> <tr> <th>02</th> <td>FreeSolv</td> <td>regression</td> <td>642</td> <td>1</td> <td>random</td> <td>3</td> <td>RMSE</td> </tr> <tr> <th>03</th> <td>Lipop</td> <td>regression</td> <td>4200</td> <td>1</td> <td>random</td> <td>3</td> <td>RMSE</td> </tr> <tr> <th>04</th> <td>PDBbind-full</td> <td>regression</td> <td>9880</td> <td>1</td> <td>time</td> <td>1</td> <td>RMSE</td> </tr> <tr> <th>05</th> <td>PDBbind-core</td> <td>regression</td> <td>168</td> <td>1</td> <td>time</td> <td>1</td> <td>RMSE</td> </tr> <tr> <th>06</th> <td>PDBbind-refined</td> <td>regression</td> <td>3040</td> <td>1</td> <td>time</td> <td>1</td> <td>RMSE</td> </tr> <tr> <th>07</th> <td>PCBA</td> <td>classification</td> <td>437929</td> <td>128</td> <td>random</td> <td>3</td> <td>PRC_AUC</td> </tr> <tr> <th>08</th> <td>MUV</td> <td>classification</td> <td>93087</td> <td>17</td> <td>random</td> <td>3</td> <td>PRC_AUC</td> </tr> <tr> <th>09</th> <td>HIV</td> <td>classification</td> <td>41127</td> <td>1</td> <td>scaffold</td> <td>3</td> <td>ROC_AUC</td> </tr> <tr> <th>10</th> <td>BACE</td> <td>classification</td> <td>1513</td> <td>1</td> <td>scaffold</td> <td>3</td> <td>ROC_AUC</td> </tr> <tr> <th>11</th> <td>BBBP</td> <td>classification</td> <td>2039</td> <td>1</td> <td>scaffold</td> <td>3</td> <td>ROC_AUC</td> </tr> <tr> <th>12</th> <td>Tox21</td> <td>classification</td> <td>7831</td> <td>12</td> <td>random</td> <td>3</td> <td>ROC_AUC</td> </tr> <tr> <th>13</th> <td>ToxCast</td> <td>classification</td> <td>8576</td> <td>617</td> <td>random</td> <td>3</td> <td>ROC_AUC</td> </tr> <tr> <th>14</th> <td>SIDER</td> <td>classification</td> <td>1427</td> <td>27</td> <td>random</td> <td>3</td> <td>ROC_AUC</td> </tr> <tr> <th>15</th> <td>ClinTox</td> <td>classification</td> <td>1478</td> <td>2</td> <td>random</td> <td>3</td> <td>ROC_AUC</td> </tr> <tr> <th>16</th> <td>ChEMBL</td> <td>classification</td> <td>456331</td> <td>1310</td> <td>scaffold</td> <td>3</td> <td>ROC_AUC</td> </tr> </tbody></table>Installation
Direct installation:
pip install git+https://github.com/shenwanxiang/ChemBench.git
Developer installation:
git clone https://github.com/shenwanxiang/ChemBench.git
cd ChemBench
pip install -e .
Usage-1: Load the Dataset and MoleculeNet's Split Induces
from chembench import load_data
df, induces = load_data('ESOL')
# get the 3 times random split induces
train_idx, valid_idx, test_idx = induces[0]
train_idx, valid_idx, test_idx = induces[1]
train_idx, valid_idx, test_idx = induces[2]
Usage-2: Load Dataset As Data Object
from chembench import dataset
data = dataset.load_ESOL()
data.x
data.y
data.description
## regression
dataset.load_Lipop()
dataset.load_ESOL()
dataset.load_FreeSolv()
dataset.load_Malaria()
dataset.load_LMC()
dataset.load_PDBF()
dataset.load_PDBC()
dataset.load_PDBR()
### classification
dataset.load_BBBP()
dataset.load_BACE()
dataset.load_HIV()
dataset.load_MUV()
dataset.load_Tox21()
dataset.load_SIDER()
dataset.load_CYP450()
dataset.load_ToxCast()
dataset.load_ClinTox()
dataset.load_ChEMBL()
dataset.load_PCBA()
Usage-3: Load Cluster Splits
the cluster split results is here, for example, load the cluster splits and random splits for dataset ESOL:
from chembench import get_cluster_induces
induces1 = get_cluster_induces("ESOL", induces = "random_5fcv_5rpts")
induces2 = get_cluster_induces("ESOL", induces = "scaffold_5fcv_1rpts")
print(len(induces1))
print(len(induces2))
For example, the chemical space of the ESOL dataset using 5fold cluster split :
the Kolmogorov-Smirnov statistic on the distribution for the pairwise groups(clusters):
Making a Release
After installing the package in development mode and installing
tox
with pip install tox
, the commands for making a new release are contained within the finish
environment
in tox.ini
. Run the following from the shell:
$ tox -e finish
This script does the following:
- Uses BumpVersion to switch the version number in the
setup.cfg
andsrc/chembench/version.py
to not have the-dev
suffix - Packages the code in both a tar archive and a wheel
- Uploads to PyPI using
twine
. Be sure to have a.pypirc
file configured to avoid the need for manual input at this step - Push to GitHub. You'll need to make a release going with the commit where the version was bumped.
- Bump the version to the next patch. If you made big changes and want to bump the version by minor, you can
use
tox -e bumpversion minor
after.