Home

Awesome

QM1B dataset

arXiv QM1B figshare+

QM1B is a low-resolution DFT dataset generated using PySCF IPU. It is composed of one billion training examples containing 9-11 heavy atoms. It was created by taking 1.09M SMILES strings from the GDB-11 database and computing molecular properties (e.g. HOMO-LUMO gap) for a set of up to 1000 conformers per molecule.

This repository contains utilities for accessing the QM1B dataset but not the raw data as that is stored elsewhere.

License

Code in this repository is covered by the MIT license

The QM1B dataset was generated with pyscf-ipu by using the GDB-11 database as an input and hasn't otherwise altered the GDB-11 database The QM1B dataset is made available under the Creative Commons 4.0 license.

Download

First check that you have at least 240 GB of storage available.

Prepare your python environment:

pip install -r requirements.txt

Run the automated download script

python download.py  /path/for/qm1b-dataset 

Dataset schema

See the QM1B datasheet for detailed documentation following the datasheets for datasets framework.

QM1B dataset is stored in the open-source columnar Apache Parquet format, with the following schema:

Dataset exploration

Dataset exploration can easily done using Pandas library. For instance, to load the validation set:

import pandas as pd

# 20m entries in the validation set.
print(pd.read_parquet("qm1b_val.parquet").head())

Cite

Please use the following citation for the QM1B dataset

@inproceedings{mathiasen2023qm1b,
  title={Generating QM1B with PySCF $ \_ $\{$$\backslash$text $\{$IPU$\}$$\}$ $},
  author={Mathiasen, Alexander and Helal, Hatem and Klaeser, Kerstin and Balanca, Paul and Dean, Josef and Luschi, Carlo and Beaini, Dominique and Fitzgibbon, Andrew William and Masters, Dominic},
  booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2023}
}