Home

Awesome

SPICE: A Dataset for Training Machine Learning Potentials

This repository contains scripts and data files used in the creation of the SPICE dataset. It does not contain the dataset itself. That is available from Zenodo:

DOI

SPICE (Small-Molecule/Protein Interaction Chemical Energies) is a collection of quantum mechanical data for training potential functions. The emphasis is particularly on simulating drug-like small molecules interacting with proteins. It is designed to achieve the following goals.

SPICE is made up of a collection of subsets. Each one is designed to provide a particular type of information. The subsets in the current version (2.0) include the following.

This table summarizes the content of each subset: the number of molecules/clusters it contains, the total number of conformations, the range of sizes spanned by the molecules/clusters, and the list of elements that appear in the subset.

SubsetMolecules/ClustersConformationsAtomsElements
Dipeptides67733,85026–60H, C, N, O, S
Solvated Amino Acids26130079–96H, C, N, O, S
DES370K Dimers3490345,6762–34H, Li, C, N, O, F, Na, Mg, P, S, Cl, K, Ca, Br, I
DES370K Monomers37418,7003–22H, C, N, O, F, P, S, Cl, Br, I
PubChem28,0391,398,5663–50H, B, C, N, O, F, Si, P, S, Cl, Br, I
Solvated PubChem139713,93463–110H, C, N, O, F, P, S, Cl, Br, I
Amino Acid Ligand Pairs79,967194,17424–72H, C, N, O, F, P, S, Cl, Br, I
Ion Pairs2814262Li, F, Na, Cl, K, Br, I
Water Clusters1100090H, O
Total113,9992,008,6282–110H, Li, B, C, N, O, F, Na, Mg, Si, P, S, Cl, K, Ca, Br, I

Citing The Dataset

Please cite this manuscript for papers that use the SPICE dataset:

Peter Eastman, Pavan Kumar Behara, David L. Dotson, Raimondas Galvelis, John E. Herr, Josh T. Horton, Yuezhi Mao, John D. Chodera, Benjamin P. Pritchard, Yuanqing Wang, Gianni De Fabritiis, and Thomas E. Markland. "SPICE, A Dataset of Drug-like Molecules and Peptides for Training Machine Learning Potentials." Scientific Data 10, 11 (2023). https://doi.org/10.1038/s41597-022-01882-6

To cite a particular version of the dataset, cite the Zenodo DOI found on the Releases page and shown above for the most recent version.

Generating New Data

All calculations in the SPICE dataset are computed with Psi4. If you want to generate new data that can be combined with SPICE, it is important to use the same level of theory and the same program with identical settings. Even when two programs use the same level of theory, there usually are enough differences in how they do calculations that energies they produce cannot be directly compared to each other. A sample input file for Psi4 is provided. It shows the exact settings to use to produce new data that can be correctly combined with SPICE.