Awesome
OpenFF QCArchive Dataset Submission
Dataset Lifecycle
All datasets submitted to QCArchive via this repository conform to the Dataset Lifecycle.
See STANDARDS.md for submission standards. Datasets must be submitted as pull requests.
User Quickstart
-
Ensure
git-lfs
is installed on your local machine: https://git-lfs.github.com/ -
To submit a new dataset, begin by cloning this repository:
export GIT_LFS_SKIP_SMUDGE=1 git clone git@github.com:openforcefield/qca-dataset-submission.git
This will clone the repo, but avoid downloading existing LFS objects. If you wish to download all LFS objects, leave off the
export GIT_LFS_SKIP_SMUDGE=1
. -
Once cloned, create and switch to a new branch from
master
, then create a new directory inqca-dataset-submission/submissions/
:git checkout -b <dataset-branch> mkdir qca-dataset-submission/submissions/YYYY-MM-DD-OpenFF-<DESCRIPTIVE-DATASET-NAME>-v1.0
You will add all submission artifacts to this directory.
-
Create and activate a new conda env with basic submission-preparation requirements with:
conda env create -f qca-dataset-submission/devtools/prod-envs/qcarchive-user-submit.yaml conda activate qcarchive-user-submit
-
Choose a starting notebook and README based on the type of dataset you wish to submit:
Copy the notebook and README for the dataset you want into the directory you created.
cp examples/<dataset-type>/* qca-dataset-submission/submissions/YYYY-MM-DD-OpenFF-<DESCRIPTIVE-DATASET-NAME>-v1.0
-
Start up a Jupyter notebook with your new notebook:
jupyter notebook qca-dataset-submission/submissions/YYYY-MM-DD-OpenFF-<DESCRIPTIVE-DATASET-NAME>-v1.0/generate-dataset.ipynb
Edit the contents with appropriate metadata information, read in your molecules using the cells appropriate for your input data, and make any other modifications as needed for your specific needs.
-
Copy generated metadata components into README. Write a reasonably-detailed high-level description of the submission at the top.
-
Commit the following files in the submission directory you made:
- your input files; please compress them if possible with e.g.
bzip2
generate-dataset.ipynb
dataset.pdf
dataset.smi
dataset.json.bz2
- your input files; please compress them if possible with e.g.
-
Push your branch to Github:
git push origin <dataset-branch>
-
Make a new PR for the branch. Validation will run automatically on your
dataset.json.*
file, indicating any potential issues prior to submission. Ask for help if you see validation failures you do not understand. Ping a reviewer in the comments. -
Once reviewed and approved, your submission will be merged and submitted to QCArchive! Computations specified by the submission will be performed on OpenFF-managed compute resources.
Creating a compute expansion
If you have already computed a dataset but want to re-compute it with a new QCSpec
(e.g. new level of theory), you can do so using a compute expansion. This is faster than creating a new dataset, and explicitly links datasets with the same molecules and purpose.
A compute expansion involves adding a file called compute.json
to your original submission, which contains the dataset metadata (identical to the original dataset) and the new compute spec.
This can be done manually, or programatically.
The programatic description is provided below, with an example of the notebook and of the file.
- Create a new branch as described above, and navigate to the submission directory of the dataset you want to expand.
- Create a new jupyter notebook called
generate-compute.ipynb
example here. - In the notebook, either download the original dataset and remove the molecules and original
QCSpec
, or re-create the dataset with the same name as the original and skip the molecule addition step.
- See below for details about how changes to the dataset are propagated; note that the dataset name must be the same, and changes to any metadata except
compute-tag
and theQCSpec
will be ignored when submitting the compute expansion. - Please note that the default
compute_tag
isopenff
; if you need to use a different one, please add it explicitly to the dataset at this step, as thecompute.json
file overrides the compute tag added manually to the PR. If you do need to change the compute tag after submission, you can change it by updating the label on the PR and the change will take effect when the error cycling action runs next.
- Add the new
QCSpec
to the dataset, and save the dataset tocompute.json
, example here. - Add the additional compute spec to the submission's
README.md
file. - Add the
generate-compute.ipynb
andcompute.json
files to the submission'sQCSubmit Manifest
entry in theREADME.md
file. - Proof the submission and open a PR. Dataset validation will run automatically.
- Once the dataset is validated, request a review, and once approved, your compute expansion will be submitted!
When the PR is merged, the following happens:
-
CI checks for
compute*.json*
, so files can be called anything so long as they follow that pattern. -
This gets loaded into a QCSubmit
dataset
structure in CI (seelifecycle.py
,SubmittableBase
) and submitted to MolSSI withopenff.qcsubmit.datasets.datasets._BaseDataset.submit()
-
submit()
checks if the dataset already exists using only the dataset type and name. Changes in descriptions, other metadata, etc. don't affect anything. New/different molecules will also be ignored if the dataset name already exists. -
submit()
adds the specifications -
submit()
submits with thecompute_tag
andpriority
within the newcompute.json
. -
Other info in the dataset, such as
dataset_tags
, are not incorporated into additional compute submissons and thus changing them will not affect the dataset.
The Lifecycle of a Dataset Submission
All Open Force Field datasets submitted to QCArchive undergo well-defined lifecycle.
Each labeled rectangle in the lifecycle represents a state. A submission PR changes state according to the arrows. Changes in state may be performed by automation or manually by a human when certain critera are met.
The lifecycle process is described below, with [bracketed] items indicating the agent of action, one of:
- [GHA]: Github Actions
- [Board]: Github Project Board
- [Human]: A maintainer of the
qca-dataset-submission
repository.
-
A PR is created against
qca-dataset-submission
by a submitter.- the template is filled out with informational sections according to the PR template
- [GHA]
validation
operates on alldataset*.json
files found in the PR; performs validation checks- comment made based on validation checks
- reruns on each push
-
Add card for the PR to Dataset Tracking board.
- [Human] add 'tracking' tag to PR
- [GHA]
lifecycle-backlog
will add card to "Backlog" state for PR if not yet there.
-
When the submission is ready to be submitted to public QCArchive (validations pass, submitters and reviewers satisfied), PR is merged.
- [Board] PR card will move to state "Queued for Submission" immediately.
- [GHA]
lifecycle-backlog
will move PR card to state "Queued for Submission" if merged and in state "Backlog" - [GHA]
lifecycle-submission
will attempt to submit the dataset- if successful, will move card to state "Error Cycling"; add comment to PR
- if failed, will keep card queued; add comment to PR; attempt again next execution
- [Human] Submit worker jobs on a server to begin compute. If using Nautilus, carefully monitor utilization and scale down resources as jobs finish.
-
COMPLETE, INCOMPLETE, ERROR numbers reported for
Optimizations
,TorsionDrives
- [GHA]
lifecycle-error-cycle
will collect the above statistics for state "Error Cycling" PRs- will restart all errored
Optimizations
andTorsionDrives
- will move PR to state "Archived/Complete" if no ERROR, INCOMPLETE, all COMPLETE
- will restart all errored
- [GHA]
-
PR will remain in state "Error Cycling" until moved to "Requires Scientific Review" or until all tasks COMPLETE
- [Human] if errors appear persistent, move to state "Requires Scientific Review"
- discussion should be had on PR for next version
- [Human] once decided, state moved to "End of Life"
- [Human] ensure all worker jobs have been shut down.
-
[GHA]
lifecycle-end-of-life
will add tag 'end-of-life' to dataset in QCArchive for PR in "End of Life" -
[GHA]
lifecycle-archived-complete
will add tag 'archived-complete' to dataset in QCArchive for PR in "Archived/Complete"
Management Touchpoints
In addition to the states given above, there are additional touchpoints available for managing dataset submissions:
-
The
tracking
label is the "on/off" switch for automation via Github Actions. To disable all automation on a submission PR, remove this label. To enable automation, add the label. -
Submission priority can be changed by adding one of the following labels:
priority-high
: highest prioritypriority-normal
: normal prioritypriority-low
: lowest priority
-
Submission routing to QCFractal managers on different compute resources can be accomplished with compute tags. Add a label like
compute-<tagname>
to set the compute tag for all QCArchive tasks associated with a submisison. Be sure to coordinate with QCFractal manager admins to ensure your chosen compute tag is being served on the expected resources. This mechanism can also be used to "dead-letter" computations that are no longer desired by setting a compute tag that no manager will service. -
The order of a submission PR in a Dataset Tracking column matters. Submissions higher in a column will be operated on first by all Github Action automation. For example, if you want to error cycle a submission before any others so it has a higher chance of being pulled by idle manager workers, place it at the top of the Error Cycling column.
Dude where's my Dataset?
Finding the source of a dataset in QCArchive can be difficult; here we offer a mapping between a dataset in QCArchive and the folder which contains its inputs including a quick overview of some metadata and the status of the dataset.
Note that new datasets submitted using QCSubmit know where they were created and have a long_description_url
in the metadata which points directly to their home folder in this repository.
Status
The status only refers to the default
specification which is required for all of our datasets. Currently this is B3LYP-D3BJ/DZVP
.
Key:
100% of all default spec jobs have been complete.
some of the jobs in the dataset contain errors which may prevent the jobs from finishing, this could be something like a linear torsiondrive.
the dataset is currently running and may have some incomplete jobs.
Basic Datasets
These are currently used to compute properties of a minimum energy conformation (Hessians, wavefunctions, etc.), usually derived from completed optimization datasets.
QCArchive Dataset | Folder | Description | Elements | Status |
---|---|---|---|---|
OpenFF Optimization Set 1 | 2019-07-09-OpenFF-Optimization-Set | Hessian calculations. | Cl, S, C, F, O, H, N | |
OpenFF NCI250K Boron 1 | 2019-07-05 OpenFF NCI250K Boron 1 | Hessian calculations. | Cl, Br, S, C, F, B, O, H, N | |
OpenFF Discrepancy Benchmark 1 | 2019-07-05 eMolecules force field discrepancies 1 | Hessian calculation. | Cl, Br, S, C, F, P, I, O, H, N | |
OpenFF Gen 2 Opt Set 1 Roche | 2020-03-20-OpenFF-Gen-2-Optimization-Set-1-Roche | Hessian calculation. | Cl, S, C, F, O, H, N | |
OpenFF Gen 2 Opt Set 2 Coverage | 2020-03-20-OpenFF-Gen-2-Optimization-Set-2-Coverage | The hessian calculations. | Cl, Br, S, C, F, P, I, O, H, N | |
OpenFF Gen 2 Opt Set 3 Pfizer Discrepancy | 2020-03-20-OpenFF-Gen-2-Optimization-Set-3-Pfizer-Discrepancy | Hessian calculations. | Cl, F, C, S, O, H, N | |
OpenFF Gen 2 Opt Set 4 eMolecules Discrepancy | 2020-03-20-OpenFF-Gen-2-Optimization-Set-4-eMolecules-Discrepancy | Hessian calculations. | Cl, Br, S, C, F, P, I, O, H, N | |
OpenFF Gen 2 Opt Set 5 Bayer | 2020-03-20-OpenFF-Gen-2-Optimization-Set-5-Bayer | Hessian calculations. | Si, Cl, Br, F, C, S, O, H, N | |
OpenFF VEHICLe Set 1 | 2019-07-02 VEHICLe optimization dataset | Hessian calculations. | S, C, O, H, N | |
SMIRNOFF Coverage Set 1 | 2019-06-25-smirnoff99Frost-coverage | Hessian calculations. | Cl, Br, S, C, F, P, I, O, H, N | |
OpenFF ESP Fragment Conformers v1.0 | 2022-01-16-OpenFF-ESP-Fragment-Conformers-v1.0 | ESP Calculations | N, Cl, C, H, P, Br, O, F, S | |
OpenFF Theory Benchmarking Single Point Energies v1.0 | 2021-09-06-theory-bm-single-points | Single Point Energy dataset for the final optimized geometries from MP2/heavy-aug-cc-pVTZ torsiondrives. | Cl, F, C, S, O, H, N, P | |
TorsionNet500 Single Points Dataset v1.0 | 2021-11-09-TorsionNet500-single-points | Single point energies of final geometries of TorsionNet500 dataset. | H, O, F, S, N, Cl, C | |
SPICE DES Monomers Single Points Dataset v1.1 | 2021-11-15-QMDataset-DES-monomers-single-points | Single point energy calculation of DES monomers. | I, C, Br, P, Cl, H, S, O, F, N | |
SPICE Solvated Amino Acids Single Points Dataset v1.1 | 2021-11-08-QMDataset-Solvated-Amino-Acids-single-points | Single point energy calculation of solvated amino acids. | N, S, O, C, H | |
SPICE DES370K Single Points Dataset v1.0 | 2021-11-08-QMDataset-DES370K-single-points | SPICE single point dataset for ML applications. | 'N', 'O', 'Mg', 'H', 'F', 'K', 'Br', 'Na', 'P', 'Cl', 'I', 'Ca', 'S', 'Li', 'C' | |
SPICE DES370K Single Points Dataset Supplement v1.0 | 2022-02-18-QMDataset-DES370K-single-points-supplement | SPICE single point dataset for ML applications. | F, H, Cl, S, I, Br, N, Li, O, C, Na | |
SPICE Dipeptides Single Points Dataset v1.2 | 2021-11-08-QMDataset-Dipeptide-single-points | SPICE single point dataset for ML applications. | C ,N ,O ,H ,S | |
SPICE PubChem Set 1 Single Points Dataset v1.2 | 2021-11-08-QMDataset-pubchem-set1-single-points | SPICE single point dataset for ML applications. | 'O', 'Cl', 'N', 'C', 'P', 'Br', 'S', 'F', 'I', 'H' | |
SPICE PubChem Set 2 Single Points Dataset v1.2 | 2021-11-09-QMDataset-pubchem-set2-single-points | SPICE single point dataset for ML applications. | 'H', 'P', 'C', 'Cl', 'Br', 'N', 'F', 'S', 'O', 'I' | |
SPICE PubChem Set 3 Single Points Dataset v1.2 | 2021-11-09-QMDataset-pubchem-set3-single-points | SPICE single point dataset for ML applications. | 'N', 'C', 'S', 'Cl', 'Br', 'F', 'P', 'I', 'H', 'O' | |
SPICE PubChem Set 4 Single Points Dataset v1.2 | 2021-11-09-QMDataset-pubchem-set4-single-points | SPICE single point dataset for ML applications. | 'N', 'S', 'Br', 'O', 'C', 'F', 'H', 'I', 'Cl', 'P' | |
SPICE PubChem Set 5 Single Points Dataset v1.2 | 2021-11-09-QMDataset-pubchem-set5-single-points | SPICE single point dataset for ML applications. | 'F', 'H', 'S', 'Br', 'Cl', 'N', 'P', 'C', 'I', 'O' | |
SPICE PubChem Set 6 Single Points Dataset v1.2 | 2021-11-09-QMDataset-pubchem-set6-single-points | SPICE single point dataset for ML applications. | 'Cl', 'O', 'N', 'H', 'C', 'P', 'S', 'F', 'Br', 'I' | |
OpenFF ESP Industry Benchmark Set v1.1 | 2022-02-02-OpenFF-ESP-Industry-Benchmark-Set-v1.1-single-point | HF/6-31G* conformers of public industry benchmark molecules. | N, F, Cl, C, H, O, Br, P, S | |
SPICE Ion Pairs Single Points Dataset v1.1 | 2022-06-08-QMDataset-ion-pairs | SPICE single point dataset for ML applications. | 'F', 'Cl', 'Li', 'Na', 'Br', 'K', 'I' | |
RNA Single Point Dataset v1.0 | 2022-07-07-RNA-basepair-triplebase-single-points | RNA single point dataset consisting of RNA basepairs and triple bases. | 'P', 'N', 'O', 'C', 'H' | |
RNA Trinucleotide Single Point Dataset v1.0 | 2022-10-21-RNA-trinucleotide-single-points | Single point energy calculations of RNA basepairs and triple bases | 'O', 'N', 'C', 'H', 'P' | |
RNA Nucleoside Single Point Dataset v1.0 | 2023-03-09-RNA-nucleoside-single-points | Single point energy calculations of RNA nucleosides without O5' hydroxyl atom | 'O', 'N', 'C', 'H' | |
OpenFF multi-Br ESP Fragment Conformers v1.1 | 2023-11-30-OpenFF-multi-Br-ESP-Fragment-Conformers-v1.1-single-point | Single point ESP calculations | Br, C, F, H, N, O, P, S | |
MLPepper RECAP Optimized Fragments v1.0 | 2024-07-26-MLPepper-RECAP-Optimized-Fragments-v1.0 | Single point property calculations for charge models | P ,B ,Cl ,Br ,C ,H ,I ,F ,O ,N ,Si ,S | |
OpenFF NAGL2 ESP Timing Benchmark v1.0 | 2024-09-06-OpenFF-NAGL2-ESP-Timing-Benchmark-v1.0 | Single point ESP calculations for timing/memory benchmarking | 'P', 'S', 'N', 'C', 'Cl', 'F', 'Br', 'O', 'H', 'I' | |
OpenFF NAGL2 ESP Timing Benchmark v1.1 | 2024-09-18-OpenFF-NAGL2-ESP-Timing-Benchmark-v1.1 | Single point ESP calculations for timing/memory benchmarking | 'P', 'S', 'N', 'C', 'Cl', 'F', 'Br', 'O', 'H', 'I' | |
OpenFF Sulfur Hessian Training Coverage Supplement v1.0 | 2024-09-18-OpenFF-Sulfur-Hessian-Training-Coverage-Supplement-v1.0 | Additional Hessian training data for Sage sulfur and phosphorus parameters (from 'OpenFF Sulfur Optimization Training Coverage Supplement v1.0') | O, S, C, Cl, P, N, F, Br, H | |
OpenFF Aniline Para Hessian v1.0 | 2024-10-07-OpenFF-Aniline-Para-Hessian-v1.0 | Hessian single points for the final molecules in the OpenFF Aniline Para Opt v1.0 dataset | 'O', 'Cl', 'S', 'Br', 'H', 'F', 'N', 'C' | |
OpenFF Gen2 Hessian Dataset Protomers v1.0 | 2024-10-07-OpenFF-Gen2-Hessian-Dataset-Protomers-v1.0 | Hessian single points for the final molecules in the OpenFF Gen2 Optimization Dataset Protomers v1.0 dataset | 'H', 'C', 'Cl', 'P', 'F', 'Br', 'O', 'N', 'S' | |
MLPepper-RECAP-Optimized-Fragments-Add-Iodines-v1.0 | 2024-10-11-MLPepper-RECAP-Optimized-Fragments-Add-Iodines-v1.0 | Set of diverse iodine containing molecules with a number of calculated electrostatic properties. | Br, Cl, S, B, O, Si, C, N, I, P, H, F |
Optimization Datasets
These are currently used to find a minimum energy conformation of a molecule.
QCArchive Dataset | Folder | Description | Elements | Status |
---|---|---|---|---|
OpenFF Optimization Set 1 | 2019-05-16-Roche-Optimization_Set | Geometry optimizations of a set of Roche molecules for forcefield fitting. | Cl, S, C, F, O, H, N | |
SMIRNOFF Coverage Set 1 | 2019-06-25-smirnoff99Frost-coverage | An optimization dataset the excises all parameters in Smirnoff99Frost. | Cl, Br, S, C, F, P, I, O, H, N | |
OpenFF VEHICLe Set 1 | 2019-07-02 VEHICLe optimization dataset | VEHICLe (virtual exploratory heterocyclic library) dataset of 24,867 aromatic heterocyclic rings with expanded stereochemistry. | S, C, O, H, N | |
OpenFF Discrepancy Benchmark 1 | 2019-07-05 eMolecules force field discrepancies 1 | A set of molecules whose optimized structures differs across forcefields. | Cl, Br, S, C, F, P, I, O, H, N | |
OpenFF NCI250K Boron 1 | 2019-07-05 OpenFF NCI250K Boron 1 | This database is a subset of boron-containing compounds from the NCI250K (Release 1 - Oct 1999) compound dataset. | Cl, Br, S, C, F, B, O, H, N | |
OpenFF Ehrman Informative Optimization v0.2 | 2019-09-06-OpenFF-Informative-Set | This provides an optimization dataset based on an initial batch of Jordan Ehrman's analysis of eMolecules, pulling out molecules with minimized geometries which are substantially different in different force fields. | Cl, Br, S, C, F, P, I, O, H, N | |
Pfizer discrepancy optimization dataset 1 | 2019-09-07-Pfizer-discrepancy-optimization-dataset-1 | This database is a subset of 100 challenging small molecule fragments where HF/minix followed by B3LYP/6-31G*//B3LYP/6-31G** differed substantially from OPLS3e. | Cl, F, C, S, O, H, N | |
FDA optimization dataset 1 | 2019-09-08-fda-optimization-dataset-1 | he ZINC15 FDA dataset was retrieve in mol2 format on Sun Sep 8 20:44:34 EDT 2019 via: http://zinc.docking.org/substances/subsets/fda.mol2?count=all | Cl, Br, F, C, S, P, I, O, H, N | |
Kinase Inhibitors: WBO Distributions | 2019-11-27-kinase-inhibitor-optimization | Geometry optimization of kinase inhibitor conformers to explore WBO conformation dependency. | Cl, Br, S, C, F, P, I, O, H, N | |
OpenFF Gen 2 Opt Set 1 Roche | 2020-03-20-OpenFF-Gen-2-Optimization-Set-1-Roche | 2nd generation optimization dataset for bond and valence parameter fitting. | Cl, S, C, F, O, H, N | |
OpenFF Gen 2 Opt Set 2 Coverage | 2020-03-20-OpenFF-Gen-2-Optimization-Set-2-Coverage | 2nd generation optimization dataset for bond and valence parameter fitting. | Cl, Br, S, C, F, P, I, O, H, N | |
OpenFF Gen 2 Opt Set 3 Pfizer Discrepancy | 2020-03-20-OpenFF-Gen-2-Optimization-Set-3-Pfizer-Discrepancy | 2nd generation optimization dataset for bond and valence parameter fitting. | Cl, F, C, S, O, H, N | |
OpenFF Gen 2 Opt Set 4 eMolecules Discrepancy | 2020-03-20-OpenFF-Gen-2-Optimization-Set-4-eMolecules-Discrepancy | 2nd generation optimization dataset for bond and valence parameter fitting | Cl, Br, S, C, F, P, I, O, H, N | |
OpenFF Gen 2 Opt Set 5 Bayer | 2020-03-20-OpenFF-Gen-2-Optimization-Set-5-Bayer | 2nd generation optimization dataset for bond and valence parameter fitting. | Si, Cl, Br, F, C, S, O, H, N | |
OpenFF Protein Fragments v1.0 | 2020-07-06-OpenFF-Protein-Fragments-Initial | This is the initial test of running constrained optimizations on various protein fragments prepared by David Cerutti. Here we just have ALA as the central residue. | H, C, O, N | |
OpenFF Protein Fragments v2.0 | 2020-08-12-OpenFF-Protein-Fragments-version2 | This is the full protein fragment dataset (version2) consisting of constrained optimizations on various protein fragments prepared by David Cerutti. We have 12 central residues which are capped with a combination of different terminal residues. | S, C, O, H, N | |
OpenFF Sandbox CHO PhAlkEthOH v1.0 | 2020-09-18-OpenFF-Sandbox-CHO-PhAlkEthOH | The molecules are from the AlkEthOH and PhEthOH datasets originally used to build the smirnoff99Frosst parameters. The AlkEthOH was taken from here | H, C, O | |
OpenFF Industry Benchmark Season 1 v1.0 | 2021-03-30-OpenFF-Industry-Benchmark-Season-1-v1.0 | The combination of all publicly chosen compound sets by industry partners from the OpenFF season 1 industry benchmark | N, F, Cl, C, H, O, Br, P, S | |
OpenFF Industry Benchmark Season 1 v1.1 | 2021-06-04-OpenFF-Industry-Benchmark-Season-1-v1.1 | The combination of all publicly chosen compound sets by industry partners from the OpenFF season 1 industry benchmark | N, F, Cl, C, H, O, Br, P, S | |
OpenFF Theory Benchmarking Constrained Optimization Set MP2 heavy-aug-cc-pVTZ v1.1 | 2020-11-25-theory-bm-set-mp2-heavy-aug-cc-pvtz | This is a Constrained Optimization dataset for benchmarking MP2/heavy-aug-cc-pVTZ. | ||
OpenFF Industry Benchmark Season 1 - MM v1.1 | 2021-07-28-OpenFF-Industry-Benchmark-Season-1-MM-v1.1 | The combination of all publicly chosen compound sets by industry partners from the OpenFF season 1 industry benchmark; MM computations starting from QM-optimized geometries. | N, F, Cl, C, H, O, Br, P, S | |
OpenFF RESP Polarizability Optimizations v1.0 | 2021-10-01-OpenFF-resppol-mp2-single-point | A data set used for training ESP-fitting based typed atomic polarizabilities with a direct approximation. | N, C, H, O | |
OpenFF RESP Polarizability Optimizations v1.1 | 2021-10-01-OpenFF-resppol-mp2-single-point | A data set used for training ESP-fitting based typed atomic polarizabilities with a direct approximation. | N, C, H, O | |
SPICE Dipeptides Optimization Dataset v1.0 | 2021-11-11-Dipeptide-optimization-set | Optimization set created from the smiles of SPICE Dipeptide dataset. | N, C, H, O, S | |
OpenFF Gen 2 Optimization Dataset Protomers v1.0 | 2021-12-21-OpenFF-Gen2-Optimization-Set-Protomers | Optimization set created from the smiles of missing protomers in Gen 2 optimization sets. | O, F, S, Br, Cl, C, P, H, I, N | |
OpenFF ESP Industry Benchmark Set v1.0 | 2022-02-02-OpenFF-ESP-Industry-Benchmark-Set-v1.0-optimization-set | HF/6-31G* conformers of public industry benchmark molecules. | N, F, Cl, C, H, O, Br, P, S | |
OpenFF Protein Capped 1-mers 3-mers Optimization Dataset v1.0 | 2022-05-30-OpenFF-Protein-Capped-1-mers-3-mers-Optimization | Optimization dataset for protein capped 1-mers Ace-X-Nme and capped 3-mers Ace-Y-X-Y-Nme with Y = {Ala, Val} and X = 26 canonical amino acids with common protomers/tautomers (Ash, Cyx, Glh, Hid, Hip, and Lyn) | H, C, N, O, S | |
OpenFF Iodine Chemistry Optimization Dataset v1.0 | 2022-07-27-OpenFF-iodine-optimization-set | Optimization set created from Gen1 and Gen2 molecules containing iodine | 'C', 'F', 'O', 'H', 'Br', 'Cl', 'N', 'I', 'S' | |
OpenFF multi-Br ESP Fragment Conformers v1.0 | 2023-11-02-OpenFF-multi-Br-ESP-Fragment-Conformers-v1.0 | Optimization set created from 2022-01-16-OpenFF-ESP-Fragment-Conformers-v1.0 by selecting molecules with multiple Cl atoms and replacing them with Br | Br, C, F, H, N, O, P, S | |
XtalPi Shared Fragments OptimizationDataset v1.0 | 2024-01-30-xtalpi-shared-fragments-optimization-v1.0 | Representative optimization molecules used to fit XFF | C, H, Cl, Br, S, O, F, N, P | |
XtalPi 20-percent Fragments OptimizationDataset v1.0 | 2024-04-02-xtalpi-20-percent-fragments-optimization-v1.0 | Larger (20%) representative subset of molecules used to fit XFF | Cl, P, Br, I, H, C, B, Si, O, N, F, S | |
OpenFF Torsion Benchmark Supplement Optimization Dataset v1.0 | 2024-04-18-OpenFF-Torsion-Benchmark-Supplement-Optimization-Dataset-v1.0 | Additional optimizations for benchmarking Sage 2.2.0 proper torsions and new parameters from the torsion multiplicity work | H, C, N, O, F, P, S, Cl, Br | |
OpenFF Torsion Multiplicity Optimization Training Coverage Supplement v1.0 | 2024-06-20-OpenFF-Torsion-Multiplicity-Optimization-Training-Coverage-Supplement-v1.0 | Additional optimization training data for Sage 2.2.0 proper torsions and new parameters from the torsion multiplicity work | C, Cl, S, O, H, P, N, Br | |
OpenFF Torsion Multiplicity Optimization Benchmarking Coverage Supplement v1.0 | 2024-06-24-OpenFF-Torsion-Multiplicity-Optimization-Benchmarking-Coverage-Supplement-v1.0 | Additional optimization benchmarking data for Sage 2.2.0 proper torsions and new parameters from the torsion multiplicity work | Cl, H, I, S, O, N, Br, C, P | |
OpenFF Iodine Fragment Opt v1.0 | 2024-09-10-OpenFF-Iodine-Fragment-Opt-v1.0 | B3LYP-D3BJ/DZVP optimized conformers for a variety of I-containing fragment molecules | C, O, I, S, F, Br, Cl, N, H | |
OpenFF Sulfur Optimization Training Coverage Supplement v1.0 | 2024-09-11-OpenFF-Sulfur-Optimization-Training-Coverage-Supplement-v1.0 | Additional optimization training data for Sage sulfur and phosphorus parameters | C, S, F, O, H, Cl, Br, P, N | |
OpenFF Sulfur Optimization Benchmarking Coverage Supplement v1.0 | 2024-09-18-OpenFF-Sulfur-Optimization-Benchmarking-Coverage-Supplement-v1.0 | Additional optimization benchmarking data for Sage sulfur and phosphorus parameters | S, P, Cl, C, N, O, H, Br, F | |
OpenFF Lipid Optimization Training Supplement v1.0 | 2024-10-08-OpenFF-Lipid-Optimization-Training-Supplement-v1.0 | Additional optimization training data for Sage from representative LIPID MAPS fragments | I, Br, O, H, P, C, N, Cl, F, S |
TorsionDrive Datasets
These are currently used perform a complete rotation of one or more selected bonds, where optimizations are performed over a discrete set of angles.
QCArchive Dataset | Folder | Description | Elements | Status |
---|---|---|---|---|
Fragment Stability Benchmark | 2019-03-06-Fragmenter_Stability-Benchmark | Examination of different fragmentation schemes. | Cl, F, C, P, I, O, H, N | |
OpenFF Group1 Torsions | 2019-05-01-OpenFF-Group1-Torsions | A collection of torsion drives for forcefield fitting. | Cl, F, C, S, O, H, N | |
SMIRNOFF Coverage Torsion Set 1 | 2019-07-01-smirnoff99Frost-coverage-torsion | Set of small molecules that use all smirnoff99Frost parameters. | C', Br, S, C, F, P, I, O, H, N | |
OpenFF Substituted Phenyl Set 1 | 2019-07-25-phenyl-set | A set of substituted phenyl torsiondrives. | Cl, Br, F, C, I, O, H, N | |
Pfizer discrepancy torsion dataset 1 | 2019-09-07-Pfizer-discrepancy-torsion-dataset-1 | This database is a subset of 100 challenging small molecule fragments where HF/minix followed by B3LYP/6-31G*//B3LYP/6-31G** differed substantially from OPLS3e. | Cl, F, C, S, O, H, N | |
TorsionDrive Paper | 2019-11-07-TorsionDrive-Paper | Torsion Drives to explore wavefront propagation for the TorsionDrive paper. | C, H, O | |
OpenFF Primary Benchmark 1 Torsion Set | 2019-12-05-OpenFF-Benchmark-Primary-1-torsion | Validation of optimized force field torsion parameters. | Cl, Br, F, C, S, O, H, N | |
OpenFF Primary Benchmark 2 Torsion Set | 2020-01-17-OpenFF-Benchmark-Full-1-torsion | Validation of optimized force field torsion parameters. | Cl, Br, S, C, F, P, I, O, H, N | |
OpenFF Group1 Torsions 2 | 2020-01-31-OpenFF-Group1-Torsions-2 | Generation of additional data for fitting of newly added torsion terms. | H, C, O, N | |
OpenFF Group1 Torsions 3 | 2020-02-10-OpenFF-Group1-Torsions-3 | Generation of additional data for fitting of t128 and t129 | H, C, O, N | |
OpenFF Gen 2 Torsion Set 1 Roche | 2020-03-12-OpenFF-Gen-2-Torsion-Set-1-Roche | Design 2nd generation torsion dataset for valence parameter fitting. | F, C, S, O, H, N | |
OpenFF Gen 2 Torsion Set 2 Coverage | 2020-03-12-OpenFF-Gen-2-Torsion-Set-2-Coverage | Design 2nd generation torsion dataset for valence parameter fitting. | Cl, Br, F, C, S, P, I, O, H, N | |
OpenFF Gen 2 Torsion Set 3 Pfizer Discrepancy | 2020-03-12-OpenFF-Gen-2-Torsion-Set-3-Pfizer-Discrepancy | Design 2nd generation torsion dataset for valence parameter fitting | S, C, F, O, H, N | |
OpenFF Gen 2 Torsion Set 4 eMolecules Discrepancy | 2020-03-12-OpenFF-Gen-2-Torsion-Set-4-eMolecules-Discrepancy | Design 2nd generation torsion dataset for valence parameter fitting. | Cl, Br, F, C, S, P, I, O, H, N | |
OpenFF Gen 2 Torsion Set 5 Bayer | 2020-03-12-OpenFF-Gen-2-Torsion-Set-5-Bayer | Design 2nd generation torsion dataset for valence parameter fitting. | Cl, Br, F, C, S, O, H, N | |
OpenFF Gen 2 Torsion Set 6 supplemental | 2020-03-12-OpenFF-Gen-2-Torsion-Set-6-supplemental | Design 2nd generation torsion dataset for valence parameter fitting. | S, C, O, H, N | |
OpenFF Gen 2 Torsion Set 1 Roche 2 | 2020-03-23-OpenFF-Gen-2-Torsion-Set-1-Roche-2 | Design 2nd generation torsion dataset for valence parameter fitting. | Cl, F, C, S, O, H, N | |
OpenFF Gen 2 Torsion Set 2 Coverage 2 | 2020-03-23-OpenFF-Gen-2-Torsion-Set-2-Coverage-2 | Design 2nd generation torsion dataset for valence parameter fitting. | Cl, Br, F, C, S, P, I, O, H, N | |
OpenFF Gen 2 Torsion Set 3 Pfizer Discrepancy 2 | 2020-03-23-OpenFF-Gen-2-Torsion-Set-3-Pfizer-Discrepancy-2 | Design 2nd generation torsion dataset for valence parameter fitting. | S, C, F, O, H, N | |
OpenFF Gen 2 Torsion Set 4 eMolecules Discrepancy 2 | 2020-03-23-OpenFF-Gen-2-Torsion-Set-4-eMolecules-Discrepancy-2 | Design 2nd generation torsion dataset for valence parameter fitting. | Cl, Br, F, C, S, P, I, O, H, N | |
OpenFF Gen 2 Torsion Set 5 Bayer 2 | 2020-03-26-OpenFF-Gen-2-Torsion-Set-5-Bayer-2 | Design 2nd generation torsion dataset for valence parameter fitting. | Cl, Br, F, C, S, O, H, N | |
OpenFF Gen 2 Torsion Set 6 supplemental 2 | 2020-03-26-OpenFF-Gen-2-Torsion-Set-6-supplemental-2 | Design 2nd generation torsion dataset for valence parameter fitting. | Br S, C, F, O, H, N | |
OpenFF Fragmenter Validation 1.0 | 2020-04-28-Fragmenter-test | Examination of different fragmentation schemes. | Cl, S, C, P, I, O, H, N | |
OpenFF DANCE 1 eMolecules t142 v1.0 | 2020-06-01-DANCE-1-eMolecules-t142-selected | Molecules selected from the eMolecules database by DANCE to improve t142 parameterization in smirnoff99Frosst. | Cl, Br, F, C, S, O, H, N | |
OpenFF Rowley Biaryl v1.0 | 2020-06-17-OpenFF-Biaryl-set | This is a TorsionDrive dataset consisting of biaryl torsions provided by Christopher Rowley. Originally used to benchmark parsley, but could also be useful for fitting. | S, C, O, H, N | |
OpenFF-benchmark-ligand-fragments-v1.0 | 2020-07-27-OpenFF-Benchmark-Ligands | This is a torsiondrive dataset created from the OpenFF FEP benchmark dataset. The ligands are fragmented before having key torsions driven. | Cl, Br, S, C, F, I, O, H, N | |
OpenFF Theory Benchmarking Set B3LYP-D3BJ DZVP v1.0 | 2020-07-27-theory-bm-set-b3lyp-d3bj-dzvp | This is a TorsionDrive dataset consisting of 36 1-D torsions selected for benchmarking different QM levels. | Cl, F, C, S, P, O, H, N | |
OpenFF Theory Benchmarking Set B3LYP-D3BJ def2-TZVP v1.0 | 2020-07-30-theory-bm-set-b3lyp-d3bj-def2-tzvp | This is a TorsionDrive dataset consisting of 36 1-D torsions selected for benchmarking different QM levels. | Cl, F, C, S, P, O, H, N | |
OpenFF Theory Benchmarking Set B3LYP-D3BJ def2-TZVPD v1.0 | 2020-07-30-theory-bm-set-b3lyp-d3bj-def2-tzvpd | This is a TorsionDrive dataset consisting of 36 1-D torsions selected for benchmarking different QM levels. | Cl, F, C, S, P, O, H, N | |
OpenFF Theory Benchmarking Set B3LYP-D3BJ def2-TZVPP v1.0 | 2020-07-30-theory-bm-set-b3lyp-d3bj-def2-tzvpp | This is a TorsionDrive dataset consisting of 36 1-D torsions selected for benchmarking different QM levels. | Cl, F, C, S, P, O, H, N | |
OpenFF Protein Fragments TorsionDrives v1.0 | 2020-09-16-OpenFF-Protein-Fragments-TorsionDrives | This is a protein fragment dataset consisting of torsion drives on various protein fragments prepared by David Cerutti. We have 12 central residues capped with a combination of different terminal residues. We drive the following angles for each fragment: - omega - phi - psi - chi1 (if applicable) - chi2 (if applicable). | S, C, O, H, N | |
OpenFF WBO Conjugated Series v1.0 | 2021-01-25-OpenFF-Conjugated-Series | This is a torsion drive dataset that consists of various chemistries that probe a range of conjugated bonds. The goal of this dataset is to develop WBO interpolated torsions for the OpenFF force field. | S, C, O, H, N | |
OpenFF Amide Torsion Set v1.0 | 2021-03-23-OpenFF-Amide-Torsion-Set-v1.0 | Amides, thioamides and amidines diversely functionalized. | S, C, O, H, N | |
OpenFF Aniline Para Opt v1.0 | 2021-04-02-OpenFF-Aniline-Para-Opt-v1.0 | Optimizations of diverse, para-substituted aniline derivatives. | Br, C, O, N, S, H, Cl, F | |
OpenFF Gen3 Torsion Set v1.0 | 2021-04-09-OpenFF-Gen3-Torsion-Set-v1.0 | This dataset is a simple-molecule-only torsiondrive dataset, aiming to avoid issue of torsion parameter contamination by large internal non-bonded interactions during a valece parameter optimization. Molecules with one effective rotating bond were generate by combining two simple substituents, which were identified by fragmenting small drug like molecules. Torsions from the generated molecule set were selected using clustering method, in a way that the dataset can allow a chemical diversity of molecules training each torsion parameter. | F ,N ,H ,Cl ,P ,S ,O ,Br ,C | |
OpenFF Aniline 2D Impropers v1.0 | 2021-03-29-OpenFF-Aniline-2D-Impropers-v1.0 | This dataset contains a set of aniline derivatives which have para-substituted groups of varying electron donating and withdrawing properties. This dataset was curated in an effort to improve and understand improper torsions in force fields. We will scan the improper and proper angle simultaneously to better understand the coupling and energetics of these torsions. | O, C, S, H, N | |
OpenFF BCC Refit Study COH v2.0 | 2021-06-22-OpenFF-BCC-Refit-Study-COH-v2.0 | A data set curated for the initial stage of the on-going OpenFF study which aims to co-optimize the AM1BCC bond charge correction (BCC) parameters against an experimental training set of density and enthalpy of mixing data points and a QM training set of electric field data. The initial data set is limited to only molecules composed of C, O, H. This limited scope significantly reduces the number of BCC parameters which must be retrained, thus allowing for easier convergence of the initial optimizations. The included molecules were combinatorially generated to cover a range of alcohol, ether, and carbonyl containing molecules. | O, C, S, H, N | |
OpenFF-benchmark-ligand-fragments-v2.0 | 2021-08-10-OpenFF-JACS-Fragments-v2.0 | This is a torsiondrive dataset created from the OpenFF FEP benchmark dataset. The ligands are fragmented using openff-fragmenter with both ambertools and openeye before having key torsions driven. | S, N, Br, C, H, O, Cl, F, I | |
OpenFF-Protein-Dipeptide-2D-TorsionDrive-v2.1 | 2021-11-18-OpenFF-Protein-Dipeptide-2D-TorsionDrive | Two-dimensional TorsionDrives on phi and psi for dipeptides of the 20 canonical amino acids and 6 alternate protomers/tautomers. | H, C, N, O, S | |
OpenFF-Protein-Capped-1-mer-Sidechains-v1.3 | 2022-02-10-OpenFF-Protein-Capped-1-mer-Sidechains | Two-dimensional TorsionDrives on chi1 and chi2 for capped 1-mers of amino acids with a rotatable bond in the sidechain. | H, C, N, O, S | |
OpenFF-Protein-Capped-3-mer-Backbones-v1.0 | 2022-05-30-OpenFF-Protein-Capped-3-mer-Backbones | Two-dimensional TorsionDrives on phi and psi for capped 3-mers Ace-Y-X-Y-Nme with Y = {Ala, Val}. | H, C, N, O, S | |
OpenFF-multiplicity-correction-torsion-drive-data-v1.1 | 2022-04-29-OpenFF-multiplicity-correction-torsion-drive-data-v1.1 | A torsiondrive dataset created to correct multiplicity issues in the force field. | 'S', 'P', 'O', 'C', 'H', 'N' | |
OpenFF-Protein-Capped-3-mer-Omega-v1.0 | 2023-02-06-OpenFF-Protein-Capped-3-mer-Omega | TorsionDrives on omega for capped 3-mers Ace-Ala-X-Ala-Nme. | H, C, N, O, S | |
XtalPi Shared Fragments TorsiondriveDataset v1.0 | 2024-01-30-xtalpi-shared-fragments-torsiondrive-v1.0 | Representative torsion scan molecules used to fit XFF | C, H, Cl, Br, S, O, F, N, P | |
OpenFF Torsion Coverage Supplement v1.0 | 2024-02-29-OpenFF-Torsion-Coverage-Supplement-v1.0 | Additional TorsionDrives to improve coverage for Sage 2.1.0 proper torsions and new parameters from the torsion multiplicity work | C, Cl, F, H, N, O, S | |
OpenFF-RNA-Dinucleoside-Monophosphate-TorsionDrives-v1.0 | 2024-03-26-OpenFF-RNA-Dinucleoside-Monophosphate-TorsionDrives | TorsionDrives of non-ring backbone, glycosidic, and hydroxyl dihedrals in RNA XpY 2-mers. | H, C, N, O, P | |
XtalPi 20-percent Fragments TorsiondriveDataset v1.0 | 2024-04-02-xtalpi-20-percent-fragments-torsiondrive-v1.0 | Torsion scans of larger representative subset (20%) of molecules used to fit XFF | O, Br, I, Si, B, C, P, S, Cl, H, N, F | |
OpenFF Torsion Drive Supplement v1.0 | 2024-04-17-OpenFF-Torsion-Drive-Supplement-v1.0 | Additional TorsionDrives to expand training data for Sage 2.2.0 proper torsions and new parameters from the torsion multiplicity work | H, C, N, O, P, S | |
OpenFF Torsion Multiplicity Torsion Drive Coverage Supplement v1.0 | 2024-06-14-OpenFF-Torsion-Multiplicity-Torsion-Drive-Coverage-Supplement-v1.0 | Additional torsion drive training data for Sage 2.2.0 proper torsions and new parameters from the torsion multiplicity work | N, Br, H, P, Cl, O, C, S | |
OpenFF Phosphate Torsion Drives v1.0 | 2024-07-17-OpenFF-Phosphate-Torsion-Drives-v1.0 | Lipid-like phosphate torsions | C, S, N, H, O, P | |
OpenFF Alkane Torsion Drives v1.0 | 2024-08-09-OpenFF-Alkane-Torsion-Drives-v1.0 | Alka/ene torsion drives | C, H |
GridOptimization Datasets
These are currently used perform a scan of one or more internal coordinates (bond, angle, torsion), where optimizations are performed over a discrete set of values.
QCArchive Dataset | Folder | Description | Elements | Status |
---|---|---|---|---|
OpenFF Trivalent Nitrogen Set 1 | 2019-06-28-Nitrogen-grid-optimization | Set of diverse trivalent nitrogen molecules for 1-D grid optimization. | Si, Cl, Br, F, C, S, P, B, I, O, H, N | |
OpenFF Trivalent Nitrogen Set 2 | 2019-12-09-Nitrogen-grid-optimization-2d | Set of diverse trivalent nitrogen molecules for 2-D grid optimization | Si, Cl, Br, F, C, S, P, B, I, O, H, N | |
OpenFF Trivalent Nitrogen Set 3 | 2020-01-15-Nitogen-grid-optimization-02-1dscans | Set of diverse trivalent nitrogen molecules for 1-D grid optimization, this is a secondary dataset | Cl, Br, S, C, F, O, H, N |