Awesome

OpenFF QCArchive Dataset Submission

Dataset Lifecycle

All datasets submitted to QCArchive via this repository conform to the Dataset Lifecycle.

See STANDARDS.md for submission standards. Datasets must be submitted as pull requests.

User Quickstart

Ensure git-lfs is installed on your local machine: https://git-lfs.github.com/
To submit a new dataset, begin by cloning this repository:
```
export GIT_LFS_SKIP_SMUDGE=1
git clone git@github.com:openforcefield/qca-dataset-submission.git
```
This will clone the repo, but avoid downloading existing LFS objects. If you wish to download all LFS objects, leave off the export GIT_LFS_SKIP_SMUDGE=1.
Once cloned, create and switch to a new branch from master, then create a new directory in qca-dataset-submission/submissions/:
```
git checkout -b <dataset-branch>
mkdir qca-dataset-submission/submissions/YYYY-MM-DD-OpenFF-<DESCRIPTIVE-DATASET-NAME>-v1.0
```
You will add all submission artifacts to this directory.

Create and activate a new conda env with basic submission-preparation requirements with:

conda env create -f qca-dataset-submission/devtools/prod-envs/qcarchive-user-submit.yaml
conda activate qcarchive-user-submit

Choose a starting notebook and README based on the type of dataset you wish to submit:
- OptimizationDataset
Copy the notebook and README for the dataset you want into the directory you created.
```
cp examples/<dataset-type>/* qca-dataset-submission/submissions/YYYY-MM-DD-OpenFF-<DESCRIPTIVE-DATASET-NAME>-v1.0
```
Start up a Jupyter notebook with your new notebook:
```
jupyter notebook qca-dataset-submission/submissions/YYYY-MM-DD-OpenFF-<DESCRIPTIVE-DATASET-NAME>-v1.0/generate-dataset.ipynb
```
Edit the contents with appropriate metadata information, read in your molecules using the cells appropriate for your input data, and make any other modifications as needed for your specific needs.
Copy generated metadata components into README. Write a reasonably-detailed high-level description of the submission at the top.
Commit the following files in the submission directory you made:
- your input files; please compress them if possible with e.g. bzip2
- generate-dataset.ipynb
- dataset.pdf
- dataset.smi
- dataset.json.bz2
Push your branch to Github:
```
git push origin <dataset-branch>
```
Make a new PR for the branch. Validation will run automatically on your dataset.json.* file, indicating any potential issues prior to submission. Ask for help if you see validation failures you do not understand. Ping a reviewer in the comments.
Once reviewed and approved, your submission will be merged and submitted to QCArchive! Computations specified by the submission will be performed on OpenFF-managed compute resources.

Creating a compute expansion

If you have already computed a dataset but want to re-compute it with a new QCSpec (e.g. new level of theory), you can do so using a compute expansion. This is faster than creating a new dataset, and explicitly links datasets with the same molecules and purpose. A compute expansion involves adding a file called compute.json to your original submission, which contains the dataset metadata (identical to the original dataset) and the new compute spec. This can be done manually, or programatically. The programatic description is provided below, with an example of the notebook and of the file.

Create a new branch as described above, and navigate to the submission directory of the dataset you want to expand.
Create a new jupyter notebook called generate-compute.ipynb example here.
In the notebook, either download the original dataset and remove the molecules and original QCSpec, or re-create the dataset with the same name as the original and skip the molecule addition step.

See below for details about how changes to the dataset are propagated; note that the dataset name must be the same, and changes to any metadata except compute-tag and the QCSpec will be ignored when submitting the compute expansion.
Please note that the default compute_tag is openff; if you need to use a different one, please add it explicitly to the dataset at this step, as the compute.json file overrides the compute tag added manually to the PR. If you do need to change the compute tag after submission, you can change it by updating the label on the PR and the change will take effect when the error cycling action runs next.

Add the new QCSpec to the dataset, and save the dataset to compute.json, example here.
Add the additional compute spec to the submission's README.md file.
Add the generate-compute.ipynb and compute.json files to the submission's QCSubmit Manifest entry in the README.md file.
Proof the submission and open a PR. Dataset validation will run automatically.
Once the dataset is validated, request a review, and once approved, your compute expansion will be submitted!

When the PR is merged, the following happens:

CI checks for compute*.json*, so files can be called anything so long as they follow that pattern.
This gets loaded into a QCSubmit dataset structure in CI (see lifecycle.py, SubmittableBase) and submitted to MolSSI with openff.qcsubmit.datasets.datasets._BaseDataset.submit()
submit() checks if the dataset already exists using only the dataset type and name. Changes in descriptions, other metadata, etc. don't affect anything. New/different molecules will also be ignored if the dataset name already exists.
submit() adds the specifications
submit() submits with the compute_tag and priority within the new compute.json.
Other info in the dataset, such as dataset_tags, are not incorporated into additional compute submissons and thus changing them will not affect the dataset.

The Lifecycle of a Dataset Submission

All Open Force Field datasets submitted to QCArchive undergo well-defined lifecycle.

Dataset Lifecycle

Each labeled rectangle in the lifecycle represents a state. A submission PR changes state according to the arrows. Changes in state may be performed by automation or manually by a human when certain critera are met.

The lifecycle process is described below, with [bracketed] items indicating the agent of action, one of:

[GHA]: Github Actions
[Board]: Github Project Board
[Human]: A maintainer of the qca-dataset-submission repository.

A PR is created against qca-dataset-submission by a submitter.
- the template is filled out with informational sections according to the PR template
- [GHA] validation operates on all dataset*.json files found in the PR; performs validation checks
  - comment made based on validation checks
  - reruns on each push
Add card for the PR to Dataset Tracking board.
- [Human] add 'tracking' tag to PR
- [GHA] lifecycle-backlog will add card to "Backlog" state for PR if not yet there.
When the submission is ready to be submitted to public QCArchive (validations pass, submitters and reviewers satisfied), PR is merged.
- [Board] PR card will move to state "Queued for Submission" immediately.
- [GHA] lifecycle-backlog will move PR card to state "Queued for Submission" if merged and in state "Backlog"
- [GHA] lifecycle-submission will attempt to submit the dataset
  - if successful, will move card to state "Error Cycling"; add comment to PR
  - if failed, will keep card queued; add comment to PR; attempt again next execution
- [Human] Submit worker jobs on a server to begin compute. If using Nautilus, carefully monitor utilization and scale down resources as jobs finish.
COMPLETE, INCOMPLETE, ERROR numbers reported for Optimizations, TorsionDrives
- [GHA] lifecycle-error-cycle will collect the above statistics for state "Error Cycling" PRs
  - will restart all errored Optimizations and TorsionDrives
  - will move PR to state "Archived/Complete" if no ERROR, INCOMPLETE, all COMPLETE
PR will remain in state "Error Cycling" until moved to "Requires Scientific Review" or until all tasks COMPLETE
- [Human] if errors appear persistent, move to state "Requires Scientific Review"
- discussion should be had on PR for next version
- [Human] once decided, state moved to "End of Life"
- [Human] ensure all worker jobs have been shut down.
[GHA] lifecycle-end-of-life will add tag 'end-of-life' to dataset in QCArchive for PR in "End of Life"
[GHA] lifecycle-archived-complete will add tag 'archived-complete' to dataset in QCArchive for PR in "Archived/Complete"

Management Touchpoints

In addition to the states given above, there are additional touchpoints available for managing dataset submissions:

The tracking label is the "on/off" switch for automation via Github Actions. To disable all automation on a submission PR, remove this label. To enable automation, add the label.
Submission priority can be changed by adding one of the following labels:
- priority-high: highest priority
- priority-normal: normal priority
- priority-low: lowest priority
Submission routing to QCFractal managers on different compute resources can be accomplished with compute tags. Add a label like compute-<tagname> to set the compute tag for all QCArchive tasks associated with a submisison. Be sure to coordinate with QCFractal manager admins to ensure your chosen compute tag is being served on the expected resources. This mechanism can also be used to "dead-letter" computations that are no longer desired by setting a compute tag that no manager will service.
The order of a submission PR in a Dataset Tracking column matters. Submissions higher in a column will be operated on first by all Github Action automation. For example, if you want to error cycle a submission before any others so it has a higher chance of being pulled by idle manager workers, place it at the top of the Error Cycling column.

Dude where's my Dataset?

Finding the source of a dataset in QCArchive can be difficult; here we offer a mapping between a dataset in QCArchive and the folder which contains its inputs including a quick overview of some metadata and the status of the dataset. Note that new datasets submitted using QCSubmit know where they were created and have a long_description_url in the metadata which points directly to their home folder in this repository.

Status

The status only refers to the default specification which is required for all of our datasets. Currently this is B3LYP-D3BJ/DZVP.

Key:

100% of all default spec jobs have been complete.

some of the jobs in the dataset contain errors which may prevent the jobs from finishing, this could be something like a linear torsiondrive.

the dataset is currently running and may have some incomplete jobs.

Basic Datasets

These are currently used to compute properties of a minimum energy conformation (Hessians, wavefunctions, etc.), usually derived from completed optimization datasets.

QCArchive Dataset	Folder	Description	Elements
`OpenFF Optimization Set 1`	2019-07-09-OpenFF-Optimization-Set	Hessian calculations.	Cl, S, C, F, O, H, N
`OpenFF NCI250K Boron 1`	2019-07-05 OpenFF NCI250K Boron 1	Hessian calculations.	Cl, Br, S, C, F, B, O, H, N
`OpenFF Discrepancy Benchmark 1`	2019-07-05 eMolecules force field discrepancies 1	Hessian calculation.	Cl, Br, S, C, F, P, I, O, H, N
`OpenFF Gen 2 Opt Set 1 Roche`	2020-03-20-OpenFF-Gen-2-Optimization-Set-1-Roche	Hessian calculation.	Cl, S, C, F, O, H, N
`OpenFF Gen 2 Opt Set 2 Coverage`	2020-03-20-OpenFF-Gen-2-Optimization-Set-2-Coverage	The hessian calculations.	Cl, Br, S, C, F, P, I, O, H, N
`OpenFF Gen 2 Opt Set 3 Pfizer Discrepancy`	2020-03-20-OpenFF-Gen-2-Optimization-Set-3-Pfizer-Discrepancy	Hessian calculations.	Cl, F, C, S, O, H, N
`OpenFF Gen 2 Opt Set 4 eMolecules Discrepancy`	2020-03-20-OpenFF-Gen-2-Optimization-Set-4-eMolecules-Discrepancy	Hessian calculations.	Cl, Br, S, C, F, P, I, O, H, N
`OpenFF Gen 2 Opt Set 5 Bayer`	2020-03-20-OpenFF-Gen-2-Optimization-Set-5-Bayer	Hessian calculations.	Si, Cl, Br, F, C, S, O, H, N
`OpenFF VEHICLe Set 1`	2019-07-02 VEHICLe optimization dataset	Hessian calculations.	S, C, O, H, N
`SMIRNOFF Coverage Set 1`	2019-06-25-smirnoff99Frost-coverage	Hessian calculations.	Cl, Br, S, C, F, P, I, O, H, N
`OpenFF ESP Fragment Conformers v1.0`	2022-01-16-OpenFF-ESP-Fragment-Conformers-v1.0	ESP Calculations	N, Cl, C, H, P, Br, O, F, S
`OpenFF Theory Benchmarking Single Point Energies v1.0`	2021-09-06-theory-bm-single-points	Single Point Energy dataset for the final optimized geometries from MP2/heavy-aug-cc-pVTZ torsiondrives.	Cl, F, C, S, O, H, N, P
`TorsionNet500 Single Points Dataset v1.0`	2021-11-09-TorsionNet500-single-points	Single point energies of final geometries of TorsionNet500 dataset.	H, O, F, S, N, Cl, C
`SPICE DES Monomers Single Points Dataset v1.1`	2021-11-15-QMDataset-DES-monomers-single-points	Single point energy calculation of DES monomers.	I, C, Br, P, Cl, H, S, O, F, N
`SPICE Solvated Amino Acids Single Points Dataset v1.1`	2021-11-08-QMDataset-Solvated-Amino-Acids-single-points	Single point energy calculation of solvated amino acids.	N, S, O, C, H
`SPICE DES370K Single Points Dataset v1.0`	2021-11-08-QMDataset-DES370K-single-points	SPICE single point dataset for ML applications.	'N', 'O', 'Mg', 'H', 'F', 'K', 'Br', 'Na', 'P', 'Cl', 'I', 'Ca', 'S', 'Li', 'C'
`SPICE DES370K Single Points Dataset Supplement v1.0`	2022-02-18-QMDataset-DES370K-single-points-supplement	SPICE single point dataset for ML applications.	F, H, Cl, S, I, Br, N, Li, O, C, Na
`SPICE Dipeptides Single Points Dataset v1.2`	2021-11-08-QMDataset-Dipeptide-single-points	SPICE single point dataset for ML applications.	C ,N ,O ,H ,S
`SPICE PubChem Set 1 Single Points Dataset v1.2`	2021-11-08-QMDataset-pubchem-set1-single-points	SPICE single point dataset for ML applications.	'O', 'Cl', 'N', 'C', 'P', 'Br', 'S', 'F', 'I', 'H'
`SPICE PubChem Set 2 Single Points Dataset v1.2`	2021-11-09-QMDataset-pubchem-set2-single-points	SPICE single point dataset for ML applications.	'H', 'P', 'C', 'Cl', 'Br', 'N', 'F', 'S', 'O', 'I'
`SPICE PubChem Set 3 Single Points Dataset v1.2`	2021-11-09-QMDataset-pubchem-set3-single-points	SPICE single point dataset for ML applications.	'N', 'C', 'S', 'Cl', 'Br', 'F', 'P', 'I', 'H', 'O'
`SPICE PubChem Set 4 Single Points Dataset v1.2`	2021-11-09-QMDataset-pubchem-set4-single-points	SPICE single point dataset for ML applications.	'N', 'S', 'Br', 'O', 'C', 'F', 'H', 'I', 'Cl', 'P'
`SPICE PubChem Set 5 Single Points Dataset v1.2`	2021-11-09-QMDataset-pubchem-set5-single-points	SPICE single point dataset for ML applications.	'F', 'H', 'S', 'Br', 'Cl', 'N', 'P', 'C', 'I', 'O'
`SPICE PubChem Set 6 Single Points Dataset v1.2`	2021-11-09-QMDataset-pubchem-set6-single-points	SPICE single point dataset for ML applications.	'Cl', 'O', 'N', 'H', 'C', 'P', 'S', 'F', 'Br', 'I'
`OpenFF ESP Industry Benchmark Set v1.1`	2022-02-02-OpenFF-ESP-Industry-Benchmark-Set-v1.1-single-point	HF/6-31G* conformers of public industry benchmark molecules.	N, F, Cl, C, H, O, Br, P, S
`SPICE Ion Pairs Single Points Dataset v1.1`	2022-06-08-QMDataset-ion-pairs	SPICE single point dataset for ML applications.	'F', 'Cl', 'Li', 'Na', 'Br', 'K', 'I'
`RNA Single Point Dataset v1.0`	2022-07-07-RNA-basepair-triplebase-single-points	RNA single point dataset consisting of RNA basepairs and triple bases.	'P', 'N', 'O', 'C', 'H'
`RNA Trinucleotide Single Point Dataset v1.0`	2022-10-21-RNA-trinucleotide-single-points	Single point energy calculations of RNA basepairs and triple bases	'O', 'N', 'C', 'H', 'P'
`RNA Nucleoside Single Point Dataset v1.0`	2023-03-09-RNA-nucleoside-single-points	Single point energy calculations of RNA nucleosides without O5' hydroxyl atom	'O', 'N', 'C', 'H'
`OpenFF multi-Br ESP Fragment Conformers v1.1`	2023-11-30-OpenFF-multi-Br-ESP-Fragment-Conformers-v1.1-single-point	Single point ESP calculations	Br, C, F, H, N, O, P, S
`MLPepper RECAP Optimized Fragments v1.0`	2024-07-26-MLPepper-RECAP-Optimized-Fragments-v1.0	Single point property calculations for charge models	P ,B ,Cl ,Br ,C ,H ,I ,F ,O ,N ,Si ,S
`OpenFF NAGL2 ESP Timing Benchmark v1.0`	2024-09-06-OpenFF-NAGL2-ESP-Timing-Benchmark-v1.0	Single point ESP calculations for timing/memory benchmarking	'P', 'S', 'N', 'C', 'Cl', 'F', 'Br', 'O', 'H', 'I'
`OpenFF NAGL2 ESP Timing Benchmark v1.1`	2024-09-18-OpenFF-NAGL2-ESP-Timing-Benchmark-v1.1	Single point ESP calculations for timing/memory benchmarking	'P', 'S', 'N', 'C', 'Cl', 'F', 'Br', 'O', 'H', 'I'
`OpenFF Sulfur Hessian Training Coverage Supplement v1.0`	2024-09-18-OpenFF-Sulfur-Hessian-Training-Coverage-Supplement-v1.0	Additional Hessian training data for Sage sulfur and phosphorus parameters (from 'OpenFF Sulfur Optimization Training Coverage Supplement v1.0')	O, S, C, Cl, P, N, F, Br, H
`OpenFF Aniline Para Hessian v1.0`	2024-10-07-OpenFF-Aniline-Para-Hessian-v1.0	Hessian single points for the final molecules in the `OpenFF Aniline Para Opt v1.0` dataset	'O', 'Cl', 'S', 'Br', 'H', 'F', 'N', 'C'
`OpenFF Gen2 Hessian Dataset Protomers v1.0`	2024-10-07-OpenFF-Gen2-Hessian-Dataset-Protomers-v1.0	Hessian single points for the final molecules in the `OpenFF Gen2 Optimization Dataset Protomers v1.0` dataset	'H', 'C', 'Cl', 'P', 'F', 'Br', 'O', 'N', 'S'
`MLPepper-RECAP-Optimized-Fragments-Add-Iodines-v1.0`	2024-10-11-MLPepper-RECAP-Optimized-Fragments-Add-Iodines-v1.0	Set of diverse iodine containing molecules with a number of calculated electrostatic properties.	Br, Cl, S, B, O, Si, C, N, I, P, H, F

Optimization Datasets

These are currently used to find a minimum energy conformation of a molecule.

QCArchive Dataset	Folder	Description	Elements
`OpenFF Optimization Set 1`	2019-05-16-Roche-Optimization_Set	Geometry optimizations of a set of Roche molecules for forcefield fitting.	Cl, S, C, F, O, H, N
`SMIRNOFF Coverage Set 1`	2019-06-25-smirnoff99Frost-coverage	An optimization dataset the excises all parameters in Smirnoff99Frost.	Cl, Br, S, C, F, P, I, O, H, N
`OpenFF VEHICLe Set 1`	2019-07-02 VEHICLe optimization dataset	VEHICLe (virtual exploratory heterocyclic library) dataset of 24,867 aromatic heterocyclic rings with expanded stereochemistry.	S, C, O, H, N
`OpenFF Discrepancy Benchmark 1`	2019-07-05 eMolecules force field discrepancies 1	A set of molecules whose optimized structures differs across forcefields.	Cl, Br, S, C, F, P, I, O, H, N
`OpenFF NCI250K Boron 1`	2019-07-05 OpenFF NCI250K Boron 1	This database is a subset of boron-containing compounds from the NCI250K (Release 1 - Oct 1999) compound dataset.	Cl, Br, S, C, F, B, O, H, N
`OpenFF Ehrman Informative Optimization v0.2`	2019-09-06-OpenFF-Informative-Set	This provides an optimization dataset based on an initial batch of Jordan Ehrman's analysis of eMolecules, pulling out molecules with minimized geometries which are substantially different in different force fields.	Cl, Br, S, C, F, P, I, O, H, N
`Pfizer discrepancy optimization dataset 1`	2019-09-07-Pfizer-discrepancy-optimization-dataset-1	This database is a subset of 100 challenging small molecule fragments where HF/minix followed by B3LYP/6-31G//B3LYP/6-31G* differed substantially from OPLS3e.	Cl, F, C, S, O, H, N
`FDA optimization dataset 1`	2019-09-08-fda-optimization-dataset-1	he ZINC15 FDA dataset was retrieve in `mol2` format on Sun Sep 8 20:44:34 EDT 2019 via: http://zinc.docking.org/substances/subsets/fda.mol2?count=all	Cl, Br, F, C, S, P, I, O, H, N
`Kinase Inhibitors: WBO Distributions`	2019-11-27-kinase-inhibitor-optimization	Geometry optimization of kinase inhibitor conformers to explore WBO conformation dependency.	Cl, Br, S, C, F, P, I, O, H, N
`OpenFF Gen 2 Opt Set 1 Roche`	2020-03-20-OpenFF-Gen-2-Optimization-Set-1-Roche	2nd generation optimization dataset for bond and valence parameter fitting.	Cl, S, C, F, O, H, N
`OpenFF Gen 2 Opt Set 2 Coverage`	2020-03-20-OpenFF-Gen-2-Optimization-Set-2-Coverage	2nd generation optimization dataset for bond and valence parameter fitting.	Cl, Br, S, C, F, P, I, O, H, N
`OpenFF Gen 2 Opt Set 3 Pfizer Discrepancy`	2020-03-20-OpenFF-Gen-2-Optimization-Set-3-Pfizer-Discrepancy	2nd generation optimization dataset for bond and valence parameter fitting.	Cl, F, C, S, O, H, N
`OpenFF Gen 2 Opt Set 4 eMolecules Discrepancy`	2020-03-20-OpenFF-Gen-2-Optimization-Set-4-eMolecules-Discrepancy	2nd generation optimization dataset for bond and valence parameter fitting	Cl, Br, S, C, F, P, I, O, H, N
`OpenFF Gen 2 Opt Set 5 Bayer`	2020-03-20-OpenFF-Gen-2-Optimization-Set-5-Bayer	2nd generation optimization dataset for bond and valence parameter fitting.	Si, Cl, Br, F, C, S, O, H, N
`OpenFF Protein Fragments v1.0`	2020-07-06-OpenFF-Protein-Fragments-Initial	This is the initial test of running constrained optimizations on various protein fragments prepared by David Cerutti. Here we just have ALA as the central residue.	H, C, O, N
`OpenFF Protein Fragments v2.0`	2020-08-12-OpenFF-Protein-Fragments-version2	This is the full protein fragment dataset (version2) consisting of constrained optimizations on various protein fragments prepared by David Cerutti. We have 12 central residues which are capped with a combination of different terminal residues.	S, C, O, H, N
`OpenFF Sandbox CHO PhAlkEthOH v1.0`	2020-09-18-OpenFF-Sandbox-CHO-PhAlkEthOH	The molecules are from the AlkEthOH and PhEthOH datasets originally used to build the smirnoff99Frosst parameters. The AlkEthOH was taken from here	H, C, O
`OpenFF Industry Benchmark Season 1 v1.0`	2021-03-30-OpenFF-Industry-Benchmark-Season-1-v1.0	The combination of all publicly chosen compound sets by industry partners from the OpenFF season 1 industry benchmark	N, F, Cl, C, H, O, Br, P, S
`OpenFF Industry Benchmark Season 1 v1.1`	2021-06-04-OpenFF-Industry-Benchmark-Season-1-v1.1	The combination of all publicly chosen compound sets by industry partners from the OpenFF season 1 industry benchmark	N, F, Cl, C, H, O, Br, P, S
`OpenFF Theory Benchmarking Constrained Optimization Set MP2 heavy-aug-cc-pVTZ v1.1`	2020-11-25-theory-bm-set-mp2-heavy-aug-cc-pvtz	This is a Constrained Optimization dataset for benchmarking MP2/heavy-aug-cc-pVTZ.
`OpenFF Industry Benchmark Season 1 - MM v1.1`	2021-07-28-OpenFF-Industry-Benchmark-Season-1-MM-v1.1	The combination of all publicly chosen compound sets by industry partners from the OpenFF season 1 industry benchmark; MM computations starting from QM-optimized geometries.	N, F, Cl, C, H, O, Br, P, S
`OpenFF RESP Polarizability Optimizations v1.0`	2021-10-01-OpenFF-resppol-mp2-single-point	A data set used for training ESP-fitting based typed atomic polarizabilities with a direct approximation.	N, C, H, O
`OpenFF RESP Polarizability Optimizations v1.1`	2021-10-01-OpenFF-resppol-mp2-single-point	A data set used for training ESP-fitting based typed atomic polarizabilities with a direct approximation.	N, C, H, O
`SPICE Dipeptides Optimization Dataset v1.0`	2021-11-11-Dipeptide-optimization-set	Optimization set created from the smiles of SPICE Dipeptide dataset.	N, C, H, O, S
`OpenFF Gen 2 Optimization Dataset Protomers v1.0`	2021-12-21-OpenFF-Gen2-Optimization-Set-Protomers	Optimization set created from the smiles of missing protomers in Gen 2 optimization sets.	O, F, S, Br, Cl, C, P, H, I, N
`OpenFF ESP Industry Benchmark Set v1.0`	2022-02-02-OpenFF-ESP-Industry-Benchmark-Set-v1.0-optimization-set	HF/6-31G* conformers of public industry benchmark molecules.	N, F, Cl, C, H, O, Br, P, S
`OpenFF Protein Capped 1-mers 3-mers Optimization Dataset v1.0`	2022-05-30-OpenFF-Protein-Capped-1-mers-3-mers-Optimization	Optimization dataset for protein capped 1-mers Ace-X-Nme and capped 3-mers Ace-Y-X-Y-Nme with Y = {Ala, Val} and X = 26 canonical amino acids with common protomers/tautomers (Ash, Cyx, Glh, Hid, Hip, and Lyn)	H, C, N, O, S
`OpenFF Iodine Chemistry Optimization Dataset v1.0`	2022-07-27-OpenFF-iodine-optimization-set	Optimization set created from Gen1 and Gen2 molecules containing iodine	'C', 'F', 'O', 'H', 'Br', 'Cl', 'N', 'I', 'S'
`OpenFF multi-Br ESP Fragment Conformers v1.0`	2023-11-02-OpenFF-multi-Br-ESP-Fragment-Conformers-v1.0	Optimization set created from 2022-01-16-OpenFF-ESP-Fragment-Conformers-v1.0 by selecting molecules with multiple Cl atoms and replacing them with Br	Br, C, F, H, N, O, P, S
`XtalPi Shared Fragments OptimizationDataset v1.0`	2024-01-30-xtalpi-shared-fragments-optimization-v1.0	Representative optimization molecules used to fit XFF	C, H, Cl, Br, S, O, F, N, P
`XtalPi 20-percent Fragments OptimizationDataset v1.0`	2024-04-02-xtalpi-20-percent-fragments-optimization-v1.0	Larger (20%) representative subset of molecules used to fit XFF	Cl, P, Br, I, H, C, B, Si, O, N, F, S
`OpenFF Torsion Benchmark Supplement Optimization Dataset v1.0`	2024-04-18-OpenFF-Torsion-Benchmark-Supplement-Optimization-Dataset-v1.0	Additional optimizations for benchmarking Sage 2.2.0 proper torsions and new parameters from the torsion multiplicity work	H, C, N, O, F, P, S, Cl, Br
`OpenFF Torsion Multiplicity Optimization Training Coverage Supplement v1.0`	2024-06-20-OpenFF-Torsion-Multiplicity-Optimization-Training-Coverage-Supplement-v1.0	Additional optimization training data for Sage 2.2.0 proper torsions and new parameters from the torsion multiplicity work	C, Cl, S, O, H, P, N, Br
`OpenFF Torsion Multiplicity Optimization Benchmarking Coverage Supplement v1.0`	2024-06-24-OpenFF-Torsion-Multiplicity-Optimization-Benchmarking-Coverage-Supplement-v1.0	Additional optimization benchmarking data for Sage 2.2.0 proper torsions and new parameters from the torsion multiplicity work	Cl, H, I, S, O, N, Br, C, P
`OpenFF Iodine Fragment Opt v1.0`	2024-09-10-OpenFF-Iodine-Fragment-Opt-v1.0	B3LYP-D3BJ/DZVP optimized conformers for a variety of I-containing fragment molecules	C, O, I, S, F, Br, Cl, N, H
`OpenFF Sulfur Optimization Training Coverage Supplement v1.0`	2024-09-11-OpenFF-Sulfur-Optimization-Training-Coverage-Supplement-v1.0	Additional optimization training data for Sage sulfur and phosphorus parameters	C, S, F, O, H, Cl, Br, P, N
`OpenFF Sulfur Optimization Benchmarking Coverage Supplement v1.0`	2024-09-18-OpenFF-Sulfur-Optimization-Benchmarking-Coverage-Supplement-v1.0	Additional optimization benchmarking data for Sage sulfur and phosphorus parameters	S, P, Cl, C, N, O, H, Br, F
`OpenFF Lipid Optimization Training Supplement v1.0`	2024-10-08-OpenFF-Lipid-Optimization-Training-Supplement-v1.0	Additional optimization training data for Sage from representative LIPID MAPS fragments	I, Br, O, H, P, C, N, Cl, F, S

TorsionDrive Datasets

These are currently used perform a complete rotation of one or more selected bonds, where optimizations are performed over a discrete set of angles.

QCArchive Dataset	Folder	Description	Elements
`Fragment Stability Benchmark`	2019-03-06-Fragmenter_Stability-Benchmark	Examination of different fragmentation schemes.	Cl, F, C, P, I, O, H, N
`OpenFF Group1 Torsions`	2019-05-01-OpenFF-Group1-Torsions	A collection of torsion drives for forcefield fitting.	Cl, F, C, S, O, H, N
`SMIRNOFF Coverage Torsion Set 1`	2019-07-01-smirnoff99Frost-coverage-torsion	Set of small molecules that use all smirnoff99Frost parameters.	C', Br, S, C, F, P, I, O, H, N
`OpenFF Substituted Phenyl Set 1`	2019-07-25-phenyl-set	A set of substituted phenyl torsiondrives.	Cl, Br, F, C, I, O, H, N
`Pfizer discrepancy torsion dataset 1`	2019-09-07-Pfizer-discrepancy-torsion-dataset-1	This database is a subset of 100 challenging small molecule fragments where HF/minix followed by B3LYP/6-31G//B3LYP/6-31G* differed substantially from OPLS3e.	Cl, F, C, S, O, H, N
`TorsionDrive Paper`	2019-11-07-TorsionDrive-Paper	Torsion Drives to explore wavefront propagation for the TorsionDrive paper.	C, H, O
`OpenFF Primary Benchmark 1 Torsion Set`	2019-12-05-OpenFF-Benchmark-Primary-1-torsion	Validation of optimized force field torsion parameters.	Cl, Br, F, C, S, O, H, N
`OpenFF Primary Benchmark 2 Torsion Set`	2020-01-17-OpenFF-Benchmark-Full-1-torsion	Validation of optimized force field torsion parameters.	Cl, Br, S, C, F, P, I, O, H, N
`OpenFF Group1 Torsions 2`	2020-01-31-OpenFF-Group1-Torsions-2	Generation of additional data for fitting of newly added torsion terms.	H, C, O, N
`OpenFF Group1 Torsions 3`	2020-02-10-OpenFF-Group1-Torsions-3	Generation of additional data for fitting of `t128` and `t129`	H, C, O, N
`OpenFF Gen 2 Torsion Set 1 Roche`	2020-03-12-OpenFF-Gen-2-Torsion-Set-1-Roche	Design 2nd generation torsion dataset for valence parameter fitting.	F, C, S, O, H, N
`OpenFF Gen 2 Torsion Set 2 Coverage`	2020-03-12-OpenFF-Gen-2-Torsion-Set-2-Coverage	Design 2nd generation torsion dataset for valence parameter fitting.	Cl, Br, F, C, S, P, I, O, H, N
`OpenFF Gen 2 Torsion Set 3 Pfizer Discrepancy`	2020-03-12-OpenFF-Gen-2-Torsion-Set-3-Pfizer-Discrepancy	Design 2nd generation torsion dataset for valence parameter fitting	S, C, F, O, H, N
`OpenFF Gen 2 Torsion Set 4 eMolecules Discrepancy`	2020-03-12-OpenFF-Gen-2-Torsion-Set-4-eMolecules-Discrepancy	Design 2nd generation torsion dataset for valence parameter fitting.	Cl, Br, F, C, S, P, I, O, H, N
`OpenFF Gen 2 Torsion Set 5 Bayer`	2020-03-12-OpenFF-Gen-2-Torsion-Set-5-Bayer	Design 2nd generation torsion dataset for valence parameter fitting.	Cl, Br, F, C, S, O, H, N
`OpenFF Gen 2 Torsion Set 6 supplemental`	2020-03-12-OpenFF-Gen-2-Torsion-Set-6-supplemental	Design 2nd generation torsion dataset for valence parameter fitting.	S, C, O, H, N
`OpenFF Gen 2 Torsion Set 1 Roche 2`	2020-03-23-OpenFF-Gen-2-Torsion-Set-1-Roche-2	Design 2nd generation torsion dataset for valence parameter fitting.	Cl, F, C, S, O, H, N
`OpenFF Gen 2 Torsion Set 2 Coverage 2`	2020-03-23-OpenFF-Gen-2-Torsion-Set-2-Coverage-2	Design 2nd generation torsion dataset for valence parameter fitting.	Cl, Br, F, C, S, P, I, O, H, N
`OpenFF Gen 2 Torsion Set 3 Pfizer Discrepancy 2`	2020-03-23-OpenFF-Gen-2-Torsion-Set-3-Pfizer-Discrepancy-2	Design 2nd generation torsion dataset for valence parameter fitting.	S, C, F, O, H, N
`OpenFF Gen 2 Torsion Set 4 eMolecules Discrepancy 2`	2020-03-23-OpenFF-Gen-2-Torsion-Set-4-eMolecules-Discrepancy-2	Design 2nd generation torsion dataset for valence parameter fitting.	Cl, Br, F, C, S, P, I, O, H, N
`OpenFF Gen 2 Torsion Set 5 Bayer 2`	2020-03-26-OpenFF-Gen-2-Torsion-Set-5-Bayer-2	Design 2nd generation torsion dataset for valence parameter fitting.	Cl, Br, F, C, S, O, H, N
`OpenFF Gen 2 Torsion Set 6 supplemental 2`	2020-03-26-OpenFF-Gen-2-Torsion-Set-6-supplemental-2	Design 2nd generation torsion dataset for valence parameter fitting.	Br S, C, F, O, H, N
`OpenFF Fragmenter Validation 1.0`	2020-04-28-Fragmenter-test	Examination of different fragmentation schemes.	Cl, S, C, P, I, O, H, N
`OpenFF DANCE 1 eMolecules t142 v1.0`	2020-06-01-DANCE-1-eMolecules-t142-selected	Molecules selected from the eMolecules database by DANCE to improve t142 parameterization in smirnoff99Frosst.	Cl, Br, F, C, S, O, H, N
`OpenFF Rowley Biaryl v1.0`	2020-06-17-OpenFF-Biaryl-set	This is a TorsionDrive dataset consisting of biaryl torsions provided by Christopher Rowley. Originally used to benchmark parsley, but could also be useful for fitting.	S, C, O, H, N
`OpenFF-benchmark-ligand-fragments-v1.0`	2020-07-27-OpenFF-Benchmark-Ligands	This is a torsiondrive dataset created from the OpenFF FEP benchmark dataset. The ligands are fragmented before having key torsions driven.	Cl, Br, S, C, F, I, O, H, N
`OpenFF Theory Benchmarking Set B3LYP-D3BJ DZVP v1.0`	2020-07-27-theory-bm-set-b3lyp-d3bj-dzvp	This is a TorsionDrive dataset consisting of 36 1-D torsions selected for benchmarking different QM levels.	Cl, F, C, S, P, O, H, N
`OpenFF Theory Benchmarking Set B3LYP-D3BJ def2-TZVP v1.0`	2020-07-30-theory-bm-set-b3lyp-d3bj-def2-tzvp	This is a TorsionDrive dataset consisting of 36 1-D torsions selected for benchmarking different QM levels.	Cl, F, C, S, P, O, H, N
`OpenFF Theory Benchmarking Set B3LYP-D3BJ def2-TZVPD v1.0`	2020-07-30-theory-bm-set-b3lyp-d3bj-def2-tzvpd	This is a TorsionDrive dataset consisting of 36 1-D torsions selected for benchmarking different QM levels.	Cl, F, C, S, P, O, H, N
`OpenFF Theory Benchmarking Set B3LYP-D3BJ def2-TZVPP v1.0`	2020-07-30-theory-bm-set-b3lyp-d3bj-def2-tzvpp	This is a TorsionDrive dataset consisting of 36 1-D torsions selected for benchmarking different QM levels.	Cl, F, C, S, P, O, H, N
`OpenFF Protein Fragments TorsionDrives v1.0`	2020-09-16-OpenFF-Protein-Fragments-TorsionDrives	This is a protein fragment dataset consisting of torsion drives on various protein fragments prepared by David Cerutti. We have 12 central residues capped with a combination of different terminal residues. We drive the following angles for each fragment: - omega - phi - psi - chi1 (if applicable) - chi2 (if applicable).	S, C, O, H, N
`OpenFF WBO Conjugated Series v1.0`	2021-01-25-OpenFF-Conjugated-Series	This is a torsion drive dataset that consists of various chemistries that probe a range of conjugated bonds. The goal of this dataset is to develop WBO interpolated torsions for the OpenFF force field.	S, C, O, H, N
`OpenFF Amide Torsion Set v1.0`	2021-03-23-OpenFF-Amide-Torsion-Set-v1.0	Amides, thioamides and amidines diversely functionalized.	S, C, O, H, N
`OpenFF Aniline Para Opt v1.0`	2021-04-02-OpenFF-Aniline-Para-Opt-v1.0	Optimizations of diverse, para-substituted aniline derivatives.	Br, C, O, N, S, H, Cl, F
`OpenFF Gen3 Torsion Set v1.0`	2021-04-09-OpenFF-Gen3-Torsion-Set-v1.0	This dataset is a simple-molecule-only torsiondrive dataset, aiming to avoid issue of torsion parameter contamination by large internal non-bonded interactions during a valece parameter optimization. Molecules with one effective rotating bond were generate by combining two simple substituents, which were identified by fragmenting small drug like molecules. Torsions from the generated molecule set were selected using clustering method, in a way that the dataset can allow a chemical diversity of molecules training each torsion parameter.	F ,N ,H ,Cl ,P ,S ,O ,Br ,C
`OpenFF Aniline 2D Impropers v1.0`	2021-03-29-OpenFF-Aniline-2D-Impropers-v1.0	This dataset contains a set of aniline derivatives which have para-substituted groups of varying electron donating and withdrawing properties. This dataset was curated in an effort to improve and understand improper torsions in force fields. We will scan the improper and proper angle simultaneously to better understand the coupling and energetics of these torsions.	O, C, S, H, N
`OpenFF BCC Refit Study COH v2.0`	2021-06-22-OpenFF-BCC-Refit-Study-COH-v2.0	A data set curated for the initial stage of the on-going OpenFF study which aims to co-optimize the AM1BCC bond charge correction (BCC) parameters against an experimental training set of density and enthalpy of mixing data points and a QM training set of electric field data. The initial data set is limited to only molecules composed of C, O, H. This limited scope significantly reduces the number of BCC parameters which must be retrained, thus allowing for easier convergence of the initial optimizations. The included molecules were combinatorially generated to cover a range of alcohol, ether, and carbonyl containing molecules.	O, C, S, H, N
`OpenFF-benchmark-ligand-fragments-v2.0`	2021-08-10-OpenFF-JACS-Fragments-v2.0	This is a torsiondrive dataset created from the OpenFF FEP benchmark dataset. The ligands are fragmented using openff-fragmenter with both ambertools and openeye before having key torsions driven.	S, N, Br, C, H, O, Cl, F, I
`OpenFF-Protein-Dipeptide-2D-TorsionDrive-v2.1`	2021-11-18-OpenFF-Protein-Dipeptide-2D-TorsionDrive	Two-dimensional TorsionDrives on phi and psi for dipeptides of the 20 canonical amino acids and 6 alternate protomers/tautomers.	H, C, N, O, S
`OpenFF-Protein-Capped-1-mer-Sidechains-v1.3`	2022-02-10-OpenFF-Protein-Capped-1-mer-Sidechains	Two-dimensional TorsionDrives on chi1 and chi2 for capped 1-mers of amino acids with a rotatable bond in the sidechain.	H, C, N, O, S
`OpenFF-Protein-Capped-3-mer-Backbones-v1.0`	2022-05-30-OpenFF-Protein-Capped-3-mer-Backbones	Two-dimensional TorsionDrives on phi and psi for capped 3-mers Ace-Y-X-Y-Nme with Y = {Ala, Val}.	H, C, N, O, S
`OpenFF-multiplicity-correction-torsion-drive-data-v1.1`	2022-04-29-OpenFF-multiplicity-correction-torsion-drive-data-v1.1	A torsiondrive dataset created to correct multiplicity issues in the force field.	'S', 'P', 'O', 'C', 'H', 'N'
`OpenFF-Protein-Capped-3-mer-Omega-v1.0`	2023-02-06-OpenFF-Protein-Capped-3-mer-Omega	TorsionDrives on omega for capped 3-mers Ace-Ala-X-Ala-Nme.	H, C, N, O, S
`XtalPi Shared Fragments TorsiondriveDataset v1.0`	2024-01-30-xtalpi-shared-fragments-torsiondrive-v1.0	Representative torsion scan molecules used to fit XFF	C, H, Cl, Br, S, O, F, N, P
`OpenFF Torsion Coverage Supplement v1.0`	2024-02-29-OpenFF-Torsion-Coverage-Supplement-v1.0	Additional TorsionDrives to improve coverage for Sage 2.1.0 proper torsions and new parameters from the torsion multiplicity work	C, Cl, F, H, N, O, S
`OpenFF-RNA-Dinucleoside-Monophosphate-TorsionDrives-v1.0`	2024-03-26-OpenFF-RNA-Dinucleoside-Monophosphate-TorsionDrives	TorsionDrives of non-ring backbone, glycosidic, and hydroxyl dihedrals in RNA XpY 2-mers.	H, C, N, O, P
`XtalPi 20-percent Fragments TorsiondriveDataset v1.0`	2024-04-02-xtalpi-20-percent-fragments-torsiondrive-v1.0	Torsion scans of larger representative subset (20%) of molecules used to fit XFF	O, Br, I, Si, B, C, P, S, Cl, H, N, F
`OpenFF Torsion Drive Supplement v1.0`	2024-04-17-OpenFF-Torsion-Drive-Supplement-v1.0	Additional TorsionDrives to expand training data for Sage 2.2.0 proper torsions and new parameters from the torsion multiplicity work	H, C, N, O, P, S
`OpenFF Torsion Multiplicity Torsion Drive Coverage Supplement v1.0`	2024-06-14-OpenFF-Torsion-Multiplicity-Torsion-Drive-Coverage-Supplement-v1.0	Additional torsion drive training data for Sage 2.2.0 proper torsions and new parameters from the torsion multiplicity work	N, Br, H, P, Cl, O, C, S
`OpenFF Phosphate Torsion Drives v1.0`	2024-07-17-OpenFF-Phosphate-Torsion-Drives-v1.0	Lipid-like phosphate torsions	C, S, N, H, O, P
`OpenFF Alkane Torsion Drives v1.0`	2024-08-09-OpenFF-Alkane-Torsion-Drives-v1.0	Alka/ene torsion drives	C, H

GridOptimization Datasets

These are currently used perform a scan of one or more internal coordinates (bond, angle, torsion), where optimizations are performed over a discrete set of values.

QCArchive Dataset	Folder	Description	Elements
`OpenFF Trivalent Nitrogen Set 1`	2019-06-28-Nitrogen-grid-optimization	Set of diverse trivalent nitrogen molecules for 1-D grid optimization.	Si, Cl, Br, F, C, S, P, B, I, O, H, N
`OpenFF Trivalent Nitrogen Set 2`	2019-12-09-Nitrogen-grid-optimization-2d	Set of diverse trivalent nitrogen molecules for 2-D grid optimization	Si, Cl, Br, F, C, S, P, B, I, O, H, N
`OpenFF Trivalent Nitrogen Set 3`	2020-01-15-Nitogen-grid-optimization-02-1dscans	Set of diverse trivalent nitrogen molecules for 1-D grid optimization, this is a secondary dataset	Cl, Br, S, C, F, O, H, N