Home

Awesome

OpenFF QCArchive Dataset Submission

Dataset Lifecycle

All datasets submitted to QCArchive via this repository conform to the Dataset Lifecycle.

See STANDARDS.md for submission standards. Datasets must be submitted as pull requests.

User Quickstart

  1. Ensure git-lfs is installed on your local machine: https://git-lfs.github.com/

  2. To submit a new dataset, begin by cloning this repository:

    export GIT_LFS_SKIP_SMUDGE=1
    git clone git@github.com:openforcefield/qca-dataset-submission.git
    

    This will clone the repo, but avoid downloading existing LFS objects. If you wish to download all LFS objects, leave off the export GIT_LFS_SKIP_SMUDGE=1.

  3. Once cloned, create and switch to a new branch from master, then create a new directory in qca-dataset-submission/submissions/:

    git checkout -b <dataset-branch>
    mkdir qca-dataset-submission/submissions/YYYY-MM-DD-OpenFF-<DESCRIPTIVE-DATASET-NAME>-v1.0
    

    You will add all submission artifacts to this directory.

  4. Create and activate a new conda env with basic submission-preparation requirements with:

    conda env create -f qca-dataset-submission/devtools/prod-envs/qcarchive-user-submit.yaml
    conda activate qcarchive-user-submit
    
  5. Choose a starting notebook and README based on the type of dataset you wish to submit:

    Copy the notebook and README for the dataset you want into the directory you created.

    cp examples/<dataset-type>/* qca-dataset-submission/submissions/YYYY-MM-DD-OpenFF-<DESCRIPTIVE-DATASET-NAME>-v1.0
    
  6. Start up a Jupyter notebook with your new notebook:

    jupyter notebook qca-dataset-submission/submissions/YYYY-MM-DD-OpenFF-<DESCRIPTIVE-DATASET-NAME>-v1.0/generate-dataset.ipynb
    

    Edit the contents with appropriate metadata information, read in your molecules using the cells appropriate for your input data, and make any other modifications as needed for your specific needs.

  7. Copy generated metadata components into README. Write a reasonably-detailed high-level description of the submission at the top.

  8. Commit the following files in the submission directory you made:

    • your input files; please compress them if possible with e.g. bzip2
    • generate-dataset.ipynb
    • dataset.pdf
    • dataset.smi
    • dataset.json.bz2
  9. Push your branch to Github:

    git push origin <dataset-branch>
    
  10. Make a new PR for the branch. Validation will run automatically on your dataset.json.* file, indicating any potential issues prior to submission. Ask for help if you see validation failures you do not understand. Ping a reviewer in the comments.

  11. Once reviewed and approved, your submission will be merged and submitted to QCArchive! Computations specified by the submission will be performed on OpenFF-managed compute resources.

Creating a compute expansion

If you have already computed a dataset but want to re-compute it with a new QCSpec (e.g. new level of theory), you can do so using a compute expansion. This is faster than creating a new dataset, and explicitly links datasets with the same molecules and purpose. A compute expansion involves adding a file called compute.json to your original submission, which contains the dataset metadata (identical to the original dataset) and the new compute spec. This can be done manually, or programatically. The programatic description is provided below, with an example of the notebook and of the file.

  1. Create a new branch as described above, and navigate to the submission directory of the dataset you want to expand.
  2. Create a new jupyter notebook called generate-compute.ipynb example here.
  3. In the notebook, either download the original dataset and remove the molecules and original QCSpec, or re-create the dataset with the same name as the original and skip the molecule addition step.
  1. Add the new QCSpec to the dataset, and save the dataset to compute.json, example here.
  2. Add the additional compute spec to the submission's README.md file.
  3. Add the generate-compute.ipynb and compute.json files to the submission's QCSubmit Manifest entry in the README.md file.
  4. Proof the submission and open a PR. Dataset validation will run automatically.
  5. Once the dataset is validated, request a review, and once approved, your compute expansion will be submitted!

When the PR is merged, the following happens:

The Lifecycle of a Dataset Submission

All Open Force Field datasets submitted to QCArchive undergo well-defined lifecycle.

Dataset Lifecycle

Each labeled rectangle in the lifecycle represents a state. A submission PR changes state according to the arrows. Changes in state may be performed by automation or manually by a human when certain critera are met.

The lifecycle process is described below, with [bracketed] items indicating the agent of action, one of:

  1. A PR is created against qca-dataset-submission by a submitter.

    • the template is filled out with informational sections according to the PR template
    • [GHA] validation operates on all dataset*.json files found in the PR; performs validation checks
      • comment made based on validation checks
      • reruns on each push
  2. Add card for the PR to Dataset Tracking board.

  3. When the submission is ready to be submitted to public QCArchive (validations pass, submitters and reviewers satisfied), PR is merged.

    • [Board] PR card will move to state "Queued for Submission" immediately.
    • [GHA] lifecycle-backlog will move PR card to state "Queued for Submission" if merged and in state "Backlog"
    • [GHA] lifecycle-submission will attempt to submit the dataset
      • if successful, will move card to state "Error Cycling"; add comment to PR
      • if failed, will keep card queued; add comment to PR; attempt again next execution
    • [Human] Submit worker jobs on a server to begin compute. If using Nautilus, carefully monitor utilization and scale down resources as jobs finish.
  4. COMPLETE, INCOMPLETE, ERROR numbers reported for Optimizations, TorsionDrives

  5. PR will remain in state "Error Cycling" until moved to "Requires Scientific Review" or until all tasks COMPLETE

    • [Human] if errors appear persistent, move to state "Requires Scientific Review"
    • discussion should be had on PR for next version
    • [Human] once decided, state moved to "End of Life"
    • [Human] ensure all worker jobs have been shut down.
  6. [GHA] lifecycle-end-of-life will add tag 'end-of-life' to dataset in QCArchive for PR in "End of Life"

  7. [GHA] lifecycle-archived-complete will add tag 'archived-complete' to dataset in QCArchive for PR in "Archived/Complete"

Management Touchpoints

In addition to the states given above, there are additional touchpoints available for managing dataset submissions:

  1. The tracking label is the "on/off" switch for automation via Github Actions. To disable all automation on a submission PR, remove this label. To enable automation, add the label.

  2. Submission priority can be changed by adding one of the following labels:

    • priority-high: highest priority
    • priority-normal: normal priority
    • priority-low: lowest priority
  3. Submission routing to QCFractal managers on different compute resources can be accomplished with compute tags. Add a label like compute-<tagname> to set the compute tag for all QCArchive tasks associated with a submisison. Be sure to coordinate with QCFractal manager admins to ensure your chosen compute tag is being served on the expected resources. This mechanism can also be used to "dead-letter" computations that are no longer desired by setting a compute tag that no manager will service.

  4. The order of a submission PR in a Dataset Tracking column matters. Submissions higher in a column will be operated on first by all Github Action automation. For example, if you want to error cycle a submission before any others so it has a higher chance of being pulled by idle manager workers, place it at the top of the Error Cycling column.

Dude where's my Dataset?

Finding the source of a dataset in QCArchive can be difficult; here we offer a mapping between a dataset in QCArchive and the folder which contains its inputs including a quick overview of some metadata and the status of the dataset. Note that new datasets submitted using QCSubmit know where they were created and have a long_description_url in the metadata which points directly to their home folder in this repository.

Status

The status only refers to the default specification which is required for all of our datasets. Currently this is B3LYP-D3BJ/DZVP.

Key:

Complete 100% of all default spec jobs have been complete.

Error some of the jobs in the dataset contain errors which may prevent the jobs from finishing, this could be something like a linear torsiondrive.

Running the dataset is currently running and may have some incomplete jobs.

Basic Datasets

These are currently used to compute properties of a minimum energy conformation (Hessians, wavefunctions, etc.), usually derived from completed optimization datasets.

QCArchive DatasetFolderDescriptionElementsStatus
OpenFF Optimization Set 12019-07-09-OpenFF-Optimization-SetHessian calculations.Cl, S, C, F, O, H, NComplete
OpenFF NCI250K Boron 12019-07-05 OpenFF NCI250K Boron 1Hessian calculations.Cl, Br, S, C, F, B, O, H, NComplete
OpenFF Discrepancy Benchmark 12019-07-05 eMolecules force field discrepancies 1Hessian calculation.Cl, Br, S, C, F, P, I, O, H, NError
OpenFF Gen 2 Opt Set 1 Roche2020-03-20-OpenFF-Gen-2-Optimization-Set-1-RocheHessian calculation.Cl, S, C, F, O, H, NComplete
OpenFF Gen 2 Opt Set 2 Coverage2020-03-20-OpenFF-Gen-2-Optimization-Set-2-CoverageThe hessian calculations.Cl, Br, S, C, F, P, I, O, H, NError
OpenFF Gen 2 Opt Set 3 Pfizer Discrepancy2020-03-20-OpenFF-Gen-2-Optimization-Set-3-Pfizer-DiscrepancyHessian calculations.Cl, F, C, S, O, H, NComplete
OpenFF Gen 2 Opt Set 4 eMolecules Discrepancy2020-03-20-OpenFF-Gen-2-Optimization-Set-4-eMolecules-DiscrepancyHessian calculations.Cl, Br, S, C, F, P, I, O, H, NComplete
OpenFF Gen 2 Opt Set 5 Bayer2020-03-20-OpenFF-Gen-2-Optimization-Set-5-BayerHessian calculations.Si, Cl, Br, F, C, S, O, H, NError
OpenFF VEHICLe Set 12019-07-02 VEHICLe optimization datasetHessian calculations.S, C, O, H, NError
SMIRNOFF Coverage Set 12019-06-25-smirnoff99Frost-coverageHessian calculations.Cl, Br, S, C, F, P, I, O, H, NError
OpenFF ESP Fragment Conformers v1.02022-01-16-OpenFF-ESP-Fragment-Conformers-v1.0ESP CalculationsN, Cl, C, H, P, Br, O, F, SRunning
OpenFF Theory Benchmarking Single Point Energies v1.02021-09-06-theory-bm-single-pointsSingle Point Energy dataset for the final optimized geometries from MP2/heavy-aug-cc-pVTZ torsiondrives.Cl, F, C, S, O, H, N, PRunning
TorsionNet500 Single Points Dataset v1.02021-11-09-TorsionNet500-single-pointsSingle point energies of final geometries of TorsionNet500 dataset.H, O, F, S, N, Cl, CRunning
SPICE DES Monomers Single Points Dataset v1.12021-11-15-QMDataset-DES-monomers-single-pointsSingle point energy calculation of DES monomers.I, C, Br, P, Cl, H, S, O, F, NComplete
SPICE Solvated Amino Acids Single Points Dataset v1.12021-11-08-QMDataset-Solvated-Amino-Acids-single-pointsSingle point energy calculation of solvated amino acids.N, S, O, C, HComplete
SPICE DES370K Single Points Dataset v1.02021-11-08-QMDataset-DES370K-single-pointsSPICE single point dataset for ML applications.'N', 'O', 'Mg', 'H', 'F', 'K', 'Br', 'Na', 'P', 'Cl', 'I', 'Ca', 'S', 'Li', 'C'Complete
SPICE DES370K Single Points Dataset Supplement v1.02022-02-18-QMDataset-DES370K-single-points-supplementSPICE single point dataset for ML applications.F, H, Cl, S, I, Br, N, Li, O, C, NaRunning
SPICE Dipeptides Single Points Dataset v1.22021-11-08-QMDataset-Dipeptide-single-pointsSPICE single point dataset for ML applications.C ,N ,O ,H ,SComplete
SPICE PubChem Set 1 Single Points Dataset v1.22021-11-08-QMDataset-pubchem-set1-single-pointsSPICE single point dataset for ML applications.'O', 'Cl', 'N', 'C', 'P', 'Br', 'S', 'F', 'I', 'H'Running
SPICE PubChem Set 2 Single Points Dataset v1.22021-11-09-QMDataset-pubchem-set2-single-pointsSPICE single point dataset for ML applications.'H', 'P', 'C', 'Cl', 'Br', 'N', 'F', 'S', 'O', 'I'Running
SPICE PubChem Set 3 Single Points Dataset v1.22021-11-09-QMDataset-pubchem-set3-single-pointsSPICE single point dataset for ML applications.'N', 'C', 'S', 'Cl', 'Br', 'F', 'P', 'I', 'H', 'O'Running
SPICE PubChem Set 4 Single Points Dataset v1.22021-11-09-QMDataset-pubchem-set4-single-pointsSPICE single point dataset for ML applications.'N', 'S', 'Br', 'O', 'C', 'F', 'H', 'I', 'Cl', 'P'Running
SPICE PubChem Set 5 Single Points Dataset v1.22021-11-09-QMDataset-pubchem-set5-single-pointsSPICE single point dataset for ML applications.'F', 'H', 'S', 'Br', 'Cl', 'N', 'P', 'C', 'I', 'O'Running
SPICE PubChem Set 6 Single Points Dataset v1.22021-11-09-QMDataset-pubchem-set6-single-pointsSPICE single point dataset for ML applications.'Cl', 'O', 'N', 'H', 'C', 'P', 'S', 'F', 'Br', 'I'Running
OpenFF ESP Industry Benchmark Set v1.12022-02-02-OpenFF-ESP-Industry-Benchmark-Set-v1.1-single-pointHF/6-31G* conformers of public industry benchmark molecules.N, F, Cl, C, H, O, Br, P, SRunning
SPICE Ion Pairs Single Points Dataset v1.12022-06-08-QMDataset-ion-pairsSPICE single point dataset for ML applications.'F', 'Cl', 'Li', 'Na', 'Br', 'K', 'I'Running
RNA Single Point Dataset v1.02022-07-07-RNA-basepair-triplebase-single-pointsRNA single point dataset consisting of RNA basepairs and triple bases.'P', 'N', 'O', 'C', 'H'Running
RNA Trinucleotide Single Point Dataset v1.02022-10-21-RNA-trinucleotide-single-pointsSingle point energy calculations of RNA basepairs and triple bases'O', 'N', 'C', 'H', 'P'Running
RNA Nucleoside Single Point Dataset v1.02023-03-09-RNA-nucleoside-single-pointsSingle point energy calculations of RNA nucleosides without O5' hydroxyl atom'O', 'N', 'C', 'H'Running
OpenFF multi-Br ESP Fragment Conformers v1.12023-11-30-OpenFF-multi-Br-ESP-Fragment-Conformers-v1.1-single-pointSingle point ESP calculationsBr, C, F, H, N, O, P, S
MLPepper RECAP Optimized Fragments v1.02024-07-26-MLPepper-RECAP-Optimized-Fragments-v1.0Single point property calculations for charge modelsP ,B ,Cl ,Br ,C ,H ,I ,F ,O ,N ,Si ,S
OpenFF NAGL2 ESP Timing Benchmark v1.02024-09-06-OpenFF-NAGL2-ESP-Timing-Benchmark-v1.0Single point ESP calculations for timing/memory benchmarking'P', 'S', 'N', 'C', 'Cl', 'F', 'Br', 'O', 'H', 'I'
OpenFF NAGL2 ESP Timing Benchmark v1.12024-09-18-OpenFF-NAGL2-ESP-Timing-Benchmark-v1.1Single point ESP calculations for timing/memory benchmarking'P', 'S', 'N', 'C', 'Cl', 'F', 'Br', 'O', 'H', 'I'
OpenFF Sulfur Hessian Training Coverage Supplement v1.02024-09-18-OpenFF-Sulfur-Hessian-Training-Coverage-Supplement-v1.0Additional Hessian training data for Sage sulfur and phosphorus parameters (from 'OpenFF Sulfur Optimization Training Coverage Supplement v1.0')O, S, C, Cl, P, N, F, Br, H
OpenFF Aniline Para Hessian v1.02024-10-07-OpenFF-Aniline-Para-Hessian-v1.0Hessian single points for the final molecules in the OpenFF Aniline Para Opt v1.0 dataset'O', 'Cl', 'S', 'Br', 'H', 'F', 'N', 'C'
OpenFF Gen2 Hessian Dataset Protomers v1.02024-10-07-OpenFF-Gen2-Hessian-Dataset-Protomers-v1.0Hessian single points for the final molecules in the OpenFF Gen2 Optimization Dataset Protomers v1.0 dataset'H', 'C', 'Cl', 'P', 'F', 'Br', 'O', 'N', 'S'
MLPepper-RECAP-Optimized-Fragments-Add-Iodines-v1.02024-10-11-MLPepper-RECAP-Optimized-Fragments-Add-Iodines-v1.0Set of diverse iodine containing molecules with a number of calculated electrostatic properties.Br, Cl, S, B, O, Si, C, N, I, P, H, F

Optimization Datasets

These are currently used to find a minimum energy conformation of a molecule.

QCArchive DatasetFolderDescriptionElementsStatus
OpenFF Optimization Set 12019-05-16-Roche-Optimization_SetGeometry optimizations of a set of Roche molecules for forcefield fitting.Cl, S, C, F, O, H, NComplete
SMIRNOFF Coverage Set 12019-06-25-smirnoff99Frost-coverageAn optimization dataset the excises all parameters in Smirnoff99Frost.Cl, Br, S, C, F, P, I, O, H, NError
OpenFF VEHICLe Set 12019-07-02 VEHICLe optimization datasetVEHICLe (virtual exploratory heterocyclic library) dataset of 24,867 aromatic heterocyclic rings with expanded stereochemistry.S, C, O, H, NError
OpenFF Discrepancy Benchmark 12019-07-05 eMolecules force field discrepancies 1A set of molecules whose optimized structures differs across forcefields.Cl, Br, S, C, F, P, I, O, H, NError
OpenFF NCI250K Boron 12019-07-05 OpenFF NCI250K Boron 1This database is a subset of boron-containing compounds from the NCI250K (Release 1 - Oct 1999) compound dataset.Cl, Br, S, C, F, B, O, H, NComplete
OpenFF Ehrman Informative Optimization v0.22019-09-06-OpenFF-Informative-SetThis provides an optimization dataset based on an initial batch of Jordan Ehrman's analysis of eMolecules, pulling out molecules with minimized geometries which are substantially different in different force fields.Cl, Br, S, C, F, P, I, O, H, NError
Pfizer discrepancy optimization dataset 12019-09-07-Pfizer-discrepancy-optimization-dataset-1This database is a subset of 100 challenging small molecule fragments where HF/minix followed by B3LYP/6-31G*//B3LYP/6-31G** differed substantially from OPLS3e.Cl, F, C, S, O, H, NComplete
FDA optimization dataset 12019-09-08-fda-optimization-dataset-1he ZINC15 FDA dataset was retrieve in mol2 format on Sun Sep 8 20:44:34 EDT 2019 via: http://zinc.docking.org/substances/subsets/fda.mol2?count=allCl, Br, F, C, S, P, I, O, H, NError
Kinase Inhibitors: WBO Distributions2019-11-27-kinase-inhibitor-optimizationGeometry optimization of kinase inhibitor conformers to explore WBO conformation dependency.Cl, Br, S, C, F, P, I, O, H, NComplete
OpenFF Gen 2 Opt Set 1 Roche2020-03-20-OpenFF-Gen-2-Optimization-Set-1-Roche2nd generation optimization dataset for bond and valence parameter fitting.Cl, S, C, F, O, H, NComplete
OpenFF Gen 2 Opt Set 2 Coverage2020-03-20-OpenFF-Gen-2-Optimization-Set-2-Coverage2nd generation optimization dataset for bond and valence parameter fitting.Cl, Br, S, C, F, P, I, O, H, NError
OpenFF Gen 2 Opt Set 3 Pfizer Discrepancy2020-03-20-OpenFF-Gen-2-Optimization-Set-3-Pfizer-Discrepancy2nd generation optimization dataset for bond and valence parameter fitting.Cl, F, C, S, O, H, NComplete
OpenFF Gen 2 Opt Set 4 eMolecules Discrepancy2020-03-20-OpenFF-Gen-2-Optimization-Set-4-eMolecules-Discrepancy2nd generation optimization dataset for bond and valence parameter fittingCl, Br, S, C, F, P, I, O, H, NComplete
OpenFF Gen 2 Opt Set 5 Bayer2020-03-20-OpenFF-Gen-2-Optimization-Set-5-Bayer2nd generation optimization dataset for bond and valence parameter fitting.Si, Cl, Br, F, C, S, O, H, NError
OpenFF Protein Fragments v1.02020-07-06-OpenFF-Protein-Fragments-InitialThis is the initial test of running constrained optimizations on various protein fragments prepared by David Cerutti. Here we just have ALA as the central residue.H, C, O, NComplete
OpenFF Protein Fragments v2.02020-08-12-OpenFF-Protein-Fragments-version2This is the full protein fragment dataset (version2) consisting of constrained optimizations on various protein fragments prepared by David Cerutti. We have 12 central residues which are capped with a combination of different terminal residues.S, C, O, H, NError
OpenFF Sandbox CHO PhAlkEthOH v1.02020-09-18-OpenFF-Sandbox-CHO-PhAlkEthOHThe molecules are from the AlkEthOH and PhEthOH datasets originally used to build the smirnoff99Frosst parameters. The AlkEthOH was taken from hereH, C, ORunning
OpenFF Industry Benchmark Season 1 v1.02021-03-30-OpenFF-Industry-Benchmark-Season-1-v1.0The combination of all publicly chosen compound sets by industry partners from the OpenFF season 1 industry benchmarkN, F, Cl, C, H, O, Br, P, SError
OpenFF Industry Benchmark Season 1 v1.12021-06-04-OpenFF-Industry-Benchmark-Season-1-v1.1The combination of all publicly chosen compound sets by industry partners from the OpenFF season 1 industry benchmarkN, F, Cl, C, H, O, Br, P, SRunning
OpenFF Theory Benchmarking Constrained Optimization Set MP2 heavy-aug-cc-pVTZ v1.12020-11-25-theory-bm-set-mp2-heavy-aug-cc-pvtzThis is a Constrained Optimization dataset for benchmarking MP2/heavy-aug-cc-pVTZ.Running
OpenFF Industry Benchmark Season 1 - MM v1.12021-07-28-OpenFF-Industry-Benchmark-Season-1-MM-v1.1The combination of all publicly chosen compound sets by industry partners from the OpenFF season 1 industry benchmark; MM computations starting from QM-optimized geometries.N, F, Cl, C, H, O, Br, P, SRunning
OpenFF RESP Polarizability Optimizations v1.02021-10-01-OpenFF-resppol-mp2-single-pointA data set used for training ESP-fitting based typed atomic polarizabilities with a direct approximation.N, C, H, ORunning
OpenFF RESP Polarizability Optimizations v1.12021-10-01-OpenFF-resppol-mp2-single-pointA data set used for training ESP-fitting based typed atomic polarizabilities with a direct approximation.N, C, H, ORunning
SPICE Dipeptides Optimization Dataset v1.02021-11-11-Dipeptide-optimization-setOptimization set created from the smiles of SPICE Dipeptide dataset.N, C, H, O, SRunning
OpenFF Gen 2 Optimization Dataset Protomers v1.02021-12-21-OpenFF-Gen2-Optimization-Set-ProtomersOptimization set created from the smiles of missing protomers in Gen 2 optimization sets.O, F, S, Br, Cl, C, P, H, I, NRunning
OpenFF ESP Industry Benchmark Set v1.02022-02-02-OpenFF-ESP-Industry-Benchmark-Set-v1.0-optimization-setHF/6-31G* conformers of public industry benchmark molecules.N, F, Cl, C, H, O, Br, P, SRunning
OpenFF Protein Capped 1-mers 3-mers Optimization Dataset v1.02022-05-30-OpenFF-Protein-Capped-1-mers-3-mers-OptimizationOptimization dataset for protein capped 1-mers Ace-X-Nme and capped 3-mers Ace-Y-X-Y-Nme with Y = {Ala, Val} and X = 26 canonical amino acids with common protomers/tautomers (Ash, Cyx, Glh, Hid, Hip, and Lyn)H, C, N, O, S
OpenFF Iodine Chemistry Optimization Dataset v1.02022-07-27-OpenFF-iodine-optimization-setOptimization set created from Gen1 and Gen2 molecules containing iodine'C', 'F', 'O', 'H', 'Br', 'Cl', 'N', 'I', 'S'
OpenFF multi-Br ESP Fragment Conformers v1.02023-11-02-OpenFF-multi-Br-ESP-Fragment-Conformers-v1.0Optimization set created from 2022-01-16-OpenFF-ESP-Fragment-Conformers-v1.0 by selecting molecules with multiple Cl atoms and replacing them with BrBr, C, F, H, N, O, P, S
XtalPi Shared Fragments OptimizationDataset v1.02024-01-30-xtalpi-shared-fragments-optimization-v1.0Representative optimization molecules used to fit XFFC, H, Cl, Br, S, O, F, N, P
XtalPi 20-percent Fragments OptimizationDataset v1.02024-04-02-xtalpi-20-percent-fragments-optimization-v1.0Larger (20%) representative subset of molecules used to fit XFFCl, P, Br, I, H, C, B, Si, O, N, F, S
OpenFF Torsion Benchmark Supplement Optimization Dataset v1.02024-04-18-OpenFF-Torsion-Benchmark-Supplement-Optimization-Dataset-v1.0Additional optimizations for benchmarking Sage 2.2.0 proper torsions and new parameters from the torsion multiplicity workH, C, N, O, F, P, S, Cl, Br
OpenFF Torsion Multiplicity Optimization Training Coverage Supplement v1.02024-06-20-OpenFF-Torsion-Multiplicity-Optimization-Training-Coverage-Supplement-v1.0Additional optimization training data for Sage 2.2.0 proper torsions and new parameters from the torsion multiplicity workC, Cl, S, O, H, P, N, Br
OpenFF Torsion Multiplicity Optimization Benchmarking Coverage Supplement v1.02024-06-24-OpenFF-Torsion-Multiplicity-Optimization-Benchmarking-Coverage-Supplement-v1.0Additional optimization benchmarking data for Sage 2.2.0 proper torsions and new parameters from the torsion multiplicity workCl, H, I, S, O, N, Br, C, P
OpenFF Iodine Fragment Opt v1.02024-09-10-OpenFF-Iodine-Fragment-Opt-v1.0B3LYP-D3BJ/DZVP optimized conformers for a variety of I-containing fragment moleculesC, O, I, S, F, Br, Cl, N, H
OpenFF Sulfur Optimization Training Coverage Supplement v1.02024-09-11-OpenFF-Sulfur-Optimization-Training-Coverage-Supplement-v1.0Additional optimization training data for Sage sulfur and phosphorus parametersC, S, F, O, H, Cl, Br, P, N
OpenFF Sulfur Optimization Benchmarking Coverage Supplement v1.02024-09-18-OpenFF-Sulfur-Optimization-Benchmarking-Coverage-Supplement-v1.0Additional optimization benchmarking data for Sage sulfur and phosphorus parametersS, P, Cl, C, N, O, H, Br, F
OpenFF Lipid Optimization Training Supplement v1.02024-10-08-OpenFF-Lipid-Optimization-Training-Supplement-v1.0Additional optimization training data for Sage from representative LIPID MAPS fragmentsI, Br, O, H, P, C, N, Cl, F, S

TorsionDrive Datasets

These are currently used perform a complete rotation of one or more selected bonds, where optimizations are performed over a discrete set of angles.

QCArchive DatasetFolderDescriptionElementsStatus
Fragment Stability Benchmark2019-03-06-Fragmenter_Stability-BenchmarkExamination of different fragmentation schemes.Cl, F, C, P, I, O, H, NError
OpenFF Group1 Torsions2019-05-01-OpenFF-Group1-TorsionsA collection of torsion drives for forcefield fitting.Cl, F, C, S, O, H, NError
SMIRNOFF Coverage Torsion Set 12019-07-01-smirnoff99Frost-coverage-torsionSet of small molecules that use all smirnoff99Frost parameters.C', Br, S, C, F, P, I, O, H, NError
OpenFF Substituted Phenyl Set 12019-07-25-phenyl-setA set of substituted phenyl torsiondrives.Cl, Br, F, C, I, O, H, NError
Pfizer discrepancy torsion dataset 12019-09-07-Pfizer-discrepancy-torsion-dataset-1This database is a subset of 100 challenging small molecule fragments where HF/minix followed by B3LYP/6-31G*//B3LYP/6-31G** differed substantially from OPLS3e.Cl, F, C, S, O, H, NError
TorsionDrive Paper2019-11-07-TorsionDrive-PaperTorsion Drives to explore wavefront propagation for the TorsionDrive paper.C, H, OError
OpenFF Primary Benchmark 1 Torsion Set2019-12-05-OpenFF-Benchmark-Primary-1-torsionValidation of optimized force field torsion parameters.Cl, Br, F, C, S, O, H, NError
OpenFF Primary Benchmark 2 Torsion Set2020-01-17-OpenFF-Benchmark-Full-1-torsionValidation of optimized force field torsion parameters.Cl, Br, S, C, F, P, I, O, H, NError
OpenFF Group1 Torsions 22020-01-31-OpenFF-Group1-Torsions-2Generation of additional data for fitting of newly added torsion terms.H, C, O, NComplete
OpenFF Group1 Torsions 32020-02-10-OpenFF-Group1-Torsions-3Generation of additional data for fitting of t128 and t129H, C, O, NError
OpenFF Gen 2 Torsion Set 1 Roche2020-03-12-OpenFF-Gen-2-Torsion-Set-1-RocheDesign 2nd generation torsion dataset for valence parameter fitting.F, C, S, O, H, NError
OpenFF Gen 2 Torsion Set 2 Coverage2020-03-12-OpenFF-Gen-2-Torsion-Set-2-CoverageDesign 2nd generation torsion dataset for valence parameter fitting.Cl, Br, F, C, S, P, I, O, H, NError
OpenFF Gen 2 Torsion Set 3 Pfizer Discrepancy2020-03-12-OpenFF-Gen-2-Torsion-Set-3-Pfizer-DiscrepancyDesign 2nd generation torsion dataset for valence parameter fittingS, C, F, O, H, NRunning
OpenFF Gen 2 Torsion Set 4 eMolecules Discrepancy2020-03-12-OpenFF-Gen-2-Torsion-Set-4-eMolecules-DiscrepancyDesign 2nd generation torsion dataset for valence parameter fitting.Cl, Br, F, C, S, P, I, O, H, NError
OpenFF Gen 2 Torsion Set 5 Bayer2020-03-12-OpenFF-Gen-2-Torsion-Set-5-BayerDesign 2nd generation torsion dataset for valence parameter fitting.Cl, Br, F, C, S, O, H, NError
OpenFF Gen 2 Torsion Set 6 supplemental2020-03-12-OpenFF-Gen-2-Torsion-Set-6-supplementalDesign 2nd generation torsion dataset for valence parameter fitting.S, C, O, H, NError
OpenFF Gen 2 Torsion Set 1 Roche 22020-03-23-OpenFF-Gen-2-Torsion-Set-1-Roche-2Design 2nd generation torsion dataset for valence parameter fitting.Cl, F, C, S, O, H, NError
OpenFF Gen 2 Torsion Set 2 Coverage 22020-03-23-OpenFF-Gen-2-Torsion-Set-2-Coverage-2Design 2nd generation torsion dataset for valence parameter fitting.Cl, Br, F, C, S, P, I, O, H, NError
OpenFF Gen 2 Torsion Set 3 Pfizer Discrepancy 22020-03-23-OpenFF-Gen-2-Torsion-Set-3-Pfizer-Discrepancy-2Design 2nd generation torsion dataset for valence parameter fitting.S, C, F, O, H, NComplete
OpenFF Gen 2 Torsion Set 4 eMolecules Discrepancy 22020-03-23-OpenFF-Gen-2-Torsion-Set-4-eMolecules-Discrepancy-2Design 2nd generation torsion dataset for valence parameter fitting.Cl, Br, F, C, S, P, I, O, H, NError
OpenFF Gen 2 Torsion Set 5 Bayer 22020-03-26-OpenFF-Gen-2-Torsion-Set-5-Bayer-2Design 2nd generation torsion dataset for valence parameter fitting.Cl, Br, F, C, S, O, H, NError
OpenFF Gen 2 Torsion Set 6 supplemental 22020-03-26-OpenFF-Gen-2-Torsion-Set-6-supplemental-2Design 2nd generation torsion dataset for valence parameter fitting.Br S, C, F, O, H, NError
OpenFF Fragmenter Validation 1.02020-04-28-Fragmenter-testExamination of different fragmentation schemes.Cl, S, C, P, I, O, H, NError
OpenFF DANCE 1 eMolecules t142 v1.02020-06-01-DANCE-1-eMolecules-t142-selectedMolecules selected from the eMolecules database by DANCE to improve t142 parameterization in smirnoff99Frosst.Cl, Br, F, C, S, O, H, NError
OpenFF Rowley Biaryl v1.02020-06-17-OpenFF-Biaryl-setThis is a TorsionDrive dataset consisting of biaryl torsions provided by Christopher Rowley. Originally used to benchmark parsley, but could also be useful for fitting.S, C, O, H, NRunning
OpenFF-benchmark-ligand-fragments-v1.02020-07-27-OpenFF-Benchmark-LigandsThis is a torsiondrive dataset created from the OpenFF FEP benchmark dataset. The ligands are fragmented before having key torsions driven.Cl, Br, S, C, F, I, O, H, NRunning
OpenFF Theory Benchmarking Set B3LYP-D3BJ DZVP v1.02020-07-27-theory-bm-set-b3lyp-d3bj-dzvpThis is a TorsionDrive dataset consisting of 36 1-D torsions selected for benchmarking different QM levels.Cl, F, C, S, P, O, H, NComplete
OpenFF Theory Benchmarking Set B3LYP-D3BJ def2-TZVP v1.02020-07-30-theory-bm-set-b3lyp-d3bj-def2-tzvpThis is a TorsionDrive dataset consisting of 36 1-D torsions selected for benchmarking different QM levels.Cl, F, C, S, P, O, H, NComplete
OpenFF Theory Benchmarking Set B3LYP-D3BJ def2-TZVPD v1.02020-07-30-theory-bm-set-b3lyp-d3bj-def2-tzvpdThis is a TorsionDrive dataset consisting of 36 1-D torsions selected for benchmarking different QM levels.Cl, F, C, S, P, O, H, NError
OpenFF Theory Benchmarking Set B3LYP-D3BJ def2-TZVPP v1.02020-07-30-theory-bm-set-b3lyp-d3bj-def2-tzvppThis is a TorsionDrive dataset consisting of 36 1-D torsions selected for benchmarking different QM levels.Cl, F, C, S, P, O, H, NComplete
OpenFF Protein Fragments TorsionDrives v1.02020-09-16-OpenFF-Protein-Fragments-TorsionDrivesThis is a protein fragment dataset consisting of torsion drives on various protein fragments prepared by David Cerutti. We have 12 central residues capped with a combination of different terminal residues. We drive the following angles for each fragment: - omega - phi - psi - chi1 (if applicable) - chi2 (if applicable).S, C, O, H, NError
OpenFF WBO Conjugated Series v1.02021-01-25-OpenFF-Conjugated-SeriesThis is a torsion drive dataset that consists of various chemistries that probe a range of conjugated bonds. The goal of this dataset is to develop WBO interpolated torsions for the OpenFF force field.S, C, O, H, NError
OpenFF Amide Torsion Set v1.02021-03-23-OpenFF-Amide-Torsion-Set-v1.0Amides, thioamides and amidines diversely functionalized.S, C, O, H, NRunning
OpenFF Aniline Para Opt v1.02021-04-02-OpenFF-Aniline-Para-Opt-v1.0Optimizations of diverse, para-substituted aniline derivatives.Br, C, O, N, S, H, Cl, FRunning
OpenFF Gen3 Torsion Set v1.02021-04-09-OpenFF-Gen3-Torsion-Set-v1.0This dataset is a simple-molecule-only torsiondrive dataset, aiming to avoid issue of torsion parameter contamination by large internal non-bonded interactions during a valece parameter optimization. Molecules with one effective rotating bond were generate by combining two simple substituents, which were identified by fragmenting small drug like molecules. Torsions from the generated molecule set were selected using clustering method, in a way that the dataset can allow a chemical diversity of molecules training each torsion parameter.F ,N ,H ,Cl ,P ,S ,O ,Br ,CRunning
OpenFF Aniline 2D Impropers v1.02021-03-29-OpenFF-Aniline-2D-Impropers-v1.0This dataset contains a set of aniline derivatives which have para-substituted groups of varying electron donating and withdrawing properties. This dataset was curated in an effort to improve and understand improper torsions in force fields. We will scan the improper and proper angle simultaneously to better understand the coupling and energetics of these torsions.O, C, S, H, NRunning
OpenFF BCC Refit Study COH v2.02021-06-22-OpenFF-BCC-Refit-Study-COH-v2.0A data set curated for the initial stage of the on-going OpenFF study which aims to co-optimize the AM1BCC bond charge correction (BCC) parameters against an experimental training set of density and enthalpy of mixing data points and a QM training set of electric field data. The initial data set is limited to only molecules composed of C, O, H. This limited scope significantly reduces the number of BCC parameters which must be retrained, thus allowing for easier convergence of the initial optimizations. The included molecules were combinatorially generated to cover a range of alcohol, ether, and carbonyl containing molecules.O, C, S, H, NRunning
OpenFF-benchmark-ligand-fragments-v2.02021-08-10-OpenFF-JACS-Fragments-v2.0This is a torsiondrive dataset created from the OpenFF FEP benchmark dataset. The ligands are fragmented using openff-fragmenter with both ambertools and openeye before having key torsions driven.S, N, Br, C, H, O, Cl, F, IRunning
OpenFF-Protein-Dipeptide-2D-TorsionDrive-v2.12021-11-18-OpenFF-Protein-Dipeptide-2D-TorsionDriveTwo-dimensional TorsionDrives on phi and psi for dipeptides of the 20 canonical amino acids and 6 alternate protomers/tautomers.H, C, N, O, S
OpenFF-Protein-Capped-1-mer-Sidechains-v1.32022-02-10-OpenFF-Protein-Capped-1-mer-SidechainsTwo-dimensional TorsionDrives on chi1 and chi2 for capped 1-mers of amino acids with a rotatable bond in the sidechain.H, C, N, O, S
OpenFF-Protein-Capped-3-mer-Backbones-v1.02022-05-30-OpenFF-Protein-Capped-3-mer-BackbonesTwo-dimensional TorsionDrives on phi and psi for capped 3-mers Ace-Y-X-Y-Nme with Y = {Ala, Val}.H, C, N, O, S
OpenFF-multiplicity-correction-torsion-drive-data-v1.12022-04-29-OpenFF-multiplicity-correction-torsion-drive-data-v1.1A torsiondrive dataset created to correct multiplicity issues in the force field.'S', 'P', 'O', 'C', 'H', 'N'Running
OpenFF-Protein-Capped-3-mer-Omega-v1.02023-02-06-OpenFF-Protein-Capped-3-mer-OmegaTorsionDrives on omega for capped 3-mers Ace-Ala-X-Ala-Nme.H, C, N, O, S
XtalPi Shared Fragments TorsiondriveDataset v1.02024-01-30-xtalpi-shared-fragments-torsiondrive-v1.0Representative torsion scan molecules used to fit XFFC, H, Cl, Br, S, O, F, N, P
OpenFF Torsion Coverage Supplement v1.02024-02-29-OpenFF-Torsion-Coverage-Supplement-v1.0Additional TorsionDrives to improve coverage for Sage 2.1.0 proper torsions and new parameters from the torsion multiplicity workC, Cl, F, H, N, O, S
OpenFF-RNA-Dinucleoside-Monophosphate-TorsionDrives-v1.02024-03-26-OpenFF-RNA-Dinucleoside-Monophosphate-TorsionDrivesTorsionDrives of non-ring backbone, glycosidic, and hydroxyl dihedrals in RNA XpY 2-mers.H, C, N, O, P
XtalPi 20-percent Fragments TorsiondriveDataset v1.02024-04-02-xtalpi-20-percent-fragments-torsiondrive-v1.0Torsion scans of larger representative subset (20%) of molecules used to fit XFFO, Br, I, Si, B, C, P, S, Cl, H, N, F
OpenFF Torsion Drive Supplement v1.02024-04-17-OpenFF-Torsion-Drive-Supplement-v1.0Additional TorsionDrives to expand training data for Sage 2.2.0 proper torsions and new parameters from the torsion multiplicity workH, C, N, O, P, S
OpenFF Torsion Multiplicity Torsion Drive Coverage Supplement v1.02024-06-14-OpenFF-Torsion-Multiplicity-Torsion-Drive-Coverage-Supplement-v1.0Additional torsion drive training data for Sage 2.2.0 proper torsions and new parameters from the torsion multiplicity workN, Br, H, P, Cl, O, C, S
OpenFF Phosphate Torsion Drives v1.02024-07-17-OpenFF-Phosphate-Torsion-Drives-v1.0Lipid-like phosphate torsionsC, S, N, H, O, P
OpenFF Alkane Torsion Drives v1.02024-08-09-OpenFF-Alkane-Torsion-Drives-v1.0Alka/ene torsion drivesC, H

GridOptimization Datasets

These are currently used perform a scan of one or more internal coordinates (bond, angle, torsion), where optimizations are performed over a discrete set of values.

QCArchive DatasetFolderDescriptionElementsStatus
OpenFF Trivalent Nitrogen Set 12019-06-28-Nitrogen-grid-optimizationSet of diverse trivalent nitrogen molecules for 1-D grid optimization.Si, Cl, Br, F, C, S, P, B, I, O, H, NError
OpenFF Trivalent Nitrogen Set 22019-12-09-Nitrogen-grid-optimization-2dSet of diverse trivalent nitrogen molecules for 2-D grid optimizationSi, Cl, Br, F, C, S, P, B, I, O, H, NError
OpenFF Trivalent Nitrogen Set 32020-01-15-Nitogen-grid-optimization-02-1dscansSet of diverse trivalent nitrogen molecules for 1-D grid optimization, this is a secondary datasetCl, Br, S, C, F, O, H, NError