Energies and forces for molecular dynamics trajectories of eight organic molecules. Level of theory DFT: PBE+vdW-TS.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
THE REVISED MD17 dataset:=========================Citation:======== Anders S. Christensen and O. Anatole von Lilienfeld (2020) "On the role of gradients for machine learning of molecular energies and forces" https://arxiv.org/abs/2007.09593The molecules are taken from the original MD17 dataset by Chmiela et al., and 100,000 structures are taken, and the energies and forces are recalculated at the PBE/def2-SVP level of theory using very tight SCF convergence and very dense DFT integration grid. As such, the dataset is practically free from nummerical noise. One warning: As the structures are taken from a molecular dynamics simulation (i.e. time series data), they are not guaranteed to be independent samples. This is easily evident from the autocorrelation function for the original MD17 datasetIn short: DO NOT train a model on more than 1000 samples from this dataset. Data already published with 50K samples on the original MD17 dataset should be considered meaningless due to this fact and due to the noise in the original data.The data:=========The ten molecules are save in Numpy .npz format.The keys correspond to:'nuclear_charges' : The nuclear charges for the molecule'coords' : The coordinates for each conformation (in units of ångstrom)'energies' : The total energy of each conformation (in units of kcal/mol)'forces' : The cartesian forces of each conformation (in units of kcal/mol/ångstrom)'old_indices' : The index of each conformation in the original MD17 dataset'old_energies' : The energy of each conformation taken from the original MD17 dataset (in units of kcal/mol)'old_forces' : The forces of each conformation taken from the original MD17 dataset (in units of kcal/mol/ångstrom)*Note that for Azobenzene, only 99988 samples are available due to 11 failed DFT calculations, and the original dataset only contained 99999 structures.Data splits:============Five training and test splits are saved in CSV format containing the corresponding indices.
Dataset Card for aspirin
Dataset Summary
The aspirin dataset is a molecular dynamics (MD) dataset. The total energy and force labels for each dataset were computed using the PBE+vdW-TS electronic structure method. All geometries are in Angstrom, energies and forces are given in kcal/mol and kcal/mol/A respectively.
Supported Tasks and Leaderboards
aspirin should be used for organic molecular property prediction, a regression task on 1 property. The score used… See the full description on the dataset page: https://huggingface.co/datasets/graphs-datasets/MD17-aspirin.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The original MD17 dataset (http://quantum-machine.org/datasets/#md-datasets) [Chemiela et al. Sci. Adv. 3(5), e1603015, 2017] contains numerical noise. Thus, any numbers presented from benchmarks on this data are likely flawed. Here, we present a new dataset with negligible numerical noise for benchmarking of forces and energy predictions for molecular dynamics simulations. As the structures are taken from a molecular dynamics simulation (i.e. time series data), they are not guaranteed to be independent samples. This is easily evident from the autocorrelation function for the original MD17 dataset. In short: DO NOT train a model on more than 1000 samples from the revised dataset, and do not train models for more than 50 samples from the original MD17 dataset. Data already published with 50K samples on the original MD17 dataset should be considered meaningless due to this fact and due to the noise in the original data.
Dataset Card for malonaldehyde
Dataset Summary
The malonaldehyde dataset is a molecular dynamics (MD) dataset. The total energy and force labels for each dataset were computed using the PBE+vdW-TS electronic structure method. All geometries are in Angstrom, energies and forces are given in kcal/mol and kcal/mol/A respectively.
Supported Tasks and Leaderboards
malonaldehyde should be used for organic molecular property prediction, a regression task on 1… See the full description on the dataset page: https://huggingface.co/datasets/graphs-datasets/MD17-malonaldehyde.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Pol Febrer (pol.febrer@icn2.cat, ORCID 0000-0003-0904-2234) Peter Bjorn Jorgensen (peterbjorgensen@gmail.com, ORCID 0000-0003-4404-7276) Arghya Bhowmik (arbh@dtu.dk, ORCID 0000-0003-3198-5116)
The dataset is published as part of the paper: "GRAPH2MAT: UNIVERSAL GRAPH TO MATRIX CONVERSION FOR ELECTRON DENSITY PREDICTION" (https://doi.org/10.26434/chemrxiv-2024-j4g21) https://github.com/BIG-MAP/graph2mat
This dataset contains the Hamiltonian, Overlap, Density and Energy Density matrices from SIESTA calculations of a subset of the MD17 aspirin dataset. The subset is taken from the third split in (https://doi.org/10.6084/m9.figshare.12672038.v3).
SIESTA 5.0.0 was used to compute the dataset.
The dataset has two directories:
And then, three directories containing the calculations with different basis sets: - matrix_dataset_defsplit: Uses the default split-valence DZP basis in SIESTA. - matrix_dataset_optimsplit: Uses a split-valence DZP basis optimized for aspirin. - matrix_dataset_defnodes: Uses the default nodes DZP basis in SIESTA.
Each of the basis directories has two subdirectories: - basis: Contains the files specifying the basis used for each atom. - runs: The results of running the SIESTA simulations. Contents are discussed next.
The "runs" directory contains one directory for each run, named with the index of the run. Each directory contains: - RUN.fdf, geom.fdf: The input files used for the SIESTA calculation. - RUN.out: The log of the SIESTA run, which apar - siesta.TSDE: Contains the Density and Energy Density matrices. - siesta.TSHS: Contains the Hamiltonian and Overlap matrices.
Each matrix can be read using the sisl python package (https://github.com/zerothi/sisl) like:
import sisl
matrix = sisl.get_sile("RUN.fdf").read_X()
where X is hamiltonian, overlap, density_matrix or energy_density_matrix.
To reproduce the results presented in the paper, follow the documentation of the graph2mat package (https://github.com/BIG-MAP/graph2mat).
https://doi.org/10.11583/DTU.c.7310005 © 2024 Technical University of Denmark
This dataset is published under the CC BY 4.0 license. This license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator.
Dataset Card for toluene
Dataset Summary
The toluene dataset is a molecular dynamics (MD) dataset. The total energy and force labels for each dataset were computed using the PBE+vdW-TS electronic structure method. All geometries are in Angstrom, energies and forces are given in kcal/mol and kcal/mol/A respectively.
Supported Tasks and Leaderboards
toluene should be used for organic molecular property prediction, a regression task on 1 property. The score used… See the full description on the dataset page: https://huggingface.co/datasets/graphs-datasets/MD17-toluene.
Dataset Card for ethanol
Dataset Summary
The ethanol dataset is a molecular dynamics (MD) dataset. The total energy and force labels for each dataset were computed using the PBE+vdW-TS electronic structure method. All geometries are in Angstrom, energies and forces are given in kcal/mol and kcal/mol/A respectively.
Supported Tasks and Leaderboards
ethanol should be used for organic molecular property prediction, a regression task on 1 property. The score used… See the full description on the dataset page: https://huggingface.co/datasets/graphs-datasets/MD17-ethanol.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
We demonstrate that fast and accurate linear force fields can be built for molecules using the atomic cluster expansion (ACE) framework. The ACE models parametrize the potential energy surface in terms of body-ordered symmetric polynomials making the functional form reminiscent of traditional molecular mechanics force fields. We show that the four- or five-body ACE force fields improve on the accuracy of the empirical force fields by up to a factor of 10, reaching the accuracy typical of recently proposed machine-learning-based approaches. We not only show state of the art accuracy and speed on the widely used MD17 and ISO17 benchmark data sets, but we also go beyond RMSE by comparing a number of ML and empirical force fields to ACE on more important tasks such as normal-mode prediction, high-temperature molecular dynamics, dihedral torsional profile prediction, and even bond breaking. We also demonstrate the smoothness, transferability, and extrapolation capabilities of ACE on a new challenging benchmark data set comprised of a potential energy surface of a flexible druglike molecule.
Dataset Card for benzene
Dataset Summary
The benzene dataset is molecular dynamics (MD) dataset. The total energy and force labels for each dataset were computed using the PBE+vdW-TS electronic structure method. All geometries are in Angstrom, energies and forces are given in kcal/mol and kcal/mol/A respectively.
Supported Tasks and Leaderboards
benzene should be used for organic molecular property prediction, a regression task on 1 property. The score used is… See the full description on the dataset page: https://huggingface.co/datasets/graphs-datasets/MD17-benzene.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Data sets used to generate learning curves. The two data sets contain the prediction errors (root-mean-square errors) obtained with different machine learning potentials (MLPs) for both energy and gradients of all molecules available in the MD17 database. The following MLP models were tested: KRR-CM, KREG, GAP-SOAP, sGDML, ANI, DPMD and PhysNet. A test set with 20000 geometries was randomly selected for each molecular system to evaluate the model's performance.See http://mlatom.com/MLPbenchmark1/ for web-version of the database, where you can further analyze it.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The xxMD (Extended Excited-state Molecular Dynamics) dataset is a comprehensive collection of non-adiabatic trajectories encompassing several photo-sensitive molecules. This dataset challenges existing Neural Force Field (NFF) models with broader nuclear configuration spaces that span reactant, transition state, product, and conical intersection regions, making it more chemically representative than its contemporaries.Key Features:Based on non-adiabatic dynamics, involving larger nuclear configuration space compared to previous datasets.Contains trajectories from four photo-sensitive molecules, each starting from an electronic excited state.Energies and forces computed using both multireference wave function theory and density functional theory.Samples reactant, transition state, product, and conical intersection regions of potential energy surfaces.Content:xxMD-CASSCF: This subset contains potential energies and forces for the first three electronic states of four molecules: azobenzene, dithiopehene, malonaldehyde, and stilbene.xxMD-DFT: Ground-state energies and forces re-computed using the M06 exchange-correlation functional for the trajectories in the xxMD-CASSCF subset.GitHub Repo: https://github.com/zpengmei/xxMD
Dataset Card for naphthalene
Dataset Summary
The naphthalene dataset is a molecular dynamics (MD) dataset. The total energy and force labels for each dataset were computed using the PBE+vdW-TS electronic structure method. All geometries are in Angstrom, energies and forces are given in kcal/mol and kcal/mol/A respectively.
Supported Tasks and Leaderboards
naphthalene should be used for organic molecular property prediction, a regression task on 1 property. The… See the full description on the dataset page: https://huggingface.co/datasets/graphs-datasets/MD17-naphthalene.
Dataset Card for salicylic_acid
Dataset Summary
The salicylic_acid dataset is a molecular dynamics (MD) dataset. The total energy and force labels for each dataset were computed using the PBE+vdW-TS electronic structure method. All geometries are in Angstrom, energies and forces are given in kcal/mol and kcal/mol/A respectively.
Supported Tasks and Leaderboards
salicylic_acid should be used for organic molecular property prediction, a regression task on 1… See the full description on the dataset page: https://huggingface.co/datasets/graphs-datasets/MD17-salicylic_acid.
Dataset Card for uracil
Dataset Summary
The uracil dataset is a molecular dynamics (MD) dataset. The total energy and force labels for each dataset were computed using the PBE+vdW-TS electronic structure method. All geometries are in Angstrom, energies and forces are given in kcal/mol and kcal/mol/A respectively.
Supported Tasks and Leaderboards
uracil should be used for organic molecular property prediction, a regression task on 1 property. The score used is… See the full description on the dataset page: https://huggingface.co/datasets/graphs-datasets/MD17-uracil.
Looking for information on a construction project near you? Project Portal offers a comprehensive view of all current, funded, and planned projects occurring across the State of Maryland. You can quickly and easily access specific project information, including a general overview, interactive map, news, schedule, pictures and video, supporting documents, and upcoming public meetings. It’s easy to search by location for a specific project, or by county for a list of all projects in your jurisdiction.(MDOT SHA Project Portal Individual Project Page Web Map)MDOT SHA WebsiteContact Us
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains all benchmark results with values of root-mean squared error (RMSE) for both energy and forces obtained with different MLPs for the 10 molecules of the MD17 database.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset
QM-22
Description
Includes CHON molecules of 4-15 atoms, developed in counterpoint to the MD17 dataset, run at higher total energies (above 500 K) and with a broader configuration space.Additional details stored in dataset columns prepended with "dataset_".
Dataset authors
Joel M. Bowman, Chen Qu, Riccardo Conte, Apurba Nandi, Paul L. Houston, Qi Yu
Publication
https://doi.org/10.1063/5.0089200
Original data link… See the full description on the dataset page: https://huggingface.co/datasets/colabfit/QM-22.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset
MD22 double walled nanotube
Description
Dataset containing MD trajectories of the double-walled nanotube supramolecule from the MD22 benchmark set. MD22 represents a collection of datasets in a benchmark that can be considered an updated version of the MD17 benchmark datasets, including more challenges with respect to system size, flexibility and degree of non-locality. The datasets in MD22 include MD trajectories of the protein Ac-Ala3-NHMe; the lipid DHA… See the full description on the dataset page: https://huggingface.co/datasets/colabfit/MD22_double_walled_nanotube.
Energies and forces for molecular dynamics trajectories of eight organic molecules. Level of theory DFT: PBE+vdW-TS.