QM9 provides quantum chemical properties (at DFT level) for a relevant, consistent, and comprehensive chemical space of small organic molecules. This database may serve the benchmarking of existing methods, development of new methods, such as hybrid quantum mechanics/machine learning, and systematic identification of structure-property relationships.
downloaded from: http://quantum-machine.org/datasets/
Abstract
Computational de novo design of new drugs and materials requires rigorous and unbiased exploration of chemical compound space. However, large uncharted territories persist due to its size scaling combinatorially with molecular size. We report computed geometric, energetic, electronic, and thermodynamic properties for 134k stable small organic molecules made up of CHONF. These molecules correspond to the subset of all 133,885 species with up to nine heavy atoms (CONF) out of the GDB-17 chemical universe of 166 billion organic molecules. We report geometries minimal in energy, corresponding harmonic frequencies, dipole moments, polarizabilities, along with energies, enthalpies, and free energies of atomization. All properties were calculated at the B3LYP/6-31G(2df,p) level of quantum chemistry. Furthermore, for the predominant stoichiometry, C7H10O2, there are 6,095 constitutional isomers among the 134k molecules. We report energies, enthalpies, and free energies of atomization at the more accurate G4MP2 level of theory for all of them. As such, this data set provides quantum chemical properties for a relevant, consistent, and comprehensive chemical space of small organic molecules. This database may serve the benchmarking of existing methods, development of new methods, such as hybrid quantum mechanics/machine learning, and systematic identification of structure-property relationships.
Download Available via figshare.
How to cite When using this dataset, please make sure to cite the following two papers:
L. Ruddigkeit, R. van Deursen, L. C. Blum, J.-L. Reymond, Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17, J. Chem. Inf. Model. 52, 2864–2875, 2012.
R. Ramakrishnan, P. O. Dral, M. Rupp, O. A. von Lilienfeld, Quantum chemistry structures and properties of 134 kilo molecules, Scientific Data 1, 140022, 2014. [bibtex]
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Two types of approaches to modeling molecular systems have demonstrated high practical efficiency. Density functional theory (DFT), the most widely used quantum chemical method, is a physical approach predicting energies and electron densities of molecules. Recently, numerous papers on machine learning (ML) of molecular properties have also been published. ML models greatly outperform DFT in terms of computational costs, and may even reach comparable accuracy, but they are missing physicality - a direct link to Quantum Physics - which limits their applicability. Here, we propose an approach that combines the strong sides of DFT and ML, namely, physicality and low computational cost. We derive general equations for exact electron densities and energies that can naturally guide applications of ML in Quantum Chemistry. Based on these equations, we build a deep neural network that can compute electron densities and energies of a wide range of organic molecules not only much faster, but also closer to exact physical values than current versions of DFT. In particular, we reached a mean absolute error in energies of molecules with up to eight non-hydrogen atoms as low as 0.9 kcal/mol relative to CCSD(T) values, noticeably lower than those of DFT (approaching ~2 kcal/mol) and ML (~1.5 kcal/mol) methods. A simultaneous improvement in the accuracy of predictions of electron densities and energies suggests that the proposed approach describes the physics of molecules better than DFT functionals developed by "human learning" earlier. Thus, physics-based ML offers exciting opportunities for modeling, with high-theory-level quantum chemical accuracy, of much larger molecular systems than currently possible.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Pol Febrer (pol.febrer@icn2.cat, ORCID 0000-0003-0904-2234) Peter Bjorn Jorgensen (peterbjorgensen@gmail.com, ORCID 0000-0003-4404-7276) Arghya Bhowmik (arbh@dtu.dk, ORCID 0000-0003-3198-5116)
The dataset is published as part of the paper: "GRAPH2MAT: UNIVERSAL GRAPH TO MATRIX CONVERSION FOR ELECTRON DENSITY PREDICTION" (https://doi.org/10.26434/chemrxiv-2024-j4g21)
This dataset contains the Hamiltonian, Overlap, Density and Energy Density matrices from SIESTA calculations of the QM9 dataset (https://doi.org/10.6084/m9.figshare.c.978904.v5)
SIESTA 5.0.0 was used to compute the dataset.
The dataset has four directories:
The "runs" directory contains one directory for each run, named with the index of the run. Each directory contains: - RUN.fdf, geom.fdf: The input files used for the SIESTA calculation. - RUN.out: The log of the SIESTA run, which apar - siesta.TSDE: Contains the Density and Energy Density matrices. - siesta.TSHS: Contains the Hamiltonian and Overlap matrices.
Each matrix can be read using the sisl python package (https://github.com/zerothi/sisl) like:
import sisl
matrix = sisl.get_sile("RUN.fdf").read_X()
where X is hamiltonian, overlap, density_matrix or energy_density_matrix.
To reproduce the results presented in the paper, follow the documentation of the graph2mat package (https://github.com/BIG-MAP/graph2mat).
https://doi.org/10.11583/DTU.c.7310005 © 2024 Technical University of Denmark
This dataset is published under the CC BY 4.0 license. This license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the official QH9 datasets from paper 'QH9: A Quantum Hamiltonian Prediction Benchmark for QM9 Molecules'. QH9 is a new Quantum Hamiltonian dataset providing precise Hamiltonian matrices for 130,831 stable molecular geometries, based on the QM9 dataset. Here is the QH9Stable dataset which is used in QH-Stable-iid and QH-Stable-ood.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
QM9 molecules calculated with VASP using Atomic Simulation Environment with the following parameters: Vasp(xc='PBE', istart=0, algo='Normal', icharg=2, nelm=180, ispin=1, nelmdl=6, isym=0, lcorr=True, potim=0.1, nelmin=5, kpts=[1,1,1], ismear=0, ediff=0.1E-05, sigma=0.1, nsw=0, ldiag=True, lreal='Auto', lwave=False, lcharg=True, encut=400)
The resulting CHGCAR files have been compressed with lz4 compression and packed in non-compressed tar archives with up to 1000 structures in each.
The datasplits json files contain the indices (0-index) of the train, validation and test sets used in the paper "Graph neural networks for fast electron density estimation of molecules, liquids, and solids"
The QM9 molecule structures were obtained from https://doi.org/10.6084/m9.figshare.c.978904.v5
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
All-atom Diffusion Transformers - QM9 dataset
QM9 dataset from the paper "All-atom Diffusion Transformers: Unified generative modelling of molecules and materials", by Chaitanya K. Joshi, Xiang Fu, Yi-Lun Liao, Vahe Gharakhanyan, Benjamin Kurt Miller, Anuroop Sriram*, and Zachary W. Ulissi* from FAIR Chemistry at Meta (* Joint last author). Original data source: https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.datasets.QM9.html (Adapted from MoleculeNet)… See the full description on the dataset page: https://huggingface.co/datasets/chaitjo/QM9_ADiT.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ANI-1E: An equilibrium database from the ANI-1 database v.2.0
Authors: Luis Itza Vazquez-Salazar and Markus Meuwly E-mail contact: litzavazquezs@gmail.com and m.meuwly@unibas.ch
From the SMILES strings, provided by Smith et al., initial geometries using OpenBabel were generated. Subsequently, geometries were optimised using PM7 implemented in MOPAC2016, before a final geometry optimisation and frequency calculation at the ωB97x/6-31G(d) level of theory performed using Gaussian09. The final results were checked to assure that they did not contain imaginary frequencies and therefore correspond to a minimum on the potential energy surface, which can be different from the global minima for the molecule. The total number of molecules is 57455; 7 molecules were unstable for optimisation. The format of the files is .xyz, following the style of the QM9 database and it contains the geometry minimal in energy, rotational constants, dipole moments, polarizabilities, along with energies of HOMO and LUMO, electronic spatial extent, zero-point energy, enthalpies, and free energies of atomisation. The header of the .xyz file follows the format given in Table 3 of the QM9 paper with the difference that the TAG is 'ANI-1E'. Additionally, a file with the original smiles of the ANI-1 dataset and the smiles of ANI-1E is added. The seven molecules (56176,56177,56213,56214,56215,56216,56217) that do not converge are not included in the new database. The .xyz of the final structures are available in the folder 'failed'. The output files for all optimizations are available upon reasonable request to the authors.
We acknowledge Alfred Andersson and Prof. David van der Spoel for attracting our attention to the problems on the first version of our database.
Data-driven schemes that associate molecular and crystal structures with their microscopic properties share the need for a concise, effective description of the arrangement of their atomic constituents. Many types of models rely on descriptions of atom-centered environments, that are associated with an atomic property or with an atomic contribution to an extensive macroscopic quantity. Frameworks in this class can be understood in terms of atom-centered density correlations (ACDC), that are used as a basis for a body-ordered, symmetry-adapted expansion of the targets. Several other schemes, that gather information on the relationship between neighboring atoms using "message-passing" ideas, cannot be directly mapped to correlations centered around a single atom. We generalize the ACDC framework to include multi-centered information, generating representations that provide a complete linear basis to regress symmetric functions of atomic coordinates, and provides a coherent foundation to systematize our understanding of both atom-centered and message-passing, invariant and equivariant machine-learning schemes.
This record contains the data and code required to reproduce the results from the corresponding paper, computing message-passing inspired machine learning features built on top of density correlation. The data used in this article is a subset of other existing datasets, which can be found online:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets from HEPOM Paper. Includes train/test sets from combined, neutral(qm9+alchemy), protonated, and hydroxylated datasets.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains the optimized structures and the corresponding molecular Hessian matrices of selected molecules in QM8, QM9 and PubChem databases. The data is use to train and validate a machine learning model described in the paper titled "Real-time interpretation of neutron vibrational spectra with symmetry-equivariant Hessian matrix prediction". In the tarball file, the three txt files contains the indexes of molecules in the corresponding datasets. The three folders (for QM8, QM9 and PubChem, respectively) contain the structure (xyz) files and the molecular Hessian (npy) files, which can be loaded with numpy module in Python.
The shape of each Hessian matrix, H, is (3n, 3n), where n is number of atoms in the molecule. The multiplier 3 means that each atom has 3 degrees of freesom, x,y and z. The order of the indexes along row and column is organized as 1x, 1y, 1z, 2x, 2y, 2z, 3x, 3y, 3z, ..., till nx, ny, nz. The order from 1 to n corresponds to the order of the atoms in the structure file. The training data could be easily generated from the data provided here with the script provided in the paper (https://github.com/maplewen4/INS_molecule).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
AIMEl-DB: Atomic Properties for 44K small organic molecules
This dataset comprises atomic properties of 44K (44 470) molecules selected from the QM9 database. The file names are based on the same indexing system used for QM9.
This dataset includes four types of files:
.com FilesInput files for Gaussian 16. Simple-point energy calculations were carried out using the keywords# B3LYP/6-31G(2df,p) scf=(maxcycle=9999) nosymm output=wfx
.log FilesOutput files from Gaussian 16 calculation with the aformentioned parameters.
.wfx FilesWave function files from Gaussian 16 calculation. These files were used as inputs for QTAIM calculations.
.sumviz FilesOutput file from AIMAll software. The keywords used for the calculations wereaimqb -nogui -scp=false -nproc=8 -naat=4 input.wfxEach .sumviz file contains more than 30 properties based on the Quantum Theory of Atoms in Molecules (QTAIM).
.csv FilesThese files contain the results of a in-house treament of .sumviz data. They cointain two calculated atomic properties:
Total magnitude of the dipole moment, |mu|
Total magnitude of the quadrupole moment, |Q|
and two extracted atomic properties: 3. Electronic Population, N 4. Atomic Energy, E
The aimel_merged_44k.csv presents the concatenation of the 44 470 csv Files.
Additionaly, the aimel_merged_38k.csv presents the concatenation of the 38 876 csv Files. This file corresponds to the version 1.0 of the dataset.
If you find this dataset useful, please cite the original paper:
Meza-González, B., Ramírez-Palma, D.I., Carpio-Martínez, P. et al. Quantum Topological Atomic Properties of 44K molecules. Sci Data 11, 945 (2024). https://doi.org/10.1038/s41597-024-03723-0
https://www.apache.org/licenses/LICENSE-2.0.htmlhttps://www.apache.org/licenses/LICENSE-2.0.html
IntrocutionThis is an official dataset used to develop Vib2Mol. We have established a vibrational spectrum-to-structure benchmark (ViBench, VB), which consists of eight parts: VB-qm9, VB-zinc15, VB-mols, VB-geometry, VB-PAHs, VB-RXN, VB-peptide, and VB-peptide-mod. Details are listed in our paper.Density functional theory (DFT) was employed to perform conformational optimization of these molecules and calculated the corresponding infrared and Raman spectra. All quantum chemical calculations were carried out using the Gaussian 16 program. The geometries were optimized using the B3LYP-D3BJ functional with a 6-311+G** basis set. Frequency calculations were obtained at the same level at the optimized geometry.Furthermore, to test model's generalization on experimental spectra, we collected experimentally measured infrared spectra from the public NIST dataset.To facilitate calculations, the spectral dimensions were unified to 1024, and molecular structures were all represented using SMILES.FundingsThis work was supported by the National Natural Science Foundation (Grant No: 22227802, 22021001, 22474117, 22272139) of China and the Fundamental Research Funds for the Central Universities (20720220009) and Shanghai Innovation Institute.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
QM9 provides quantum chemical properties (at DFT level) for a relevant, consistent, and comprehensive chemical space of small organic molecules. This database may serve the benchmarking of existing methods, development of new methods, such as hybrid quantum mechanics/machine learning, and systematic identification of structure-property relationships.