QM9 provides quantum chemical properties (at DFT level) for a relevant, consistent, and comprehensive chemical space of small organic molecules. This database may serve the benchmarking of existing methods, development of new methods, such as hybrid quantum mechanics/machine learning, and systematic identification of structure-property relationships.
QM9 consists of computed geometric, energetic, electronic, and thermodynamic properties for 134k stable small organic molecules made up of C, H, O, N, and F. As usual, we remove the uncharacterized molecules and provide the remaining 130,831.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('qm9', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
downloaded from: http://quantum-machine.org/datasets/
Abstract
Computational de novo design of new drugs and materials requires rigorous and unbiased exploration of chemical compound space. However, large uncharted territories persist due to its size scaling combinatorially with molecular size. We report computed geometric, energetic, electronic, and thermodynamic properties for 134k stable small organic molecules made up of CHONF. These molecules correspond to the subset of all 133,885 species with up to nine heavy atoms (CONF) out of the GDB-17 chemical universe of 166 billion organic molecules. We report geometries minimal in energy, corresponding harmonic frequencies, dipole moments, polarizabilities, along with energies, enthalpies, and free energies of atomization. All properties were calculated at the B3LYP/6-31G(2df,p) level of quantum chemistry. Furthermore, for the predominant stoichiometry, C7H10O2, there are 6,095 constitutional isomers among the 134k molecules. We report energies, enthalpies, and free energies of atomization at the more accurate G4MP2 level of theory for all of them. As such, this data set provides quantum chemical properties for a relevant, consistent, and comprehensive chemical space of small organic molecules. This database may serve the benchmarking of existing methods, development of new methods, such as hybrid quantum mechanics/machine learning, and systematic identification of structure-property relationships.
Download Available via figshare.
How to cite When using this dataset, please make sure to cite the following two papers:
L. Ruddigkeit, R. van Deursen, L. C. Blum, J.-L. Reymond, Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17, J. Chem. Inf. Model. 52, 2864–2875, 2012.
R. Ramakrishnan, P. O. Dral, M. Rupp, O. A. von Lilienfeld, Quantum chemistry structures and properties of 134 kilo molecules, Scientific Data 1, 140022, 2014. [bibtex]
HR-machine/QM9-Dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Revised QM9 dataset with properties calculated using aPBE0 in the cc-pVTZ basis set.
The atomic coordinates, atomic numbers, chemical symbols, total energies, atomization energies, MO energies, homos, lumos, dipoles moment norms are in the arrays "coords", "charges", "elements", "energies", "atomization", "moenergies", "homo", "lumo", "dipole" respectively.
Density matrices will be uploaded soon.
Usage example :
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A dataset of HOMO/LUMO energies of the QM9 dataset computed at GW level of theory.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Two types of approaches to modeling molecular systems have demonstrated high practical efficiency. Density functional theory (DFT), the most widely used quantum chemical method, is a physical approach predicting energies and electron densities of molecules. Recently, numerous papers on machine learning (ML) of molecular properties have also been published. ML models greatly outperform DFT in terms of computational costs, and may even reach comparable accuracy, but they are missing physicality - a direct link to Quantum Physics - which limits their applicability. Here, we propose an approach that combines the strong sides of DFT and ML, namely, physicality and low computational cost. We derive general equations for exact electron densities and energies that can naturally guide applications of ML in Quantum Chemistry. Based on these equations, we build a deep neural network that can compute electron densities and energies of a wide range of organic molecules not only much faster, but also closer to exact physical values than current versions of DFT. In particular, we reached a mean absolute error in energies of molecules with up to eight non-hydrogen atoms as low as 0.9 kcal/mol relative to CCSD(T) values, noticeably lower than those of DFT (approaching ~2 kcal/mol) and ML (~1.5 kcal/mol) methods. A simultaneous improvement in the accuracy of predictions of electron densities and energies suggests that the proposed approach describes the physics of molecules better than DFT functionals developed by "human learning" earlier. Thus, physics-based ML offers exciting opportunities for modeling, with high-theory-level quantum chemical accuracy, of much larger molecular systems than currently possible.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The classification of molecules according to ClassyFire [1] for the QM9 dataset [2].
The QM9 dataset is a set of nearly 140k organic molecules with no more than 9 C, N, O, and F atoms optimized to a stable structure with DFT.
ClassyFire is a tool and taxonomic library for the labeling of molecules.
1. Djoumbou Feunang, Y. et al. ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. J. Cheminform. 8, 1–20 (2016).
2. Ramakrishnan, R., Dral, P. O., Rupp, M. & Von Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. Sci. Data 1, 1–7 (2014).
The data directory ('QM9_jsons_classified.tar.gz') contains a `json` file for each structure in the QM9 dataset. The name of the file is the same identifier as from QM9. Data fields include:
- `cf_alternative_parents` : classifications describing the compound that do not fall in the given ancestry
- `cf_ancestors` : classes along the taxonomic branch for the structure
- `cf_class` : ClassyFire given class
- `cf_superclass` : ClassyFire given super class
- `cf_subclass` : ClassyFire given subclass
- `cf_direct_parent` : Class one level above this structure on the taxonomic branch
- `cf_description` : Exposition on the given class
- `cf_identifier` : identifier for the structure in the ClassyFire database
- `cf_intermediate_nodes` : classes connecting branches on taxonomic tree
- `cf_kingdom` : ClassyFire given kingdom
- `cf_molecular_framework` : describes aromaticity and number of cycles
- `cf_predicted_chebi_terms` : terms describing the molecule in the ChEBI framework
- `cf_predicted_lipidmaps_terms` : terms describing the molecule in LIPID MAPS framework
- `cf_smiles` : smiles string given by ClassyFire
- `cf_substituents` : substituent groups in the structure
Many fields contain subfields, seen in the example below for molecule with QM9 id 000123:
{"cf_alternative_parents":[{"name":"Dialkylamines","description":"Organic compounds containing a dialkylamine group, characterized by two alkyl groups bonded to the amino nitrogen.","chemont_id":"CHEMONTID:0002228","url":"http:\/\/classyfire.wishartlab.com\/tax_nodes\/C0002228"},{"name":"Organopnictogen compounds","description":"Compounds containing a bond between carbon a pnictogen atom. Pnictogens are p-block element atoms that are in the group 15 of the periodic table.","chemont_id":"CHEMONTID:0004557","url":"http:\/\/classyfire.wishartlab.com\/tax_nodes\/C0004557"},{"name":"Hydrocarbon derivatives","description":"Derivatives of hydrocarbons obtained by substituting one or more carbon atoms by an heteroatom. They contain at least one carbon atom and heteroatom.","chemont_id":"CHEMONTID:0004150","url":"http:\/\/classyfire.wishartlab.com\/tax_nodes\/C0004150"}],"cf_ancestors":["Alpha-aminonitriles","Amines","Chemical entities","Dialkylamines","Hydrocarbon derivatives","Nitriles","Organic compounds","Organic cyanides","Organic nitrogen compounds","Organonitrogen compounds","Organopnictogen compounds","Secondary amines"],"cf_class":"Organonitrogen compounds","cf_classification_version":"2.1","cf_description":"This compound belongs to the class of organic compounds known as alpha-aminonitriles. These are organonitrogen compounds that contain an amino group located on the carbon at the position alpha to a carbonitrile group. They have the general formula RC(NH2)C#N, where the amine group can be substituted.","cf_direct_parent":{"name":"Alpha-aminonitriles","description":"Organonitrogen compounds that contain an amino group located on the carbon at the position alpha to a carbonitrile group. They have the general formula RC(NH2)C#N, where the amine group can be substituted.","chemont_id":"CHEMONTID:0004453","url":"http:\/\/classyfire.wishartlab.com\/tax_nodes\/C0004453"},"cf_external_descriptors":[],"cf_identifier":"Q5198051-1","cf_inchikey":"InChIKey=PVVRRUUMHFWFQV-UHFFFAOYSA-N","cf_intermediate_nodes":[{"name":"Nitriles","description":"Compounds having the structure RC#N; thus C-substituted derivatives of hydrocyanic acid, HC#N.","chemont_id":"CHEMONTID:0000362","url":"http:\/\/classyfire.wishartlab.com\/tax_nodes\/C0000362"}],"cf_kingdom":"Organic compounds","cf_molecular_framework":"Aliphatic acyclic compounds","cf_predicted_chebi_terms":["chemical entity (CHEBI:24431)","organic molecular entity (CHEBI:50860)","organonitrogen compound (CHEBI:35352)","secondary amino compound (CHEBI:50995)","nitrile (CHEBI:18379)","amine (CHEBI:32952)","secondary amine (CHEBI:32863)","cyanides (CHEBI:23424)","organic molecule (CHEBI:72695)","pnictogen molecular entity (CHEBI:33302)","nitrogen molecular entity (CHEBI:51143)"],"cf_predicted_lipidmaps_terms":[],"cf_smiles":"CNCC#N","cf_subclass":"Organic cyanides","cf_substituents":["Alpha-aminonitrile","Secondary amine","Secondary aliphatic amine","Organopnictogen compound","Hydrocarbon derivative","Amine","Aliphatic acyclic compound"],"cf_superclass":"Organic nitrogen compounds"}
A visualization ''qm9_pie_labeled.png" is given of a fracturization of superclasses within qm9 down to subclass.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We constructed the QM9Spectra(QM9S) dataset using 130K organic molecules based on the popular QM9 dataset. We firstly re-optimized molecular geometries using the Gaussian16 package (B.01 version) at B3LYP/def-TZVP level of theory. Then the molecular properties including scalars (energy, NPA charges, etc.), vectors (electric dipole, etc.), 2nd order tensors (Hessian matrix, quadrupole moment, polarizability, etc.), and 3rd order tensors (octupole moment, first hyperpolarizability, etc.) were calculated at the same level. The frequency analysis and time-dependent density functional theory (TD-DFT) were carried out at the same level to obtain the infrared, Raman, and UV-Vis spectra.Two versions of the dataset, .pt (torch_geometric version) and .csv, are provided for training and use. In addition, we also provide broadened spectra.When using this dataset, please cite to the original article's doi: https://doi.org/10.1038/s43588-023-00550-y instead of the doi provided by figshare.
nimashoghi/qm9 dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This file contains numbers that are index of the QM9 molecules. These indices are not present in either of our molecular or reaction datasets. These indices are not considered because there were problems converting the coordinates to SMILES string.
This item is part of the collection MultiXC-QM9 with DOI: 10.11583/DTU.c.6185986
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This provides a curated hdf5 file for a subset of the QM9 dataset to be used for testing purposes, designed to be compatible with modelforge, an infrastructure to implement and train NNPs. This test dataset contains 1000 configurations, 1 for each unique system.
When applicable, the units of properties are provided in the datafile, encoded as strings compatible with the openff-units package. For more information about the structure of the data file, please see the following:
The QM9 dataset includes 133,885 organic molecules with up to nine total heavy atoms (C,O,N,or F; excluding H) original published by Ramakrishnan, et al. Properties in the QM9 dataset were calculated at the B3LYP/6-31G(2df,p) level of quantum chemistry.
Original publication:
Source dataset, released with CCO 1.0 Universal license:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Database for training graph neural network (GNN) models in Integrating Explainability into Graph Neural Network Models for the Prediction of X-ray Absorption Spectra, by Amir Kotobi, Kanishka Singh, Daniel Höche, Sadia Bari, Robert H.Meißner, and Annika Bande.
Included:
qm9_Cedge_xas_56k.npz: the TDDFT XAS spectra of 56k structures from the QM9 dataset, were employed to label the graph dataset. The dataset contains two pairs of key/value entries: spec_stk, which represents a 2D array containing energies and oscillator strengths of XAS spectra, and id, which consists of the indices of QM9 structures. This data was used to create the QM9-XAS graph dataset.
qm9xas_orca_output.zip: the raw ORCA output of TDDFT calculations for the 56k QM9-XAS dataset consists of excitation energies, densities, molecular orbitals, and other relevant information. This unprocessed output serves as a source to derive ground truth data for explaining the predictions made by GNNs.
qm9xas_spec_train_val.pt: processed graph train/validation dataset of 50k QM9 structures. It is used as input to GNN models for training and validation.
qm9xas_spec_test.pt: processed graph test dataset of 6k QM9 structures. It is used to test the performance of trained GNN models.
Notes on the datasets:
The QM9-XAS dataset was created using ORCA electronic structure package [Neese, F., WIREs Computational Molecular Science 2012, 2, 73–78] to calculate carbon K-edge XAS spectra with the time-dependent density functional theory (TDDFT) method [Petersilka, M.; Gossmann, U. J.; Gross, E. K. U., Phys. Rev. Lett. 1996, 76, 1212–1215]
The molecular structures of QM9-XAS datasets were sourced from the QM9 database [R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. Von Lilienfeld, Sci. Data 1, 1 (2014)].
Funding:
This research was funded by HIDA Trainee Network program, HAICU, Helmholtz AI-4-XAS, DASHH and HEIBRiDS graduate schools. For theoretical calculations and model training, computational resources at DESY and JFZ were used.
qm9 quantum. Visit https://dataone.org/datasets/sha256%3A09d247add3aca18d56b63d1834642e99243abfc143ff716a713c4a88e8bf59c5 for complete metadata about this dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
QM9x is a dataset that contains DFT calculations of energy and forces for all configurations in QM9 recalculated with the wb97x functional and 6-31G(d) basis set. Recalculating the energy and forces causes a slight shift of the potential energy surface which results in forces acting on most configurations in the dataset.
The choice of basis set and functional makes the QM9x compatible with the Transition1x and the ANI1x dataset.
see https://arxiv.org/abs/2207.12858 for comparison between ANI1x, QM9x and Transition1x.
Dataloaders and example scripts are availble in https://gitlab.com/matschreiner/QM9x
Please cite as Transition1x - Force and Energy Calculations of Millions of Near-Transition State Molecular Configurations,
and the original QM9 dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Pol Febrer (pol.febrer@icn2.cat, ORCID 0000-0003-0904-2234) Peter Bjorn Jorgensen (peterbjorgensen@gmail.com, ORCID 0000-0003-4404-7276) Arghya Bhowmik (arbh@dtu.dk, ORCID 0000-0003-3198-5116)
The dataset is published as part of the paper: "GRAPH2MAT: UNIVERSAL GRAPH TO MATRIX CONVERSION FOR ELECTRON DENSITY PREDICTION" (https://doi.org/10.26434/chemrxiv-2024-j4g21)
This dataset contains the Hamiltonian, Overlap, Density and Energy Density matrices from SIESTA calculations of the QM9 dataset (https://doi.org/10.6084/m9.figshare.c.978904.v5)
SIESTA 5.0.0 was used to compute the dataset.
The dataset has four directories:
The "runs" directory contains one directory for each run, named with the index of the run. Each directory contains: - RUN.fdf, geom.fdf: The input files used for the SIESTA calculation. - RUN.out: The log of the SIESTA run, which apar - siesta.TSDE: Contains the Density and Energy Density matrices. - siesta.TSHS: Contains the Hamiltonian and Overlap matrices.
Each matrix can be read using the sisl python package (https://github.com/zerothi/sisl) like:
import sisl
matrix = sisl.get_sile("RUN.fdf").read_X()
where X is hamiltonian, overlap, density_matrix or energy_density_matrix.
To reproduce the results presented in the paper, follow the documentation of the graph2mat package (https://github.com/BIG-MAP/graph2mat).
https://doi.org/10.11583/DTU.c.7310005 © 2024 Technical University of Denmark
This dataset is published under the CC BY 4.0 license. This license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
compressed python pickle file for qm9 dataset
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cite this dataset
Ramakrishnan, R., Dral, P. O., Rupp, M., and Lilienfeld, O. A. JARVIS-QM9-DGL. ColabFit, 2023. https://doi.org/10.60732/403cd4f2
View on the ColabFit Exchange
https://materials.colabfit.org/id/DS_tat5i46x3hkr_0
Dataset Name
JARVIS-QM9-DGL
Description
The JARVIS-QM9-DGL dataset is part of the joint automated repository for various integrated simulations (JARVIS) database. This dataset contains configurations from the QM9 dataset… See the full description on the dataset page: https://huggingface.co/datasets/colabfit/JARVIS-QM9-DGL.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
All-atom Diffusion Transformers - QM9 dataset
QM9 dataset from the paper "All-atom Diffusion Transformers: Unified generative modelling of molecules and materials", by Chaitanya K. Joshi, Xiang Fu, Yi-Lun Liao, Vahe Gharakhanyan, Benjamin Kurt Miller, Anuroop Sriram*, and Zachary W. Ulissi* from FAIR Chemistry at Meta (* Joint last author). Original data source: https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.datasets.QM9.html (Adapted from MoleculeNet)… See the full description on the dataset page: https://huggingface.co/datasets/chaitjo/QM9_ADiT.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
JSON file of python dictionnary. Key: SMILES, value: dict {'HAC' # number of heavy atoms, 'swscore_ChEMBL' # % of ECFP4 of the molecule that belong to ChEMBL, 'swscore_ZINC' # % of ECFP4 of the molecule that belong to ZINC or ChEMBL} Only neutral singlet molecule without any atomic charges (formal or real) composed of H, C, N, O, and F.
QM9 provides quantum chemical properties (at DFT level) for a relevant, consistent, and comprehensive chemical space of small organic molecules. This database may serve the benchmarking of existing methods, development of new methods, such as hybrid quantum mechanics/machine learning, and systematic identification of structure-property relationships.