QM9 consists of computed geometric, energetic, electronic, and thermodynamic properties for 134k stable small organic molecules made up of C, H, O, N, and F. As usual, we remove the uncharacterized molecules and provide the remaining 130,831.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('qm9', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
OverviewHessian QM9 is the first database of equilibrium configurations and numerical Hessian matrices, consisting of 41,645 molecules from the QM9 dataset at the $\omega$B97x/6-31G* level. Molecular Hessians were calculated in vacuum, as well as in water, tetrahydrofuran, and toluene using an implicit solvation model.A pre-print article associated with this dataset is available at here.Data recordsThe dataset is stored in Hugging Face's dataset format. For each of the four implicit solvent environments (vacuum, THF, toluene, and water), the data is divided into separate datasets containing vibrational analysis of 41,645 optimized geometries. Labels are associated with the QM9 molecule labelling system given by Ramakrishnan et al.Please note that only molecules containing H, C, N, O were considered. This exclusion was due to the limited number of molecules containing fluorine in the QM9 dataset, which was not sufficient to build a good description of the chemical environment for fluorine atoms. Including these molecules may have reduced the overall precision of any models trained on our data.Load the dataset:Use the following Python script to load the dataset dictionary: pythonfrom datasets import load_from_diskdataset = load_from_disk(root_directory)print(dataset)
Expected output:pythonDatasetDict({vacuum: Dataset({features: ['energy', 'positions', 'atomic_numbers', 'forces', 'frequencies', 'normal_modes', 'hessian', 'label'],num_rows: 41645}),thf: Dataset({features: ['energy', 'positions', 'atomic_numbers', 'forces', 'frequencies', 'normal_modes', 'hessian', 'label'],num_rows: 41645}),toluene: Dataset({features: ['energy', 'positions', 'atomic_numbers', 'forces', 'frequencies', 'normal_modes', 'hessian', 'label'],num_rows: 41645}),water: Dataset({features: ['energy', 'positions', 'atomic_numbers', 'forces', 'frequencies', 'normal_modes', 'hessian', 'label'],num_rows: 41645})})
DFT MethodsAll DFT calculations were carried out using the NWChem software package. The density functional used was $\omega$B97x with a 6-31G* basis set to create data compatible with the ANI-1/ANI-1x/ANI-2x datasets. The self-consistent field (SCF) cycle was converged when changes in total energy and density were less than 1e-6 eV. All molecules in the set are neutral with a multiplicity of 1. The Mura-Knowles radial quadrature and Lebedev angular quadrature were used in the integration. Structures were optimized in vacuum and three solvents (tetrahydrofuran, toluene, and water) using an implicit solvation model.The Hessian matrices, vibrational frequencies, and normal modes were computed for a subset of 41,645 molecular geometries using the finite differences method.Example model weightsAn example model trained on Hessian data is included in this dataset. Full details of the model will be provided in an upcoming publication. The model is an E(3)-equivariant graph neural network using the e3x
package with specific architecture details. To load the model weights, use:pythonparams = jnp.load('params_train_f128_i5_b16.npz', allow_pickle=True)['params'].item()
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Revised QM9 dataset with properties calculated using aPBE0 in the cc-pVTZ basis set.
The atomic coordinates, atomic numbers, chemical symbols, total energies, atomization energies, MO energies, homos, lumos, dipoles moment norms are in the arrays "coords", "charges", "elements", "energies", "atomization", "moenergies", "homo", "lumo", "dipole" respectively.
Density matrices will be uploaded soon.
Usage example :
HR-machine/QM9-Dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Card for "QM9"
QM9 dataset from Ruddigkeit et al., 2012; Ramakrishnan et al., 2014. Original data downloaded from: http://quantum-machine.org/datasets. Additional annotations (QED, logP, SA score, NP score, bond and ring counts) added using rdkit library.
Quick start usage:
from datasets import load_dataset
ds = load_dataset("yairschiff/qm9")
test_size = 0.1 seed = 1… See the full description on the dataset page: https://huggingface.co/datasets/yairschiff/qm9.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Two types of approaches to modeling molecular systems have demonstrated high practical efficiency. Density functional theory (DFT), the most widely used quantum chemical method, is a physical approach predicting energies and electron densities of molecules. Recently, numerous papers on machine learning (ML) of molecular properties have also been published. ML models greatly outperform DFT in terms of computational costs, and may even reach comparable accuracy, but they are missing physicality - a direct link to Quantum Physics - which limits their applicability. Here, we propose an approach that combines the strong sides of DFT and ML, namely, physicality and low computational cost. We derive general equations for exact electron densities and energies that can naturally guide applications of ML in Quantum Chemistry. Based on these equations, we build a deep neural network that can compute electron densities and energies of a wide range of organic molecules not only much faster, but also closer to exact physical values than current versions of DFT. In particular, we reached a mean absolute error in energies of molecules with up to eight non-hydrogen atoms as low as 0.9 kcal/mol relative to CCSD(T) values, noticeably lower than those of DFT (approaching ~2 kcal/mol) and ML (~1.5 kcal/mol) methods. A simultaneous improvement in the accuracy of predictions of electron densities and energies suggests that the proposed approach describes the physics of molecules better than DFT functionals developed by "human learning" earlier. Thus, physics-based ML offers exciting opportunities for modeling, with high-theory-level quantum chemical accuracy, of much larger molecular systems than currently possible. Sinitskiy, A. V., & Pande, V. S. Deep Neural Network Computes Electron Densities and Energies of a Large Set of Organic Molecules Faster than Density Functional Theory (DFT). arXiv:1809.02723 (2018). Available at https://arxiv.org/abs/1809.02723 Sinitskiy, A. V., & Pande, V. S. Physical machine learning outperforms "human learning" in Quantum Chemistry. arXiv:1908.00971 (2019). Available at https://arxiv.org/abs/1908.00971
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A dataset of HOMO/LUMO energies of the QM9 dataset computed at GW level of theory.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We constructed the QM9Spectra(QM9S) dataset using 130K organic molecules based on the popular QM9 dataset. We firstly re-optimized molecular geometries using the Gaussian16 package (B.01 version) at B3LYP/def-TZVP level of theory. Then the molecular properties including scalars (energy, NPA charges, etc.), vectors (electric dipole, etc.), 2nd order tensors (Hessian matrix, quadrupole moment, polarizability, etc.), and 3rd order tensors (octupole moment, first hyperpolarizability, etc.) were calculated at the same level. The frequency analysis and time-dependent density functional theory (TD-DFT) were carried out at the same level to obtain the infrared, Raman, and UV-Vis spectra.Two versions of the dataset, .pt (torch_geometric version) and .csv, are provided for training and use. In addition, we also provide broadened spectra.When using this dataset, please cite to the original article's doi: https://doi.org/10.1038/s43588-023-00550-y instead of the doi provided by figshare.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The JARVIS-QM9-DGL dataset is part of the joint automated repository for various integrated simulations (JARVIS) database. This dataset contains configurations from the QM9 dataset, originally created as part of the datasets at quantum-machine.org, as implemented with the Deep Graph Library (DGL) Python package. Units for r2 (electronic spatial extent) are a0^2; for alpha (isotropic polarizability), a0^3; for mu (dipole moment), D; for Cv (heat capacity), cal/mol K. Units for all other properties are eV. JARVIS is a set of tools and collected datasets built to meet current materials design challenges.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
liuganghuggingface/QM9 dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Database for training graph neural network (GNN) models in Integrating Explainability into Graph Neural Network Models for the Prediction of X-ray Absorption Spectra, by Amir Kotobi, Kanishka Singh, Daniel Höche, Sadia Bari, Robert H.Meißner, and Annika Bande.
Included:
qm9_Cedge_xas_56k.npz: the TDDFT XAS spectra of 56k structures from the QM9 dataset, were employed to label the graph dataset. The dataset contains two pairs of key/value entries: spec_stk, which represents a 2D array containing energies and oscillator strengths of XAS spectra, and id, which consists of the indices of QM9 structures. This data was used to create the QM9-XAS graph dataset.
qm9xas_orca_output.zip: the raw ORCA output of TDDFT calculations for the 56k QM9-XAS dataset consists of excitation energies, densities, molecular orbitals, and other relevant information. This unprocessed output serves as a source to derive ground truth data for explaining the predictions made by GNNs.
qm9xas_spec_train_val.pt: processed graph train/validation dataset of 50k QM9 structures. It is used as input to GNN models for training and validation.
qm9xas_spec_test.pt: processed graph test dataset of 6k QM9 structures. It is used to test the performance of trained GNN models.
Notes on the datasets:
The QM9-XAS dataset was created using ORCA electronic structure package [Neese, F., WIREs Computational Molecular Science 2012, 2, 73–78] to calculate carbon K-edge XAS spectra with the time-dependent density functional theory (TDDFT) method [Petersilka, M.; Gossmann, U. J.; Gross, E. K. U., Phys. Rev. Lett. 1996, 76, 1212–1215]
The molecular structures of QM9-XAS datasets were sourced from the QM9 database [R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. Von Lilienfeld, Sci. Data 1, 1 (2014)].
Funding:
This research was funded by HIDA Trainee Network program, HAICU, Helmholtz AI-4-XAS, DASHH and HEIBRiDS graduate schools. For theoretical calculations and model training, computational resources at DESY and JFZ were used.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Pol Febrer (pol.febrer@icn2.cat, ORCID 0000-0003-0904-2234) Peter Bjorn Jorgensen (peterbjorgensen@gmail.com, ORCID 0000-0003-4404-7276) Arghya Bhowmik (arbh@dtu.dk, ORCID 0000-0003-3198-5116)
The dataset is published as part of the paper: "GRAPH2MAT: UNIVERSAL GRAPH TO MATRIX CONVERSION FOR ELECTRON DENSITY PREDICTION" (https://doi.org/10.26434/chemrxiv-2024-j4g21)
This dataset contains the Hamiltonian, Overlap, Density and Energy Density matrices from SIESTA calculations of the QM9 dataset (https://doi.org/10.6084/m9.figshare.c.978904.v5)
SIESTA 5.0.0 was used to compute the dataset.
The dataset has four directories:
The "runs" directory contains one directory for each run, named with the index of the run. Each directory contains: - RUN.fdf, geom.fdf: The input files used for the SIESTA calculation. - RUN.out: The log of the SIESTA run, which apar - siesta.TSDE: Contains the Density and Energy Density matrices. - siesta.TSHS: Contains the Hamiltonian and Overlap matrices.
Each matrix can be read using the sisl python package (https://github.com/zerothi/sisl) like:
import sisl
matrix = sisl.get_sile("RUN.fdf").read_X()
where X is hamiltonian, overlap, density_matrix or energy_density_matrix.
To reproduce the results presented in the paper, follow the documentation of the graph2mat package (https://github.com/BIG-MAP/graph2mat).
https://doi.org/10.11583/DTU.c.7310005 © 2024 Technical University of Denmark
This dataset is published under the CC BY 4.0 license. This license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator.
The QM9 and ZINC datasets are used for molecule generation. The QM9 dataset contains ∼134,000 organic molecules with at most 9 atoms, and the ZINC dataset includes 250,000 drug-like molecules with at most 38 atoms.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
QM9-extended database was further extended with 1781 compounds consisting chlorine atoms.
Here are shielding and J-coupling features created by the quantum chemistry package deMon using its free download binary with default settings over the QM9 set of molecules used in Predicting Molecular Properties. These features would be considered forbidden for this competition because they are based on quantum calculations, but they appear to help with predictions using boosted tree and neural net models. They took around 2.5 days to compute in parallel on two different linux boxes with 14 CPU cores each (files have '_even' and '_odd' suffixes). Python code to import them:
root = "../"
demon_odd = pd.read_csv(root+'deMon_jcoupling_odd.csv')
print(demon_odd.columns, demon_odd.shape)
demon_even = pd.read_csv(root+'deMon_jcoupling_even.csv')
print(demon_even.columns, demon_even.shape)
demonj = pd.concat((demon_even,demon_odd))
print(demonj.columns, demonj.shape)
demon_odd = pd.read_csv(root+'deMon_shielding_odd.csv')
print(demon_odd.columns, demon_odd.shape)
demon_even = pd.read_csv(root+'deMon_shielding_even.csv')
print(demon_even.columns, demon_even.shape)
demons = pd.concat((demon_even,demon_odd))
print(demons.columns, demons.shape)
The shielding values are at the atom level and the J coupling at the pair level. Use molecule_name and atom indices when merging since the molecules are not in the same order as the original data. Also, deMon did not produce results for a few of the molecules so the features will be missing for them.
QM9 is a quantum chemistry benchmark dataset containing 134k stable small organic molecules.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Molecular dipole moments of the QM7b dataset, a random sample of 21'000 molecules from the QM9 dataset, and the MuML showcase set (including the four challenge series) described in the linked publication.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset containing DFT calculations of energy and forces for all configurations in the QM9 dataset, recalculated with the ωB97X functional and 6-31G(d) basis set. Recalculating the energy and forces causes a slight shift of the potential energy surface, which results in forces acting on most configurations in the dataset. The data was generated by running Nudged Elastic Band (NEB) calculations with DFT on 10k reactions while saving intermediate calculations. QM9x is used as a benchmarking and comparison dataset for the dataset Transition1x.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
compressed python pickle file for qm9 dataset
The JARVIS_QM9_STD_JCTC dataset is part of the joint automated repository for various integrated simulations (JARVIS) database. This dataset contains configurations from the QM9 dataset, originally created as part of the datasets at quantum-machine.org. Units for r2 (electronic spatial extent) are a ^2; for alpha (isotropic polarizability), a ^3; for mu (dipole moment), D; for Cv (heat capacity), cal/mol K. Units for all other properties are eV. JARVIS is a set of tools and collected datasets built to meet current materials design challenges.For the first iteration of DFT calculations, Gaussian 09's default electronic and geometry thresholds have been used for all molecules. For those molecules which failed to reach SCF convergence ultrafine grids have been invoked within a second iteration for evaluating the XC energy contributions. Within a third iteration on the remaining unconverged molecules, we identified those which had relaxed to saddle points, and further tightened the SCF criteria using the keyword scf(maxcycle=200, verytight). All those molecules which still featured imaginary frequencies entered the fourth iteration using keywords, opt(calcfc, maxstep=5, maxcycles=1000). calcfc constructs a Hessian in the first step of the geometry relaxation for eigenvector following. Within the fifth and final iteration, all molecules which still failed to reach convergence, have subsequently been converged using opt(calcall, maxstep=1, maxcycles=1000)
QM9 consists of computed geometric, energetic, electronic, and thermodynamic properties for 134k stable small organic molecules made up of C, H, O, N, and F. As usual, we remove the uncharacterized molecules and provide the remaining 130,831.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('qm9', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.