Facebook
TwitterQM9 consists of computed geometric, energetic, electronic, and thermodynamic properties for 134k stable small organic molecules made up of C, H, O, N, and F. As usual, we remove the uncharacterized molecules and provide the remaining 130,831.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('qm9', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Hessian QM9 is the first database of equilibrium configurations and numerical Hessian matrices, consisting of 41,645 molecules from the QM9 dataset at the wB97x/6-31G* level. Molecular Hessians were calculated in vacuum, as well as in water, tetrahydrofuran, and toluene using an implicit solvation model.
Facebook
TwitterHR-machine/QM9-Dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterDataset Card for "QM9"
QM9 dataset from Ruddigkeit et al., 2012; Ramakrishnan et al., 2014. Original data downloaded from: http://quantum-machine.org/datasets. Additional annotations (QED, logP, SA score, NP score, bond and ring counts) added using rdkit library.
Quick start usage:
from datasets import load_dataset
ds = load_dataset("yairschiff/qm9")
test_size = 0.1 seed = 1… See the full description on the dataset page: https://huggingface.co/datasets/yairschiff/qm9.
Facebook
Twitterdownloaded from: http://quantum-machine.org/datasets/
Abstract
Computational de novo design of new drugs and materials requires rigorous and unbiased exploration of chemical compound space. However, large uncharted territories persist due to its size scaling combinatorially with molecular size. We report computed geometric, energetic, electronic, and thermodynamic properties for 134k stable small organic molecules made up of CHONF. These molecules correspond to the subset of all 133,885 species with up to nine heavy atoms (CONF) out of the GDB-17 chemical universe of 166 billion organic molecules. We report geometries minimal in energy, corresponding harmonic frequencies, dipole moments, polarizabilities, along with energies, enthalpies, and free energies of atomization. All properties were calculated at the B3LYP/6-31G(2df,p) level of quantum chemistry. Furthermore, for the predominant stoichiometry, C7H10O2, there are 6,095 constitutional isomers among the 134k molecules. We report energies, enthalpies, and free energies of atomization at the more accurate G4MP2 level of theory for all of them. As such, this data set provides quantum chemical properties for a relevant, consistent, and comprehensive chemical space of small organic molecules. This database may serve the benchmarking of existing methods, development of new methods, such as hybrid quantum mechanics/machine learning, and systematic identification of structure-property relationships.
Download Available via figshare.
How to cite When using this dataset, please make sure to cite the following two papers:
L. Ruddigkeit, R. van Deursen, L. C. Blum, J.-L. Reymond, Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17, J. Chem. Inf. Model. 52, 2864–2875, 2012.
R. Ramakrishnan, P. O. Dral, M. Rupp, O. A. von Lilienfeld, Quantum chemistry structures and properties of 134 kilo molecules, Scientific Data 1, 140022, 2014. [bibtex]
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Revised QM9 dataset with properties calculated using aPBE0 in the cc-pVTZ basis set.
The atomic coordinates, atomic numbers, chemical symbols, total energies, atomization energies, MO energies, homos, lumos, dipoles moment norms are in the arrays "coords", "charges", "elements", "energies", "atomization", "moenergies", "homo", "lumo", "dipole" respectively.
Density matrices will be uploaded soon.
Usage example :
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We constructed the QM9Spectra(QM9S) dataset using 130K organic molecules based on the popular QM9 dataset. We firstly re-optimized molecular geometries using the Gaussian16 package (B.01 version) at B3LYP/def-TZVP level of theory. Then the molecular properties including scalars (energy, NPA charges, etc.), vectors (electric dipole, etc.), 2nd order tensors (Hessian matrix, quadrupole moment, polarizability, etc.), and 3rd order tensors (octupole moment, first hyperpolarizability, etc.) were calculated at the same level. The frequency analysis and time-dependent density functional theory (TD-DFT) were carried out at the same level to obtain the infrared, Raman, and UV-Vis spectra.Two versions of the dataset, .pt (torch_geometric version) and .csv, are provided for training and use. In addition, we also provide broadened spectra.When using this dataset, please cite to the original article's doi: https://doi.org/10.1038/s43588-023-00550-y instead of the doi provided by figshare.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
liuganghuggingface/QM9 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Revised QM9 dataset with properties calculated using the aPBE0 functional and cc-pVTZ basis set.The atomic coordinates, atomic numbers, chemical symbols, total energies, atomization energies, MO energies, homos, lumos, dipoles moments are in the arrays "coordinates", "charges", "elements", "energies", "atomization", "moenergies", "homos", "lumos", "dipole" respectively.Usage example :import numpy as npdata = np.load('revQM9.npz',allow_pickle=True)coords, q, elems, energies = data['coordinates'], data['charges'], data['elements'], data['energies']
Facebook
TwitterA dataset of small molecules for benchmarking molecule generation methods. The dataset consists of fingerprints of the molecules, and the goal is to predict the original molecule from the fingerprint.
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Two types of approaches to modeling molecular systems have demonstrated high practical efficiency. Density functional theory (DFT), the most widely used quantum chemical method, is a physical approach predicting energies and electron densities of molecules. Recently, numerous papers on machine learning (ML) of molecular properties have also been published. ML models greatly outperform DFT in terms of computational costs, and may even reach comparable accuracy, but they are missing physicality - a direct link to Quantum Physics - which limits their applicability. Here, we propose an approach that combines the strong sides of DFT and ML, namely, physicality and low computational cost. We derive general equations for exact electron densities and energies that can naturally guide applications of ML in Quantum Chemistry. Based on these equations, we build a deep neural network that can compute electron densities and energies of a wide range of organic molecules not only much faster, but also closer to exact physical values than current versions of DFT. In particular, we reached a mean absolute error in energies of molecules with up to eight non-hydrogen atoms as low as 0.9 kcal/mol relative to CCSD(T) values, noticeably lower than those of DFT (approaching ~2 kcal/mol) and ML (~1.5 kcal/mol) methods. A simultaneous improvement in the accuracy of predictions of electron densities and energies suggests that the proposed approach describes the physics of molecules better than DFT functionals developed by "human learning" earlier. Thus, physics-based ML offers exciting opportunities for modeling, with high-theory-level quantum chemical accuracy, of much larger molecular systems than currently possible. Sinitskiy, A. V., & Pande, V. S. Deep Neural Network Computes Electron Densities and Energies of a Large Set of Organic Molecules Faster than Density Functional Theory (DFT). arXiv:1809.02723 (2018). Available at https://arxiv.org/abs/1809.02723 Sinitskiy, A. V., & Pande, V. S. Physical machine learning outperforms "human learning" in Quantum Chemistry. arXiv:1908.00971 (2019). Available at https://arxiv.org/abs/1908.00971
Facebook
TwitterDataset Card for "QM9"
More Information needed
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A dataset of HOMO/LUMO energies of the QM9 dataset computed at GW level of theory.
Facebook
TwitterThe dataset is used for testing the proposed TopNets architecture on molecular property prediction tasks.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
QM9-extended database was further extended with 1781 compounds containing chlorine atoms and 2020 compounds containing bromine atoms.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Database for training graph neural network (GNN) models in Integrating Explainability into Graph Neural Network Models for the Prediction of X-ray Absorption Spectra, by Amir Kotobi, Kanishka Singh, Daniel Höche, Sadia Bari, Robert H.Meißner, and Annika Bande.
Included:
qm9_Cedge_xas_56k.npz: the TDDFT XAS spectra of 56k structures from the QM9 dataset, were employed to label the graph dataset. The dataset contains two pairs of key/value entries: spec_stk, which represents a 2D array containing energies and oscillator strengths of XAS spectra, and id, which consists of the indices of QM9 structures. This data was used to create the QM9-XAS graph dataset.
qm9xas_orca_output.zip: the raw ORCA output of TDDFT calculations for the 56k QM9-XAS dataset consists of excitation energies, densities, molecular orbitals, and other relevant information. This unprocessed output serves as a source to derive ground truth data for explaining the predictions made by GNNs.
qm9xas_spec_train_val.pt: processed graph train/validation dataset of 50k QM9 structures. It is used as input to GNN models for training and validation.
qm9xas_spec_test.pt: processed graph test dataset of 6k QM9 structures. It is used to test the performance of trained GNN models.
Notes on the datasets:
The QM9-XAS dataset was created using ORCA electronic structure package [Neese, F., WIREs Computational Molecular Science 2012, 2, 73–78] to calculate carbon K-edge XAS spectra with the time-dependent density functional theory (TDDFT) method [Petersilka, M.; Gossmann, U. J.; Gross, E. K. U., Phys. Rev. Lett. 1996, 76, 1212–1215]
The molecular structures of QM9-XAS datasets were sourced from the QM9 database [R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. Von Lilienfeld, Sci. Data 1, 1 (2014)].
Funding:
This research was funded by HIDA Trainee Network program, HAICU, Helmholtz AI-4-XAS, DASHH and HEIBRiDS graduate schools. For theoretical calculations and model training, computational resources at DESY and JFZ were used.
Facebook
Twitterqm9 quantum. Visit https://dataone.org/datasets/sha256%3A09d247add3aca18d56b63d1834642e99243abfc143ff716a713c4a88e8bf59c5 for complete metadata about this dataset.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The computational de novo design of new drugs and materials requires a thorough and unbiased exploration of chemical compound space. However, this space remains largely unexplored due to its combinatorial scaling with molecular size. To address this challenge, a dataset of 134,000 stable small organic molecules composed of carbon (C), hydrogen (H), oxygen (O), nitrogen (N), and fluorine (F) has been meticulously computed. These molecules represent a subset of all 133,885 species with up to nine heavy atoms (C, O, N, F) from the GDB-17 chemical universe, which encompasses 166 billion organic molecules.
For each molecule, computed geometric, energetic, electronic, and thermodynamic properties are provided, including:
This dataset offers a relevant, consistent, and comprehensive exploration of chemical space for small organic molecules, providing a valuable resource for benchmarking existing methods, developing new methodologies (such as hybrid quantum mechanics/machine learning approaches), and systematically identifying structure-property relationships [1].
[1] Ramakrishnan, Raghunathan, et al. "Quantum chemistry structures and properties of 134 kilo molecules." Scientific data 1.1 (2014): 1-7.
In this notebook, we aim to leverage this dataset (QM9) to predict the molecular properties of these small organic molecules using the Coulomb matrix representation. Specifically, we will focus on using the eigenvalues of the Coulomb matrix, which serve as a crucial descriptor for capturing the electronic structure of molecules for predicting molecular properties.
By the end of this notebook, you will have:
Let's begin by loading and exploring the dataset.
Enjoy! ⚛
| No. | Property | Unit | Description |
|---|---|---|---|
| 1 | tag | — | ‘gdb9’ string to facilitate extraction |
| 2 | i | — | Consecutive, 1-based integer identifier |
| 3 | A | GHz | Rotational constant |
| 4 | B | GHz | Rotational constant |
| 5 | C | GHz | Rotational constant |
| 6 | μ | D | Dipole moment |
| 7 | α | a³ | Isotropic polarizability |
| 8 | εHOMO | Ha | Energy of HOMO |
| 9 | εLUMO | Ha | Energy of LUMO |
| 10 | εgap | Ha | Gap (εLUMO − εHOMO) |
| 11 | /R2S | a² | Electronic spatial extent |
| 12 | zpve | Ha | Zero point vibrational energy |
| 13 | U0 | Ha | Internal energy at 0 K |
| 14 | U | Ha | Internal energy at 298.15 K |
| 15 | H | Ha | Enthalpy at 298.15 K |
| 16 | G | Ha | Free energy at 298.15 K |
| 17 | C v | cal/mol·K | Heat capacity at 298.15 K |
For each molecule, atomic coordinates and calculated properties are stored in a file named dataset_index.xyz. The XYZ format 1 is a widespread plain text format for encoding Cartesian coordinates
of molecules, with no formal specification. It contains a header line specifying the number of atoms n a, a
comment line, and n a lines containing element type and atomic coordinates, one atom per line. The comment line is used to store all scalar properties, Mulliken charges are added as a fifth column. Harmonic vibrational frequencies, SMILES and InChI [2] are appended as respective additional lines.
[1] https://open-babel.readthedocs.io/en/latest/FileFormats/XYZ_cartesian_coordinates_format.html
[2] https://iupac.org/who-we-are/divisions/division-details/inchi/
| Line | Content | |------|----------------------------------------------------------...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Pol Febrer (pol.febrer@icn2.cat, ORCID 0000-0003-0904-2234) Peter Bjorn Jorgensen (peterbjorgensen@gmail.com, ORCID 0000-0003-4404-7276) Arghya Bhowmik (arbh@dtu.dk, ORCID 0000-0003-3198-5116)
The dataset is published as part of the paper: "GRAPH2MAT: UNIVERSAL GRAPH TO MATRIX CONVERSION FOR ELECTRON DENSITY PREDICTION" (https://doi.org/10.26434/chemrxiv-2024-j4g21)
This dataset contains the Hamiltonian, Overlap, Density and Energy Density matrices from SIESTA calculations of the QM9 dataset (https://doi.org/10.6084/m9.figshare.c.978904.v5)
SIESTA 5.0.0 was used to compute the dataset.
The dataset has four directories:
The "runs" directory contains one directory for each run, named with the index of the run. Each directory contains: - RUN.fdf, geom.fdf: The input files used for the SIESTA calculation. - RUN.out: The log of the SIESTA run, which apar - siesta.TSDE: Contains the Density and Energy Density matrices. - siesta.TSHS: Contains the Hamiltonian and Overlap matrices.
Each matrix can be read using the sisl python package (https://github.com/zerothi/sisl) like:
import sisl
matrix = sisl.get_sile("RUN.fdf").read_X()
where X is hamiltonian, overlap, density_matrix or energy_density_matrix.
To reproduce the results presented in the paper, follow the documentation of the graph2mat package (https://github.com/BIG-MAP/graph2mat).
https://doi.org/10.11583/DTU.c.7310005 © 2024 Technical University of Denmark
This dataset is published under the CC BY 4.0 license. This license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This database comprises a curated collection of 21,085 conjugated molecules, filtered from the original QM9 dataset. For each molecule, calculations were performed using the LC-ωPBE/6-31G* method. To account for the system dependence of the range-separated parameter ω in the LC-ωPBE approach, the IP method was employed to fine-tune ω, ensuring an optimal value for each molecule. Based on these refined parameters, further calculations were performed to determine molecular properties, extracting key data such as the Hamiltonian matrix, overlap matrix, eigenvalues, and eigenvectors.
The dataset is stored in a DB file format, with each entry containing the following information:
i: Index of the molecule
SMILES: SMILES representation of the molecule
omega: Value of the range-separated parameter ω
j2: Equation value constructed when tuning ω using the IP method
coordinates: Three-dimensional coordinates of atoms in the molecule
Z: Atomic numbers of the atoms in the molecule
n_atoms: Number of atoms in the molecule
hamiltonian: Hamiltonian matrix of the molecule
eigenvalues: Eigenvalues of the molecule
overlap: Overlap matrix of the molecular orbitals
eigenvectors: Eigenvectors of the molecule
Usage example :
import sqlite3
conn = sqlite3.connect('qm9_conj_OT-w.db') cursor = conn.cursor()
cursor.execute("SELECT * FROM qm9_data WHERE i = 0")molecule_data = cursor.fetchone()
smiles = molecule_data[1]omega = molecule_data[2]
conn.close()
Facebook
TwitterQM9 consists of computed geometric, energetic, electronic, and thermodynamic properties for 134k stable small organic molecules made up of C, H, O, N, and F. As usual, we remove the uncharacterized molecules and provide the remaining 130,831.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('qm9', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.