Facebook
TwitterQM9 consists of computed geometric, energetic, electronic, and thermodynamic properties for 134k stable small organic molecules made up of C, H, O, N, and F. As usual, we remove the uncharacterized molecules and provide the remaining 130,831.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('qm9', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterHR-machine/QM9-Dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterdownloaded from: http://quantum-machine.org/datasets/
Abstract
Computational de novo design of new drugs and materials requires rigorous and unbiased exploration of chemical compound space. However, large uncharted territories persist due to its size scaling combinatorially with molecular size. We report computed geometric, energetic, electronic, and thermodynamic properties for 134k stable small organic molecules made up of CHONF. These molecules correspond to the subset of all 133,885 species with up to nine heavy atoms (CONF) out of the GDB-17 chemical universe of 166 billion organic molecules. We report geometries minimal in energy, corresponding harmonic frequencies, dipole moments, polarizabilities, along with energies, enthalpies, and free energies of atomization. All properties were calculated at the B3LYP/6-31G(2df,p) level of quantum chemistry. Furthermore, for the predominant stoichiometry, C7H10O2, there are 6,095 constitutional isomers among the 134k molecules. We report energies, enthalpies, and free energies of atomization at the more accurate G4MP2 level of theory for all of them. As such, this data set provides quantum chemical properties for a relevant, consistent, and comprehensive chemical space of small organic molecules. This database may serve the benchmarking of existing methods, development of new methods, such as hybrid quantum mechanics/machine learning, and systematic identification of structure-property relationships.
Download Available via figshare.
How to cite When using this dataset, please make sure to cite the following two papers:
L. Ruddigkeit, R. van Deursen, L. C. Blum, J.-L. Reymond, Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17, J. Chem. Inf. Model. 52, 2864–2875, 2012.
R. Ramakrishnan, P. O. Dral, M. Rupp, O. A. von Lilienfeld, Quantum chemistry structures and properties of 134 kilo molecules, Scientific Data 1, 140022, 2014. [bibtex]
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We constructed the QM9Spectra(QM9S) dataset using 130K organic molecules based on the popular QM9 dataset. We firstly re-optimized molecular geometries using the Gaussian16 package (B.01 version) at B3LYP/def-TZVP level of theory. Then the molecular properties including scalars (energy, NPA charges, etc.), vectors (electric dipole, etc.), 2nd order tensors (Hessian matrix, quadrupole moment, polarizability, etc.), and 3rd order tensors (octupole moment, first hyperpolarizability, etc.) were calculated at the same level. The frequency analysis and time-dependent density functional theory (TD-DFT) were carried out at the same level to obtain the infrared, Raman, and UV-Vis spectra.Two versions of the dataset, .pt (torch_geometric version) and .csv, are provided for training and use. In addition, we also provide broadened spectra.When using this dataset, please cite to the original article's doi: https://doi.org/10.1038/s43588-023-00550-y instead of the doi provided by figshare.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Here you can find GEOM, a dataset with over 37 million molecular conformations annotated by energy and statistical weight for over 450,000 molecules. Over 317,000 species contain experimental data related to biophysics, physiology, and physical chemistry, and the remaining 133,000 species are from the QM9 dataset.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
QM9-extended database was further extended with 1781 compounds containing chlorine atoms and 2020 compounds containing bromine atoms.
Facebook
TwitterHere are shielding and J-coupling features created by the quantum chemistry package deMon using its free download binary with default settings over the QM9 set of molecules used in Predicting Molecular Properties. These features would be considered forbidden for this competition because they are based on quantum calculations, but they appear to help with predictions using boosted tree and neural net models. They took around 2.5 days to compute in parallel on two different linux boxes with 14 CPU cores each (files have '_even' and '_odd' suffixes). Python code to import them:
root = "../"
demon_odd = pd.read_csv(root+'deMon_jcoupling_odd.csv')
print(demon_odd.columns, demon_odd.shape)
demon_even = pd.read_csv(root+'deMon_jcoupling_even.csv')
print(demon_even.columns, demon_even.shape)
demonj = pd.concat((demon_even,demon_odd))
print(demonj.columns, demonj.shape)
demon_odd = pd.read_csv(root+'deMon_shielding_odd.csv')
print(demon_odd.columns, demon_odd.shape)
demon_even = pd.read_csv(root+'deMon_shielding_even.csv')
print(demon_even.columns, demon_even.shape)
demons = pd.concat((demon_even,demon_odd))
print(demons.columns, demons.shape)
The shielding values are at the atom level and the J coupling at the pair level. Use molecule_name and atom indices when merging since the molecules are not in the same order as the original data. Also, deMon did not produce results for a few of the molecules so the features will be missing for them.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterQM9 consists of computed geometric, energetic, electronic, and thermodynamic properties for 134k stable small organic molecules made up of C, H, O, N, and F. As usual, we remove the uncharacterized molecules and provide the remaining 130,831.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('qm9', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.