100+ datasets found

SMILES DataSet for Analysis & Prediction Dataset
kaggle.com
zip
Updated Jun 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yan Maksi (2023). SMILES DataSet for Analysis & Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/yanmaksi/big-molecules-smiles-dataset
Explore at:
zip(296339 bytes)Available download formats
Dataset updated
Jun 11, 2023
Authors
Yan Maksi
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F9770082%2Fb234dd748f233e4d3ef1d72d048828b5%2FMastering%20Drug%20Design.jpg?generation=1686502308761641&alt=media" alt="">

Read this article to get unlock the wonderful world Deep Reinforcement Learning for Drug Design

ReLeaSE is a public dataset, consisting of molecular structures and their corresponding binding affinity to proteins. The dataset was created for the purpose of evaluating and comparing machine learning models for the prediction of protein-ligand binding affinity.

The dataset contains a total of 10,000 molecules and their binding affinity to several target proteins, including thrombin, kinase, and protease. The molecular structures are represented using Simplified Molecular Input Line Entry System (SMILES) notation, which is a standardized method for representing molecular structures as a string of characters. The binding affinity is represented as a negative logarithm of the dissociation constant (pKd), which is a measure of the strength of the interaction between the molecule and the target protein.

The ReLeaSE dataset provides a standardized benchmark for evaluating machine learning models for protein-ligand binding affinity prediction. The dataset is publicly available and can be used for research purposes, making it an important resource for the drug discovery community.
p
HF Molecule data for quantum computing
pennylane.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Utkarsh Azad; Stepan Fomichev, HF Molecule data for quantum computing [Dataset]. https://pennylane.ai/datasets/hf-molecule
Explore at:
Authors
Utkarsh Azad; Stepan Fomichev
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Measurement technique
Simulation
Dataset funded by
Xanadu Quantum Technologies
Description
This dataset contains Hamiltonian information, molecular data, VQE data, and tapering data for the HF Molecule using the STO-3G basis set at various bondlengths.
h
molecule-data
huggingface.co
Updated Jan 15, 2026
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SII-minxiyu (2026). molecule-data [Dataset]. https://huggingface.co/datasets/SII-minxiyu/molecule-data
Explore at:
Dataset updated
Jan 15, 2026
Authors
SII-minxiyu
Description
SII-minxiyu/molecule-data dataset hosted on Hugging Face and contributed by the HF Datasets community
p
C2 Molecule data for quantum computing
pennylane.ai
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Utkarsh Azad; Stepan Fomichev, C2 Molecule data for quantum computing [Dataset]. https://pennylane.ai/datasets/c2-molecule
Explore at:
Authors
Utkarsh Azad; Stepan Fomichev
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Measurement technique
Simulation
Dataset funded by
Xanadu Quantum Technologieshttps://xanadu.ai/
Description
This dataset contains Hamiltonian information, molecular data, VQE data, and tapering data for the C2 Molecule using the STO-3G basis set at various bondlengths.
Supplementary Data. Effector Molecule Analysis.xlsx
figshare.com
xlsx
Updated Jan 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
angus watson (2023). Supplementary Data. Effector Molecule Analysis.xlsx [Dataset]. http://doi.org/10.6084/m9.figshare.21940223.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21940223.v1
Dataset updated
Jan 23, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
angus watson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Supplementary Data: Effector Molecule Analysis
m
The Molecular Entities in Linked Data Dataset
data.mendeley.com
Updated Apr 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dominik Tomaszuk (2020). The Molecular Entities in Linked Data Dataset [Dataset]. http://doi.org/10.17632/fp4phyrbkz.1
Explore at:
Unique identifier
https://doi.org/10.17632/fp4phyrbkz.1
Dataset updated
Apr 4, 2020
Authors
Dominik Tomaszuk
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Molecular Entities in Linked Data (MEiLD) dataset comprises data of distinct atoms, molecules, ions, ion pairs, radicals, radical ions, and others that can be identifiable as separately distinguishable chemical entities. The dataset is provided in a JSON-LD format and was generated by the SDFEater, a tool that allows parsing atoms, bonds, and other molecule data. MEiLD contains 349,960 of ‘small’ chemical entities.
Data for Lingo3DMol
figshare.com
zip
Updated Nov 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wei Feng; Lvwei Wang; Zaiyun Lin; Yanhao Zhu; Han Wang; Jianqiang Dong; Rong Bai; Huting Wang; Jielong Zhou; Wei Peng; Bo Huang; Wenbiao Zhou (2023). Data for Lingo3DMol [Dataset]. http://doi.org/10.6084/m9.figshare.24550351.v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24550351.v3
Dataset updated
Nov 27, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Wei Feng; Lvwei Wang; Zaiyun Lin; Yanhao Zhu; Han Wang; Jianqiang Dong; Rong Bai; Huting Wang; Jielong Zhou; Wei Peng; Bo Huang; Wenbiao Zhou
License
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Description
"DUDE_pocket.tar.gz" contains the pocket structures of the 101 DUDE targets used for model evaluation."lingo3dmol_confs.tar.gz" includes molecules generated by Lingo3DMol for the 101 DUD-E targets. It includes 3D conformations directly generated by the model without any force field-based refinement."pocket2mol_confs.tar.gz" consists of molecules generated by Pocket2Mol for the 101 DUD-E targets. It includes 3D conformations directly generated by the model without any force field-based refinement."targetdiff_confs.tar.gz" contains molecules generated by TargitDiff for the 101 DUD-E targets. It includes 3D conformations directly generated by the model without any force field-based refinement."random_confs.tar.gz" contains random molecules for the 101 DUD-E targets. It includes 3D conformations obtained through docking.The molecule files in "*_confs.tar.gz" files are named using the format {pdb_id}-{mol_id}.Additional information about the generated molecules, including mol_id, SMILES, and metric scores involved in the evaluation, can be found in the "*_moleculars_metric_score_with_dude.csv" files."Training_data.tar.gz" contains all the complex structures used for model fine-tuning."Training_data_homology_with_DUDE.csv" contains information about the PDB IDs in the training set and their maximum sequence identity with the DUD-E targets used in the evaluation."Pretraining_molecules_SMILES" is a dataset that contains a specific subset of data used for pretraining. This dataset consists of 1.4 million publicly available molecules that were utilized during the pretraining phase.
Data from: The METLIN small molecule dataset for machine learning-based...
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xavier Domingo-Almenara (2023). The METLIN small molecule dataset for machine learning-based retention time prediction [Dataset]. http://doi.org/10.6084/m9.figshare.8038913.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8038913.v1
Dataset updated
May 30, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Xavier Domingo-Almenara
License
https://www.apache.org/licenses/LICENSE-2.0.htmlhttps://www.apache.org/licenses/LICENSE-2.0.html
Description
The METLIN Small Molecule Retention Time (SMRT) datasetThe METLIN SMRT is a reverse-phase retention time dataset covering a total of 80,038 small molecules. The SMRT dataset includes, for each molecule, the retention time (in seconds), the PubChem number, the molfiles containing the structure (SDF format), and molecular descriptors and extended connectivity fingerprints (ECFP) calculated with Dragon 7 (Kode Chemoinformatics, Pisa, Italy). The SMRT is a freely available resource. Use and redistribution of the data, in whole or in part, requires explicit acknowledgment of the source material and the original publication:Domingo-Almenara, X. et al. The METLIN small molecule dataset for machine learning-based retention time prediction. Nature Communications (2019) DOI: 10.1038/s41467-019-13680-7
Data from: molecule graph
kaggle.com
zip
Updated Jun 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Danny Ahn (2025). molecule graph [Dataset]. https://www.kaggle.com/datasets/dannyahn/molecule-graph
Explore at:
zip(348452457 bytes)Available download formats
Dataset updated
Jun 27, 2025
Authors
Danny Ahn
Description
Dataset

This dataset was created by Danny Ahn

Contents
f
Additional file 1 of Tools and pipelines for BioNano data: molecule assembly...
datasetcatalog.nlm.nih.gov
Updated Dec 15, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lam, Ernest; Herndon, Nic; Shelton, Jennifer; Anantharaman, Thomas; Coleman, Michelle; Sheth, Palak; Brown, Susan; Lu, Nanyan (2016). Additional file 1 of Tools and pipelines for BioNano data: molecule assembly pipeline and FASTA super scaffolding tool [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001894698
Explore at:
Dataset updated
Dec 15, 2016
Authors
Lam, Ernest; Herndon, Nic; Shelton, Jennifer; Anantharaman, Thomas; Coleman, Michelle; Sheth, Palak; Brown, Susan; Lu, Nanyan
Description
Single molecule map stretch per scan in recent flowcells. Bases per pixel (bpp) is plotted for scans 1..n for each flowcell of mouse lemur molecules (purple). The first scan of each flowcell is indicated with a grey dashed line. The pre-adjusted molecule map stretch was determined by aligning molecule maps to the in silico maps. Data made available by P.A. Larsen, J. Rogers, A.D. Yoder and the Duke Lemur Center. (ZIP 55 kb)
Data from: Example Dataset of Molecular Dynamics Simulations for MeTrEx
zenodo.org
bin, txt
Updated Jan 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sabrina Jaeger-Honz; Sabrina Jaeger-Honz; Christiane Rohse; Beat Ehrmann; Beat Ehrmann; Karsten Klein; Karsten Klein; Ying Zhang; Wendong Ma; Yu Hamada; Jian Li; Falk Schreiber; Falk Schreiber; Christiane Rohse; Ying Zhang; Wendong Ma; Yu Hamada; Jian Li (2025). Example Dataset of Molecular Dynamics Simulations for MeTrEx [Dataset]. http://doi.org/10.5281/zenodo.14512968
Explore at:
bin, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14512968
Dataset updated
Jan 21, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sabrina Jaeger-Honz; Sabrina Jaeger-Honz; Christiane Rohse; Beat Ehrmann; Beat Ehrmann; Karsten Klein; Karsten Klein; Ying Zhang; Wendong Ma; Yu Hamada; Jian Li; Falk Schreiber; Falk Schreiber; Christiane Rohse; Ying Zhang; Wendong Ma; Yu Hamada; Jian Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Dec 18, 2024
Description
This data set contains an example of Molecular Dynamics (MD) Simulation files. A Pseudomonas aeruginosa membrane was simulated with Polymyxin B1 and used as an example data set for MeTrEx (Membrane Trajectory Explorer). MeTrEx is a Python-based software to explore and analyse MD Simulation ligand and membrane molecule interaction data.

The manuscript is currently under review, and the data set will be published open-source after acceptance.

Included Files

1. example_topology_with_6PMB.pdb

Description: Structure file in PDB format, including the topology of a membrane system with six polymyxin B (PMB) molecules.

Purpose: Provides the structural data required to initialise the simulation.

2. example_simulation_6PMB_5000Frames.xtc

Description: Trajectory file in XTC format containing 5,000 frames of a PMB-membrane interaction simulation.

Purpose: Allows detailed trajectory exploration and visual analysis in MeTrEx.

3. example_simulation_6PMB_500Frames.xtc

Description: A shorter trajectory file (500 frames) for rapid testing and exploration.

Purpose: Serves as a lightweight example for quick demonstrations.

4. example_contact_LH0_PMB.xvg

Description: XVG file providing contact analysis data between PMB molecules and the upper membrane leaflet lipids (LH0).

5. example_membrane_solvent.xvg

Description: XVG file containing analysis data of membrane solvent properties.

How to Use

1. Import the provided files into MeTrEx for the visualisation and analysis workflow.

2. Explore trajectory visualisation, molecular speed and distance mapping, and external analysis integration (XVG File analysis) using the software's corresponding tools.

3. Refer to the MeTrEx documentation for step-by-step guidance on loading and interacting with these datasets.

If you have questions or need support, please contact the MeTrEx developers or refer to the official documentation.
p
O2 Molecule data for quantum computing
pennylane.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Utkarsh Azad; Stepan Fomichev, O2 Molecule data for quantum computing [Dataset]. https://pennylane.ai/datasets/o2-molecule
Explore at:
Authors
Utkarsh Azad; Stepan Fomichev
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Measurement technique
Simulation
Dataset funded by
Xanadu Quantum Technologies
Description
This dataset contains Hamiltonian information, molecular data, VQE data, and tapering data for the O2 Molecule using the STO-3G basis set at various bondlengths.
Physical and Chemical Properties of Substances
kaggle.com
zip
Updated Apr 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan Yakovlev G (2025). Physical and Chemical Properties of Substances [Dataset]. https://www.kaggle.com/datasets/ivanyakovlevg/physical-and-chemical-properties-of-substances/code
Explore at:
zip(965964 bytes)Available download formats
Dataset updated
Apr 29, 2025
Authors
Ivan Yakovlev G
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Version 3 description below

This dataset is an extended version of the "Wikipedia Molecules Properties Dataset" with added SMILES representations, additional physicochemical properties calculated using the thermo library, and chemical classification according to the ClassyFire system for approximately 4,200 compounds.

Key Features

Original structural formulas taken from the Wikipedia Molecules Properties Dataset (15,000+ molecules)

SMILES representations obtained for ~4,200 compounds extracted from structural formulas using the thermo library

Calculated physicochemical properties (melting point, boiling point, etc.) obtained using the thermo library

Chemical classification according to the ClassyFire system (Kingdom, Superclass, Class, Subclass, etc.)

Feature Description

name: Name of the chemical compound

formula: Chemical formula of the compound

CAS: Unique CAS (Chemical Abstracts Service) identification number

smiles: Molecular structure representation in SMILES format (Simplified Molecular Input Line Entry System)

InChI: International Chemical Identifier

InChIKey: Hashed version of InChI for quick searching

molecular_weight: Molecular weight of the compound (g/mol)

melting_point_K: Melting point in Kelvin

boiling_point_K: Boiling point in Kelvin

heat_of_fusion: Heat of fusion (enthalpy of fusion), J/mol

heat_of_vaporization: Heat of vaporization (enthalpy of vaporization), J/mol

critical_temperature: Critical temperature, K

critical_pressure: Critical pressure, Pa

flash_point: Flash point, K

logP: Octanol-water partition coefficient (measure of lipophilicity)

improved_name: Improved/standardized name of the compound

kingdom: Kingdom in chemical taxonomy

superclass: Superclass of the compound

class: Class of the compound in chemical taxonomy

direct_parent: Direct parent class of the compound

substituents: Substituents and functional groups in the compound

Tools and Resources Used

Wikipedia Molecules Properties Dataset: the dataset used as the foundation

Thermo: A library for calculating thermodynamic and transport properties of chemicals

ClassyFire: A chemical taxonomy system and classifier for small molecules

License

CC0: Public Domain - This dataset is in the public domain. You can copy, modify, distribute and perform the data, even for commercial purposes, all without asking permission.

Citation

When using this dataset, please cite: - Original dataset: Wikipedia Molecules Properties Dataset - This extended dataset: "Yakovlev, I. (2025). Physical and Chemical Properties of Substances [Data set]. Zenodo. https://doi.org/10.5281/zenodo.18242974" or "Ivan Yakovlev. (2025). Physical and Chemical Properties of Substances [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/11611464"

Contact Information

Ivan Yakovlev
Email: yakovlev.ivan.g@gmail.com
LinkedIn: www.linkedin.com/in/ivanyakovlevg

Version 3.0.0 (2025-04-28)

Major Enhancements

Added functional group classification: Each compound is now classified according to 26 different functional groups including alcohols, alkanes, aromatics, ethers, and more

SMILES representation: Added canonical SMILES strings for all compounds with valid InChI identifiers

Organic/inorganic classification: Each compound is now labeled as organic or inorganic based on chemical structure

Data Processing Improvements

InChI standardization: Fixed inconsistent InChI prefixes, ensuring all follow the standard "InChI=1S/" format

Structure validation: All molecular structures have been validated using RDKit's chemical informatics toolkit

Missing data handling: Improved handling of compounds with invalid or missing structural identifiers

Technical Details

Used RDKit for chemical structure manipulation and SMILES conversion

Applied Thermo library for functional group classification

Conversion success rate: 98.7% of compounds with valid InChI were successfully converted to SMILES

Functional group identification completed for 97.9% of valid structures

Dataset Statistics

Top 5 functional groups identified:

Organic compounds: 76.4%

Hydrocarbons: 34.2%

Aromatics: 23.7%

Alcohols: 19.8%

Ethers: 14.3%

Known Limitations

Approximately 1.3% of InChI strings could not be converted to SMILES due to structural inconsistencies

Some complex metal-organic compounds may have inconsistent functional group classification

Polymers and mixtures may have limited functional group detection accuracy

Potential Applications

Enhanced structure-activity relat...
f
Additional file 2 of Tools and pipelines for BioNano data: molecule assembly...
datasetcatalog.nlm.nih.gov
Updated Dec 15, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anantharaman, Thomas; Shelton, Jennifer; Brown, Susan; Coleman, Michelle; Herndon, Nic; Lu, Nanyan; Lam, Ernest; Sheth, Palak (2016). Additional file 2 of Tools and pipelines for BioNano data: molecule assembly pipeline and FASTA super scaffolding tool [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001894689
Explore at:
Dataset updated
Dec 15, 2016
Authors
Anantharaman, Thomas; Shelton, Jennifer; Brown, Susan; Coleman, Michelle; Herndon, Nic; Lu, Nanyan; Lam, Ernest; Sheth, Palak
Description
Cumulative length and number of single molecule maps per BNX file for T. castaneum data generated over time. Detailed metrics for molecule maps per BNX file (cumulative length and number of maps). Columns include cumulative length of molecule maps > 150 kb, number of molecule maps > 150 kb and date that BNX file was generated. (CSV 4.11 kb)
H
Replication Data for Machine Learning Modeling of pKa
dataverse.harvard.edu
Updated Mar 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dongdong Zhang (2024). Replication Data for Machine Learning Modeling of pKa [Dataset]. http://doi.org/10.7910/DVN/6A67L9
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/6A67L9
Dataset updated
Mar 30, 2024
Dataset provided by
Harvard Dataverse
Authors
Dongdong Zhang
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Cleaned some public pka datasets containing experimental and calculated data for ML modeling and providing protonated and deprotonated molecular topology graphs as well as the reaction centers
Data from: Learning Properties of Ordered and Disordered Materials from...
figshare.com
application/gzip
Updated Oct 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chi Chen (2020). Learning Properties of Ordered and Disordered Materials from Multi-fidelity Data [Dataset]. http://doi.org/10.6084/m9.figshare.13040330.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13040330.v1
Dataset updated
Oct 1, 2020
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Chi Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains two datasets for our recent work "Learning Properties of Ordered and Disordered Materials from Multi-fidelity Data". The first data set is a multi-fidelity band gap data for crystals, and the second data set is the molecular energy data set for molecules.1. Multi-fidelity band gap data for crystalsThe full band gap data used in the paper is located at band_gap_no_structs.gz. Users can use the following code to extract it. import gzipimport jsonwith gzip.open("band_gap_no_structs.gz", "rb") as f: data = json.loads(f.read())data is a dictionary with the following format{"pbe": {mp_id: PBE band gap, mp_id: PBE band gap, ...},"hse": {mp_id: HSE band gap, mp_id: HSE band gap, ...},"gllb-sc": {mp_id: GLLB-SC band gap, mp_id: GLLB-SC band gap, ...},"scan": {mp_id: SCAN band gap, mp_id: SCAN band gap, ...},"ordered_exp": {icsd_id: Exp band gap, icsd_id: Exp band gap, ...},"disordered_exp": {icsd_id: Exp band gap, icsd_id: Exp band gap, ...}}where mp_id is the Materials Project materials ID for the material, and icsd_id is the ICSD materials ID. For example, the PBE band gap of NaCl (mp-22862, band gap 5.003 eV) can be accessed by data['pbe']['mp-22862']. Note that the Materials Project database is evolving with time and it is possible that certain ID is removed in latest release and there may also be some band gap value change for the same material. To get the structure that corresponds to the specific material id in Materials Project, users can use the pymatgen REST API. 1.1. Register at Materials Project https://www.materialsproject.org and get an API key.1.2. In python, do the following to get the corresponding computational structure. from pymatgen import MPRester mpr = MPRester(#Your API Key) structure = mpr.get_structure_by_material_id(#mp_id)A dump of all the material ids and structures for 2019.04.01 MP version is provided here: https://ndownloader.figshare.com/files/15108200. Users can download the file and extract the material_id and structure from this file for all materials. The structure in this case is a cif file. Users can use again pymatgen to read the cif string and get the structure. from pymatgen.core import Structurestructure = Structure.from_str(#cif_string, fmt='cif')For the ICSD structures, the users are required to have commercial ICSD access. Hence the structures will not be provided here.2. Multi-fidelity molecular energy dataThe molecule_data.zip contains two datasets in json format. 2.1 G4MP2.json contains two fidelity G4MP2 (6095) and B3LYP (130831) calculations results on QM9 molecules {"G4MP2": {"U0": {ID: G4MP2 energy (eV), ...}, { "molecules": {ID: Pymatgen molecule dict, ...}},"B3LYP": {"U0": {ID: B3LYP energy (eV), ...} {"molecules": {ID: Pymatgen molecule dict, ...}}}2.2 qm7b.json contains the molecule energy calculation resultsi for 7211 molecules using HF, MP2 and CCSD(T) methods with 6-31g, sto-3g and cc-pvdz bases. {"molecules": {ID: Pymatgen molecule dict, ...},"targets": {ID: {"HF": {"sto3g": Atomization energy (kcal/mol), "631g": Atomization energy (kcal/mol), "cc-pvdz": Atomization energy (kcal/mol)}, "MP2": {"sto3g": Atomization energy (kcal/mol), "631g": Atomization energy (kcal/mol), "cc-pvdz": Atomization energy (kcal/mol)}, "CCSD(T)": {"sto3g": Atomization energy (kcal/mol), "631g": Atomization energy (kcal/mol), "cc-pvdz": Atomization energy (kcal/mol)}, ...}}}
M
Data from: Single-molecule mapping of replisome progression
datacatalog.mskcc.org
Updated Oct 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Claussin, Clémence; Vazquez, Jacob; Whitehouse, Iestyn (2023). Single-molecule mapping of replisome progression [Dataset]. https://datacatalog.mskcc.org/dataset/10918
Explore at:
Dataset updated
Oct 11, 2023
Dataset provided by
MSK Library
Authors
Claussin, Clémence; Vazquez, Jacob; Whitehouse, Iestyn
Description
Summary from GEO:

"Replicon-seq is a method to study the progression of sister replisomes during DNA replication. This method relies excision of the full-length of replicons by the fusion of MNase to MCM4 and sequencing via Nanopore technology."

Overall design from GEO:

"MCM4 was fused to Miccrococcal nuclease ( MNase) to generate DNA double strand break at the site of replisomes. DNA ends are repaired and MinION compatible DNA adaptors are ligated. Full length molecules are sequenced. Because cells have been released from a G1 arrest in presence of BrdU, we can select for replicon reads (reads that contain BrdU) informatically using DNAscent."
pickled molecule data
kaggle.com
zip
Updated Jun 29, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cédric Bény (2019). pickled molecule data [Dataset]. https://www.kaggle.com/cedben/pickled-molecule-data
Explore at:
zip(231938147 bytes)Available download formats
Dataset updated
Jun 29, 2019
Authors
Cédric Bény
Description
Old version. Use https://www.kaggle.com/cedben/pickled-molecules instead
Coarse-Grained and Multi-Dimensional Data-Driven Molecular Generation: A...
zenodo.org
zip
Updated Sep 14, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
yurong zou; yurong zou (2024). Coarse-Grained and Multi-Dimensional Data-Driven Molecular Generation: A General Framework for Selective Inhibitor Design and Optimization in Structure-Based Drug Discovery [Dataset]. http://doi.org/10.5281/zenodo.13761486
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13761486
Dataset updated
Sep 14, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
yurong zou; yurong zou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2024
Description
Many approaches not only fail to consider the intricate binding pocket interactions, leading to molecules with suboptimal properties and stability, but also struggle with designing selective inhibitors. To address this challenge, we have developed an innovative structure-based three-dimensional molecular generation framework named Coarse-grained and Multi-dimensional Data-driven molecular generation (CMD-GEN). This framework bridges three-dimensional ligand-protein complex data with two-dimensional drug-like molecule data by utilizing coarse-grained pharmacophore points sampled from diffusion models, thereby enriching the training data for generative models. Through a hierarchical architecture, it decomposes the generation of three-dimensional molecules within the pocket into sampling of coarse-grained pharmacophore points, generating of chemical structures, and alignment of conformations, avoiding the instability issues associated with inherent in deep generative model-based generation of molecular conformations.

This project provide the source dataset used to train and evaluate the overall model.
Small Molecule-Protein Interaction Data
kaggle.com
zip
Updated Apr 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Indranil Bhattacharyya (2024). Small Molecule-Protein Interaction Data [Dataset]. https://www.kaggle.com/datasets/photon98/leash-bio-engineered-data-training
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Apr 19, 2024
Authors
Indranil Bhattacharyya
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
About the Dataset and How I augmented the data:

The dataset used in this augmentation process(used a subset of the original training data) is sourced from the Leash Bio - Predict New Medicines with BELKA competition(Read More). It comprises examples of small molecules categorized through binary classification, determining whether each molecule is a binder to one of three protein targets. The data collection method involves utilizing DNA-encoded chemical library (DEL) technology.

Chemical representations are expressed in SMILES (Simplified Molecular-Input Line-Entry System), while the labels denote binary binding classifications, corresponding to three distinct protein targets.

I've expanded the original dataset by augmenting it with additional features derived from the existing data. Specifically, I've calculated and included three new features:

mol_wt (Molecular Weight): Calculated based on the SMILES data using RDKit, providing insight into the mass of each molecule.

logP (Partition Coefficient): Also derived from the SMILES data using RDKit, representing the logarithm of the partition coefficient, a measure of a molecule's hydrophobicity and its ability to partition between a hydrophobic solvent and water.

rotamers (Number of Rotamers): Determined from the SMILES data using RDKit, indicating the number of distinct conformations or rotational isomers a molecule can adopt. These additional features aim to enrich the feature matrix, potentially enhancing the predictive power of models trained on the augmented dataset.

Data Description:

id- A unique example_id we use to identify the molecule-binding target pair. buildingblock1_smiles - The structure, in SMILES, of the first building block **buildingblock2_smiles **- The structure, in SMILES, of the second building block buildingblock3_smiles - The structure, in SMILES, of the third building block **molecule_smiles **- The structure of the fully assembled molecule, in SMILES. This includes the three building blocks and the triazine core. Note we use a [Dy] as the stand-in for the DNA linker. protein_name - The protein target name binds - The target column. A binary class label of whether the molecule binds to the protein. Not available for the test set. mol_wt - The molecule's molecular weight derived from SMILES data using RDKit. logP - The logP of the molecule derived from SMILES data using RDKit. **rotamers **- The number of rotamers of the molecule derived from SMILES data using RDKit.

Targets: binds

Proteins are encoded in the genome, and names of the genes encoding those proteins are typically bestowed by their discoverers and regulated by the Hugo Gene Nomenclature Committee. The protein products of these genes can sometimes have different names, often due to the history of their discovery.

Facebook

Twitter

Click to copy link

Link copied

Cite

Yan Maksi (2023). SMILES DataSet for Analysis & Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/yanmaksi/big-molecules-smiles-dataset

SMILES DataSet for Analysis & Prediction Dataset

ReLeaSE is a dataset, consisting of molecular structures and their corresponding

Explore at:

4 scholarly articles cite this dataset (View in Google Scholar)

zip(296339 bytes)Available download formats

Dataset updated

Jun 11, 2023

Authors

Yan Maksi

License

http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

Description

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F9770082%2Fb234dd748f233e4d3ef1d72d048828b5%2FMastering%20Drug%20Design.jpg?generation=1686502308761641&alt=media" alt="">

Read this article to get unlock the wonderful world Deep Reinforcement Learning for Drug Design

ReLeaSE is a public dataset, consisting of molecular structures and their corresponding binding affinity to proteins. The dataset was created for the purpose of evaluating and comparing machine learning models for the prediction of protein-ligand binding affinity.

The dataset contains a total of 10,000 molecules and their binding affinity to several target proteins, including thrombin, kinase, and protease. The molecular structures are represented using Simplified Molecular Input Line Entry System (SMILES) notation, which is a standardized method for representing molecular structures as a string of characters. The binding affinity is represented as a negative logarithm of the dissociation constant (pKd), which is a measure of the strength of the interaction between the molecule and the target protein.

The ReLeaSE dataset provides a standardized benchmark for evaluating machine learning models for protein-ligand binding affinity prediction. The dataset is publicly available and can be used for research purposes, making it an important resource for the drug discovery community.

Clear search

Close search

Google apps

Main menu

SMILES DataSet for Analysis & Prediction Dataset

HF Molecule data for quantum computing

molecule-data

C2 Molecule data for quantum computing

Supplementary Data. Effector Molecule Analysis.xlsx

The Molecular Entities in Linked Data Dataset

Data for Lingo3DMol

Data from: The METLIN small molecule dataset for machine learning-based...

Data from: molecule graph

Dataset

Contents

Additional file 1 of Tools and pipelines for BioNano data: molecule assembly...

Data from: Example Dataset of Molecular Dynamics Simulations for MeTrEx

O2 Molecule data for quantum computing

Physical and Chemical Properties of Substances

Key Features

Feature Description

Tools and Resources Used

License

Citation

Contact Information

Version 3.0.0 (2025-04-28)

Major Enhancements

Data Processing Improvements

Technical Details

Dataset Statistics

Known Limitations

Potential Applications

Additional file 2 of Tools and pipelines for BioNano data: molecule assembly...

Replication Data for Machine Learning Modeling of pKa

Data from: Learning Properties of Ordered and Disordered Materials from...

Data from: Single-molecule mapping of replisome progression

pickled molecule data

Coarse-Grained and Multi-Dimensional Data-Driven Molecular Generation: A...

Small Molecule-Protein Interaction Data

About the Dataset and How I augmented the data:

Data Description:

Targets: binds

SMILES DataSet for Analysis & Prediction Dataset

ReLeaSE is a dataset, consisting of molecular structures and their corresponding

Targets: `binds`