100+ datasets found
  1. SMILES DataSet for Analysis & Prediction Dataset

    • kaggle.com
    zip
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yan Maksi (2023). SMILES DataSet for Analysis & Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/yanmaksi/big-molecules-smiles-dataset
    Explore at:
    zip(296339 bytes)Available download formats
    Dataset updated
    Jun 11, 2023
    Authors
    Yan Maksi
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F9770082%2Fb234dd748f233e4d3ef1d72d048828b5%2FMastering%20Drug%20Design.jpg?generation=1686502308761641&alt=media" alt="">

    Read this article to get unlock the wonderful world Deep Reinforcement Learning for Drug Design

    ReLeaSE is a public dataset, consisting of molecular structures and their corresponding binding affinity to proteins. The dataset was created for the purpose of evaluating and comparing machine learning models for the prediction of protein-ligand binding affinity.

    The dataset contains a total of 10,000 molecules and their binding affinity to several target proteins, including thrombin, kinase, and protease. The molecular structures are represented using Simplified Molecular Input Line Entry System (SMILES) notation, which is a standardized method for representing molecular structures as a string of characters. The binding affinity is represented as a negative logarithm of the dissociation constant (pKd), which is a measure of the strength of the interaction between the molecule and the target protein.

    The ReLeaSE dataset provides a standardized benchmark for evaluating machine learning models for protein-ligand binding affinity prediction. The dataset is publicly available and can be used for research purposes, making it an important resource for the drug discovery community.

  2. p

    HF Molecule data for quantum computing

    • pennylane.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Utkarsh Azad; Stepan Fomichev, HF Molecule data for quantum computing [Dataset]. https://pennylane.ai/datasets/hf-molecule
    Explore at:
    Authors
    Utkarsh Azad; Stepan Fomichev
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Measurement technique
    Simulation
    Dataset funded by
    Xanadu Quantum Technologies
    Description

    This dataset contains Hamiltonian information, molecular data, VQE data, and tapering data for the HF Molecule using the STO-3G basis set at various bondlengths.

  3. h

    molecule-data

    • huggingface.co
    Updated Jan 15, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SII-minxiyu (2026). molecule-data [Dataset]. https://huggingface.co/datasets/SII-minxiyu/molecule-data
    Explore at:
    Dataset updated
    Jan 15, 2026
    Authors
    SII-minxiyu
    Description

    SII-minxiyu/molecule-data dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. p

    C2 Molecule data for quantum computing

    • pennylane.ai
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Utkarsh Azad; Stepan Fomichev, C2 Molecule data for quantum computing [Dataset]. https://pennylane.ai/datasets/c2-molecule
    Explore at:
    Authors
    Utkarsh Azad; Stepan Fomichev
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Measurement technique
    Simulation
    Dataset funded by
    Xanadu Quantum Technologieshttps://xanadu.ai/
    Description

    This dataset contains Hamiltonian information, molecular data, VQE data, and tapering data for the C2 Molecule using the STO-3G basis set at various bondlengths.

  5. Supplementary Data. Effector Molecule Analysis.xlsx

    • figshare.com
    xlsx
    Updated Jan 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    angus watson (2023). Supplementary Data. Effector Molecule Analysis.xlsx [Dataset]. http://doi.org/10.6084/m9.figshare.21940223.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jan 23, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    angus watson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplementary Data: Effector Molecule Analysis

  6. m

    The Molecular Entities in Linked Data Dataset

    • data.mendeley.com
    Updated Apr 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dominik Tomaszuk (2020). The Molecular Entities in Linked Data Dataset [Dataset]. http://doi.org/10.17632/fp4phyrbkz.1
    Explore at:
    Dataset updated
    Apr 4, 2020
    Authors
    Dominik Tomaszuk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Molecular Entities in Linked Data (MEiLD) dataset comprises data of distinct atoms, molecules, ions, ion pairs, radicals, radical ions, and others that can be identifiable as separately distinguishable chemical entities. The dataset is provided in a JSON-LD format and was generated by the SDFEater, a tool that allows parsing atoms, bonds, and other molecule data. MEiLD contains 349,960 of ‘small’ chemical entities.

  7. Data for Lingo3DMol

    • figshare.com
    zip
    Updated Nov 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wei Feng; Lvwei Wang; Zaiyun Lin; Yanhao Zhu; Han Wang; Jianqiang Dong; Rong Bai; Huting Wang; Jielong Zhou; Wei Peng; Bo Huang; Wenbiao Zhou (2023). Data for Lingo3DMol [Dataset]. http://doi.org/10.6084/m9.figshare.24550351.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 27, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Wei Feng; Lvwei Wang; Zaiyun Lin; Yanhao Zhu; Han Wang; Jianqiang Dong; Rong Bai; Huting Wang; Jielong Zhou; Wei Peng; Bo Huang; Wenbiao Zhou
    License

    https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html

    Description

    "DUDE_pocket.tar.gz" contains the pocket structures of the 101 DUDE targets used for model evaluation."lingo3dmol_confs.tar.gz" includes molecules generated by Lingo3DMol for the 101 DUD-E targets. It includes 3D conformations directly generated by the model without any force field-based refinement."pocket2mol_confs.tar.gz" consists of molecules generated by Pocket2Mol for the 101 DUD-E targets. It includes 3D conformations directly generated by the model without any force field-based refinement."targetdiff_confs.tar.gz" contains molecules generated by TargitDiff for the 101 DUD-E targets. It includes 3D conformations directly generated by the model without any force field-based refinement."random_confs.tar.gz" contains random molecules for the 101 DUD-E targets. It includes 3D conformations obtained through docking.The molecule files in "*_confs.tar.gz" files are named using the format {pdb_id}-{mol_id}.Additional information about the generated molecules, including mol_id, SMILES, and metric scores involved in the evaluation, can be found in the "*_moleculars_metric_score_with_dude.csv" files."Training_data.tar.gz" contains all the complex structures used for model fine-tuning."Training_data_homology_with_DUDE.csv" contains information about the PDB IDs in the training set and their maximum sequence identity with the DUD-E targets used in the evaluation."Pretraining_molecules_SMILES" is a dataset that contains a specific subset of data used for pretraining. This dataset consists of 1.4 million publicly available molecules that were utilized during the pretraining phase.

  8. Data from: The METLIN small molecule dataset for machine learning-based...

    • figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xavier Domingo-Almenara (2023). The METLIN small molecule dataset for machine learning-based retention time prediction [Dataset]. http://doi.org/10.6084/m9.figshare.8038913.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Xavier Domingo-Almenara
    License

    https://www.apache.org/licenses/LICENSE-2.0.htmlhttps://www.apache.org/licenses/LICENSE-2.0.html

    Description

    The METLIN Small Molecule Retention Time (SMRT) datasetThe METLIN SMRT is a reverse-phase retention time dataset covering a total of 80,038 small molecules. The SMRT dataset includes, for each molecule, the retention time (in seconds), the PubChem number, the molfiles containing the structure (SDF format), and molecular descriptors and extended connectivity fingerprints (ECFP) calculated with Dragon 7 (Kode Chemoinformatics, Pisa, Italy). The SMRT is a freely available resource. Use and redistribution of the data, in whole or in part, requires explicit acknowledgment of the source material and the original publication:Domingo-Almenara, X. et al. The METLIN small molecule dataset for machine learning-based retention time prediction. Nature Communications (2019) DOI: 10.1038/s41467-019-13680-7

  9. Data from: molecule graph

    • kaggle.com
    zip
    Updated Jun 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Danny Ahn (2025). molecule graph [Dataset]. https://www.kaggle.com/datasets/dannyahn/molecule-graph
    Explore at:
    zip(348452457 bytes)Available download formats
    Dataset updated
    Jun 27, 2025
    Authors
    Danny Ahn
    Description

    Dataset

    This dataset was created by Danny Ahn

    Contents

  10. f

    Additional file 1 of Tools and pipelines for BioNano data: molecule assembly...

    • datasetcatalog.nlm.nih.gov
    Updated Dec 15, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lam, Ernest; Herndon, Nic; Shelton, Jennifer; Anantharaman, Thomas; Coleman, Michelle; Sheth, Palak; Brown, Susan; Lu, Nanyan (2016). Additional file 1 of Tools and pipelines for BioNano data: molecule assembly pipeline and FASTA super scaffolding tool [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001894698
    Explore at:
    Dataset updated
    Dec 15, 2016
    Authors
    Lam, Ernest; Herndon, Nic; Shelton, Jennifer; Anantharaman, Thomas; Coleman, Michelle; Sheth, Palak; Brown, Susan; Lu, Nanyan
    Description

    Single molecule map stretch per scan in recent flowcells. Bases per pixel (bpp) is plotted for scans 1..n for each flowcell of mouse lemur molecules (purple). The first scan of each flowcell is indicated with a grey dashed line. The pre-adjusted molecule map stretch was determined by aligning molecule maps to the in silico maps. Data made available by P.A. Larsen, J. Rogers, A.D. Yoder and the Duke Lemur Center. (ZIP 55 kb)

  11. Data from: Example Dataset of Molecular Dynamics Simulations for MeTrEx

    • zenodo.org
    bin, txt
    Updated Jan 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sabrina Jaeger-Honz; Sabrina Jaeger-Honz; Christiane Rohse; Beat Ehrmann; Beat Ehrmann; Karsten Klein; Karsten Klein; Ying Zhang; Wendong Ma; Yu Hamada; Jian Li; Falk Schreiber; Falk Schreiber; Christiane Rohse; Ying Zhang; Wendong Ma; Yu Hamada; Jian Li (2025). Example Dataset of Molecular Dynamics Simulations for MeTrEx [Dataset]. http://doi.org/10.5281/zenodo.14512968
    Explore at:
    bin, txtAvailable download formats
    Dataset updated
    Jan 21, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sabrina Jaeger-Honz; Sabrina Jaeger-Honz; Christiane Rohse; Beat Ehrmann; Beat Ehrmann; Karsten Klein; Karsten Klein; Ying Zhang; Wendong Ma; Yu Hamada; Jian Li; Falk Schreiber; Falk Schreiber; Christiane Rohse; Ying Zhang; Wendong Ma; Yu Hamada; Jian Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 18, 2024
    Description

    This data set contains an example of Molecular Dynamics (MD) Simulation files. A Pseudomonas aeruginosa membrane was simulated with Polymyxin B1 and used as an example data set for MeTrEx (Membrane Trajectory Explorer). MeTrEx is a Python-based software to explore and analyse MD Simulation ligand and membrane molecule interaction data.

    The manuscript is currently under review, and the data set will be published open-source after acceptance.

    Included Files
    1. example_topology_with_6PMB.pdb
    Description: Structure file in PDB format, including the topology of a membrane system with six polymyxin B (PMB) molecules.
    Purpose: Provides the structural data required to initialise the simulation.
    2. example_simulation_6PMB_5000Frames.xtc
    Description: Trajectory file in XTC format containing 5,000 frames of a PMB-membrane interaction simulation.
    Purpose: Allows detailed trajectory exploration and visual analysis in MeTrEx.
    3. example_simulation_6PMB_500Frames.xtc
    Description: A shorter trajectory file (500 frames) for rapid testing and exploration.
    Purpose: Serves as a lightweight example for quick demonstrations.
    4. example_contact_LH0_PMB.xvg
    Description: XVG file providing contact analysis data between PMB molecules and the upper membrane leaflet lipids (LH0).
    5. example_membrane_solvent.xvg
    Description: XVG file containing analysis data of membrane solvent properties.
    How to Use
    1. Import the provided files into MeTrEx for the visualisation and analysis workflow.
    2. Explore trajectory visualisation, molecular speed and distance mapping, and external analysis integration (XVG File analysis) using the software's corresponding tools.
    3. Refer to the MeTrEx documentation for step-by-step guidance on loading and interacting with these datasets.
    If you have questions or need support, please contact the MeTrEx developers or refer to the official documentation.

  12. p

    O2 Molecule data for quantum computing

    • pennylane.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Utkarsh Azad; Stepan Fomichev, O2 Molecule data for quantum computing [Dataset]. https://pennylane.ai/datasets/o2-molecule
    Explore at:
    Authors
    Utkarsh Azad; Stepan Fomichev
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Measurement technique
    Simulation
    Dataset funded by
    Xanadu Quantum Technologies
    Description

    This dataset contains Hamiltonian information, molecular data, VQE data, and tapering data for the O2 Molecule using the STO-3G basis set at various bondlengths.

  13. Physical and Chemical Properties of Substances

    • kaggle.com
    zip
    Updated Apr 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Yakovlev G (2025). Physical and Chemical Properties of Substances [Dataset]. https://www.kaggle.com/datasets/ivanyakovlevg/physical-and-chemical-properties-of-substances/code
    Explore at:
    zip(965964 bytes)Available download formats
    Dataset updated
    Apr 29, 2025
    Authors
    Ivan Yakovlev G
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Version 3 description below

    This dataset is an extended version of the "Wikipedia Molecules Properties Dataset" with added SMILES representations, additional physicochemical properties calculated using the thermo library, and chemical classification according to the ClassyFire system for approximately 4,200 compounds.

    Key Features

    • Original structural formulas taken from the Wikipedia Molecules Properties Dataset (15,000+ molecules)
    • SMILES representations obtained for ~4,200 compounds extracted from structural formulas using the thermo library
    • Calculated physicochemical properties (melting point, boiling point, etc.) obtained using the thermo library
    • Chemical classification according to the ClassyFire system (Kingdom, Superclass, Class, Subclass, etc.)

    Feature Description

    • name: Name of the chemical compound
    • formula: Chemical formula of the compound
    • CAS: Unique CAS (Chemical Abstracts Service) identification number
    • smiles: Molecular structure representation in SMILES format (Simplified Molecular Input Line Entry System)
    • InChI: International Chemical Identifier
    • InChIKey: Hashed version of InChI for quick searching
    • molecular_weight: Molecular weight of the compound (g/mol)
    • melting_point_K: Melting point in Kelvin
    • boiling_point_K: Boiling point in Kelvin
    • heat_of_fusion: Heat of fusion (enthalpy of fusion), J/mol
    • heat_of_vaporization: Heat of vaporization (enthalpy of vaporization), J/mol
    • critical_temperature: Critical temperature, K
    • critical_pressure: Critical pressure, Pa
    • flash_point: Flash point, K
    • logP: Octanol-water partition coefficient (measure of lipophilicity)
    • improved_name: Improved/standardized name of the compound
    • kingdom: Kingdom in chemical taxonomy
    • superclass: Superclass of the compound
    • class: Class of the compound in chemical taxonomy
    • direct_parent: Direct parent class of the compound
    • substituents: Substituents and functional groups in the compound

    Tools and Resources Used

    License

    CC0: Public Domain - This dataset is in the public domain. You can copy, modify, distribute and perform the data, even for commercial purposes, all without asking permission.

    Citation

    When using this dataset, please cite: - Original dataset: Wikipedia Molecules Properties Dataset - This extended dataset: "Yakovlev, I. (2025). Physical and Chemical Properties of Substances [Data set]. Zenodo. https://doi.org/10.5281/zenodo.18242974" or "Ivan Yakovlev. (2025). Physical and Chemical Properties of Substances [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/11611464"

    Contact Information

    Ivan Yakovlev
    Email: yakovlev.ivan.g@gmail.com
    LinkedIn: www.linkedin.com/in/ivanyakovlevg

    Version 3.0.0 (2025-04-28)

    Major Enhancements

    • Added functional group classification: Each compound is now classified according to 26 different functional groups including alcohols, alkanes, aromatics, ethers, and more
    • SMILES representation: Added canonical SMILES strings for all compounds with valid InChI identifiers
    • Organic/inorganic classification: Each compound is now labeled as organic or inorganic based on chemical structure

    Data Processing Improvements

    • InChI standardization: Fixed inconsistent InChI prefixes, ensuring all follow the standard "InChI=1S/" format
    • Structure validation: All molecular structures have been validated using RDKit's chemical informatics toolkit
    • Missing data handling: Improved handling of compounds with invalid or missing structural identifiers

    Technical Details

    • Used RDKit for chemical structure manipulation and SMILES conversion
    • Applied Thermo library for functional group classification
    • Conversion success rate: 98.7% of compounds with valid InChI were successfully converted to SMILES
    • Functional group identification completed for 97.9% of valid structures

    Dataset Statistics

    • Top 5 functional groups identified:
      • Organic compounds: 76.4%
      • Hydrocarbons: 34.2%
      • Aromatics: 23.7%
      • Alcohols: 19.8%
      • Ethers: 14.3%

    Known Limitations

    • Approximately 1.3% of InChI strings could not be converted to SMILES due to structural inconsistencies
    • Some complex metal-organic compounds may have inconsistent functional group classification
    • Polymers and mixtures may have limited functional group detection accuracy

    Potential Applications

    • Enhanced structure-activity relat...
  14. f

    Additional file 2 of Tools and pipelines for BioNano data: molecule assembly...

    • datasetcatalog.nlm.nih.gov
    Updated Dec 15, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anantharaman, Thomas; Shelton, Jennifer; Brown, Susan; Coleman, Michelle; Herndon, Nic; Lu, Nanyan; Lam, Ernest; Sheth, Palak (2016). Additional file 2 of Tools and pipelines for BioNano data: molecule assembly pipeline and FASTA super scaffolding tool [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001894689
    Explore at:
    Dataset updated
    Dec 15, 2016
    Authors
    Anantharaman, Thomas; Shelton, Jennifer; Brown, Susan; Coleman, Michelle; Herndon, Nic; Lu, Nanyan; Lam, Ernest; Sheth, Palak
    Description

    Cumulative length and number of single molecule maps per BNX file for T. castaneum data generated over time. Detailed metrics for molecule maps per BNX file (cumulative length and number of maps). Columns include cumulative length of molecule maps > 150 kb, number of molecule maps > 150 kb and date that BNX file was generated. (CSV 4.11 kb)

  15. H

    Replication Data for Machine Learning Modeling of pKa

    • dataverse.harvard.edu
    Updated Mar 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dongdong Zhang (2024). Replication Data for Machine Learning Modeling of pKa [Dataset]. http://doi.org/10.7910/DVN/6A67L9
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 30, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Dongdong Zhang
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Cleaned some public pka datasets containing experimental and calculated data for ML modeling and providing protonated and deprotonated molecular topology graphs as well as the reaction centers

  16. Data from: Learning Properties of Ordered and Disordered Materials from...

    • figshare.com
    application/gzip
    Updated Oct 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chi Chen (2020). Learning Properties of Ordered and Disordered Materials from Multi-fidelity Data [Dataset]. http://doi.org/10.6084/m9.figshare.13040330.v1
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Oct 1, 2020
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Chi Chen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains two datasets for our recent work "Learning Properties of Ordered and Disordered Materials from Multi-fidelity Data". The first data set is a multi-fidelity band gap data for crystals, and the second data set is the molecular energy data set for molecules.1. Multi-fidelity band gap data for crystalsThe full band gap data used in the paper is located at band_gap_no_structs.gz. Users can use the following code to extract it. import gzipimport jsonwith gzip.open("band_gap_no_structs.gz", "rb") as f: data = json.loads(f.read())data is a dictionary with the following format{"pbe": {mp_id: PBE band gap, mp_id: PBE band gap, ...},"hse": {mp_id: HSE band gap, mp_id: HSE band gap, ...},"gllb-sc": {mp_id: GLLB-SC band gap, mp_id: GLLB-SC band gap, ...},"scan": {mp_id: SCAN band gap, mp_id: SCAN band gap, ...},"ordered_exp": {icsd_id: Exp band gap, icsd_id: Exp band gap, ...},"disordered_exp": {icsd_id: Exp band gap, icsd_id: Exp band gap, ...}}where mp_id is the Materials Project materials ID for the material, and icsd_id is the ICSD materials ID. For example, the PBE band gap of NaCl (mp-22862, band gap 5.003 eV) can be accessed by data['pbe']['mp-22862']. Note that the Materials Project database is evolving with time and it is possible that certain ID is removed in latest release and there may also be some band gap value change for the same material. To get the structure that corresponds to the specific material id in Materials Project, users can use the pymatgen REST API. 1.1. Register at Materials Project https://www.materialsproject.org and get an API key.1.2. In python, do the following to get the corresponding computational structure. from pymatgen import MPRester mpr = MPRester(#Your API Key) structure = mpr.get_structure_by_material_id(#mp_id)A dump of all the material ids and structures for 2019.04.01 MP version is provided here: https://ndownloader.figshare.com/files/15108200. Users can download the file and extract the material_id and structure from this file for all materials. The structure in this case is a cif file. Users can use again pymatgen to read the cif string and get the structure. from pymatgen.core import Structurestructure = Structure.from_str(#cif_string, fmt='cif')For the ICSD structures, the users are required to have commercial ICSD access. Hence the structures will not be provided here.2. Multi-fidelity molecular energy dataThe molecule_data.zip contains two datasets in json format. 2.1 G4MP2.json contains two fidelity G4MP2 (6095) and B3LYP (130831) calculations results on QM9 molecules {"G4MP2": {"U0": {ID: G4MP2 energy (eV), ...}, { "molecules": {ID: Pymatgen molecule dict, ...}},"B3LYP": {"U0": {ID: B3LYP energy (eV), ...} {"molecules": {ID: Pymatgen molecule dict, ...}}}2.2 qm7b.json contains the molecule energy calculation resultsi for 7211 molecules using HF, MP2 and CCSD(T) methods with 6-31g, sto-3g and cc-pvdz bases. {"molecules": {ID: Pymatgen molecule dict, ...},"targets": {ID: {"HF": {"sto3g": Atomization energy (kcal/mol), "631g": Atomization energy (kcal/mol), "cc-pvdz": Atomization energy (kcal/mol)}, "MP2": {"sto3g": Atomization energy (kcal/mol), "631g": Atomization energy (kcal/mol), "cc-pvdz": Atomization energy (kcal/mol)}, "CCSD(T)": {"sto3g": Atomization energy (kcal/mol), "631g": Atomization energy (kcal/mol), "cc-pvdz": Atomization energy (kcal/mol)}, ...}}}

  17. M

    Data from: Single-molecule mapping of replisome progression

    • datacatalog.mskcc.org
    Updated Oct 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Claussin, Clémence; Vazquez, Jacob; Whitehouse, Iestyn (2023). Single-molecule mapping of replisome progression [Dataset]. https://datacatalog.mskcc.org/dataset/10918
    Explore at:
    Dataset updated
    Oct 11, 2023
    Dataset provided by
    MSK Library
    Authors
    Claussin, Clémence; Vazquez, Jacob; Whitehouse, Iestyn
    Description

    Summary from GEO:

    "Replicon-seq is a method to study the progression of sister replisomes during DNA replication. This method relies excision of the full-length of replicons by the fusion of MNase to MCM4 and sequencing via Nanopore technology."


    Overall design from GEO:

    "MCM4 was fused to Miccrococcal nuclease ( MNase) to generate DNA double strand break at the site of replisomes. DNA ends are repaired and MinION compatible DNA adaptors are ligated. Full length molecules are sequenced. Because cells have been released from a G1 arrest in presence of BrdU, we can select for replicon reads (reads that contain BrdU) informatically using DNAscent."

  18. pickled molecule data

    • kaggle.com
    zip
    Updated Jun 29, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cédric Bény (2019). pickled molecule data [Dataset]. https://www.kaggle.com/cedben/pickled-molecule-data
    Explore at:
    zip(231938147 bytes)Available download formats
    Dataset updated
    Jun 29, 2019
    Authors
    Cédric Bény
    Description
  19. Coarse-Grained and Multi-Dimensional Data-Driven Molecular Generation: A...

    • zenodo.org
    zip
    Updated Sep 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    yurong zou; yurong zou (2024). Coarse-Grained and Multi-Dimensional Data-Driven Molecular Generation: A General Framework for Selective Inhibitor Design and Optimization in Structure-Based Drug Discovery [Dataset]. http://doi.org/10.5281/zenodo.13761486
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 14, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    yurong zou; yurong zou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2024
    Description

    Many approaches not only fail to consider the intricate binding pocket interactions, leading to molecules with suboptimal properties and stability, but also struggle with designing selective inhibitors. To address this challenge, we have developed an innovative structure-based three-dimensional molecular generation framework named Coarse-grained and Multi-dimensional Data-driven molecular generation (CMD-GEN). This framework bridges three-dimensional ligand-protein complex data with two-dimensional drug-like molecule data by utilizing coarse-grained pharmacophore points sampled from diffusion models, thereby enriching the training data for generative models. Through a hierarchical architecture, it decomposes the generation of three-dimensional molecules within the pocket into sampling of coarse-grained pharmacophore points, generating of chemical structures, and alignment of conformations, avoiding the instability issues associated with inherent in deep generative model-based generation of molecular conformations.

    This project provide the source dataset used to train and evaluate the overall model.

  20. Small Molecule-Protein Interaction Data

    • kaggle.com
    zip
    Updated Apr 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Indranil Bhattacharyya (2024). Small Molecule-Protein Interaction Data [Dataset]. https://www.kaggle.com/datasets/photon98/leash-bio-engineered-data-training
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Apr 19, 2024
    Authors
    Indranil Bhattacharyya
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    About the Dataset and How I augmented the data:

    The dataset used in this augmentation process(used a subset of the original training data) is sourced from the Leash Bio - Predict New Medicines with BELKA competition(Read More). It comprises examples of small molecules categorized through binary classification, determining whether each molecule is a binder to one of three protein targets. The data collection method involves utilizing DNA-encoded chemical library (DEL) technology.

    Chemical representations are expressed in SMILES (Simplified Molecular-Input Line-Entry System), while the labels denote binary binding classifications, corresponding to three distinct protein targets.

    I've expanded the original dataset by augmenting it with additional features derived from the existing data. Specifically, I've calculated and included three new features:

    • mol_wt (Molecular Weight): Calculated based on the SMILES data using RDKit, providing insight into the mass of each molecule.
    • logP (Partition Coefficient): Also derived from the SMILES data using RDKit, representing the logarithm of the partition coefficient, a measure of a molecule's hydrophobicity and its ability to partition between a hydrophobic solvent and water.
    • rotamers (Number of Rotamers): Determined from the SMILES data using RDKit, indicating the number of distinct conformations or rotational isomers a molecule can adopt. These additional features aim to enrich the feature matrix, potentially enhancing the predictive power of models trained on the augmented dataset.

    Data Description:

    id- A unique example_id we use to identify the molecule-binding target pair. buildingblock1_smiles - The structure, in SMILES, of the first building block **buildingblock2_smiles **- The structure, in SMILES, of the second building block buildingblock3_smiles - The structure, in SMILES, of the third building block **molecule_smiles **- The structure of the fully assembled molecule, in SMILES. This includes the three building blocks and the triazine core. Note we use a [Dy] as the stand-in for the DNA linker. protein_name - The protein target name binds - The target column. A binary class label of whether the molecule binds to the protein. Not available for the test set. mol_wt - The molecule's molecular weight derived from SMILES data using RDKit. logP - The logP of the molecule derived from SMILES data using RDKit. **rotamers **- The number of rotamers of the molecule derived from SMILES data using RDKit.

    Targets: binds

    Proteins are encoded in the genome, and names of the genes encoding those proteins are typically bestowed by their discoverers and regulated by the Hugo Gene Nomenclature Committee. The protein products of these genes can sometimes have different names, often due to the history of their discovery.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Yan Maksi (2023). SMILES DataSet for Analysis & Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/yanmaksi/big-molecules-smiles-dataset
Organization logo

SMILES DataSet for Analysis & Prediction Dataset

ReLeaSE is a dataset, consisting of molecular structures and their corresponding

Explore at:
4 scholarly articles cite this dataset (View in Google Scholar)
zip(296339 bytes)Available download formats
Dataset updated
Jun 11, 2023
Authors
Yan Maksi
License

http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

Description

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F9770082%2Fb234dd748f233e4d3ef1d72d048828b5%2FMastering%20Drug%20Design.jpg?generation=1686502308761641&alt=media" alt="">

Read this article to get unlock the wonderful world Deep Reinforcement Learning for Drug Design

ReLeaSE is a public dataset, consisting of molecular structures and their corresponding binding affinity to proteins. The dataset was created for the purpose of evaluating and comparing machine learning models for the prediction of protein-ligand binding affinity.

The dataset contains a total of 10,000 molecules and their binding affinity to several target proteins, including thrombin, kinase, and protease. The molecular structures are represented using Simplified Molecular Input Line Entry System (SMILES) notation, which is a standardized method for representing molecular structures as a string of characters. The binding affinity is represented as a negative logarithm of the dissociation constant (pKd), which is a measure of the strength of the interaction between the molecule and the target protein.

The ReLeaSE dataset provides a standardized benchmark for evaluating machine learning models for protein-ligand binding affinity prediction. The dataset is publicly available and can be used for research purposes, making it an important resource for the drug discovery community.

Search
Clear search
Close search
Google apps
Main menu