100+ datasets found
  1. Small Molecule-Protein Interaction Data

    • kaggle.com
    zip
    Updated Apr 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Indranil Bhattacharyya (2024). Small Molecule-Protein Interaction Data [Dataset]. https://www.kaggle.com/datasets/photon98/leash-bio-engineered-data-training
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Apr 19, 2024
    Authors
    Indranil Bhattacharyya
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    About the Dataset and How I augmented the data:

    The dataset used in this augmentation process(used a subset of the original training data) is sourced from the Leash Bio - Predict New Medicines with BELKA competition(Read More). It comprises examples of small molecules categorized through binary classification, determining whether each molecule is a binder to one of three protein targets. The data collection method involves utilizing DNA-encoded chemical library (DEL) technology.

    Chemical representations are expressed in SMILES (Simplified Molecular-Input Line-Entry System), while the labels denote binary binding classifications, corresponding to three distinct protein targets.

    I've expanded the original dataset by augmenting it with additional features derived from the existing data. Specifically, I've calculated and included three new features:

    • mol_wt (Molecular Weight): Calculated based on the SMILES data using RDKit, providing insight into the mass of each molecule.
    • logP (Partition Coefficient): Also derived from the SMILES data using RDKit, representing the logarithm of the partition coefficient, a measure of a molecule's hydrophobicity and its ability to partition between a hydrophobic solvent and water.
    • rotamers (Number of Rotamers): Determined from the SMILES data using RDKit, indicating the number of distinct conformations or rotational isomers a molecule can adopt. These additional features aim to enrich the feature matrix, potentially enhancing the predictive power of models trained on the augmented dataset.

    Data Description:

    id- A unique example_id we use to identify the molecule-binding target pair. buildingblock1_smiles - The structure, in SMILES, of the first building block **buildingblock2_smiles **- The structure, in SMILES, of the second building block buildingblock3_smiles - The structure, in SMILES, of the third building block **molecule_smiles **- The structure of the fully assembled molecule, in SMILES. This includes the three building blocks and the triazine core. Note we use a [Dy] as the stand-in for the DNA linker. protein_name - The protein target name binds - The target column. A binary class label of whether the molecule binds to the protein. Not available for the test set. mol_wt - The molecule's molecular weight derived from SMILES data using RDKit. logP - The logP of the molecule derived from SMILES data using RDKit. **rotamers **- The number of rotamers of the molecule derived from SMILES data using RDKit.

    Targets: binds

    Proteins are encoded in the genome, and names of the genes encoding those proteins are typically bestowed by their discoverers and regulated by the Hugo Gene Nomenclature Committee. The protein products of these genes can sometimes have different names, often due to the history of their discovery.

  2. p

    H2O Molecule data for quantum computing

    • pennylane.ai
    Updated Nov 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Utkarsh Azad; Stepan Fomichev (2023). H2O Molecule data for quantum computing [Dataset]. https://pennylane.ai/datasets/h2o-molecule
    Explore at:
    Dataset updated
    Nov 22, 2023
    Authors
    Utkarsh Azad; Stepan Fomichev
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Measurement technique
    Simulation
    Dataset funded by
    Xanadu Quantum Technologieshttps://xanadu.ai/
    Description

    This dataset contains Hamiltonian information, molecular data, VQE data, and tapering data for the H2O Molecule using the STO-3G basis set at various bondlengths.

  3. Data from: Reading PDB: Perception of Molecules from 3D Atomic Coordinates

    • acs.figshare.com
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sascha Urbaczek; Adrian Kolodzik; Inken Groth; Stefan Heuser; Matthias Rarey (2023). Reading PDB: Perception of Molecules from 3D Atomic Coordinates [Dataset]. http://doi.org/10.1021/ci300358c.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    ACS Publications
    Authors
    Sascha Urbaczek; Adrian Kolodzik; Inken Groth; Stefan Heuser; Matthias Rarey
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The analysis of small molecule crystal structures is a common way to gather valuable information for drug development. The necessary structural data is usually provided in specific file formats containing only element identities and three-dimensional atomic coordinates as reliable chemical information. Consequently, the automated perception of molecular structures from atomic coordinates has become a standard task in cheminformatics. The molecules generated by such methods must be both chemically valid and reasonable to provide a reliable basis for subsequent calculations. This can be a difficult task since the provided coordinates may deviate from ideal molecular geometries due to experimental uncertainties or low resolution. Additionally, the quality of the input data often differs significantly thus making it difficult to distinguish between actual structural features and mere geometric distortions. We present a method for the generation of molecular structures from atomic coordinates based on the recently published NAOMI model. By making use of this consistent chemical description, our method is able to generate reliable results even with input data of low quality. Molecules from 363 Protein Data Bank (PDB) entries could be perceived with a success rate of 98%, a result which could not be achieved with previously described methods. The robustness of our approach has been assessed by processing all small molecules from the PDB and comparing them to reference structures. The complete data set can be processed in less than 3 min, thus showing that our approach is suitable for large scale applications.

  4. s

    MUSCLE (MUltiplexed Single-molecule Characterization at the Library scalE)...

    • figshare.scilifelab.se
    zip
    Updated Jan 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mikhail Panfilov; Guanzhong Mao; Jianfeng Guo; Javier Aguirre Rivera; Anton Sabantcev; Sebastian Deindl (2025). MUSCLE (MUltiplexed Single-molecule Characterization at the Library scalE) protocol data and codes [Dataset]. http://doi.org/10.17044/scilifelab.28008872.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 15, 2025
    Dataset provided by
    Uppsala University
    Authors
    Mikhail Panfilov; Guanzhong Mao; Jianfeng Guo; Javier Aguirre Rivera; Anton Sabantcev; Sebastian Deindl
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A test dataset for MUSCLE (MUltiplexed Single-molecule Characterization at the Library scalE) data analysis. See "\Python codes for MUSCLE data analysis\README.txt" for the instructions on running the data analysis codes. Use the files in the "Test MUSCLE dataset" folder as input for the codes. "Test MUSCLE dataset\Output_tile1" contains the code output for the test dataset. The example dataset corresponds to one MiSeq tile in an experiment analyzing dCas9-induced R-loop formation for a library of 256 different target sequences.The latest version of the Python codes for matching single-molecule FRET traces with sequenced clusters is available at https://github.com/deindllab/MUSCLE/.

  5. p

    H5 Molecule data for quantum computing

    • pennylane.ai
    Updated Nov 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Utkarsh Azad; Stepan Fomichev (2023). H5 Molecule data for quantum computing [Dataset]. https://pennylane.ai/datasets/h5-molecule
    Explore at:
    Dataset updated
    Nov 4, 2023
    Authors
    Utkarsh Azad; Stepan Fomichev
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Measurement technique
    Simulation
    Dataset funded by
    Xanadu Quantum Technologieshttps://xanadu.ai/
    Description

    This dataset contains Hamiltonian information, molecular data, VQE data, and tapering data for the H5 Molecule using the STO-3G basis set at various bondlengths.

  6. Data from: Library of Two Million Unique Small Molecules with Precalculated...

    • zenodo.org
    • repository.uantwerpen.be
    bin
    Updated Aug 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Issar Arab; Issar Arab; Kris Laukens; Kris Laukens; Wout Bittremieux; Wout Bittremieux (2024). Library of Two Million Unique Small Molecules with Precalculated Fingerprints, Descriptors, and Cardiotoxicity Inhibition Data [Dataset]. http://doi.org/10.5281/zenodo.11066707
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 8, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Issar Arab; Issar Arab; Kris Laukens; Kris Laukens; Wout Bittremieux; Wout Bittremieux
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository comprises a dataset of ~2 million unique compounds saved in an hdf5 small molecule library store, which includes the following fields for each molecule:

    • InChI key
    • Standardized SMILES string
    • Compound source
    • ChEMBL identifier if the compound exists in this open access database
    • 1024-bit Morgan fingerprint
    • 2048-bit Morgan fingerprint
    • 881-bit PubChem fingerprints
    • 854 vector-length of preprocessed and standardized Mordred descriptors
    • and cardiotoxicity inhibition predictions for each of the three cardiac ion channels (hERG, Nav1.5, and Cav1.2) using CtoxPred2 along with the model confidence scores.

    The repository also includes a Jupyter notebook that serves as an initial guide for querying the small molecule library store. Export both files to the same folder, allocate approximately 40 GB of available memory disk space, unzip the library store, and then launch the notebook to begin querying.

    Upon usage, please cite this publication:

    • Issar Arab, Kris Laukens, Wout Bittremieux, Semisupervised Learning to Boost hERG, Nav1.5, and Cav1.2 Cardiac Ion Channel Toxicity Prediction by Mining a Large Unlabeled Small Molecule Data Set, Journal of Chemical Information and Modeling, (2024). doi:https://doi.org/10.1021/acs.jcim.4c01102">10.1021/acs.jcim.4c01102
  7. H

    Replication Data for Machine Learning Modeling of pKa

    • dataverse.harvard.edu
    • search.dataone.org
    • +1more
    Updated Mar 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dongdong Zhang (2024). Replication Data for Machine Learning Modeling of pKa [Dataset]. http://doi.org/10.7910/DVN/6A67L9
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 30, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Dongdong Zhang
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Cleaned some public pka datasets containing experimental and calculated data for ML modeling and providing protonated and deprotonated molecular topology graphs as well as the reaction centers

  8. m

    Data from: 3DMolNet: a generative network for molecular structures

    • archive.materialscloud.org
    • materialscloud-archive-failover.cineca.it
    Updated Nov 28, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Materials Cloud (2021). 3DMolNet: a generative network for molecular structures [Dataset]. http://doi.org/10.24435/materialscloud:g6-ft
    Explore at:
    Dataset updated
    Nov 28, 2021
    Dataset provided by
    Materials Cloud
    Description

    With the recent advances in machine learning for quantum chemistry, it is now possible to predict the chemical properties of compounds and to generate novel molecules. Existing generative models mostly use a string- or graph-based representation, but the precise three-dimensional coordinates of the atoms are usually not encoded. First attempts in this direction have been proposed, where autoregressive or GAN-based models generate atom coordinates. Those either lack a latent space in the autoregressive setting, such that a smooth exploration of the compound space is not possible, or cannot generalize to varying chemical compositions. We propose a new approach to efficiently generate molecular structures that are not restricted to a fixed size or composition. Our model is based on the variational autoencoder which learns a translation-, rotation-, and permutation-invariant low-dimensional representation of molecules. Our experiments yield a mean reconstruction error below 0.05 Angstrom, outperforming the current state-of-the-art methods by a factor of four, and which is even lower than the spatial quantization error of most chemical descriptors. The compositional and structural validity of newly generated molecules has been confirmed by quantum chemical methods in a set of experiments.

  9. Global import data of Small,molecule

    • volza.com
    csv
    Updated Oct 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Volza.LLC (2025). Global import data of Small,molecule [Dataset]. https://www.volza.com/p/small-or-molecule/import/import-in-india/coo-united-states/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Oct 3, 2025
    Dataset provided by
    Volza
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Count of importers, Sum of import value, 2014-01-01/2021-09-30, Count of import shipments
    Description

    37532 Global import shipment records of Small,molecule with prices, volume & current Buyer's suppliers relationships based on actual Global export trade database.

  10. Single-molecule source data files

    • zenodo.org
    • explore.openaire.eu
    Updated Oct 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeff Gelles; Jeff Gelles (2020). Single-molecule source data files [Dataset]. http://doi.org/10.5281/zenodo.2530159
    Explore at:
    Dataset updated
    Oct 4, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jeff Gelles; Jeff Gelles
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data archive contains single-molecule source data for "Delayed inhibition mechanism for secondary channel factor regulation of ribosomal RNA transcription" by Sarah K. Stumper, Harini Ravi, Larry J. Friedman, Rachel Anne Mooney, Ivan R. Corrêa, Jr., Anne Gershenson, Robert Landick, and Jeff Gelles.

    The data archive (doi: 10.5281/zenodo.2530159) provides files for each figure and figure supplement. The files are ‘intervals’ files readable by the imscroll program (https://github.com/gelles-brandeis/CoSMoS_Analysis).

  11. d

    Structure - Molecular Modeling Database (MMDB)

    • catalog.data.gov
    • datadiscovery.nlm.nih.gov
    • +2more
    Updated Feb 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Library of Medicine (2025). Structure - Molecular Modeling Database (MMDB) [Dataset]. https://catalog.data.gov/dataset/molecular-modeling-database-mmdb
    Explore at:
    Dataset updated
    Feb 3, 2025
    Dataset provided by
    National Library of Medicine
    Description

    Three dimensional structures provide a wealth of information on the biological function and the evolutionary history of macromolecules. They can be used to examine sequence-structure-function relationships, interactions, active sites, and more.

  12. m

    Data from: Chemical Shifts in Molecular Solids by Machine Learning Datasets

    • archive.materialscloud.org
    • materialscloud-archive-failover.cineca.it
    Updated Oct 22, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Materials Cloud (2019). Chemical Shifts in Molecular Solids by Machine Learning Datasets [Dataset]. http://doi.org/10.24435/materialscloud:2019.0023/v2
    Explore at:
    Dataset updated
    Oct 22, 2019
    Dataset provided by
    Materials Cloud
    Description

    We present a database of energy and NMR chemical shifts DFT calculations of 4150 crystal organic solids. The structures contain only H/C/N/O/S atoms and were subject to all-atoms geometry optimisation. Calculations were carried out using Quantum Espresso and GIPAW.

  13. f

    Data from: BEGAN: Boltzmann-Reweighted Data Augmentation for Enhanced...

    • figshare.com
    • acs.figshare.com
    xlsx
    Updated Nov 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jialei Dai; Yutong Zhang; Chen Shi; Yang Liu; Peng Xiu; Yong Wang (2024). BEGAN: Boltzmann-Reweighted Data Augmentation for Enhanced GAN-Based Molecule Design in Insect Pheromone Receptors [Dataset]. http://doi.org/10.1021/acs.jpcb.4c06729.s003
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Nov 21, 2024
    Dataset provided by
    ACS Publications
    Authors
    Jialei Dai; Yutong Zhang; Chen Shi; Yang Liu; Peng Xiu; Yong Wang
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Identifying small molecules that bind strongly to target proteins in rational molecular design is crucial. Machine learning techniques, such as generative adversarial networks (GAN), are now essential tools for generating such molecules. In this study, we present an enhanced method for molecule generation using objective-reinforced GANs. Specifically, we introduce BEGAN (Boltzmann-enhanced GAN), a novel approach that adjusts molecule occurrence frequencies during training based on the Boltzmann distribution exp(−ΔU/τ), where ΔU represents the estimated binding free energy derived from docking algorithms and τ is a temperature-related scaling hyperparameter. This Boltzmann reweighting process shifts the generation process toward molecules with higher binding affinities, allowing the GAN to explore molecular spaces with superior binding properties. The reweighting process can also be refined through multiple iterations without altering the overall distribution shape. To validate our approach, we apply it to the design of sex pheromone analogs targeting Spodoptera frugiperda pheromone receptor SfruOR16, illustrating that the Boltzmann reweighting significantly increases the likelihood of generating promising sex pheromone analogs with improved binding affinities to SfruOR16, further supported by atomistic molecular dynamics simulations. Furthermore, we conduct a comprehensive investigation into parameter dependencies and propose a reasonable range for the hyperparameter τ. Our method offers a promising approach for optimizing molecular generation for enhanced protein binding, potentially increasing the efficiency of molecular discovery pipelines.

  14. Data from: Electrical Conductance of Molecular Junctions by a Robust...

    • acs.figshare.com
    • figshare.com
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M. Teresa González; Songmei Wu; Roman Huber; Sense J. van der Molen; Christian Schönenberger; Michel Calame (2023). Electrical Conductance of Molecular Junctions by a Robust Statistical Analysis [Dataset]. http://doi.org/10.1021/nl061581e.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    ACS Publications
    Authors
    M. Teresa González; Songmei Wu; Roman Huber; Sense J. van der Molen; Christian Schönenberger; Michel Calame
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    We propose an objective and robust method to extract the electrical conductance of single molecules connected to metal electrodes from a set of measured conductance data. Our method roots in the physics of tunneling and is tested on octanedithiol using mechanically controllable break junctions. The single molecule conductance values can be deduced without the need for data selection.

  15. d

    Data from: Quantum mechanical double slit for molecular scattering

    • datadryad.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Aug 26, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haowen Zhou; William E. Perreault; Nandini Mukherjee; Richard N. Zare (2021). Quantum mechanical double slit for molecular scattering [Dataset]. http://doi.org/10.5061/dryad.jh9w0vtcb
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 26, 2021
    Dataset provided by
    Dryad
    Authors
    Haowen Zhou; William E. Perreault; Nandini Mukherjee; Richard N. Zare
    Time period covered
    2021
    Description

    Interference observed in a double-slit experiment most conclusively demonstrates the wave properties of particles. We construct a quantum mechanical double-slit interferometer by rovibrationally exciting D2 (v=2, j=2) molecules in a biaxial state using Stark-induced adiabatic Raman passage. In D2(v=2, j=2)→D2(v=2, j'=0) rotational relaxation via a cold collision with ground state He, the entangled bond axis orientations in the biaxial state act as two slits generating two indistinguishable quantum mechanical pathways connecting initial and final states of the colliding system. The interference disappears when we decouple the two orientations of the bond axis by separately constructing the uniaxial states of D2, unequivocally establishing the double-slit action of the biaxial state. This double slit opens new possibilities in the coherent control of molecular collisions.

  16. o

    Data from: Six-Dimensional Single-Molecule Imaging with Isotropic Resolution...

    • osf.io
    Updated Feb 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthew D. Lew; Oumeng Zhang (2023). Six-Dimensional Single-Molecule Imaging with Isotropic Resolution using a Multi-View Reflector Microscope [Dataset]. http://doi.org/10.17605/OSF.IO/S8M4X
    Explore at:
    Dataset updated
    Feb 2, 2023
    Dataset provided by
    Center For Open Science
    Authors
    Matthew D. Lew; Oumeng Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data supporting the manuscript: Zhang, O., Guo, Z., He, Y. et al. Six-dimensional single-molecule imaging with isotropic resolution using a multi-view reflector microscope. Nat. Photon. 17, 179–186 (2023). https://doi.org/10.1038/s41566-022-01116-6

  17. p

    H8 Molecule data for quantum computing

    • pennylane.ai
    Updated Nov 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Utkarsh Azad; Stepan Fomichev (2023). H8 Molecule data for quantum computing [Dataset]. https://pennylane.ai/datasets/h8-molecule
    Explore at:
    Dataset updated
    Nov 4, 2023
    Authors
    Utkarsh Azad; Stepan Fomichev
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Measurement technique
    Simulation
    Dataset funded by
    Xanadu Quantum Technologieshttps://xanadu.ai/
    Description

    This dataset contains Hamiltonian information, molecular data, VQE data, and tapering data for the H8 Molecule using the STO-3G basis set at various bondlengths.

  18. U

    Replication Data for: Predicting the binding of small molecules to proteins...

    • dataverse.unimi.it
    bin, chemical/x-pdb
    Updated Apr 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guido Tiana; Guido Tiana (2024). Replication Data for: Predicting the binding of small molecules to proteins through invariant representation of the molecular structure [Dataset]. http://doi.org/10.13130/RD_UNIMI/5879ZG
    Explore at:
    bin(55616783), bin(4128), bin(233672), bin(86982), bin(152), bin(7580670), bin(248), chemical/x-pdb(2430), bin(8072), bin(1707959), bin(438990), bin(891656), bin(1895550), bin(7580), bin(17751), bin(36787470), bin(578267464), bin(38857), bin(1759470), bin(25322), bin(3827), bin(16799), bin(16873), bin(3136), bin(200), bin(2988), bin(1121739), bin(60248), bin(147133230), bin(248856), bin(2096), bin(216842), bin(336997712), bin(60128), bin(12692), bin(58520), bin(12160), bin(5917), bin(3080), bin(84550), bin(3500517), bin(100778), bin(750854), bin(866570), bin(3260), bin(336070)Available download formats
    Dataset updated
    Apr 29, 2024
    Dataset provided by
    UNIMI Dataverse
    Authors
    Guido Tiana; Guido Tiana
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Network parameters and datasets to reproduce the results of the article "Predicting the binding of small molecules to proteins through invariant representation of the molecular structure" by R. Beccaria, A. Lazzeri and G. Tiana. The scripts and the instructions can be obtained from https://github.com/guidotiana/Milbinding

  19. Example Cytidine data set from I19-1 at Diamond Light Source

    • zenodo.org
    • data.niaid.nih.gov
    bz2
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Graeme Winter; Markus Gerstel; David Allan; Mark Warren; Harriott Nowell; Sarah Barnett; Graeme Winter; Markus Gerstel; David Allan; Mark Warren; Harriott Nowell; Sarah Barnett (2020). Example Cytidine data set from I19-1 at Diamond Light Source [Dataset]. http://doi.org/10.5281/zenodo.33555
    Explore at:
    bz2Available download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Graeme Winter; Markus Gerstel; David Allan; Mark Warren; Harriott Nowell; Sarah Barnett; Graeme Winter; Markus Gerstel; David Allan; Mark Warren; Harriott Nowell; Sarah Barnett
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data set recorded in 6 scans, three omega and three phi, on the updated Diamond Light Source beamline I19-1, as part of routine commissioning. These data have been processed several times with different packages and are being made available to the community as an example set to allow other data processing software authors to verify the content of the data and headers.

  20. r

    DrugBank - Open Data Drug and Drug Target Database

    • researchdata.edu.au
    Updated May 2, 2013
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    QFAB (2013). DrugBank - Open Data Drug and Drug Target Database [Dataset]. https://researchdata.edu.au/drugbank-open-drug-target-database/14044
    Explore at:
    Dataset updated
    May 2, 2013
    Dataset provided by
    QFAB
    Description

    The DrugBank database is a unique bioinformatics and cheminformatics resource that combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information. The database contains 6712 drug entries including 1448 FDA-approved small molecule drugs, 131 FDA-approved biotech (protein/peptide) drugs, 85 nutraceuticals and 5080 experimental drugs. Additionally, 4227 non-redundant protein (i.e. drug target/enzyme/transporter/carrier) sequences are linked to these drug entries. Each DrugCard entry contains more than 150 data fields with half of the information being devoted to drug/chemical data and the other half devoted to drug target or protein data. DrugBank is supported by David Wishart, Departments of Computing Science X Biological Sciences, University of Alberta. DrugBank is also supported by The Metabolomics Innovation Centre, a Genome Canada-funded core facility serving the scientific community and industry with world-class expertise and cutting-edge technologies in metabolomics.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Indranil Bhattacharyya (2024). Small Molecule-Protein Interaction Data [Dataset]. https://www.kaggle.com/datasets/photon98/leash-bio-engineered-data-training
Organization logo

Small Molecule-Protein Interaction Data

Enriched DataSet for Enhanced Small Molecule Binding Analysis

Explore at:
10 scholarly articles cite this dataset (View in Google Scholar)
zip(0 bytes)Available download formats
Dataset updated
Apr 19, 2024
Authors
Indranil Bhattacharyya
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

About the Dataset and How I augmented the data:

The dataset used in this augmentation process(used a subset of the original training data) is sourced from the Leash Bio - Predict New Medicines with BELKA competition(Read More). It comprises examples of small molecules categorized through binary classification, determining whether each molecule is a binder to one of three protein targets. The data collection method involves utilizing DNA-encoded chemical library (DEL) technology.

Chemical representations are expressed in SMILES (Simplified Molecular-Input Line-Entry System), while the labels denote binary binding classifications, corresponding to three distinct protein targets.

I've expanded the original dataset by augmenting it with additional features derived from the existing data. Specifically, I've calculated and included three new features:

  • mol_wt (Molecular Weight): Calculated based on the SMILES data using RDKit, providing insight into the mass of each molecule.
  • logP (Partition Coefficient): Also derived from the SMILES data using RDKit, representing the logarithm of the partition coefficient, a measure of a molecule's hydrophobicity and its ability to partition between a hydrophobic solvent and water.
  • rotamers (Number of Rotamers): Determined from the SMILES data using RDKit, indicating the number of distinct conformations or rotational isomers a molecule can adopt. These additional features aim to enrich the feature matrix, potentially enhancing the predictive power of models trained on the augmented dataset.

Data Description:

id- A unique example_id we use to identify the molecule-binding target pair. buildingblock1_smiles - The structure, in SMILES, of the first building block **buildingblock2_smiles **- The structure, in SMILES, of the second building block buildingblock3_smiles - The structure, in SMILES, of the third building block **molecule_smiles **- The structure of the fully assembled molecule, in SMILES. This includes the three building blocks and the triazine core. Note we use a [Dy] as the stand-in for the DNA linker. protein_name - The protein target name binds - The target column. A binary class label of whether the molecule binds to the protein. Not available for the test set. mol_wt - The molecule's molecular weight derived from SMILES data using RDKit. logP - The logP of the molecule derived from SMILES data using RDKit. **rotamers **- The number of rotamers of the molecule derived from SMILES data using RDKit.

Targets: binds

Proteins are encoded in the genome, and names of the genes encoding those proteins are typically bestowed by their discoverers and regulated by the Hugo Gene Nomenclature Committee. The protein products of these genes can sometimes have different names, often due to the history of their discovery.

Search
Clear search
Close search
Google apps
Main menu