67 datasets found
  1. P

    approved_drug_target Dataset

    • paperswithcode.com
    Updated Nov 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahsa Sheikholeslami; Navid Mazrouei; Yousof Gheisari; Afshin Fasihi; Matin Irajpour; Ali Motahharynia (2024). approved_drug_target Dataset [Dataset]. https://paperswithcode.com/dataset/approved-drug-target
    Explore at:
    Dataset updated
    Nov 19, 2024
    Authors
    Mahsa Sheikholeslami; Navid Mazrouei; Yousof Gheisari; Afshin Fasihi; Matin Irajpour; Ali Motahharynia
    Description

    This dataset provides a curated collection of approved drug Simplified Molecular Input Line Entry System (SMILES) strings and their associated protein sequences. Each small molecule has been approved by at least one regulatory body, ensuring the safety and relevance of the data for computational applications. The dataset includes 1,660 approved small molecules and their 2,093 related protein targets.

  2. SMILES strings for compounds

    • search.datacite.org
    • figshare.com
    Updated Jan 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rakesh Baboo (2016). SMILES strings for compounds [Dataset]. http://doi.org/10.6084/m9.figshare.1412636.v1
    Explore at:
    Dataset updated
    Jan 19, 2016
    Dataset provided by
    DataCitehttps://www.datacite.org/
    Figsharehttp://figshare.com/
    Authors
    Rakesh Baboo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SMILES strings provided for the set of compounds

  3. PubChemSMILESCanonTitlePC

    • huggingface.co
    Updated Jan 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BASF (2025). PubChemSMILESCanonTitlePC [Dataset]. https://huggingface.co/datasets/BASF-AI/PubChemSMILESCanonTitlePC
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 3, 2025
    Dataset authored and provided by
    BASFhttp://basf.com/
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    PubChem Canonical SMILES and Titles Pair Classification

    This dataset contains pairs of canonical SMILES strings and their corresponding entity titles, with labels indicating whether they refer to the same chemical entity. A label of 1 means the SMILES string and the title correspond to the same entity, while a label of 0 indicates they do not. The dataset is sourced from PubChem (ChEBI source), and it provides valuable information for tasks involving chemical entity matching.

  4. f

    Data from: How does Transformer model evolve to learn diverse chemical...

    • springernature.figshare.com
    zip
    Updated Feb 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tadahaya Mizuno (2024). How does Transformer model evolve to learn diverse chemical structures? [Dataset]. http://doi.org/10.6084/m9.figshare.22736528.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 17, 2024
    Dataset provided by
    figshare
    Authors
    Tadahaya Mizuno
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    source codes

  5. SMILES-18 dataset

    • zenodo.org
    • data.niaid.nih.gov
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ignacio Pérez-Correa; Ignacio Pérez-Correa (2023). SMILES-18 dataset [Dataset]. http://doi.org/10.5281/zenodo.7978077
    Explore at:
    Dataset updated
    May 31, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ignacio Pérez-Correa; Ignacio Pérez-Correa
    Description

    Dataset of organic molecules encoded as SMILES strings with 18,322,500 records collected from the Pubchem database.

    List of characters included in the dataset:

    DescriptionSMILES Characters
    Atoms "C", "O", "N", "P", "S", "F", "Cl", "Br", "I", "Si", "B"
    Branches "(", ")"
    Rings"1", "2", "3", "4", "5", "6", "7", "8", "9"
    Bonds "=", "#"
    Ions "+", "-"
    Stereochemistry "/", "\"
    Miscellaneous "[", "]"

  6. EFGs: A Complete and Accurate Implementation of Ertl’s Functional Group...

    • acs.figshare.com
    • figshare.com
    xlsx
    Updated Jan 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gonzalo Colmenarejo (2025). EFGs: A Complete and Accurate Implementation of Ertl’s Functional Group Detection Algorithm in RDKit [Dataset]. http://doi.org/10.1021/acs.jcim.4c02268.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jan 29, 2025
    Dataset provided by
    ACS Publications
    Authors
    Gonzalo Colmenarejo
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Functional groups are widely used in organic chemistry, because they provide a rationale to analyze physicochemical and reactivity properties. In medicinal chemistry, they are the basis for analyzing ligand–biomacromolecule interactions. Ertl’s algorithm is an approach to extract functional groups in arbitrary organic molecules that does not depend on predefined libraries of functional groups. However, there is a lack of a complete and accurate implementation of Ertl’s algorithm in the widely used RDKit cheminformatic toolkit. In this paper, a new RDKit/Python implementation of the algorithm is described, that is both accurate and complete. For a RDKit molecule, it provides (i) a PNG binary string with an image of the molecule with color-highlighted functional groups; (ii) a list of sets of atom indices (idx), each set corresponding to a functional group; (iii) a list of pseudo-SMILES canonicalized strings for the full functional groups; and (iv) a list of RDKit labeled mol objects, one for each full functional group. The code is freely available in https://github.com/bbu-imdea/efgs and is part of the RDKit Contrib directory (https://github.com/rdkit/rdkit/tree/master/Contrib/efgs).

  7. PubChemSMILESIsoTitleBM

    • huggingface.co
    Updated Jan 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PubChemSMILESIsoTitleBM [Dataset]. https://huggingface.co/datasets/BASF-AI/PubChemSMILESIsoTitleBM
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 3, 2025
    Dataset authored and provided by
    BASFhttp://basf.com/
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    PubChem Isomeric SMILES and Titles Bitext Mining

    This dataset contains two separate lists: one of isomeric SMILES strings and the other of corresponding entity titles, both sourced from PubChem (ChEBI source). The task is to identify matching pairs between the SMILES strings and the titles, where each SMILES string from the first list should be aligned with its corresponding entity title from the second list. The dataset is intended for bitext mining tasks, where the goal is to… See the full description on the dataset page: https://huggingface.co/datasets/BASF-AI/PubChemSMILESIsoTitleBM.

  8. Z

    Supplementary Files - SMILES strings for DOSEDO enumerated library

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Faust, Ann Marie (2023). Supplementary Files - SMILES strings for DOSEDO enumerated library [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8136903
    Explore at:
    Dataset updated
    Jul 15, 2023
    Dataset authored and provided by
    Faust, Ann Marie
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplementary Files for Nature Communications publication titled "Diversity-oriented synthesis encoded by deoxyoligonucleotides". Two files contain the Iodo-library and Bromo-library SMILES strings.

  9. PubChemSMILESIsoTitlePC

    • huggingface.co
    Updated Jan 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BASF (2025). PubChemSMILESIsoTitlePC [Dataset]. https://huggingface.co/datasets/BASF-AI/PubChemSMILESIsoTitlePC
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 3, 2025
    Dataset authored and provided by
    BASFhttp://basf.com/
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    PubChem Isomeric SMILES and Titles Pair Classification

    This dataset contains pairs of isomeric SMILES strings and their corresponding entity titles, with labels indicating whether they refer to the same chemical entity. A label of 1 means the SMILES string and the title correspond to the same entity, while a label of 0 indicates they do not. The dataset is sourced from PubChem (ChEBI source), and it provides valuable information for tasks involving chemical entity matching.

  10. Table of SMILES for compounds listed

    • figshare.com
    docx
    Updated Jan 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Embir Jaspal (2016). Table of SMILES for compounds listed [Dataset]. http://doi.org/10.6084/m9.figshare.1416131.v1
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jan 19, 2016
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Embir Jaspal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Table including the SMILES strings for the compounds listed in the paper

  11. Data from: Library of Two Million Unique Small Molecules with Precalculated...

    • zenodo.org
    • repository.uantwerpen.be
    bin
    Updated Aug 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Issar Arab; Issar Arab; Kris Laukens; Kris Laukens; Wout Bittremieux; Wout Bittremieux (2024). Library of Two Million Unique Small Molecules with Precalculated Fingerprints, Descriptors, and Cardiotoxicity Inhibition Data [Dataset]. http://doi.org/10.5281/zenodo.11066707
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 8, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Issar Arab; Issar Arab; Kris Laukens; Kris Laukens; Wout Bittremieux; Wout Bittremieux
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository comprises a dataset of ~2 million unique compounds saved in an hdf5 small molecule library store, which includes the following fields for each molecule:

    • InChI key
    • Standardized SMILES string
    • Compound source
    • ChEMBL identifier if the compound exists in this open access database
    • 1024-bit Morgan fingerprint
    • 2048-bit Morgan fingerprint
    • 881-bit PubChem fingerprints
    • 854 vector-length of preprocessed and standardized Mordred descriptors
    • and cardiotoxicity inhibition predictions for each of the three cardiac ion channels (hERG, Nav1.5, and Cav1.2) using CtoxPred2 along with the model confidence scores.

    The repository also includes a Jupyter notebook that serves as an initial guide for querying the small molecule library store. Export both files to the same folder, allocate approximately 40 GB of available memory disk space, unzip the library store, and then launch the notebook to begin querying.

    Upon usage, please cite this publication:

    • Issar Arab, Kris Laukens, Wout Bittremieux, Semisupervised Learning to Boost hERG, Nav1.5, and Cav1.2 Cardiac Ion Channel Toxicity Prediction by Mining a Large Unlabeled Small Molecule Data Set, Journal of Chemical Information and Modeling, (2024). doi:https://doi.org/10.1021/acs.jcim.4c01102">10.1021/acs.jcim.4c01102
  12. f

    Data from: Benchmark Data Set for in Silico Prediction of Ames Mutagenicity

    • acs.figshare.com
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Katja Hansen; Sebastian Mika; Timon Schroeter; Andreas Sutter; Antonius ter Laak; Thomas Steger-Hartmann; Nikolaus Heinrich; Klaus-Robert Müller (2023). Benchmark Data Set for in Silico Prediction of Ames Mutagenicity [Dataset]. http://doi.org/10.1021/ci900161g.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    ACS Publications
    Authors
    Katja Hansen; Sebastian Mika; Timon Schroeter; Andreas Sutter; Antonius ter Laak; Thomas Steger-Hartmann; Nikolaus Heinrich; Klaus-Robert Müller
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Up to now, publicly available data sets to build and evaluate Ames mutagenicity prediction tools have been very limited in terms of size and chemical space covered. In this report we describe a new unique public Ames mutagenicity data set comprising about 6500 nonconfidential compounds (available as SMILES strings and SDF) together with their biological activity. Three commercial tools (DEREK, MultiCASE, and an off-the-shelf Bayesian machine learner in Pipeline Pilot) are compared with four noncommercial machine learning implementations (Support Vector Machines, Random Forests, k-Nearest Neighbors, and Gaussian Processes) on the new benchmark data set.

  13. n

    ProtChemSI

    • neuinfo.org
    • dknet.org
    • +1more
    Updated Mar 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). ProtChemSI [Dataset]. http://identifiers.org/RRID:SCR_006115
    Explore at:
    Dataset updated
    Mar 2, 2025
    Description

    The database of protein-chemical structural interactions includes all existing 3D structures of complexes of proteins with low molecular weight ligands. When one considers the proteins and chemical vertices of a graph, all these interactions form a network. Biological networks are powerful tools for predicting undocumented relationships between molecules. The underlying principle is that existing interactions between molecules can be used to predict new interactions. For pairs of proteins sharing a common ligand, we use protein and chemical superimpositions combined with fast structural compatibility screens to predict whether additional compounds bound by one protein would bind the other. The current version includes data from the Protein Data Bank as of August 2011. The database is updated monthly.

  14. Data from: ChemTastesDB: A Curated Database of Molecular Tastants

    • zenodo.org
    • data.niaid.nih.gov
    bin, pdf
    Updated Jul 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cristian Rojas; Cristian Rojas; Davide Ballabio; Davide Ballabio; Karen Pacheco Sarmiento; Elisa Pacheco Jaramillo; Mateo Mendoza; Fernando García; Fernando García; Karen Pacheco Sarmiento; Elisa Pacheco Jaramillo; Mateo Mendoza (2024). ChemTastesDB: A Curated Database of Molecular Tastants [Dataset]. http://doi.org/10.5281/zenodo.6528835
    Explore at:
    bin, pdfAvailable download formats
    Dataset updated
    Jul 16, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Cristian Rojas; Cristian Rojas; Davide Ballabio; Davide Ballabio; Karen Pacheco Sarmiento; Elisa Pacheco Jaramillo; Mateo Mendoza; Fernando García; Fernando García; Karen Pacheco Sarmiento; Elisa Pacheco Jaramillo; Mateo Mendoza
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ChemTastesDB is a database that includes curated information of 2944 molecular tastants. ChemTastesDB constitutes a useful tool for the scientific community to expand the information of molecular tastants, which could assist in the analysis of the relationships between molecular structure and taste, as well as in silico (QSAR) studies for taste prediction by means of diverse machine learning approaches.

    Molecules are labelled in one of the five basic tastes (sweet, bitter, umami sour and salty), as well as to other classes related to non-basic tastes (tasteless, non-sweet, multitaste and miscellaneous). ChemTastesDB provides the following information for each molecule: name, PubChem CID, CAS registry number, canonical SMILES string, class taste and the reference to the scientific sources from where data were retrieved. Moreover, the molecular structure in the HyperChem (.hin) format of each chemical is provided.

    This is version 1.2 of the ChemTastesDB.

    What's new in version 1.2:

    Chemical information (for instance, name, PubChem CID or CAS number) for some tastants has been included.

    The database is freeware and may be used if proper reference is given to the authors. Preferably refer to the following paper:
    Rojas, C., Ballabio, D., Pacheco Sarmiento, K., Pacheco Jaramillo, E., Mendoza, M., & García, F. (2022). ChemTastesDB: A curated database of molecular tastants. Food Chemistry: Molecular Sciences, 4, 100090. https://doi.org/10.1016/j.fochms.2022.100090.

  15. Data from: A consensus compound/bioactivity dataset for data-driven drug...

    • zenodo.org
    • explore.openaire.eu
    • +1more
    zip
    Updated May 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laura Isigkeit; Laura Isigkeit; Apirat Chaikuad; Apirat Chaikuad; Daniel Merk; Daniel Merk (2022). A consensus compound/bioactivity dataset for data-driven drug design and chemogenomics [Dataset]. http://doi.org/10.5281/zenodo.6320761
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 13, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Laura Isigkeit; Laura Isigkeit; Apirat Chaikuad; Apirat Chaikuad; Daniel Merk; Daniel Merk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Information

    The diverse publicly available compound/bioactivity databases constitute a key resource for data-driven applications in chemogenomics and drug design. Analysis of their coverage of compound entries and biological targets revealed considerable differences, however, suggesting benefit of a consensus dataset. Therefore, we have combined and curated information from five esteemed databases (ChEMBL, PubChem, BindingDB, IUPHAR/BPS and Probes&Drugs) to assemble a consensus compound/bioactivity dataset comprising 1144803 compounds with 10915362 bioactivities on 5613 targets (including defined macromolecular targets as well as cell-lines and phenotypic readouts). It also provides simplified information on assay types underlying the bioactivity data and on bioactivity confidence by comparing data from different sources. We have unified the source databases, brought them into a common format and combined them, enabling an ease for generic uses in multiple applications such as chemogenomics and data-driven drug design.

    The consensus dataset provides increased target coverage and contains a higher number of molecules compared to the source databases which is also evident from a larger number of scaffolds. These features render the consensus dataset a valuable tool for machine learning and other data-driven applications in (de novo) drug design and bioactivity prediction. The increased chemical and bioactivity coverage of the consensus dataset may improve robustness of such models compared to the single source databases. In addition, semi-automated structure and bioactivity annotation checks with flags for divergent data from different sources may help data selection and further accurate curation.

    Structure and content of the dataset

    Dataset structure

    ChEMBL

    ID

    PubChem

    ID

    IUPHAR

    ID

    Target

    Activity

    type

    Assay typeUnitMean C (0)...Mean PC (0)...Mean B (0)...Mean I (0)...Mean PD (0)...Activity check annotationLigand namesCanonical SMILES C...Structure checkSource

    The dataset was created using the Konstanz Information Miner (KNIME) (https://www.knime.com/) and was exported as a CSV-file and a compressed CSV-file.

    Except for the canonical SMILES columns, all columns are filled with the datatype ‘string’. The datatype for the canonical SMILES columns is the smiles-format. We recommend the File Reader node for using the dataset in KNIME. With the help of this node the data types of the columns can be adjusted exactly. In addition, only this node can read the compressed format.

    Column content:

    • ChEMBL ID, PubChem ID, IUPHAR ID: chemical identifier of the databases
    • Target: biological target of the molecule expressed as the HGNC gene symbol
    • Activity type: for example, pIC50
    • Assay type: Simplification/Classification of the assay into cell-free, cellular, functional and unspecified
    • Unit: unit of bioactivity measurement
    • Mean columns of the databases: mean of bioactivity values or activity comments denoted with the frequency of their occurrence in the database, e.g. Mean C = 7.5 *(15) -> the value for this compound-target pair occurs 15 times in ChEMBL database
    • Activity check annotation: a bioactivity check was performed by comparing values from the different sources and adding an activity check annotation to provide automated activity validation for additional confidence
      • no comment: bioactivity values are within one log unit;
      • check activity data: bioactivity values are not within one log unit;
      • only one data point: only one value was available, no comparison and no range calculated;
      • no activity value: no precise numeric activity value was available;
      • no log-value could be calculated: no negative decadic logarithm could be calculated, e.g., because the reported unit was not a compound concentration
    • Ligand names: all unique names contained in the five source databases are listed
    • Canonical SMILES columns: Molecular structure of the compound from each database
    • Structure check: To denote matching or differing compound structures in different source databases
      • match: molecule structures are the same between different sources;
      • no match: the structures differ;
      • 1 source: no structure comparison is possible, because the molecule comes from only one source database.
    • Source: From which databases the data come from

  16. d

    Vaska's space

    • search.dataone.org
    • borealisdata.ca
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Friederich, Pascal (2023). Vaska's space [Dataset]. https://search.dataone.org/view/sha256%3A3b051c2e04eacf25d5cc4aabd90efde6ad2a0cdd0b1ac1d7c47c206c9b9be1a5
    Explore at:
    Dataset updated
    Dec 28, 2023
    Dataset provided by
    Borealis
    Authors
    Friederich, Pascal
    Description

    This repository contains a list of 1947 Iridium complexes, including their geometry (xyz files) and energy barriers for the hydrogen splitting reaction. Data Coordinates in .xyz format - data/coordinates_complex: coordinates of all complexes without additional hydrogen - data/coordinates_TS: coordinates of all complexes with an additional hydrogen molecule in transition state - data/coordinates_molSimplify: coordinates of all complexes generated with molSimplify developed by the Kulik group Properties in .csv format (data/vaskas_features_properties_smiles_filenames.csv) - "smiles": SMILES strings of all molecules - "filename": corresponding xyz filename - "barrier": DFT computed energy barrier [kcal/mol] for the transition state of the hydrogen splitting reaction - "distance": DFT computed H-H distance in the transition state geometry - "chi-X", "Z-X", "T-X", "I-X" and "S-X": (auto)correlation features described in our paper

  17. Z

    Chemical structures, Cell Painting and transcriptional profiles for compound...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anne E. Carpenter (2023). Chemical structures, Cell Painting and transcriptional profiles for compound bioactivity prediction. [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7729582
    Explore at:
    Dataset updated
    Apr 5, 2023
    Dataset provided by
    Tim Becker
    Peter Horvath
    Kevin Yang
    Nikita Moshkov
    Juan C. Caicedo
    Anne E. Carpenter
    Bridget K. Wagner
    Vlado Dancik
    Shantanu Singh
    Paul A. Clemons
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the related data, both input and produced for the paper "Predicting compound activity from phenotypic profiles and chemical structures".

    This data can be merged with paper's GitHub repository for reproduction.

    Folders and files and are described below:

    ├── assay_data ├── assay_matrix_discrete_270_assays.csv Assay matrix with hits for assays (270) and compounds (16170). Note that this is the final file that we used to produce splits. ├── assay_metadata.csv Assay metadata ├── broad_ids.txt List of broad ids used in this study. That is an unfiltered list of compounds required by some analysis scripts. ├── smiles.txt Same as broad_ids.txt, but SMILES strings.

    ├── feature_data (for 16978 compounds, can be masked with ./misc/compounds16978to16170.npy) ├── cp.npz Classical chemical features ├── ge.npz Gene expression features ├── ge_scale.npz Gene expression scaled features ├── mo.npz Morphology features (not batch corrected) ├── mobc.npz Morphology features (batch corrected)

    ├── misc ├── compound_analysis.npz Compounds in the dataset identified as PAINS ├── compounds16978to16170.npy Used to filter features from the bigger set of compounds to the final one ├── fingerprints.npz Calculated fingerprints of compounds, those were then used to calculate similarity ├── similarity_fingerprints.npz Similarity matrix for compounds (16978) ├── population_normalized.csv.gz Well-level morphological profiles that were used for batch-correction ├── Table for PUMA Excel file with additional data and plots

    ├── predictions ├── scaffold_median(mean)_AUC.csv Aggregated median(mean) AUC scores over scaffold-based cross-validation splits. In the paper, median results were reported. ├── scaffold_median(mean)_EF.csv Aggregated median(mean) enrichment factor (EF) over scaffold-based cross-validation splits. In the paper, median results were reported. ├── toprank_chemical_cv{}_hitsnorm.csv Those files are needed to create enrichment plots and contain hit rate and top rank hit rate. ├── Each folder here stands for an experiment type, the number in the folder name is a number of the split. Inside each folder there are the following elements: ├── predictions Folder with predictions for each assay-compound pair for each modality ├── 2022_01_evaluation_all_data.csv File with AUC scores for each assay for the test set in the split ├── 2022_01_evaluation_all_data_EF.csv File with enrichment factor (EF) values for each assay for the test set in the split. Those files exist only for chemical folders. ├── assay_matrix_discrete_train(test)_old_scaff.csv Training and test subsets of data for the split. The first column contains broad_id. ├── assay_matrix_discrete_train(test)_old_scaff.csv Same, but SMILES strings in the first column. Those files are used as input to ChemProp!

     Experiments in this folder are the following: 
     - chemical Scaffold-based 5-fold cross-validation splits, the main results in the paper are reported with this series of experiments.
     - chemical_bal Same splits as in chemical, but training were run with ChemProp built-in data balancing. 
     - chemical_st Same splits as in chemical, but separate models were trained for each assay.
     - CV Random 5-fold cross-validation splits.
     - GE 5-fold cross-validation splits based on same-size clustering of gene expression features.
     - MOBC 5-fold cross-validation splits based on same-size clustering of batch-corrected morphology features.
     - random 10 random splits, ~80% of compounds in the training set and the rest in the test set. 
    

    ├── splitting This folder contains numpy files which help to match compounds and features to create training and test sets for a split, which can be reused in the analysis notebook for data preparation. ├── scaffold_based_split.npz Splitting for scaffold-based splits. ├── random_split_{}.npz Random split indices of test set compounds (10 files). ├── cross_validation_indicies.npz Indices for random cross-validation splits ├── GE_clusters_size_constrained.npz Indicies of clusters of same-size clustering for gene-expression features. ├── MOBC_clusters_size_constrained.npz Indices of clusters of same-size clustering for batch-corrected morphology features.

  18. f

    Additional file 1 of Zombie cheminformatics: extraction and conversion of...

    • springernature.figshare.com
    • figshare.com
    txt
    Updated Aug 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Blakey; Samantha Pearman-Kanza; Jeremy G. Frey (2024). Additional file 1 of Zombie cheminformatics: extraction and conversion of Wiswesser Line Notation (WLN) from chemical documents [Dataset]. http://doi.org/10.6084/m9.figshare.26706900.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Aug 15, 2024
    Dataset provided by
    figshare
    Authors
    Michael Blakey; Samantha Pearman-Kanza; Jeremy G. Frey
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 1. smith.tsv (WLN:SMILES strings from Elbert Smiths encoding manual).

  19. m

    SMILES and physicochemical parameters - pinene, decane, toluene oxidation...

    • data.mendeley.com
    Updated Mar 8, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriel Isaacman-VanWertz (2021). SMILES and physicochemical parameters - pinene, decane, toluene oxidation products [Dataset]. http://doi.org/10.17632/3rgvkf7c9n.1
    Explore at:
    Dataset updated
    Mar 8, 2021
    Authors
    Gabriel Isaacman-VanWertz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset includes 182,127 SMILES strings generated by 5 generations of oxidation using the GECKO-A model for alpha-pinene, decane, and toluene under typical continental atmospheric conditions. For each compound, physicochemical parameters (vapor pressure, Henry's law constant, and gas phase reaction rate constant with the hydroxyl radical) are estimated using several structure-activity relationships. Compounds are flagged according to in which oxidation systems they exceed a threshold of 0.1% of total modeled mass of their given molecular formula. Descriptions of this dataset and the parameter estimation are provided in Isaacman-VanWertz and Aumont, "Impact of organic molecular structure on the estimation of atmospherically relevant physicochemical parameters", Atmospheric Chemistry and Physics. The subset of compounds 38,594 compounds used in the core analyses of that work are also flagged.

  20. CoconutSmiles2NameBitextMining

    • huggingface.co
    Updated Jan 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CoconutSmiles2NameBitextMining [Dataset]. https://huggingface.co/datasets/BASF-AI/CoconutSmiles2NameBitextMining
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 3, 2025
    Dataset authored and provided by
    BASFhttp://basf.com/
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    CoconutDB SMILES to Formula Bitext Mining

    This dataset consists of two lists: one containing both isomeric and canonical SMILES strings, and the other containing the corresponding molecular formulas of chemical entities, sourced from CoconutDB. The primary task is to identify matching pairs between the SMILES strings and their molecular formulas. Each SMILES string from the first list should be accurately aligned with its corresponding molecular formula from the second list.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Mahsa Sheikholeslami; Navid Mazrouei; Yousof Gheisari; Afshin Fasihi; Matin Irajpour; Ali Motahharynia (2024). approved_drug_target Dataset [Dataset]. https://paperswithcode.com/dataset/approved-drug-target

approved_drug_target Dataset

Approved Drug SMILES and Protein Sequence Dataset

Explore at:
Dataset updated
Nov 19, 2024
Authors
Mahsa Sheikholeslami; Navid Mazrouei; Yousof Gheisari; Afshin Fasihi; Matin Irajpour; Ali Motahharynia
Description

This dataset provides a curated collection of approved drug Simplified Molecular Input Line Entry System (SMILES) strings and their associated protein sequences. Each small molecule has been approved by at least one regulatory body, ensuring the safety and relevance of the data for computational applications. The dataset includes 1,660 approved small molecules and their 2,093 related protein targets.

Search
Clear search
Close search
Google apps
Main menu