100+ datasets found
  1. n

    ChEMBL

    • neuinfo.org
    • rrid.site
    • +2more
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). ChEMBL [Dataset]. http://identifiers.org/RRID:SCR_014042
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    Collection of bioactive drug-like small molecules that contains 2D structures, calculated properties and abstracted bioactivities. Used for drug discovery and chemical biology research. Clinical progress of new compounds is continuously integrated into the database.

  2. ChEMBL EBI Small Molecules Database

    • kaggle.com
    zip
    Updated Feb 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google BigQuery (2019). ChEMBL EBI Small Molecules Database [Dataset]. https://www.kaggle.com/bigquery/ebi-chembl
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Feb 12, 2019
    Dataset provided by
    BigQueryhttps://cloud.google.com/bigquery
    Authors
    Google BigQuery
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context

    ChEMBL is maintained by the European Bioinformatics Institute (EBI), of the European Molecular Biology Laboratory (EMBL), based at the Wellcome Trust Genome Campus, Hinxton, UK.

    Content

    ChEMBL is a manually curated database of bioactive molecules with drug-like properties used in drug discovery, including information about existing patented drugs.

    Schema: http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_23/chembl_23_schema.png

    Documentation: http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_23/schema_documentation.html

    Fork this notebook to get started on accessing data in the BigQuery dataset using the BQhelper package to write SQL queries.

    Acknowledgements

    “ChEMBL” by the European Bioinformatics Institute (EMBL-EBI), used under CC BY-SA 3.0. Modifications have been made to add normalized publication numbers.

    Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:ebi_chembl

    Banner photo by rawpixel on Unsplash

  3. t

    ChEMBL database - Dataset - LDM

    • service.tib.eu
    Updated Dec 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). ChEMBL database - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/chembl-database
    Explore at:
    Dataset updated
    Dec 2, 2024
    Description

    The dataset used in this paper is the ChEMBL database, which contains drugs/molecules and their binding information for proteins Lyn, Lck, and Src.

  4. b

    ChEMBL database of bioactive drug-like small molecules - Cell lines section

    • bioregistry.io
    Updated Dec 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). ChEMBL database of bioactive drug-like small molecules - Cell lines section [Dataset]. https://bioregistry.io/chembl.cell
    Explore at:
    Dataset updated
    Dec 11, 2022
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Chemistry resources

  5. MultiTarget Bioactivity ChEMBL

    • kaggle.com
    Updated Aug 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    J. (2025). MultiTarget Bioactivity ChEMBL [Dataset]. https://www.kaggle.com/datasets/xjoannax88/multitarget-bioactivity-chembl
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 30, 2025
    Dataset provided by
    Kaggle
    Authors
    J.
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Bioactivity Dataset for EGFR, DRD2, BACE1, and HDAC1

    This dataset contains curated molecular data for compounds tested against four pharmacologically important protein targets. All data are derived from the ChEMBL database.

    Targets Included

    Target NameChEMBL Target IDTarget Class
    EGFRCHEMBL203Kinase (Receptor TK)
    DRD2CHEMBL217GPCR (Dopamine D2 receptor)
    BACE1CHEMBL1987Enzyme (Aspartyl protease)
    HDAC1CHEMBL325Enzyme (Histone deacetylase)

    Each entry in the dataset includes: - ChEMBL compound ID - Canonical SMILES - Molecular properties (molecular weight, HBA, HBD, logP, and TPSA) - Target label

    Format

    The dataset is provided in CSV or Parquet format with the following columns:

    • ChEMBL ID: ChEMBL compound identifier
    • SMILES: Canonical SMILES string
    • Molecular weight: Molecular mass (Da)
    • LogP: Octanol-water partition coefficient
    • HBA: Number of hydrogen bond acceptors
    • HBD: Number of hydrogen bond donors
    • TPSA: Topological polar surface area
    • Protein: Protein target name

    License

    This dataset is a derivative of ChEMBL and is distributed under the same license:

    Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
    https://creativecommons.org/licenses/by-sa/3.0/

    Source

    Citation

    If you use this dataset, please cite the following:

    ChEMBL Database:
    Gaulton A, Hersey A, Nowotka M, Bento AP, Chambers J, Mendez D, Mutowo P, Atkinson F, Bellis LJ, Cibrián-Uhalte E, Davies M, Dedman N, Karlsson A, Magariños MP, Overington JP, Papadatos G, Smit I, Leach AR. (2017).
    The ChEMBL database in 2017. Nucleic Acids Res., 45(D1): D945–D954.
    https://doi.org/10.1093/nar/gkw1074

    ChEMBL Web Services:
    Davies M, Nowotka M, Papadatos G, Dedman N, Gaulton A, Atkinson F, Bellis LJ, Overington JP. (2015).
    ChEMBL web services: streamlining access to drug discovery data and utilities. Nucleic Acids Res., 43(W1): W612–W620.
    https://doi.org/10.1093/nar/gkv352

  6. t

    The ChEMBL database in 2017 - Dataset - LDM

    • service.tib.eu
    Updated Dec 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). The ChEMBL database in 2017 - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/the-chembl-database-in-2017
    Explore at:
    Dataset updated
    Dec 3, 2024
    Description

    The ChEMBL database is a large collection of bioactive compounds and their biological activities.

  7. Drug Targets and Drug Lists Data Package

    • johnsnowlabs.com
    csv
    Updated Jan 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Snow Labs (2021). Drug Targets and Drug Lists Data Package [Dataset]. https://www.johnsnowlabs.com/marketplace/drug-targets-and-drug-lists-data-package/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 20, 2021
    Dataset authored and provided by
    John Snow Labs
    Description

    This data package contains information on approved, researched and proven drug targets and drug lists.

  8. Data from: A consensus compound/bioactivity dataset for data-driven drug...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated May 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laura Isigkeit; Laura Isigkeit; Apirat Chaikuad; Apirat Chaikuad; Daniel Merk; Daniel Merk (2022). A consensus compound/bioactivity dataset for data-driven drug design and chemogenomics [Dataset]. http://doi.org/10.5281/zenodo.6320761
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 13, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Laura Isigkeit; Laura Isigkeit; Apirat Chaikuad; Apirat Chaikuad; Daniel Merk; Daniel Merk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Information

    The diverse publicly available compound/bioactivity databases constitute a key resource for data-driven applications in chemogenomics and drug design. Analysis of their coverage of compound entries and biological targets revealed considerable differences, however, suggesting benefit of a consensus dataset. Therefore, we have combined and curated information from five esteemed databases (ChEMBL, PubChem, BindingDB, IUPHAR/BPS and Probes&Drugs) to assemble a consensus compound/bioactivity dataset comprising 1144803 compounds with 10915362 bioactivities on 5613 targets (including defined macromolecular targets as well as cell-lines and phenotypic readouts). It also provides simplified information on assay types underlying the bioactivity data and on bioactivity confidence by comparing data from different sources. We have unified the source databases, brought them into a common format and combined them, enabling an ease for generic uses in multiple applications such as chemogenomics and data-driven drug design.

    The consensus dataset provides increased target coverage and contains a higher number of molecules compared to the source databases which is also evident from a larger number of scaffolds. These features render the consensus dataset a valuable tool for machine learning and other data-driven applications in (de novo) drug design and bioactivity prediction. The increased chemical and bioactivity coverage of the consensus dataset may improve robustness of such models compared to the single source databases. In addition, semi-automated structure and bioactivity annotation checks with flags for divergent data from different sources may help data selection and further accurate curation.

    Structure and content of the dataset

    Dataset structure

    ChEMBL

    ID

    PubChem

    ID

    IUPHAR

    ID

    Target

    Activity

    type

    Assay typeUnitMean C (0)...Mean PC (0)...Mean B (0)...Mean I (0)...Mean PD (0)...Activity check annotationLigand namesCanonical SMILES C...Structure checkSource

    The dataset was created using the Konstanz Information Miner (KNIME) (https://www.knime.com/) and was exported as a CSV-file and a compressed CSV-file.

    Except for the canonical SMILES columns, all columns are filled with the datatype ‘string’. The datatype for the canonical SMILES columns is the smiles-format. We recommend the File Reader node for using the dataset in KNIME. With the help of this node the data types of the columns can be adjusted exactly. In addition, only this node can read the compressed format.

    Column content:

    • ChEMBL ID, PubChem ID, IUPHAR ID: chemical identifier of the databases
    • Target: biological target of the molecule expressed as the HGNC gene symbol
    • Activity type: for example, pIC50
    • Assay type: Simplification/Classification of the assay into cell-free, cellular, functional and unspecified
    • Unit: unit of bioactivity measurement
    • Mean columns of the databases: mean of bioactivity values or activity comments denoted with the frequency of their occurrence in the database, e.g. Mean C = 7.5 *(15) -> the value for this compound-target pair occurs 15 times in ChEMBL database
    • Activity check annotation: a bioactivity check was performed by comparing values from the different sources and adding an activity check annotation to provide automated activity validation for additional confidence
      • no comment: bioactivity values are within one log unit;
      • check activity data: bioactivity values are not within one log unit;
      • only one data point: only one value was available, no comparison and no range calculated;
      • no activity value: no precise numeric activity value was available;
      • no log-value could be calculated: no negative decadic logarithm could be calculated, e.g., because the reported unit was not a compound concentration
    • Ligand names: all unique names contained in the five source databases are listed
    • Canonical SMILES columns: Molecular structure of the compound from each database
    • Structure check: To denote matching or differing compound structures in different source databases
      • match: molecule structures are the same between different sources;
      • no match: the structures differ;
      • 1 source: no structure comparison is possible, because the molecule comes from only one source database.
    • Source: From which databases the data come from

  9. Tyrosine Kinases ligands with bioactivity data

    • kaggle.com
    zip
    Updated Jun 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ricardo Romero Ochoa (2024). Tyrosine Kinases ligands with bioactivity data [Dataset]. https://www.kaggle.com/datasets/ricardoromeroochoa/tyrosine-kinases-ligands-with-bioactivity-data
    Explore at:
    zip(2087515 bytes)Available download formats
    Dataset updated
    Jun 6, 2024
    Authors
    Ricardo Romero Ochoa
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains bioactivity, toxicity, and druglikeness information of 28314 small compounds, obtained from the ChEMBL database and subsequently curated, targeting tyrosine kinases proteins. It is comprised of the following elements:

    1. ABL (Abelson tyrosine-protein kinase 1) 1970
    2. EGFR (Epidermal growth factor receptor) 9508
    3. PDGFR (Platelet-derived growth factor receptor) 1750
    4. FGFR (Fibroblast growth factor receptor) 3156
    5. MET (Hepatocyte growth factor receptor) 3950
    6. VEGFR (Vascular endothelial growth factor receptor) 1211
    7. KIT (Stem cell factor receptor) 1692
    8. RET (Rearranged during transfection) 896
    9. JAK (Janus kinase) 4203
    10. ALK (Anaplastic lymphoma kinase) 2087
    11. SRC 3846

    The data can be used to train models that predict or classify bioactivity and druglike properties of small compounds targeting a tyrosine kinase protein.

  10. f

    Data from: hERG Me Out

    • acs.figshare.com
    • figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paul Czodrowski (2023). hERG Me Out [Dataset]. http://doi.org/10.1021/ci400308z.s002
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    ACS Publications
    Authors
    Paul Czodrowski
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    A detailed analysis of the hERG content inside the ChEMBL database is performed. The correlation between the outcome from binding assays and functional assays is probed. On the basis of descriptor distributions, design paradigms with respect to structural and physicochemical properties of hERG active and hERG inactive compounds are challenged. Finally, classification models with different data sets are trained. All source code is provided, which is based on the Python open source packages RDKit and scikit-learn to enable the community to rerun the experiments. The code is stored on github (https://github.com/pzc/herg_chembl_jcim).

  11. b

    ChEMBL

    • bioregistry.io
    Updated Apr 24, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). ChEMBL [Dataset]. http://identifiers.org/re3data:r3d100010539
    Explore at:
    Dataset updated
    Apr 24, 2021
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    ChEMBL is a database of bioactive compounds, their quantitative properties and bioactivities (binding constants, pharmacology and ADMET, etc). The data is abstracted and curated from the primary scientific literature.

  12. f

    Data from: Drug Safety Data Curation and Modeling in ChEMBL: Boxed Warnings...

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    • +1more
    Updated Jan 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bosc, Nicolas; Bento, A. Patrícia; Hersey, Anne; Hunter, Fiona M. I.; Gaulton, Anna; Leach, Andrew R. (2021). Drug Safety Data Curation and Modeling in ChEMBL: Boxed Warnings and Withdrawn Drugs [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000801021
    Explore at:
    Dataset updated
    Jan 28, 2021
    Authors
    Bosc, Nicolas; Bento, A. Patrícia; Hersey, Anne; Hunter, Fiona M. I.; Gaulton, Anna; Leach, Andrew R.
    Description

    The safety of marketed drugs is an ongoing concern, with some of the more frequently prescribed medicines resulting in serious or life-threatening adverse effects in some patients. Safety-related information for approved drugs has been curated to include the assignment of toxicity class(es) based on their withdrawn status and/or black box warning information described on medicinal product labels. The ChEMBL resource contains a wide range of bioactivity data types, from early “Discovery” stage preclinical data for individual compounds through to postclinical data on marketed drugs; the inclusion of the curated drug safety data set within this framework can support a wide range of safety-related drug discovery questions. The curated drug safety data set will be made freely available through ChEMBL and updated in future database releases.

  13. Z

    Analog series-based scaffolds from ChEMBL with associated activity...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dimova, Dilyana; Stumpfe, Dagmar; Hu, Ye; Bajorath, Jürgen (2020). Analog series-based scaffolds from ChEMBL with associated activity information [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_155302
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, 6 Rheinische Friedrich-Wilhelms-Universitat, Dahlmannstrasse 2, D-53113 Bonn, Germany
    Authors
    Dimova, Dilyana; Stumpfe, Dagmar; Hu, Ye; Bajorath, Jürgen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reported is the activity information for the 12,294 analog series-based (ASB) scaffolds extracted from ChEMBL database. For each ASB scaffold structural and activity information for all analogs comprising the analog series is provoded.

  14. Z

    ChEMBL data against CHEMBL367, CHEMBL368 and CHEMBL612348

    • data.niaid.nih.gov
    Updated May 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arnaud Gaudry (2023). ChEMBL data against CHEMBL367, CHEMBL368 and CHEMBL612348 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7953283
    Explore at:
    Dataset updated
    May 20, 2023
    Dataset provided by
    University of Geneva
    Authors
    Arnaud Gaudry
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data from ChEMBL compounds reported with an activity against one of the following targets: CHEMBL367 : Leishmania donovani, CHEMBL368 : Trypanosoma cruzi, and CHEMBL612348 : Trypanosoma brucei rhodesiense.

  15. ChemBL-for-Pretraing-MPNN/GNN

    • kaggle.com
    zip
    Updated Oct 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Divyansh Vishwkarma (2025). ChemBL-for-Pretraing-MPNN/GNN [Dataset]. https://www.kaggle.com/datasets/divyanshvishwkarma/chembl-for-pretraing-mpnn
    Explore at:
    zip(23305429 bytes)Available download formats
    Dataset updated
    Oct 31, 2025
    Authors
    Divyansh Vishwkarma
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    ChemBL

    ChEMBL is a comprehensive, open-access, and manually curated chemical database of bioactive molecules with drug-like properties. It is maintained by the European Bioinformatics Institute (EBI), which is part of the European Molecular Biology Laboratory (EMBL). This particular dataset is cleaned and preprocessed to be used for pre-training Message Passing Neural Networks (MPNN). Which can be used down the line for more robust predictive or generative models.

    SMILES to Graph

    To convert the SMILES string to graph I have put together a python module SMILESToGraph on github.

    It is a comprehensive molecular feature extraction toolkit that converts SMILES (Simplified Molecular Input Line Entry System) strings into graph representations suitable for machine learning applications. It provides configurable feature levels, built-in normalization, and standalone descriptor extraction capabilities.

    Installation:

    To use it run the following snippet import sys !pip install rdkit --q; !git clone https://github.com/Divyansh900/SMILES-to-Graph.git; sys.path.append('/kaggle/working/SMILES-to-Graph')

    Initialization:

    from SMILESToGraph import SMILESToGraph
    
    converter = SMILESToGraph(
      feature_level="Comprehensive",
      include_3d = False,
      include_partial_charges = True,
      include_descriptors = True,
      max_atomic_num = 100
    )
    

    Usage

    To convert a smiles string to graph use the to_graph method

    converter.to_graph(smiles_string) # return graph feature (node, edge, graph level based on feature_level)

    It also has multiple helper methods like :

    converter.get_feature_shapes() # return the dimensions of each graph features

    I hope this will help you out, for more information and features please refer to the github page.

    Applications

    ChEMBL is a vital resource for: + Drug Discovery: Identifying potential drug candidates, understanding structure-activity relationships (SAR), and designing compound screening libraries. + Target Validation: Linking small molecules to their corresponding protein targets. + Safety/Toxicity Analysis: Investigating potential off-target effects of compounds. + Cheminformatics and Computational Biology: Developing predictive models and conducting large-scale data mining.

  16. ChEMBL RDF

    • data.wu.ac.at
    api/sparql, meta/void +1
    Updated Jul 30, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EMBL European Bioinformatics Institute (2016). ChEMBL RDF [Dataset]. https://data.wu.ac.at/odso/datahub_io/YzA0YTM3MTItN2NlZC00ODk1LTk3YzUtZDcxYjgwZDcyZTUw
    Explore at:
    ttl, api/sparql, meta/voidAvailable download formats
    Dataset updated
    Jul 30, 2016
    Dataset provided by
    European Bioinformatics Institutehttp://www.ebi.ac.uk/
    License

    http://www.opendefinition.org/licenses/cc-by-sahttp://www.opendefinition.org/licenses/cc-by-sa

    Description

    ChEMBL is a database of bioactive drug-like small molecules, it contains 2-D structures, calculated properties (e.g. logP, Molecular Weight, Lipinski Parameters, etc.) and abstracted bioactivities (e.g. binding constants, pharmacology and ADMET data). The data is abstracted and curated from the primary scientific literature, and cover a significant fraction of the SAR and discovery of modern drugs.

    It is available in RDF form through EMBL-EBI's RDF Platform.

  17. f

    Table_1_How to Achieve Better Results Using PASS-Based Virtual Screening:...

    • frontiersin.figshare.com
    • figshare.com
    xlsx
    Updated Jun 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pavel V. Pogodin; Alexey A. Lagunin; Anastasia V. Rudik; Dmitry A. Filimonov; Dmitry S. Druzhilovskiy; Mark C. Nicklaus; Vladimir V. Poroikov (2023). Table_1_How to Achieve Better Results Using PASS-Based Virtual Screening: Case Study for Kinase Inhibitors.XLSX [Dataset]. http://doi.org/10.3389/fchem.2018.00133.s003
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    Frontiers
    Authors
    Pavel V. Pogodin; Alexey A. Lagunin; Anastasia V. Rudik; Dmitry A. Filimonov; Dmitry S. Druzhilovskiy; Mark C. Nicklaus; Vladimir V. Poroikov
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Discovery of new pharmaceutical substances is currently boosted by the possibility of utilization of the Synthetically Accessible Virtual Inventory (SAVI) library, which includes about 283 million molecules, each annotated with a proposed synthetic one-step route from commercially available starting materials. The SAVI database is well-suited for ligand-based methods of virtual screening to select molecules for experimental testing. In this study, we compare the performance of three approaches for the analysis of structure-activity relationships that differ in their criteria for selecting of “active” and “inactive” compounds included in the training sets. PASS (Prediction of Activity Spectra for Substances), which is based on a modified Naïve Bayes algorithm, was applied since it had been shown to be robust and to provide good predictions of many biological activities based on just the structural formula of a compound even if the information in the training set is incomplete. We used different subsets of kinase inhibitors for this case study because many data are currently available on this important class of drug-like molecules. Based on the subsets of kinase inhibitors extracted from the ChEMBL 20 database we performed the PASS training, and then applied the model to ChEMBL 23 compounds not yet present in ChEMBL 20 to identify novel kinase inhibitors. As one may expect, the best prediction accuracy was obtained if only the experimentally confirmed active and inactive compounds for distinct kinases in the training procedure were used. However, for some kinases, reasonable results were obtained even if we used merged training sets, in which we designated as inactives the compounds not tested against the particular kinase. Thus, depending on the availability of data for a particular biological activity, one may choose the first or the second approach for creating ligand-based computational tools to achieve the best possible results in virtual screening.

  18. Drug Design with Small Molecule SMILES

    • kaggle.com
    zip
    Updated Feb 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sgt. Peppers (2022). Drug Design with Small Molecule SMILES [Dataset]. https://www.kaggle.com/datasets/art3mis/chembl22/discussion
    Explore at:
    zip(26097905 bytes)Available download formats
    Dataset updated
    Feb 26, 2022
    Authors
    Sgt. Peppers
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    Context

    SMILES (simplified molecular-input line-entry system) strings are ASCII sequences that describe the structure of a compound. Due to their simplicity, SMILES representations have been widely used in drug design, mining, and repurposing using machine learning and natural language processing techniques.

    See Cheminformania's seq2seq model for an excellent tutorial to get started.

    Content

    A single text file

    Each row contains two columns for description of an individual molecule. The first column is the SMILES string. The second is a reference to the full ChEMBL entry for that particular molecule.

    Acknowledgements

    ChEMBL Database: Gaulton A, Hersey A, Nowotka M, Bento AP, Chambers J, Mendez D, Mutowo P, Atkinson F, Bellis LJ, Cibrián-Uhalte E, Davies M, Dedman N, Karlsson A, Magariños MP, Overington JP, Papadatos G, Smit I, Leach AR. (2017) 'The ChEMBL database in 2017.' Nucleic Acids Res., 45(D1) D945-D954.

    ChEMBL Web Services: Davies M, Nowotka M, Papadatos G, Dedman N, Gaulton A, Atkinson F, Bellis L, Overington JP. (2015) 'ChEMBL web services: streamlining access to drug discovery data and utilities.' Nucleic Acids Res., 43(W1) W612-W620.

    ChEMBL RDF: S. Jupp, J. Malone, J. Bolleman, M. Brandizi, M. Davies, L. Garcia, A. Gaulton, S. Gehant, C. Laibe, N. Redaschi, S.M Wimalaratne, M. Martin, N. Le Novère, H. Parkinson, E. Birney and A.M Jenkinson (2014) The EBI RDF Platform: Linked Open Data for the Life Sciences Bioinformatics 30 1338-1339

  19. Z

    KinFragLib: Combinatorial library

    • data.niaid.nih.gov
    • nde-dev.biothings.io
    • +1more
    Updated Mar 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sydow, Dominique; Schmiel, Paula; Mortie, Jérémie; Buchthal, Katharina; Kramer, Paula Linh; Volkamer, Andrea (2024). KinFragLib: Combinatorial library [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3954931
    Explore at:
    Dataset updated
    Mar 21, 2024
    Dataset provided by
    Charité
    Saarland University
    Bayer AG
    Authors
    Sydow, Dominique; Schmiel, Paula; Mortie, Jérémie; Buchthal, Katharina; Kramer, Paula Linh; Volkamer, Andrea
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    KinFragLib: Exploring the Kinase Inhibitor Space Using Subpocket-Focused Fragmentation and Recombination.

    Project description.

    Protein kinases play a crucial role in many cell signaling processes, making them one of the most important families of drug targets. In this context, fragment-based drug design strategies have been successfully applied to develop novel kinase inhibitors, usually following a knowledge-driven approach to optimize a focused set of fragments to a potent kinase inhibitor.

    Alternatively, KinFragLib is a new method that allows to explore and extend the chemical space of kinase inhibitors using data-driven fragmentation and recombination, built on available structural kinome data from the KLIFS database for over 3,200 kinase DFG-in complexes. The computational fragmentation method splits the co-crystallized non-covalent kinase inhibitors into fragments with respect to their 3D proximity to six predefined functionally relevant subpocket centers. The resulting fragment library consists of six subpocket pools with over 9,000 fragments, available at https://github.com/volkamerlab/KinFragLib.

    KinFragLib offers two main applications: (i) In-depth analyses of the chemical space of known kinase inhibitors, subpocket characteristics and connections, as well as (ii) subpocket-informed recombination of fragments to generate potential novel inhibitors. The latter showed that recombining only a subset of 727 representative fragments generated a combinatorial library of 11.3 million molecules, containing, besides some known kinase inhibitors, more than 99% novel chemical matter compared to ChEMBL and 55% molecules compliant with Lipinski's rule of five.

    Combinatorial library dataset.

    The dataset offered here is part of the KinFragLib GitHub repository (https://github.com/volkamerlab/KinFragLib) and contains the metadata and properties of the KinFragLib combinatorial library.

    1. Raw data

    combinatorial_library.json: Full combinatorial library, please refer to notebooks/4_1_combinatorial_library_data_preparation.ipynb at https://github.com/volkamerlab/KinFragLib for detailed information about this data format.

    combinatorial_library_deduplicated.json: Deduplicated combinatorial library (based on InChIs).

    chembl_standardized_inchi.csv: Standardized ChEMBL 33 molecules in the form of InChI strings.

    1. Processed data

    Data extracted from combinatorial_library_deduplicated.json, performed in notebooks/4_1_combinatorial_library_data_preparation.ipynb at https://github.com/volkamerlab/KinFragLib.

    n_atoms.csv: Number of atoms for each recombined ligand.

    ro5.csv: Number of ligands that fulfill Lipinski's rule of five (Ro5) and its individual criteria; number of ligands in total.

    subpockets.csv: Number of ligands per subpocket combination.

    original_exact.json: Ligands with exact matches in original ligands, i.e. KLIFS ligands that were used for the fragmentation.

    original_substructure.json: Ligands with substructure matches in original ligands, i.e. KLIFS ligands that were used for the fragmentation.

    chembl_exact.json: Ligands with exact matches in ChEMBL.

    chembl_most_similar.json: Most similar ligand in ChEMBL for each recombined ligand.

    chembl_highly_similar.json: Most similar ligand in ChEMBL for each recombined ligand with similarity greater than 0.9.

    Usage.

    This dataset can be used to run the notebooks available on https://github.com/volkamerlab/KinFragLib.

    Clone the KinFragLib repository.

    Download the tar.bz2 file provided here.

    Extract the archive content to the combinatorial library folder in your local KinFragLib folder and run the notebooks.

    tar -xvf combinatorial_library.tar.bz2 -C /path_to_kinfraglib/data/combinatorial_library/

    Citation.

    This dataset is part of the KinFragLib publication:

    Sydow, D., Schmiel, P., Mortier, J., and Volkamer, A. KinFragLib: Exploring the Kinase Inhibitor Space Using Subpocket-Focused Fragmentation and Recombination. J. Chem. Inf. Model. 2020. https://pubs.acs.org/doi/abs/10.1021/acs.jcim.0c00839

  20. D

    ChEMBL-RDF v13.5

    • dataverse.nl
    application/x-gzip
    Updated Nov 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Egon Willighagen; Egon Willighagen (2021). ChEMBL-RDF v13.5 [Dataset]. http://doi.org/10.34894/IJWU5L
    Explore at:
    application/x-gzip(765062132)Available download formats
    Dataset updated
    Nov 22, 2021
    Dataset provided by
    DataverseNL
    Authors
    Egon Willighagen; Egon Willighagen
    License

    https://dataverse.nl/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.34894/IJWU5Lhttps://dataverse.nl/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.34894/IJWU5L

    Area covered
    Netherlands
    Description

    ChEMBL is medicinal chemistry database by the team of dr. J. Overington at the EBI: http://www.ebi.ac.uk/chembl/ It is detailed in this paper (doi:10.1093/nar/gkr777): http://nar.oxfordjournals.org/content/early/2011/09/22/nar.gkr777.short This project develops, releases, and hosts a RDF version of ChEMBL, independent from the ChEMBL team who make their own RDF version. The main SPARQL end point is available from Uppsala University at: http://rdf.farmbio.uu.se/chembl/sparql

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2022). ChEMBL [Dataset]. http://identifiers.org/RRID:SCR_014042

ChEMBL

RRID:SCR_014042, r3d100010539, ChEMBL (RRID:SCR_014042), ChEMBLdb, Chembl, ChEMBL Database

Explore at:
14 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jan 29, 2022
Description

Collection of bioactive drug-like small molecules that contains 2D structures, calculated properties and abstracted bioactivities. Used for drug discovery and chemical biology research. Clinical progress of new compounds is continuously integrated into the database.

Search
Clear search
Close search
Google apps
Main menu