Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
ChEMBL is maintained by the European Bioinformatics Institute (EBI), of the European Molecular Biology Laboratory (EMBL), based at the Wellcome Trust Genome Campus, Hinxton, UK.
ChEMBL is a manually curated database of bioactive molecules with drug-like properties used in drug discovery, including information about existing patented drugs.
Schema: http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_23/chembl_23_schema.png
Documentation: http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_23/schema_documentation.html
Fork this notebook to get started on accessing data in the BigQuery dataset using the BQhelper package to write SQL queries.
“ChEMBL” by the European Bioinformatics Institute (EMBL-EBI), used under CC BY-SA 3.0. Modifications have been made to add normalized publication numbers.
Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:ebi_chembl
Facebook
TwitterCollection of bioactive drug-like small molecules that contains 2D structures, calculated properties and abstracted bioactivities. Used for drug discovery and chemical biology research. Clinical progress of new compounds is continuously integrated into the database.
Facebook
TwitterThis data package contains information on approved, researched and proven drug targets and drug lists.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The safety of marketed drugs is an ongoing concern, with some of the more frequently prescribed medicines resulting in serious or life-threatening adverse effects in some patients. Safety-related information for approved drugs has been curated to include the assignment of toxicity class(es) based on their withdrawn status and/or black box warning information described on medicinal product labels. The ChEMBL resource contains a wide range of bioactivity data types, from early “Discovery” stage preclinical data for individual compounds through to postclinical data on marketed drugs; the inclusion of the curated drug safety data set within this framework can support a wide range of safety-related drug discovery questions. The curated drug safety data set will be made freely available through ChEMBL and updated in future database releases.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
This dataset contains curated molecular data for compounds tested against four pharmacologically important protein targets. All data are derived from the ChEMBL database.
| Target Name | ChEMBL Target ID | Target Class |
|---|---|---|
| EGFR | CHEMBL203 | Kinase (Receptor TK) |
| DRD2 | CHEMBL217 | GPCR (Dopamine D2 receptor) |
| BACE1 | CHEMBL1987 | Enzyme (Aspartyl protease) |
| HDAC1 | CHEMBL325 | Enzyme (Histone deacetylase) |
Each entry in the dataset includes: - ChEMBL compound ID - Canonical SMILES - Molecular properties (molecular weight, HBA, HBD, logP, and TPSA) - Target label
The dataset is provided in CSV or Parquet format with the following columns:
ChEMBL ID: ChEMBL compound identifierSMILES: Canonical SMILES stringMolecular weight: Molecular mass (Da)LogP: Octanol-water partition coefficientHBA: Number of hydrogen bond acceptorsHBD: Number of hydrogen bond donorsTPSA: Topological polar surface areaProtein: Protein target nameThis dataset is a derivative of ChEMBL and is distributed under the same license:
Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
https://creativecommons.org/licenses/by-sa/3.0/
chembl_webresource_client)If you use this dataset, please cite the following:
ChEMBL Database:
Gaulton A, Hersey A, Nowotka M, Bento AP, Chambers J, Mendez D, Mutowo P, Atkinson F, Bellis LJ, Cibrián-Uhalte E, Davies M, Dedman N, Karlsson A, Magariños MP, Overington JP, Papadatos G, Smit I, Leach AR. (2017).
The ChEMBL database in 2017. Nucleic Acids Res., 45(D1): D945–D954.
https://doi.org/10.1093/nar/gkw1074
ChEMBL Web Services:
Davies M, Nowotka M, Papadatos G, Dedman N, Gaulton A, Atkinson F, Bellis LJ, Overington JP. (2015).
ChEMBL web services: streamlining access to drug discovery data and utilities. Nucleic Acids Res., 43(W1): W612–W620.
https://doi.org/10.1093/nar/gkv352
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Published compounds from ChEMBL version 32 are used to seek evidence for the occurrence of “natural selection” in drug discovery. Three measures of natural product (NP) character were applied, to compare time- and target-matched compounds reaching the clinic (clinical compounds in phase 1–3 development and approved drugs) with background compounds (reference compounds). Pseudo-NPs (PNPs), containing NP fragments combined in ways inaccessible by nature, are increasing over time, reaching 67% of clinical compounds first disclosed since 2010. PNPs are 54% more likely to be found in post-2008 clinical versus reference compounds. The majority of target classes show increased clinical compound NP character versus their reference compounds. Only 176 NP fragments appear in >1000 clinical compounds published since 2008, yet these make up on average 63% of the clinical compound’s core scaffolds. There is untapped potential awaiting exploitation, by applying nature’s building blocks“natural intelligence”to drug design.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A ChEMBL-v29 dataset was generated to be used with the ligand-based similarity search target prediction engine FastTargetPred (https://github.com/ludovicchaput/FastTargetPred). Using this new dataset, attempts to predict macromolecular targets for a published dataset of 160,000 tetrapeptides was performed.
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
SMILES (simplified molecular-input line-entry system) strings are ASCII sequences that describe the structure of a compound. Due to their simplicity, SMILES representations have been widely used in drug design, mining, and repurposing using machine learning and natural language processing techniques.
See Cheminformania's seq2seq model for an excellent tutorial to get started.
A single text file
Each row contains two columns for description of an individual molecule. The first column is the SMILES string. The second is a reference to the full ChEMBL entry for that particular molecule.
ChEMBL Database: Gaulton A, Hersey A, Nowotka M, Bento AP, Chambers J, Mendez D, Mutowo P, Atkinson F, Bellis LJ, Cibrián-Uhalte E, Davies M, Dedman N, Karlsson A, Magariños MP, Overington JP, Papadatos G, Smit I, Leach AR. (2017) 'The ChEMBL database in 2017.' Nucleic Acids Res., 45(D1) D945-D954.
ChEMBL Web Services: Davies M, Nowotka M, Papadatos G, Dedman N, Gaulton A, Atkinson F, Bellis L, Overington JP. (2015) 'ChEMBL web services: streamlining access to drug discovery data and utilities.' Nucleic Acids Res., 43(W1) W612-W620.
ChEMBL RDF: S. Jupp, J. Malone, J. Bolleman, M. Brandizi, M. Davies, L. Garcia, A. Gaulton, S. Gehant, C. Laibe, N. Redaschi, S.M Wimalaratne, M. Martin, N. Le Novère, H. Parkinson, E. Birney and A.M Jenkinson (2014) The EBI RDF Platform: Linked Open Data for the Life Sciences Bioinformatics 30 1338-1339
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Computational approaches to analyze various drug/compound centered analysis often present a need to map attributes from multiple drug databases. In this study, we provide a Neo4j repository that integrates two of the most prominent open source drug databases, DrugBank and ChEMBL, with a goal of establishing an integrated data visualization and analysis tool for drug discovery studies. The drugs present in DrugBank are mapped to their counterparts in ChEMBL. The integration of these resources and the harmonization using knowledge graph serialization using Neo4j lead to identification of relationships between drugs and other related features that are otherwise spread across two different resources. A common data format, a prerequisite to populate the Neo4j database, enables users to identify new relationships central to drug discovery research, like Drug Target Interactions (DTI). The resource is freely available at: https://github.com/ambf0632/CompoundDB4j
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Compound descriptors for compounds retrieved from the ChEMBL database. Descriptors to annotate the compounds chemically and based on physicochemical properties. To be used in cheminformatics.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This curated dataset offers a valuable resource for deep learning applications in drug discovery and repositioning domains. It contains 5,350 high-resolution images systematically categorized into pharmacological classes and molecular targets. The pharmacological classes encompass antifungals, antivirals, corticosteroids, diuretics, and non-steroidal anti-inflammatory drugs (NSAIDs), while the molecular targets emphasize Alzheimer's disease-related enzymes, including acetylcholinesterase, butyrylcholinesterase, and beta-secretase 1. The dataset was meticulously compiled using data from well-established databases, including DrugBank, ChEMBL, and DUD-E, ensuring diversity and quality in the compounds selected for training. Active compounds (true positives) were sourced from DrugBank and ChEMBL, while decoy compounds (true negatives) were generated using the DUD-E protocol. The decoy compounds are designed to match the physicochemical properties of active compounds while lacking binding affinity, creating a robust benchmark for machine learning evaluation. The balanced structure of the dataset, with equal representation of true positive and decoy compounds, enhances its suitability for binary and multi-class classification tasks. The collection of compounds is diverse and of high quality, thus supporting a wide range of deep learning tasks, including pharmacological class prediction, virtual screening, and molecular target identification. This ultimately advances computational approaches in drug discovery.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The advent of powerful machine learning algorithms as well as the availability of high volume of pharmacological data has given new fuel to QSAR, opening new unprecedented options for deriving highly predictive models for assisting the rationale design of new bioactive compounds, for screening and prioritizing large molecular libraries, and for repurposing new drugs toward new clinical uses. Here, we present PoseidonQ (an acronym for Personal Optimization Software for Efficient Implementation and Derivation of Online QSAR), a user-friendly software solution designed to simplify the derivation of the QSAR model for drug design and discovery. PoseidonQ incorporates 22 machine learning algorithms, 17 types of molecular fingerprints, and 208 RDKit molecular descriptors and enables the quick derivation of both regression and classification models along with a calculated and easily interpretable applicability domain. Importantly, the platform is automatically linked to the latest version of the ChEMBL database, thus providing streamlined access to large amounts of curated bioactivity data. Importantly, the user is also given the option of gathering high-quality experimental data based on customizable filtering settings. Noteworthy, PoseidonQ facilitates the deployment of trained QSAR models as web-based applications through seamless integration with Streamlit Cloud and GitHub, empowering users to share, refine, and integrate models effortlessly. Interestingly, the translation of QSAR models into web-based applications makes them free accessible, portable, and ready for screening large volumes of new data without limits. By unifying data preparation, model generation, and deployment into an intuitive workflow, PoseidonQ makes advanced QSAR modeling for drug design and discovery accessible to a wide audience of researchers irrespective of their skill levels. PoseidonQ bridges the gap between complex machine learning techniques and practical drug discovery applications, enhancing the efficiency, collaboration, and adoption of QSAR approaches in modern drug discovery programs. PoseidonQ is available for Windows and Linux (ubuntu 22.04 distro) operating systems and can be downloaded for free at https://github.com/Muzatheking12/PoseidonQ.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
** Overview** This dataset provides detailed information on various chemical compounds and their measured effectiveness against malaria parasites based on experimental bioassay results. The data has been curated from the ChEMBL and TCMDC repositories and includes molecular structure data alongside biological activity metrics.
The purpose of this dataset is to enable machine learning applications such as classification or regression models that can predict the effectiveness of a compound based on its structure and bioassay values.
| Column Name | Description |
|---|---|
COMPOUND_ID | Unique identifier for each chemical compound |
SOURCES | Source tags or identifiers (e.g., ChEMBL ID, TCMDC ID) |
PCT_IHB_3D7 | Percentage inhibition of the 3D7 strain of Plasmodium falciparum |
PCT_INHB_DD2 | Percentage inhibition of the DD2 strain of Plasmodium falciparum |
PCT_INHIB_3D7_PFLDH | Percentage inhibition of the PfLDH enzyme from the 3D7 strain |
pXC50_3D7 | Negative log of the IC50 value (higher is more effective) for 3D7 strain |
SMILES | SMILES string representing the molecular structure |
Drug Effectiveness An effectiveness score can be derived by combining inhibition percentages and the pXC50_3D7 metric. Higher scores generally indicate more effective compounds against malaria parasites.
This dataset can be used for:
Drug discovery using AI/ML
Predictive modeling of compound activity
QSAR (Quantitative Structure–Activity Relationship) analysis
Molecular structure-based classification
Use Cases Drug discovery & development
Bioinformatics research
Cheminformatics feature engineering
ML/DL-based compound classification
License License: Creative Commons Attribution 4.0 International (CC BY 4.0) Link: https://creativecommons.org/licenses/by/4.0/
What does this license mean? This license allows others to:
Share — copy and redistribute the material in any medium or format Adapt — remix, transform, and build upon the material for any purpose, even commercially As long as you give appropriate credit to the original dataset creator.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Histone deacetylases have been recognized as a potential target for epigenetic aberrance reversal in the various strategies for cancer therapy, with HDAC6 implicated in various forms of tumor growth and cancers. Diverse inhibitors of HDAC6 has been developed, however, there is still the challenge of iso-specificity and toxicity. In this study, we trained a Random forest model on all HDAC6 inhibitors curated in the ChEMBL database (3,742). Upon rigorous validations the model had an 78% balanced accuracy (external validation) and was used to screen the SCUBIDOO database; 7785 hit compounds resulted and were docked into HDAC6 CD2 active-site. The top two compounds having a benzimidazole moiety as its zinc-binding group had a binding affinity of -78.56kcal/mol and -78.21kcal/mol respectively. The compounds were subjected to exhaustive docking protocols (Qm-polarized docking and Induced-Fit docking) in other to elucidate a binding hypothesis and accurate binding affinity. Upon optimization, the compounds showed improved binding affinity (-81.42kcal/mol), putative specificity for HDAC6, and good ADMET properties. We have therefore developed a reliable model to screen for HDAC6 inhibitors and suggested a series of benzimidazole based inhibitors showing high binding affinity and putative specificity for HDAC6
Facebook
Twitterhttp://www.opendefinition.org/licenses/cc-by-sahttp://www.opendefinition.org/licenses/cc-by-sa
ChEMBL is a database of bioactive drug-like small molecules, it contains 2-D structures, calculated properties (e.g. logP, Molecular Weight, Lipinski Parameters, etc.) and abstracted bioactivities (e.g. binding constants, pharmacology and ADMET data). The data is abstracted and curated from the primary scientific literature, and cover a significant fraction of the SAR and discovery of modern drugs.
It is available in RDF form through EMBL-EBI's RDF Platform.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was built during a research project, in the field of Computer-Aided Drug Discovery (CADD), funded by the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery grant. The aim of the project was to build descriptor-based machine learning models for hERG cardiotoxicity liability predictions. The dataset includes a total of 8879 unique molecular compounds gathered from ChEMBL and PubChem publicly available bioactivity databases, as well as from literature mining. The list is split into 2 sets, 8380 for training and 499 for testing. All molecular compounds are represented in their SMILE format with their corresponding PIC50 potency values.
To access the full original work, please visit the following link: Manuscript
Note: Upon usage of this data, kindly cite the original manuscript describing the curation process:
Arab, Issar, and Khaled Barakat. "ToxTree: descriptor-based machine learning models for both hERG and Nav1. 5 cardiotoxicity liability predictions." arXiv preprint arXiv:2112.13467 (2021).
Refer to our latest manually curated and a much larger dataset here: link
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
A systematic analysis is presented of the 220 phosphodiesterase (PDE) catalytic domain crystal structures present in the Protein Data Bank (PDB) with a focus on PDE-ligand interactions. The consistent structural alignment of 57 PDE ligand binding site residues enables the systematic analysis of PDE-ligand Interaction FingerPrints (IFPs), the identification of subtype-specific PDE-ligand interaction features, and the classification of ligands according to their binding modes. We illustrate how systematic mining of this phosphodiesterase structure and ligand interaction annotated (PDEStrIAn) database provides new insights into how conserved and selective PDE interaction hot spots can accommodate the large diversity of chemical scaffolds in PDE ligands. A substructure analysis of the co-crystalized PDE ligands in combination with those in the ChEMBL database provides a toolbox for scaffold hopping and ligand design. These analyses lead to an improved understanding of the structural requirements of PDE binding that will be useful in future drug discovery studies.
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
A public dataset from ChEMBL (v25) for MELLODDY TUNER release v3.
Data extracted from ChEMBL (LICENSE attached), and processed with https://github.com/melloddy/MELLODDY-TUNER release v3.
Data can be used for technical tests, template for your own dataset and machine learning with SparseChem (https://github.com/melloddy/SparseChem).
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
A public dataset from ChEMBL (v25) for MELLODDY TUNER release v2.
Data extracted from ChEMBL and PubChem (input archives), and processed with https://github.com/melloddy/MELLODDY-TUNER release v2.
Data can be used for technical tests, template for your own dataset and machine learning with SparseChem (https://github.com/melloddy/SparseChem).
Facebook
TwitterThe compressed data folder (dataset.rar) represents a data framework for researchers in the field of drug discovery to perform in depth analyses on a very large open-access unique and comprehensive hERG, Nav1.5, and Cav1.2 cardiotoxicity integrated database of small molecules and their activities. The database is organized as follows: Each sub-folder represents a cardiac ion channel target: hERG, Nav1.5, and Cav1.2 Each target sub-folder consists of 3 files in CSV format: One file containing the development set (split into training and validation sets using an 80/20 ratio for hyperparameter tuning). The other 2 files contain external evaluation sets. The first test dataset consists of compounds with a structural similarity of no more than 60% (Tanimoto similarity ≤ 0.6) to the remaining development set, while the second test dataset comprises compounds with a structural similarity of no more than 70% (Tanimoto similarity ≤ 0.7) to the remaining development set. Each file contains data with 7 columns: "InChl Key" as a unique identifier of the chemical structure, "SMILES" as the string format of storage and exchange of the chemical structure, "Source" as the upstream data source from which the data was retrieved, "ChEMBL ID" as the ChEMBL identifier if the compound comes from ChEMBL database, "PubChem CID" as the PubChem compound identifier if the compound comes from PubChem database, "pIC50" as the negative logarithm of the half-maximal inhibitory concentration (IC50) to describe the potency of the compound, and "USED_AS" column specifying whether the compound was used for training or validation. Upon usage, please cite this publication: Issar Arab, Kristof Egghe, Kris Laukens, Ke Chen, Khaled Barakat, Wout Bittremieux, Benchmarking of Small Molecule Feature Representations for hERG, Nav1.5, and Cav1.2 Cardiotoxicity Prediction, Journal of Chemical Information and Modeling, (2023). doi:10.1021/acs.jcim.3c01301
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
ChEMBL is maintained by the European Bioinformatics Institute (EBI), of the European Molecular Biology Laboratory (EMBL), based at the Wellcome Trust Genome Campus, Hinxton, UK.
ChEMBL is a manually curated database of bioactive molecules with drug-like properties used in drug discovery, including information about existing patented drugs.
Schema: http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_23/chembl_23_schema.png
Documentation: http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_23/schema_documentation.html
Fork this notebook to get started on accessing data in the BigQuery dataset using the BQhelper package to write SQL queries.
“ChEMBL” by the European Bioinformatics Institute (EMBL-EBI), used under CC BY-SA 3.0. Modifications have been made to add normalized publication numbers.
Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:ebi_chembl