59 datasets found

ChEMBL EBI Small Molecules Database
kaggle.com
zip
Updated Feb 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). ChEMBL EBI Small Molecules Database [Dataset]. https://www.kaggle.com/bigquery/ebi-chembl
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset authored and provided by
Google BigQuery
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Context

ChEMBL is maintained by the European Bioinformatics Institute (EBI), of the European Molecular Biology Laboratory (EMBL), based at the Wellcome Trust Genome Campus, Hinxton, UK.

Content

ChEMBL is a manually curated database of bioactive molecules with drug-like properties used in drug discovery, including information about existing patented drugs.

Schema: http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_23/chembl_23_schema.png

Documentation: http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_23/schema_documentation.html

Fork this notebook to get started on accessing data in the BigQuery dataset using the BQhelper package to write SQL queries.

Acknowledgements

“ChEMBL” by the European Bioinformatics Institute (EMBL-EBI), used under CC BY-SA 3.0. Modifications have been made to add normalized publication numbers.

Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:ebi_chembl

Banner photo by rawpixel on Unsplash
r
ChEMBL
rrid.site
scicrunch.org
+2more
Updated Dec 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). ChEMBL [Dataset]. http://identifiers.org/RRID:SCR_014042
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_014042
Dataset updated
Dec 23, 2025
Description
Collection of bioactive drug-like small molecules that contains 2D structures, calculated properties and abstracted bioactivities. Used for drug discovery and chemical biology research. Clinical progress of new compounds is continuously integrated into the database.
Drug Targets and Drug Lists Data Package
johnsnowlabs.com
csv
Updated Jan 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Snow Labs (2021). Drug Targets and Drug Lists Data Package [Dataset]. https://www.johnsnowlabs.com/marketplace/drug-targets-and-drug-lists-data-package/
Explore at:
csvAvailable download formats
Dataset updated
Jan 20, 2021
Dataset authored and provided by
John Snow Labs
Description
This data package contains information on approved, researched and proven drug targets and drug lists.
f
Data from: Drug Safety Data Curation and Modeling in ChEMBL: Boxed Warnings...
acs.figshare.com
datasetcatalog.nlm.nih.gov
xlsx
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fiona M.I. Hunter; A. Patrícia Bento; Nicolas Bosc; Anna Gaulton; Anne Hersey; Andrew R. Leach (2023). Drug Safety Data Curation and Modeling in ChEMBL: Boxed Warnings and Withdrawn Drugs [Dataset]. http://doi.org/10.1021/acs.chemrestox.0c00296.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.chemrestox.0c00296.s001
Dataset updated
Jun 4, 2023
Dataset provided by
ACS Publications
Authors
Fiona M.I. Hunter; A. Patrícia Bento; Nicolas Bosc; Anna Gaulton; Anne Hersey; Andrew R. Leach
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The safety of marketed drugs is an ongoing concern, with some of the more frequently prescribed medicines resulting in serious or life-threatening adverse effects in some patients. Safety-related information for approved drugs has been curated to include the assignment of toxicity class(es) based on their withdrawn status and/or black box warning information described on medicinal product labels. The ChEMBL resource contains a wide range of bioactivity data types, from early “Discovery” stage preclinical data for individual compounds through to postclinical data on marketed drugs; the inclusion of the curated drug safety data set within this framework can support a wide range of safety-related drug discovery questions. The curated drug safety data set will be made freely available through ChEMBL and updated in future database releases.
MultiTarget Bioactivity ChEMBL
kaggle.com
zip
Updated Nov 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
J. (2025). MultiTarget Bioactivity ChEMBL [Dataset]. https://www.kaggle.com/datasets/xjoannax88/multitarget-bioactivity-chembl/code
Explore at:
zip(41644 bytes)Available download formats
Dataset updated
Nov 30, 2025
Authors
J.
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Bioactivity Dataset for EGFR, DRD2, BACE1, and HDAC1

This dataset contains curated molecular data for compounds tested against four pharmacologically important protein targets. All data are derived from the ChEMBL database.

Targets Included

Target Name ChEMBL Target ID Target Class
EGFR CHEMBL203 Kinase (Receptor TK)
DRD2 CHEMBL217 GPCR (Dopamine D2 receptor)
BACE1 CHEMBL1987 Enzyme (Aspartyl protease)
HDAC1 CHEMBL325 Enzyme (Histone deacetylase)

Each entry in the dataset includes: - ChEMBL compound ID - Canonical SMILES - Molecular properties (molecular weight, HBA, HBD, logP, and TPSA) - Target label

Format

The dataset is provided in CSV or Parquet format with the following columns:

ChEMBL ID: ChEMBL compound identifier

SMILES: Canonical SMILES string

Molecular weight: Molecular mass (Da)

LogP: Octanol-water partition coefficient

HBA: Number of hydrogen bond acceptors

HBD: Number of hydrogen bond donors

TPSA: Topological polar surface area

Protein: Protein target name

License

This dataset is a derivative of ChEMBL and is distributed under the same license:

Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
https://creativecommons.org/licenses/by-sa/3.0/

Source

ChEMBL Database: https://www.ebi.ac.uk/chembl/

Data accessed via the ChEMBL WebResource Client (chembl_webresource_client)

Citation

If you use this dataset, please cite the following:

ChEMBL Database:
Gaulton A, Hersey A, Nowotka M, Bento AP, Chambers J, Mendez D, Mutowo P, Atkinson F, Bellis LJ, Cibrián-Uhalte E, Davies M, Dedman N, Karlsson A, Magariños MP, Overington JP, Papadatos G, Smit I, Leach AR. (2017).
The ChEMBL database in 2017. Nucleic Acids Res., 45(D1): D945–D954.
https://doi.org/10.1093/nar/gkw1074

ChEMBL Web Services:
Davies M, Nowotka M, Papadatos G, Dedman N, Gaulton A, Atkinson F, Bellis LJ, Overington JP. (2015).
ChEMBL web services: streamlining access to drug discovery data and utilities. Nucleic Acids Res., 43(W1): W612–W620.
https://doi.org/10.1093/nar/gkv352
f
Data from: Occurrence of “Natural Selection” in Successful Small Molecule...
figshare.com
datasetcatalog.nlm.nih.gov
application/csv
Updated Jul 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
A. Lina Heinzke; Axel Pahl; Barbara Zdrazil; Andrew R. Leach; Herbert Waldmann; Robert J. Young; Paul D. Leeson (2024). Occurrence of “Natural Selection” in Successful Small Molecule Drug Discovery [Dataset]. http://doi.org/10.1021/acs.jmedchem.4c00811.s003
Explore at:
application/csvAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jmedchem.4c00811.s003
Dataset updated
Jul 30, 2024
Dataset provided by
ACS Publications
Authors
A. Lina Heinzke; Axel Pahl; Barbara Zdrazil; Andrew R. Leach; Herbert Waldmann; Robert J. Young; Paul D. Leeson
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Published compounds from ChEMBL version 32 are used to seek evidence for the occurrence of “natural selection” in drug discovery. Three measures of natural product (NP) character were applied, to compare time- and target-matched compounds reaching the clinic (clinical compounds in phase 1–3 development and approved drugs) with background compounds (reference compounds). Pseudo-NPs (PNPs), containing NP fragments combined in ways inaccessible by nature, are increasing over time, reaching 67% of clinical compounds first disclosed since 2010. PNPs are 54% more likely to be found in post-2008 clinical versus reference compounds. The majority of target classes show increased clinical compound NP character versus their reference compounds. Only 176 NP fragments appear in >1000 clinical compounds published since 2008, yet these make up on average 63% of the clinical compound’s core scaffolds. There is untapped potential awaiting exploitation, by applying nature’s building blocks“natural intelligence”to drug design.
m
A new ChEMBL dataset for FastTargetPred
data.mendeley.com
dataon.kisti.re.kr
Updated Feb 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bruno Villoutreix (2022). A new ChEMBL dataset for FastTargetPred [Dataset]. http://doi.org/10.17632/9t5zdgs3s2.1
Explore at:
Unique identifier
https://doi.org/10.17632/9t5zdgs3s2.1
Dataset updated
Feb 1, 2022
Authors
Bruno Villoutreix
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A ChEMBL-v29 dataset was generated to be used with the ligand-based similarity search target prediction engine FastTargetPred (https://github.com/ludovicchaput/FastTargetPred). Using this new dataset, attempts to predict macromolecular targets for a published dataset of 160,000 tetrapeptides was performed.
Drug Design with Small Molecule SMILES
kaggle.com
zip
Updated Feb 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sgt. Peppers (2022). Drug Design with Small Molecule SMILES [Dataset]. https://www.kaggle.com/art3mis/chembl22
Explore at:
zip(26097905 bytes)Available download formats
Dataset updated
Feb 26, 2022
Authors
Sgt. Peppers
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
Context

SMILES (simplified molecular-input line-entry system) strings are ASCII sequences that describe the structure of a compound. Due to their simplicity, SMILES representations have been widely used in drug design, mining, and repurposing using machine learning and natural language processing techniques.

See Cheminformania's seq2seq model for an excellent tutorial to get started.

Content

A single text file

Each row contains two columns for description of an individual molecule. The first column is the SMILES string. The second is a reference to the full ChEMBL entry for that particular molecule.

Acknowledgements

ChEMBL Database: Gaulton A, Hersey A, Nowotka M, Bento AP, Chambers J, Mendez D, Mutowo P, Atkinson F, Bellis LJ, Cibrián-Uhalte E, Davies M, Dedman N, Karlsson A, Magariños MP, Overington JP, Papadatos G, Smit I, Leach AR. (2017) 'The ChEMBL database in 2017.' Nucleic Acids Res., 45(D1) D945-D954.

ChEMBL Web Services: Davies M, Nowotka M, Papadatos G, Dedman N, Gaulton A, Atkinson F, Bellis L, Overington JP. (2015) 'ChEMBL web services: streamlining access to drug discovery data and utilities.' Nucleic Acids Res., 43(W1) W612-W620.

ChEMBL RDF: S. Jupp, J. Malone, J. Bolleman, M. Brandizi, M. Davies, L. Garcia, A. Gaulton, S. Gehant, C. Laibe, N. Redaschi, S.M Wimalaratne, M. Martin, N. Le Novère, H. Parkinson, E. Birney and A.M Jenkinson (2014) The EBI RDF Platform: Linked Open Data for the Life Sciences Bioinformatics 30 1338-1339
CompoundDB4j V1
zenodo.org
application/gzip, png
Updated Oct 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cassandra Königs; Cassandra Königs (2024). CompoundDB4j V1 [Dataset]. http://doi.org/10.5281/zenodo.13906923
Explore at:
png, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13906923
Dataset updated
Oct 9, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Cassandra Königs; Cassandra Königs
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
CompoundDB4j: Deriving an extensible and integrated knowledge of heterogeneous chemical databases

Computational approaches to analyze various drug/compound centered analysis often present a need to map attributes from multiple drug databases. In this study, we provide a Neo4j repository that integrates two of the most prominent open source drug databases, DrugBank and ChEMBL, with a goal of establishing an integrated data visualization and analysis tool for drug discovery studies. The drugs present in DrugBank are mapped to their counterparts in ChEMBL. The integration of these resources and the harmonization using knowledge graph serialization using Neo4j lead to identification of relationships between drugs and other related features that are otherwise spread across two different resources. A common data format, a prerequisite to populate the Neo4j database, enables users to identify new relationships central to drug discovery research, like Drug Target Interactions (DTI). The resource is freely available at: https://github.com/ambf0632/CompoundDB4j
4
Compound descriptors for text-mined orthosteric and allosteric dataset
data.4tu.nl
txt
Updated Jul 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lindsey Burggraaff (2020). Compound descriptors for text-mined orthosteric and allosteric dataset [Dataset]. http://doi.org/10.4121/uuid:5738caea-2390-4dc7-9830-8d9644232144
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.4121/uuid:5738caea-2390-4dc7-9830-8d9644232144
Dataset updated
Jul 28, 2020
Dataset provided by
4TU.ResearchData
Authors
Lindsey Burggraaff
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Compound descriptors for compounds retrieved from the ChEMBL database. Descriptors to annotate the compounds chemically and based on physicochemical properties. To be used in cheminformatics.
m
A Curated Dataset for Drug Class Prediction and Repositioning
data.mendeley.com
Updated Jan 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tiago Alves de Oliveira (2025). A Curated Dataset for Drug Class Prediction and Repositioning [Dataset]. http://doi.org/10.17632/j83cfgkrb5.2
Explore at:
Unique identifier
https://doi.org/10.17632/j83cfgkrb5.2
Dataset updated
Jan 27, 2025
Authors
Tiago Alves de Oliveira
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This curated dataset offers a valuable resource for deep learning applications in drug discovery and repositioning domains. It contains 5,350 high-resolution images systematically categorized into pharmacological classes and molecular targets. The pharmacological classes encompass antifungals, antivirals, corticosteroids, diuretics, and non-steroidal anti-inflammatory drugs (NSAIDs), while the molecular targets emphasize Alzheimer's disease-related enzymes, including acetylcholinesterase, butyrylcholinesterase, and beta-secretase 1. The dataset was meticulously compiled using data from well-established databases, including DrugBank, ChEMBL, and DUD-E, ensuring diversity and quality in the compounds selected for training. Active compounds (true positives) were sourced from DrugBank and ChEMBL, while decoy compounds (true negatives) were generated using the DUD-E protocol. The decoy compounds are designed to match the physicochemical properties of active compounds while lacking binding affinity, creating a robust benchmark for machine learning evaluation. The balanced structure of the dataset, with equal representation of true positive and decoy compounds, enhances its suitability for binary and multi-class classification tasks. The collection of compounds is diverse and of high quality, thus supporting a wide range of deep learning tasks, including pharmacological class prediction, virtual screening, and molecular target identification. This ultimately advances computational approaches in drug discovery.
f
Data from: PoseidonQ: A Free Machine Learning Platform for the Development,...
acs.figshare.com
xlsx
Updated Apr 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muzammil Kabier; Nicola Gambacorta; Fulvio Ciriaco; Fabrizio Mastrolorito; Sunil Kumar; Bijo Mathew; Orazio Nicolotti (2025). PoseidonQ: A Free Machine Learning Platform for the Development, Analysis, and Validation of Efficient and Portable QSAR Models for Drug Discovery [Dataset]. http://doi.org/10.1021/acs.jcim.4c02372.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.4c02372.s002
Dataset updated
Apr 9, 2025
Dataset provided by
ACS Publications
Authors
Muzammil Kabier; Nicola Gambacorta; Fulvio Ciriaco; Fabrizio Mastrolorito; Sunil Kumar; Bijo Mathew; Orazio Nicolotti
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The advent of powerful machine learning algorithms as well as the availability of high volume of pharmacological data has given new fuel to QSAR, opening new unprecedented options for deriving highly predictive models for assisting the rationale design of new bioactive compounds, for screening and prioritizing large molecular libraries, and for repurposing new drugs toward new clinical uses. Here, we present PoseidonQ (an acronym for Personal Optimization Software for Efficient Implementation and Derivation of Online QSAR), a user-friendly software solution designed to simplify the derivation of the QSAR model for drug design and discovery. PoseidonQ incorporates 22 machine learning algorithms, 17 types of molecular fingerprints, and 208 RDKit molecular descriptors and enables the quick derivation of both regression and classification models along with a calculated and easily interpretable applicability domain. Importantly, the platform is automatically linked to the latest version of the ChEMBL database, thus providing streamlined access to large amounts of curated bioactivity data. Importantly, the user is also given the option of gathering high-quality experimental data based on customizable filtering settings. Noteworthy, PoseidonQ facilitates the deployment of trained QSAR models as web-based applications through seamless integration with Streamlit Cloud and GitHub, empowering users to share, refine, and integrate models effortlessly. Interestingly, the translation of QSAR models into web-based applications makes them free accessible, portable, and ready for screening large volumes of new data without limits. By unifying data preparation, model generation, and deployment into an intuitive workflow, PoseidonQ makes advanced QSAR modeling for drug design and discovery accessible to a wide audience of researchers irrespective of their skill levels. PoseidonQ bridges the gap between complex machine learning techniques and practical drug discovery applications, enhancing the efficiency, collaboration, and adoption of QSAR approaches in modern drug discovery programs. PoseidonQ is available for Windows and Linux (ubuntu 22.04 distro) operating systems and can be downloaded for free at https://github.com/Muzatheking12/PoseidonQ.

Target Name	ChEMBL Target ID	Target Class
EGFR	CHEMBL203	Kinase (Receptor TK)
DRD2	CHEMBL217	GPCR (Dopamine D2 receptor)
BACE1	CHEMBL1987	Enzyme (Aspartyl protease)
HDAC1	CHEMBL325	Enzyme (Histone deacetylase)

ChEMBL (Drug Effectiveness Prediction)

kaggle.com

zip

Updated Jun 3, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Abdullah Amjad (2025). ChEMBL (Drug Effectiveness Prediction) [Dataset]. https://www.kaggle.com/abdullahamjad1234/chembl-drug-effectiveness-prediction

Explore at:

zip(488297 bytes)Available download formats

Dataset updated

Jun 3, 2025

Authors

Abdullah Amjad

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

** Overview** This dataset provides detailed information on various chemical compounds and their measured effectiveness against malaria parasites based on experimental bioassay results. The data has been curated from the ChEMBL and TCMDC repositories and includes molecular structure data alongside biological activity metrics.

The purpose of this dataset is to enable machine learning applications such as classification or regression models that can predict the effectiveness of a compound based on its structure and bioassay values.

Column Name	Description
`COMPOUND_ID`	Unique identifier for each chemical compound
`SOURCES`	Source tags or identifiers (e.g., ChEMBL ID, TCMDC ID)
`PCT_IHB_3D7`	Percentage inhibition of the 3D7 strain of Plasmodium falciparum
`PCT_INHB_DD2`	Percentage inhibition of the DD2 strain of Plasmodium falciparum
`PCT_INHIB_3D7_PFLDH`	Percentage inhibition of the PfLDH enzyme from the 3D7 strain
`pXC50_3D7`	Negative log of the IC50 value (higher is more effective) for 3D7 strain
`SMILES`	SMILES string representing the molecular structure

Drug Effectiveness An effectiveness score can be derived by combining inhibition percentages and the pXC50_3D7 metric. Higher scores generally indicate more effective compounds against malaria parasites.

This dataset can be used for:

Drug discovery using AI/ML

Predictive modeling of compound activity

QSAR (Quantitative Structure–Activity Relationship) analysis

Molecular structure-based classification

Use Cases Drug discovery & development

Bioinformatics research

Cheminformatics feature engineering

ML/DL-based compound classification

License License: Creative Commons Attribution 4.0 International (CC BY 4.0) Link: https://creativecommons.org/licenses/by/4.0/

What does this license mean? This license allows others to:

Share — copy and redistribute the material in any medium or format Adapt — remix, transform, and build upon the material for any purpose, even commercially As long as you give appropriate credit to the original dataset creator.

m
Development of Putative Isospecific Inhibitors for HDAC6 using Random...
data.mendeley.com
Updated Sep 8, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ireoluwa Joel (2020). Development of Putative Isospecific Inhibitors for HDAC6 using Random Forest, QM-Polarized docking, Induced-fit docking, and Quantum mechanics [Dataset]. http://doi.org/10.17632/775s3xrhrk.2
Explore at:
Unique identifier
https://doi.org/10.17632/775s3xrhrk.2
Dataset updated
Sep 8, 2020
Authors
Ireoluwa Joel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Histone deacetylases have been recognized as a potential target for epigenetic aberrance reversal in the various strategies for cancer therapy, with HDAC6 implicated in various forms of tumor growth and cancers. Diverse inhibitors of HDAC6 has been developed, however, there is still the challenge of iso-specificity and toxicity. In this study, we trained a Random forest model on all HDAC6 inhibitors curated in the ChEMBL database (3,742). Upon rigorous validations the model had an 78% balanced accuracy (external validation) and was used to screen the SCUBIDOO database; 7785 hit compounds resulted and were docked into HDAC6 CD2 active-site. The top two compounds having a benzimidazole moiety as its zinc-binding group had a binding affinity of -78.56kcal/mol and -78.21kcal/mol respectively. The compounds were subjected to exhaustive docking protocols (Qm-polarized docking and Induced-Fit docking) in other to elucidate a binding hypothesis and accurate binding affinity. Upon optimization, the compounds showed improved binding affinity (-81.42kcal/mol), putative specificity for HDAC6, and good ADMET properties. We have therefore developed a reliable model to screen for HDAC6 inhibitors and suggested a series of benzimidazole based inhibitors showing high binding affinity and putative specificity for HDAC6
ChEMBL RDF
data.wu.ac.at
api/sparql, meta/void +1
Updated Jul 30, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
EMBL European Bioinformatics Institute (2016). ChEMBL RDF [Dataset]. https://data.wu.ac.at/odso/datahub_io/YzA0YTM3MTItN2NlZC00ODk1LTk3YzUtZDcxYjgwZDcyZTUw
Explore at:
ttl, api/sparql, meta/voidAvailable download formats
Dataset updated
Jul 30, 2016
Dataset provided by
European Bioinformatics Institutehttp://www.ebi.ac.uk/
License
http://www.opendefinition.org/licenses/cc-by-sahttp://www.opendefinition.org/licenses/cc-by-sa
Description
ChEMBL is a database of bioactive drug-like small molecules, it contains 2-D structures, calculated properties (e.g. logP, Molecular Weight, Lipinski Parameters, etc.) and abstracted bioactivities (e.g. binding constants, pharmacology and ADMET data). The data is abstracted and curated from the primary scientific literature, and cover a significant fraction of the SAR and discovery of modern drugs.

It is available in RDF form through EMBL-EBI's RDF Platform.
Highly curated hERG dataset of 8879 unique molecular compounds with...
zenodo.org
csv
Updated Sep 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Issar Arab; Khaled Barakat; Issar Arab; Khaled Barakat (2024). Highly curated hERG dataset of 8879 unique molecular compounds with corresponding potency values [Dataset]. http://doi.org/10.5281/zenodo.5807719
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5807719
Dataset updated
Sep 28, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Issar Arab; Khaled Barakat; Issar Arab; Khaled Barakat
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was built during a research project, in the field of Computer-Aided Drug Discovery (CADD), funded by the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery grant. The aim of the project was to build descriptor-based machine learning models for hERG cardiotoxicity liability predictions. The dataset includes a total of 8879 unique molecular compounds gathered from ChEMBL and PubChem publicly available bioactivity databases, as well as from literature mining. The list is split into 2 sets, 8380 for training and 499 for testing. All molecular compounds are represented in their SMILE format with their corresponding PIC50 potency values.

To access the full original work, please visit the following link: Manuscript

Note: Upon usage of this data, kindly cite the original manuscript describing the curation process:

Arab, Issar, and Khaled Barakat. "ToxTree: descriptor-based machine learning models for both hERG and Nav1. 5 cardiotoxicity liability predictions." arXiv preprint arXiv:2112.13467 (2021).

Refer to our latest manually curated and a much larger dataset here: link
Z
Data from: PDEStrIAn: A phosphodiesterase structure and ligand interaction...
data.niaid.nih.gov
data-staging.niaid.nih.gov
+1more
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jansen, Chimed; Kooistra, Albert J.; Kanev, Georgi K.; Leurs, Rob; de Esch, Iwan J.P.; de Graaf, Chris (2020). PDEStrIAn: A phosphodiesterase structure and ligand interaction annotated database as a tool for structure-based drug design [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_45774
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Vrije Universiteit Amsterdam
Authors
Jansen, Chimed; Kooistra, Albert J.; Kanev, Georgi K.; Leurs, Rob; de Esch, Iwan J.P.; de Graaf, Chris
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
A systematic analysis is presented of the 220 phosphodiesterase (PDE) catalytic domain crystal structures present in the Protein Data Bank (PDB) with a focus on PDE-ligand interactions. The consistent structural alignment of 57 PDE ligand binding site residues enables the systematic analysis of PDE-ligand Interaction FingerPrints (IFPs), the identification of subtype-specific PDE-ligand interaction features, and the classification of ligands according to their binding modes. We illustrate how systematic mining of this phosphodiesterase structure and ligand interaction annotated (PDEStrIAn) database provides new insights into how conserved and selective PDE interaction hot spots can accommodate the large diversity of chemical scaffolds in PDE ligands. A substructure analysis of the co-crystalized PDE ligands in combination with those in the ChEMBL database provides a toolbox for scaffold hopping and ligand design. These analyses lead to an improved understanding of the structural requirements of PDE binding that will be useful in future drug discovery studies.
Z
MELLODDY TUNER release v3 public data
data.niaid.nih.gov
resodate.org
Updated Aug 2, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Friedrich, Lukas (2022). MELLODDY TUNER release v3 public data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6948580
Explore at:
Dataset updated
Aug 2, 2022
Dataset provided by
Merck KGaA
Authors
Friedrich, Lukas
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
A public dataset from ChEMBL (v25) for MELLODDY TUNER release v3.

Data extracted from ChEMBL (LICENSE attached), and processed with https://github.com/melloddy/MELLODDY-TUNER release v3.

Data can be used for technical tests, template for your own dataset and machine learning with SparseChem (https://github.com/melloddy/SparseChem).
MELLODDY TUNER release v2 public data
zenodo.org
data.niaid.nih.gov
zip
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lukas Friedrich; Lukas Friedrich (2022). MELLODDY TUNER release v2 public data [Dataset]. http://doi.org/10.5281/zenodo.4835670
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4835670
Dataset updated
Aug 1, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lukas Friedrich; Lukas Friedrich
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
A public dataset from ChEMBL (v25) for MELLODDY TUNER release v2.

Data extracted from ChEMBL and PubChem (input archives), and processed with https://github.com/melloddy/MELLODDY-TUNER release v2.

Data can be used for technical tests, template for your own dataset and machine learning with SparseChem (https://github.com/melloddy/SparseChem).
A large comprehensive curated dataset of small molecules and their...
repository.uantwerpen.be
zenodo.org
Updated 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arab, Issar; Egghe, Kristof; Laukens, Kris; Chen, Ke; Barakat, Khaled; Bittremieux, Wout (2023). A large comprehensive curated dataset of small molecules and their activities covering three cardiac ion channels: hERG, Cav1.2, and Nav1.5 [Dataset]. http://doi.org/10.5281/ZENODO.8359714
Explore at:
Unique identifier
https://doi.org/10.5281/ZENODO.8359714
Dataset updated
2023
Dataset provided by
Zenodohttp://zenodo.org/
University of Antwerp
Faculty of Sciences. Mathematics and Computer Science
Authors
Arab, Issar; Egghe, Kristof; Laukens, Kris; Chen, Ke; Barakat, Khaled; Bittremieux, Wout
Description
The compressed data folder (dataset.rar) represents a data framework for researchers in the field of drug discovery to perform in depth analyses on a very large open-access unique and comprehensive hERG, Nav1.5, and Cav1.2 cardiotoxicity integrated database of small molecules and their activities. The database is organized as follows: Each sub-folder represents a cardiac ion channel target: hERG, Nav1.5, and Cav1.2 Each target sub-folder consists of 3 files in CSV format: One file containing the development set (split into training and validation sets using an 80/20 ratio for hyperparameter tuning). The other 2 files contain external evaluation sets. The first test dataset consists of compounds with a structural similarity of no more than 60% (Tanimoto similarity ≤ 0.6) to the remaining development set, while the second test dataset comprises compounds with a structural similarity of no more than 70% (Tanimoto similarity ≤ 0.7) to the remaining development set. Each file contains data with 7 columns: "InChl Key" as a unique identifier of the chemical structure, "SMILES" as the string format of storage and exchange of the chemical structure, "Source" as the upstream data source from which the data was retrieved, "ChEMBL ID" as the ChEMBL identifier if the compound comes from ChEMBL database, "PubChem CID" as the PubChem compound identifier if the compound comes from PubChem database, "pIC50" as the negative logarithm of the half-maximal inhibitory concentration (IC50) to describe the potency of the compound, and "USED_AS" column specifying whether the compound was used for training or validation. Upon usage, please cite this publication: Issar Arab, Kristof Egghe, Kris Laukens, Ke Chen, Khaled Barakat, Wout Bittremieux, Benchmarking of Small Molecule Feature Representations for hERG, Nav1.5, and Cav1.2 Cardiotoxicity Prediction, Journal of Chemical Information and Modeling, (2023). doi:10.1021/acs.jcim.3c01301

Facebook

Twitter

Click to copy link

Link copied

Cite

Google BigQuery (2019). ChEMBL EBI Small Molecules Database [Dataset]. https://www.kaggle.com/bigquery/ebi-chembl

ChEMBL EBI Small Molecules Database

A large-scale bioactivity database for drug discovery (BigQuery)

Explore at:

zip(0 bytes)Available download formats

Dataset updated

Feb 12, 2019

Dataset authored and provided by

Google BigQuery

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Context

ChEMBL is maintained by the European Bioinformatics Institute (EBI), of the European Molecular Biology Laboratory (EMBL), based at the Wellcome Trust Genome Campus, Hinxton, UK.

Content

ChEMBL is a manually curated database of bioactive molecules with drug-like properties used in drug discovery, including information about existing patented drugs.

Schema: http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_23/chembl_23_schema.png

Documentation: http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_23/schema_documentation.html

Fork this notebook to get started on accessing data in the BigQuery dataset using the BQhelper package to write SQL queries.

Acknowledgements

“ChEMBL” by the European Bioinformatics Institute (EMBL-EBI), used under CC BY-SA 3.0. Modifications have been made to add normalized publication numbers.

Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:ebi_chembl

Banner photo by rawpixel on Unsplash

Clear search

Close search

Google apps

Main menu

ChEMBL EBI Small Molecules Database

Context

Content

Acknowledgements

ChEMBL

Drug Targets and Drug Lists Data Package

Data from: Drug Safety Data Curation and Modeling in ChEMBL: Boxed Warnings...

MultiTarget Bioactivity ChEMBL

Bioactivity Dataset for EGFR, DRD2, BACE1, and HDAC1

Targets Included

Format

License

Source

Citation

Data from: Occurrence of “Natural Selection” in Successful Small Molecule...

A new ChEMBL dataset for FastTargetPred

Drug Design with Small Molecule SMILES

Context

Content

Acknowledgements

CompoundDB4j V1

CompoundDB4j: Deriving an extensible and integrated knowledge of heterogeneous chemical databases

Compound descriptors for text-mined orthosteric and allosteric dataset

A Curated Dataset for Drug Class Prediction and Repositioning

Data from: PoseidonQ: A Free Machine Learning Platform for the Development,...

ChEMBL (Drug Effectiveness Prediction)

Development of Putative Isospecific Inhibitors for HDAC6 using Random...

ChEMBL RDF

Highly curated hERG dataset of 8879 unique molecular compounds with...

Data from: PDEStrIAn: A phosphodiesterase structure and ligand interaction...

MELLODDY TUNER release v3 public data

MELLODDY TUNER release v2 public data

A large comprehensive curated dataset of small molecules and their...

ChEMBL EBI Small Molecules Database

A large-scale bioactivity database for drug discovery (BigQuery)

Context

Content

Acknowledgements