Facebook
TwitterCollection of bioactive drug-like small molecules that contains 2D structures, calculated properties and abstracted bioactivities. Used for drug discovery and chemical biology research. Clinical progress of new compounds is continuously integrated into the database.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
ChEMBL is maintained by the European Bioinformatics Institute (EBI), of the European Molecular Biology Laboratory (EMBL), based at the Wellcome Trust Genome Campus, Hinxton, UK.
ChEMBL is a manually curated database of bioactive molecules with drug-like properties used in drug discovery, including information about existing patented drugs.
Schema: http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_23/chembl_23_schema.png
Documentation: http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_23/schema_documentation.html
Fork this notebook to get started on accessing data in the BigQuery dataset using the BQhelper package to write SQL queries.
“ChEMBL” by the European Bioinformatics Institute (EMBL-EBI), used under CC BY-SA 3.0. Modifications have been made to add normalized publication numbers.
Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:ebi_chembl
Facebook
TwitterChEMBL is a manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs. This representation of ChEMBL is stored in Parquet format and most easily utilized through Amazon Athena. Follow the documentation for install instructions (< 2 minute install). New ChEMBL releases occur sporadically; the most up to date information on ChEMBL releases can be found here.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
This dataset contains curated molecular data for compounds tested against four pharmacologically important protein targets. All data are derived from the ChEMBL database.
| Target Name | ChEMBL Target ID | Target Class |
|---|---|---|
| EGFR | CHEMBL203 | Kinase (Receptor TK) |
| DRD2 | CHEMBL217 | GPCR (Dopamine D2 receptor) |
| BACE1 | CHEMBL1987 | Enzyme (Aspartyl protease) |
| HDAC1 | CHEMBL325 | Enzyme (Histone deacetylase) |
Each entry in the dataset includes: - ChEMBL compound ID - Canonical SMILES - Molecular properties (molecular weight, HBA, HBD, logP, and TPSA) - Target label
The dataset is provided in CSV or Parquet format with the following columns:
ChEMBL ID: ChEMBL compound identifierSMILES: Canonical SMILES stringMolecular weight: Molecular mass (Da)LogP: Octanol-water partition coefficientHBA: Number of hydrogen bond acceptorsHBD: Number of hydrogen bond donorsTPSA: Topological polar surface areaProtein: Protein target nameThis dataset is a derivative of ChEMBL and is distributed under the same license:
Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
https://creativecommons.org/licenses/by-sa/3.0/
chembl_webresource_client)If you use this dataset, please cite the following:
ChEMBL Database:
Gaulton A, Hersey A, Nowotka M, Bento AP, Chambers J, Mendez D, Mutowo P, Atkinson F, Bellis LJ, Cibrián-Uhalte E, Davies M, Dedman N, Karlsson A, Magariños MP, Overington JP, Papadatos G, Smit I, Leach AR. (2017).
The ChEMBL database in 2017. Nucleic Acids Res., 45(D1): D945–D954.
https://doi.org/10.1093/nar/gkw1074
ChEMBL Web Services:
Davies M, Nowotka M, Papadatos G, Dedman N, Gaulton A, Atkinson F, Bellis LJ, Overington JP. (2015).
ChEMBL web services: streamlining access to drug discovery data and utilities. Nucleic Acids Res., 43(W1): W612–W620.
https://doi.org/10.1093/nar/gkv352
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The SARS-CoV-2 3CLpro protein is one of the key therapeutic targets of interest for COVID-19 due to its critical role in viral replication, various high-quality protein crystal structures, and as a basis for computationally screening for compounds with improved inhibitory activity, bioavailability, and ADMETox properties. The ChEMBL and PubChem database contains experimental data from screening small molecules against SARS-CoV-2 3CLpro, which expands the opportunity to learn the pattern and design a computational model that can predict the potency of any drug compound against coronavirus before in-vitro and in-vivo testing. In this study, Utilizing several descriptors, we evaluated 27 machine learning classifiers. We also developed a neural network model that can correctly identify bioactive and inactive chemicals with 91% accuracy, on CheMBL data and 93% accuracy on combined data on both CheMBL and Pubchem. The F1-score for inactive and active compounds was 93% and 94%, respectively. SHAP (SHapley Additive exPlanations) on XGB classifier to find important fingerprints from the PaDEL descriptors for this task. The results indicated that the PaDEL descriptors were effective in predicting bioactivity, the proposed neural network design was efficient, and the Explanatory factor through SHAP correctly identified the important fingertips. In addition, we validated the effectiveness of our proposed model using a large dataset encompassing over 100,000 molecules. This research employed various molecular descriptors to discover the optimal one for this task. To evaluate the effectiveness of these possible medications against SARS-CoV-2, more in-vitro and in-vivo research is required.
Facebook
TwitterThe Chembl database contains a large collection of bioactive molecules and their interactions with proteins.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Discovery of new pharmaceutical substances is currently boosted by the possibility of utilization of the Synthetically Accessible Virtual Inventory (SAVI) library, which includes about 283 million molecules, each annotated with a proposed synthetic one-step route from commercially available starting materials. The SAVI database is well-suited for ligand-based methods of virtual screening to select molecules for experimental testing. In this study, we compare the performance of three approaches for the analysis of structure-activity relationships that differ in their criteria for selecting of “active” and “inactive” compounds included in the training sets. PASS (Prediction of Activity Spectra for Substances), which is based on a modified Naïve Bayes algorithm, was applied since it had been shown to be robust and to provide good predictions of many biological activities based on just the structural formula of a compound even if the information in the training set is incomplete. We used different subsets of kinase inhibitors for this case study because many data are currently available on this important class of drug-like molecules. Based on the subsets of kinase inhibitors extracted from the ChEMBL 20 database we performed the PASS training, and then applied the model to ChEMBL 23 compounds not yet present in ChEMBL 20 to identify novel kinase inhibitors. As one may expect, the best prediction accuracy was obtained if only the experimentally confirmed active and inactive compounds for distinct kinases in the training procedure were used. However, for some kinases, reasonable results were obtained even if we used merged training sets, in which we designated as inactives the compounds not tested against the particular kinase. Thus, depending on the availability of data for a particular biological activity, one may choose the first or the second approach for creating ligand-based computational tools to achieve the best possible results in virtual screening.
Facebook
TwitterThis data package contains information on approved, researched and proven drug targets and drug lists.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
List of approved drugs obtained from ChEMBL database on 25.07.2025.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
ChEMBL is a database of bioactive compounds, their quantitative properties and bioactivities (binding constants, pharmacology and ADMET, etc). The data is abstracted and curated from the primary scientific literature.
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Two subsets of molecules from ChEMBL-34[1].
Those molecular datasets might be useful to people training molecular generators.
After decompression, you will get:
chembl34_stable_ES_OA_LL.smi: 585,272 molecules.
chembl34_stable_ES_OA_DL.smi: 756,420 molecules.
stable=non-reactive molecules (filtered-out reactive functional groups from [5]).
https://github.com/UnixJunkie/molenc/blob/master/bin/molenc_stable.py
ES=Easy Synthesis (SAscore <= 3.0) [2].
https://github.com/UnixJunkie/molenc/blob/master/bin/molenc_SA.py
OA=Orally Available (according to a classifier trained on the dataset from [6]).
LL=Lead-Like (almost the definition from [3]).
https://github.com/UnixJunkie/molenc/blob/master/bin/molenc_lead.py
DL=Drug-Like (definition from [4]).
https://github.com/UnixJunkie/molenc/blob/master/bin/molenc_drug.py
Zdrazil, B., Felix, E., Hunter, F., Manners, E. J., Blackshaw, J., Corbett, S., ... & Leach, A. R. (2024). The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods. Nucleic acids research, 52(D1), D1180-D1192. https://doi.org/10.1093/nar/gkad1004
Ertl, P., & Schuffenhauer, A. (2009). Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. Journal of cheminformatics, 1, 1-11. https://jcheminf.biomedcentral.com/articles/10.1186/1758-2946-1-8
Hann, M. M., & Oprea, T. I. (2004). Pursuing the leadlikeness concept in pharmaceutical research. Current opinion in chemical biology, 8(3), 255-263. https://doi.org/10.1016/j.cbpa.2004.04.003
Tran-Nguyen, V. K., Jacquemard, C., & Rognan, D. (2020). LIT-PCBA: an unbiased data set for machine learning and virtual screening. Journal of chemical information and modeling, 60(9), 4263-4273. https://pubs.acs.org/doi/10.1021/acs.jcim.0c00155
Lisurek, M., Rupp, B., Wichard, J., Neuenschwander, M., von Kries, J. P., Frank, R., ... & Kühne, R. (2010) Design of chemical libraries with potentially bioactive molecules applying a maximum common substructure concept. Molecular diversity, 14, 401-408. https://link.springer.com/article/10.1007/s11030-009-9187-z
Falcon-Cano, G., Molina, C., & Cabrera-Perez, M. A. (2020). ADME prediction with KNIME: development and validation of a publicly available workflow for the prediction of human oral bioavailability. Journal of chemical information and modeling, 60(6), 2660-2667. https://pubs.acs.org/doi/10.1021/acs.jcim.0c00019
Facebook
Twitterhttps://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html
Studying and mitigating the effects of data drifts on ML model performance at the example of chemical toxicity data
Project description
Machine learning models are powerful tools for the prediction of molecular properties or the biological activity of chemical compounds. However, to make these models useful and applicable, the confidence in the predictions should also be specified. For that purpose, models may be integrated in a conformal prediction (CP) framework that adds a calibration step to estimate the confidence of the predictions. CP models offer the advantage of ensuring a predefined error rate, as long as the test and training sets are exchangeable.
In cases where the test data presents a drift from the descriptor space of the training data, or where assay setups change, this assumption may not be fulfilled and the models are not guaranteed to be valid.
In this study, the performance of internally valid CP models was evaluated upon application to either newer time-split data or to external data. More specifically, temporal data drifts were analysed based on time-splits of twelve toxicity-related datasets from the ChEMBL database. Moreover, models trained on publicly available data for liver toxicity and MNT in vivo were applied on proprietary data to evaluate the discrepancies. In general it was observed that the training and (holdout) test sets were not exchangeable in the studied set-ups, and the models were therefore not applicable (i.e. non-valid CP models).
To recover the validity of the models on the holdout test set, a strategy for updating the calibration set with data more similar to the holdout set was investigated. Restored validity is the main requisite for applying the CP models with confidence. However, this comes at the cost of decreased model efficiency, as more predictions are identified as inconclusive.
Dataset
The uploaded file contains the ChEMBL data used in the work for the manuscript “Studying and mitigating the effects of data drifts on ML model performance at the example of chemical toxicity data”.
Twelve preprocessed datasets containing molecule chembl ID, SMILES, binary activity (i.e. 1 if active, 0 if inactive), publication year, and CHEMBIO descriptors are available for the following ChEMBL endpoints, extracted from ChEMBL Version 26:
CHEMBL220: Acetylcholinesterase (human), 2673 compounds
CHEMBL4078: Acetylcholinesterase (fish), 3811 compounds
CHEMBL5763: Cholinesterase, 2755 compounds
CHEMBL203: EGFR erbB1, 4059 compounds
CHEMBL206: Estrogen receptor alpha, 1416 compounds
CHEMBL279: VEGFR 2, 5174 compounds
CHEMBL230: Cyclooxygenase-2, 2020 compounds
CHEMBL340: Cytochrome P450 3A4, 3316 compounds
CHEMBL240: HERG, 4976 compounds
CHEMBL2039: Monoamine oxidase B, 2534 compounds
CHEMBL222: Norepinephrine transporter, 1566 compounds
CHEMBL228: Serotonin transporter, 2111 compounds
Usage
This dataset can be used as input to run the notebooks available at
https://github.com/volkamerlab/CPRecalibration_manuscript_SI
Clone the GitHub repository.
Download the dataset provided here.
Copy the dataset (don’t extract) into the data folder of the cloned GitHub repository.
Follow the instructions on GitHub.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Reported is the activity information for the 12,294 analog series-based (ASB) scaffolds extracted from ChEMBL database. For each ASB scaffold structural and activity information for all analogs comprising the analog series is provoded.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
ChEMBL is a database of bioactive drug-like small molecules, it contains 2-D structures, calculated properties (e.g. logP, Molecular Weight, Lipinski Parameters, etc.) and abstracted bioactivities (e.g. binding constants, pharmacology and ADMET data). The data is abstracted and curated from the primary scientific literature, and cover a significant fraction of the SAR and discovery of modern drugs We attempt to normalise the bioactivities into a uniform set of end-points and units where possible, and also to tag the links between a molecular target and a published assay with a set of varying confidence levels. Additional data on clinical progress of compounds is being integrated into ChEMBL at the current time.
Facebook
Twitterhttps://dataverse.nl/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.34894/IJWU5Lhttps://dataverse.nl/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.34894/IJWU5L
ChEMBL is medicinal chemistry database by the team of dr. J. Overington at the EBI: http://www.ebi.ac.uk/chembl/ It is detailed in this paper (doi:10.1093/nar/gkr777): http://nar.oxfordjournals.org/content/early/2011/09/22/nar.gkr777.short This project develops, releases, and hosts a RDF version of ChEMBL, independent from the ChEMBL team who make their own RDF version. The main SPARQL end point is available from Uppsala University at: http://rdf.farmbio.uu.se/chembl/sparql
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
A detailed analysis of the hERG content inside the ChEMBL database is performed. The correlation between the outcome from binding assays and functional assays is probed. On the basis of descriptor distributions, design paradigms with respect to structural and physicochemical properties of hERG active and hERG inactive compounds are challenged. Finally, classification models with different data sets are trained. All source code is provided, which is based on the Python open source packages RDKit and scikit-learn to enable the community to rerun the experiments. The code is stored on github (https://github.com/pzc/herg_chembl_jcim).
Facebook
Twitterhttps://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
The standardised ChEMBL data sets from publication "HyFactor: Hydrogen-count labelled graph-based defactorization Autoencoder"
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the updated version of the dataset from 10.5281/zenodo.6320761
Information
The diverse publicly available compound/bioactivity databases constitute a key resource for data-driven applications in chemogenomics and drug design. Analysis of their coverage of compound entries and biological targets revealed considerable differences, however, suggesting benefit of a consensus dataset. Therefore, we have combined and curated information from five esteemed databases (ChEMBL, PubChem, BindingDB, IUPHAR/BPS and Probes&Drugs) to assemble a consensus compound/bioactivity dataset comprising 1144648 compounds with 10915362 bioactivities on 5613 targets (including defined macromolecular targets as well as cell-lines and phenotypic readouts). It also provides simplified information on assay types underlying the bioactivity data and on bioactivity confidence by comparing data from different sources. We have unified the source databases, brought them into a common format and combined them, enabling an ease for generic uses in multiple applications such as chemogenomics and data-driven drug design.
The consensus dataset provides increased target coverage and contains a higher number of molecules compared to the source databases which is also evident from a larger number of scaffolds. These features render the consensus dataset a valuable tool for machine learning and other data-driven applications in (de novo) drug design and bioactivity prediction. The increased chemical and bioactivity coverage of the consensus dataset may improve robustness of such models compared to the single source databases. In addition, semi-automated structure and bioactivity annotation checks with flags for divergent data from different sources may help data selection and further accurate curation.
This dataset belongs to the publication: https://doi.org/10.3390/molecules27082513
Structure and content of the dataset
|
ChEMBL ID |
PubChem ID |
IUPHAR ID | Target |
Activity type | Assay type | Unit | Mean C (0) | ... | Mean PC (0) | ... | Mean B (0) | ... | Mean I (0) | ... | Mean PD (0) | ... | Activity check annotation | Ligand names | Canonical SMILES C | ... | Structure check (Tanimoto) | Source |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
The dataset was created using the Konstanz Information Miner (KNIME) (https://www.knime.com/) and was exported as a CSV-file and a compressed CSV-file.
Except for the canonical SMILES columns, all columns are filled with the datatype ‘string’. The datatype for the canonical SMILES columns is the smiles-format. We recommend the File Reader node for using the dataset in KNIME. With the help of this node the data types of the columns can be adjusted exactly. In addition, only this node can read the compressed format.
Column content:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data from ChEMBL compounds reported with an activity against one of the following targets: CHEMBL367 : Leishmania donovani, CHEMBL368 : Trypanosoma cruzi, and CHEMBL612348 : Trypanosoma brucei rhodesiense.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
From ChEMBL version 17, 31 compound data sets have been selected for regression modeling. Compounds had to be active against human targets in a direct inhibition/binding assay with highest ChEMBL confidence score and Ki values below 100 micromolar. Multiple Ki values for the same compound were averaged if they fell into the same order of magnitude, or else they were disregarded. Duplicates, known pan-assay interference, and other reactive molecules were removed. Only sets with at least 500 compounds were considered.
Note: The SD files contain a field "pKi"; note however that this field contains the Ki value in nM units, not the logarithmic value.
Facebook
TwitterCollection of bioactive drug-like small molecules that contains 2D structures, calculated properties and abstracted bioactivities. Used for drug discovery and chemical biology research. Clinical progress of new compounds is continuously integrated into the database.