35 datasets found

f
Bioactivity datasets from ChEMBL.
plos.figshare.com
xls
Updated Sep 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Faisal Bin Ashraf; Sanjida Akter; Sumona Hoque Mumu; Muhammad Usama Islam; Jasim Uddin (2023). Bioactivity datasets from ChEMBL. [Dataset]. http://doi.org/10.1371/journal.pone.0288053.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0288053.t001
Dataset updated
Sep 6, 2023
Dataset provided by
PLOS ONE
Authors
Faisal Bin Ashraf; Sanjida Akter; Sumona Hoque Mumu; Muhammad Usama Islam; Jasim Uddin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The SARS-CoV-2 3CLpro protein is one of the key therapeutic targets of interest for COVID-19 due to its critical role in viral replication, various high-quality protein crystal structures, and as a basis for computationally screening for compounds with improved inhibitory activity, bioavailability, and ADMETox properties. The ChEMBL and PubChem database contains experimental data from screening small molecules against SARS-CoV-2 3CLpro, which expands the opportunity to learn the pattern and design a computational model that can predict the potency of any drug compound against coronavirus before in-vitro and in-vivo testing. In this study, Utilizing several descriptors, we evaluated 27 machine learning classifiers. We also developed a neural network model that can correctly identify bioactive and inactive chemicals with 91% accuracy, on CheMBL data and 93% accuracy on combined data on both CheMBL and Pubchem. The F1-score for inactive and active compounds was 93% and 94%, respectively. SHAP (SHapley Additive exPlanations) on XGB classifier to find important fingerprints from the PaDEL descriptors for this task. The results indicated that the PaDEL descriptors were effective in predicting bioactivity, the proposed neural network design was efficient, and the Explanatory factor through SHAP correctly identified the important fingertips. In addition, we validated the effectiveness of our proposed model using a large dataset encompassing over 100,000 molecules. This research employed various molecular descriptors to discover the optimal one for this task. To evaluate the effectiveness of these possible medications against SARS-CoV-2, more in-vitro and in-vivo research is required.
Z
Data from: A consensus compound/bioactivity dataset for data-driven drug...
data.niaid.nih.gov
zenodo.org
Updated May 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isigkeit, Laura (2022). A consensus compound/bioactivity dataset for data-driven drug design and chemogenomics [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6320760
Explore at:
Dataset updated
May 13, 2022
Dataset provided by
Isigkeit, Laura
Chaikuad, Apirat
Merk, Daniel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the updated version of the dataset from 10.5281/zenodo.6320761

Information

The diverse publicly available compound/bioactivity databases constitute a key resource for data-driven applications in chemogenomics and drug design. Analysis of their coverage of compound entries and biological targets revealed considerable differences, however, suggesting benefit of a consensus dataset. Therefore, we have combined and curated information from five esteemed databases (ChEMBL, PubChem, BindingDB, IUPHAR/BPS and Probes&Drugs) to assemble a consensus compound/bioactivity dataset comprising 1144648 compounds with 10915362 bioactivities on 5613 targets (including defined macromolecular targets as well as cell-lines and phenotypic readouts). It also provides simplified information on assay types underlying the bioactivity data and on bioactivity confidence by comparing data from different sources. We have unified the source databases, brought them into a common format and combined them, enabling an ease for generic uses in multiple applications such as chemogenomics and data-driven drug design.

The consensus dataset provides increased target coverage and contains a higher number of molecules compared to the source databases which is also evident from a larger number of scaffolds. These features render the consensus dataset a valuable tool for machine learning and other data-driven applications in (de novo) drug design and bioactivity prediction. The increased chemical and bioactivity coverage of the consensus dataset may improve robustness of such models compared to the single source databases. In addition, semi-automated structure and bioactivity annotation checks with flags for divergent data from different sources may help data selection and further accurate curation.

This dataset belongs to the publication: https://doi.org/10.3390/molecules27082513

Structure and content of the dataset

Dataset structure

ChEMBL

ID

PubChem

ID

IUPHAR

ID

Target

Activity

type

Assay type Unit Mean C (0) ... Mean PC (0) ... Mean B (0) ... Mean I (0) ... Mean PD (0) ... Activity check annotation Ligand names Canonical SMILES C ... Structure check (Tanimoto) Source

The dataset was created using the Konstanz Information Miner (KNIME) (https://www.knime.com/) and was exported as a CSV-file and a compressed CSV-file.

Except for the canonical SMILES columns, all columns are filled with the datatype ‘string’. The datatype for the canonical SMILES columns is the smiles-format. We recommend the File Reader node for using the dataset in KNIME. With the help of this node the data types of the columns can be adjusted exactly. In addition, only this node can read the compressed format.

Column content:

ChEMBL ID, PubChem ID, IUPHAR ID: chemical identifier of the databases

Target: biological target of the molecule expressed as the HGNC gene symbol

Activity type: for example, pIC50

Assay type: Simplification/Classification of the assay into cell-free, cellular, functional and unspecified

Unit: unit of bioactivity measurement

Mean columns of the databases: mean of bioactivity values or activity comments denoted with the frequency of their occurrence in the database, e.g. Mean C = 7.5 *(15) -> the value for this compound-target pair occurs 15 times in ChEMBL database

Activity check annotation: a bioactivity check was performed by comparing values from the different sources and adding an activity check annotation to provide automated activity validation for additional confidence

no comment: bioactivity values are within one log unit;

check activity data: bioactivity values are not within one log unit;

only one data point: only one value was available, no comparison and no range calculated;

no activity value: no precise numeric activity value was available;

no log-value could be calculated: no negative decadic logarithm could be calculated, e.g., because the reported unit was not a compound concentration

Ligand names: all unique names contained in the five source databases are listed

Canonical SMILES columns: Molecular structure of the compound from each database

Structure check (Tanimoto): To denote matching or differing compound structures in different source databases

match: molecule structures are the same between different sources;

no match: the structures differ. We calculated the Jaccard-Tanimoto similarity coefficient from Morgan Fingerprints to reveal true differences between sources and reported the minimum value;

1 structure: no structure comparison is possible, because there was only one structure available;

no structure: no structure comparison is possible, because there was no structure available.

Source: From which databases the data come from
f
Table_1_How to Achieve Better Results Using PASS-Based Virtual Screening:...
frontiersin.figshare.com
figshare.com
xlsx
Updated Jun 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pavel V. Pogodin; Alexey A. Lagunin; Anastasia V. Rudik; Dmitry A. Filimonov; Dmitry S. Druzhilovskiy; Mark C. Nicklaus; Vladimir V. Poroikov (2023). Table_1_How to Achieve Better Results Using PASS-Based Virtual Screening: Case Study for Kinase Inhibitors.XLSX [Dataset]. http://doi.org/10.3389/fchem.2018.00133.s003
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fchem.2018.00133.s003
Dataset updated
Jun 3, 2023
Dataset provided by
Frontiers
Authors
Pavel V. Pogodin; Alexey A. Lagunin; Anastasia V. Rudik; Dmitry A. Filimonov; Dmitry S. Druzhilovskiy; Mark C. Nicklaus; Vladimir V. Poroikov
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Discovery of new pharmaceutical substances is currently boosted by the possibility of utilization of the Synthetically Accessible Virtual Inventory (SAVI) library, which includes about 283 million molecules, each annotated with a proposed synthetic one-step route from commercially available starting materials. The SAVI database is well-suited for ligand-based methods of virtual screening to select molecules for experimental testing. In this study, we compare the performance of three approaches for the analysis of structure-activity relationships that differ in their criteria for selecting of “active” and “inactive” compounds included in the training sets. PASS (Prediction of Activity Spectra for Substances), which is based on a modified Naïve Bayes algorithm, was applied since it had been shown to be robust and to provide good predictions of many biological activities based on just the structural formula of a compound even if the information in the training set is incomplete. We used different subsets of kinase inhibitors for this case study because many data are currently available on this important class of drug-like molecules. Based on the subsets of kinase inhibitors extracted from the ChEMBL 20 database we performed the PASS training, and then applied the model to ChEMBL 23 compounds not yet present in ChEMBL 20 to identify novel kinase inhibitors. As one may expect, the best prediction accuracy was obtained if only the experimentally confirmed active and inactive compounds for distinct kinases in the training procedure were used. However, for some kinases, reasonable results were obtained even if we used merged training sets, in which we designated as inactives the compounds not tested against the particular kinase. Thus, depending on the availability of data for a particular biological activity, one may choose the first or the second approach for creating ligand-based computational tools to achieve the best possible results in virtual screening.
31 ChEMBL data sets for regression modeling
zenodo.org
data.niaid.nih.gov
zip
Updated Jan 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jenny Balfer; Jürgen Bajorath; Jenny Balfer; Jürgen Bajorath (2020). 31 ChEMBL data sets for regression modeling [Dataset]. http://doi.org/10.5281/zenodo.13986
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13986
Dataset updated
Jan 21, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jenny Balfer; Jürgen Bajorath; Jenny Balfer; Jürgen Bajorath
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
From ChEMBL version 17, 31 compound data sets have been selected for regression modeling. Compounds had to be active against human targets in a direct inhibition/binding assay with highest ChEMBL confidence score and Ki values below 100 micromolar. Multiple Ki values for the same compound were averaged if they fell into the same order of magnitude, or else they were disregarded. Duplicates, known pan-assay interference, and other reactive molecules were removed. Only sets with at least 500 compounds were considered.

Note: The SD files contain a field "pKi"; note however that this field contains the Ki value in nM units, not the logarithmic value.
Activity cliffs with dual-atom replacements and single-atom analogs
zenodo.org
data.niaid.nih.gov
Updated Nov 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Huabin Hu; Jürgen Bajorath; Huabin Hu; Jürgen Bajorath (2021). Activity cliffs with dual-atom replacements and single-atom analogs [Dataset]. http://doi.org/10.5281/zenodo.5634280
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5634280
Dataset updated
Nov 2, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Huabin Hu; Jürgen Bajorath; Huabin Hu; Jürgen Bajorath
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
From the ChEMBL database, 852 activity cliffs (ACs) with dual-atom replacements were extracted which were formed by compounds with high-confidence activity data. Each AC captured an at least 10-fold difference in compound potency. For a subset of these ACs, analogs with corresponding single-atom replacements were identified. The dual-atom ACs and available single-atom replacement analogs were provided (SMILES representation and ChEMBL compound ID). For each AC compound and analog, targets from ChEMBL are reported (with UniProt ID). The target shared by all associated compounds represents the primary AC target
ChEMBL data against CHEMBL367, CHEMBL368 and CHEMBL612348
zenodo.org
data.niaid.nih.gov
bin
Updated May 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arnaud Gaudry; Arnaud Gaudry (2023). ChEMBL data against CHEMBL367, CHEMBL368 and CHEMBL612348 [Dataset]. http://doi.org/10.5281/zenodo.7953284
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7953284
Dataset updated
May 20, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Arnaud Gaudry; Arnaud Gaudry
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data from ChEMBL compounds reported with an activity against one of the following targets: CHEMBL367 : Leishmania donovani, CHEMBL368 : Trypanosoma cruzi, and CHEMBL612348 : Trypanosoma brucei rhodesiense.
A large comprehensive curated dataset of small molecules and their...
repository.uantwerpen.be
zenodo.org
Updated 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arab, Issar; Egghe, Kristof; Laukens, Kris; Chen, Ke; Barakat, Khaled; Bittremieux, Wout (2023). A large comprehensive curated dataset of small molecules and their activities covering three cardiac ion channels: hERG, Cav1.2, and Nav1.5 [Dataset]. http://doi.org/10.5281/ZENODO.8359714
Explore at:
Unique identifier
https://doi.org/10.5281/ZENODO.8359714
Dataset updated
2023
Dataset provided by
Zenodohttp://zenodo.org/
Faculty of Sciences. Mathematics and Computer Science
University of Antwerp
Authors
Arab, Issar; Egghe, Kristof; Laukens, Kris; Chen, Ke; Barakat, Khaled; Bittremieux, Wout
Description
The compressed data folder (dataset.rar) represents a data framework for researchers in the field of drug discovery to perform in depth analyses on a very large open-access unique and comprehensive hERG, Nav1.5, and Cav1.2 cardiotoxicity integrated database of small molecules and their activities. The database is organized as follows: Each sub-folder represents a cardiac ion channel target: hERG, Nav1.5, and Cav1.2 Each target sub-folder consists of 3 files in CSV format: One file containing the development set (split into training and validation sets using an 80/20 ratio for hyperparameter tuning). The other 2 files contain external evaluation sets. The first test dataset consists of compounds with a structural similarity of no more than 60% (Tanimoto similarity ≤ 0.6) to the remaining development set, while the second test dataset comprises compounds with a structural similarity of no more than 70% (Tanimoto similarity ≤ 0.7) to the remaining development set. Each file contains data with 7 columns: "InChl Key" as a unique identifier of the chemical structure, "SMILES" as the string format of storage and exchange of the chemical structure, "Source" as the upstream data source from which the data was retrieved, "ChEMBL ID" as the ChEMBL identifier if the compound comes from ChEMBL database, "PubChem CID" as the PubChem compound identifier if the compound comes from PubChem database, "pIC50" as the negative logarithm of the half-maximal inhibitory concentration (IC50) to describe the potency of the compound, and "USED_AS" column specifying whether the compound was used for training or validation. Upon usage, please cite this publication: Issar Arab, Kristof Egghe, Kris Laukens, Ke Chen, Khaled Barakat, Wout Bittremieux, Benchmarking of Small Molecule Feature Representations for hERG, Nav1.5, and Cav1.2 Cardiotoxicity Prediction, Journal of Chemical Information and Modeling, (2023). doi:10.1021/acs.jcim.3c01301
h
HLM_RLM
huggingface.co
Updated Jun 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maom Lab (2025). HLM_RLM [Dataset]. https://huggingface.co/datasets/maomlab/HLM_RLM
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 4, 2025
Dataset authored and provided by
Maom Lab
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Human & Rat Liver Microsomal Stability

3345 RLM and 6420 HLM compounds were initially collected from the ChEMBL bioactivity database. (HLM ID: 613373, 2367379, and 612558; RLM ID: 613694, 2367428, and 612558) Finally, the RLM stability data set contains 3108 compounds, and the HLM stability data set contains 5902 compounds. For the RLM stability data set, 1542 (49.6%) compounds were classified as stable, and 1566 (50.4%) compounds were classified as unstable, among which the… See the full description on the dataset page: https://huggingface.co/datasets/maomlab/HLM_RLM.
Z
Bioactive compounds with no structural analogs (high-confidence activity...
data.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dimova, Dilyana (2020). Bioactive compounds with no structural analogs (high-confidence activity data) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_33497
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Stumpfe, Dagmar
Dimova, Dilyana
Bajorath, Jürgen
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
A set of 52,815 unique bioactive compounds (human targets, high-confidence activity data) with no structural analogs with high-confidence activity data was extracted from ChEMBL. For each compound the ChEMBL compound ID (CHEMBLID_Compound) and high-confidence target annotation(s) (CHEMBLID_Targets) are provided. The data set was generated as a part of an analysis to be published in 'Medicinal Chemistry Communications'.
R
Experimental data of sterylglucosides isolated from biodiesel tank...
dataverse.unr.edu.ar
bin, html, tsv, txt +1
Updated May 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
RDA UNR (2025). Experimental data of sterylglucosides isolated from biodiesel tank precipitates [Dataset]. http://doi.org/10.57715/UNR/CY7ICJ
Explore at:
zip(443043), zip(3000787), zip(212867), txt(15681), zip(7787947), zip(346289), html(40042), tsv(450), zip(343000), tsv(514), zip(10198774), zip(346961), bin(3613156), tsv(550)Available download formats
Unique identifier
https://doi.org/10.57715/UNR/CY7ICJ
Dataset updated
May 22, 2025
Dataset provided by
RDA UNR
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset funded by
Universidad Nacional de Rosario
Consejo Nacional de Investigaciones Científicas y Técnicas
Description
Introduction: This dataset compiles experimental data on the recovery of sterylglucosides from biodiesel tanks precipitates. It includes isolated compounds reported bioactivity data (ChEMBL database), purity measurements (quantitative NMR), toxicity assays on cells (Huh7) and male mice (C57BL/6). The present dataset is focused on the review of reported biological activity of each one of the three compounds isolated as mixture (beta-sitosterylglucoside:stigmasterylglucoside:campesterylglucoside in 2:1:1 ratio), followed by purity measurement of the mixture and the later toxicity of the mixture against a cell-line and mice. Methodological information: Sample origin: The biodiesel tank bottom sample was provided by UnitecBio (https://unitecbio.com.ar/) in 2015. Isolation and Purification Procedure: Two consecutive washes (hexane and methanol, respectively) were used to eliminate the biodiesel and the remaining minor components. Then, the sterylglycosides were recovered by filtration and dried until they reached a constant weight. The identity of the sterylglycosides was confirmed by proton (¹H) and carbon (¹³C) Nuclear Magnetic Resonance Spectroscopy, and also two-dimensional NMR experiments, i.e. Correlation Spectroscopy (COSY) NMR, Heteronuclear Single Quantum Coherence (HSQC) NMR, and Heteronuclear Multiple Bond Correlation (HMBC) NMR. NMR spectra were acquired on a Bruker Avance II 300 MHz (75.13 MHz) using CDCl₃ as solvent. Spectra were processed using Bruker TopSpin 4.4.1. Purity Determination: The isolated mixture was mixed with the internal standard, 4-methoxyphenol. It was then derivatized by acetylation, and ¹H NMR spectra were recorded (128 scans). NMR spectra were acquired on a Bruker Avance II 300 MHz (75.13 MHz) using CDCl3 as solvent. Spectra were processed using Bruker TopSpin 4.4.1. Reported Bioactivity Evaluation: A search for the bioactivities of the compounds in the mixture was performed in the ChEMBL35 database on June 30, 2025. The corresponding SMILES were used as input for the ChEMBL advanced search engine. Solubility tests: SGs were suspended in various solvents or solvent mixtures in screw-capped tubes. Samples were heated at 40 °C with orbital shaking (400 rpm, 5 min) using a thermomixer, then visually inspected for undissolved material. Solubility was evaluated by the Tyndall effect using a red laser. If insolubility or partial solubility was observed, additional solvent was added and the process repeated. In certain cases, ultrasonic treatment was applied (20 min, 40 °C) to improve dissolution. Toxicity assays: In vitro cytotoxicity was assessed using Huh7 hepatocarcinoma cells exposed to increasing concentrations of SGs for 72 hours. For in vivo evaluation, SGs were administered orally to mice (n = 11, 2 per group, 3 for control) at 25–200 mg/kg/day for 21 days. Biochemical markers of liver function (GOT, GPT) and metabolic parameters (glucose, cholesterol, TAGs) were measured after euthanasia. All animal procedures were performed in accordance with the Regulation for the Care and Use of Laboratory Animals and were approved by the Institutional Committee for the Care and Use of Laboratory Animals (CICUAL) of the Universidad Nacional de Rosario (UNR), Argentina. Dataset content: This dataset's files are organized into four main folders according to the type of analysis performed. Sterylglucosides_NMR_experiments: This contains the NMR spectra that were used to determine the structure of the isolated mixture. It includes zipped files with the following spectral data: Sterylglucosides_13C_spectrum.zip Sterylglucosides_1H_spectrum.zip Sterylglucosides_COSY_spectrum.zip Sterylglucosides_HMBC_spectrum.zip Sterylglucosides_HSQC_spectrum.zip NMR spectra were acquired on a Bruker Avance II 300 MHz (75.13 MHz) using CDCl₃ as solvent. Chemical shifts (δ) were reported in ppm downﬁeld from tetramethylsilane and coupling constants are in hertz (Hz). All NMR spectra were referenced to the residual undeuterated solvent as an internal reference. Spectra were processed using Bruker TopSpin 4.4.1. Sterylglucosides_purity_determination: This contains the NMR spectra that were used to determine the purity of the isolated mixture. It includes zipped files with the following spectral data: Entry_1_q1HRMN_spectrum.zip Entry_2_q1HRMN_spectrum.zip Entry_3_q1HRMN_spectrum.zip Signals corresponding to the aromatic protons of 4-methoxyphenyl acetate (7.1–6.9 ppm, dd, 4H) and the proton attached to C6 (5.42 ppm, m, 1H) were integrated relative to each other to determine the sample's purity. Spectra were processed using Bruker TopSpin 4.4.1. It also contains the processed data for purity determination, included in the following file: Sterylglucosides_purity_determination-1.tab Sterylglucosides_reported_activities_in_ChEMBL: This contains the results of searches for bioactivities by structure in the ChEMBL database. Each of the three structures was used as input to ChEMBL's advanced search engine (https://www.ebi.ac.uk/chembl/) “chemical search” -> “SIMILARITY>95%”. Three entries were obtained for beta-sitosterylglucoside, one for stigmasterylglucoside and none for campesterylglucoside. Each entry is presented as a separate sheet indicating the corresponding ChEMBL ID preceded by “B” for beta-sitosterylglucoside and “S” for stigmasterylglucoside in the following file: Carlucci_et_al_Sterylglucosides_activities_ChEMBL.tab Sterylglucosides_toxicity_assays: It compiles experimental toxicity data obtained through in vitro Huh7 cell assays and in vivo C57BL/6 mouse studies. The data are included in the following file: Carlucci_et_al_Sterylglucosides_toxicity_cells_mice.tab Solubility and MTT result are resumed for Acetone and DMSO vehicles in Sheet 1. Raw UV-vis data and viability percentages for MTT assays for both control and SGs are described in Sheets 2-5. Mice body weights (BW) and liver weights (LW) results are described in Sheet 6. Mice plasma and tissue metabolic assays results (raw data included) are described in Sheets 7-11 (i.e., TG & cholesterol, glycemia, GOT, GPT and WP MTP, respectively) Additionally, the results of the analysis of variance are included in the following files: Carlucci_et_al_Sterylglucosides_toxicity_cells_ANOVA.pzfx Carlucci_et_al_Sterylglucosides_toxicity_mice_ANOVA.pzfx Value of the data: The general significance of this dataset revolves around the revision of the reported biological activities of sterylglucosides and the assessment of their purity and toxicity when isolated from biodiesel tank bottom deposits. This dataset offers valuable insights for researchers working in the fields of natural product chemistry, toxicology, and bioactivity profiling by providing experimental evidence that highlights the non-toxic behavior of a sterylglucoside mixture in both in vitro (Huh7 cells) and in vivo (C57BL/6 mice) models. The biological activities of β-sitosterylglucoside, stigmasterylglucoside, and campesterylglucoside were systematically re-evaluated based on existing data. The purity and structural integrity of the isolated sterylglucosides were confirmed. Toxicity assessments indicated no significant adverse effects in human liver cells or mice. Solubility profiles in different solvents were established to support future applications. Structural identification was consistent with previously reported NMR data. This dataset may be particularly valuable for studies involving sterylglucosides derived from industrial waste sources, contributing to the evaluation of their safety profiles and potential applications in biomedical and pharmaceutical research. Data quality: Experiments were carried out with replicates (n = 3 or 2). For the controls of the in vitro and in vivo experiments, solvents were used as vehicles in the absence of sterylglucosides.
286 new target pairs based on shared compounds from ChEMBL
zenodo.org
data.niaid.nih.gov
txt
Updated Jan 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Filip Miljković; Ryo Kunimoto; Jürgen Bajorath; Filip Miljković; Ryo Kunimoto; Jürgen Bajorath (2020). 286 new target pairs based on shared compounds from ChEMBL [Dataset]. http://doi.org/10.5281/zenodo.556530
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.556530
Dataset updated
Jan 21, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Filip Miljković; Ryo Kunimoto; Jürgen Bajorath; Filip Miljković; Ryo Kunimoto; Jürgen Bajorath
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reported is the list of 286 compound-based target pairs between distantly related or unrelated pharmaceutical target proteins with shared common compounds, derived from ChEMBL22 high-confidence data. For each target, the corresponding UniProt ID is provided. In addition, for each given target pair, the number of shared compounds and structures by SMILES notations are added as well.
Z
Compound activity records associated with original publications in ChEMBL 21...
data.niaid.nih.gov
Updated Jan 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hu, Ye (2020). Compound activity records associated with original publications in ChEMBL 21 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_51688
Explore at:
Dataset updated
Jan 21, 2020
Dataset provided by
Hu, Ye
Bajorath, Jürgen
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Provided are two sets of compound activity records (set 1 and set 2) that were traced back to original publications and assembled from ChEMBL release 21. For each compound-target combination, the corresponding potency measurements and publications are provided. In addition, the list of unique publications is given for both sets 1 and 2.
o
Highly Promiscuous Compounds From Pubchem Assays
explore.openaire.eu
data.niaid.nih.gov
Updated Nov 2, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erik Gilberg; Swarit Jasial; Dagmar Stumpfe; Dilyana Dimova; Jürgen Bajorath (2016). Highly Promiscuous Compounds From Pubchem Assays [Dataset]. http://doi.org/10.5281/zenodo.164405
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.164405
Dataset updated
Nov 2, 2016
Authors
Erik Gilberg; Swarit Jasial; Dagmar Stumpfe; Dilyana Dimova; Jürgen Bajorath
Description
For the pool of 466 detected highly promiscuous compounds the PubChem ID and the corresponding ChEMBL ID(s) are provided. In addition, the detection status is set to "pains" or "aggregator" if the compound was detected as PAINS or an aggregator, respectively. Otherwise the status is set "passed". "ChEMBL analogues" lists the ChEMBL compound IDs of structural analogs of highly promiscuous compounds (if available). For the 466 compounds the number of targets and the corresponding PubChem target IDs are given in the last two columns. Compounds 1-26 (Compound No. 1-26) correspond to compounds shown in the publication.
Data from: Library of Two Million Unique Small Molecules with Precalculated...
zenodo.org
repository.uantwerpen.be
bin
Updated Aug 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Issar Arab; Issar Arab; Kris Laukens; Kris Laukens; Wout Bittremieux; Wout Bittremieux (2024). Library of Two Million Unique Small Molecules with Precalculated Fingerprints, Descriptors, and Cardiotoxicity Inhibition Data [Dataset]. http://doi.org/10.5281/zenodo.11066707
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11066707
Dataset updated
Aug 8, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Issar Arab; Issar Arab; Kris Laukens; Kris Laukens; Wout Bittremieux; Wout Bittremieux
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository comprises a dataset of ~2 million unique compounds saved in an hdf5 small molecule library store, which includes the following fields for each molecule:

InChI key

Standardized SMILES string

Compound source

ChEMBL identifier if the compound exists in this open access database

1024-bit Morgan fingerprint

2048-bit Morgan fingerprint

881-bit PubChem fingerprints

854 vector-length of preprocessed and standardized Mordred descriptors

and cardiotoxicity inhibition predictions for each of the three cardiac ion channels (hERG, Nav1.5, and Cav1.2) using CtoxPred2 along with the model confidence scores.

The repository also includes a Jupyter notebook that serves as an initial guide for querying the small molecule library store. Export both files to the same folder, allocate approximately 40 GB of available memory disk space, unzip the library store, and then launch the notebook to begin querying.

Upon usage, please cite this publication:

Issar Arab, Kris Laukens, Wout Bittremieux, Semisupervised Learning to Boost hERG, Nav1.5, and Cav1.2 Cardiac Ion Channel Toxicity Prediction by Mining a Large Unlabeled Small Molecule Data Set, Journal of Chemical Information and Modeling, (2024). doi:https://doi.org/10.1021/acs.jcim.4c01102">10.1021/acs.jcim.4c01102
h
chembl_multiassay_activity
huggingface.co
Updated Jan 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Haoran Jie (2025). chembl_multiassay_activity [Dataset]. https://huggingface.co/datasets/jiahborcn/chembl_multiassay_activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 20, 2025
Authors
Haoran Jie
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
ChEMBL Drug-Target Activity Dataset

This dataset was extracted from ChEMBL34 database. It is designed for multitask classification of drug-target activities. It links compound structures with activity data for multiple assays, enabling multitask learning experiments in drug discovery. Key features of the dataset include:

Multitask Format

Each assay ID is treated as a separate binary classification task. Binary labels (0 for inactive, 1 for active) and masks (indicating… See the full description on the dataset page: https://huggingface.co/datasets/jiahborcn/chembl_multiassay_activity.
Z
784 promiscuity cliffs from ChEMBL
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bajorath, Jürgen (2020). 784 promiscuity cliffs from ChEMBL [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_200393
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Gilberg, Erik
Dimova, Dilyana
Bajorath, Jürgen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reported are the 784 promiscuity cliffs formed by compounds from medicinal chemistry sources. Compounds forming cliffs are provided as SMILES. For each compound its promiscuity degree (PD) and the list of ChEMBL target IDs are provided.
Classification of Binding Modes for Kinase-Inhibitor Complex Structures, 3D...
zenodo.org
explore.openaire.eu
bin
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Norbert Furtmann; Ye Hu; Jürgen Bajorath; Norbert Furtmann; Ye Hu; Jürgen Bajorath (2020). Classification of Binding Modes for Kinase-Inhibitor Complex Structures, 3D Activity Cliffs Formed by Kinase Inhibitors, and Structural Analogues of 3D-Cliff Compounds [Dataset]. http://doi.org/10.5281/zenodo.11022
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11022
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Norbert Furtmann; Ye Hu; Jürgen Bajorath; Norbert Furtmann; Ye Hu; Jürgen Bajorath
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The classification of crystallographic binding modes is provided for 884 kinase-inhibitor complex structures that were assembled from PDB. In addition, a total of 105 three-dimensional activity cliffs formed by 3D kinase inhibitors are listed. Their corresponding potency information is also given. Furthermore, the 2D structural analogues of 3D cliff-forming inhibitors were identified from ChEMBL database, on the basis of matched molecular pairs. These analogs and their activity information are also provided.
Z
GDB Databases
data.niaid.nih.gov
zenodo.org
Updated Sep 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lorenz C. Blum (2022). GDB Databases [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5172017
Explore at:
Dataset updated
Sep 1, 2022
Dataset provided by
Ruud van Deursen
Lorenz C. Blum
Tobias Fink
Lars Ruddigkeit
Jean-Louis Reymond
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
About

GDB-11 enumerates small organic molecules up to 11 atoms of C, N, O and F following simple chemical stability and synthetic feasibility rules. GDB-13 enumerates small organic molecules up to 13 atoms of C, N, O, S and Cl following simple chemical stability and synthetic feasibility rules. With 977 468 314 structures, GDB-13 is the largest publicly available small organic molecule database to date.

How to cite

To cite GDB-11, please reference:

Virtual exploration of the chemical universe up to 11 atoms of C, N, O, F: assembly of 26.4 million structures (110.9 million stereoisomers) and analysis for new ring systems, stereochemistry, physico-chemical properties, compound classes and drug discovery. Fink, T.; Reymond, J.-L. J. Chem. Inf. Model. 2007, 47, 342-353.

Virtual Exploration of the Small Molecule Chemical Universe below 160 Daltons. Fink, T.; Bruggesser, H.; Reymond, J.-L. Angew. Chem. Int. Ed. 2005, 44, 1504-1508.

To cite GDB-13, please reference:

970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13. Blum L. C.; Reymond J.-L. J. Am. Chem. Soc., 2009, 131, 8732-8733.

To cite GDB-17, please reference:

Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. Ruddigkeit Lars, van Deursen Ruud, Blum L. C.; Reymond J.-L. J. Chem. Inf. Model., 2012, 52, 2864-2875.

Download

You can download the databases and subsets of it using the links provided. All the molecules are stored in dearomatized, canonized SMILES format and compressed as tar/gz archive (for Windows users: Download 7-zip to open archives).

GDB-17 GDB-17-Set (50 million) GDB17.50000000.smi.gz 314 MB Lead-like Set (100-350 MW & 1-3 clogP)(11 million) GDB17.50000000LL.smi.gz 75 MB Lead-like Set (100-350 MW & 1-3 clogP) without small rings (3-4 ring atoms)(0.8 million) GDB17.50000000LLnoSR.smi.gz 55 MB

GDB-13 Entire GDB-13 (including all C/N/O/Cl/S molecules) gdb13.tgz 2.6 GB GDB-13 Subsets (The sum of all the subsets below correspond to the entire GDB-13 above) Graph subset (saturated hydrocarbons) gdb13.g.tgz 1.1 MB Skeleton subset (unsaturated hydrocarbons) gdb13.sk.tgz 14 MB Only carbon & nitrogen containing molecules gdb13.cn.tgz 443 MB Only carbon & oxygen containing molecules gdb13.co.tgz 299 MB Only carbon & nitrogen & oxygen containing molecules gdb13.cno.tgz 1.8 GB Chlorine & sulphur containing molecules gdb13.cls.tgz 189 MB

GDB-13 Subsets (For details please refer to the Table 2 in J Comput Aided Mol Des 2011 25:637 to 647) GDB-13 Subset AB (~635 Millions) AB.smi.gz 2.4 GB GDB-13 Subset ABC (~441 Millions) ABC.smi.gz 1.7 GB GDB-13 Subset ABCD (~277 Millions) ABCD.smi.gz 1.1 GB GDB-13 Subset ABCDE (~140 Millions) ABCDE.smi.gz 565 MB GDB-13 Subset ABCDEF (~43 Millions) ABCDEF.smi.gz 171 MB GDB-13 Subset ABCDEFG (~13 Millions) ABCDEFG.smi.gz 50 MB GDB-13 Subset ABCDEFGH (~1.4 Millions) ABCDEFGH.smi.gz 6.2 MB GDB-13 Random Sample. Annotated with frequency and log-likelihood (Please refer to Exploring the GDB-13 chemical space using deep generative models) GDB-13 Random Sample (1 Million) gdb13.1M.freq.ll.smi.gz 14.8 MB

FDB-17 FDB-17 FDB-17-fragmentset.smi.gz 62.2 MB

GDB4c GDB4c (SMILES) GDB4c.smi.gz 6.2 MB GDB4c3D (SMILES) GDB4c3D.smi.gz 161 MB GDB4c3D (SDF) GDB4c3D.sdf.tar.gz 2 GB

Other GDBMedChem (SMILES) GDBMedChem.smi 276 MB GDBChEMBL (SMILES) GDBChEMBL.smi 353.6 MB GDB-13 random selection (1 million) gdb13.rand1M.smi.gz 7.2 MB Fragment-like subset (Rule of three) gdb13.frl.tgz 1.2 GB Dark matter universe up to 9 heavy atoms dmu9.tgz 87 MB

GDB-11 Entire GDB-11 (including all C/N/O/F molecules) gdb11.tgz 122 MB Fragrance Like Subsets: For details please refer to Ruddigkeit et al. Journal of Cheminformatics 2014, 6:27 FragranceDB (SuperScent + Flavornet) FragranceDB.smi 56 KB TasteDB (SuperSweet + BitterDB) TasteDB.smi 44 KB FragranceDB.FL (Fragrance-like subset of FragranceDB) FragranceDB.FL.smi 32 KB ChEMBL.FL (Fragrance-like subset of ChEMBL) ChEMBL.FL.smi 452 KB PubChem.FL Fragrance-like subset of PubChem PubChem.FL.smi 20 MB ZINC.FL (Fragrance-like subset of ZINC) ZINC.FL.smi 1.3 MB GDB-13.FL (Fragrance-like subset of GDB-13) GDB-13.FL.smi.gz 165 MB

Terms and conditions: The GDB databases may be downloaded free of charge. In published research involving GDB, cite the appropriate references mentioned above. GDB must not be used as part of or in patents. GDB and large portions thereof must not be redistributed without the express written permission of Jean-Louis Reymond.
Collection of analog series-based (ASB) scaffolds
zenodo.org
data.niaid.nih.gov
bin
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dilyana Dimova; Dilyana Dimova; Jürgen Bajorath; Jürgen Bajorath (2020). Collection of analog series-based (ASB) scaffolds [Dataset]. http://doi.org/10.5281/zenodo.1041394
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1041394
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dilyana Dimova; Dilyana Dimova; Jürgen Bajorath; Jürgen Bajorath
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The entire collection of 23,791 unique ASB scaffolds generated from compounds from Probes and Drugs Portal (PDP) and ChEMBL (version 23) is reported. Each ASB scaffold is provided in canonical SMILES representation, the database origin (DB_origin) is specified, and the number of analogs (#analogs) the scaffold represents reported. In addition, for each ASB scaffold from ChEMBL, unique target annotations of corresponding analogs are provided using UniProt target identifiers. For ASB scaffolds from PDP, collected compound annotations are provided. For scaffolds shared between PDP and ChEMBL the number of analogs is provided in the form 'X|Y' where 'X' and 'Y' denote the number of analogs in PDP and ChEMBL, respectively. Scaffolds are rank-ordered according to the number of analogs they represent.
Z
Data from: Systematic Design of Analogs of Active Compounds Covering More...
data.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dimova, Dilyana (2020). Systematic Design of Analogs of Active Compounds Covering More than 1000 Targets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_45807
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Dimova, Dilyana
Bajorath, Jürgen
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The analog database consisting of 1,297,204 virtual compounds is provided. Virtual compounds are reported in SMILES representation. In addition, for each virtual compound all available ChEMBL analogs (CHEMBL_COMPOUND_ID) and their activities (CHEMBL_TARGET_IDs) are given.

Facebook

Twitter

Click to copy link

Link copied

Cite

Faisal Bin Ashraf; Sanjida Akter; Sumona Hoque Mumu; Muhammad Usama Islam; Jasim Uddin (2023). Bioactivity datasets from ChEMBL. [Dataset]. http://doi.org/10.1371/journal.pone.0288053.t001

Bioactivity datasets from ChEMBL.

Explore at:

5 scholarly articles cite this dataset (View in Google Scholar)

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0288053.t001

Dataset updated

Sep 6, 2023

Dataset provided by

PLOS ONE

Authors

Faisal Bin Ashraf; Sanjida Akter; Sumona Hoque Mumu; Muhammad Usama Islam; Jasim Uddin

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The SARS-CoV-2 3CLpro protein is one of the key therapeutic targets of interest for COVID-19 due to its critical role in viral replication, various high-quality protein crystal structures, and as a basis for computationally screening for compounds with improved inhibitory activity, bioavailability, and ADMETox properties. The ChEMBL and PubChem database contains experimental data from screening small molecules against SARS-CoV-2 3CLpro, which expands the opportunity to learn the pattern and design a computational model that can predict the potency of any drug compound against coronavirus before in-vitro and in-vivo testing. In this study, Utilizing several descriptors, we evaluated 27 machine learning classifiers. We also developed a neural network model that can correctly identify bioactive and inactive chemicals with 91% accuracy, on CheMBL data and 93% accuracy on combined data on both CheMBL and Pubchem. The F1-score for inactive and active compounds was 93% and 94%, respectively. SHAP (SHapley Additive exPlanations) on XGB classifier to find important fingerprints from the PaDEL descriptors for this task. The results indicated that the PaDEL descriptors were effective in predicting bioactivity, the proposed neural network design was efficient, and the Explanatory factor through SHAP correctly identified the important fingertips. In addition, we validated the effectiveness of our proposed model using a large dataset encompassing over 100,000 molecules. This research employed various molecular descriptors to discover the optimal one for this task. To evaluate the effectiveness of these possible medications against SARS-CoV-2, more in-vitro and in-vivo research is required.

Clear search

Close search

Google apps

Main menu

Bioactivity datasets from ChEMBL.

Data from: A consensus compound/bioactivity dataset for data-driven drug...

Table_1_How to Achieve Better Results Using PASS-Based Virtual Screening:...

31 ChEMBL data sets for regression modeling

Activity cliffs with dual-atom replacements and single-atom analogs

ChEMBL data against CHEMBL367, CHEMBL368 and CHEMBL612348

A large comprehensive curated dataset of small molecules and their...

HLM_RLM

Bioactive compounds with no structural analogs (high-confidence activity...

Experimental data of sterylglucosides isolated from biodiesel tank...

286 new target pairs based on shared compounds from ChEMBL

Compound activity records associated with original publications in ChEMBL 21...

Highly Promiscuous Compounds From Pubchem Assays

Data from: Library of Two Million Unique Small Molecules with Precalculated...

chembl_multiassay_activity

784 promiscuity cliffs from ChEMBL

Classification of Binding Modes for Kinase-Inhibitor Complex Structures, 3D...

GDB Databases

Collection of analog series-based (ASB) scaffolds

Data from: Systematic Design of Analogs of Active Compounds Covering More...

Bioactivity datasets from ChEMBL.See More Versions

Bioactivity datasets from ChEMBL.