35 datasets found
  1. f

    Bioactivity datasets from ChEMBL.

    • plos.figshare.com
    xls
    Updated Sep 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Faisal Bin Ashraf; Sanjida Akter; Sumona Hoque Mumu; Muhammad Usama Islam; Jasim Uddin (2023). Bioactivity datasets from ChEMBL. [Dataset]. http://doi.org/10.1371/journal.pone.0288053.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 6, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Faisal Bin Ashraf; Sanjida Akter; Sumona Hoque Mumu; Muhammad Usama Islam; Jasim Uddin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The SARS-CoV-2 3CLpro protein is one of the key therapeutic targets of interest for COVID-19 due to its critical role in viral replication, various high-quality protein crystal structures, and as a basis for computationally screening for compounds with improved inhibitory activity, bioavailability, and ADMETox properties. The ChEMBL and PubChem database contains experimental data from screening small molecules against SARS-CoV-2 3CLpro, which expands the opportunity to learn the pattern and design a computational model that can predict the potency of any drug compound against coronavirus before in-vitro and in-vivo testing. In this study, Utilizing several descriptors, we evaluated 27 machine learning classifiers. We also developed a neural network model that can correctly identify bioactive and inactive chemicals with 91% accuracy, on CheMBL data and 93% accuracy on combined data on both CheMBL and Pubchem. The F1-score for inactive and active compounds was 93% and 94%, respectively. SHAP (SHapley Additive exPlanations) on XGB classifier to find important fingerprints from the PaDEL descriptors for this task. The results indicated that the PaDEL descriptors were effective in predicting bioactivity, the proposed neural network design was efficient, and the Explanatory factor through SHAP correctly identified the important fingertips. In addition, we validated the effectiveness of our proposed model using a large dataset encompassing over 100,000 molecules. This research employed various molecular descriptors to discover the optimal one for this task. To evaluate the effectiveness of these possible medications against SARS-CoV-2, more in-vitro and in-vivo research is required.

  2. Z

    Data from: A consensus compound/bioactivity dataset for data-driven drug...

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isigkeit, Laura (2022). A consensus compound/bioactivity dataset for data-driven drug design and chemogenomics [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6320760
    Explore at:
    Dataset updated
    May 13, 2022
    Dataset provided by
    Isigkeit, Laura
    Chaikuad, Apirat
    Merk, Daniel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the updated version of the dataset from 10.5281/zenodo.6320761

    Information

    The diverse publicly available compound/bioactivity databases constitute a key resource for data-driven applications in chemogenomics and drug design. Analysis of their coverage of compound entries and biological targets revealed considerable differences, however, suggesting benefit of a consensus dataset. Therefore, we have combined and curated information from five esteemed databases (ChEMBL, PubChem, BindingDB, IUPHAR/BPS and Probes&Drugs) to assemble a consensus compound/bioactivity dataset comprising 1144648 compounds with 10915362 bioactivities on 5613 targets (including defined macromolecular targets as well as cell-lines and phenotypic readouts). It also provides simplified information on assay types underlying the bioactivity data and on bioactivity confidence by comparing data from different sources. We have unified the source databases, brought them into a common format and combined them, enabling an ease for generic uses in multiple applications such as chemogenomics and data-driven drug design.

    The consensus dataset provides increased target coverage and contains a higher number of molecules compared to the source databases which is also evident from a larger number of scaffolds. These features render the consensus dataset a valuable tool for machine learning and other data-driven applications in (de novo) drug design and bioactivity prediction. The increased chemical and bioactivity coverage of the consensus dataset may improve robustness of such models compared to the single source databases. In addition, semi-automated structure and bioactivity annotation checks with flags for divergent data from different sources may help data selection and further accurate curation.

    This dataset belongs to the publication: https://doi.org/10.3390/molecules27082513

    Structure and content of the dataset

    Dataset structure
    

    ChEMBL

    ID

    PubChem

    ID

    IUPHAR

    ID

        Target
    

    Activity

    type

        Assay type
        Unit
        Mean C (0)
        ...
        Mean PC (0)
        ...
        Mean B (0)
        ...
        Mean I (0)
        ...
        Mean PD (0)
        ...
        Activity check annotation
        Ligand names
        Canonical SMILES C
        ...
        Structure check (Tanimoto)
        Source
    

    The dataset was created using the Konstanz Information Miner (KNIME) (https://www.knime.com/) and was exported as a CSV-file and a compressed CSV-file.

    Except for the canonical SMILES columns, all columns are filled with the datatype ‘string’. The datatype for the canonical SMILES columns is the smiles-format. We recommend the File Reader node for using the dataset in KNIME. With the help of this node the data types of the columns can be adjusted exactly. In addition, only this node can read the compressed format.

    Column content:

    ChEMBL ID, PubChem ID, IUPHAR ID: chemical identifier of the databases

    Target: biological target of the molecule expressed as the HGNC gene symbol

    Activity type: for example, pIC50

    Assay type: Simplification/Classification of the assay into cell-free, cellular, functional and unspecified

    Unit: unit of bioactivity measurement

    Mean columns of the databases: mean of bioactivity values or activity comments denoted with the frequency of their occurrence in the database, e.g. Mean C = 7.5 *(15) -> the value for this compound-target pair occurs 15 times in ChEMBL database

    Activity check annotation: a bioactivity check was performed by comparing values from the different sources and adding an activity check annotation to provide automated activity validation for additional confidence

    no comment: bioactivity values are within one log unit;

    check activity data: bioactivity values are not within one log unit;

    only one data point: only one value was available, no comparison and no range calculated;

    no activity value: no precise numeric activity value was available;

    no log-value could be calculated: no negative decadic logarithm could be calculated, e.g., because the reported unit was not a compound concentration

    Ligand names: all unique names contained in the five source databases are listed

    Canonical SMILES columns: Molecular structure of the compound from each database

    Structure check (Tanimoto): To denote matching or differing compound structures in different source databases

    match: molecule structures are the same between different sources;

    no match: the structures differ. We calculated the Jaccard-Tanimoto similarity coefficient from Morgan Fingerprints to reveal true differences between sources and reported the minimum value;

    1 structure: no structure comparison is possible, because there was only one structure available;

    no structure: no structure comparison is possible, because there was no structure available.

    Source: From which databases the data come from

  3. f

    Table_1_How to Achieve Better Results Using PASS-Based Virtual Screening:...

    • frontiersin.figshare.com
    • figshare.com
    xlsx
    Updated Jun 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pavel V. Pogodin; Alexey A. Lagunin; Anastasia V. Rudik; Dmitry A. Filimonov; Dmitry S. Druzhilovskiy; Mark C. Nicklaus; Vladimir V. Poroikov (2023). Table_1_How to Achieve Better Results Using PASS-Based Virtual Screening: Case Study for Kinase Inhibitors.XLSX [Dataset]. http://doi.org/10.3389/fchem.2018.00133.s003
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    Frontiers
    Authors
    Pavel V. Pogodin; Alexey A. Lagunin; Anastasia V. Rudik; Dmitry A. Filimonov; Dmitry S. Druzhilovskiy; Mark C. Nicklaus; Vladimir V. Poroikov
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Discovery of new pharmaceutical substances is currently boosted by the possibility of utilization of the Synthetically Accessible Virtual Inventory (SAVI) library, which includes about 283 million molecules, each annotated with a proposed synthetic one-step route from commercially available starting materials. The SAVI database is well-suited for ligand-based methods of virtual screening to select molecules for experimental testing. In this study, we compare the performance of three approaches for the analysis of structure-activity relationships that differ in their criteria for selecting of “active” and “inactive” compounds included in the training sets. PASS (Prediction of Activity Spectra for Substances), which is based on a modified Naïve Bayes algorithm, was applied since it had been shown to be robust and to provide good predictions of many biological activities based on just the structural formula of a compound even if the information in the training set is incomplete. We used different subsets of kinase inhibitors for this case study because many data are currently available on this important class of drug-like molecules. Based on the subsets of kinase inhibitors extracted from the ChEMBL 20 database we performed the PASS training, and then applied the model to ChEMBL 23 compounds not yet present in ChEMBL 20 to identify novel kinase inhibitors. As one may expect, the best prediction accuracy was obtained if only the experimentally confirmed active and inactive compounds for distinct kinases in the training procedure were used. However, for some kinases, reasonable results were obtained even if we used merged training sets, in which we designated as inactives the compounds not tested against the particular kinase. Thus, depending on the availability of data for a particular biological activity, one may choose the first or the second approach for creating ligand-based computational tools to achieve the best possible results in virtual screening.

  4. 31 ChEMBL data sets for regression modeling

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jan 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jenny Balfer; Jürgen Bajorath; Jenny Balfer; Jürgen Bajorath (2020). 31 ChEMBL data sets for regression modeling [Dataset]. http://doi.org/10.5281/zenodo.13986
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 21, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jenny Balfer; Jürgen Bajorath; Jenny Balfer; Jürgen Bajorath
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    From ChEMBL version 17, 31 compound data sets have been selected for regression modeling. Compounds had to be active against human targets in a direct inhibition/binding assay with highest ChEMBL confidence score and Ki values below 100 micromolar. Multiple Ki values for the same compound were averaged if they fell into the same order of magnitude, or else they were disregarded. Duplicates, known pan-assay interference, and other reactive molecules were removed. Only sets with at least 500 compounds were considered.

    Note: The SD files contain a field "pKi"; note however that this field contains the Ki value in nM units, not the logarithmic value.

  5. Activity cliffs with dual-atom replacements and single-atom analogs

    • zenodo.org
    • data.niaid.nih.gov
    Updated Nov 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huabin Hu; Jürgen Bajorath; Huabin Hu; Jürgen Bajorath (2021). Activity cliffs with dual-atom replacements and single-atom analogs [Dataset]. http://doi.org/10.5281/zenodo.5634280
    Explore at:
    Dataset updated
    Nov 2, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Huabin Hu; Jürgen Bajorath; Huabin Hu; Jürgen Bajorath
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    From the ChEMBL database, 852 activity cliffs (ACs) with dual-atom replacements were extracted which were formed by compounds with high-confidence activity data. Each AC captured an at least 10-fold difference in compound potency. For a subset of these ACs, analogs with corresponding single-atom replacements were identified. The dual-atom ACs and available single-atom replacement analogs were provided (SMILES representation and ChEMBL compound ID). For each AC compound and analog, targets from ChEMBL are reported (with UniProt ID). The target shared by all associated compounds represents the primary AC target

  6. ChEMBL data against CHEMBL367, CHEMBL368 and CHEMBL612348

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated May 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arnaud Gaudry; Arnaud Gaudry (2023). ChEMBL data against CHEMBL367, CHEMBL368 and CHEMBL612348 [Dataset]. http://doi.org/10.5281/zenodo.7953284
    Explore at:
    binAvailable download formats
    Dataset updated
    May 20, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Arnaud Gaudry; Arnaud Gaudry
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data from ChEMBL compounds reported with an activity against one of the following targets: CHEMBL367 : Leishmania donovani, CHEMBL368 : Trypanosoma cruzi, and CHEMBL612348 : Trypanosoma brucei rhodesiense.

  7. A large comprehensive curated dataset of small molecules and their...

    • repository.uantwerpen.be
    • zenodo.org
    Updated 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arab, Issar; Egghe, Kristof; Laukens, Kris; Chen, Ke; Barakat, Khaled; Bittremieux, Wout (2023). A large comprehensive curated dataset of small molecules and their activities covering three cardiac ion channels: hERG, Cav1.2, and Nav1.5 [Dataset]. http://doi.org/10.5281/ZENODO.8359714
    Explore at:
    Dataset updated
    2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Faculty of Sciences. Mathematics and Computer Science
    University of Antwerp
    Authors
    Arab, Issar; Egghe, Kristof; Laukens, Kris; Chen, Ke; Barakat, Khaled; Bittremieux, Wout
    Description

    The compressed data folder (dataset.rar) represents a data framework for researchers in the field of drug discovery to perform in depth analyses on a very large open-access unique and comprehensive hERG, Nav1.5, and Cav1.2 cardiotoxicity integrated database of small molecules and their activities. The database is organized as follows: Each sub-folder represents a cardiac ion channel target: hERG, Nav1.5, and Cav1.2 Each target sub-folder consists of 3 files in CSV format: One file containing the development set (split into training and validation sets using an 80/20 ratio for hyperparameter tuning). The other 2 files contain external evaluation sets. The first test dataset consists of compounds with a structural similarity of no more than 60% (Tanimoto similarity ≤ 0.6) to the remaining development set, while the second test dataset comprises compounds with a structural similarity of no more than 70% (Tanimoto similarity ≤ 0.7) to the remaining development set. Each file contains data with 7 columns: "InChl Key" as a unique identifier of the chemical structure, "SMILES" as the string format of storage and exchange of the chemical structure, "Source" as the upstream data source from which the data was retrieved, "ChEMBL ID" as the ChEMBL identifier if the compound comes from ChEMBL database, "PubChem CID" as the PubChem compound identifier if the compound comes from PubChem database, "pIC50" as the negative logarithm of the half-maximal inhibitory concentration (IC50) to describe the potency of the compound, and "USED_AS" column specifying whether the compound was used for training or validation. Upon usage, please cite this publication: Issar Arab, Kristof Egghe, Kris Laukens, Ke Chen, Khaled Barakat, Wout Bittremieux, Benchmarking of Small Molecule Feature Representations for hERG, Nav1.5, and Cav1.2 Cardiotoxicity Prediction, Journal of Chemical Information and Modeling, (2023). doi:10.1021/acs.jcim.3c01301

  8. h

    HLM_RLM

    • huggingface.co
    Updated Jun 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maom Lab (2025). HLM_RLM [Dataset]. https://huggingface.co/datasets/maomlab/HLM_RLM
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 4, 2025
    Dataset authored and provided by
    Maom Lab
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Human & Rat Liver Microsomal Stability

    3345 RLM and 6420 HLM compounds were initially collected from the ChEMBL bioactivity database. (HLM ID: 613373, 2367379, and 612558; RLM ID: 613694, 2367428, and 612558) Finally, the RLM stability data set contains 3108 compounds, and the HLM stability data set contains 5902 compounds. For the RLM stability data set, 1542 (49.6%) compounds were classified as stable, and 1566 (50.4%) compounds were classified as unstable, among which the… See the full description on the dataset page: https://huggingface.co/datasets/maomlab/HLM_RLM.

  9. Z

    Bioactive compounds with no structural analogs (high-confidence activity...

    • data.niaid.nih.gov
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dimova, Dilyana (2020). Bioactive compounds with no structural analogs (high-confidence activity data) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_33497
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Stumpfe, Dagmar
    Dimova, Dilyana
    Bajorath, Jürgen
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    A set of 52,815 unique bioactive compounds (human targets, high-confidence activity data) with no structural analogs with high-confidence activity data was extracted from ChEMBL. For each compound the ChEMBL compound ID (CHEMBLID_Compound) and high-confidence target annotation(s) (CHEMBLID_Targets) are provided. The data set was generated as a part of an analysis to be published in 'Medicinal Chemistry Communications'.

  10. R

    Experimental data of sterylglucosides isolated from biodiesel tank...

    • dataverse.unr.edu.ar
    bin, html, tsv, txt +1
    Updated May 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RDA UNR (2025). Experimental data of sterylglucosides isolated from biodiesel tank precipitates [Dataset]. http://doi.org/10.57715/UNR/CY7ICJ
    Explore at:
    zip(443043), zip(3000787), zip(212867), txt(15681), zip(7787947), zip(346289), html(40042), tsv(450), zip(343000), tsv(514), zip(10198774), zip(346961), bin(3613156), tsv(550)Available download formats
    Dataset updated
    May 22, 2025
    Dataset provided by
    RDA UNR
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    Universidad Nacional de Rosario
    Consejo Nacional de Investigaciones Científicas y Técnicas
    Description

    Introduction: This dataset compiles experimental data on the recovery of sterylglucosides from biodiesel tanks precipitates. It includes isolated compounds reported bioactivity data (ChEMBL database), purity measurements (quantitative NMR), toxicity assays on cells (Huh7) and male mice (C57BL/6). The present dataset is focused on the review of reported biological activity of each one of the three compounds isolated as mixture (beta-sitosterylglucoside:stigmasterylglucoside:campesterylglucoside in 2:1:1 ratio), followed by purity measurement of the mixture and the later toxicity of the mixture against a cell-line and mice. Methodological information: Sample origin: The biodiesel tank bottom sample was provided by UnitecBio (https://unitecbio.com.ar/) in 2015. Isolation and Purification Procedure: Two consecutive washes (hexane and methanol, respectively) were used to eliminate the biodiesel and the remaining minor components. Then, the sterylglycosides were recovered by filtration and dried until they reached a constant weight. The identity of the sterylglycosides was confirmed by proton (¹H) and carbon (¹³C) Nuclear Magnetic Resonance Spectroscopy, and also two-dimensional NMR experiments, i.e. Correlation Spectroscopy (COSY) NMR, Heteronuclear Single Quantum Coherence (HSQC) NMR, and Heteronuclear Multiple Bond Correlation (HMBC) NMR. NMR spectra were acquired on a Bruker Avance II 300 MHz (75.13 MHz) using CDCl₃ as solvent. Spectra were processed using Bruker TopSpin 4.4.1. Purity Determination: The isolated mixture was mixed with the internal standard, 4-methoxyphenol. It was then derivatized by acetylation, and ¹H NMR spectra were recorded (128 scans). NMR spectra were acquired on a Bruker Avance II 300 MHz (75.13 MHz) using CDCl3 as solvent. Spectra were processed using Bruker TopSpin 4.4.1. Reported Bioactivity Evaluation: A search for the bioactivities of the compounds in the mixture was performed in the ChEMBL35 database on June 30, 2025. The corresponding SMILES were used as input for the ChEMBL advanced search engine. Solubility tests: SGs were suspended in various solvents or solvent mixtures in screw-capped tubes. Samples were heated at 40 °C with orbital shaking (400 rpm, 5 min) using a thermomixer, then visually inspected for undissolved material. Solubility was evaluated by the Tyndall effect using a red laser. If insolubility or partial solubility was observed, additional solvent was added and the process repeated. In certain cases, ultrasonic treatment was applied (20 min, 40 °C) to improve dissolution. Toxicity assays: In vitro cytotoxicity was assessed using Huh7 hepatocarcinoma cells exposed to increasing concentrations of SGs for 72 hours. For in vivo evaluation, SGs were administered orally to mice (n = 11, 2 per group, 3 for control) at 25–200 mg/kg/day for 21 days. Biochemical markers of liver function (GOT, GPT) and metabolic parameters (glucose, cholesterol, TAGs) were measured after euthanasia. All animal procedures were performed in accordance with the Regulation for the Care and Use of Laboratory Animals and were approved by the Institutional Committee for the Care and Use of Laboratory Animals (CICUAL) of the Universidad Nacional de Rosario (UNR), Argentina. Dataset content: This dataset's files are organized into four main folders according to the type of analysis performed. Sterylglucosides_NMR_experiments: This contains the NMR spectra that were used to determine the structure of the isolated mixture. It includes zipped files with the following spectral data: Sterylglucosides_13C_spectrum.zip Sterylglucosides_1H_spectrum.zip Sterylglucosides_COSY_spectrum.zip Sterylglucosides_HMBC_spectrum.zip Sterylglucosides_HSQC_spectrum.zip NMR spectra were acquired on a Bruker Avance II 300 MHz (75.13 MHz) using CDCl₃ as solvent. Chemical shifts (δ) were reported in ppm downfield from tetramethylsilane and coupling constants are in hertz (Hz). All NMR spectra were referenced to the residual undeuterated solvent as an internal reference. Spectra were processed using Bruker TopSpin 4.4.1. Sterylglucosides_purity_determination: This contains the NMR spectra that were used to determine the purity of the isolated mixture. It includes zipped files with the following spectral data: Entry_1_q1HRMN_spectrum.zip Entry_2_q1HRMN_spectrum.zip Entry_3_q1HRMN_spectrum.zip Signals corresponding to the aromatic protons of 4-methoxyphenyl acetate (7.1–6.9 ppm, dd, 4H) and the proton attached to C6 (5.42 ppm, m, 1H) were integrated relative to each other to determine the sample's purity. Spectra were processed using Bruker TopSpin 4.4.1. It also contains the processed data for purity determination, included in the following file: Sterylglucosides_purity_determination-1.tab Sterylglucosides_reported_activities_in_ChEMBL: This contains the results of searches for bioactivities by structure in the ChEMBL database. Each of the three structures was used as input to ChEMBL's advanced search engine (https://www.ebi.ac.uk/chembl/) “chemical search” -> “SIMILARITY>95%”. Three entries were obtained for beta-sitosterylglucoside, one for stigmasterylglucoside and none for campesterylglucoside. Each entry is presented as a separate sheet indicating the corresponding ChEMBL ID preceded by “B” for beta-sitosterylglucoside and “S” for stigmasterylglucoside in the following file: Carlucci_et_al_Sterylglucosides_activities_ChEMBL.tab Sterylglucosides_toxicity_assays: It compiles experimental toxicity data obtained through in vitro Huh7 cell assays and in vivo C57BL/6 mouse studies. The data are included in the following file: Carlucci_et_al_Sterylglucosides_toxicity_cells_mice.tab Solubility and MTT result are resumed for Acetone and DMSO vehicles in Sheet 1. Raw UV-vis data and viability percentages for MTT assays for both control and SGs are described in Sheets 2-5. Mice body weights (BW) and liver weights (LW) results are described in Sheet 6. Mice plasma and tissue metabolic assays results (raw data included) are described in Sheets 7-11 (i.e., TG & cholesterol, glycemia, GOT, GPT and WP MTP, respectively) Additionally, the results of the analysis of variance are included in the following files: Carlucci_et_al_Sterylglucosides_toxicity_cells_ANOVA.pzfx Carlucci_et_al_Sterylglucosides_toxicity_mice_ANOVA.pzfx Value of the data: The general significance of this dataset revolves around the revision of the reported biological activities of sterylglucosides and the assessment of their purity and toxicity when isolated from biodiesel tank bottom deposits. This dataset offers valuable insights for researchers working in the fields of natural product chemistry, toxicology, and bioactivity profiling by providing experimental evidence that highlights the non-toxic behavior of a sterylglucoside mixture in both in vitro (Huh7 cells) and in vivo (C57BL/6 mice) models. The biological activities of β-sitosterylglucoside, stigmasterylglucoside, and campesterylglucoside were systematically re-evaluated based on existing data. The purity and structural integrity of the isolated sterylglucosides were confirmed. Toxicity assessments indicated no significant adverse effects in human liver cells or mice. Solubility profiles in different solvents were established to support future applications. Structural identification was consistent with previously reported NMR data. This dataset may be particularly valuable for studies involving sterylglucosides derived from industrial waste sources, contributing to the evaluation of their safety profiles and potential applications in biomedical and pharmaceutical research. Data quality: Experiments were carried out with replicates (n = 3 or 2). For the controls of the in vitro and in vivo experiments, solvents were used as vehicles in the absence of sterylglucosides.

  11. 286 new target pairs based on shared compounds from ChEMBL

    • zenodo.org
    • data.niaid.nih.gov
    txt
    Updated Jan 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Filip Miljković; Ryo Kunimoto; Jürgen Bajorath; Filip Miljković; Ryo Kunimoto; Jürgen Bajorath (2020). 286 new target pairs based on shared compounds from ChEMBL [Dataset]. http://doi.org/10.5281/zenodo.556530
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 21, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Filip Miljković; Ryo Kunimoto; Jürgen Bajorath; Filip Miljković; Ryo Kunimoto; Jürgen Bajorath
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reported is the list of 286 compound-based target pairs between distantly related or unrelated pharmaceutical target proteins with shared common compounds, derived from ChEMBL22 high-confidence data. For each target, the corresponding UniProt ID is provided. In addition, for each given target pair, the number of shared compounds and structures by SMILES notations are added as well.

  12. Z

    Compound activity records associated with original publications in ChEMBL 21...

    • data.niaid.nih.gov
    Updated Jan 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hu, Ye (2020). Compound activity records associated with original publications in ChEMBL 21 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_51688
    Explore at:
    Dataset updated
    Jan 21, 2020
    Dataset provided by
    Hu, Ye
    Bajorath, Jürgen
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Provided are two sets of compound activity records (set 1 and set 2) that were traced back to original publications and assembled from ChEMBL release 21. For each compound-target combination, the corresponding potency measurements and publications are provided. In addition, the list of unique publications is given for both sets 1 and 2.

  13. o

    Highly Promiscuous Compounds From Pubchem Assays

    • explore.openaire.eu
    • data.niaid.nih.gov
    Updated Nov 2, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erik Gilberg; Swarit Jasial; Dagmar Stumpfe; Dilyana Dimova; Jürgen Bajorath (2016). Highly Promiscuous Compounds From Pubchem Assays [Dataset]. http://doi.org/10.5281/zenodo.164405
    Explore at:
    Dataset updated
    Nov 2, 2016
    Authors
    Erik Gilberg; Swarit Jasial; Dagmar Stumpfe; Dilyana Dimova; Jürgen Bajorath
    Description

    For the pool of 466 detected highly promiscuous compounds the PubChem ID and the corresponding ChEMBL ID(s) are provided. In addition, the detection status is set to "pains" or "aggregator" if the compound was detected as PAINS or an aggregator, respectively. Otherwise the status is set "passed". "ChEMBL analogues" lists the ChEMBL compound IDs of structural analogs of highly promiscuous compounds (if available). For the 466 compounds the number of targets and the corresponding PubChem target IDs are given in the last two columns. Compounds 1-26 (Compound No. 1-26) correspond to compounds shown in the publication.

  14. Data from: Library of Two Million Unique Small Molecules with Precalculated...

    • zenodo.org
    • repository.uantwerpen.be
    bin
    Updated Aug 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Issar Arab; Issar Arab; Kris Laukens; Kris Laukens; Wout Bittremieux; Wout Bittremieux (2024). Library of Two Million Unique Small Molecules with Precalculated Fingerprints, Descriptors, and Cardiotoxicity Inhibition Data [Dataset]. http://doi.org/10.5281/zenodo.11066707
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 8, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Issar Arab; Issar Arab; Kris Laukens; Kris Laukens; Wout Bittremieux; Wout Bittremieux
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository comprises a dataset of ~2 million unique compounds saved in an hdf5 small molecule library store, which includes the following fields for each molecule:

    • InChI key
    • Standardized SMILES string
    • Compound source
    • ChEMBL identifier if the compound exists in this open access database
    • 1024-bit Morgan fingerprint
    • 2048-bit Morgan fingerprint
    • 881-bit PubChem fingerprints
    • 854 vector-length of preprocessed and standardized Mordred descriptors
    • and cardiotoxicity inhibition predictions for each of the three cardiac ion channels (hERG, Nav1.5, and Cav1.2) using CtoxPred2 along with the model confidence scores.

    The repository also includes a Jupyter notebook that serves as an initial guide for querying the small molecule library store. Export both files to the same folder, allocate approximately 40 GB of available memory disk space, unzip the library store, and then launch the notebook to begin querying.

    Upon usage, please cite this publication:

    • Issar Arab, Kris Laukens, Wout Bittremieux, Semisupervised Learning to Boost hERG, Nav1.5, and Cav1.2 Cardiac Ion Channel Toxicity Prediction by Mining a Large Unlabeled Small Molecule Data Set, Journal of Chemical Information and Modeling, (2024). doi:https://doi.org/10.1021/acs.jcim.4c01102">10.1021/acs.jcim.4c01102
  15. h

    chembl_multiassay_activity

    • huggingface.co
    Updated Jan 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haoran Jie (2025). chembl_multiassay_activity [Dataset]. https://huggingface.co/datasets/jiahborcn/chembl_multiassay_activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 20, 2025
    Authors
    Haoran Jie
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    ChEMBL Drug-Target Activity Dataset

    This dataset was extracted from ChEMBL34 database. It is designed for multitask classification of drug-target activities. It links compound structures with activity data for multiple assays, enabling multitask learning experiments in drug discovery. Key features of the dataset include:

      Multitask Format
    

    Each assay ID is treated as a separate binary classification task. Binary labels (0 for inactive, 1 for active) and masks (indicating… See the full description on the dataset page: https://huggingface.co/datasets/jiahborcn/chembl_multiassay_activity.

  16. Z

    784 promiscuity cliffs from ChEMBL

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bajorath, Jürgen (2020). 784 promiscuity cliffs from ChEMBL [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_200393
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Gilberg, Erik
    Dimova, Dilyana
    Bajorath, Jürgen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reported are the 784 promiscuity cliffs formed by compounds from medicinal chemistry sources. Compounds forming cliffs are provided as SMILES. For each compound its promiscuity degree (PD) and the list of ChEMBL target IDs are provided.

  17. Classification of Binding Modes for Kinase-Inhibitor Complex Structures, 3D...

    • zenodo.org
    • explore.openaire.eu
    bin
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Norbert Furtmann; Ye Hu; Jürgen Bajorath; Norbert Furtmann; Ye Hu; Jürgen Bajorath (2020). Classification of Binding Modes for Kinase-Inhibitor Complex Structures, 3D Activity Cliffs Formed by Kinase Inhibitors, and Structural Analogues of 3D-Cliff Compounds [Dataset]. http://doi.org/10.5281/zenodo.11022
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Norbert Furtmann; Ye Hu; Jürgen Bajorath; Norbert Furtmann; Ye Hu; Jürgen Bajorath
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The classification of crystallographic binding modes is provided for 884 kinase-inhibitor complex structures that were assembled from PDB. In addition, a total of 105 three-dimensional activity cliffs formed by 3D kinase inhibitors are listed. Their corresponding potency information is also given. Furthermore, the 2D structural analogues of 3D cliff-forming inhibitors were identified from ChEMBL database, on the basis of matched molecular pairs. These analogs and their activity information are also provided.

  18. Z

    GDB Databases

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lorenz C. Blum (2022). GDB Databases [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5172017
    Explore at:
    Dataset updated
    Sep 1, 2022
    Dataset provided by
    Ruud van Deursen
    Lorenz C. Blum
    Tobias Fink
    Lars Ruddigkeit
    Jean-Louis Reymond
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    About

    GDB-11 enumerates small organic molecules up to 11 atoms of C, N, O and F following simple chemical stability and synthetic feasibility rules. GDB-13 enumerates small organic molecules up to 13 atoms of C, N, O, S and Cl following simple chemical stability and synthetic feasibility rules. With 977 468 314 structures, GDB-13 is the largest publicly available small organic molecule database to date.

    How to cite

    To cite GDB-11, please reference:

    Virtual exploration of the chemical universe up to 11 atoms of C, N, O, F: assembly of 26.4 million structures (110.9 million stereoisomers) and analysis for new ring systems, stereochemistry, physico-chemical properties, compound classes and drug discovery. Fink, T.; Reymond, J.-L. J. Chem. Inf. Model. 2007, 47, 342-353.

    Virtual Exploration of the Small Molecule Chemical Universe below 160 Daltons. Fink, T.; Bruggesser, H.; Reymond, J.-L. Angew. Chem. Int. Ed. 2005, 44, 1504-1508.

    To cite GDB-13, please reference:

    970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13. Blum L. C.; Reymond J.-L. J. Am. Chem. Soc., 2009, 131, 8732-8733.

    To cite GDB-17, please reference:

    Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. Ruddigkeit Lars, van Deursen Ruud, Blum L. C.; Reymond J.-L. J. Chem. Inf. Model., 2012, 52, 2864-2875.

    Download

    You can download the databases and subsets of it using the links provided. All the molecules are stored in dearomatized, canonized SMILES format and compressed as tar/gz archive (for Windows users: Download 7-zip to open archives).

    GDB-17 GDB-17-Set (50 million) GDB17.50000000.smi.gz 314 MB Lead-like Set (100-350 MW & 1-3 clogP)(11 million) GDB17.50000000LL.smi.gz 75 MB Lead-like Set (100-350 MW & 1-3 clogP) without small rings (3-4 ring atoms)(0.8 million) GDB17.50000000LLnoSR.smi.gz 55 MB

    GDB-13 Entire GDB-13 (including all C/N/O/Cl/S molecules) gdb13.tgz 2.6 GB GDB-13 Subsets (The sum of all the subsets below correspond to the entire GDB-13 above) Graph subset (saturated hydrocarbons) gdb13.g.tgz 1.1 MB Skeleton subset (unsaturated hydrocarbons) gdb13.sk.tgz 14 MB Only carbon & nitrogen containing molecules gdb13.cn.tgz 443 MB Only carbon & oxygen containing molecules gdb13.co.tgz 299 MB Only carbon & nitrogen & oxygen containing molecules gdb13.cno.tgz 1.8 GB Chlorine & sulphur containing molecules gdb13.cls.tgz 189 MB

    GDB-13 Subsets (For details please refer to the Table 2 in J Comput Aided Mol Des 2011 25:637 to 647) GDB-13 Subset AB (~635 Millions) AB.smi.gz 2.4 GB GDB-13 Subset ABC (~441 Millions) ABC.smi.gz 1.7 GB GDB-13 Subset ABCD (~277 Millions) ABCD.smi.gz 1.1 GB GDB-13 Subset ABCDE (~140 Millions) ABCDE.smi.gz 565 MB GDB-13 Subset ABCDEF (~43 Millions) ABCDEF.smi.gz 171 MB GDB-13 Subset ABCDEFG (~13 Millions) ABCDEFG.smi.gz 50 MB GDB-13 Subset ABCDEFGH (~1.4 Millions) ABCDEFGH.smi.gz 6.2 MB GDB-13 Random Sample. Annotated with frequency and log-likelihood (Please refer to Exploring the GDB-13 chemical space using deep generative models) GDB-13 Random Sample (1 Million) gdb13.1M.freq.ll.smi.gz 14.8 MB

    FDB-17 FDB-17 FDB-17-fragmentset.smi.gz 62.2 MB

    GDB4c GDB4c (SMILES) GDB4c.smi.gz 6.2 MB GDB4c3D (SMILES) GDB4c3D.smi.gz 161 MB GDB4c3D (SDF) GDB4c3D.sdf.tar.gz 2 GB

    Other GDBMedChem (SMILES) GDBMedChem.smi 276 MB GDBChEMBL (SMILES) GDBChEMBL.smi 353.6 MB GDB-13 random selection (1 million) gdb13.rand1M.smi.gz 7.2 MB Fragment-like subset (Rule of three) gdb13.frl.tgz 1.2 GB Dark matter universe up to 9 heavy atoms dmu9.tgz 87 MB

    GDB-11 Entire GDB-11 (including all C/N/O/F molecules) gdb11.tgz 122 MB Fragrance Like Subsets: For details please refer to Ruddigkeit et al. Journal of Cheminformatics 2014, 6:27 FragranceDB (SuperScent + Flavornet) FragranceDB.smi 56 KB TasteDB (SuperSweet + BitterDB) TasteDB.smi 44 KB FragranceDB.FL (Fragrance-like subset of FragranceDB) FragranceDB.FL.smi 32 KB ChEMBL.FL (Fragrance-like subset of ChEMBL) ChEMBL.FL.smi 452 KB PubChem.FL Fragrance-like subset of PubChem PubChem.FL.smi 20 MB ZINC.FL (Fragrance-like subset of ZINC) ZINC.FL.smi 1.3 MB GDB-13.FL (Fragrance-like subset of GDB-13) GDB-13.FL.smi.gz 165 MB

    Terms and conditions: The GDB databases may be downloaded free of charge. In published research involving GDB, cite the appropriate references mentioned above. GDB must not be used as part of or in patents. GDB and large portions thereof must not be redistributed without the express written permission of Jean-Louis Reymond.

  19. Collection of analog series-based (ASB) scaffolds

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dilyana Dimova; Dilyana Dimova; Jürgen Bajorath; Jürgen Bajorath (2020). Collection of analog series-based (ASB) scaffolds [Dataset]. http://doi.org/10.5281/zenodo.1041394
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dilyana Dimova; Dilyana Dimova; Jürgen Bajorath; Jürgen Bajorath
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The entire collection of 23,791 unique ASB scaffolds generated from compounds from Probes and Drugs Portal (PDP) and ChEMBL (version 23) is reported. Each ASB scaffold is provided in canonical SMILES representation, the database origin (DB_origin) is specified, and the number of analogs (#analogs) the scaffold represents reported. In addition, for each ASB scaffold from ChEMBL, unique target annotations of corresponding analogs are provided using UniProt target identifiers. For ASB scaffolds from PDP, collected compound annotations are provided. For scaffolds shared between PDP and ChEMBL the number of analogs is provided in the form 'X|Y' where 'X' and 'Y' denote the number of analogs in PDP and ChEMBL, respectively. Scaffolds are rank-ordered according to the number of analogs they represent.

  20. Z

    Data from: Systematic Design of Analogs of Active Compounds Covering More...

    • data.niaid.nih.gov
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dimova, Dilyana (2020). Systematic Design of Analogs of Active Compounds Covering More than 1000 Targets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_45807
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Dimova, Dilyana
    Bajorath, Jürgen
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The analog database consisting of 1,297,204 virtual compounds is provided. Virtual compounds are reported in SMILES representation. In addition, for each virtual compound all available ChEMBL analogs (CHEMBL_COMPOUND_ID) and their activities (CHEMBL_TARGET_IDs) are given.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Faisal Bin Ashraf; Sanjida Akter; Sumona Hoque Mumu; Muhammad Usama Islam; Jasim Uddin (2023). Bioactivity datasets from ChEMBL. [Dataset]. http://doi.org/10.1371/journal.pone.0288053.t001

Bioactivity datasets from ChEMBL.

Related Article
Explore at:
5 scholarly articles cite this dataset (View in Google Scholar)
xlsAvailable download formats
Dataset updated
Sep 6, 2023
Dataset provided by
PLOS ONE
Authors
Faisal Bin Ashraf; Sanjida Akter; Sumona Hoque Mumu; Muhammad Usama Islam; Jasim Uddin
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The SARS-CoV-2 3CLpro protein is one of the key therapeutic targets of interest for COVID-19 due to its critical role in viral replication, various high-quality protein crystal structures, and as a basis for computationally screening for compounds with improved inhibitory activity, bioavailability, and ADMETox properties. The ChEMBL and PubChem database contains experimental data from screening small molecules against SARS-CoV-2 3CLpro, which expands the opportunity to learn the pattern and design a computational model that can predict the potency of any drug compound against coronavirus before in-vitro and in-vivo testing. In this study, Utilizing several descriptors, we evaluated 27 machine learning classifiers. We also developed a neural network model that can correctly identify bioactive and inactive chemicals with 91% accuracy, on CheMBL data and 93% accuracy on combined data on both CheMBL and Pubchem. The F1-score for inactive and active compounds was 93% and 94%, respectively. SHAP (SHapley Additive exPlanations) on XGB classifier to find important fingerprints from the PaDEL descriptors for this task. The results indicated that the PaDEL descriptors were effective in predicting bioactivity, the proposed neural network design was efficient, and the Explanatory factor through SHAP correctly identified the important fingertips. In addition, we validated the effectiveness of our proposed model using a large dataset encompassing over 100,000 molecules. This research employed various molecular descriptors to discover the optimal one for this task. To evaluate the effectiveness of these possible medications against SARS-CoV-2, more in-vitro and in-vivo research is required.

Search
Clear search
Close search
Google apps
Main menu