53 datasets found
  1. ChEMBL EBI Small Molecules Database

    • kaggle.com
    Updated Feb 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google BigQuery (2019). ChEMBL EBI Small Molecules Database [Dataset]. https://www.kaggle.com/datasets/bigquery/ebi-chembl
    Explore at:
    Dataset updated
    Feb 12, 2019
    Dataset provided by
    BigQueryhttps://cloud.google.com/bigquery
    Authors
    Google BigQuery
    Description

    Context

    ChEMBL is maintained by the European Bioinformatics Institute (EBI), of the European Molecular Biology Laboratory (EMBL), based at the Wellcome Trust Genome Campus, Hinxton, UK.

    Content

    ChEMBL is a manually curated database of bioactive molecules with drug-like properties used in drug discovery, including information about existing patented drugs.

    Schema: http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_23/chembl_23_schema.png

    Documentation: http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_23/schema_documentation.html

    Fork this notebook to get started on accessing data in the BigQuery dataset using the BQhelper package to write SQL queries.

    Acknowledgements

    “ChEMBL” by the European Bioinformatics Institute (EMBL-EBI), used under CC BY-SA 3.0. Modifications have been made to add normalized publication numbers.

    Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:ebi_chembl

    Banner photo by rawpixel on Unsplash

  2. ChEMBL Data

    • console.cloud.google.com
    Updated Aug 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:Google%20Patents%20Public%20Datasets&inv=1&invt=AbzlbA (2020). ChEMBL Data [Dataset]. https://console.cloud.google.com/marketplace/product/google_patents_public_datasets/chembl
    Explore at:
    Dataset updated
    Aug 4, 2020
    Dataset provided by
    Googlehttp://google.com/
    License
    Description

    ChEMBL Data is a manually curated database of small molecules used in drug discovery, including information about existing patented drugs.

  3. Z

    Raw data extracted from ChEMBL

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Drizard, Nicolas (2022). Raw data extracted from ChEMBL [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5045054
    Explore at:
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    Friedrich, Lukas
    Drizard, Nicolas
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    Raw data files extracted from ChEMBL for the MELLODDY project.

  4. f

    Bioactivity datasets from ChEMBL.

    • plos.figshare.com
    xls
    Updated Sep 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Faisal Bin Ashraf; Sanjida Akter; Sumona Hoque Mumu; Muhammad Usama Islam; Jasim Uddin (2023). Bioactivity datasets from ChEMBL. [Dataset]. http://doi.org/10.1371/journal.pone.0288053.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 6, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Faisal Bin Ashraf; Sanjida Akter; Sumona Hoque Mumu; Muhammad Usama Islam; Jasim Uddin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The SARS-CoV-2 3CLpro protein is one of the key therapeutic targets of interest for COVID-19 due to its critical role in viral replication, various high-quality protein crystal structures, and as a basis for computationally screening for compounds with improved inhibitory activity, bioavailability, and ADMETox properties. The ChEMBL and PubChem database contains experimental data from screening small molecules against SARS-CoV-2 3CLpro, which expands the opportunity to learn the pattern and design a computational model that can predict the potency of any drug compound against coronavirus before in-vitro and in-vivo testing. In this study, Utilizing several descriptors, we evaluated 27 machine learning classifiers. We also developed a neural network model that can correctly identify bioactive and inactive chemicals with 91% accuracy, on CheMBL data and 93% accuracy on combined data on both CheMBL and Pubchem. The F1-score for inactive and active compounds was 93% and 94%, respectively. SHAP (SHapley Additive exPlanations) on XGB classifier to find important fingerprints from the PaDEL descriptors for this task. The results indicated that the PaDEL descriptors were effective in predicting bioactivity, the proposed neural network design was efficient, and the Explanatory factor through SHAP correctly identified the important fingertips. In addition, we validated the effectiveness of our proposed model using a large dataset encompassing over 100,000 molecules. This research employed various molecular descriptors to discover the optimal one for this task. To evaluate the effectiveness of these possible medications against SARS-CoV-2, more in-vitro and in-vivo research is required.

  5. Z

    Data from: A consensus compound/bioactivity dataset for data-driven drug...

    • data.niaid.nih.gov
    Updated May 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isigkeit, Laura (2022). A consensus compound/bioactivity dataset for data-driven drug design and chemogenomics [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6320760
    Explore at:
    Dataset updated
    May 13, 2022
    Dataset provided by
    Chaikuad, Apirat
    Isigkeit, Laura
    Merk, Daniel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the updated version of the dataset from 10.5281/zenodo.6320761

    Information

    The diverse publicly available compound/bioactivity databases constitute a key resource for data-driven applications in chemogenomics and drug design. Analysis of their coverage of compound entries and biological targets revealed considerable differences, however, suggesting benefit of a consensus dataset. Therefore, we have combined and curated information from five esteemed databases (ChEMBL, PubChem, BindingDB, IUPHAR/BPS and Probes&Drugs) to assemble a consensus compound/bioactivity dataset comprising 1144648 compounds with 10915362 bioactivities on 5613 targets (including defined macromolecular targets as well as cell-lines and phenotypic readouts). It also provides simplified information on assay types underlying the bioactivity data and on bioactivity confidence by comparing data from different sources. We have unified the source databases, brought them into a common format and combined them, enabling an ease for generic uses in multiple applications such as chemogenomics and data-driven drug design.

    The consensus dataset provides increased target coverage and contains a higher number of molecules compared to the source databases which is also evident from a larger number of scaffolds. These features render the consensus dataset a valuable tool for machine learning and other data-driven applications in (de novo) drug design and bioactivity prediction. The increased chemical and bioactivity coverage of the consensus dataset may improve robustness of such models compared to the single source databases. In addition, semi-automated structure and bioactivity annotation checks with flags for divergent data from different sources may help data selection and further accurate curation.

    This dataset belongs to the publication: https://doi.org/10.3390/molecules27082513

    Structure and content of the dataset

    Dataset structure
    

    ChEMBL

    ID

    PubChem

    ID

    IUPHAR

    ID

        Target
    

    Activity

    type

        Assay type
        Unit
        Mean C (0)
        ...
        Mean PC (0)
        ...
        Mean B (0)
        ...
        Mean I (0)
        ...
        Mean PD (0)
        ...
        Activity check annotation
        Ligand names
        Canonical SMILES C
        ...
        Structure check (Tanimoto)
        Source
    

    The dataset was created using the Konstanz Information Miner (KNIME) (https://www.knime.com/) and was exported as a CSV-file and a compressed CSV-file.

    Except for the canonical SMILES columns, all columns are filled with the datatype ‘string’. The datatype for the canonical SMILES columns is the smiles-format. We recommend the File Reader node for using the dataset in KNIME. With the help of this node the data types of the columns can be adjusted exactly. In addition, only this node can read the compressed format.

    Column content:

    ChEMBL ID, PubChem ID, IUPHAR ID: chemical identifier of the databases

    Target: biological target of the molecule expressed as the HGNC gene symbol

    Activity type: for example, pIC50

    Assay type: Simplification/Classification of the assay into cell-free, cellular, functional and unspecified

    Unit: unit of bioactivity measurement

    Mean columns of the databases: mean of bioactivity values or activity comments denoted with the frequency of their occurrence in the database, e.g. Mean C = 7.5 *(15) -> the value for this compound-target pair occurs 15 times in ChEMBL database

    Activity check annotation: a bioactivity check was performed by comparing values from the different sources and adding an activity check annotation to provide automated activity validation for additional confidence

    no comment: bioactivity values are within one log unit;

    check activity data: bioactivity values are not within one log unit;

    only one data point: only one value was available, no comparison and no range calculated;

    no activity value: no precise numeric activity value was available;

    no log-value could be calculated: no negative decadic logarithm could be calculated, e.g., because the reported unit was not a compound concentration

    Ligand names: all unique names contained in the five source databases are listed

    Canonical SMILES columns: Molecular structure of the compound from each database

    Structure check (Tanimoto): To denote matching or differing compound structures in different source databases

    match: molecule structures are the same between different sources;

    no match: the structures differ. We calculated the Jaccard-Tanimoto similarity coefficient from Morgan Fingerprints to reveal true differences between sources and reported the minimum value;

    1 structure: no structure comparison is possible, because there was only one structure available;

    no structure: no structure comparison is possible, because there was no structure available.

    Source: From which databases the data come from

  6. f

    Table_1_How to Achieve Better Results Using PASS-Based Virtual Screening:...

    • frontiersin.figshare.com
    • figshare.com
    xlsx
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pavel V. Pogodin; Alexey A. Lagunin; Anastasia V. Rudik; Dmitry A. Filimonov; Dmitry S. Druzhilovskiy; Mark C. Nicklaus; Vladimir V. Poroikov (2023). Table_1_How to Achieve Better Results Using PASS-Based Virtual Screening: Case Study for Kinase Inhibitors.XLSX [Dataset]. http://doi.org/10.3389/fchem.2018.00133.s003
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    Frontiers
    Authors
    Pavel V. Pogodin; Alexey A. Lagunin; Anastasia V. Rudik; Dmitry A. Filimonov; Dmitry S. Druzhilovskiy; Mark C. Nicklaus; Vladimir V. Poroikov
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Discovery of new pharmaceutical substances is currently boosted by the possibility of utilization of the Synthetically Accessible Virtual Inventory (SAVI) library, which includes about 283 million molecules, each annotated with a proposed synthetic one-step route from commercially available starting materials. The SAVI database is well-suited for ligand-based methods of virtual screening to select molecules for experimental testing. In this study, we compare the performance of three approaches for the analysis of structure-activity relationships that differ in their criteria for selecting of “active” and “inactive” compounds included in the training sets. PASS (Prediction of Activity Spectra for Substances), which is based on a modified Naïve Bayes algorithm, was applied since it had been shown to be robust and to provide good predictions of many biological activities based on just the structural formula of a compound even if the information in the training set is incomplete. We used different subsets of kinase inhibitors for this case study because many data are currently available on this important class of drug-like molecules. Based on the subsets of kinase inhibitors extracted from the ChEMBL 20 database we performed the PASS training, and then applied the model to ChEMBL 23 compounds not yet present in ChEMBL 20 to identify novel kinase inhibitors. As one may expect, the best prediction accuracy was obtained if only the experimentally confirmed active and inactive compounds for distinct kinases in the training procedure were used. However, for some kinases, reasonable results were obtained even if we used merged training sets, in which we designated as inactives the compounds not tested against the particular kinase. Thus, depending on the availability of data for a particular biological activity, one may choose the first or the second approach for creating ligand-based computational tools to achieve the best possible results in virtual screening.

  7. GDB Databases

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, bin
    Updated Sep 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tobias Fink; Lorenz C. Blum; Lars Ruddigkeit; Ruud van Deursen; Jean-Louis Reymond; Tobias Fink; Lorenz C. Blum; Lars Ruddigkeit; Ruud van Deursen; Jean-Louis Reymond (2022). GDB Databases [Dataset]. http://doi.org/10.5281/zenodo.5172018
    Explore at:
    bin, application/gzipAvailable download formats
    Dataset updated
    Sep 1, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tobias Fink; Lorenz C. Blum; Lars Ruddigkeit; Ruud van Deursen; Jean-Louis Reymond; Tobias Fink; Lorenz C. Blum; Lars Ruddigkeit; Ruud van Deursen; Jean-Louis Reymond
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    About

    GDB-11 enumerates small organic molecules up to 11 atoms of C, N, O and F following simple chemical stability and synthetic feasibility rules.
    GDB-13 enumerates small organic molecules up to 13 atoms of C, N, O, S and Cl following simple chemical stability and synthetic feasibility rules. With 977 468 314 structures, GDB-13 is the largest publicly available small organic molecule database to date.

    How to cite

    To cite GDB-11, please reference:

    Virtual exploration of the chemical universe up to 11 atoms of C, N, O, F: assembly of 26.4 million structures (110.9 million stereoisomers) and analysis for new ring systems, stereochemistry, physico-chemical properties, compound classes and drug discovery. Fink, T.; Reymond, J.-L. J. Chem. Inf. Model. 2007, 47, 342-353.

    Virtual Exploration of the Small Molecule Chemical Universe below 160 Daltons. Fink, T.; Bruggesser, H.; Reymond, J.-L. Angew. Chem. Int. Ed. 2005, 44, 1504-1508.

    To cite GDB-13, please reference:

    970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13. Blum L. C.; Reymond J.-L. J. Am. Chem. Soc., 2009, 131, 8732-8733.

    To cite GDB-17, please reference:

    Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. Ruddigkeit Lars, van Deursen Ruud, Blum L. C.; Reymond J.-L. J. Chem. Inf. Model., 2012, 52, 2864-2875.

    Download

    You can download the databases and subsets of it using the links provided. All the molecules are stored in dearomatized, canonized SMILES format and compressed as tar/gz archive (for Windows users: Download 7-zip to open archives).


    GDB-17
    GDB-17-Set (50 million) GDB17.50000000.smi.gz 314 MB
    Lead-like Set (100-350 MW & 1-3 clogP)(11 million) GDB17.50000000LL.smi.gz 75 MB
    Lead-like Set (100-350 MW & 1-3 clogP) without small rings (3-4 ring atoms)(0.8 million) GDB17.50000000LLnoSR.smi.gz 55 MB

    GDB-13
    Entire GDB-13 (including all C/N/O/Cl/S molecules) gdb13.tgz 2.6 GB
    GDB-13 Subsets (The sum of all the subsets below correspond to the entire GDB-13 above)
    Graph subset (saturated hydrocarbons) gdb13.g.tgz 1.1 MB
    Skeleton subset (unsaturated hydrocarbons) gdb13.sk.tgz 14 MB
    Only carbon & nitrogen containing molecules gdb13.cn.tgz 443 MB
    Only carbon & oxygen containing molecules gdb13.co.tgz 299 MB
    Only carbon & nitrogen & oxygen containing molecules gdb13.cno.tgz 1.8 GB
    Chlorine & sulphur containing molecules gdb13.cls.tgz 189 MB

    GDB-13 Subsets (For details please refer to the Table 2 in J Comput Aided Mol Des 2011 25:637 to 647)
    GDB-13 Subset AB (~635 Millions) AB.smi.gz 2.4 GB
    GDB-13 Subset ABC (~441 Millions) ABC.smi.gz 1.7 GB
    GDB-13 Subset ABCD (~277 Millions) ABCD.smi.gz 1.1 GB
    GDB-13 Subset ABCDE (~140 Millions) ABCDE.smi.gz 565 MB
    GDB-13 Subset ABCDEF (~43 Millions) ABCDEF.smi.gz 171 MB
    GDB-13 Subset ABCDEFG (~13 Millions) ABCDEFG.smi.gz 50 MB
    GDB-13 Subset ABCDEFGH (~1.4 Millions) ABCDEFGH.smi.gz 6.2 MB
    GDB-13 Random Sample. Annotated with frequency and log-likelihood (Please refer to Exploring the GDB-13 chemical space using deep generative models)
    GDB-13 Random Sample (1 Million) gdb13.1M.freq.ll.smi.gz 14.8 MB

    FDB-17
    FDB-17 FDB-17-fragmentset.smi.gz 62.2 MB


    GDB4c
    GDB4c (SMILES) GDB4c.smi.gz 6.2 MB
    GDB4c3D (SMILES) GDB4c3D.smi.gz 161 MB
    GDB4c3D (SDF) GDB4c3D.sdf.tar.gz 2 GB


    Other
    GDBMedChem (SMILES) GDBMedChem.smi 276 MB
    GDBChEMBL (SMILES) GDBChEMBL.smi 353.6 MB
    GDB-13 random selection (1 million) gdb13.rand1M.smi.gz 7.2 MB
    Fragment-like subset (Rule of three) gdb13.frl.tgz 1.2 GB
    Dark matter universe up to 9 heavy atoms dmu9.tgz 87 MB

    GDB-11
    Entire GDB-11 (including all C/N/O/F molecules) gdb11.tgz 122 MB
    Fragrance Like Subsets: For details please refer to Ruddigkeit et al. Journal of Cheminformatics 2014, 6:27
    FragranceDB (SuperScent + Flavornet) FragranceDB.smi 56 KB
    TasteDB (SuperSweet + BitterDB) TasteDB.smi 44 KB
    FragranceDB.FL (Fragrance-like subset of FragranceDB) FragranceDB.FL.smi 32 KB
    ChEMBL.FL (Fragrance-like subset of ChEMBL) ChEMBL.FL.smi 452 KB
    PubChem.FL Fragrance-like subset of PubChem PubChem.FL.smi 20 MB
    ZINC.FL (Fragrance-like subset of ZINC) ZINC.FL.smi 1.3 MB
    GDB-13.FL (Fragrance-like subset of GDB-13) GDB-13.FL.smi.gz 165 MB

    Terms and conditions: The GDB databases may be downloaded free of charge. In published research involving GDB, cite the appropriate references mentioned above. GDB must not be used as part of or in patents. GDB and large portions thereof must not be redistributed without the express written permission of Jean-Louis Reymond.

  8. f

    Data from: PDEStrIAn: A Phosphodiesterase Structure and Ligand Interaction...

    • acs.figshare.com
    zip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chimed Jansen; Albert J. Kooistra; Georgi K. Kanev; Rob Leurs; Iwan J. P. de Esch; Chris de Graaf (2023). PDEStrIAn: A Phosphodiesterase Structure and Ligand Interaction Annotated Database As a Tool for Structure-Based Drug Design [Dataset]. http://doi.org/10.1021/acs.jmedchem.5b01813.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    ACS Publications
    Authors
    Chimed Jansen; Albert J. Kooistra; Georgi K. Kanev; Rob Leurs; Iwan J. P. de Esch; Chris de Graaf
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    A systematic analysis is presented of the 220 phosphodiesterase (PDE) catalytic domain crystal structures present in the Protein Data Bank (PDB) with a focus on PDE–ligand interactions. The consistent structural alignment of 57 PDE ligand binding site residues enables the systematic analysis of PDE–ligand interaction fingerprints (IFPs), the identification of subtype-specific PDE–ligand interaction features, and the classification of ligands according to their binding modes. We illustrate how systematic mining of this phosphodiesterase structure and ligand interaction annotated (PDEStrIAn) database provides new insights into how conserved and selective PDE interaction hot spots can accommodate the large diversity of chemical scaffolds in PDE ligands. A substructure analysis of the cocrystallized PDE ligands in combination with those in the ChEMBL database provides a toolbox for scaffold hopping and ligand design. These analyses lead to an improved understanding of the structural requirements of PDE binding that will be useful in future drug discovery studies.

  9. Clinical Trial Management System Market Size | Forecast 2031

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Nov 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2023). Clinical Trial Management System Market Size | Forecast 2031 [Dataset]. https://growthmarketreports.com/report/clinical-trial-management-system-market-global-industry-analysis
    Explore at:
    pdf, csv, pptxAvailable download formats
    Dataset updated
    Nov 27, 2023
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Global Clinical Trial Management System Market Outlook 2031:



    The global clinical trial management system market size was valued at USD 1.66 Billion in 2022 and is projected to reach USD 5.53 Billion by 2031, expanding at a CAGR of 14.3% during the forecast period 2023 - 2031. The growth of market is attributed to increasing number of chronic diseases, changing lifestyle, growing number of clinical trials, outsourcing as well as implementation by research organization.



    In the last few years heavy investment in the clinical trial segment is seen which boosted the market. Several organizations have developed clinical trial management system to integrate with the existing software which helps in providing efficiency in the work.





    Increasing advancement of technology reduces the cost associated with clinical trial management system. Conversely high cost and data security are some challenges for the segment which can hinder the market during the forecast period.



    During the COVID 19 outbreak the healthcare segment of the countries are investing heavily in research and development to battle with virus. Compounds such as ZINC database, FDA-approved drugs CHEMBL database and more. So it can be said that COVID-19 act as a positive impact on the healthcare industry for spurring the market growth.



    Clinical Trial Management System Market Trends, Drivers, Restraints, and Opportunities




    • Rising number of chronic diseases, lifestyle disorder is expected to drive the market in the coming years.

    • Increasing outsourcing of clinical trials is anticipated to boost the market during the projected period.

    • Increasing investment in research and development is estimated to be on the major factor boosting the

  10. MELLODDY TUNER release v3 public data

    • zenodo.org
    • explore.openaire.eu
    • +1more
    bin, zip
    Updated Aug 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lukas Friedrich; Lukas Friedrich (2022). MELLODDY TUNER release v3 public data [Dataset]. http://doi.org/10.5281/zenodo.6948581
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Aug 2, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lukas Friedrich; Lukas Friedrich
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    A public dataset from ChEMBL (v25) for MELLODDY TUNER release v3.

    Data extracted from ChEMBL (LICENSE attached), and processed with https://github.com/melloddy/MELLODDY-TUNER release v3.

    Data can be used for technical tests, template for your own dataset and machine learning with SparseChem (https://github.com/melloddy/SparseChem).

  11. h

    chembl_multiassay_activity

    • huggingface.co
    Updated Jan 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haoran Jie (2025). chembl_multiassay_activity [Dataset]. https://huggingface.co/datasets/jiahborcn/chembl_multiassay_activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 20, 2025
    Authors
    Haoran Jie
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    ChEMBL Drug-Target Activity Dataset

    This dataset was extracted from ChEMBL34 database. It is designed for multitask classification of drug-target activities. It links compound structures with activity data for multiple assays, enabling multitask learning experiments in drug discovery. Key features of the dataset include:

      Multitask Format
    

    Each assay ID is treated as a separate binary classification task. Binary labels (0 for inactive, 1 for active) and masks (indicating… See the full description on the dataset page: https://huggingface.co/datasets/jiahborcn/chembl_multiassay_activity.

  12. o

    Data from: A large-scale dataset of in vivo pharmacology assay results.

    • explore.openaire.eu
    Updated Mar 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fiona M I Hunter; Francis L Atkinson; A Patrícia Bento; Nicolas Bosc; Anna Gaulton; Anne Hersey; Andrew R Leach (2019). A large-scale dataset of in vivo pharmacology assay results. [Dataset]. https://explore.openaire.eu/search/dataset?pid=PMC6206617
    Explore at:
    Dataset updated
    Mar 20, 2019
    Authors
    Fiona M I Hunter; Francis L Atkinson; A Patrícia Bento; Nicolas Bosc; Anna Gaulton; Anne Hersey; Andrew R Leach
    Description

    ChEMBL is a large-scale, open-access drug discovery resource containing bioactivity information primarily extracted from scientific literature. A substantial dataset of more than 135,000 in vivo assays has been collated as a key resource of animal models for translational medicine within drug discovery. To improve the utility of the in vivo data, an extensive data curation task has been undertaken that allows the assays to be grouped by animal disease model or phenotypic endpoint. The dataset contains previously unavailable information about compounds or drugs tested in animal models and, in conjunction with assay data on protein targets or cell- or tissue- based systems, allows the investigation of the effects of compounds at differing levels of biological complexity. Equally, it enables researchers to identify compounds that have been investigated for a group of disease-, pharmacology- or toxicity-relevant assays.

  13. Data from: Benchmarking the Predictive Power of Ligand Efficiency Indices in...

    • acs.figshare.com
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isidro Cortes-Ciriano (2023). Benchmarking the Predictive Power of Ligand Efficiency Indices in QSAR [Dataset]. http://doi.org/10.1021/acs.jcim.6b00136.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    ACS Publications
    Authors
    Isidro Cortes-Ciriano
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Compound physicochemical properties favoring in vitro potency are not always correlated to desirable pharmacokinetic profiles. Therefore, using potency (i.e., IC50) as the main criterion to prioritize candidate drugs at early stage drug discovery campaigns has been questioned. Yet, the vast majority of the virtual screening models reported in the medicinal chemistry literature predict the biological activity of compounds by regressing in vitro potency on topological or physicochemical descriptors. Two studies published in this journal showed that higher predictive power on external molecules can be achieved by using ligand efficiency indices as the dependent variable instead of a metric of potency (IC50) or binding affinity (Ki). The present study aims at filling the shortage of a thorough assessment of the predictive power of ligand efficiency indices in QSAR. To this aim, the predictive power of 11 ligand efficiency indices has been benchmarked across four algorithms (Gradient Boosting Machines, Partial Least Squares, Random Forest, and Support Vector Machines), two descriptor types (Morgan fingerprints, and physicochemical descriptors), and 29 data sets collected from the literature and ChEMBL database. Ligand efficiency metrics led to the highest predictive power on external molecules irrespective of the descriptor type or algorithm used, with an R2test difference of ∼0.3 units and a this difference ∼0.4 units when modeling small data sets and a normalized RMSE decrease of >0.1 units in some cases. Polarity indices, such as SEI and NSEI, led to higher predictive power than metrics based on molecular size, i.e., BEI, NBEI, and LE. LELP, which comprises a polarity factor (cLogP) and a size parameter (LE) constantly led to the most predictive models, suggesting that these two properties convey a complementary predictive signal. Overall, this study suggests that using ligand efficiency indices as the dependent variable might be an efficient strategy to model compound activity.

  14. h

    qmugs_bioinf595

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manasa Yadavalli, qmugs_bioinf595 [Dataset]. https://huggingface.co/datasets/ymanasa2000/qmugs_bioinf595
    Explore at:
    Authors
    Manasa Yadavalli
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    QMugs for ML

    QMugs is a comprehensive dataset designed to facilitate the use of machine learning in drug discovery. It contains over 665,000 drug-like molecules sourced from the ChEMBL database, with each molecule annotated by quantum mechanical (QM) properties computed at the DFT level. These properties include energies, charges, and dipole moments, calculated for multiple conformers of each molecule. The dataset provides a valuable resource for training models to predict… See the full description on the dataset page: https://huggingface.co/datasets/ymanasa2000/qmugs_bioinf595.

  15. f

    Data from: PoseidonQ: A Free Machine Learning Platform for the Development,...

    • acs.figshare.com
    • figshare.com
    xlsx
    Updated Apr 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muzammil Kabier; Nicola Gambacorta; Fulvio Ciriaco; Fabrizio Mastrolorito; Sunil Kumar; Bijo Mathew; Orazio Nicolotti (2025). PoseidonQ: A Free Machine Learning Platform for the Development, Analysis, and Validation of Efficient and Portable QSAR Models for Drug Discovery [Dataset]. http://doi.org/10.1021/acs.jcim.4c02372.s006
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Apr 9, 2025
    Dataset provided by
    ACS Publications
    Authors
    Muzammil Kabier; Nicola Gambacorta; Fulvio Ciriaco; Fabrizio Mastrolorito; Sunil Kumar; Bijo Mathew; Orazio Nicolotti
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The advent of powerful machine learning algorithms as well as the availability of high volume of pharmacological data has given new fuel to QSAR, opening new unprecedented options for deriving highly predictive models for assisting the rationale design of new bioactive compounds, for screening and prioritizing large molecular libraries, and for repurposing new drugs toward new clinical uses. Here, we present PoseidonQ (an acronym for Personal Optimization Software for Efficient Implementation and Derivation of Online QSAR), a user-friendly software solution designed to simplify the derivation of the QSAR model for drug design and discovery. PoseidonQ incorporates 22 machine learning algorithms, 17 types of molecular fingerprints, and 208 RDKit molecular descriptors and enables the quick derivation of both regression and classification models along with a calculated and easily interpretable applicability domain. Importantly, the platform is automatically linked to the latest version of the ChEMBL database, thus providing streamlined access to large amounts of curated bioactivity data. Importantly, the user is also given the option of gathering high-quality experimental data based on customizable filtering settings. Noteworthy, PoseidonQ facilitates the deployment of trained QSAR models as web-based applications through seamless integration with Streamlit Cloud and GitHub, empowering users to share, refine, and integrate models effortlessly. Interestingly, the translation of QSAR models into web-based applications makes them free accessible, portable, and ready for screening large volumes of new data without limits. By unifying data preparation, model generation, and deployment into an intuitive workflow, PoseidonQ makes advanced QSAR modeling for drug design and discovery accessible to a wide audience of researchers irrespective of their skill levels. PoseidonQ bridges the gap between complex machine learning techniques and practical drug discovery applications, enhancing the efficiency, collaboration, and adoption of QSAR approaches in modern drug discovery programs. PoseidonQ is available for Windows and Linux (ubuntu 22.04 distro) operating systems and can be downloaded for free at https://github.com/Muzatheking12/PoseidonQ.

  16. o

    Data from: Pdestrian: A Phosphodiesterase Structure And Ligand Interaction...

    • explore.openaire.eu
    • data.niaid.nih.gov
    Updated Feb 8, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chimed Jansen; Albert J. Kooistra; Georgi K. Kanev; Rob Leurs; Iwan J. P. De Esch; Chris De Graaf (2016). Pdestrian: A Phosphodiesterase Structure And Ligand Interaction Annotated Database As A Tool For Structure-Based Drug Design [Dataset]. http://doi.org/10.5281/zenodo.45774
    Explore at:
    Dataset updated
    Feb 8, 2016
    Authors
    Chimed Jansen; Albert J. Kooistra; Georgi K. Kanev; Rob Leurs; Iwan J. P. De Esch; Chris De Graaf
    Description

    A systematic analysis is presented of the 220 phosphodiesterase (PDE) catalytic domain crystal structures present in the Protein Data Bank (PDB) with a focus on PDE-ligand interactions. The consistent structural alignment of 57 PDE ligand binding site residues enables the systematic analysis of PDE-ligand Interaction FingerPrints (IFPs), the identification of subtype-specific PDE-ligand interaction features, and the classification of ligands according to their binding modes. We illustrate how systematic mining of this phosphodiesterase structure and ligand interaction annotated (PDEStrIAn) database provides new insights into how conserved and selective PDE interaction hot spots can accommodate the large diversity of chemical scaffolds in PDE ligands. A substructure analysis of the co-crystalized PDE ligands in combination with those in the ChEMBL database provides a toolbox for scaffold hopping and ligand design. These analyses lead to an improved understanding of the structural requirements of PDE binding that will be useful in future drug discovery studies. The newest version of PDEStrIAn is available at http://pdestrian.vu-compmedchem.nl

  17. f

    Compound descriptors for text-mined orthosteric and allosteric dataset

    • figshare.com
    • data.4tu.nl
    txt
    Updated Jul 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lindsey Burggraaff (2020). Compound descriptors for text-mined orthosteric and allosteric dataset [Dataset]. http://doi.org/10.4121/uuid:5738caea-2390-4dc7-9830-8d9644232144
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jul 28, 2020
    Dataset provided by
    4TU.ResearchData
    Authors
    Lindsey Burggraaff
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Compound descriptors for compounds retrieved from the ChEMBL database. Descriptors to annotate the compounds chemically and based on physicochemical properties. To be used in cheminformatics.

  18. ChEMBL RDF

    • data.wu.ac.at
    api/sparql, meta/void +1
    Updated Jul 30, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EMBL European Bioinformatics Institute (2016). ChEMBL RDF [Dataset]. https://data.wu.ac.at/odso/datahub_io/YzA0YTM3MTItN2NlZC00ODk1LTk3YzUtZDcxYjgwZDcyZTUw
    Explore at:
    ttl, api/sparql, meta/voidAvailable download formats
    Dataset updated
    Jul 30, 2016
    Dataset provided by
    European Molecular Biology Laboratoryhttp://www.embl.org/
    European Bioinformatics Institutehttp://www.ebi.ac.uk/
    License

    http://www.opendefinition.org/licenses/cc-by-sahttp://www.opendefinition.org/licenses/cc-by-sa

    Description

    ChEMBL is a database of bioactive drug-like small molecules, it contains 2-D structures, calculated properties (e.g. logP, Molecular Weight, Lipinski Parameters, etc.) and abstracted bioactivities (e.g. binding constants, pharmacology and ADMET data). The data is abstracted and curated from the primary scientific literature, and cover a significant fraction of the SAR and discovery of modern drugs.

    It is available in RDF form through EMBL-EBI's RDF Platform.

  19. O

    ChEMBL

    • opendatalab.com
    zip
    Updated May 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). ChEMBL [Dataset]. https://opendatalab.com/OpenDataLab/ChEMBL
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 7, 2024
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    ChEMBL is a database of bioactive drug-like small molecules, it contains 2-D structures, calculated properties (e.g. logP, Molecular Weight, Lipinski Parameters, etc.) and abstracted bioactivities (e.g. binding constants, pharmacology and ADMET data). The data is abstracted and curated from the primary scientific literature, and cover a significant fraction of the SAR and discovery of modern drugs We attempt to normalise the bioactivities into a uniform set of end-points and units where possible, and also to tag the links between a molecular target and a published assay with a set of varying confidence levels. Additional data on clinical progress of compounds is being integrated into ChEMBL at the current time.

  20. Data from: A large comprehensive curated dataset of small molecules and...

    • zenodo.org
    • repository.uantwerpen.be
    bin, png
    Updated Jul 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Issar Arab; Issar Arab; Kristof Egghe; Kris Laukens; Kris Laukens; Ke Chen; Ke Chen; Khaled Barakat; Khaled Barakat; Wout Bittremieux; Wout Bittremieux; Kristof Egghe (2024). A large comprehensive curated dataset of small molecules and their activities covering three cardiac ion channels: hERG, Cav1.2, and Nav1.5 [Dataset]. http://doi.org/10.5281/zenodo.8359714
    Explore at:
    png, binAvailable download formats
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Issar Arab; Issar Arab; Kristof Egghe; Kris Laukens; Kris Laukens; Ke Chen; Ke Chen; Khaled Barakat; Khaled Barakat; Wout Bittremieux; Wout Bittremieux; Kristof Egghe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The compressed data folder (dataset.rar) represents a data framework for researchers in the field of drug discovery to perform in depth analyses on a very large open-access unique and comprehensive hERG, Nav1.5, and Cav1.2 cardiotoxicity integrated database of small molecules and their activities. The database is organized as follows:

    • Each sub-folder represents a cardiac ion channel target: hERG, Nav1.5, and Cav1.2
    • Each target sub-folder consists of 3 files in CSV format: One file containing the development set (split into training and validation sets using an 80/20 ratio for hyperparameter tuning). The other 2 files contain external evaluation sets. The first test dataset consists of compounds with a structural similarity of no more than 60% (Tanimoto similarity ≤ 0.6) to the remaining development set, while the second test dataset comprises compounds with a structural similarity of no more than 70% (Tanimoto similarity ≤ 0.7) to the remaining development set.
    • Each file contains data with 7 columns: "InChl Key" as a unique identifier of the chemical structure, "SMILES" as the string format of storage and exchange of the chemical structure, "Source" as the upstream data source from which the data was retrieved, "ChEMBL ID" as the ChEMBL identifier if the compound comes from ChEMBL database, "PubChem CID" as the PubChem compound identifier if the compound comes from PubChem database, "pIC50" as the negative logarithm of the half-maximal inhibitory concentration (IC50) to describe the potency of the compound, and "USED_AS" column specifying whether the compound was used for training or validation.

    Upon usage, please cite this publication:

    • Issar Arab, Kristof Egghe, Kris Laukens, Ke Chen, Khaled Barakat, Wout Bittremieux, Benchmarking of Small Molecule Feature Representations for hERG, Nav1.5, and Cav1.2 Cardiotoxicity Prediction, Journal of Chemical Information and Modeling, (2023). doi:10.1021/acs.jcim.3c01301

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Google BigQuery (2019). ChEMBL EBI Small Molecules Database [Dataset]. https://www.kaggle.com/datasets/bigquery/ebi-chembl
Organization logo

ChEMBL EBI Small Molecules Database

A large-scale bioactivity database for drug discovery (BigQuery)

Explore at:
Dataset updated
Feb 12, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
Description

Context

ChEMBL is maintained by the European Bioinformatics Institute (EBI), of the European Molecular Biology Laboratory (EMBL), based at the Wellcome Trust Genome Campus, Hinxton, UK.

Content

ChEMBL is a manually curated database of bioactive molecules with drug-like properties used in drug discovery, including information about existing patented drugs.

Schema: http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_23/chembl_23_schema.png

Documentation: http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_23/schema_documentation.html

Fork this notebook to get started on accessing data in the BigQuery dataset using the BQhelper package to write SQL queries.

Acknowledgements

“ChEMBL” by the European Bioinformatics Institute (EMBL-EBI), used under CC BY-SA 3.0. Modifications have been made to add normalized publication numbers.

Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:ebi_chembl

Banner photo by rawpixel on Unsplash

Search
Clear search
Close search
Google apps
Main menu