21 datasets found

r
ChEMBL
rrid.site
dknet.org
+1more
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). ChEMBL [Dataset]. http://identifiers.org/RRID:SCR_014042
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_014042
Dataset updated
Jan 29, 2022
Description
Collection of bioactive drug-like small molecules that contains 2D structures, calculated properties and abstracted bioactivities. Used for drug discovery and chemical biology research. Clinical progress of new compounds is continuously integrated into the database.
f
Bioactivity datasets from ChEMBL.
plos.figshare.com
xls
Updated Sep 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Faisal Bin Ashraf; Sanjida Akter; Sumona Hoque Mumu; Muhammad Usama Islam; Jasim Uddin (2023). Bioactivity datasets from ChEMBL. [Dataset]. http://doi.org/10.1371/journal.pone.0288053.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0288053.t001
Dataset updated
Sep 6, 2023
Dataset provided by
PLOS ONE
Authors
Faisal Bin Ashraf; Sanjida Akter; Sumona Hoque Mumu; Muhammad Usama Islam; Jasim Uddin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The SARS-CoV-2 3CLpro protein is one of the key therapeutic targets of interest for COVID-19 due to its critical role in viral replication, various high-quality protein crystal structures, and as a basis for computationally screening for compounds with improved inhibitory activity, bioavailability, and ADMETox properties. The ChEMBL and PubChem database contains experimental data from screening small molecules against SARS-CoV-2 3CLpro, which expands the opportunity to learn the pattern and design a computational model that can predict the potency of any drug compound against coronavirus before in-vitro and in-vivo testing. In this study, Utilizing several descriptors, we evaluated 27 machine learning classifiers. We also developed a neural network model that can correctly identify bioactive and inactive chemicals with 91% accuracy, on CheMBL data and 93% accuracy on combined data on both CheMBL and Pubchem. The F1-score for inactive and active compounds was 93% and 94%, respectively. SHAP (SHapley Additive exPlanations) on XGB classifier to find important fingerprints from the PaDEL descriptors for this task. The results indicated that the PaDEL descriptors were effective in predicting bioactivity, the proposed neural network design was efficient, and the Explanatory factor through SHAP correctly identified the important fingertips. In addition, we validated the effectiveness of our proposed model using a large dataset encompassing over 100,000 molecules. This research employed various molecular descriptors to discover the optimal one for this task. To evaluate the effectiveness of these possible medications against SARS-CoV-2, more in-vitro and in-vivo research is required.
f
Table_1_How to Achieve Better Results Using PASS-Based Virtual Screening:...
frontiersin.figshare.com
figshare.com
xlsx
Updated Jun 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pavel V. Pogodin; Alexey A. Lagunin; Anastasia V. Rudik; Dmitry A. Filimonov; Dmitry S. Druzhilovskiy; Mark C. Nicklaus; Vladimir V. Poroikov (2023). Table_1_How to Achieve Better Results Using PASS-Based Virtual Screening: Case Study for Kinase Inhibitors.XLSX [Dataset]. http://doi.org/10.3389/fchem.2018.00133.s003
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fchem.2018.00133.s003
Dataset updated
Jun 3, 2023
Dataset provided by
Frontiers
Authors
Pavel V. Pogodin; Alexey A. Lagunin; Anastasia V. Rudik; Dmitry A. Filimonov; Dmitry S. Druzhilovskiy; Mark C. Nicklaus; Vladimir V. Poroikov
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Discovery of new pharmaceutical substances is currently boosted by the possibility of utilization of the Synthetically Accessible Virtual Inventory (SAVI) library, which includes about 283 million molecules, each annotated with a proposed synthetic one-step route from commercially available starting materials. The SAVI database is well-suited for ligand-based methods of virtual screening to select molecules for experimental testing. In this study, we compare the performance of three approaches for the analysis of structure-activity relationships that differ in their criteria for selecting of “active” and “inactive” compounds included in the training sets. PASS (Prediction of Activity Spectra for Substances), which is based on a modified Naïve Bayes algorithm, was applied since it had been shown to be robust and to provide good predictions of many biological activities based on just the structural formula of a compound even if the information in the training set is incomplete. We used different subsets of kinase inhibitors for this case study because many data are currently available on this important class of drug-like molecules. Based on the subsets of kinase inhibitors extracted from the ChEMBL 20 database we performed the PASS training, and then applied the model to ChEMBL 23 compounds not yet present in ChEMBL 20 to identify novel kinase inhibitors. As one may expect, the best prediction accuracy was obtained if only the experimentally confirmed active and inactive compounds for distinct kinases in the training procedure were used. However, for some kinases, reasonable results were obtained even if we used merged training sets, in which we designated as inactives the compounds not tested against the particular kinase. Thus, depending on the availability of data for a particular biological activity, one may choose the first or the second approach for creating ligand-based computational tools to achieve the best possible results in virtual screening.
f
Compound descriptors for text-mined orthosteric and allosteric dataset
figshare.com
data.4tu.nl
txt
Updated Jul 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lindsey Burggraaff (2020). Compound descriptors for text-mined orthosteric and allosteric dataset [Dataset]. http://doi.org/10.4121/uuid:5738caea-2390-4dc7-9830-8d9644232144
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.4121/uuid:5738caea-2390-4dc7-9830-8d9644232144
Dataset updated
Jul 28, 2020
Dataset provided by
4TU.ResearchData
Authors
Lindsey Burggraaff
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Compound descriptors for compounds retrieved from the ChEMBL database. Descriptors to annotate the compounds chemically and based on physicochemical properties. To be used in cheminformatics.
GDB Databases
zenodo.org
data.niaid.nih.gov
application/gzip, bin
Updated Sep 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tobias Fink; Lorenz C. Blum; Lars Ruddigkeit; Ruud van Deursen; Jean-Louis Reymond; Tobias Fink; Lorenz C. Blum; Lars Ruddigkeit; Ruud van Deursen; Jean-Louis Reymond (2022). GDB Databases [Dataset]. http://doi.org/10.5281/zenodo.5172018
Explore at:
bin, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5172018
Dataset updated
Sep 1, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tobias Fink; Lorenz C. Blum; Lars Ruddigkeit; Ruud van Deursen; Jean-Louis Reymond; Tobias Fink; Lorenz C. Blum; Lars Ruddigkeit; Ruud van Deursen; Jean-Louis Reymond
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
About

GDB-11 enumerates small organic molecules up to 11 atoms of C, N, O and F following simple chemical stability and synthetic feasibility rules.
GDB-13 enumerates small organic molecules up to 13 atoms of C, N, O, S and Cl following simple chemical stability and synthetic feasibility rules. With 977 468 314 structures, GDB-13 is the largest publicly available small organic molecule database to date.

How to cite

To cite GDB-11, please reference:

Virtual exploration of the chemical universe up to 11 atoms of C, N, O, F: assembly of 26.4 million structures (110.9 million stereoisomers) and analysis for new ring systems, stereochemistry, physico-chemical properties, compound classes and drug discovery. Fink, T.; Reymond, J.-L. J. Chem. Inf. Model. 2007, 47, 342-353.

Virtual Exploration of the Small Molecule Chemical Universe below 160 Daltons. Fink, T.; Bruggesser, H.; Reymond, J.-L. Angew. Chem. Int. Ed. 2005, 44, 1504-1508.

To cite GDB-13, please reference:

970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13. Blum L. C.; Reymond J.-L. J. Am. Chem. Soc., 2009, 131, 8732-8733.

To cite GDB-17, please reference:

Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. Ruddigkeit Lars, van Deursen Ruud, Blum L. C.; Reymond J.-L. J. Chem. Inf. Model., 2012, 52, 2864-2875.

Download

You can download the databases and subsets of it using the links provided. All the molecules are stored in dearomatized, canonized SMILES format and compressed as tar/gz archive (for Windows users: Download 7-zip to open archives).

GDB-17
GDB-17-Set (50 million) GDB17.50000000.smi.gz 314 MB
Lead-like Set (100-350 MW & 1-3 clogP)(11 million) GDB17.50000000LL.smi.gz 75 MB
Lead-like Set (100-350 MW & 1-3 clogP) without small rings (3-4 ring atoms)(0.8 million) GDB17.50000000LLnoSR.smi.gz 55 MB

GDB-13
Entire GDB-13 (including all C/N/O/Cl/S molecules) gdb13.tgz 2.6 GB
GDB-13 Subsets (The sum of all the subsets below correspond to the entire GDB-13 above)
Graph subset (saturated hydrocarbons) gdb13.g.tgz 1.1 MB
Skeleton subset (unsaturated hydrocarbons) gdb13.sk.tgz 14 MB
Only carbon & nitrogen containing molecules gdb13.cn.tgz 443 MB
Only carbon & oxygen containing molecules gdb13.co.tgz 299 MB
Only carbon & nitrogen & oxygen containing molecules gdb13.cno.tgz 1.8 GB
Chlorine & sulphur containing molecules gdb13.cls.tgz 189 MB

GDB-13 Subsets (For details please refer to the Table 2 in J Comput Aided Mol Des 2011 25:637 to 647)
GDB-13 Subset AB (~635 Millions) AB.smi.gz 2.4 GB
GDB-13 Subset ABC (~441 Millions) ABC.smi.gz 1.7 GB
GDB-13 Subset ABCD (~277 Millions) ABCD.smi.gz 1.1 GB
GDB-13 Subset ABCDE (~140 Millions) ABCDE.smi.gz 565 MB
GDB-13 Subset ABCDEF (~43 Millions) ABCDEF.smi.gz 171 MB
GDB-13 Subset ABCDEFG (~13 Millions) ABCDEFG.smi.gz 50 MB
GDB-13 Subset ABCDEFGH (~1.4 Millions) ABCDEFGH.smi.gz 6.2 MB
GDB-13 Random Sample. Annotated with frequency and log-likelihood (Please refer to Exploring the GDB-13 chemical space using deep generative models)
GDB-13 Random Sample (1 Million) gdb13.1M.freq.ll.smi.gz 14.8 MB

FDB-17
FDB-17 FDB-17-fragmentset.smi.gz 62.2 MB

GDB4c
GDB4c (SMILES) GDB4c.smi.gz 6.2 MB
GDB4c3D (SMILES) GDB4c3D.smi.gz 161 MB
GDB4c3D (SDF) GDB4c3D.sdf.tar.gz 2 GB

Other
GDBMedChem (SMILES) GDBMedChem.smi 276 MB
GDBChEMBL (SMILES) GDBChEMBL.smi 353.6 MB
GDB-13 random selection (1 million) gdb13.rand1M.smi.gz 7.2 MB
Fragment-like subset (Rule of three) gdb13.frl.tgz 1.2 GB
Dark matter universe up to 9 heavy atoms dmu9.tgz 87 MB

GDB-11
Entire GDB-11 (including all C/N/O/F molecules) gdb11.tgz 122 MB
Fragrance Like Subsets: For details please refer to Ruddigkeit et al. Journal of Cheminformatics 2014, 6:27
FragranceDB (SuperScent + Flavornet) FragranceDB.smi 56 KB
TasteDB (SuperSweet + BitterDB) TasteDB.smi 44 KB
FragranceDB.FL (Fragrance-like subset of FragranceDB) FragranceDB.FL.smi 32 KB
ChEMBL.FL (Fragrance-like subset of ChEMBL) ChEMBL.FL.smi 452 KB
PubChem.FL Fragrance-like subset of PubChem PubChem.FL.smi 20 MB
ZINC.FL (Fragrance-like subset of ZINC) ZINC.FL.smi 1.3 MB
GDB-13.FL (Fragrance-like subset of GDB-13) GDB-13.FL.smi.gz 165 MB

Terms and conditions: The GDB databases may be downloaded free of charge. In published research involving GDB, cite the appropriate references mentioned above. GDB must not be used as part of or in patents. GDB and large portions thereof must not be redistributed without the express written permission of Jean-Louis Reymond.
Data from: A large comprehensive curated dataset of small molecules and...
zenodo.org
repository.uantwerpen.be
bin, png
Updated Jul 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Issar Arab; Issar Arab; Kristof Egghe; Kris Laukens; Kris Laukens; Ke Chen; Ke Chen; Khaled Barakat; Khaled Barakat; Wout Bittremieux; Wout Bittremieux; Kristof Egghe (2024). A large comprehensive curated dataset of small molecules and their activities covering three cardiac ion channels: hERG, Cav1.2, and Nav1.5 [Dataset]. http://doi.org/10.5281/zenodo.8359714
Explore at:
png, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8359714
Dataset updated
Jul 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Issar Arab; Issar Arab; Kristof Egghe; Kris Laukens; Kris Laukens; Ke Chen; Ke Chen; Khaled Barakat; Khaled Barakat; Wout Bittremieux; Wout Bittremieux; Kristof Egghe
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The compressed data folder (dataset.rar) represents a data framework for researchers in the field of drug discovery to perform in depth analyses on a very large open-access unique and comprehensive hERG, Nav1.5, and Cav1.2 cardiotoxicity integrated database of small molecules and their activities. The database is organized as follows:

Each sub-folder represents a cardiac ion channel target: hERG, Nav1.5, and Cav1.2

Each target sub-folder consists of 3 files in CSV format: One file containing the development set (split into training and validation sets using an 80/20 ratio for hyperparameter tuning). The other 2 files contain external evaluation sets. The first test dataset consists of compounds with a structural similarity of no more than 60% (Tanimoto similarity ≤ 0.6) to the remaining development set, while the second test dataset comprises compounds with a structural similarity of no more than 70% (Tanimoto similarity ≤ 0.7) to the remaining development set.

Each file contains data with 7 columns: "InChl Key" as a unique identifier of the chemical structure, "SMILES" as the string format of storage and exchange of the chemical structure, "Source" as the upstream data source from which the data was retrieved, "ChEMBL ID" as the ChEMBL identifier if the compound comes from ChEMBL database, "PubChem CID" as the PubChem compound identifier if the compound comes from PubChem database, "pIC50" as the negative logarithm of the half-maximal inhibitory concentration (IC50) to describe the potency of the compound, and "USED_AS" column specifying whether the compound was used for training or validation.

Upon usage, please cite this publication:

Issar Arab, Kristof Egghe, Kris Laukens, Ke Chen, Khaled Barakat, Wout Bittremieux, Benchmarking of Small Molecule Feature Representations for hERG, Nav1.5, and Cav1.2 Cardiotoxicity Prediction, Journal of Chemical Information and Modeling, (2023). doi:10.1021/acs.jcim.3c01301
h
chembl_multiassay_activity
huggingface.co
Updated Jan 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Haoran Jie (2025). chembl_multiassay_activity [Dataset]. https://huggingface.co/datasets/jiahborcn/chembl_multiassay_activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 20, 2025
Authors
Haoran Jie
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
ChEMBL Drug-Target Activity Dataset

This dataset was extracted from ChEMBL34 database. It is designed for multitask classification of drug-target activities. It links compound structures with activity data for multiple assays, enabling multitask learning experiments in drug discovery. Key features of the dataset include:

Multitask Format

Each assay ID is treated as a separate binary classification task. Binary labels (0 for inactive, 1 for active) and masks (indicating… See the full description on the dataset page: https://huggingface.co/datasets/jiahborcn/chembl_multiassay_activity.
Generative AI for designing and validating easily synthesizable and...
zenodo.org
data.niaid.nih.gov
bin, zip
Updated Mar 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kyle Swanson; Kyle Swanson; Gary Liu; Denise Catacutan; Autumn Arnold; James Zou; Jonathan Stokes; Gary Liu; Denise Catacutan; Autumn Arnold; James Zou; Jonathan Stokes (2024). Generative AI for designing and validating easily synthesizable and structurally novel antibiotics: Data and Models [Dataset]. http://doi.org/10.5281/zenodo.10257839
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10257839
Dataset updated
Mar 22, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Kyle Swanson; Kyle Swanson; Gary Liu; Denise Catacutan; Autumn Arnold; James Zou; Jonathan Stokes; Gary Liu; Denise Catacutan; Autumn Arnold; James Zou; Jonathan Stokes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains data and models used in the following paper.

Swanson, K., Liu, G., Catacutan, D., Zou, J. & Stokes, J. Generative AI for designing and validating easily synthesizable and structurally novel antibiotics. Nature Machine Intelligence, 2024.

The data and models are meant to be used with the SyntheMol code. More details about how to use the data and models with the code are available here.

The Data.zip file has the following structure. Note that the numbers for the Data subdirectories correspond to the supplementary data numbers in the paper (e.g., 1_training_data corresponds to Supplementary Data 1).

Data

1_training_data: The Acinetobacter baumannii inhibition data used to train antibiotic property prediction models.

2_chembl: Known antibiotic and antibacterial molecules from ChEMBL, which are used to compute the novelty of generated antibiotic candidates.

4_real_space: Data files and statistics for the Enamine REAL Space. The molecular building blocks file is version 2021 q3-4 while all other REAL Space details are computed from the full enumerated REAL space version 2022 q1-2 (downloaded on August 30, 2022).

5_generations_clogp: Compounds generated by SyntheMol using Chemprop models trained to predict cLogP.

6_generations_chemprop: Compounds generated by SyntheMol using Chemprop models trained to predict A. baumannii inhibition.

7_generations_chemprop_rdkit: Compounds generated by SyntheMol using Chemprop-RDKit models trained to predict A. baumannii inhibition.

8_generations_random_forest: Compounds generated by SyntheMol using random forest models trained to predict A. baumannii inhibition.

9_synthesized: Information on the 58 SyntheMol-generated compounds that were successfully synthesized by Enamine.

The Models.zip file contains one folder for each model used in the paper. Note that each model is technically an ensemble of ten individual models, so each directory contains ten model files.
O
ChEMBL
opendatalab.com
zip
Updated May 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). ChEMBL [Dataset]. https://opendatalab.com/OpenDataLab/ChEMBL
Explore at:
zipAvailable download formats
Dataset updated
May 7, 2024
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
ChEMBL is a database of bioactive drug-like small molecules, it contains 2-D structures, calculated properties (e.g. logP, Molecular Weight, Lipinski Parameters, etc.) and abstracted bioactivities (e.g. binding constants, pharmacology and ADMET data). The data is abstracted and curated from the primary scientific literature, and cover a significant fraction of the SAR and discovery of modern drugs We attempt to normalise the bioactivities into a uniform set of end-points and units where possible, and also to tag the links between a molecular target and a published assay with a set of varying confidence levels. Additional data on clinical progress of compounds is being integrated into ChEMBL at the current time.
f
Text-mined orthosteric and allosteric compound dataset
figshare.com
data.4tu.nl
txt
Updated Jul 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lindsey Burggraaff (2020). Text-mined orthosteric and allosteric compound dataset [Dataset]. http://doi.org/10.4121/uuid:de9e9805-916f-47e5-8e72-323f675d5d5a
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.4121/uuid:de9e9805-916f-47e5-8e72-323f675d5d5a
Dataset updated
Jul 28, 2020
Dataset provided by
4TU.ResearchData
Authors
Lindsey Burggraaff
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Compound dataset retrieved from the ChEMBL database using text mining. Text-mined orthosteric and allosteric binding types annotations are included in the dataset.
Data from: AntiBac-Pred: A Web Application for Predicting Antibacterial...
figshare.com
acs.figshare.com
xlsx
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pavel V. Pogodin; Alexey A. Lagunin; Anastasia V. Rudik; Dmitry S. Druzhilovskiy; Dmitry A. Filimonov; Vladimir V. Poroikov (2023). AntiBac-Pred: A Web Application for Predicting Antibacterial Activity of Chemical Compounds [Dataset]. http://doi.org/10.1021/acs.jcim.9b00436.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.9b00436.s001
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Pavel V. Pogodin; Alexey A. Lagunin; Anastasia V. Rudik; Dmitry S. Druzhilovskiy; Dmitry A. Filimonov; Vladimir V. Poroikov
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Discovery of new antibacterial agents is a never-ending task of medicinal chemistry. Every new drug brings significant improvement to patients with bacterial infections, but prolonged usage of antibacterials leads to the emergence of resistant strains. Therefore, novel active structures with new modes of action are required. We describe a web application called AntiBac-Pred aimed to help users in the rational selection of the chemical compounds for experimental studies of antibacterial activity. This application is developed using antibacterial activity data available in ChEMBL and PASS software. It allows users to classify chemical structures of interest into growth inhibitors or noninhibitors of 353 different bacteria strains, including both resistant and nonresistant ones.
Data from: Benchmarking the Predictive Power of Ligand Efficiency Indices in...
acs.figshare.com
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isidro Cortes-Ciriano (2023). Benchmarking the Predictive Power of Ligand Efficiency Indices in QSAR [Dataset]. http://doi.org/10.1021/acs.jcim.6b00136.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.6b00136.s001
Dataset updated
May 30, 2023
Dataset provided by
ACS Publications
Authors
Isidro Cortes-Ciriano
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Compound physicochemical properties favoring in vitro potency are not always correlated to desirable pharmacokinetic profiles. Therefore, using potency (i.e., IC50) as the main criterion to prioritize candidate drugs at early stage drug discovery campaigns has been questioned. Yet, the vast majority of the virtual screening models reported in the medicinal chemistry literature predict the biological activity of compounds by regressing in vitro potency on topological or physicochemical descriptors. Two studies published in this journal showed that higher predictive power on external molecules can be achieved by using ligand efficiency indices as the dependent variable instead of a metric of potency (IC50) or binding affinity (Ki). The present study aims at filling the shortage of a thorough assessment of the predictive power of ligand efficiency indices in QSAR. To this aim, the predictive power of 11 ligand efficiency indices has been benchmarked across four algorithms (Gradient Boosting Machines, Partial Least Squares, Random Forest, and Support Vector Machines), two descriptor types (Morgan fingerprints, and physicochemical descriptors), and 29 data sets collected from the literature and ChEMBL database. Ligand efficiency metrics led to the highest predictive power on external molecules irrespective of the descriptor type or algorithm used, with an R2test difference of ∼0.3 units and a this difference ∼0.4 units when modeling small data sets and a normalized RMSE decrease of >0.1 units in some cases. Polarity indices, such as SEI and NSEI, led to higher predictive power than metrics based on molecular size, i.e., BEI, NBEI, and LE. LELP, which comprises a polarity factor (cLogP) and a size parameter (LE) constantly led to the most predictive models, suggesting that these two properties convey a complementary predictive signal. Overall, this study suggests that using ligand efficiency indices as the dependent variable might be an efficient strategy to model compound activity.
Data from: Discovering Highly Potent Molecules from an Initial Set of...
acs.figshare.com
figshare.com
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isidro Cortés-Ciriano; Nicholas C. Firth; Andreas Bender; Oliver Watson (2023). Discovering Highly Potent Molecules from an Initial Set of Inactives Using Iterative Screening [Dataset]. http://doi.org/10.1021/acs.jcim.8b00376.s003
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.8b00376.s003
Dataset updated
May 30, 2023
Dataset provided by
ACS Publications
Authors
Isidro Cortés-Ciriano; Nicholas C. Firth; Andreas Bender; Oliver Watson
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The versatility of similarity searching and quantitative structure–activity relationships to model the activity of compound sets within given bioactivity ranges (i.e., interpolation) is well established. However, their relative performance in the common scenario in early stage drug discovery where lots of inactive data but no active data points are available (i.e., extrapolation from the low-activity to the high-activity range) has not been thoroughly examined yet. To this aim, we have designed an iterative virtual screening strategy which was evaluated on 25 diverse bioactivity data sets from ChEMBL. We benchmark the efficiency of random forest (RF), multiple linear regression, ridge regression, similarity searching, and random selection of compounds to identify a highly active molecule in the test set among a large number of low-potency compounds. We use the number of iterations required to find this active molecule to evaluate the performance of each experimental setup. We show that linear and ridge regression often outperform RF and similarity searching, reducing the number of iterations to find an active compound by a factor of 2 or more. Even simple regression methods seem better able to extrapolate to high-bioactivity ranges than RF, which only provides output values in the range covered by the training set. In addition, examination of the scaffold diversity in the data sets used shows that in some cases similarity searching and RF require two times as many iterations as random selection depending on the chemical space covered in the initial training data. Lastly, we show using bioactivity data for COX-1 and COX-2 that our framework can be extended to multitarget drug discovery, where compounds are selected by concomitantly considering their activity against multiple targets. Overall, this study provides an approach for iterative screening where only inactive data are present in early stages of drug discovery in order to discover highly potent compounds and the best experimental set up in which to do so.
f
Data from: PoseidonQ: A Free Machine Learning Platform for the Development,...
acs.figshare.com
figshare.com
xlsx
Updated Apr 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muzammil Kabier; Nicola Gambacorta; Fulvio Ciriaco; Fabrizio Mastrolorito; Sunil Kumar; Bijo Mathew; Orazio Nicolotti (2025). PoseidonQ: A Free Machine Learning Platform for the Development, Analysis, and Validation of Efficient and Portable QSAR Models for Drug Discovery [Dataset]. http://doi.org/10.1021/acs.jcim.4c02372.s006
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.4c02372.s006
Dataset updated
Apr 9, 2025
Dataset provided by
ACS Publications
Authors
Muzammil Kabier; Nicola Gambacorta; Fulvio Ciriaco; Fabrizio Mastrolorito; Sunil Kumar; Bijo Mathew; Orazio Nicolotti
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The advent of powerful machine learning algorithms as well as the availability of high volume of pharmacological data has given new fuel to QSAR, opening new unprecedented options for deriving highly predictive models for assisting the rationale design of new bioactive compounds, for screening and prioritizing large molecular libraries, and for repurposing new drugs toward new clinical uses. Here, we present PoseidonQ (an acronym for Personal Optimization Software for Efficient Implementation and Derivation of Online QSAR), a user-friendly software solution designed to simplify the derivation of the QSAR model for drug design and discovery. PoseidonQ incorporates 22 machine learning algorithms, 17 types of molecular fingerprints, and 208 RDKit molecular descriptors and enables the quick derivation of both regression and classification models along with a calculated and easily interpretable applicability domain. Importantly, the platform is automatically linked to the latest version of the ChEMBL database, thus providing streamlined access to large amounts of curated bioactivity data. Importantly, the user is also given the option of gathering high-quality experimental data based on customizable filtering settings. Noteworthy, PoseidonQ facilitates the deployment of trained QSAR models as web-based applications through seamless integration with Streamlit Cloud and GitHub, empowering users to share, refine, and integrate models effortlessly. Interestingly, the translation of QSAR models into web-based applications makes them free accessible, portable, and ready for screening large volumes of new data without limits. By unifying data preparation, model generation, and deployment into an intuitive workflow, PoseidonQ makes advanced QSAR modeling for drug design and discovery accessible to a wide audience of researchers irrespective of their skill levels. PoseidonQ bridges the gap between complex machine learning techniques and practical drug discovery applications, enhancing the efficiency, collaboration, and adoption of QSAR approaches in modern drug discovery programs. PoseidonQ is available for Windows and Linux (ubuntu 22.04 distro) operating systems and can be downloaded for free at https://github.com/Muzatheking12/PoseidonQ.
f
Data from: Annotation of Allosteric Compounds to Enhance Bioactivity...
figshare.com
zip
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lindsey Burggraaff; Amber van Veen; Chi Chung Lam; Herman W. T. van Vlijmen; Adriaan P. IJzerman; Gerard J. P. van Westen (2023). Annotation of Allosteric Compounds to Enhance Bioactivity Modeling for Class A GPCRs [Dataset]. http://doi.org/10.1021/acs.jcim.0c00695.s003
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.0c00695.s003
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Lindsey Burggraaff; Amber van Veen; Chi Chung Lam; Herman W. T. van Vlijmen; Adriaan P. IJzerman; Gerard J. P. van Westen
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Proteins often have both orthosteric and allosteric binding sites. Endogenous ligands, such as hormones and neurotransmitters, bind to the orthosteric site, while synthetic ligands may bind to orthosteric or allosteric sites, which has become a focal point in drug discovery. Usually, such allosteric modulators bind to a protein noncompetitively with its endogenous ligand or substrate. The growing interest in allosteric modulators has resulted in a substantial increase of these entities and their features such as binding data in chemical libraries and databases. Although this data surge fuels research focused on allosteric modulators, binding data is unfortunately not always clearly indicated as being allosteric or orthosteric. Therefore, allosteric binding data is difficult to retrieve from databases that contain a mixture of allosteric and orthosteric compounds. This decreases model performance when statistical methods, such as machine learning models, are applied. In previous work we generated an allosteric data subset of ChEMBL release 14. In the current study an improved text mining approach is used to retrieve the allosteric and orthosteric binding types from the literature in ChEMBL release 22. Moreover, convolutional deep neural networks were constructed to predict the binding types of compounds for class A G protein-coupled receptors (GPCRs). Temporal split validation showed the model predictiveness with Matthews correlation coefficient (MCC) = 0.54, sensitivity allosteric = 0.54, and sensitivity orthosteric = 0.94. Finally, this study shows that the inclusion of accurate binding types increases binding predictions by including them as descriptor (MCC = 0.27 improved to MCC = 0.34; validated for class A GPCRs, trained on all GPCRs). Although the focus of this study is mainly on class A GPCRs, binding types for all protein classes in ChEMBL were obtained and explored. The data set is included as a supplement to this study, allowing the reader to select the compounds and binding types of interest.
Data from: Perturbation-Theory Machine Learning (PTML) Multilabel Model of...
acs.figshare.com
figshare.com
xlsx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alejandro Cabrera-Andrade; Andrés López-Cortés; Cristian R. Munteanu; Alejandro Pazos; Yunierkis Pérez-Castillo; Eduardo Tejera; Sonia Arrasate; Humbert González-Díaz (2023). Perturbation-Theory Machine Learning (PTML) Multilabel Model of the ChEMBL Dataset of Preclinical Assays for Antisarcoma Compounds [Dataset]. http://doi.org/10.1021/acsomega.0c03356.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acsomega.0c03356.s001
Dataset updated
Jun 1, 2023
Dataset provided by
ACS Publications
Authors
Alejandro Cabrera-Andrade; Andrés López-Cortés; Cristian R. Munteanu; Alejandro Pazos; Yunierkis Pérez-Castillo; Eduardo Tejera; Sonia Arrasate; Humbert González-Díaz
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Sarcomas are a group of malignant neoplasms of connective tissue with a different etiology than carcinomas. The efforts to discover new drugs with antisarcoma activity have generated large datasets of multiple preclinical assays with different experimental conditions. For instance, the ChEMBL database contains outcomes of 37,919 different antisarcoma assays with 34,955 different chemical compounds. Furthermore, the experimental conditions reported in this dataset include 157 types of biological activity parameters, 36 drug targets, 43 cell lines, and 17 assay organisms. Considering this information, we propose combining perturbation theory (PT) principles with machine learning (ML) to develop a PTML model to predict antisarcoma compounds. PTML models use one function of reference that measures the probability of a drug being active under certain conditions (protein, cell line, organism, etc.). In this paper, we used a linear discriminant analysis and neural network to train and compare PT and non-PT models. All the explored models have an accuracy of 89.19–95.25% for training and 89.22–95.46% in validation sets. PTML-based strategies have similar accuracy but generate simplest models. Therefore, they may become a versatile tool for predicting antisarcoma compounds.
f
Data from: IFPTML Multi-Output Model for Anti-Retroviral Compounds Including...
acs.figshare.com
figshare.com
xlsx
Updated Apr 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emilia Vásquez-Domínguez; Shan He; Carlos Santolaria; Sonia Arrasate; Humbert González-Díaz (2025). IFPTML Multi-Output Model for Anti-Retroviral Compounds Including the Drug Structure and Target Protein Sequence Information [Dataset]. http://doi.org/10.1021/acs.jcim.5c00242.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.5c00242.s001
Dataset updated
Apr 28, 2025
Dataset provided by
ACS Publications
Authors
Emilia Vásquez-Domínguez; Shan He; Carlos Santolaria; Sonia Arrasate; Humbert González-Díaz
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Retroviruses such as HIV cause significant diseases in humans and other organisms, making the discovery of antiretroviral (ARV) drugs a critical priority. While databases like ChEMBL contain valuable information, their complexity poses challenges. The data set includes approximately >140,000 assays across eight viruses, encompassing

350 biological activity parameters, >50 target proteins, >80 cell lines, 60 assay organisms, and >770 viral strains. Artificial Intelligence/Machine Learning (AI/ML) models offer a promising approach to accelerate ARV discovery. Recently, we developed AI/ML models for ChEMBL ARV data using the Information Fusion Perturbation Theory and Machine Learning (IFPTML) strategy. However, neither existing AI/ML models nor our prior IFPTML implementation simultaneously incorporates viral protein sequences, strains, cell lines, assay organisms, or virus/human mutations. This limitation renders them ineffective for predicting activity against amino acid sequence variations (e.g., mutations, variants, or emerging strains)a critical shortcoming given the well-documented prevalence of drug-resistance mutations in marketed ARVs. In this work, we present an enhanced IFPTML model integrating protein sequence descriptors. We computed and incorporated sequence descriptors for all drug target proteins in ChEMBL, derived from proteomes of retroviruses (HIV, FeLV, MMV, SIV, etc.). The model demonstrated robust performance, with sensitivity (Sn), specificity (Sp), and accuracy (Ac) values ranging between 72.0 and 88.0% in both training and validation phases. We analyze its predictions for protein mutations documented in ChEMBL and other literature sources. To our knowledge, this represents the first unified multicondition, multioutput model for ARV discovery that systematically accounts for protein sequence information.
f
Chemically Standardized Dataset of 512 Kinases for Statistical Modeling
figshare.com
data.4tu.nl
application/gzip
Updated Jul 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lindsey Burggraaff (2020). Chemically Standardized Dataset of 512 Kinases for Statistical Modeling [Dataset]. http://doi.org/10.4121/uuid:6af1d9de-281f-4221-b7e1-e7c01b90dfe0
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.4121/uuid:6af1d9de-281f-4221-b7e1-e7c01b90dfe0
Dataset updated
Jul 28, 2020
Dataset provided by
4TU.ResearchData
Authors
Lindsey Burggraaff
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Compound dataset consisting of structures and bioactivity data (classes) for 512 kinases. Chemical structures are available as InChIKey and bioactivity data as either active (pChEMBL >= 6.5) or inactive (pChEMBL < 6.5) (the meaning of the pChEMBL value can be found on: https://www.ebi.ac.uk/chembl/). The compound structures are chemically standardised by neutralising charges, removing salts, and keeping the largest fragment. The dataset was used in training and validation of statistical models (QSAR and PCM).
f
Data from: Combining Machine Learning and Molecular Dynamics to Predict...
acs.figshare.com
zip
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carmen Esposito; Shuzhe Wang; Udo E. W. Lange; Frank Oellien; Sereina Riniker (2023). Combining Machine Learning and Molecular Dynamics to Predict P‑Glycoprotein Substrates [Dataset]. http://doi.org/10.1021/acs.jcim.0c00525.s002
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.0c00525.s002
Dataset updated
Jun 8, 2023
Dataset provided by
ACS Publications
Authors
Carmen Esposito; Shuzhe Wang; Udo E. W. Lange; Frank Oellien; Sereina Riniker
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The efflux transporter P-glycoprotein (P-gp) is responsible for the extrusion of a wide variety of molecules, including drug molecules, from the cell. Therefore, P-gp-mediated efflux transport limits the bioavailability of drugs. To identify potential P-gp substrates early in the drug discovery process, in silico models have been developed based on structural and physicochemical descriptors. In this study, we investigate the use of molecular dynamics fingerprints (MDFPs) as an orthogonal descriptor for the training of machine learning (ML) models to classify small molecules into substrates and nonsubstrates of P-gp. MDFPs encode the information from short MD simulations of the molecules in different environments (water, membrane, or protein pocket). The performance of the MDFPs, evaluated on both an in-house dataset (3930 compounds) and a public dataset from ChEMBL (1114 compounds), is compared to that of commonly used 2D molecular descriptors, including structure-based and property-based descriptors. We find that all tested classifiers interpolate well, achieving high accuracy on chemically diverse subsets. However, by challenging the models with external validation and prospective analysis, we show that only tree-based ML models trained on MDFPs or property-based descriptors generalize well to regions of the chemical space not covered by the training set.
f
Table displaying IC50 values in μM of promising compound activity against P....
plos.figshare.com
xls
Updated Nov 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thato Matlhodi; Lisema Patrick Makatsela; Tendamudzimu Harmfree Dongola; Mthokozisi Blessing Cedric Simelane; Addmore Shonhai; Njabulo Joyfull Gumede; Fortunate Mokoena (2024). Table displaying IC50 values in μM of promising compound activity against P. falciparum cells, as well as cytotoxicity towards human cells and respective selectivity indices. [Dataset]. http://doi.org/10.1371/journal.pone.0308969.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0308969.t003
Dataset updated
Nov 25, 2024
Dataset provided by
PLOS ONE
Authors
Thato Matlhodi; Lisema Patrick Makatsela; Tendamudzimu Harmfree Dongola; Mthokozisi Blessing Cedric Simelane; Addmore Shonhai; Njabulo Joyfull Gumede; Fortunate Mokoena
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Table displaying IC50 values in μM of promising compound activity against P. falciparum cells, as well as cytotoxicity towards human cells and respective selectivity indices.

Facebook

Twitter

Click to copy link

Link copied

Cite

(2022). ChEMBL [Dataset]. http://identifiers.org/RRID:SCR_014042

ChEMBL

RRID:SCR_014042, ChEMBL (RRID:SCR_014042), ChEMBLdb, Chembl, ChEMBL Database

Explore at:

16 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://identifiers.org/RRID:SCR_014042

Dataset updated

Jan 29, 2022

Description

Collection of bioactive drug-like small molecules that contains 2D structures, calculated properties and abstracted bioactivities. Used for drug discovery and chemical biology research. Clinical progress of new compounds is continuously integrated into the database.

Clear search

Close search

Google apps

Main menu

ChEMBL

Bioactivity datasets from ChEMBL.

Table_1_How to Achieve Better Results Using PASS-Based Virtual Screening:...

Compound descriptors for text-mined orthosteric and allosteric dataset

GDB Databases

Data from: A large comprehensive curated dataset of small molecules and...

chembl_multiassay_activity

Generative AI for designing and validating easily synthesizable and...

ChEMBL

Text-mined orthosteric and allosteric compound dataset

Data from: AntiBac-Pred: A Web Application for Predicting Antibacterial...

Data from: Benchmarking the Predictive Power of Ligand Efficiency Indices in...

Data from: Discovering Highly Potent Molecules from an Initial Set of...

Data from: PoseidonQ: A Free Machine Learning Platform for the Development,...

Data from: Annotation of Allosteric Compounds to Enhance Bioactivity...

Data from: Perturbation-Theory Machine Learning (PTML) Multilabel Model of...

Data from: IFPTML Multi-Output Model for Anti-Retroviral Compounds Including...

Chemically Standardized Dataset of 512 Kinases for Statistical Modeling

Data from: Combining Machine Learning and Molecular Dynamics to Predict...

Table displaying IC50 values in μM of promising compound activity against P....

ChEMBL

RRID:SCR_014042, ChEMBL (RRID:SCR_014042), ChEMBLdb, Chembl, ChEMBL Database