Collection of bioactive drug-like small molecules that contains 2D structures, calculated properties and abstracted bioactivities. Used for drug discovery and chemical biology research. Clinical progress of new compounds is continuously integrated into the database.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The SARS-CoV-2 3CLpro protein is one of the key therapeutic targets of interest for COVID-19 due to its critical role in viral replication, various high-quality protein crystal structures, and as a basis for computationally screening for compounds with improved inhibitory activity, bioavailability, and ADMETox properties. The ChEMBL and PubChem database contains experimental data from screening small molecules against SARS-CoV-2 3CLpro, which expands the opportunity to learn the pattern and design a computational model that can predict the potency of any drug compound against coronavirus before in-vitro and in-vivo testing. In this study, Utilizing several descriptors, we evaluated 27 machine learning classifiers. We also developed a neural network model that can correctly identify bioactive and inactive chemicals with 91% accuracy, on CheMBL data and 93% accuracy on combined data on both CheMBL and Pubchem. The F1-score for inactive and active compounds was 93% and 94%, respectively. SHAP (SHapley Additive exPlanations) on XGB classifier to find important fingerprints from the PaDEL descriptors for this task. The results indicated that the PaDEL descriptors were effective in predicting bioactivity, the proposed neural network design was efficient, and the Explanatory factor through SHAP correctly identified the important fingertips. In addition, we validated the effectiveness of our proposed model using a large dataset encompassing over 100,000 molecules. This research employed various molecular descriptors to discover the optimal one for this task. To evaluate the effectiveness of these possible medications against SARS-CoV-2, more in-vitro and in-vivo research is required.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Discovery of new pharmaceutical substances is currently boosted by the possibility of utilization of the Synthetically Accessible Virtual Inventory (SAVI) library, which includes about 283 million molecules, each annotated with a proposed synthetic one-step route from commercially available starting materials. The SAVI database is well-suited for ligand-based methods of virtual screening to select molecules for experimental testing. In this study, we compare the performance of three approaches for the analysis of structure-activity relationships that differ in their criteria for selecting of “active” and “inactive” compounds included in the training sets. PASS (Prediction of Activity Spectra for Substances), which is based on a modified Naïve Bayes algorithm, was applied since it had been shown to be robust and to provide good predictions of many biological activities based on just the structural formula of a compound even if the information in the training set is incomplete. We used different subsets of kinase inhibitors for this case study because many data are currently available on this important class of drug-like molecules. Based on the subsets of kinase inhibitors extracted from the ChEMBL 20 database we performed the PASS training, and then applied the model to ChEMBL 23 compounds not yet present in ChEMBL 20 to identify novel kinase inhibitors. As one may expect, the best prediction accuracy was obtained if only the experimentally confirmed active and inactive compounds for distinct kinases in the training procedure were used. However, for some kinases, reasonable results were obtained even if we used merged training sets, in which we designated as inactives the compounds not tested against the particular kinase. Thus, depending on the availability of data for a particular biological activity, one may choose the first or the second approach for creating ligand-based computational tools to achieve the best possible results in virtual screening.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Compound descriptors for compounds retrieved from the ChEMBL database. Descriptors to annotate the compounds chemically and based on physicochemical properties. To be used in cheminformatics.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
About
GDB-11 enumerates small organic molecules up to 11 atoms of C, N, O and F following simple chemical stability and synthetic feasibility rules.
GDB-13 enumerates small organic molecules up to 13 atoms of C, N, O, S and Cl following simple chemical stability and synthetic feasibility rules. With 977 468 314 structures, GDB-13 is the largest publicly available small organic molecule database to date.
How to cite
To cite GDB-11, please reference:
Virtual exploration of the chemical universe up to 11 atoms of C, N, O, F: assembly of 26.4 million structures (110.9 million stereoisomers) and analysis for new ring systems, stereochemistry, physico-chemical properties, compound classes and drug discovery. Fink, T.; Reymond, J.-L. J. Chem. Inf. Model. 2007, 47, 342-353.
Virtual Exploration of the Small Molecule Chemical Universe below 160 Daltons. Fink, T.; Bruggesser, H.; Reymond, J.-L. Angew. Chem. Int. Ed. 2005, 44, 1504-1508.
To cite GDB-13, please reference:
970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13. Blum L. C.; Reymond J.-L. J. Am. Chem. Soc., 2009, 131, 8732-8733.
To cite GDB-17, please reference:
Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. Ruddigkeit Lars, van Deursen Ruud, Blum L. C.; Reymond J.-L. J. Chem. Inf. Model., 2012, 52, 2864-2875.
Download
You can download the databases and subsets of it using the links provided. All the molecules are stored in dearomatized, canonized SMILES format and compressed as tar/gz archive (for Windows users: Download 7-zip to open archives).
GDB-17
GDB-17-Set (50 million) GDB17.50000000.smi.gz 314 MB
Lead-like Set (100-350 MW & 1-3 clogP)(11 million) GDB17.50000000LL.smi.gz 75 MB
Lead-like Set (100-350 MW & 1-3 clogP) without small rings (3-4 ring atoms)(0.8 million) GDB17.50000000LLnoSR.smi.gz 55 MB
GDB-13
Entire GDB-13 (including all C/N/O/Cl/S molecules) gdb13.tgz 2.6 GB
GDB-13 Subsets (The sum of all the subsets below correspond to the entire GDB-13 above)
Graph subset (saturated hydrocarbons) gdb13.g.tgz 1.1 MB
Skeleton subset (unsaturated hydrocarbons) gdb13.sk.tgz 14 MB
Only carbon & nitrogen containing molecules gdb13.cn.tgz 443 MB
Only carbon & oxygen containing molecules gdb13.co.tgz 299 MB
Only carbon & nitrogen & oxygen containing molecules gdb13.cno.tgz 1.8 GB
Chlorine & sulphur containing molecules gdb13.cls.tgz 189 MB
GDB-13 Subsets (For details please refer to the Table 2 in J Comput Aided Mol Des 2011 25:637 to 647)
GDB-13 Subset AB (~635 Millions) AB.smi.gz 2.4 GB
GDB-13 Subset ABC (~441 Millions) ABC.smi.gz 1.7 GB
GDB-13 Subset ABCD (~277 Millions) ABCD.smi.gz 1.1 GB
GDB-13 Subset ABCDE (~140 Millions) ABCDE.smi.gz 565 MB
GDB-13 Subset ABCDEF (~43 Millions) ABCDEF.smi.gz 171 MB
GDB-13 Subset ABCDEFG (~13 Millions) ABCDEFG.smi.gz 50 MB
GDB-13 Subset ABCDEFGH (~1.4 Millions) ABCDEFGH.smi.gz 6.2 MB
GDB-13 Random Sample. Annotated with frequency and log-likelihood (Please refer to Exploring the GDB-13 chemical space using deep generative models)
GDB-13 Random Sample (1 Million) gdb13.1M.freq.ll.smi.gz 14.8 MB
FDB-17
FDB-17 FDB-17-fragmentset.smi.gz 62.2 MB
GDB4c
GDB4c (SMILES) GDB4c.smi.gz 6.2 MB
GDB4c3D (SMILES) GDB4c3D.smi.gz 161 MB
GDB4c3D (SDF) GDB4c3D.sdf.tar.gz 2 GB
Other
GDBMedChem (SMILES) GDBMedChem.smi 276 MB
GDBChEMBL (SMILES) GDBChEMBL.smi 353.6 MB
GDB-13 random selection (1 million) gdb13.rand1M.smi.gz 7.2 MB
Fragment-like subset (Rule of three) gdb13.frl.tgz 1.2 GB
Dark matter universe up to 9 heavy atoms dmu9.tgz 87 MB
GDB-11
Entire GDB-11 (including all C/N/O/F molecules) gdb11.tgz 122 MB
Fragrance Like Subsets: For details please refer to Ruddigkeit et al. Journal of Cheminformatics 2014, 6:27
FragranceDB (SuperScent + Flavornet) FragranceDB.smi 56 KB
TasteDB (SuperSweet + BitterDB) TasteDB.smi 44 KB
FragranceDB.FL (Fragrance-like subset of FragranceDB) FragranceDB.FL.smi 32 KB
ChEMBL.FL (Fragrance-like subset of ChEMBL) ChEMBL.FL.smi 452 KB
PubChem.FL Fragrance-like subset of PubChem PubChem.FL.smi 20 MB
ZINC.FL (Fragrance-like subset of ZINC) ZINC.FL.smi 1.3 MB
GDB-13.FL (Fragrance-like subset of GDB-13) GDB-13.FL.smi.gz 165 MB
Terms and conditions: The GDB databases may be downloaded free of charge. In published research involving GDB, cite the appropriate references mentioned above. GDB must not be used as part of or in patents. GDB and large portions thereof must not be redistributed without the express written permission of Jean-Louis Reymond.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The compressed data folder (dataset.rar) represents a data framework for researchers in the field of drug discovery to perform in depth analyses on a very large open-access unique and comprehensive hERG, Nav1.5, and Cav1.2 cardiotoxicity integrated database of small molecules and their activities. The database is organized as follows:
Upon usage, please cite this publication:
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
ChEMBL Drug-Target Activity Dataset
This dataset was extracted from ChEMBL34 database. It is designed for multitask classification of drug-target activities. It links compound structures with activity data for multiple assays, enabling multitask learning experiments in drug discovery. Key features of the dataset include:
Multitask Format
Each assay ID is treated as a separate binary classification task. Binary labels (0 for inactive, 1 for active) and masks (indicating… See the full description on the dataset page: https://huggingface.co/datasets/jiahborcn/chembl_multiassay_activity.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains data and models used in the following paper.
Swanson, K., Liu, G., Catacutan, D., Zou, J. & Stokes, J. Generative AI for designing and validating easily synthesizable and structurally novel antibiotics. Nature Machine Intelligence, 2024.
The data and models are meant to be used with the SyntheMol code. More details about how to use the data and models with the code are available here.
The Data.zip file has the following structure. Note that the numbers for the Data subdirectories correspond to the supplementary data numbers in the paper (e.g., 1_training_data corresponds to Supplementary Data 1).
Data
1_training_data: The Acinetobacter baumannii inhibition data used to train antibiotic property prediction models.
2_chembl: Known antibiotic and antibacterial molecules from ChEMBL, which are used to compute the novelty of generated antibiotic candidates.
4_real_space: Data files and statistics for the Enamine REAL Space. The molecular building blocks file is version 2021 q3-4 while all other REAL Space details are computed from the full enumerated REAL space version 2022 q1-2 (downloaded on August 30, 2022).
5_generations_clogp: Compounds generated by SyntheMol using Chemprop models trained to predict cLogP.
6_generations_chemprop: Compounds generated by SyntheMol using Chemprop models trained to predict A. baumannii inhibition.
7_generations_chemprop_rdkit: Compounds generated by SyntheMol using Chemprop-RDKit models trained to predict A. baumannii inhibition.
8_generations_random_forest: Compounds generated by SyntheMol using random forest models trained to predict A. baumannii inhibition.
9_synthesized: Information on the 58 SyntheMol-generated compounds that were successfully synthesized by Enamine.
The Models.zip file contains one folder for each model used in the paper. Note that each model is technically an ensemble of ten individual models, so each directory contains ten model files.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
ChEMBL is a database of bioactive drug-like small molecules, it contains 2-D structures, calculated properties (e.g. logP, Molecular Weight, Lipinski Parameters, etc.) and abstracted bioactivities (e.g. binding constants, pharmacology and ADMET data). The data is abstracted and curated from the primary scientific literature, and cover a significant fraction of the SAR and discovery of modern drugs We attempt to normalise the bioactivities into a uniform set of end-points and units where possible, and also to tag the links between a molecular target and a published assay with a set of varying confidence levels. Additional data on clinical progress of compounds is being integrated into ChEMBL at the current time.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Compound dataset retrieved from the ChEMBL database using text mining. Text-mined orthosteric and allosteric binding types annotations are included in the dataset.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Discovery of new antibacterial agents is a never-ending task of medicinal chemistry. Every new drug brings significant improvement to patients with bacterial infections, but prolonged usage of antibacterials leads to the emergence of resistant strains. Therefore, novel active structures with new modes of action are required. We describe a web application called AntiBac-Pred aimed to help users in the rational selection of the chemical compounds for experimental studies of antibacterial activity. This application is developed using antibacterial activity data available in ChEMBL and PASS software. It allows users to classify chemical structures of interest into growth inhibitors or noninhibitors of 353 different bacteria strains, including both resistant and nonresistant ones.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Compound physicochemical properties favoring in vitro potency are not always correlated to desirable pharmacokinetic profiles. Therefore, using potency (i.e., IC50) as the main criterion to prioritize candidate drugs at early stage drug discovery campaigns has been questioned. Yet, the vast majority of the virtual screening models reported in the medicinal chemistry literature predict the biological activity of compounds by regressing in vitro potency on topological or physicochemical descriptors. Two studies published in this journal showed that higher predictive power on external molecules can be achieved by using ligand efficiency indices as the dependent variable instead of a metric of potency (IC50) or binding affinity (Ki). The present study aims at filling the shortage of a thorough assessment of the predictive power of ligand efficiency indices in QSAR. To this aim, the predictive power of 11 ligand efficiency indices has been benchmarked across four algorithms (Gradient Boosting Machines, Partial Least Squares, Random Forest, and Support Vector Machines), two descriptor types (Morgan fingerprints, and physicochemical descriptors), and 29 data sets collected from the literature and ChEMBL database. Ligand efficiency metrics led to the highest predictive power on external molecules irrespective of the descriptor type or algorithm used, with an R2test difference of ∼0.3 units and a this difference ∼0.4 units when modeling small data sets and a normalized RMSE decrease of >0.1 units in some cases. Polarity indices, such as SEI and NSEI, led to higher predictive power than metrics based on molecular size, i.e., BEI, NBEI, and LE. LELP, which comprises a polarity factor (cLogP) and a size parameter (LE) constantly led to the most predictive models, suggesting that these two properties convey a complementary predictive signal. Overall, this study suggests that using ligand efficiency indices as the dependent variable might be an efficient strategy to model compound activity.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The versatility of similarity searching and quantitative structure–activity relationships to model the activity of compound sets within given bioactivity ranges (i.e., interpolation) is well established. However, their relative performance in the common scenario in early stage drug discovery where lots of inactive data but no active data points are available (i.e., extrapolation from the low-activity to the high-activity range) has not been thoroughly examined yet. To this aim, we have designed an iterative virtual screening strategy which was evaluated on 25 diverse bioactivity data sets from ChEMBL. We benchmark the efficiency of random forest (RF), multiple linear regression, ridge regression, similarity searching, and random selection of compounds to identify a highly active molecule in the test set among a large number of low-potency compounds. We use the number of iterations required to find this active molecule to evaluate the performance of each experimental setup. We show that linear and ridge regression often outperform RF and similarity searching, reducing the number of iterations to find an active compound by a factor of 2 or more. Even simple regression methods seem better able to extrapolate to high-bioactivity ranges than RF, which only provides output values in the range covered by the training set. In addition, examination of the scaffold diversity in the data sets used shows that in some cases similarity searching and RF require two times as many iterations as random selection depending on the chemical space covered in the initial training data. Lastly, we show using bioactivity data for COX-1 and COX-2 that our framework can be extended to multitarget drug discovery, where compounds are selected by concomitantly considering their activity against multiple targets. Overall, this study provides an approach for iterative screening where only inactive data are present in early stages of drug discovery in order to discover highly potent compounds and the best experimental set up in which to do so.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The advent of powerful machine learning algorithms as well as the availability of high volume of pharmacological data has given new fuel to QSAR, opening new unprecedented options for deriving highly predictive models for assisting the rationale design of new bioactive compounds, for screening and prioritizing large molecular libraries, and for repurposing new drugs toward new clinical uses. Here, we present PoseidonQ (an acronym for Personal Optimization Software for Efficient Implementation and Derivation of Online QSAR), a user-friendly software solution designed to simplify the derivation of the QSAR model for drug design and discovery. PoseidonQ incorporates 22 machine learning algorithms, 17 types of molecular fingerprints, and 208 RDKit molecular descriptors and enables the quick derivation of both regression and classification models along with a calculated and easily interpretable applicability domain. Importantly, the platform is automatically linked to the latest version of the ChEMBL database, thus providing streamlined access to large amounts of curated bioactivity data. Importantly, the user is also given the option of gathering high-quality experimental data based on customizable filtering settings. Noteworthy, PoseidonQ facilitates the deployment of trained QSAR models as web-based applications through seamless integration with Streamlit Cloud and GitHub, empowering users to share, refine, and integrate models effortlessly. Interestingly, the translation of QSAR models into web-based applications makes them free accessible, portable, and ready for screening large volumes of new data without limits. By unifying data preparation, model generation, and deployment into an intuitive workflow, PoseidonQ makes advanced QSAR modeling for drug design and discovery accessible to a wide audience of researchers irrespective of their skill levels. PoseidonQ bridges the gap between complex machine learning techniques and practical drug discovery applications, enhancing the efficiency, collaboration, and adoption of QSAR approaches in modern drug discovery programs. PoseidonQ is available for Windows and Linux (ubuntu 22.04 distro) operating systems and can be downloaded for free at https://github.com/Muzatheking12/PoseidonQ.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Proteins often have both orthosteric and allosteric binding sites. Endogenous ligands, such as hormones and neurotransmitters, bind to the orthosteric site, while synthetic ligands may bind to orthosteric or allosteric sites, which has become a focal point in drug discovery. Usually, such allosteric modulators bind to a protein noncompetitively with its endogenous ligand or substrate. The growing interest in allosteric modulators has resulted in a substantial increase of these entities and their features such as binding data in chemical libraries and databases. Although this data surge fuels research focused on allosteric modulators, binding data is unfortunately not always clearly indicated as being allosteric or orthosteric. Therefore, allosteric binding data is difficult to retrieve from databases that contain a mixture of allosteric and orthosteric compounds. This decreases model performance when statistical methods, such as machine learning models, are applied. In previous work we generated an allosteric data subset of ChEMBL release 14. In the current study an improved text mining approach is used to retrieve the allosteric and orthosteric binding types from the literature in ChEMBL release 22. Moreover, convolutional deep neural networks were constructed to predict the binding types of compounds for class A G protein-coupled receptors (GPCRs). Temporal split validation showed the model predictiveness with Matthews correlation coefficient (MCC) = 0.54, sensitivity allosteric = 0.54, and sensitivity orthosteric = 0.94. Finally, this study shows that the inclusion of accurate binding types increases binding predictions by including them as descriptor (MCC = 0.27 improved to MCC = 0.34; validated for class A GPCRs, trained on all GPCRs). Although the focus of this study is mainly on class A GPCRs, binding types for all protein classes in ChEMBL were obtained and explored. The data set is included as a supplement to this study, allowing the reader to select the compounds and binding types of interest.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Sarcomas are a group of malignant neoplasms of connective tissue with a different etiology than carcinomas. The efforts to discover new drugs with antisarcoma activity have generated large datasets of multiple preclinical assays with different experimental conditions. For instance, the ChEMBL database contains outcomes of 37,919 different antisarcoma assays with 34,955 different chemical compounds. Furthermore, the experimental conditions reported in this dataset include 157 types of biological activity parameters, 36 drug targets, 43 cell lines, and 17 assay organisms. Considering this information, we propose combining perturbation theory (PT) principles with machine learning (ML) to develop a PTML model to predict antisarcoma compounds. PTML models use one function of reference that measures the probability of a drug being active under certain conditions (protein, cell line, organism, etc.). In this paper, we used a linear discriminant analysis and neural network to train and compare PT and non-PT models. All the explored models have an accuracy of 89.19–95.25% for training and 89.22–95.46% in validation sets. PTML-based strategies have similar accuracy but generate simplest models. Therefore, they may become a versatile tool for predicting antisarcoma compounds.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Retroviruses such as HIV cause significant diseases in humans and other organisms, making the discovery of antiretroviral (ARV) drugs a critical priority. While databases like ChEMBL contain valuable information, their complexity poses challenges. The data set includes approximately >140,000 assays across eight viruses, encompassing
350 biological activity parameters, >50 target proteins, >80 cell lines, 60 assay organisms, and >770 viral strains. Artificial Intelligence/Machine Learning (AI/ML) models offer a promising approach to accelerate ARV discovery. Recently, we developed AI/ML models for ChEMBL ARV data using the Information Fusion Perturbation Theory and Machine Learning (IFPTML) strategy. However, neither existing AI/ML models nor our prior IFPTML implementation simultaneously incorporates viral protein sequences, strains, cell lines, assay organisms, or virus/human mutations. This limitation renders them ineffective for predicting activity against amino acid sequence variations (e.g., mutations, variants, or emerging strains)a critical shortcoming given the well-documented prevalence of drug-resistance mutations in marketed ARVs. In this work, we present an enhanced IFPTML model integrating protein sequence descriptors. We computed and incorporated sequence descriptors for all drug target proteins in ChEMBL, derived from proteomes of retroviruses (HIV, FeLV, MMV, SIV, etc.). The model demonstrated robust performance, with sensitivity (Sn), specificity (Sp), and accuracy (Ac) values ranging between 72.0 and 88.0% in both training and validation phases. We analyze its predictions for protein mutations documented in ChEMBL and other literature sources. To our knowledge, this represents the first unified multicondition, multioutput model for ARV discovery that systematically accounts for protein sequence information.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Compound dataset consisting of structures and bioactivity data (classes) for 512 kinases. Chemical structures are available as InChIKey and bioactivity data as either active (pChEMBL >= 6.5) or inactive (pChEMBL < 6.5) (the meaning of the pChEMBL value can be found on: https://www.ebi.ac.uk/chembl/). The compound structures are chemically standardised by neutralising charges, removing salts, and keeping the largest fragment. The dataset was used in training and validation of statistical models (QSAR and PCM).
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The efflux transporter P-glycoprotein (P-gp) is responsible for the extrusion of a wide variety of molecules, including drug molecules, from the cell. Therefore, P-gp-mediated efflux transport limits the bioavailability of drugs. To identify potential P-gp substrates early in the drug discovery process, in silico models have been developed based on structural and physicochemical descriptors. In this study, we investigate the use of molecular dynamics fingerprints (MDFPs) as an orthogonal descriptor for the training of machine learning (ML) models to classify small molecules into substrates and nonsubstrates of P-gp. MDFPs encode the information from short MD simulations of the molecules in different environments (water, membrane, or protein pocket). The performance of the MDFPs, evaluated on both an in-house dataset (3930 compounds) and a public dataset from ChEMBL (1114 compounds), is compared to that of commonly used 2D molecular descriptors, including structure-based and property-based descriptors. We find that all tested classifiers interpolate well, achieving high accuracy on chemically diverse subsets. However, by challenging the models with external validation and prospective analysis, we show that only tree-based ML models trained on MDFPs or property-based descriptors generalize well to regions of the chemical space not covered by the training set.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Table displaying IC50 values in μM of promising compound activity against P. falciparum cells, as well as cytotoxicity towards human cells and respective selectivity indices.
Collection of bioactive drug-like small molecules that contains 2D structures, calculated properties and abstracted bioactivities. Used for drug discovery and chemical biology research. Clinical progress of new compounds is continuously integrated into the database.