This dataset provides a curated collection of approved drug Simplified Molecular Input Line Entry System (SMILES) strings and their associated protein sequences. Each small molecule has been approved by at least one regulatory body, ensuring the safety and relevance of the data for computational applications. The dataset includes 1,660 approved small molecules and their 2,093 related protein targets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SMILES strings provided for the set of compounds
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
PubChem Canonical SMILES and Titles Pair Classification
This dataset contains pairs of canonical SMILES strings and their corresponding entity titles, with labels indicating whether they refer to the same chemical entity. A label of 1 means the SMILES string and the title correspond to the same entity, while a label of 0 indicates they do not. The dataset is sourced from PubChem (ChEBI source), and it provides valuable information for tasks involving chemical entity matching.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
source codes
Dataset of organic molecules encoded as SMILES strings with 18,322,500 records collected from the Pubchem database.
List of characters included in the dataset:
Description | SMILES Characters |
---|---|
Atoms | "C", "O", "N", "P", "S", "F", "Cl", "Br", "I", "Si", "B" |
Branches | "(", ")" |
Rings | "1", "2", "3", "4", "5", "6", "7", "8", "9" |
Bonds | "=", "#" |
Ions | "+", "-" |
Stereochemistry | "/", "\" |
Miscellaneous | "[", "]" |
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Functional groups are widely used in organic chemistry, because they provide a rationale to analyze physicochemical and reactivity properties. In medicinal chemistry, they are the basis for analyzing ligand–biomacromolecule interactions. Ertl’s algorithm is an approach to extract functional groups in arbitrary organic molecules that does not depend on predefined libraries of functional groups. However, there is a lack of a complete and accurate implementation of Ertl’s algorithm in the widely used RDKit cheminformatic toolkit. In this paper, a new RDKit/Python implementation of the algorithm is described, that is both accurate and complete. For a RDKit molecule, it provides (i) a PNG binary string with an image of the molecule with color-highlighted functional groups; (ii) a list of sets of atom indices (idx), each set corresponding to a functional group; (iii) a list of pseudo-SMILES canonicalized strings for the full functional groups; and (iv) a list of RDKit labeled mol objects, one for each full functional group. The code is freely available in https://github.com/bbu-imdea/efgs and is part of the RDKit Contrib directory (https://github.com/rdkit/rdkit/tree/master/Contrib/efgs).
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
PubChem Isomeric SMILES and Titles Bitext Mining
This dataset contains two separate lists: one of isomeric SMILES strings and the other of corresponding entity titles, both sourced from PubChem (ChEBI source). The task is to identify matching pairs between the SMILES strings and the titles, where each SMILES string from the first list should be aligned with its corresponding entity title from the second list. The dataset is intended for bitext mining tasks, where the goal is to… See the full description on the dataset page: https://huggingface.co/datasets/BASF-AI/PubChemSMILESIsoTitleBM.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supplementary Files for Nature Communications publication titled "Diversity-oriented synthesis encoded by deoxyoligonucleotides". Two files contain the Iodo-library and Bromo-library SMILES strings.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
PubChem Isomeric SMILES and Titles Pair Classification
This dataset contains pairs of isomeric SMILES strings and their corresponding entity titles, with labels indicating whether they refer to the same chemical entity. A label of 1 means the SMILES string and the title correspond to the same entity, while a label of 0 indicates they do not. The dataset is sourced from PubChem (ChEBI source), and it provides valuable information for tasks involving chemical entity matching.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Table including the SMILES strings for the compounds listed in the paper
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository comprises a dataset of ~2 million unique compounds saved in an hdf5 small molecule library store, which includes the following fields for each molecule:
The repository also includes a Jupyter notebook that serves as an initial guide for querying the small molecule library store. Export both files to the same folder, allocate approximately 40 GB of available memory disk space, unzip the library store, and then launch the notebook to begin querying.
Upon usage, please cite this publication:
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Up to now, publicly available data sets to build and evaluate Ames mutagenicity prediction tools have been very limited in terms of size and chemical space covered. In this report we describe a new unique public Ames mutagenicity data set comprising about 6500 nonconfidential compounds (available as SMILES strings and SDF) together with their biological activity. Three commercial tools (DEREK, MultiCASE, and an off-the-shelf Bayesian machine learner in Pipeline Pilot) are compared with four noncommercial machine learning implementations (Support Vector Machines, Random Forests, k-Nearest Neighbors, and Gaussian Processes) on the new benchmark data set.
The database of protein-chemical structural interactions includes all existing 3D structures of complexes of proteins with low molecular weight ligands. When one considers the proteins and chemical vertices of a graph, all these interactions form a network. Biological networks are powerful tools for predicting undocumented relationships between molecules. The underlying principle is that existing interactions between molecules can be used to predict new interactions. For pairs of proteins sharing a common ligand, we use protein and chemical superimpositions combined with fast structural compatibility screens to predict whether additional compounds bound by one protein would bind the other. The current version includes data from the Protein Data Bank as of August 2011. The database is updated monthly.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ChemTastesDB is a database that includes curated information of 2944 molecular tastants. ChemTastesDB constitutes a useful tool for the scientific community to expand the information of molecular tastants, which could assist in the analysis of the relationships between molecular structure and taste, as well as in silico (QSAR) studies for taste prediction by means of diverse machine learning approaches.
Molecules are labelled in one of the five basic tastes (sweet, bitter, umami sour and salty), as well as to other classes related to non-basic tastes (tasteless, non-sweet, multitaste and miscellaneous). ChemTastesDB provides the following information for each molecule: name, PubChem CID, CAS registry number, canonical SMILES string, class taste and the reference to the scientific sources from where data were retrieved. Moreover, the molecular structure in the HyperChem (.hin) format of each chemical is provided.
This is version 1.2 of the ChemTastesDB.
What's new in version 1.2:
Chemical information (for instance, name, PubChem CID or CAS number) for some tastants has been included.
The database is freeware and may be used if proper reference is given to the authors. Preferably refer to the following paper:
Rojas, C., Ballabio, D., Pacheco Sarmiento, K., Pacheco Jaramillo, E., Mendoza, M., & GarcÃa, F. (2022). ChemTastesDB: A curated database of molecular tastants. Food Chemistry: Molecular Sciences, 4, 100090. https://doi.org/10.1016/j.fochms.2022.100090.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Information
The diverse publicly available compound/bioactivity databases constitute a key resource for data-driven applications in chemogenomics and drug design. Analysis of their coverage of compound entries and biological targets revealed considerable differences, however, suggesting benefit of a consensus dataset. Therefore, we have combined and curated information from five esteemed databases (ChEMBL, PubChem, BindingDB, IUPHAR/BPS and Probes&Drugs) to assemble a consensus compound/bioactivity dataset comprising 1144803 compounds with 10915362 bioactivities on 5613 targets (including defined macromolecular targets as well as cell-lines and phenotypic readouts). It also provides simplified information on assay types underlying the bioactivity data and on bioactivity confidence by comparing data from different sources. We have unified the source databases, brought them into a common format and combined them, enabling an ease for generic uses in multiple applications such as chemogenomics and data-driven drug design.
The consensus dataset provides increased target coverage and contains a higher number of molecules compared to the source databases which is also evident from a larger number of scaffolds. These features render the consensus dataset a valuable tool for machine learning and other data-driven applications in (de novo) drug design and bioactivity prediction. The increased chemical and bioactivity coverage of the consensus dataset may improve robustness of such models compared to the single source databases. In addition, semi-automated structure and bioactivity annotation checks with flags for divergent data from different sources may help data selection and further accurate curation.
Structure and content of the dataset
ChEMBL ID |
PubChem ID |
IUPHAR ID | Target |
Activity type | Assay type | Unit | Mean C (0) | ... | Mean PC (0) | ... | Mean B (0) | ... | Mean I (0) | ... | Mean PD (0) | ... | Activity check annotation | Ligand names | Canonical SMILES C | ... | Structure check | Source |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
The dataset was created using the Konstanz Information Miner (KNIME) (https://www.knime.com/) and was exported as a CSV-file and a compressed CSV-file.
Except for the canonical SMILES columns, all columns are filled with the datatype ‘string’. The datatype for the canonical SMILES columns is the smiles-format. We recommend the File Reader node for using the dataset in KNIME. With the help of this node the data types of the columns can be adjusted exactly. In addition, only this node can read the compressed format.
Column content:
This repository contains a list of 1947 Iridium complexes, including their geometry (xyz files) and energy barriers for the hydrogen splitting reaction. Data Coordinates in .xyz format - data/coordinates_complex: coordinates of all complexes without additional hydrogen - data/coordinates_TS: coordinates of all complexes with an additional hydrogen molecule in transition state - data/coordinates_molSimplify: coordinates of all complexes generated with molSimplify developed by the Kulik group Properties in .csv format (data/vaskas_features_properties_smiles_filenames.csv) - "smiles": SMILES strings of all molecules - "filename": corresponding xyz filename - "barrier": DFT computed energy barrier [kcal/mol] for the transition state of the hydrogen splitting reaction - "distance": DFT computed H-H distance in the transition state geometry - "chi-X", "Z-X", "T-X", "I-X" and "S-X": (auto)correlation features described in our paper
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the related data, both input and produced for the paper "Predicting compound activity from phenotypic profiles and chemical structures".
This data can be merged with paper's GitHub repository for reproduction.
Folders and files and are described below:
├── assay_data ├── assay_matrix_discrete_270_assays.csv Assay matrix with hits for assays (270) and compounds (16170). Note that this is the final file that we used to produce splits. ├── assay_metadata.csv Assay metadata ├── broad_ids.txt List of broad ids used in this study. That is an unfiltered list of compounds required by some analysis scripts. ├── smiles.txt Same as broad_ids.txt, but SMILES strings.
├── feature_data (for 16978 compounds, can be masked with ./misc/compounds16978to16170.npy) ├── cp.npz Classical chemical features ├── ge.npz Gene expression features ├── ge_scale.npz Gene expression scaled features ├── mo.npz Morphology features (not batch corrected) ├── mobc.npz Morphology features (batch corrected)
├── misc ├── compound_analysis.npz Compounds in the dataset identified as PAINS ├── compounds16978to16170.npy Used to filter features from the bigger set of compounds to the final one ├── fingerprints.npz Calculated fingerprints of compounds, those were then used to calculate similarity ├── similarity_fingerprints.npz Similarity matrix for compounds (16978) ├── population_normalized.csv.gz Well-level morphological profiles that were used for batch-correction ├── Table for PUMA Excel file with additional data and plots
├── predictions ├── scaffold_median(mean)_AUC.csv Aggregated median(mean) AUC scores over scaffold-based cross-validation splits. In the paper, median results were reported. ├── scaffold_median(mean)_EF.csv Aggregated median(mean) enrichment factor (EF) over scaffold-based cross-validation splits. In the paper, median results were reported. ├── toprank_chemical_cv{}_hitsnorm.csv Those files are needed to create enrichment plots and contain hit rate and top rank hit rate. ├── Each folder here stands for an experiment type, the number in the folder name is a number of the split. Inside each folder there are the following elements: ├── predictions Folder with predictions for each assay-compound pair for each modality ├── 2022_01_evaluation_all_data.csv File with AUC scores for each assay for the test set in the split ├── 2022_01_evaluation_all_data_EF.csv File with enrichment factor (EF) values for each assay for the test set in the split. Those files exist only for chemical folders. ├── assay_matrix_discrete_train(test)_old_scaff.csv Training and test subsets of data for the split. The first column contains broad_id. ├── assay_matrix_discrete_train(test)_old_scaff.csv Same, but SMILES strings in the first column. Those files are used as input to ChemProp!
Experiments in this folder are the following:
- chemical Scaffold-based 5-fold cross-validation splits, the main results in the paper are reported with this series of experiments.
- chemical_bal Same splits as in chemical, but training were run with ChemProp built-in data balancing.
- chemical_st Same splits as in chemical, but separate models were trained for each assay.
- CV Random 5-fold cross-validation splits.
- GE 5-fold cross-validation splits based on same-size clustering of gene expression features.
- MOBC 5-fold cross-validation splits based on same-size clustering of batch-corrected morphology features.
- random 10 random splits, ~80% of compounds in the training set and the rest in the test set.
├── splitting This folder contains numpy files which help to match compounds and features to create training and test sets for a split, which can be reused in the analysis notebook for data preparation. ├── scaffold_based_split.npz Splitting for scaffold-based splits. ├── random_split_{}.npz Random split indices of test set compounds (10 files). ├── cross_validation_indicies.npz Indices for random cross-validation splits ├── GE_clusters_size_constrained.npz Indicies of clusters of same-size clustering for gene-expression features. ├── MOBC_clusters_size_constrained.npz Indices of clusters of same-size clustering for batch-corrected morphology features.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 1. smith.tsv (WLN:SMILES strings from Elbert Smiths encoding manual).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes 182,127 SMILES strings generated by 5 generations of oxidation using the GECKO-A model for alpha-pinene, decane, and toluene under typical continental atmospheric conditions. For each compound, physicochemical parameters (vapor pressure, Henry's law constant, and gas phase reaction rate constant with the hydroxyl radical) are estimated using several structure-activity relationships. Compounds are flagged according to in which oxidation systems they exceed a threshold of 0.1% of total modeled mass of their given molecular formula. Descriptions of this dataset and the parameter estimation are provided in Isaacman-VanWertz and Aumont, "Impact of organic molecular structure on the estimation of atmospherically relevant physicochemical parameters", Atmospheric Chemistry and Physics. The subset of compounds 38,594 compounds used in the core analyses of that work are also flagged.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
CoconutDB SMILES to Formula Bitext Mining
This dataset consists of two lists: one containing both isomeric and canonical SMILES strings, and the other containing the corresponding molecular formulas of chemical entities, sourced from CoconutDB. The primary task is to identify matching pairs between the SMILES strings and their molecular formulas. Each SMILES string from the first list should be accurately aligned with its corresponding molecular formula from the second list.
This dataset provides a curated collection of approved drug Simplified Molecular Input Line Entry System (SMILES) strings and their associated protein sequences. Each small molecule has been approved by at least one regulatory body, ensuring the safety and relevance of the data for computational applications. The dataset includes 1,660 approved small molecules and their 2,093 related protein targets.