67 datasets found

P
approved_drug_target Dataset
paperswithcode.com
Updated Nov 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahsa Sheikholeslami; Navid Mazrouei; Yousof Gheisari; Afshin Fasihi; Matin Irajpour; Ali Motahharynia (2024). approved_drug_target Dataset [Dataset]. https://paperswithcode.com/dataset/approved-drug-target
Explore at:
Dataset updated
Nov 19, 2024
Authors
Mahsa Sheikholeslami; Navid Mazrouei; Yousof Gheisari; Afshin Fasihi; Matin Irajpour; Ali Motahharynia
Description
This dataset provides a curated collection of approved drug Simplified Molecular Input Line Entry System (SMILES) strings and their associated protein sequences. Each small molecule has been approved by at least one regulatory body, ensuring the safety and relevance of the data for computational applications. The dataset includes 1,660 approved small molecules and their 2,093 related protein targets.
SMILES strings for compounds
search.datacite.org
figshare.com
Updated Jan 19, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rakesh Baboo (2016). SMILES strings for compounds [Dataset]. http://doi.org/10.6084/m9.figshare.1412636.v1
Explore at:
Unique identifier
https://doi.org/10.6084/m9.figshare.1412636.v1
Dataset updated
Jan 19, 2016
Dataset provided by
DataCitehttps://www.datacite.org/
Figsharehttp://figshare.com/
Authors
Rakesh Baboo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SMILES strings provided for the set of compounds
PubChemSMILESCanonTitlePC
huggingface.co
Updated Jan 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BASF (2025). PubChemSMILESCanonTitlePC [Dataset]. https://huggingface.co/datasets/BASF-AI/PubChemSMILESCanonTitlePC
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 3, 2025
Dataset authored and provided by
BASFhttp://basf.com/
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
PubChem Canonical SMILES and Titles Pair Classification

This dataset contains pairs of canonical SMILES strings and their corresponding entity titles, with labels indicating whether they refer to the same chemical entity. A label of 1 means the SMILES string and the title correspond to the same entity, while a label of 0 indicates they do not. The dataset is sourced from PubChem (ChEBI source), and it provides valuable information for tasks involving chemical entity matching.
f
Data from: How does Transformer model evolve to learn diverse chemical...
springernature.figshare.com
zip
Updated Feb 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tadahaya Mizuno (2024). How does Transformer model evolve to learn diverse chemical structures? [Dataset]. http://doi.org/10.6084/m9.figshare.22736528.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22736528.v1
Dataset updated
Feb 17, 2024
Dataset provided by
figshare
Authors
Tadahaya Mizuno
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
source codes

SMILES-18 dataset

zenodo.org
data.niaid.nih.gov

Updated May 31, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

Ignacio Pérez-Correa; Ignacio Pérez-Correa (2023). SMILES-18 dataset [Dataset]. http://doi.org/10.5281/zenodo.7978077

Explore at:

Unique identifier

https://doi.org/10.5281/zenodo.7978077

Dataset updated

May 31, 2023

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Ignacio Pérez-Correa; Ignacio Pérez-Correa

Description

Dataset of organic molecules encoded as SMILES strings with 18,322,500 records collected from the Pubchem database.

List of characters included in the dataset:

Description	SMILES Characters
Atoms	"C", "O", "N", "P", "S", "F", "Cl", "Br", "I", "Si", "B"
Branches	"(", ")"
Rings	"1", "2", "3", "4", "5", "6", "7", "8", "9"
Bonds	"=", "#"
Ions	"+", "-"
Stereochemistry	"/", "\"
Miscellaneous	"[", "]"

EFGs: A Complete and Accurate Implementation of Ertl’s Functional Group...
acs.figshare.com
figshare.com
xlsx
Updated Jan 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gonzalo Colmenarejo (2025). EFGs: A Complete and Accurate Implementation of Ertl’s Functional Group Detection Algorithm in RDKit [Dataset]. http://doi.org/10.1021/acs.jcim.4c02268.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.4c02268.s001
Dataset updated
Jan 29, 2025
Dataset provided by
ACS Publications
Authors
Gonzalo Colmenarejo
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Functional groups are widely used in organic chemistry, because they provide a rationale to analyze physicochemical and reactivity properties. In medicinal chemistry, they are the basis for analyzing ligand–biomacromolecule interactions. Ertl’s algorithm is an approach to extract functional groups in arbitrary organic molecules that does not depend on predefined libraries of functional groups. However, there is a lack of a complete and accurate implementation of Ertl’s algorithm in the widely used RDKit cheminformatic toolkit. In this paper, a new RDKit/Python implementation of the algorithm is described, that is both accurate and complete. For a RDKit molecule, it provides (i) a PNG binary string with an image of the molecule with color-highlighted functional groups; (ii) a list of sets of atom indices (idx), each set corresponding to a functional group; (iii) a list of pseudo-SMILES canonicalized strings for the full functional groups; and (iv) a list of RDKit labeled mol objects, one for each full functional group. The code is freely available in https://github.com/bbu-imdea/efgs and is part of the RDKit Contrib directory (https://github.com/rdkit/rdkit/tree/master/Contrib/efgs).
PubChemSMILESIsoTitleBM
huggingface.co
Updated Jan 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PubChemSMILESIsoTitleBM [Dataset]. https://huggingface.co/datasets/BASF-AI/PubChemSMILESIsoTitleBM
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 3, 2025
Dataset authored and provided by
BASFhttp://basf.com/
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
PubChem Isomeric SMILES and Titles Bitext Mining

This dataset contains two separate lists: one of isomeric SMILES strings and the other of corresponding entity titles, both sourced from PubChem (ChEBI source). The task is to identify matching pairs between the SMILES strings and the titles, where each SMILES string from the first list should be aligned with its corresponding entity title from the second list. The dataset is intended for bitext mining tasks, where the goal is to… See the full description on the dataset page: https://huggingface.co/datasets/BASF-AI/PubChemSMILESIsoTitleBM.
Z
Supplementary Files - SMILES strings for DOSEDO enumerated library
data.niaid.nih.gov
zenodo.org
Updated Jul 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Faust, Ann Marie (2023). Supplementary Files - SMILES strings for DOSEDO enumerated library [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8136903
Explore at:
Dataset updated
Jul 15, 2023
Dataset authored and provided by
Faust, Ann Marie
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Supplementary Files for Nature Communications publication titled "Diversity-oriented synthesis encoded by deoxyoligonucleotides". Two files contain the Iodo-library and Bromo-library SMILES strings.
PubChemSMILESIsoTitlePC
huggingface.co
Updated Jan 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BASF (2025). PubChemSMILESIsoTitlePC [Dataset]. https://huggingface.co/datasets/BASF-AI/PubChemSMILESIsoTitlePC
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 3, 2025
Dataset authored and provided by
BASFhttp://basf.com/
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
PubChem Isomeric SMILES and Titles Pair Classification

This dataset contains pairs of isomeric SMILES strings and their corresponding entity titles, with labels indicating whether they refer to the same chemical entity. A label of 1 means the SMILES string and the title correspond to the same entity, while a label of 0 indicates they do not. The dataset is sourced from PubChem (ChEBI source), and it provides valuable information for tasks involving chemical entity matching.
Table of SMILES for compounds listed
figshare.com
docx
Updated Jan 19, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Embir Jaspal (2016). Table of SMILES for compounds listed [Dataset]. http://doi.org/10.6084/m9.figshare.1416131.v1
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1416131.v1
Dataset updated
Jan 19, 2016
Dataset provided by
Figsharehttp://figshare.com/
Authors
Embir Jaspal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Table including the SMILES strings for the compounds listed in the paper
Data from: Library of Two Million Unique Small Molecules with Precalculated...
zenodo.org
repository.uantwerpen.be
bin
Updated Aug 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Issar Arab; Issar Arab; Kris Laukens; Kris Laukens; Wout Bittremieux; Wout Bittremieux (2024). Library of Two Million Unique Small Molecules with Precalculated Fingerprints, Descriptors, and Cardiotoxicity Inhibition Data [Dataset]. http://doi.org/10.5281/zenodo.11066707
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11066707
Dataset updated
Aug 8, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Issar Arab; Issar Arab; Kris Laukens; Kris Laukens; Wout Bittremieux; Wout Bittremieux
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository comprises a dataset of ~2 million unique compounds saved in an hdf5 small molecule library store, which includes the following fields for each molecule:

InChI key

Standardized SMILES string

Compound source

ChEMBL identifier if the compound exists in this open access database

1024-bit Morgan fingerprint

2048-bit Morgan fingerprint

881-bit PubChem fingerprints

854 vector-length of preprocessed and standardized Mordred descriptors

and cardiotoxicity inhibition predictions for each of the three cardiac ion channels (hERG, Nav1.5, and Cav1.2) using CtoxPred2 along with the model confidence scores.

The repository also includes a Jupyter notebook that serves as an initial guide for querying the small molecule library store. Export both files to the same folder, allocate approximately 40 GB of available memory disk space, unzip the library store, and then launch the notebook to begin querying.

Upon usage, please cite this publication:

Issar Arab, Kris Laukens, Wout Bittremieux, Semisupervised Learning to Boost hERG, Nav1.5, and Cav1.2 Cardiac Ion Channel Toxicity Prediction by Mining a Large Unlabeled Small Molecule Data Set, Journal of Chemical Information and Modeling, (2024). doi:https://doi.org/10.1021/acs.jcim.4c01102">10.1021/acs.jcim.4c01102
f
Data from: Benchmark Data Set for in Silico Prediction of Ames Mutagenicity
acs.figshare.com
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katja Hansen; Sebastian Mika; Timon Schroeter; Andreas Sutter; Antonius ter Laak; Thomas Steger-Hartmann; Nikolaus Heinrich; Klaus-Robert Müller (2023). Benchmark Data Set for in Silico Prediction of Ames Mutagenicity [Dataset]. http://doi.org/10.1021/ci900161g.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/ci900161g.s001
Dataset updated
May 30, 2023
Dataset provided by
ACS Publications
Authors
Katja Hansen; Sebastian Mika; Timon Schroeter; Andreas Sutter; Antonius ter Laak; Thomas Steger-Hartmann; Nikolaus Heinrich; Klaus-Robert Müller
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Up to now, publicly available data sets to build and evaluate Ames mutagenicity prediction tools have been very limited in terms of size and chemical space covered. In this report we describe a new unique public Ames mutagenicity data set comprising about 6500 nonconfidential compounds (available as SMILES strings and SDF) together with their biological activity. Three commercial tools (DEREK, MultiCASE, and an off-the-shelf Bayesian machine learner in Pipeline Pilot) are compared with four noncommercial machine learning implementations (Support Vector Machines, Random Forests, k-Nearest Neighbors, and Gaussian Processes) on the new benchmark data set.
n
ProtChemSI
neuinfo.org
dknet.org
+1more
Updated Mar 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). ProtChemSI [Dataset]. http://identifiers.org/RRID:SCR_006115
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_006115
Dataset updated
Mar 2, 2025
Description
The database of protein-chemical structural interactions includes all existing 3D structures of complexes of proteins with low molecular weight ligands. When one considers the proteins and chemical vertices of a graph, all these interactions form a network. Biological networks are powerful tools for predicting undocumented relationships between molecules. The underlying principle is that existing interactions between molecules can be used to predict new interactions. For pairs of proteins sharing a common ligand, we use protein and chemical superimpositions combined with fast structural compatibility screens to predict whether additional compounds bound by one protein would bind the other. The current version includes data from the Protein Data Bank as of August 2011. The database is updated monthly.
Data from: ChemTastesDB: A Curated Database of Molecular Tastants
zenodo.org
data.niaid.nih.gov
bin, pdf
Updated Jul 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cristian Rojas; Cristian Rojas; Davide Ballabio; Davide Ballabio; Karen Pacheco Sarmiento; Elisa Pacheco Jaramillo; Mateo Mendoza; Fernando García; Fernando García; Karen Pacheco Sarmiento; Elisa Pacheco Jaramillo; Mateo Mendoza (2024). ChemTastesDB: A Curated Database of Molecular Tastants [Dataset]. http://doi.org/10.5281/zenodo.6528835
Explore at:
bin, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6528835
Dataset updated
Jul 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Cristian Rojas; Cristian Rojas; Davide Ballabio; Davide Ballabio; Karen Pacheco Sarmiento; Elisa Pacheco Jaramillo; Mateo Mendoza; Fernando García; Fernando García; Karen Pacheco Sarmiento; Elisa Pacheco Jaramillo; Mateo Mendoza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ChemTastesDB is a database that includes curated information of 2944 molecular tastants. ChemTastesDB constitutes a useful tool for the scientific community to expand the information of molecular tastants, which could assist in the analysis of the relationships between molecular structure and taste, as well as in silico (QSAR) studies for taste prediction by means of diverse machine learning approaches.

Molecules are labelled in one of the five basic tastes (sweet, bitter, umami sour and salty), as well as to other classes related to non-basic tastes (tasteless, non-sweet, multitaste and miscellaneous). ChemTastesDB provides the following information for each molecule: name, PubChem CID, CAS registry number, canonical SMILES string, class taste and the reference to the scientific sources from where data were retrieved. Moreover, the molecular structure in the HyperChem (.hin) format of each chemical is provided.

This is version 1.2 of the ChemTastesDB.

What's new in version 1.2:

Chemical information (for instance, name, PubChem CID or CAS number) for some tastants has been included.

The database is freeware and may be used if proper reference is given to the authors. Preferably refer to the following paper:
Rojas, C., Ballabio, D., Pacheco Sarmiento, K., Pacheco Jaramillo, E., Mendoza, M., & García, F. (2022). ChemTastesDB: A curated database of molecular tastants. Food Chemistry: Molecular Sciences, 4, 100090. https://doi.org/10.1016/j.fochms.2022.100090.

Data from: A consensus compound/bioactivity dataset for data-driven drug...

zenodo.org
explore.openaire.eu
+1more

zip

Updated May 13, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Laura Isigkeit; Laura Isigkeit; Apirat Chaikuad; Apirat Chaikuad; Daniel Merk; Daniel Merk (2022). A consensus compound/bioactivity dataset for data-driven drug design and chemogenomics [Dataset]. http://doi.org/10.5281/zenodo.6320761

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6320761

Dataset updated

May 13, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Laura Isigkeit; Laura Isigkeit; Apirat Chaikuad; Apirat Chaikuad; Daniel Merk; Daniel Merk

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Information

The diverse publicly available compound/bioactivity databases constitute a key resource for data-driven applications in chemogenomics and drug design. Analysis of their coverage of compound entries and biological targets revealed considerable differences, however, suggesting benefit of a consensus dataset. Therefore, we have combined and curated information from five esteemed databases (ChEMBL, PubChem, BindingDB, IUPHAR/BPS and Probes&Drugs) to assemble a consensus compound/bioactivity dataset comprising 1144803 compounds with 10915362 bioactivities on 5613 targets (including defined macromolecular targets as well as cell-lines and phenotypic readouts). It also provides simplified information on assay types underlying the bioactivity data and on bioactivity confidence by comparing data from different sources. We have unified the source databases, brought them into a common format and combined them, enabling an ease for generic uses in multiple applications such as chemogenomics and data-driven drug design.

The consensus dataset provides increased target coverage and contains a higher number of molecules compared to the source databases which is also evident from a larger number of scaffolds. These features render the consensus dataset a valuable tool for machine learning and other data-driven applications in (de novo) drug design and bioactivity prediction. The increased chemical and bioactivity coverage of the consensus dataset may improve robustness of such models compared to the single source databases. In addition, semi-automated structure and bioactivity annotation checks with flags for divergent data from different sources may help data selection and further accurate curation.

Structure and content of the dataset

**Dataset structure**
ChEMBL ID	PubChem ID	IUPHAR ID	Target	Activity type	Assay type	Unit	Mean C (0)	...	Mean PC (0)	...	Mean B (0)	...	Mean I (0)	...	Mean PD (0)	...	Activity check annotation	Ligand names	Canonical SMILES C	...	Structure check	Source

The dataset was created using the Konstanz Information Miner (KNIME) (https://www.knime.com/) and was exported as a CSV-file and a compressed CSV-file.

Except for the canonical SMILES columns, all columns are filled with the datatype ‘string’. The datatype for the canonical SMILES columns is the smiles-format. We recommend the File Reader node for using the dataset in KNIME. With the help of this node the data types of the columns can be adjusted exactly. In addition, only this node can read the compressed format.

Column content:

ChEMBL ID, PubChem ID, IUPHAR ID: chemical identifier of the databases
Target: biological target of the molecule expressed as the HGNC gene symbol
Activity type: for example, pIC₅₀
Assay type: Simplification/Classification of the assay into cell-free, cellular, functional and unspecified
Unit: unit of bioactivity measurement
Mean columns of the databases: mean of bioactivity values or activity comments denoted with the frequency of their occurrence in the database, e.g. Mean C = 7.5 *(15) -> the value for this compound-target pair occurs 15 times in ChEMBL database
Activity check annotation: a bioactivity check was performed by comparing values from the different sources and adding an activity check annotation to provide automated activity validation for additional confidence
- no comment: bioactivity values are within one log unit;
- check activity data: bioactivity values are not within one log unit;
- only one data point: only one value was available, no comparison and no range calculated;
- no activity value: no precise numeric activity value was available;
- no log-value could be calculated: no negative decadic logarithm could be calculated, e.g., because the reported unit was not a compound concentration
Ligand names: all unique names contained in the five source databases are listed
Canonical SMILES columns: Molecular structure of the compound from each database
Structure check: To denote matching or differing compound structures in different source databases
- match: molecule structures are the same between different sources;
- no match: the structures differ;
- 1 source: no structure comparison is possible, because the molecule comes from only one source database.
Source: From which databases the data come from

d
Vaska's space
search.dataone.org
borealisdata.ca
Updated Dec 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Friederich, Pascal (2023). Vaska's space [Dataset]. https://search.dataone.org/view/sha256%3A3b051c2e04eacf25d5cc4aabd90efde6ad2a0cdd0b1ac1d7c47c206c9b9be1a5
Explore at:
Dataset updated
Dec 28, 2023
Dataset provided by
Borealis
Authors
Friederich, Pascal
Description
This repository contains a list of 1947 Iridium complexes, including their geometry (xyz files) and energy barriers for the hydrogen splitting reaction. Data Coordinates in .xyz format - data/coordinates_complex: coordinates of all complexes without additional hydrogen - data/coordinates_TS: coordinates of all complexes with an additional hydrogen molecule in transition state - data/coordinates_molSimplify: coordinates of all complexes generated with molSimplify developed by the Kulik group Properties in .csv format (data/vaskas_features_properties_smiles_filenames.csv) - "smiles": SMILES strings of all molecules - "filename": corresponding xyz filename - "barrier": DFT computed energy barrier [kcal/mol] for the transition state of the hydrogen splitting reaction - "distance": DFT computed H-H distance in the transition state geometry - "chi-X", "Z-X", "T-X", "I-X" and "S-X": (auto)correlation features described in our paper
Z
Chemical structures, Cell Painting and transcriptional profiles for compound...
data.niaid.nih.gov
zenodo.org
Updated Apr 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anne E. Carpenter (2023). Chemical structures, Cell Painting and transcriptional profiles for compound bioactivity prediction. [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7729582
Explore at:
Dataset updated
Apr 5, 2023
Dataset provided by
Tim Becker
Peter Horvath
Kevin Yang
Nikita Moshkov
Juan C. Caicedo
Anne E. Carpenter
Bridget K. Wagner
Vlado Dancik
Shantanu Singh
Paul A. Clemons
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the related data, both input and produced for the paper "Predicting compound activity from phenotypic profiles and chemical structures".

This data can be merged with paper's GitHub repository for reproduction.

Folders and files and are described below:

├── assay_data ├── assay_matrix_discrete_270_assays.csv Assay matrix with hits for assays (270) and compounds (16170). Note that this is the final file that we used to produce splits. ├── assay_metadata.csv Assay metadata ├── broad_ids.txt List of broad ids used in this study. That is an unfiltered list of compounds required by some analysis scripts. ├── smiles.txt Same as broad_ids.txt, but SMILES strings.

├── feature_data (for 16978 compounds, can be masked with ./misc/compounds16978to16170.npy) ├── cp.npz Classical chemical features ├── ge.npz Gene expression features ├── ge_scale.npz Gene expression scaled features ├── mo.npz Morphology features (not batch corrected) ├── mobc.npz Morphology features (batch corrected)

├── misc ├── compound_analysis.npz Compounds in the dataset identified as PAINS ├── compounds16978to16170.npy Used to filter features from the bigger set of compounds to the final one ├── fingerprints.npz Calculated fingerprints of compounds, those were then used to calculate similarity ├── similarity_fingerprints.npz Similarity matrix for compounds (16978) ├── population_normalized.csv.gz Well-level morphological profiles that were used for batch-correction ├── Table for PUMA Excel file with additional data and plots

├── predictions ├── scaffold_median(mean)_AUC.csv Aggregated median(mean) AUC scores over scaffold-based cross-validation splits. In the paper, median results were reported. ├── scaffold_median(mean)_EF.csv Aggregated median(mean) enrichment factor (EF) over scaffold-based cross-validation splits. In the paper, median results were reported. ├── toprank_chemical_cv{}_hitsnorm.csv Those files are needed to create enrichment plots and contain hit rate and top rank hit rate. ├── Each folder here stands for an experiment type, the number in the folder name is a number of the split. Inside each folder there are the following elements: ├── predictions Folder with predictions for each assay-compound pair for each modality ├── 2022_01_evaluation_all_data.csv File with AUC scores for each assay for the test set in the split ├── 2022_01_evaluation_all_data_EF.csv File with enrichment factor (EF) values for each assay for the test set in the split. Those files exist only for chemical folders. ├── assay_matrix_discrete_train(test)_old_scaff.csv Training and test subsets of data for the split. The first column contains broad_id. ├── assay_matrix_discrete_train(test)_old_scaff.csv Same, but SMILES strings in the first column. Those files are used as input to ChemProp!

Experiments in this folder are the following: - chemical Scaffold-based 5-fold cross-validation splits, the main results in the paper are reported with this series of experiments. - chemical_bal Same splits as in chemical, but training were run with ChemProp built-in data balancing. - chemical_st Same splits as in chemical, but separate models were trained for each assay. - CV Random 5-fold cross-validation splits. - GE 5-fold cross-validation splits based on same-size clustering of gene expression features. - MOBC 5-fold cross-validation splits based on same-size clustering of batch-corrected morphology features. - random 10 random splits, ~80% of compounds in the training set and the rest in the test set.

├── splitting This folder contains numpy files which help to match compounds and features to create training and test sets for a split, which can be reused in the analysis notebook for data preparation. ├── scaffold_based_split.npz Splitting for scaffold-based splits. ├── random_split_{}.npz Random split indices of test set compounds (10 files). ├── cross_validation_indicies.npz Indices for random cross-validation splits ├── GE_clusters_size_constrained.npz Indicies of clusters of same-size clustering for gene-expression features. ├── MOBC_clusters_size_constrained.npz Indices of clusters of same-size clustering for batch-corrected morphology features.
f
Additional file 1 of Zombie cheminformatics: extraction and conversion of...
springernature.figshare.com
figshare.com
txt
Updated Aug 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Blakey; Samantha Pearman-Kanza; Jeremy G. Frey (2024). Additional file 1 of Zombie cheminformatics: extraction and conversion of Wiswesser Line Notation (WLN) from chemical documents [Dataset]. http://doi.org/10.6084/m9.figshare.26706900.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26706900.v1
Dataset updated
Aug 15, 2024
Dataset provided by
figshare
Authors
Michael Blakey; Samantha Pearman-Kanza; Jeremy G. Frey
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 1. smith.tsv (WLN:SMILES strings from Elbert Smiths encoding manual).
m
SMILES and physicochemical parameters - pinene, decane, toluene oxidation...
data.mendeley.com
Updated Mar 8, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel Isaacman-VanWertz (2021). SMILES and physicochemical parameters - pinene, decane, toluene oxidation products [Dataset]. http://doi.org/10.17632/3rgvkf7c9n.1
Explore at:
Unique identifier
https://doi.org/10.17632/3rgvkf7c9n.1
Dataset updated
Mar 8, 2021
Authors
Gabriel Isaacman-VanWertz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset includes 182,127 SMILES strings generated by 5 generations of oxidation using the GECKO-A model for alpha-pinene, decane, and toluene under typical continental atmospheric conditions. For each compound, physicochemical parameters (vapor pressure, Henry's law constant, and gas phase reaction rate constant with the hydroxyl radical) are estimated using several structure-activity relationships. Compounds are flagged according to in which oxidation systems they exceed a threshold of 0.1% of total modeled mass of their given molecular formula. Descriptions of this dataset and the parameter estimation are provided in Isaacman-VanWertz and Aumont, "Impact of organic molecular structure on the estimation of atmospherically relevant physicochemical parameters", Atmospheric Chemistry and Physics. The subset of compounds 38,594 compounds used in the core analyses of that work are also flagged.
CoconutSmiles2NameBitextMining
huggingface.co
Updated Jan 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CoconutSmiles2NameBitextMining [Dataset]. https://huggingface.co/datasets/BASF-AI/CoconutSmiles2NameBitextMining
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 3, 2025
Dataset authored and provided by
BASFhttp://basf.com/
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
CoconutDB SMILES to Formula Bitext Mining

This dataset consists of two lists: one containing both isomeric and canonical SMILES strings, and the other containing the corresponding molecular formulas of chemical entities, sourced from CoconutDB. The primary task is to identify matching pairs between the SMILES strings and their molecular formulas. Each SMILES string from the first list should be accurately aligned with its corresponding molecular formula from the second list.

Facebook

Twitter

Click to copy link

Link copied

Cite

Mahsa Sheikholeslami; Navid Mazrouei; Yousof Gheisari; Afshin Fasihi; Matin Irajpour; Ali Motahharynia (2024). approved_drug_target Dataset [Dataset]. https://paperswithcode.com/dataset/approved-drug-target

approved_drug_target Dataset

Approved Drug SMILES and Protein Sequence Dataset

Explore at:

Dataset updated

Nov 19, 2024

Authors

Mahsa Sheikholeslami; Navid Mazrouei; Yousof Gheisari; Afshin Fasihi; Matin Irajpour; Ali Motahharynia

Description

This dataset provides a curated collection of approved drug Simplified Molecular Input Line Entry System (SMILES) strings and their associated protein sequences. Each small molecule has been approved by at least one regulatory body, ensuring the safety and relevance of the data for computational applications. The dataset includes 1,660 approved small molecules and their 2,093 related protein targets.

Clear search

Close search

Google apps

Main menu

approved_drug_target Dataset

SMILES strings for compounds

PubChemSMILESCanonTitlePC

Data from: How does Transformer model evolve to learn diverse chemical...

SMILES-18 dataset

EFGs: A Complete and Accurate Implementation of Ertl’s Functional Group...

PubChemSMILESIsoTitleBM

Supplementary Files - SMILES strings for DOSEDO enumerated library

PubChemSMILESIsoTitlePC

Table of SMILES for compounds listed

Data from: Library of Two Million Unique Small Molecules with Precalculated...

Data from: Benchmark Data Set for in Silico Prediction of Ames Mutagenicity

ProtChemSI

Data from: ChemTastesDB: A Curated Database of Molecular Tastants

Data from: A consensus compound/bioactivity dataset for data-driven drug...

Vaska's space

Chemical structures, Cell Painting and transcriptional profiles for compound...

Additional file 1 of Zombie cheminformatics: extraction and conversion of...

SMILES and physicochemical parameters - pinene, decane, toluene oxidation...

CoconutSmiles2NameBitextMining

approved_drug_target DatasetSee More Versions

Approved Drug SMILES and Protein Sequence Dataset

approved_drug_target Dataset