30 datasets found

h
moleculenet-benchmark
huggingface.co
Updated Aug 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katie Link (2023). moleculenet-benchmark [Dataset]. https://huggingface.co/datasets/katielink/moleculenet-benchmark
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 29, 2023
Authors
Katie Link
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
MoleculeNet Benchmark (website)

MoleculeNet is a benchmark specially designed for testing machine learning methods of molecular properties. As we aim to facilitate the development of molecular machine learning method, this work curates a number of dataset collections, creates a suite of software that implements many known featurizations and previously proposed algorithms. All methods and datasets are integrated as parts of the open source DeepChem package(MIT license). MoleculeNet… See the full description on the dataset page: https://huggingface.co/datasets/katielink/moleculenet-benchmark.
h
MoleculeNet_Lipophilicity
huggingface.co
Updated Jul 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
scikit-fingerprints (2024). MoleculeNet_Lipophilicity [Dataset]. https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_Lipophilicity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 7, 2024
Dataset authored and provided by
scikit-fingerprints
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
MoleculeNet Lipophilicity

Lipophilicity dataset, part of MoleculeNet [1] benchmark. It is intended to be used through scikit-fingerprints library. The task is to predict octanol/water distribution coefficient (logD) at pH 7.4. Targets are already log transformed, and are a unitless ratio.

Characteristic Description

Tasks 1

Task type regression

Total samples 4200

Recommended split scaffold

Recommended metric RMSE

References

[1] Wu, Zhenqin, et al.… See the full description on the dataset page: https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_Lipophilicity.
h
MoleculeNet_Tox21
huggingface.co
ollama.hf-mirror.com
Updated Feb 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
scikit-fingerprints (2025). MoleculeNet_Tox21 [Dataset]. https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_Tox21
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 3, 2025
Dataset authored and provided by
scikit-fingerprints
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
MoleculeNet Tox21

Tox21 dataset [1], part of MoleculeNet [2] benchmark. It is intended to be used through scikit-fingerprints library. The task is to predict 12 toxicity targets, including nuclear receptors and stress response pathways. All tasks are binary. Note that targets have missing values. Algorithms should be evaluated only on present labels. For training data, you may want to impute them, e.g. with zeros.

Characteristic Description

Tasks 12

Task type multitask… See the full description on the dataset page: https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_Tox21.
h
MoleculeNet_ESOL
huggingface.co
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
scikit-fingerprints (2024). MoleculeNet_ESOL [Dataset]. https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_ESOL
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 12, 2024
Dataset authored and provided by
scikit-fingerprints
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
MoleculeNet ESOL

ESOL (Estimated SOLubility) dataset [1], part of MoleculeNet [2] benchmark. It is intended to be used through scikit-fingerprints library. The task is to predict aqueous solubility. Targets are log-transformed, and the unit is log mols per litre (log Mol/L).

Characteristic Description

Tasks 1

Task type regression

Total samples 1128

Recommended split scaffold

Recommended metric RMSE

References

[1] John S. Delaney "ESOL: Estimating… See the full description on the dataset page: https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_ESOL.
Benchmark Data for Chemprop
zenodo.org
application/gzip
Updated Jul 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esther Heid; Esther Heid; Kevin P. Greenman; Kevin P. Greenman; Yunsie Chung; Yunsie Chung; Shih-Cheng Li; Shih-Cheng Li; David E. Graff; David E. Graff; Florence H. Vermeire; Florence H. Vermeire; Haoyang Wu; Haoyang Wu; William H. Green; William H. Green; Charles J. McGill; Charles J. McGill (2023). Benchmark Data for Chemprop [Dataset]. http://doi.org/10.5281/zenodo.8174268
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8174268
Dataset updated
Jul 24, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Esther Heid; Esther Heid; Kevin P. Greenman; Kevin P. Greenman; Yunsie Chung; Yunsie Chung; Shih-Cheng Li; Shih-Cheng Li; David E. Graff; David E. Graff; Florence H. Vermeire; Florence H. Vermeire; Haoyang Wu; Haoyang Wu; William H. Green; William H. Green; Charles J. McGill; Charles J. McGill
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Datasets and splits of the manuscript "Chemprop: Machine Learning Package for Chemical Property Prediction." Train, validation and test splits are located within each folder, as well as additional data necessary for some of the benchmarks. To train Chemprop models, refer to our code repository to obtain ready-to-use scripts to train machine learning models for each of the systems.

Available benchmarking systems:

`hiv` HIV replication inhibition from MoleculeNet and OGB with scaffold splits

`pcba_random` Biological activities from MoleculeNet and OGB with random splits

`pcba_scaffold` Biological activities from MoleculeNet and OGB with scaffold splits

`qm9_multitask` DFT calculated properties from MoleculeNet and OGB, trained as a multi-task model

`qm9_u0` DFT calculated properties from MoleculeNet and OGB, trained as a single-task model on the target U0 only

`qm9_gap` DFT calculated properties from MoleculeNet and OGB, trained as a single-task model on the target gap only

`sampl` Water-octanol partition coefficients, used to predict molecules from the SAMPL6, 7 and 9 challenges

`atom_bond_137k` Quantum-mechanical atom and bond descriptors

`bde` Bond dissociation enthalpies trained as single-task model

`bde_charges` Bond dissociation enthalpies trained as multi-task model together with atomic partial charges

`charges_eps_4` Partial charges at a dielectric constant of 4 (in protein)

`charges_eps_78` Partial charges at a dielectric constant of 78 (in water)

`barriers_e2` Reaction barrier heights of E2 reactions

`barriers_sn2` Reaction barrier heights of SN2 reactions

`barriers_cycloadd` Reaction barrier heights of cycloaddition reactions

`barriers_rdb7` Reaction barrier heights in the RDB7 dataset

`barriers_rgd1` Reaction barrier heights in the RGD1-CNHO dataset

`multi_molecule` UV/Vis peak absorption wavelengths in different solvents

`ir` IR Spectra

`pcqm4mv2` HOMO-LUMO gaps of the PCQM4Mv2 dataset

`uncertainty_ensemble` Uncertainty estimation using an ensemble using the QM9 gap dataset

`uncertainty_evidential` Uncertainty estimation using evidential learning using the QM9 gap dataset

`uncertainty_mve` Uncertainty estimation using mean-variance estimation using the QM9 gap dataset

`timing` Timing benchmark using subsets of QM9 gap
t
Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K....
service.tib.eu
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, V. Pande (2024). Dataset: MoleculeNet dataset. https://doi.org/10.57702/wi8voz93 [Dataset]. https://service.tib.eu/ldmservice/dataset/moleculenet-dataset
Explore at:
Dataset updated
Dec 2, 2024
Description
The MoleculeNet dataset is a benchmarking platform for molecular machine learning.
h
MoleculeNet_SIDER
ollama.hf-mirror.com
huggingface.co
Updated May 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
scikit-fingerprints (2025). MoleculeNet_SIDER [Dataset]. https://ollama.hf-mirror.com/datasets/scikit-fingerprints/MoleculeNet_SIDER
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 31, 2025
Dataset authored and provided by
scikit-fingerprints
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
MoleculeNet SIDER

Load and return the SIDER (Side Effect Resource) dataset [1], part of MoleculeNet [2] benchmark. It is intended to be used through scikit-fingerprints library. The task is to predict adverse drug reactions (ADRs) as drug side effects to 27 system organ classes in MedDRA classification. All tasks are binary.

Characteristic Description

Tasks 12

Task type multitask classification

Total samples 7831

Recommended split scaffold

Recommended metric… See the full description on the dataset page: https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_SIDER.
h
MoleculeNet-Hiv-split
ollama.hf-mirror.com
huggingface.co
Updated May 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raushan Turganbay (2025). MoleculeNet-Hiv-split [Dataset]. https://ollama.hf-mirror.com/datasets/RaushanTurganbay/MoleculeNet-Hiv-split
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 31, 2025
Authors
Raushan Turganbay
Description
RaushanTurganbay/MoleculeNet-Hiv-split dataset hosted on Hugging Face and contributed by the HF Datasets community
h
MoleculeNet_FreeSolv
huggingface.co
Updated Jul 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
scikit-fingerprints (2024). MoleculeNet_FreeSolv [Dataset]. https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_FreeSolv
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 10, 2024
Dataset authored and provided by
scikit-fingerprints
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
MoleculeNet FreeSolv

FreeSolv (Free Solvation Database) dataset [1], part of MoleculeNet [2] benchmark. It is intended to be used through scikit-fingerprints library. The task is to predict hydration free energy of small molecules in water. Targets are in kcal/mol.

Characteristic Description

Tasks 1

Task type regression

Total samples 642

Recommended split scaffold

Recommended metric RMSE

References

[1] Mobley, D.L., Guthrie, J.P. "FreeSolv: a… See the full description on the dataset page: https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_FreeSolv.
f
Data from: Comparison of Cellular Morphological Descriptors and Molecular...
acs.figshare.com
xlsx
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Srijit Seal; Hongbin Yang; Luis Vollmers; Andreas Bender (2023). Comparison of Cellular Morphological Descriptors and Molecular Fingerprints for the Prediction of Cytotoxicity- and Proliferation-Related Assays [Dataset]. http://doi.org/10.1021/acs.chemrestox.0c00303.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.chemrestox.0c00303.s002
Dataset updated
Jun 5, 2023
Dataset provided by
ACS Publications
Authors
Srijit Seal; Hongbin Yang; Luis Vollmers; Andreas Bender
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Cell morphology features, such as those from the Cell Painting assay, can be generated at relatively low costs and represent versatile biological descriptors of a system and thereby compound response. In this study, we explored cell morphology descriptors and molecular fingerprints, separately and in combination, for the prediction of cytotoxicity- and proliferation-related in vitro assay endpoints. We selected 135 compounds from the MoleculeNet ToxCast benchmark data set which were annotated with Cell Painting readouts, where the relatively small size of the data set is due to the overlap of required annotations. We trained Random Forest classification models using nested cross-validation and Cell Painting descriptors, Morgan and ErG fingerprints, and their combinations. While using leave-one-cluster-out cross-validation (with clusters based on physicochemical descriptors), models using Cell Painting descriptors achieved higher average performance over all assays (Balanced Accuracy of 0.65, Matthews Correlation Coefficient of 0.28, and AUC-ROC of 0.71) compared to models using ErG fingerprints (BA 0.55, MCC 0.09, and AUC-ROC 0.60) and Morgan fingerprints alone (BA 0.54, MCC 0.06, and AUC-ROC 0.56). While using random shuffle splits, the combination of Cell Painting descriptors with ErG and Morgan fingerprints further improved balanced accuracy on average by 8.9% (in 9 out of 12 assays) and 23.4% (in 8 out of 12 assays) compared to using only ErG and Morgan fingerprints, respectively. Regarding feature importance, Cell Painting descriptors related to nuclei texture, granularity of cells, and cytoplasm as well as cell neighbors and radial distributions were identified to be most contributing, which is plausible given the endpoint considered. We conclude that cell morphological descriptors contain complementary information to molecular fingerprints which can be used to improve the performance of predictive cytotoxicity models, in particular in areas of novel structural space.
h
MoleculeNet_ClinTox
huggingface.co
Updated Jul 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
scikit-fingerprints (2024). MoleculeNet_ClinTox [Dataset]. https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_ClinTox
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 7, 2024
Dataset authored and provided by
scikit-fingerprints
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
MoleculeNet ClinTox

Load and return the ClinTox dataset, part of MoleculeNet [1] benchmark. It is intended to be used through scikit-fingerprints library. The task is to predict drug approval viability, by predicting clinical trial toxicity and final FDA approval status. Both tasks are binary.

Characteristic Description

Tasks 2

Task type multitask classification

Total samples 1477

Recommended split scaffold

Recommended metric AUROC

References

[1]… See the full description on the dataset page: https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_ClinTox.
Z
Data from: From Pixels to Phenotypes: Integrating Image-Based Profiling with...
data.niaid.nih.gov
zenodo.org
Updated Jan 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andreas Bender (2024). From Pixels to Phenotypes: Integrating Image-Based Profiling with Cell Health Data Improves Interpretability [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8147309
Explore at:
Dataset updated
Jan 11, 2024
Dataset provided by
Jordi Carreras-Puigvert
Srijit Seal
Ola Spjuth
Andreas Bender
Anne E Carpenter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Code: https://github.com/srijitseal/BioMorph_Space

Cell Painting assays generate morphological profiles that are versatile descriptors of biological systems and have been used to predict in vitro and in vivo drug effects. However, Cell Painting features are based on image statistics, and are, therefore, often not readily biologically interpretable. In this study, we introduce an approach that maps specific Cell Painting features into the BioMorph space using readouts from comprehensive Cell Health assays. We validated that the resulting BioMorph space effectively connected compounds not only with the morphological features associated with their bioactivity but with deeper insights into phenotypic characteristics and cellular processes associated with the given bioactivity. The BioMorph space revealed the mechanism of action for individual compounds, including dual-acting compounds such as emetine, an inhibitor of both protein synthesis and DNA replication. In summary, BioMorph space offers a more biologically relevant way to interpret cell morphological features from the Cell Painting assays and to generate hypotheses for experimental validation.

The following datasets are released:

Cell_Health_median_357_profiles_70_labels.csv : The Cell Heath dataset for CRISPR perturbations. Contains median consensus signatures for the 357 consensus profiles (119 CRISPR perturbations × 3 cell lines) Ref: Way et al.

Cell_Painitng_CRISPR_Perturbations_357_profiles_827_features_scaled.csv: The Cell Painting dataset for CRISPR perturbations. Contains 827 morphology features (and metadata annotation) for 357 consensus profiles (119 CRISPR perturbations × 3 cell lines). Ref: Way et al.

Cell_Painting_data_658_compounds_827_Features_scaled.csv The Cell Painting dataset for compound perturbations. Contains 658 structurally unique compounds with 827 Cell Painting features. Ref: Bray et al

Endpoints_9_Mitotox_biological_activities_658_compounds.csv The biological assay activity labels for compound perturbations. Contains 658 structurally unique compounds with 9 biological activity consensus hit calls. Ref: ToxCast/MoleculeNet

BioMoprh_pvalue_658_compunds_398_BioMorph_terms.csv: The dataset of standardised BioMorph term p-values. Contains 398 BioMorph terms for the 658 compounds in the biological activity dataset.

References: Way et al. Predicting cell health phenotypes using image-based morphology profiling. Mol Biol Cell. 2021;32(9):995-1005. Bray et al. A dataset of images and morphological profiles of 30 000 small-molecule treatments using the Cell Painting assay. Gigascience. 2017;6(12):1-5. MoleculeNet: Wu et al. MoleculeNet: A benchmark for molecular machine learning. Chem Sci. 2018;9(2):513-530. ToxCast: Exploring ToxCast Data | US EPA https://www.epa.gov/chemical-research/exploring-toxcast-data (accessed Jul 9, 2023).
Dataset of small molecules free energy in water (FreeSolv) curated and...
zenodo.org
csv
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). Dataset of small molecules free energy in water (FreeSolv) curated and enriched using the Enalos tools and Enalos KNIME nodes for machine learning analysis [Dataset]. http://doi.org/10.5281/zenodo.14391750
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14391750
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A curated and enriched dataset for the hydration free energy in water (FreeSolv) of small molecules, intended for in silico model development. The dataset is retrieved from MoleculeNet. The curated FreeSolv dataset comprises 642 compounds enriched with 777 molecular descriptors extracted from their 2D structure using EnalosMold2 KNIME node.

More curated datasets are available via chemPharos: https://db.chempharos.eu/datasets/Datasets.zul
h
pcba_686978
huggingface.co
ollama.hf-mirror.com
Updated Mar 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zach Nussbaum (2023). pcba_686978 [Dataset]. https://huggingface.co/datasets/zpn/pcba_686978
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 18, 2023
Authors
Zach Nussbaum
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for pcba_686978

Dataset Summary

pcba_686978 is a dataset included in MoleculeNet. PubChem BioAssay (PCBA) is a database consisting of biological activities of small molecules generated by high-throughput screening. We have chosen one of the larger tasks (ID 686978) as described in https://par.nsf.gov/servlets/purl/10168888.

Dataset Structure Data Fields

Each split contains

smiles: the SMILES representation of a molecule selfies:… See the full description on the dataset page: https://huggingface.co/datasets/zpn/pcba_686978.
Druglike molecule datasets for drug discovery
zenodo.org
bin
Updated Jan 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonghyun Lee; Jonghyun Lee (2023). Druglike molecule datasets for drug discovery [Dataset]. http://doi.org/10.5281/zenodo.7547717
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7547717
Dataset updated
Jan 18, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jonghyun Lee; Jonghyun Lee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Background
Trnasformer-based AI models have shown outstanding performance in identifying druggable candidate molecules. In most cases, models are trained on a massive amount of database of molecular information to capture the latent meaning of a given molecule. However, the desirable properties of candidate molecules include the feasibility of synthesizing them, low toxicity, and high druggability. In this study, we injected prior knowledge of the desirable properties of molecules during the training process.

Methods
Using the PubChem database (100 M), we filtered druglike molecules based on the quantity of drug-likeliness (QED) score and the Pfizer rule. With this dataset of drug-like molecules, we trained both the molecular representation model (chemBERTa) and the molecular generation models (MolGPT). The molecular representation model was evaluated by fine-tuning the results on the MoleculeNet benchmark datasets, and the molecular generation model was evaluated based on the generated samples (10 K).

Results
Training with druglike molecules enabled the generation of molecules with desirable properties without any conditioning. Although the molecular representation learning model was not remarkable, however, its performance in predicting clinical toxicology exceeded that of conventional molecular representation models.

Conclusion
By training based on a dataset of druglike molecules, our approach enables molecular representation models to predict clinical toxicity more precisely. Furthermore, it enables the molecule generation model to generate molecules with desirable druglike properties without any conditional generation procedures.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

import pickle

with open("druglike_molecules_QED.pkl", "rb") as f:

data = pickle.load(f)
h
MoleculeNet_PCBA
huggingface.co
Updated Jul 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
scikit-fingerprints (2024). MoleculeNet_PCBA [Dataset]. https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_PCBA
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 7, 2024
Dataset authored and provided by
scikit-fingerprints
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
MoleculeNet PCBA

PCBA (PubChem BioAssay) dataset [1], part of MoleculeNet [2] benchmark. It is intended to be used through scikit-fingerprints library. The task is to predict biological activity against 128 bioassays, generated by high-throughput screening (HTS). All tasks are binary active/non-active. Note that targets have missing values. Algorithms should be evaluated only on present labels. For training data, you may want to impute them, e.g. with zeros.

Characteristic… See the full description on the dataset page: https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_PCBA.
T
ogbg_molpcba
tensorflow.org
Updated Dec 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). ogbg_molpcba [Dataset]. https://www.tensorflow.org/datasets/catalog/ogbg_molpcba
Explore at:
Dataset updated
Dec 14, 2022
Description
'ogbg-molpcba' is a molecular dataset sampled from PubChem BioAssay. It is a graph prediction dataset from the Open Graph Benchmark (OGB).

This dataset is experimental, and the API is subject to change in future releases.

The below description of the dataset is adapted from the OGB paper:

Input Format

All the molecules are pre-processed using RDKit ([1]).

Each graph represents a molecule, where nodes are atoms, and edges are chemical bonds.

Input node features are 9-dimensional, containing atomic number and chirality, as well as other additional atom features such as formal charge and whether the atom is in the ring.

Input edge features are 3-dimensional, containing bond type, bond stereochemistry, as well as an additional bond feature indicating whether the bond is conjugated.

The exact description of all features is available at https://github.com/snap-stanford/ogb/blob/master/ogb/utils/features.py.

Prediction

The task is to predict 128 different biological activities (inactive/active). See [2] and [3] for more description about these targets. Not all targets apply to each molecule: missing targets are indicated by NaNs.

References

[1]: Greg Landrum, et al. 'RDKit: Open-source cheminformatics'. URL: https://github.com/rdkit/rdkit

[2]: Bharath Ramsundar, Steven Kearnes, Patrick Riley, Dale Webster, David Konerding and Vijay Pande. 'Massively Multitask Networks for Drug Discovery'. URL: https://arxiv.org/pdf/1502.02072.pdf

[3]: Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, and Vijay Pande. MoleculeNet: a benchmark for molecular machine learning. Chemical Science, 9(2):513-530, 2018.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('ogbg_molpcba', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/ogbg_molpcba-0.1.3.png" alt="Visualization" width="500px">
h
MoleculeNet_ToxCast
huggingface.co
Updated Jul 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
scikit-fingerprints (2024). MoleculeNet_ToxCast [Dataset]. https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_ToxCast
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 7, 2024
Dataset authored and provided by
scikit-fingerprints
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
MoleculeNet ToxCast

ToxCast dataset [1], part of MoleculeNet [2] benchmark. It is intended to be used through scikit-fingerprints library. The task is to predict 617 toxicity targets from a large library of compounds based on in vitro high-throughput screening. All tasks are binary. Note that targets have missing values. Algorithms should be evaluated only on present labels. For training data, you may want to impute them, e.g. with zeros.

Characteristic Description

Tasks… See the full description on the dataset page: https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_ToxCast.
h
MoleculeNet_BACE
huggingface.co
Updated Jul 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
scikit-fingerprints (2024). MoleculeNet_BACE [Dataset]. https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_BACE
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 7, 2024
Dataset authored and provided by
scikit-fingerprints
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
MoleculeNet BACE

BACE dataset [1], part of MoleculeNet [2] benchmark. It is intended to be used through scikit-fingerprints library. The task is to predict binding results for a set of inhibitors of humanβ-secretase 1 (BACE-1).

Characteristic Description

Tasks 1

Task type classification

Total samples 1513

Recommended split scaffold

Recommended metricAUROC

References

[1] Govindan Subramanian et al. "Computational Modeling of β-Secretase 1 (BACE-1)… See the full description on the dataset page: https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_BACE.
Conformer datasets for "Equivariant Graph Neural Networks for Toxicity...
zenodo.org
xz
Updated May 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julian Cremer; Leonardo Medrano Sandonas; Leonardo Medrano Sandonas; Alexandre Tkatchenko; Djork-Arné Clevert; Gianni De Fabritiis; Julian Cremer; Alexandre Tkatchenko; Djork-Arné Clevert; Gianni De Fabritiis (2024). Conformer datasets for "Equivariant Graph Neural Networks for Toxicity Prediction" [Dataset]. http://doi.org/10.5281/zenodo.11237635
Explore at:
xzAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11237635
Dataset updated
May 21, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Julian Cremer; Leonardo Medrano Sandonas; Leonardo Medrano Sandonas; Alexandre Tkatchenko; Djork-Arné Clevert; Gianni De Fabritiis; Julian Cremer; Alexandre Tkatchenko; Djork-Arné Clevert; Gianni De Fabritiis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Predictive modeling of toxicity is a crucial step in the drug discovery pipeline. It can help filter out molecules with a high probability of failing in the early stages of de novo drug design. Thus, several machine learning (ML) models have been developed to predict the toxicity of molecules by combining classical ML techniques or deep neural networks with well-known molecular representations such as fingerprints or 2D graphs. But the more natural, accurate representation of molecules is expected to be defined in physical 3D space like in ab initio methods. Recent studies successfully used equivariant graph neural networks (EGNNs) for representation learning based on 3D structures to predict quantum-mechanical properties of molecules. Inspired by this, we investigated the performance of EGNNs to construct reliable ML models for toxicity prediction. We used the equivariant transformer (ET) model in TorchMD-NET for this. Eleven toxicity data sets taken from MoleculeNet, TDCommons, and ToxBenchmark have been considered to evaluate the capability of ET for toxicity prediction. Our results show that ET adequately learns 3D representations of molecules that can successfully correlate with toxicity activity, achieving good accuracies on most data sets comparable to state-of-the-art models. We also test a physicochemical property, namely, the total energy of a molecule, to inform the toxicity prediction with a physical prior. However, our work suggests that these two properties can not be related. We also provide an attention weight analysis for helping to understand the toxicity prediction in 3D space and thus increase the explainability of the ML model. In summary, our findings offer promising insights considering 3D geometry information via EGNNs and provide a straightforward way to integrate molecular conformers into ML-based pipelines for predicting and investigating toxicity prediction in physical space. We expect that in the future, especially for larger, more diverse data sets, EGNNs will be an essential tool in this domain.

PAPER

https://pubs.acs.org/doi/full/10.1021/acs.chemrestox.3c00032

CODE and MODELS:

The conformer data sets and trained toxicity models will be published upon acceptance of this work. The code has been made available at https://github.com/jule-c/ET-Tox, and the processed data as well as pretrained models for training and testing can be downloaded from https://zenodo.org/record/7942946. We can provide the full list of conformers as XYZ files upon request.

Facebook

Twitter

Click to copy link

Link copied

Cite

Katie Link (2023). moleculenet-benchmark [Dataset]. https://huggingface.co/datasets/katielink/moleculenet-benchmark

moleculenet-benchmark

katielink/moleculenet-benchmark

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 29, 2023

Authors

Katie Link

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

MoleculeNet Benchmark (website)

MoleculeNet is a benchmark specially designed for testing machine learning methods of molecular properties. As we aim to facilitate the development of molecular machine learning method, this work curates a number of dataset collections, creates a suite of software that implements many known featurizations and previously proposed algorithms. All methods and datasets are integrated as parts of the open source DeepChem package(MIT license). MoleculeNet… See the full description on the dataset page: https://huggingface.co/datasets/katielink/moleculenet-benchmark.

Clear search

Close search

Google apps

Main menu

moleculenet-benchmark

MoleculeNet_Lipophilicity

MoleculeNet_Tox21

MoleculeNet_ESOL

Benchmark Data for Chemprop

Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K....

MoleculeNet_SIDER

MoleculeNet-Hiv-split

MoleculeNet_FreeSolv

Data from: Comparison of Cellular Morphological Descriptors and Molecular...

MoleculeNet_ClinTox

Data from: From Pixels to Phenotypes: Integrating Image-Based Profiling with...

Dataset of small molecules free energy in water (FreeSolv) curated and...

pcba_686978

Druglike molecule datasets for drug discovery

MoleculeNet_PCBA

ogbg_molpcba

Input Format

Prediction

References

MoleculeNet_ToxCast

MoleculeNet_BACE

Conformer datasets for "Equivariant Graph Neural Networks for Toxicity...

moleculenet-benchmark

katielink/moleculenet-benchmark