30 datasets found
  1. h

    moleculenet-benchmark

    • huggingface.co
    Updated Aug 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Katie Link (2023). moleculenet-benchmark [Dataset]. https://huggingface.co/datasets/katielink/moleculenet-benchmark
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 29, 2023
    Authors
    Katie Link
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    MoleculeNet Benchmark (website)

    MoleculeNet is a benchmark specially designed for testing machine learning methods of molecular properties. As we aim to facilitate the development of molecular machine learning method, this work curates a number of dataset collections, creates a suite of software that implements many known featurizations and previously proposed algorithms. All methods and datasets are integrated as parts of the open source DeepChem package(MIT license). MoleculeNet… See the full description on the dataset page: https://huggingface.co/datasets/katielink/moleculenet-benchmark.

  2. h

    MoleculeNet_Lipophilicity

    • huggingface.co
    Updated Jul 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    scikit-fingerprints (2024). MoleculeNet_Lipophilicity [Dataset]. https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_Lipophilicity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 7, 2024
    Dataset authored and provided by
    scikit-fingerprints
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    MoleculeNet Lipophilicity

    Lipophilicity dataset, part of MoleculeNet [1] benchmark. It is intended to be used through scikit-fingerprints library. The task is to predict octanol/water distribution coefficient (logD) at pH 7.4. Targets are already log transformed, and are a unitless ratio.

    Characteristic Description

    Tasks 1

    Task type regression

    Total samples 4200

    Recommended split scaffold

    Recommended metric RMSE

      References
    

    [1] Wu, Zhenqin, et al.… See the full description on the dataset page: https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_Lipophilicity.

  3. h

    MoleculeNet_Tox21

    • huggingface.co
    • ollama.hf-mirror.com
    Updated Feb 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    scikit-fingerprints (2025). MoleculeNet_Tox21 [Dataset]. https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_Tox21
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 3, 2025
    Dataset authored and provided by
    scikit-fingerprints
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    MoleculeNet Tox21

    Tox21 dataset [1], part of MoleculeNet [2] benchmark. It is intended to be used through scikit-fingerprints library. The task is to predict 12 toxicity targets, including nuclear receptors and stress response pathways. All tasks are binary. Note that targets have missing values. Algorithms should be evaluated only on present labels. For training data, you may want to impute them, e.g. with zeros.

    Characteristic Description

    Tasks 12

    Task type multitask… See the full description on the dataset page: https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_Tox21.

  4. h

    MoleculeNet_ESOL

    • huggingface.co
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    scikit-fingerprints (2024). MoleculeNet_ESOL [Dataset]. https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_ESOL
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 12, 2024
    Dataset authored and provided by
    scikit-fingerprints
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    MoleculeNet ESOL

    ESOL (Estimated SOLubility) dataset [1], part of MoleculeNet [2] benchmark. It is intended to be used through scikit-fingerprints library. The task is to predict aqueous solubility. Targets are log-transformed, and the unit is log mols per litre (log Mol/L).

    Characteristic Description

    Tasks 1

    Task type regression

    Total samples 1128

    Recommended split scaffold

    Recommended metric RMSE

      References
    

    [1] John S. Delaney "ESOL: Estimating… See the full description on the dataset page: https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_ESOL.

  5. Benchmark Data for Chemprop

    • zenodo.org
    application/gzip
    Updated Jul 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Esther Heid; Esther Heid; Kevin P. Greenman; Kevin P. Greenman; Yunsie Chung; Yunsie Chung; Shih-Cheng Li; Shih-Cheng Li; David E. Graff; David E. Graff; Florence H. Vermeire; Florence H. Vermeire; Haoyang Wu; Haoyang Wu; William H. Green; William H. Green; Charles J. McGill; Charles J. McGill (2023). Benchmark Data for Chemprop [Dataset]. http://doi.org/10.5281/zenodo.8174268
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jul 24, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Esther Heid; Esther Heid; Kevin P. Greenman; Kevin P. Greenman; Yunsie Chung; Yunsie Chung; Shih-Cheng Li; Shih-Cheng Li; David E. Graff; David E. Graff; Florence H. Vermeire; Florence H. Vermeire; Haoyang Wu; Haoyang Wu; William H. Green; William H. Green; Charles J. McGill; Charles J. McGill
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets and splits of the manuscript "Chemprop: Machine Learning Package for Chemical Property Prediction." Train, validation and test splits are located within each folder, as well as additional data necessary for some of the benchmarks. To train Chemprop models, refer to our code repository to obtain ready-to-use scripts to train machine learning models for each of the systems.

    Available benchmarking systems:

    • `hiv` HIV replication inhibition from MoleculeNet and OGB with scaffold splits
    • `pcba_random` Biological activities from MoleculeNet and OGB with random splits
    • `pcba_scaffold` Biological activities from MoleculeNet and OGB with scaffold splits
    • `qm9_multitask` DFT calculated properties from MoleculeNet and OGB, trained as a multi-task model
    • `qm9_u0` DFT calculated properties from MoleculeNet and OGB, trained as a single-task model on the target U0 only
    • `qm9_gap` DFT calculated properties from MoleculeNet and OGB, trained as a single-task model on the target gap only
    • `sampl` Water-octanol partition coefficients, used to predict molecules from the SAMPL6, 7 and 9 challenges
    • `atom_bond_137k` Quantum-mechanical atom and bond descriptors
    • `bde` Bond dissociation enthalpies trained as single-task model
    • `bde_charges` Bond dissociation enthalpies trained as multi-task model together with atomic partial charges
    • `charges_eps_4` Partial charges at a dielectric constant of 4 (in protein)
    • `charges_eps_78` Partial charges at a dielectric constant of 78 (in water)
    • `barriers_e2` Reaction barrier heights of E2 reactions
    • `barriers_sn2` Reaction barrier heights of SN2 reactions
    • `barriers_cycloadd` Reaction barrier heights of cycloaddition reactions
    • `barriers_rdb7` Reaction barrier heights in the RDB7 dataset
    • `barriers_rgd1` Reaction barrier heights in the RGD1-CNHO dataset
    • `multi_molecule` UV/Vis peak absorption wavelengths in different solvents
    • `ir` IR Spectra
    • `pcqm4mv2` HOMO-LUMO gaps of the PCQM4Mv2 dataset
    • `uncertainty_ensemble` Uncertainty estimation using an ensemble using the QM9 gap dataset
    • `uncertainty_evidential` Uncertainty estimation using evidential learning using the QM9 gap dataset
    • `uncertainty_mve` Uncertainty estimation using mean-variance estimation using the QM9 gap dataset
    • `timing` Timing benchmark using subsets of QM9 gap
  6. t

    Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K....

    • service.tib.eu
    Updated Dec 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, V. Pande (2024). Dataset: MoleculeNet dataset. https://doi.org/10.57702/wi8voz93 [Dataset]. https://service.tib.eu/ldmservice/dataset/moleculenet-dataset
    Explore at:
    Dataset updated
    Dec 2, 2024
    Description

    The MoleculeNet dataset is a benchmarking platform for molecular machine learning.

  7. h

    MoleculeNet_SIDER

    • ollama.hf-mirror.com
    • huggingface.co
    Updated May 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    scikit-fingerprints (2025). MoleculeNet_SIDER [Dataset]. https://ollama.hf-mirror.com/datasets/scikit-fingerprints/MoleculeNet_SIDER
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 31, 2025
    Dataset authored and provided by
    scikit-fingerprints
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    MoleculeNet SIDER

    Load and return the SIDER (Side Effect Resource) dataset [1], part of MoleculeNet [2] benchmark. It is intended to be used through scikit-fingerprints library. The task is to predict adverse drug reactions (ADRs) as drug side effects to 27 system organ classes in MedDRA classification. All tasks are binary.

    Characteristic Description

    Tasks 12

    Task type multitask classification

    Total samples 7831

    Recommended split scaffold

    Recommended metric… See the full description on the dataset page: https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_SIDER.

  8. h

    MoleculeNet-Hiv-split

    • ollama.hf-mirror.com
    • huggingface.co
    Updated May 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raushan Turganbay (2025). MoleculeNet-Hiv-split [Dataset]. https://ollama.hf-mirror.com/datasets/RaushanTurganbay/MoleculeNet-Hiv-split
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 31, 2025
    Authors
    Raushan Turganbay
    Description

    RaushanTurganbay/MoleculeNet-Hiv-split dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. h

    MoleculeNet_FreeSolv

    • huggingface.co
    Updated Jul 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    scikit-fingerprints (2024). MoleculeNet_FreeSolv [Dataset]. https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_FreeSolv
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 10, 2024
    Dataset authored and provided by
    scikit-fingerprints
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    MoleculeNet FreeSolv

    FreeSolv (Free Solvation Database) dataset [1], part of MoleculeNet [2] benchmark. It is intended to be used through scikit-fingerprints library. The task is to predict hydration free energy of small molecules in water. Targets are in kcal/mol.

    Characteristic Description

    Tasks 1

    Task type regression

    Total samples 642

    Recommended split scaffold

    Recommended metric RMSE

      References
    

    [1] Mobley, D.L., Guthrie, J.P. "FreeSolv: a… See the full description on the dataset page: https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_FreeSolv.

  10. f

    Data from: Comparison of Cellular Morphological Descriptors and Molecular...

    • acs.figshare.com
    xlsx
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Srijit Seal; Hongbin Yang; Luis Vollmers; Andreas Bender (2023). Comparison of Cellular Morphological Descriptors and Molecular Fingerprints for the Prediction of Cytotoxicity- and Proliferation-Related Assays [Dataset]. http://doi.org/10.1021/acs.chemrestox.0c00303.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    ACS Publications
    Authors
    Srijit Seal; Hongbin Yang; Luis Vollmers; Andreas Bender
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Cell morphology features, such as those from the Cell Painting assay, can be generated at relatively low costs and represent versatile biological descriptors of a system and thereby compound response. In this study, we explored cell morphology descriptors and molecular fingerprints, separately and in combination, for the prediction of cytotoxicity- and proliferation-related in vitro assay endpoints. We selected 135 compounds from the MoleculeNet ToxCast benchmark data set which were annotated with Cell Painting readouts, where the relatively small size of the data set is due to the overlap of required annotations. We trained Random Forest classification models using nested cross-validation and Cell Painting descriptors, Morgan and ErG fingerprints, and their combinations. While using leave-one-cluster-out cross-validation (with clusters based on physicochemical descriptors), models using Cell Painting descriptors achieved higher average performance over all assays (Balanced Accuracy of 0.65, Matthews Correlation Coefficient of 0.28, and AUC-ROC of 0.71) compared to models using ErG fingerprints (BA 0.55, MCC 0.09, and AUC-ROC 0.60) and Morgan fingerprints alone (BA 0.54, MCC 0.06, and AUC-ROC 0.56). While using random shuffle splits, the combination of Cell Painting descriptors with ErG and Morgan fingerprints further improved balanced accuracy on average by 8.9% (in 9 out of 12 assays) and 23.4% (in 8 out of 12 assays) compared to using only ErG and Morgan fingerprints, respectively. Regarding feature importance, Cell Painting descriptors related to nuclei texture, granularity of cells, and cytoplasm as well as cell neighbors and radial distributions were identified to be most contributing, which is plausible given the endpoint considered. We conclude that cell morphological descriptors contain complementary information to molecular fingerprints which can be used to improve the performance of predictive cytotoxicity models, in particular in areas of novel structural space.

  11. h

    MoleculeNet_ClinTox

    • huggingface.co
    Updated Jul 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    scikit-fingerprints (2024). MoleculeNet_ClinTox [Dataset]. https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_ClinTox
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 7, 2024
    Dataset authored and provided by
    scikit-fingerprints
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    MoleculeNet ClinTox

    Load and return the ClinTox dataset, part of MoleculeNet [1] benchmark. It is intended to be used through scikit-fingerprints library. The task is to predict drug approval viability, by predicting clinical trial toxicity and final FDA approval status. Both tasks are binary.

    Characteristic Description

    Tasks 2

    Task type multitask classification

    Total samples 1477

    Recommended split scaffold

    Recommended metric AUROC

      References
    

    [1]… See the full description on the dataset page: https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_ClinTox.

  12. Z

    Data from: From Pixels to Phenotypes: Integrating Image-Based Profiling with...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andreas Bender (2024). From Pixels to Phenotypes: Integrating Image-Based Profiling with Cell Health Data Improves Interpretability [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8147309
    Explore at:
    Dataset updated
    Jan 11, 2024
    Dataset provided by
    Jordi Carreras-Puigvert
    Srijit Seal
    Ola Spjuth
    Andreas Bender
    Anne E Carpenter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Code: https://github.com/srijitseal/BioMorph_Space

    Cell Painting assays generate morphological profiles that are versatile descriptors of biological systems and have been used to predict in vitro and in vivo drug effects. However, Cell Painting features are based on image statistics, and are, therefore, often not readily biologically interpretable. In this study, we introduce an approach that maps specific Cell Painting features into the BioMorph space using readouts from comprehensive Cell Health assays. We validated that the resulting BioMorph space effectively connected compounds not only with the morphological features associated with their bioactivity but with deeper insights into phenotypic characteristics and cellular processes associated with the given bioactivity. The BioMorph space revealed the mechanism of action for individual compounds, including dual-acting compounds such as emetine, an inhibitor of both protein synthesis and DNA replication. In summary, BioMorph space offers a more biologically relevant way to interpret cell morphological features from the Cell Painting assays and to generate hypotheses for experimental validation.

    The following datasets are released:

    Cell_Health_median_357_profiles_70_labels.csv : The Cell Heath dataset for CRISPR perturbations. Contains median consensus signatures for the 357 consensus profiles (119 CRISPR perturbations × 3 cell lines) Ref: Way et al.

    Cell_Painitng_CRISPR_Perturbations_357_profiles_827_features_scaled.csv: The Cell Painting dataset for CRISPR perturbations. Contains 827 morphology features (and metadata annotation) for 357 consensus profiles (119 CRISPR perturbations × 3 cell lines). Ref: Way et al.

    Cell_Painting_data_658_compounds_827_Features_scaled.csv The Cell Painting dataset for compound perturbations. Contains 658 structurally unique compounds with 827 Cell Painting features. Ref: Bray et al

    Endpoints_9_Mitotox_biological_activities_658_compounds.csv The biological assay activity labels for compound perturbations. Contains 658 structurally unique compounds with 9 biological activity consensus hit calls. Ref: ToxCast/MoleculeNet

    BioMoprh_pvalue_658_compunds_398_BioMorph_terms.csv: The dataset of standardised BioMorph term p-values. Contains 398 BioMorph terms for the 658 compounds in the biological activity dataset.

    References: Way et al. Predicting cell health phenotypes using image-based morphology profiling. Mol Biol Cell. 2021;32(9):995-1005. Bray et al. A dataset of images and morphological profiles of 30 000 small-molecule treatments using the Cell Painting assay. Gigascience. 2017;6(12):1-5. MoleculeNet: Wu et al. MoleculeNet: A benchmark for molecular machine learning. Chem Sci. 2018;9(2):513-530. ToxCast: Exploring ToxCast Data | US EPA https://www.epa.gov/chemical-research/exploring-toxcast-data (accessed Jul 9, 2023).

  13. Dataset of small molecules free energy in water (FreeSolv) curated and...

    • zenodo.org
    csv
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). Dataset of small molecules free energy in water (FreeSolv) curated and enriched using the Enalos tools and Enalos KNIME nodes for machine learning analysis [Dataset]. http://doi.org/10.5281/zenodo.14391750
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A curated and enriched dataset for the hydration free energy in water (FreeSolv) of small molecules, intended for in silico model development. The dataset is retrieved from MoleculeNet. The curated FreeSolv dataset comprises 642 compounds enriched with 777 molecular descriptors extracted from their 2D structure using EnalosMold2 KNIME node.

    More curated datasets are available via chemPharos: https://db.chempharos.eu/datasets/Datasets.zul

  14. h

    pcba_686978

    • huggingface.co
    • ollama.hf-mirror.com
    Updated Mar 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zach Nussbaum (2023). pcba_686978 [Dataset]. https://huggingface.co/datasets/zpn/pcba_686978
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 18, 2023
    Authors
    Zach Nussbaum
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for pcba_686978

      Dataset Summary
    

    pcba_686978 is a dataset included in MoleculeNet. PubChem BioAssay (PCBA) is a database consisting of biological activities of small molecules generated by high-throughput screening. We have chosen one of the larger tasks (ID 686978) as described in https://par.nsf.gov/servlets/purl/10168888.

      Dataset Structure
    
    
    
    
    
    
    
      Data Fields
    

    Each split contains

    smiles: the SMILES representation of a molecule selfies:… See the full description on the dataset page: https://huggingface.co/datasets/zpn/pcba_686978.

  15. Druglike molecule datasets for drug discovery

    • zenodo.org
    bin
    Updated Jan 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonghyun Lee; Jonghyun Lee (2023). Druglike molecule datasets for drug discovery [Dataset]. http://doi.org/10.5281/zenodo.7547717
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 18, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jonghyun Lee; Jonghyun Lee
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Background
    Trnasformer-based AI models have shown outstanding performance in identifying druggable candidate molecules. In most cases, models are trained on a massive amount of database of molecular information to capture the latent meaning of a given molecule. However, the desirable properties of candidate molecules include the feasibility of synthesizing them, low toxicity, and high druggability. In this study, we injected prior knowledge of the desirable properties of molecules during the training process.

    Methods
    Using the PubChem database (100 M), we filtered druglike molecules based on the quantity of drug-likeliness (QED) score and the Pfizer rule. With this dataset of drug-like molecules, we trained both the molecular representation model (chemBERTa) and the molecular generation models (MolGPT). The molecular representation model was evaluated by fine-tuning the results on the MoleculeNet benchmark datasets, and the molecular generation model was evaluated based on the generated samples (10 K).

    Results
    Training with druglike molecules enabled the generation of molecules with desirable properties without any conditioning. Although the molecular representation learning model was not remarkable, however, its performance in predicting clinical toxicology exceeded that of conventional molecular representation models.

    Conclusion
    By training based on a dataset of druglike molecules, our approach enables molecular representation models to predict clinical toxicity more precisely. Furthermore, it enables the molecule generation model to generate molecules with desirable druglike properties without any conditional generation procedures.

    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    import pickle

    with open("druglike_molecules_QED.pkl", "rb") as f:

    data = pickle.load(f)

  16. h

    MoleculeNet_PCBA

    • huggingface.co
    Updated Jul 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    scikit-fingerprints (2024). MoleculeNet_PCBA [Dataset]. https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_PCBA
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 7, 2024
    Dataset authored and provided by
    scikit-fingerprints
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    MoleculeNet PCBA

    PCBA (PubChem BioAssay) dataset [1], part of MoleculeNet [2] benchmark. It is intended to be used through scikit-fingerprints library. The task is to predict biological activity against 128 bioassays, generated by high-throughput screening (HTS). All tasks are binary active/non-active. Note that targets have missing values. Algorithms should be evaluated only on present labels. For training data, you may want to impute them, e.g. with zeros.

    Characteristic… See the full description on the dataset page: https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_PCBA.

  17. T

    ogbg_molpcba

    • tensorflow.org
    Updated Dec 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). ogbg_molpcba [Dataset]. https://www.tensorflow.org/datasets/catalog/ogbg_molpcba
    Explore at:
    Dataset updated
    Dec 14, 2022
    Description

    'ogbg-molpcba' is a molecular dataset sampled from PubChem BioAssay. It is a graph prediction dataset from the Open Graph Benchmark (OGB).

    This dataset is experimental, and the API is subject to change in future releases.

    The below description of the dataset is adapted from the OGB paper:

    Input Format

    All the molecules are pre-processed using RDKit ([1]).

    • Each graph represents a molecule, where nodes are atoms, and edges are chemical bonds.
    • Input node features are 9-dimensional, containing atomic number and chirality, as well as other additional atom features such as formal charge and whether the atom is in the ring.
    • Input edge features are 3-dimensional, containing bond type, bond stereochemistry, as well as an additional bond feature indicating whether the bond is conjugated.

    The exact description of all features is available at https://github.com/snap-stanford/ogb/blob/master/ogb/utils/features.py.

    Prediction

    The task is to predict 128 different biological activities (inactive/active). See [2] and [3] for more description about these targets. Not all targets apply to each molecule: missing targets are indicated by NaNs.

    References

    [1]: Greg Landrum, et al. 'RDKit: Open-source cheminformatics'. URL: https://github.com/rdkit/rdkit

    [2]: Bharath Ramsundar, Steven Kearnes, Patrick Riley, Dale Webster, David Konerding and Vijay Pande. 'Massively Multitask Networks for Drug Discovery'. URL: https://arxiv.org/pdf/1502.02072.pdf

    [3]: Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, and Vijay Pande. MoleculeNet: a benchmark for molecular machine learning. Chemical Science, 9(2):513-530, 2018.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('ogbg_molpcba', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/ogbg_molpcba-0.1.3.png" alt="Visualization" width="500px">

  18. h

    MoleculeNet_ToxCast

    • huggingface.co
    Updated Jul 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    scikit-fingerprints (2024). MoleculeNet_ToxCast [Dataset]. https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_ToxCast
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 7, 2024
    Dataset authored and provided by
    scikit-fingerprints
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    MoleculeNet ToxCast

    ToxCast dataset [1], part of MoleculeNet [2] benchmark. It is intended to be used through scikit-fingerprints library. The task is to predict 617 toxicity targets from a large library of compounds based on in vitro high-throughput screening. All tasks are binary. Note that targets have missing values. Algorithms should be evaluated only on present labels. For training data, you may want to impute them, e.g. with zeros.

    Characteristic Description

    Tasks… See the full description on the dataset page: https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_ToxCast.

  19. h

    MoleculeNet_BACE

    • huggingface.co
    Updated Jul 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    scikit-fingerprints (2024). MoleculeNet_BACE [Dataset]. https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_BACE
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 7, 2024
    Dataset authored and provided by
    scikit-fingerprints
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    MoleculeNet BACE

    BACE dataset [1], part of MoleculeNet [2] benchmark. It is intended to be used through scikit-fingerprints library. The task is to predict binding results for a set of inhibitors of humanβ-secretase 1 (BACE-1).

    Characteristic Description

    Tasks 1

    Task type classification

    Total samples 1513

    Recommended split scaffold

    Recommended metricAUROC

      References
    

    [1] Govindan Subramanian et al. "Computational Modeling of β-Secretase 1 (BACE-1)… See the full description on the dataset page: https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_BACE.

  20. Conformer datasets for "Equivariant Graph Neural Networks for Toxicity...

    • zenodo.org
    xz
    Updated May 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julian Cremer; Leonardo Medrano Sandonas; Leonardo Medrano Sandonas; Alexandre Tkatchenko; Djork-Arné Clevert; Gianni De Fabritiis; Julian Cremer; Alexandre Tkatchenko; Djork-Arné Clevert; Gianni De Fabritiis (2024). Conformer datasets for "Equivariant Graph Neural Networks for Toxicity Prediction" [Dataset]. http://doi.org/10.5281/zenodo.11237635
    Explore at:
    xzAvailable download formats
    Dataset updated
    May 21, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Julian Cremer; Leonardo Medrano Sandonas; Leonardo Medrano Sandonas; Alexandre Tkatchenko; Djork-Arné Clevert; Gianni De Fabritiis; Julian Cremer; Alexandre Tkatchenko; Djork-Arné Clevert; Gianni De Fabritiis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Predictive modeling of toxicity is a crucial step in the drug discovery pipeline. It can help filter out molecules with a high probability of failing in the early stages of de novo drug design. Thus, several machine learning (ML) models have been developed to predict the toxicity of molecules by combining classical ML techniques or deep neural networks with well-known molecular representations such as fingerprints or 2D graphs. But the more natural, accurate representation of molecules is expected to be defined in physical 3D space like in ab initio methods. Recent studies successfully used equivariant graph neural networks (EGNNs) for representation learning based on 3D structures to predict quantum-mechanical properties of molecules. Inspired by this, we investigated the performance of EGNNs to construct reliable ML models for toxicity prediction. We used the equivariant transformer (ET) model in TorchMD-NET for this. Eleven toxicity data sets taken from MoleculeNet, TDCommons, and ToxBenchmark have been considered to evaluate the capability of ET for toxicity prediction. Our results show that ET adequately learns 3D representations of molecules that can successfully correlate with toxicity activity, achieving good accuracies on most data sets comparable to state-of-the-art models. We also test a physicochemical property, namely, the total energy of a molecule, to inform the toxicity prediction with a physical prior. However, our work suggests that these two properties can not be related. We also provide an attention weight analysis for helping to understand the toxicity prediction in 3D space and thus increase the explainability of the ML model. In summary, our findings offer promising insights considering 3D geometry information via EGNNs and provide a straightforward way to integrate molecular conformers into ML-based pipelines for predicting and investigating toxicity prediction in physical space. We expect that in the future, especially for larger, more diverse data sets, EGNNs will be an essential tool in this domain.

    PAPER

    https://pubs.acs.org/doi/full/10.1021/acs.chemrestox.3c00032

    CODE and MODELS:

    The conformer data sets and trained toxicity models will be published upon acceptance of this work. The code has been made available at https://github.com/jule-c/ET-Tox, and the processed data as well as pretrained models for training and testing can be downloaded from https://zenodo.org/record/7942946. We can provide the full list of conformers as XYZ files upon request.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Katie Link (2023). moleculenet-benchmark [Dataset]. https://huggingface.co/datasets/katielink/moleculenet-benchmark

moleculenet-benchmark

katielink/moleculenet-benchmark

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 29, 2023
Authors
Katie Link
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

MoleculeNet Benchmark (website)

MoleculeNet is a benchmark specially designed for testing machine learning methods of molecular properties. As we aim to facilitate the development of molecular machine learning method, this work curates a number of dataset collections, creates a suite of software that implements many known featurizations and previously proposed algorithms. All methods and datasets are integrated as parts of the open source DeepChem package(MIT license). MoleculeNet… See the full description on the dataset page: https://huggingface.co/datasets/katielink/moleculenet-benchmark.

Search
Clear search
Close search
Google apps
Main menu