Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
MoleculeNet Benchmark (website)
MoleculeNet is a benchmark specially designed for testing machine learning methods of molecular properties. As we aim to facilitate the development of molecular machine learning method, this work curates a number of dataset collections, creates a suite of software that implements many known featurizations and previously proposed algorithms. All methods and datasets are integrated as parts of the open source DeepChem package(MIT license). MoleculeNet… See the full description on the dataset page: https://huggingface.co/datasets/katielink/moleculenet-benchmark.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
MoleculeNet Lipophilicity
Lipophilicity dataset, part of MoleculeNet [1] benchmark. It is intended to be used through scikit-fingerprints library. The task is to predict octanol/water distribution coefficient (logD) at pH 7.4. Targets are already log transformed, and are a unitless ratio.
Characteristic Description
Tasks 1
Task type regression
Total samples 4200
Recommended split scaffold
Recommended metric RMSE
References
[1] Wu, Zhenqin, et al.… See the full description on the dataset page: https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_Lipophilicity.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
MoleculeNet Tox21
Tox21 dataset [1], part of MoleculeNet [2] benchmark. It is intended to be used through scikit-fingerprints library. The task is to predict 12 toxicity targets, including nuclear receptors and stress response pathways. All tasks are binary. Note that targets have missing values. Algorithms should be evaluated only on present labels. For training data, you may want to impute them, e.g. with zeros.
Characteristic Description
Tasks 12
Task type multitask… See the full description on the dataset page: https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_Tox21.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
MoleculeNet ESOL
ESOL (Estimated SOLubility) dataset [1], part of MoleculeNet [2] benchmark. It is intended to be used through scikit-fingerprints library. The task is to predict aqueous solubility. Targets are log-transformed, and the unit is log mols per litre (log Mol/L).
Characteristic Description
Tasks 1
Task type regression
Total samples 1128
Recommended split scaffold
Recommended metric RMSE
References
[1] John S. Delaney "ESOL: Estimating… See the full description on the dataset page: https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_ESOL.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets and splits of the manuscript "Chemprop: Machine Learning Package for Chemical Property Prediction." Train, validation and test splits are located within each folder, as well as additional data necessary for some of the benchmarks. To train Chemprop models, refer to our code repository to obtain ready-to-use scripts to train machine learning models for each of the systems.
Available benchmarking systems:
The MoleculeNet dataset is a benchmarking platform for molecular machine learning.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
MoleculeNet SIDER
Load and return the SIDER (Side Effect Resource) dataset [1], part of MoleculeNet [2] benchmark. It is intended to be used through scikit-fingerprints library. The task is to predict adverse drug reactions (ADRs) as drug side effects to 27 system organ classes in MedDRA classification. All tasks are binary.
Characteristic Description
Tasks 12
Task type multitask classification
Total samples 7831
Recommended split scaffold
Recommended metric… See the full description on the dataset page: https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_SIDER.
RaushanTurganbay/MoleculeNet-Hiv-split dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
MoleculeNet FreeSolv
FreeSolv (Free Solvation Database) dataset [1], part of MoleculeNet [2] benchmark. It is intended to be used through scikit-fingerprints library. The task is to predict hydration free energy of small molecules in water. Targets are in kcal/mol.
Characteristic Description
Tasks 1
Task type regression
Total samples 642
Recommended split scaffold
Recommended metric RMSE
References
[1] Mobley, D.L., Guthrie, J.P. "FreeSolv: a… See the full description on the dataset page: https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_FreeSolv.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Cell morphology features, such as those from the Cell Painting assay, can be generated at relatively low costs and represent versatile biological descriptors of a system and thereby compound response. In this study, we explored cell morphology descriptors and molecular fingerprints, separately and in combination, for the prediction of cytotoxicity- and proliferation-related in vitro assay endpoints. We selected 135 compounds from the MoleculeNet ToxCast benchmark data set which were annotated with Cell Painting readouts, where the relatively small size of the data set is due to the overlap of required annotations. We trained Random Forest classification models using nested cross-validation and Cell Painting descriptors, Morgan and ErG fingerprints, and their combinations. While using leave-one-cluster-out cross-validation (with clusters based on physicochemical descriptors), models using Cell Painting descriptors achieved higher average performance over all assays (Balanced Accuracy of 0.65, Matthews Correlation Coefficient of 0.28, and AUC-ROC of 0.71) compared to models using ErG fingerprints (BA 0.55, MCC 0.09, and AUC-ROC 0.60) and Morgan fingerprints alone (BA 0.54, MCC 0.06, and AUC-ROC 0.56). While using random shuffle splits, the combination of Cell Painting descriptors with ErG and Morgan fingerprints further improved balanced accuracy on average by 8.9% (in 9 out of 12 assays) and 23.4% (in 8 out of 12 assays) compared to using only ErG and Morgan fingerprints, respectively. Regarding feature importance, Cell Painting descriptors related to nuclei texture, granularity of cells, and cytoplasm as well as cell neighbors and radial distributions were identified to be most contributing, which is plausible given the endpoint considered. We conclude that cell morphological descriptors contain complementary information to molecular fingerprints which can be used to improve the performance of predictive cytotoxicity models, in particular in areas of novel structural space.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
MoleculeNet ClinTox
Load and return the ClinTox dataset, part of MoleculeNet [1] benchmark. It is intended to be used through scikit-fingerprints library. The task is to predict drug approval viability, by predicting clinical trial toxicity and final FDA approval status. Both tasks are binary.
Characteristic Description
Tasks 2
Task type multitask classification
Total samples 1477
Recommended split scaffold
Recommended metric AUROC
References
[1]… See the full description on the dataset page: https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_ClinTox.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Code: https://github.com/srijitseal/BioMorph_Space
Cell Painting assays generate morphological profiles that are versatile descriptors of biological systems and have been used to predict in vitro and in vivo drug effects. However, Cell Painting features are based on image statistics, and are, therefore, often not readily biologically interpretable. In this study, we introduce an approach that maps specific Cell Painting features into the BioMorph space using readouts from comprehensive Cell Health assays. We validated that the resulting BioMorph space effectively connected compounds not only with the morphological features associated with their bioactivity but with deeper insights into phenotypic characteristics and cellular processes associated with the given bioactivity. The BioMorph space revealed the mechanism of action for individual compounds, including dual-acting compounds such as emetine, an inhibitor of both protein synthesis and DNA replication. In summary, BioMorph space offers a more biologically relevant way to interpret cell morphological features from the Cell Painting assays and to generate hypotheses for experimental validation.
The following datasets are released:
Cell_Health_median_357_profiles_70_labels.csv : The Cell Heath dataset for CRISPR perturbations. Contains median consensus signatures for the 357 consensus profiles (119 CRISPR perturbations × 3 cell lines) Ref: Way et al.
Cell_Painitng_CRISPR_Perturbations_357_profiles_827_features_scaled.csv: The Cell Painting dataset for CRISPR perturbations. Contains 827 morphology features (and metadata annotation) for 357 consensus profiles (119 CRISPR perturbations × 3 cell lines). Ref: Way et al.
Cell_Painting_data_658_compounds_827_Features_scaled.csv The Cell Painting dataset for compound perturbations. Contains 658 structurally unique compounds with 827 Cell Painting features. Ref: Bray et al
Endpoints_9_Mitotox_biological_activities_658_compounds.csv The biological assay activity labels for compound perturbations. Contains 658 structurally unique compounds with 9 biological activity consensus hit calls. Ref: ToxCast/MoleculeNet
BioMoprh_pvalue_658_compunds_398_BioMorph_terms.csv: The dataset of standardised BioMorph term p-values. Contains 398 BioMorph terms for the 658 compounds in the biological activity dataset.
References: Way et al. Predicting cell health phenotypes using image-based morphology profiling. Mol Biol Cell. 2021;32(9):995-1005. Bray et al. A dataset of images and morphological profiles of 30 000 small-molecule treatments using the Cell Painting assay. Gigascience. 2017;6(12):1-5. MoleculeNet: Wu et al. MoleculeNet: A benchmark for molecular machine learning. Chem Sci. 2018;9(2):513-530. ToxCast: Exploring ToxCast Data | US EPA https://www.epa.gov/chemical-research/exploring-toxcast-data (accessed Jul 9, 2023).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A curated and enriched dataset for the hydration free energy in water (FreeSolv) of small molecules, intended for in silico model development. The dataset is retrieved from MoleculeNet. The curated FreeSolv dataset comprises 642 compounds enriched with 777 molecular descriptors extracted from their 2D structure using EnalosMold2 KNIME node.
More curated datasets are available via chemPharos: https://db.chempharos.eu/datasets/Datasets.zul
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for pcba_686978
Dataset Summary
pcba_686978 is a dataset included in MoleculeNet. PubChem BioAssay (PCBA) is a database consisting of biological activities of small molecules generated by high-throughput screening. We have chosen one of the larger tasks (ID 686978) as described in https://par.nsf.gov/servlets/purl/10168888.
Dataset Structure
Data Fields
Each split contains
smiles: the SMILES representation of a molecule selfies:… See the full description on the dataset page: https://huggingface.co/datasets/zpn/pcba_686978.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Background
Trnasformer-based AI models have shown outstanding performance in identifying druggable candidate molecules. In most cases, models are trained on a massive amount of database of molecular information to capture the latent meaning of a given molecule. However, the desirable properties of candidate molecules include the feasibility of synthesizing them, low toxicity, and high druggability. In this study, we injected prior knowledge of the desirable properties of molecules during the training process.
Methods
Using the PubChem database (100 M), we filtered druglike molecules based on the quantity of drug-likeliness (QED) score and the Pfizer rule. With this dataset of drug-like molecules, we trained both the molecular representation model (chemBERTa) and the molecular generation models (MolGPT). The molecular representation model was evaluated by fine-tuning the results on the MoleculeNet benchmark datasets, and the molecular generation model was evaluated based on the generated samples (10 K).
Results
Training with druglike molecules enabled the generation of molecules with desirable properties without any conditioning. Although the molecular representation learning model was not remarkable, however, its performance in predicting clinical toxicology exceeded that of conventional molecular representation models.
Conclusion
By training based on a dataset of druglike molecules, our approach enables molecular representation models to predict clinical toxicity more precisely. Furthermore, it enables the molecule generation model to generate molecules with desirable druglike properties without any conditional generation procedures.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
import pickle
with open("druglike_molecules_QED.pkl", "rb") as f:
data = pickle.load(f)
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
MoleculeNet PCBA
PCBA (PubChem BioAssay) dataset [1], part of MoleculeNet [2] benchmark. It is intended to be used through scikit-fingerprints library. The task is to predict biological activity against 128 bioassays, generated by high-throughput screening (HTS). All tasks are binary active/non-active. Note that targets have missing values. Algorithms should be evaluated only on present labels. For training data, you may want to impute them, e.g. with zeros.
Characteristic… See the full description on the dataset page: https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_PCBA.
'ogbg-molpcba' is a molecular dataset sampled from PubChem BioAssay. It is a graph prediction dataset from the Open Graph Benchmark (OGB).
This dataset is experimental, and the API is subject to change in future releases.
The below description of the dataset is adapted from the OGB paper:
All the molecules are pre-processed using RDKit ([1]).
The exact description of all features is available at https://github.com/snap-stanford/ogb/blob/master/ogb/utils/features.py.
The task is to predict 128 different biological activities (inactive/active). See [2] and [3] for more description about these targets. Not all targets apply to each molecule: missing targets are indicated by NaNs.
[1]: Greg Landrum, et al. 'RDKit: Open-source cheminformatics'. URL: https://github.com/rdkit/rdkit
[2]: Bharath Ramsundar, Steven Kearnes, Patrick Riley, Dale Webster, David Konerding and Vijay Pande. 'Massively Multitask Networks for Drug Discovery'. URL: https://arxiv.org/pdf/1502.02072.pdf
[3]: Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, and Vijay Pande. MoleculeNet: a benchmark for molecular machine learning. Chemical Science, 9(2):513-530, 2018.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('ogbg_molpcba', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/ogbg_molpcba-0.1.3.png" alt="Visualization" width="500px">
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
MoleculeNet ToxCast
ToxCast dataset [1], part of MoleculeNet [2] benchmark. It is intended to be used through scikit-fingerprints library. The task is to predict 617 toxicity targets from a large library of compounds based on in vitro high-throughput screening. All tasks are binary. Note that targets have missing values. Algorithms should be evaluated only on present labels. For training data, you may want to impute them, e.g. with zeros.
Characteristic Description
Tasks… See the full description on the dataset page: https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_ToxCast.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
MoleculeNet BACE
BACE dataset [1], part of MoleculeNet [2] benchmark. It is intended to be used through scikit-fingerprints library. The task is to predict binding results for a set of inhibitors of humanβ-secretase 1 (BACE-1).
Characteristic Description
Tasks 1
Task type classification
Total samples 1513
Recommended split scaffold
Recommended metricAUROC
References
[1] Govindan Subramanian et al. "Computational Modeling of β-Secretase 1 (BACE-1)… See the full description on the dataset page: https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_BACE.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Predictive modeling of toxicity is a crucial step in the drug discovery pipeline. It can help filter out molecules with a high probability of failing in the early stages of de novo drug design. Thus, several machine learning (ML) models have been developed to predict the toxicity of molecules by combining classical ML techniques or deep neural networks with well-known molecular representations such as fingerprints or 2D graphs. But the more natural, accurate representation of molecules is expected to be defined in physical 3D space like in ab initio methods. Recent studies successfully used equivariant graph neural networks (EGNNs) for representation learning based on 3D structures to predict quantum-mechanical properties of molecules. Inspired by this, we investigated the performance of EGNNs to construct reliable ML models for toxicity prediction. We used the equivariant transformer (ET) model in TorchMD-NET for this. Eleven toxicity data sets taken from MoleculeNet, TDCommons, and ToxBenchmark have been considered to evaluate the capability of ET for toxicity prediction. Our results show that ET adequately learns 3D representations of molecules that can successfully correlate with toxicity activity, achieving good accuracies on most data sets comparable to state-of-the-art models. We also test a physicochemical property, namely, the total energy of a molecule, to inform the toxicity prediction with a physical prior. However, our work suggests that these two properties can not be related. We also provide an attention weight analysis for helping to understand the toxicity prediction in 3D space and thus increase the explainability of the ML model. In summary, our findings offer promising insights considering 3D geometry information via EGNNs and provide a straightforward way to integrate molecular conformers into ML-based pipelines for predicting and investigating toxicity prediction in physical space. We expect that in the future, especially for larger, more diverse data sets, EGNNs will be an essential tool in this domain.
PAPER
https://pubs.acs.org/doi/full/10.1021/acs.chemrestox.3c00032
CODE and MODELS:
The conformer data sets and trained toxicity models will be published upon acceptance of this work. The code has been made available at https://github.com/jule-c/ET-Tox, and the processed data as well as pretrained models for training and testing can be downloaded from https://zenodo.org/record/7942946. We can provide the full list of conformers as XYZ files upon request.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
MoleculeNet Benchmark (website)
MoleculeNet is a benchmark specially designed for testing machine learning methods of molecular properties. As we aim to facilitate the development of molecular machine learning method, this work curates a number of dataset collections, creates a suite of software that implements many known featurizations and previously proposed algorithms. All methods and datasets are integrated as parts of the open source DeepChem package(MIT license). MoleculeNet… See the full description on the dataset page: https://huggingface.co/datasets/katielink/moleculenet-benchmark.