61 datasets found
  1. Scaffold Split

    • figshare.com
    csv
    Updated Apr 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amitesh Badkul (2025). Scaffold Split [Dataset]. http://doi.org/10.6084/m9.figshare.28908170.v1
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 30, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Amitesh Badkul
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Scaffold split for the ChEMBL 33 dataset across three seeds.

  2. BM Scaffold split 10folds

    • kaggle.com
    zip
    Updated Apr 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ITK8191 (2021). BM Scaffold split 10folds [Dataset]. https://www.kaggle.com/itsuki9180/bm-scaffold-split-10folds
    Explore at:
    zip(2731216438 bytes)Available download formats
    Dataset updated
    Apr 7, 2021
    Authors
    ITK8191
    Description

    Dataset

    This dataset was created by ITK8191

    Contents

  3. h

    MolInst_FS_125K_Scaffold_SMILES-MMChat

    • huggingface.co
    Updated Nov 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenMol (2024). MolInst_FS_125K_Scaffold_SMILES-MMChat [Dataset]. https://huggingface.co/datasets/OpenMol/MolInst_FS_125K_Scaffold_SMILES-MMChat
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 4, 2024
    Dataset authored and provided by
    OpenMol
    Description

    Forward Reaction Prediction Dataset (derived from MolInstruct)

    molecule representation format: 1D SMILES will further encode into 2D graph features

    We use scaffold splitting to reconstruct the train-split. We use SMolInstruct FS train split as the sample pool.

    For Detail, refer to PRESTO: Progressive Pretraining Enhances Synthetic Chemistry Outcomes: https://arxiv.org/pdf/2406.13193

  4. Chemical structures, Cell Painting and transcriptional profiles for compound...

    • zenodo.org
    • datasetcatalog.nlm.nih.gov
    • +2more
    zip
    Updated Apr 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikita Moshkov; Nikita Moshkov; Tim Becker; Tim Becker; Kevin Yang; Peter Horvath; Vlado Dancik; Bridget K. Wagner; Bridget K. Wagner; Paul A. Clemons; Paul A. Clemons; Shantanu Singh; Anne E. Carpenter; Anne E. Carpenter; Juan C. Caicedo; Juan C. Caicedo; Kevin Yang; Peter Horvath; Vlado Dancik; Shantanu Singh (2023). Chemical structures, Cell Painting and transcriptional profiles for compound bioactivity prediction. [Dataset]. http://doi.org/10.5281/zenodo.7729583
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 5, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nikita Moshkov; Nikita Moshkov; Tim Becker; Tim Becker; Kevin Yang; Peter Horvath; Vlado Dancik; Bridget K. Wagner; Bridget K. Wagner; Paul A. Clemons; Paul A. Clemons; Shantanu Singh; Anne E. Carpenter; Anne E. Carpenter; Juan C. Caicedo; Juan C. Caicedo; Kevin Yang; Peter Horvath; Vlado Dancik; Shantanu Singh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the related data, both input and produced for the paper "Predicting compound activity from phenotypic profiles and chemical structures".

    This data can be merged with paper's GitHub repository for reproduction.

    Folders and files and are described below:

    β”œβ”€β”€ assay_data
      β”œβ”€β”€ assay_matrix_discrete_270_assays.csv Assay matrix with hits for assays (270) and compounds (16170). Note that this is the final file that we used to produce splits.
      β”œβ”€β”€ assay_metadata.csv Assay metadata
      β”œβ”€β”€ broad_ids.txt List of broad ids used in this study. That is an unfiltered list of compounds required by some analysis scripts. 
      β”œβ”€β”€ smiles.txt Same as broad_ids.txt, but SMILES strings.
    
    β”œβ”€β”€ feature_data (for 16978 compounds, can be masked with ./misc/compounds16978to16170.npy)
      β”œβ”€β”€ cp.npz Classical chemical features
      β”œβ”€β”€ ge.npz Gene expression features
      β”œβ”€β”€ ge_scale.npz Gene expression scaled features
      β”œβ”€β”€ mo.npz Morphology features (not batch corrected)
      β”œβ”€β”€ mobc.npz Morphology features (batch corrected)
    
    β”œβ”€β”€ misc
      β”œβ”€β”€ compound_analysis.npz Compounds in the dataset identified as PAINS
      β”œβ”€β”€ compounds16978to16170.npy Used to filter features from the bigger set of compounds to the final one
      β”œβ”€β”€ fingerprints.npz Calculated fingerprints of compounds, those were then used to calculate similarity
      β”œβ”€β”€ similarity_fingerprints.npz Similarity matrix for compounds (16978)
      β”œβ”€β”€ population_normalized.csv.gz Well-level morphological profiles that were used for batch-correction 
      β”œβ”€β”€ Table for PUMA Excel file with additional data and plots
    
    β”œβ”€β”€ predictions
      β”œβ”€β”€ scaffold_median(mean)_AUC.csv Aggregated median(mean) AUC scores over scaffold-based cross-validation splits. In the paper, median results were reported. 
      β”œβ”€β”€ scaffold_median(mean)_EF.csv Aggregated median(mean) enrichment factor (EF) over scaffold-based cross-validation splits. In the paper, median results were reported. 
      β”œβ”€β”€ toprank_chemical_cv{}_hitsnorm.csv Those files are needed to create enrichment plots and contain hit rate and top rank hit rate.
      β”œβ”€β”€ Each folder here stands for an experiment type, the number in the folder name is a number of the split. Inside each folder there are the following elements:
       β”œβ”€β”€ predictions Folder with predictions for each assay-compound pair for each modality
       β”œβ”€β”€ 2022_01_evaluation_all_data.csv File with AUC scores for each assay for the test set in the split
       β”œβ”€β”€ 2022_01_evaluation_all_data_EF.csv File with enrichment factor (EF) values for each assay for the test set in the split. Those files exist only for *chemical* folders.
       β”œβ”€β”€ assay_matrix_discrete_train(test)_old_scaff.csv Training and test subsets of data for the split. The first column contains broad_id.
       β”œβ”€β”€ assay_matrix_discrete_train(test)_old_scaff.csv Same, but SMILES strings in the first column. Those files are used as input to ChemProp!
    
       Experiments in this folder are the following: 
       - chemical Scaffold-based 5-fold cross-validation splits, the main results in the paper are reported with this series of experiments.
       - chemical_bal Same splits as in chemical, but training were run with ChemProp built-in data balancing. 
       - chemical_st Same splits as in chemical, but separate models were trained for each assay.
       - CV Random 5-fold cross-validation splits.
       - GE 5-fold cross-validation splits based on same-size clustering of gene expression features.
       - MOBC 5-fold cross-validation splits based on same-size clustering of batch-corrected morphology features.
       - random 10 random splits, ~80% of compounds in the training set and the rest in the test set. 
    
    β”œβ”€β”€ splitting This folder contains numpy files which help to match compounds and features to create training and test sets for a split, which can be reused in the analysis notebook for data preparation. 
      β”œβ”€β”€ scaffold_based_split.npz Splitting for scaffold-based splits.
      β”œβ”€β”€ random_split_{}.npz Random split indices of test set compounds (10 files).
      β”œβ”€β”€ cross_validation_indicies.npz Indices for random cross-validation splits
      β”œβ”€β”€ GE_clusters_size_constrained.npz Indicies of clusters of same-size clustering for gene-expression features.
      β”œβ”€β”€ MOBC_clusters_size_constrained.npz Indices of clusters of same-size clustering for batch-corrected morphology features.

  5. f

    Data from: Predicting Critical Properties and Acentric Factors of Fluids...

    • acs.figshare.com
    • datasetcatalog.nlm.nih.gov
    zip
    Updated Jul 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sayandeep Biswas; Yunsie Chung; Josephine Ramirez; Haoyang Wu; William H. Green (2023). Predicting Critical Properties and Acentric Factors of Fluids Using Multitask Machine Learning [Dataset]. http://doi.org/10.1021/acs.jcim.3c00546.s002
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 24, 2023
    Dataset provided by
    ACS Publications
    Authors
    Sayandeep Biswas; Yunsie Chung; Josephine Ramirez; Haoyang Wu; William H. Green
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Knowledge of critical properties, such as critical temperature, pressure, density, as well as acentric factor, is essential to calculate thermo-physical properties of chemical compounds. Experiments to determine critical properties and acentric factors are expensive and time intensive; therefore, we developed a machine learning (ML) model that can predict these molecular properties given the SMILES representation of a chemical species. We explored directed message passing neural network (D-MPNN) and graph attention network as ML architecture choices. Additionally, we investigated featurization with additional atomic and molecular features, multitask training, and pretraining using estimated data to optimize model performance. Our final model utilizes a D-MPNN layer to learn the molecular representation and is supplemented by Abraham parameters. A multitask training scheme was used to train a single model to predict all the critical properties and acentric factors along with boiling point, melting point, enthalpy of vaporization, and enthalpy of fusion. The model was evaluated on both random and scaffold splits where it shows state-of-the-art accuracies. The extensive data set of critical properties and acentric factors contains 1144 chemical compounds and is made available in the public domain together with the source code that can be used for further exploration.

  6. Benchmark Data for Chemprop

    • zenodo.org
    application/gzip
    Updated Jul 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Esther Heid; Esther Heid; Kevin P. Greenman; Kevin P. Greenman; Yunsie Chung; Yunsie Chung; Shih-Cheng Li; Shih-Cheng Li; David E. Graff; David E. Graff; Florence H. Vermeire; Florence H. Vermeire; Haoyang Wu; Haoyang Wu; William H. Green; William H. Green; Charles J. McGill; Charles J. McGill (2023). Benchmark Data for Chemprop [Dataset]. http://doi.org/10.5281/zenodo.8174268
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jul 24, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Esther Heid; Esther Heid; Kevin P. Greenman; Kevin P. Greenman; Yunsie Chung; Yunsie Chung; Shih-Cheng Li; Shih-Cheng Li; David E. Graff; David E. Graff; Florence H. Vermeire; Florence H. Vermeire; Haoyang Wu; Haoyang Wu; William H. Green; William H. Green; Charles J. McGill; Charles J. McGill
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets and splits of the manuscript "Chemprop: Machine Learning Package for Chemical Property Prediction." Train, validation and test splits are located within each folder, as well as additional data necessary for some of the benchmarks. To train Chemprop models, refer to our code repository to obtain ready-to-use scripts to train machine learning models for each of the systems.

    Available benchmarking systems:

    • `hiv` HIV replication inhibition from MoleculeNet and OGB with scaffold splits
    • `pcba_random` Biological activities from MoleculeNet and OGB with random splits
    • `pcba_scaffold` Biological activities from MoleculeNet and OGB with scaffold splits
    • `qm9_multitask` DFT calculated properties from MoleculeNet and OGB, trained as a multi-task model
    • `qm9_u0` DFT calculated properties from MoleculeNet and OGB, trained as a single-task model on the target U0 only
    • `qm9_gap` DFT calculated properties from MoleculeNet and OGB, trained as a single-task model on the target gap only
    • `sampl` Water-octanol partition coefficients, used to predict molecules from the SAMPL6, 7 and 9 challenges
    • `atom_bond_137k` Quantum-mechanical atom and bond descriptors
    • `bde` Bond dissociation enthalpies trained as single-task model
    • `bde_charges` Bond dissociation enthalpies trained as multi-task model together with atomic partial charges
    • `charges_eps_4` Partial charges at a dielectric constant of 4 (in protein)
    • `charges_eps_78` Partial charges at a dielectric constant of 78 (in water)
    • `barriers_e2` Reaction barrier heights of E2 reactions
    • `barriers_sn2` Reaction barrier heights of SN2 reactions
    • `barriers_cycloadd` Reaction barrier heights of cycloaddition reactions
    • `barriers_rdb7` Reaction barrier heights in the RDB7 dataset
    • `barriers_rgd1` Reaction barrier heights in the RGD1-CNHO dataset
    • `multi_molecule` UV/Vis peak absorption wavelengths in different solvents
    • `ir` IR Spectra
    • `pcqm4mv2` HOMO-LUMO gaps of the PCQM4Mv2 dataset
    • `uncertainty_ensemble` Uncertainty estimation using an ensemble using the QM9 gap dataset
    • `uncertainty_evidential` Uncertainty estimation using evidential learning using the QM9 gap dataset
    • `uncertainty_mve` Uncertainty estimation using mean-variance estimation using the QM9 gap dataset
    • `timing` Timing benchmark using subsets of QM9 gap
  7. h

    ro4_vs_d2

    • huggingface.co
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Varunika (2025). ro4_vs_d2 [Dataset]. https://huggingface.co/datasets/vmsavla/ro4_vs_d2
    Explore at:
    Dataset updated
    May 6, 2025
    Authors
    Varunika
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains docking scores for over 15 million compounds from the Enamine REAL library screened against the d2 protein target using structure-based virtual screening. The data was originally published by Luttens et al. (2022), and includes sanitized SMILES strings, compound identifiers, docking scores, and Bemis-Murcko scaffolds. The dataset is split using scaffold-based splitting to support robust machine learning benchmarking.

  8. Z

    Dataset for "ConfSolv: Prediction of solute conformer free energies across a...

    • data.niaid.nih.gov
    Updated Oct 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lagnajit Pattanaik; Angiras Menon; Volker Settels; Kevin A. Spiekermann; Zipei Tan; Florence Vermeire; Frederik Sandfort; Philipp Eiden; William H. Green (2023). Dataset for "ConfSolv: Prediction of solute conformer free energies across a range of solvents" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8292519
    Explore at:
    Dataset updated
    Oct 25, 2023
    Dataset provided by
    Katholieke Universiteit Leuven
    BASF SE Scientific Modelling
    Massachusetts Institute of Technology
    Authors
    Lagnajit Pattanaik; Angiras Menon; Volker Settels; Kevin A. Spiekermann; Zipei Tan; Florence Vermeire; Frederik Sandfort; Philipp Eiden; William H. Green
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains three archives. The first archive, full_dataset.zip, contains geometries and free energies for nearly 44,000 solute molecules with almost 9 million conformers, in 42 different solvents. The geometries and gas phase free energies are computed using density functional theory (DFT). The solvation free energy for each conformer is computed using COSMO-RS and the solution free energies are computed using the sum of the gas phase free energies and the solvation free energies. The geometries for each solute conformer are provided as ASE_atoms_objects within a pandas DataFrame, found in the compressed file dft coords.pkl.gz within full_dataset.zip. The gas-phase energies, solvation free energies, and solution free energies are also provided as a pandas DataFrame in the compressed file free_energy.pkl.gz within full_dataset.zip. Ten example data splits for both random and scaffold split types are also provided in the ZIP archive for training models. Scaffold split index 0 is used to generate results in the corresponding publication. The second archive, refined_conf_search.zip, contains geometries and free energies for a representative sample of 28 solute molecules from the full dataset that were subject to a refined conformer search and thus had more conformers located. The format of the data is identical to full_dataset.zip. The third archive contains one folder for each solvent for which we have provided free energies in full_dataset.zip. Each folder contains the .cosmo file for every solvent conformer used in the COSMOtherm calculations, a dummy input file for the COSMOtherm calculations, and a CSV file that contains the electronic energy of each solvent conformer that needs to be substituted for "EH_Line" in the dummy input file.

  9. f

    Synthesis of a Family of Spirocyclic Scaffolds: Building Blocks for the...

    • acs.figshare.com
    txt
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarvesh Kumar; Paul D. Thornton; Thomas O. Painter; Prashi Jain; Jared Downard; Justin T. Douglas; Conrad Santini (2023). Synthesis of a Family of Spirocyclic Scaffolds: Building Blocks for the Exploration of Chemical Space [Dataset]. http://doi.org/10.1021/jo400738b.s002
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    ACS Publications
    Authors
    Sarvesh Kumar; Paul D. Thornton; Thomas O. Painter; Prashi Jain; Jared Downard; Justin T. Douglas; Conrad Santini
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This report describes the preparation of a series of 17 novel racemic spirocyclic scaffolds that are intended for the creation of compound libraries by parallel synthesis for biological screening. Each scaffold features two points of orthogonal diversification. The scaffolds are related to each other in four ways: (1) through stepwise changes in the size of the nitrogen-bearing ring; (2) through the oxidation state of the carbon-centered point of diversification; (3) through the relative stereochemical orientation of the two diversification sites in those members that are stereogenic; and (4) through the provision of both saturated and unsaturated versions of the furan ring in the scaffold series derived from 3-piperidone. The scaffolds provide incremental changes in the relative orientation of the diversity components that would be introduced onto them. The scaffolds feature high sp3 carbon content which is essential for the three-dimensional exploration of chemical space. This characteristic is particularly evident in those members of this family that bear two stereocenters, i.e., the two series derived from 3-piperidone and 3-pyrrolidinone. In the series derived from 3-piperidone we were able to β€œsplit the difference” between the two diastereomers by preparation of their corresponding unsaturated version.

  10. h

    zinc10M

    • huggingface.co
    Updated Feb 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SHJ (2024). zinc10M [Dataset]. https://huggingface.co/datasets/jarod0411/zinc10M
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 7, 2024
    Authors
    SHJ
    Description

    zinc dataset with React=Standard, Purch=In-Stock and Drug-Like (about 11M)

    preprocess:

    canonicalizeChem.MolToSmiles(Chem.MolFromSmiles(mol),True)

    compute scaffoldChem.MolToSmiles(MurckoScaffold.GetScaffoldForMol(Chem.MolFromSmiles(mol)),True)

    convert smiles and scaffold smiles to selfies respectively using selfiessf.encoder(smiles)

    filter out all molecules scaffold is empty "" cannot be converted to selfies

    90/10 train/validation split

  11. c

    Chig-AIMD scaffold train

    • colabfit.org
    • huggingface.co
    Updated Apr 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tong Wang; Xinheng He; Mingyu Li; Bin Shao; Tie-Yan Liu (2025). Chig-AIMD scaffold train [Dataset]. https://colabfit.org/id/DS_7puixss6qd61_0
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset provided by
    ColabFit
    Authors
    Tong Wang; Xinheng He; Mingyu Li; Bin Shao; Tie-Yan Liu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Training configurations from the 'scaffold' split of Chig-AIMD. This dataset covers the conformational space of chignolin with DFT-level precision. We sequentially applied replica exchange molecular dynamics (REMD), conventional MD, and ab initio MD (AIMD) simulations on a 10 amino acid protein, Chignolin, and finally collected 2 million biomolecule structures with quantum level energy and force records.

  12. h

    MoleculeNet_FreeSolv

    • huggingface.co
    Updated Oct 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    scikit-fingerprints (2025). MoleculeNet_FreeSolv [Dataset]. https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_FreeSolv
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 3, 2025
    Dataset authored and provided by
    scikit-fingerprints
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    MoleculeNet FreeSolv

    FreeSolv (Free Solvation Database) dataset [1], part of MoleculeNet [2] benchmark. It is intended to be used through scikit-fingerprints library. The task is to predict hydration free energy of small molecules in water. Targets are in kcal/mol.

    Characteristic Description

    Tasks 1

    Task type regression

    Total samples 642

    Recommended split scaffold

    Recommended metric RMSE

      References
    

    [1] Mobley, D.L., Guthrie, J.P. "FreeSolv: a… See the full description on the dataset page: https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_FreeSolv.

  13. [Therapeutics Data Commons] Acute Toxicity LD50

    • kaggle.com
    Updated Jan 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seongjin Kim (2025). [Therapeutics Data Commons] Acute Toxicity LD50 [Dataset]. https://www.kaggle.com/datasets/iapetus509/therapeutics-data-commons-acute-toxicity-ld50
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 11, 2025
    Dataset provided by
    Kaggle
    Authors
    Seongjin Kim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    https://tdcommons.ai/single_pred_tasks/tox#acute-toxicity-ld50

    Dataset Description: Acute toxicity LD50 measures the most conservative dose that can lead to lethal adverse effects. The higher the dose, the more lethal of a drug. This dataset is kindly provided by the authors of [1].

    Task Description: Regression. Given a drug SMILES string, predict its acute toxicity.

    **Dataset Statistics: ** 7,385 drugs.

    Dataset Split: Random Split, Scaffold Split

    from tdc.single_pred import Tox data = Tox(name = 'LD50_Zhu') split = data.get_split()

    References: [1] Zhu, Hao, et al. β€œQuantitative structureβˆ’ activity relationship modeling of rat acute toxicity by oral exposure.” Chemical research in toxicology 22.12 (2009): 1913-1921.

    Dataset License: CC BY 4.0.

  14. f

    Protein Scaffold-Activated Protein Trans-Splicing in Mammalian Cells

    • acs.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel F. Selgrade; Jason J. Lohmueller; Florian Lienert; Pamela A. Silver (2023). Protein Scaffold-Activated Protein Trans-Splicing in Mammalian Cells [Dataset]. http://doi.org/10.1021/ja401689b.s002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    ACS Publications
    Authors
    Daniel F. Selgrade; Jason J. Lohmueller; Florian Lienert; Pamela A. Silver
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Conditional protein splicing is a powerful biotechnological tool that can be used to rapidly and post-translationally control the activity of a given protein. Here we demonstrate a novel conditional splicing system in which a genetically encoded protein scaffold induces the splicing and activation of an enzyme in mammalian cells. In this system the protein scaffold binds to two inactive split intein/enzyme extein protein fragments leading to intein fragment complementation, splicing, and activation of the firefly luciferase enzyme. We first demonstrate the ability of antiparallel coiled-coils (CCs) to mediate splicing between two intein fragments, effectively creating two new split inteins. We then generate and test two versions of the scaffold-induced splicing system using two pairs of CCs. Finally, we optimize the linker lengths of the proteins in the system and demonstrate 13-fold activation of luciferase by the scaffold compared to the activity of negative controls. Our protein scaffold-triggered conditional splicing system is an effective strategy to control enzyme activity using a protein input, enabling enhanced genetic control over protein splicing and the potential creation of splicing-based protein sensors and autoregulatory systems.

  15. Data sets and machine learning models for: Predicting critical properties of...

    • zenodo.org
    bin, zip
    Updated Oct 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sayandeep Biswas; Yunsie Chung; Yunsie Chung; Josephine Ramirez; Haoyang Wu; Haoyang Wu; William Green; William Green; Sayandeep Biswas; Josephine Ramirez (2023). Data sets and machine learning models for: Predicting critical properties of fluids using machine learning [Dataset]. http://doi.org/10.5281/zenodo.7804143
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Oct 1, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sayandeep Biswas; Yunsie Chung; Yunsie Chung; Josephine Ramirez; Haoyang Wu; Haoyang Wu; William Green; William Green; Sayandeep Biswas; Josephine Ramirez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The experimental data sets, data splits, additional features, QM calculations, model predictions, and final machine learning models for the manuscript "Predicting critical properties of fluids using multi-task machine learning". Citation should refer directly to the manuscript. (citation will be added soon)

    To use the machine learning models, please refer to the sample files and instructions on https://github.com/yunsiechung/chemprop/tree/crit_prop.

    Detailed information can be found in README.md file.

    Details on the properties considered

    The data set includes the following 8 properties:

    • Tc: critical temperature, in K
    • Pc: critical pressure, in bar
    • rhoc: critical density, in mol/L
    • omega: acentric factor, unitless
    • Tb: boiling point, in K
    • Tm: melting point, in K
    • dHvap: enthalpy of vaporization at boiling point, in kJ/mol
    • dHfus: enthalpy of fusion at melting point, in kJ/mol

    Details on the files

    1. Data sets under CritProp_v1.0.0:

    • all_data: includes the data sets used in this work. All data points are listed for each chemical compound as well as its corresponding data source. The details of the data sources can be found in the README.md file. The distribution of the data set is included in each folder.
      • estimated_data_for_pretraining: contains the estimated data from Yaws' handbook that are used to pre-train our machine learning (ML) model.
      • experimental_data: contains the experimental data used to fine-tune our ML model.
    • additional_features: includes the additional features tested for the ML model.
      • abraham: Abraham solute parameters (E, S, A, B, L). Molecular features.
      • acsf: ACSF (atom-centered symmetry functions). Atomic features that are coverted from the 3D coordinates of the compound
      • qm_atom: QM (quantum chemical) atomic feature.
      • qm_mol: QM molecular feature.
      • rdkit: Selected RDKit 2D molecular features.
    • data_splits_and_model_predictions: contains the training set and test set used to for random and scaffold splits. It also contains the predicted values from our final ML model for each test set.

    2. Machine learning (ML) model files:

    • CritProp_ML_model_fiiles_with_abraham_feat.zip: contains the Chemprop ML model files that are trained using Abraham features as additional molecular features. This gives the best results.
    • CritProp_ML_model_fiiles_without_additional_feat.zip: contains the Chemprop ML model files that are trained without any additional features. This gives the second best results.

    To use these ML models, please refer to the sample files and instructions on https://github.com/yunsiechung/chemprop/tree/crit_prop

    3. QM (quantum chemical) calculations:

    • QM_calculations.zip: contains the results of the QM calculations that are performed to compute QM features.

  16. h

    clearance

    • huggingface.co
    Updated Dec 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zach Nussbaum (2022). clearance [Dataset]. https://huggingface.co/datasets/zpn/clearance
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 25, 2022
    Authors
    Zach Nussbaum
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for clearance

      Dataset Summary
    

    clearance is a dataset included in Chemberta-2 benchmarking.

      Dataset Structure
    
    
    
    
    
      Data Fields
    

    Each split contains

    smiles: the SMILES representation of a molecule selfies: the SELFIES representation of a molecule target:

      Data Splits
    

    The dataset is split into an 80/10/10 train/valid/test split using scaffold split.

      Source Data
    
    
    
    
    
      Initial Data Collection and Normalization
    

    Data was… See the full description on the dataset page: https://huggingface.co/datasets/zpn/clearance.

  17. Collection of analog series-based (ASB) scaffolds shared between ZINC,...

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carmen Cerchia; Dilyana Dimova; Dilyana Dimova; Antonio Lavecchia; JΓΌrgen Bajorath; JΓΌrgen Bajorath; Carmen Cerchia; Antonio Lavecchia (2020). Collection of analog series-based (ASB) scaffolds shared between ZINC, ChEMBL, and PubChem [Dataset]. http://doi.org/10.5281/zenodo.1043537
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Carmen Cerchia; Dilyana Dimova; Dilyana Dimova; Antonio Lavecchia; JΓΌrgen Bajorath; JΓΌrgen Bajorath; Carmen Cerchia; Antonio Lavecchia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analog series-based (ASB) scaffolds shared between ZINC and ChEMBL (version 22), ZINC and PubChem and all the three databases are provided as three separate files. For each ASB scaffold, the SMILES representation of ZINC compounds is provided. In addition, the number of ZINC compounds, the number and the list of targets it was annotated with is reported. A README file is also given.

  18. c

    Chig-AIMD scaffold val

    • colabfit.org
    • huggingface.co
    Updated Feb 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tong Wang; Xinheng He; Mingyu Li; Bin Shao; Tie-Yan Liu (2024). Chig-AIMD scaffold val [Dataset]. https://colabfit.org/id/DS_mzz13lim5qfi_0
    Explore at:
    Dataset updated
    Feb 10, 2024
    Dataset provided by
    ColabFit
    Authors
    Tong Wang; Xinheng He; Mingyu Li; Bin Shao; Tie-Yan Liu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Validation configurations from the 'scaffold' split of Chig-AIMD. This dataset covers the conformational space of chignolin with DFT-level precision. We sequentially applied replica exchange molecular dynamics (REMD), conventional MD, and ab initio MD (AIMD) simulations on a 10 amino acid protein, Chignolin, and finally collected 2 million biomolecule structures with quantum level energy and force records.

  19. h

    fang-2023-biogen-adme

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SCBIR Lab, fang-2023-biogen-adme [Dataset]. https://huggingface.co/datasets/scbirlab/fang-2023-biogen-adme
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    SCBIR Lab
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Biogen ADME dataset (public data)

    Data from Fang et al., Prospective Validation of Machine Learning Algorithms for Absorption, Distribution, Metabolism, and Excretion Prediction: An Industrial Perspective, available from the GitHub repositiory. We used schemist (which in turn uses RDKit) to add molecuar weight, Murcko scaffold, Crippen cLogP, and topological surface area, and to generate scaffold splits.

      Dataset Details
    

    From the original README:

    To benefit the… See the full description on the dataset page: https://huggingface.co/datasets/scbirlab/fang-2023-biogen-adme.

  20. e

    Structural insightsStructural insights into regulation of the PEAK3...

    • ebi.ac.uk
    Updated Jul 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antoine Forget (2023). Structural insightsStructural insights into regulation of the PEAK3 pseudokinase scaffold by 14-3-3 into regulation of the PEAK3 pseudokinase scaffold by 14-3-3 [Dataset]. https://www.ebi.ac.uk/pride/archive/projects/PXD035574
    Explore at:
    Dataset updated
    Jul 20, 2023
    Authors
    Antoine Forget
    Variables measured
    Proteomics
    Description

    The three members of the PEAK family of pseudokinases (PEAK1, PEAK2, and PEAK3) are molecular scaffolds that have recently emerged as important nodes in signaling pathways that control cell migration, morphology, and proliferation, and are increasingly found mis-regulated in human cancers. While no structures of PEAK3 have been solved to date, crystal structures of the PEAK1 and PEAK2 pseudokinase domains revealed their dimeric organization. It remains unclear how dimerization plays a role in PEAK scaffolding functions as no structures of PEAK family members in complex with their binding partners have been solved. Here, we report the cryo-EM structure of the PEAK3 pseudokinase, also adopting a dimeric state, and in complex with an endogenous 14-3-3 heterodimer purified from mammalian cells. Our structure reveals an asymmetric binding mode between PEAK3 and 14-3-3 stabilized by one pseudokinase domain and the Split HElical Dimerization (SHED) domain of the PEAK3 dimer. The binding interface is comprised of a canonical primary interaction involving two phosphorylated 14-3-3 consensus binding sites located in the N-terminal domains of the PEAK3 monomers docked in the conserved amphipathic grooves of the 14-3-3 dimer, and a unique secondary interaction between 14-3-3 and PEAK3 that has not been observed in any previous structures of 14-3-3/client complexes. Disruption of these interactions results in the relocation of PEAK3 to the nucleus and changes its cellular interactome. Lastly, we identify Protein Kinase D as the regulator of PEAK3/14-3-3 binding, providing a mechanism by which the diverse functions of the PEAK3 scaffold might be fine-tuned in cells.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Amitesh Badkul (2025). Scaffold Split [Dataset]. http://doi.org/10.6084/m9.figshare.28908170.v1
Organization logoOrganization logo

Scaffold Split

Explore at:
csvAvailable download formats
Dataset updated
Apr 30, 2025
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Amitesh Badkul
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Scaffold split for the ChEMBL 33 dataset across three seeds.

Search
Clear search
Close search
Google apps
Main menu