Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Scaffold split for the ChEMBL 33 dataset across three seeds.
Facebook
TwitterThis dataset was created by ITK8191
Facebook
TwitterForward Reaction Prediction Dataset (derived from MolInstruct)
molecule representation format: 1D SMILES will further encode into 2D graph features
We use scaffold splitting to reconstruct the train-split. We use SMolInstruct FS train split as the sample pool.
For Detail, refer to PRESTO: Progressive Pretraining Enhances Synthetic Chemistry Outcomes: https://arxiv.org/pdf/2406.13193
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the related data, both input and produced for the paper "Predicting compound activity from phenotypic profiles and chemical structures".
This data can be merged with paper's GitHub repository for reproduction.
Folders and files and are described below:
βββ assay_data
βββ assay_matrix_discrete_270_assays.csv Assay matrix with hits for assays (270) and compounds (16170). Note that this is the final file that we used to produce splits.
βββ assay_metadata.csv Assay metadata
βββ broad_ids.txt List of broad ids used in this study. That is an unfiltered list of compounds required by some analysis scripts.
βββ smiles.txt Same as broad_ids.txt, but SMILES strings.
βββ feature_data (for 16978 compounds, can be masked with ./misc/compounds16978to16170.npy)
βββ cp.npz Classical chemical features
βββ ge.npz Gene expression features
βββ ge_scale.npz Gene expression scaled features
βββ mo.npz Morphology features (not batch corrected)
βββ mobc.npz Morphology features (batch corrected)
βββ misc
βββ compound_analysis.npz Compounds in the dataset identified as PAINS
βββ compounds16978to16170.npy Used to filter features from the bigger set of compounds to the final one
βββ fingerprints.npz Calculated fingerprints of compounds, those were then used to calculate similarity
βββ similarity_fingerprints.npz Similarity matrix for compounds (16978)
βββ population_normalized.csv.gz Well-level morphological profiles that were used for batch-correction
βββ Table for PUMA Excel file with additional data and plots
βββ predictions
βββ scaffold_median(mean)_AUC.csv Aggregated median(mean) AUC scores over scaffold-based cross-validation splits. In the paper, median results were reported.
βββ scaffold_median(mean)_EF.csv Aggregated median(mean) enrichment factor (EF) over scaffold-based cross-validation splits. In the paper, median results were reported.
βββ toprank_chemical_cv{}_hitsnorm.csv Those files are needed to create enrichment plots and contain hit rate and top rank hit rate.
βββ Each folder here stands for an experiment type, the number in the folder name is a number of the split. Inside each folder there are the following elements:
βββ predictions Folder with predictions for each assay-compound pair for each modality
βββ 2022_01_evaluation_all_data.csv File with AUC scores for each assay for the test set in the split
βββ 2022_01_evaluation_all_data_EF.csv File with enrichment factor (EF) values for each assay for the test set in the split. Those files exist only for *chemical* folders.
βββ assay_matrix_discrete_train(test)_old_scaff.csv Training and test subsets of data for the split. The first column contains broad_id.
βββ assay_matrix_discrete_train(test)_old_scaff.csv Same, but SMILES strings in the first column. Those files are used as input to ChemProp!
Experiments in this folder are the following:
- chemical Scaffold-based 5-fold cross-validation splits, the main results in the paper are reported with this series of experiments.
- chemical_bal Same splits as in chemical, but training were run with ChemProp built-in data balancing.
- chemical_st Same splits as in chemical, but separate models were trained for each assay.
- CV Random 5-fold cross-validation splits.
- GE 5-fold cross-validation splits based on same-size clustering of gene expression features.
- MOBC 5-fold cross-validation splits based on same-size clustering of batch-corrected morphology features.
- random 10 random splits, ~80% of compounds in the training set and the rest in the test set.
βββ splitting This folder contains numpy files which help to match compounds and features to create training and test sets for a split, which can be reused in the analysis notebook for data preparation.
βββ scaffold_based_split.npz Splitting for scaffold-based splits.
βββ random_split_{}.npz Random split indices of test set compounds (10 files).
βββ cross_validation_indicies.npz Indices for random cross-validation splits
βββ GE_clusters_size_constrained.npz Indicies of clusters of same-size clustering for gene-expression features.
βββ MOBC_clusters_size_constrained.npz Indices of clusters of same-size clustering for batch-corrected morphology features.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Knowledge of critical properties, such as critical temperature, pressure, density, as well as acentric factor, is essential to calculate thermo-physical properties of chemical compounds. Experiments to determine critical properties and acentric factors are expensive and time intensive; therefore, we developed a machine learning (ML) model that can predict these molecular properties given the SMILES representation of a chemical species. We explored directed message passing neural network (D-MPNN) and graph attention network as ML architecture choices. Additionally, we investigated featurization with additional atomic and molecular features, multitask training, and pretraining using estimated data to optimize model performance. Our final model utilizes a D-MPNN layer to learn the molecular representation and is supplemented by Abraham parameters. A multitask training scheme was used to train a single model to predict all the critical properties and acentric factors along with boiling point, melting point, enthalpy of vaporization, and enthalpy of fusion. The model was evaluated on both random and scaffold splits where it shows state-of-the-art accuracies. The extensive data set of critical properties and acentric factors contains 1144 chemical compounds and is made available in the public domain together with the source code that can be used for further exploration.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets and splits of the manuscript "Chemprop: Machine Learning Package for Chemical Property Prediction." Train, validation and test splits are located within each folder, as well as additional data necessary for some of the benchmarks. To train Chemprop models, refer to our code repository to obtain ready-to-use scripts to train machine learning models for each of the systems.
Available benchmarking systems:
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains docking scores for over 15 million compounds from the Enamine REAL library screened against the d2 protein target using structure-based virtual screening. The data was originally published by Luttens et al. (2022), and includes sanitized SMILES strings, compound identifiers, docking scores, and Bemis-Murcko scaffolds. The dataset is split using scaffold-based splitting to support robust machine learning benchmarking.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains three archives. The first archive, full_dataset.zip, contains geometries and free energies for nearly 44,000 solute molecules with almost 9 million conformers, in 42 different solvents. The geometries and gas phase free energies are computed using density functional theory (DFT). The solvation free energy for each conformer is computed using COSMO-RS and the solution free energies are computed using the sum of the gas phase free energies and the solvation free energies. The geometries for each solute conformer are provided as ASE_atoms_objects within a pandas DataFrame, found in the compressed file dft coords.pkl.gz within full_dataset.zip. The gas-phase energies, solvation free energies, and solution free energies are also provided as a pandas DataFrame in the compressed file free_energy.pkl.gz within full_dataset.zip. Ten example data splits for both random and scaffold split types are also provided in the ZIP archive for training models. Scaffold split index 0 is used to generate results in the corresponding publication. The second archive, refined_conf_search.zip, contains geometries and free energies for a representative sample of 28 solute molecules from the full dataset that were subject to a refined conformer search and thus had more conformers located. The format of the data is identical to full_dataset.zip. The third archive contains one folder for each solvent for which we have provided free energies in full_dataset.zip. Each folder contains the .cosmo file for every solvent conformer used in the COSMOtherm calculations, a dummy input file for the COSMOtherm calculations, and a CSV file that contains the electronic energy of each solvent conformer that needs to be substituted for "EH_Line" in the dummy input file.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This report describes the preparation of a series of 17 novel racemic spirocyclic scaffolds that are intended for the creation of compound libraries by parallel synthesis for biological screening. Each scaffold features two points of orthogonal diversification. The scaffolds are related to each other in four ways: (1) through stepwise changes in the size of the nitrogen-bearing ring; (2) through the oxidation state of the carbon-centered point of diversification; (3) through the relative stereochemical orientation of the two diversification sites in those members that are stereogenic; and (4) through the provision of both saturated and unsaturated versions of the furan ring in the scaffold series derived from 3-piperidone. The scaffolds provide incremental changes in the relative orientation of the diversity components that would be introduced onto them. The scaffolds feature high sp3 carbon content which is essential for the three-dimensional exploration of chemical space. This characteristic is particularly evident in those members of this family that bear two stereocenters, i.e., the two series derived from 3-piperidone and 3-pyrrolidinone. In the series derived from 3-piperidone we were able to βsplit the differenceβ between the two diastereomers by preparation of their corresponding unsaturated version.
Facebook
Twitterzinc dataset with React=Standard, Purch=In-Stock and Drug-Like (about 11M)
preprocess:
canonicalizeChem.MolToSmiles(Chem.MolFromSmiles(mol),True)
compute scaffoldChem.MolToSmiles(MurckoScaffold.GetScaffoldForMol(Chem.MolFromSmiles(mol)),True)
convert smiles and scaffold smiles to selfies respectively using selfiessf.encoder(smiles)
filter out all molecules scaffold is empty "" cannot be converted to selfies
90/10 train/validation split
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Training configurations from the 'scaffold' split of Chig-AIMD. This dataset covers the conformational space of chignolin with DFT-level precision. We sequentially applied replica exchange molecular dynamics (REMD), conventional MD, and ab initio MD (AIMD) simulations on a 10 amino acid protein, Chignolin, and finally collected 2 million biomolecule structures with quantum level energy and force records.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
MoleculeNet FreeSolv
FreeSolv (Free Solvation Database) dataset [1], part of MoleculeNet [2] benchmark. It is intended to be used through scikit-fingerprints library. The task is to predict hydration free energy of small molecules in water. Targets are in kcal/mol.
Characteristic Description
Tasks 1
Task type regression
Total samples 642
Recommended split scaffold
Recommended metric RMSE
References
[1] Mobley, D.L., Guthrie, J.P. "FreeSolv: a⦠See the full description on the dataset page: https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_FreeSolv.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
https://tdcommons.ai/single_pred_tasks/tox#acute-toxicity-ld50
Dataset Description: Acute toxicity LD50 measures the most conservative dose that can lead to lethal adverse effects. The higher the dose, the more lethal of a drug. This dataset is kindly provided by the authors of [1].
Task Description: Regression. Given a drug SMILES string, predict its acute toxicity.
**Dataset Statistics: ** 7,385 drugs.
Dataset Split: Random Split, Scaffold Split
from tdc.single_pred import Tox
data = Tox(name = 'LD50_Zhu')
split = data.get_split()
References: [1] Zhu, Hao, et al. βQuantitative structureβ activity relationship modeling of rat acute toxicity by oral exposure.β Chemical research in toxicology 22.12 (2009): 1913-1921.
Dataset License: CC BY 4.0.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Conditional protein splicing is a powerful biotechnological tool that can be used to rapidly and post-translationally control the activity of a given protein. Here we demonstrate a novel conditional splicing system in which a genetically encoded protein scaffold induces the splicing and activation of an enzyme in mammalian cells. In this system the protein scaffold binds to two inactive split intein/enzyme extein protein fragments leading to intein fragment complementation, splicing, and activation of the firefly luciferase enzyme. We first demonstrate the ability of antiparallel coiled-coils (CCs) to mediate splicing between two intein fragments, effectively creating two new split inteins. We then generate and test two versions of the scaffold-induced splicing system using two pairs of CCs. Finally, we optimize the linker lengths of the proteins in the system and demonstrate 13-fold activation of luciferase by the scaffold compared to the activity of negative controls. Our protein scaffold-triggered conditional splicing system is an effective strategy to control enzyme activity using a protein input, enabling enhanced genetic control over protein splicing and the potential creation of splicing-based protein sensors and autoregulatory systems.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The experimental data sets, data splits, additional features, QM calculations, model predictions, and final machine learning models for the manuscript "Predicting critical properties of fluids using multi-task machine learning". Citation should refer directly to the manuscript. (citation will be added soon)
To use the machine learning models, please refer to the sample files and instructions on https://github.com/yunsiechung/chemprop/tree/crit_prop.
Detailed information can be found in README.md file.
Details on the properties considered
The data set includes the following 8 properties:
Details on the files
1. Data sets under CritProp_v1.0.0:
2. Machine learning (ML) model files:
To use these ML models, please refer to the sample files and instructions on https://github.com/yunsiechung/chemprop/tree/crit_prop
3. QM (quantum chemical) calculations:
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for clearance
Dataset Summary
clearance is a dataset included in Chemberta-2 benchmarking.
Dataset Structure
Data Fields
Each split contains
smiles: the SMILES representation of a molecule selfies: the SELFIES representation of a molecule target:
Data Splits
The dataset is split into an 80/10/10 train/valid/test split using scaffold split.
Source Data
Initial Data Collection and Normalization
Data was⦠See the full description on the dataset page: https://huggingface.co/datasets/zpn/clearance.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analog series-based (ASB) scaffolds shared between ZINC and ChEMBL (version 22), ZINC and PubChem and all the three databases are provided as three separate files. For each ASB scaffold, the SMILES representation of ZINC compounds is provided. In addition, the number of ZINC compounds, the number and the list of targets it was annotated with is reported. A README file is also given.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Validation configurations from the 'scaffold' split of Chig-AIMD. This dataset covers the conformational space of chignolin with DFT-level precision. We sequentially applied replica exchange molecular dynamics (REMD), conventional MD, and ab initio MD (AIMD) simulations on a 10 amino acid protein, Chignolin, and finally collected 2 million biomolecule structures with quantum level energy and force records.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Biogen ADME dataset (public data)
Data from Fang et al., Prospective Validation of Machine Learning Algorithms for Absorption, Distribution, Metabolism, and Excretion Prediction: An Industrial Perspective, available from the GitHub repositiory. We used schemist (which in turn uses RDKit) to add molecuar weight, Murcko scaffold, Crippen cLogP, and topological surface area, and to generate scaffold splits.
Dataset Details
From the original README:
To benefit the⦠See the full description on the dataset page: https://huggingface.co/datasets/scbirlab/fang-2023-biogen-adme.
Facebook
TwitterThe three members of the PEAK family of pseudokinases (PEAK1, PEAK2, and PEAK3) are molecular scaffolds that have recently emerged as important nodes in signaling pathways that control cell migration, morphology, and proliferation, and are increasingly found mis-regulated in human cancers. While no structures of PEAK3 have been solved to date, crystal structures of the PEAK1 and PEAK2 pseudokinase domains revealed their dimeric organization. It remains unclear how dimerization plays a role in PEAK scaffolding functions as no structures of PEAK family members in complex with their binding partners have been solved. Here, we report the cryo-EM structure of the PEAK3 pseudokinase, also adopting a dimeric state, and in complex with an endogenous 14-3-3 heterodimer purified from mammalian cells. Our structure reveals an asymmetric binding mode between PEAK3 and 14-3-3 stabilized by one pseudokinase domain and the Split HElical Dimerization (SHED) domain of the PEAK3 dimer. The binding interface is comprised of a canonical primary interaction involving two phosphorylated 14-3-3 consensus binding sites located in the N-terminal domains of the PEAK3 monomers docked in the conserved amphipathic grooves of the 14-3-3 dimer, and a unique secondary interaction between 14-3-3 and PEAK3 that has not been observed in any previous structures of 14-3-3/client complexes. Disruption of these interactions results in the relocation of PEAK3 to the nucleus and changes its cellular interactome. Lastly, we identify Protein Kinase D as the regulator of PEAK3/14-3-3 binding, providing a mechanism by which the diverse functions of the PEAK3 scaffold might be fine-tuned in cells.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Scaffold split for the ChEMBL 33 dataset across three seeds.