Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset compiled and curated for use in the ThermoMPNN paper: https://doi.org/10.1073/pnas.2314853121:
Dataset for training models for prediction of thermodynamic stability changes (ddG) of protein point mutations given a wildtype protein structure (PDB) file. Data was assembled by matching sequence-based ddG measurements in FireProtDB to structures from the RCSB Protein Data Bank (PDB). For details, see the Methods section of our manuscript.
Citing this work: If you choose to use this dataset for your own research, please cite this repository and the ThermoMPNN paper: https://doi.org/10.1073/pnas.2314853121.
Contents:
pdbs/ directory contains all PDB files
csvs/ directory contains all CSVs with mutation data
csvs/4_fireprotDB_bestpH.csv is the main (full) dataset file with 3,438 mutations across 100 proteins.
csvs/fireprot_splits.pkl contains the dataset splits (train/val/test) used in our study
csvs/splits/ contains csvs for each of the splits (train/val/test/homologue-free) indexed from the full dataset csv.
Important CSV columns:
pdb_id_corrected: corresponds to the PDB in the pdbs/ directory (after curation and disambiguation)
ddG: ddG value for mutation (mutant - WT)
wild_type: wild-type amino acid (1-letter code)
mutation: mutant amino acid (1-letter code)
pdb_position: 0-based index of the mutated residue in the PDB file (may be different from position in the original FireProtDB sequence entry)
A dataset of protein-peptide complexes for training a generative model for full-atom peptide design with Geometric Latent Diffusion.
https://choosealicense.com/licenses/agpl-3.0/https://choosealicense.com/licenses/agpl-3.0/
pdb-rna_secondary_structure
[!IMPORTANT]The pdb-rna_secondary_structure dataset is in beta test. This dataset card may not accurately reflects the data content. The data content and this dataset card may subject to change. Please contact the MultiMolecule team on GitHub issues should you have any feedback.
[!CAUTION] This dataset is converted from the dataset released by the authors of SPOT-RNA. The MultiMolecule is aware of a potential issue in data quality. We are working on… See the full description on the dataset page: https://huggingface.co/datasets/multimolecule/pdb-rna_secondary_structure.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data used to generate co-occurrence network map of publication data keywords using the VOSviewer server (Version 1.6.5). Approximately 227,000 keywords were extracted from citation titles and abstracts from the Web of Science. A network was computed for a total of 2,460 terms selected by the full-counting method and relevance scoring as implemented within VOSviewer. For analysis, we reviewed co-occurrence network maps for thresholds between 5 and 40. The default cutoff of 30 as the number of term co-occurrence is shown.
Search for carbohydrate containing PDB entries by criteria like species or the compound / classification terms. You can choose predefined, frequent terms from the pull-down-menus or enter your own queries manually.
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains supplementary data to the journal article 'Redocking the PDB' by Flachsenberg et al. (https://doi.org/10.1021/acs.jcim.3c01573)[1]. In this paper, we described two datasets: The PDBScan22 dataset with a large set of 322,051 macromolecule–ligand binding sites generally suitable for redocking and the PDBScan22-HQ dataset with 21,355 binding sites passing different structure quality filters. These datasets were further characterized by calculating properties of the ligand (e.g., molecular weight), properties of the binding site (e.g., volume), and structure quality descriptors (e.g., crystal structure resolution). Additionally, we performed redocking experiments with our novel JAMDA structure preparation and docking workflow[1] and with AutoDock Vina[2,3]. Details for all these experiments and the dataset composition can be found in the journal article[1]. Here, we provide all the datasets, i.e., the PDBScan22 and PDBScan22-HQ datasets as well as the docking results and the additionally calculated properties (for the ligand, the binding sites, and structure quality descriptors). Furthermore, we give a detailed description of their content (i.e., the data types and a description of the column values). All datasets consist of CSV files with the actual data and associated metadata JSON files describing their content. The CSV/JSON files are compliant with the CSV on the web standard (https://csvw.org/). General hints
All docking experiment results consist of two CSV files, one with general information about the docking run (e.g., was it successful?) and one with individual pose results (i.e., score and RMSD to the crystal structure). All files (except for the docking pose tables) can be indexed uniquely by the column tuple '(pdb, name)' containing the PDB code of the complex (e.g., 1gm8) and the name ligand (in the format '_', e.g., 'SOX_B_1559'). All files (except for the docking pose tables) have exactly the same number of rows as the dataset they were calculated on (e.g., PDBScan22 or PDBScan22-HQ). However, some CSV files may have missing values (see also the JSON metadata files) in some or even all columns (except for 'pdb' and 'name'). The docking pose tables also contain the 'pdb' and 'name' columns. However, these alone are not unique but only together with the 'rank' column (i.e., there might be multiple poses for each docking run or none). Example usage Using the pandas library (https://pandas.pydata.org/) in Python, we can calculate the number of protein-ligand complexes in the PDBScan22-HQ dataset with a top-ranked pose RMSD to the crystal structure ≤ 2.0 Å in the JAMDA redocking experiment and a molecular weight between 100 Da and 200 Da:
import pandas as pd df = pd.read_csv('PDBScan22-HQ.csv') df_poses = pd.read_csv('PDBScan22-HQ_JAMDA_NL_NR_poses.csv') df_properties = pd.read_csv('PDBScan22_ligand_properties.csv') merged = df.merge(df_properties, how='left', on=['pdb', 'name']) merged = merged[(merged['MW'] >= 100) & (merged['MW'] <= 200)].merge(df_poses[df_poses['rank'] == 1], how='left', on=['pdb', 'name']) nof_successful_top_ranked = (merged['rmsd_ai'] <= 2.0).sum() nof_no_top_ranked = merged['rmsd_ai'].isna().sum() Datasets
PDBScan22.csv: This is the PDBScan22 dataset[1]. This dataset was derived from the PDB4. It contains macromolecule–ligand binding sites (defined by PDB code and ligand identifier) that can be read by the NAOMI library[5,6] and pass basic consistency filters. PDBScan22-HQ.csv: This is the PDBScan22-HQ dataset[1]. It contains macromolecule–ligand binding sites from the PDBScan22 dataset that pass certain structure quality filters described in our publication[1]. PDBScan22-HQ-ADV-Success.csv: This is a subset of the PDBScan22-HQ dataset without 336 binding sites where AutoDock Vina[2,3] fails. PDBScan22-HQ-Macrocycles.csv: This is a subset of the PDBScan22-HQ dataset without 336 binding sites where AutoDock Vina[2,3] fails and only contains molecules with macrocycles with at least ten atoms. Properties for PDBScan22
PDBScan22_ligand_properties.csv: Conformation-independent properties of all ligand molecules in the PDBScan22 dataset. Properties were calculated using an in-house tool developed with the NAOMI library[5,6]. PDBScan22_StructureProfiler_quality_descriptors.csv: Structure quality descriptors for the binding sites in the PDBScan22 dataset calculated using the StructureProfiler tool[7]. PDBScan22_basic_complex_properties.csv: Simple properties of the binding sites in the PDBScan22 dataset. Properties were calculated using an in-house tool developed with the NAOMI library[5,6]. Properties for PDBScan22-HQ
PDBScan22-HQ_DoGSite3_pocket_descriptors.csv: Binding site descriptors calculated for the binding sites in the PDBScan22-HQ dataset using the DoGSite3 tool[8]. PDBScan22-HQ_molecule_types.csv: Assignment of ligands in the PDBScan22-HQ dataset (without 336 binding sites where AutoDock Vina fails) to different molecular classes (i.e., drug-like, fragment-like oligosaccharide, oligopeptide, cofactor, macrocyclic). A detailed description of the assignment can be found in our publication[1]. Docking results on PDBScan22
PDBScan22_JAMDA_NL_NR.csv: Docking results of JAMDA[1] on the PDBScan22 dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22_JAMDA_NL_NR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22_JAMDA_NL_NR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22 dataset. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. Docking results on PDBScan22-HQ
PDBScan22-HQ_JAMDA_NL_NR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_NL_NR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_NL_NR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_NL_WR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_NL_WR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was enabled. PDBScan22-HQ_JAMDA_NL_WR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was not considered during preprocessing of the binding site and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was enabled. PDBScan22-HQ_JAMDA_NW_NR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_NW_NR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, all water molecules were removed from the binding site during preprocessing, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_NW_NR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was not considered during preprocessing of the binding site, all water molecules were removed from the binding site during preprocessing, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_NW_WR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_NW_WR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, all water molecules were removed from the binding site during preprocessing, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was enabled. PDBScan22-HQ_JAMDA_NW_WR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was not considered during preprocessing of the binding site, all water molecules were removed from the binding site during preprocessing, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was enabled. PDBScan22-HQ_JAMDA_WL_NR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_WL_NR_poses.csv'. For this experiment, the ligand was considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_WL_NR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand
Protein Data Bank Entry 8TYZ is listed as the structure corresponding to this dataset
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
README file to the project files provided as supporting information to the manuscript “A deep learning approach to the structural analysis of proteins”
Dec. 30, 2018
Authors: Marco Giulini and Raffaello Potestio
==================================
The dataset contains the following files:
datasets.zip: archive containing five .csv files, namely:
- decoys_cm.csv : all the data for 10728 protein decoys, training set
- evaluation_cm.csv : all data for 146 proteins in the evaluation set
- random_CG.csv : 1200 Coulomb matrices. 100 CG models for each protein with 120 amino acids
- 1e5g_centered_sphere.csv : 100 CG models in which the central atoms in 1e5g are not removed
- 1e5g_random_sphere.csv : 10 CG models for 10 different (random) locations for the sphere that includes atoms that have to be retained. 100 CG models in total
decoys_labels.lab containing the labels associated to the 10728 decoys present in the training set
evaluation_labels.lab containing the labels associated to the 146 pdb files in the evaluation set
random_CG_labels.lab containing the labels associated to the 6 proteins with 120 amino acids
network_development_training: a python script that performs cross validation and full training of the model
saved_networks.zip FOLDER containing 10 networks: the architecture is included in .json files while weight parameters are inside .hs files
pdb_files.zip FOLDER containing the PDB files that have been employed in the project, namely:
- pdb_files_len100 : pdb files with 100 amino acids
- pdb_files_len101-110 : pdb files with a number of amino acids between 101 and 110
- decoys : decoys of length 100 extracted from the above folder: name syntax == PDBNAME_decoy_STARTRES_ENDRES.pdb
EXAMPLE 6gsp.pdb will give rise to 6gsp_decoy_0_100.pdb , 6gsp_decoy_1_101.pdb , 6gsp_decoy_2_102.pdb , 6gsp_decoy_3_103.pdb , 6gsp_decoy_4_104.pdb
- pdb_files_len100 : 6 pdb files with 120 amino acids
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data item of the type ? from the database pdb with accession 103m and name SPERM WHALE MYOGLOBIN H64A N-BUTYL ISOCYANIDE AT PH 9.0
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘PDB Electric Power Load History’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/ashfakyeafi/pbd-load-history on 13 February 2022.
--- Dataset description provided by original source is as follows ---
With this data, many works can be done in the Electrical Engineering sector.
--- Original source retains full ownership of the source dataset ---
Protein Data Bank Entry 8TYX is listed as the structure corresponding to this dataset
Homology modeling: NR2F1 Active form: NR2F1_act.pdb Auto-repressed form: NR2F1_rep.pdb Molecular Dynamics simulations Original models: nr2f1_lbd_wt.pdb, nr2f1_lbd_q244x.pdb, nr2f1_lbd_e400x.pdb Structures after 4 ns equilibration: nr2f1_lbd_wt_start.pdb, nr2f1_lbd_q244x_start.pdb, nr2f1_lbd_e400x_start.pdb Trajectories in gromacs compressed format aligned with the equilibrated structure: nr2f1_lbd_wt_clean.xtc, nr2f1_lbd_q244x_clean.xtc, nr2f1_lbd_e400x_clean.xtc Final structure after 500 ns productive MD simulations: nr2f1_lbd_wt_500ns.pdb, nr2f1_lbd_q244x_500ns.pdb, nr2f1_lbd_e400x_500ns.pdb Docking simulations For each docking simulation performed with PIPER we provide the best solution as detailed in the mansucript Homodimer: NR2F1_act_dimer.pdb (active), NR2F1_rep_dimer.pdb (auto-repressed) Heterodimer with NR2F2: NR2F1_act_NR2F2.pdb (active), NR2F1_rep_NR2F2.pdb (auto-repressed) Heterodimer with RXRa: NR2F1_act_RXRa.pdb (active), NR2F1_rep_RXRa.pdb (auto-repressed) Heterodimer with CRABP2: NR2F1_act_CRABP2_apo.pdb (apo), NR2F1_act_CRABP2_holo.pdb (holo)
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset compiled and curated for use in the ThermoMPNN paper: https://doi.org/10.1073/pnas.2314853121:
Dataset for training models for prediction of thermodynamic stability changes (ddG) of protein point mutations given a wildtype protein structure (PDB) file. Data was assembled by matching sequence-based ddG measurements in FireProtDB to structures from the RCSB Protein Data Bank (PDB). For details, see the Methods section of our manuscript.
Citing this work: If you choose to use this dataset for your own research, please cite this repository and the ThermoMPNN paper: https://doi.org/10.1073/pnas.2314853121.
Contents:
pdbs/ directory contains all PDB files
csvs/ directory contains all CSVs with mutation data
csvs/4_fireprotDB_bestpH.csv is the main (full) dataset file with 3,438 mutations across 100 proteins.
csvs/fireprot_splits.pkl contains the dataset splits (train/val/test) used in our study
csvs/splits/ contains csvs for each of the splits (train/val/test/homologue-free) indexed from the full dataset csv.
Important CSV columns:
pdb_id_corrected: corresponds to the PDB in the pdbs/ directory (after curation and disambiguation)
ddG: ddG value for mutation (mutant - WT)
wild_type: wild-type amino acid (1-letter code)
mutation: mutant amino acid (1-letter code)
pdb_position: 0-based index of the mutated residue in the PDB file (may be different from position in the original FireProtDB sequence entry)