Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This is a protein data set retrieved from Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank (PDB).
The PDB archive is a repository of atomic coordinates and other information describing proteins and other important biological macromolecules. Structural biologists use methods such as X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy to determine the location of each atom relative to each other in the molecule. They then deposit this information, which is then annotated and publicly released into the archive by the wwPDB.
The constantly-growing PDB is a reflection of the research that is happening in laboratories across the world. This can make it both exciting and challenging to use the database in research and education. Structures are available for many of the proteins and nucleic acids involved in the central processes of life, so you can go to the PDB archive to find structures for ribosomes, oncogenes, drug targets, and even whole viruses. However, it can be a challenge to find the information that you need, since the PDB archives so many different structures. You will often find multiple structures for a given molecule, or partial structures, or structures that have been modified or inactivated from their native form.
There are two data files. Both are arranged on "structureId" of the protein:
pdb_data_no_dups.csv contains protein meta data which includes details on protein classification, extraction methods, etc.
data_seq.csv contains >400,000 protein structure sequences.
â
Original data set down loaded from http://www.rcsb.org/pdb/
Protein data base helped the life science community to study about different diseases and come with new drugs and solution that help the human survival.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Methembe Thomas Tshuma
Released under CC0: Public Domain
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains text files from the CASP12 (Critical Assessment of Techniques for Protein Structure Prediction) challenge, which is a benchmark for evaluating computational methods in protein structure prediction.
Each file corresponds to a target protein and includes: - The amino acid sequence of the protein. - The true structural properties, such as secondary structure or 3D coordinates (depending on the specific format). - Standardized naming conventions for easy parsing.
All files are in plain .txt format and are ready to be used for machine learning models, such as LSTMs or other sequence-based models.
Facebook
TwitterThis dataset was created by Moklesur Rahman
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."
This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.
While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.
This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.
The dataset is divided into two subsets:
- Training: 16,000 samples (proteinas_train.csv).
- Testing: 4,000 samples (proteinas_test.csv).
This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This repository includes datasets used in Deep supervised and convolutional generative stochastic network for protein secondary structure prediction at ICML 2014.
As described in the paper two datasets are used. Both are based on protein structures from CullPDB servers. The difference is that the first one is divided to training/validation/test set, while the second one is filtered to remove redundancy with CB513 dataset (for the purpose of testing performance on CB513 dataset).
cullpdb+profile_5926_filtered.npy.gz is the one with training/validation/test set division, after filtering for redundancy with CB513. this is used for evaluation on CB513.
cb513+profile_split1.npy.gz is the CB513 including protein features. Note that one of the sequences in CB513 is longer than 700 amino acids, and it is splited to two overlapping sequences and these are the last two samples (i.e. there are 514 rows instead of 513).
It is currently in numpy format as a (N protein x k features) matrix. You can reshape it to (N protein x 700 amino acids x 57 features) first.
The 57 features are:
[0,22): amino acid residues, with the order of 'A', 'C', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'M', 'L', 'N', 'Q', 'P', 'S', 'R', 'T', 'W', 'V', 'Y', 'X', 'NoSeq' [22,31): Secondary structure labels, with the sequence of 'L', 'B', 'E', 'G', 'I', 'H', 'S', 'T', 'NoSeq' [31,33): N- and C- terminals; [33,35): relative and absolute solvent accessibility, used only for training. (absolute accessibility is thresholded at 15; relative accessibility is normalized by the largest accessibility value in a protein and thresholded at 0.15; original solvent accessibility is computed by DSSP) [35,57): sequence profile. Note the order of amino acid residues is ACDEFGHIKLMNPQRSTVWXY and it is different from the order for amino acid residues The last feature of both amino acid residues and secondary structure labels just mark end of the protein sequence. [22,31) and [33,35) are hidden during testing.
The cullpdb+profile_5926_filtered.npy.gz file are removed duplicates from the original cullpdb+profile_6133_filtered.npy.gz file, updated 2018-10-28.
The dataset division for the cullpdb+profile_5926.npy.gz dataset is
[0,5430) training [5435,5690) test [5690,5926) validation For the filtered dataset cullpdb+profile_5926_filtered.npy.gz, all proteins can be used for training and test on CB513 dataset.
Facebook
TwitterNPPE-2 Protein Secondary Structure Dataset
This dataset is used for the NPPE-2 project on Protein Secondary Structure Prediction.
Source
The original dataset is provided via Kaggle: https://www.kaggle.com/competitions/sep-25-dl-gen-ai-nppe-2
Description
The dataset contains protein amino-acid sequences and their corresponding secondary structure annotations:
Q8 (sst8): 8-state labels Q3 (sst3): 3-state labels derived from Q8
Files include:
train.csv test.csv⌠See the full description on the dataset page: https://huggingface.co/datasets/24f1000743/dlgenai-nppe2-dataset.
Facebook
TwitterThis is a text file contain the primary sequence of protein and the secondary sequence of the corresponding primary protein . The secondary structure have only 3 category . Such as - 1. C : loops 2. H : Helix 3. E : Stand
There is 150 instance and every instance contain 2 line . 1st line = primary sequence (amino acid) 2nd line = secondary sequence (C,H,E)
Facebook
TwitterCB6133 dataset with amino acid sequence and secondary structures. The files were extracted from the original dataset.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset contains 3D protein structure files in PDB format, gathered via the AlphaFoldDB API, for the Critical Assessment of protein Function Annotation (CAFA) 5 challenge protein entries.
The AlphaFoldDB is a comprehensive database that stores protein structures predicted by AlphaFold2 - an AI model developed by DeepMind that predicts the 3D structure of a protein based on its sequence. AlphaFold's predictions have been recognized for their remarkable accuracy, often comparable to those obtained from experimental methods.
The CAFA challenge is a community-wide effort to assess computational methods that predict protein function. The protein entries in this dataset are specifically related to the 5th iteration of the challenge - CAFA 5.
The naming conventions for the files are: `
Facebook
Twitterhttp://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html
Protein secondary structure prediction dataset. Used by 2015 NAR paper* from Barton group. There are a total of 1507 protein sequences, each represented by an integer identifier (e.g. 24695). 1348 in the training folder, and the rest in the blind test folder.
For each example, there are the following files: .fasta -> amino acid sequence for that domain .dssp -> ground truth 3-state secondary structures, obtained from PDB 3D crystal structures using the DSSP algorithm .pssm -> PSI-BLAST matrices, obtained from running the PSI-BLAST algorithm on the sequence, which returns both the matrix and a multiple-sequence alignment (MSA) .hmm -> profile HMM matrices, obtained by running the HMMer3 algorithm on the MSA generated from PSI-BLAST
The suggested k for cross validation is 7, such that each fold will have 193 (the last will have 190) protein sequences.
This leads on to the purpose of the third file in this dataset - shuffle.pkl. This file contains the suggested 7-fold split for cross-validation, in the form of a nested list. Random splits were generated until the 3-state secondary structure contents were within 1% of each other, to balance the prediction labels across the 7 folds.
*Alexey Drozdetskiy, Christian Cole, James Procter, Geoffrey J. Barton, JPred4: a protein secondary structure prediction server, Nucleic Acids Research, Volume 43, Issue W1, 1 July 2015, Pages W389âW394, https://doi.org/10.1093/nar/gkv332
Facebook
TwitterDataset Name: PISCES-16037
Dataset Overview: This dataset consists of 16,037 protein sequences and their corresponding secondary structures, curated from the Protein Data Bank (PDB) using the Protein Sequence Culling Server (PISCES). The dataset was specifically designed for training and validating protein secondary structure prediction models.
Data Source: * Protein Data Bank (PDB): The primary source of protein structures. * PISCES: Used to select high-quality protein sequences based on various criteria.
Dataset Features:
Dataset Characteristics:
Dataset Preparation:
Dataset Splitting:
Intended Use: This dataset is suitable for researchers and developers working on protein secondary structure prediction. It can be used to train and evaluate machine learning models, develop new prediction algorithms, and study the relationship between protein sequence and structure.
Citation: Kazm, A., Ali, A., & Hashim, H. (2024). Transformer Encoder with Protein Language Model for Protein Secondary Structure Prediction. Engineering, Technology & Applied Science Research, 14(2), 13124-13132.
Summary: The PISCES-16037 dataset provides a valuable resource for protein secondary structure prediction research. Its high-quality, diverse, and standardized nature make it well-suited for various applications in computational biology and bioinformatics. Additional Information: The dataset includes secondary structure assignments based on the PSS-9 classification system. The following table provides a brief description of each PSS-9 type:
| Symbol | Description |
|---|---|
| B | β-bridge |
| E | β-strand |
| G | 3-10-helix |
| H | Îą-helix |
| I | Ď-helix |
| P | Helix-PPII |
| L | Loop |
| S | Bend |
| T | Turn |
| X | Disordered regions |
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset provides resources for predicting protein structures and proteinâprotein complexes using AlphaFold2 and AlphaFold2-Multimer.
Includes example notebooks demonstrating single-chain and multimer structure prediction workflows.
Suitable for learning, research, and practical bioinformatics applications.
Helps users understand sequence preparation, model configuration, and result interpretation.
Useful for protein folding studies, structural biology, and computational drug discovery.
Supports experiments in protein interaction analysis and complex modeling.
Provides reproducible notebooks for running AlphaFold2 pipelines in a simplified environment.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F9770082%2Fb234dd748f233e4d3ef1d72d048828b5%2FMastering%20Drug%20Design.jpg?generation=1686502308761641&alt=media" alt="">
Read this article to get unlock the wonderful world Deep Reinforcement Learning for Drug Design
ReLeaSE is a public dataset, consisting of molecular structures and their corresponding binding affinity to proteins. The dataset was created for the purpose of evaluating and comparing machine learning models for the prediction of protein-ligand binding affinity.
The dataset contains a total of 10,000 molecules and their binding affinity to several target proteins, including thrombin, kinase, and protease. The molecular structures are represented using Simplified Molecular Input Line Entry System (SMILES) notation, which is a standardized method for representing molecular structures as a string of characters. The binding affinity is represented as a negative logarithm of the dissociation constant (pKd), which is a measure of the strength of the interaction between the molecule and the target protein.
The ReLeaSE dataset provides a standardized benchmark for evaluating machine learning models for protein-ligand binding affinity prediction. The dataset is publicly available and can be used for research purposes, making it an important resource for the drug discovery community.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Protein secondary structure can be calculated based on its atoms' 3D coordinates once the protein's 3D structure is solved using X-ray crystallography or NMR. Commonly, DSSP is the tool used for calculating the secondary structure and assigns one of the following secondary structure types (https://swift.cmbi.umcn.nl/gv/dssp/index.html) to every amino acid in a protein:
However, X-ray or NMR is expensive. Ideally, we would like to predict the secondary structure of a protein based on its primary sequence directly, which has had a long history. A review on this topic is published recently, Sixty-five years of the long march in protein secondary structure prediction: the final stretch?.
For the purpose of secondary structure prediction, it is common to simplify the aforementioned eight states (Q8) into three (Q3) by merging (E, B) into E, (H, G, I) into E, and (C, S, T) into C. The current accuracy for three-state (Q3) secondary structure prediction is about ~85% while that for eight-state (Q8) prediction is <70%. The exact number depends on the particular test dataset used.
The main dataset lists peptide sequences and their corresponding secondary structures. It is a transformation of https://cdn.rcsb.org/etl/kabschSander/ss.txt.gz downloaded at 2018-06-06 from RSCB PDB into a tabular structure. If you download the file at a later time, the number of sequences in it will probably increase.
Description of columns:
Key steps in the transformation:
*" character. has_nonstd_aa) is added to indicate whether the protein sequence contains nonstandard amino acids.For details of curation, please see https://github.com/zyxue/pdb-secondary-structure.
A subset (9079 sequences) based on sequences culled by PISCES with more strict quality control is also provided. This dataset is considered ready for training models.
The culled subset generated on 2018-05-31 with cutoffs of 25%, 2Ă
, and 0.25 for sequence identity, resolution and R-factor respectively, is used. The URL to the original culled list is http://dunbrack.fccc.edu/Guoli/culledpdb_hh/cullpdb_pc25_res2.0_R0.25_d180531_chains9099.gz, but it may not be permanently available. This dataset contains more columns from cullpdb_pc25_res2.0_R0.25_d180531_chains9099.gz with self-explanatory names.
For more about PISCES, please see https://academic.oup.com/bioinformatics/article/19/12/1589/258419.
The peptide sequence and secondary structure are downloaded from https://cdn.rcsb.org/etl/kabschSander/ss.txt.gz. The culled subset is downloaded from http://dunbrack.fccc.edu/PISCES.php.
Kaggle provides a great platform for sharing ideas and solving data science problem. Sharing a cleaned dataset help prevent others from duplicated work and also provides a common dataset for more comparable benchmark among different methods.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
đ Context & Inspiration
This dataset was created to provide a comprehensive, easily accessible archive of Protein Data Bank (PDB) filesâvital for researchers and developers working in structural biology, molecular modeling, and bioinformatics. The files were gathered from a public directory-style server, where PDB structures are organized into subfolders based on naming conventions.
The goal was to simplify access to this valuable data by restructuring it in a more machine-learning-friendly format, eliminating the need for repetitive scraping or manual downloads. Inspiration came from real-world bottlenecks in data preprocessing for protein folding prediction models, drug design simulations, and structure-based machine learning pipelines.
đ Source
All files were sourced from a publicly available FTP-style archive mirroring the PDB repository. The original structure was preserved to ensure traceability and compatibility with existing workflows.
â ď¸ Note: This dataset is intended for research and educational use. Ensure compliance with any licensing or usage terms defined by the PDB archive.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset dedicated to Protein Homology Modeling of Human Hemoglobin. The reference template structure used is PDB ID: 2DN1. Focuses on generating 3D models of a mutated (variant) Hemoglobin sequence. Includes the necessary files for template selection and sequence alignment. A practical resource for studying computational structural biology and protein prediction.
Facebook
Twitterhttps://zenodo.org/record/5047020#.ZEQ6Y3ZBy38 Papers: - Residue and sequence embeddings of the human proteome (SwissProt for organism Human, downloaded on 2021.06.09) computed using bio_embeddings (bioembeddings.com) using the ProtT5 embedder at full precision (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3). - Sequence-level predictions of subcellular localization in 10 classes using LA (https://www.biorxiv.org/content/10.1101/2021.04.25.441334v1) - Residue-level three state secondary structure prediction (alpha, sheet or other) using models reported in the ProtTrans paper (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3)
embed_protbert_train_clip_1200_first_70000_prot.csv Embedding for proteins by protBert model from RostLab. Data - CAFA5 first 70 000 proteins from the train. Generation notebook: https://www.kaggle.com/code/alexandervc/protbert-embedding-starter?scriptVersionId=126811755
Facebook
Twitter| Topic | Benchmark | Target typea | Resolution | # Training sequences | Source |
|---|---|---|---|---|---|
| Protein structure | Secondary structure | Categorical (3) | Local | 8,678 | (Moult et al., 2018; Rao et al., 2019) |
| Disorder | Binary | Local | 8,678 | (Moult et al., 2018) | |
| Remote homology | Categorical (1,195) | Global | 12,312 | (Andreeva et al., 2014, 2020; Rao et al., 2019) | |
| Fold classes | Categorical (7) | Global | 15,680 | (Andreeva et al., 2014, 2020) | |
| Post-translational modifications | Signal peptide | Binary | Global | 16,606 | (Armenteros et al., 2019) |
| Major PTMs | Binary | Local | 43,356 | (Hornbeck et al., 2015) | |
| Neuropeptide cleavage | Binary | Local | 2,727 | (Ofer and Linial 2014, 2015; Brandes et al., 2016) | |
| Biophysical properties | Fluorescence | Continuous | Global | 21,446 | (Sarkisyan et al., 2016; Rao et al., 2019) |
| Stability | Continuous | Global | 53,679 | (Rocklin et al., 2017; Rao et al., 2019) |
"https://pubmed.ncbi.nlm.nih.gov/33390682/">Rao R. et al. (2019) Evaluating protein transfer learning with tape. Adv. Neural Inf. Process. Syst., 32, 9689â9701.
"https://academic.oup.com/nar/article/48/D1/D376/5625529">Andreeva A, Kulesha E, Gough J, Murzin AG (2020) The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures. Nucleic Acids Res, 48, D376--D382.
Armenteros JJA, Tsirigos KD, Sønderby CK, et al. (2019) SignalP 5.0 improves signal peptide predictions using deep neural networks. Nat Biotechnol, 37, 420â423.
"http://dx.doi.org/10.1093/bioinformatics/btv345">Ofer D, Linial M (2015) ProFET: Feature engineering captures high-level protein functions. Bioinformatics, 31, 3429â3436.
"https://doi.org/10.1093/database/baw133">Brandes N, Ofer D, Linial M. (2016) ASAP: A machine learning framework for local protein properties. Database 2016.
"http://dx.doi.org/10.1038/nature17995">Sarkisyan KS, Bolotin DA, Meer MV, et al. (2016) Local fitness landscape of the green fluorescent protein. Nature, 533, 397â401.
Rocklin GJ, Chidyausiku TM, Goreshnik I, et al. (2017) Global analysis of protein folding using massively parallel design,synthesis, and testing. Science (80-) 357, 168â175.
ProteinBERT is a free open-source project available under the MIT License
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset provides a curated collection of proteins with their sequences, functional annotations, and structural information. It serves as a resource for researchers working in bioinformatics, structural biology, and systems biology, offering insights into the molecular machinery of life.
By linking protein sequence data with functional and structural attributes, it supports diverse applications such as: -Protein classification and annotation tasks -Sequence-to-function machine learning models -Structural modeling and docking studies -Comparative proteomics and evolutionary studies
The dataset is valuable for both computational researchers and students of life sciences, offering real-world biological data that can be directly integrated into pipelines for protein analysis, modeling, and prediction.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This is a protein data set retrieved from Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank (PDB).
The PDB archive is a repository of atomic coordinates and other information describing proteins and other important biological macromolecules. Structural biologists use methods such as X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy to determine the location of each atom relative to each other in the molecule. They then deposit this information, which is then annotated and publicly released into the archive by the wwPDB.
The constantly-growing PDB is a reflection of the research that is happening in laboratories across the world. This can make it both exciting and challenging to use the database in research and education. Structures are available for many of the proteins and nucleic acids involved in the central processes of life, so you can go to the PDB archive to find structures for ribosomes, oncogenes, drug targets, and even whole viruses. However, it can be a challenge to find the information that you need, since the PDB archives so many different structures. You will often find multiple structures for a given molecule, or partial structures, or structures that have been modified or inactivated from their native form.
There are two data files. Both are arranged on "structureId" of the protein:
pdb_data_no_dups.csv contains protein meta data which includes details on protein classification, extraction methods, etc.
data_seq.csv contains >400,000 protein structure sequences.
â
Original data set down loaded from http://www.rcsb.org/pdb/
Protein data base helped the life science community to study about different diseases and come with new drugs and solution that help the human survival.