32 datasets found
  1. Structural Protein Sequences

    • kaggle.com
    zip
    Updated Feb 3, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SHAHIR (2018). Structural Protein Sequences [Dataset]. https://www.kaggle.com/datasets/shahir/protein-data-set
    Explore at:
    zip(28782775 bytes)Available download formats
    Dataset updated
    Feb 3, 2018
    Authors
    SHAHIR
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Context

    This is a protein data set retrieved from Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank (PDB).

    The PDB archive is a repository of atomic coordinates and other information describing proteins and other important biological macromolecules. Structural biologists use methods such as X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy to determine the location of each atom relative to each other in the molecule. They then deposit this information, which is then annotated and publicly released into the archive by the wwPDB.

    The constantly-growing PDB is a reflection of the research that is happening in laboratories across the world. This can make it both exciting and challenging to use the database in research and education. Structures are available for many of the proteins and nucleic acids involved in the central processes of life, so you can go to the PDB archive to find structures for ribosomes, oncogenes, drug targets, and even whole viruses. However, it can be a challenge to find the information that you need, since the PDB archives so many different structures. You will often find multiple structures for a given molecule, or partial structures, or structures that have been modified or inactivated from their native form.

    Content

    There are two data files. Both are arranged on "structureId" of the protein:

    • pdb_data_no_dups.csv contains protein meta data which includes details on protein classification, extraction methods, etc.

    • data_seq.csv contains >400,000 protein structure sequences.

    ​

    Acknowledgements

    Original data set down loaded from http://www.rcsb.org/pdb/

    Inspiration

    Protein data base helped the life science community to study about different diseases and come with new drugs and solution that help the human survival.

  2. Mouse protein structure prediction (cleaned)

    • kaggle.com
    zip
    Updated Oct 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Methembe Thomas Tshuma (2023). Mouse protein structure prediction (cleaned) [Dataset]. https://www.kaggle.com/datasets/congo43/mouse-protein-structure-prediction-cleaned
    Explore at:
    zip(430783 bytes)Available download formats
    Dataset updated
    Oct 19, 2023
    Authors
    Methembe Thomas Tshuma
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Methembe Thomas Tshuma

    Released under CC0: Public Domain

    Contents

  3. CASP12

    • kaggle.com
    zip
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aruja Tiwary (2025). CASP12 [Dataset]. https://www.kaggle.com/datasets/arujatiwary/casp12
    Explore at:
    zip(14194884795 bytes)Available download formats
    Dataset updated
    May 19, 2025
    Authors
    Aruja Tiwary
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains text files from the CASP12 (Critical Assessment of Techniques for Protein Structure Prediction) challenge, which is a benchmark for evaluating computational methods in protein structure prediction.

    Each file corresponds to a target protein and includes: - The amino acid sequence of the protein. - The true structural properties, such as secondary structure or 3D coordinates (depending on the specific format). - Standardized naming conventions for easy parsing.

    All files are in plain .txt format and are ready to be used for machine learning models, such as LSTMs or other sequence-based models.

  4. CB513 dataset for protein structure prediction

    • kaggle.com
    zip
    Updated Sep 1, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moklesur Rahman (2018). CB513 dataset for protein structure prediction [Dataset]. https://www.kaggle.com/moklesur/cb513-dataset-for-protein-structure-prediction
    Explore at:
    zip(9628765 bytes)Available download formats
    Dataset updated
    Sep 1, 2018
    Authors
    Moklesur Rahman
    Description

    Dataset

    This dataset was created by Moklesur Rahman

    Contents

  5. Bioinformatics Protein Dataset - Simulated

    • kaggle.com
    zip
    Updated Dec 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated
    Explore at:
    zip(12928905 bytes)Available download formats
    Dataset updated
    Dec 27, 2024
    Authors
    Rafael Gallo
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Subtitle

    "Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

    Description

    Introduction

    This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

    Columns Included

    • ID_Protein: Unique identifier for each protein.
    • Sequence: String of amino acids.
    • Molecular_Weight: Molecular weight calculated from the sequence.
    • Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
    • Hydrophobicity: Average hydrophobicity calculated from the sequence.
    • Total_Charge: Sum of the charges of the amino acids in the sequence.
    • Polar_Proportion: Percentage of polar amino acids in the sequence.
    • Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
    • Sequence_Length: Total number of amino acids in the sequence.
    • Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

    Inspiration and Sources

    While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

    Proposed Uses

    This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

    How This Dataset Was Created

    1. Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
    2. Property Calculation: Physicochemical properties were calculated using the Biopython library.
    3. Class Assignment: Classes were randomly assigned for classification purposes.

    Limitations

    • The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
    • The functional classes are simulated and do not correspond to actual biological characteristics.

    Data Split

    The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

    Acknowledgment

    This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

  6. Protein_dataset

    • kaggle.com
    zip
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tran Minh Thuan (2024). Protein_dataset [Dataset]. https://www.kaggle.com/datasets/tranminhthuan/protein-dataset
    Explore at:
    zip(7146573 bytes)Available download formats
    Dataset updated
    Jul 17, 2024
    Authors
    Tran Minh Thuan
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This repository includes datasets used in Deep supervised and convolutional generative stochastic network for protein secondary structure prediction at ICML 2014.

    As described in the paper two datasets are used. Both are based on protein structures from CullPDB servers. The difference is that the first one is divided to training/validation/test set, while the second one is filtered to remove redundancy with CB513 dataset (for the purpose of testing performance on CB513 dataset).

    cullpdb+profile_5926_filtered.npy.gz is the one with training/validation/test set division, after filtering for redundancy with CB513. this is used for evaluation on CB513.

    cb513+profile_split1.npy.gz is the CB513 including protein features. Note that one of the sequences in CB513 is longer than 700 amino acids, and it is splited to two overlapping sequences and these are the last two samples (i.e. there are 514 rows instead of 513).

    It is currently in numpy format as a (N protein x k features) matrix. You can reshape it to (N protein x 700 amino acids x 57 features) first.

    The 57 features are:

    [0,22): amino acid residues, with the order of 'A', 'C', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'M', 'L', 'N', 'Q', 'P', 'S', 'R', 'T', 'W', 'V', 'Y', 'X', 'NoSeq' [22,31): Secondary structure labels, with the sequence of 'L', 'B', 'E', 'G', 'I', 'H', 'S', 'T', 'NoSeq' [31,33): N- and C- terminals; [33,35): relative and absolute solvent accessibility, used only for training. (absolute accessibility is thresholded at 15; relative accessibility is normalized by the largest accessibility value in a protein and thresholded at 0.15; original solvent accessibility is computed by DSSP) [35,57): sequence profile. Note the order of amino acid residues is ACDEFGHIKLMNPQRSTVWXY and it is different from the order for amino acid residues The last feature of both amino acid residues and secondary structure labels just mark end of the protein sequence. [22,31) and [33,35) are hidden during testing.

    The cullpdb+profile_5926_filtered.npy.gz file are removed duplicates from the original cullpdb+profile_6133_filtered.npy.gz file, updated 2018-10-28.

    The dataset division for the cullpdb+profile_5926.npy.gz dataset is

    [0,5430) training [5435,5690) test [5690,5926) validation For the filtered dataset cullpdb+profile_5926_filtered.npy.gz, all proteins can be used for training and test on CB513 dataset.

  7. h

    dlgenai-nppe2-dataset

    • huggingface.co
    Updated Dec 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vrindesh Pareek (2025). dlgenai-nppe2-dataset [Dataset]. https://huggingface.co/datasets/24f1000743/dlgenai-nppe2-dataset
    Explore at:
    Dataset updated
    Dec 17, 2025
    Authors
    Vrindesh Pareek
    Description

    NPPE-2 Protein Secondary Structure Dataset

    This dataset is used for the NPPE-2 project on Protein Secondary Structure Prediction.

      Source
    

    The original dataset is provided via Kaggle: https://www.kaggle.com/competitions/sep-25-dl-gen-ai-nppe-2

      Description
    

    The dataset contains protein amino-acid sequences and their corresponding secondary structure annotations:

    Q8 (sst8): 8-state labels Q3 (sst3): 3-state labels derived from Q8

    Files include:

    train.csv test.csv… See the full description on the dataset page: https://huggingface.co/datasets/24f1000743/dlgenai-nppe2-dataset.

  8. RS126Data

    • kaggle.com
    zip
    Updated Jul 17, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tamzid Hasan (2020). RS126Data [Dataset]. https://www.kaggle.com/tamzidhasan/rs126data
    Explore at:
    zip(19392 bytes)Available download formats
    Dataset updated
    Jul 17, 2020
    Authors
    Tamzid Hasan
    Description

    Introduction

    This is a text file contain the primary sequence of protein and the secondary sequence of the corresponding primary protein . The secondary structure have only 3 category . Such as - 1. C : loops 2. H : Helix 3. E : Stand

    Dataset

    There is 150 instance and every instance contain 2 line . 1st line = primary sequence (amino acid) 2nd line = secondary sequence (C,H,E)

  9. CB6133 dataset

    • kaggle.com
    zip
    Updated Sep 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriel Bianchin de Oliveira (2024). CB6133 dataset [Dataset]. https://www.kaggle.com/datasets/gabrielbianchin/cb6133-dataset
    Explore at:
    zip(1147451 bytes)Available download formats
    Dataset updated
    Sep 20, 2024
    Authors
    Gabriel Bianchin de Oliveira
    Description

    CB6133 dataset with amino acid sequence and secondary structures. The files were extracted from the original dataset.

  10. CAFA 5 Protein Database Files (PDB)

    • kaggle.com
    zip
    Updated Jul 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    A Merii (2023). CAFA 5 Protein Database Files (PDB) [Dataset]. https://www.kaggle.com/datasets/amerii/cafa-5-pdbs
    Explore at:
    zip(12654687498 bytes)Available download formats
    Dataset updated
    Jul 25, 2023
    Authors
    A Merii
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains 3D protein structure files in PDB format, gathered via the AlphaFoldDB API, for the Critical Assessment of protein Function Annotation (CAFA) 5 challenge protein entries.

    The AlphaFoldDB is a comprehensive database that stores protein structures predicted by AlphaFold2 - an AI model developed by DeepMind that predicts the 3D structure of a protein based on its sequence. AlphaFold's predictions have been recognized for their remarkable accuracy, often comparable to those obtained from experimental methods.

    The CAFA challenge is a community-wide effort to assess computational methods that predict protein function. The protein entries in this dataset are specifically related to the 5th iteration of the challenge - CAFA 5.

    The dataset provides the following information for each protein:

    The naming conventions for the files are: `

  11. Protein secondary structure prediction Jpred4 data

    • kaggle.com
    zip
    Updated Oct 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jiagengchang (2021). Protein secondary structure prediction Jpred4 data [Dataset]. https://www.kaggle.com/jiagengchang/dcpb1500
    Explore at:
    zip(20099527 bytes)Available download formats
    Dataset updated
    Oct 1, 2021
    Authors
    jiagengchang
    License

    http://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html

    Description

    Context

    Protein secondary structure prediction dataset. Used by 2015 NAR paper* from Barton group. There are a total of 1507 protein sequences, each represented by an integer identifier (e.g. 24695). 1348 in the training folder, and the rest in the blind test folder.

    For each example, there are the following files: .fasta -> amino acid sequence for that domain .dssp -> ground truth 3-state secondary structures, obtained from PDB 3D crystal structures using the DSSP algorithm .pssm -> PSI-BLAST matrices, obtained from running the PSI-BLAST algorithm on the sequence, which returns both the matrix and a multiple-sequence alignment (MSA) .hmm -> profile HMM matrices, obtained by running the HMMer3 algorithm on the MSA generated from PSI-BLAST

    The suggested k for cross validation is 7, such that each fold will have 193 (the last will have 190) protein sequences.

    This leads on to the purpose of the third file in this dataset - shuffle.pkl. This file contains the suggested 7-fold split for cross-validation, in the form of a nested list. Random splits were generated until the 3-state secondary structure contents were within 1% of each other, to balance the prediction labels across the 7 folds.

    *Alexey Drozdetskiy, Christian Cole, James Procter, Geoffrey J. Barton, JPred4: a protein secondary structure prediction server, Nucleic Acids Research, Volume 43, Issue W1, 1 July 2015, Pages W389–W394, https://doi.org/10.1093/nar/gkv332

  12. 9-class Protein Secondary Structure Dataset

    • kaggle.com
    zip
    Updated Sep 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ammar A kazm (2024). 9-class Protein Secondary Structure Dataset [Dataset]. https://www.kaggle.com/datasets/ammarakazm/9-class-protein-secondary-structure-dataset
    Explore at:
    zip(4106160 bytes)Available download formats
    Dataset updated
    Sep 13, 2024
    Authors
    ammar A kazm
    Description

    Dataset Description and Summary

    Dataset Name: PISCES-16037

    Dataset Overview: This dataset consists of 16,037 protein sequences and their corresponding secondary structures, curated from the Protein Data Bank (PDB) using the Protein Sequence Culling Server (PISCES). The dataset was specifically designed for training and validating protein secondary structure prediction models.

    Data Source: * Protein Data Bank (PDB): The primary source of protein structures. * PISCES: Used to select high-quality protein sequences based on various criteria.

    Dataset Features:

    • Protein Sequences: Amino acid sequences of the proteins.
    • Secondary Structures: Secondary structure assignments for each residue, using the DSSP 4.0 classification (including the polyproline helix).
    • PDB IDs and Chain Identifiers: Unique identifiers for each protein structure and chain.

    Dataset Characteristics:

    • Quality: Proteins were selected based on stringent criteria to ensure high-quality structures.
    • Diversity: The dataset includes a diverse range of protein sequences in terms of length, structure, and function.
    • Standardization: Secondary structures are assigned using the DSSP 4.0 algorithm, a widely accepted standard.

    Dataset Preparation:

    • Filtering: Overlapping sequences with test datasets (CASP12, CASP13, CASP14, and CB433) were removed.
    • Completeness: Proteins with incomplete structural information were excluded.

    Dataset Splitting:

    • Training Set: 15,037 proteins were used for training the model.
    • Validation Set: 1,000 proteins were used for evaluating the model's performance during training.

    Intended Use: This dataset is suitable for researchers and developers working on protein secondary structure prediction. It can be used to train and evaluate machine learning models, develop new prediction algorithms, and study the relationship between protein sequence and structure.

    Citation: Kazm, A., Ali, A., & Hashim, H. (2024). Transformer Encoder with Protein Language Model for Protein Secondary Structure Prediction. Engineering, Technology & Applied Science Research, 14(2), 13124-13132.

    Summary: The PISCES-16037 dataset provides a valuable resource for protein secondary structure prediction research. Its high-quality, diverse, and standardized nature make it well-suited for various applications in computational biology and bioinformatics. Additional Information: The dataset includes secondary structure assignments based on the PSS-9 classification system. The following table provides a brief description of each PSS-9 type:

    SymbolDescription
    Bβ-bridge
    Eβ-strand
    G3-10-helix
    HÎą-helix
    Iπ-helix
    PHelix-PPII
    LLoop
    SBend
    TTurn
    XDisordered regions
  13. Protein Structure Prediction Alphafold2 Multimer

    • kaggle.com
    zip
    Updated Nov 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr. Nagendra (2025). Protein Structure Prediction Alphafold2 Multimer [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/protein-structure-prediction-alphafold2-multimer/code
    Explore at:
    zip(1133963 bytes)Available download formats
    Dataset updated
    Nov 29, 2025
    Authors
    Dr. Nagendra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset provides resources for predicting protein structures and protein–protein complexes using AlphaFold2 and AlphaFold2-Multimer.

    Includes example notebooks demonstrating single-chain and multimer structure prediction workflows.

    Suitable for learning, research, and practical bioinformatics applications.

    Helps users understand sequence preparation, model configuration, and result interpretation.

    Useful for protein folding studies, structural biology, and computational drug discovery.

    Supports experiments in protein interaction analysis and complex modeling.

    Provides reproducible notebooks for running AlphaFold2 pipelines in a simplified environment.

  14. SMILES DataSet for Analysis & Prediction Dataset

    • kaggle.com
    zip
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yan Maksi (2023). SMILES DataSet for Analysis & Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/yanmaksi/big-molecules-smiles-dataset
    Explore at:
    zip(296339 bytes)Available download formats
    Dataset updated
    Jun 11, 2023
    Authors
    Yan Maksi
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F9770082%2Fb234dd748f233e4d3ef1d72d048828b5%2FMastering%20Drug%20Design.jpg?generation=1686502308761641&alt=media" alt="">

    Read this article to get unlock the wonderful world Deep Reinforcement Learning for Drug Design

    ReLeaSE is a public dataset, consisting of molecular structures and their corresponding binding affinity to proteins. The dataset was created for the purpose of evaluating and comparing machine learning models for the prediction of protein-ligand binding affinity.

    The dataset contains a total of 10,000 molecules and their binding affinity to several target proteins, including thrombin, kinase, and protease. The molecular structures are represented using Simplified Molecular Input Line Entry System (SMILES) notation, which is a standardized method for representing molecular structures as a string of characters. The binding affinity is represented as a negative logarithm of the dissociation constant (pKd), which is a measure of the strength of the interaction between the molecule and the target protein.

    The ReLeaSE dataset provides a standardized benchmark for evaluating machine learning models for protein-ligand binding affinity prediction. The dataset is publicly available and can be used for research purposes, making it an important resource for the drug discovery community.

  15. Protein Secondary Structure

    • kaggle.com
    zip
    Updated Jun 6, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    -_- (2018). Protein Secondary Structure [Dataset]. https://www.kaggle.com/alfrandom/protein-secondary-structure
    Explore at:
    zip(40687706 bytes)Available download formats
    Dataset updated
    Jun 6, 2018
    Authors
    -_-
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Introduction

    Protein secondary structure can be calculated based on its atoms' 3D coordinates once the protein's 3D structure is solved using X-ray crystallography or NMR. Commonly, DSSP is the tool used for calculating the secondary structure and assigns one of the following secondary structure types (https://swift.cmbi.umcn.nl/gv/dssp/index.html) to every amino acid in a protein:

    1. C: Loops and irregular elements (corresponding to the blank characters output by DSSP)
    2. E: β-strand
    3. H: Îą-helix
    4. B: β-bridge
    5. G: 3-helix
    6. I: π-helix
    7. T: Turn
    8. S: Bend

    However, X-ray or NMR is expensive. Ideally, we would like to predict the secondary structure of a protein based on its primary sequence directly, which has had a long history. A review on this topic is published recently, Sixty-five years of the long march in protein secondary structure prediction: the final stretch?.

    For the purpose of secondary structure prediction, it is common to simplify the aforementioned eight states (Q8) into three (Q3) by merging (E, B) into E, (H, G, I) into E, and (C, S, T) into C. The current accuracy for three-state (Q3) secondary structure prediction is about ~85% while that for eight-state (Q8) prediction is <70%. The exact number depends on the particular test dataset used.

    Dataset

    The main dataset lists peptide sequences and their corresponding secondary structures. It is a transformation of https://cdn.rcsb.org/etl/kabschSander/ss.txt.gz downloaded at 2018-06-06 from RSCB PDB into a tabular structure. If you download the file at a later time, the number of sequences in it will probably increase.

    Description of columns:

    1. pdb_id: the id used to locate its entry on https://www.rcsb.org/
    2. chain_code: when a protein consists of multiple peptides (chains), the chain code is needed to locate a particular one.
    3. seq: the sequence of the peptide
    4. sst8: the eight-state (Q8) secondary structure
    5. sst3: the three-state (Q3) secondary structure
    6. len: the length of the peptide
    7. has_nonstd_aa: whether the peptide contains nonstandard amino acids (B, O, U, X, or Z).

    Key steps in the transformation:

    • Both Q3 and Q8 secondary structure sequences are listed.
    • All nonstandard amino acids, which includes B, O, U, X, and Z, (see here for their meanings) are masked with "*" character.
    • An additional column (has_nonstd_aa) is added to indicate whether the protein sequence contains nonstandard amino acids.
    • A subset of the sequences with low sequence identity and high resolution, ready for training, is also provided

    For details of curation, please see https://github.com/zyxue/pdb-secondary-structure.

    A subset (9079 sequences) based on sequences culled by PISCES with more strict quality control is also provided. This dataset is considered ready for training models.

    The culled subset generated on 2018-05-31 with cutoffs of 25%, 2Å, and 0.25 for sequence identity, resolution and R-factor respectively, is used. The URL to the original culled list is http://dunbrack.fccc.edu/Guoli/culledpdb_hh/cullpdb_pc25_res2.0_R0.25_d180531_chains9099.gz, but it may not be permanently available. This dataset contains more columns from cullpdb_pc25_res2.0_R0.25_d180531_chains9099.gz with self-explanatory names.

    For more about PISCES, please see https://academic.oup.com/bioinformatics/article/19/12/1589/258419.

    Acknowledgements

    The peptide sequence and secondary structure are downloaded from https://cdn.rcsb.org/etl/kabschSander/ss.txt.gz. The culled subset is downloaded from http://dunbrack.fccc.edu/PISCES.php.

    Inspiration

    Kaggle provides a great platform for sharing ideas and solving data science problem. Sharing a cleaned dataset help prevent others from duplicated work and also provides a common dataset for more comparable benchmark among different methods.

    Early attempts on this (or related) problem:

    1. Baldi, Pierre, Søren Brunak, Paolo Frasconi, Gianluca Pollastri and Giovanni Soda. “Bidirectional Dynamics for Protein Secondary Structure Prediction.” Sequence Learning (2001). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.7092&rep=rep1&type=pdf
    2. Chen, J. and Chaudhari, N. S.. "Protein Secondary Structure Prediction with bidirectional LSTM networks." Paper presented at the meeting of the Post-Conference Workshop on Computational Intelligence Approaches for the Analysis of Bio-data (CI-BIO), Montreal, Canada, 2005. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.7092&rep=rep1&type=pdf (Couldn't find a pdf)
    3. Sepp Hochreiter, Martin Heusel, Klaus Obermayer; Fast model-based protein homology ...
  16. Protein Data Bank

    • kaggle.com
    zip
    Updated May 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmet Can GÜNAY (2025). Protein Data Bank [Dataset]. https://www.kaggle.com/datasets/ahmetcangunay1/protein-data-bank
    Explore at:
    zip(5079269900 bytes)Available download formats
    Dataset updated
    May 3, 2025
    Authors
    Ahmet Can GÜNAY
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    📖 Context & Inspiration

    This dataset was created to provide a comprehensive, easily accessible archive of Protein Data Bank (PDB) files—vital for researchers and developers working in structural biology, molecular modeling, and bioinformatics. The files were gathered from a public directory-style server, where PDB structures are organized into subfolders based on naming conventions.

    The goal was to simplify access to this valuable data by restructuring it in a more machine-learning-friendly format, eliminating the need for repetitive scraping or manual downloads. Inspiration came from real-world bottlenecks in data preprocessing for protein folding prediction models, drug design simulations, and structure-based machine learning pipelines.

    🌐 Source

    All files were sourced from a publicly available FTP-style archive mirroring the PDB repository. The original structure was preserved to ensure traceability and compatibility with existing workflows.

    ⚠️ Note: This dataset is intended for research and educational use. Ensure compliance with any licensing or usage terms defined by the PDB archive.
    
  17. PROTEIN-STRUCTURE-MODELLING--2DN1-Human-Hemoglobin

    • kaggle.com
    zip
    Updated Nov 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr. Nagendra (2025). PROTEIN-STRUCTURE-MODELLING--2DN1-Human-Hemoglobin [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/protein-structure-modelling-2dn1-human-hemoglobin
    Explore at:
    zip(1464028 bytes)Available download formats
    Dataset updated
    Nov 28, 2025
    Authors
    Dr. Nagendra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset dedicated to Protein Homology Modeling of Human Hemoglobin. The reference template structure used is PDB ID: 2DN1. Focuses on generating 3D models of a mutated (variant) Hemoglobin sequence. Includes the necessary files for template selection and sequence alignment. A practical resource for studying computational structural biology and protein prediction.

  18. Protein Embeddings 1

    • kaggle.com
    zip
    Updated Apr 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Chervov (2023). Protein Embeddings 1 [Dataset]. https://www.kaggle.com/datasets/alexandervc/protein-embeddings-1
    Explore at:
    zip(713648892 bytes)Available download formats
    Dataset updated
    Apr 23, 2023
    Authors
    Alexander Chervov
    Description

    Part 1 of files from:

    https://zenodo.org/record/5047020#.ZEQ6Y3ZBy38 Papers: - Residue and sequence embeddings of the human proteome (SwissProt for organism Human, downloaded on 2021.06.09) computed using bio_embeddings (bioembeddings.com) using the ProtT5 embedder at full precision (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3). - Sequence-level predictions of subcellular localization in 10 classes using LA (https://www.biorxiv.org/content/10.1101/2021.04.25.441334v1) - Residue-level three state secondary structure prediction (alpha, sheet or other) using models reported in the ProtTrans paper (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3)

    Part 2

    embed_protbert_train_clip_1200_first_70000_prot.csv Embedding for proteins by protBert model from RostLab. Data - CAFA5 first 70 000 proteins from the train. Generation notebook: https://www.kaggle.com/code/alexandervc/protbert-embedding-starter?scriptVersionId=126811755

  19. Protein Benchmarks

    • kaggle.com
    zip
    Updated Dec 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samriddhi Sinha (2022). Protein Benchmarks [Dataset]. https://www.kaggle.com/datasets/djokester/protein-benchmarks/code
    Explore at:
    zip(17372386 bytes)Available download formats
    Dataset updated
    Dec 5, 2022
    Authors
    Samriddhi Sinha
    Description
    TopicBenchmarkTarget typeaResolution# Training sequencesSource
    Protein structureSecondary structureCategorical (3)Local8,678(Moult et al., 2018; Rao et al., 2019)
    DisorderBinaryLocal8,678(Moult et al., 2018)
    Remote homologyCategorical (1,195)Global12,312(Andreeva et al., 2014, 2020; Rao et al., 2019)
    Fold classesCategorical (7)Global15,680(Andreeva et al., 2014, 2020)
    Post-translational modificationsSignal peptideBinaryGlobal16,606(Armenteros et al., 2019)
    Major PTMsBinaryLocal43,356(Hornbeck et al., 2015)
    Neuropeptide cleavageBinaryLocal2,727(Ofer and Linial 2014, 2015; Brandes et al., 2016)
    Biophysical propertiesFluorescenceContinuousGlobal21,446(Sarkisyan et al., 2016; Rao et al., 2019)
    StabilityContinuousGlobal53,679(Rocklin et al., 2017; Rao et al., 2019)

    Sources

    Moult J, Fidelis K, Kryshtafovych A, et al. (2018) Critical assessment of methods of protein structure prediction (CASP)—Round XII. Proteins Struct Funct Bioinforma, 86, 7–15.

    "https://pubmed.ncbi.nlm.nih.gov/33390682/">Rao R. et al. (2019) Evaluating protein transfer learning with tape. Adv. Neural Inf. Process. Syst., 32, 9689–9701.

    "https://academic.oup.com/nar/article/48/D1/D376/5625529">Andreeva A, Kulesha E, Gough J, Murzin AG (2020) The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures. Nucleic Acids Res, 48, D376--D382.

    Armenteros JJA, Tsirigos KD, Sønderby CK, et al. (2019) SignalP 5.0 improves signal peptide predictions using deep neural networks. Nat Biotechnol, 37, 420–423.

    Hornbeck P V, Zhang B, Murray B, et al. (2015) PhosphoSitePlus, 2014: mutations, PTMs and recalibrations. Nucleic AcidsRes, 43, D512–D520.

    Ofer D, Linial M (2014) NeuroPID: a predictor for identifying neuropeptide precursors from metazoan proteomes. Bioinformatics, 30, 931–940.

    "http://dx.doi.org/10.1093/bioinformatics/btv345">Ofer D, Linial M (2015) ProFET: Feature engineering captures high-level protein functions. Bioinformatics, 31, 3429–3436.

    "https://doi.org/10.1093/database/baw133">Brandes N, Ofer D, Linial M. (2016) ASAP: A machine learning framework for local protein properties. Database 2016.

    "http://dx.doi.org/10.1038/nature17995">Sarkisyan KS, Bolotin DA, Meer MV, et al. (2016) Local fitness landscape of the green fluorescent protein. Nature, 533, 397–401.

    Rocklin GJ, Chidyausiku TM, Goreshnik I, et al. (2017) Global analysis of protein folding using massively parallel design,synthesis, and testing. Science (80-) 357, 168–175.

    License

    ProteinBERT is a free open-source project available under the MIT License

    Image by Image by PublicDomainPictures from Pixabay
  20. Neurotransmitter Receptors & Protein Sequences

    • kaggle.com
    zip
    Updated Jul 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yashasvi Goswami (2025). Neurotransmitter Receptors & Protein Sequences [Dataset]. https://www.kaggle.com/datasets/yashasvigoswami/neurotransmitter-receptors
    Explore at:
    zip(35950 bytes)Available download formats
    Dataset updated
    Jul 7, 2025
    Authors
    Yashasvi Goswami
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset provides a curated collection of proteins with their sequences, functional annotations, and structural information. It serves as a resource for researchers working in bioinformatics, structural biology, and systems biology, offering insights into the molecular machinery of life.

    By linking protein sequence data with functional and structural attributes, it supports diverse applications such as: -Protein classification and annotation tasks -Sequence-to-function machine learning models -Structural modeling and docking studies -Comparative proteomics and evolutionary studies

    The dataset is valuable for both computational researchers and students of life sciences, offering real-world biological data that can be directly integrated into pipelines for protein analysis, modeling, and prediction.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
SHAHIR (2018). Structural Protein Sequences [Dataset]. https://www.kaggle.com/datasets/shahir/protein-data-set
Organization logo

Structural Protein Sequences

Sequence and meta data for various protein structures

Explore at:
391 scholarly articles cite this dataset (View in Google Scholar)
zip(28782775 bytes)Available download formats
Dataset updated
Feb 3, 2018
Authors
SHAHIR
License

http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

Description

Context

This is a protein data set retrieved from Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank (PDB).

The PDB archive is a repository of atomic coordinates and other information describing proteins and other important biological macromolecules. Structural biologists use methods such as X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy to determine the location of each atom relative to each other in the molecule. They then deposit this information, which is then annotated and publicly released into the archive by the wwPDB.

The constantly-growing PDB is a reflection of the research that is happening in laboratories across the world. This can make it both exciting and challenging to use the database in research and education. Structures are available for many of the proteins and nucleic acids involved in the central processes of life, so you can go to the PDB archive to find structures for ribosomes, oncogenes, drug targets, and even whole viruses. However, it can be a challenge to find the information that you need, since the PDB archives so many different structures. You will often find multiple structures for a given molecule, or partial structures, or structures that have been modified or inactivated from their native form.

Content

There are two data files. Both are arranged on "structureId" of the protein:

  • pdb_data_no_dups.csv contains protein meta data which includes details on protein classification, extraction methods, etc.

  • data_seq.csv contains >400,000 protein structure sequences.

​

Acknowledgements

Original data set down loaded from http://www.rcsb.org/pdb/

Inspiration

Protein data base helped the life science community to study about different diseases and come with new drugs and solution that help the human survival.

Search
Clear search
Close search
Google apps
Main menu