32 datasets found

Structural Protein Sequences
kaggle.com
zip
Updated Feb 3, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SHAHIR (2018). Structural Protein Sequences [Dataset]. https://www.kaggle.com/datasets/shahir/protein-data-set
Explore at:
zip(28782775 bytes)Available download formats
Dataset updated
Feb 3, 2018
Authors
SHAHIR
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Context

This is a protein data set retrieved from Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank (PDB).

The PDB archive is a repository of atomic coordinates and other information describing proteins and other important biological macromolecules. Structural biologists use methods such as X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy to determine the location of each atom relative to each other in the molecule. They then deposit this information, which is then annotated and publicly released into the archive by the wwPDB.

The constantly-growing PDB is a reflection of the research that is happening in laboratories across the world. This can make it both exciting and challenging to use the database in research and education. Structures are available for many of the proteins and nucleic acids involved in the central processes of life, so you can go to the PDB archive to find structures for ribosomes, oncogenes, drug targets, and even whole viruses. However, it can be a challenge to find the information that you need, since the PDB archives so many different structures. You will often find multiple structures for a given molecule, or partial structures, or structures that have been modified or inactivated from their native form.

Content

There are two data files. Both are arranged on "structureId" of the protein:

pdb_data_no_dups.csv contains protein meta data which includes details on protein classification, extraction methods, etc.

data_seq.csv contains >400,000 protein structure sequences.

Acknowledgements

Original data set down loaded from http://www.rcsb.org/pdb/

Inspiration

Protein data base helped the life science community to study about different diseases and come with new drugs and solution that help the human survival.
Mouse protein structure prediction (cleaned)
kaggle.com
zip
Updated Oct 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Methembe Thomas Tshuma (2023). Mouse protein structure prediction (cleaned) [Dataset]. https://www.kaggle.com/datasets/congo43/mouse-protein-structure-prediction-cleaned
Explore at:
zip(430783 bytes)Available download formats
Dataset updated
Oct 19, 2023
Authors
Methembe Thomas Tshuma
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Methembe Thomas Tshuma

Released under CC0: Public Domain

Contents
CASP12
kaggle.com
zip
Updated May 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aruja Tiwary (2025). CASP12 [Dataset]. https://www.kaggle.com/datasets/arujatiwary/casp12
Explore at:
zip(14194884795 bytes)Available download formats
Dataset updated
May 19, 2025
Authors
Aruja Tiwary
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset contains text files from the CASP12 (Critical Assessment of Techniques for Protein Structure Prediction) challenge, which is a benchmark for evaluating computational methods in protein structure prediction.

Each file corresponds to a target protein and includes: - The amino acid sequence of the protein. - The true structural properties, such as secondary structure or 3D coordinates (depending on the specific format). - Standardized naming conventions for easy parsing.

All files are in plain .txt format and are ready to be used for machine learning models, such as LSTMs or other sequence-based models.
CB513 dataset for protein structure prediction
kaggle.com
zip
Updated Sep 1, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moklesur Rahman (2018). CB513 dataset for protein structure prediction [Dataset]. https://www.kaggle.com/moklesur/cb513-dataset-for-protein-structure-prediction
Explore at:
zip(9628765 bytes)Available download formats
Dataset updated
Sep 1, 2018
Authors
Moklesur Rahman
Description
Dataset

This dataset was created by Moklesur Rahman

Contents
Bioinformatics Protein Dataset - Simulated
kaggle.com
zip
Updated Dec 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated
Explore at:
zip(12928905 bytes)Available download formats
Dataset updated
Dec 27, 2024
Authors
Rafael Gallo
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Subtitle

"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

Description

Introduction

This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

Columns Included

ID_Protein: Unique identifier for each protein.

Sequence: String of amino acids.

Molecular_Weight: Molecular weight calculated from the sequence.

Isoelectric_Point: Estimated isoelectric point based on the sequence composition.

Hydrophobicity: Average hydrophobicity calculated from the sequence.

Total_Charge: Sum of the charges of the amino acids in the sequence.

Polar_Proportion: Percentage of polar amino acids in the sequence.

Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.

Sequence_Length: Total number of amino acids in the sequence.

Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

Inspiration and Sources

While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

Proposed Uses

This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

How This Dataset Was Created

Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.

Property Calculation: Physicochemical properties were calculated using the Biopython library.

Class Assignment: Classes were randomly assigned for classification purposes.

Limitations

The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.

The functional classes are simulated and do not correspond to actual biological characteristics.

Data Split

The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

Acknowledgment

This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.
Protein_dataset
kaggle.com
zip
Updated Jul 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tran Minh Thuan (2024). Protein_dataset [Dataset]. https://www.kaggle.com/datasets/tranminhthuan/protein-dataset
Explore at:
zip(7146573 bytes)Available download formats
Dataset updated
Jul 17, 2024
Authors
Tran Minh Thuan
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This repository includes datasets used in Deep supervised and convolutional generative stochastic network for protein secondary structure prediction at ICML 2014.

As described in the paper two datasets are used. Both are based on protein structures from CullPDB servers. The difference is that the first one is divided to training/validation/test set, while the second one is filtered to remove redundancy with CB513 dataset (for the purpose of testing performance on CB513 dataset).

cullpdb+profile_5926_filtered.npy.gz is the one with training/validation/test set division, after filtering for redundancy with CB513. this is used for evaluation on CB513.

cb513+profile_split1.npy.gz is the CB513 including protein features. Note that one of the sequences in CB513 is longer than 700 amino acids, and it is splited to two overlapping sequences and these are the last two samples (i.e. there are 514 rows instead of 513).

It is currently in numpy format as a (N protein x k features) matrix. You can reshape it to (N protein x 700 amino acids x 57 features) first.

The 57 features are:

[0,22): amino acid residues, with the order of 'A', 'C', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'M', 'L', 'N', 'Q', 'P', 'S', 'R', 'T', 'W', 'V', 'Y', 'X', 'NoSeq' [22,31): Secondary structure labels, with the sequence of 'L', 'B', 'E', 'G', 'I', 'H', 'S', 'T', 'NoSeq' [31,33): N- and C- terminals; [33,35): relative and absolute solvent accessibility, used only for training. (absolute accessibility is thresholded at 15; relative accessibility is normalized by the largest accessibility value in a protein and thresholded at 0.15; original solvent accessibility is computed by DSSP) [35,57): sequence profile. Note the order of amino acid residues is ACDEFGHIKLMNPQRSTVWXY and it is different from the order for amino acid residues The last feature of both amino acid residues and secondary structure labels just mark end of the protein sequence. [22,31) and [33,35) are hidden during testing.

The cullpdb+profile_5926_filtered.npy.gz file are removed duplicates from the original cullpdb+profile_6133_filtered.npy.gz file, updated 2018-10-28.

The dataset division for the cullpdb+profile_5926.npy.gz dataset is

[0,5430) training [5435,5690) test [5690,5926) validation For the filtered dataset cullpdb+profile_5926_filtered.npy.gz, all proteins can be used for training and test on CB513 dataset.
h
dlgenai-nppe2-dataset
huggingface.co
Updated Dec 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vrindesh Pareek (2025). dlgenai-nppe2-dataset [Dataset]. https://huggingface.co/datasets/24f1000743/dlgenai-nppe2-dataset
Explore at:
Dataset updated
Dec 17, 2025
Authors
Vrindesh Pareek
Description
NPPE-2 Protein Secondary Structure Dataset

This dataset is used for the NPPE-2 project on Protein Secondary Structure Prediction.

Source

The original dataset is provided via Kaggle: https://www.kaggle.com/competitions/sep-25-dl-gen-ai-nppe-2

Description

The dataset contains protein amino-acid sequences and their corresponding secondary structure annotations:

Q8 (sst8): 8-state labels Q3 (sst3): 3-state labels derived from Q8

Files include:

train.csv test.csv… See the full description on the dataset page: https://huggingface.co/datasets/24f1000743/dlgenai-nppe2-dataset.
RS126Data
kaggle.com
zip
Updated Jul 17, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tamzid Hasan (2020). RS126Data [Dataset]. https://www.kaggle.com/tamzidhasan/rs126data
Explore at:
zip(19392 bytes)Available download formats
Dataset updated
Jul 17, 2020
Authors
Tamzid Hasan
Description
Introduction

This is a text file contain the primary sequence of protein and the secondary sequence of the corresponding primary protein . The secondary structure have only 3 category . Such as - 1. C : loops 2. H : Helix 3. E : Stand

Dataset

There is 150 instance and every instance contain 2 line . 1st line = primary sequence (amino acid) 2nd line = secondary sequence (C,H,E)
CB6133 dataset
kaggle.com
zip
Updated Sep 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel Bianchin de Oliveira (2024). CB6133 dataset [Dataset]. https://www.kaggle.com/datasets/gabrielbianchin/cb6133-dataset
Explore at:
zip(1147451 bytes)Available download formats
Dataset updated
Sep 20, 2024
Authors
Gabriel Bianchin de Oliveira
Description
CB6133 dataset with amino acid sequence and secondary structures. The files were extracted from the original dataset.
CAFA 5 Protein Database Files (PDB)
kaggle.com
zip
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
A Merii (2023). CAFA 5 Protein Database Files (PDB) [Dataset]. https://www.kaggle.com/datasets/amerii/cafa-5-pdbs
Explore at:
zip(12654687498 bytes)Available download formats
Dataset updated
Jul 25, 2023
Authors
A Merii
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset contains 3D protein structure files in PDB format, gathered via the AlphaFoldDB API, for the Critical Assessment of protein Function Annotation (CAFA) 5 challenge protein entries.

The AlphaFoldDB is a comprehensive database that stores protein structures predicted by AlphaFold2 - an AI model developed by DeepMind that predicts the 3D structure of a protein based on its sequence. AlphaFold's predictions have been recognized for their remarkable accuracy, often comparable to those obtained from experimental methods.

The CAFA challenge is a community-wide effort to assess computational methods that predict protein function. The protein entries in this dataset are specifically related to the 5th iteration of the challenge - CAFA 5.

The dataset provides the following information for each protein:

The naming conventions for the files are: `
Protein secondary structure prediction Jpred4 data
kaggle.com
zip
Updated Oct 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
jiagengchang (2021). Protein secondary structure prediction Jpred4 data [Dataset]. https://www.kaggle.com/jiagengchang/dcpb1500
Explore at:
zip(20099527 bytes)Available download formats
Dataset updated
Oct 1, 2021
Authors
jiagengchang
License
http://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html
Description
Context

Protein secondary structure prediction dataset. Used by 2015 NAR paper* from Barton group. There are a total of 1507 protein sequences, each represented by an integer identifier (e.g. 24695). 1348 in the training folder, and the rest in the blind test folder.

For each example, there are the following files: .fasta -> amino acid sequence for that domain .dssp -> ground truth 3-state secondary structures, obtained from PDB 3D crystal structures using the DSSP algorithm .pssm -> PSI-BLAST matrices, obtained from running the PSI-BLAST algorithm on the sequence, which returns both the matrix and a multiple-sequence alignment (MSA) .hmm -> profile HMM matrices, obtained by running the HMMer3 algorithm on the MSA generated from PSI-BLAST

The suggested k for cross validation is 7, such that each fold will have 193 (the last will have 190) protein sequences.

This leads on to the purpose of the third file in this dataset - shuffle.pkl. This file contains the suggested 7-fold split for cross-validation, in the form of a nested list. Random splits were generated until the 3-state secondary structure contents were within 1% of each other, to balance the prediction labels across the 7 folds.

*Alexey Drozdetskiy, Christian Cole, James Procter, Geoffrey J. Barton, JPred4: a protein secondary structure prediction server, Nucleic Acids Research, Volume 43, Issue W1, 1 July 2015, Pages W389–W394, https://doi.org/10.1093/nar/gkv332
9-class Protein Secondary Structure Dataset
kaggle.com
zip
Updated Sep 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ammar A kazm (2024). 9-class Protein Secondary Structure Dataset [Dataset]. https://www.kaggle.com/datasets/ammarakazm/9-class-protein-secondary-structure-dataset
Explore at:
zip(4106160 bytes)Available download formats
Dataset updated
Sep 13, 2024
Authors
ammar A kazm
Description
Dataset Description and Summary

Dataset Name: PISCES-16037

Dataset Overview: This dataset consists of 16,037 protein sequences and their corresponding secondary structures, curated from the Protein Data Bank (PDB) using the Protein Sequence Culling Server (PISCES). The dataset was specifically designed for training and validating protein secondary structure prediction models.

Data Source: * Protein Data Bank (PDB): The primary source of protein structures. * PISCES: Used to select high-quality protein sequences based on various criteria.

Dataset Features:

Protein Sequences: Amino acid sequences of the proteins.

Secondary Structures: Secondary structure assignments for each residue, using the DSSP 4.0 classification (including the polyproline helix).

PDB IDs and Chain Identifiers: Unique identifiers for each protein structure and chain.

Dataset Characteristics:

Quality: Proteins were selected based on stringent criteria to ensure high-quality structures.

Diversity: The dataset includes a diverse range of protein sequences in terms of length, structure, and function.

Standardization: Secondary structures are assigned using the DSSP 4.0 algorithm, a widely accepted standard.

Dataset Preparation:

Filtering: Overlapping sequences with test datasets (CASP12, CASP13, CASP14, and CB433) were removed.

Completeness: Proteins with incomplete structural information were excluded.

Dataset Splitting:

Training Set: 15,037 proteins were used for training the model.

Validation Set: 1,000 proteins were used for evaluating the model's performance during training.

Intended Use: This dataset is suitable for researchers and developers working on protein secondary structure prediction. It can be used to train and evaluate machine learning models, develop new prediction algorithms, and study the relationship between protein sequence and structure.

Citation: Kazm, A., Ali, A., & Hashim, H. (2024). Transformer Encoder with Protein Language Model for Protein Secondary Structure Prediction. Engineering, Technology & Applied Science Research, 14(2), 13124-13132.

Summary: The PISCES-16037 dataset provides a valuable resource for protein secondary structure prediction research. Its high-quality, diverse, and standardized nature make it well-suited for various applications in computational biology and bioinformatics. Additional Information: The dataset includes secondary structure assignments based on the PSS-9 classification system. The following table provides a brief description of each PSS-9 type:

Symbol Description
B β-bridge
E β-strand
G 3-10-helix
H α-helix
I π-helix
P Helix-PPII
L Loop
S Bend
T Turn
X Disordered regions
Protein Structure Prediction Alphafold2 Multimer
kaggle.com
zip
Updated Nov 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr. Nagendra (2025). Protein Structure Prediction Alphafold2 Multimer [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/protein-structure-prediction-alphafold2-multimer/code
Explore at:
zip(1133963 bytes)Available download formats
Dataset updated
Nov 29, 2025
Authors
Dr. Nagendra
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset provides resources for predicting protein structures and protein–protein complexes using AlphaFold2 and AlphaFold2-Multimer.

Includes example notebooks demonstrating single-chain and multimer structure prediction workflows.

Suitable for learning, research, and practical bioinformatics applications.

Helps users understand sequence preparation, model configuration, and result interpretation.

Useful for protein folding studies, structural biology, and computational drug discovery.

Supports experiments in protein interaction analysis and complex modeling.

Provides reproducible notebooks for running AlphaFold2 pipelines in a simplified environment.
SMILES DataSet for Analysis & Prediction Dataset
kaggle.com
zip
Updated Jun 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yan Maksi (2023). SMILES DataSet for Analysis & Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/yanmaksi/big-molecules-smiles-dataset
Explore at:
zip(296339 bytes)Available download formats
Dataset updated
Jun 11, 2023
Authors
Yan Maksi
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F9770082%2Fb234dd748f233e4d3ef1d72d048828b5%2FMastering%20Drug%20Design.jpg?generation=1686502308761641&alt=media" alt="">

Read this article to get unlock the wonderful world Deep Reinforcement Learning for Drug Design

ReLeaSE is a public dataset, consisting of molecular structures and their corresponding binding affinity to proteins. The dataset was created for the purpose of evaluating and comparing machine learning models for the prediction of protein-ligand binding affinity.

The dataset contains a total of 10,000 molecules and their binding affinity to several target proteins, including thrombin, kinase, and protease. The molecular structures are represented using Simplified Molecular Input Line Entry System (SMILES) notation, which is a standardized method for representing molecular structures as a string of characters. The binding affinity is represented as a negative logarithm of the dissociation constant (pKd), which is a measure of the strength of the interaction between the molecule and the target protein.

The ReLeaSE dataset provides a standardized benchmark for evaluating machine learning models for protein-ligand binding affinity prediction. The dataset is publicly available and can be used for research purposes, making it an important resource for the drug discovery community.
Protein Secondary Structure
kaggle.com
zip
Updated Jun 6, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
-_- (2018). Protein Secondary Structure [Dataset]. https://www.kaggle.com/alfrandom/protein-secondary-structure
Explore at:
zip(40687706 bytes)Available download formats
Dataset updated
Jun 6, 2018
Authors
-_-
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Introduction

Protein secondary structure can be calculated based on its atoms' 3D coordinates once the protein's 3D structure is solved using X-ray crystallography or NMR. Commonly, DSSP is the tool used for calculating the secondary structure and assigns one of the following secondary structure types (https://swift.cmbi.umcn.nl/gv/dssp/index.html) to every amino acid in a protein:

C: Loops and irregular elements (corresponding to the blank characters output by DSSP)

E: β-strand

H: α-helix

B: β-bridge

G: 3-helix

I: π-helix

T: Turn

S: Bend

However, X-ray or NMR is expensive. Ideally, we would like to predict the secondary structure of a protein based on its primary sequence directly, which has had a long history. A review on this topic is published recently, Sixty-five years of the long march in protein secondary structure prediction: the final stretch?.

For the purpose of secondary structure prediction, it is common to simplify the aforementioned eight states (Q8) into three (Q3) by merging (E, B) into E, (H, G, I) into E, and (C, S, T) into C. The current accuracy for three-state (Q3) secondary structure prediction is about ~85% while that for eight-state (Q8) prediction is <70%. The exact number depends on the particular test dataset used.

Dataset

The main dataset lists peptide sequences and their corresponding secondary structures. It is a transformation of https://cdn.rcsb.org/etl/kabschSander/ss.txt.gz downloaded at 2018-06-06 from RSCB PDB into a tabular structure. If you download the file at a later time, the number of sequences in it will probably increase.

Description of columns:

pdb_id: the id used to locate its entry on https://www.rcsb.org/

chain_code: when a protein consists of multiple peptides (chains), the chain code is needed to locate a particular one.

seq: the sequence of the peptide

sst8: the eight-state (Q8) secondary structure

sst3: the three-state (Q3) secondary structure

len: the length of the peptide

has_nonstd_aa: whether the peptide contains nonstandard amino acids (B, O, U, X, or Z).

Key steps in the transformation:

Both Q3 and Q8 secondary structure sequences are listed.

All nonstandard amino acids, which includes B, O, U, X, and Z, (see here for their meanings) are masked with "*" character.

An additional column (has_nonstd_aa) is added to indicate whether the protein sequence contains nonstandard amino acids.

A subset of the sequences with low sequence identity and high resolution, ready for training, is also provided

For details of curation, please see https://github.com/zyxue/pdb-secondary-structure.

A subset (9079 sequences) based on sequences culled by PISCES with more strict quality control is also provided. This dataset is considered ready for training models.

The culled subset generated on 2018-05-31 with cutoffs of 25%, 2Å, and 0.25 for sequence identity, resolution and R-factor respectively, is used. The URL to the original culled list is http://dunbrack.fccc.edu/Guoli/culledpdb_hh/cullpdb_pc25_res2.0_R0.25_d180531_chains9099.gz, but it may not be permanently available. This dataset contains more columns from cullpdb_pc25_res2.0_R0.25_d180531_chains9099.gz with self-explanatory names.

For more about PISCES, please see https://academic.oup.com/bioinformatics/article/19/12/1589/258419.

Acknowledgements

The peptide sequence and secondary structure are downloaded from https://cdn.rcsb.org/etl/kabschSander/ss.txt.gz. The culled subset is downloaded from http://dunbrack.fccc.edu/PISCES.php.

Inspiration

Kaggle provides a great platform for sharing ideas and solving data science problem. Sharing a cleaned dataset help prevent others from duplicated work and also provides a common dataset for more comparable benchmark among different methods.

Early attempts on this (or related) problem:

Baldi, Pierre, Søren Brunak, Paolo Frasconi, Gianluca Pollastri and Giovanni Soda. “Bidirectional Dynamics for Protein Secondary Structure Prediction.” Sequence Learning (2001). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.7092&rep=rep1&type=pdf

Chen, J. and Chaudhari, N. S.. "Protein Secondary Structure Prediction with bidirectional LSTM networks." Paper presented at the meeting of the Post-Conference Workshop on Computational Intelligence Approaches for the Analysis of Bio-data (CI-BIO), Montreal, Canada, 2005. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.7092&rep=rep1&type=pdf (Couldn't find a pdf)

Sepp Hochreiter, Martin Heusel, Klaus Obermayer; Fast model-based protein homology ...
Protein Data Bank
kaggle.com
zip
Updated May 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmet Can GÜNAY (2025). Protein Data Bank [Dataset]. https://www.kaggle.com/datasets/ahmetcangunay1/protein-data-bank
Explore at:
zip(5079269900 bytes)Available download formats
Dataset updated
May 3, 2025
Authors
Ahmet Can GÜNAY
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
📖 Context & Inspiration

This dataset was created to provide a comprehensive, easily accessible archive of Protein Data Bank (PDB) files—vital for researchers and developers working in structural biology, molecular modeling, and bioinformatics. The files were gathered from a public directory-style server, where PDB structures are organized into subfolders based on naming conventions.

The goal was to simplify access to this valuable data by restructuring it in a more machine-learning-friendly format, eliminating the need for repetitive scraping or manual downloads. Inspiration came from real-world bottlenecks in data preprocessing for protein folding prediction models, drug design simulations, and structure-based machine learning pipelines.

🌐 Source

All files were sourced from a publicly available FTP-style archive mirroring the PDB repository. The original structure was preserved to ensure traceability and compatibility with existing workflows.

⚠️ Note: This dataset is intended for research and educational use. Ensure compliance with any licensing or usage terms defined by the PDB archive.
PROTEIN-STRUCTURE-MODELLING--2DN1-Human-Hemoglobin
kaggle.com
zip
Updated Nov 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr. Nagendra (2025). PROTEIN-STRUCTURE-MODELLING--2DN1-Human-Hemoglobin [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/protein-structure-modelling-2dn1-human-hemoglobin
Explore at:
zip(1464028 bytes)Available download formats
Dataset updated
Nov 28, 2025
Authors
Dr. Nagendra
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset dedicated to Protein Homology Modeling of Human Hemoglobin. The reference template structure used is PDB ID: 2DN1. Focuses on generating 3D models of a mutated (variant) Hemoglobin sequence. Includes the necessary files for template selection and sequence alignment. A practical resource for studying computational structural biology and protein prediction.
Protein Embeddings 1
kaggle.com
zip
Updated Apr 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Chervov (2023). Protein Embeddings 1 [Dataset]. https://www.kaggle.com/datasets/alexandervc/protein-embeddings-1
Explore at:
zip(713648892 bytes)Available download formats
Dataset updated
Apr 23, 2023
Authors
Alexander Chervov
Description
Part 1 of files from:

https://zenodo.org/record/5047020#.ZEQ6Y3ZBy38 Papers: - Residue and sequence embeddings of the human proteome (SwissProt for organism Human, downloaded on 2021.06.09) computed using bio_embeddings (bioembeddings.com) using the ProtT5 embedder at full precision (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3). - Sequence-level predictions of subcellular localization in 10 classes using LA (https://www.biorxiv.org/content/10.1101/2021.04.25.441334v1) - Residue-level three state secondary structure prediction (alpha, sheet or other) using models reported in the ProtTrans paper (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3)

Part 2

embed_protbert_train_clip_1200_first_70000_prot.csv Embedding for proteins by protBert model from RostLab. Data - CAFA5 first 70 000 proteins from the train. Generation notebook: https://www.kaggle.com/code/alexandervc/protbert-embedding-starter?scriptVersionId=126811755

Symbol	Description
B	β-bridge
E	β-strand
G	3-10-helix
H	α-helix
I	π-helix
P	Helix-PPII
L	Loop
S	Bend
T	Turn
X	Disordered regions

Protein Benchmarks

kaggle.com

zip

Updated Dec 5, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Samriddhi Sinha (2022). Protein Benchmarks [Dataset]. https://www.kaggle.com/datasets/djokester/protein-benchmarks/code

Explore at:

zip(17372386 bytes)Available download formats

Dataset updated

Dec 5, 2022

Authors

Samriddhi Sinha

Description

Topic	Benchmark	Target typea	Resolution	# Training sequences	Source
Protein structure	Secondary structure	Categorical (3)	Local	8,678	(Moult et al., 2018; Rao et al., 2019)
	Disorder	Binary	Local	8,678	(Moult et al., 2018)
	Remote homology	Categorical (1,195)	Global	12,312	(Andreeva et al., 2014, 2020; Rao et al., 2019)
	Fold classes	Categorical (7)	Global	15,680	(Andreeva et al., 2014, 2020)
Post-translational modifications	Signal peptide	Binary	Global	16,606	(Armenteros et al., 2019)
	Major PTMs	Binary	Local	43,356	(Hornbeck et al., 2015)
	Neuropeptide cleavage	Binary	Local	2,727	(Ofer and Linial 2014, 2015; Brandes et al., 2016)
Biophysical properties	Fluorescence	Continuous	Global	21,446	(Sarkisyan et al., 2016; Rao et al., 2019)
	Stability	Continuous	Global	53,679	(Rocklin et al., 2017; Rao et al., 2019)

Sources

Moult J, Fidelis K, Kryshtafovych A, et al. (2018) Critical assessment of methods of protein structure prediction (CASP)—Round XII. Proteins Struct Funct Bioinforma, 86, 7–15.

"https://pubmed.ncbi.nlm.nih.gov/33390682/">Rao R. et al. (2019) Evaluating protein transfer learning with tape. Adv. Neural Inf. Process. Syst., 32, 9689–9701.

"https://academic.oup.com/nar/article/48/D1/D376/5625529">Andreeva A, Kulesha E, Gough J, Murzin AG (2020) The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures. Nucleic Acids Res, 48, D376--D382.

Armenteros JJA, Tsirigos KD, Sønderby CK, et al. (2019) SignalP 5.0 improves signal peptide predictions using deep neural networks. Nat Biotechnol, 37, 420–423.

Hornbeck P V, Zhang B, Murray B, et al. (2015) PhosphoSitePlus, 2014: mutations, PTMs and recalibrations. Nucleic AcidsRes, 43, D512–D520.

Ofer D, Linial M (2014) NeuroPID: a predictor for identifying neuropeptide precursors from metazoan proteomes. Bioinformatics, 30, 931–940.

"http://dx.doi.org/10.1093/bioinformatics/btv345">Ofer D, Linial M (2015) ProFET: Feature engineering captures high-level protein functions. Bioinformatics, 31, 3429–3436.

"https://doi.org/10.1093/database/baw133">Brandes N, Ofer D, Linial M. (2016) ASAP: A machine learning framework for local protein properties. Database 2016.

"http://dx.doi.org/10.1038/nature17995">Sarkisyan KS, Bolotin DA, Meer MV, et al. (2016) Local fitness landscape of the green fluorescent protein. Nature, 533, 397–401.

Rocklin GJ, Chidyausiku TM, Goreshnik I, et al. (2017) Global analysis of protein folding using massively parallel design,synthesis, and testing. Science (80-) 357, 168–175.

License

ProteinBERT is a free open-source project available under the MIT License

Image by Image by PublicDomainPictures from Pixabay

Neurotransmitter Receptors & Protein Sequences
kaggle.com
zip
Updated Jul 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yashasvi Goswami (2025). Neurotransmitter Receptors & Protein Sequences [Dataset]. https://www.kaggle.com/datasets/yashasvigoswami/neurotransmitter-receptors
Explore at:
zip(35950 bytes)Available download formats
Dataset updated
Jul 7, 2025
Authors
Yashasvi Goswami
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset provides a curated collection of proteins with their sequences, functional annotations, and structural information. It serves as a resource for researchers working in bioinformatics, structural biology, and systems biology, offering insights into the molecular machinery of life.

By linking protein sequence data with functional and structural attributes, it supports diverse applications such as: -Protein classification and annotation tasks -Sequence-to-function machine learning models -Structural modeling and docking studies -Comparative proteomics and evolutionary studies

The dataset is valuable for both computational researchers and students of life sciences, offering real-world biological data that can be directly integrated into pipelines for protein analysis, modeling, and prediction.

Facebook

Twitter

Click to copy link

Link copied

Cite

SHAHIR (2018). Structural Protein Sequences [Dataset]. https://www.kaggle.com/datasets/shahir/protein-data-set

Structural Protein Sequences

Sequence and meta data for various protein structures

Explore at:

391 scholarly articles cite this dataset (View in Google Scholar)

zip(28782775 bytes)Available download formats

Dataset updated

Feb 3, 2018

Authors

SHAHIR

License

http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

Description

Context

This is a protein data set retrieved from Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank (PDB).

The PDB archive is a repository of atomic coordinates and other information describing proteins and other important biological macromolecules. Structural biologists use methods such as X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy to determine the location of each atom relative to each other in the molecule. They then deposit this information, which is then annotated and publicly released into the archive by the wwPDB.

The constantly-growing PDB is a reflection of the research that is happening in laboratories across the world. This can make it both exciting and challenging to use the database in research and education. Structures are available for many of the proteins and nucleic acids involved in the central processes of life, so you can go to the PDB archive to find structures for ribosomes, oncogenes, drug targets, and even whole viruses. However, it can be a challenge to find the information that you need, since the PDB archives so many different structures. You will often find multiple structures for a given molecule, or partial structures, or structures that have been modified or inactivated from their native form.

Content

There are two data files. Both are arranged on "structureId" of the protein:

pdb_data_no_dups.csv contains protein meta data which includes details on protein classification, extraction methods, etc.
data_seq.csv contains >400,000 protein structure sequences.

Acknowledgements

Original data set down loaded from http://www.rcsb.org/pdb/

Inspiration

Protein data base helped the life science community to study about different diseases and come with new drugs and solution that help the human survival.

Clear search

Close search

Google apps

Main menu

Structural Protein Sequences

Context

Content

Acknowledgements

Inspiration

Mouse protein structure prediction (cleaned)

Dataset

Contents

CASP12

CB513 dataset for protein structure prediction

Dataset

Contents

Bioinformatics Protein Dataset - Simulated

Subtitle

Description

Introduction

Columns Included

Inspiration and Sources

Proposed Uses

How This Dataset Was Created

Limitations

Data Split

Acknowledgment

Protein_dataset

dlgenai-nppe2-dataset

RS126Data

Introduction

Dataset

CB6133 dataset

CAFA 5 Protein Database Files (PDB)

The dataset provides the following information for each protein:

Protein secondary structure prediction Jpred4 data

Context

9-class Protein Secondary Structure Dataset

Dataset Description and Summary

Protein Structure Prediction Alphafold2 Multimer

SMILES DataSet for Analysis & Prediction Dataset

Protein Secondary Structure

Introduction

Dataset

Acknowledgements

Inspiration

Early attempts on this (or related) problem:

Protein Data Bank

PROTEIN-STRUCTURE-MODELLING--2DN1-Human-Hemoglobin

Protein Embeddings 1

Part 1 of files from:

Part 2

Protein Benchmarks

Sources

License

Image by Image by PublicDomainPictures from Pixabay

Neurotransmitter Receptors & Protein Sequences

Structural Protein Sequences

Sequence and meta data for various protein structures

Context

Content

Acknowledgements

Inspiration