5 datasets found

Bioinformatics Protein Dataset - Simulated
kaggle.com
zip
Updated Dec 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated
Explore at:
zip(12928905 bytes)Available download formats
Dataset updated
Dec 27, 2024
Authors
Rafael Gallo
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Subtitle

"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

Description

Introduction

This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

Columns Included

ID_Protein: Unique identifier for each protein.

Sequence: String of amino acids.

Molecular_Weight: Molecular weight calculated from the sequence.

Isoelectric_Point: Estimated isoelectric point based on the sequence composition.

Hydrophobicity: Average hydrophobicity calculated from the sequence.

Total_Charge: Sum of the charges of the amino acids in the sequence.

Polar_Proportion: Percentage of polar amino acids in the sequence.

Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.

Sequence_Length: Total number of amino acids in the sequence.

Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

Inspiration and Sources

While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

Proposed Uses

This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

How This Dataset Was Created

Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.

Property Calculation: Physicochemical properties were calculated using the Biopython library.

Class Assignment: Classes were randomly assigned for classification purposes.

Limitations

The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.

The functional classes are simulated and do not correspond to actual biological characteristics.

Data Split

The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

Acknowledgment

This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.
RCSB_PDB Human Macromolecular Structure Dataset
kaggle.com
Updated Sep 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samira Alipour (2024). RCSB_PDB Human Macromolecular Structure Dataset [Dataset]. https://www.kaggle.com/datasets/samiraalipour/rcsb-pdb-macromolecular-structure-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 20, 2024
Dataset provided by
Kaggle
Authors
Samira Alipour
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
RCSB_PDB Human Macromolecular Structure Dataset with Structural Features

This dataset consists of 11,832 protein structures obtained from the RCSB Protein Data Bank (PDB), covering the period from 2015 to 2023. It includes features relevant to X-ray crystallography-determined protein structures from the organism Homo sapiens (human). In addition, new structural features such as the number of residues, chains, and secondary structure components (helix, sheet, coil) have been added. These features provide additional insights that enhance the dataset’s application for various machine learning (ML) and deep learning (DL) tasks.

The dataset is optimized for tasks such as protein-ligand binding prediction, oligomeric state analysis, enzyme classification, protein secondary structure prediction, structure quality assessment, and more. By incorporating secondary structure data, the dataset also supports tasks related to structural motif detection, protein stability prediction, and protein domain analysis.

For a detailed explanation of how these new features were extracted and integrated, including code snippets, please refer to our PrepareData|RCSB_PDBHumanMacromoleculeDataset & prepareRCSB_PDBDatasetwithStructuralFeatures.

The dataset was curated by filtering proteins based on X-ray diffraction data and a resolution range of 1.0 Å to 3.0 Å, ensuring high-quality structural data for downstream ML applications.

License

The dataset is licensed under the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication, allowing unrestricted use, modification, and distribution. It can be used for both academic and commercial purposes.

Data Extraction and Feature Augmentation Process

1. Search Query in RCSB PDB

A custom query was constructed to retrieve X-ray crystallography structures of human proteins from the RCSB Protein Data Bank (PDB) with resolutions between 1.0 and 3.0 Å. The query also filtered for proteins with asymmetric symmetry and enzyme classification information.

Query: (experimental_method:"X-RAY DIFFRACTION") AND (resolution:[1.0 TO 3.0]) AND (entity_poly.rcsb_entity_polymer_type:"Protein (only)") AND (pdbx_audit_revision_history.revision_date:[2015-01-01 TO 2023-09-17]) AND (rcsb_entity_source_organism.taxonomy_lineage.name:"Homo sapiens") AND (rcsb_symmetry.symmetry_type:"Asymmetric") AND (rcsb_enzyme_classification.ec_name:*)

2. Feature Selection and Structural Feature Augmentation

After fetching the relevant protein structures, the following features were selected for extraction:

Original Structural Features: Resolution, R-factors, oligomeric state, etc.

Sequence Data: Protein sequence length, polymer type.

Ligand Information: Ligand IDs, binding affinities, etc.

Enzyme Classification: EC numbers, molecular weight, etc.

New Structural Features (added using BioPython and DSSP):

Number of Residues: The total number of amino acid residues in the structure.

Number of Chains: The number of distinct chains in the protein.

Secondary Structure Composition: Number of helices, sheets, and coils.

These newly added features enrich the dataset's potential for structural bioinformatics applications. For a full breakdown of the extraction process, refer to the PrepareData|RCSB_PDBHumanMacromoleculeDataset & prepareRCSB_PDBDatasetwithStructuralFeatures.

3. Data Preparation

The extracted data, along with the new structural features, was consolidated into a tabular format. This dataset was further prepared by merging multiple CSV files into a single dataset using a custom Python notebook.

Detailed Feature Explanation

The dataset includes a broad range of features, now expanded with structural data to enhance ML/DL applications. Here is a detailed explanation of all the features:

Original Features:

PDB ID: A unique identifier for each protein structure in the PDB.

Experimental Method: The method used to determine the structure (e.g., X-ray diffraction, NMR).

Matthews Coefficient: An estimate of the volume of a crystal occupied by the protein versus the solvent.

Percent Solvent Content: The percentage of solvent (typically water) present in the crystal.

Crystallization Method: The method used to grow the crystals (e.g., vapor diffusion).

pH: The pH at which the crystallization was performed.

**Crystal Gro...
Protein Data Bank
kaggle.com
zip
Updated May 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmet Can GÜNAY (2025). Protein Data Bank [Dataset]. https://www.kaggle.com/datasets/ahmetcangunay1/protein-data-bank
Explore at:
zip(5079269900 bytes)Available download formats
Dataset updated
May 3, 2025
Authors
Ahmet Can GÜNAY
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
📖 Context & Inspiration

This dataset was created to provide a comprehensive, easily accessible archive of Protein Data Bank (PDB) files—vital for researchers and developers working in structural biology, molecular modeling, and bioinformatics. The files were gathered from a public directory-style server, where PDB structures are organized into subfolders based on naming conventions.

The goal was to simplify access to this valuable data by restructuring it in a more machine-learning-friendly format, eliminating the need for repetitive scraping or manual downloads. Inspiration came from real-world bottlenecks in data preprocessing for protein folding prediction models, drug design simulations, and structure-based machine learning pipelines.

🌐 Source

All files were sourced from a publicly available FTP-style archive mirroring the PDB repository. The original structure was preserved to ensure traceability and compatibility with existing workflows.

⚠️ Note: This dataset is intended for research and educational use. Ensure compliance with any licensing or usage terms defined by the PDB archive.
Structural Protein Sequences
kaggle.com
zip
Updated Feb 3, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SHAHIR (2018). Structural Protein Sequences [Dataset]. https://www.kaggle.com/shahir/protein-data-set
Explore at:
zip(28782775 bytes)Available download formats
Dataset updated
Feb 3, 2018
Authors
SHAHIR
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Context

This is a protein data set retrieved from Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank (PDB).

The PDB archive is a repository of atomic coordinates and other information describing proteins and other important biological macromolecules. Structural biologists use methods such as X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy to determine the location of each atom relative to each other in the molecule. They then deposit this information, which is then annotated and publicly released into the archive by the wwPDB.

The constantly-growing PDB is a reflection of the research that is happening in laboratories across the world. This can make it both exciting and challenging to use the database in research and education. Structures are available for many of the proteins and nucleic acids involved in the central processes of life, so you can go to the PDB archive to find structures for ribosomes, oncogenes, drug targets, and even whole viruses. However, it can be a challenge to find the information that you need, since the PDB archives so many different structures. You will often find multiple structures for a given molecule, or partial structures, or structures that have been modified or inactivated from their native form.

Content

There are two data files. Both are arranged on "structureId" of the protein:

pdb_data_no_dups.csv contains protein meta data which includes details on protein classification, extraction methods, etc.

data_seq.csv contains >400,000 protein structure sequences.

Acknowledgements

Original data set down loaded from http://www.rcsb.org/pdb/

Inspiration

Protein data base helped the life science community to study about different diseases and come with new drugs and solution that help the human survival.
Neurotransmitter Receptors & Protein Sequences
kaggle.com
zip
Updated Jul 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yashasvi Goswami (2025). Neurotransmitter Receptors & Protein Sequences [Dataset]. https://www.kaggle.com/datasets/yashasvigoswami/neurotransmitter-receptors/discussion?sort=undefined
Explore at:
zip(35950 bytes)Available download formats
Dataset updated
Jul 7, 2025
Authors
Yashasvi Goswami
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset provides a curated collection of proteins with their sequences, functional annotations, and structural information. It serves as a resource for researchers working in bioinformatics, structural biology, and systems biology, offering insights into the molecular machinery of life.

By linking protein sequence data with functional and structural attributes, it supports diverse applications such as: -Protein classification and annotation tasks -Sequence-to-function machine learning models -Structural modeling and docking studies -Comparative proteomics and evolutionary studies

The dataset is valuable for both computational researchers and students of life sciences, offering real-world biological data that can be directly integrated into pipelines for protein analysis, modeling, and prediction.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated

Bioinformatics Protein Dataset - Simulated

Synthetic protein dataset with sequences, physical properties, and functional cl

Explore at:

zip(12928905 bytes)Available download formats

Dataset updated

Dec 27, 2024

Authors

Rafael Gallo

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Subtitle

"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

Description

Introduction

This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

Columns Included

ID_Protein: Unique identifier for each protein.
Sequence: String of amino acids.
Molecular_Weight: Molecular weight calculated from the sequence.
Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
Hydrophobicity: Average hydrophobicity calculated from the sequence.
Total_Charge: Sum of the charges of the amino acids in the sequence.
Polar_Proportion: Percentage of polar amino acids in the sequence.
Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
Sequence_Length: Total number of amino acids in the sequence.
Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

Inspiration and Sources

While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

Proposed Uses

This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

How This Dataset Was Created

Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
Property Calculation: Physicochemical properties were calculated using the Biopython library.
Class Assignment: Classes were randomly assigned for classification purposes.

Limitations

The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
The functional classes are simulated and do not correspond to actual biological characteristics.

Data Split

The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

Acknowledgment

This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

Clear search

Close search

Google apps

Main menu

Bioinformatics Protein Dataset - Simulated

Subtitle

Description

Introduction

Columns Included

Inspiration and Sources

Proposed Uses

How This Dataset Was Created

Limitations

Data Split

Acknowledgment

RCSB_PDB Human Macromolecular Structure Dataset

RCSB_PDB Human Macromolecular Structure Dataset with Structural Features

License

Data Extraction and Feature Augmentation Process

1. Search Query in RCSB PDB

2. Feature Selection and Structural Feature Augmentation

3. Data Preparation

Detailed Feature Explanation

Original Features:

Protein Data Bank

Structural Protein Sequences

Context

Content

Acknowledgements

Inspiration

Neurotransmitter Receptors & Protein Sequences

Bioinformatics Protein Dataset - SimulatedSee More Versions

Synthetic protein dataset with sequences, physical properties, and functional cl

Subtitle

Description

Introduction

Columns Included

Inspiration and Sources

Proposed Uses

How This Dataset Was Created

Limitations

Data Split

Acknowledgment

Bioinformatics Protein Dataset - Simulated