5 datasets found
  1. Bioinformatics Protein Dataset - Simulated

    • kaggle.com
    zip
    Updated Dec 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated
    Explore at:
    zip(12928905 bytes)Available download formats
    Dataset updated
    Dec 27, 2024
    Authors
    Rafael Gallo
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Subtitle

    "Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

    Description

    Introduction

    This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

    Columns Included

    • ID_Protein: Unique identifier for each protein.
    • Sequence: String of amino acids.
    • Molecular_Weight: Molecular weight calculated from the sequence.
    • Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
    • Hydrophobicity: Average hydrophobicity calculated from the sequence.
    • Total_Charge: Sum of the charges of the amino acids in the sequence.
    • Polar_Proportion: Percentage of polar amino acids in the sequence.
    • Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
    • Sequence_Length: Total number of amino acids in the sequence.
    • Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

    Inspiration and Sources

    While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

    Proposed Uses

    This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

    How This Dataset Was Created

    1. Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
    2. Property Calculation: Physicochemical properties were calculated using the Biopython library.
    3. Class Assignment: Classes were randomly assigned for classification purposes.

    Limitations

    • The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
    • The functional classes are simulated and do not correspond to actual biological characteristics.

    Data Split

    The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

    Acknowledgment

    This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

  2. RCSB_PDB Human Macromolecular Structure Dataset

    • kaggle.com
    Updated Sep 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samira Alipour (2024). RCSB_PDB Human Macromolecular Structure Dataset [Dataset]. https://www.kaggle.com/datasets/samiraalipour/rcsb-pdb-macromolecular-structure-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 20, 2024
    Dataset provided by
    Kaggle
    Authors
    Samira Alipour
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    RCSB_PDB Human Macromolecular Structure Dataset with Structural Features

    This dataset consists of 11,832 protein structures obtained from the RCSB Protein Data Bank (PDB), covering the period from 2015 to 2023. It includes features relevant to X-ray crystallography-determined protein structures from the organism Homo sapiens (human). In addition, new structural features such as the number of residues, chains, and secondary structure components (helix, sheet, coil) have been added. These features provide additional insights that enhance the dataset’s application for various machine learning (ML) and deep learning (DL) tasks.

    The dataset is optimized for tasks such as protein-ligand binding prediction, oligomeric state analysis, enzyme classification, protein secondary structure prediction, structure quality assessment, and more. By incorporating secondary structure data, the dataset also supports tasks related to structural motif detection, protein stability prediction, and protein domain analysis.

    For a detailed explanation of how these new features were extracted and integrated, including code snippets, please refer to our PrepareData|RCSB_PDBHumanMacromoleculeDataset & prepareRCSB_PDBDatasetwithStructuralFeatures.

    The dataset was curated by filtering proteins based on X-ray diffraction data and a resolution range of 1.0 Å to 3.0 Å, ensuring high-quality structural data for downstream ML applications.

    License

    The dataset is licensed under the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication, allowing unrestricted use, modification, and distribution. It can be used for both academic and commercial purposes.

    Data Extraction and Feature Augmentation Process

    1. Search Query in RCSB PDB

    A custom query was constructed to retrieve X-ray crystallography structures of human proteins from the RCSB Protein Data Bank (PDB) with resolutions between 1.0 and 3.0 Å. The query also filtered for proteins with asymmetric symmetry and enzyme classification information.

    Query: (experimental_method:"X-RAY DIFFRACTION") AND (resolution:[1.0 TO 3.0]) AND (entity_poly.rcsb_entity_polymer_type:"Protein (only)") AND (pdbx_audit_revision_history.revision_date:[2015-01-01 TO 2023-09-17]) AND (rcsb_entity_source_organism.taxonomy_lineage.name:"Homo sapiens") AND (rcsb_symmetry.symmetry_type:"Asymmetric") AND (rcsb_enzyme_classification.ec_name:*)

    2. Feature Selection and Structural Feature Augmentation

    After fetching the relevant protein structures, the following features were selected for extraction:

    • Original Structural Features: Resolution, R-factors, oligomeric state, etc.
    • Sequence Data: Protein sequence length, polymer type.
    • Ligand Information: Ligand IDs, binding affinities, etc.
    • Enzyme Classification: EC numbers, molecular weight, etc.
    • New Structural Features (added using BioPython and DSSP):
      • Number of Residues: The total number of amino acid residues in the structure.
      • Number of Chains: The number of distinct chains in the protein.
      • Secondary Structure Composition: Number of helices, sheets, and coils.

    These newly added features enrich the dataset's potential for structural bioinformatics applications. For a full breakdown of the extraction process, refer to the PrepareData|RCSB_PDBHumanMacromoleculeDataset & prepareRCSB_PDBDatasetwithStructuralFeatures.

    3. Data Preparation

    The extracted data, along with the new structural features, was consolidated into a tabular format. This dataset was further prepared by merging multiple CSV files into a single dataset using a custom Python notebook.

    Detailed Feature Explanation

    The dataset includes a broad range of features, now expanded with structural data to enhance ML/DL applications. Here is a detailed explanation of all the features:

    Original Features:

    1. PDB ID: A unique identifier for each protein structure in the PDB.
    2. Experimental Method: The method used to determine the structure (e.g., X-ray diffraction, NMR).
    3. Matthews Coefficient: An estimate of the volume of a crystal occupied by the protein versus the solvent.
    4. Percent Solvent Content: The percentage of solvent (typically water) present in the crystal.
    5. Crystallization Method: The method used to grow the crystals (e.g., vapor diffusion).
    6. pH: The pH at which the crystallization was performed.
    7. **Crystal Gro...
  3. Protein Data Bank

    • kaggle.com
    zip
    Updated May 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmet Can GÜNAY (2025). Protein Data Bank [Dataset]. https://www.kaggle.com/datasets/ahmetcangunay1/protein-data-bank
    Explore at:
    zip(5079269900 bytes)Available download formats
    Dataset updated
    May 3, 2025
    Authors
    Ahmet Can GÜNAY
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    📖 Context & Inspiration

    This dataset was created to provide a comprehensive, easily accessible archive of Protein Data Bank (PDB) files—vital for researchers and developers working in structural biology, molecular modeling, and bioinformatics. The files were gathered from a public directory-style server, where PDB structures are organized into subfolders based on naming conventions.

    The goal was to simplify access to this valuable data by restructuring it in a more machine-learning-friendly format, eliminating the need for repetitive scraping or manual downloads. Inspiration came from real-world bottlenecks in data preprocessing for protein folding prediction models, drug design simulations, and structure-based machine learning pipelines.

    🌐 Source

    All files were sourced from a publicly available FTP-style archive mirroring the PDB repository. The original structure was preserved to ensure traceability and compatibility with existing workflows.

    ⚠️ Note: This dataset is intended for research and educational use. Ensure compliance with any licensing or usage terms defined by the PDB archive.
    
  4. Structural Protein Sequences

    • kaggle.com
    zip
    Updated Feb 3, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SHAHIR (2018). Structural Protein Sequences [Dataset]. https://www.kaggle.com/shahir/protein-data-set
    Explore at:
    zip(28782775 bytes)Available download formats
    Dataset updated
    Feb 3, 2018
    Authors
    SHAHIR
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Context

    This is a protein data set retrieved from Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank (PDB).

    The PDB archive is a repository of atomic coordinates and other information describing proteins and other important biological macromolecules. Structural biologists use methods such as X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy to determine the location of each atom relative to each other in the molecule. They then deposit this information, which is then annotated and publicly released into the archive by the wwPDB.

    The constantly-growing PDB is a reflection of the research that is happening in laboratories across the world. This can make it both exciting and challenging to use the database in research and education. Structures are available for many of the proteins and nucleic acids involved in the central processes of life, so you can go to the PDB archive to find structures for ribosomes, oncogenes, drug targets, and even whole viruses. However, it can be a challenge to find the information that you need, since the PDB archives so many different structures. You will often find multiple structures for a given molecule, or partial structures, or structures that have been modified or inactivated from their native form.

    Content

    There are two data files. Both are arranged on "structureId" of the protein:

    • pdb_data_no_dups.csv contains protein meta data which includes details on protein classification, extraction methods, etc.

    • data_seq.csv contains >400,000 protein structure sequences.

    Acknowledgements

    Original data set down loaded from http://www.rcsb.org/pdb/

    Inspiration

    Protein data base helped the life science community to study about different diseases and come with new drugs and solution that help the human survival.

  5. Neurotransmitter Receptors & Protein Sequences

    • kaggle.com
    zip
    Updated Jul 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yashasvi Goswami (2025). Neurotransmitter Receptors & Protein Sequences [Dataset]. https://www.kaggle.com/datasets/yashasvigoswami/neurotransmitter-receptors/discussion?sort=undefined
    Explore at:
    zip(35950 bytes)Available download formats
    Dataset updated
    Jul 7, 2025
    Authors
    Yashasvi Goswami
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset provides a curated collection of proteins with their sequences, functional annotations, and structural information. It serves as a resource for researchers working in bioinformatics, structural biology, and systems biology, offering insights into the molecular machinery of life.

    By linking protein sequence data with functional and structural attributes, it supports diverse applications such as: -Protein classification and annotation tasks -Sequence-to-function machine learning models -Structural modeling and docking studies -Comparative proteomics and evolutionary studies

    The dataset is valuable for both computational researchers and students of life sciences, offering real-world biological data that can be directly integrated into pipelines for protein analysis, modeling, and prediction.

  6. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated
Organization logo

Bioinformatics Protein Dataset - Simulated

Synthetic protein dataset with sequences, physical properties, and functional cl

Explore at:
zip(12928905 bytes)Available download formats
Dataset updated
Dec 27, 2024
Authors
Rafael Gallo
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Subtitle

"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

Description

Introduction

This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

Columns Included

  • ID_Protein: Unique identifier for each protein.
  • Sequence: String of amino acids.
  • Molecular_Weight: Molecular weight calculated from the sequence.
  • Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
  • Hydrophobicity: Average hydrophobicity calculated from the sequence.
  • Total_Charge: Sum of the charges of the amino acids in the sequence.
  • Polar_Proportion: Percentage of polar amino acids in the sequence.
  • Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
  • Sequence_Length: Total number of amino acids in the sequence.
  • Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

Inspiration and Sources

While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

Proposed Uses

This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

How This Dataset Was Created

  1. Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
  2. Property Calculation: Physicochemical properties were calculated using the Biopython library.
  3. Class Assignment: Classes were randomly assigned for classification purposes.

Limitations

  • The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
  • The functional classes are simulated and do not correspond to actual biological characteristics.

Data Split

The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

Acknowledgment

This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

Search
Clear search
Close search
Google apps
Main menu