2 datasets found
  1. Bioinformatics Protein Dataset - Simulated

    • kaggle.com
    Updated Dec 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. http://doi.org/10.34740/kaggle/dsv/10315204
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 27, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rafael Gallo
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Subtitle

    "Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

    Description

    Introduction

    This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

    Columns Included

    • ID_Protein: Unique identifier for each protein.
    • Sequence: String of amino acids.
    • Molecular_Weight: Molecular weight calculated from the sequence.
    • Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
    • Hydrophobicity: Average hydrophobicity calculated from the sequence.
    • Total_Charge: Sum of the charges of the amino acids in the sequence.
    • Polar_Proportion: Percentage of polar amino acids in the sequence.
    • Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
    • Sequence_Length: Total number of amino acids in the sequence.
    • Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

    Inspiration and Sources

    While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

    Proposed Uses

    This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

    How This Dataset Was Created

    1. Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
    2. Property Calculation: Physicochemical properties were calculated using the Biopython library.
    3. Class Assignment: Classes were randomly assigned for classification purposes.

    Limitations

    • The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
    • The functional classes are simulated and do not correspond to actual biological characteristics.

    Data Split

    The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

    Acknowledgment

    This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

  2. Bioinformatics Simulated

    • kaggle.com
    Updated Jan 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    willian oliveira gibin (2025). Bioinformatics Simulated [Dataset]. http://doi.org/10.34740/kaggle/dsv/10398445
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 7, 2025
    Dataset provided by
    Kaggle
    Authors
    willian oliveira gibin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification. The dataset includes the following columns: ID_Protein, a unique identifier for each protein; Sequence, a string of amino acids; Molecular_Weight, molecular weight calculated from the sequence; Isoelectric_Point, estimated isoelectric point based on the sequence composition; Hydrophobicity, average hydrophobicity calculated from the sequence; Total_Charge, sum of the charges of the amino acids in the sequence; Polar_Proportion, percentage of polar amino acids in the sequence; Nonpolar_Proportion, percentage of nonpolar amino acids in the sequence; Sequence_Length, total number of amino acids in the sequence; and Class, the functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other. While this is a simulated dataset, it was inspired by patterns observed in real protein datasets such as UniProt, a comprehensive database of protein sequences and annotations; the Kyte-Doolittle Scale, calculations of hydrophobicity; and Biopython, a tool for analyzing biological sequences. This dataset is ideal for training classification models for proteins, exploratory analysis of physicochemical properties of proteins, and building machine learning pipelines in bioinformatics. The dataset was created through sequence generation, where amino acid chains were randomly generated with lengths between 50 and 300 residues, property calculation using the Biopython library, and class assignment with classes randomly assigned for classification purposes. However, the sequences and properties do not represent real proteins but follow patterns observed in natural proteins, and the functional classes are simulated and do not correspond to actual biological characteristics. The dataset is divided into two subsets: Training, which includes 16,000 samples (proteinas_train.csv), and Testing, which includes 4,000 samples (proteinas_test.csv). This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

  3. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. http://doi.org/10.34740/kaggle/dsv/10315204
Organization logo

Bioinformatics Protein Dataset - Simulated

Synthetic protein dataset with sequences, physical properties, and functional cl

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 27, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rafael Gallo
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Subtitle

"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

Description

Introduction

This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

Columns Included

  • ID_Protein: Unique identifier for each protein.
  • Sequence: String of amino acids.
  • Molecular_Weight: Molecular weight calculated from the sequence.
  • Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
  • Hydrophobicity: Average hydrophobicity calculated from the sequence.
  • Total_Charge: Sum of the charges of the amino acids in the sequence.
  • Polar_Proportion: Percentage of polar amino acids in the sequence.
  • Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
  • Sequence_Length: Total number of amino acids in the sequence.
  • Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

Inspiration and Sources

While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

Proposed Uses

This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

How This Dataset Was Created

  1. Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
  2. Property Calculation: Physicochemical properties were calculated using the Biopython library.
  3. Class Assignment: Classes were randomly assigned for classification purposes.

Limitations

  • The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
  • The functional classes are simulated and do not correspond to actual biological characteristics.

Data Split

The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

Acknowledgment

This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

Search
Clear search
Close search
Google apps
Main menu