77 datasets found
  1. Bioinformatics Protein Dataset - Simulated

    • kaggle.com
    zip
    Updated Dec 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated
    Explore at:
    zip(12928905 bytes)Available download formats
    Dataset updated
    Dec 27, 2024
    Authors
    Rafael Gallo
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Subtitle

    "Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

    Description

    Introduction

    This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

    Columns Included

    • ID_Protein: Unique identifier for each protein.
    • Sequence: String of amino acids.
    • Molecular_Weight: Molecular weight calculated from the sequence.
    • Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
    • Hydrophobicity: Average hydrophobicity calculated from the sequence.
    • Total_Charge: Sum of the charges of the amino acids in the sequence.
    • Polar_Proportion: Percentage of polar amino acids in the sequence.
    • Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
    • Sequence_Length: Total number of amino acids in the sequence.
    • Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

    Inspiration and Sources

    While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

    Proposed Uses

    This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

    How This Dataset Was Created

    1. Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
    2. Property Calculation: Physicochemical properties were calculated using the Biopython library.
    3. Class Assignment: Classes were randomly assigned for classification purposes.

    Limitations

    • The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
    • The functional classes are simulated and do not correspond to actual biological characteristics.

    Data Split

    The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

    Acknowledgment

    This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

  2. DNA Classification dataset

    • kaggle.com
    Updated Jul 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arif Miah (2025). DNA Classification dataset [Dataset]. https://www.kaggle.com/datasets/miadul/dna-classification-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 29, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Arif Miah
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains 3,000 synthetic DNA samples with 13 features designed for genomic data analysis, machine learning, and bioinformatics research. Each row represents a unique DNA sample with both sequence-level and statistical attributes.

    🔹 Dataset Structure

    Rows: 3,000

    Columns: 13

    🔹 Features Description

    1. Sample_ID → Unique identifier for each DNA sample

    2. Sequence → DNA sequence (string of A, T, C, G)

    3. GC_Content → Percentage of Guanine (G) and Cytosine (C) in the sequence

    4. AT_Content → Percentage of Adenine (A) and Thymine (T) in the sequence

    5. Sequence_Length → Total sequence length

    6. Num_A → Number of Adenine bases

    7. Num_T → Number of Thymine bases

    8. Num_C → Number of Cytosine bases

    9. Num_G → Number of Guanine bases

    10. kmer_3_freq → Average 3-mer (triplet) frequency score

    11. Mutation_Flag → Binary flag indicating mutation presence (0 = No, 1 = Yes)

    12. Class_Label → Class of the sample (Human, Bacteria, Virus, Plant)

    13. Disease_Risk → Risk level associated with the sample (Low / Medium / High)

    🔹 Potential Use Cases

    DNA classification tasks (e.g., predicting species from DNA sequence features)

    Exploratory Data Analysis (EDA) in bioinformatics

    Machine Learning model development (Logistic Regression, Random Forest, SVM, Neural Networks)

    Deep Learning approaches (LSTM, CNN, Transformers for sequence learning)

    Mutation detection and disease risk analysis

    Teaching and practicing biological data preprocessing techniques

    🔹 Why This Dataset?

    Synthetic but realistic structure, inspired by genomics data

    Balanced and diverse distribution of features and labels

    Suitable for beginners and researchers to practice classification, visualization, and model comparison

    🔹 Example Research Questions

    Can we classify DNA samples into their biological class using sequence-based features?

    How does GC content relate to mutation risk?

    Which ML model performs best for DNA classification tasks?

    Can synthetic DNA features predict disease risk categories?

    📌 Acknowledgment

    This dataset is synthetic and generated for educational & research purposes. It does not represent real patient data.

  3. DNA mutations

    • kaggle.com
    zip
    Updated Jul 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rana Saloom (2022). DNA mutations [Dataset]. https://www.kaggle.com/datasets/ranasaloom/dna-mutations-type
    Explore at:
    zip(36355 bytes)Available download formats
    Dataset updated
    Jul 25, 2022
    Authors
    Rana Saloom
    Description

    Context In bioinformatics, the issue of mutation discovery and type determination remains a significant concern. The problem is divided by the researchers into binary classification and multi-class problems. When the user wants to know if the DNA sequence has been altered, the issue is a binary classification problem. When it is desirable to identify the principal class of mutation or its sub-classes, the problem becomes more challenging. The primary classes of mutations are deletion, insertion, and replacement mutation, and their sub-classes are (deletion frameshift, deletion in-frame, insertion frameshift, insertion in-frame, silent, missense, nonsense, and read-through). Additionally, answers to sporadic issues like the DNA sequence alignment challenge are necessary for mutation detection techniques.

    Content Due to the scarcity of labeled databases, this data set was created by addressing an unlabeled database and creating random mutations of all kinds for the purpose of benefiting from them by researchers in the field of bioinformatics to analyze DNA sequences and know the impact of mutations on humans. The dataset that was used in labeling is published on the NCBI GeneBank website (NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome).

  4. Additional file 2 of The Venus score for the assessment of the quality and...

    • springernature.figshare.com
    ods
    Updated Jan 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Davide Chicco; Alessandro Fabris; Giuseppe Jurman (2025). Additional file 2 of The Venus score for the assessment of the quality and trustworthiness of biomedical datasets [Dataset]. http://doi.org/10.6084/m9.figshare.28170191.v1
    Explore at:
    odsAvailable download formats
    Dataset updated
    Jan 9, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Davide Chicco; Alessandro Fabris; Giuseppe Jurman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplementary Material 2

  5. Sample DNA Sequence

    • kaggle.com
    zip
    Updated Jan 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sreshta Putchala (2021). Sample DNA Sequence [Dataset]. https://www.kaggle.com/sreshta140/covid19-genome-sequence
    Explore at:
    zip(69652 bytes)Available download formats
    Dataset updated
    Jan 14, 2021
    Authors
    Sreshta Putchala
    Description

    Dataset

    This dataset was created by Sreshta Putchala

    Contents

  6. EC class prediction dataset

    • kaggle.com
    zip
    Updated Jul 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Mitchell (2023). EC class prediction dataset [Dataset]. https://www.kaggle.com/datasets/jbomitchell/ec-class-prediction-dataset
    Explore at:
    zip(8106829 bytes)Available download formats
    Dataset updated
    Jul 10, 2023
    Authors
    John Mitchell
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    This dataset contains relevant notebook submission files and papers:

    Notebook submission files from:

    PS S3E18 EDA + Ensembles by @zhukovoleksiy v8 0.65031.

    PS_3.18_LGBM_bin by @akioonodera v9 0.64706.

    PS3E18 EDA| Ensemble ML Pipeline |BinaryPredictict by @tetsutani v37 0.65540.

    0.65447 | Ensemble | AutoML | Enzyme Classify by @utisop v10 0.65447.

    pyBoost baselinepyBoost baseline by @l0glikelihood v4 0.65446.

    Random Forest EC classification by @jbomitchell RF62853_submission.csv 0.62853.

    Overfit Champion by @onurkoc83 v1 0.65810.

    Playground Series S3E18 - EDA & Separate Learning by @mateuszk013 v1 0.64933.

    Ensemble ML Pipeline + Bagging = 0.65557 by @chingiznurzhanov v7 0.65557.

    PS3E18| FeatureEnginering+Stacking by @jaygun84 v5 0.64845.

    S03E18 EDA | VotingClassifier | Optuna v15 0.64776.

    PS3E18 - GaussianNB by @mehrankazeminia v1 0.65898, v2 0.66009 & v3 0.66117.

    Enzyme Weighted Voting by @nivedithavudayagiri v2 0.65028.

    Multi-label With TF-Decision Forests by @gusthema v6 0.63374.

    S3E18 Target_Encoding LB 0.65947 by @meisa0 v1 0.65947.

    Boost Classifier Model by @satyaprakashshukl v7 0.64965.

    PS3E18: Multiple lightgbm models + Optuna by syerramilli v4 0.64982.

    s3e18_solution for overfitting public :0.64785 by @onurkoc83 v1 0.64785.

    PSS3E18 : FLAML : roc_auc_weighted by @gauravduttakiit v2 0.64732.

    PGS318: combiner by @kdmitrie v4 0.65350.

    averaging best solutions mean vs Weighted mean by @omarrajaa v5 0.66106.

    Papers

    N Nath & JBO Mitchell, Is EC class predictable from reaction mechanism? BMC Bioinformatics, 13:60 (2012) doi: 10.1186/1471-2105-13-60

    L De Ferrari & JBO Mitchell, From sequence to enzyme mechanism using multi-label machine learning, BMC Bioinformatics, 15:150 (2014) doi: 10.1186/1471-2105-15-150

    N Nath, JBO Mitchell & G Caetano-Anollés, The Natural History of Biocatalytic Mechanisms, PLoS Computational Biology, 10, e1003642 (2014) doi: 10.1371/journal.pcbi.1003642

    KE Beattie, L De Ferrari & JBO Mitchell, Why do sequence signatures predict enzyme mechanism? Homology versus Chemistry, Evolutionary Bioinformatics, 11: 267-274 (2015) doi: 10.4137/EBO.S31482

    HY Mussa, L De Ferrari & JBO Mitchell, Enzyme Mechanism Prediction: A Template Matching Problem on InterPro Signature Subspaces, BMC Research Reports, 8:744 (2015) doi: 10.1186/s13104-015-1730-7

  7. Bioinformatics-UAS Kelompok 4

    • kaggle.com
    zip
    Updated Nov 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthony TIF 2022 (2025). Bioinformatics-UAS Kelompok 4 [Dataset]. https://www.kaggle.com/datasets/anthonytif2022/bioinformatics
    Explore at:
    zip(2964027 bytes)Available download formats
    Dataset updated
    Nov 19, 2025
    Authors
    Anthony TIF 2022
    Description

    Dataset

    This dataset was created by Anthony TIF 2022

    Contents

  8. 🔬 Essential Proteins - Bioinformatics 🗃️ Dataset

    • kaggle.com
    zip
    Updated Oct 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samira Shemirani (2023). 🔬 Essential Proteins - Bioinformatics 🗃️ Dataset [Dataset]. https://www.kaggle.com/datasets/samira1992/essential-proteins-bioinformatics-dataset
    Explore at:
    zip(209855 bytes)Available download formats
    Dataset updated
    Oct 19, 2023
    Authors
    Samira Shemirani
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Essential proteins are vital for the life and reproduction of organisms and play a crucial role in maintaining cellular functions. If the destruction of a certain protein would lead to lethality or infertility, it can be classified as essential to an organism, meaning the organism cannot survive without it. Compared to non-essential proteins, essential proteins are more likely to persist in biological evolution. For instance, essential proteins make excellent targets for the development of new potential drugs and vaccines aimed at treating and preventing diseases.

    With the advent of high-throughput technologies, such as the yeast two-hybrid system and mass spectrometry analysis, various protein-protein interaction (PPI) data become available, facilitating the study of essential proteins at the network level.

  9. Vitamin D deficiency

    • figshare.com
    txt
    Updated Dec 13, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafa Carretero (2019). Vitamin D deficiency [Dataset]. http://doi.org/10.6084/m9.figshare.7461938.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 13, 2019
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Rafa Carretero
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Vitamin D deficiency dataset using machine learning approaches

  10. 🧮 Sequence Alignment - Bioinformatics 🗃️ Dataset

    • kaggle.com
    zip
    Updated Mar 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samira Shemirani (2024). 🧮 Sequence Alignment - Bioinformatics 🗃️ Dataset [Dataset]. https://www.kaggle.com/datasets/samira1992/sequence-alignment-bioinformatics-dataset/discussion
    Explore at:
    zip(1683 bytes)Available download formats
    Dataset updated
    Mar 31, 2024
    Authors
    Samira Shemirani
    Description

    The Multiple Sequence Alignment (MSA) is a key task in bioinformatics because it is used in various important biological analyses, such as predicting the function and structure of unknown proteins.

    We will use the following proteins for MSA:

    Mouse (mouse-kiss1- NP_839991.2)

    • Human (human-kiss1- NP_002247.3)

    • Opposum (opposum-kiss1- NP_001137604.1)

    • Frog-A (frog-kiss1- NP_001156331.1)

    • Zebrafish-A (zebrafish-kiss1- NP_001106961.1)

    • Frog-B (frog-kiss2 - NP_001156332.1)

    • Zebrafish-B (zebrafish-kiss2 - NP_001136057.1)

  11. UniProt

    • opendatalab.com
    • bioregistry.io
    • +2more
    zip
    Updated Nov 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fritz Haber Institute of the Max Planck Society (2022). UniProt [Dataset]. https://opendatalab.com/OpenDataLab/UniProt
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 21, 2022
    Dataset provided by
    National Center for Biotechnology Informationhttp://www.ncbi.nlm.nih.gov/
    European Molecular Biology Laboratoryhttp://www.embl.org/
    European Bioinformatics Institutehttp://www.ebi.ac.uk/
    Fritz Haber Institute of the Max Planck Society
    Description

    蛋白质序列数据库

  12. E. coli Resistance Dataset

    • kaggle.com
    zip
    Updated Jun 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Valeria Maciel (2025). E. coli Resistance Dataset [Dataset]. https://www.kaggle.com/datasets/valeriamaciel/e-coli-resistance-dataset
    Explore at:
    zip(3172020 bytes)Available download formats
    Dataset updated
    Jun 28, 2025
    Authors
    Valeria Maciel
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    🧬 E. coli Antibiotic Resistance Dataset (Raw from BV-BRC)

    This dataset contains 195,000+ raw records of Escherichia coli clinical isolates and their antimicrobial susceptibility test results. The data was extracted from the Bacterial and Viral Bioinformatics Resource Center (BV-BRC), a public repository funded by NIAID.

    Each entry captures how a specific E. coli genome responds to a given antibiotic, along with phenotypic interpretation, lab methods, measurement values (e.g., MIC), and supporting publication links.

    🔍 What’s Included

    • 🧬 Genome ID & strain names
    • 💊 Antibiotic tested
    • 📏 Measurement (MIC / Zone diameter)
    • ✅ Resistance phenotype (Resistant/Susceptible/Intermediate)
    • 🧪 Testing method, platform, vendor, and standard (CLSI/EUCAST)
    • 🔗 PubMed references and evidence source

    📦 Dataset Characteristics

    • Total records: 195,000+
    • Format: Raw, not cleaned (missing values and mixed units may be present)
    • Organism: Escherichia coli
    • Source: BV-BRC
    • License: CC BY-NC-SA 4.0
    • Language: English
  13. exampleVCFfiles

    • kaggle.com
    zip
    Updated Jan 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Omer Faruk Isler (2025). exampleVCFfiles [Dataset]. https://www.kaggle.com/omerfarukisler/examplevcffiles
    Explore at:
    zip(24351800 bytes)Available download formats
    Dataset updated
    Jan 10, 2025
    Authors
    Omer Faruk Isler
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    all vcf files that I was able to provide in BLG348 Intro to Bioinformatics course term project. Mutect variantCaller didn't work properly so I didn't add them. NotFıltered vcf's indicates previos version of vcf's that contains different filters (not only PASS ones) You can also check my profile to see the plots that I used for my project report & presentation.

  14. CASP12

    • kaggle.com
    zip
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aruja Tiwary (2025). CASP12 [Dataset]. https://www.kaggle.com/datasets/arujatiwary/casp12
    Explore at:
    zip(14194884795 bytes)Available download formats
    Dataset updated
    May 19, 2025
    Authors
    Aruja Tiwary
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains text files from the CASP12 (Critical Assessment of Techniques for Protein Structure Prediction) challenge, which is a benchmark for evaluating computational methods in protein structure prediction.

    Each file corresponds to a target protein and includes: - The amino acid sequence of the protein. - The true structural properties, such as secondary structure or 3D coordinates (depending on the specific format). - Standardized naming conventions for easy parsing.

    All files are in plain .txt format and are ready to be used for machine learning models, such as LSTMs or other sequence-based models.

  15. Transcriptomics in yeast

    • kaggle.com
    zip
    Updated Jan 24, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CostalAether (2017). Transcriptomics in yeast [Dataset]. https://www.kaggle.com/costalaether/yeast-transcriptomics
    Explore at:
    zip(4901525 bytes)Available download formats
    Dataset updated
    Jan 24, 2017
    Authors
    CostalAether
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Disclaimer

    This is a data set of mine that I though might be enjoyable to the community. It's concerning Next generation sequencing and Transcriptomics. I used several raw datasets, that are public, but the processing to get to this dataset is extensive. This is my first contribution to kaggle, so be nice, and let me know how I can improve the experience. NGS machines are combined the biggest data producer worldwide. So why not add some (more? ) to kaggle.

    A look into Yeast transcriptomics

    Background

    Yeasts ( in this case saccharomyces cerevisiae) are used in the production of beer, wine, bread and a whole lot of Biotech applications such as creating complex pharmaceuticals. They are living eukaryotic organisms (meaning quite complex). All living organisms store information in their DNA, but action within a cell is carried out by specific Proteins. The path from DNA to Protein (from data to action) is simple. a specific region on the DNA gets transcribed to mRNA, that gets translated to proteins. Common assumption says that the translation step is linear, more mRNA means more protein. Cells actively regulate the amount of protein by the amount of mRNA it creates. The expression of each gene depends on the condition the cell is in (starving, stressed etc..) Modern methods in Biology show us all mRNA that is currently inside a cell. Assuming the linearity of the process, we can get more protein the more specific mRNA is available to a cell. Making mRNA an excellent marker for what is actually happening inside a cell. It is important to consider that mRNA is fragile. It is actively replenished only when it is needed. Both mRNA and proteins are expensive for a cell to produce .

    Yeasts are good model organisms for this, since they only have about 6000 genes. They are also single cells which is more homogeneous, and contain few advanced features (splice junctions etc.)

    ( all of this is heavily simplified, let me know if I should go into more details )

    The data

    files

    The following files are provided **SC_expression.csv** expression values for each gene over the available conditions **labels_CC.csv ** labels for the individual genes , their status and where known intracellular localization ( see below) Maybe this would be nice as a little competition, I'll see how this one is going before I'll upload the other label files. Please provide some feedback on the presentation, and whatever else you would want me to share.

    background

    I used 92 samples from various openly available raw datasets, and ran them through a modern RNAseq pipeline. Spanning a range of different conditions (I hid the raw names). The conditions covered stress conditions, temperature and heavy metals, as well as growth media changes and the deletion of specific genes. Originally I had 150 sets, 92 are of good enough quality. Evaluation was done on gene level. Each gene got it's own row, Samples are columns (some are in replicates over several columns) . Expression levels were normalized with by TPM (transcripts per million), a default normalization procedure. Raw counts would have been integers, normalized they are floats.

    Analysis and labels

    Genes

    The function of individual genes is a matter of dispute. Clearly living cells are complex. The inner machinations of cells are not visible. Gene functionality is commonly inferred indirectly by removing a gene, and test the cells behavior. This is time consuming and not very precise. As you can see in the dataset, there is still much to be done to fully understand even single cell yeasts.

    The provided dataset is allows for a different approach to functional classification of genes. The label files contained in the set correspond a gene to a specific label. The classification is based on the official Gene Onthology associations classification. I simplified the nomenclature. Gene functionality is usually given in a hierarchical structure. [inside cell --> cytoplasma --> associated to complex A ... ] I'm only keeping high level associations, and using readable terms instead of GO terms. I'll extend if people are interested.

    Labels

    CC labels concern Cellular Component.
    Where the gene is within a cell. goes into details of found associations. the term 'cellular_component' should be seen as E.g the label 'cellular_component' is synonymous with 'unknown location' . CC is the easiest label to attach to a gene. It is the one that can be studied the easiest. Still there are many genes missing.

    MF labels concern Molecular Function. What is the gene doing. [upcoming] BP labels concern Biological Processes. What is the genes involvement. [upcoming]

    The core interest here is whether it is possible to improve the genes classification by modeling the data. A common assu...

  16. bioseq whl

    • kaggle.com
    zip
    Updated Feb 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrey Shtrauss (2022). bioseq whl [Dataset]. https://www.kaggle.com/shtrausslearning/bioseq
    Explore at:
    zip(72074 bytes)Available download formats
    Dataset updated
    Feb 5, 2022
    Authors
    Andrey Shtrauss
    Description

    Bioseq Simple package to work with biological sequence

    Notebooks that use package (to date): Biological Sequence Operations Biological Sequence Alignment

    Prepare whl file using: python setup.py bdist_wheel --universal

    Install in notebook via: !pip install /path/

  17. Gene Expression Analysis and Disease Relationship

    • kaggle.com
    zip
    Updated Aug 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    asel (2025). Gene Expression Analysis and Disease Relationship [Dataset]. https://www.kaggle.com/datasets/ylmzasel/gene-expression-analysis-and-disease-relationship/code
    Explore at:
    zip(8740 bytes)Available download formats
    Dataset updated
    Aug 4, 2025
    Authors
    asel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset comprises 1000 hypothetical patient or sample entries, each detailing gene expression profiles and relevant clinical characteristics. It includes a mix of both numerical and categorical data types, allowing for the application of diverse machine learning and statistical analysis methods

    Column Descriptions: PatientID (Categorical/Numerical): A unique identification number assigned to each patient. Age (Numerical): The patient's age. Can be used to investigate potential correlations between age and gene expression profiles. Gender (Categorical): The patient's gender (0: Female, 1: Male). Effects of gender on gene expression or disease status can be analyzed. Gene_X_Expression (Numerical): The relative expression level of a specific gene, "Gene X". This represents a hypothetical gene that might play a role in disease progression or treatment response. Gene_Y_Expression (Numerical): The relative expression level of another specific gene, "Gene Y". Can be studied in conjunction with or independently of Gene X. SmokingStatus (Categorical): The patient's smoking status (0: Non-smoker, 1: Ex-smoker, 2: Current smoker). Environmental factors' impact on gene expression and disease can be assessed. DiseaseStatus (Categorical): The patient's status for the target disease (0: Healthy, 1: Disease A, 2: Disease B). This can serve as the primary target variable for your predictive models.

    TreatmentResponse (Categorical/Numerical): The degree of response to applied treatment (0: No Response, 1: Partial Response, 2: Full Response). The role of gene expression profiles in predicting treatment success can be explored. Use Cases and Potential Projects This dataset serves as an excellent starting point for students, researchers, and enthusiasts in bioinformatics, computational biology, data science, and machine learning, enabling various projects such as: Disease Diagnosis/Classification: Building models to predict HastalıkDurumu using gene expression levels and other clinical factors. Treatment Response Prediction: Forecasting how patients with specific gene expression profiles might respond to treatment (TedaviYanıtı). Biomarker Discovery: Identifying gene expression levels (e.g., Gen_X_İfadesi, Gen_Y_İfadesi) that show strong correlations with disease or treatment response. Feature Engineering and Selection: Evaluating the importance of various features in the dataset and creating new ones to enhance model performance. Data Visualization: Generating visualizations to explore relationships between gene expression data and demographic/clinical factors. Regression and Correlation Analyses: Quantitatively examining the effects of factors like age and smoking status on gene expression levels.

    Why Use This Dataset? Privacy Secure: Being entirely synthetic, it carries no privacy or ethical concerns associated with real patient data. Diversity: The mix of both numerical and categorical variables offers a rich ground for experimenting with different analytical techniques. Predictive Potential: Clear target variables like HastalıkDurumu and TedaviYanıtı make it ideal for developing classification and regression models. Educational and Learning: Perfect for applying fundamental data science and machine learning concepts for anyone interested in the bioinformatics domain.

  18. Genomic Data for Cancer

    • kaggle.com
    zip
    Updated Mar 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Batuhan Şahan (2025). Genomic Data for Cancer [Dataset]. https://www.kaggle.com/datasets/brsahan/genomic-data-for-cancer/code
    Explore at:
    zip(9134 bytes)Available download formats
    Dataset updated
    Mar 16, 2025
    Authors
    Batuhan Şahan
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset contains gene expression levels of two genes and their correlation with cancer presence. It is designed for classification tasks, particularly in machine learning and bioinformatics applications. The data can be used to train models like K-Nearest Neighbors (KNN), SVM, and Neural Networks for cancer prediction.

  19. KEGG genomes, networks, diseases and drugs

    • kaggle.com
    zip
    Updated Apr 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ali Abedi Madiseh (2023). KEGG genomes, networks, diseases and drugs [Dataset]. https://www.kaggle.com/datasets/aliabedimadiseh/kegg-genomes-networks-diseases-and-drugs
    Explore at:
    zip(9132230 bytes)Available download formats
    Dataset updated
    Apr 21, 2023
    Authors
    Ali Abedi Madiseh
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    This dataset collected from 'genome.jp' web-based dataset by using its ftp : *** https://www.genome.jp/ftp/kegg/ It include bioinformatics and medical datbases in pathway, medical, genome, medicus , drug and .etc categories.

    This dataset include 5 .txt files: dgroup : 'Entry_ID' , 'name', 'type' and 'member' information about drugs disease: 'Entry_ID' , 'name' , 'subgroup', 'supergroup', 'description' ,'genes' and 'category' about drugs and related disease drug: this file include molecular information of drugs network: this file include network of genes interaction with their 'class' and 'gene' information variant: this file include variants of the genes and 'gene variant id' , 'gene name' , 'gene definition' and 'variation type' categories.

    Important definitions

    1.Signaling Pathways : Describes a series of chemical reactions in which a group of molecules in a cell work together to control a cell function, such as cell division or cell death. A cell receives signals from its environment when a molecule, such as a hormone or growth factor, binds to a specific protein receptor on or in the cell. After the first molecule in the pathway receives a signal, it activates another molecule. This process is repeated through the entire signaling pathway until the last molecule is activated and the cell function is carried out. Abnormal activation of signaling pathways may lead to diseases, such as cancer. Drugs are being developed to target specific molecules involved in these pathways. These drugs may help keep cancer cells from growing. (https://www.cancer.gov/publications/dictionaries/cancer-terms/def/signaling-pathway)

    2.Variants of gene :An alteration in the most common DNA nucleotide sequence. The term variant can be used to describe an alteration that may be benign, pathogenic, or of unknown significance. The term variant is increasingly being used in place of the term mutation. (https://www.cancer.gov/publications/dictionaries/genetics-dictionary/def/variant)

  20. PARSING FASTA AND GENBANK FILES

    • kaggle.com
    zip
    Updated Nov 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr. Nagendra (2025). PARSING FASTA AND GENBANK FILES [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/parsing-fasta-and-genbank-files
    Explore at:
    zip(17972831 bytes)Available download formats
    Dataset updated
    Nov 25, 2025
    Authors
    Dr. Nagendra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset is a dedicated resource for learning how to parse core bioinformatics file formats. It contains representative samples of FASTA and GenBank files. The goal is to provide raw data for practicing essential data extraction skills. FASTA files contain sequence data, such as DNA, RNA, or protein, in a simple text format. GenBank files include detailed sequence annotations, features, and metadata. This is an ideal starting point for anyone learning Biopython or general sequence manipulation in genomics.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated
Organization logo

Bioinformatics Protein Dataset - Simulated

Synthetic protein dataset with sequences, physical properties, and functional cl

Explore at:
zip(12928905 bytes)Available download formats
Dataset updated
Dec 27, 2024
Authors
Rafael Gallo
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Subtitle

"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

Description

Introduction

This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

Columns Included

  • ID_Protein: Unique identifier for each protein.
  • Sequence: String of amino acids.
  • Molecular_Weight: Molecular weight calculated from the sequence.
  • Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
  • Hydrophobicity: Average hydrophobicity calculated from the sequence.
  • Total_Charge: Sum of the charges of the amino acids in the sequence.
  • Polar_Proportion: Percentage of polar amino acids in the sequence.
  • Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
  • Sequence_Length: Total number of amino acids in the sequence.
  • Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

Inspiration and Sources

While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

Proposed Uses

This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

How This Dataset Was Created

  1. Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
  2. Property Calculation: Physicochemical properties were calculated using the Biopython library.
  3. Class Assignment: Classes were randomly assigned for classification purposes.

Limitations

  • The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
  • The functional classes are simulated and do not correspond to actual biological characteristics.

Data Split

The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

Acknowledgment

This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

Search
Clear search
Close search
Google apps
Main menu