77 datasets found

Bioinformatics Protein Dataset - Simulated
kaggle.com
zip
Updated Dec 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated
Explore at:
zip(12928905 bytes)Available download formats
Dataset updated
Dec 27, 2024
Authors
Rafael Gallo
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Subtitle

"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

Description

Introduction

This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

Columns Included

ID_Protein: Unique identifier for each protein.

Sequence: String of amino acids.

Molecular_Weight: Molecular weight calculated from the sequence.

Isoelectric_Point: Estimated isoelectric point based on the sequence composition.

Hydrophobicity: Average hydrophobicity calculated from the sequence.

Total_Charge: Sum of the charges of the amino acids in the sequence.

Polar_Proportion: Percentage of polar amino acids in the sequence.

Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.

Sequence_Length: Total number of amino acids in the sequence.

Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

Inspiration and Sources

While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

Proposed Uses

This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

How This Dataset Was Created

Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.

Property Calculation: Physicochemical properties were calculated using the Biopython library.

Class Assignment: Classes were randomly assigned for classification purposes.

Limitations

The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.

The functional classes are simulated and do not correspond to actual biological characteristics.

Data Split

The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

Acknowledgment

This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.
DNA Classification dataset
kaggle.com
Updated Jul 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arif Miah (2025). DNA Classification dataset [Dataset]. https://www.kaggle.com/datasets/miadul/dna-classification-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 29, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Arif Miah
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset contains 3,000 synthetic DNA samples with 13 features designed for genomic data analysis, machine learning, and bioinformatics research. Each row represents a unique DNA sample with both sequence-level and statistical attributes.

🔹 Dataset Structure

Rows: 3,000

Columns: 13

🔹 Features Description

Sample_ID → Unique identifier for each DNA sample

Sequence → DNA sequence (string of A, T, C, G)

GC_Content → Percentage of Guanine (G) and Cytosine (C) in the sequence

AT_Content → Percentage of Adenine (A) and Thymine (T) in the sequence

Sequence_Length → Total sequence length

Num_A → Number of Adenine bases

Num_T → Number of Thymine bases

Num_C → Number of Cytosine bases

Num_G → Number of Guanine bases

kmer_3_freq → Average 3-mer (triplet) frequency score

Mutation_Flag → Binary flag indicating mutation presence (0 = No, 1 = Yes)

Class_Label → Class of the sample (Human, Bacteria, Virus, Plant)

Disease_Risk → Risk level associated with the sample (Low / Medium / High)

🔹 Potential Use Cases

DNA classification tasks (e.g., predicting species from DNA sequence features)

Exploratory Data Analysis (EDA) in bioinformatics

Machine Learning model development (Logistic Regression, Random Forest, SVM, Neural Networks)

Deep Learning approaches (LSTM, CNN, Transformers for sequence learning)

Mutation detection and disease risk analysis

Teaching and practicing biological data preprocessing techniques

🔹 Why This Dataset?

Synthetic but realistic structure, inspired by genomics data

Balanced and diverse distribution of features and labels

Suitable for beginners and researchers to practice classification, visualization, and model comparison

🔹 Example Research Questions

Can we classify DNA samples into their biological class using sequence-based features?

How does GC content relate to mutation risk?

Which ML model performs best for DNA classification tasks?

Can synthetic DNA features predict disease risk categories?

📌 Acknowledgment

This dataset is synthetic and generated for educational & research purposes. It does not represent real patient data.
DNA mutations
kaggle.com
zip
Updated Jul 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rana Saloom (2022). DNA mutations [Dataset]. https://www.kaggle.com/datasets/ranasaloom/dna-mutations-type
Explore at:
zip(36355 bytes)Available download formats
Dataset updated
Jul 25, 2022
Authors
Rana Saloom
Description
Context In bioinformatics, the issue of mutation discovery and type determination remains a significant concern. The problem is divided by the researchers into binary classification and multi-class problems. When the user wants to know if the DNA sequence has been altered, the issue is a binary classification problem. When it is desirable to identify the principal class of mutation or its sub-classes, the problem becomes more challenging. The primary classes of mutations are deletion, insertion, and replacement mutation, and their sub-classes are (deletion frameshift, deletion in-frame, insertion frameshift, insertion in-frame, silent, missense, nonsense, and read-through). Additionally, answers to sporadic issues like the DNA sequence alignment challenge are necessary for mutation detection techniques.

Content Due to the scarcity of labeled databases, this data set was created by addressing an unlabeled database and creating random mutations of all kinds for the purpose of benefiting from them by researchers in the field of bioinformatics to analyze DNA sequences and know the impact of mutations on humans. The dataset that was used in labeling is published on the NCBI GeneBank website (NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome).
Additional file 2 of The Venus score for the assessment of the quality and...
springernature.figshare.com
ods
Updated Jan 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Davide Chicco; Alessandro Fabris; Giuseppe Jurman (2025). Additional file 2 of The Venus score for the assessment of the quality and trustworthiness of biomedical datasets [Dataset]. http://doi.org/10.6084/m9.figshare.28170191.v1
Explore at:
odsAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28170191.v1
Dataset updated
Jan 9, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Davide Chicco; Alessandro Fabris; Giuseppe Jurman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Supplementary Material 2
Sample DNA Sequence
kaggle.com
zip
Updated Jan 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sreshta Putchala (2021). Sample DNA Sequence [Dataset]. https://www.kaggle.com/sreshta140/covid19-genome-sequence
Explore at:
zip(69652 bytes)Available download formats
Dataset updated
Jan 14, 2021
Authors
Sreshta Putchala
Description
Dataset

This dataset was created by Sreshta Putchala

Contents
EC class prediction dataset
kaggle.com
zip
Updated Jul 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Mitchell (2023). EC class prediction dataset [Dataset]. https://www.kaggle.com/datasets/jbomitchell/ec-class-prediction-dataset
Explore at:
zip(8106829 bytes)Available download formats
Dataset updated
Jul 10, 2023
Authors
John Mitchell
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
This dataset contains relevant notebook submission files and papers:

Notebook submission files from:

PS S3E18 EDA + Ensembles by @zhukovoleksiy v8 0.65031.

PS_3.18_LGBM_bin by @akioonodera v9 0.64706.

PS3E18 EDA| Ensemble ML Pipeline |BinaryPredictict by @tetsutani v37 0.65540.

0.65447 | Ensemble | AutoML | Enzyme Classify by @utisop v10 0.65447.

pyBoost baselinepyBoost baseline by @l0glikelihood v4 0.65446.

Random Forest EC classification by @jbomitchell RF62853_submission.csv 0.62853.

Overfit Champion by @onurkoc83 v1 0.65810.

Playground Series S3E18 - EDA & Separate Learning by @mateuszk013 v1 0.64933.

Ensemble ML Pipeline + Bagging = 0.65557 by @chingiznurzhanov v7 0.65557.

PS3E18| FeatureEnginering+Stacking by @jaygun84 v5 0.64845.

S03E18 EDA | VotingClassifier | Optuna v15 0.64776.

PS3E18 - GaussianNB by @mehrankazeminia v1 0.65898, v2 0.66009 & v3 0.66117.

Enzyme Weighted Voting by @nivedithavudayagiri v2 0.65028.

Multi-label With TF-Decision Forests by @gusthema v6 0.63374.

S3E18 Target_Encoding LB 0.65947 by @meisa0 v1 0.65947.

Boost Classifier Model by @satyaprakashshukl v7 0.64965.

PS3E18: Multiple lightgbm models + Optuna by syerramilli v4 0.64982.

s3e18_solution for overfitting public :0.64785 by @onurkoc83 v1 0.64785.

PSS3E18 : FLAML : roc_auc_weighted by @gauravduttakiit v2 0.64732.

PGS318: combiner by @kdmitrie v4 0.65350.

averaging best solutions mean vs Weighted mean by @omarrajaa v5 0.66106.

Papers

N Nath & JBO Mitchell, Is EC class predictable from reaction mechanism? BMC Bioinformatics, 13:60 (2012) doi: 10.1186/1471-2105-13-60

L De Ferrari & JBO Mitchell, From sequence to enzyme mechanism using multi-label machine learning, BMC Bioinformatics, 15:150 (2014) doi: 10.1186/1471-2105-15-150

N Nath, JBO Mitchell & G Caetano-Anollés, The Natural History of Biocatalytic Mechanisms, PLoS Computational Biology, 10, e1003642 (2014) doi: 10.1371/journal.pcbi.1003642

KE Beattie, L De Ferrari & JBO Mitchell, Why do sequence signatures predict enzyme mechanism? Homology versus Chemistry, Evolutionary Bioinformatics, 11: 267-274 (2015) doi: 10.4137/EBO.S31482

HY Mussa, L De Ferrari & JBO Mitchell, Enzyme Mechanism Prediction: A Template Matching Problem on InterPro Signature Subspaces, BMC Research Reports, 8:744 (2015) doi: 10.1186/s13104-015-1730-7
Bioinformatics-UAS Kelompok 4
kaggle.com
zip
Updated Nov 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anthony TIF 2022 (2025). Bioinformatics-UAS Kelompok 4 [Dataset]. https://www.kaggle.com/datasets/anthonytif2022/bioinformatics
Explore at:
zip(2964027 bytes)Available download formats
Dataset updated
Nov 19, 2025
Authors
Anthony TIF 2022
Description
Dataset

This dataset was created by Anthony TIF 2022

Contents
🔬 Essential Proteins - Bioinformatics 🗃️ Dataset
kaggle.com
zip
Updated Oct 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samira Shemirani (2023). 🔬 Essential Proteins - Bioinformatics 🗃️ Dataset [Dataset]. https://www.kaggle.com/datasets/samira1992/essential-proteins-bioinformatics-dataset
Explore at:
zip(209855 bytes)Available download formats
Dataset updated
Oct 19, 2023
Authors
Samira Shemirani
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Essential proteins are vital for the life and reproduction of organisms and play a crucial role in maintaining cellular functions. If the destruction of a certain protein would lead to lethality or infertility, it can be classified as essential to an organism, meaning the organism cannot survive without it. Compared to non-essential proteins, essential proteins are more likely to persist in biological evolution. For instance, essential proteins make excellent targets for the development of new potential drugs and vaccines aimed at treating and preventing diseases.

With the advent of high-throughput technologies, such as the yeast two-hybrid system and mass spectrometry analysis, various protein-protein interaction (PPI) data become available, facilitating the study of essential proteins at the network level.
Vitamin D deficiency
figshare.com
txt
Updated Dec 13, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafa Carretero (2019). Vitamin D deficiency [Dataset]. http://doi.org/10.6084/m9.figshare.7461938.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7461938.v1
Dataset updated
Dec 13, 2019
Dataset provided by
Figsharehttp://figshare.com/
Authors
Rafa Carretero
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Vitamin D deficiency dataset using machine learning approaches
🧮 Sequence Alignment - Bioinformatics 🗃️ Dataset
kaggle.com
zip
Updated Mar 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samira Shemirani (2024). 🧮 Sequence Alignment - Bioinformatics 🗃️ Dataset [Dataset]. https://www.kaggle.com/datasets/samira1992/sequence-alignment-bioinformatics-dataset/discussion
Explore at:
zip(1683 bytes)Available download formats
Dataset updated
Mar 31, 2024
Authors
Samira Shemirani
Description
The Multiple Sequence Alignment (MSA) is a key task in bioinformatics because it is used in various important biological analyses, such as predicting the function and structure of unknown proteins.

We will use the following proteins for MSA:

Mouse (mouse-kiss1- NP_839991.2)

• Human (human-kiss1- NP_002247.3)

• Opposum (opposum-kiss1- NP_001137604.1)

• Frog-A (frog-kiss1- NP_001156331.1)

• Zebrafish-A (zebrafish-kiss1- NP_001106961.1)

• Frog-B (frog-kiss2 - NP_001156332.1)

• Zebrafish-B (zebrafish-kiss2 - NP_001136057.1)
UniProt
opendatalab.com
bioregistry.io
+2more
zip
Updated Nov 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fritz Haber Institute of the Max Planck Society (2022). UniProt [Dataset]. https://opendatalab.com/OpenDataLab/UniProt
Explore at:
zipAvailable download formats
Dataset updated
Nov 21, 2022
Dataset provided by
National Center for Biotechnology Informationhttp://www.ncbi.nlm.nih.gov/
European Molecular Biology Laboratoryhttp://www.embl.org/
European Bioinformatics Institutehttp://www.ebi.ac.uk/
Fritz Haber Institute of the Max Planck Society
Description
蛋白质序列数据库
E. coli Resistance Dataset
kaggle.com
zip
Updated Jun 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Valeria Maciel (2025). E. coli Resistance Dataset [Dataset]. https://www.kaggle.com/datasets/valeriamaciel/e-coli-resistance-dataset
Explore at:
zip(3172020 bytes)Available download formats
Dataset updated
Jun 28, 2025
Authors
Valeria Maciel
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
🧬 E. coli Antibiotic Resistance Dataset (Raw from BV-BRC)

This dataset contains 195,000+ raw records of Escherichia coli clinical isolates and their antimicrobial susceptibility test results. The data was extracted from the Bacterial and Viral Bioinformatics Resource Center (BV-BRC), a public repository funded by NIAID.

Each entry captures how a specific E. coli genome responds to a given antibiotic, along with phenotypic interpretation, lab methods, measurement values (e.g., MIC), and supporting publication links.

🔍 What’s Included

🧬 Genome ID & strain names

💊 Antibiotic tested

📏 Measurement (MIC / Zone diameter)

✅ Resistance phenotype (Resistant/Susceptible/Intermediate)

🧪 Testing method, platform, vendor, and standard (CLSI/EUCAST)

🔗 PubMed references and evidence source

📦 Dataset Characteristics

Total records: 195,000+

Format: Raw, not cleaned (missing values and mixed units may be present)

Organism: Escherichia coli

Source: BV-BRC

License: CC BY-NC-SA 4.0

Language: English
exampleVCFfiles
kaggle.com
zip
Updated Jan 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Omer Faruk Isler (2025). exampleVCFfiles [Dataset]. https://www.kaggle.com/omerfarukisler/examplevcffiles
Explore at:
zip(24351800 bytes)Available download formats
Dataset updated
Jan 10, 2025
Authors
Omer Faruk Isler
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
all vcf files that I was able to provide in BLG348 Intro to Bioinformatics course term project. Mutect variantCaller didn't work properly so I didn't add them. NotFıltered vcf's indicates previos version of vcf's that contains different filters (not only PASS ones) You can also check my profile to see the plots that I used for my project report & presentation.
CASP12
kaggle.com
zip
Updated May 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aruja Tiwary (2025). CASP12 [Dataset]. https://www.kaggle.com/datasets/arujatiwary/casp12
Explore at:
zip(14194884795 bytes)Available download formats
Dataset updated
May 19, 2025
Authors
Aruja Tiwary
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset contains text files from the CASP12 (Critical Assessment of Techniques for Protein Structure Prediction) challenge, which is a benchmark for evaluating computational methods in protein structure prediction.

Each file corresponds to a target protein and includes: - The amino acid sequence of the protein. - The true structural properties, such as secondary structure or 3D coordinates (depending on the specific format). - Standardized naming conventions for easy parsing.

All files are in plain .txt format and are ready to be used for machine learning models, such as LSTMs or other sequence-based models.
Transcriptomics in yeast
kaggle.com
zip
Updated Jan 24, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CostalAether (2017). Transcriptomics in yeast [Dataset]. https://www.kaggle.com/costalaether/yeast-transcriptomics
Explore at:
zip(4901525 bytes)Available download formats
Dataset updated
Jan 24, 2017
Authors
CostalAether
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Disclaimer
This is a data set of mine that I though might be enjoyable to the community. It's concerning Next generation sequencing and Transcriptomics. I used several raw datasets, that are public, but the processing to get to this dataset is extensive. This is my first contribution to kaggle, so be nice, and let me know how I can improve the experience. NGS machines are combined the biggest data producer worldwide. So why not add some (more? ) to kaggle.
A look into Yeast transcriptomics

Background

Yeasts ( in this case saccharomyces cerevisiae) are used in the production of beer, wine, bread and a whole lot of Biotech applications such as creating complex pharmaceuticals. They are living eukaryotic organisms (meaning quite complex). All living organisms store information in their DNA, but action within a cell is carried out by specific Proteins. The path from DNA to Protein (from data to action) is simple. a specific region on the DNA gets transcribed to mRNA, that gets translated to proteins. Common assumption says that the translation step is linear, more mRNA means more protein. Cells actively regulate the amount of protein by the amount of mRNA it creates. The expression of each gene depends on the condition the cell is in (starving, stressed etc..) Modern methods in Biology show us all mRNA that is currently inside a cell. Assuming the linearity of the process, we can get more protein the more specific mRNA is available to a cell. Making mRNA an excellent marker for what is actually happening inside a cell. It is important to consider that mRNA is fragile. It is actively replenished only when it is needed. Both mRNA and proteins are expensive for a cell to produce .

Yeasts are good model organisms for this, since they only have about 6000 genes. They are also single cells which is more homogeneous, and contain few advanced features (splice junctions etc.)

( all of this is heavily simplified, let me know if I should go into more details )

The data

files
The following files are provided **SC_expression.csv** expression values for each gene over the available conditions **labels_CC.csv ** labels for the individual genes , their status and where known intracellular localization ( see below) Maybe this would be nice as a little competition, I'll see how this one is going before I'll upload the other label files. Please provide some feedback on the presentation, and whatever else you would want me to share.
background
I used 92 samples from various openly available raw datasets, and ran them through a modern RNAseq pipeline. Spanning a range of different conditions (I hid the raw names). The conditions covered stress conditions, temperature and heavy metals, as well as growth media changes and the deletion of specific genes. Originally I had 150 sets, 92 are of good enough quality. Evaluation was done on gene level. Each gene got it's own row, Samples are columns (some are in replicates over several columns) . Expression levels were normalized with by TPM (transcripts per million), a default normalization procedure. Raw counts would have been integers, normalized they are floats.
Analysis and labels

Genes

The function of individual genes is a matter of dispute. Clearly living cells are complex. The inner machinations of cells are not visible. Gene functionality is commonly inferred indirectly by removing a gene, and test the cells behavior. This is time consuming and not very precise. As you can see in the dataset, there is still much to be done to fully understand even single cell yeasts.

The provided dataset is allows for a different approach to functional classification of genes. The label files contained in the set correspond a gene to a specific label. The classification is based on the official Gene Onthology associations classification. I simplified the nomenclature. Gene functionality is usually given in a hierarchical structure. [inside cell --> cytoplasma --> associated to complex A ... ] I'm only keeping high level associations, and using readable terms instead of GO terms. I'll extend if people are interested.

Labels

CC labels concern Cellular Component.
Where the gene is within a cell. goes into details of found associations. the term 'cellular_component' should be seen as E.g the label 'cellular_component' is synonymous with 'unknown location' . CC is the easiest label to attach to a gene. It is the one that can be studied the easiest. Still there are many genes missing.

MF labels concern Molecular Function. What is the gene doing. [upcoming] BP labels concern Biological Processes. What is the genes involvement. [upcoming]

The core interest here is whether it is possible to improve the genes classification by modeling the data. A common assu...
bioseq whl
kaggle.com
zip
Updated Feb 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrey Shtrauss (2022). bioseq whl [Dataset]. https://www.kaggle.com/shtrausslearning/bioseq
Explore at:
zip(72074 bytes)Available download formats
Dataset updated
Feb 5, 2022
Authors
Andrey Shtrauss
Description
Bioseq Simple package to work with biological sequence

Notebooks that use package (to date): Biological Sequence Operations Biological Sequence Alignment

Prepare whl file using: python setup.py bdist_wheel --universal

Install in notebook via: !pip install /path/
Gene Expression Analysis and Disease Relationship
kaggle.com
zip
Updated Aug 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
asel (2025). Gene Expression Analysis and Disease Relationship [Dataset]. https://www.kaggle.com/datasets/ylmzasel/gene-expression-analysis-and-disease-relationship/code
Explore at:
zip(8740 bytes)Available download formats
Dataset updated
Aug 4, 2025
Authors
asel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset comprises 1000 hypothetical patient or sample entries, each detailing gene expression profiles and relevant clinical characteristics. It includes a mix of both numerical and categorical data types, allowing for the application of diverse machine learning and statistical analysis methods

Column Descriptions: PatientID (Categorical/Numerical): A unique identification number assigned to each patient. Age (Numerical): The patient's age. Can be used to investigate potential correlations between age and gene expression profiles. Gender (Categorical): The patient's gender (0: Female, 1: Male). Effects of gender on gene expression or disease status can be analyzed. Gene_X_Expression (Numerical): The relative expression level of a specific gene, "Gene X". This represents a hypothetical gene that might play a role in disease progression or treatment response. Gene_Y_Expression (Numerical): The relative expression level of another specific gene, "Gene Y". Can be studied in conjunction with or independently of Gene X. SmokingStatus (Categorical): The patient's smoking status (0: Non-smoker, 1: Ex-smoker, 2: Current smoker). Environmental factors' impact on gene expression and disease can be assessed. DiseaseStatus (Categorical): The patient's status for the target disease (0: Healthy, 1: Disease A, 2: Disease B). This can serve as the primary target variable for your predictive models.

TreatmentResponse (Categorical/Numerical): The degree of response to applied treatment (0: No Response, 1: Partial Response, 2: Full Response). The role of gene expression profiles in predicting treatment success can be explored. Use Cases and Potential Projects This dataset serves as an excellent starting point for students, researchers, and enthusiasts in bioinformatics, computational biology, data science, and machine learning, enabling various projects such as: Disease Diagnosis/Classification: Building models to predict HastalıkDurumu using gene expression levels and other clinical factors. Treatment Response Prediction: Forecasting how patients with specific gene expression profiles might respond to treatment (TedaviYanıtı). Biomarker Discovery: Identifying gene expression levels (e.g., Gen_X_İfadesi, Gen_Y_İfadesi) that show strong correlations with disease or treatment response. Feature Engineering and Selection: Evaluating the importance of various features in the dataset and creating new ones to enhance model performance. Data Visualization: Generating visualizations to explore relationships between gene expression data and demographic/clinical factors. Regression and Correlation Analyses: Quantitatively examining the effects of factors like age and smoking status on gene expression levels.

Why Use This Dataset? Privacy Secure: Being entirely synthetic, it carries no privacy or ethical concerns associated with real patient data. Diversity: The mix of both numerical and categorical variables offers a rich ground for experimenting with different analytical techniques. Predictive Potential: Clear target variables like HastalıkDurumu and TedaviYanıtı make it ideal for developing classification and regression models. Educational and Learning: Perfect for applying fundamental data science and machine learning concepts for anyone interested in the bioinformatics domain.
Genomic Data for Cancer
kaggle.com
zip
Updated Mar 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Batuhan Şahan (2025). Genomic Data for Cancer [Dataset]. https://www.kaggle.com/datasets/brsahan/genomic-data-for-cancer/code
Explore at:
zip(9134 bytes)Available download formats
Dataset updated
Mar 16, 2025
Authors
Batuhan Şahan
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset contains gene expression levels of two genes and their correlation with cancer presence. It is designed for classification tasks, particularly in machine learning and bioinformatics applications. The data can be used to train models like K-Nearest Neighbors (KNN), SVM, and Neural Networks for cancer prediction.
KEGG genomes, networks, diseases and drugs
kaggle.com
zip
Updated Apr 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Abedi Madiseh (2023). KEGG genomes, networks, diseases and drugs [Dataset]. https://www.kaggle.com/datasets/aliabedimadiseh/kegg-genomes-networks-diseases-and-drugs
Explore at:
zip(9132230 bytes)Available download formats
Dataset updated
Apr 21, 2023
Authors
Ali Abedi Madiseh
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
This dataset collected from 'genome.jp' web-based dataset by using its ftp : *** https://www.genome.jp/ftp/kegg/ It include bioinformatics and medical datbases in pathway, medical, genome, medicus , drug and .etc categories.

This dataset include 5 .txt files: dgroup : 'Entry_ID' , 'name', 'type' and 'member' information about drugs disease: 'Entry_ID' , 'name' , 'subgroup', 'supergroup', 'description' ,'genes' and 'category' about drugs and related disease drug: this file include molecular information of drugs network: this file include network of genes interaction with their 'class' and 'gene' information variant: this file include variants of the genes and 'gene variant id' , 'gene name' , 'gene definition' and 'variation type' categories.

Important definitions

1.Signaling Pathways : Describes a series of chemical reactions in which a group of molecules in a cell work together to control a cell function, such as cell division or cell death. A cell receives signals from its environment when a molecule, such as a hormone or growth factor, binds to a specific protein receptor on or in the cell. After the first molecule in the pathway receives a signal, it activates another molecule. This process is repeated through the entire signaling pathway until the last molecule is activated and the cell function is carried out. Abnormal activation of signaling pathways may lead to diseases, such as cancer. Drugs are being developed to target specific molecules involved in these pathways. These drugs may help keep cancer cells from growing. (https://www.cancer.gov/publications/dictionaries/cancer-terms/def/signaling-pathway)

2.Variants of gene :An alteration in the most common DNA nucleotide sequence. The term variant can be used to describe an alteration that may be benign, pathogenic, or of unknown significance. The term variant is increasingly being used in place of the term mutation. (https://www.cancer.gov/publications/dictionaries/genetics-dictionary/def/variant)
PARSING FASTA AND GENBANK FILES
kaggle.com
zip
Updated Nov 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr. Nagendra (2025). PARSING FASTA AND GENBANK FILES [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/parsing-fasta-and-genbank-files
Explore at:
zip(17972831 bytes)Available download formats
Dataset updated
Nov 25, 2025
Authors
Dr. Nagendra
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset is a dedicated resource for learning how to parse core bioinformatics file formats. It contains representative samples of FASTA and GenBank files. The goal is to provide raw data for practicing essential data extraction skills. FASTA files contain sequence data, such as DNA, RNA, or protein, in a simple text format. GenBank files include detailed sequence annotations, features, and metadata. This is an ideal starting point for anyone learning Biopython or general sequence manipulation in genomics.

Facebook

Twitter

Click to copy link

Link copied

Cite

Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated

Bioinformatics Protein Dataset - Simulated

Synthetic protein dataset with sequences, physical properties, and functional cl

Explore at:

zip(12928905 bytes)Available download formats

Dataset updated

Dec 27, 2024

Authors

Rafael Gallo

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Subtitle

"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

Description

Introduction

This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

Columns Included

ID_Protein: Unique identifier for each protein.
Sequence: String of amino acids.
Molecular_Weight: Molecular weight calculated from the sequence.
Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
Hydrophobicity: Average hydrophobicity calculated from the sequence.
Total_Charge: Sum of the charges of the amino acids in the sequence.
Polar_Proportion: Percentage of polar amino acids in the sequence.
Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
Sequence_Length: Total number of amino acids in the sequence.
Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

Inspiration and Sources

While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

Proposed Uses

This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

How This Dataset Was Created

Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
Property Calculation: Physicochemical properties were calculated using the Biopython library.
Class Assignment: Classes were randomly assigned for classification purposes.

Limitations

The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
The functional classes are simulated and do not correspond to actual biological characteristics.

Data Split

The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

Acknowledgment

This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

Clear search

Close search

Google apps

Main menu

Bioinformatics Protein Dataset - Simulated

Subtitle

Description

Introduction

Columns Included

Inspiration and Sources

Proposed Uses

How This Dataset Was Created

Limitations

Data Split

Acknowledgment

DNA Classification dataset

DNA mutations

Additional file 2 of The Venus score for the assessment of the quality and...

Sample DNA Sequence

Dataset

Contents

EC class prediction dataset

Bioinformatics-UAS Kelompok 4

Dataset

Contents

🔬 Essential Proteins - Bioinformatics 🗃️ Dataset

Vitamin D deficiency

🧮 Sequence Alignment - Bioinformatics 🗃️ Dataset

UniProt

E. coli Resistance Dataset

🧬 E. coli Antibiotic Resistance Dataset (Raw from BV-BRC)

🔍 What’s Included

📦 Dataset Characteristics

exampleVCFfiles

CASP12

Transcriptomics in yeast

Disclaimer

A look into Yeast transcriptomics

Background

The data

files

background

Analysis and labels

Genes

Labels

bioseq whl

Gene Expression Analysis and Disease Relationship

Genomic Data for Cancer

KEGG genomes, networks, diseases and drugs

PARSING FASTA AND GENBANK FILES

Bioinformatics Protein Dataset - SimulatedSee More Versions

Synthetic protein dataset with sequences, physical properties, and functional cl

Subtitle

Description

Introduction

Columns Included

Inspiration and Sources

Proposed Uses

How This Dataset Was Created

Limitations

Data Split

Acknowledgment

Bioinformatics Protein Dataset - Simulated