Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."
This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.
While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.
This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.
The dataset is divided into two subsets:
- Training: 16,000 samples (proteinas_train.csv).
- Testing: 4,000 samples (proteinas_test.csv).
This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains 3,000 synthetic DNA samples with 13 features designed for genomic data analysis, machine learning, and bioinformatics research. Each row represents a unique DNA sample with both sequence-level and statistical attributes.
🔹 Dataset Structure
Rows: 3,000
Columns: 13
🔹 Features Description
Sample_ID → Unique identifier for each DNA sample
Sequence → DNA sequence (string of A, T, C, G)
GC_Content → Percentage of Guanine (G) and Cytosine (C) in the sequence
AT_Content → Percentage of Adenine (A) and Thymine (T) in the sequence
Sequence_Length → Total sequence length
Num_A → Number of Adenine bases
Num_T → Number of Thymine bases
Num_C → Number of Cytosine bases
Num_G → Number of Guanine bases
kmer_3_freq → Average 3-mer (triplet) frequency score
Mutation_Flag → Binary flag indicating mutation presence (0 = No, 1 = Yes)
Class_Label → Class of the sample (Human, Bacteria, Virus, Plant)
Disease_Risk → Risk level associated with the sample (Low / Medium / High)
🔹 Potential Use Cases
DNA classification tasks (e.g., predicting species from DNA sequence features)
Exploratory Data Analysis (EDA) in bioinformatics
Machine Learning model development (Logistic Regression, Random Forest, SVM, Neural Networks)
Deep Learning approaches (LSTM, CNN, Transformers for sequence learning)
Mutation detection and disease risk analysis
Teaching and practicing biological data preprocessing techniques
🔹 Why This Dataset?
Synthetic but realistic structure, inspired by genomics data
Balanced and diverse distribution of features and labels
Suitable for beginners and researchers to practice classification, visualization, and model comparison
🔹 Example Research Questions
Can we classify DNA samples into their biological class using sequence-based features?
How does GC content relate to mutation risk?
Which ML model performs best for DNA classification tasks?
Can synthetic DNA features predict disease risk categories?
📌 Acknowledgment
This dataset is synthetic and generated for educational & research purposes. It does not represent real patient data.
Facebook
TwitterContext In bioinformatics, the issue of mutation discovery and type determination remains a significant concern. The problem is divided by the researchers into binary classification and multi-class problems. When the user wants to know if the DNA sequence has been altered, the issue is a binary classification problem. When it is desirable to identify the principal class of mutation or its sub-classes, the problem becomes more challenging. The primary classes of mutations are deletion, insertion, and replacement mutation, and their sub-classes are (deletion frameshift, deletion in-frame, insertion frameshift, insertion in-frame, silent, missense, nonsense, and read-through). Additionally, answers to sporadic issues like the DNA sequence alignment challenge are necessary for mutation detection techniques.
Content Due to the scarcity of labeled databases, this data set was created by addressing an unlabeled database and creating random mutations of all kinds for the purpose of benefiting from them by researchers in the field of bioinformatics to analyze DNA sequences and know the impact of mutations on humans. The dataset that was used in labeling is published on the NCBI GeneBank website (NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supplementary Material 2
Facebook
TwitterThis dataset was created by Sreshta Putchala
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
This dataset contains relevant notebook submission files and papers:
Notebook submission files from:
PS S3E18 EDA + Ensembles by @zhukovoleksiy v8 0.65031.
PS_3.18_LGBM_bin by @akioonodera v9 0.64706.
PS3E18 EDA| Ensemble ML Pipeline |BinaryPredictict by @tetsutani v37 0.65540.
0.65447 | Ensemble | AutoML | Enzyme Classify by @utisop v10 0.65447.
pyBoost baselinepyBoost baseline by @l0glikelihood v4 0.65446.
Random Forest EC classification by @jbomitchell RF62853_submission.csv 0.62853.
Overfit Champion by @onurkoc83 v1 0.65810.
Playground Series S3E18 - EDA & Separate Learning by @mateuszk013 v1 0.64933.
Ensemble ML Pipeline + Bagging = 0.65557 by @chingiznurzhanov v7 0.65557.
PS3E18| FeatureEnginering+Stacking by @jaygun84 v5 0.64845.
S03E18 EDA | VotingClassifier | Optuna v15 0.64776.
PS3E18 - GaussianNB by @mehrankazeminia v1 0.65898, v2 0.66009 & v3 0.66117.
Enzyme Weighted Voting by @nivedithavudayagiri v2 0.65028.
Multi-label With TF-Decision Forests by @gusthema v6 0.63374.
S3E18 Target_Encoding LB 0.65947 by @meisa0 v1 0.65947.
Boost Classifier Model by @satyaprakashshukl v7 0.64965.
PS3E18: Multiple lightgbm models + Optuna by syerramilli v4 0.64982.
s3e18_solution for overfitting public :0.64785 by @onurkoc83 v1 0.64785.
PSS3E18 : FLAML : roc_auc_weighted by @gauravduttakiit v2 0.64732.
PGS318: combiner by @kdmitrie v4 0.65350.
averaging best solutions mean vs Weighted mean by @omarrajaa v5 0.66106.
Papers
N Nath & JBO Mitchell, Is EC class predictable from reaction mechanism? BMC Bioinformatics, 13:60 (2012) doi: 10.1186/1471-2105-13-60
L De Ferrari & JBO Mitchell, From sequence to enzyme mechanism using multi-label machine learning, BMC Bioinformatics, 15:150 (2014) doi: 10.1186/1471-2105-15-150
N Nath, JBO Mitchell & G Caetano-Anollés, The Natural History of Biocatalytic Mechanisms, PLoS Computational Biology, 10, e1003642 (2014) doi: 10.1371/journal.pcbi.1003642
KE Beattie, L De Ferrari & JBO Mitchell, Why do sequence signatures predict enzyme mechanism? Homology versus Chemistry, Evolutionary Bioinformatics, 11: 267-274 (2015) doi: 10.4137/EBO.S31482
HY Mussa, L De Ferrari & JBO Mitchell, Enzyme Mechanism Prediction: A Template Matching Problem on InterPro Signature Subspaces, BMC Research Reports, 8:744 (2015) doi: 10.1186/s13104-015-1730-7
Facebook
TwitterThis dataset was created by Anthony TIF 2022
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Essential proteins are vital for the life and reproduction of organisms and play a crucial role in maintaining cellular functions. If the destruction of a certain protein would lead to lethality or infertility, it can be classified as essential to an organism, meaning the organism cannot survive without it. Compared to non-essential proteins, essential proteins are more likely to persist in biological evolution. For instance, essential proteins make excellent targets for the development of new potential drugs and vaccines aimed at treating and preventing diseases.
With the advent of high-throughput technologies, such as the yeast two-hybrid system and mass spectrometry analysis, various protein-protein interaction (PPI) data become available, facilitating the study of essential proteins at the network level.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Vitamin D deficiency dataset using machine learning approaches
Facebook
TwitterThe Multiple Sequence Alignment (MSA) is a key task in bioinformatics because it is used in various important biological analyses, such as predicting the function and structure of unknown proteins.
We will use the following proteins for MSA:
Mouse (mouse-kiss1- NP_839991.2)
• Human (human-kiss1- NP_002247.3)
• Opposum (opposum-kiss1- NP_001137604.1)
• Frog-A (frog-kiss1- NP_001156331.1)
• Zebrafish-A (zebrafish-kiss1- NP_001106961.1)
• Frog-B (frog-kiss2 - NP_001156332.1)
• Zebrafish-B (zebrafish-kiss2 - NP_001136057.1)
Facebook
Twitter蛋白质序列数据库
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset contains 195,000+ raw records of Escherichia coli clinical isolates and their antimicrobial susceptibility test results. The data was extracted from the Bacterial and Viral Bioinformatics Resource Center (BV-BRC), a public repository funded by NIAID.
Each entry captures how a specific E. coli genome responds to a given antibiotic, along with phenotypic interpretation, lab methods, measurement values (e.g., MIC), and supporting publication links.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
all vcf files that I was able to provide in BLG348 Intro to Bioinformatics course term project. Mutect variantCaller didn't work properly so I didn't add them. NotFıltered vcf's indicates previos version of vcf's that contains different filters (not only PASS ones) You can also check my profile to see the plots that I used for my project report & presentation.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains text files from the CASP12 (Critical Assessment of Techniques for Protein Structure Prediction) challenge, which is a benchmark for evaluating computational methods in protein structure prediction.
Each file corresponds to a target protein and includes: - The amino acid sequence of the protein. - The true structural properties, such as secondary structure or 3D coordinates (depending on the specific format). - Standardized naming conventions for easy parsing.
All files are in plain .txt format and are ready to be used for machine learning models, such as LSTMs or other sequence-based models.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Yeasts ( in this case saccharomyces cerevisiae) are used in the production of beer, wine, bread and a whole lot of Biotech applications such as creating complex pharmaceuticals. They are living eukaryotic organisms (meaning quite complex). All living organisms store information in their DNA, but action within a cell is carried out by specific Proteins. The path from DNA to Protein (from data to action) is simple. a specific region on the DNA gets transcribed to mRNA, that gets translated to proteins. Common assumption says that the translation step is linear, more mRNA means more protein. Cells actively regulate the amount of protein by the amount of mRNA it creates. The expression of each gene depends on the condition the cell is in (starving, stressed etc..) Modern methods in Biology show us all mRNA that is currently inside a cell. Assuming the linearity of the process, we can get more protein the more specific mRNA is available to a cell. Making mRNA an excellent marker for what is actually happening inside a cell. It is important to consider that mRNA is fragile. It is actively replenished only when it is needed. Both mRNA and proteins are expensive for a cell to produce .
Yeasts are good model organisms for this, since they only have about 6000 genes. They are also single cells which is more homogeneous, and contain few advanced features (splice junctions etc.)
( all of this is heavily simplified, let me know if I should go into more details )
The function of individual genes is a matter of dispute. Clearly living cells are complex. The inner machinations of cells are not visible. Gene functionality is commonly inferred indirectly by removing a gene, and test the cells behavior. This is time consuming and not very precise. As you can see in the dataset, there is still much to be done to fully understand even single cell yeasts.
The provided dataset is allows for a different approach to functional classification of genes. The label files contained in the set correspond a gene to a specific label. The classification is based on the official Gene Onthology associations classification. I simplified the nomenclature. Gene functionality is usually given in a hierarchical structure. [inside cell --> cytoplasma --> associated to complex A ... ] I'm only keeping high level associations, and using readable terms instead of GO terms. I'll extend if people are interested.
CC labels concern Cellular Component.
Where the gene is within a cell. goes into details of found associations. the term 'cellular_component' should be seen as E.g the label 'cellular_component' is synonymous with 'unknown location' . CC is the easiest label to attach to a gene. It is the one that can be studied the easiest. Still there are many genes missing.
MF labels concern Molecular Function. What is the gene doing. [upcoming] BP labels concern Biological Processes. What is the genes involvement. [upcoming]
The core interest here is whether it is possible to improve the genes classification by modeling the data. A common assu...
Facebook
TwitterBioseq Simple package to work with biological sequence
Notebooks that use package (to date): Biological Sequence Operations Biological Sequence Alignment
Prepare whl file using:
python setup.py bdist_wheel --universal
Install in notebook via: !pip install /path/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset comprises 1000 hypothetical patient or sample entries, each detailing gene expression profiles and relevant clinical characteristics. It includes a mix of both numerical and categorical data types, allowing for the application of diverse machine learning and statistical analysis methods
Column Descriptions: PatientID (Categorical/Numerical): A unique identification number assigned to each patient. Age (Numerical): The patient's age. Can be used to investigate potential correlations between age and gene expression profiles. Gender (Categorical): The patient's gender (0: Female, 1: Male). Effects of gender on gene expression or disease status can be analyzed. Gene_X_Expression (Numerical): The relative expression level of a specific gene, "Gene X". This represents a hypothetical gene that might play a role in disease progression or treatment response. Gene_Y_Expression (Numerical): The relative expression level of another specific gene, "Gene Y". Can be studied in conjunction with or independently of Gene X. SmokingStatus (Categorical): The patient's smoking status (0: Non-smoker, 1: Ex-smoker, 2: Current smoker). Environmental factors' impact on gene expression and disease can be assessed. DiseaseStatus (Categorical): The patient's status for the target disease (0: Healthy, 1: Disease A, 2: Disease B). This can serve as the primary target variable for your predictive models.
TreatmentResponse (Categorical/Numerical): The degree of response to applied treatment (0: No Response, 1: Partial Response, 2: Full Response). The role of gene expression profiles in predicting treatment success can be explored. Use Cases and Potential Projects This dataset serves as an excellent starting point for students, researchers, and enthusiasts in bioinformatics, computational biology, data science, and machine learning, enabling various projects such as: Disease Diagnosis/Classification: Building models to predict HastalıkDurumu using gene expression levels and other clinical factors. Treatment Response Prediction: Forecasting how patients with specific gene expression profiles might respond to treatment (TedaviYanıtı). Biomarker Discovery: Identifying gene expression levels (e.g., Gen_X_İfadesi, Gen_Y_İfadesi) that show strong correlations with disease or treatment response. Feature Engineering and Selection: Evaluating the importance of various features in the dataset and creating new ones to enhance model performance. Data Visualization: Generating visualizations to explore relationships between gene expression data and demographic/clinical factors. Regression and Correlation Analyses: Quantitatively examining the effects of factors like age and smoking status on gene expression levels.
Why Use This Dataset? Privacy Secure: Being entirely synthetic, it carries no privacy or ethical concerns associated with real patient data. Diversity: The mix of both numerical and categorical variables offers a rich ground for experimenting with different analytical techniques. Predictive Potential: Clear target variables like HastalıkDurumu and TedaviYanıtı make it ideal for developing classification and regression models. Educational and Learning: Perfect for applying fundamental data science and machine learning concepts for anyone interested in the bioinformatics domain.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains gene expression levels of two genes and their correlation with cancer presence. It is designed for classification tasks, particularly in machine learning and bioinformatics applications. The data can be used to train models like K-Nearest Neighbors (KNN), SVM, and Neural Networks for cancer prediction.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset collected from 'genome.jp' web-based dataset by using its ftp : *** https://www.genome.jp/ftp/kegg/ It include bioinformatics and medical datbases in pathway, medical, genome, medicus , drug and .etc categories.
This dataset include 5 .txt files: dgroup : 'Entry_ID' , 'name', 'type' and 'member' information about drugs disease: 'Entry_ID' , 'name' , 'subgroup', 'supergroup', 'description' ,'genes' and 'category' about drugs and related disease drug: this file include molecular information of drugs network: this file include network of genes interaction with their 'class' and 'gene' information variant: this file include variants of the genes and 'gene variant id' , 'gene name' , 'gene definition' and 'variation type' categories.
Important definitions
1.Signaling Pathways : Describes a series of chemical reactions in which a group of molecules in a cell work together to control a cell function, such as cell division or cell death. A cell receives signals from its environment when a molecule, such as a hormone or growth factor, binds to a specific protein receptor on or in the cell. After the first molecule in the pathway receives a signal, it activates another molecule. This process is repeated through the entire signaling pathway until the last molecule is activated and the cell function is carried out. Abnormal activation of signaling pathways may lead to diseases, such as cancer. Drugs are being developed to target specific molecules involved in these pathways. These drugs may help keep cancer cells from growing. (https://www.cancer.gov/publications/dictionaries/cancer-terms/def/signaling-pathway)
2.Variants of gene :An alteration in the most common DNA nucleotide sequence. The term variant can be used to describe an alteration that may be benign, pathogenic, or of unknown significance. The term variant is increasingly being used in place of the term mutation. (https://www.cancer.gov/publications/dictionaries/genetics-dictionary/def/variant)
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is a dedicated resource for learning how to parse core bioinformatics file formats. It contains representative samples of FASTA and GenBank files. The goal is to provide raw data for practicing essential data extraction skills. FASTA files contain sequence data, such as DNA, RNA, or protein, in a simple text format. GenBank files include detailed sequence annotations, features, and metadata. This is an ideal starting point for anyone learning Biopython or general sequence manipulation in genomics.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."
This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.
While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.
This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.
The dataset is divided into two subsets:
- Training: 16,000 samples (proteinas_train.csv).
- Testing: 4,000 samples (proteinas_test.csv).
This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.