31 datasets found

Bioinformatics Protein Dataset - Simulated
kaggle.com
zip
Updated Dec 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated
Explore at:
zip(12928905 bytes)Available download formats
Dataset updated
Dec 27, 2024
Authors
Rafael Gallo
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Subtitle

"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

Description

Introduction

This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

Columns Included

ID_Protein: Unique identifier for each protein.

Sequence: String of amino acids.

Molecular_Weight: Molecular weight calculated from the sequence.

Isoelectric_Point: Estimated isoelectric point based on the sequence composition.

Hydrophobicity: Average hydrophobicity calculated from the sequence.

Total_Charge: Sum of the charges of the amino acids in the sequence.

Polar_Proportion: Percentage of polar amino acids in the sequence.

Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.

Sequence_Length: Total number of amino acids in the sequence.

Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

Inspiration and Sources

While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

Proposed Uses

This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

How This Dataset Was Created

Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.

Property Calculation: Physicochemical properties were calculated using the Biopython library.

Class Assignment: Classes were randomly assigned for classification purposes.

Limitations

The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.

The functional classes are simulated and do not correspond to actual biological characteristics.

Data Split

The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

Acknowledgment

This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.
Dataset for practice session 1 in bioinformatics
figshare.com
txt
Updated Jul 17, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elena Sugis (2016). Dataset for practice session 1 in bioinformatics [Dataset]. http://doi.org/10.6084/m9.figshare.3490211.v3
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3490211.v3
Dataset updated
Jul 17, 2016
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Elena Sugis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset for the practice in the data preprocessing and unsupervised learning in the introduction to bioinformatics course
f
Table_5_Comprehensive Review of Web Servers and Bioinformatics Tools for...
figshare.com
datasetcatalog.nlm.nih.gov
+1more
xlsx
Updated Feb 5, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hong Zheng; Guosen Zhang; Lu Zhang; Qiang Wang; Huimin Li; Yali Han; Longxiang Xie; Zhongyi Yan; Yongqiang Li; Yang An; Huan Dong; Wan Zhu; Xiangqian Guo (2020). Table_5_Comprehensive Review of Web Servers and Bioinformatics Tools for Cancer Prognosis Analysis.XLSX [Dataset]. http://doi.org/10.3389/fonc.2020.00068.s005
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fonc.2020.00068.s005
Dataset updated
Feb 5, 2020
Dataset provided by
Frontiers
Authors
Hong Zheng; Guosen Zhang; Lu Zhang; Qiang Wang; Huimin Li; Yali Han; Longxiang Xie; Zhongyi Yan; Yongqiang Li; Yang An; Huan Dong; Wan Zhu; Xiangqian Guo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Prognostic biomarkers are of great significance to predict the outcome of patients with cancer, to guide the clinical treatments, to elucidate tumorigenesis mechanisms, and offer the opportunity of identifying therapeutic targets. To screen and develop prognostic biomarkers, high throughput profiling methods including gene microarray and next-generation sequencing have been widely applied and shown great success. However, due to the lack of independent validation, only very few prognostic biomarkers have been applied for clinical practice. In order to cross-validate the reliability of potential prognostic biomarkers, some groups have collected the omics datasets (i.e., epigenetics/transcriptome/proteome) with relative follow-up data (such as OS/DSS/PFS) of clinical samples from different cohorts, and developed the easy-to-use online bioinformatics tools and web servers to assist the biomarker screening and validation. These tools and web servers provide great convenience for the development of prognostic biomarkers, for the study of molecular mechanisms of tumorigenesis and progression, and even for the discovery of important therapeutic targets. Aim to help researchers to get a quick learning and understand the function of these tools, the current review delves into the introduction of the usage, characteristics and algorithms of tools, and web servers, such as LOGpc, KM plotter, GEPIA, TCPA, OncoLnc, PrognoScan, MethSurv, SurvExpress, UALCAN, etc., and further help researchers to select more suitable tools for their own research. In addition, all the tools introduced in this review can be reached at http://bioinfo.henu.edu.cn/WebServiceList.html.
f
Comparison of the multiple-delivery-mode training model employed by...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Feb 25, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lennard, Katie; Aron, Shaun; Panji, Sumir; Kennedy, Dane; Mulder, Nicola; Allali, Imane; Fields, Christopher J; Ras, Verena; Mwaikono, Kilaza Samson; Rendon, Gloria; Claassen-Weitz, Shantelle; Holmes, Jessica R.; Botha, Gerrit (2021). Comparison of the multiple-delivery-mode training model employed by H3ABioNet’s Introduction to Bioinformatics (IBT) course and the 16s rRNA Microbiome Intermediate Bioinformatics Training course (16S). [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000897705
Explore at:
Dataset updated
Feb 25, 2021
Authors
Lennard, Katie; Aron, Shaun; Panji, Sumir; Kennedy, Dane; Mulder, Nicola; Allali, Imane; Fields, Christopher J; Ras, Verena; Mwaikono, Kilaza Samson; Rendon, Gloria; Claassen-Weitz, Shantelle; Holmes, Jessica R.; Botha, Gerrit
Description
The table provides a short description of the major components of the model employed by each course, highlighting any differences between the two (deviations are indicated by an asterisk (*)).
Drug Repositioning
kaggle.com
zip
Updated Aug 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aria Hassanali Aragh (2024). Drug Repositioning [Dataset]. https://www.kaggle.com/datasets/ariasha/drug-repositioning
Explore at:
zip(1427936 bytes)Available download formats
Dataset updated
Aug 28, 2024
Authors
Aria Hassanali Aragh
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset is introduced in our paper, MiRAGE: Mining Relationships for Advanced Generative Evaluation in Drug Repositioning. It contains three CSV files: diseaseInfo, drugsInfo, and mapping.

diseaseInfo: This tabular dataset contains information on over 1,500 diseases, identified by MeSH (Medical Subject Headings) IDs. It includes three main features: a description of each disease, its associated pathways, and its classification into groups as represented by the slimmapping feature.

drugsInfo: This file provides details on approximately** 1,400 drugs**, each identified by a DrugBank ID. The dataset covers seven key features sourced from the DrugBank database: a description of the drug, its target, pharmacodynamics, SMILES (Simplified Molecular Input Line Entry System) notation, mechanism of action, conditions treated, and category.

mapping: This file represents known drug interactions and disease interactions, derived from the Comparative Toxicogenomics Database (CTD).

These datasets are well-suited for exploring various methods for binary prediction of drug-disease interactions, a critical and emerging challenge in the field of bioinformatics. Leveraging these datasets can aid in the exploration and discovery of new drug repositioning opportunities. Additionally, researchers can compare their results with those presented in our paper to validate and benchmark their methods, facilitating advancements in this vital area of study.
Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...
data.niaid.nih.gov
zenodo.org
+1more
zip
Updated Dec 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dylan Westfall; Mullins James (2023). Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies [Dataset]. http://doi.org/10.5061/dryad.w3r2280w0
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.w3r2280w0
Dataset updated
Dec 7, 2023
Dataset provided by
HIV Prevention Trials Networkhttp://www.hptn.org/
National Institute of Allergy and Infectious Diseaseshttp://www.niaid.nih.gov/
HIV Vaccine Trials Networkhttp://www.hvtn.org/
PEPFAR
Authors
Dylan Westfall; Mullins James
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies. Methods This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies" Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005 For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub. The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub. The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results. Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program. To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper. Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd. Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.
d
Transcriptome-based evaluation of the translatable potential of new...
search.dataone.org
borealisdata.ca
Updated Dec 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Salkini, Ammar; Wang, Lisheng (2023). Transcriptome-based evaluation of the translatable potential of new treatments in Triple-Negative Breast Cancer [Dataset]. http://doi.org/10.5683/SP3/JSMI9B
Explore at:
Unique identifier
https://doi.org/10.5683/SP3/JSMI9B
Dataset updated
Dec 28, 2023
Dataset provided by
Borealis
Authors
Salkini, Ammar; Wang, Lisheng
Description
Full Abstract: Introduction: Triple-negative breast cancer (TNBC) is a highly metastatic type of breast cancer and one of the largest contributors to cancer mortality in women. Unlike other breast cancers, TNBC lacks any approved therapeutic targets. Scientists are rigorously attempting to decipher molecular pathways enriched in TNBC and to design clinically applicable therapeutics. Many TNBC drugs that successfully produce general antitumor effects in vitro fail to display significant long-lasting positive effects at the clinical level. This is in part because they do not effectively suppress the growth of cancer stem cells (CSCs), which have increased ability to evolve into metastatic tumors and are associated with enrichment of immunosuppressive pathways. Moreover, it has been shown that in TNBC, dormant CSCs are able to change their metabolic signature to escape the toxic effects of these drugs; these modified metabolic signatures are shown to be causally associated with increased metastasis. Therefore, a successful, clinically-applicable therapy must have the ability to selectively inhibit CSC growth, the metastatic metabolic signature, and pathways involved in immunosuppression. Objective: This study will evaluate the potential of four recently proposed TNBC treatments—which all successfully reduced tumor viability in vitro and/or in vivo—to inhibit genes involved in CSC survival, metastatic metabolic signature, and tumor immunosuppression. Methods: TNBC cell lines and/or patient-derived xenografts were treated with four different treatments: DCC-2036, 9Gy proton irradiation, miR302b+cisplatin combination, and DFX+doxorubicin combination. Genome-wide mRNA profiling (via either RNA-seq or microarray) was performed on control and treated groups. Data was obtained from publicly-deposited NCBI GEO datasets. We assessed the differential expression of over 40 genes associated with CSC growth, metastatic metabolic modifications, and immunosuppression in TNBC tumors. Limma statistical analysis was performed. GSEA was also used to complement results from individual gene expression analysis. Results: DCC-2036 treatment significantly induced the expression of CSC TNBC biomarkers—such as ALDH2, CD44, CCR5, and SNAI1—and genes associated with TNBC metastatic metabolomic signature—such as PPARGC1A. DCC-2036 showed inconsistent effects on the expression of immunosuppressive markers. 9Gy proton irradiation has mixed effects on the expression of our candidate genes, yet mostly induced the expression of stemness, metastatic, and immunosuppressive markers. miR302b+cisplatin and DFX+doxorubicin both failed to inhibit the candidate genes, yet without significantly inducing their expression. GSEA analysis confirmed the results obtained for all four treatments. Conclusions: Observing cancer rebound in TNBC patients after treatment with traditional cancer drugs is common and often happens when treatments fail to inhibit CSC growth, metabolic pathways associated with metastasis, and oncogenic immunosuppressive pathways. Our analysis shows that all four treatments failed to significantly impact the expression of protein pathways associated with increased metastasis and immunosuppression. It is worth noting that the researchers did report a decrease in tumor viability due to treatment of their experimental models with all four treatments. However, these findings correspond to the viability of the whole cell culture or tumor, not the viability of specifically the CSCs; in TNBC, CSCs make up only a small proportion of the total mass or the tumor, so the reported antiproliferative effects of the treatments do not necessarily suggest the treatment has effectively targeted the CSC population. Therefore, we hypothesize that these non-targeted therapies will likely not show positive effects in clinical studies. Furthermore, none of the researchers performed any assays evaluating CSC growth—such as CSC-labelled flow cytometry—or metastasis—such as secondary tumor transplantation. Therefore, we encourage the researchers to perform more rigorous assays to evaluate the translatable potential of their treatments. Finally, the outline of this study provides a useful rationale for future studies to evaluate emerging TNBC therapies and serves as a motivation for further in-silico research focus.
f
Table5_Identification of Differentially Expressed Genes and miRNAs for...
datasetcatalog.nlm.nih.gov
frontiersin.figshare.com
Updated Jun 2, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hu, Weitao; Chen, Xiaoqing; Fang, Taiyong (2022). Table5_Identification of Differentially Expressed Genes and miRNAs for Ulcerative Colitis Using Bioinformatics Analysis.docx [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000362953
Explore at:
Dataset updated
Jun 2, 2022
Authors
Hu, Weitao; Chen, Xiaoqing; Fang, Taiyong
Description
Introduction: Ulcerative colitis (UC) is a chronic inflammatory disease of the intestine whose cause and underlying mechanisms are not fully understood. The aim of this study was to use bioinformatics analysis to identify differentially expressed genes (DEGs) with diagnostic and therapeutic potential in UC.Materials and methods: Three UC datasets (GSE179285, GSE75214, GSE48958) were downloaded from the Gene Expression Omnibus (GEO) database. DEGs between normal and UC tissues were identified using the GEO2R online tool. The Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses of the DEGs were performed using Metascape. Protein-protein interaction network (PPI) analysis and visualization using STRING and Cytoscape. Finally, the miRNA gene regulatory network was constructed by Cytoscape to predict potential microRNAs (miRNAs) associated with DEGs.Results: A total of 446 DEGs were identified, consisting of 309 upregulated genes and 137 downregulated genes. The enriched functions and pathways of the DEGs include extracellular matrix, regulation of cell adhesion, inflammatory response, response to cytokine, monocarboxylic acid metabolic process, response to toxic substance. The analysis of KEGG pathway indicates that the DEGs were significantly enriched in Complement and coagulation cascades, Amoebiasis, TNF signaling pathway, bile secretion, and Mineral absorption. Combining the results of the PPI network and CytoHubba, 9 hub genes including CXCL8, ICAM1, CXCR4, CD44, IL1B, MMP9, SPP1, TIMP1, and HIF1A were selected. Based on the DEG-miRNAs network construction, 7 miRNAs including miR-335-5p, mir-204-5p, miR-93-5p, miR106a-5p, miR-21-5p, miR-146a-5p, and miR-155-5p were identified as potential critical miRNAs.Conclusion: In summary, we identified DEGs that may be involved in the progression or occurrence of UC. A total of 446 DEGs,9 hub genes and 7 miRNAs were identified, which may be considered as biomarkers of UC. Further studies, however, are needed to elucidate the biological functions of these genes in UC.
d
Data from: Reference genome choice and filtering thresholds jointly...
datadryad.org
data.niaid.nih.gov
zip
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jessica Rick; Chad Brock; Alexander Lewanski; Jimena Golcher-Benavides; Catherine Wagner (2023). Reference genome choice and filtering thresholds jointly influence phylogenomic analyses [Dataset]. http://doi.org/10.5061/dryad.djh9w0w2g
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.djh9w0w2g
Dataset updated
Nov 8, 2023
Dataset provided by
Dryad
Authors
Jessica Rick; Chad Brock; Alexander Lewanski; Jimena Golcher-Benavides; Catherine Wagner
Time period covered
Sep 1, 2023
Description
Data and supplementary material here are associated with the manuscript "Reference genome choice and filtering thresholds jointly influence phylogenetic analyses". Scripts can be found at https://github.com/jessicarick/refbias_scripts, and are archived on Zenodo at https://doi.org/10.5281/zenodo.5940690.
f
DataSheet1_Integrative bioinformatics approaches to establish potential...
datasetcatalog.nlm.nih.gov
frontiersin.figshare.com
Updated Apr 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shi, Shan; Zhou, Jiao; Yu, Wenyan; Jin, Zhongwen; Qiu, Yeqing; Xie, Rongzhi; Zhang, Hongyu (2023). DataSheet1_Integrative bioinformatics approaches to establish potential prognostic immune-related genes signature and drugs in the non-small cell lung cancer microenvironment.docx [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000984911
Explore at:
Dataset updated
Apr 3, 2023
Authors
Shi, Shan; Zhou, Jiao; Yu, Wenyan; Jin, Zhongwen; Qiu, Yeqing; Xie, Rongzhi; Zhang, Hongyu
Description
Introduction: Research has revealed that the tumor microenvironment (TME) is associated with the progression of malignancy. The combination of meaningful prognostic biomarkers related to the TME is expected to be a reliable direction for improving the diagnosis and treatment of non-small cell lung cancer (NSCLC).Method and Result: Therefore, to better understand the connection between the TME and survival outcomes of NSCLC, we used the “DESeq2” R package to mine the differentially expressed genes (DEGs) of two groups of NSCLC samples according to the optimal cutoff value of the immune score through the ESTIMATE algorithm. A total of 978 up-DEGs and 828 down-DEGs were eventually identified. A fifteen-gene prognostic signature was established via LASSO and Cox regression analysis and further divided the patients into two risk sets. The survival outcome of high-risk patients was significantly worse than that of low-risk patients in both the TCGA and two external validation sets (p-value < 0.05). The gene signature showed high predictive accuracy in TCGA (1-year area under the time-dependent ROC curve (AUC) = 0.722, 2-year AUC = 0.708, 3-year AUC = 0.686). The nomogram comprised of the risk score and related clinicopathological information was constructed, and calibration plots and ROC curves were applied, KEGG and GSEA analyses showed that the epithelial-mesenchymal transition (EMT) pathway, E2F target pathway and immune-associated pathway were mainly involved in the high-risk group. Further somatic mutation and immune analyses were conducted to compare the differences between the two groups. Drug sensitivity provides a potential treatment basis for clinical treatment. Finally, EREG and ADH1C were selected as the key prognostic genes of the two overlapping results from PPI and multiple Cox analyses. They were verified by comparing the mRNA expression in cell lines and protein expression in the HPA database, and clinical validation further confirmed the effectiveness of key genes.Conclusion: In conclusion, we obtained an immune-related fifteen-gene prognostic signature and potential mechanism and sensitive drugs underling the prognosis model, which may provide accurate prognosis prediction and available strategies for NSCLC.
c
Protein Structural Domain Classification
cathdb.info
ec.i4cologne.com
+3more
Updated Sep 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Protein Structural Domain Classification [Dataset]. http://identifiers.org/MIR:00100005
Explore at:
Unique identifier
https://identifiers.org/MIR:00100005
Dataset updated
Sep 30, 2024
Description
CATH Domain Classification List (latest release) - protein structural domains classified into CATH hierarchy.
f
Supplementary Material for: Profiling and Bioinformatics Analysis Revealing...
datasetcatalog.nlm.nih.gov
karger.figshare.com
Updated Nov 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Y. , Wei; H. , Huang; N. , Li; Z. , Liu; Y. , Zhang; J. , Huang; W. , Zhong; Z. , Yuan; G. , Huang; C. , Huang; X. , Chen (2021). Supplementary Material for: Profiling and Bioinformatics Analysis Revealing Differential Circular RNA Expression about Storage Lesion Regulatory in Stored Red Blood Cells [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000885453
Explore at:
Dataset updated
Nov 15, 2021
Authors
Y. , Wei; H. , Huang; N. , Li; Z. , Liu; Y. , Zhang; J. , Huang; W. , Zhong; Z. , Yuan; G. , Huang; C. , Huang; X. , Chen
Description
Introduction: Circular RNA (circRNA) plays an important role in regulating metabolism of red blood cells (RBCs) and their storage lesions, but the study of how circRNA expression changes in stored RBCs has rarely been conducted. Methods: The expression change of circRNA was systemically evaluated via high-throughput sequencing on healthy RBCs on day 0, 20, and 40. And then we confirmed the reliability of the high-throughput sequencing analysis by RT-qPCR characterization on selected circRNAs. A higher parental gene enrichment was used to explore circRNA function in pathways. In addition, we deciphered a dysregulated circRNA-related ceRNAs network, and identified three circRNA-miRNA-mRNA regulatory axes related to storage lesion. Results: We identified 2,586 known and 6,216 putative novel circRNAs, more than 100 circRNAs expression levels were shifted, and the number of downregulated circRNAs was greater with longer storage time. Furthermore, a higher parental gene enrichment related to circRNA was found in pathways, including cAMP signaling pathway, ubiquitin-mediated proteolysis, apoptosis, adhesion, MAPK signaling pathway, cystine methionine metabolism, RNA degradation, RNA transport, TGF-β, and actin regulatory pathway. hsa_circ_0007127-miR-513a-5p-SMAD4, hsa_circ_0000033-miR-19a-3p-VAMP3, and hsa_circ_0005546-miR-4720-CCND3 regulatory axes related to storage lesion was found. Conclusions: Through investigation in circRNAs profile and circRNA-miRNA-mRNA interactions, this study provides insights on stored RBC circRNA expression changes, which closely relate to the storage lesion of RBCs and their physiological functions.
f
Supplementary Material for: Analyzing the Role of Specific DAMPs-Related...
datasetcatalog.nlm.nih.gov
karger.figshare.com
Updated Oct 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
F. , Zheng; Q. , Xie; Y. , Tang; F. , Yuan (2024). Supplementary Material for: Analyzing the Role of Specific DAMPs-Related Genes in Osteoarthritis and Investigating the Association Between β-Amyloid and ApoE Isoforms [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001276272
Explore at:
Dataset updated
Oct 4, 2024
Authors
F. , Zheng; Q. , Xie; Y. , Tang; F. , Yuan
Description
Introduction：Osteoarthritis (OA) is a prevalent chronic joint disorder. It is characterized by an immune response that maintains a low level of inflammation throughout its progression. During OA, cartilage degradation leads to the release of damage-associated molecular patterns (DAMPs), which intensify the inflammatory response. β-amyloid, is a well-recognized DAMP in OA, can interact with APOE isoforms. Methods：This study identified DAMPs-related genes in OA using bioinformatics techniques. Additionally, we examined the expression levels of β-amyloid and ApoE isoforms by ELISA. Results: We identified 10 key genes by machine learning techniques. Immune infiltration analysis revealed upregulation of various immune cell types in OA cartilage, underscoring the critical role of inflammation in OA pathogenesis. In the validation study, elevated serum levels of β-amyloid in Knee Osteoarthritis (KOA) patients were confirmed, showing positive correlations with APOE2 and ApoE4. Notably, ApoE3 was identified as an independent protective factor against KOA. Conclusion: In this bioinformatics analysis, we identified the DAMPs-related genes of KOA, and explored their potential functions and regulatory networks. The high expression of β-Amyloid in KOA was confirmed by experiments, and the correlation between βAmyloid and ApoE2, ApoE4 in KOA was revealed for the first time, this provides a new way to explore the pathogenesis of KOA and to study the therapeutic targets of KOA.
f
DataSheet2_Cancer and COVID-19 Susceptibility and Severity: A Two-Sample...
datasetcatalog.nlm.nih.gov
frontiersin.figshare.com
Updated Jan 24, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhong, Cheng; Li, Duguang; Lin, Hui; Liang, Yuelong; Xia, Qiming; Chen, Guoqiao; Fan, Xiaoxiao; Cheng, Jiaxi; Zhang, Yiyin; Yang, Jing; Mao, Qijiang; Chen, Peng; Jin, Shengxi; Li, Yirun (2022). DataSheet2_Cancer and COVID-19 Susceptibility and Severity: A Two-Sample Mendelian Randomization and Bioinformatic Analysis.docx [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000376593
Explore at:
Dataset updated
Jan 24, 2022
Authors
Zhong, Cheng; Li, Duguang; Lin, Hui; Liang, Yuelong; Xia, Qiming; Chen, Guoqiao; Fan, Xiaoxiao; Cheng, Jiaxi; Zhang, Yiyin; Yang, Jing; Mao, Qijiang; Chen, Peng; Jin, Shengxi; Li, Yirun
Description
The clinical management of patients with COVID-19 and cancer is a Gordian knot that has been discussed widely but has not reached a consensus. We introduced two-sample Mendelian randomization to investigate the causal association between a genetic predisposition to cancers and COVID-19 susceptibility and severity. Moreover, we also explored the mutation landscape, expression pattern, and prognostic implications of genes involved with COVID-19 in distinct cancers. Among all of the cancer types we analyzed, only the genetic predisposition to lung adenocarcinoma was causally associated with increased COVID-19 severity (OR = 2.93, β = 1.074, se = 0.411, p = 0.009) with no obvious heterogeneity (Q = 17.29, p = 0.24) or symmetry of the funnel plot. In addition, the results of the pleiotropy test demonstrated that instrument SNPs were less likely to affect COVID-19 severity via approaches other than lung adenocarcinoma cancer susceptibility (p = 0.96). Leave-one-out analysis showed no outliers in instrument SNPs, whose elimination rendered alterations in statistical significance, which further supported the reliability of the MR results. Broad mutation and differential expression of these genes were also found in cancers, which may provide valuable information for developing new treatment modalities for patients with both cancer and COVID-19. For example, ERAP2, a risk factor for COVID-19-associated death, is upregulated in lung squamous cancer and negatively associated with patient prognosis. Hence, ERAP2-targeted treatment may simultaneously reduce COVID-19 disease severity and restrain cancer progression. Our results highlighted the importance of strengthening medical surveillance for COVID-19 deterioration in patients with lung adenocarcinoma by showing their causal genetic association. For these patients, a delay in anticancer treatment, such as chemotherapy and surgery, should be considered.
f
Data_Sheet_1_Interdisciplinary and Transferable Concepts in Bioinformatics...
figshare.com
frontiersin.figshare.com
pdf
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iain G. Johnston; Mark Slater; Jean-Baptiste Cazier (2023). Data_Sheet_1_Interdisciplinary and Transferable Concepts in Bioinformatics Education: Observations and Approaches From a UK MSc Course.pdf [Dataset]. http://doi.org/10.3389/feduc.2022.826951.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/feduc.2022.826951.s001
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Iain G. Johnston; Mark Slater; Jean-Baptiste Cazier
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
United Kingdom
Description
Bioinformatics is a highly interdisciplinary subject, with substantial and growing influence in health, environmental science and society, and is utilised by scientists from many diverse academic backgrounds. Education in bioinformatics therefore necessitates effective development of skills in interdisciplinary collaboration, communication, ethics, and critical analysis of research, in addition to practical and technical skills. Insights from bioinformatics training can additionally inform developing education in the tightly aligned and emerging disciplines of data science and artificial intelligence. Here, we describe the design, implementation, and review of a module in a UK MSc-level bioinformatics programme attempting to address these goals for diverse student cohorts. Reflecting the philosophy of the field and programme, the module content was designed either as “diversity-addressing”—working toward a common foundation of knowledge—or “diversity-exploiting”—where different student viewpoints and skills were harnessed to facilitate student research projects “greater than the sum of their parts.” For a universal introduction to technical concepts, we combined a mixed lecture/immediate computational practical approach, facilitated by virtual machines, creating an efficient technical learning environment praised in student feedback for building confidence among cohorts with diverse backgrounds. Interdisciplinary group research projects where diverse students worked on real research questions were supervised in tandem with interactive contact time covering transferable skills in collaboration and communication in diverse teams, research presentation, and ethics. Multi-faceted feedback and assessment provided a constructive alignment with real peer-reviewed bioinformatics research. We believe that the inclusion of these transferable, interdisciplinary, and critical concepts in a bioinformatics course can help produce rounded, experienced graduates, ready for the real world and with many future options in science and society. In addition, we hope to provide some ideas and resources to facilitate such inclusion.
f
Data_Sheet_6_Identification and cross-validation of autophagy-related genes...
frontiersin.figshare.com
txt
Updated May 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yufang Yang; Min Zhang; Ziqing Li; Shen He; Xueqi Ren; Linmei Wang; Zhifei Wang; Shi Shu (2023). Data_Sheet_6_Identification and cross-validation of autophagy-related genes in cardioembolic stroke.CSV [Dataset]. http://doi.org/10.3389/fneur.2023.1097623.s006
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.3389/fneur.2023.1097623.s006
Dataset updated
May 30, 2023
Dataset provided by
Frontiers
Authors
Yufang Yang; Min Zhang; Ziqing Li; Shen He; Xueqi Ren; Linmei Wang; Zhifei Wang; Shi Shu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ObjectiveCardioembolic stroke (CE stroke, also known as cardiogenic cerebral embolism, CCE) has the highest recurrence rate and fatality rate among all subtypes of ischemic stroke, the pathogenesis of which was unclear. Autophagy plays an essential role in the development of CE stroke. We aim to identify the potential autophagy-related molecular markers of CE stroke and uncover the potential therapeutic targets through bioinformatics analysis.MethodsThe mRNA expression profile dataset GSE58294 was obtained from the GEO database. The potential autophagy-related differentially expressed (DE) genes of CE stroke were screened by R software. Protein–protein interactions (PPIs), correlation analysis, and gene ontology (GO) enrichment analysis were applied to the autophagy-related DE genes. GSE66724, GSE41177, and GSE22255 were introduced for the verification of the autophagy-related DE genes in CE stroke, and the differences in values were re-calculated by Student’s t-test.ResultsA total of 41 autophagy-related DE genes (37 upregulated genes and four downregulated genes) were identified between 23 cardioembolic stroke patients (≤3 h, prior to treatment) and 23 healthy controls. The KEGG and GO enrichment analysis of autophagy-related DE genes indicated several enriched terms related to autophagy, apoptosis, and ER stress. The PPI results demonstrated the interactions between these autophagy-related genes. Moreover, several hub genes, especially for CE stroke, were identified and re-calculated by Student’s t-test.ConclusionWe identified 41 potential autophagy-related genes associated with CE stroke through bioinformatics analysis. SERPINA1, WDFY3, ERN1, RHEB, and BCL2L1 were identified as the most significant DE genes that may affect the development of CE stroke by regulating autophagy. CXCR4 was identified as a hub gene of all types of strokes. ARNT, MAPK1, ATG12, ATG16L2, ATG2B, and BECN1 were identified as particular hub genes for CE stroke. These results may provide insight into the role of autophagy in CE stroke and contribute to the discovery of potential therapeutic targets for CE stroke treatment.
Consensus signatures for LINCS L1000 perturbations
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Himmelstein; Leo Brueggeman; Sergio Baranzini (2023). Consensus signatures for LINCS L1000 perturbations [Dataset]. http://doi.org/10.6084/m9.figshare.3085426.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3085426.v1
Dataset updated
May 30, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Daniel Himmelstein; Leo Brueggeman; Sergio Baranzini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LINCS L1000 measures the transcriptional response to perturbations. However, a single perturbagen is often assessed at several conditions, such as dosages, timepoints, or cell lines. A consensus signature meta-analyzes several input signatures and condenses them into a single output signature.We've computed consensus signatures for:+ 1,170 DrugBank compounds+ 4,326 gene knockdowns+ 2,413 gene overexpressions+ 38,327 L1000 perturbationsEach signature contains dysregulation z-scores for 7,467 genes (978 measured and 6,489 inferred, see genes.tsv). The consensi-{type}.tsv.bz2 files contain the perturbagen × gene matrix of z-scores. The dysreg-{type}.tsv files contain significantly dysregulated genes. The dysreg-{type}-summary.tsv files provide the counts of significantly up/down-regulated genes per perturbagen.Our methods are available on Thinklab. The project GitHub repository contains all of the datasets here besides consensi-pert_id.tsv.bz2 due to its large file size.If using these datasets, please attribute this figshare deposition and the LINCS L1000 project. Also please abide by the data release policy for the NIH LINCS Program.This is not an official LINCS L1000 repository. Users are warned that our modifications may have introduced errors or removed signal that was present the original data. We thank the L1000 team for posting their data and providing support including online office hours.
Summary of results.
plos.figshare.com
xls
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mary E. Prendergast; Michael Buckley; Alison Crowther; Laurent Frantz; Heidi Eager; Ophélie Lebrasseur; Rainer Hutterer; Ardern Hulme-Beaman; Wim Van Neer; Katerina Douka; Margaret-Ashley Veall; Eriéndira M. Quintana Morales; Verena J. Schuenemann; Ella Reiter; Richard Allen; Evangelos A. Dimopoulos; Richard M. Helm; Ceri Shipton; Ogeto Mwebi; Christiane Denys; Mark Horton; Stephanie Wynne-Jones; Jeffrey Fleisher; Chantal Radimilahy; Henry Wright; Jeremy B. Searle; Johannes Krause; Greger Larson; Nicole L. Boivin (2023). Summary of results. [Dataset]. http://doi.org/10.1371/journal.pone.0182565.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0182565.t001
Dataset updated
Jun 2, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Mary E. Prendergast; Michael Buckley; Alison Crowther; Laurent Frantz; Heidi Eager; Ophélie Lebrasseur; Rainer Hutterer; Ardern Hulme-Beaman; Wim Van Neer; Katerina Douka; Margaret-Ashley Veall; Eriéndira M. Quintana Morales; Verena J. Schuenemann; Ella Reiter; Richard Allen; Evangelos A. Dimopoulos; Richard M. Helm; Ceri Shipton; Ogeto Mwebi; Christiane Denys; Mark Horton; Stephanie Wynne-Jones; Jeffrey Fleisher; Chantal Radimilahy; Henry Wright; Jeremy B. Searle; Johannes Krause; Greger Larson; Nicole L. Boivin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Zooarchaeological and biomolecular results for the sites studied, with earliest dates for fauna confirmed via biomolecular analysis.
f
Table3_Capturing heart valve development with Gene Ontology.XLSX
frontiersin.figshare.com
figshare.com
xlsx
Updated Oct 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saadullah H. Ahmed; Alexander T. Deng; Rachael P. Huntley; Nancy H. Campbell; Ruth C. Lovering (2023). Table3_Capturing heart valve development with Gene Ontology.XLSX [Dataset]. http://doi.org/10.3389/fgene.2023.1251902.s003
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2023.1251902.s003
Dataset updated
Oct 17, 2023
Dataset provided by
Frontiers
Authors
Saadullah H. Ahmed; Alexander T. Deng; Rachael P. Huntley; Nancy H. Campbell; Ruth C. Lovering
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction: The normal development of all heart valves requires highly coordinated signaling pathways and downstream mediators. While genomic variants can be responsible for congenital valve disease, environmental factors can also play a role. Later in life valve calcification is a leading cause of aortic valve stenosis, a progressive disease that may lead to heart failure. Current research into the causes of both congenital valve diseases and valve calcification is using a variety of high-throughput methodologies, including transcriptomics, proteomics and genomics. High quality genetic data from biological knowledge bases are essential to facilitate analyses and interpretation of these high-throughput datasets. The Gene Ontology (GO, http://geneontology.org/) is a major bioinformatics resource used to interpret these datasets, as it provides structured, computable knowledge describing the role of gene products across all organisms. The UCL Functional Gene Annotation team focuses on GO annotation of human gene products. Having identified that the GO annotations included in transcriptomic, proteomic and genomic data did not provide sufficient descriptive information about heart valve development, we initiated a focused project to address this issue.Methods: This project prioritized 138 proteins for GO annotation, which led to the curation of 100 peer-reviewed articles and the creation of 400 heart valve development-relevant GO annotations.Results: While the focus of this project was heart valve development, around 600 of the 1000 annotations created described the broader cellular role of these proteins, including those describing aortic valve morphogenesis, BMP signaling and endocardial cushion development. Our functional enrichment analysis of the 28 proteins known to have a role in bicuspid aortic valve disease confirmed that this annotation project has led to an improved interpretation of a heart valve genetic dataset.Discussion: To address the needs of the heart valve research community this project has provided GO annotations to describe the specific roles of key proteins involved in heart valve development. The breadth of GO annotations created by this project will benefit many of those seeking to interpret a wide range of cardiovascular genomic, transcriptomic, proteomic and metabolomic datasets.
Data from: Mini dataset.
figshare.com
application/x-rar
Updated Nov 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahsa Yaghobinejad; Mohammad Naji; Ali Mohammad Alizadeh; Soheib Aryanezhad; Solmaz Khalighfard; Parisa Asadollahi; Nasrin Takzare; Tayebeh Rastegar (2025). Mini dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0315366.s007
Explore at:
application/x-rarAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0315366.s007
Dataset updated
Nov 4, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Mahsa Yaghobinejad; Mohammad Naji; Ali Mohammad Alizadeh; Soheib Aryanezhad; Solmaz Khalighfard; Parisa Asadollahi; Nasrin Takzare; Tayebeh Rastegar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Given the urgent need for more specific, sensitive, and non-invasive markers for prostate cancer screening and differential diagnosis, circulating miRNAs have emerged as valuable candidates. Sixty seven prostate cancer subjects in different stages were included in this study. The participants were categorized into groups based on their pathological characteristics as local, biochemical relapse and metastatic. We retrieved eligible datasets from GEO database to identify stage-specific differentially expressed up/down-regulated genes. Cytohubba, built-in application of Cytoscape software, and Reactome pathway database were applied to select hub genes. To select upstream miRNAs, we utilized the MiRWalk and miRNet online tools. To construct the miRNA-mRNA regulatory networks, we employed rna22. Finally, three miRNAs and five target genes were validated in peripheral blood mononuclear cells of PCa patients compared with benign prostate hyperplasia. PSA level was also measured using ELISA. Our findings revealed the potential role of PRC1 and UBA52 to be used as biomarkers for the metastatic stage, RCC1 for both biochemical relapse, and metastatic subjects. Furthermore, elevated levels of miR-124-3p and downregulation of miR-133a-3p can be introduced as biochemical relapse stage identifier. We also identified the tumor suppressor role of miR-17-5p, which was associated with higher Gleason scores. We propose PRC1, UBA52, RCC1, miR-124-3p and miR133a-3p as stage-specific PCa identifiers.

Facebook

Twitter

Click to copy link

Link copied

Cite

Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated

Bioinformatics Protein Dataset - Simulated

Synthetic protein dataset with sequences, physical properties, and functional cl

Explore at:

zip(12928905 bytes)Available download formats

Dataset updated

Dec 27, 2024

Authors

Rafael Gallo

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Subtitle

"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

Description

Introduction

This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

Columns Included

ID_Protein: Unique identifier for each protein.
Sequence: String of amino acids.
Molecular_Weight: Molecular weight calculated from the sequence.
Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
Hydrophobicity: Average hydrophobicity calculated from the sequence.
Total_Charge: Sum of the charges of the amino acids in the sequence.
Polar_Proportion: Percentage of polar amino acids in the sequence.
Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
Sequence_Length: Total number of amino acids in the sequence.
Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

Inspiration and Sources

While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

Proposed Uses

This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

How This Dataset Was Created

Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
Property Calculation: Physicochemical properties were calculated using the Biopython library.
Class Assignment: Classes were randomly assigned for classification purposes.

Limitations

The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
The functional classes are simulated and do not correspond to actual biological characteristics.

Data Split

The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

Acknowledgment

This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

Clear search

Close search

Google apps

Main menu

Bioinformatics Protein Dataset - Simulated

Subtitle

Description

Introduction

Columns Included

Inspiration and Sources

Proposed Uses

How This Dataset Was Created

Limitations

Data Split

Acknowledgment

Dataset for practice session 1 in bioinformatics

Table_5_Comprehensive Review of Web Servers and Bioinformatics Tools for...

Comparison of the multiple-delivery-mode training model employed by...

Drug Repositioning

Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...

Transcriptome-based evaluation of the translatable potential of new...

Table5_Identification of Differentially Expressed Genes and miRNAs for...

Data from: Reference genome choice and filtering thresholds jointly...

DataSheet1_Integrative bioinformatics approaches to establish potential...

Protein Structural Domain Classification

Supplementary Material for: Profiling and Bioinformatics Analysis Revealing...

Supplementary Material for: Analyzing the Role of Specific DAMPs-Related...

DataSheet2_Cancer and COVID-19 Susceptibility and Severity: A Two-Sample...

Data_Sheet_1_Interdisciplinary and Transferable Concepts in Bioinformatics...

Data_Sheet_6_Identification and cross-validation of autophagy-related genes...

Consensus signatures for LINCS L1000 perturbations

Summary of results.

Table3_Capturing heart valve development with Gene Ontology.XLSX

Data from: Mini dataset.

Bioinformatics Protein Dataset - SimulatedSee More Versions

Synthetic protein dataset with sequences, physical properties, and functional cl

Subtitle

Description

Introduction

Columns Included

Inspiration and Sources

Proposed Uses

How This Dataset Was Created

Limitations

Data Split

Acknowledgment

Bioinformatics Protein Dataset - Simulated