83 datasets found

Bioinformatics data for paper
catalog.data.gov
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Bioinformatics data for paper [Dataset]. https://catalog.data.gov/dataset/bioinformatics-data-for-paper
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Data for sequence comparison of commamox genomes and genes identified. This dataset is associated with the following publication: Camejo, P., J. Santodomingo, K. McMahon, and D. Noguera. Genome-enabled insights into the ecophysiology of the comammox bacterium Ca. Nitrospira nitrosa. ENVIRONMENTAL SCIENCE & TECHNOLOGY. American Chemical Society, Washington, DC, USA, 2(5): 1-16, (2017).
C
Bioinformatics for Researchers in Life Sciences: Tools and Learning...
data.iadb.org
csv, pdf
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
IDB Datasets (2025). Bioinformatics for Researchers in Life Sciences: Tools and Learning Resources [Dataset]. http://doi.org/10.60966/kwvb-wr19
Explore at:
csv(355108), pdf(2989058), csv(276253)Available download formats
Unique identifier
https://doi.org/10.60966/kwvb-wr19
Dataset updated
Apr 10, 2025
Dataset provided by
IDB Datasets
License
Attribution-NonCommercial-NoDerivs 3.0 (CC BY-NC-ND 3.0)https://creativecommons.org/licenses/by-nc-nd/3.0/
License information was derived automatically
Time period covered
Jan 1, 2020 - Jan 1, 2021
Description
The COVID-19 pandemic has shown that bioinformatics--a multidisciplinary field that combines biological knowledge with computer programming concerned with the acquisition, storage, analysis, and dissemination of biological data--has a fundamental role in scientific research strategies in all disciplines involved in fighting the virus and its variants. It aids in sequencing and annotating genomes and their observed mutations; analyzing gene and protein expression; simulation and modeling of DNA, RNA, proteins and biomolecular interactions; and mining of biological literature, among many other critical areas of research. Studies suggest that bioinformatics skills in the Latin American and Caribbean region are relatively incipient, and thus its scientific systems cannot take full advantage of the increasing availability of bioinformatic tools and data. This dataset is a catalog of bioinformatics software for researchers and professionals working in life sciences. It includes more than 300 different tools for varied uses, such as data analysis, visualization, repositories and databases, data storage services, scientific communication, marketplace and collaboration, and lab resource management. Most tools are available as web-based or desktop applications, while others are programming libraries. It also includes 10 suggested entries for other third-party repositories that could be of use.
d
Data from: Transcriptomic and bioinformatics analysis of the early...
catalog.data.gov
agdatacommons.nal.usda.gov
+1more
Updated Apr 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Data from: Transcriptomic and bioinformatics analysis of the early time-course of the response to prostaglandin F2 alpha in the bovine corpus luteum [Dataset]. https://catalog.data.gov/dataset/data-from-transcriptomic-and-bioinformatics-analysis-of-the-early-time-course-of-the-respo-cd938
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Service
Description
RNA expression analysis was performed on the corpus luteum tissue at five time points after prostaglandin F2 alpha treatment of midcycle cows using an Affymetrix Bovine Gene v1 Array. The normalized linear microarray data was uploaded to the NCBI GEO repository (GSE94069). Subsequent statistical analysis determined differentially expressed transcripts ± 1.5-fold change from saline control with P ≤ 0.05. Gene ontology of differentially expressed transcripts was annotated by DAVID and Panther. Physiological characteristics of the study animals are presented in a figure. Bioinformatic analysis by Ingenuity Pathway Analysis was curated, compiled, and presented in tables. A dataset comparison with similar microarray analyses was performed and bioinformatics analysis by Ingenuity Pathway Analysis, DAVID, Panther, and String of differentially expressed genes from each dataset as well as the differentially expressed genes common to all three datasets were curated, compiled, and presented in tables. Finally, a table comparing four bioinformatics tools' predictions of functions associated with genes common to all three datasets is presented. These data have been further analyzed and interpreted in the companion article "Early transcriptome responses of the bovine mid-cycle corpus luteum to prostaglandin F2 alpha includes cytokine signaling". Resources in this dataset:Resource Title: Supporting information as Excel spreadsheets and tables. File Name: Web Page, url: http://www.sciencedirect.com/science/article/pii/S2352340917304031?via=ihub#s0070
f
Data from: Advancing computational biology and bioinformatics research...
datasetcatalog.nlm.nih.gov
figshare.com
Updated Sep 27, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonchhe, Anup; Su, Andrew I.; Natoli, Ted; Macaluso, N. J. Maximilian; Briney, Bryan; Blasco, Andrea; Narayan, Rajiv; Lakhani, Karim R.; Paik, Jin H.; Endres, Michael G.; Sergeev, Rinat A.; Wu, Chunlei; Subramanian, Aravind (2019). Advancing computational biology and bioinformatics research through open innovation competitions [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000064443
Explore at:
Dataset updated
Sep 27, 2019
Authors
Jonchhe, Anup; Su, Andrew I.; Natoli, Ted; Macaluso, N. J. Maximilian; Briney, Bryan; Blasco, Andrea; Narayan, Rajiv; Lakhani, Karim R.; Paik, Jin H.; Endres, Michael G.; Sergeev, Rinat A.; Wu, Chunlei; Subramanian, Aravind
Description
Open data science and algorithm development competitions offer a unique avenue for rapid discovery of better computational strategies. We highlight three examples in computational biology and bioinformatics research in which the use of competitions has yielded significant performance gains over established algorithms. These include algorithms for antibody clustering, imputing gene expression data, and querying the Connectivity Map (CMap). Performance gains are evaluated quantitatively using realistic, albeit sanitized, data sets. The solutions produced through these competitions are then examined with respect to their utility and the prospects for implementation in the field. We present the decision process and competition design considerations that lead to these successful outcomes as a model for researchers who want to use competitions and non-domain crowds as collaborators to further their research.
Gene Expression Analysis and Disease Relationship
kaggle.com
zip
Updated Aug 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
asel (2025). Gene Expression Analysis and Disease Relationship [Dataset]. https://www.kaggle.com/datasets/ylmzasel/gene-expression-analysis-and-disease-relationship/code
Explore at:
zip(8740 bytes)Available download formats
Dataset updated
Aug 4, 2025
Authors
asel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset comprises 1000 hypothetical patient or sample entries, each detailing gene expression profiles and relevant clinical characteristics. It includes a mix of both numerical and categorical data types, allowing for the application of diverse machine learning and statistical analysis methods

Column Descriptions: PatientID (Categorical/Numerical): A unique identification number assigned to each patient. Age (Numerical): The patient's age. Can be used to investigate potential correlations between age and gene expression profiles. Gender (Categorical): The patient's gender (0: Female, 1: Male). Effects of gender on gene expression or disease status can be analyzed. Gene_X_Expression (Numerical): The relative expression level of a specific gene, "Gene X". This represents a hypothetical gene that might play a role in disease progression or treatment response. Gene_Y_Expression (Numerical): The relative expression level of another specific gene, "Gene Y". Can be studied in conjunction with or independently of Gene X. SmokingStatus (Categorical): The patient's smoking status (0: Non-smoker, 1: Ex-smoker, 2: Current smoker). Environmental factors' impact on gene expression and disease can be assessed. DiseaseStatus (Categorical): The patient's status for the target disease (0: Healthy, 1: Disease A, 2: Disease B). This can serve as the primary target variable for your predictive models.

TreatmentResponse (Categorical/Numerical): The degree of response to applied treatment (0: No Response, 1: Partial Response, 2: Full Response). The role of gene expression profiles in predicting treatment success can be explored. Use Cases and Potential Projects This dataset serves as an excellent starting point for students, researchers, and enthusiasts in bioinformatics, computational biology, data science, and machine learning, enabling various projects such as: Disease Diagnosis/Classification: Building models to predict HastalıkDurumu using gene expression levels and other clinical factors. Treatment Response Prediction: Forecasting how patients with specific gene expression profiles might respond to treatment (TedaviYanıtı). Biomarker Discovery: Identifying gene expression levels (e.g., Gen_X_İfadesi, Gen_Y_İfadesi) that show strong correlations with disease or treatment response. Feature Engineering and Selection: Evaluating the importance of various features in the dataset and creating new ones to enhance model performance. Data Visualization: Generating visualizations to explore relationships between gene expression data and demographic/clinical factors. Regression and Correlation Analyses: Quantitatively examining the effects of factors like age and smoking status on gene expression levels.

Why Use This Dataset? Privacy Secure: Being entirely synthetic, it carries no privacy or ethical concerns associated with real patient data. Diversity: The mix of both numerical and categorical variables offers a rich ground for experimenting with different analytical techniques. Predictive Potential: Clear target variables like HastalıkDurumu and TedaviYanıtı make it ideal for developing classification and regression models. Educational and Learning: Perfect for applying fundamental data science and machine learning concepts for anyone interested in the bioinformatics domain.
PARSING FASTA AND GENBANK FILES
kaggle.com
zip
Updated Nov 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr. Nagendra (2025). PARSING FASTA AND GENBANK FILES [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/parsing-fasta-and-genbank-files
Explore at:
zip(17972831 bytes)Available download formats
Dataset updated
Nov 25, 2025
Authors
Dr. Nagendra
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset is a dedicated resource for learning how to parse core bioinformatics file formats. It contains representative samples of FASTA and GenBank files. The goal is to provide raw data for practicing essential data extraction skills. FASTA files contain sequence data, such as DNA, RNA, or protein, in a simple text format. GenBank files include detailed sequence annotations, features, and metadata. This is an ideal starting point for anyone learning Biopython or general sequence manipulation in genomics.
m
Data from: PseudoResistance DB: A new Database of antibiotics related to...
data.mendeley.com
Updated Nov 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Caio Cheohen (2024). PseudoResistance DB: A new Database of antibiotics related to Pseudomonas aeruginosa antibiotic resistance [Dataset]. http://doi.org/10.17632/bxdn3p33z2.1
Explore at:
Unique identifier
https://doi.org/10.17632/bxdn3p33z2.1
Dataset updated
Nov 8, 2024
Authors
Caio Cheohen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This research addresses the pressing issue of antibiotic resistance, a global health challenge that undermines the efficacy of treatments against infectious diseases. Focusing on Pseudomonas aeruginosa—a Gram-negative bacterium known for causing opportunistic infections—this study emphasizes its prioritization by the World Health Organization (WHO) as a critical-level pathogen requiring new therapeutic approaches.

To identify antibiotics associated with P. aeruginosa, the study employed text mining techniques on the Scielo database. The resulting dataset comprises 98 antibiotics, each documented with detailed textual information and referencing data. Additionally, the dataset includes structural files of the antibiotics in several formats suitable for computational modeling and simulations. These formats encompass Protein Data Bank, Partial Charge & Atom Type (PDBQT), Simplified Molecular Input Line Entry System (SMI), IUPAC International Chemical Identifier (INCHI), Molecular Design Limited Molfile (MOL2), Structure-Data File (SDF), Chemical Markup Language (CML), Cartesian Coordinates File (XYZ), Scalable Vector Graphics (SVG), Molecular File (MOL) and Protein Data Bank (PDB) files, with molecular models generated via OpenBabel to facilitate advanced studies in drug development and resistance mechanisms.
Cell_Gene_Expression_Metadata
kaggle.com
zip
Updated Sep 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kazi Aishikuzzaman (2025). Cell_Gene_Expression_Metadata [Dataset]. https://www.kaggle.com/datasets/kaziaishikuzzaman/cell-gene-expression-metadata
Explore at:
zip(845887409 bytes)Available download formats
Dataset updated
Sep 24, 2025
Authors
Kazi Aishikuzzaman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview This dataset contains comprehensive metadata from single-cell gene expression studies, providing researchers with structured information about cellular phenotypes, experimental conditions, and sample characteristics. The data is particularly valuable for bioinformatics research, machine learning applications in genomics, and comparative studies across different cell types and conditions.

Dataset Description: The dataset comprises metadata associated with single-cell RNA sequencing (scRNA-seq) experiments, including: Cell Type Information: Classification of different cell types and subtypes Experimental Metadata: Details about experimental conditions, protocols, and methodologies Sample Characteristics: Information about biological samples, including tissue origin, developmental stages, and treatment conditions Quality Metrics: Data quality indicators and filtering parameters Annotation Details: Standardized cell type annotations and biological classifications

Data Source and Licensing This dataset is derived from publicly available single-cell gene expression data, potentially sourced from: CELLxGENE Data Portal (https://cellxgene.cziscience.com/) Gene Expression Omnibus (GEO) European Bioinformatics Institute (EBI) Other public genomics repositories

License: Creative Commons CC BY 4.0 (or specify the actual license) ✅ Commercial use allowed ✅ Modification allowed ✅ Distribution allowed ✅ Private use allowed ❗ Attribution required

Research Applications Cell Type Discovery: Identify novel cell types and subtypes Comparative Genomics: Study cellular differences across conditions, tissues, or species Disease Research: Investigate cellular changes in disease states Developmental Biology: Analyze cellular differentiation and development patterns

Machine Learning Applications Classification Tasks: Predict cell types from gene expression data Clustering Analysis: Discover cellular subpopulations and states Dimensionality Reduction: Apply PCA, t-SNE, UMAP for visualization Biomarker Discovery: Identify genes characteristic of specific cell types

Educational Use : Teaching bioinformatics and computational biology concepts. Demonstrating single-cell analysis workflows. Training in data preprocessing and quality control.

Data Quality and Preprocessing : Quality Control: Metadata has been curated and standardized Missing Values: [Specify how missing values are handled] Standardization: Cell type annotations follow established ontologies (e.g., Cell Ontology) Validation: Data has been cross-referenced with original publications

Usage Guidelines : Getting Started- Load the metadata files using pandas or your preferred data analysis tool. Explore the cell type distributions and experimental conditions. Filter data based on quality metrics as needed. Join with corresponding gene expression data for comprehensive analysis.

Best Practices Always cite original data sources and publications. Consider batch effects when combining data from different experiments. Validate findings with independent datasets when possible. Follow established bioinformatics workflows for single-cell analysis.

Citation and Acknowledgments : If you use this dataset in your research, please: Cite this dataset:[Kazi Aishikuzzaman]. (2024). Cell Gene Expression Metadata. Kaggle. https://www.kaggle.com/datasets/kaziaishikuzzaman/cell-gene-expression-metadata

File Structure : dataset- ─ metadata_summary.csv # Main metadata file ─ cell_type_annotations.csv # Detailed cell type information
─ experimental_conditions.csv # Experiment-specific metadata ─ quality_metrics.csv # Data quality indicators ─ README.txt # Detailed file descriptions

Technical Specifications : File Encoding: UTF-8 Separator: Comma-separated values (CSV) Missing Values: Represented as 'NA' or empty cells Data Types: Mixed (categorical, numerical, text)

Contact and Support : For questions about this dataset: Kaggle Profile: @kaziaishikuzzaman Dataset Issues: Use Kaggle's discussion section Collaboration: Open to research collaborations and improvements

Version History : v1.0: Initial release with comprehensive metadata collection [Future versions]: Updates and additional annotations as available

Related Datasets: Consider exploring these complementary datasets- Single-cell gene expression data (companion to this metadata) Cell atlas datasets from major consortiums Disease-specific single-cell studies Multi-omics datasets with matching cell types

Keywords: single-cell, RNA-seq, genomics, cell types, metadata, bioinformatics, machine learning, computational biology Category: Biology > Genomics
R
Bioinformatics Market Research Report 2033
researchintelo.com
csv, pdf, pptx
Updated Jul 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Research Intelo (2025). Bioinformatics Market Research Report 2033 [Dataset]. https://researchintelo.com/report/bioinformatics-market
Explore at:
pdf, csv, pptxAvailable download formats
Dataset updated
Jul 24, 2025
Dataset authored and provided by
Research Intelo
License
https://researchintelo.com/privacy-and-policyhttps://researchintelo.com/privacy-and-policy
Time period covered
2024 - 2033
Area covered
Global
Description
Bioinformatics Market Outlook

According to our latest research, the global bioinformatics market size reached USD 16.2 billion in 2024, exhibiting robust expansion driven by growing demand across various life science applications. The market is anticipated to maintain a strong momentum, registering a CAGR of 12.6% during the forecast period, and is projected to achieve a value of USD 47.3 billion by 2033. This significant growth is primarily fueled by advancements in genomics and proteomics, the proliferation of high-throughput sequencing technologies, and the rising integration of artificial intelligence and machine learning in biological data analysis. As per our latest research, the increasing need for efficient data management and analysis in drug discovery, personalized medicine, and agricultural biotechnology continues to propel the global bioinformatics market forward.

One of the core growth drivers for the bioinformatics market is the exponential rise in biological data generation, particularly from next-generation sequencing (NGS) platforms. As sequencing costs have plummeted and throughput has soared, researchers and organizations across academia, healthcare, and agriculture are generating vast amounts of genomic, proteomic, and metabolomic data. This deluge of information necessitates robust bioinformatics tools and platforms for storage, retrieval, analysis, and interpretation. The capability to translate raw biological data into actionable insights for disease research, crop improvement, and environmental monitoring has made bioinformatics indispensable. Furthermore, collaborations between biotechnology companies, academic institutions, and IT firms are fostering innovation in software and algorithm development, amplifying the market’s growth trajectory.

Another significant growth factor is the integration of artificial intelligence (AI) and machine learning (ML) within bioinformatics platforms. AI-driven analytics are revolutionizing the way researchers interpret complex biological datasets, enabling more accurate predictions in genomics, drug discovery, and personalized medicine. The ability of ML algorithms to identify patterns, predict molecular interactions, and automate data processing is enhancing the efficiency and reliability of bioinformatics workflows. Moreover, the increasing adoption of cloud-based bioinformatics solutions is democratizing access to powerful computational resources, allowing small and medium enterprises (SMEs) and academic labs to leverage advanced analytics without heavy infrastructure investments. These technological advancements are expected to further accelerate market expansion over the coming years.

The growing focus on personalized medicine and precision healthcare is also catalyzing the demand for bioinformatics. Healthcare providers and pharmaceutical companies are increasingly utilizing bioinformatics tools to tailor treatments based on individual genetic profiles, leading to improved patient outcomes and reduced adverse effects. In drug discovery, bioinformatics accelerates target identification, biomarker discovery, and candidate screening, shortening development timelines and reducing costs. Furthermore, bioinformatics is playing a pivotal role in agricultural biotechnology, helping researchers develop genetically modified crops with enhanced traits, improved yield, and resistance to diseases. The convergence of these diverse applications underscores the strategic importance of bioinformatics across multiple sectors.

From a regional perspective, North America continues to lead the global bioinformatics market, supported by a well-established biotechnology industry, significant R&D investments, and favorable government initiatives. The United States, in particular, is home to several leading bioinformatics companies and research institutions, driving innovation and adoption. Europe follows closely, with strong contributions from countries like Germany, the UK, and France, where collaborative research projects and public-private partnerships are prevalent. Meanwhile, the Asia Pacific region is witnessing the fastest growth, propelled by expanding genomics research, increasing healthcare expenditures, and a surge in government funding for life science initiatives, particularly in China, India, and Japan.

Product & Service Analysis

The product & service segment of the bioinformatics market is broadly categorized into software, hardware, and
f
DataSheet1_Bioinformatic Teaching Resources – For Educators, by Educators –...
frontiersin.figshare.com
figshare.com
docx
Updated Jun 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ellen G. Dow; Elisha M. Wood-Charlson; Steven J. Biller; Timothy Paustian; Aaron Schirmer; Cody S. Sheik; Jason M. Whitham; Rose Krebs; Carlos C. Goller; Benjamin Allen; Zachary Crockett; Adam P. Arkin (2023). DataSheet1_Bioinformatic Teaching Resources – For Educators, by Educators – Using KBase, a Free, User-Friendly, Open Source Platform.DOCX [Dataset]. http://doi.org/10.3389/feduc.2021.711535.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/feduc.2021.711535.s001
Dataset updated
Jun 6, 2023
Dataset provided by
Frontiers
Authors
Ellen G. Dow; Elisha M. Wood-Charlson; Steven J. Biller; Timothy Paustian; Aaron Schirmer; Cody S. Sheik; Jason M. Whitham; Rose Krebs; Carlos C. Goller; Benjamin Allen; Zachary Crockett; Adam P. Arkin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Over the past year, biology educators and staff at the U.S. Department of Energy Systems Biology Knowledgebase (KBase) initiated a collaborative effort to develop a curriculum for bioinformatics education. KBase is a free web-based platform where anyone can conduct sophisticated and reproducible bioinformatic analyses via a graphical user interface. Here, we demonstrate the utility of KBase as a platform for bioinformatics education, and present a set of modular, adaptable, and customizable instructional units for teaching concepts in Genomics, Metagenomics, Pangenomics, and Phylogenetics. Each module contains teaching resources, publicly available data, analysis tools, and Markdown capability, enabling instructors to modify the lesson as appropriate for their specific course. We present initial student survey data on the effectiveness of using KBase for teaching bioinformatic concepts, provide an example case study, and detail the utility of the platform from an instructor’s perspective. Even as in-person teaching returns, KBase will continue to work with instructors, supporting the development of new active learning curriculum modules. For anyone utilizing the platform, the growing KBase Educators Organization provides an educators network, accompanied by community-sourced guidelines, instructional templates, and peer support, for instructors wishing to use KBase within a classroom at any educational level–whether virtual or in-person.
m
2025 Green Card Report for Bioinformatics, Biotechnology, Computer Science
myvisajobs.com
Updated Jan 16, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MyVisaJobs (2025). 2025 Green Card Report for Bioinformatics, Biotechnology, Computer Science [Dataset]. https://www.myvisajobs.com/reports/green-card/major/bioinformatics,-biotechnology,-computer-science/
Explore at:
Dataset updated
Jan 16, 2025
Dataset authored and provided by
MyVisaJobs
License
https://www.myvisajobs.com/terms-of-service/https://www.myvisajobs.com/terms-of-service/
Variables measured
Major, Salary, Petitions Filed
Description
A dataset that explores Green Card sponsorship trends, salary data, and employer insights for bioinformatics, biotechnology, computer science in the U.S.
d
Two-step mixed model approach to analyzing differential alternative RNA...
datadryad.org
zip
Updated Sep 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Li Luo; Huining Kang; Xichen Li; Scott Ness; Christine Stidley (2020). Two-step mixed model approach to analyzing differential alternative RNA splicing: Datasets and R scripts for analysis of alternative splicing [Dataset]. http://doi.org/10.5061/dryad.66t1g1k0h
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.66t1g1k0h
Dataset updated
Sep 28, 2020
Dataset provided by
Dryad
Authors
Li Luo; Huining Kang; Xichen Li; Scott Ness; Christine Stidley
Time period covered
Sep 26, 2020
Description
The dataset was collected through whole-transcriptome RNA-Sequencing technologies. The processing method was described in the manuscript.
Drosophila Melanogaster Genome
kaggle.com
ieee-dataport.org
zip
Updated Nov 17, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Myles O'Neill (2019). Drosophila Melanogaster Genome [Dataset]. https://www.kaggle.com/mylesoneill/drosophila-melanogaster-genome
Explore at:
zip(136202106 bytes)Available download formats
Dataset updated
Nov 17, 2019
Authors
Myles O'Neill
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Drosophila Melanogaster

Drosophila Melanogaster, the common fruit fly, is a model organism which has been extensively used in entymological research. It is one of the most studied organisms in biological research, particularly in genetics and developmental biology.

When its not being used for scientific research, D. melanogaster is a common pest in homes, restaurants, and anywhere else that serves food. They are not to be confused with Tephritidae flys (also known as fruit flys).

https://en.wikipedia.org/wiki/Drosophila_melanogaster

About the Genome

This genome was first sequenced in 2000. It contains four pairs of chromosomes (2,3,4 and X/Y). More than 60% of the genome appears to be functional non-protein-coding DNA.

![D. melanogaster chromosomes][1]

The genome is maintained and frequently updated at [FlyBase][2]. This dataset is sourced from the UCSC Genome Bioinformatics download page. It uses the August 2014 version of the D. melanogaster genome (dm6, BDGP Release 6 + ISO1 MT). http://hgdownload.soe.ucsc.edu/downloads.html#fruitfly

Files were modified by Kaggle to be a better fit for analysis on Scripts. This primarily involved turning files into CSV format, with a header row, as well as converting the genome itself from 2bit format into a FASTA sequence file.

Bioinformatics

Genomic analysis can be daunting to data scientists who haven't had much experience with bioinformatics before. We have tried to give basic explanations to each of the files in this dataset, as well as links to further reading on the biological basis for each. If you haven't had the chance to study much biology before, some light reading (ie wikipedia) on the following topics may be helpful to understand the nuances of the data provided here: [Genetics][3], [Genomics]4, [Chromosomes][7], [DNA][8], [RNA]9, [Genes][12], [Alleles][13], [Exons][14], [Introns][15], [Transcription][16], [Translation][17], [Peptides][18], [Proteins][19], [Gene Regulation][20], [Mutation][21], [Phylogenetics][22], and [SNPs][23].

Of course, if you've got some idea of the basics already - don't be afraid to jump right in!

Learning Bioinformatics

There are a lot of great resources for learning bioinformatics on the web. One cool site is [Rosalind][24] - a platform that gives you bioinformatic coding challenges to complete. You can use Kaggle Scripts on this dataset to easily complete the challenges on Rosalind (and see [Myles' solutions here][25] if you get stuck). We have set up [Biopython][26] on Kaggle's docker image which is a great library to help you with your analyses. Check out their [tutorial here][27] and we've also created [a python notebook with some of the tutorial applied to this dataset][28] as a reference.

Files in this Dataset

Drosophila Melanogaster Genome

genome.fa

The assembled genome itself is presented here in [FASTA format][29]. Each chromosome is a different sequence of nucleotides. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are show in lower case; non-repeating sequence is shown in upper case.
Meta Information

There are 3 additional files with meta information about the genome.

meta-cpg-island-ext-unmasked.csv

This file contains descriptive information about CpG Islands in the genome.

https://en.wikipedia.org/wiki/CpG_site

meta-cytoband.csv

This file describes the positions of cytogenic bands on each chromosome.

https://en.wikipedia.org/wiki/Cytogenetics

meta-simple-repeat.csv

This file describes simple tandem repeats in the genome.

https://en.wikipedia.org/wiki/Repeated_sequence_(DNA) https://en.wikipedia.org/wiki/Tandem_repeat
Drosophila Melanogaster mRNA Sequences

Messenger RNA (mRNA) is an intermediate molecule created as part of the cellular process of converting genomic information into proteins. Some mRNA are never translated into proteins and have functional roles in the cell on their own. Collectively, organism mRNA information is known as a Transcriptome. mRNA files included in this dataset give insight into the activity of genes in the organism.

https://en.wikipedia.org/wiki/Messenger_RNA

mrna-genbank.fa

This file includes all mRNA sequences from GenBank associated with Drosophila Melanogaster.

http://www.ncbi.nlm.nih.gov/genbank/

mrna-refseq.fa

This file includes all mRNA sequences from RefSeq associated with Drosophila Melanogaster.

http://www.ncbi.nlm.nih.gov/refseq/
Gene Predictions

A gene is a segment of DNA on the genome which, through mRNA, is used to create proteins in the organism. Knowing which parts of DNA are coding (genes) or non-coding is difficult, and a number of different systems for prediction exist. This da...
The GitHub repository for an integrative analysis of genomic plasticity in...
zenodo.org
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rayna Harris; Hsin-Yi Kao; Juan Marcos Alarcon; Hans Hofmann; Andre Fenton; Rayna Harris; Hsin-Yi Kao; Juan Marcos Alarcon; Hans Hofmann; Andre Fenton (2020). The GitHub repository for an integrative analysis of genomic plasticity in the hippocampus [Dataset]. http://doi.org/10.5281/zenodo.810407
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.810407
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rayna Harris; Hsin-Yi Kao; Juan Marcos Alarcon; Hans Hofmann; Andre Fenton; Rayna Harris; Hsin-Yi Kao; Juan Marcos Alarcon; Hans Hofmann; Andre Fenton
Description
Cost-effective next-generation sequencing has made unbiased gene expression investigations possible. Gene expression studies at the level of single neurons may be especially important for understanding nervous system structure and function because of neuron-specific functionality and plasticity. While cellular dissociation is a prerequisite technical manipulation for such single-cell studies, the extent to which the process of dissociating cells affects neural gene expression has not been determined. Here, we examine the effect of cellular dissociation on gene expression in the mouse hippocampus. We also determine to which extent such changes might confound studies on the behavioral and physiological functions of hippocampus.

This dataset contains the data, software, and results the accompany a manuscript that is in the process of submission to the journal Hippocampus.
SARS-sequence-data
kaggle.com
zip
Updated Mar 31, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott Powers (2020). SARS-sequence-data [Dataset]. https://www.kaggle.com/datasets/spowers/sarssequencedata
Explore at:
zip(8071988 bytes)Available download formats
Dataset updated
Mar 31, 2020
Authors
Scott Powers
Description
Context

SARS-cov-2 is the causative agent in the current global pandemic. SARS-cov-2, also called novel Coronavirus, is related to both SARS and bat SARS. Many datasets exist on kaggle related to this epidemic, however genomics data had yet to be added. NCBI is an open repository of biomedical data including sequencing data from laboratories around the world. Many sequences have been collected for all three families of viruses mentioned, however the data is presented in an easy to use format for data scientists. This dataset is a collection of those sequences, which will be updated periodically as new sequencing data is added.

Content

This dataset contains sequence data obtained from NCBI for various coronaviridae. Specifically of interest at this time are the causative agents of SARS and COVID-19 and the related family that causes bat SARS. The data specific to those three groups is contained with a CSV file along with the full text description and NCBI accession number. Additional information about each can be obtained by searching NCBI for the specific accession number.

In addition to the csv file are the original FASTA files for those sequence data, along with another for related coronavirus.

Acknowledgements

These FASTA files were collected using a script maintained by the BioStars Handbook authors. The actual sequence data has been generated by various research and clinical groups around the world dealing with infectious diseases.

Inspiration

The BioStars Handbook nCov Analysis text is a great starting point to look at these data from a general bioinformatics perspective. However of interest is how we can look beyond those methods to incorporate general data science techniques to gain more insight into these agents.

Sequence similarity is a good place to start to understand the evolutionary history of these organisms. This is well studied in the literature, however it can be useful as a starting point.

For features I would recommend looking into kmer counts as well as one hot encoding the sequence. To help one hot encode the sequences might need to have their length padded, and the classic placeholder in bioinformatics is the character N.
m
Pneumonia Drug Exp Data
data.mendeley.com
Updated Sep 29, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OCHIN SHARMA (2023). Pneumonia Drug Exp Data [Dataset]. http://doi.org/10.17632/8bmpx4zvs8.1
Explore at:
Unique identifier
https://doi.org/10.17632/8bmpx4zvs8.1
Dataset updated
Sep 29, 2023
Authors
OCHIN SHARMA
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is the result of experiments conducted using Python and rdkit library.
u
Data from: BBGD: an online database for blueberry genomic data
agdatacommons.nal.usda.gov
s.cnmilf.com
+2more
xls
Updated Nov 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nadim W. Alkharouf; Anik L. Dhanaraj; Dhananjay Naik; Christopher Overall; Benjamin F. Matthews; Lisa J. Rowland (2025). Data from: BBGD: an online database for blueberry genomic data [Dataset]. http://doi.org/10.15482/USDA.ADC/1173243
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.15482/USDA.ADC/1173243
Dataset updated
Nov 22, 2025
Dataset provided by
BMC Plant Biology
Authors
Nadim W. Alkharouf; Anik L. Dhanaraj; Dhananjay Naik; Christopher Overall; Benjamin F. Matthews; Lisa J. Rowland
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is supplemental to the article "BBGD: an online database for blueberry genomic data," (2007); it is titled "list of genes printed on microarray slides." The article, "BBGD: an online database for blueberry genomic data," (2007) involving blueberry cold hardiness experiments has a list of all the genes that were printed on microarray slides. This dataset, supplemental to the article, is called: "list of genes printed on microarray slides." 1471-2229-7-5-s1.xls 663k. By using the BBGD database, researchers developed EST-based markers for mapping, and have identified a number of "candidate" cold tolerance genes that are highly expressed in blueberry flower buds after exposure to low temperatures.

BBGD (http://bioinformatics.towson.edu/BBGD/) is a public online database, and was developed for blueberry genomics. BBGD is both a sequence and gene expression database: it stores both EST and microarray data, and allows scientists to correlate expression profiles with gene function. Presently, the main focus of the database is the identification of genes in blueberry that are significantly induced or suppressed after low temperature exposure. Data was collected sometime between 2000 and 2007 - exact dates are unknown. Resources in this dataset:Resource Title: List of genes printed on microarray slides, 1471-2229-7-5-s1.xls. File Name: 1471-2229-7-5-s1.xlsResource Title: Data dictionary. File Name: BBGD-data-dictionary.csvResource Description: Defines fields for list of genes.
o
QIIME 2 Tutorial Data
registry.opendata.aws
Updated Jan 23, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Caporaso Lab (2019). QIIME 2 Tutorial Data [Dataset]. https://registry.opendata.aws/qiime2/
Explore at:
Dataset updated
Jan 23, 2019
Dataset provided by
Caporaso Lab
Description
QIIME 2 (pronounced “chime two”) is a microbiome multi-omics bioinformatics and data science platform that is trusted, free, open source, extensible, and community developed and supported.
D
Cloud HPC For Bioinformatics Market Research Report 2033
dataintelo.com
csv, pdf, pptx
Updated Oct 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Cloud HPC For Bioinformatics Market Research Report 2033 [Dataset]. https://dataintelo.com/report/cloud-hpc-for-bioinformatics-market
Explore at:
pdf, csv, pptxAvailable download formats
Dataset updated
Oct 1, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Cloud HPC for Bioinformatics Market Outlook

According to our latest research, the global Cloud HPC for Bioinformatics market size was valued at USD 5.1 billion in 2024, with a robust growth rate reflected in a CAGR of 17.8% during the forecast period. Driven by the increasing adoption of high-throughput sequencing, expanding genomics research, and the surge in demand for scalable computing resources, the market is projected to reach USD 15.4 billion by 2033. This accelerated growth is primarily attributed to the convergence of cloud computing and high-performance computing (HPC) technologies, which are revolutionizing the bioinformatics landscape by enabling faster, more efficient data analysis and facilitating breakthroughs in life sciences.

The exponential growth in biological data, especially genomic and proteomic datasets, is a key driver for the Cloud HPC for Bioinformatics market. Next-generation sequencing (NGS) platforms and other advanced technologies generate terabytes of data per experiment, necessitating scalable and powerful computational resources. Cloud-based HPC solutions address this challenge by offering on-demand, elastic computing power, enabling researchers to process and analyze vast datasets without the need for heavy capital investment in local infrastructure. This democratization of computational resources has made advanced bioinformatics accessible to a broader spectrum of organizations, from startups to large pharmaceutical companies, thus significantly expanding the market’s user base.

Another crucial growth factor is the rising collaboration between academic institutions, research organizations, and commercial entities. The move towards open science and data sharing has increased the need for interoperable, secure, and high-speed computing environments. Cloud HPC platforms provide a collaborative space where multidisciplinary teams can work together on large-scale projects, share data securely, and accelerate discovery timelines. Moreover, the integration of artificial intelligence (AI) and machine learning (ML) algorithms into cloud-based bioinformatics workflows is enhancing the accuracy and speed of data interpretation, further fueling market expansion.

The shift in healthcare towards precision medicine is also bolstering the demand for Cloud HPC in bioinformatics. Personalized healthcare relies on the rapid analysis of individual genetic information, which requires substantial computational power. Cloud-based HPC solutions are enabling hospitals, clinics, and diagnostic labs to implement advanced bioinformatics applications without significant IT overheads. This trend is particularly pronounced in the pharmaceutical and biotechnology sectors, where high-speed analysis is critical for drug discovery and development. The growing emphasis on reducing time-to-market for new therapies and the need for cost-effective solutions are expected to sustain strong market growth through 2033.

Regionally, North America maintains its dominance in the Cloud HPC for Bioinformatics market, accounting for the largest revenue share in 2024. This leadership is driven by the presence of major cloud service providers, high R&D investment, and a mature bioinformatics ecosystem. Europe follows closely, benefiting from strong government support and collaborative research initiatives. The Asia Pacific region is emerging as the fastest-growing market, propelled by increasing investments in healthcare infrastructure, expanding genomics research, and rising adoption of cloud technologies. The Middle East & Africa and Latin America, while currently representing smaller shares, are expected to witness steady growth as digital transformation initiatives gain momentum.

Component Analysis

The Cloud HPC for Bioinformatics market by component is segmented into hardware, software, and services, each playing a vital role in enabling high-performance bioinformatics workflows. Hardware forms the backbone of cloud HPC infrastructure, encompassing servers, storage devices, and networking equipment that facilitate rapid data processing and storage. As bioinformatics applications demand ever-increasing computational power, cloud providers are investing in advanced hardware architectures, such as GPU-accelerated servers and high-speed interconnects, to meet the needs of genomics, proteomics, and molecular modeling. The ongoing evolution of hardware, including the adoption of ARM-based processors and specialized AI chips, is expected to further enhance the p
m
Prediction of Heart Attack
data.mendeley.com
Updated Aug 21, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rakin Sad Aftab (2024). Prediction of Heart Attack [Dataset]. http://doi.org/10.17632/yrwd336rkz.2
Explore at:
Unique identifier
https://doi.org/10.17632/yrwd336rkz.2
Dataset updated
Aug 21, 2024
Authors
Rakin Sad Aftab
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset consists of 1763 observations, each representing a unique patient, and 12 different attributes associated with heart disease. This dataset is a critical resource for researchers focusing on predictive analytics in cardiovascular diseases.

Variables Overview: 1. Age: A continuous variable indicating the age of the patient. 2. Sex: A categorical variable with two levels ('Male', 'Female'), indicating the gender of the patient. 3. CP (Chest Pain type): A categorical variable describing the type of chest pain experienced by the patient, with categories such as 'Asymptomatic', 'Atypical Angina', 'Typical Angina', and 'Non-Angina'. 4. TRTBPS (Resting Blood Pressure): A continuous variable indicating the resting blood pressure (in mm Hg) on admission to the hospital. 5. Chol (Serum Cholesterol): A continuous variable measuring the serum cholesterol in mg/dl. 6. FBS (Fasting Blood Sugar): A binary variable where 1 represents fasting blood sugar > 120 mg/dl, and 0 otherwise. 7. Rest ECG (Resting Electrocardiographic Results): Categorizes the resting electrocardiographic results of the patient into 'Normal', 'ST Elevation', and other categories. 8. Thalachh (Maximum Heart Rate Achieved): A continuous variable indicating the maximum heart rate achieved by the patient. 9. Exng (Exercise Induced Angina): A binary variable where 1 indicates the presence of exercise-induced angina, and 0 otherwise. 10. Oldpeak (ST Depression Induced by Exercise Relative to Rest): A continuous variable indicating the ST depression induced by exercise relative to rest. 11. Slope (Slope of the Peak Exercise ST Segment): A categorical variable with levels such as 'Flat', 'Up Sloping', representing the slope of the peak exercise ST segment. 14. Target: A binary target variable indicating the presence (1) or absence (0) of heart disease.

Descriptive Statistics: The patients' age ranges from 29 to 77 years, with a mean age of approximately 54 years. The resting blood pressure spans from 94 to 200 mm Hg, and the average cholesterol level is about 246 mg/dl. The maximum heart rate achieved varies widely among patients, from 71 to 202 beats per minute.

Importance for Research: This dataset provides a comprehensive view of various factors that could potentially be linked to heart disease, making it an invaluable resource for developing predictive models. By analyzing relationships and patterns within these variables, researchers can identify key predictors of heart disease and enhance the accuracy of diagnostic tools. This could lead to better preventive measures and treatment strategies, ultimately improving patient outcomes in the realm of cardiovascular health

Facebook

Twitter

Click to copy link

Link copied

Cite

U.S. EPA Office of Research and Development (ORD) (2020). Bioinformatics data for paper [Dataset]. https://catalog.data.gov/dataset/bioinformatics-data-for-paper

Bioinformatics data for paper

Explore at:

Dataset updated

Nov 12, 2020

Dataset provided by

United States Environmental Protection Agencyhttp://www.epa.gov/

Description

Data for sequence comparison of commamox genomes and genes identified. This dataset is associated with the following publication: Camejo, P., J. Santodomingo, K. McMahon, and D. Noguera. Genome-enabled insights into the ecophysiology of the comammox bacterium Ca. Nitrospira nitrosa. ENVIRONMENTAL SCIENCE & TECHNOLOGY. American Chemical Society, Washington, DC, USA, 2(5): 1-16, (2017).

Clear search

Close search

Google apps

Main menu

Bioinformatics data for paper

Bioinformatics for Researchers in Life Sciences: Tools and Learning...

Data from: Transcriptomic and bioinformatics analysis of the early...

Data from: Advancing computational biology and bioinformatics research...

Gene Expression Analysis and Disease Relationship

PARSING FASTA AND GENBANK FILES

Data from: PseudoResistance DB: A new Database of antibiotics related to...

Cell_Gene_Expression_Metadata

Bioinformatics Market Research Report 2033

Bioinformatics Market Outlook

Product & Service Analysis

DataSheet1_Bioinformatic Teaching Resources – For Educators, by Educators –...

2025 Green Card Report for Bioinformatics, Biotechnology, Computer Science

Two-step mixed model approach to analyzing differential alternative RNA...

Drosophila Melanogaster Genome

Drosophila Melanogaster

About the Genome

Bioinformatics

Learning Bioinformatics

Files in this Dataset

The GitHub repository for an integrative analysis of genomic plasticity in...

SARS-sequence-data

Context

Content

Acknowledgements

Inspiration

Pneumonia Drug Exp Data

Data from: BBGD: an online database for blueberry genomic data

QIIME 2 Tutorial Data

Cloud HPC For Bioinformatics Market Research Report 2033

Cloud HPC for Bioinformatics Market Outlook

Component Analysis

Prediction of Heart Attack

Bioinformatics data for paper