81 datasets found

Bioinformatics data for paper
catalog.data.gov
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Bioinformatics data for paper [Dataset]. https://catalog.data.gov/dataset/bioinformatics-data-for-paper
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Data for sequence comparison of commamox genomes and genes identified. This dataset is associated with the following publication: Camejo, P., J. Santodomingo, K. McMahon, and D. Noguera. Genome-enabled insights into the ecophysiology of the comammox bacterium Ca. Nitrospira nitrosa. ENVIRONMENTAL SCIENCE & TECHNOLOGY. American Chemical Society, Washington, DC, USA, 2(5): 1-16, (2017).
C
Bioinformatics for Researchers in Life Sciences: Tools and Learning...
data.iadb.org
csv, pdf
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
IDB Datasets (2025). Bioinformatics for Researchers in Life Sciences: Tools and Learning Resources [Dataset]. http://doi.org/10.60966/kwvb-wr19
Explore at:
csv(355108), pdf(2989058), csv(276253)Available download formats
Unique identifier
https://doi.org/10.60966/kwvb-wr19
Dataset updated
Apr 10, 2025
Dataset provided by
IDB Datasets
License
Attribution-NonCommercial-NoDerivs 3.0 (CC BY-NC-ND 3.0)https://creativecommons.org/licenses/by-nc-nd/3.0/
License information was derived automatically
Time period covered
Jan 1, 2020 - Jan 1, 2021
Description
The COVID-19 pandemic has shown that bioinformatics--a multidisciplinary field that combines biological knowledge with computer programming concerned with the acquisition, storage, analysis, and dissemination of biological data--has a fundamental role in scientific research strategies in all disciplines involved in fighting the virus and its variants. It aids in sequencing and annotating genomes and their observed mutations; analyzing gene and protein expression; simulation and modeling of DNA, RNA, proteins and biomolecular interactions; and mining of biological literature, among many other critical areas of research. Studies suggest that bioinformatics skills in the Latin American and Caribbean region are relatively incipient, and thus its scientific systems cannot take full advantage of the increasing availability of bioinformatic tools and data. This dataset is a catalog of bioinformatics software for researchers and professionals working in life sciences. It includes more than 300 different tools for varied uses, such as data analysis, visualization, repositories and databases, data storage services, scientific communication, marketplace and collaboration, and lab resource management. Most tools are available as web-based or desktop applications, while others are programming libraries. It also includes 10 suggested entries for other third-party repositories that could be of use.
d
Raw motif mapping bedfile data and model training set class probabilities
search.dataone.org
data.niaid.nih.gov
+1more
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Phillip Davis (2025). Raw motif mapping bedfile data and model training set class probabilities [Dataset]. http://doi.org/10.5061/dryad.tdz08kq3w
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.tdz08kq3w
Dataset updated
May 6, 2025
Dataset provided by
Dryad Digital Repository
Authors
Phillip Davis
Time period covered
Jan 1, 2023
Description
Leveraging prior viral genome sequencing data to make predictions on whether an unknown, emergent virus harbors a â€˜phenotype-of-concernâ€™ has been a long-sought goal of genomic epidemiology. A predictive phenotype model built from nucleotide-level information aloneÂ is challenging with respect to RNA viruses due to the ultra-high intra-sequence variance of their genomes, even within closely related clades. We developed a degenerate k-mer method to accommodate this high intra-sequence variation of RNA virus genomes for modeling frameworks.Â By leveraging a taxonomy-guided â€˜group-shuffle-splitâ€™ cross validation paradigm on complete coronavirus assemblies from prior to October 2018, we trained multiple regularized logistic regression classifiers at the nucleotide k-mer level. We demonstrate the feasibility of this method by finding models accurately predicting withheld SARS-CoV-2 genome sequences as human pathogens and accurately predicting withheld Swine Acute Diarrhea Syndrome coronavirus (...
f
Data from: Advancing computational biology and bioinformatics research...
datasetcatalog.nlm.nih.gov
figshare.com
Updated Sep 27, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonchhe, Anup; Su, Andrew I.; Natoli, Ted; Macaluso, N. J. Maximilian; Briney, Bryan; Blasco, Andrea; Narayan, Rajiv; Lakhani, Karim R.; Paik, Jin H.; Endres, Michael G.; Sergeev, Rinat A.; Wu, Chunlei; Subramanian, Aravind (2019). Advancing computational biology and bioinformatics research through open innovation competitions [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000064443
Explore at:
Dataset updated
Sep 27, 2019
Authors
Jonchhe, Anup; Su, Andrew I.; Natoli, Ted; Macaluso, N. J. Maximilian; Briney, Bryan; Blasco, Andrea; Narayan, Rajiv; Lakhani, Karim R.; Paik, Jin H.; Endres, Michael G.; Sergeev, Rinat A.; Wu, Chunlei; Subramanian, Aravind
Description
Open data science and algorithm development competitions offer a unique avenue for rapid discovery of better computational strategies. We highlight three examples in computational biology and bioinformatics research in which the use of competitions has yielded significant performance gains over established algorithms. These include algorithms for antibody clustering, imputing gene expression data, and querying the Connectivity Map (CMap). Performance gains are evaluated quantitatively using realistic, albeit sanitized, data sets. The solutions produced through these competitions are then examined with respect to their utility and the prospects for implementation in the field. We present the decision process and competition design considerations that lead to these successful outcomes as a model for researchers who want to use competitions and non-domain crowds as collaborators to further their research.
f
DataSheet1_Bioinformatic Teaching Resources – For Educators, by Educators –...
figshare.com
frontiersin.figshare.com
docx
Updated Jun 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ellen G. Dow; Elisha M. Wood-Charlson; Steven J. Biller; Timothy Paustian; Aaron Schirmer; Cody S. Sheik; Jason M. Whitham; Rose Krebs; Carlos C. Goller; Benjamin Allen; Zachary Crockett; Adam P. Arkin (2023). DataSheet1_Bioinformatic Teaching Resources – For Educators, by Educators – Using KBase, a Free, User-Friendly, Open Source Platform.DOCX [Dataset]. http://doi.org/10.3389/feduc.2021.711535.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/feduc.2021.711535.s001
Dataset updated
Jun 6, 2023
Dataset provided by
Frontiers
Authors
Ellen G. Dow; Elisha M. Wood-Charlson; Steven J. Biller; Timothy Paustian; Aaron Schirmer; Cody S. Sheik; Jason M. Whitham; Rose Krebs; Carlos C. Goller; Benjamin Allen; Zachary Crockett; Adam P. Arkin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Over the past year, biology educators and staff at the U.S. Department of Energy Systems Biology Knowledgebase (KBase) initiated a collaborative effort to develop a curriculum for bioinformatics education. KBase is a free web-based platform where anyone can conduct sophisticated and reproducible bioinformatic analyses via a graphical user interface. Here, we demonstrate the utility of KBase as a platform for bioinformatics education, and present a set of modular, adaptable, and customizable instructional units for teaching concepts in Genomics, Metagenomics, Pangenomics, and Phylogenetics. Each module contains teaching resources, publicly available data, analysis tools, and Markdown capability, enabling instructors to modify the lesson as appropriate for their specific course. We present initial student survey data on the effectiveness of using KBase for teaching bioinformatic concepts, provide an example case study, and detail the utility of the platform from an instructor’s perspective. Even as in-person teaching returns, KBase will continue to work with instructors, supporting the development of new active learning curriculum modules. For anyone utilizing the platform, the growing KBase Educators Organization provides an educators network, accompanied by community-sourced guidelines, instructional templates, and peer support, for instructors wishing to use KBase within a classroom at any educational level–whether virtual or in-person.
B
Biological Software Report
datainsightsmarket.com
doc, pdf, ppt
Updated Apr 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Biological Software Report [Dataset]. https://www.datainsightsmarket.com/reports/biological-software-1444091
Explore at:
ppt, pdf, docAvailable download formats
Dataset updated
Apr 21, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global biological software market is experiencing robust growth, driven by the increasing adoption of advanced technologies in life sciences research and healthcare. The market, estimated at $2.5 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of approximately 12% from 2025 to 2033, reaching an estimated market value of $7 billion by 2033. This expansion is fueled by several key factors: the escalating demand for high-throughput data analysis in genomics and proteomics, the rising prevalence of chronic diseases necessitating advanced diagnostic tools, and the growing adoption of cloud-based solutions for enhanced collaboration and accessibility. Furthermore, the continuous development of sophisticated algorithms and user-friendly interfaces is making biological software more accessible to a wider range of researchers and clinicians. The segment encompassing experimental design and data analysis software holds a significant market share, reflecting the crucial role of computational tools in optimizing research workflows and extracting meaningful insights from complex biological datasets. North America currently dominates the market, owing to the robust presence of established biotechnology companies and a well-funded research infrastructure. However, Asia-Pacific is expected to witness significant growth in the coming years due to the expanding healthcare sector and increasing government investments in research and development. Market restraints include the high cost of software licenses, the requirement for specialized training to effectively utilize these tools, and the potential challenges associated with data security and integration across different platforms. Nevertheless, the ongoing innovation in software capabilities, coupled with the increasing adoption of subscription-based models and cloud-based solutions, is expected to mitigate these constraints. The competitive landscape is characterized by a mix of established players like Thermo Fisher Scientific and DNASTAR, along with smaller specialized companies offering niche solutions. This dynamic competitive environment fosters innovation and drives the development of advanced biological software solutions tailored to the specific needs of diverse research and clinical applications. Future growth will be influenced by factors such as advancements in artificial intelligence and machine learning within the software, integration with laboratory automation systems, and increasing collaboration between software providers and research institutions.
m
Data from: PseudoResistance DB: A new Database of antibiotics related to...
data.mendeley.com
Updated Nov 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Caio Cheohen (2024). PseudoResistance DB: A new Database of antibiotics related to Pseudomonas aeruginosa antibiotic resistance [Dataset]. http://doi.org/10.17632/bxdn3p33z2.1
Explore at:
Unique identifier
https://doi.org/10.17632/bxdn3p33z2.1
Dataset updated
Nov 8, 2024
Authors
Caio Cheohen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This research addresses the pressing issue of antibiotic resistance, a global health challenge that undermines the efficacy of treatments against infectious diseases. Focusing on Pseudomonas aeruginosa—a Gram-negative bacterium known for causing opportunistic infections—this study emphasizes its prioritization by the World Health Organization (WHO) as a critical-level pathogen requiring new therapeutic approaches.

To identify antibiotics associated with P. aeruginosa, the study employed text mining techniques on the Scielo database. The resulting dataset comprises 98 antibiotics, each documented with detailed textual information and referencing data. Additionally, the dataset includes structural files of the antibiotics in several formats suitable for computational modeling and simulations. These formats encompass Protein Data Bank, Partial Charge & Atom Type (PDBQT), Simplified Molecular Input Line Entry System (SMI), IUPAC International Chemical Identifier (INCHI), Molecular Design Limited Molfile (MOL2), Structure-Data File (SDF), Chemical Markup Language (CML), Cartesian Coordinates File (XYZ), Scalable Vector Graphics (SVG), Molecular File (MOL) and Protein Data Bank (PDB) files, with molecular models generated via OpenBabel to facilitate advanced studies in drug development and resistance mechanisms.
L
Life Science IT Analytics Software Report
datainsightsmarket.com
doc, pdf, ppt
Updated Aug 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Life Science IT Analytics Software Report [Dataset]. https://www.datainsightsmarket.com/reports/life-science-it-analytics-software-543689
Explore at:
ppt, doc, pdfAvailable download formats
Dataset updated
Aug 23, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Life Science IT Analytics Software market is booming, projected to reach $15 billion by 2033, driven by genomic data growth and personalized medicine. Learn about key trends, top companies (Illumina, Thermo Fisher, Qiagen), and market forecasts in our comprehensive analysis.
Gene Expression Analysis and Disease Relationship
kaggle.com
zip
Updated Aug 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
asel (2025). Gene Expression Analysis and Disease Relationship [Dataset]. https://www.kaggle.com/datasets/ylmzasel/gene-expression-analysis-and-disease-relationship/code
Explore at:
zip(8740 bytes)Available download formats
Dataset updated
Aug 4, 2025
Authors
asel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset comprises 1000 hypothetical patient or sample entries, each detailing gene expression profiles and relevant clinical characteristics. It includes a mix of both numerical and categorical data types, allowing for the application of diverse machine learning and statistical analysis methods

Column Descriptions: PatientID (Categorical/Numerical): A unique identification number assigned to each patient. Age (Numerical): The patient's age. Can be used to investigate potential correlations between age and gene expression profiles. Gender (Categorical): The patient's gender (0: Female, 1: Male). Effects of gender on gene expression or disease status can be analyzed. Gene_X_Expression (Numerical): The relative expression level of a specific gene, "Gene X". This represents a hypothetical gene that might play a role in disease progression or treatment response. Gene_Y_Expression (Numerical): The relative expression level of another specific gene, "Gene Y". Can be studied in conjunction with or independently of Gene X. SmokingStatus (Categorical): The patient's smoking status (0: Non-smoker, 1: Ex-smoker, 2: Current smoker). Environmental factors' impact on gene expression and disease can be assessed. DiseaseStatus (Categorical): The patient's status for the target disease (0: Healthy, 1: Disease A, 2: Disease B). This can serve as the primary target variable for your predictive models.

TreatmentResponse (Categorical/Numerical): The degree of response to applied treatment (0: No Response, 1: Partial Response, 2: Full Response). The role of gene expression profiles in predicting treatment success can be explored. Use Cases and Potential Projects This dataset serves as an excellent starting point for students, researchers, and enthusiasts in bioinformatics, computational biology, data science, and machine learning, enabling various projects such as: Disease Diagnosis/Classification: Building models to predict HastalıkDurumu using gene expression levels and other clinical factors. Treatment Response Prediction: Forecasting how patients with specific gene expression profiles might respond to treatment (TedaviYanıtı). Biomarker Discovery: Identifying gene expression levels (e.g., Gen_X_İfadesi, Gen_Y_İfadesi) that show strong correlations with disease or treatment response. Feature Engineering and Selection: Evaluating the importance of various features in the dataset and creating new ones to enhance model performance. Data Visualization: Generating visualizations to explore relationships between gene expression data and demographic/clinical factors. Regression and Correlation Analyses: Quantitatively examining the effects of factors like age and smoking status on gene expression levels.

Why Use This Dataset? Privacy Secure: Being entirely synthetic, it carries no privacy or ethical concerns associated with real patient data. Diversity: The mix of both numerical and categorical variables offers a rich ground for experimenting with different analytical techniques. Predictive Potential: Clear target variables like HastalıkDurumu and TedaviYanıtı make it ideal for developing classification and regression models. Educational and Learning: Perfect for applying fundamental data science and machine learning concepts for anyone interested in the bioinformatics domain.
L
Life Science Analytics Software Report
datainsightsmarket.com
doc, pdf, ppt
Updated Oct 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Life Science Analytics Software Report [Dataset]. https://www.datainsightsmarket.com/reports/life-science-analytics-software-543718
Explore at:
ppt, pdf, docAvailable download formats
Dataset updated
Oct 27, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global Life Science Analytics Software market is projected to experience substantial growth, driven by an estimated market size of $12,500 million in 2025 and a Compound Annual Growth Rate (CAGR) of 12%. This robust expansion is fueled by an increasing volume of complex biological data generated through advancements in genomics, proteomics, and clinical research. The insatiable demand for faster drug discovery, optimized clinical trial management, and more efficient laboratory operations are key drivers. Furthermore, the burgeoning field of bioinformatics, powered by sophisticated analytical tools, is enabling researchers to derive deeper insights from vast datasets, accelerating the development of novel therapeutics and personalized medicine. The integration of artificial intelligence and machine learning into these platforms is further enhancing predictive capabilities and automating data analysis, making them indispensable for modern life science organizations. The market is characterized by several significant trends. The rising adoption of cloud-based analytics solutions is facilitating greater scalability, accessibility, and collaboration among researchers globally. There is also a noticeable shift towards specialized analytics software tailored for specific applications like drug discovery informatics and clinical trial management, offering more targeted and efficient solutions. However, the market faces certain restraints, including the high cost of implementing and maintaining advanced analytics software, concerns regarding data security and privacy, and a shortage of skilled data scientists with expertise in life sciences. Despite these challenges, the continuous innovation in software features, coupled with increasing investments in research and development by key players such as Revvity, IBM Corporation, and Thermo Fisher Scientific, is expected to propel the market forward, with Asia Pacific poised to emerge as a rapidly growing region due to its expanding healthcare infrastructure and increasing R&D investments. This in-depth report provides a holistic view of the global Life Science Analytics Software market, meticulously analyzing its trajectory from 2019 to 2033, with a specific focus on the Base Year of 2025 and the Forecast Period of 2025-2033. The Historical Period of 2019-2024 has been thoroughly reviewed to establish foundational market dynamics. The report offers actionable insights, market valuations in the millions, and strategic recommendations for stakeholders navigating this dynamic sector.
m
2025 Green Card Report for Master Of Science In Bioinformatics
myvisajobs.com
Updated Jan 16, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MyVisaJobs (2025). 2025 Green Card Report for Master Of Science In Bioinformatics [Dataset]. https://www.myvisajobs.com/reports/green-card/major/master-of-science-in-bioinformatics
Explore at:
Dataset updated
Jan 16, 2025
Dataset authored and provided by
MyVisaJobs
License
https://www.myvisajobs.com/terms-of-service/https://www.myvisajobs.com/terms-of-service/
Variables measured
Major, Salary, Petitions Filed
Description
A dataset that explores Green Card sponsorship trends, salary data, and employer insights for master of science in bioinformatics in the U.S.
q
The Network for Integrating Bioinformatics into Life Sciences Education...
qubeshub.org
Updated Jul 23, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anne Rosenwald; Elizabeth Dinsdale; William Morgan; Mark Pauley; William Tapprich; Eric Triplett; Jason Williams (2020). The Network for Integrating Bioinformatics into Life Sciences Education (NIBLSE): Barriers to Integration [Dataset]. http://doi.org/10.25334/NHB4-X766
Explore at:
Unique identifier
https://doi.org/10.25334/NHB4-X766
Dataset updated
Jul 23, 2020
Dataset provided by
QUBES
Authors
Anne Rosenwald; Elizabeth Dinsdale; William Morgan; Mark Pauley; William Tapprich; Eric Triplett; Jason Williams
Description
The Network for Integrating Bioinformatics into Life Sciences Education (NIBLSE) seeks to promote the use of bioinformatics and data science as a way to teach biology.
The GitHub repository for an integrative analysis of genomic plasticity in...
zenodo.org
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rayna Harris; Hsin-Yi Kao; Juan Marcos Alarcon; Hans Hofmann; Andre Fenton; Rayna Harris; Hsin-Yi Kao; Juan Marcos Alarcon; Hans Hofmann; Andre Fenton (2020). The GitHub repository for an integrative analysis of genomic plasticity in the hippocampus [Dataset]. http://doi.org/10.5281/zenodo.810407
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.810407
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rayna Harris; Hsin-Yi Kao; Juan Marcos Alarcon; Hans Hofmann; Andre Fenton; Rayna Harris; Hsin-Yi Kao; Juan Marcos Alarcon; Hans Hofmann; Andre Fenton
Description
Cost-effective next-generation sequencing has made unbiased gene expression investigations possible. Gene expression studies at the level of single neurons may be especially important for understanding nervous system structure and function because of neuron-specific functionality and plasticity. While cellular dissociation is a prerequisite technical manipulation for such single-cell studies, the extent to which the process of dissociating cells affects neural gene expression has not been determined. Here, we examine the effect of cellular dissociation on gene expression in the mouse hippocampus. We also determine to which extent such changes might confound studies on the behavioral and physiological functions of hippocampus.

This dataset contains the data, software, and results the accompany a manuscript that is in the process of submission to the journal Hippocampus.
h
Bioinformatics Services Market - Global Growth Opportunities 2024-2030
htfmarketinsights.com
pdf & excel
Updated Oct 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HTF Market Intelligence (2025). Bioinformatics Services Market - Global Growth Opportunities 2024-2030 [Dataset]. https://htfmarketinsights.com/report/4013511-bioinformatics-services-market
Explore at:
pdf & excelAvailable download formats
Dataset updated
Oct 7, 2025
Dataset authored and provided by
HTF Market Intelligence
License
https://www.htfmarketinsights.com/privacy-policyhttps://www.htfmarketinsights.com/privacy-policy
Time period covered
2019 - 2031
Area covered
Global
Description
Global Bioinformatics Services Market is segmented by Application (Pharmaceutical Companies_ Biotech Companies_ Research Institutions), Type (Biotechnology_ Life Sciences_ Genomics_ Bioinformatics_ Data Science), and Geography (North America_ LATAM_ West Europe_Central & Eastern Europe_ Northern Europe_ Southern Europe_ East Asia_ Southeast Asia_ South Asia_ Central Asia_ Oceania_ MEA)
PARSING FASTA AND GENBANK FILES
kaggle.com
zip
Updated Nov 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr. Nagendra (2025). PARSING FASTA AND GENBANK FILES [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/parsing-fasta-and-genbank-files
Explore at:
zip(17972831 bytes)Available download formats
Dataset updated
Nov 25, 2025
Authors
Dr. Nagendra
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset is a dedicated resource for learning how to parse core bioinformatics file formats. It contains representative samples of FASTA and GenBank files. The goal is to provide raw data for practicing essential data extraction skills. FASTA files contain sequence data, such as DNA, RNA, or protein, in a simple text format. GenBank files include detailed sequence annotations, features, and metadata. This is an ideal starting point for anyone learning Biopython or general sequence manipulation in genomics.
m
SARS-CoV-2 Surface glycoproteins Alignment Data
data.mendeley.com
Updated Aug 20, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Done Stojanov (2021). SARS-CoV-2 Surface glycoproteins Alignment Data [Dataset]. http://doi.org/10.17632/btb5ffk247.1
Explore at:
Unique identifier
https://doi.org/10.17632/btb5ffk247.1
Dataset updated
Aug 20, 2021
Authors
Done Stojanov
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SARS-CoV-2SpikeProteinMutations.docx contains data on mutations found in aligned SARS-CoV-2 surface glycoproteins.

SARS-CoV-2SpikeProteinVariants.docx contains data on computed SARS-CoV-2 surface glycoprotein variants in Europe.
Drosophila Melanogaster Genome
kaggle.com
ieee-dataport.org
zip
Updated Nov 17, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Myles O'Neill (2019). Drosophila Melanogaster Genome [Dataset]. https://www.kaggle.com/mylesoneill/drosophila-melanogaster-genome
Explore at:
zip(136202106 bytes)Available download formats
Dataset updated
Nov 17, 2019
Authors
Myles O'Neill
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Drosophila Melanogaster

Drosophila Melanogaster, the common fruit fly, is a model organism which has been extensively used in entymological research. It is one of the most studied organisms in biological research, particularly in genetics and developmental biology.

When its not being used for scientific research, D. melanogaster is a common pest in homes, restaurants, and anywhere else that serves food. They are not to be confused with Tephritidae flys (also known as fruit flys).

https://en.wikipedia.org/wiki/Drosophila_melanogaster

About the Genome

This genome was first sequenced in 2000. It contains four pairs of chromosomes (2,3,4 and X/Y). More than 60% of the genome appears to be functional non-protein-coding DNA.

![D. melanogaster chromosomes][1]

The genome is maintained and frequently updated at [FlyBase][2]. This dataset is sourced from the UCSC Genome Bioinformatics download page. It uses the August 2014 version of the D. melanogaster genome (dm6, BDGP Release 6 + ISO1 MT). http://hgdownload.soe.ucsc.edu/downloads.html#fruitfly

Files were modified by Kaggle to be a better fit for analysis on Scripts. This primarily involved turning files into CSV format, with a header row, as well as converting the genome itself from 2bit format into a FASTA sequence file.

Bioinformatics

Genomic analysis can be daunting to data scientists who haven't had much experience with bioinformatics before. We have tried to give basic explanations to each of the files in this dataset, as well as links to further reading on the biological basis for each. If you haven't had the chance to study much biology before, some light reading (ie wikipedia) on the following topics may be helpful to understand the nuances of the data provided here: [Genetics][3], [Genomics]4, [Chromosomes][7], [DNA][8], [RNA]9, [Genes][12], [Alleles][13], [Exons][14], [Introns][15], [Transcription][16], [Translation][17], [Peptides][18], [Proteins][19], [Gene Regulation][20], [Mutation][21], [Phylogenetics][22], and [SNPs][23].

Of course, if you've got some idea of the basics already - don't be afraid to jump right in!

Learning Bioinformatics

There are a lot of great resources for learning bioinformatics on the web. One cool site is [Rosalind][24] - a platform that gives you bioinformatic coding challenges to complete. You can use Kaggle Scripts on this dataset to easily complete the challenges on Rosalind (and see [Myles' solutions here][25] if you get stuck). We have set up [Biopython][26] on Kaggle's docker image which is a great library to help you with your analyses. Check out their [tutorial here][27] and we've also created [a python notebook with some of the tutorial applied to this dataset][28] as a reference.

Files in this Dataset

Drosophila Melanogaster Genome

genome.fa

The assembled genome itself is presented here in [FASTA format][29]. Each chromosome is a different sequence of nucleotides. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are show in lower case; non-repeating sequence is shown in upper case.
Meta Information

There are 3 additional files with meta information about the genome.

meta-cpg-island-ext-unmasked.csv

This file contains descriptive information about CpG Islands in the genome.

https://en.wikipedia.org/wiki/CpG_site

meta-cytoband.csv

This file describes the positions of cytogenic bands on each chromosome.

https://en.wikipedia.org/wiki/Cytogenetics

meta-simple-repeat.csv

This file describes simple tandem repeats in the genome.

https://en.wikipedia.org/wiki/Repeated_sequence_(DNA) https://en.wikipedia.org/wiki/Tandem_repeat
Drosophila Melanogaster mRNA Sequences

Messenger RNA (mRNA) is an intermediate molecule created as part of the cellular process of converting genomic information into proteins. Some mRNA are never translated into proteins and have functional roles in the cell on their own. Collectively, organism mRNA information is known as a Transcriptome. mRNA files included in this dataset give insight into the activity of genes in the organism.

https://en.wikipedia.org/wiki/Messenger_RNA

mrna-genbank.fa

This file includes all mRNA sequences from GenBank associated with Drosophila Melanogaster.

http://www.ncbi.nlm.nih.gov/genbank/

mrna-refseq.fa

This file includes all mRNA sequences from RefSeq associated with Drosophila Melanogaster.

http://www.ncbi.nlm.nih.gov/refseq/
Gene Predictions

A gene is a segment of DNA on the genome which, through mRNA, is used to create proteins in the organism. Knowing which parts of DNA are coding (genes) or non-coding is difficult, and a number of different systems for prediction exist. This da...
Cell_Gene_Expression_Metadata
kaggle.com
zip
Updated Sep 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kazi Aishikuzzaman (2025). Cell_Gene_Expression_Metadata [Dataset]. https://www.kaggle.com/datasets/kaziaishikuzzaman/cell-gene-expression-metadata
Explore at:
zip(845887409 bytes)Available download formats
Dataset updated
Sep 24, 2025
Authors
Kazi Aishikuzzaman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview This dataset contains comprehensive metadata from single-cell gene expression studies, providing researchers with structured information about cellular phenotypes, experimental conditions, and sample characteristics. The data is particularly valuable for bioinformatics research, machine learning applications in genomics, and comparative studies across different cell types and conditions.

Dataset Description: The dataset comprises metadata associated with single-cell RNA sequencing (scRNA-seq) experiments, including: Cell Type Information: Classification of different cell types and subtypes Experimental Metadata: Details about experimental conditions, protocols, and methodologies Sample Characteristics: Information about biological samples, including tissue origin, developmental stages, and treatment conditions Quality Metrics: Data quality indicators and filtering parameters Annotation Details: Standardized cell type annotations and biological classifications

Data Source and Licensing This dataset is derived from publicly available single-cell gene expression data, potentially sourced from: CELLxGENE Data Portal (https://cellxgene.cziscience.com/) Gene Expression Omnibus (GEO) European Bioinformatics Institute (EBI) Other public genomics repositories

License: Creative Commons CC BY 4.0 (or specify the actual license) ✅ Commercial use allowed ✅ Modification allowed ✅ Distribution allowed ✅ Private use allowed ❗ Attribution required

Research Applications Cell Type Discovery: Identify novel cell types and subtypes Comparative Genomics: Study cellular differences across conditions, tissues, or species Disease Research: Investigate cellular changes in disease states Developmental Biology: Analyze cellular differentiation and development patterns

Machine Learning Applications Classification Tasks: Predict cell types from gene expression data Clustering Analysis: Discover cellular subpopulations and states Dimensionality Reduction: Apply PCA, t-SNE, UMAP for visualization Biomarker Discovery: Identify genes characteristic of specific cell types

Educational Use : Teaching bioinformatics and computational biology concepts. Demonstrating single-cell analysis workflows. Training in data preprocessing and quality control.

Data Quality and Preprocessing : Quality Control: Metadata has been curated and standardized Missing Values: [Specify how missing values are handled] Standardization: Cell type annotations follow established ontologies (e.g., Cell Ontology) Validation: Data has been cross-referenced with original publications

Usage Guidelines : Getting Started- Load the metadata files using pandas or your preferred data analysis tool. Explore the cell type distributions and experimental conditions. Filter data based on quality metrics as needed. Join with corresponding gene expression data for comprehensive analysis.

Best Practices Always cite original data sources and publications. Consider batch effects when combining data from different experiments. Validate findings with independent datasets when possible. Follow established bioinformatics workflows for single-cell analysis.

Citation and Acknowledgments : If you use this dataset in your research, please: Cite this dataset:[Kazi Aishikuzzaman]. (2024). Cell Gene Expression Metadata. Kaggle. https://www.kaggle.com/datasets/kaziaishikuzzaman/cell-gene-expression-metadata

File Structure : dataset- ─ metadata_summary.csv # Main metadata file ─ cell_type_annotations.csv # Detailed cell type information
─ experimental_conditions.csv # Experiment-specific metadata ─ quality_metrics.csv # Data quality indicators ─ README.txt # Detailed file descriptions

Technical Specifications : File Encoding: UTF-8 Separator: Comma-separated values (CSV) Missing Values: Represented as 'NA' or empty cells Data Types: Mixed (categorical, numerical, text)

Contact and Support : For questions about this dataset: Kaggle Profile: @kaziaishikuzzaman Dataset Issues: Use Kaggle's discussion section Collaboration: Open to research collaborations and improvements

Version History : v1.0: Initial release with comprehensive metadata collection [Future versions]: Updates and additional annotations as available

Related Datasets: Consider exploring these complementary datasets- Single-cell gene expression data (companion to this metadata) Cell atlas datasets from major consortiums Disease-specific single-cell studies Multi-omics datasets with matching cell types

Keywords: single-cell, RNA-seq, genomics, cell types, metadata, bioinformatics, machine learning, computational biology Category: Biology > Genomics
o
QIIME 2 Tutorial Data
registry.opendata.aws
Updated Jan 23, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Caporaso Lab (2019). QIIME 2 Tutorial Data [Dataset]. https://registry.opendata.aws/qiime2/
Explore at:
Dataset updated
Jan 23, 2019
Dataset provided by
Caporaso Lab
Description
QIIME 2 (pronounced “chime two”) is a microbiome multi-omics bioinformatics and data science platform that is trusted, free, open source, extensible, and community developed and supported.
m
SARS-CoV-2 Surface glycoprotein Alignment Data Mendeley
data.mendeley.com
narcis.nl
Updated Sep 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Done Stojanov (2021). SARS-CoV-2 Surface glycoprotein Alignment Data Mendeley [Dataset]. http://doi.org/10.17632/k7sy3sk7rx.2
Explore at:
Unique identifier
https://doi.org/10.17632/k7sy3sk7rx.2
Dataset updated
Sep 20, 2021
Authors
Done Stojanov
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SARS-CoV-2SpikeProteinMutations.xlsx contains data on mutations found in aligned SARS-CoV-2 surface glycoproteins.

Facebook

Twitter

Click to copy link

Link copied

Cite

U.S. EPA Office of Research and Development (ORD) (2020). Bioinformatics data for paper [Dataset]. https://catalog.data.gov/dataset/bioinformatics-data-for-paper

Bioinformatics data for paper

Explore at:

Dataset updated

Nov 12, 2020

Dataset provided by

United States Environmental Protection Agencyhttp://www.epa.gov/

Description

Data for sequence comparison of commamox genomes and genes identified. This dataset is associated with the following publication: Camejo, P., J. Santodomingo, K. McMahon, and D. Noguera. Genome-enabled insights into the ecophysiology of the comammox bacterium Ca. Nitrospira nitrosa. ENVIRONMENTAL SCIENCE & TECHNOLOGY. American Chemical Society, Washington, DC, USA, 2(5): 1-16, (2017).

Clear search

Close search

Google apps

Main menu

Bioinformatics data for paper

Bioinformatics for Researchers in Life Sciences: Tools and Learning...

Raw motif mapping bedfile data and model training set class probabilities

Data from: Advancing computational biology and bioinformatics research...

DataSheet1_Bioinformatic Teaching Resources – For Educators, by Educators –...

Biological Software Report

Data from: PseudoResistance DB: A new Database of antibiotics related to...

Life Science IT Analytics Software Report

Gene Expression Analysis and Disease Relationship

Life Science Analytics Software Report

2025 Green Card Report for Master Of Science In Bioinformatics

The Network for Integrating Bioinformatics into Life Sciences Education...

The GitHub repository for an integrative analysis of genomic plasticity in...

Bioinformatics Services Market - Global Growth Opportunities 2024-2030

PARSING FASTA AND GENBANK FILES

SARS-CoV-2 Surface glycoproteins Alignment Data

Drosophila Melanogaster Genome

Drosophila Melanogaster

About the Genome

Bioinformatics

Learning Bioinformatics

Files in this Dataset

Cell_Gene_Expression_Metadata

QIIME 2 Tutorial Data

SARS-CoV-2 Surface glycoprotein Alignment Data Mendeley

Bioinformatics data for paper