81 datasets found
  1. Bioinformatics data for paper

    • catalog.data.gov
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Bioinformatics data for paper [Dataset]. https://catalog.data.gov/dataset/bioinformatics-data-for-paper
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    Data for sequence comparison of commamox genomes and genes identified. This dataset is associated with the following publication: Camejo, P., J. Santodomingo, K. McMahon, and D. Noguera. Genome-enabled insights into the ecophysiology of the comammox bacterium Ca. Nitrospira nitrosa. ENVIRONMENTAL SCIENCE & TECHNOLOGY. American Chemical Society, Washington, DC, USA, 2(5): 1-16, (2017).

  2. C

    Bioinformatics for Researchers in Life Sciences: Tools and Learning...

    • data.iadb.org
    csv, pdf
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IDB Datasets (2025). Bioinformatics for Researchers in Life Sciences: Tools and Learning Resources [Dataset]. http://doi.org/10.60966/kwvb-wr19
    Explore at:
    csv(355108), pdf(2989058), csv(276253)Available download formats
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    IDB Datasets
    License

    Attribution-NonCommercial-NoDerivs 3.0 (CC BY-NC-ND 3.0)https://creativecommons.org/licenses/by-nc-nd/3.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2020 - Jan 1, 2021
    Description

    The COVID-19 pandemic has shown that bioinformatics--a multidisciplinary field that combines biological knowledge with computer programming concerned with the acquisition, storage, analysis, and dissemination of biological data--has a fundamental role in scientific research strategies in all disciplines involved in fighting the virus and its variants. It aids in sequencing and annotating genomes and their observed mutations; analyzing gene and protein expression; simulation and modeling of DNA, RNA, proteins and biomolecular interactions; and mining of biological literature, among many other critical areas of research. Studies suggest that bioinformatics skills in the Latin American and Caribbean region are relatively incipient, and thus its scientific systems cannot take full advantage of the increasing availability of bioinformatic tools and data. This dataset is a catalog of bioinformatics software for researchers and professionals working in life sciences. It includes more than 300 different tools for varied uses, such as data analysis, visualization, repositories and databases, data storage services, scientific communication, marketplace and collaboration, and lab resource management. Most tools are available as web-based or desktop applications, while others are programming libraries. It also includes 10 suggested entries for other third-party repositories that could be of use.

  3. d

    Raw motif mapping bedfile data and model training set class probabilities

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Phillip Davis (2025). Raw motif mapping bedfile data and model training set class probabilities [Dataset]. http://doi.org/10.5061/dryad.tdz08kq3w
    Explore at:
    Dataset updated
    May 6, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Phillip Davis
    Time period covered
    Jan 1, 2023
    Description

    Leveraging prior viral genome sequencing data to make predictions on whether an unknown, emergent virus harbors a ‘phenotype-of-concern’ has been a long-sought goal of genomic epidemiology. A predictive phenotype model built from nucleotide-level information alone is challenging with respect to RNA viruses due to the ultra-high intra-sequence variance of their genomes, even within closely related clades. We developed a degenerate k-mer method to accommodate this high intra-sequence variation of RNA virus genomes for modeling frameworks. By leveraging a taxonomy-guided ‘group-shuffle-split’ cross validation paradigm on complete coronavirus assemblies from prior to October 2018, we trained multiple regularized logistic regression classifiers at the nucleotide k-mer level. We demonstrate the feasibility of this method by finding models accurately predicting withheld SARS-CoV-2 genome sequences as human pathogens and accurately predicting withheld Swine Acute Diarrhea Syndrome coronavirus (...

  4. f

    Data from: Advancing computational biology and bioinformatics research...

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Sep 27, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonchhe, Anup; Su, Andrew I.; Natoli, Ted; Macaluso, N. J. Maximilian; Briney, Bryan; Blasco, Andrea; Narayan, Rajiv; Lakhani, Karim R.; Paik, Jin H.; Endres, Michael G.; Sergeev, Rinat A.; Wu, Chunlei; Subramanian, Aravind (2019). Advancing computational biology and bioinformatics research through open innovation competitions [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000064443
    Explore at:
    Dataset updated
    Sep 27, 2019
    Authors
    Jonchhe, Anup; Su, Andrew I.; Natoli, Ted; Macaluso, N. J. Maximilian; Briney, Bryan; Blasco, Andrea; Narayan, Rajiv; Lakhani, Karim R.; Paik, Jin H.; Endres, Michael G.; Sergeev, Rinat A.; Wu, Chunlei; Subramanian, Aravind
    Description

    Open data science and algorithm development competitions offer a unique avenue for rapid discovery of better computational strategies. We highlight three examples in computational biology and bioinformatics research in which the use of competitions has yielded significant performance gains over established algorithms. These include algorithms for antibody clustering, imputing gene expression data, and querying the Connectivity Map (CMap). Performance gains are evaluated quantitatively using realistic, albeit sanitized, data sets. The solutions produced through these competitions are then examined with respect to their utility and the prospects for implementation in the field. We present the decision process and competition design considerations that lead to these successful outcomes as a model for researchers who want to use competitions and non-domain crowds as collaborators to further their research.

  5. f

    DataSheet1_Bioinformatic Teaching Resources – For Educators, by Educators –...

    • figshare.com
    • frontiersin.figshare.com
    docx
    Updated Jun 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ellen G. Dow; Elisha M. Wood-Charlson; Steven J. Biller; Timothy Paustian; Aaron Schirmer; Cody S. Sheik; Jason M. Whitham; Rose Krebs; Carlos C. Goller; Benjamin Allen; Zachary Crockett; Adam P. Arkin (2023). DataSheet1_Bioinformatic Teaching Resources – For Educators, by Educators – Using KBase, a Free, User-Friendly, Open Source Platform.DOCX [Dataset]. http://doi.org/10.3389/feduc.2021.711535.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 6, 2023
    Dataset provided by
    Frontiers
    Authors
    Ellen G. Dow; Elisha M. Wood-Charlson; Steven J. Biller; Timothy Paustian; Aaron Schirmer; Cody S. Sheik; Jason M. Whitham; Rose Krebs; Carlos C. Goller; Benjamin Allen; Zachary Crockett; Adam P. Arkin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Over the past year, biology educators and staff at the U.S. Department of Energy Systems Biology Knowledgebase (KBase) initiated a collaborative effort to develop a curriculum for bioinformatics education. KBase is a free web-based platform where anyone can conduct sophisticated and reproducible bioinformatic analyses via a graphical user interface. Here, we demonstrate the utility of KBase as a platform for bioinformatics education, and present a set of modular, adaptable, and customizable instructional units for teaching concepts in Genomics, Metagenomics, Pangenomics, and Phylogenetics. Each module contains teaching resources, publicly available data, analysis tools, and Markdown capability, enabling instructors to modify the lesson as appropriate for their specific course. We present initial student survey data on the effectiveness of using KBase for teaching bioinformatic concepts, provide an example case study, and detail the utility of the platform from an instructor’s perspective. Even as in-person teaching returns, KBase will continue to work with instructors, supporting the development of new active learning curriculum modules. For anyone utilizing the platform, the growing KBase Educators Organization provides an educators network, accompanied by community-sourced guidelines, instructional templates, and peer support, for instructors wishing to use KBase within a classroom at any educational level–whether virtual or in-person.

  6. B

    Biological Software Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Biological Software Report [Dataset]. https://www.datainsightsmarket.com/reports/biological-software-1444091
    Explore at:
    ppt, pdf, docAvailable download formats
    Dataset updated
    Apr 21, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global biological software market is experiencing robust growth, driven by the increasing adoption of advanced technologies in life sciences research and healthcare. The market, estimated at $2.5 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of approximately 12% from 2025 to 2033, reaching an estimated market value of $7 billion by 2033. This expansion is fueled by several key factors: the escalating demand for high-throughput data analysis in genomics and proteomics, the rising prevalence of chronic diseases necessitating advanced diagnostic tools, and the growing adoption of cloud-based solutions for enhanced collaboration and accessibility. Furthermore, the continuous development of sophisticated algorithms and user-friendly interfaces is making biological software more accessible to a wider range of researchers and clinicians. The segment encompassing experimental design and data analysis software holds a significant market share, reflecting the crucial role of computational tools in optimizing research workflows and extracting meaningful insights from complex biological datasets. North America currently dominates the market, owing to the robust presence of established biotechnology companies and a well-funded research infrastructure. However, Asia-Pacific is expected to witness significant growth in the coming years due to the expanding healthcare sector and increasing government investments in research and development. Market restraints include the high cost of software licenses, the requirement for specialized training to effectively utilize these tools, and the potential challenges associated with data security and integration across different platforms. Nevertheless, the ongoing innovation in software capabilities, coupled with the increasing adoption of subscription-based models and cloud-based solutions, is expected to mitigate these constraints. The competitive landscape is characterized by a mix of established players like Thermo Fisher Scientific and DNASTAR, along with smaller specialized companies offering niche solutions. This dynamic competitive environment fosters innovation and drives the development of advanced biological software solutions tailored to the specific needs of diverse research and clinical applications. Future growth will be influenced by factors such as advancements in artificial intelligence and machine learning within the software, integration with laboratory automation systems, and increasing collaboration between software providers and research institutions.

  7. m

    Data from: PseudoResistance DB: A new Database of antibiotics related to...

    • data.mendeley.com
    Updated Nov 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Caio Cheohen (2024). PseudoResistance DB: A new Database of antibiotics related to Pseudomonas aeruginosa antibiotic resistance [Dataset]. http://doi.org/10.17632/bxdn3p33z2.1
    Explore at:
    Dataset updated
    Nov 8, 2024
    Authors
    Caio Cheohen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This research addresses the pressing issue of antibiotic resistance, a global health challenge that undermines the efficacy of treatments against infectious diseases. Focusing on Pseudomonas aeruginosa—a Gram-negative bacterium known for causing opportunistic infections—this study emphasizes its prioritization by the World Health Organization (WHO) as a critical-level pathogen requiring new therapeutic approaches.

    To identify antibiotics associated with P. aeruginosa, the study employed text mining techniques on the Scielo database. The resulting dataset comprises 98 antibiotics, each documented with detailed textual information and referencing data. Additionally, the dataset includes structural files of the antibiotics in several formats suitable for computational modeling and simulations. These formats encompass Protein Data Bank, Partial Charge & Atom Type (PDBQT), Simplified Molecular Input Line Entry System (SMI), IUPAC International Chemical Identifier (INCHI), Molecular Design Limited Molfile (MOL2), Structure-Data File (SDF), Chemical Markup Language (CML), Cartesian Coordinates File (XYZ), Scalable Vector Graphics (SVG), Molecular File (MOL) and Protein Data Bank (PDB) files, with molecular models generated via OpenBabel to facilitate advanced studies in drug development and resistance mechanisms.

  8. L

    Life Science IT Analytics Software Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Aug 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Life Science IT Analytics Software Report [Dataset]. https://www.datainsightsmarket.com/reports/life-science-it-analytics-software-543689
    Explore at:
    ppt, doc, pdfAvailable download formats
    Dataset updated
    Aug 23, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Life Science IT Analytics Software market is booming, projected to reach $15 billion by 2033, driven by genomic data growth and personalized medicine. Learn about key trends, top companies (Illumina, Thermo Fisher, Qiagen), and market forecasts in our comprehensive analysis.

  9. Gene Expression Analysis and Disease Relationship

    • kaggle.com
    zip
    Updated Aug 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    asel (2025). Gene Expression Analysis and Disease Relationship [Dataset]. https://www.kaggle.com/datasets/ylmzasel/gene-expression-analysis-and-disease-relationship/code
    Explore at:
    zip(8740 bytes)Available download formats
    Dataset updated
    Aug 4, 2025
    Authors
    asel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset comprises 1000 hypothetical patient or sample entries, each detailing gene expression profiles and relevant clinical characteristics. It includes a mix of both numerical and categorical data types, allowing for the application of diverse machine learning and statistical analysis methods

    Column Descriptions: PatientID (Categorical/Numerical): A unique identification number assigned to each patient. Age (Numerical): The patient's age. Can be used to investigate potential correlations between age and gene expression profiles. Gender (Categorical): The patient's gender (0: Female, 1: Male). Effects of gender on gene expression or disease status can be analyzed. Gene_X_Expression (Numerical): The relative expression level of a specific gene, "Gene X". This represents a hypothetical gene that might play a role in disease progression or treatment response. Gene_Y_Expression (Numerical): The relative expression level of another specific gene, "Gene Y". Can be studied in conjunction with or independently of Gene X. SmokingStatus (Categorical): The patient's smoking status (0: Non-smoker, 1: Ex-smoker, 2: Current smoker). Environmental factors' impact on gene expression and disease can be assessed. DiseaseStatus (Categorical): The patient's status for the target disease (0: Healthy, 1: Disease A, 2: Disease B). This can serve as the primary target variable for your predictive models.

    TreatmentResponse (Categorical/Numerical): The degree of response to applied treatment (0: No Response, 1: Partial Response, 2: Full Response). The role of gene expression profiles in predicting treatment success can be explored. Use Cases and Potential Projects This dataset serves as an excellent starting point for students, researchers, and enthusiasts in bioinformatics, computational biology, data science, and machine learning, enabling various projects such as: Disease Diagnosis/Classification: Building models to predict HastalıkDurumu using gene expression levels and other clinical factors. Treatment Response Prediction: Forecasting how patients with specific gene expression profiles might respond to treatment (TedaviYanıtı). Biomarker Discovery: Identifying gene expression levels (e.g., Gen_X_İfadesi, Gen_Y_İfadesi) that show strong correlations with disease or treatment response. Feature Engineering and Selection: Evaluating the importance of various features in the dataset and creating new ones to enhance model performance. Data Visualization: Generating visualizations to explore relationships between gene expression data and demographic/clinical factors. Regression and Correlation Analyses: Quantitatively examining the effects of factors like age and smoking status on gene expression levels.

    Why Use This Dataset? Privacy Secure: Being entirely synthetic, it carries no privacy or ethical concerns associated with real patient data. Diversity: The mix of both numerical and categorical variables offers a rich ground for experimenting with different analytical techniques. Predictive Potential: Clear target variables like HastalıkDurumu and TedaviYanıtı make it ideal for developing classification and regression models. Educational and Learning: Perfect for applying fundamental data science and machine learning concepts for anyone interested in the bioinformatics domain.

  10. L

    Life Science Analytics Software Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Oct 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Life Science Analytics Software Report [Dataset]. https://www.datainsightsmarket.com/reports/life-science-analytics-software-543718
    Explore at:
    ppt, pdf, docAvailable download formats
    Dataset updated
    Oct 27, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global Life Science Analytics Software market is projected to experience substantial growth, driven by an estimated market size of $12,500 million in 2025 and a Compound Annual Growth Rate (CAGR) of 12%. This robust expansion is fueled by an increasing volume of complex biological data generated through advancements in genomics, proteomics, and clinical research. The insatiable demand for faster drug discovery, optimized clinical trial management, and more efficient laboratory operations are key drivers. Furthermore, the burgeoning field of bioinformatics, powered by sophisticated analytical tools, is enabling researchers to derive deeper insights from vast datasets, accelerating the development of novel therapeutics and personalized medicine. The integration of artificial intelligence and machine learning into these platforms is further enhancing predictive capabilities and automating data analysis, making them indispensable for modern life science organizations. The market is characterized by several significant trends. The rising adoption of cloud-based analytics solutions is facilitating greater scalability, accessibility, and collaboration among researchers globally. There is also a noticeable shift towards specialized analytics software tailored for specific applications like drug discovery informatics and clinical trial management, offering more targeted and efficient solutions. However, the market faces certain restraints, including the high cost of implementing and maintaining advanced analytics software, concerns regarding data security and privacy, and a shortage of skilled data scientists with expertise in life sciences. Despite these challenges, the continuous innovation in software features, coupled with increasing investments in research and development by key players such as Revvity, IBM Corporation, and Thermo Fisher Scientific, is expected to propel the market forward, with Asia Pacific poised to emerge as a rapidly growing region due to its expanding healthcare infrastructure and increasing R&D investments. This in-depth report provides a holistic view of the global Life Science Analytics Software market, meticulously analyzing its trajectory from 2019 to 2033, with a specific focus on the Base Year of 2025 and the Forecast Period of 2025-2033. The Historical Period of 2019-2024 has been thoroughly reviewed to establish foundational market dynamics. The report offers actionable insights, market valuations in the millions, and strategic recommendations for stakeholders navigating this dynamic sector.

  11. m

    2025 Green Card Report for Master Of Science In Bioinformatics

    • myvisajobs.com
    Updated Jan 16, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MyVisaJobs (2025). 2025 Green Card Report for Master Of Science In Bioinformatics [Dataset]. https://www.myvisajobs.com/reports/green-card/major/master-of-science-in-bioinformatics
    Explore at:
    Dataset updated
    Jan 16, 2025
    Dataset authored and provided by
    MyVisaJobs
    License

    https://www.myvisajobs.com/terms-of-service/https://www.myvisajobs.com/terms-of-service/

    Variables measured
    Major, Salary, Petitions Filed
    Description

    A dataset that explores Green Card sponsorship trends, salary data, and employer insights for master of science in bioinformatics in the U.S.

  12. q

    The Network for Integrating Bioinformatics into Life Sciences Education...

    • qubeshub.org
    Updated Jul 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anne Rosenwald; Elizabeth Dinsdale; William Morgan; Mark Pauley; William Tapprich; Eric Triplett; Jason Williams (2020). The Network for Integrating Bioinformatics into Life Sciences Education (NIBLSE): Barriers to Integration [Dataset]. http://doi.org/10.25334/NHB4-X766
    Explore at:
    Dataset updated
    Jul 23, 2020
    Dataset provided by
    QUBES
    Authors
    Anne Rosenwald; Elizabeth Dinsdale; William Morgan; Mark Pauley; William Tapprich; Eric Triplett; Jason Williams
    Description

    The Network for Integrating Bioinformatics into Life Sciences Education (NIBLSE) seeks to promote the use of bioinformatics and data science as a way to teach biology.

  13. The GitHub repository for an integrative analysis of genomic plasticity in...

    • zenodo.org
    zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rayna Harris; Hsin-Yi Kao; Juan Marcos Alarcon; Hans Hofmann; Andre Fenton; Rayna Harris; Hsin-Yi Kao; Juan Marcos Alarcon; Hans Hofmann; Andre Fenton (2020). The GitHub repository for an integrative analysis of genomic plasticity in the hippocampus [Dataset]. http://doi.org/10.5281/zenodo.810407
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rayna Harris; Hsin-Yi Kao; Juan Marcos Alarcon; Hans Hofmann; Andre Fenton; Rayna Harris; Hsin-Yi Kao; Juan Marcos Alarcon; Hans Hofmann; Andre Fenton
    Description

    Cost-effective next-generation sequencing has made unbiased gene expression investigations possible. Gene expression studies at the level of single neurons may be especially important for understanding nervous system structure and function because of neuron-specific functionality and plasticity. While cellular dissociation is a prerequisite technical manipulation for such single-cell studies, the extent to which the process of dissociating cells affects neural gene expression has not been determined. Here, we examine the effect of cellular dissociation on gene expression in the mouse hippocampus. We also determine to which extent such changes might confound studies on the behavioral and physiological functions of hippocampus.

    This dataset contains the data, software, and results the accompany a manuscript that is in the process of submission to the journal Hippocampus.

  14. h

    Bioinformatics Services Market - Global Growth Opportunities 2024-2030

    • htfmarketinsights.com
    pdf & excel
    Updated Oct 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HTF Market Intelligence (2025). Bioinformatics Services Market - Global Growth Opportunities 2024-2030 [Dataset]. https://htfmarketinsights.com/report/4013511-bioinformatics-services-market
    Explore at:
    pdf & excelAvailable download formats
    Dataset updated
    Oct 7, 2025
    Dataset authored and provided by
    HTF Market Intelligence
    License

    https://www.htfmarketinsights.com/privacy-policyhttps://www.htfmarketinsights.com/privacy-policy

    Time period covered
    2019 - 2031
    Area covered
    Global
    Description

    Global Bioinformatics Services Market is segmented by Application (Pharmaceutical Companies_ Biotech Companies_ Research Institutions), Type (Biotechnology_ Life Sciences_ Genomics_ Bioinformatics_ Data Science), and Geography (North America_ LATAM_ West Europe_Central & Eastern Europe_ Northern Europe_ Southern Europe_ East Asia_ Southeast Asia_ South Asia_ Central Asia_ Oceania_ MEA)

  15. PARSING FASTA AND GENBANK FILES

    • kaggle.com
    zip
    Updated Nov 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr. Nagendra (2025). PARSING FASTA AND GENBANK FILES [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/parsing-fasta-and-genbank-files
    Explore at:
    zip(17972831 bytes)Available download formats
    Dataset updated
    Nov 25, 2025
    Authors
    Dr. Nagendra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset is a dedicated resource for learning how to parse core bioinformatics file formats. It contains representative samples of FASTA and GenBank files. The goal is to provide raw data for practicing essential data extraction skills. FASTA files contain sequence data, such as DNA, RNA, or protein, in a simple text format. GenBank files include detailed sequence annotations, features, and metadata. This is an ideal starting point for anyone learning Biopython or general sequence manipulation in genomics.

  16. m

    SARS-CoV-2 Surface glycoproteins Alignment Data

    • data.mendeley.com
    Updated Aug 20, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Done Stojanov (2021). SARS-CoV-2 Surface glycoproteins Alignment Data [Dataset]. http://doi.org/10.17632/btb5ffk247.1
    Explore at:
    Dataset updated
    Aug 20, 2021
    Authors
    Done Stojanov
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    1. SARS-CoV-2SpikeProteinMutations.docx contains data on mutations found in aligned SARS-CoV-2 surface glycoproteins.
    2. SARS-CoV-2SpikeProteinVariants.docx contains data on computed SARS-CoV-2 surface glycoprotein variants in Europe.
  17. Drosophila Melanogaster Genome

    • kaggle.com
    • ieee-dataport.org
    zip
    Updated Nov 17, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Myles O'Neill (2019). Drosophila Melanogaster Genome [Dataset]. https://www.kaggle.com/mylesoneill/drosophila-melanogaster-genome
    Explore at:
    zip(136202106 bytes)Available download formats
    Dataset updated
    Nov 17, 2019
    Authors
    Myles O'Neill
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Drosophila Melanogaster

    Drosophila Melanogaster, the common fruit fly, is a model organism which has been extensively used in entymological research. It is one of the most studied organisms in biological research, particularly in genetics and developmental biology.

    When its not being used for scientific research, D. melanogaster is a common pest in homes, restaurants, and anywhere else that serves food. They are not to be confused with Tephritidae flys (also known as fruit flys).

    https://en.wikipedia.org/wiki/Drosophila_melanogaster

    About the Genome

    This genome was first sequenced in 2000. It contains four pairs of chromosomes (2,3,4 and X/Y). More than 60% of the genome appears to be functional non-protein-coding DNA.

    ![D. melanogaster chromosomes][1]

    The genome is maintained and frequently updated at [FlyBase][2]. This dataset is sourced from the UCSC Genome Bioinformatics download page. It uses the August 2014 version of the D. melanogaster genome (dm6, BDGP Release 6 + ISO1 MT). http://hgdownload.soe.ucsc.edu/downloads.html#fruitfly

    Files were modified by Kaggle to be a better fit for analysis on Scripts. This primarily involved turning files into CSV format, with a header row, as well as converting the genome itself from 2bit format into a FASTA sequence file.

    Bioinformatics

    Genomic analysis can be daunting to data scientists who haven't had much experience with bioinformatics before. We have tried to give basic explanations to each of the files in this dataset, as well as links to further reading on the biological basis for each. If you haven't had the chance to study much biology before, some light reading (ie wikipedia) on the following topics may be helpful to understand the nuances of the data provided here: [Genetics][3], [Genomics]4, [Chromosomes][7], [DNA][8], [RNA]9, [Genes][12], [Alleles][13], [Exons][14], [Introns][15], [Transcription][16], [Translation][17], [Peptides][18], [Proteins][19], [Gene Regulation][20], [Mutation][21], [Phylogenetics][22], and [SNPs][23].

    Of course, if you've got some idea of the basics already - don't be afraid to jump right in!

    Learning Bioinformatics

    There are a lot of great resources for learning bioinformatics on the web. One cool site is [Rosalind][24] - a platform that gives you bioinformatic coding challenges to complete. You can use Kaggle Scripts on this dataset to easily complete the challenges on Rosalind (and see [Myles' solutions here][25] if you get stuck). We have set up [Biopython][26] on Kaggle's docker image which is a great library to help you with your analyses. Check out their [tutorial here][27] and we've also created [a python notebook with some of the tutorial applied to this dataset][28] as a reference.

    Files in this Dataset

    Drosophila Melanogaster Genome

    • genome.fa

    The assembled genome itself is presented here in [FASTA format][29]. Each chromosome is a different sequence of nucleotides. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are show in lower case; non-repeating sequence is shown in upper case.

    Meta Information

    There are 3 additional files with meta information about the genome.

    • meta-cpg-island-ext-unmasked.csv

    This file contains descriptive information about CpG Islands in the genome.

    https://en.wikipedia.org/wiki/CpG_site

    • meta-cytoband.csv

    This file describes the positions of cytogenic bands on each chromosome.

    https://en.wikipedia.org/wiki/Cytogenetics

    • meta-simple-repeat.csv

    This file describes simple tandem repeats in the genome.

    https://en.wikipedia.org/wiki/Repeated_sequence_(DNA) https://en.wikipedia.org/wiki/Tandem_repeat

    Drosophila Melanogaster mRNA Sequences

    Messenger RNA (mRNA) is an intermediate molecule created as part of the cellular process of converting genomic information into proteins. Some mRNA are never translated into proteins and have functional roles in the cell on their own. Collectively, organism mRNA information is known as a Transcriptome. mRNA files included in this dataset give insight into the activity of genes in the organism.

    https://en.wikipedia.org/wiki/Messenger_RNA

    • mrna-genbank.fa

    This file includes all mRNA sequences from GenBank associated with Drosophila Melanogaster.

    http://www.ncbi.nlm.nih.gov/genbank/

    • mrna-refseq.fa

    This file includes all mRNA sequences from RefSeq associated with Drosophila Melanogaster.

    http://www.ncbi.nlm.nih.gov/refseq/

    Gene Predictions

    A gene is a segment of DNA on the genome which, through mRNA, is used to create proteins in the organism. Knowing which parts of DNA are coding (genes) or non-coding is difficult, and a number of different systems for prediction exist. This da...

  18. Cell_Gene_Expression_Metadata

    • kaggle.com
    zip
    Updated Sep 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kazi Aishikuzzaman (2025). Cell_Gene_Expression_Metadata [Dataset]. https://www.kaggle.com/datasets/kaziaishikuzzaman/cell-gene-expression-metadata
    Explore at:
    zip(845887409 bytes)Available download formats
    Dataset updated
    Sep 24, 2025
    Authors
    Kazi Aishikuzzaman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview This dataset contains comprehensive metadata from single-cell gene expression studies, providing researchers with structured information about cellular phenotypes, experimental conditions, and sample characteristics. The data is particularly valuable for bioinformatics research, machine learning applications in genomics, and comparative studies across different cell types and conditions.

    Dataset Description: The dataset comprises metadata associated with single-cell RNA sequencing (scRNA-seq) experiments, including: Cell Type Information: Classification of different cell types and subtypes Experimental Metadata: Details about experimental conditions, protocols, and methodologies Sample Characteristics: Information about biological samples, including tissue origin, developmental stages, and treatment conditions Quality Metrics: Data quality indicators and filtering parameters Annotation Details: Standardized cell type annotations and biological classifications

    Data Source and Licensing This dataset is derived from publicly available single-cell gene expression data, potentially sourced from: CELLxGENE Data Portal (https://cellxgene.cziscience.com/) Gene Expression Omnibus (GEO) European Bioinformatics Institute (EBI) Other public genomics repositories

    License: Creative Commons CC BY 4.0 (or specify the actual license) ✅ Commercial use allowed ✅ Modification allowed ✅ Distribution allowed ✅ Private use allowed ❗ Attribution required

    Research Applications Cell Type Discovery: Identify novel cell types and subtypes Comparative Genomics: Study cellular differences across conditions, tissues, or species Disease Research: Investigate cellular changes in disease states Developmental Biology: Analyze cellular differentiation and development patterns

    Machine Learning Applications Classification Tasks: Predict cell types from gene expression data Clustering Analysis: Discover cellular subpopulations and states Dimensionality Reduction: Apply PCA, t-SNE, UMAP for visualization Biomarker Discovery: Identify genes characteristic of specific cell types

    Educational Use : Teaching bioinformatics and computational biology concepts. Demonstrating single-cell analysis workflows. Training in data preprocessing and quality control.

    Data Quality and Preprocessing : Quality Control: Metadata has been curated and standardized Missing Values: [Specify how missing values are handled] Standardization: Cell type annotations follow established ontologies (e.g., Cell Ontology) Validation: Data has been cross-referenced with original publications

    Usage Guidelines : Getting Started- Load the metadata files using pandas or your preferred data analysis tool. Explore the cell type distributions and experimental conditions. Filter data based on quality metrics as needed. Join with corresponding gene expression data for comprehensive analysis.

    Best Practices Always cite original data sources and publications. Consider batch effects when combining data from different experiments. Validate findings with independent datasets when possible. Follow established bioinformatics workflows for single-cell analysis.

    Citation and Acknowledgments : If you use this dataset in your research, please: Cite this dataset:[Kazi Aishikuzzaman]. (2024). Cell Gene Expression Metadata. Kaggle. https://www.kaggle.com/datasets/kaziaishikuzzaman/cell-gene-expression-metadata

    File Structure : dataset- ─ metadata_summary.csv # Main metadata file ─ cell_type_annotations.csv # Detailed cell type information
    ─ experimental_conditions.csv # Experiment-specific metadata ─ quality_metrics.csv # Data quality indicators ─ README.txt # Detailed file descriptions

    Technical Specifications : File Encoding: UTF-8 Separator: Comma-separated values (CSV) Missing Values: Represented as 'NA' or empty cells Data Types: Mixed (categorical, numerical, text)

    Contact and Support : For questions about this dataset: Kaggle Profile: @kaziaishikuzzaman Dataset Issues: Use Kaggle's discussion section Collaboration: Open to research collaborations and improvements

    Version History : v1.0: Initial release with comprehensive metadata collection [Future versions]: Updates and additional annotations as available

    Related Datasets: Consider exploring these complementary datasets- Single-cell gene expression data (companion to this metadata) Cell atlas datasets from major consortiums Disease-specific single-cell studies Multi-omics datasets with matching cell types

    Keywords: single-cell, RNA-seq, genomics, cell types, metadata, bioinformatics, machine learning, computational biology Category: Biology > Genomics

  19. o

    QIIME 2 Tutorial Data

    • registry.opendata.aws
    Updated Jan 23, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Caporaso Lab (2019). QIIME 2 Tutorial Data [Dataset]. https://registry.opendata.aws/qiime2/
    Explore at:
    Dataset updated
    Jan 23, 2019
    Dataset provided by
    Caporaso Lab
    Description

    QIIME 2 (pronounced “chime two”) is a microbiome multi-omics bioinformatics and data science platform that is trusted, free, open source, extensible, and community developed and supported.

  20. m

    SARS-CoV-2 Surface glycoprotein Alignment Data Mendeley

    • data.mendeley.com
    • narcis.nl
    Updated Sep 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Done Stojanov (2021). SARS-CoV-2 Surface glycoprotein Alignment Data Mendeley [Dataset]. http://doi.org/10.17632/k7sy3sk7rx.2
    Explore at:
    Dataset updated
    Sep 20, 2021
    Authors
    Done Stojanov
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    1. SARS-CoV-2SpikeProteinMutations.xlsx contains data on mutations found in aligned SARS-CoV-2 surface glycoproteins.
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Bioinformatics data for paper [Dataset]. https://catalog.data.gov/dataset/bioinformatics-data-for-paper
Organization logo

Bioinformatics data for paper

Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description

Data for sequence comparison of commamox genomes and genes identified. This dataset is associated with the following publication: Camejo, P., J. Santodomingo, K. McMahon, and D. Noguera. Genome-enabled insights into the ecophysiology of the comammox bacterium Ca. Nitrospira nitrosa. ENVIRONMENTAL SCIENCE & TECHNOLOGY. American Chemical Society, Washington, DC, USA, 2(5): 1-16, (2017).

Search
Clear search
Close search
Google apps
Main menu