100+ datasets found
  1. Bioinformatics Protein Dataset - Simulated

    • kaggle.com
    zip
    Updated Dec 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated
    Explore at:
    zip(12928905 bytes)Available download formats
    Dataset updated
    Dec 27, 2024
    Authors
    Rafael Gallo
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Subtitle

    "Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

    Description

    Introduction

    This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

    Columns Included

    • ID_Protein: Unique identifier for each protein.
    • Sequence: String of amino acids.
    • Molecular_Weight: Molecular weight calculated from the sequence.
    • Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
    • Hydrophobicity: Average hydrophobicity calculated from the sequence.
    • Total_Charge: Sum of the charges of the amino acids in the sequence.
    • Polar_Proportion: Percentage of polar amino acids in the sequence.
    • Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
    • Sequence_Length: Total number of amino acids in the sequence.
    • Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

    Inspiration and Sources

    While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

    Proposed Uses

    This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

    How This Dataset Was Created

    1. Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
    2. Property Calculation: Physicochemical properties were calculated using the Biopython library.
    3. Class Assignment: Classes were randomly assigned for classification purposes.

    Limitations

    • The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
    • The functional classes are simulated and do not correspond to actual biological characteristics.

    Data Split

    The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

    Acknowledgment

    This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

  2. Bioinformatic databases survey

    • zenodo.org
    csv
    Updated Aug 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alise Ponsero; Alise Ponsero; Bonnie Hurwitz; Bonnie Hurwitz; Kiran Smelser; Kiran Smelser; Karen Valencia; Lucas Jimenez Miranda; Lucas Jimenez Miranda; Abby McDermott; Karen Valencia; Abby McDermott (2024). Bioinformatic databases survey [Dataset]. http://doi.org/10.5281/zenodo.12790448
    Explore at:
    csvAvailable download formats
    Dataset updated
    Aug 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alise Ponsero; Alise Ponsero; Bonnie Hurwitz; Bonnie Hurwitz; Kiran Smelser; Kiran Smelser; Karen Valencia; Lucas Jimenez Miranda; Lucas Jimenez Miranda; Abby McDermott; Karen Valencia; Abby McDermott
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Bioinformatic databases survey

    The dataset surveys bioinformatic databases published in the NAR database issue from 1995 to 2022. It evaluates the current number of citations and availability of each ressources.

    Data content

    The dataset is composed of two tables :

    A. Databases table : Contains the information of each database published in the NAR database issue.

    • db_id : Database ID in the dataset
    • resource_name : Name(s) of the database
    • current_access : Latest known web address of the database
    • is_a_pun : The database name is a play on word
    • available_2022 : The database was accessible online during the 2022 survey
    • last_accessible_year : If not accessible, latest point in time where the database was found online (using the Internet web archive snapshots)
    • unavailable_message : If not accessible, the message/error when trying to access the ressource
    • year_first_publication : Year of first publication of the database
    • year_last_publication : Year of latest publication of the database (including database update publications)
    • total_citations_2022 : Cumulative number of citation for all articles of the database
    • nb_authors_max : Maximum number of authors associated to any articles published for that database
    • nb_articles_2022 : Number of articles published for that database in 2022

    B. Articles table : Contains the information collected for the NAR articles

    • collector : Person who contributed to add this database in the dataset
    • article_global_id : DOI of the article surveyed
    • db_id : Database ID of the ressource described in the article
    • article_id : Article unique ID
    • article_year : Article publication year
    • Authors : list of authors of the article. Separated by ";"
    • Author.ID : list of ORCID of the authors of the article. Separated by ";"
    • Title : Title of the atricle
    • Source.title : Journal name
    • Volume : Volume number
    • Issue : Issue number
    • Funding.Details : Funding information of the article
    • Funding.Text : Funding text provided by the authors
    • PubMed.ID : Pubmed ID of the article
    • citations_2016 : Number of citations of the article in 2016 (if published)
    • citations_2022 : Number of citations of the article in 2022
    • nb_authors : Number of authors in the article
    • Index.Keywords : Keywords associated to the publication

    Data sources

    Note that the presented dataset leverage and expand on the dataset gathered and published in Imker, H.J., 2020. Who Bears the Burden of Long-Lived Molecular Biology Databases?. Data Science Journal, 19(1), p.8. The original dataset collected by Dr. Imker is available at : https://doi.org/10.13012/B2IDB-4311325_V1

    The dataset was collected and is maintained by undergraduate students of a CURE class (Course-based Undergraduate Research Experience) held at the University of Arizona. All students of the class have participated to the collection, update and curation the dataset that is available as a database and a web-portal at https://hurwitzlab.shinyapps.io/DS_Heroes/. Students could elect to be added or not as author to this Zenodo repository.

    The CURE class BAT102 "Data Science Heroes: An undergraduate research experience in Open Data Science Practices" gives the students an opportunity to learn about open science and investigate open data practices in bioinformatics through a survey of the databases published in the NAR database issue.

  3. P

    Bioinformatics Services Market Industry Forecast 2034

    • polarismarketresearch.com
    Updated Aug 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Polaris Market Research & Consulting, Inc. (2025). Bioinformatics Services Market Industry Forecast 2034 [Dataset]. https://www.polarismarketresearch.com/industry-analysis/bioinformatics-services-market
    Explore at:
    Dataset updated
    Aug 26, 2025
    Dataset authored and provided by
    Polaris Market Research & Consulting, Inc.
    License

    https://www.polarismarketresearch.com/privacy-policyhttps://www.polarismarketresearch.com/privacy-policy

    Description

    Bioinformatics Services Market will grow from USD 4,399.58 Million to USD 16,297.10 Million by 2034, showing an impressive CAGR of 15.7%.

  4. f

    Properties of bioinformatically identified candidate antigens and previously...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Jun 9, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eren, Hasan; Karagenc, Tulin; Kinnaird, Jane; Bakırcı, Serkan; Tait, Andrew; Weir, William; Shiels, Brian; Bilgic, Huseyin Bilgin (2016). Properties of bioinformatically identified candidate antigens and previously identified antigens [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001591609
    Explore at:
    Dataset updated
    Jun 9, 2016
    Authors
    Eren, Hasan; Karagenc, Tulin; Kinnaird, Jane; Bakırcı, Serkan; Tait, Andrew; Weir, William; Shiels, Brian; Bilgic, Huseyin Bilgin
    Description

    Properties of bioinformatically identified candidate antigens and previously identified antigens

  5. l

    Appendix V - Bioinformatic pipelines/scripts

    • figshare.le.ac.uk
    txt
    Updated Jun 2, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ihthisham Ali (2020). Appendix V - Bioinformatic pipelines/scripts [Dataset]. http://doi.org/10.25392/leicester.data.12363785.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2020
    Dataset provided by
    University of Leicester
    Authors
    Ihthisham Ali
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PhD thesis: PRDM9 Diversity, Recombination Landscapes and Childhood Leukaemia by Ihthisham Ali Appendix V contains various data pipelines and scripts used for the remapping of Illumina HiSeq2000 dataset to known PRDM9 ZnF arrays, read depth and variant calling vcf file generation, haplotype estimation and imputation of FIGNL1 coding variants in relation to the British ALL cohort, de novo assembly of read data and mapping of MinION read data.A. ALL study phasing and imputationB. Illumina HiSeq 2000 dataset - Read depth (DP) and variant calling pipelineC. Illumina HiSeq 2000 dataset - data treatmentD. VelvetOptimiser best k-mer determination log (exemplary)E. Alignment of contigs generated by Velvet de novo assembly for the PRDM9 A/A carrier and aligned against the PRDM9 A ZnF arrayF. MinION nanopore reads - minimap2 pipeline

  6. Features of bioinformatically-defined Mycobacteriophage endolysin domains.

    • plos.figshare.com
    xls
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kimberly M. Payne; Graham F. Hatfull (2023). Features of bioinformatically-defined Mycobacteriophage endolysin domains. [Dataset]. http://doi.org/10.1371/journal.pone.0034052.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Kimberly M. Payne; Graham F. Hatfull
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Features of bioinformatically-defined Mycobacteriophage endolysin domains.

  7. C

    Bioinformatics for Researchers in Life Sciences: Tools and Learning...

    • data.iadb.org
    csv, pdf
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IDB Datasets (2025). Bioinformatics for Researchers in Life Sciences: Tools and Learning Resources [Dataset]. http://doi.org/10.60966/kwvb-wr19
    Explore at:
    csv(276253), pdf(2989058), csv(355108)Available download formats
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    IDB Datasets
    License

    Attribution-NonCommercial-NoDerivs 3.0 (CC BY-NC-ND 3.0)https://creativecommons.org/licenses/by-nc-nd/3.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2020 - Jan 1, 2021
    Description

    The COVID-19 pandemic has shown that bioinformatics--a multidisciplinary field that combines biological knowledge with computer programming concerned with the acquisition, storage, analysis, and dissemination of biological data--has a fundamental role in scientific research strategies in all disciplines involved in fighting the virus and its variants. It aids in sequencing and annotating genomes and their observed mutations; analyzing gene and protein expression; simulation and modeling of DNA, RNA, proteins and biomolecular interactions; and mining of biological literature, among many other critical areas of research. Studies suggest that bioinformatics skills in the Latin American and Caribbean region are relatively incipient, and thus its scientific systems cannot take full advantage of the increasing availability of bioinformatic tools and data. This dataset is a catalog of bioinformatics software for researchers and professionals working in life sciences. It includes more than 300 different tools for varied uses, such as data analysis, visualization, repositories and databases, data storage services, scientific communication, marketplace and collaboration, and lab resource management. Most tools are available as web-based or desktop applications, while others are programming libraries. It also includes 10 suggested entries for other third-party repositories that could be of use.

  8. B

    Bioinformatics Services Market Report

    • marketresearchforecast.com
    doc, pdf, ppt
    Updated Oct 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Research Forecast (2025). Bioinformatics Services Market Report [Dataset]. https://www.marketresearchforecast.com/reports/bioinformatics-services-market-10291
    Explore at:
    ppt, pdf, docAvailable download formats
    Dataset updated
    Oct 24, 2025
    Dataset authored and provided by
    Market Research Forecast
    License

    https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy

    Time period covered
    2026 - 2034
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The size of the Bioinformatics Services Market was valued at USD XX USD Billion in 2023 and is projected to reach USD XXX USD Billion by 2032, with an expected CAGR of 16.5% during the forecast period. Recent developments include: June 2023 – Psomagen added a new sequencing platform, the Pacific Bioscience Revio system, to offer services such as whole genome, whole exome, single cell and bulk RNAseq, microbiome, Olink Proteomics, and others., August 2023 – PacBio agreed to acquire Apton Biosystems, Inc., to accelerate the development of a next-generation, high-throughput short-read sequencer., March 2023 – Emmes, a Clinical Research Organization (CRO), acquired Essex Management. Essex offers bioinformatics and Health Information Technology (HIT) consulting services to government, private sector and academic organizations., November 2022 – Arima Genomics, Inc. partnered with Basepair to empower scientists with bioinformatic analysis., September 2021 – Dovetails Genomics expanded its epigenetic services in the areas of bioinformatics and target enrichment to offer a one-stop solution.. Key drivers for this market are: Growing Applications and Research Grants to Surge the Demand for These Services. Potential restraints include: Growing Applications and Research Grants to Surge the Demand for These Services. Notable trends are: Growing Applications and Research Grants to Surge the Demand for These Services.

  9. R

    RMQS1 16S bioinformatic config files and control sample data

    • entrepot.recherche.data.gouv.fr
    application/gzip, tsv +1
    Updated Aug 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sébastien Terrat; Sébastien Terrat; Samuel Dequiedt; Samuel Dequiedt (2024). RMQS1 16S bioinformatic config files and control sample data [Dataset]. http://doi.org/10.57745/XBFOJP
    Explore at:
    tsv(522347), txt(143493), tsv(8814), tsv(33093), tsv(117004), application/gzip(362535), tsv(13212), tsv(32344), tsv(266094), tsv(80032), txt(10413), tsv(16460)Available download formats
    Dataset updated
    Aug 22, 2024
    Dataset provided by
    Recherche Data Gouv
    Authors
    Sébastien Terrat; Sébastien Terrat; Samuel Dequiedt; Samuel Dequiedt
    License

    https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html

    Dataset funded by
    French National Research Agency (ANR)
    France Génomique
    French Agency for Ecological Transition (ADEME)
    Description

    RMQS: The French Soil Quality Monitoring Network (RMQS) is a national program for the assessment and long-term monitoring of the quality of French soils. This network is based on the monitoring of 2240 sites representative of French soils and their land use. These sites are spread over the whole French territory (metropolitan and overseas) along a systematic square grid of 16 km x 16 km cells. The network covers a broad spectrum of climatic, soil and land-use conditions (croplands, permanent grasslands, woodlands, orchards and vineyards, natural or scarcely anthropogenic land and urban parkland). The first sampling campaign in metropolitan France took place from 2000 to 2009. Dataset: This dataset contains config files used to run the bioinformatic pipeline and the control sample data that were not published before Reference environmental DNA samples named “G4” in internal laboratory processes were added for each molecular analysis. They were used for technical validation, but not necessarily published alongside the datasets. The taxonomy and OTU abundance files for these control samples were built like the taxonomy and abundance file of the main dataset. As these internal control samples were clustered against the RMQS dataset in an open reference fashion, they contained new OTUs (noted as “OUT”) that corresponded to sequences that did not match any of 188,030 RMQS reference sequences. The sample bank association file links each sample to its sequencing library. The G4 metadata file links each G4 to its library, molecular tag and sequence repository information. File structure: Taxonomy files rmqs1_control_taxonomy_: Taxonomy is splitted across five files with one line per site and one column per taxa. Each line sums to 10k (rarefaction threshold). Three supplementary columns are present: Unknown: not matching any reference. Unclassified: missing taxa between genus and phylum. Environmental: matched to sample from environmental study, generally with only a phylum name. rmqs1_16S_otu_abundance.tsv: OTU abundance per site (one column per OTUs, “DB” + number for OTUs from RMQS reference set, “OUT” for OTUs not matching any “DB” ones). Each line sums to 10k (rarefaction threshold). rmqs1_16S_bank_association.tsv: two columns file with bank name for each sample rmqs1_16S_bank_metadata.tsv: library_name: library name used in labs study_accession, sample_accession, experiment_accession, run_accession: SRA EBI identifier library_name_genoscope: library name used in the Genoscope sequence center MID: multiplex identifier sequence run_alias: Genoscope internal alias ftp_link: FTP link to download library Input_G4.txt: Tabulated file containing the parameters and the bioinformatic steps done by the BIOCOM-PIPE pipeline to extract, treat and analyze controls from raw librairies detailed in the rmqs1_16S_bank_metadata.tsv. project_G4.tab: Comma separated file containing the needed information to generate the Input.txt file with the BIOCOM-PIPE pipeline for controls only: PROJECT: Project name chosen by the user LIBRARY_NAME: Library name chosen by the user LIBRARY_NAME_RECEIVED: Library name chosen by the sequencing partner and used by BIOCOM-PIPE SAMPLE_NAME: Sample name chosen by the user MID_F: MID name or MID sequence associated to the Forward primer MID_R: MID name or MID sequence associated to the Reverse primer TARGET: Target gene (16S, 18S, or 23S) PRIMER_F: Forward primer name used for amplification PRIMER_R: Reverse primer name used for amplification SEQUENCE_PRIMER_F: Forward primer sequence used for amplification SEQUENCE_PRIMER_R: Reverse primer sequence used for amplification Input_GLOBAL.txt: Tabulated file containing the parameters and the bioinformatic steps done by the BIOCOM-PIPE pipeline to extract, treat and analyze controls and samples from raw librairies detailed in the rmqs1_16S_bank_metadata.tsv. project_GLOBAL.tab: Comma separated file containing the needed information to generate the Input.txt file for controls and samples with the BIOCOM-PIPE pipeline: PROJECT: Project name chosen by the user LIBRARY_NAME: Library name chosen by the user LIBRARY_NAME_RECEIVED: Library name chosen by the sequencing partner and used by BIOCOM-PIPE SAMPLE_NAME: Sample name chosen by the user MID_F: MID name or MID sequence associated to the Forward primer MID_R: MID name or MID sequence associated to the Reverse primer TARGET: Target gene (16S, 18S, or 23S) PRIMER_F: Forward primer name used for amplification PRIMER_R: Reverse primer name used for amplification SEQUENCE_PRIMER_F: Forward primer sequence used for amplification SEQUENCE_PRIMER_R: Reverse primer sequence used for amplification Details: Three libraries (58,59 and 69) data were re-sequenced and are not detailed in files. Some samples can be present in several libraries. We kept only the one with the highest number of sequences.

  10. m

    Data in brief of genome and bioinformatic of vicilins from Vigna unguiculata...

    • data.mendeley.com
    Updated Mar 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antônio Rocha (2023). Data in brief of genome and bioinformatic of vicilins from Vigna unguiculata [Dataset]. http://doi.org/10.17632/7ysf2zbfkt.2
    Explore at:
    Dataset updated
    Mar 16, 2023
    Authors
    Antônio Rocha
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data in brief is repository of article: Genomic and bioinformatic analysis of Vicilin dataset, the 7S globulin from cowpea (Vigna unguiculata) seeds

  11. B

    Bioinformatics Market Report

    • marketresearchforecast.com
    doc, pdf, ppt
    Updated Oct 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Research Forecast (2025). Bioinformatics Market Report [Dataset]. https://www.marketresearchforecast.com/reports/bioinformatics-market-10292
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Oct 26, 2025
    Dataset authored and provided by
    Market Research Forecast
    License

    https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy

    Time period covered
    2026 - 2034
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The size of the Bioinformatics Market was valued at USD 20.72 USD Billion in 2023 and is projected to reach USD 64.45 USD Billion by 2032, with an expected CAGR of 17.6% during the forecast period. Recent developments include: October 2023 – Bionl, Inc., a pioneering company in biomedical and bioinformatics research, launched a no-code biomedical research platform that enables researchers, students, and professionals to investigate biomedicine using natural language queries., October 2023 – BioBam Bioinformatics launched OmicsBox 3.1 to empower researchers, scientists, and bioinformaticians in their pursuit of advanced omics data analysis and interpretation., April 2023 – Absci Corp. collaborated with Aster Insights (formerly named M2GEN) to expedite the development of new cancer medicines., December 2022 – Analytical Biosciences Limited partnered with Mission Bio to co-develop bioinformatics packages for translational and clinical research applications in hematological cancers., April 2022 – ATCC signed an agreement with QIAGEN to provide sequencing data from its collection of biological data. QIAGEN Digital Insights aims to establish a database from this information to develop and deliver high-value digital biology content for the biotechnology and pharmaceutical industries.. Key drivers for this market are: Increased Funding for Genomics Research to Surge Demand for Bioinformatic Solutions. Potential restraints include: Increased Funding for Genomics Research to Surge Demand for Bioinformatic Solutions. Notable trends are: Increased Funding for Genomics Research to Surge Demand for Bioinformatic Solutions.

  12. M

    PATRIC: Bacterial Bioinformatics Resource Center

    • datacatalog.mskcc.org
    Updated Nov 13, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). PATRIC: Bacterial Bioinformatics Resource Center [Dataset]. https://datacatalog.mskcc.org/dataset/10392
    Explore at:
    Dataset updated
    Nov 13, 2019
    Description

    PATRIC (Pathosystems Resource Integration Center) is the Bacterial Bioinformatics Resource Center, an information system designed to support the biomedical research community’s work on bacterial infectious diseases via integration of vital pathogen information with rich data and analysis tools. PATRIC sharpens and hones the scope of available bacterial phylogenomic data from numerous sources specifically for the bacterial research community, in order to save biologists time and effort when conducting comparative analyses. The freely available PATRIC platform provides an interface for biologists to discover data and information and conduct comprehensive comparative genomics and other analyses in a one-stop shop.

  13. d

    Data from: Use of long-read sequencing simulators to assess real-world...

    • datasets.ai
    • agdatacommons.nal.usda.gov
    • +1more
    0
    Updated Mar 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Agriculture (2024). Data from: Use of long-read sequencing simulators to assess real-world applications for food safety [Dataset]. https://datasets.ai/datasets/data-from-use-of-long-read-sequencing-simulators-to-assess-real-world-applications-for-foo-35d38
    Explore at:
    0Available download formats
    Dataset updated
    Mar 30, 2024
    Dataset authored and provided by
    Department of Agriculture
    Description

    Shiga toxin-producing Escherichia coli (STEC) and Listeria monocytogenes are responsible for severe foodborne illnesses in the United States. Current identification methods require at least four days to identify STEC and six days for L. monocytogenes. Adoption of long-read, whole genome sequencing for testing could significantly reduce the time needed for identification, but method development costs are high. Therefore, the goal of this project was to use NanoSim-H software to simulate Oxford Nanopore sequencing reads to assess the feasibility of sequencing-based foodborne pathogen detection and guide experimental design. Sequencing reads were simulated for STEC, L. monocytogenes, and a 1:1 combination of STEC and Bos taurus genomes using NanoSim-H. This dataset includes all of the simulated reads generated by the project in fasta format. This dataset can be analyzed bioinformatically or used to test bioinformatic pipelines.

  14. Data from: Whole-genome sequencing and bioinformatic tools powered by...

    • microbiology.figshare.com
    bin
    Updated Aug 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nishitha R Kumar; Tejashree A Balraj; Kerry K Cooper; Akila Prashant (2025). Whole-genome sequencing and bioinformatic tools powered by machine learning to identify antibiotic-resistant genes and virulence factors in Escherichia coli from sepsis [Dataset]. http://doi.org/10.6084/m9.figshare.27204585.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 11, 2025
    Dataset provided by
    Microbiology Societyhttp://www.microbiologysociety.org/
    Authors
    Nishitha R Kumar; Tejashree A Balraj; Kerry K Cooper; Akila Prashant
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplementary material for 'Whole-genome sequencing and bioinformatic tools powered by machine learning to identify antibiotic-resistant genes and virulence factors in Escherichia coli from sepsis', as described on Microbial Genomics.

  15. f

    Bioinformatic Pipeline Scripts Amplicon Sequencing - Grey Box Grassy...

    • open.flinders.edu.au
    • researchdata.edu.au
    txt
    Updated Aug 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicole Fickling (2025). Bioinformatic Pipeline Scripts Amplicon Sequencing - Grey Box Grassy Woodlands [Dataset]. http://doi.org/10.25451/flinders.29848280.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Aug 8, 2025
    Dataset provided by
    Flinders University
    Authors
    Nicole Fickling
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Bioinformatic scripts for the paper titled "Habitat fragmentation shifts soil microbial composition but not richness". Scripts outline the bioinformatic pipeline using DADA2 and QIIME2 to create amplicon sequence variant tables from .fastq files.The .fastq files and sequence metadata are available on the Sequence Read Archive under project number: PRJNA1298480

  16. d

    Data from: Post-bioinformatic methods to identify and reduce the prevalence...

    • datadryad.org
    • search.dataone.org
    zip
    Updated Mar 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lorna Drake; Jordan Cuff (2021). Post-bioinformatic methods to identify and reduce the prevalence of artefacts in metabarcoding data [Dataset]. http://doi.org/10.5061/dryad.2jm63xsp4
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 30, 2021
    Dataset provided by
    Dryad
    Authors
    Lorna Drake; Jordan Cuff
    Time period covered
    Mar 25, 2021
    Description

    Example dataset one: British otter diet

    Faecal samples were collected during otter post-mortems by the Cardiff University Otter Project. Extracted faecal DNA was amplified using two metabarcoding primer pairs designed to amplify regions of the 16S rRNA and cytochrome c oxidase subunit I (COI) genes, each primer having ten-base-pair molecular identifier tags (MID tags) to facilitate post-bioinformatic sample identification. Extraction and PCR negative controls, unused MID tag combinations, repeat samples and mock communities were included alongside the focal eDNA samples. Mock communities comprised standardised mixtures of DNA of marine species not previously detected in the diet of Eurasian otters. The resultant DNA libraries for each marker were sequenced on separate MiSeq V2 chips with 2x250bp paired-end reads.

    Example dataset two: cereal crop spider diet

    Money spiders (Bathyphantes, Erigone, Microlinyphia and Tenuiphantes; Araneae: Linyphiidae) and wolf spiders (Pardosa; Ar...

  17. t

    Bioinformatics Market Demand, Size and Competitive Analysis | TechSci...

    • techsciresearch.com
    Updated Aug 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TechSci Research (2025). Bioinformatics Market Demand, Size and Competitive Analysis | TechSci Research [Dataset]. https://www.techsciresearch.com/report/bioinformatics-market/4279.html
    Explore at:
    Dataset updated
    Aug 15, 2025
    Dataset authored and provided by
    TechSci Research
    License

    https://www.techsciresearch.com/privacy-policy.aspxhttps://www.techsciresearch.com/privacy-policy.aspx

    Description

    Bioinformatics Market was valued at USD 11.24 Billion in 2024 and is expected to reach USD 22.59 Billion by 2030 with a CAGR of 12.34%.

    Pages185
    Market Size2024: USD 11.24 Billion
    Forecast Market Size2030: USD 22.59 Billion
    CAGR2025-2030: 12.34%
    Fastest Growing SegmentGenomics & Proteomics
    Largest MarketNorth America
    Key Players1. 3rd Millennium Inc. 2. Thermo Fisher Scientific, Inc. 3. Agilent Technologies, Inc. 4. BioWisdom Ltd 5. Quest Diagnostics (Celera Corporation) 6. Dassault Systèmes SE 7. Illumina, Inc. 8. Geneva Bioinformatics SA 9. Perkin Elmer, Inc. 10. Lineage Cell Therapeutics (BioTime Inc.)

  18. c

    Supplemental Material for the Manuscript "Genomic Characterization and...

    • kilthub.cmu.edu
    xlsx
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ramya Ramadoss; Fajer Almarzooqi; Basem Shomar; Valentin Alekseevich Ilyin; Annette Shoba Vincent (2023). Supplemental Material for the Manuscript "Genomic Characterization and Annotation of two Novel Bacteriophages Isolated from a Wastewater Treatment Plant in Qatar" [Dataset]. http://doi.org/10.1184/R1/16965004.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Carnegie Mellon University
    Authors
    Ramya Ramadoss; Fajer Almarzooqi; Basem Shomar; Valentin Alekseevich Ilyin; Annette Shoba Vincent
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Qatar
    Description

    This is the Supplemental Material for the Manuscript "Genomic Characterization and Annotation of two Novel Bacteriophages Isolated from a Wastewater Treatment Plant in Qatar". Sheets "inphared_EscherichiaPhageCL1" and "inphared_EscherichiaPhageC600M2" lists all the genomes related to Escherichia Phage CL1 and Escherichia Phage C600M2 respectively, identified using get_closest_relatives.pl program in INPHARED package (https://github.com/RyanCook94/inphared).

  19. f

    Data from: A large-scale analysis of bioinformatics code on GitHub

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Oct 31, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carlson, Nichole E.; Harnke, Benjamin; Russell, Pamela H.; Johnson, Rachel L.; Ananthan, Shreyas (2018). A large-scale analysis of bioinformatics code on GitHub [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000639408
    Explore at:
    Dataset updated
    Oct 31, 2018
    Authors
    Carlson, Nichole E.; Harnke, Benjamin; Russell, Pamela H.; Johnson, Rachel L.; Ananthan, Shreyas
    Description

    In recent years, the explosion of genomic data and bioinformatic tools has been accompanied by a growing conversation around reproducibility of results and usability of software. However, the actual state of the body of bioinformatics software remains largely unknown. The purpose of this paper is to investigate the state of source code in the bioinformatics community, specifically looking at relationships between code properties, development activity, developer communities, and software impact. To investigate these issues, we curated a list of 1,720 bioinformatics repositories on GitHub through their mention in peer-reviewed bioinformatics articles. Additionally, we included 23 high-profile repositories identified by their popularity in an online bioinformatics forum. We analyzed repository metadata, source code, development activity, and team dynamics using data made available publicly through the GitHub API, as well as article metadata. We found key relationships within our dataset, including: certain scientific topics are associated with more active code development and higher community interest in the repository; most of the code in the main dataset is written in dynamically typed languages, while most of the code in the high-profile set is statically typed; developer team size is associated with community engagement and high-profile repositories have larger teams; the proportion of female contributors decreases for high-profile repositories and with seniority level in author lists; and, multiple measures of project impact are associated with the simple variable of whether the code was modified at all after paper publication. In addition to providing the first large-scale analysis of bioinformatics code to our knowledge, our work will enable future analysis through publicly available data, code, and methods. Code to generate the dataset and reproduce the analysis is provided under the MIT license at https://github.com/pamelarussell/github-bioinformatics. Data are available at https://doi.org/10.17605/OSF.IO/UWHX8.

  20. D

    Data from: A two-tier bioinformatic pipeline to develop probes for target...

    • datasetcatalog.nlm.nih.gov
    • data.niaid.nih.gov
    • +2more
    Updated Jan 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amarasinghe, Prabha; Michelangeli, Fabian; Jantzen, Johanna; Cellinese, Nico; Reginato, Marcelo; Folk, Ryan; Soltis, Pamela S.; Soltis, Douglas (2021). A two-tier bioinformatic pipeline to develop probes for target capture of nuclear loci with applications in Melastomataceae [Dataset]. http://doi.org/10.5061/dryad.8931zcrm2
    Explore at:
    Dataset updated
    Jan 2, 2021
    Authors
    Amarasinghe, Prabha; Michelangeli, Fabian; Jantzen, Johanna; Cellinese, Nico; Reginato, Marcelo; Folk, Ryan; Soltis, Pamela S.; Soltis, Douglas
    Description

    Premise of the study: Putatively single-copy nuclear (SCN) loci, identified using genomic resources of closely related species, are ideal for phylogenomic inference. However, suitable genomic resources are not available for many clades, including Melastomataceae. We introduce a versatile approach to identify SCN loci for clades with few genomic resources and use it to develop probes for target enrichment in the distantly related Memecylon and Tibouchina (Melastomataceae). Methods: We present a two-tiered pipeline. First, we identified putatively SCN loci using MarkerMiner and transcriptomes from distantly related species in Melastomataceae. Published loci and genes of functional significance were added (384 total loci). Second, using HybPiper, we retrieved 689 homologous template sequences for these loci using genome-skimming data from within the focal clades. Results: We sequenced 193 loci from both Memecylon and Tibouchina, with probes designed from 56 template sequences successfully targeting sequences in both clades. Probes designed from genome-skimming data within a focal clade were more successful than probes designed from other sources. Discussion: Our pipeline successfully identified and targeted SCN loci in Memecylon and Tibouchina, enabling phylogenomic studies in both clades and potentially across Melastomataceae. This pipeline could be easily applied to other clades with few genomic resources.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated
Organization logo

Bioinformatics Protein Dataset - Simulated

Synthetic protein dataset with sequences, physical properties, and functional cl

Explore at:
zip(12928905 bytes)Available download formats
Dataset updated
Dec 27, 2024
Authors
Rafael Gallo
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Subtitle

"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

Description

Introduction

This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

Columns Included

  • ID_Protein: Unique identifier for each protein.
  • Sequence: String of amino acids.
  • Molecular_Weight: Molecular weight calculated from the sequence.
  • Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
  • Hydrophobicity: Average hydrophobicity calculated from the sequence.
  • Total_Charge: Sum of the charges of the amino acids in the sequence.
  • Polar_Proportion: Percentage of polar amino acids in the sequence.
  • Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
  • Sequence_Length: Total number of amino acids in the sequence.
  • Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

Inspiration and Sources

While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

Proposed Uses

This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

How This Dataset Was Created

  1. Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
  2. Property Calculation: Physicochemical properties were calculated using the Biopython library.
  3. Class Assignment: Classes were randomly assigned for classification purposes.

Limitations

  • The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
  • The functional classes are simulated and do not correspond to actual biological characteristics.

Data Split

The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

Acknowledgment

This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

Search
Clear search
Close search
Google apps
Main menu