100+ datasets found

Bioinformatics Protein Dataset - Simulated
kaggle.com
zip
Updated Dec 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated
Explore at:
zip(12928905 bytes)Available download formats
Dataset updated
Dec 27, 2024
Authors
Rafael Gallo
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Subtitle

"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

Description

Introduction

This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

Columns Included

ID_Protein: Unique identifier for each protein.

Sequence: String of amino acids.

Molecular_Weight: Molecular weight calculated from the sequence.

Isoelectric_Point: Estimated isoelectric point based on the sequence composition.

Hydrophobicity: Average hydrophobicity calculated from the sequence.

Total_Charge: Sum of the charges of the amino acids in the sequence.

Polar_Proportion: Percentage of polar amino acids in the sequence.

Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.

Sequence_Length: Total number of amino acids in the sequence.

Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

Inspiration and Sources

While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

Proposed Uses

This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

How This Dataset Was Created

Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.

Property Calculation: Physicochemical properties were calculated using the Biopython library.

Class Assignment: Classes were randomly assigned for classification purposes.

Limitations

The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.

The functional classes are simulated and do not correspond to actual biological characteristics.

Data Split

The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

Acknowledgment

This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.
Bioinformatic databases survey
zenodo.org
csv
Updated Aug 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alise Ponsero; Alise Ponsero; Bonnie Hurwitz; Bonnie Hurwitz; Kiran Smelser; Kiran Smelser; Karen Valencia; Lucas Jimenez Miranda; Lucas Jimenez Miranda; Abby McDermott; Karen Valencia; Abby McDermott (2024). Bioinformatic databases survey [Dataset]. http://doi.org/10.5281/zenodo.12790448
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.12790448
Dataset updated
Aug 17, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alise Ponsero; Alise Ponsero; Bonnie Hurwitz; Bonnie Hurwitz; Kiran Smelser; Kiran Smelser; Karen Valencia; Lucas Jimenez Miranda; Lucas Jimenez Miranda; Abby McDermott; Karen Valencia; Abby McDermott
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Bioinformatic databases survey

The dataset surveys bioinformatic databases published in the NAR database issue from 1995 to 2022. It evaluates the current number of citations and availability of each ressources.

Data content

The dataset is composed of two tables :

A. Databases table : Contains the information of each database published in the NAR database issue.

db_id : Database ID in the dataset

resource_name : Name(s) of the database

current_access : Latest known web address of the database

is_a_pun : The database name is a play on word

available_2022 : The database was accessible online during the 2022 survey

last_accessible_year : If not accessible, latest point in time where the database was found online (using the Internet web archive snapshots)

unavailable_message : If not accessible, the message/error when trying to access the ressource

year_first_publication : Year of first publication of the database

year_last_publication : Year of latest publication of the database (including database update publications)

total_citations_2022 : Cumulative number of citation for all articles of the database

nb_authors_max : Maximum number of authors associated to any articles published for that database

nb_articles_2022 : Number of articles published for that database in 2022

B. Articles table : Contains the information collected for the NAR articles

collector : Person who contributed to add this database in the dataset

article_global_id : DOI of the article surveyed

db_id : Database ID of the ressource described in the article

article_id : Article unique ID

article_year : Article publication year

Authors : list of authors of the article. Separated by ";"

Author.ID : list of ORCID of the authors of the article. Separated by ";"

Title : Title of the atricle

Source.title : Journal name

Volume : Volume number

Issue : Issue number

Funding.Details : Funding information of the article

Funding.Text : Funding text provided by the authors

PubMed.ID : Pubmed ID of the article

citations_2016 : Number of citations of the article in 2016 (if published)

citations_2022 : Number of citations of the article in 2022

nb_authors : Number of authors in the article

Index.Keywords : Keywords associated to the publication

Data sources

Note that the presented dataset leverage and expand on the dataset gathered and published in Imker, H.J., 2020. Who Bears the Burden of Long-Lived Molecular Biology Databases?. Data Science Journal, 19(1), p.8. The original dataset collected by Dr. Imker is available at : https://doi.org/10.13012/B2IDB-4311325_V1

The dataset was collected and is maintained by undergraduate students of a CURE class (Course-based Undergraduate Research Experience) held at the University of Arizona. All students of the class have participated to the collection, update and curation the dataset that is available as a database and a web-portal at https://hurwitzlab.shinyapps.io/DS_Heroes/. Students could elect to be added or not as author to this Zenodo repository.

The CURE class BAT102 "Data Science Heroes: An undergraduate research experience in Open Data Science Practices" gives the students an opportunity to learn about open science and investigate open data practices in bioinformatics through a survey of the databases published in the NAR database issue.
P
Bioinformatics Services Market Industry Forecast 2034
polarismarketresearch.com
Updated Aug 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Polaris Market Research & Consulting, Inc. (2025). Bioinformatics Services Market Industry Forecast 2034 [Dataset]. https://www.polarismarketresearch.com/industry-analysis/bioinformatics-services-market
Explore at:
Dataset updated
Aug 26, 2025
Dataset authored and provided by
Polaris Market Research & Consulting, Inc.
License
https://www.polarismarketresearch.com/privacy-policyhttps://www.polarismarketresearch.com/privacy-policy
Description
Bioinformatics Services Market will grow from USD 4,399.58 Million to USD 16,297.10 Million by 2034, showing an impressive CAGR of 15.7%.
f
Properties of bioinformatically identified candidate antigens and previously...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Jun 9, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eren, Hasan; Karagenc, Tulin; Kinnaird, Jane; Bakırcı, Serkan; Tait, Andrew; Weir, William; Shiels, Brian; Bilgic, Huseyin Bilgin (2016). Properties of bioinformatically identified candidate antigens and previously identified antigens [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001591609
Explore at:
Dataset updated
Jun 9, 2016
Authors
Eren, Hasan; Karagenc, Tulin; Kinnaird, Jane; Bakırcı, Serkan; Tait, Andrew; Weir, William; Shiels, Brian; Bilgic, Huseyin Bilgin
Description
Properties of bioinformatically identified candidate antigens and previously identified antigens
l
Appendix V - Bioinformatic pipelines/scripts
figshare.le.ac.uk
txt
Updated Jun 2, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ihthisham Ali (2020). Appendix V - Bioinformatic pipelines/scripts [Dataset]. http://doi.org/10.25392/leicester.data.12363785.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.12363785.v1
Dataset updated
Jun 2, 2020
Dataset provided by
University of Leicester
Authors
Ihthisham Ali
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PhD thesis: PRDM9 Diversity, Recombination Landscapes and Childhood Leukaemia by Ihthisham Ali Appendix V contains various data pipelines and scripts used for the remapping of Illumina HiSeq2000 dataset to known PRDM9 ZnF arrays, read depth and variant calling vcf file generation, haplotype estimation and imputation of FIGNL1 coding variants in relation to the British ALL cohort, de novo assembly of read data and mapping of MinION read data.A. ALL study phasing and imputationB. Illumina HiSeq 2000 dataset - Read depth (DP) and variant calling pipelineC. Illumina HiSeq 2000 dataset - data treatmentD. VelvetOptimiser best k-mer determination log (exemplary)E. Alignment of contigs generated by Velvet de novo assembly for the PRDM9 A/A carrier and aligned against the PRDM9 A ZnF arrayF. MinION nanopore reads - minimap2 pipeline
Features of bioinformatically-defined Mycobacteriophage endolysin domains.
plos.figshare.com
xls
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kimberly M. Payne; Graham F. Hatfull (2023). Features of bioinformatically-defined Mycobacteriophage endolysin domains. [Dataset]. http://doi.org/10.1371/journal.pone.0034052.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0034052.t001
Dataset updated
Jun 4, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Kimberly M. Payne; Graham F. Hatfull
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Features of bioinformatically-defined Mycobacteriophage endolysin domains.
C
Bioinformatics for Researchers in Life Sciences: Tools and Learning...
data.iadb.org
csv, pdf
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
IDB Datasets (2025). Bioinformatics for Researchers in Life Sciences: Tools and Learning Resources [Dataset]. http://doi.org/10.60966/kwvb-wr19
Explore at:
csv(276253), pdf(2989058), csv(355108)Available download formats
Unique identifier
https://doi.org/10.60966/kwvb-wr19
Dataset updated
Apr 10, 2025
Dataset provided by
IDB Datasets
License
Attribution-NonCommercial-NoDerivs 3.0 (CC BY-NC-ND 3.0)https://creativecommons.org/licenses/by-nc-nd/3.0/
License information was derived automatically
Time period covered
Jan 1, 2020 - Jan 1, 2021
Description
The COVID-19 pandemic has shown that bioinformatics--a multidisciplinary field that combines biological knowledge with computer programming concerned with the acquisition, storage, analysis, and dissemination of biological data--has a fundamental role in scientific research strategies in all disciplines involved in fighting the virus and its variants. It aids in sequencing and annotating genomes and their observed mutations; analyzing gene and protein expression; simulation and modeling of DNA, RNA, proteins and biomolecular interactions; and mining of biological literature, among many other critical areas of research. Studies suggest that bioinformatics skills in the Latin American and Caribbean region are relatively incipient, and thus its scientific systems cannot take full advantage of the increasing availability of bioinformatic tools and data. This dataset is a catalog of bioinformatics software for researchers and professionals working in life sciences. It includes more than 300 different tools for varied uses, such as data analysis, visualization, repositories and databases, data storage services, scientific communication, marketplace and collaboration, and lab resource management. Most tools are available as web-based or desktop applications, while others are programming libraries. It also includes 10 suggested entries for other third-party repositories that could be of use.
B
Bioinformatics Services Market Report
marketresearchforecast.com
doc, pdf, ppt
Updated Oct 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Research Forecast (2025). Bioinformatics Services Market Report [Dataset]. https://www.marketresearchforecast.com/reports/bioinformatics-services-market-10291
Explore at:
ppt, pdf, docAvailable download formats
Dataset updated
Oct 24, 2025
Dataset authored and provided by
Market Research Forecast
License
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
Time period covered
2026 - 2034
Area covered
Global
Variables measured
Market Size
Description
The size of the Bioinformatics Services Market was valued at USD XX USD Billion in 2023 and is projected to reach USD XXX USD Billion by 2032, with an expected CAGR of 16.5% during the forecast period. Recent developments include: June 2023 – Psomagen added a new sequencing platform, the Pacific Bioscience Revio system, to offer services such as whole genome, whole exome, single cell and bulk RNAseq, microbiome, Olink Proteomics, and others., August 2023 – PacBio agreed to acquire Apton Biosystems, Inc., to accelerate the development of a next-generation, high-throughput short-read sequencer., March 2023 – Emmes, a Clinical Research Organization (CRO), acquired Essex Management. Essex offers bioinformatics and Health Information Technology (HIT) consulting services to government, private sector and academic organizations., November 2022 – Arima Genomics, Inc. partnered with Basepair to empower scientists with bioinformatic analysis., September 2021 – Dovetails Genomics expanded its epigenetic services in the areas of bioinformatics and target enrichment to offer a one-stop solution.. Key drivers for this market are: Growing Applications and Research Grants to Surge the Demand for These Services. Potential restraints include: Growing Applications and Research Grants to Surge the Demand for These Services. Notable trends are: Growing Applications and Research Grants to Surge the Demand for These Services.
R
RMQS1 16S bioinformatic config files and control sample data
entrepot.recherche.data.gouv.fr
application/gzip, tsv +1
Updated Aug 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sébastien Terrat; Sébastien Terrat; Samuel Dequiedt; Samuel Dequiedt (2024). RMQS1 16S bioinformatic config files and control sample data [Dataset]. http://doi.org/10.57745/XBFOJP
Explore at:
tsv(522347), txt(143493), tsv(8814), tsv(33093), tsv(117004), application/gzip(362535), tsv(13212), tsv(32344), tsv(266094), tsv(80032), txt(10413), tsv(16460)Available download formats
Unique identifier
https://doi.org/10.57745/XBFOJP
Dataset updated
Aug 22, 2024
Dataset provided by
Recherche Data Gouv
Authors
Sébastien Terrat; Sébastien Terrat; Samuel Dequiedt; Samuel Dequiedt
License
https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
Dataset funded by
French National Research Agency (ANR)
France Génomique
French Agency for Ecological Transition (ADEME)
Description
RMQS: The French Soil Quality Monitoring Network (RMQS) is a national program for the assessment and long-term monitoring of the quality of French soils. This network is based on the monitoring of 2240 sites representative of French soils and their land use. These sites are spread over the whole French territory (metropolitan and overseas) along a systematic square grid of 16 km x 16 km cells. The network covers a broad spectrum of climatic, soil and land-use conditions (croplands, permanent grasslands, woodlands, orchards and vineyards, natural or scarcely anthropogenic land and urban parkland). The first sampling campaign in metropolitan France took place from 2000 to 2009. Dataset: This dataset contains config files used to run the bioinformatic pipeline and the control sample data that were not published before Reference environmental DNA samples named “G4” in internal laboratory processes were added for each molecular analysis. They were used for technical validation, but not necessarily published alongside the datasets. The taxonomy and OTU abundance files for these control samples were built like the taxonomy and abundance file of the main dataset. As these internal control samples were clustered against the RMQS dataset in an open reference fashion, they contained new OTUs (noted as “OUT”) that corresponded to sequences that did not match any of 188,030 RMQS reference sequences. The sample bank association file links each sample to its sequencing library. The G4 metadata file links each G4 to its library, molecular tag and sequence repository information. File structure: Taxonomy files rmqs1_control_taxonomy_: Taxonomy is splitted across five files with one line per site and one column per taxa. Each line sums to 10k (rarefaction threshold). Three supplementary columns are present: Unknown: not matching any reference. Unclassified: missing taxa between genus and phylum. Environmental: matched to sample from environmental study, generally with only a phylum name. rmqs1_16S_otu_abundance.tsv: OTU abundance per site (one column per OTUs, “DB” + number for OTUs from RMQS reference set, “OUT” for OTUs not matching any “DB” ones). Each line sums to 10k (rarefaction threshold). rmqs1_16S_bank_association.tsv: two columns file with bank name for each sample rmqs1_16S_bank_metadata.tsv: library_name: library name used in labs study_accession, sample_accession, experiment_accession, run_accession: SRA EBI identifier library_name_genoscope: library name used in the Genoscope sequence center MID: multiplex identifier sequence run_alias: Genoscope internal alias ftp_link: FTP link to download library Input_G4.txt: Tabulated file containing the parameters and the bioinformatic steps done by the BIOCOM-PIPE pipeline to extract, treat and analyze controls from raw librairies detailed in the rmqs1_16S_bank_metadata.tsv. project_G4.tab: Comma separated file containing the needed information to generate the Input.txt file with the BIOCOM-PIPE pipeline for controls only: PROJECT: Project name chosen by the user LIBRARY_NAME: Library name chosen by the user LIBRARY_NAME_RECEIVED: Library name chosen by the sequencing partner and used by BIOCOM-PIPE SAMPLE_NAME: Sample name chosen by the user MID_F: MID name or MID sequence associated to the Forward primer MID_R: MID name or MID sequence associated to the Reverse primer TARGET: Target gene (16S, 18S, or 23S) PRIMER_F: Forward primer name used for amplification PRIMER_R: Reverse primer name used for amplification SEQUENCE_PRIMER_F: Forward primer sequence used for amplification SEQUENCE_PRIMER_R: Reverse primer sequence used for amplification Input_GLOBAL.txt: Tabulated file containing the parameters and the bioinformatic steps done by the BIOCOM-PIPE pipeline to extract, treat and analyze controls and samples from raw librairies detailed in the rmqs1_16S_bank_metadata.tsv. project_GLOBAL.tab: Comma separated file containing the needed information to generate the Input.txt file for controls and samples with the BIOCOM-PIPE pipeline: PROJECT: Project name chosen by the user LIBRARY_NAME: Library name chosen by the user LIBRARY_NAME_RECEIVED: Library name chosen by the sequencing partner and used by BIOCOM-PIPE SAMPLE_NAME: Sample name chosen by the user MID_F: MID name or MID sequence associated to the Forward primer MID_R: MID name or MID sequence associated to the Reverse primer TARGET: Target gene (16S, 18S, or 23S) PRIMER_F: Forward primer name used for amplification PRIMER_R: Reverse primer name used for amplification SEQUENCE_PRIMER_F: Forward primer sequence used for amplification SEQUENCE_PRIMER_R: Reverse primer sequence used for amplification Details: Three libraries (58,59 and 69) data were re-sequenced and are not detailed in files. Some samples can be present in several libraries. We kept only the one with the highest number of sequences.
m
Data in brief of genome and bioinformatic of vicilins from Vigna unguiculata...
data.mendeley.com
Updated Mar 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antônio Rocha (2023). Data in brief of genome and bioinformatic of vicilins from Vigna unguiculata [Dataset]. http://doi.org/10.17632/7ysf2zbfkt.2
Explore at:
Unique identifier
https://doi.org/10.17632/7ysf2zbfkt.2
Dataset updated
Mar 16, 2023
Authors
Antônio Rocha
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data in brief is repository of article: Genomic and bioinformatic analysis of Vicilin dataset, the 7S globulin from cowpea (Vigna unguiculata) seeds
B
Bioinformatics Market Report
marketresearchforecast.com
doc, pdf, ppt
Updated Oct 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Research Forecast (2025). Bioinformatics Market Report [Dataset]. https://www.marketresearchforecast.com/reports/bioinformatics-market-10292
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Oct 26, 2025
Dataset authored and provided by
Market Research Forecast
License
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
Time period covered
2026 - 2034
Area covered
Global
Variables measured
Market Size
Description
The size of the Bioinformatics Market was valued at USD 20.72 USD Billion in 2023 and is projected to reach USD 64.45 USD Billion by 2032, with an expected CAGR of 17.6% during the forecast period. Recent developments include: October 2023 – Bionl, Inc., a pioneering company in biomedical and bioinformatics research, launched a no-code biomedical research platform that enables researchers, students, and professionals to investigate biomedicine using natural language queries., October 2023 – BioBam Bioinformatics launched OmicsBox 3.1 to empower researchers, scientists, and bioinformaticians in their pursuit of advanced omics data analysis and interpretation., April 2023 – Absci Corp. collaborated with Aster Insights (formerly named M2GEN) to expedite the development of new cancer medicines., December 2022 – Analytical Biosciences Limited partnered with Mission Bio to co-develop bioinformatics packages for translational and clinical research applications in hematological cancers., April 2022 – ATCC signed an agreement with QIAGEN to provide sequencing data from its collection of biological data. QIAGEN Digital Insights aims to establish a database from this information to develop and deliver high-value digital biology content for the biotechnology and pharmaceutical industries.. Key drivers for this market are: Increased Funding for Genomics Research to Surge Demand for Bioinformatic Solutions. Potential restraints include: Increased Funding for Genomics Research to Surge Demand for Bioinformatic Solutions. Notable trends are: Increased Funding for Genomics Research to Surge Demand for Bioinformatic Solutions.
M
PATRIC: Bacterial Bioinformatics Resource Center
datacatalog.mskcc.org
Updated Nov 13, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). PATRIC: Bacterial Bioinformatics Resource Center [Dataset]. https://datacatalog.mskcc.org/dataset/10392
Explore at:
Dataset updated
Nov 13, 2019
Description
PATRIC (Pathosystems Resource Integration Center) is the Bacterial Bioinformatics Resource Center, an information system designed to support the biomedical research community’s work on bacterial infectious diseases via integration of vital pathogen information with rich data and analysis tools. PATRIC sharpens and hones the scope of available bacterial phylogenomic data from numerous sources specifically for the bacterial research community, in order to save biologists time and effort when conducting comparative analyses. The freely available PATRIC platform provides an interface for biologists to discover data and information and conduct comprehensive comparative genomics and other analyses in a one-stop shop.
d
Data from: Use of long-read sequencing simulators to assess real-world...
datasets.ai
agdatacommons.nal.usda.gov
+1more
0
Updated Mar 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Agriculture (2024). Data from: Use of long-read sequencing simulators to assess real-world applications for food safety [Dataset]. https://datasets.ai/datasets/data-from-use-of-long-read-sequencing-simulators-to-assess-real-world-applications-for-foo-35d38
Explore at:
0Available download formats
Dataset updated
Mar 30, 2024
Dataset authored and provided by
Department of Agriculture
Description
Shiga toxin-producing Escherichia coli (STEC) and Listeria monocytogenes are responsible for severe foodborne illnesses in the United States. Current identification methods require at least four days to identify STEC and six days for L. monocytogenes. Adoption of long-read, whole genome sequencing for testing could significantly reduce the time needed for identification, but method development costs are high. Therefore, the goal of this project was to use NanoSim-H software to simulate Oxford Nanopore sequencing reads to assess the feasibility of sequencing-based foodborne pathogen detection and guide experimental design. Sequencing reads were simulated for STEC, L. monocytogenes, and a 1:1 combination of STEC and Bos taurus genomes using NanoSim-H. This dataset includes all of the simulated reads generated by the project in fasta format. This dataset can be analyzed bioinformatically or used to test bioinformatic pipelines.
Data from: Whole-genome sequencing and bioinformatic tools powered by...
microbiology.figshare.com
bin
Updated Aug 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nishitha R Kumar; Tejashree A Balraj; Kerry K Cooper; Akila Prashant (2025). Whole-genome sequencing and bioinformatic tools powered by machine learning to identify antibiotic-resistant genes and virulence factors in Escherichia coli from sepsis [Dataset]. http://doi.org/10.6084/m9.figshare.27204585.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27204585.v1
Dataset updated
Aug 11, 2025
Dataset provided by
Microbiology Societyhttp://www.microbiologysociety.org/
Authors
Nishitha R Kumar; Tejashree A Balraj; Kerry K Cooper; Akila Prashant
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Supplementary material for 'Whole-genome sequencing and bioinformatic tools powered by machine learning to identify antibiotic-resistant genes and virulence factors in Escherichia coli from sepsis', as described on Microbial Genomics.
f
Bioinformatic Pipeline Scripts Amplicon Sequencing - Grey Box Grassy...
open.flinders.edu.au
researchdata.edu.au
txt
Updated Aug 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicole Fickling (2025). Bioinformatic Pipeline Scripts Amplicon Sequencing - Grey Box Grassy Woodlands [Dataset]. http://doi.org/10.25451/flinders.29848280.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.25451/flinders.29848280.v1
Dataset updated
Aug 8, 2025
Dataset provided by
Flinders University
Authors
Nicole Fickling
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Bioinformatic scripts for the paper titled "Habitat fragmentation shifts soil microbial composition but not richness". Scripts outline the bioinformatic pipeline using DADA2 and QIIME2 to create amplicon sequence variant tables from .fastq files.The .fastq files and sequence metadata are available on the Sequence Read Archive under project number: PRJNA1298480
d
Data from: Post-bioinformatic methods to identify and reduce the prevalence...
datadryad.org
search.dataone.org
zip
Updated Mar 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lorna Drake; Jordan Cuff (2021). Post-bioinformatic methods to identify and reduce the prevalence of artefacts in metabarcoding data [Dataset]. http://doi.org/10.5061/dryad.2jm63xsp4
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.2jm63xsp4
Dataset updated
Mar 30, 2021
Dataset provided by
Dryad
Authors
Lorna Drake; Jordan Cuff
Time period covered
Mar 25, 2021
Description
Example dataset one: British otter diet

Faecal samples were collected during otter post-mortems by the Cardiff University Otter Project. Extracted faecal DNA was amplified using two metabarcoding primer pairs designed to amplify regions of the 16S rRNA and cytochrome c oxidase subunit I (COI) genes, each primer having ten-base-pair molecular identifier tags (MID tags) to facilitate post-bioinformatic sample identification. Extraction and PCR negative controls, unused MID tag combinations, repeat samples and mock communities were included alongside the focal eDNA samples. Mock communities comprised standardised mixtures of DNA of marine species not previously detected in the diet of Eurasian otters. The resultant DNA libraries for each marker were sequenced on separate MiSeq V2 chips with 2x250bp paired-end reads.

Example dataset two: cereal crop spider diet

Money spiders (Bathyphantes, Erigone, Microlinyphia and Tenuiphantes; Araneae: Linyphiidae) and wolf spiders (Pardosa; Ar...

Bioinformatics Market Demand, Size and Competitive Analysis | TechSci...

techsciresearch.com

Updated Aug 15, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

TechSci Research (2025). Bioinformatics Market Demand, Size and Competitive Analysis | TechSci Research [Dataset]. https://www.techsciresearch.com/report/bioinformatics-market/4279.html

Explore at:

Dataset updated

Aug 15, 2025

Dataset authored and provided by

TechSci Research

License

https://www.techsciresearch.com/privacy-policy.aspxhttps://www.techsciresearch.com/privacy-policy.aspx

Description

Bioinformatics Market was valued at USD 11.24 Billion in 2024 and is expected to reach USD 22.59 Billion by 2030 with a CAGR of 12.34%.

Pages	185
Market Size	2024: USD 11.24 Billion
Forecast Market Size	2030: USD 22.59 Billion
CAGR	2025-2030: 12.34%
Fastest Growing Segment	Genomics & Proteomics
Largest Market	North America
Key Players	1. 3rd Millennium Inc. 2. Thermo Fisher Scientific, Inc. 3. Agilent Technologies, Inc. 4. BioWisdom Ltd 5. Quest Diagnostics (Celera Corporation) 6. Dassault Systèmes SE 7. Illumina, Inc. 8. Geneva Bioinformatics SA 9. Perkin Elmer, Inc. 10. Lineage Cell Therapeutics (BioTime Inc.)

c
Supplemental Material for the Manuscript "Genomic Characterization and...
kilthub.cmu.edu
xlsx
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ramya Ramadoss; Fajer Almarzooqi; Basem Shomar; Valentin Alekseevich Ilyin; Annette Shoba Vincent (2023). Supplemental Material for the Manuscript "Genomic Characterization and Annotation of two Novel Bacteriophages Isolated from a Wastewater Treatment Plant in Qatar" [Dataset]. http://doi.org/10.1184/R1/16965004.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1184/R1/16965004.v1
Dataset updated
May 30, 2023
Dataset provided by
Carnegie Mellon University
Authors
Ramya Ramadoss; Fajer Almarzooqi; Basem Shomar; Valentin Alekseevich Ilyin; Annette Shoba Vincent
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Qatar
Description
This is the Supplemental Material for the Manuscript "Genomic Characterization and Annotation of two Novel Bacteriophages Isolated from a Wastewater Treatment Plant in Qatar". Sheets "inphared_EscherichiaPhageCL1" and "inphared_EscherichiaPhageC600M2" lists all the genomes related to Escherichia Phage CL1 and Escherichia Phage C600M2 respectively, identified using get_closest_relatives.pl program in INPHARED package (https://github.com/RyanCook94/inphared).
f
Data from: A large-scale analysis of bioinformatics code on GitHub
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Oct 31, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carlson, Nichole E.; Harnke, Benjamin; Russell, Pamela H.; Johnson, Rachel L.; Ananthan, Shreyas (2018). A large-scale analysis of bioinformatics code on GitHub [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000639408
Explore at:
Dataset updated
Oct 31, 2018
Authors
Carlson, Nichole E.; Harnke, Benjamin; Russell, Pamela H.; Johnson, Rachel L.; Ananthan, Shreyas
Description
In recent years, the explosion of genomic data and bioinformatic tools has been accompanied by a growing conversation around reproducibility of results and usability of software. However, the actual state of the body of bioinformatics software remains largely unknown. The purpose of this paper is to investigate the state of source code in the bioinformatics community, specifically looking at relationships between code properties, development activity, developer communities, and software impact. To investigate these issues, we curated a list of 1,720 bioinformatics repositories on GitHub through their mention in peer-reviewed bioinformatics articles. Additionally, we included 23 high-profile repositories identified by their popularity in an online bioinformatics forum. We analyzed repository metadata, source code, development activity, and team dynamics using data made available publicly through the GitHub API, as well as article metadata. We found key relationships within our dataset, including: certain scientific topics are associated with more active code development and higher community interest in the repository; most of the code in the main dataset is written in dynamically typed languages, while most of the code in the high-profile set is statically typed; developer team size is associated with community engagement and high-profile repositories have larger teams; the proportion of female contributors decreases for high-profile repositories and with seniority level in author lists; and, multiple measures of project impact are associated with the simple variable of whether the code was modified at all after paper publication. In addition to providing the first large-scale analysis of bioinformatics code to our knowledge, our work will enable future analysis through publicly available data, code, and methods. Code to generate the dataset and reproduce the analysis is provided under the MIT license at https://github.com/pamelarussell/github-bioinformatics. Data are available at https://doi.org/10.17605/OSF.IO/UWHX8.
D
Data from: A two-tier bioinformatic pipeline to develop probes for target...
datasetcatalog.nlm.nih.gov
data.niaid.nih.gov
+2more
Updated Jan 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amarasinghe, Prabha; Michelangeli, Fabian; Jantzen, Johanna; Cellinese, Nico; Reginato, Marcelo; Folk, Ryan; Soltis, Pamela S.; Soltis, Douglas (2021). A two-tier bioinformatic pipeline to develop probes for target capture of nuclear loci with applications in Melastomataceae [Dataset]. http://doi.org/10.5061/dryad.8931zcrm2
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.8931zcrm2
Dataset updated
Jan 2, 2021
Authors
Amarasinghe, Prabha; Michelangeli, Fabian; Jantzen, Johanna; Cellinese, Nico; Reginato, Marcelo; Folk, Ryan; Soltis, Pamela S.; Soltis, Douglas
Description
Premise of the study: Putatively single-copy nuclear (SCN) loci, identified using genomic resources of closely related species, are ideal for phylogenomic inference. However, suitable genomic resources are not available for many clades, including Melastomataceae. We introduce a versatile approach to identify SCN loci for clades with few genomic resources and use it to develop probes for target enrichment in the distantly related Memecylon and Tibouchina (Melastomataceae). Methods: We present a two-tiered pipeline. First, we identified putatively SCN loci using MarkerMiner and transcriptomes from distantly related species in Melastomataceae. Published loci and genes of functional significance were added (384 total loci). Second, using HybPiper, we retrieved 689 homologous template sequences for these loci using genome-skimming data from within the focal clades. Results: We sequenced 193 loci from both Memecylon and Tibouchina, with probes designed from 56 template sequences successfully targeting sequences in both clades. Probes designed from genome-skimming data within a focal clade were more successful than probes designed from other sources. Discussion: Our pipeline successfully identified and targeted SCN loci in Memecylon and Tibouchina, enabling phylogenomic studies in both clades and potentially across Melastomataceae. This pipeline could be easily applied to other clades with few genomic resources.

Facebook

Twitter

Click to copy link

Link copied

Cite

Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated

Bioinformatics Protein Dataset - Simulated

Synthetic protein dataset with sequences, physical properties, and functional cl

Explore at:

zip(12928905 bytes)Available download formats

Dataset updated

Dec 27, 2024

Authors

Rafael Gallo

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Subtitle

"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

Description

Introduction

This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

Columns Included

ID_Protein: Unique identifier for each protein.
Sequence: String of amino acids.
Molecular_Weight: Molecular weight calculated from the sequence.
Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
Hydrophobicity: Average hydrophobicity calculated from the sequence.
Total_Charge: Sum of the charges of the amino acids in the sequence.
Polar_Proportion: Percentage of polar amino acids in the sequence.
Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
Sequence_Length: Total number of amino acids in the sequence.
Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

Inspiration and Sources

While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

Proposed Uses

This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

How This Dataset Was Created

Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
Property Calculation: Physicochemical properties were calculated using the Biopython library.
Class Assignment: Classes were randomly assigned for classification purposes.

Limitations

The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
The functional classes are simulated and do not correspond to actual biological characteristics.

Data Split

The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

Acknowledgment

This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

Clear search

Close search

Google apps

Main menu

Bioinformatics Protein Dataset - Simulated

Subtitle

Description

Introduction

Columns Included

Inspiration and Sources

Proposed Uses

How This Dataset Was Created

Limitations

Data Split

Acknowledgment

Bioinformatic databases survey

Bioinformatic databases survey

Data content

Data sources

Bioinformatics Services Market Industry Forecast 2034

Properties of bioinformatically identified candidate antigens and previously...

Appendix V - Bioinformatic pipelines/scripts

Features of bioinformatically-defined Mycobacteriophage endolysin domains.

Bioinformatics for Researchers in Life Sciences: Tools and Learning...

Bioinformatics Services Market Report

RMQS1 16S bioinformatic config files and control sample data

Data in brief of genome and bioinformatic of vicilins from Vigna unguiculata...

Bioinformatics Market Report

PATRIC: Bacterial Bioinformatics Resource Center

Data from: Use of long-read sequencing simulators to assess real-world...

Data from: Whole-genome sequencing and bioinformatic tools powered by...

Bioinformatic Pipeline Scripts Amplicon Sequencing - Grey Box Grassy...

Data from: Post-bioinformatic methods to identify and reduce the prevalence...

Bioinformatics Market Demand, Size and Competitive Analysis | TechSci...

Supplemental Material for the Manuscript "Genomic Characterization and...

Data from: A large-scale analysis of bioinformatics code on GitHub

Data from: A two-tier bioinformatic pipeline to develop probes for target...

Bioinformatics Protein Dataset - SimulatedSee More Versions

Synthetic protein dataset with sequences, physical properties, and functional cl

Subtitle

Description

Introduction

Columns Included

Inspiration and Sources

Proposed Uses

How This Dataset Was Created

Limitations

Data Split

Acknowledgment

Bioinformatics Protein Dataset - Simulated