Facebook
Twitterhttps://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
This item contains a test dataset based on Sumatran rhinoceros (Dicerorhinus sumatrensis) whole-genome re-sequencing data that we publish along with the GenErode pipeline (https://github.com/NBISweden/GenErode; Kutschera et al. 2022) and that we reduced in size so that users have the possibility to get familiar with the pipeline before analyzing their own genome-wide datasets. We extracted scaffold ‘Sc9M7eS_2_HRSCAF_41’ of size 40,842,778 bp from the Sumatran rhinoceros genome assembly (Dicerorhinus sumatrensis harrissoni; GenBank accession number GCA_014189135.1) to be used as reference genome in GenErode. Some GenErode steps require the reference genome of a closely related species, so we additionally provide three scaffolds from the White rhinoceros genome assembly (Ceratotherium simum simum; GenBank accession number GCF_000283155.1) with a combined length of 41,195,616 bp that are putatively orthologous to Sumatran rhinoceros scaffold ‘Sc9M7eS_2_HRSCAF_41’, along with gene predictions in GTF format. The repository also contains a Sumatran rhinoceros mitochondrial genome (GenBank accession number NC_012684.1) to be used as reference for the optional mitochondrial mapping step in GenErode. The test dataset contains whole-genome re-sequencing data from three historical and three modern Sumatran rhinoceros samples from the now-extinct Malay Peninsula population from von Seth et al. (2021) that was subsampled to paired-end reads that mapped to Sumatran rhinoceros scaffold ‘Sc9M7eS_2_HRSCAF_41’, along with a small proportion of randomly selected reads that mapped to the Sumatran rhinoceros mitochondrial genome or elsewhere in the genome. For GERP analyses, scaffolds from the genome assemblies of 30 mammalian outgroup species are provided that had reciprocal blast hits to gene predictions from Sumatran rhinoceros scaffold ‘Sc9M7eS_2_HRSCAF_41’. Further, a phylogeny of the White rhinoceros and the 30 outgroup species including divergence time estimates (in billions of years) from timetree.org is available. Finally, the item contains configuration and metadata files that were used for three separate runs of GenErode to generate the results presented in Kutschera et al. (2022). Bash scripts and a workflow description for the test dataset generation are available in the GenErode GitHub repository (https://github.com/NBISweden/GenErode/docs/extras/test_dataset_generation).
References: Kutschera VE, Kierczak M, van der Valk T, von Seth J, Dussex N, Lord E, et al. GenErode: a bioinformatics pipeline to investigate genome erosion in endangered and extinct species. BMC Bioinformatics 2022;23:228. https://doi.org/10.1186/s12859-022-04757-0 von Seth J, Dussex N, Díez-Del-Molino D, van der Valk T, Kutschera VE, Kierczak M, et al. Genomic insights into the conservation status of the world’s last remaining Sumatran rhinoceros populations. Nature Communications 2021;12:2393.
Facebook
TwitterThe Genome Solver was an NSF-funded project developed as a way to train undergraduate life science faculty in basic web-based tools for bioinformatics. As part of the project we developed a one-day workshop consisting of bioinformatics modules on the theme of bacterial genomics, which we delivered to faculty at colleges and universities around the country. All of our workshop material can be accessed on the QUBESHub website: https://qubeshub.org/community/groups/genomesolver/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Assembly and annotation scripts.
Facebook
TwitterFiltering steps in bioinformatics pipeline and remaining sequencing reads.
Facebook
Twitter
According to our latest research, the global Bioinformatics Pipelines as a Service market size was valued at USD 1.82 billion in 2024, and is anticipated to grow at a robust CAGR of 14.6% from 2025 to 2033. By the end of 2033, the market is forecasted to reach USD 5.73 billion. This growth is primarily driven by the increasing adoption of cloud computing in life sciences, the exponential rise in biological data generation, and the growing need for scalable, cost-effective, and automated bioinformatics solutions across healthcare, pharmaceutical, and research sectors.
The surge in next-generation sequencing (NGS) and other high-throughput technologies has led to an unprecedented volume of biological data, creating a pressing demand for advanced computational tools. Bioinformatics Pipelines as a Service (BPaaS) addresses this need by offering scalable, automated, and user-friendly platforms that streamline complex data analysis workflows. Researchers and clinicians are increasingly leveraging these services to accelerate genomic, proteomic, and transcriptomic studies. The shift towards precision medicine and the growing importance of biomarker discovery are key growth factors, as BPaaS platforms enable rapid and reproducible analysis, reducing time-to-insight and enhancing research productivity. Furthermore, the integration of artificial intelligence (AI) and machine learning (ML) within these pipelines is further enhancing data interpretation, fostering innovation, and expanding market opportunities.
Another significant growth driver is the rising demand for cost-effective and flexible bioinformatics solutions among small and medium-sized enterprises (SMEs) and academic institutions. Traditional bioinformatics infrastructure requires substantial investment in hardware, software, and skilled personnel, which can be prohibitive for smaller organizations. BPaaS eliminates these barriers by providing on-demand access to sophisticated analytical tools and computational resources, democratizing access to advanced bioinformatics. This trend is particularly evident in emerging economies, where cloud-based solutions are enabling research institutions and biotechnology startups to participate in cutting-edge life sciences research without heavy capital expenditure. Additionally, the growing collaborations between bioinformatics service providers and pharmaceutical companies are accelerating drug discovery and development pipelines, further propelling market growth.
Regulatory compliance and data security have also become critical considerations, especially with the increasing use of patient-derived data in clinical and translational research. BPaaS providers are investing in robust security protocols, compliance certifications, and data governance frameworks to address these concerns. The adoption of cloud-based bioinformatics pipelines is being facilitated by advancements in data encryption, multi-factor authentication, and secure data storage solutions, ensuring the protection of sensitive genomic and clinical information. This has instilled greater confidence among healthcare providers and pharmaceutical companies, driving broader acceptance of BPaaS solutions in regulated environments. As a result, the market is witnessing strong demand from both developed and developing regions, with North America and Europe leading in adoption, while Asia Pacific and Latin America are rapidly emerging as high-growth markets.
From a regional perspective, North America dominated the Bioinformatics Pipelines as a Service market in 2024, accounting for approximately 44% of global revenue, followed by Europe and Asia Pacific. The presence of leading bioinformatics companies, advanced healthcare infrastructure, and substantial investments in genomics research have positioned North America as a key driver of market expansion. Europe is also witnessing significant growth due to increased funding for life sciences research and supportive regulatory frameworks. Meanwhile, Asia Pacific is projected to exhibit the highest CAGR over the forecast period, driven by expanding biotechnology industries, growing government initiatives, and rising adoption of digital health technologies in countries such as China, India, and Japan.
The emergence of "https://growthmarketreports.com/report/cloud-based-multi-omics-data-warehouse-market" target="_blank">Cloud-Based Multi-Omics D
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the Bioinformatics Pipelines as a Service market size reached USD 2.37 billion globally in 2024. The market is exhibiting robust momentum, growing at a CAGR of 13.2% during the forecast period. By 2033, the market is projected to attain a value of USD 6.71 billion. This impressive growth trajectory is primarily driven by the increasing adoption of next-generation sequencing, expanding applications in personalized medicine, and growing demand for scalable, cloud-based bioinformatics solutions. As per our latest research, the market's expansion is underpinned by the convergence of advanced computational tools and the exponential rise in biological data generation across various sectors.
A major growth factor fueling the Bioinformatics Pipelines as a Service market is the accelerating pace of genomic and multi-omics research worldwide. The proliferation of high-throughput sequencing technologies has resulted in an unprecedented surge in biological data. This deluge of information necessitates robust, scalable, and automated bioinformatics pipelines that can efficiently process, analyze, and interpret complex datasets. Organizations, ranging from pharmaceutical giants to academic research institutes, are increasingly turning to pipeline-as-a-service models to streamline their workflows, reduce operational overheads, and ensure data reproducibility. The ability to access cutting-edge analytical tools without heavy upfront investments in IT infrastructure is particularly attractive, fostering widespread adoption across both developed and emerging markets.
Another significant driver is the growing emphasis on personalized medicine and precision healthcare. As clinicians and researchers strive to tailor treatments to individual genetic profiles, the need for sophisticated bioinformatics analysis has never been greater. Bioinformatics Pipelines as a Service platforms enable seamless integration of diverse omics data, supporting the identification of biomarkers, therapeutic targets, and patient-specific interventions. The flexibility of these solutions, combined with their ability to adapt to rapidly evolving scientific methodologies, positions them as indispensable assets in both clinical diagnostics and drug discovery pipelines. Moreover, regulatory agencies are increasingly recognizing the value of standardized, auditable bioinformatics workflows, further accelerating market adoption.
The expanding application scope of bioinformatics pipelines in non-clinical domains, such as agriculture and crop science, is also contributing to market growth. Researchers in agrigenomics are leveraging these platforms to enhance crop yields, improve disease resistance, and accelerate breeding programs. The integration of metabolomics and proteomics data is enabling deeper insights into plant physiology and stress responses, driving innovation in sustainable agriculture. Additionally, the rise of collaborative research initiatives and public-private partnerships is fostering the development of interoperable, user-friendly pipeline solutions that cater to a broad spectrum of end-users. These trends collectively underscore the transformative potential of Bioinformatics Pipelines as a Service across diverse scientific disciplines.
From a regional perspective, North America continues to dominate the Bioinformatics Pipelines as a Service market, supported by a robust biotechnology ecosystem, substantial R&D investments, and a favorable regulatory landscape. Europe follows closely, driven by strong academic research networks and government-backed genomics initiatives. The Asia Pacific region is emerging as a high-growth market, fueled by expanding healthcare infrastructure, rising awareness of precision medicine, and increasing participation in international genomics collaborations. Meanwhile, Latin America and the Middle East & Africa are witnessing gradual adoption, with market growth primarily concentrated in major urban centers and research hubs. Despite regional disparities, the global outlook remains overwhelmingly positive, with technological advancements and cross-sector collaborations expected to drive sustained market expansion through 2033.
The Offering segment of the Bioinformatics Pipelines as a Service market is bifurcated into Platform and S
Facebook
Twitterincluding 5
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Appendix N: The website link of the bioinformatics tools and online resources used in this thesis were summarised
Facebook
TwitterVirus reads reported by the bioinformatic pipeline.
Facebook
TwitterDNA methylation, one of the most important epigenetic modifications, plays a crucial role in various biological processes. The level of DNA methylation can be measured using whole-genome bisulfite sequencing at single base resolution. However, until now, there is a paucity of publicly available software for carrying out integrated methylation data analysis. In this study, we implemented Methy-Pipe, which not only fulfills the core data analysis requirements (e.g. sequence alignment, differential methylation analysis, etc.) but also provides useful tools for methylation data annotation and visualization. Specifically, it uses Burrow-Wheeler Transform (BWT) algorithm to directly align bisulfite sequencing reads to a reference genome and implements a novel sliding window based approach with statistical methods for the identification of differentially methylated regions (DMRs). The capability of processing data parallelly allows it to outperform a number of other bisulfite alignment software packages. To demonstrate its utility and performance, we applied it to both real and simulated bisulfite sequencing datasets. The results indicate that Methy-Pipe can accurately estimate methylation densities, identify DMRs and provide a variety of utility programs for downstream methylation data analysis. In summary, Methy-Pipe is a useful pipeline that can process whole genome bisulfite sequencing data in an efficient, accurate, and user-friendly manner. Software and test dataset are available at http://sunlab.lihs.cuhk.edu.hk/methy-pipe/.
Facebook
TwitterAmplicon high-throughput sequencing of 16S ribosomal RNA (rRNA) gene is currently the most widely used technique to investigate complex gut microbial communities. Microbial identification might be influenced by several factors, including the choice of bioinformatic pipelines, making comparisons across studies difficult. Here, we compared four commonly used pipelines (QIIME2, Bioconductor, UPARSE and mothur) run on two operating systems (OS) (Linux and Mac), to evaluate the impact of bioinformatic pipeline and OS on the taxonomic classification of 40 human stool samples. We applied the SILVA 132 reference database for all the pipelines. We compared phyla and genera identification and relative abundances across the four pipelines using the Friedman rank sum test. QIIME2 and Bioconductor provided identical outputs on Linux and Mac OS, while UPARSE and mothur reported only minimal differences between OS. Taxa assignments were consistent at both phylum and genus level across all the pipelines. However, a difference in terms of relative abundance was identified for all phyla (p < 0.013) and for the majority of the most abundant genera (p < 0.028), such as Bacteroides (QIIME2: 24.5%, Bioconductor: 24.6%, UPARSE-linux: 23.6%, UPARSE-mac: 20.6%, mothur-linux: 22.2%, mothur-mac: 21.6%, p < 0.001). The use of different bioinformatic pipelines affects the estimation of the relative abundance of gut microbial community, indicating that studies using different pipelines cannot be directly compared. A harmonization procedure is needed to move the field forward.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supplementary table 1. USEQ pipeline test.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description of the parameters of sebnif and their default values.
Facebook
Twitterhttps://researchintelo.com/privacy-and-policyhttps://researchintelo.com/privacy-and-policy
According to our latest research, the Global Bioinformatics Pipelines as a Service market size was valued at $1.98 billion in 2024 and is projected to reach $7.61 billion by 2033, expanding at a robust CAGR of 16.1% during the forecast period of 2025–2033. The primary driver fueling this remarkable growth is the surging demand for scalable, automated, and highly efficient bioinformatics solutions across genomics, proteomics, and other omics research domains. The proliferation of next-generation sequencing technologies, coupled with the exponential growth in biological data generation, has necessitated advanced, cloud-based bioinformatics pipelines that can streamline data analysis, reduce turnaround times, and enhance reproducibility for both research and clinical applications. As a result, Bioinformatics Pipelines as a Service (BPaaS) has emerged as a mission-critical enabler, accelerating scientific discovery and innovation in life sciences while democratizing access to high-performance computational tools.
North America currently holds the largest share of the Bioinformatics Pipelines as a Service market, accounting for over 38% of the global revenue in 2024. This dominance can be attributed to the region’s mature biotechnology and pharmaceutical ecosystem, extensive investments in genomics research, and the presence of leading bioinformatics service providers and cloud computing giants. The United States, in particular, has established a robust regulatory and funding framework that encourages the adoption of advanced digital health solutions, including BPaaS. Major academic research centers and healthcare institutions across North America are increasingly leveraging these platforms to support precision medicine initiatives, large-scale population genomics projects, and translational research, further solidifying the region’s leadership in this market.
In contrast, the Asia Pacific region is projected to exhibit the fastest growth, with a remarkable CAGR of 19.3% between 2025 and 2033. This acceleration is underpinned by substantial investments in national genomics programs, expanding biotechnology hubs in countries such as China, India, and South Korea, and the rising adoption of cloud infrastructure. Governments and private players across Asia Pacific are actively fostering public-private partnerships, upgrading research capabilities, and incentivizing digital transformation in healthcare and life sciences. The growing pool of skilled bioinformaticians, coupled with the region’s large and genetically diverse populations, is creating significant opportunities for BPaaS providers to offer tailored solutions for disease research, drug discovery, and personalized medicine.
Emerging economies in Latin America and Middle East & Africa are gradually embracing bioinformatics pipelines as a service, although market penetration remains constrained by challenges such as limited access to high-speed internet, lower R&D funding, and fragmented healthcare infrastructure. Nonetheless, localized demand for cost-effective and scalable bioinformatics solutions is rising, particularly as academic and clinical institutions seek to participate in global genomics consortia and leverage international expertise. Regulatory harmonization efforts, capacity-building initiatives, and targeted investments in digital health infrastructure are expected to gradually bridge adoption gaps, making these regions promising markets for future expansion.
| Attributes | Details |
| Report Title | Bioinformatics Pipelines as a Service Market Research Report 2033 |
| By Component | Software, Services |
| By Deployment Mode | Cloud-based, On-Premises, Hybrid |
| By Application | Genomics, Proteomics, Transcriptomics, Metabolomics, Other |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Preprocessing report generated automatically by the iMAP to provide a summary of quality control of the reads. The iMAP pipeline automatically saved the output in the “reports” folder as “report2_read_preprocessing.html”. (HTML 3463 kb)
Facebook
TwitterThe repository includes the representative sequences of the UnFATE 195 genes and the baits designed from them, the single locus trees, alignments and final phylogenies for the proof of concept Pezizomycotina phylogeny inferred using the universal probe set and the pipeline we developed (files ending in "Pezizo_pilotTE"). It also includes the supermatrices and single locus alignments generated by mining the 195 genes of our gene set from publicly available genome, used in published phylogenomic inferences.
File description: the following tar.gz files contain the the reference sequences (unfate_markers_reference_sequences_DNA.tar.gz) obtained from the clustering approach adopted to find the best representative sequences to build the universal bait set (baits.tar.gz).
See Ametrano et al. 2025 (Systemati...
Facebook
Twitterhttps://media.market.us/privacy-policyhttps://media.market.us/privacy-policy
The Global Bioinformatics Services Market is projected to reach USD 10.7 billion by 2033, growing from USD 2.9 billion in 2023 at a CAGR of 13.9%. Growth is being driven by the rapid expansion of genomic and health data generation across research institutions, healthcare systems, and public-health agencies. The World Health Organization’s Global Genomic Surveillance Strategy has positioned bioinformatics as a core element in detecting and responding to health threats. This policy direction is reinforcing global demand for scalable analytical platforms, secure data sharing, and sustainable workflow solutions.
A fundamental growth catalyst is the declining cost of sequencing. According to the U.S. National Human Genome Research Institute, the cost per genome has decreased sharply since the late 2000s. As sequencing becomes more affordable, the number of samples increases, driving demand for downstream data storage, processing, and interpretation. Consequently, outsourcing bioinformatics tasks to specialized service providers has become more common and cost-effective.
Another major factor supporting market expansion is the rise in publicly available genomic data. The NIH Sequence Read Archive (SRA) surpassed 50 petabases of data by early 2024, requiring large-scale indexing, quality control, and reanalysis. This massive data load necessitates professional expertise and infrastructure, which are primarily offered by bioinformatics service companies.
The integration of genomics into healthcare systems is further strengthening market growth. The NHS Genomic Medicine Service in England is expanding clinical genomics applications in oncology and rare disease management. This transition creates sustained demand for validated bioinformatics pipelines, variant curation, and clinical reporting services. Healthcare institutions increasingly depend on external service providers for secure, clinical-grade analysis pipelines and data governance compliance, ensuring both accuracy and confidentiality in genomic interpretation.
Public health initiatives and global investments are enhancing the bioinformatics services landscape. Programs like the U.S. CDC’s Advanced Molecular Detection and ECDC’s sequencing integration are driving large-scale genomic surveillance. These initiatives require ongoing analysis, pipeline standardization, and data-platform management, which are largely delivered through external service providers. As countries institutionalize sequencing, recurring demand for bioinformatics workflows and analytic services is expected to persist.
In low- and middle-income countries, international investment is expanding market opportunities. The World Bank’s genomic capacity-building programs in Africa are fostering sequencing and analytics infrastructure. These efforts include bioinformatics training and workflow design, ensuring long-term sustainability. Such projects significantly widen the global serviceable market for bioinformatics expertise. Similarly, large-scale national genomic initiatives like the NIH All of Us program generate billions of variants that require harmonization, annotation, and interpretation, sustaining demand for cloud-based data management and analytic platforms.
The growing focus on antimicrobial resistance (AMR) is also fueling bioinformatics adoption. Under WHO’s GLASS platform, countries are integrating whole-genome sequencing into AMR surveillance. This expansion is creating consistent demand for quality assurance, centralized analysis hubs, and workflow optimization. Furthermore, data governance reforms by the OECD and other regulatory bodies are facilitating secure secondary use of genomic data, promoting trust in data sharing and collaboration.
Strategic public funding further strengthens the market outlook. Horizon Europe’s Health Work Programme (2025) and NHGRI’s technology initiatives continue to fund large-scale, data-driven research, ensuring a steady flow of contracts for bioinformatics firms. Workforce development is also improving, with national systems such as NHS England expanding bioinformatics training. This capacity building not only supports in-house analytics but also increases outsourcing to handle peak workloads and specialized computational tasks.
In conclusion, the bioinformatics services market is benefiting from multiple converging factors—technological affordability, global health investments, regulatory clarity, and expanding data ecosystems. These structural developments are shaping a resilient, long-term demand environment for scalable, compliant, and high-quality bioinformatics services worldwide.
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The size of the Bioinformatics Platforms Market market was valued at USD 16.36 Million in 2023 and is projected to reach USD 27.93 Million by 2032, with an expected CAGR of 7.94% during the forecast period. Recent developments include: In June 2022, California's biotechnology research startup LatchBio launched an end-to-end bioinformatics platform for handling big biotech data to accelerate scientific discovery., In March 2022, ARUP launched Rio, a bioinformatics pipeline and analytics platform for better, faster next-generation sequencing test results.. Key drivers for this market are: Increasing Demand for Nucleic Acid and Protein Sequencing, Increasing Initiatives from Governments and Private Organizations; Accelerating Growth of Proteomics and Genomics; Increasing Research on Molecular Biology and Drug Discovery. Potential restraints include: Lack of Well-defined Standards and Common Data Formats for Integration of Data, Data Complexity Concerns and Lack of User-friendly Tools. Notable trends are: Sequence Analysis Platform Segment is Expected Hold a Significant Share Over the Forecast Period.
Facebook
TwitterTargeted resequencing by massively parallel sequencing has become an effective and affordable way to survey small to large portions of the genome for genetic variation. Despite the rapid development in open source software for analysis of such data, the practical implementation of these tools through construction of sequencing analysis pipelines still remains a challenging and laborious activity, and a major hurdle for many small research and clinical laboratories. We developed TREVA (Targeted REsequencing Virtual Appliance), making pre-built pipelines immediately available as a virtual appliance. Based on virtual machine technologies, TREVA is a solution for rapid and efficient deployment of complex bioinformatics pipelines to laboratories of all sizes, enabling reproducible results. The analyses that are supported in TREVA include: somatic and germline single-nucleotide and insertion/deletion variant calling, copy number analysis, and cohort-based analyses such as pathway and significantly mutated genes analyses. TREVA is flexible and easy to use, and can be customised by Linux-based extensions if required. TREVA can also be deployed on the cloud (cloud computing), enabling instant access without investment overheads for additional hardware. TREVA is available at http://bioinformatics.petermac.org/treva/.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Software to produce figures in ADP-Seq paper, filter reads properly paired to restriction enzyme fragments termini, and quantify the amount of ADPr DNA modification in each restriction enzyme fragment.
Facebook
Twitterhttps://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
This item contains a test dataset based on Sumatran rhinoceros (Dicerorhinus sumatrensis) whole-genome re-sequencing data that we publish along with the GenErode pipeline (https://github.com/NBISweden/GenErode; Kutschera et al. 2022) and that we reduced in size so that users have the possibility to get familiar with the pipeline before analyzing their own genome-wide datasets. We extracted scaffold ‘Sc9M7eS_2_HRSCAF_41’ of size 40,842,778 bp from the Sumatran rhinoceros genome assembly (Dicerorhinus sumatrensis harrissoni; GenBank accession number GCA_014189135.1) to be used as reference genome in GenErode. Some GenErode steps require the reference genome of a closely related species, so we additionally provide three scaffolds from the White rhinoceros genome assembly (Ceratotherium simum simum; GenBank accession number GCF_000283155.1) with a combined length of 41,195,616 bp that are putatively orthologous to Sumatran rhinoceros scaffold ‘Sc9M7eS_2_HRSCAF_41’, along with gene predictions in GTF format. The repository also contains a Sumatran rhinoceros mitochondrial genome (GenBank accession number NC_012684.1) to be used as reference for the optional mitochondrial mapping step in GenErode. The test dataset contains whole-genome re-sequencing data from three historical and three modern Sumatran rhinoceros samples from the now-extinct Malay Peninsula population from von Seth et al. (2021) that was subsampled to paired-end reads that mapped to Sumatran rhinoceros scaffold ‘Sc9M7eS_2_HRSCAF_41’, along with a small proportion of randomly selected reads that mapped to the Sumatran rhinoceros mitochondrial genome or elsewhere in the genome. For GERP analyses, scaffolds from the genome assemblies of 30 mammalian outgroup species are provided that had reciprocal blast hits to gene predictions from Sumatran rhinoceros scaffold ‘Sc9M7eS_2_HRSCAF_41’. Further, a phylogeny of the White rhinoceros and the 30 outgroup species including divergence time estimates (in billions of years) from timetree.org is available. Finally, the item contains configuration and metadata files that were used for three separate runs of GenErode to generate the results presented in Kutschera et al. (2022). Bash scripts and a workflow description for the test dataset generation are available in the GenErode GitHub repository (https://github.com/NBISweden/GenErode/docs/extras/test_dataset_generation).
References: Kutschera VE, Kierczak M, van der Valk T, von Seth J, Dussex N, Lord E, et al. GenErode: a bioinformatics pipeline to investigate genome erosion in endangered and extinct species. BMC Bioinformatics 2022;23:228. https://doi.org/10.1186/s12859-022-04757-0 von Seth J, Dussex N, Díez-Del-Molino D, van der Valk T, Kutschera VE, Kierczak M, et al. Genomic insights into the conservation status of the world’s last remaining Sumatran rhinoceros populations. Nature Communications 2021;12:2393.