100+ datasets found
  1. o

    Data from: Semi-artificial datasets as a resource for validation of...

    • explore.openaire.eu
    • search.dataone.org
    • +2more
    Updated Nov 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucie Tamisier; Annelies Haegeman; Yoika Foucart; Nicolas Fouillien; Maher Al Rwahnih; Nihal Buzkan; Thierry Candresse; Michela Chiumenti; Kris De Jonghe; Marie Lefebvre; Paolo Margaria; Jean Sébastien Reynard; Kristian Stevens; Denis Kutnjak; Sébastien Massart (2021). Semi-artificial datasets as a resource for validation of bioinformatics pipelines for plant virus detection [Dataset]. http://doi.org/10.5281/zenodo.5572591
    Explore at:
    Dataset updated
    Nov 2, 2021
    Authors
    Lucie Tamisier; Annelies Haegeman; Yoika Foucart; Nicolas Fouillien; Maher Al Rwahnih; Nihal Buzkan; Thierry Candresse; Michela Chiumenti; Kris De Jonghe; Marie Lefebvre; Paolo Margaria; Jean Sébastien Reynard; Kristian Stevens; Denis Kutnjak; Sébastien Massart
    Description

    In the last decade, High-Throughput Sequencing (HTS) has revolutionized biology and medicine. This technology allows the sequencing of huge amount of DNA and RNA fragments at a very low price. In medicine, HTS tests for disease diagnostics are already brought into routine practice. However, the adoption in plant health diagnostics is still limited. One of the main bottlenecks is the lack of expertise and consensus on the standardization of the data analysis. The Plant Health Bioinformatic Network (PHBN) is an Euphresco project aiming to build a community network of bioinformaticians/computational biologists working in plant health. One of the main goals of the project is to develop reference datasets that can be used for validation of bioinformatics pipelines and for standardization purposes. Semi-artificial datasets have been created for this purpose (Datasets 1 to 10). They are composed of a "real" HTS dataset spiked with artificial viral reads. It will allow researchers to adjust their pipeline/parameters as good as possible to approximate the actual viral composition of the semi-artificial datasets. Each semi-artificial dataset allows to test one or several limitations that could prevent virus detection or a correct virus identification from HTS data (i.e. low viral concentration, new viral species, non-complete genome). Eight artificial datasets only composed of viral reads (no background data) have also been created (Datasets 11 to 18). Each dataset consists of a mix of several isolates from the same viral species showing different frequencies. The viral species were selected to be as divergent as possible. These datasets can be used to test haplotype reconstruction software, the goal being to reconstruct all the isolates present in a dataset. A GitLab repository (https://gitlab.com/ilvo/VIROMOCKchallenge) is available and provides a complete description of the composition of each dataset, the methods used to create them and their goals. Dataset_x.fastq.gz These are the fastq files of the 18 datasets. Description of the datasets This is a word document describing each dataset.

  2. d

    Data from: Transcriptomic and bioinformatics analysis of the early...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data from: Transcriptomic and bioinformatics analysis of the early time-course of the response to prostaglandin F2 alpha in the bovine corpus luteum [Dataset]. https://catalog.data.gov/dataset/data-from-transcriptomic-and-bioinformatics-analysis-of-the-early-time-course-of-the-respo-cd938
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    RNA expression analysis was performed on the corpus luteum tissue at five time points after prostaglandin F2 alpha treatment of midcycle cows using an Affymetrix Bovine Gene v1 Array. The normalized linear microarray data was uploaded to the NCBI GEO repository (GSE94069). Subsequent statistical analysis determined differentially expressed transcripts ± 1.5-fold change from saline control with P ≀ 0.05. Gene ontology of differentially expressed transcripts was annotated by DAVID and Panther. Physiological characteristics of the study animals are presented in a figure. Bioinformatic analysis by Ingenuity Pathway Analysis was curated, compiled, and presented in tables. A dataset comparison with similar microarray analyses was performed and bioinformatics analysis by Ingenuity Pathway Analysis, DAVID, Panther, and String of differentially expressed genes from each dataset as well as the differentially expressed genes common to all three datasets were curated, compiled, and presented in tables. Finally, a table comparing four bioinformatics tools' predictions of functions associated with genes common to all three datasets is presented. These data have been further analyzed and interpreted in the companion article "Early transcriptome responses of the bovine mid-cycle corpus luteum to prostaglandin F2 alpha includes cytokine signaling". Resources in this dataset:Resource Title: Supporting information as Excel spreadsheets and tables. File Name: Web Page, url: http://www.sciencedirect.com/science/article/pii/S2352340917304031?via=ihub#s0070

  3. w

    Dataset of books about Bioinformatics-Data processing

    • workwithdata.com
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books about Bioinformatics-Data processing [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=j0-book_subject&fop0=%3D&fval0=Bioinformatics-Data+processing&j=1&j0=book_subjects
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about books. It has 4 rows and is filtered where the book subjects is Bioinformatics-Data processing. It features 9 columns including author, publication date, language, and book publisher.

  4. Data from: Data reuse and the open data citation advantage

    • zenodo.org
    • search.dataone.org
    • +2more
    bin, csv, txt
    Updated May 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heather A. Piwowar; Todd J. Vision; Heather A. Piwowar; Todd J. Vision (2022). Data from: Data reuse and the open data citation advantage [Dataset]. http://doi.org/10.5061/dryad.781pv
    Explore at:
    bin, csv, txtAvailable download formats
    Dataset updated
    May 28, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Heather A. Piwowar; Todd J. Vision; Heather A. Piwowar; Todd J. Vision
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Background: Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the "citation benefit". Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results: Here, we look at citation rates while controlling for many known citation predictors, and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. Conclusion: After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered.We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.

  5. w

    Dataset of books series that contain Introduction to bioinformatics

    • workwithdata.com
    Updated Nov 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2024). Dataset of books series that contain Introduction to bioinformatics [Dataset]. https://www.workwithdata.com/datasets/book-series?f=1&fcol0=j0-book&fop0=%3D&fval0=Introduction+to+bioinformatics&j=1&j0=books
    Explore at:
    Dataset updated
    Nov 25, 2024
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about book series. It has 2 rows and is filtered where the books is Introduction to bioinformatics. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.

  6. đŸ§« Promoter or not? - Bioinformatics đŸ—ƒïž Dataset

    • kaggle.com
    Updated Mar 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samira Shemirani (2024). đŸ§« Promoter or not? - Bioinformatics đŸ—ƒïž Dataset [Dataset]. https://www.kaggle.com/datasets/samira1992/promoter-or-not-bioinformatics-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 31, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Samira Shemirani
    Description

    The promoter region is located near the transcription start sites, which regulate the transcription initiation of the gene by controlling the binding of RNA polymerase. Thus, recognition of the promoter region is an important area of interest in the field of bioinformatics. Over the past years, many new promoter prediction programs (PPPs) have emerged. PPPs aim to identify promoter regions in a genome using computational methods. Promoter prediction is a supervised learning problem that consists of three main steps to extract features: 1) CpG islands 2) Structural features 3) Content features

  7. Artificial real metagenomic reads

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, tsv
    Updated Feb 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael B. Hall; Michael B. Hall (2024). Artificial real metagenomic reads [Dataset]. http://doi.org/10.5281/zenodo.10472796
    Explore at:
    application/gzip, tsvAvailable download formats
    Dataset updated
    Feb 15, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Michael B. Hall; Michael B. Hall
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    Jan 9, 2024
    Description

    We created Nanopore and Illumina metagenomic datasets by combining real (non-simulated) sequencing reads into an artificial metagenomic dataset. By doing this, we can be highly confident of the true taxa for each read in the dataset.

    We used samples which have matched Illumina and Nanopore sequencing to ensure that differences are purely driven by technological differences and not composition differences. For the human component we combined reads from three individuals in the 1000 Genomes Project with Nanopore data for the human component downloaded from the 1000G ONT Sequencing Consortium (https://millerlaboratory.com/1000G-ONT.html) and we provide URLs for each sample: HG00277 Finnish Male with Illumina NovaSeq 6000 (accession: ERR3241786) and Nanopore R10.4; NA19318 Luhya, Kenya Male with Illumina NovaSeq 6000 (accession: ERR3239713) and Nanopore R10.4 (basecalled with Dorado v0.3.4); HG03611 Bengali, Bangladesh Female with Illumina NovaSeq 6000 (accession: ERR3243073) and Nanopore R10.4 (basecalled with Dorado v0.3.4). Each human readset was randomly downsampled to 1Gbp using rasusa (v0.7.1). For the M. tuberculosis component we used Illumina HiSeq 4000 (accession: ERR245682) and Nanopore R10.3 (accession: ERR8170871) (note: we used R10.3 as there are no R10.4 M. tuberculosis WGS datasets publicly available). For the bacterial component, we used Illumina MiSeq (accession: ERR7255689) and Nanopore R10.4 (accession: ERR7287988) reads from the ZymoBIOMICS HMW DNA Standard D6322 (Zymo Research), which contains seven bacterial and one fungal strain(s) - none of which are Mycobacterium. We removed Nanopore reads from all datasets with a length less than 500bp and the M. tuberculosis and Zymo datasets were downsampled to 3Gbp with rasusa. All human, M. tuberculosis, and Zymo reads were combined into a single artificial metagenomic file.

  8. I

    Molecular Biology Databases Published in Nucleic Acids Research between...

    • databank.illinois.edu
    Updated Feb 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heidi Imker (2024). Molecular Biology Databases Published in Nucleic Acids Research between 1991-2016 [Dataset]. http://doi.org/10.13012/B2IDB-4311325_V1
    Explore at:
    Dataset updated
    Feb 1, 2024
    Authors
    Heidi Imker
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset was developed to create a census of sufficiently documented molecular biology databases to answer several preliminary research questions. Articles published in the annual Nucleic Acids Research (NAR) “Database Issues” were used to identify a population of databases for study. Namely, the questions addressed herein include: 1) what is the historical rate of database proliferation versus rate of database attrition?, 2) to what extent do citations indicate persistence?, and 3) are databases under active maintenance and does evidence of maintenance likewise correlate to citation? An overarching goal of this study is to provide the ability to identify subsets of databases for further analysis, both as presented within this study and through subsequent use of this openly released dataset.

  9. m

    Principles and steps for integrating bioinformatics

    • data.mendeley.com
    Updated Aug 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hang Thi Nguyen (2024). Principles and steps for integrating bioinformatics [Dataset]. http://doi.org/10.17632/wjx5h7wh22.3
    Explore at:
    Dataset updated
    Aug 7, 2024
    Authors
    Hang Thi Nguyen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Biological data is increasing at a high speed, creating a vast amount of knowledge, while updating knowledge in teaching is limited, along with the unchanged time in the classroom. Therefore, integrating bioinformatics into teaching will be effective in teaching biology today. However, the big challenge is that pedagogical university students have yet to learn the basic knowledge and skills of bioinformatics, so they have difficulty and confusion when using it. However, the big challenge is that pedagogical university students have yet to learn the basic knowledge and skills of bioinformatics, so they have difficulty and confusion when using it in biology teaching. This dataset includes survey results on high school teachers, teacher training curriculums and pedagogical students in Vietnam. The highlights of this dataset are six basic principles and four steps of bioinformatics integration in teaching biology at high schools, with illustrative examples. The principles and approaches of integrating Bioinformatics into biology teaching improve the quality of biology teaching and promote STEM education in Vietnam and developing countries.

  10. Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Dec 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dylan Westfall; Mullins James (2023). Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies [Dataset]. http://doi.org/10.5061/dryad.w3r2280w0
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 7, 2023
    Dataset provided by
    HIV Prevention Trials Network
    HIV Vaccine Trials Networkhttp://www.hvtn.org/
    National Institute of Allergy and Infectious Diseaseshttp://www.niaid.nih.gov/
    PEPFAR
    Authors
    Dylan Westfall; Mullins James
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies. Methods This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies" Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005 For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub. The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub. The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results. Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program. To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper. Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd. Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.

  11. Bioinformatics Software Market Report | Global Forecast From 2025 To 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Jan 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Bioinformatics Software Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/bioinformatics-software-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Jan 7, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Bioinformatics Software Market Outlook



    The global bioinformatics software market size was valued at approximately USD 10 billion in 2023, and it is projected to reach around USD 25 billion by 2032, growing at a robust CAGR of 11% during the forecast period. This remarkable growth is fueled by the increased application of bioinformatics in drug discovery and development, the rising demand for personalized medicine, and the ongoing advancements in sequencing technologies. The convergence of biology and information technology has led to the optimization of biological data management, propelling the market's expansion as it transforms the landscape of biotechnology and pharmaceutical research. The rapid integration of artificial intelligence and machine learning techniques to process complex biological data further accentuates the growth trajectory of this market.



    An essential growth factor for the bioinformatics software market is the burgeoning demand for sequencing technologies. The decreasing cost of sequencing has led to a massive increase in the volume of genomic data generated, necessitating advanced software solutions to manage and interpret this data efficiently. This demand is particularly evident in genomics and proteomics, where bioinformatics software plays a critical role in analyzing and visualizing large datasets. Additionally, the adoption of cloud computing in bioinformatics offers scalable resources and cost-effective solutions for data storage and processing, further fueling market growth. The increasing collaboration between research institutions and software companies to develop innovative bioinformatics tools is also contributing positively to market expansion.



    Another significant driver is the growth of personalized medicine, which relies heavily on bioinformatics for the analysis of individual genetic information to tailor therapeutic strategies. As healthcare systems worldwide move towards precision medicine, the demand for bioinformatics software that can integrate genetic, phenotypic, and environmental data becomes more pronounced. This trend is not only transforming patient care but also significantly impacting drug development processes, as pharmaceutical companies aim to create more effective and targeted therapies. The strategic partnerships and collaborations between biotech firms and bioinformatics software providers are critical in advancing personalized medicine and enhancing patient outcomes.



    The increasing prevalence of complex diseases such as cancer and neurological disorders necessitates comprehensive research efforts, driving the need for robust bioinformatics software. These diseases require multi-omics approaches for better understanding, diagnosis, and treatment, where bioinformatics tools are indispensable. The ongoing research and development activities in this area, supported by government funding and private investments, are fostering innovation in bioinformatics solutions. Furthermore, the development of user-friendly and intuitive software interfaces is expanding the market beyond specialized research labs to include clinical settings and hospitals, broadening the potential user base and enhancing market penetration.



    From a regional perspective, North America currently leads the bioinformatics software market, thanks to its advanced technological infrastructure, significant investment in healthcare R&D, and the presence of numerous key market players. The region accounted for the largest market share in 2023 and is expected to maintain its dominance throughout the forecast period. Meanwhile, the Asia Pacific region is anticipated to exhibit the highest CAGR, driven by increasing investments in biotechnology and pharmaceutical research, expanding healthcare infrastructure, and the rising adoption of bioinformatics in emerging economies like China and India. Europe's market growth is also significant, supported by substantial funding for genomic research and a strong focus on precision medicine initiatives.



    Lifesciences Data Mining and Visualization are becoming increasingly vital in the bioinformatics software market. As the volume of biological data continues to grow exponentially, the need for sophisticated tools to mine and visualize this data is paramount. These tools enable researchers to uncover hidden patterns and insights from complex datasets, facilitating breakthroughs in genomics, proteomics, and other life sciences fields. The integration of advanced data mining techniques with visualization capabilities allows for a more intuitive

  12. f

    Simplified illustration of how the NGS record with base UUID...

    • plos.figshare.com
    xls
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marius Zeeb; Paul Frischknecht; Suraj Balakrishna; Lisa Jörimann; Jasmin Tschumi; Levente Zsichla; Sandra E. Chaudron; Bashkim Jaha; Kathrin Neumann; Christine Leemann; Michael Huber; Karoline Leuzinger; Huldrych F. GĂŒnthard; Karin J. Metzner; Roger D. Kouyos (2025). Simplified illustration of how the NGS record with base UUID 5bfc99f6-8432-4afc-be32-3f9d2dfa4871 might have been created in the database core’s blob log as a series of UUID-named files whose content is interpreted to yield the SHCND subgraph shown in Fig 2A. The sequence number arbitrarily starts at 1000 and there are no other blobs created between the initial creation of the NGS record and the uploading and attachment of its fastq files. [Dataset]. http://doi.org/10.1371/journal.pdig.0000825.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    PLOS Digital Health
    Authors
    Marius Zeeb; Paul Frischknecht; Suraj Balakrishna; Lisa Jörimann; Jasmin Tschumi; Levente Zsichla; Sandra E. Chaudron; Bashkim Jaha; Kathrin Neumann; Christine Leemann; Michael Huber; Karoline Leuzinger; Huldrych F. GĂŒnthard; Karin J. Metzner; Roger D. Kouyos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Simplified illustration of how the NGS record with base UUID 5bfc99f6-8432-4afc-be32-3f9d2dfa4871 might have been created in the database core’s blob log as a series of UUID-named files whose content is interpreted to yield the SHCND subgraph shown in Fig 2A. The sequence number arbitrarily starts at 1000 and there are no other blobs created between the initial creation of the NGS record and the uploading and attachment of its fastq files.

  13. r

    Alternative Splicing Annotation Project II Database

    • rrid.site
    • neuinfo.org
    • +1more
    Updated Jun 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Alternative Splicing Annotation Project II Database [Dataset]. http://identifiers.org/RRID:SCR_000322
    Explore at:
    Dataset updated
    Jun 26, 2025
    Description

    THIS RESOURCE IS NO LONGER IN SERVICE, documented on 8/12/13. An expanded version of the Alternative Splicing Annotation Project (ASAP) database with a new interface and integration of comparative features using UCSC BLASTZ multiple alignments. It supports 9 vertebrate species, 4 insects, and nematodes, and provides with extensive alternative splicing analysis and their splicing variants. As for human alternative splicing data, newly added EST libraries were classified and included into previous tissue and cancer classification, and lists of tissue and cancer (normal) specific alternatively spliced genes are re-calculated and updated. They have created a novel orthologous exon and intron databases and their splice variants based on multiple alignment among several species. These orthologous exon and intron database can give more comprehensive homologous gene information than protein similarity based method. Furthermore, splice junction and exon identity among species can be valuable resources to elucidate species-specific genes. ASAP II database can be easily integrated with pygr (unpublished, the Python Graph Database Framework for Bioinformatics) and its powerful features such as graph query, multi-genome alignment query and etc. ASAP II can be searched by several different criteria such as gene symbol, gene name and ID (UniGene, GenBank etc.). The web interface provides 7 different kinds of views: (I) user query, UniGene annotation, orthologous genes and genome browsers; (II) genome alignment; (III) exons and orthologous exons; (IV) introns and orthologous introns; (V) alternative splicing; (IV) isoform and protein sequences; (VII) tissue and cancer vs. normal specificity. ASAP II shows genome alignments of isoforms, exons, and introns in UCSC-like genome browser. All alternative splicing relationships with supporting evidence information, types of alternative splicing patterns, and inclusion rate for skipped exons are listed in separate tables. Users can also search human data for tissue- and cancer-specific splice forms at the bottom of the gene summary page. The p-values for tissue-specificity as log-odds (LOD) scores, and highlight the results for LOD >= 3 and at least 3 EST sequences are all also reported.

  14. Genomics England - Bioinformatics

    • healthdatagateway.org
    unknown
    Updated Mar 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The 100;,;000 Genomes Project Protocol v3;,;Genomics England. doi:10.6084/m9.figshare.4530893.v3. 2017. Publications that use the Genomics England Database should include an author as Genomics England Research Consortium. Please see the publication policy. (2023). Genomics England - Bioinformatics [Dataset]. https://healthdatagateway.org/dataset/381
    Explore at:
    unknownAvailable download formats
    Dataset updated
    Mar 30, 2023
    Dataset provided by
    Genomics England
    Authors
    The 100;,;000 Genomes Project Protocol v3;,;Genomics England. doi:10.6084/m9.figshare.4530893.v3. 2017. Publications that use the Genomics England Database should include an author as Genomics England Research Consortium. Please see the publication policy.
    License

    https://www.genomicsengland.co.uk/about-gecip/joining-research-community/https://www.genomicsengland.co.uk/about-gecip/joining-research-community/

    Description

    To identify and enrol participants for the 100,000 Genomes Project we have created NHS Genomic Medicine Centres (GMCs). Each centre includes several NHS Trusts and hospitals. GMCs recruit and consent patients. They then provide DNA samples and clinical information for analysis.

    Illumina, a biotechnology company, have been commissioned to sequence the DNA of participants. They return the whole genome sequences to Genomics England. We have created a secure, monitored, infrastructure to store the genome sequences and clinical data. The data is analysed within this infrastructure and any important findings, like a diagnosis, are passed back to the patient’s doctor.

    To help make sure that the project brings benefits for people who take part, we have created the Genomics England Clinical Interpretation Partnership (GeCIP). GeCIP brings together funders, researchers, NHS teams and trainees. They will analyse the data – to help ensure benefits for patients and an increased understanding of genomics. The data will also be used for medical and scientific research. This could be research into diagnosing, understanding or treating disease.

    To learn more about how we work you can read the 100,000 Genomes Project protocol. It has details of the development, delivery and operation of the project. It also sets out the patient and clinical benefit, scientific and transformational objectives, the implementation strategy and the ethical and governance frameworks.

  15. s

    References and test datasets for the Cactus pipeline

    • figshare.scilifelab.se
    • researchdata.se
    txt
    Updated Jan 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jerome Salignon; Lluis Milan Arino; Maxime Garcia; Christian Riedel (2025). References and test datasets for the Cactus pipeline [Dataset]. http://doi.org/10.17044/scilifelab.20171347.v4
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 15, 2025
    Dataset provided by
    Karolinska Institutet
    Authors
    Jerome Salignon; Lluis Milan Arino; Maxime Garcia; Christian Riedel
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Overview This item contains references and test datasets for the Cactus pipeline. Cactus (Chromatin ACcessibility and Transcriptomics Unification Software) is an mRNA-Seq and ATAC-Seq analysis pipeline that aims to provide advanced molecular insights on the conditions under study.

    Test datasets The test datasets contain all data needed to run Cactus in each of the 4 supported organisms. This include ATAC-Seq and mRNA-Seq data (.fastq.gz), parameter files (.yml) and design files (*.tsv). They were were created for each species by downloading publicly available datasets with fetchngs (Ewels et al., 2020) and subsampling reads to the minimum required to have enough DAS (Differential Analysis Subsets) for enrichment analysis. Datasets downloaded: - Worm and Humans: GSE98758 - Fly: GSE149339 - Mouse: GSE193393

    References One of the goals of Cactus is to make the analysis as simple and fast as possible for the user while providing detailed insights on molecular mechanisms. This is achieved by parsing all needed references for the 4 ENCODE (Dunham et al., 2012; Stamatoyannopoulos et al., 2012; Luo et al., 2020) and modENCODE (THE MODENCODE CONSORTIUM et al., 2010; Gerstein et al., 2010) organisms (human, M. musculus, D. melanogaster and C. elegans). This parsing step was done with a Nextflow pipeline with most tools encapsulated within containers for improved efficiency and reproducibility and to allow the creation of customized references. Genomic sequences and annotations were downloaded from Ensembl (Cunningham et al., 2022). The ENCODE API (Luo et al., 2020) was used to download the CHIP-Seq profiles of 2,714 Transcription Factors (TFs) (Landt et al., 2012; Boyle et al., 2014) and chromatin states in the form of 899 ChromHMM profiles (Boix et al., 2021; van der Velde et al., 2021) and 6 HiHMM profiles (Ho et al., 2014). Slim annotations (cell, organ, development, and system) were parsed and used to create groups of CHIP-Seq profiles that share the same annotations, allowing users to analyze only CHIP-Seq profiles relevant to their study. 2,779 TF motifs were obtained from the Cis-BP database (Lambert et al., 2019). GO terms and KEGG pathways were obtained via the R packages AnnotationHub (Morgan and Shepherd, 2021) and clusterProfiler (Yu et al., 2012; Wu et al., 2021), respectively.

    Documentation More information on how to use Cactus and how references and test datasets were created is available on the documentation website: https://github.com/jsalignon/cactus.

  16. m

    augmentation data for DAISM

    • data.mendeley.com
    • explore.openaire.eu
    • +1more
    Updated Jun 22, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yating Lin (2022). augmentation data for DAISM [Dataset]. http://doi.org/10.17632/ysjwjvpnh3.1
    Explore at:
    Dataset updated
    Jun 22, 2022
    Authors
    Yating Lin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The purified dataset for data augmentation for DAISM-DNNXMBD can be downloaded from this repository.

    The pbmc8k dataset downloaded from 10X Genomics were processed and uesd for data augmentation to create training datasets for training DAISM-DNN models. pbmc8k.h5ad contains 5 cell types (B.cells, CD4.T.cells, CD8.T.cells, monocytic.lineage, NK.cells), and pbmc8k_fine.h5ad cantains 7 cell types (naive.B.cells, memory.B.cells, naive.CD4.T.cells, memory.CD4.T.cells,naive.CD8.T.cells, memory.CD8.T.cells, regulatory.T.cells, monocytes, macrophages, myeloid.dendritic.cells, NK.cells).

    For RNA-seq dataset, it contains 5 cell types (B.cells, CD4.T.cells, CD8.T.cells, monocytic.lineage, NK.cells). Raw FASTQ reads were downloaded from the NCBI website, and transcription and gene-level expression quantification were performed using Salmon (version 0.11.3) with Gencode v29 after quality control of FASTQ reads using fastp. All tools were used with default parameters.

  17. E

    Simulated metagenomic dataset for Smith et al. 2022

    • find.data.gov.scot
    • dtechtive.com
    txt
    Updated Apr 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Edinburgh. The Roslin Institute (2022). Simulated metagenomic dataset for Smith et al. 2022 [Dataset]. http://doi.org/10.7488/ds/3444
    Explore at:
    txt(0.0166 MB)Available download formats
    Dataset updated
    Apr 22, 2022
    Dataset provided by
    University of Edinburgh. The Roslin Institute
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    UNITED KINGDOM
    Description

    This dataset is simulated metagenomic data created by Rebecca (Becky) Smith, PhD student at the Roslin Institute in Mick Watson's group. This data is described in detail in Smith et al. 2022, but briefly these reads were simulated using InSilicoSeq (https://doi.org/10.1093/bioinformatics/bty630) with the hiseq exponential model, and 150bp. The genomes used to create this data are from the Hungate Collection (paper at https://www.nature.com/articles/nbt.4110 and sequences at https://genome.jgi.doe.gov/portal/HungateCollection/HungateCollection.info.html ).

  18. w

    Data from: BBGD: an online database for blueberry genomic data

    • data.wu.ac.at
    • agdatacommons.nal.usda.gov
    • +1more
    html, xls
    Updated Dec 21, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Agriculture (2017). Data from: BBGD: an online database for blueberry genomic data [Dataset]. https://data.wu.ac.at/schema/data_gov/MmM3MTAyNTktNTYwMS00M2Q5LWI1OGEtNzFkNzA0NDkwYzEz
    Explore at:
    html, xlsAvailable download formats
    Dataset updated
    Dec 21, 2017
    Dataset provided by
    Department of Agriculture
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset is supplemental to the article "BBGD: an online database for blueberry genomic data," (2007); it is titled "list of genes printed on microarray slides."

    The article, "BBGD: an online database for blueberry genomic data," (2007) involving blueberry cold hardiness experiments has a list of all the genes that were printed on microarray slides. This dataset, supplemental to the article, is called: "list of genes printed on microarray slides." 1471-2229-7-5-s1.xls 663k.
    By using the BBGD database, researchers developed EST-based markers for mapping, and have identified a number of "candidate" cold tolerance genes that are highly expressed in blueberry flower buds after exposure to low temperatures.

    BBGD (http://bioinformatics.towson.edu/BBGD/) is a public online database, and was developed for blueberry genomics. BBGD is both a sequence and gene expression database: it stores both EST and microarray data, and allows scientists to correlate expression profiles with gene function. Presently, the main focus of the database is the identification of genes in blueberry that are significantly induced or suppressed after low temperature exposure.

  19. w

    Dataset of book subjects that contain Probabilistic methods for...

    • workwithdata.com
    Updated Nov 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2024). Dataset of book subjects that contain Probabilistic methods for bioinformatics : with an introduction to Bayesian networks [Dataset]. https://www.workwithdata.com/datasets/book-subjects?f=1&fcol0=j0-book&fop0=%3D&fval0=Probabilistic+methods+for+bioinformatics+:+with+an+introduction+to+Bayesian+networks&j=1&j0=books
    Explore at:
    Dataset updated
    Nov 7, 2024
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about book subjects. It has 3 rows and is filtered where the books is Probabilistic methods for bioinformatics : with an introduction to Bayesian networks. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.

  20. R

    Replication data for: "Gene Regulatory Network Inference Methodology for...

    • entrepot.recherche.data.gouv.fr
    bin, csv, text/tsv +2
    Updated Oct 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lise PomiĂšs; CĂ©line Brouard; Harold DuruflĂ©; Élise MaignĂ©; ClĂ©ment CarrĂ©; Louise Gody; Fulya Trösser; George Katsirelos; Brigitte Mangin; Nicolas Langlade; Simon de Givry; Lise PomiĂšs; CĂ©line Brouard; Harold DuruflĂ©; Élise MaignĂ©; ClĂ©ment CarrĂ©; Louise Gody; Fulya Trösser; George Katsirelos; Brigitte Mangin; Nicolas Langlade; Simon de Givry (2024). Replication data for: "Gene Regulatory Network Inference Methodology for Genomic and Transcriptomic Data Acquired in Genetically Related Heterozygote Individuals", 100 simulated datasets of RNA gene expressions of sunflower hybrids [Dataset]. http://doi.org/10.15454/VRGWZ2
    Explore at:
    text/tsv(659833), tsv(1833), tsv(1800), tsv(1804), tsv(1789), text/tsv(654928), tsv(1872), text/tsv(662958), tsv(1802), text/tsv(661869), tsv(1881), tsv(4586), text/tsv(661641), text/tsv(661988), tsv(1828), tsv(1821), text/tsv(657995), tsv(1819), text/tsv(655256), text/tsv(661909), tsv(1832), text/tsv(661021), text/tsv(662302), tsv(1846), text/tsv(661873), text/tsv(662024), text/tsv(661937), tsv(1830), text/tsv(661653), text/tsv(660064), tsv(1798), tsv(1950), tsv(1862), text/tsv(662551), tsv(1818), tsv(1838), text/tsv(657811), tsv(1773), tsv(1811), tsv(4591), tsv(1770), tsv(1775), text/tsv(655689), text/tsv(661488), tsv(1822), text/tsv(661734), text/tsv(658440), text/tsv(662487), text/tsv(659689), tsv(1827), text/tsv(660842), tsv(1808), csv(259993), text/tsv(139701), tsv(1861), tsv(1859), text/tsv(662756), bin(36), text/tsv(662239), text/tsv(661996), tsv(1851), tsv(17656), text/tsv(661305), text/tsv(660526), text/tsv(662081), tsv(1873), text/tsv(662441), tsv(1743), text/tsv(662142), tsv(1857), text/tsv(662323), tsv(1845), tsv(1787), tsv(1841), tsv(1831), text/tsv(661972), text/tsv(661591), text/tsv(660460), text/tsv(663495), text/tsv(661958), tsv(1858), text/tsv(660991), text/tsv(662072), text/tsv(661964), text/tsv(661906), tsv(1844), csv(265132), tsv(4682602), text/tsv(661830), text/tsv(662327), tsv(4599), tsv(1820), text/tsv(662629), text/tsv(662583), txt(2411), text/tsv(662188), tsv(4587), tsv(1809), csv(264784), tsv(4607), tsv(1840), text/tsv(662244), tsv(1944), tsv(1794), text/tsv(661594), tsv(1777), tsv(1740), text/tsv(661233), text/tsv(661868), tsv(1823), text/tsv(657946), text/tsv(657579), tsv(1877), tsv(1834), csv(258559), tsv(1879), text/tsv(660968), text/tsv(657331), tsv(1801), text/tsv(661994), tsv(4592), tsv(1848), text/tsv(656055), tsv(1860), text/tsv(662154), text/tsv(662133), csv(258052), tsv(1785), text/tsv(662211), text/tsv(662109), tsv(1865), text/tsv(661947), text/tsv(661805), tsv(1825), text/tsv(662460), text/tsv(657571), text/tsv(662397), text/tsv(662023), tsv(1816), text/tsv(661823), text/tsv(659349), text/tsv(661912), text/tsv(660540), tsv(1836), tsv(1842), text/tsv(662615), tsv(1807), tsv(4605), tsv(1835), tsv(1748), text/tsv(661905), tsv(1871), text/tsv(662207), text/tsv(660580), tsv(4603), text/tsv(659207), text/tsv(659100), text/tsv(661987), text/tsv(662427), text/tsv(661978), text/tsv(661593), tsv(1781), text/tsv(657397), tsv(1812), tsv(1799), tsv(1817), tsv(3552), csv(255601), csv(248611), text/tsv(662185), text/tsv(662309), text/tsv(661735), text/tsv(662696), text/tsv(662216), text/tsv(661775), tsv(1813), text/tsv(662176), text/tsv(659784), text/tsv(661273), tsv(1855), text/tsv(659982), text/tsv(662648), tsv(1796), tsv(4590), csv(251719), text/tsv(662132), csv(255565), tsv(1878), text/tsv(662231), text/tsv(658367), tsv(1736), text/tsv(661931), text/tsv(660954), csv(251050)Available download formats
    Dataset updated
    Oct 23, 2024
    Dataset provided by
    Recherche Data Gouv
    Authors
    Lise PomiĂšs; CĂ©line Brouard; Harold DuruflĂ©; Élise MaignĂ©; ClĂ©ment CarrĂ©; Louise Gody; Fulya Trösser; George Katsirelos; Brigitte Mangin; Nicolas Langlade; Simon de Givry; Lise PomiĂšs; CĂ©line Brouard; Harold DuruflĂ©; Élise MaignĂ©; ClĂ©ment CarrĂ©; Louise Gody; Fulya Trösser; George Katsirelos; Brigitte Mangin; Nicolas Langlade; Simon de Givry
    License

    https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html

    Description

    Replication data for: "Gene regulatory network inference methodology for genomic and transcriptomic data acquired in genetically related heterozygote individuals": 100 simulated datasets of RNA gene expressions of sunflower hybrids. This data set includes the 100 simulated datasets that have been used in the paper "Gene regulatory network inference methodology for genomic and transcriptomic data acquired in genetically related heterozygote individuals", Bioinformatics, 2022, https://doi.org/10.1093/bioinformatics/btac445 They are artificial expression datasets created with the data simulator SysGenSIM (modified) from the same gene regulatory network: artificialDataSet_network.csv. The files that have been used to generate the 100 expression datasets are also included (activation/repression sign networkSign and heterosis effect zMatrix directories). The "networks" directory contains the learned networks. The network inference method can be found here. For the description of the files, and the dimensions, see README.txt.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Lucie Tamisier; Annelies Haegeman; Yoika Foucart; Nicolas Fouillien; Maher Al Rwahnih; Nihal Buzkan; Thierry Candresse; Michela Chiumenti; Kris De Jonghe; Marie Lefebvre; Paolo Margaria; Jean Sébastien Reynard; Kristian Stevens; Denis Kutnjak; Sébastien Massart (2021). Semi-artificial datasets as a resource for validation of bioinformatics pipelines for plant virus detection [Dataset]. http://doi.org/10.5281/zenodo.5572591

Data from: Semi-artificial datasets as a resource for validation of bioinformatics pipelines for plant virus detection

Related Article
Explore at:
23 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Nov 2, 2021
Authors
Lucie Tamisier; Annelies Haegeman; Yoika Foucart; Nicolas Fouillien; Maher Al Rwahnih; Nihal Buzkan; Thierry Candresse; Michela Chiumenti; Kris De Jonghe; Marie Lefebvre; Paolo Margaria; Jean Sébastien Reynard; Kristian Stevens; Denis Kutnjak; Sébastien Massart
Description

In the last decade, High-Throughput Sequencing (HTS) has revolutionized biology and medicine. This technology allows the sequencing of huge amount of DNA and RNA fragments at a very low price. In medicine, HTS tests for disease diagnostics are already brought into routine practice. However, the adoption in plant health diagnostics is still limited. One of the main bottlenecks is the lack of expertise and consensus on the standardization of the data analysis. The Plant Health Bioinformatic Network (PHBN) is an Euphresco project aiming to build a community network of bioinformaticians/computational biologists working in plant health. One of the main goals of the project is to develop reference datasets that can be used for validation of bioinformatics pipelines and for standardization purposes. Semi-artificial datasets have been created for this purpose (Datasets 1 to 10). They are composed of a "real" HTS dataset spiked with artificial viral reads. It will allow researchers to adjust their pipeline/parameters as good as possible to approximate the actual viral composition of the semi-artificial datasets. Each semi-artificial dataset allows to test one or several limitations that could prevent virus detection or a correct virus identification from HTS data (i.e. low viral concentration, new viral species, non-complete genome). Eight artificial datasets only composed of viral reads (no background data) have also been created (Datasets 11 to 18). Each dataset consists of a mix of several isolates from the same viral species showing different frequencies. The viral species were selected to be as divergent as possible. These datasets can be used to test haplotype reconstruction software, the goal being to reconstruct all the isolates present in a dataset. A GitLab repository (https://gitlab.com/ilvo/VIROMOCKchallenge) is available and provides a complete description of the composition of each dataset, the methods used to create them and their goals. Dataset_x.fastq.gz These are the fastq files of the 18 datasets. Description of the datasets This is a word document describing each dataset.

Search
Clear search
Close search
Google apps
Main menu