In the last decade, High-Throughput Sequencing (HTS) has revolutionized biology and medicine. This technology allows the sequencing of huge amount of DNA and RNA fragments at a very low price. In medicine, HTS tests for disease diagnostics are already brought into routine practice. However, the adoption in plant health diagnostics is still limited. One of the main bottlenecks is the lack of expertise and consensus on the standardization of the data analysis. The Plant Health Bioinformatic Network (PHBN) is an Euphresco project aiming to build a community network of bioinformaticians/computational biologists working in plant health. One of the main goals of the project is to develop reference datasets that can be used for validation of bioinformatics pipelines and for standardization purposes. Semi-artificial datasets have been created for this purpose (Datasets 1 to 10). They are composed of a "real" HTS dataset spiked with artificial viral reads. It will allow researchers to adjust their pipeline/parameters as good as possible to approximate the actual viral composition of the semi-artificial datasets. Each semi-artificial dataset allows to test one or several limitations that could prevent virus detection or a correct virus identification from HTS data (i.e. low viral concentration, new viral species, non-complete genome). Eight artificial datasets only composed of viral reads (no background data) have also been created (Datasets 11 to 18). Each dataset consists of a mix of several isolates from the same viral species showing different frequencies. The viral species were selected to be as divergent as possible. These datasets can be used to test haplotype reconstruction software, the goal being to reconstruct all the isolates present in a dataset. A GitLab repository (https://gitlab.com/ilvo/VIROMOCKchallenge) is available and provides a complete description of the composition of each dataset, the methods used to create them and their goals. Dataset_x.fastq.gz These are the fastq files of the 18 datasets. Description of the datasets This is a word document describing each dataset.
Shiga toxin-producing Escherichia coli (STEC) and Listeria monocytogenes are responsible for severe foodborne illnesses in the United States. Current identification methods require at least four days to identify STEC and six days for L. monocytogenes. Adoption of long-read, whole genome sequencing for testing could significantly reduce the time needed for identification, but method development costs are high. Therefore, the goal of this project was to use NanoSim-H software to simulate Oxford Nanopore sequencing reads to assess the feasibility of sequencing-based foodborne pathogen detection and guide experimental design. Sequencing reads were simulated for STEC, L. monocytogenes, and a 1:1 combination of STEC and Bos taurus genomes using NanoSim-H. This dataset includes all of the simulated reads generated by the project in fasta format. This dataset can be analyzed bioinformatically or used to test bioinformatic pipelines.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset was developed to create a census of sufficiently documented molecular biology databases to answer several preliminary research questions. Articles published in the annual Nucleic Acids Research (NAR) “Database Issues” were used to identify a population of databases for study. Namely, the questions addressed herein include: 1) what is the historical rate of database proliferation versus rate of database attrition?, 2) to what extent do citations indicate persistence?, and 3) are databases under active maintenance and does evidence of maintenance likewise correlate to citation? An overarching goal of this study is to provide the ability to identify subsets of databases for further analysis, both as presented within this study and through subsequent use of this openly released dataset.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Projects in chemo- and bioinformatics often consist of scattered data in various types and are difficult to access in a meaningful way for efficient data analysis. Data is usually too diverse to be even manipulated effectively. Sdfconf is data manipulation and analysis software to address this problem in a logical and robust manner. Other software commonly used for such tasks are either not designed with molecular and/or conformational data in mind or provide only a narrow set of tasks to be accomplished. Furthermore, many tools are only available within commercial software packages. Sdfconf is a flexible, robust, and free-of-charge tool for linking data from various sources for meaningful and efficient manipulation and analysis of molecule data sets. Sdfconf packages molecular structures and metadata into a complete ensemble, from which one can access both the whole data set and individual molecules and/or conformations. In this software note, we offer some practical examples of the utilization of sdfconf.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains INSDC sequence records not associated with environmental sample identifiers or host organisms. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) by querying data with search parameters: `environmental_sample=False & host=""`
EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230).
The data was then processed as follows:
1. Human sequences were excluded.
2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number.
3. Contigs and whole genome shotgun (WGS) records were added individually.
4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept.
5. The records associated with the same vouchers are aggregated together.
6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by `scientific_name`, `collection_date`, `location`, `country`, `identified_by`, `collected_by` and `sample_accession` (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: https://github.com/gbif/embl-adapter/issues/10#issuecomment-855757978
7. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip
More information available here: https://github.com/gbif/embl-adapter#readme
You can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We created Nanopore and Illumina metagenomic datasets by combining real (non-simulated) sequencing reads into an artificial metagenomic dataset. By doing this, we can be highly confident of the true taxa for each read in the dataset.
We used samples which have matched Illumina and Nanopore sequencing to ensure that differences are purely driven by technological differences and not composition differences. For the human component we combined reads from three individuals in the 1000 Genomes Project with Nanopore data for the human component downloaded from the 1000G ONT Sequencing Consortium (https://millerlaboratory.com/1000G-ONT.html) and we provide URLs for each sample: HG00277 Finnish Male with Illumina NovaSeq 6000 (accession: ERR3241786) and Nanopore R10.4; NA19318 Luhya, Kenya Male with Illumina NovaSeq 6000 (accession: ERR3239713) and Nanopore R10.4 (basecalled with Dorado v0.3.4); HG03611 Bengali, Bangladesh Female with Illumina NovaSeq 6000 (accession: ERR3243073) and Nanopore R10.4 (basecalled with Dorado v0.3.4). Each human readset was randomly downsampled to 1Gbp using rasusa (v0.7.1). For the M. tuberculosis component we used Illumina HiSeq 4000 (accession: ERR245682) and Nanopore R10.3 (accession: ERR8170871) (note: we used R10.3 as there are no R10.4 M. tuberculosis WGS datasets publicly available). For the bacterial component, we used Illumina MiSeq (accession: ERR7255689) and Nanopore R10.4 (accession: ERR7287988) reads from the ZymoBIOMICS HMW DNA Standard D6322 (Zymo Research), which contains seven bacterial and one fungal strain(s) - none of which are Mycobacterium. We removed Nanopore reads from all datasets with a length less than 500bp and the M. tuberculosis and Zymo datasets were downsampled to 3Gbp with rasusa. All human, M. tuberculosis, and Zymo reads were combined into a single artificial metagenomic file.
Attribution-NonCommercial-NoDerivs 3.0 (CC BY-NC-ND 3.0)https://creativecommons.org/licenses/by-nc-nd/3.0/
License information was derived automatically
The COVID-19 pandemic has shown that bioinformatics--a multidisciplinary field that combines biological knowledge with computer programming concerned with the acquisition, storage, analysis, and dissemination of biological data--has a fundamental role in scientific research strategies in all disciplines involved in fighting the virus and its variants. It aids in sequencing and annotating genomes and their observed mutations; analyzing gene and protein expression; simulation and modeling of DNA, RNA, proteins and biomolecular interactions; and mining of biological literature, among many other critical areas of research. Studies suggest that bioinformatics skills in the Latin American and Caribbean region are relatively incipient, and thus its scientific systems cannot take full advantage of the increasing availability of bioinformatic tools and data. This dataset is a catalog of bioinformatics software for researchers and professionals working in life sciences. It includes more than 300 different tools for varied uses, such as data analysis, visualization, repositories and databases, data storage services, scientific communication, marketplace and collaboration, and lab resource management. Most tools are available as web-based or desktop applications, while others are programming libraries. It also includes 10 suggested entries for other third-party repositories that could be of use.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The biomarkers for thyroid cancer are still not known properly. For treating thyroid cancer these biomarkers can by be targeted specifically. Through this project, we identified and used bioinformatics tools to find biomarkers associated with thyroid cancer. Gene Expression Omnibus database (GEO) was used to find dataset related with thyroid cancer. Their expression profiles were downloaded. Four dataset GSE3467, GSE3678, GSE33630, and GSE53157 were identified from GEO database. The dataset GSE3467 contains nine thyroid tumor samples and nine normal thyroid tissue samples. The GSE3678 contains seven thyroid tumor samples and seven normal thyroid tissue samples. The GSE53157 contains twenty four thyroid tumor samples and three normal thyroid samples. The GSE33630 contains sixty thyroid tumor samples and forty five normal thyroid samples. These four datasets were analyzed individually and were integrated at the end to find the common genes among these four datasets. The microarray analysis of the datasets were performed using excel. T.Test analysis were performed for all the four datasets individually on a separate excel sheet. The data was normalized by converting normal value into log scale. Differential expression analysis of all the four datasets were done to identify differentially expresses genes (DEGs). Only upregulated genes were taken into account. Principal component analysis (PCA) of all the four dataset were performed using the raw data. The PCA analysis were performed using T-BioInfo server and the scatterplots were prepared using excel. RStudio was used to match the gene symbols with the corresponding probe ids using left join function. Inner join function in R was used to find integrated genes between the four datasets. Heatmaps of all the four datasets were performed using RStudio. To find number of intersection of Differentially expressed genes, an upset plot was prepared using RStudio. 74 genes with their corresponding probe ids were found to be common among all the four datasets. These genes are common to at least two datasets. These 74 common genes were analyzed using Database for Annotation, Visualization, and Integrated Discovery (DAVID), to study their Gene onotology (GO) functional annotations and pathways. According to the GO functional annotations result, most of the integrated upregulated genes were involved in protein binding, plasma membrane and integral component of membrane. Most common pathway include Extracellular matrix organization, Neutrophil degranulation, TGF-beta signaling pathway and Epithelial to mesenchymal transition in colorectal cancer. These 74 genes were introduced to STRING database to find protein-protein interactions between the genes. Interactions between the nodes were downloaded from STRING database and introduced to Sytoscape. Sytoscape analysis explained that only 19 genes showed protein-protein interactions between each other. Disease free survival analysis of the 13 genes that were common to three datasets were done using GEPIA. Boxplots of these 13 genes were also prepared using GEPIA. This showed that these differentially expressed genes showed different expression in normal thyroid tissue and thyroid tumor samples. Hence these 13 genes common to 3 datasets can be used as potential biomarkers for thyroid cancer. Among these 13 genes, four genes are implicated in cancer/cell proliferation can be probable target for treatment options.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset contains all of the source code used in the analysis described in the paper "Molecular Basis of Human Complex Diseases." The dataset contains codes for the three main results mentioned in the article, which are packaged in three separate files, and numbered in the same order as the article describes. The first section of the code summarizes the disease-related regulatory analysis process. The second section contains codes for identifying all cohort- and family-related variants. The third section of the code describes the entire process of analyzing single-cell data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We used the Woltka pipeline to compile the complete Basic genome dataset, consisting of 4634 genomes, with each genus represented by a single genome. After downloading all coding sequences (CDS) from the NCBI database, we extracted 8 million distinct CDS, focusing on bacteria and archaea and excluding viruses and fungi due to inadequate gene information.
To maintain accuracy, we excluded hypothetical proteins, uncharacterized proteins, and sequences without gene labels. We addressed issues with gene name inconsistencies in NCBI by keeping only genes with more than 1000 samples and ensuring each phylum had at least 350 sequences. This resulted in a curated dataset of 800,318 gene sequences from 497 genes across 2046 genera.
We created four datasets to evaluate our model: a training set (Train_set), a test set (Test_set) with different samples but the same genus and gene as the training set, a Taxa_out_set excluding 18 phyla present in the training set but from different phyla, and a Gene_out_set excluding 60 genes from the training set but from the same phyla. We ensured each CDS had only one representation per genome, removing genes with multiple representations within the same species.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains INSDC sequences associated with host organisms. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) using the methods described below.
EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230).
The data was then processed as follows:
1. Human sequences were excluded.
2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number.
3. Contigs and whole genome shotgun (WGS) records were added individually.
4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept.
5. The records associated with the same vouchers are aggregated together.
6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by `scientific_name`, `collection_date`, `location`, `country`, `identified_by`, `collected_by` and `sample_accession` (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: https://github.com/gbif/embl-adapter/issues/10#issuecomment-855757978
7. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip
More information available here: https://github.com/gbif/embl-adapter#readme
You can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies.
Methods
This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies"
Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005
For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub.
The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub.
The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd
file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results.
Sequence_Analysis.Rmd
has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd
and Figures.Rmd
. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program.
To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper.
Using Identifying_Recombinant_Reads.Rmd
, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd.
Figures.Rmd
used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
An archive containing 100 artificial single-cell datasets. The data of each dataset is an R data file (.rda). The file names have the following format: splatter_thousands_of_cellskCells_groups_set_set_id.rda'For example: "splatter_1kCells_groups_set_9.rda" represents the 9th set containing 1000 cells made using the "groups" option of splatSimulate. You can import the data into R using the load() function. Each dataset includes the following R objects:1) counts : the number of reads or UMIs for each gene in each cell2) gene.data : a summary of the data generated by Splatter3) params : the parameters used to generate the dataset
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparisons of the accuracies (Acc), sensitivities (Sn) and positive predictive values (PPV) of FSA and other alignment methods on the BAliBASE 3 [24] and SABmark 1.65 [25] databases. Probalign has the highest accuracy on the commonly-used BAliBASE 3 dataset and FSA in default mode has superior accuracy on the BAliBASE 3+fp and SABmark 1.65 datasets (note that only FSA and AMAP explicitly attempt to maximize the expected accuracy). FSA has higher positive predictive values than any other program on all datasets and can additionally achieve high sensitivity when run in maximum-sensitivity mode. The BAliBASE 3+fp dataset, which mirrors BAliBASE 3 but includes a single non-homologous sequence in each alignment, was designed to test the robustness of alignment programs to incomplete homology. Traditional alignment programs, designed to maximize sensitivity, suffer greatly-increased mis-alignment when even a single non-homologous sequence is introduced; in contrast, FSA is robust to the non-homologous sequence and has an unchanged positive predictive value. Remarkably, FSA was the only tested program with a mis-alignment rate of
https://www.scilifelab.se/data/restricted-access/https://www.scilifelab.se/data/restricted-access/
Dataset description Data consists of CRAM file from capture-based gene panel sequencing (Twist Bioscience) of 252 genes selected based on their relevance in lymphoid malignancies. The panel also included genome-wide backbone probes for copy-number analysis. The preprared libraries were then subsequenlty equenced in paired-end mode (2x150bp) on the Illumina NovaSeq 6000 (Illumina Inc.). BALSAMIC was used to analyze the FASTQ files and aligning them to reference genome. Trimmed reads were mapped to the reference genome hg19 using BWA MEM v0.7.15 4. The resulting SAM files were converted to BAM files and sorted using samtools v1.6. Duplicated reads were marked using Picard tools MarkDuplicate v2.17.0. And finally converted to CRAM files using samtools v1.6.
Note: CRAM is a sequencing read file format that is highly space efficient by using reference-based compression of sequence data and offers both lossless and lossy modes of compression: https://www.ebi.ac.uk/ena/cram/
Data Access Statement The data is under restricted access and can be accessed upon request through the email-adress below. The targeted sequence datasets are only to be used for research aimed at advancing the understanding of genetic factors in the chronic lymphocytic leukemia. Applications aimed at method development including bioinformatics would not be considered as acceptable for use of this dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This page contains all the data required to reproduce the figures included in the publication.
In the Figure_1 folder: - input: BED files used for data simulation. - HACk.random.bed is for single-clone simulation. - c1.bed, c2.bed, and c3.bed are for multiple-clone simulation. - allele2.bed is an empty file required by VISOR [1] for simulating heterozygous variants. - multiple_vcfs: VCF files from multiple-clone simulation. - single_vcfs: VCF files from single-clone simulation. - svc_m_rdata: SVclone output from multiple-clone simulation, saved in .RData format. - svc_s_rdata: SVclone output from single-clone simulation, saved in .RData format. - svcfit_time.txt: A text file storing runtime results from SVCFit. - svclone_time.txt: A text file storing runtime results from SVclone.
*Figures 1A and 1B: input, single_vcfs, multiple_vcfs, svc_s_rdata, svc_m_rdata *Figure 1C: The data used to create the mixtures is available upon request from the European Genome-phenome Archive (EGAD00001001343). The script used to create the mixtures is available at https://github.com/mcmero/SVclone_Rmarkdown/blob/master/make_insilico_mixtures.sh [2]. *Figure 1D: svcfit_time.txt, svclone_time.txt
In the Figure_S folder: - HACk.random.bed: Contains structural variants included in the simulation for read depth analysis. - depth: A folder containing read depth data for each simulation.
*Figure S2: HACk.random.bed, depth
Reference 1. Bolognini, Davide, Sanders, Ashley, Korbel, Jan O, Magi, Alberto, Benes, Vladimir, and Rausch, Tobias, ‘VISOR: A Versatile Haplotype-Aware Struc-tural Variant Simulator for Short- and Long-Read Sequencing’, Bioinfor-matics, 36/4 (2020), 1267–69 2. Cmero, Marek, Yuan, Ke, Ong, Cheng Soon, Schröder, Jan, Corcoran, Niall M., Papenfuss, Tony, et al., "Inferring Structural Variant Cancer Cell Fraction," Nature Communications, 11(1) (2020), 730.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Provide a brief description of the dataset, including its purpose, context, and significance.
List and describe each column or key feature of the dataset.
Detail the format, size, and structure of the dataset.
This dataset is ideal for a variety of applications:
Explain the scope and coverage of the dataset:
CC0
List examples of intended users and their use cases:
Include any additional notes or context about the dataset that might be helpful for users.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the dataset used in the paper Effect of Sequence Padding on the Performance of Protein-Based Deep Learning Models by Angela Lopez-del Rio, Maria Martin, Alexandre Perera-Lluna and Rabie Saidi.The UniprotKB/Swiss-Prot database (version 2019_05) protein entries analysed during the current study can be accessed and downloaded through the following link: http://ftp.ebi.ac.uk/pub/databases/uniprot/previous_releases/release-2019_05/knowledgebase/uniprot_sprot-only2019_05.tar.gz. Since this data needs further filtering to get only taxonomy Archaea, we have uploaded here the data analysed in this article.The code is publicly available at https://github.com/b2slab/padding_benchmark.
This item contains a test dataset based on Sumatran rhinoceros (Dicerorhinus sumatrensis) whole-genome re-sequencing data that we publish along with the GenErode pipeline (https://github.com/NBISweden/GenErode; Kutschera et al. 2022) and that we reduced in size so that users have the possibility to get familiar with the pipeline before analyzing their own genome-wide datasets.
We extracted scaffold ‘Sc9M7eS_2_HRSCAF_41’ of size 40,842,778 bp from the Sumatran rhinoceros genome assembly (Dicerorhinus sumatrensis harrissoni; GenBank accession number GCA_014189135.1) to be used as reference genome in GenErode. Some GenErode steps require the reference genome of a closely related species, so we additionally provide three scaffolds from the White rhinoceros genome assembly (Ceratotherium simum simum; GenBank accession number GCF_000283155.1) with a combined length of 41,195,616 bp that are putatively orthologous to Sumatran rhinoceros scaffold ‘Sc9M7eS_2_HRSCAF_41’, along with gene predictions in GTF format. The repository also contains a Sumatran rhinoceros mitochondrial genome (GenBank accession number NC_012684.1) to be used as reference for the optional mitochondrial mapping step in GenErode.
The test dataset contains whole-genome re-sequencing data from three historical and three modern Sumatran rhinoceros samples from the now-extinct Malay Peninsula population from von Seth et al. (2021) that was subsampled to paired-end reads that mapped to Sumatran rhinoceros scaffold ‘Sc9M7eS_2_HRSCAF_41’, along with a small proportion of randomly selected reads that mapped to the Sumatran rhinoceros mitochondrial genome or elsewhere in the genome.
For GERP analyses, scaffolds from the genome assemblies of 30 mammalian outgroup species are provided that had reciprocal blast hits to gene predictions from Sumatran rhinoceros scaffold ‘Sc9M7eS_2_HRSCAF_41’. Further, a phylogeny of the White rhinoceros and the 30 outgroup species including divergence time estimates (in billions of years) from timetree.org is available.
Finally, the item contains configuration and metadata files that were used for three separate runs of GenErode to generate the results presented in Kutschera et al. (2022).
Bash scripts and a workflow description for the test dataset generation are available in the GenErode GitHub repository (https://github.com/NBISweden/GenErode/docs/extras/test_dataset_generation).
References:
Kutschera VE, Kierczak M, van der Valk T, von Seth J, Dussex N, Lord E, et al. GenErode: a bioinformatics pipeline to investigate genome erosion in endangered and extinct species. BMC Bioinformatics 2022;23:228. https://doi.org/10.1186/s12859-022-04757-0
von Seth J, Dussex N, Díez-Del-Molino D, van der Valk T, Kutschera VE, Kierczak M, et al. Genomic insights into the conservation status of the world’s last remaining Sumatran rhinoceros populations. Nature Communications 2021;12:2393.
https://www.genomicsengland.co.uk/about-gecip/joining-research-community/https://www.genomicsengland.co.uk/about-gecip/joining-research-community/
To identify and enrol participants for the 100,000 Genomes Project we have created NHS Genomic Medicine Centres (GMCs). Each centre includes several NHS Trusts and hospitals. GMCs recruit and consent patients. They then provide DNA samples and clinical information for analysis.
Illumina, a biotechnology company, have been commissioned to sequence the DNA of participants. They return the whole genome sequences to Genomics England. We have created a secure, monitored, infrastructure to store the genome sequences and clinical data. The data is analysed within this infrastructure and any important findings, like a diagnosis, are passed back to the patient’s doctor.
To help make sure that the project brings benefits for people who take part, we have created the Genomics England Clinical Interpretation Partnership (GeCIP). GeCIP brings together funders, researchers, NHS teams and trainees. They will analyse the data – to help ensure benefits for patients and an increased understanding of genomics. The data will also be used for medical and scientific research. This could be research into diagnosing, understanding or treating disease.
To learn more about how we work you can read the 100,000 Genomes Project protocol. It has details of the development, delivery and operation of the project. It also sets out the patient and clinical benefit, scientific and transformational objectives, the implementation strategy and the ethical and governance frameworks.
In the last decade, High-Throughput Sequencing (HTS) has revolutionized biology and medicine. This technology allows the sequencing of huge amount of DNA and RNA fragments at a very low price. In medicine, HTS tests for disease diagnostics are already brought into routine practice. However, the adoption in plant health diagnostics is still limited. One of the main bottlenecks is the lack of expertise and consensus on the standardization of the data analysis. The Plant Health Bioinformatic Network (PHBN) is an Euphresco project aiming to build a community network of bioinformaticians/computational biologists working in plant health. One of the main goals of the project is to develop reference datasets that can be used for validation of bioinformatics pipelines and for standardization purposes. Semi-artificial datasets have been created for this purpose (Datasets 1 to 10). They are composed of a "real" HTS dataset spiked with artificial viral reads. It will allow researchers to adjust their pipeline/parameters as good as possible to approximate the actual viral composition of the semi-artificial datasets. Each semi-artificial dataset allows to test one or several limitations that could prevent virus detection or a correct virus identification from HTS data (i.e. low viral concentration, new viral species, non-complete genome). Eight artificial datasets only composed of viral reads (no background data) have also been created (Datasets 11 to 18). Each dataset consists of a mix of several isolates from the same viral species showing different frequencies. The viral species were selected to be as divergent as possible. These datasets can be used to test haplotype reconstruction software, the goal being to reconstruct all the isolates present in a dataset. A GitLab repository (https://gitlab.com/ilvo/VIROMOCKchallenge) is available and provides a complete description of the composition of each dataset, the methods used to create them and their goals. Dataset_x.fastq.gz These are the fastq files of the 18 datasets. Description of the datasets This is a word document describing each dataset.