This dataset tracks the updates made on the dataset "NCBI Virus" as a repository for previous versions of the data and metadata.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
A centralized sequence repository for all strains of novel corona virus (SARS-CoV-2) submitted to the National Center for Biotechnology Information (NCBI).
https://doi.org/10.5061/dryad.cjsxksnff
Dataset Summary:
This dataset contains 43 E gene and 44 complete genome nucleotide sequences of the dengue virus, encompassing all four serotypes (DENV-1, DENV-2, DENV-3, and DENV-4) identified in Pakistan to date. Additionally, the dataset includes four reference sequences of the dengue virus and six sequences from regions outside Pakistan to provide a broader comparative perspective. All sequences were retrieved from the Virus Pathogen Resource (ViPR) database.
Experimental Procedures:
Data Collection and Sequence Alignment: Sequences were aligned using MUSCLE for initial processing and MEGA X for detailed phylogenetic analyses. This dual approach ensures robust sequence alignment critical for accurate downstream analysis.
Phylogenetic Analysis: After alignmen...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repo contains results of biocomputational analysis of four wastewater sequencing datasets, used in the paper "Inferring the sensitivity of wastewater metagenomic sequencing for early detection of viruses: a statistical modelling study".The bioprojects for the studies are:Brinch 2020: PRJEB13832, PRJEB34633Crits-Christoph 2021: PRJNA661613Rothman 2021: PRJNA729801Spurbeck 2023: PRJNA924011The computational pipeline used for analysis can be found here: https://github.com/naobservatory/mgs-workflow/tree/2·1·0Here are the methods for study selection and processing:We performed a literature search for studies which (i) generated large (>100M read pairs), untargeted shotgun W-MGS datasets from raw treatment plant influent (ii) used sample preparation methods well-suited for broad enrichment of viruses, and (iii) were performed in regions and time periods for which good public-health data were available.We selected three RNA-sequencing studies which fit all of these criteria: Crits-Christoph et al. 2021, Rothman et al. 2021, and Spurbeck et al. 2023. While we were unable to find any DNA-sequencing studies that fulfilled all three criteria, we were still interested in assessing the capability of DNA sequencing to detect human-infecting viruses. Therefore, we included the DNA sequencing study by Brinch et al. 2020, which fulfils criteria (i) and (iii).All four of these studies conducted composite sampling of municipal influent (the three RNA studies all used 24-hour composite samples, while Brinch used 12-hour composites) and sequenced processed samples with paired-end Illumina technology. The three RNA sequencing studies were conducted in the United States, sampling wastewater from California and Ohio between 2020 and 2022. Brinch sampled wastewater in Copenhagen, Denmark from 2015 to 2018.For these studies, we obtained sequencing reads from the European Nucleotide Archive and identified virus reads using Bowtie2 and Kraken2 with relative abundance of each virus calculated as the number of high-quality, non-duplicate reads assigned to that virus divided by the total number of sequencing reads (appendix 5 p 23).In addition to untargeted W-MGS data, Crits-Christoph and Rothman also sequenced samples that had undergone hybridization-capture enrichment with the Illumina Respiratory Virus Panel (RVP). Data from these samples underwent the same bioinformatic analysis as the untargeted samples from the same studies.Here is the supplement with additional details:FASTQ files for each included study were obtained from the Sequencing Read Archive and analyzed with a custom computational pipeline (see “Data Sharing”) as follows:Raw reads were screened for adapter contamination with Cutadapt, Trimmomatic, and FASTP. Additionally, FASTP was used to trim low-quality and low-complexity sequences. Cleaned reads underwent deduplication with Clumpify.Deduplicated reads were ribodepleted with BBDuk, using SILVA SSU and LSU sequence databases, version 138.1.Ribodepleted reads were then separately analyzed in a taxonomic profiling and a human-infecting virus identification pipeline. In the taxonomic pipeline, paired-end reads were merged with BBMerge, with reads that failed to merge being concatenated with an intervening “N” base. Sequences were then passed to Kraken2 for taxonomic assignment, using the Standard database (2022-12-01 build), then summarized with Bracken.The human-infecting virus pipeline included the following steps:Beforehand, a database of human-infecting viral genomes was generated by obtaining all human-infecting virus taxonomy identifiers from Virus-Host DB; expanding this list to include all descendant identifiers; downloading all viral genomes corresponding to these identifiers from Genbank; and filtering the resulting database to remove transgenic and contaminated sequences.Ribodepleted reads were aligned against this database with Bowtie2 to identify putative human-infecting virus reads. Each read is assigned an NCBI taxonomy ID (taxid) corresponding to the best alignment found by Bowtie2. Putative human-infecting virus reads were filtered by aligning them to reference genomes that include human, cow, pig, mouse and E. coli, as well as various genetic engineering vectors. Alignment was performed by Bowtie2 and BBMap in series.After filtering, read pairs were merged with BBMerge and taxonomically assigned with Kraken2 as above. Each read was either (1) assigned to a human-infecting virus taxon with Kraken, (2) assigned to a non-HV taxon with Kraken, or (3) not assigned to any taxon. All reads in category (2) were filtered out.Reads are assigned a HV status if (i) they are given an HV assignment by both Bowtie2 and Kraken2; or if (ii) a read is unassigned by Kraken but aligns to an HV taxon with Bowtie2 with an length-normalized alignment score above a specific user-defined threshold of 20 (i.e. alignmentScore/ln(readLength) >= 20).The number of reads assigned to each human-infecting virus taxon are calculated by summing all Bowtie2 assignments to that taxid and its taxonomic descendants, according to the NCBI taxonomy hierarchy. These read counts were then used to calculate RA(1%) estimates as described in Appendix 3. The taxids used to generate such estimates are documented in Table S5.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
This dataset tracks the updates made on the dataset "NCBI Virus" as a repository for previous versions of the data and metadata.