100+ datasets found

COG-UK Viral Genome Sequences
healthdatagateway.org
unknown
Updated Oct 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
COG-UK Consortium (2024). COG-UK Viral Genome Sequences [Dataset]. http://doi.org/10.1016/S2666-5247(20)30054-9
Explore at:
unknownAvailable download formats
Unique identifier
https://doi.org/10.1016/S2666-5247(20)30054-9
Dataset updated
Oct 8, 2024
Dataset provided by
COVID-19 Genomics UK Consortium
Authors
COG-UK Consortium
License
https://www.cogconsortium.uk/data/https://www.cogconsortium.uk/data/
Description
The current COVID-19 pandemic, caused by the SARS-CoV-2 virus, represents a major threat to health in the UK and globally. To fully understand the transmission and evolution of the virus requires sequencing and analysing viral genomes at scale and speed. The numbers of samples calls for a rapid increase in the UK’s pathogen genome sequencing capacity rapidly and robustly.

To provide this increased capacity to collect, sequence and analyse the whole genomes of virus samples in the UK, the COVID-19 Genomics UK (COG-UK) consortium is pooling the world leading knowledge and expertise in genomics of the four UK Public Health Agencies, multiple regional University hubs, and large sequencing centres such as the Wellcome Sanger Institute.
f
Viral genomes from GenBank (reference) - Comparative analysis of gene...
figshare.com
application/x-gzip
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Enrique Gonzalez Tortuero; Revathy Krishnamurthi; Heather Allison; Ian Goodhead; Chloë James (2023). Viral genomes from GenBank (reference) - Comparative analysis of gene prediction tools for viral genome annotation [Dataset]. http://doi.org/10.6084/m9.figshare.21353829.v1
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21353829.v1
Dataset updated
Jun 3, 2023
Dataset provided by
figshare
Authors
Enrique Gonzalez Tortuero; Revathy Krishnamurthi; Heather Allison; Ian Goodhead; Chloë James
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The file "viral.genomic.gbk.tar.gz" contains all the RefSeq viral database information in GenBank format, used as the gold standard for the comparisons. In such a way, it should be run as is when using the script "genecounter.py" to count the number of genes, while it is the second (mandatory) input file for the counting of true positives (TP), false positives (FP) and false negatives (FN) via "coordinateschecker.py". In any case, it could also be used for other evaluation purposes.
s
IVDB - Influenza Virus Database
scicrunch.org
neuinfo.org
+1more
Updated Dec 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). IVDB - Influenza Virus Database [Dataset]. http://identifiers.org/RRID:SCR_013404
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_013404
Dataset updated
Dec 4, 2023
Description
IVDB hosts complete genome sequences of influenza A virus generated by BGI and curates all other published influenza virus sequences after expert annotations. For the convenience of efficient data utilization, our Q-Filter system classifies and ranks all nucleotide sequences into 7 categories according to sequence content and integrity. IVDB provides a series of tools and viewers for analyzing the viral genomes, genes, genetic polymorphisms and phylogenetic relationships comparatively. A searching system is developed for users to retrieve a combination of different data types by setting various search options. To facilitate analysis of the global viral transmission and evolution, the IV Sequence Distribution Tool (IVDT) is developed to display worldwide geographic distribution of the viral genotypes and to couple genomic data with epidemiological data. The BLAST, multiple sequence alignment tools and phylogenetic analysis tools were integrated for online data analysis. Furthermore, IVDB offers instant access to the pre-computed alignments and polymorphism analysis of influenza virus genes and proteins and presents the results by SNP distribution plots and minor allele distributions. IVDB aims to be a powerful information resource and an analysis workbench for scientists working on IV genetics, evolution, diagnostics, vaccine development, and drug design.
Viral RefSeq databases for Centrifuge, Kraken2 and DIAMOND
zenodo.org
datadryad.org
application/gzip, txt
Updated Jun 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anna-Sapfo Malaspinas; Anna-Sapfo Malaspinas; Samuel Neuenschwander; Samuel Neuenschwander; Yami Arizmendi Cárdenas; Yami Arizmendi Cárdenas (2022). Viral RefSeq databases for Centrifuge, Kraken2 and DIAMOND [Dataset]. http://doi.org/10.5061/dryad.mkkwh711w
Explore at:
txt, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.mkkwh711w
Dataset updated
Jun 5, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anna-Sapfo Malaspinas; Anna-Sapfo Malaspinas; Samuel Neuenschwander; Samuel Neuenschwander; Yami Arizmendi Cárdenas; Yami Arizmendi Cárdenas
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Owing to technological advances in ancient DNA, it is now possible to sequence viruses from the past to track down their origin and evolution. However, ancient DNA data is considerably more degraded and contaminated than modern data making the identification of ancient viral genomes particularly challenging. Several methods to characterise the modern microbiome (and, within this, the virome) have been developed; in particular, tools that assign sequenced reads to specific taxa in order to characterise the organisms present in a sample of interest. While these existing tools are routinely used in modern data, their performance when applied to ancient microbiome data to screen for ancient viruses remains unknown.

In this work, we conducted an extensive simulation study using public viral sequences to establish which tool is the most suitable to screen ancient samples for human DNA viruses. We compared the performance of four widely used classifiers, namely Centrifuge, Kraken2, DIAMOND and MetaPhlAn2, in correctly assigning sequencing reads to the corresponding viruses. To do so, we simulated reads by adding noise typical of ancient DNA to a set of publicly available human DNA viral sequences and to the human genome. We fragmented the DNA into different lengths, added sequencing error and C to T and G to A deamination substitutions at the read termini. Then we measured the resulting sensitivity and precision for all classifiers.

Across most simulations, more than 228 out of the 233 simulated viruses are recovered by Centrifuge, Kraken2 and DIAMOND, in contrast to MetaPhlAn2 which recovers only around one third. Overall, Centrifuge and Kraken2 have the best performance with the highest values of sensitivity and precision. We found that deamination damage has little impact on the performance of the classifiers, less than the sequencing error and the length of the reads. Since Centrifuge can handle short reads (in contrast to DIAMOND and Kraken2 with default settings) and since it achieves the highest sensitivity and precision at the species level across all the simulations performed, it is our recommended tool. Regardless of the tool used, our simulations indicate that, for ancient human studies, users should use strict filters to remove all reads of potential human origin. Finally, we recommend to verify which species are present in the database used, as it might happen that default databases lack sequences for viruses of interest.
d
T4-like genome database
dknet.org
scicrunch.org
Updated Oct 16, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). T4-like genome database [Dataset]. http://identifiers.org/RRID:SCR_005367
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_005367
Dataset updated
Oct 16, 2019
Description
THIS RESOURCE IS NO LONGER IN SERVICE, documented August 22, 2016. A database of information on bacterial phages. It contains multiple phage genomes, which users can BLAST and MegaBLAST, and also hosts a Phage Forum in which users can discuss phage data. Interactive browsing of completed phage genomes is available using the program. The browser allows users to scan the genome for particular features and to download sequence information plus analyses of those features. Views of the genome are generated showing named genes BLAST similarities to other phages predicted tRNAs and other sequence features.
d
VIRsiRNAdb
dknet.org
Updated Aug 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). VIRsiRNAdb [Dataset]. http://identifiers.org/RRID:SCR_006108
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_006108
Dataset updated
Aug 31, 2024
Description
VIRsiRNAdb is a curated database of experimentally validated viral siRNA / shRNA targeting diverse genes of 42 important human viruses including influenza, SARS and Hepatitis viruses. Submissions are welcome. Currently, the database provides detailed experimental information of 1358 siRNA/shRNA which includes siRNA sequence, virus subtype, target gene, GenBank accession, design algorithm, cell type, test object, test method and efficacy (mostly quantitative efficacies). Further, wherever available, information regarding alternative efficacies of above 300 siRNAs derived from different assays has also been incorporated. The database has facilities like search, advance search (using Boolean operators AND, OR) browsing (with data sorting option), internal linking and external linking to other databases (Pubmed, Genbank, ICTV). Additionally useful siRNA analysis tools are also provided e.g. siTarAlign for aligning the siRNA sequence with reference viral genomes or user defined sequences. virsiRNAdb would prove useful for RNAi researchers especially in siRNA based antiviral therapeutics development.
o
Virosaurus dataset
explore.openaire.eu
Updated Jan 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anne Gleizes; philippe Le Mercier; Edouard de Castro (2022). Virosaurus dataset [Dataset]. http://doi.org/10.5281/zenodo.5863049
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5863049
Dataset updated
Jan 17, 2022
Authors
Anne Gleizes; philippe Le Mercier; Edouard de Castro
Description
Virosaurus (from virus thesaurus) is a curated virus genome database, aimed at facilitating clinical metagenomics analysis. The data comprises clustered and annotated sequences of Vertebrate viruses , Others viruses (Insect, Fungus, Eukaryotic microorgansism) or Plant viruses in FASTA format. Virosaurus also provides complete virus sequence dataset for all those viruses, which comprises complete genomes for nonsegmented viruses, and complete segments for segmented viruses. Complete sequences: This dataset contains full-length genomes (monopartite virus) or segments (segmented virus) for all vertebrate virus families. Virosaurus: Virus reference sequence databases for clinical metagenomics. All complete sequences were clustered at 90% to remove redundancy in Virosaurus Vertebrate 90 (23,615 FASTAs); or clustered at 98% in Virosaurus vertebrate 98 (73,160 FASTAs). Many clusters can belong to the same virus species. For example, there are 100 Lassa virus clusters in Virosaurus90, 638 in Virosaurus98. The FASTA header have been annotated with metadata to facilitate metagenomic analysis. For instance, viral nucleic acid is annotated as RNA, DNA or RNA/DNA, thereby improving interpretation from sequencing either molecule.
f
GCVDB Viruses
figshare.com
application/gzip
Updated Jan 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bailey Wallace (2024). GCVDB Viruses [Dataset]. http://doi.org/10.6084/m9.figshare.24968805.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24968805.v1
Dataset updated
Jan 15, 2024
Dataset provided by
figshare
Authors
Bailey Wallace
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Viral genomes and genome fragments from the Global Coral Viruses Database (GCVDB).
n
VIDA
neuinfo.org
scicrunch.org
+2more
Updated Oct 16, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). VIDA [Dataset]. http://identifiers.org/RRID:SCR_007111
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007111 https://identifiers.org/RRID:SCR_007111/resolver/mentions
Dataset updated
Oct 16, 2019
Description
VIDA contains a collection of homologous protein families derived from open reading frames from complete and partial virus genomes. For each family, users can get an alignment of the conserved regions, functional and taxonomy information, and links to DNA sequences and structures. * Search homologous protein families from particular virus families * Links to complete genome sequence: Arteriviridae, Coronaviridae, Herpesviridae, Poxviridae The Virus Database at University College London has been developed as a system to organize animal virus open reading frame sequences. All known and predicted protein sequences from complete and partial genomes of particular virus families are extracted from GenBank and filtered to remove 100% redundancy. On the basis of sequence similarity the sequences are then clustered into homologous protein families (HPFs). The families are enriched with annotations including function and functional classification, related protein structures, taxonomy, length of the proteins, boundaries of the conserved region/s, virus-specific gene name and links to EMBL entries and SWISSPROT.
b
VirGen
bioregistry.io
Updated Feb 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). VirGen [Dataset]. https://bioregistry.io/registry/virgen
Explore at:
Dataset updated
Feb 20, 2024
Description
VirGen a comprehensive viral genome resource, which organizes the ‘sequence space’ of viral genomes in a structured fashion. It has been developed with an objective to serve as an annotated and curated database for complete viral genome sequences.
Metadata record for: Domain-centric database to uncover structure of...
springernature.figshare.com
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scientific Data Curation Team (2023). Metadata record for: Domain-centric database to uncover structure of minimally characterized viral genomes [Dataset]. http://doi.org/10.6084/m9.figshare.12319631.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12319631.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Scientific Data Curation Team
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains key characteristics about the data described in the Data Descriptor Domain-centric database to uncover structure of minimally characterized viral genomes. Contents:

1. human readable metadata summary table in CSV format 2. machine readable metadata file in JSON format
o
COVID-19 Genome Sequence Dataset
registry.opendata.aws
catalog.midasnetwork.us
Updated Jul 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Library of Medicine (NLM) (2020). COVID-19 Genome Sequence Dataset [Dataset]. https://registry.opendata.aws/ncbi-covid-19/
Explore at:
Dataset updated
Jul 9, 2020
Dataset provided by
<a href="http://nlm.nih.gov/">National Library of Medicine (NLM)</a>
Description
This repository within the ACTIV TRACE initiative houses a comprehensive collection of datasets related to SARS-CoV-2. The processing of SARS-CoV-2 Sequence Read Archive (SRA) files has been optimized to identify genetic variations in viral samples. This information is then presented in the Variant Call Format (VCF). Each VCF file corresponds to the SRA parent-run's accession ID. Additionally, the data is available in the parquet format, making it easier to search and filter using the Amazon Athena Service. The SARS-CoV-2 Variant Calling Pipeline is designed to handle new data every six hours, with updates to the AWS ODP bucket occurring daily.
Z
"Genome binning of viral entities from bulk metagenomics data" - CAMISIM...
data.niaid.nih.gov
Updated Jan 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johansen, Joachim (2022). "Genome binning of viral entities from bulk metagenomics data" - CAMISIM simulated datasets and genomes [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5676246
Explore at:
Dataset updated
Jan 5, 2022
Dataset authored and provided by
Johansen, Joachim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Genome binning of viral entities from bulk metagenomics data

Authors

Joachim Johansen1,2, Damian R. Plichta2, Jakob Nybo Nissen1,3, Marie Louise Jespersen1,4, Shiraz A. Shah5, Ling Deng6, Jakob Stokholm5,6, Hans Bisgaard5, Dennis Sandris Nielsen6, Søren Sørensen7, Simon Rasmussen1

Affiliations

1 Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen N, Denmark

2 Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA

3 Statens Serum Institut, Viral & Microbial Special diagnostics, Copenhagen, Denmark

4 National Food Institute, Technical University of Denmark, Kongens Lyngby, Denmark

5 Copenhagen Prospective Studies on Asthma in Childhood (COPSAC), Herlev and Gentofte Hospital, University of Copenhagen, Copenhagen, Denmark

6 Section of Food Microbiology and Fermentation, Department of Food Science, Faculty of Science, University of Copenhagen, Copenhagen, Denmark

7 Section of Microbiology, Department of Biology, University of Copenhagen, Copenhagen, Denmark

Methods description

We compared the viral binning performance of VAMB and MetaBAT2 using the official CAMI consortium method to create assemblies and metagenome profiles. To this end we generated 3 different metagenome compositions with up to 308 reference genomes; one mixed with bacteria, plasmids and viruses to test binning in complex samples i.e. high diversity (1), one with only crass-like viruses to test binning with highly similar viruses i.e. high relatedness (2) and a set of small-viruses (<6,000 bp) including members of the Microviridae family to address the bias of size (3). Bacterial genomes were gathered from NCBIs refseq genome repository 2021, plasmids from the PLSDB database (v. 2021_06_23) and viral genomes from the recent MGV database.

Dataset A contained a mixture of bacteria (N=8), plasmids (N=20) and viruses (N=280) to test binning in complex samples, i.e. high diversity. Dataset B contained only crass-like viruses (N=80) to test binning with highly similar viruses i.e. high relatedness. Dataset C contained small-viruses (N=50, <6,000 bp) of the Microviridae family to address the bias of size. Bacterial genomes were sampled from the Refseq genome repository 2021, plasmids from the PLSDB database and viral genomes from the recent MGV database (Nayfach, et al. Nature Microbiology 2021).
Databases for NEXT-RSV-SEQ (RSV, HMPV, PIV)
zenodo.org
zip
Updated May 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stephan Fuchs; Stephan Fuchs; Sophie Köndgen; Sophie Köndgen (2024). Databases for NEXT-RSV-SEQ (RSV, HMPV, PIV) [Dataset]. http://doi.org/10.5281/zenodo.8133844
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8133844
Dataset updated
May 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Stephan Fuchs; Stephan Fuchs; Sophie Köndgen; Sophie Köndgen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Here, we provide databases ready for use in sequencing read decontamination with our Next-RSV-SEQ pipeline (https://gitlab.com/rki_bioinformatics/next-rsv-seq), designed for viral genome assembly using Illumina data.

Available Databases:

Human orthopneumovirus / Human respiratory syncytial virus (RSV): [RSV_GRCh38_2022-04-06.zip]

Human metapneumovirus (HMPV): [HMPV_2022-05-28.zip]

Parainfluenza virus (PIV): [PIV_2022-05-28.zip]

These databases were created using Kraken2 and are based on all complete viral genome sequences available on NCBI Reference Sequence Database (RefSeq) as of April 6, 2022 (for RSV), and May 28, 2022 (for HMPV and PIV). The RSV database also includes the human genome sequence GRCh38 (Genome Reference Consortium Human Build 38). The databases have been compressed into zip format for easy downloading. Before use, please unpack the respective zip archive.
s
Hepatitis Virus B Database
scicrunch.org
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Hepatitis Virus B Database [Dataset]. http://identifiers.org/RRID:SCR_007705
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007705
Dataset updated
Jan 29, 2022
Description
HepSEQ is the International Repository for Hepatitis B Virus Strain Data. It is web-accessible, quality-based, molecular, clinical and epidemiological database for hepatitis B infection and provides a tool for the research community or for those involved in hepatitis B case management. This database currently has 1012 patient records and 1253 viral sequences. The quality of all submitted sequences is checked. The tools provided include: SeqMatch: search the database for matching sequences Genotyper: genotype HBV strains (based on HBV surface antigen genes) Gene Mutation: display the sequences that contain mutations in HBV coding regions Mutation Annotator: annotate sequences for mutation known to be associated with anti-viral resistance This web database development is funded by the UK Department of Health is curated and is hosted by the Health Protection Agency.
Virus+ Sequence Masked Mouse Reference Genome (GRCm38)
zenodo.org
explore.openaire.eu
application/gzip
Updated Feb 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott A Handley; Scott A Handley (2021). Virus+ Sequence Masked Mouse Reference Genome (GRCm38) [Dataset]. http://doi.org/10.5281/zenodo.4116249
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4116249
Dataset updated
Feb 9, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Scott A Handley; Scott A Handley
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A version of the mouse genome (GRCm38) masked for all possible viral sequences.

See Virus+ Masked Human Genome for a masked human reference database.

The following commands were used to generate the additional virus sequence masked reference database:

1) Download all RefSeq and Neighbor nucleotide records:

https://www.ncbi.nlm.nih.gov/nuccore/?term=Viruses[Organism]%20NOT%20cellular%20organisms[ORGN]%20NOT%20wgs[PROP]%20NOT%20gbdiv%20syn[prop]%20AND%20(srcdb_refseq[PROP]%20OR%20nuccore%20genome%20samespecies[Filter])

2) Shred the downloaded viral genomes using shred.sh from the bbtools package

shred.sh in=refseq_virus_reformated.fasta out=virus_shred.fasta.gz length=85 minlength=75 overlap=30

3) Map shredded virus sequence to the GRCm38 genome using bbmap.sh from the bbtools package

bbmap.sh ref=GRCm38.fa.gz in=virus_shred.fasta.gz outm=map_mouse_all_viruses.sam minid=0.90

4) Mask virus sequenced mapped regions from the GRCm38 genome using bbmask.sh from the bbtools package

bbmask.sh in=GRCm38.fa.gz out=GRCm38_virus_masked.fasta.gz sam=map_mouse_all_viruses.sam

5) Remove all N's to further reduce file size using seqkit
seqkit -is replace -p "n" -r "" GRCm38_virus_masked.fasta.gz > mouse_virus_masked.fasta_Ns_removed.gz

Additional References:

bbtools

seqkit

NCBI Virus Genome RefSeq
H
COG-UK Viral Genome Sequences
dtechtive.com
find.data.gov.scot
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
COVID-19 GENOMICS UK (2023). COG-UK Viral Genome Sequences [Dataset]. https://dtechtive.com/datasets/26040
Explore at:
Dataset updated
May 30, 2023
Dataset provided by
COVID-19 GENOMICS UK
Area covered
United Kingdom
Description
COG-UK Consortium has published a dataset which contains over 20K SARS-CoV-2 viral genome sequences available as open access.
d
Data from: Viral tagging reveals discrete populations in Synechococcus viral...
search.dataone.org
datadryad.org
+1more
Updated Apr 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Li Deng; J. Cesar Ignacio-Espinoza; Ann C. Gregory; Bonnie T. Poulos; Joshua S. Weitz; Philip Hugenholtz; Matthew B. Sullivan (2025). Viral tagging reveals discrete populations in Synechococcus viral genome sequence space [Dataset]. http://doi.org/10.5061/dryad.gr3ks
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.gr3ks
Dataset updated
Apr 19, 2025
Dataset provided by
Dryad Digital Repository
Authors
Li Deng; J. Cesar Ignacio-Espinoza; Ann C. Gregory; Bonnie T. Poulos; Joshua S. Weitz; Philip Hugenholtz; Matthew B. Sullivan
Time period covered
Jan 1, 2015
Description
Microbes and their viruses drive myriad processes across ecosystems ranging from oceans and soils to bioreactors and humans. Despite this importance, microbial diversity is only now being mapped at scales relevant to nature, while the viral diversity associated with any particular host remains little researched. Here we quantify host-associated viral diversity using viral-tagged metagenomics, which links viruses to specific host cells for high-throughput screening and sequencing. In a single experiment, we screened 107 Pacific Ocean viruses against a single strain of Synechococcus and found that naturally occurring cyanophage genome sequence space is statistically clustered into discrete populations. These population-based, host-linked viral ecological data suggest that, for this single host and seawater sample alone, there are at least 26 double-stranded DNA viral populations with estimated relative abundances ranging from 0.06 to 18.2%. These populations include previously cultivated...
d
Data from: HoloBee Database v2016.1
catalog.data.gov
agdatacommons.nal.usda.gov
+3more
Updated Apr 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). HoloBee Database v2016.1 [Dataset]. https://catalog.data.gov/dataset/holobee-database-v2016-1-9e8e9
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Service
Description
Organisms living in honey bees and honey bee colonies form large associative holobiont communities that are integral to bee biology. High-throughput sequencing approaches to characterize these holobiont communities from honey bees in various states of health and disease are now commonplace, producing large amounts of nucleotide sequence data that must be accurately and consistently analyzed in order to produce reliable and comparable reports. In addition, new species designations and revisions are actively being made from honey bee holobiont communities, complicating nomenclature in larger databases where taxonomic descriptions associated with archived sequences can quickly become outdated and misleading. To improve the accuracy and consistency of honey bee holobiont research, we have developed HoloBee: a curated database of publicly accessioned nucleotide sequences from the honey bee holobiont community. Except in rare and noted exceptions made by curators, sequences used in HoloBee were obtained from, or in association with, Apis mellifera (Western honey bee) as well as other honey bee species where available (e.g. Apis cerana, Apis dorsata, Apis laboriosa, Apis koschevnikovi, Apis florea, Apis andreniformis and Apis nigrocincta). Sources include: within or on the surface of honey bees (adult, pupae, larvae, egg), corbicular pollen, bee bread, royal jelly, honey, comb, hive surfaces (e.g. bottom board debris, frames, landing platforms), and isolates of microbes, parasites and pathogens from honey bees. HoloBee contains two non-overlapping sets of sequence data, HoloBee-Barcode and HoloBee-Mop, each of which have distinct intended uses. HoloBee-Barcode is a non-redundant database of taxonomically informative barcoding loci for all viruses, bacteria, fungi, protozoans and metazoans associated with honey bees (Apis spp.). It was created from an exhaustive master sequence archive of all valid holobiont sequences. Redundancy was removed from this master archive using a clustering algorithm that grouped sequences with ≥ 99% identity and retained the longest sequence from each cluster as the representative accession for that sequence type (“centroid”). These centroid sequences were concatenated into a fasta formatted file to create the HoloBee-Barcode database. Associated taxonomy for each centroid, including Superkingdom through Species and Strain/Isolate, was individually reviewed and corrected when necessary by a curator. Cross reference tables (separated according to 5 major taxonomic groups) provide a user-friendly outline of information for each centroid accession within HoloBee-Barcode including taxonomy, gene/product name, sequence length, the unaltered NCBI definition line, the number and identity of redundant sequences clustered within each centroid, and any additional information provided by the curator. HoloBee-Barcode centroid counts are: Viruses = 86; Bacteria = 496; Fungi = 41; Protozoa = 4; Metazoa = 60. HoloBee-Barcode is intended to improve and standardize quantitative and qualitative metagenomic descriptions of holobiont communities associated with honey bees by providing a curated set of barcode sequences. The goal of genetic barcoding is to associate a nucleotide sequence sample to a taxonomically valid species. Genomic regions targeted for such barcoding purposes varied by taxonomic group. The small subunit (SSU) ribosomal RNA, or 16S rRNA, is the most commonly used barcode for bacteria and is used in HB-Barcode. These 16S rRNA sequences will support the analysis of data generated with the widely used approach of amplicon-based 16S rRNA deep sequencing to study microbiota communities. Although barcode markers for fungi are less definitive than bacteria, HB-Barcode defaults to the ribosomal RNA internal transcribed spacer region (ITS), which typically includes ITS-1, 5.8S, and ITS-2. For some clades that cannot be resolved by this region, other barcode markers were selected. The majority of barcodes for metazoan taxa are the mitochondrial locus cytochrome c oxidase subunit I (COI). Complete mitochondrial DNA (mtDNA) sequence for Apis cerana (Asian honey bee) and Galleria mellonella (Greater wax moth) are included as barcodes for these species. We note that A. cerana mtDNA is included because it is considered a potentially invasive honey bee species and monitoring for its occurrence is in practice regionally, including in Australia, New Zealand and the USA. Protozoan barcodes include cytochrome b oxidase (Cytb), SSU, or ITS while entire genomes are used for viral barcoding. HoloBee-Mop is a database comprised mostly of chromosomal, mitochondrial and plasmid genome assemblies in order to aggregate as much honey bee holobiont genomic sequence information as possible. For a few organisms without genome assembly data, transcriptome data are included (e.g. Aethina tumida, small hive beetle). Unlike HoloBee-Barcode, redundancy removal was not performed on the HoloBee-Mop database and thus this resource provides an archive of nucleotide sequence assemblies from honey bee holobionts. However, since full viral genomes are used in HoloBee-Barcode, only redundant viral sequences occur in HoloBee-Mop. All accessions within each of these assemblies were concatenated into a single fasta formatted file to create the HoloBee-Mop database. The intended purpose of HoloBee-Mop is to improve honey bee genome and transcriptome assemblies by “mopping-up” as much viral, bacterial, fungal, protozoan and non-honey bee metazoan sequence data as possible. Therefore, sequence data remaining after processing reads through both HoloBee-Barcode and HoloBee-Mop that do not map to the honey bee genome may contain unique data from taxonomic variants or novel species. Details for each sequence assembly within HoloBee-Mop are tabulated in cross reference tables according to each major taxonomic group. HoloBee-Mop assembly counts are: Viruses = 2; Bacteria = 55; Fungi = 5; Protozoa = 1; Metazoa = 6. Follow the HoloBee database on Twitter at: https://twitter.com/HoloBee_db For questions about the HoloBee database, contact: HoloBee database team: holobee.db@gmail.com Jay Evans: Jay.Evans@ars.usda.gov Anna Childers: Anna.Childers@ars.usda.gov Resources in this dataset:Resource Title: HoloBee_v2016.1 sequence database. File Name: HB_v2016.1.zipResource Description: This compressed file contains two fasta sequence files: HB_Bar_v2016.1.fasta (HoloBee-Barcode database) HB_Mop_v2016.1.fasta (HoloBee-Mop database) md5 values: HB_v2016.1.zip: 6e372e443744282128eb51488176503f HB_Bar_v2016.1.fasta: 109e1f686a690c70ef78fc4b5066a01f HB_Mop_v2016.1.fasta: ced8c3f5987dce69e800c8c491471eba Resource Title: data dictionary for HoloBee_v2016.1. File Name: Data_Dictionary_HoloBee_v2016.1.xlsxResource Title: HoloBee_v2016.1 cross reference tables. File Name: HB_v2016.1_crossref.zipResource Description: This compressed file contains ten spreadsheet files (.xlsx) tabulating detailed information for all centroids (HoloBee-Barcode database) and sequence assemblies (HoloBee-Mop database) used in HoloBee v2016.1: HB_Bar_v2016.1_bacteria_crossref_2016-05-18.xlsx HB_Bar_v2016.1_fungi_crossref_2016-05-20.xlsx HB_Bar_v2016.1_metazoa_crossref_2016-05-16.xlsx HB_Bar_v2016.1_protozoa_crossref_2016-05-20.xlsx HB_Bar_v2016.1_viruses_crossref_2016-05-17.xlsx HB_Mop_v2016.1_bacteria_crossref_2016-05-12.xlsx HB_Mop_v2016.1_fungi_crossref_2016-05-12.xlsx HB_Mop_v2016.1_metazoa_crossref_2016-04-15.xlsx HB_Mop_v2016.1_protozoa_crossref_2016-04-11.xlsx HB_Mop_v2016.1_viruses_crossref_2016-05-12.xlsx md5 value: HB_v2016.1_crossref.zip: a8a57d92830eb77904743afc95980465 Resource Title: data dictionary for HoloBee_v2016.1. File Name: Data_Dictionary_HoloBee_v2016.1.csv
s
Hepatitis C Virus Database (HCVdb)
scicrunch.org
Updated Jun 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Hepatitis C Virus Database (HCVdb) [Dataset]. http://identifiers.org/RRID:SCR_005718
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_005718
Dataset updated
Jun 27, 2024
Description
The Hepatitis C Virus Database (HCVdb) is a cooperative project of several groups with the mission of providing to the scientific community studying the hepatitis C virus a comprehensive battery of informational and analytical tools. The Viral Bioinformatics Resource Center (VBRC), the Immune Epitope Database and Analysis Resource (IEDB), the Broad Institute Microbial Sequencing Center (MSC), and the Los Alamos HCV Sequence Database (HCV-LANL) are combining forces to acquire and annotate data on Hepatitis C virus, and to develop and utilize new tools to facilitate the study of this group of organisms.

Facebook

Twitter

Click to copy link

Link copied

Cite

COG-UK Consortium (2024). COG-UK Viral Genome Sequences [Dataset]. http://doi.org/10.1016/S2666-5247(20)30054-9

COG-UK Viral Genome Sequences

Explore at:

95 scholarly articles cite this dataset (View in Google Scholar)

unknownAvailable download formats

Unique identifier

https://doi.org/10.1016/S2666-5247(20)30054-9

Dataset updated

Oct 8, 2024

Dataset provided by

COVID-19 Genomics UK Consortium

Authors

COG-UK Consortium

License

https://www.cogconsortium.uk/data/https://www.cogconsortium.uk/data/

Description

The current COVID-19 pandemic, caused by the SARS-CoV-2 virus, represents a major threat to health in the UK and globally. To fully understand the transmission and evolution of the virus requires sequencing and analysing viral genomes at scale and speed. The numbers of samples calls for a rapid increase in the UK’s pathogen genome sequencing capacity rapidly and robustly.

To provide this increased capacity to collect, sequence and analyse the whole genomes of virus samples in the UK, the COVID-19 Genomics UK (COG-UK) consortium is pooling the world leading knowledge and expertise in genomics of the four UK Public Health Agencies, multiple regional University hubs, and large sequencing centres such as the Wellcome Sanger Institute.

Clear search

Close search

Google apps

Main menu

COG-UK Viral Genome Sequences

Viral genomes from GenBank (reference) - Comparative analysis of gene...

IVDB - Influenza Virus Database

Viral RefSeq databases for Centrifuge, Kraken2 and DIAMOND

T4-like genome database

VIRsiRNAdb

Virosaurus dataset

GCVDB Viruses

VIDA

VirGen

Metadata record for: Domain-centric database to uncover structure of...

COVID-19 Genome Sequence Dataset

"Genome binning of viral entities from bulk metagenomics data" - CAMISIM...

Databases for NEXT-RSV-SEQ (RSV, HMPV, PIV)

Hepatitis Virus B Database

Virus+ Sequence Masked Mouse Reference Genome (GRCm38)

COG-UK Viral Genome Sequences

Data from: Viral tagging reveals discrete populations in Synechococcus viral...

Data from: HoloBee Database v2016.1

Hepatitis C Virus Database (HCVdb)

COG-UK Viral Genome Sequences

COG-UK Viral Genome Sequences