100+ datasets found
  1. d

    Data from: VIDA: a virus database system for the organization of animal...

    • catalog.data.gov
    • odgavaprod.ogopendata.com
    Updated Sep 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institutes of Health (2025). VIDA: a virus database system for the organization of animal virus genome open reading frames [Dataset]. https://catalog.data.gov/dataset/vida-a-virus-database-system-for-the-organization-of-animal-virus-genome-open-reading-fram
    Explore at:
    Dataset updated
    Sep 6, 2025
    Dataset provided by
    National Institutes of Health
    Description

    VIDA is a new virus database that organizes open reading frames (ORFs) from partial and complete genomic sequences from animal viruses. Currently VIDA includes all sequences from GenBank for Herpesviridae, Coronaviridae and Arteriviridae. The ORFs are organized into homologous protein families, which are identified on the basis of sequence similarity relationships. Conserved sequence regions of potential functional importance are identified and can be retrieved as sequence alignments. We use a controlled taxonomical and functional classification for all the proteins and protein families in the database. When available, protein structures that are related to the families have also been included. The database is available for online search and sequence information retrieval at http://www.biochem.ucl.ac.uk/bsm/virus_database/VIDA.html.

  2. Kraken2 Metagenomic Virus Database

    • osti.gov
    Updated Apr 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF) (2020). Kraken2 Metagenomic Virus Database [Dataset]. http://doi.org/10.13139/OLCF/1615774
    Explore at:
    Dataset updated
    Apr 23, 2020
    Dataset provided by
    Office of Sciencehttp://www.er.doe.gov/
    Department of Energy Biological and Environmental Research Program
    Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
    Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
    Description

    The Database: Kraken2 [1] database built from a classification tree containing over 700k metagenomic viruses from JGI IMG/VR [2]. (1) Wood, D. E., Lu, J., & Langmead, B. (2019). Improved metagenomic analysis with Kraken 2. Genome Biol., 20(1), 1–13. doi: 10.1186/s13059-019-1891-0 (2) Paez-Espino D, Chen I-MA, Palaniappan K, Ratner A, Chu K, Szeto E, et al. IMG/VR: a database of cultured and uncultured DNA Viruses and retroviruses. Nucleic Acids Res. 2017;45:D457–65. For Paper: Title: A k-mer based approach for virus classification in metatranscriptomic and metagenomic samples identifies viral associations in the Populus phytobiome and autism brains Abstract Background Viruses are an underrepresented taxa in the study and identification of microbiome constituents; however, they play an important role in health, microbiome regulation, and transfer of genetic material. Only a few thousand viruses have been isolated, sequenced, and assigned a taxonomy, which further limits the ability to identify and quantify viruses in the microbiome. Additionally, the vast diversity of viruses represents a challenge for classification, not only in constructing a viral taxonomy, but also in identifying similarities between a virus' genotype and its phenotype. However, the diversity of viral sequences can be leveraged to classify their sequences in metagenomic and metatranscriptomic samples. Methods To identify viruses in transcriptomic and genomic samples, we developed a dynamic programming algorithm for creating a classification tree out of 715,672 metagenome viruses. To create the classification tree, we clustered proportional similarity scores generated from the k-mer profiles of each of the metagenome viruses. We then integrated the viral classification tree with the NCBI taxonomy for use with ParaKraken, a metagenomic/transcriptomic classifier. Results To illustrate the breadth of our utility for classifying viruses with ParaKraken, we analyzed data from a plant metagenome study identifying the differences between two Populus genotypes in three different compartments and on a human metatranscriptome study identifying the differences between Autism Spectrum Disorder patients and controls in post mortem brain biopsies. In the Populus study, we identified genotype and compartment specific viral signatures, while in the Autism study we identified a significant increased abundance of eight viral sequences in Autism brain biopsies. Conclusion Viruses represent an important aspect of the microbiome. The ability to classify viruses represents the first step in being able to better understand their role in the microbiome. The viral classification method presented here allows for more complete identification of viral sequences for use in identifying associations between viruses and the host and viruses and other microbiome members. Acknowledgements and Funding This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. This research was also supported by the Plant-Microbe Interfaces Scientific Focus Area in the Genomic Science Program, the Office of Biological and Environmental Research (BER) in the U.S. Department of Energy Office of Science, and by the Department of Energy, Laboratory Directed Research and Development funding (ProjectID 8321), at the Oak Ridge National Laboratory. Oak Ridge National Laboratory is managed by UT-Battelle, LLC, for the US DOE under contract DE-AC05-00OR22725. This research used resources of the Compute and Data Environment for Science (CADES).

  3. s

    IVDB - Influenza Virus Database

    • scicrunch.org
    • neuinfo.org
    • +1more
    Updated Dec 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). IVDB - Influenza Virus Database [Dataset]. http://identifiers.org/RRID:SCR_013404
    Explore at:
    Dataset updated
    Dec 4, 2023
    Description

    IVDB hosts complete genome sequences of influenza A virus generated by BGI and curates all other published influenza virus sequences after expert annotations. For the convenience of efficient data utilization, our Q-Filter system classifies and ranks all nucleotide sequences into 7 categories according to sequence content and integrity. IVDB provides a series of tools and viewers for analyzing the viral genomes, genes, genetic polymorphisms and phylogenetic relationships comparatively. A searching system is developed for users to retrieve a combination of different data types by setting various search options. To facilitate analysis of the global viral transmission and evolution, the IV Sequence Distribution Tool (IVDT) is developed to display worldwide geographic distribution of the viral genotypes and to couple genomic data with epidemiological data. The BLAST, multiple sequence alignment tools and phylogenetic analysis tools were integrated for online data analysis. Furthermore, IVDB offers instant access to the pre-computed alignments and polymorphism analysis of influenza virus genes and proteins and presents the results by SNP distribution plots and minor allele distributions. IVDB aims to be a powerful information resource and an analysis workbench for scientists working on IV genetics, evolution, diagnostics, vaccine development, and drug design.

  4. Viral RefSeq databases for Centrifuge, Kraken2 and DIAMOND

    • zenodo.org
    • search.dataone.org
    • +1more
    application/gzip, txt
    Updated Jun 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anna-Sapfo Malaspinas; Anna-Sapfo Malaspinas; Samuel Neuenschwander; Samuel Neuenschwander; Yami Arizmendi Cárdenas; Yami Arizmendi Cárdenas (2022). Viral RefSeq databases for Centrifuge, Kraken2 and DIAMOND [Dataset]. http://doi.org/10.5061/dryad.mkkwh711w
    Explore at:
    txt, application/gzipAvailable download formats
    Dataset updated
    Jun 5, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anna-Sapfo Malaspinas; Anna-Sapfo Malaspinas; Samuel Neuenschwander; Samuel Neuenschwander; Yami Arizmendi Cárdenas; Yami Arizmendi Cárdenas
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Owing to technological advances in ancient DNA, it is now possible to sequence viruses from the past to track down their origin and evolution. However, ancient DNA data is considerably more degraded and contaminated than modern data making the identification of ancient viral genomes particularly challenging. Several methods to characterise the modern microbiome (and, within this, the virome) have been developed; in particular, tools that assign sequenced reads to specific taxa in order to characterise the organisms present in a sample of interest. While these existing tools are routinely used in modern data, their performance when applied to ancient microbiome data to screen for ancient viruses remains unknown.

    In this work, we conducted an extensive simulation study using public viral sequences to establish which tool is the most suitable to screen ancient samples for human DNA viruses. We compared the performance of four widely used classifiers, namely Centrifuge, Kraken2, DIAMOND and MetaPhlAn2, in correctly assigning sequencing reads to the corresponding viruses. To do so, we simulated reads by adding noise typical of ancient DNA to a set of publicly available human DNA viral sequences and to the human genome. We fragmented the DNA into different lengths, added sequencing error and C to T and G to A deamination substitutions at the read termini. Then we measured the resulting sensitivity and precision for all classifiers.

    Across most simulations, more than 228 out of the 233 simulated viruses are recovered by Centrifuge, Kraken2 and DIAMOND, in contrast to MetaPhlAn2 which recovers only around one third. Overall, Centrifuge and Kraken2 have the best performance with the highest values of sensitivity and precision. We found that deamination damage has little impact on the performance of the classifiers, less than the sequencing error and the length of the reads. Since Centrifuge can handle short reads (in contrast to DIAMOND and Kraken2 with default settings) and since it achieves the highest sensitivity and precision at the species level across all the simulations performed, it is our recommended tool. Regardless of the tool used, our simulations indicate that, for ancient human studies, users should use strict filters to remove all reads of potential human origin. Finally, we recommend to verify which species are present in the database used, as it might happen that default databases lack sequences for viruses of interest.

  5. f

    GCVDB Viruses

    • figshare.com
    application/gzip
    Updated Jan 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bailey Wallace (2024). GCVDB Viruses [Dataset]. http://doi.org/10.6084/m9.figshare.24968805.v1
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jan 15, 2024
    Dataset provided by
    figshare
    Authors
    Bailey Wallace
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Viral genomes and genome fragments from the Global Coral Viruses Database (GCVDB).

  6. Viral genomes from GenBank (reference) - Comparative analysis of gene...

    • figshare.com
    application/x-gzip
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Enrique Gonzalez Tortuero; Revathy Krishnamurthi; Heather Allison; Ian Goodhead; Chloë James (2023). Viral genomes from GenBank (reference) - Comparative analysis of gene prediction tools for viral genome annotation [Dataset]. http://doi.org/10.6084/m9.figshare.21353829.v1
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Enrique Gonzalez Tortuero; Revathy Krishnamurthi; Heather Allison; Ian Goodhead; Chloë James
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The file "viral.genomic.gbk.tar.gz" contains all the RefSeq viral database information in GenBank format, used as the gold standard for the comparisons. In such a way, it should be run as is when using the script "genecounter.py" to count the number of genes, while it is the second (mandatory) input file for the counting of true positives (TP), false positives (FP) and false negatives (FN) via "coordinateschecker.py". In any case, it could also be used for other evaluation purposes.

  7. Data from: Deformed wing virus genome sequence data

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    Updated Apr 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Deformed wing virus genome sequence data [Dataset]. https://catalog.data.gov/dataset/deformed-wing-virus-genome-sequence-data
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Servicehttps://www.ars.usda.gov/
    Description

    The presence and replication of honeybee deformed wing virus variant A (DWV-A) was recently confirmed in the red imported fire ants, Solenopsis invicta Buren. Reported here is the complete genome sequence data of this virus from S. invicta, which is valuable for future research on the DWV.

  8. r

    VIDA

    • rrid.site
    • scicrunch.org
    • +2more
    Updated Oct 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). VIDA [Dataset]. http://identifiers.org/RRID:SCR_007111
    Explore at:
    Dataset updated
    Oct 9, 2025
    Description

    VIDA contains a collection of homologous protein families derived from open reading frames from complete and partial virus genomes. For each family, users can get an alignment of the conserved regions, functional and taxonomy information, and links to DNA sequences and structures. * Search homologous protein families from particular virus families * Links to complete genome sequence: Arteriviridae, Coronaviridae, Herpesviridae, Poxviridae The Virus Database at University College London has been developed as a system to organize animal virus open reading frame sequences. All known and predicted protein sequences from complete and partial genomes of particular virus families are extracted from GenBank and filtered to remove 100% redundancy. On the basis of sequence similarity the sequences are then clustered into homologous protein families (HPFs). The families are enriched with annotations including function and functional classification, related protein structures, taxonomy, length of the proteins, boundaries of the conserved region/s, virus-specific gene name and links to EMBL entries and SWISSPROT., THIS RESOURCE IS NO LONGER IN SERVICE. Documented on September 16,2025.

  9. s

    VIRsiRNAdb

    • scicrunch.org
    • dknet.org
    • +1more
    Updated Aug 2, 2011
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2011). VIRsiRNAdb [Dataset]. http://identifiers.org/RRID:SCR_006108
    Explore at:
    Dataset updated
    Aug 2, 2011
    Description

    VIRsiRNAdb is a curated database of experimentally validated viral siRNA / shRNA targeting diverse genes of 42 important human viruses including influenza, SARS and Hepatitis viruses. Submissions are welcome. Currently, the database provides detailed experimental information of 1358 siRNA/shRNA which includes siRNA sequence, virus subtype, target gene, GenBank accession, design algorithm, cell type, test object, test method and efficacy (mostly quantitative efficacies). Further, wherever available, information regarding alternative efficacies of above 300 siRNAs derived from different assays has also been incorporated. The database has facilities like search, advance search (using Boolean operators AND, OR) browsing (with data sorting option), internal linking and external linking to other databases (Pubmed, Genbank, ICTV). Additionally useful siRNA analysis tools are also provided e.g. siTarAlign for aligning the siRNA sequence with reference viral genomes or user defined sequences. virsiRNAdb would prove useful for RNAi researchers especially in siRNA based antiviral therapeutics development.

  10. COG-UK Viral Genome Sequences

    • healthdatagateway.org
    unknown
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    COG-UK Consortium, COG-UK Viral Genome Sequences [Dataset]. http://doi.org/10.1016/S2666-5247(20)30054-9
    Explore at:
    unknownAvailable download formats
    Dataset provided by
    COVID-19 Genomics UK Consortium
    Authors
    COG-UK Consortium
    License

    https://www.cogconsortium.uk/data/https://www.cogconsortium.uk/data/

    Description

    The current COVID-19 pandemic, caused by the SARS-CoV-2 virus, represents a major threat to health in the UK and globally. To fully understand the transmission and evolution of the virus requires sequencing and analysing viral genomes at scale and speed. The numbers of samples calls for a rapid increase in the UK’s pathogen genome sequencing capacity rapidly and robustly.

    To provide this increased capacity to collect, sequence and analyse the whole genomes of virus samples in the UK, the COVID-19 Genomics UK (COG-UK) consortium is pooling the world leading knowledge and expertise in genomics of the four UK Public Health Agencies, multiple regional University hubs, and large sequencing centres such as the Wellcome Sanger Institute.

  11. o

    COVID-19 Genome Sequence Dataset

    • registry.opendata.aws
    Updated Jul 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Library of Medicine (NLM) (2020). COVID-19 Genome Sequence Dataset [Dataset]. https://registry.opendata.aws/ncbi-covid-19/
    Explore at:
    Dataset updated
    Jul 9, 2020
    Dataset provided by
    <a href="http://nlm.nih.gov/">National Library of Medicine (NLM)</a>
    Description

    This repository within the ACTIV TRACE initiative houses a comprehensive collection of datasets related to SARS-CoV-2. The processing of SARS-CoV-2 Sequence Read Archive (SRA) files has been optimized to identify genetic variations in viral samples. This information is then presented in the Variant Call Format (VCF). Each VCF file corresponds to the SRA parent-run's accession ID. Additionally, the data is available in the parquet format, making it easier to search and filter using the Amazon Athena Service. The SARS-CoV-2 Variant Calling Pipeline is designed to handle new data every six hours, with updates to the AWS ODP bucket occurring daily.

  12. n

    NCBI Genome

    • neuinfo.org
    • scicrunch.org
    • +2more
    Updated Nov 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). NCBI Genome [Dataset]. http://identifiers.org/RRID:SCR_002474
    Explore at:
    Dataset updated
    Nov 15, 2024
    Description

    Database that organizes information on genomes including sequences, maps, chromosomes, assemblies, and annotations in six major organism groups: Archaea, Bacteria, Eukaryotes, Viruses, Viroids, and Plasmids. Genomes of over 1,200 organisms can be found in this database, representing both completely sequenced organisms and those for which sequencing is in progress. Users can browse by organism, and view genome maps and protein clusters. Links to other prokaryotic and archaeal genome projects, as well as BLAST tools and access to the rest of the NCBI online resources are available.

  13. Virus+ Sequence Masked Mouse Reference Genome (GRCm38)

    • zenodo.org
    application/gzip
    Updated Feb 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scott A Handley; Scott A Handley (2021). Virus+ Sequence Masked Mouse Reference Genome (GRCm38) [Dataset]. http://doi.org/10.5281/zenodo.4116249
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Feb 9, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Scott A Handley; Scott A Handley
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A version of the mouse genome (GRCm38) masked for all possible viral sequences.

    See Virus+ Masked Human Genome for a masked human reference database.

    The following commands were used to generate the additional virus sequence masked reference database:

    1) Download all RefSeq and Neighbor nucleotide records:

    https://www.ncbi.nlm.nih.gov/nuccore/?term=Viruses[Organism]%20NOT%20cellular%20organisms[ORGN]%20NOT%20wgs[PROP]%20NOT%20gbdiv%20syn[prop]%20AND%20(srcdb_refseq[PROP]%20OR%20nuccore%20genome%20samespecies[Filter])

    2) Shred the downloaded viral genomes using shred.sh from the bbtools package

    shred.sh in=refseq_virus_reformated.fasta out=virus_shred.fasta.gz length=85 minlength=75 overlap=30

    3) Map shredded virus sequence to the GRCm38 genome using bbmap.sh from the bbtools package

    bbmap.sh ref=GRCm38.fa.gz in=virus_shred.fasta.gz outm=map_mouse_all_viruses.sam minid=0.90

    4) Mask virus sequenced mapped regions from the GRCm38 genome using bbmask.sh from the bbtools package

    bbmask.sh in=GRCm38.fa.gz out=GRCm38_virus_masked.fasta.gz sam=map_mouse_all_viruses.sam

    5) Remove all N's to further reduce file size using seqkit
    seqkit -is replace -p "n" -r "" GRCm38_virus_masked.fasta.gz > mouse_virus_masked.fasta_Ns_removed.gz

    Additional References:

    1. bbtools
    2. seqkit
    3. NCBI Virus Genome RefSeq
  14. n

    RNA Virus Database

    • neuinfo.org
    • scicrunch.org
    • +1more
    Updated Nov 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). RNA Virus Database [Dataset]. http://identifiers.org/RRID:SCR_007899
    Explore at:
    Dataset updated
    Nov 9, 2024
    Description

    THIS RESOURCE IS NO LONGER IN SERVICE, documented August 19, 2016. It is a database and web application describing the genome organization and providing analytical tools for the 938 known species of RNA virus. It can identify submitted nucleotide sequences, can place them into multiple whole-genome alignments (in species where more than one isolate has been fully sequenced) and contains translated genome sequences for all species. It has been created for two main purposes: to facilitate the comparative analysis of RNA viruses and to become a hub for other, more specialised virus Web sites.

  15. Z

    "Genome binning of viral entities from bulk metagenomics data" - CAMISIM...

    • data.niaid.nih.gov
    Updated Jan 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johansen, Joachim (2022). "Genome binning of viral entities from bulk metagenomics data" - CAMISIM simulated datasets and genomes [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5676246
    Explore at:
    Dataset updated
    Jan 5, 2022
    Dataset authored and provided by
    Johansen, Joachim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Genome binning of viral entities from bulk metagenomics data

    Authors

    Joachim Johansen1,2, Damian R. Plichta2, Jakob Nybo Nissen1,3, Marie Louise Jespersen1,4, Shiraz A. Shah5, Ling Deng6, Jakob Stokholm5,6, Hans Bisgaard5, Dennis Sandris Nielsen6, Søren Sørensen7, Simon Rasmussen1

    Affiliations

    1 Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen N, Denmark

    2 Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA

    3 Statens Serum Institut, Viral & Microbial Special diagnostics, Copenhagen, Denmark

    4 National Food Institute, Technical University of Denmark, Kongens Lyngby, Denmark

    5 Copenhagen Prospective Studies on Asthma in Childhood (COPSAC), Herlev and Gentofte Hospital, University of Copenhagen, Copenhagen, Denmark

    6 Section of Food Microbiology and Fermentation, Department of Food Science, Faculty of Science, University of Copenhagen, Copenhagen, Denmark

    7 Section of Microbiology, Department of Biology, University of Copenhagen, Copenhagen, Denmark

    Methods description

    We compared the viral binning performance of VAMB and MetaBAT2 using the official CAMI consortium method to create assemblies and metagenome profiles. To this end we generated 3 different metagenome compositions with up to 308 reference genomes; one mixed with bacteria, plasmids and viruses to test binning in complex samples i.e. high diversity (1), one with only crass-like viruses to test binning with highly similar viruses i.e. high relatedness (2) and a set of small-viruses (<6,000 bp) including members of the Microviridae family to address the bias of size (3). Bacterial genomes were gathered from NCBIs refseq genome repository 2021, plasmids from the PLSDB database (v. 2021_06_23) and viral genomes from the recent MGV database.

    Dataset A contained a mixture of bacteria (N=8), plasmids (N=20) and viruses (N=280) to test binning in complex samples, i.e. high diversity. Dataset B contained only crass-like viruses (N=80) to test binning with highly similar viruses i.e. high relatedness. Dataset C contained small-viruses (N=50, <6,000 bp) of the Microviridae family to address the bias of size. Bacterial genomes were sampled from the Refseq genome repository 2021, plasmids from the PLSDB database and viral genomes from the recent MGV database (Nayfach, et al. Nature Microbiology 2021).

  16. d

    Complete genome sequence of a novel extrachromosomal virus-like element...

    • catalog.data.gov
    • data.virginia.gov
    Updated Sep 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institutes of Health (2025). Complete genome sequence of a novel extrachromosomal virus-like element identified in planarian [Dataset]. https://catalog.data.gov/dataset/complete-genome-sequence-of-a-novel-extrachromosomal-virus-like-element-identified-in-plan
    Explore at:
    Dataset updated
    Sep 7, 2025
    Dataset provided by
    National Institutes of Health
    Description

    Background Freshwater planarians are widely used as models for investigation of pattern formation and studies on genetic variation in populations. Despite extensive information on the biology and genetics of planaria, the occurrence and distribution of viruses in these animals remains an unexplored area of research. Results Using a combination of Suppression Subtractive Hybridization (SSH) and Mirror Orientation Selection (MOS), we compared the genomes of two strains of freshwater planarian, Girardia tigrina. The novel extrachromosomal DNA-containing virus-like element denoted PEVE (Planarian Extrachromosomal Virus-like Element) was identified in one planarian strain. The PEVE genome (about 7.5 kb) consists of two unique regions (Ul and Us) flanked by inverted repeats. Sequence analyses reveal that PEVE comprises two helicase-like sequences in the genome, of which the first is a homolog of a circoviral replication initiator protein (Rep), and the second is similar to the papillomavirus E1 helicase domain. PEVE genome exists in at least two variant forms with different arrangements of single-stranded and double-stranded DNA stretches that correspond to the Us and Ul regions. Using PCR analysis and whole-mount in situ hybridization, we characterized PEVE distribution and expression in the planarian body. Conclusions PEVE is the first viral element identified in free-living flatworms. This element differs from all known viruses and viral elements, and comprises two potential helicases that are homologous to proteins from distant viral phyla. PEVE is unevenly distributed in the worm body, and is detected in specific parenchyma cells.

  17. f

    Data from: A global dataset of sequence, diversity and biosafety...

    • figshare.com
    txt
    Updated Jun 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ying Huang; Shunlong Wang; Hong Liu; Evans Atoni; Fei Wang; Wei Chen; Zhaolin Li; Sergio Rodriguez; Zhiming Yuan; Zhaoyan Ming; Han Xia (2023). A global dataset of sequence, diversity and biosafety recommendation of arbovirus and arthropod-specific virus [Dataset]. http://doi.org/10.6084/m9.figshare.22154573.v7
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 27, 2023
    Dataset provided by
    figshare
    Authors
    Ying Huang; Shunlong Wang; Hong Liu; Evans Atoni; Fei Wang; Wei Chen; Zhaolin Li; Sergio Rodriguez; Zhiming Yuan; Zhaoyan Ming; Han Xia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We built a comprehensive dataset of the arboviruses and arthropod-specific viruses by curating worldwide available data from Arbovirus Catalog, Section VIII-F of the Biosafety in Microbiological and Biomedical Laboratories 6th edition, Virus Metadata Resource of International Committee on Taxonomy of Viruses, and GenBank. This dataset includes a complete information on viral taxonomy, biological characteristics, vectors and vertebrate hosts, distribution, recommended biosafety levels, genome segment, and nucleotide/amino acid sequences, which will facilitate research by scientists/researchers of arboviruses and arthropod-specific viruses in viral vector/host prediction, disease outbreak risk warning, arbovirus/arthropod-specific interactions, phylogenetic and evolutionary relationships, and biosafety risk assessment.

    This global dataset of viral sequence, diversity, distribution, and biosafety recommendation for arbovirus and ASV contains a viral information file (.xlsx), a nucleic acid sequences file (.fna) and amino acid sequences file (.faa), as accessible from figshare26. The column details of the viral meta information file (.xlsx) are as follows (The “NAV” in the field indicates not available value):

    Taxonomy Information 1. Virus_Group: (customized field) viruses in the database are divided into two groups: arbovirus and ASV. The former has both vertebrate and arthropod hosts, the latter has only arthropod hosts. 2. Name: (source from GenBank) the virus name, each name represents a distinct virus. 3. Acronym: (source from BMBL) acronym of virus name. 4. NCBI_Taxonomy_ID: (source from GenBank) taxonomy identifier of virus from NCBI Taxonomy Database. 5. Isolate: (source from GenBank) Isolate of virus from NCBI GenBank. 6. Unified_Isolate_Number: (customized field) renumbering of the field Isolate. Each isolate of the same virus is numbered. 7. Species: (source from ICTV) species that the virus belongs to. Species of the viruses are normally different with their names. 8. Genus: (source from ICTV) genus that the virus belongs to. 9. Family: (source from ICTV) family that the virus belongs to.

    Genome Information 10. Segmented: (customized field) whether the genome of the virus is unsegmented (recorded as “no”) or segmented virus (recorded as “yes”). Virus with an unknown number of segments (recorded as “NAV”). 11. Number_of_Segments: (source from GenBank) the theoretical number of segments of the virus. 12. Molecule_Type: (source from GenBank) molecule types of the virus genome which are divided into ssRNA(+), ssRNA(-), ssRNA(+/-), dsRNA, RNA, ssDNA(+/-), dsDNA and etc.

    Sequence Information 13. Accession: (source from GenBank) NCBI GenBank Accession of the nucleotide sequence. 14. Locus: (source from GenBank) the locus name of the nucleotide sequence. 15. SRA_Accession: (source from GenBank) NCBI SRA Accession of the nucleotide sequence. 16. Submitters: (source from GenBank) submitters of the nucleotide sequence. 17. Sequence_Type: (source from GenBank) whether the nucleotide sequence is a reference sequence (recorded as “RefSeq”) or a non-reference sequence (recorded as “GenBank”). 18. BioSample: (source from GenBank) NCBI BioSample Accession of the nucleotide sequence. 19. GenBank_Title: (source from GenBank) the field “DEFINITION” of NCBI GenBank database of the sequence. 20. Genotype: (source from GenBank) genotype of the nucleotide sequence. 21. Segment: (source from GenBank) segment identifier of the nucleotide sequence. 22. Unified_Segment_Number: (customized field) renumbering of the field Segment. Each segment is assigned a new number from 1. Segment of the unsegmented virus is assigned as 1.

    Host Information 23. Host_Species: (customized field) the species of the dead-end host of the virus. 24. Host_Genus: (customized field) the genus of the dead-end host of the virus. 25. Host_Family: (customized field) the family of the dead-end host of the virus. 26. Host: (source from GenBank) the field from the NCBI GenBank database that represents dead-end host or vectors.

    Biosafety Information 27. Recommended_BSL: (customized field) recommended biosafety level of laboratory to research the virus (recorded as “2”, “3”, “4”, “NAV”). 28. BMBL_Recommended_BSL: (source from BMBL) BMBL recommended biosafety level of laboratory to research the virus (recorded as “2”, “2 with 3 practices”, “2b”, “3”, “3a”, “3b”, “4”, “NAV”). 29. Basis_of_Rating: (source from BMBL) risk assessment of the virus (recorded as “A1”, “A2”, “A3”, “A4”, “A7”, “IE”, “S”, “NAV”). 30. Antigenic_Group: (source from BMBL) the antigenic group of the virus. 31. Isolated: (customized field) whether the virus has been isolated (“Yes” or “No”).

    Source Information 32. Latitude_and_Longitude: (source from GenBank) longitude and latitude of the virus isolation source. 33. State_or_Province: (customized field) state or provincial administrative unit of the virus source. 34. Geo_Location: (source from GenBank) geographical position of the virus source. 35. Country_or_Region: (customized field) the country or region of the virus source. 36. Isolation_Source: (source from GenBank) the organism which the virus was collected from. 37. Collection_Date: (source from GenBank) the date that the virus was collected. 38. Submit_Date: (source from GenBank) the date that the virus was submitted. 39. Release_Date: (source from GenBank) the date that the virus was released or last modified.

    References 40. Publications: (customized field) the number of publications and literature covering the specific virus research. 41. Accession_URL: (customized field) the DOI leading directly to the GenBank source.

    The nucleotide sequences file and amino acid sequences file are standard FASTA files. Each sequence information consists of two lines, header and content. The header contains two types of information, locus and accession, split by '|'. Content is a specific nucleic acid or amino acid sequence. The detailed definitions of the fields in the header are as follows: 1. Locus: NCBI GenBank LOCUS ID of the nucleotide sequence. 2. Accession: NCBI GenBank Accession of the nucleotide sequence. Protein_ID: a protein sequence identification number (for amino acid sequences file).

  18. d

    Descriptions of Plant Viruses

    • dknet.org
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Descriptions of Plant Viruses [Dataset]. http://identifiers.org/RRID:SCR_006656
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    DPVweb provides a central source of information about viruses, viroids and satellites of plants, fungi and protozoa. Comprehensive taxonomic information, including brief descriptions of each family and genus, and classified lists of virus sequences are provided. The database also holds detailed, curated, information for all sequences of viruses, viroids and satellites of plants, fungi and protozoa that are complete or that contain at least one complete gene. For comparative purposes, it also contains a single representative sequence of all other fully sequenced virus species with an RNA or single-stranded DNA genome. The start and end positions of each feature (gene, non-translated region and the like) have been recorded and checked for accuracy. As far as possible, nomenclature for genes and proteins are standardized within genera and families. Sequences of features (either as DNA or amino acid sequences) can be directly downloaded from the website in FASTA format. The sequence information can also be accessed via client software for PC computers (freely downloadable from the website) that enable users to make an easy selection of sequences and features of a chosen virus for further analyses. The public sequence databases contain vast amounts of data on virus genomes but accessing and comparing the data, except for relatively small sets of related viruses can be very time consuming. The procedure is made difficult because some of the sequences on these databases are incorrectly named, poorly annotated or redundant. The NCBI Reference Sequence project (1) provides a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA) and protein products, for major research organisms. This now includes curated information for a single sequence of each fully sequenced virus species. While this is a welcome development, it can only deal with complete sequences. An important feature of DPV is the opportunity to access genes (and other features) of multiple sequences quickly and accurately. Thus, for example, it is easy to obtain the nucleotide or amino acid sequences of all the available accessions of the coat protein gene of a given virus species or for a group of viruses. To increase its usefulness further, DPVweb also contains a single representative sequence of all other fully sequenced virus species with an RNA or single-stranded DNA (ssDNA) genome. Sponsors: This site is supported by the Association of Applied Biologists and the Zhejiang Academy of Agricultural Sciences, Hangzhou, People''s Republic of China.

  19. d

    Data from: HoloBee Database v2016.1

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    • +4more
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). HoloBee Database v2016.1 [Dataset]. https://catalog.data.gov/dataset/holobee-database-v2016-1-9e8e9
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    Organisms living in honey bees and honey bee colonies form large associative holobiont communities that are integral to bee biology. High-throughput sequencing approaches to characterize these holobiont communities from honey bees in various states of health and disease are now commonplace, producing large amounts of nucleotide sequence data that must be accurately and consistently analyzed in order to produce reliable and comparable reports. In addition, new species designations and revisions are actively being made from honey bee holobiont communities, complicating nomenclature in larger databases where taxonomic descriptions associated with archived sequences can quickly become outdated and misleading. To improve the accuracy and consistency of honey bee holobiont research, we have developed HoloBee: a curated database of publicly accessioned nucleotide sequences from the honey bee holobiont community. Except in rare and noted exceptions made by curators, sequences used in HoloBee were obtained from, or in association with, Apis mellifera (Western honey bee) as well as other honey bee species where available (e.g. Apis cerana, Apis dorsata, Apis laboriosa, Apis koschevnikovi, Apis florea, Apis andreniformis and Apis nigrocincta). Sources include: within or on the surface of honey bees (adult, pupae, larvae, egg), corbicular pollen, bee bread, royal jelly, honey, comb, hive surfaces (e.g. bottom board debris, frames, landing platforms), and isolates of microbes, parasites and pathogens from honey bees. HoloBee contains two non-overlapping sets of sequence data, HoloBee-Barcode and HoloBee-Mop, each of which have distinct intended uses. HoloBee-Barcode is a non-redundant database of taxonomically informative barcoding loci for all viruses, bacteria, fungi, protozoans and metazoans associated with honey bees (Apis spp.). It was created from an exhaustive master sequence archive of all valid holobiont sequences. Redundancy was removed from this master archive using a clustering algorithm that grouped sequences with ≥ 99% identity and retained the longest sequence from each cluster as the representative accession for that sequence type (“centroid”). These centroid sequences were concatenated into a fasta formatted file to create the HoloBee-Barcode database. Associated taxonomy for each centroid, including Superkingdom through Species and Strain/Isolate, was individually reviewed and corrected when necessary by a curator. Cross reference tables (separated according to 5 major taxonomic groups) provide a user-friendly outline of information for each centroid accession within HoloBee-Barcode including taxonomy, gene/product name, sequence length, the unaltered NCBI definition line, the number and identity of redundant sequences clustered within each centroid, and any additional information provided by the curator. HoloBee-Barcode centroid counts are: Viruses = 86; Bacteria = 496; Fungi = 41; Protozoa = 4; Metazoa = 60. HoloBee-Barcode is intended to improve and standardize quantitative and qualitative metagenomic descriptions of holobiont communities associated with honey bees by providing a curated set of barcode sequences. The goal of genetic barcoding is to associate a nucleotide sequence sample to a taxonomically valid species. Genomic regions targeted for such barcoding purposes varied by taxonomic group. The small subunit (SSU) ribosomal RNA, or 16S rRNA, is the most commonly used barcode for bacteria and is used in HB-Barcode. These 16S rRNA sequences will support the analysis of data generated with the widely used approach of amplicon-based 16S rRNA deep sequencing to study microbiota communities. Although barcode markers for fungi are less definitive than bacteria, HB-Barcode defaults to the ribosomal RNA internal transcribed spacer region (ITS), which typically includes ITS-1, 5.8S, and ITS-2. For some clades that cannot be resolved by this region, other barcode markers were selected. The majority of barcodes for metazoan taxa are the mitochondrial locus cytochrome c oxidase subunit I (COI). Complete mitochondrial DNA (mtDNA) sequence for Apis cerana (Asian honey bee) and Galleria mellonella (Greater wax moth) are included as barcodes for these species. We note that A. cerana mtDNA is included because it is considered a potentially invasive honey bee species and monitoring for its occurrence is in practice regionally, including in Australia, New Zealand and the USA. Protozoan barcodes include cytochrome b oxidase (Cytb), SSU, or ITS while entire genomes are used for viral barcoding. HoloBee-Mop is a database comprised mostly of chromosomal, mitochondrial and plasmid genome assemblies in order to aggregate as much honey bee holobiont genomic sequence information as possible. For a few organisms without genome assembly data, transcriptome data are included (e.g. Aethina tumida, small hive beetle). Unlike HoloBee-Barcode, redundancy removal was not performed on the HoloBee-Mop database and thus this resource provides an archive of nucleotide sequence assemblies from honey bee holobionts. However, since full viral genomes are used in HoloBee-Barcode, only redundant viral sequences occur in HoloBee-Mop. All accessions within each of these assemblies were concatenated into a single fasta formatted file to create the HoloBee-Mop database. The intended purpose of HoloBee-Mop is to improve honey bee genome and transcriptome assemblies by “mopping-up” as much viral, bacterial, fungal, protozoan and non-honey bee metazoan sequence data as possible. Therefore, sequence data remaining after processing reads through both HoloBee-Barcode and HoloBee-Mop that do not map to the honey bee genome may contain unique data from taxonomic variants or novel species. Details for each sequence assembly within HoloBee-Mop are tabulated in cross reference tables according to each major taxonomic group. HoloBee-Mop assembly counts are: Viruses = 2; Bacteria = 55; Fungi = 5; Protozoa = 1; Metazoa = 6. Follow the HoloBee database on Twitter at: https://twitter.com/HoloBee_db For questions about the HoloBee database, contact: HoloBee database team: holobee.db@gmail.com Jay Evans: Jay.Evans@ars.usda.gov Anna Childers: Anna.Childers@ars.usda.gov Resources in this dataset:Resource Title: HoloBee_v2016.1 sequence database. File Name: HB_v2016.1.zipResource Description: This compressed file contains two fasta sequence files: HB_Bar_v2016.1.fasta (HoloBee-Barcode database) HB_Mop_v2016.1.fasta (HoloBee-Mop database) md5 values: HB_v2016.1.zip: 6e372e443744282128eb51488176503f HB_Bar_v2016.1.fasta: 109e1f686a690c70ef78fc4b5066a01f HB_Mop_v2016.1.fasta: ced8c3f5987dce69e800c8c491471eba Resource Title: data dictionary for HoloBee_v2016.1. File Name: Data_Dictionary_HoloBee_v2016.1.xlsxResource Title: HoloBee_v2016.1 cross reference tables. File Name: HB_v2016.1_crossref.zipResource Description: This compressed file contains ten spreadsheet files (.xlsx) tabulating detailed information for all centroids (HoloBee-Barcode database) and sequence assemblies (HoloBee-Mop database) used in HoloBee v2016.1: HB_Bar_v2016.1_bacteria_crossref_2016-05-18.xlsx HB_Bar_v2016.1_fungi_crossref_2016-05-20.xlsx HB_Bar_v2016.1_metazoa_crossref_2016-05-16.xlsx HB_Bar_v2016.1_protozoa_crossref_2016-05-20.xlsx HB_Bar_v2016.1_viruses_crossref_2016-05-17.xlsx HB_Mop_v2016.1_bacteria_crossref_2016-05-12.xlsx HB_Mop_v2016.1_fungi_crossref_2016-05-12.xlsx HB_Mop_v2016.1_metazoa_crossref_2016-04-15.xlsx HB_Mop_v2016.1_protozoa_crossref_2016-04-11.xlsx HB_Mop_v2016.1_viruses_crossref_2016-05-12.xlsx md5 value: HB_v2016.1_crossref.zip: a8a57d92830eb77904743afc95980465 Resource Title: data dictionary for HoloBee_v2016.1. File Name: Data_Dictionary_HoloBee_v2016.1.csv

  20. Z

    Data from: "Centenarians have a diverse population of gut bacteriophages...

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johansen, J. (2022). "Centenarians have a diverse population of gut bacteriophages that may promote healthy lifespan" - Genomes and annotation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6579479
    Explore at:
    Dataset updated
    May 25, 2022
    Dataset authored and provided by
    Johansen, J.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    File-dump associated with the manuscript:

    "Centenarians have a diverse population of gut bacteriophages that may promote healthy lifespan" (Not yet published)

    MGVs refer to the viral genome database in the publication: https://www.nature.com/articles/s41564-021-00928-6

    Following uploaded:

    File 1: VOG Markers in vOTUs/vMAGs and MGV genomes

    File 2: Viral Tree Newick file with vOTUs/vMAGs and MGV genomes

    File 3: All vOTUs/vMAGs genomes

    File 4: Master table annotation of vOTUs/vMAGs

    File 5: Centenarian bacterial isolate proviruses

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
National Institutes of Health (2025). VIDA: a virus database system for the organization of animal virus genome open reading frames [Dataset]. https://catalog.data.gov/dataset/vida-a-virus-database-system-for-the-organization-of-animal-virus-genome-open-reading-fram

Data from: VIDA: a virus database system for the organization of animal virus genome open reading frames

Related Article
Explore at:
Dataset updated
Sep 6, 2025
Dataset provided by
National Institutes of Health
Description

VIDA is a new virus database that organizes open reading frames (ORFs) from partial and complete genomic sequences from animal viruses. Currently VIDA includes all sequences from GenBank for Herpesviridae, Coronaviridae and Arteriviridae. The ORFs are organized into homologous protein families, which are identified on the basis of sequence similarity relationships. Conserved sequence regions of potential functional importance are identified and can be retrieved as sequence alignments. We use a controlled taxonomical and functional classification for all the proteins and protein families in the database. When available, protein structures that are related to the families have also been included. The database is available for online search and sequence information retrieval at http://www.biochem.ucl.ac.uk/bsm/virus_database/VIDA.html.

Search
Clear search
Close search
Google apps
Main menu