NCBI Virus is an integrative, value-added resource designed to support retrieval, display and analysis of a curated collection of virus sequences and large sequence datasets. Its goal is to increase the usability of viral sequence data archived in GenBank and other NCBI repositories. This resource includes resources previously included in HIV-1, Human Protein Interaction Database, Influenza Virus Resource, and Virus Variation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The file "viral.genomic.gbk.tar.gz" contains all the RefSeq viral database information in GenBank format, used as the gold standard for the comparisons. In such a way, it should be run as is when using the script "genecounter.py" to count the number of genes, while it is the second (mandatory) input file for the counting of true positives (TP), false positives (FP) and false negatives (FN) via "coordinateschecker.py". In any case, it could also be used for other evaluation purposes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Curated database of NCBI virus genomes, formatted for BLASTn
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Diamond NCBI Genbank Viral database
Database type: Diamond database
Database format version: 3
Label: 2023-03-18_18-40-17
Sequences: 3,191,190
Sum length: 824,564,244
Assembly summary entries: 58,201
--------------------------------------------------------
SOVAP v.1.3: GitHub
Soil Virome Analysis Pipeline
Description
The study of viral communities in complex environmental samples, such as soil, can provide valuable insights into the diversity and functions of viral communities in the ecosystem. However, processing and analyzing of virome data can be a challenging task that requires the integration of various computational tools and techniques.
To address these challenges, we have developed SOVAP pipeline that utilizes a suite of state-of-the-art tools for processing, analysis, and annotation viromics and metagenomics data.
It utilizes various tools such as Fastp and Centrifuge for preprocessing and contamination removal, geNomad, Diamond and Megan for identification and annotation of viral contigs which are assembled and clustered using Megahit and CD-HIT. Additionally, this pipeline provides an estimate of the abundance of viral contigs, allowing for a more comprehensive understanding of the virome within the sample. The integration of these tools offers a reliable and effective means of taxonomy classification and annotation of viral contigs, aiding researchers in gaining insight into the composition and function of the virome within the analyzed sample.
By integrating the SOVAP pipeline with IMG/VR and geNomad, it is possible to identify a wider range of viruses, including those that were previously unknown.
The batch-mode script allows for the processing of multiple datasets using the SOVAP pipeline. This feature is particularly useful for large-scale analyses, such as those involving multiple environmental samples or large sequencing datasets.
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Including virus protein sequence files and their corresponding annotation files, as well as the virus classification table of ICTV.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a custom-built blast database of higher plant viruses and viroids.
A challenge associated with the bioinformatics analysis of sequencing data for diagnostic purposes is the dependency on sequence databases for taxonomic assignment of detection. Although public databases such as the GenBank database maintained at NCBI are the most up to date, the enormous nature of these databases limits their portability across different computing resources. Moreover, sequencing data submitted by users to these public databases may not be accurate, and annotations provided in the GenBank record, such as the taxonomy assignment, which is crucial for accurate diagnosis, may be inaccurate and/or out of data. Additionally, the descriptors of the sequences in the public databases are not harmonized and lack taxonomic information posing an additional challenge to validate sequence homology-based pathogen detections.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Title:
Unveiling Host-Parasite Relationships through Conserved MITEs in Prokaryote and Viral Genomes
Authors:
Francisco Nadal-Molero(1), Riccardo Roselli(1), Silvia Garcia-Juan(1), Alicia Campos-Lopez(1), Ana-Belen Martin-Cuadrado(1*)
SUPPLEMENTARY FILES
Supplementary File S1. Sequences of cMITEs detected in Bacteria genomes (fasta format). The hosting microbial species and inferred NCBI-taxonomy are indicated in the name of each sequence. The structure of the MITE name is: “Accession|Genome|start|end|TSD|TIRlength|MITETracker_group|Lineage”.
Supplementary File S2. Sequences of cMITEs detected in the Archaea genomes (fasta format). The hosting microbial species and inferred NCBI-taxonomy are indicated in the name of each sequence. The structure of the MITE name is: “Accession|Genome|start|end|TSD|TIRlength|MITETracker_group|Lineage”.
Supplementary File S3. Sequences of vMITEs detected in the virus sequences from the NCBI and IMG/VR v.4.1 database (fasta format). Virus, microbial host (if known) and inferred NCBI-taxonomy is stated in the name of each sequence. The structure of the MITE name is:
“Accession|Genome|start|end|TSD|TIRlength|MITETracker_group|Virus|Name|Host”.
Supplementary File S4. Sequences of si-vMITEs detected in the virus sequences from the NCBI and IMG/VR v.4.1 database (fasta format). Virus, microbial host (if known) and inferred NCBI-taxonomy are stated in the name of each sequence. The structure of the MITE name is: “Accession|Genome|start|end|Ident.Method.by.DB|Host”.
Supplementary Files S5. Cytoscape networks. (A) Figure 1A, (B) Figure 1B.
Supplementary File S6. Sequences of cMITEs obtained from 5837 genomes of Neisseriales. The structure of the MITE name is:
“Accession|NucleotideID|start|end|TSD|TIRlength|MITETracker_group|Genome|Lineage”.
Supplementary File S7. Sequences of si-vMITEs obtained from 5837 genomes of Neisseriales. The structure of the MITE name is: “Accession|Genome|start|end|Host”.
Supplementary File S8. Sequences of cMITEs obtained from 46051 genomes of Bacteroidota. The structure of the MITE name is:
“Accession|NucleotideID|start|end|TSD|TIRlength|MITETracker_group|Genome|Lineage”.
Supplementary File S9. Sequences of si-vMITEs obtained from 46051 genomes of Bacteroidota. The structure of the MITE name is: “Accession|Genome|start|end|Host”.
This dataset tracks the updates made on the dataset "NCBI Virus" as a repository for previous versions of the data and metadata.
This repository within the ACTIV TRACE initiative houses a comprehensive collection of datasets related to SARS-CoV-2. The processing of SARS-CoV-2 Sequence Read Archive (SRA) files has been optimized to identify genetic variations in viral samples. This information is then presented in the Variant Call Format (VCF). Each VCF file corresponds to the SRA parent-run's accession ID. Additionally, the data is available in the parquet format, making it easier to search and filter using the Amazon Athena Service. The SARS-CoV-2 Variant Calling Pipeline is designed to handle new data every six hours, with updates to the AWS ODP bucket occurring daily.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We built a comprehensive dataset of the arboviruses and arthropod-specific viruses by curating worldwide available data from Arbovirus Catalog, Section VIII-F of the Biosafety in Microbiological and Biomedical Laboratories 6th edition, Virus Metadata Resource of International Committee on Taxonomy of Viruses, and GenBank. This dataset includes a complete information on viral taxonomy, biological characteristics, vectors and vertebrate hosts, distribution, recommended biosafety levels, genome segment, and nucleotide/amino acid sequences, which will facilitate research by scientists/researchers of arboviruses and arthropod-specific viruses in viral vector/host prediction, disease outbreak risk warning, arbovirus/arthropod-specific interactions, phylogenetic and evolutionary relationships, and biosafety risk assessment.
This global dataset of viral sequence, diversity, distribution, and biosafety recommendation for arbovirus and ASV contains a viral information file (.xlsx), a nucleic acid sequences file (.fna) and amino acid sequences file (.faa), as accessible from figshare26. The column details of the viral meta information file (.xlsx) are as follows (The “NAV” in the field indicates not available value):
Taxonomy Information 1. Virus_Group: (customized field) viruses in the database are divided into two groups: arbovirus and ASV. The former has both vertebrate and arthropod hosts, the latter has only arthropod hosts. 2. Name: (source from GenBank) the virus name, each name represents a distinct virus. 3. Acronym: (source from BMBL) acronym of virus name. 4. NCBI_Taxonomy_ID: (source from GenBank) taxonomy identifier of virus from NCBI Taxonomy Database. 5. Isolate: (source from GenBank) Isolate of virus from NCBI GenBank. 6. Unified_Isolate_Number: (customized field) renumbering of the field Isolate. Each isolate of the same virus is numbered. 7. Species: (source from ICTV) species that the virus belongs to. Species of the viruses are normally different with their names. 8. Genus: (source from ICTV) genus that the virus belongs to. 9. Family: (source from ICTV) family that the virus belongs to.
Genome Information 10. Segmented: (customized field) whether the genome of the virus is unsegmented (recorded as “no”) or segmented virus (recorded as “yes”). Virus with an unknown number of segments (recorded as “NAV”). 11. Number_of_Segments: (source from GenBank) the theoretical number of segments of the virus. 12. Molecule_Type: (source from GenBank) molecule types of the virus genome which are divided into ssRNA(+), ssRNA(-), ssRNA(+/-), dsRNA, RNA, ssDNA(+/-), dsDNA and etc.
Sequence Information 13. Accession: (source from GenBank) NCBI GenBank Accession of the nucleotide sequence. 14. Locus: (source from GenBank) the locus name of the nucleotide sequence. 15. SRA_Accession: (source from GenBank) NCBI SRA Accession of the nucleotide sequence. 16. Submitters: (source from GenBank) submitters of the nucleotide sequence. 17. Sequence_Type: (source from GenBank) whether the nucleotide sequence is a reference sequence (recorded as “RefSeq”) or a non-reference sequence (recorded as “GenBank”). 18. BioSample: (source from GenBank) NCBI BioSample Accession of the nucleotide sequence. 19. GenBank_Title: (source from GenBank) the field “DEFINITION” of NCBI GenBank database of the sequence. 20. Genotype: (source from GenBank) genotype of the nucleotide sequence. 21. Segment: (source from GenBank) segment identifier of the nucleotide sequence. 22. Unified_Segment_Number: (customized field) renumbering of the field Segment. Each segment is assigned a new number from 1. Segment of the unsegmented virus is assigned as 1.
Host Information 23. Host_Species: (customized field) the species of the dead-end host of the virus. 24. Host_Genus: (customized field) the genus of the dead-end host of the virus. 25. Host_Family: (customized field) the family of the dead-end host of the virus. 26. Host: (source from GenBank) the field from the NCBI GenBank database that represents dead-end host or vectors.
Biosafety Information 27. Recommended_BSL: (customized field) recommended biosafety level of laboratory to research the virus (recorded as “2”, “3”, “4”, “NAV”). 28. BMBL_Recommended_BSL: (source from BMBL) BMBL recommended biosafety level of laboratory to research the virus (recorded as “2”, “2 with 3 practices”, “2b”, “3”, “3a”, “3b”, “4”, “NAV”). 29. Basis_of_Rating: (source from BMBL) risk assessment of the virus (recorded as “A1”, “A2”, “A3”, “A4”, “A7”, “IE”, “S”, “NAV”). 30. Antigenic_Group: (source from BMBL) the antigenic group of the virus. 31. Isolated: (customized field) whether the virus has been isolated (“Yes” or “No”).
Source Information 32. Latitude_and_Longitude: (source from GenBank) longitude and latitude of the virus isolation source. 33. State_or_Province: (customized field) state or provincial administrative unit of the virus source. 34. Geo_Location: (source from GenBank) geographical position of the virus source. 35. Country_or_Region: (customized field) the country or region of the virus source. 36. Isolation_Source: (source from GenBank) the organism which the virus was collected from. 37. Collection_Date: (source from GenBank) the date that the virus was collected. 38. Submit_Date: (source from GenBank) the date that the virus was submitted. 39. Release_Date: (source from GenBank) the date that the virus was released or last modified.
References 40. Publications: (customized field) the number of publications and literature covering the specific virus research. 41. Accession_URL: (customized field) the DOI leading directly to the GenBank source.
The nucleotide sequences file and amino acid sequences file are standard FASTA files. Each sequence information consists of two lines, header and content. The header contains two types of information, locus and accession, split by '|'. Content is a specific nucleic acid or amino acid sequence. The detailed definitions of the fields in the header are as follows: 1. Locus: NCBI GenBank LOCUS ID of the nucleotide sequence. 2. Accession: NCBI GenBank Accession of the nucleotide sequence. Protein_ID: a protein sequence identification number (for amino acid sequences file).
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
A data hub for searching, retrieving, and analyzing SARS-CoV-2 GenBank data.
Database that organizes information on genomes including sequences, maps, chromosomes, assemblies, and annotations in six major organism groups: Archaea, Bacteria, Eukaryotes, Viruses, Viroids, and Plasmids. Genomes of over 1,200 organisms can be found in this database, representing both completely sequenced organisms and those for which sequencing is in progress. Users can browse by organism, and view genome maps and protein clusters. Links to other prokaryotic and archaeal genome projects, as well as BLAST tools and access to the rest of the NCBI online resources are available.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
List of NCBI accession numbers for viral and host sequences used in this study.
Database for a curated classification and nomenclature that contains the names of all organisms that are represented in the public sequence databases with at least one nucleotide or protein sequence. Data provided encompasses archaea, bacteria, eukaryota, viroids and viruses. The NCBI taxonomy database is not a primary source for taxonomic or phylogenetic information. Furthermore, the database does not follow a single taxonomic treatise but rather attempts to incorporate phylogenetic and taxonomic knowledge from a variety of sources, including the published literature, web-based databases, and the advice of sequence submitters and outside taxonomy experts. Consequently, the NCBI taxonomy database is not a phylogenetic or taxonomic authority and should not be cited as such.
Database of data obtained from the NIAID Influenza Genome Sequencing Project as well as from GenBank, combined with tools for flu sequence analysis and annotation. In addition, it provides links to other resources that contain flu sequences, publications and general information about flu viruses. Users can search the Flu database, build queries, retrieve sequences, and apply analysis tools. This includes selecting influenza sequences by virus, subtype, host, and other criteria, finding complete genome sets, aligning sequence and others in the database (up to 1000 sequences), viewing clustering and phylogenetic trees, BLAST searching a flu sequence against the database, and more.
Database that organizes information on genomes including sequences, maps, chromosomes, assemblies, and annotations in six major organism groups: Archaea, Bacteria, Eukaryotes, Viruses, Viroids, and Plasmids. Genomes of over 1,200 organisms can be found in this database, representing both completely sequenced organisms and those for which sequencing is in progress. Users can browse by organism, and view genome maps and protein clusters. Links to other prokaryotic and archaeal genome projects, as well as BLAST tools and access to the rest of the NCBI online resources are available.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The dynamics of coronavirus disease-19 (COVID-19) have been extensively researched in many settings around the world, but little is known about these patterns in Africa. 7540 complete nucleotide genomes from 51 African nations were obtained and analysed from the National Center for Biotechnology Information (NCBI) and Global Initiative on Sharing Influenza Data (GISAID) databases to examine genetic diversity and spread dynamics of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) lineages circulating in Africa. Utilising a variety of clade and lineage nomenclature schemes, we looked at their diversity, and used maximum parsimony inference methods to recreate their evolutionary divergence and history. According to this study, only 465 of the 2610 Pango lineages found to have existed in the world circulated in Africa after three years of the COVID-19 pandemic outbreak, with five different lineages dominating at various points during the outbreak. We identified South Africa, Kenya, and Nigeria as key sources of viral transmissions between Sub-Saharan African nations. These findings provide insight into the viral strains that are circulating in Africa and their evolutionary patterns. Methods Dataset mining and workflow SARS-CoV-2 genome sequences collected from Africa were obtained from NCBI database and GISAID database on February 26, 2023. 24415 African sequences were retrieved from both databases so as to examine the number of lineages circulating within Africa. The two databases had only 8044 complete genome sequences combined from Africa, and these sequences excluding those with low coverage using NextClade were retrieved to determine spread dynamics. 5908 sequences from 23 African countries were available in the NCBI and 2137 sequences from 41 African countries from GISAID database. The sequences were aligned using the online version of the MAFFT multiple sequence alignment tool, with the Wuhan-Hu-1 (MN 908947.3) as the reference sequence, and sequences with more than 5.0% ambiguous letters were removed. Duplicates were removed using goalign dedup software and only high quality African complete sequences remained (n=7540). Phylogenetic reconstruction Using IQ-TREE multicore software version v1.6.12 and NextClade, phylogeny reconstruction on the dataset was performed numerous times. Lineage classification PANGOLin, a web application was used to classify sequences into their lineages. The objective was to determine the SARS-CoV-2 lineages that are circulating in Africa that are most important from an epidemiological perspective, as well as the lineage dynamics within and across the African continent, due to the fact that this naming system integrates genetic and geographic data concerning SARS-CoV-2 dynamics. Phylogeographic reconstruction VOC, (VOI) and VUM were designated based on the WHO framework as of 20 January 2022. We included one lineage, namely A.23.1 and labelled it as VOI for the purposes of this analysis. This lineage was included because it demonstrated the continued evolution of African lineages into potentially more transmissible variants. VOI, VOC, and VUM that emerged on the African continent were marked. These were A.23.1 (VOI), B.1.351 and B.1.1.529 (VOC), B.1.640, and B.1.525 (VUM). Genome sequences of these five lineages were extracted from NCBI database for phylogeographic reconstruction. A similar approach to that described above (including alignment using online MAFFT) was employed. Phylogeographic reconstruction for all variants circulating in Africa and all VOI, VOC, and VUM was conducted using PASTML.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Viral reference data for PathoLive including GI numbers and taxonomic information per sequence. Data taken from the viral part of the NCBI RefSeq downloaded on 2016-07-06.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
bowtie2 index and dudes database for the set of Fungal and Viral complete genomes from NCBI RefSeq, dating from 2017-09. The dudes database was made based on accession version numbers (DUDesDB.py option -m "av").
NCBI Virus is an integrative, value-added resource designed to support retrieval, display and analysis of a curated collection of virus sequences and large sequence datasets. Its goal is to increase the usability of viral sequence data archived in GenBank and other NCBI repositories. This resource includes resources previously included in HIV-1, Human Protein Interaction Database, Influenza Virus Resource, and Virus Variation.