3 datasets found

Z
Virus+ Sequence Masked Human Reference Genome (hg19)
data.niaid.nih.gov
zenodo.org
Updated Feb 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Handley, Scott A. (2021). Virus+ Sequence Masked Human Reference Genome (hg19) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4116106
Explore at:
Dataset updated
Feb 9, 2021
Dataset authored and provided by
Handley, Scott A.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A version of the human genome (hg19) originally masked for ribosomal, plant, animal, fungal and low-entropy sequences by Brian Bushnell (Bushnell Masked Human Genome) additionally masked for all possible viral sequences.

The following commands were used to generate the additional virus sequence masked reference database:

1) Download all RefSeq and Neighbor nucleotide records:

https://www.ncbi.nlm.nih.gov/nuccore/?term=Viruses[Organism]%20NOT%20cellular%20organisms[ORGN]%20NOT%20wgs[PROP]%20NOT%20gbdiv%20syn[prop]%20AND%20(srcdb_refseq[PROP]%20OR%20nuccore%20genome%20samespecies[Filter])

2) Shred the downloaded viral genomes using shred.sh from the bbtools package

shred.sh in=refseq_virus_reformated.fasta out=virus_shred.fasta.gz length=85 minlength=75 overlap=30

3) Map shredded virus sequence to the hg19-masked human genome using bbmap.sh from the bbtools package

bbmap.sh ref=hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz in=virus_shred.fasta.gz outm=map_human_all_viruses.sam minid=0.90

4) Mask virus sequenced mapped regions from the hg19-masked human genome using bbmask.sh from the bbtools package

bbmask.sh in=hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz out=human_virus_masked.fasta.gz sam=map_human_all_viruses .sam

5) Remove all N's to further reduce file size using seqkit seqkit -is replace -p "n" -r "" human_virus_masked.fasta.gz > human_virus_masked.fasta_Ns_removed.gz

Additional References:

http://seqanswers.com/forums/showthread.php?t=42552 for additional information on the original masking of hg19

bbtools

seqkit

NCBI Virus Genome RefSeq
Virus+ Sequence Masked Mouse Reference Genome (GRCm38)
zenodo.org
explore.openaire.eu
application/gzip
Updated Feb 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott A Handley; Scott A Handley (2021). Virus+ Sequence Masked Mouse Reference Genome (GRCm38) [Dataset]. http://doi.org/10.5281/zenodo.4116249
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4116249
Dataset updated
Feb 9, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Scott A Handley; Scott A Handley
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A version of the mouse genome (GRCm38) masked for all possible viral sequences.

See Virus+ Masked Human Genome for a masked human reference database.

The following commands were used to generate the additional virus sequence masked reference database:

1) Download all RefSeq and Neighbor nucleotide records:

https://www.ncbi.nlm.nih.gov/nuccore/?term=Viruses[Organism]%20NOT%20cellular%20organisms[ORGN]%20NOT%20wgs[PROP]%20NOT%20gbdiv%20syn[prop]%20AND%20(srcdb_refseq[PROP]%20OR%20nuccore%20genome%20samespecies[Filter])

2) Shred the downloaded viral genomes using shred.sh from the bbtools package

shred.sh in=refseq_virus_reformated.fasta out=virus_shred.fasta.gz length=85 minlength=75 overlap=30

3) Map shredded virus sequence to the GRCm38 genome using bbmap.sh from the bbtools package

bbmap.sh ref=GRCm38.fa.gz in=virus_shred.fasta.gz outm=map_mouse_all_viruses.sam minid=0.90

4) Mask virus sequenced mapped regions from the GRCm38 genome using bbmask.sh from the bbtools package

bbmask.sh in=GRCm38.fa.gz out=GRCm38_virus_masked.fasta.gz sam=map_mouse_all_viruses.sam

5) Remove all N's to further reduce file size using seqkit
seqkit -is replace -p "n" -r "" GRCm38_virus_masked.fasta.gz > mouse_virus_masked.fasta_Ns_removed.gz

Additional References:

bbtools

seqkit

NCBI Virus Genome RefSeq
MARMICRODB database for taxonomic classification of (marine) metagenomes
zenodo.org
explore.openaire.eu
+1more
application/gzip, bin +3
Updated Mar 20, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shane L Hogle; Shane L Hogle (2020). MARMICRODB database for taxonomic classification of (marine) metagenomes [Dataset]. http://doi.org/10.5281/zenodo.3520509
Explore at:
bin, application/gzip, tsv, html, bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.3520509
Dataset updated
Mar 20, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Shane L Hogle; Shane L Hogle
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction:
This sequence database (MARMICRODB) was introduced in the publication JW Becker, SL Hogle, K Rosendo, and SW Chisholm. 2019. Co-culture and biogeography of Prochlorococcus and SAR11. ISME J. doi:10.1038/s41396-019-0365-4. Please see the original publication and its associated supplementary material for the original description of this resource.

Motivation:
We needed a reference database to annotate shotgun metagenomes from the Tara Oceans project [1] the GEOTRACES cruises GA02, GA03, GA10, and GP13 and the HOT and BATS time series [2]. Our interests are primarily in quantifying and annotating the free-living, oligotrophic bacterial groups Prochlorococcus, Pelagibacterales/SAR11, SAR116, and SAR86 from these samples using the protein classifier tool Kaiju [3]. Kaiju’s sensitivity and classification accuracy depend on the composition of the reference database, and highest sensitivity is achieved when the reference database contains a comprehensive representation of expected taxa from an environment/sample of interest. However, the speed of the algorithm decreases as database size increases. Therefore, we aimed to create a reference database that maximized the representation of sequences from marine bacteria, archaea, and microbial eukaryotes, while minimizing (but not excluding) the sequences from clinical, industrial, and terrestrial host-associated samples.

Results/Description:
MARMICRODB consists of 56 million sequence non-redundant protein sequences from 18769 bacterial/archaeal/eukaryote genome and transcriptome bins and 7492 viral genomes optimized for use with the protein homology classifier Kaiju [3]. To ensure maximum representation of marine bacteria, archaea, and microbial eukaryotes, we included translated genes/transcripts from 5397 representative “specI” species clusters from the proGenomes database [4]; 113 transcriptomes from the Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP) [5]; 10509 metagenome assembled genomes from the Tara Oceans expedition [6,7], the Red Sea [8], the Baltic Sea [9], and other aquatic and terrestrial sources [10]; 994 isolate genomes from the Genomic Encyclopedia of Bacteria and Archaea [11]; 7492 viral genomes from NCBI RefSeq [12]; 786 bacterial and archaeal genomes from MarRef [13]; and 677 marine single cell genomes [14]. In order to annotate metagenomic reads at the clade/ecotype level (subspecies) for the focal taxa Prochlorococcus, Synechococcus, SAR11/Pelagibacterales, SAR86, and SAR116, we generated custom MARMICRODB taxonomies based on curated genome phylogenies for each group. The curated phylogenies, Kaiju formatted Burrows-Wheeler index, translated genes, the custom taxonomy hierarchy, an interactive kronaplot of the taxonomic composition, and scripts and instructions for how to use or rebuild the resource is available from 10.5281/zenodo.3520509.

Methods:
The curation and quality control of MARMICRODB single cell, metagenome assembled, and isolate genomes was performed as described in [15]. Briefly, we downloaded all MARMICRODB genomes as raw nucleotide assemblies from NCBI. We determined an initial genome taxonomy for these assemblies using checkM with the default lineage workflow [16]. All genome bins met the completion/contamination thresholds outlined in prior studies [7,17]. For single cell and metagenome assembled genomes, especially those from Tara Oceans Mediterranean sea samples [18], we use the GTDB-Tk classification workflow [19] to verify the taxonomic fidelity of each genome bin. We then selected genomes with a checkM taxonomic assignment of Prochlorococcus, Synechococcus, SAR11/Pelagibacterales, SAR86, and SAR116 for further analysis and confirmed taxonomic assignment using blast matches to known Prochlorococcus/Synechococcus ITS sequences and by matching 16S sequences to the SILVA database [20]. To refine our estimates of completeness/contamination of Prochlorococcus genome bins we created a custom set of 730 single copy protein families (available from 10.5281/zenodo.3719132) from closed, isolate Prochlorococcus genomes [21] for quality assessments with checkM. For Synechococcus we used the CheckM taxonomic-specific workflow with the genus Synechococcus. After the custom CheckM quality control, we excluded any genome bins from downstream analysis that had an estimated quality < 30, defined as %completeness – 5x %contamination resulting in 18769 genome/transcriptome bins. We predicted genes in the resulting genome bins using prodigal [22] and excluded protein sequences with lengths less than 20 and greater than 20000 amino acids, removed non-standard amino acid residues, and condensed redundant protein sequences to a single representative sequence to which we assigned a lowest common ancestor (LCA) taxonomy identifier from the NCBI taxonomy database [23]. The resulting protein sequences were compiled and used to build a Kaiju [3] search database.

The above filtering criteria resulted in 605 Prochlorococcus, 96 Synechococcus, 186 SAR11/Pelagibacterales, 60 SAR86, and 59 SAR116 high-quality genome bins. We constructed a high quality fixed reference phylogenetic tree for each taxonomic group based on genomes manually selected for completeness and the phylogenetic diversity. For example the Prochlorococcus and Synechococcus genomes for the fixed reference phylogeny are estimated > 90% complete, and SAR11 genomes are estimated > 70% complete. We created multiple sequence alignments of phylogenetically conserved genes from these genomes using the GTDB-Tk pipeline [19] with default settings. The pipeline identifies conserved proteins (120 bacterial proteins) and generates concatenated multi-protein alignments [17] from the genome assemblies using hmmalign from the hmmer software suite. We further filtered the resulting alignment columns using the bacterial and archaeal alignment masks from [17] (http://gtdb.ecogenomic.org/downloads). We removed columns represented by fewer than 50% of all taxa and/or columns with no single amino acid residue occuring at a frequency greater than 25%. We trimmed the alignments using trimal [24] with the automated -gappyout option to trim columns based on their gap distribution. We inferred reference phylogenies using multithreaded RAxML [25] with the GAMMA model of rate heterogeneity, empirically determined base frequencies, and the LG substitution model [26](PROTGAMMALGF). Branch support is based on 250 resampled bootstrap trees. This tree was then pruned to only allow a maximum average distance to the closest leaf (ADCL) of 0.003 to reduce the phylogenetic redundancy in the tree [27]. We then “placed” genomes that either did not pass completeness threshold or were considered phylogenetically redundant by ADCL within the fixed reference phylogeny for each group using pplacer [28] representing each placed genome as a pendant edge in the final tree. We then examined the resulting tree and manually selected clade/ecotype cutoffs to be as consistent as possible with clade definitions previously outlined for these groups [29–32]. We then gave clades from each taxonomic group custom taxonomic identifiers and we added these identifiers to the MARMICRODB Kaiju taxonomic hierarchy.

Software/databases used:
checkM v1.0.11[16]
HMMERv3.1b2 (http://hmmer.org/)
prodigal v2.6.3 [22]
trimAl v1.4.rev22 [24]
AliView v1.18.1 [33] [34]
Phyx v0.1 [35]
RAxML v8.2.12 [36]
Pplacer v1.1alpha [28]
GTDB-Tk v0.1.3 [19]
Kaiju v1.6.0 [34]
GTDB RS83 (https://data.ace.uq.edu.au/public/gtdb/data/releases/release83/83.0/)
NCBI Taxonomy (accessed 2018-07-02) [23]
TIGRFAM v14.0 [37]
PFAM v31.0 [38]

Discussion/Caveats:
MARMICRODB is optimized for metagenomic samples from the marine environment, in particular planktonic microbes from the pelagic euphotic zone. We expect this database may also be useful for classifying other types of marine metagenomic samples (for example mesopelagic, bathypelagic, or even benthic or marine host-associated), but it has not been tested as such. The original purpose of this database was to quantify clades/ecotypes of Prochlorococcus, Synechococcus, SAR11/Pelagibacterales, SAR86, and SAR116 in metagenomes from Tara Oceans Expedition and the GEOTRACES project. We carefully annotated and quality controlled genomes from these five groups, but the processing of the other marine taxa was largely automated and unsupervised. Taxonomy for other groups was copied over from the Genome Taxonomy Database (GTDB) [19,39] and NCBI Taxonomy [23] so any inconsistencies in those databases will be propagated to MARMICRODB. For most use cases MARMICRODB can probably be used unmodified, but if the user’s goal is to focus on a particular organism/clade that we did not curate in the database then the user may wish to spend some time curating those genomes (ie checking for contamination, dereplicating, building a genome phylogeny for custom taxonomy node assignment). Currently the custom taxonomy is hardcoded in the MARMICRODB.fmi index, but if users wish to modify MARMICRODB by adding or removing genomes, or reconfiguring taxonomic ranks the names.dmp and nodes.dmp files can easily be modified as well as the fasta file of protein sequences. However, the Kaiju index will need to be rebuilt, and user will require a high
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Handley, Scott A. (2021). Virus+ Sequence Masked Human Reference Genome (hg19) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4116106

Virus+ Sequence Masked Human Reference Genome (hg19)

Explore at:

Dataset updated

Feb 9, 2021

Dataset authored and provided by

Handley, Scott A.

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

A version of the human genome (hg19) originally masked for ribosomal, plant, animal, fungal and low-entropy sequences by Brian Bushnell (Bushnell Masked Human Genome) additionally masked for all possible viral sequences.

The following commands were used to generate the additional virus sequence masked reference database:

1) Download all RefSeq and Neighbor nucleotide records:

https://www.ncbi.nlm.nih.gov/nuccore/?term=Viruses[Organism]%20NOT%20cellular%20organisms[ORGN]%20NOT%20wgs[PROP]%20NOT%20gbdiv%20syn[prop]%20AND%20(srcdb_refseq[PROP]%20OR%20nuccore%20genome%20samespecies[Filter])

2) Shred the downloaded viral genomes using shred.sh from the bbtools package

shred.sh in=refseq_virus_reformated.fasta out=virus_shred.fasta.gz length=85 minlength=75 overlap=30

3) Map shredded virus sequence to the hg19-masked human genome using bbmap.sh from the bbtools package

bbmap.sh ref=hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz in=virus_shred.fasta.gz outm=map_human_all_viruses.sam minid=0.90

4) Mask virus sequenced mapped regions from the hg19-masked human genome using bbmask.sh from the bbtools package

bbmask.sh in=hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz out=human_virus_masked.fasta.gz sam=map_human_all_viruses .sam

5) Remove all N's to further reduce file size using seqkit seqkit -is replace -p "n" -r "" human_virus_masked.fasta.gz > human_virus_masked.fasta_Ns_removed.gz

Additional References:

http://seqanswers.com/forums/showthread.php?t=42552 for additional information on the original masking of hg19

bbtools

seqkit

NCBI Virus Genome RefSeq

Clear search

Close search

Google apps

Main menu

Virus+ Sequence Masked Human Reference Genome (hg19)

Virus+ Sequence Masked Mouse Reference Genome (GRCm38)

MARMICRODB database for taxonomic classification of (marine) metagenomes

Virus+ Sequence Masked Human Reference Genome (hg19)