Facebook
TwitterNCBI Datasets is one-stop shop for finding, browsing, and downloading genomic data. Find and download taxonomy, genome, gene, transcript, protein data, including installation of NCBI Datasets command-line tools.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The mash reference that can be downloaded from the mash documentaion is for RefSeq version 70.
I do not inherently have a problem with RefSeq version 70, but RefSeq is well past version 200 now.
RefSeq updates four times year, and I needed an easy way to create and distribute a mash sketch file of the representative bacterial/prokaryotic genomes.This is intended to be a place to hold the mash sketches from https://github.com/erinyoung/update_mash_dist.The mash sketch file from erinyoung/update_mash_dist requires git lfs to be installed when cloning the repository, which is cumbersome for some users.The update requency is intended to mirror that of RefSeq (i.e. 4 time a year), but... is likely to be less frequent than that.Don't hesitate to submit an issue if this needs to get updated.I do have some prior zenodo repositories (https://zenodo.org/records/10519852 , https://zenodo.org/records/7887021 , and https://zenodo.org/records/7348463 ) which hold the same mash sketch reference, but the refseq version is in the title. I'd rather have one repository that gets updated rather than create new repositories each time.This is how the mash reference file was created:
wget https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/linux-amd64/datasets wget https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/linux-amd64/dataformat chmod +x datasets dataformat
wget https://github.com/marbl/Mash/releases/download/v2.3/mash-Linux64-v2.3.tar tar -xvf mash-Linux64-v2.3.tar
datasets summary genome taxon bacteria --reference --as-json-lines |
dataformat tsv genome --fields accession,organism-name --elide-header |
sed 's/[//g' |
sed 's/]//g' |
sed 's/["'\'']//g' |
sed 's/endosymbiont of /endosymbiont_of_/g' >
ids.txt
while read line do id=$(echo $line | awk '{print $1}') ge=$(echo $line | awk '{print $2}') if [ ! -n "$ge" ] ; then ge="unknown" ; fi sp=$(echo $line | awk '{print $3}') if [ ! -n "$sp" ] ; then sp="unknown" ; fi
datasets download genome accession $id
unzip ncbi_dataset.zip
cp ncbi_dataset/data/*/*_genomic.fna ${ge}_${sp}_${id}.fasta
if [ ! -f RefSeqSketches_${version}.msh ]
then
mash sketch ${ge}_${sp}_${id}.fasta -o RefSeqSketches_${version}
else
mash sketch ${ge}_${sp}_${id}.fasta -o ${ge}_${sp}_${id}
mv RefSeqSketches_${version}.msh tmp.msh
mash paste RefSeqSketches_${version} tmp.msh ${ge}_${sp}_${id}.msh
rm tmp.msh ${ge}_${sp}_${id}.msh
fi
rm ${ge}_${sp}_${id}.fasta rm -rf ncbi_dataset/ rm ncbi_dataset.zip rm README.md rm md5sum.txt done < ids.txt
To use
wget
mask sketch sample.fasta RefSeqSketches_.msh > mash_results.txt
sort -gk3 mash_results.txt > sorted_mash_results.txt
The should look like the following:
2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_pyogenes_GCF_900475035.1.fasta 0.0116661 0 643/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_dysgalactiae_GCF_016128095.1.fasta 0.0782587 0 107/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_canis_GCF_900636575.1.fasta 0.132399 2.34894e-153 32/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_agalactiae_GCF_001552035.1.fasta 0.164662 1.32611e-72 16/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_castoreus_GCF_000425025.1.fasta 0.174408 2.34302e-58 13/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_didelphis_GCF_000380005.1.fasta 0.182269 8.30736e-49 11/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_uberis_GCF_900475595.1.fasta 0.186761 5.62934e-44 10/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_iniae_GCF_000831485.1.fasta 0.191731 3.33152e-39 9/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_ictaluri_GCF_000188015.2.fasta 0.197292 1.75608e-34 8/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_phocae_GCF_001302265.1.fasta 0.203604 2.46548e-30 7/1000
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Updated database to identify Listeria using mashID (https://github.com/duceppemo/mashID).75,767 genomes were downloaded using the NCBI datasets cli tool (https://www.ncbi.nlm.nih.gov/datasets/docs/v2/command-line-tools/download-and-install/).Downloaded genomes were renamed and binned by species (https://github.com/duceppemo/ncbi/blob/master/rename_and_bin_fasta.py)Genomes were dereplicated by species using a 0.1% similarity threshold (https://github.com/rrwick/Assembly-Dereplicator)
Facebook
TwitterUpdated database to identify Mycobacteriaceae using mashID (https://github.com/duceppemo/mashID).26,101 genomes were downloaded using the NCBI datasets cli tool (https://www.ncbi.nlm.nih.gov/datasets/docs/v2/command-line-tools/download-and-install/).Downloaded genomes were renamed and binned by species (https://github.com/duceppemo/ncbi/blob/master/rename_and_bin_fasta.py)Genomes were dereplicated by species using a 0.1% similarity threshold (https://github.com/rrwick/Assembly-Dereplicator)
Facebook
TwitterGlobal inequality rooted in legacies of colonialism and uneven development can lead to systematic biases in scientific knowledge. In ecology and evolutionary biology, findings, funding, and research effort are disproportionately concentrated at high latitudes, while biological diversity is concentrated at low latitudes. This discrepancy may have a particular influence in fields like phylogeography, molecular ecology, and conservation genetics, where the rise of genomics has increased the cost and technical expertise required to apply state-of-the-art methods. Here, we ask whether a fundamental biogeographic pattern – the latitudinal gradient of species richness in tetrapods – is reflected in available reference genomes, an important data resource for various applications of molecular tools for biodiversity research and conservation. We also ask whether sequencing approaches differ between the Global South and Global North, reviewing the last five years of conservation genetics rese..., We used the National Center for Biotechnology Information (NCBI) Datasets command-line tools v.16.19.0 (O’Leary et al. 2024) to download taxonomy metadata for the subset of species with an assembled reference genome in the following taxa: birds (Class: Aves), mammals (Class: Mammalia), squamates (Order: Squamata), amphibians (Class: Amphibia), turtles (Order: Testudines), crocodilians (Order: Crocodilia) and tuataras (Order: Rhynchocephalia). We selected these groups—together comprising extant tetrapods—to provide a snapshot of animal diversity in relatively well-studied clades with different ecologies and evolutionary histories, while restricting the total dataset to a computationally manageable size. From this initial list we retained species with an exact match to the Global Biodiversity Information Facility’s (GBIF) Backbone Taxonomy using rgbif v.3.8.0 (Chamberlain et al. 2024) and downloaded all observations of each backed by georeferenced voucher specimens in natural history muse..., , # Data from: A latitudinal gradient of reference genomes
Dataset DOI: 10.5061/dryad.2v6wwpzxh
Linck and Cadena 2024 Mol. Ecol. obtained two types of data: 1) metadata on published tetrapod reference genomes from NCBI's Genome Browser; and 2) georeferenced occurrence data from the Global Biodiversity Information Facility. Data were aggregated using the NCBI Datasets command-line tools v.16.19.0 and rgbif v.3.8.0. GBIF data are used in accordance with the organization's Data user agreement (https://www.gbif.org/terms/data-user)
The georeferenced occurrence data used in the study and necessary to knit 01_analysis.Rmd are available directly from GBIF via dedicated DOIs and landing pages. These downloads are:Â https://doi.org/10.15468/dl.59eyey (producing 0013380-240626123714530.zip); [https://doi.org/10.15468/dl.vybgce...,
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Indexed NCBI RefSeq database, containing both bacterial and fungal genomes.to download form the command line, use:curl "https://mediaflux.researchsoftware.unimelb.edu.au:443/mflux/share.mfjp?_token=Lqaic1pBmpDdqX8ofv1C1128247855&browser=true&filename=RefSeq_bf.zip" -d browser=false -o RefSeq_bf.zip
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In a recent manuscript, we report a draft genome of the ascomycotal fungal species Pseudopithomyces maydicus (isolate name SBW1) obtained using a culture isolate from brewery wastewater. From a 22 contig assembly, we predict 13502 protein coding gene models, of which 4389 (32.5%) were annotated to KEGG Orthology and identify 39 biosynthetic gene clusters. Here we provide supplementary data from our analysis:
Supplementary Figure 1 Sequence alignment between Sanger-sequenced partial 28S LSU-rRNA sequence and the top ranked BLASTN hit from NCBI nr/nt database.
Supplementary Figure 2 Pairs plot for contig GC-content, contig coverage and contig length from the P. maydicus assembly.
Supplementary Data File 1 Table listing properties of contigs from the P. maydicus assembly.
Supplementary Data File 2 Summary of taxonomic classification analysis of recovered 18S SSU-rRNA sequences to the SILVA 138 database.
Supplementary Data File 3 Alignment of Sanger-sequenced partial 28S LSU-rRNA sequence against three 28S LSU-rRNA gene sequences recovered from the P. maydicus long read genome assembly and a set of 62 28S LSU-rRNA sequences from members of genus Psuedopithomyces (NCBI Nucleotide searched for “Pseudopithomyces AND 28S" on 30th May 2022).
Supplementary Data File 4 MASH similarity statistics obtained by comparing the P. maydicus long read genome assembly sequence to 9563 fungal genomes obtained from NCBI. The reference genomes from NCBI were downloaded using the NCBI ‘dataset’ (version 13.6.0) command line tool (datasets_13.6.0 download genome taxon 4751 --filename fungi.zip --assembly-level complete_genome,chromosome,scaffold,contig --exclude-gff3 --exclude-protein --exclude-rna).
Supplementary Data File 5 BlastKOALA annotation data for all proteins predicted from P. maydicus long read assembly.
Supplementary Results Complete output from the antiSMASH6 analysis of the P. maydicus long read assembly.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterNCBI Datasets is one-stop shop for finding, browsing, and downloading genomic data. Find and download taxonomy, genome, gene, transcript, protein data, including installation of NCBI Datasets command-line tools.