7 datasets found

d
NCBI Datasets
catalog.data.gov
healthdata.gov
+2more
Updated Jul 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Library of Medicine (2025). NCBI Datasets [Dataset]. https://catalog.data.gov/dataset/ncbi-datasets-beta
Explore at:
Dataset updated
Jul 17, 2025
Dataset provided by
National Library of Medicine
Description
NCBI Datasets is one-stop shop for finding, browsing, and downloading genomic data. Find and download taxonomy, genome, gene, transcript, protein data, including installation of NCBI Datasets command-line tools.
Z
Mash Sketch of RefSeq Bacterial Reference Genomes
data.niaid.nih.gov
zenodo.org
Updated Mar 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Young, Erin (2025). Mash Sketch of RefSeq Bacterial Reference Genomes [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13901152
Explore at:
Dataset updated
Mar 14, 2025
Dataset provided by
Public Health Laboratory, Department of Health and Human Services, State of Utah
Authors
Young, Erin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The mash reference that can be downloaded from the mash documentaion is for RefSeq version 70.

I do not inherently have a problem with RefSeq version 70, but RefSeq is well past version 200 now.

RefSeq updates four times year, and I needed an easy way to create and distribute a mash sketch file of the representative bacterial/prokaryotic genomes.This is intended to be a place to hold the mash sketches from https://github.com/erinyoung/update_mash_dist.The mash sketch file from erinyoung/update_mash_dist requires git lfs to be installed when cloning the repository, which is cumbersome for some users.The update requency is intended to mirror that of RefSeq (i.e. 4 time a year), but... is likely to be less frequent than that.Don't hesitate to submit an issue if this needs to get updated.I do have some prior zenodo repositories (https://zenodo.org/records/10519852 , https://zenodo.org/records/7887021 , and https://zenodo.org/records/7348463 ) which hold the same mash sketch reference, but the refseq version is in the title. I'd rather have one repository that gets updated rather than create new repositories each time.This is how the mash reference file was created:

Step 1. Download Datasets and Dataformat

wget https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/linux-amd64/datasets wget https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/linux-amd64/dataformat chmod +x datasets dataformat

Step 2. Download Mash

wget https://github.com/marbl/Mash/releases/download/v2.3/mash-Linux64-v2.3.tar tar -xvf mash-Linux64-v2.3.tar

Step 3. Get a list of all the genomes

Note: this also changes how some of the names are represented

datasets summary genome taxon bacteria --reference --as-json-lines |
dataformat tsv genome --fields accession,organism-name --elide-header |
sed 's/[//g' |
sed 's/]//g' |
sed 's/["'\'']//g' |
sed 's/endosymbiont of /endosymbiont_of_/g' >
ids.txt

Step 4. Download the reference files and sketch them

Note: Since this is done in Github Actions (GA), I need to keep everything below 30G.

The best way to do this is to download the process each reference file individually, and then combine it to the whole.

This obviously does not need to be followed if not under those same limitations.

while read line do id=$(echo $line | awk '{print $1}') ge=$(echo $line | awk '{print $2}') if [ ! -n "$ge" ] ; then ge="unknown" ; fi sp=$(echo $line | awk '{print $3}') if [ ! -n "$sp" ] ; then sp="unknown" ; fi

datasets download genome accession $id unzip ncbi_dataset.zip cp ncbi_dataset/data/*/*_genomic.fna ${ge}_${sp}_${id}.fasta if [ ! -f RefSeqSketches_${version}.msh ] then mash sketch ${ge}_${sp}_${id}.fasta -o RefSeqSketches_${version} else
mash sketch ${ge}_${sp}_${id}.fasta -o ${ge}_${sp}_${id} mv RefSeqSketches_${version}.msh tmp.msh mash paste RefSeqSketches_${version} tmp.msh ${ge}_${sp}_${id}.msh rm tmp.msh ${ge}_${sp}_${id}.msh fi

rm ${ge}_${sp}_${id}.fasta rm -rf ncbi_dataset/ rm ncbi_dataset.zip rm README.md rm md5sum.txt done < ids.txt

To use

download file

wget

mask sketch sample.fasta RefSeqSketches_.msh > mash_results.txt

These results are unsorted, so many find it useful to sort them.

sort -gk3 mash_results.txt > sorted_mash_results.txt

The should look like the following:

2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_pyogenes_GCF_900475035.1.fasta 0.0116661 0 643/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_dysgalactiae_GCF_016128095.1.fasta 0.0782587 0 107/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_canis_GCF_900636575.1.fasta 0.132399 2.34894e-153 32/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_agalactiae_GCF_001552035.1.fasta 0.164662 1.32611e-72 16/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_castoreus_GCF_000425025.1.fasta 0.174408 2.34302e-58 13/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_didelphis_GCF_000380005.1.fasta 0.182269 8.30736e-49 11/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_uberis_GCF_900475595.1.fasta 0.186761 5.62934e-44 10/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_iniae_GCF_000831485.1.fasta 0.191731 3.33152e-39 9/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_ictaluri_GCF_000188015.2.fasta 0.197292 1.75608e-34 8/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_phocae_GCF_001302265.1.fasta 0.203604 2.46548e-30 7/1000
Listeria database for mashID - 2025-02-18 update
figshare.com
bin
Updated Feb 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marc-Olivier Duceppe (2025). Listeria database for mashID - 2025-02-18 update [Dataset]. http://doi.org/10.6084/m9.figshare.28489262.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28489262.v1
Dataset updated
Feb 25, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Marc-Olivier Duceppe
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Updated database to identify Listeria using mashID (https://github.com/duceppemo/mashID).75,767 genomes were downloaded using the NCBI datasets cli tool (https://www.ncbi.nlm.nih.gov/datasets/docs/v2/command-line-tools/download-and-install/).Downloaded genomes were renamed and binned by species (https://github.com/duceppemo/ncbi/blob/master/rename_and_bin_fasta.py)Genomes were dereplicated by species using a 0.1% similarity threshold (https://github.com/rrwick/Assembly-Dereplicator)
f
Mycobacteriaceae database for mashID - 2025-02-20 update
datasetcatalog.nlm.nih.gov
figshare.com
Updated Feb 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Duceppe, Marc-Olivier (2025). Mycobacteriaceae database for mashID - 2025-02-20 update [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001330361
Explore at:
Dataset updated
Feb 25, 2025
Authors
Duceppe, Marc-Olivier
Description
Updated database to identify Mycobacteriaceae using mashID (https://github.com/duceppemo/mashID).26,101 genomes were downloaded using the NCBI datasets cli tool (https://www.ncbi.nlm.nih.gov/datasets/docs/v2/command-line-tools/download-and-install/).Downloaded genomes were renamed and binned by species (https://github.com/duceppemo/ncbi/blob/master/rename_and_bin_fasta.py)Genomes were dereplicated by species using a 0.1% similarity threshold (https://github.com/rrwick/Assembly-Dereplicator)
d
A latitudinal gradient of reference genomes
search.dataone.org
Updated Aug 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ethan Linck; Carlos Daniel Cadena (2025). A latitudinal gradient of reference genomes [Dataset]. http://doi.org/10.5061/dryad.2v6wwpzxh
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.2v6wwpzxh
Dataset updated
Aug 22, 2025
Dataset provided by
Dryad Digital Repository
Authors
Ethan Linck; Carlos Daniel Cadena
Description
Global inequality rooted in legacies of colonialism and uneven development can lead to systematic biases in scientific knowledge. In ecology and evolutionary biology, findings, funding, and research effort are disproportionately concentrated at high latitudes, while biological diversity is concentrated at low latitudes. This discrepancy may have a particular influence in fields like phylogeography, molecular ecology, and conservation genetics, where the rise of genomics has increased the cost and technical expertise required to apply state-of-the-art methods. Here, we ask whether a fundamental biogeographic pattern â€“ the latitudinal gradient of species richness in tetrapods â€“ is reflected in available reference genomes, an important data resource for various applications of molecular tools for biodiversity research and conservation. We also ask whether sequencing approaches differ between the Global South and Global North, reviewing the last five years of conservation genetics rese..., We used the National Center for Biotechnology Information (NCBI) Datasets command-line tools v.16.19.0 (Oâ€™Leary et al. 2024) to download taxonomy metadata for the subset of species with an assembled reference genome in the following taxa: birds (Class: Aves), mammals (Class: Mammalia), squamates (Order: Squamata), amphibians (Class: Amphibia), turtles (Order: Testudines), crocodilians (Order: Crocodilia) and tuataras (Order: Rhynchocephalia). We selected these groupsâ€”together comprising extant tetrapodsâ€”to provide a snapshot of animal diversity in relatively well-studied clades with different ecologies and evolutionary histories, while restricting the total dataset to a computationally manageable size. From this initial list we retained species with an exact match to the Global Biodiversity Information Facilityâ€™s (GBIF) Backbone Taxonomy using rgbif v.3.8.0 (Chamberlain et al. 2024) and downloaded all observations of each backed by georeferenced voucher specimens in natural history muse..., , # Data from: A latitudinal gradient of reference genomes

Dataset DOI: 10.5061/dryad.2v6wwpzxh

Description of the data and file structure

Linck and Cadena 2024 Mol. Ecol. obtained two types of data: 1) metadata on published tetrapod reference genomes from NCBI's Genome Browser; and 2) georeferenced occurrence data from the Global Biodiversity Information Facility. Data were aggregated using theÂ NCBI Datasets command-line toolsÂ v.16.19.0 andÂ rgbif v.3.8.0. GBIF data are used in accordance with the organization's Data user agreement (https://www.gbif.org/terms/data-user)

GBIF Data

The georeferenced occurrence data used in the study and necessary to knit 01_analysis.Rmd are available directly from GBIF via dedicated DOIs and landing pages. These downloads are:Â https://doi.org/10.15468/dl.59eyey (producing 0013380-240626123714530.zip); [https://doi.org/10.15468/dl.vybgce...,
u
Indexed RefSeq database
figshare.unimelb.edu.au
bin
Updated Feb 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VANESSA ROSSETTO MARCELINO (2024). Indexed RefSeq database [Dataset]. http://doi.org/10.26188/25222604.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.26188/25222604.v1
Dataset updated
Feb 28, 2024
Dataset provided by
The University of Melbourne
Authors
VANESSA ROSSETTO MARCELINO
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Indexed NCBI RefSeq database, containing both bacterial and fungal genomes.to download form the command line, use:curl "https://mediaflux.researchsoftware.unimelb.edu.au:443/mflux/share.mfjp?_token=Lqaic1pBmpDdqX8ofv1C1128247855&browser=true&filename=RefSeq_bf.zip" -d browser=false -o RefSeq_bf.zip
Z
Supplementary data related to draft genome of the ascomycotal fungal species...
data.niaid.nih.gov
Updated Jul 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arumugam, Krithika; Ho, Sherilyn; Bessarab, Irina; Goh, Falicia; Haryono, Mindia; Santillan, Ezequiel; Wuertz, Stefan; Chow, Yvonne; Williams, Rohan (2024). Supplementary data related to draft genome of the ascomycotal fungal species Pseudopithomyces maydicus (family Didymosphaeriaceae) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7374665
Explore at:
Dataset updated
Jul 15, 2024
Dataset provided by
National University of Singapore
Singapore Institute of Food and Biotechnology Innovation (SIFBI)
Nanyang Technological University
Authors
Arumugam, Krithika; Ho, Sherilyn; Bessarab, Irina; Goh, Falicia; Haryono, Mindia; Santillan, Ezequiel; Wuertz, Stefan; Chow, Yvonne; Williams, Rohan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In a recent manuscript, we report a draft genome of the ascomycotal fungal species Pseudopithomyces maydicus (isolate name SBW1) obtained using a culture isolate from brewery wastewater. From a 22 contig assembly, we predict 13502 protein coding gene models, of which 4389 (32.5%) were annotated to KEGG Orthology and identify 39 biosynthetic gene clusters. Here we provide supplementary data from our analysis:

Supplementary Figure 1 Sequence alignment between Sanger-sequenced partial 28S LSU-rRNA sequence and the top ranked BLASTN hit from NCBI nr/nt database.

Supplementary Figure 2 Pairs plot for contig GC-content, contig coverage and contig length from the P. maydicus assembly.

Supplementary Data File 1 Table listing properties of contigs from the P. maydicus assembly.

Supplementary Data File 2 Summary of taxonomic classification analysis of recovered 18S SSU-rRNA sequences to the SILVA 138 database.

Supplementary Data File 3 Alignment of Sanger-sequenced partial 28S LSU-rRNA sequence against three 28S LSU-rRNA gene sequences recovered from the P. maydicus long read genome assembly and a set of 62 28S LSU-rRNA sequences from members of genus Psuedopithomyces (NCBI Nucleotide searched for “Pseudopithomyces AND 28S" on 30th May 2022).

Supplementary Data File 4 MASH similarity statistics obtained by comparing the P. maydicus long read genome assembly sequence to 9563 fungal genomes obtained from NCBI. The reference genomes from NCBI were downloaded using the NCBI ‘dataset’ (version 13.6.0) command line tool (datasets_13.6.0 download genome taxon 4751 --filename fungi.zip --assembly-level complete_genome,chromosome,scaffold,contig --exclude-gff3 --exclude-protein --exclude-rna).

Supplementary Data File 5 BlastKOALA annotation data for all proteins predicted from P. maydicus long read assembly.

Supplementary Results Complete output from the antiSMASH6 analysis of the P. maydicus long read assembly.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

National Library of Medicine (2025). NCBI Datasets [Dataset]. https://catalog.data.gov/dataset/ncbi-datasets-beta

NCBI Datasets

Explore at:

Dataset updated

Jul 17, 2025

Dataset provided by

National Library of Medicine

Description

NCBI Datasets is one-stop shop for finding, browsing, and downloading genomic data. Find and download taxonomy, genome, gene, transcript, protein data, including installation of NCBI Datasets command-line tools.

Clear search

Close search

Google apps

Main menu

NCBI Datasets

Mash Sketch of RefSeq Bacterial Reference Genomes

Step 1. Download Datasets and Dataformat

Step 2. Download Mash

Step 3. Get a list of all the genomes

Note: this also changes how some of the names are represented

Step 4. Download the reference files and sketch them

Note: Since this is done in Github Actions (GA), I need to keep everything below 30G.

The best way to do this is to download the process each reference file individually, and then combine it to the whole.

This obviously does not need to be followed if not under those same limitations.

download file

These results are unsorted, so many find it useful to sort them.

Listeria database for mashID - 2025-02-18 update

Mycobacteriaceae database for mashID - 2025-02-20 update

A latitudinal gradient of reference genomes

Description of the data and file structure

GBIF Data

Indexed RefSeq database

Supplementary data related to draft genome of the ascomycotal fungal species...

NCBI Datasets