7 datasets found
  1. d

    NCBI Datasets

    • catalog.data.gov
    • healthdata.gov
    • +2more
    Updated Jul 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Library of Medicine (2025). NCBI Datasets [Dataset]. https://catalog.data.gov/dataset/ncbi-datasets-beta
    Explore at:
    Dataset updated
    Jul 17, 2025
    Dataset provided by
    National Library of Medicine
    Description

    NCBI Datasets is one-stop shop for finding, browsing, and downloading genomic data. Find and download taxonomy, genome, gene, transcript, protein data, including installation of NCBI Datasets command-line tools.

  2. Z

    Mash Sketch of RefSeq Bacterial Reference Genomes

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Young, Erin (2025). Mash Sketch of RefSeq Bacterial Reference Genomes [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13901152
    Explore at:
    Dataset updated
    Mar 14, 2025
    Dataset provided by
    Public Health Laboratory, Department of Health and Human Services, State of Utah
    Authors
    Young, Erin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The mash reference that can be downloaded from the mash documentaion is for RefSeq version 70.

    I do not inherently have a problem with RefSeq version 70, but RefSeq is well past version 200 now.

    RefSeq updates four times year, and I needed an easy way to create and distribute a mash sketch file of the representative bacterial/prokaryotic genomes.This is intended to be a place to hold the mash sketches from https://github.com/erinyoung/update_mash_dist.The mash sketch file from erinyoung/update_mash_dist requires git lfs to be installed when cloning the repository, which is cumbersome for some users.The update requency is intended to mirror that of RefSeq (i.e. 4 time a year), but... is likely to be less frequent than that.Don't hesitate to submit an issue if this needs to get updated.I do have some prior zenodo repositories (https://zenodo.org/records/10519852 , https://zenodo.org/records/7887021 , and https://zenodo.org/records/7348463 ) which hold the same mash sketch reference, but the refseq version is in the title. I'd rather have one repository that gets updated rather than create new repositories each time.This is how the mash reference file was created:

    Step 1. Download Datasets and Dataformat

    wget https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/linux-amd64/datasets wget https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/linux-amd64/dataformat chmod +x datasets dataformat

    Step 2. Download Mash

    wget https://github.com/marbl/Mash/releases/download/v2.3/mash-Linux64-v2.3.tar tar -xvf mash-Linux64-v2.3.tar

    Step 3. Get a list of all the genomes

    Note: this also changes how some of the names are represented

    datasets summary genome taxon bacteria --reference --as-json-lines |
    dataformat tsv genome --fields accession,organism-name --elide-header |
    sed 's/[//g' |
    sed 's/]//g' |
    sed 's/["'\'']//g' |
    sed 's/endosymbiont of /endosymbiont_of_/g' >
    ids.txt

    Step 4. Download the reference files and sketch them

    Note: Since this is done in Github Actions (GA), I need to keep everything below 30G.

    The best way to do this is to download the process each reference file individually, and then combine it to the whole.

    This obviously does not need to be followed if not under those same limitations.

    while read line do id=$(echo $line | awk '{print $1}') ge=$(echo $line | awk '{print $2}') if [ ! -n "$ge" ] ; then ge="unknown" ; fi sp=$(echo $line | awk '{print $3}') if [ ! -n "$sp" ] ; then sp="unknown" ; fi

    datasets download genome accession $id unzip ncbi_dataset.zip cp ncbi_dataset/data/*/*_genomic.fna ${ge}_${sp}_${id}.fasta if [ ! -f RefSeqSketches_${version}.msh ] then mash sketch ${ge}_${sp}_${id}.fasta -o RefSeqSketches_${version} else
    mash sketch ${ge}_${sp}_${id}.fasta -o ${ge}_${sp}_${id} mv RefSeqSketches_${version}.msh tmp.msh mash paste RefSeqSketches_${version} tmp.msh ${ge}_${sp}_${id}.msh rm tmp.msh ${ge}_${sp}_${id}.msh fi

    rm ${ge}_${sp}_${id}.fasta rm -rf ncbi_dataset/ rm ncbi_dataset.zip rm README.md rm md5sum.txt done < ids.txt

    To use

    download file

    wget

    mask sketch sample.fasta RefSeqSketches_.msh > mash_results.txt

    These results are unsorted, so many find it useful to sort them.

    sort -gk3 mash_results.txt > sorted_mash_results.txt

     The should look like the following:
    

    2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_pyogenes_GCF_900475035.1.fasta 0.0116661 0 643/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_dysgalactiae_GCF_016128095.1.fasta 0.0782587 0 107/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_canis_GCF_900636575.1.fasta 0.132399 2.34894e-153 32/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_agalactiae_GCF_001552035.1.fasta 0.164662 1.32611e-72 16/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_castoreus_GCF_000425025.1.fasta 0.174408 2.34302e-58 13/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_didelphis_GCF_000380005.1.fasta 0.182269 8.30736e-49 11/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_uberis_GCF_900475595.1.fasta 0.186761 5.62934e-44 10/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_iniae_GCF_000831485.1.fasta 0.191731 3.33152e-39 9/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_ictaluri_GCF_000188015.2.fasta 0.197292 1.75608e-34 8/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_phocae_GCF_001302265.1.fasta 0.203604 2.46548e-30 7/1000

  3. Listeria database for mashID - 2025-02-18 update

    • figshare.com
    bin
    Updated Feb 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marc-Olivier Duceppe (2025). Listeria database for mashID - 2025-02-18 update [Dataset]. http://doi.org/10.6084/m9.figshare.28489262.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Feb 25, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Marc-Olivier Duceppe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Updated database to identify Listeria using mashID (https://github.com/duceppemo/mashID).75,767 genomes were downloaded using the NCBI datasets cli tool (https://www.ncbi.nlm.nih.gov/datasets/docs/v2/command-line-tools/download-and-install/).Downloaded genomes were renamed and binned by species (https://github.com/duceppemo/ncbi/blob/master/rename_and_bin_fasta.py)Genomes were dereplicated by species using a 0.1% similarity threshold (https://github.com/rrwick/Assembly-Dereplicator)

  4. f

    Mycobacteriaceae database for mashID - 2025-02-20 update

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Feb 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Duceppe, Marc-Olivier (2025). Mycobacteriaceae database for mashID - 2025-02-20 update [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001330361
    Explore at:
    Dataset updated
    Feb 25, 2025
    Authors
    Duceppe, Marc-Olivier
    Description

    Updated database to identify Mycobacteriaceae using mashID (https://github.com/duceppemo/mashID).26,101 genomes were downloaded using the NCBI datasets cli tool (https://www.ncbi.nlm.nih.gov/datasets/docs/v2/command-line-tools/download-and-install/).Downloaded genomes were renamed and binned by species (https://github.com/duceppemo/ncbi/blob/master/rename_and_bin_fasta.py)Genomes were dereplicated by species using a 0.1% similarity threshold (https://github.com/rrwick/Assembly-Dereplicator)

  5. d

    A latitudinal gradient of reference genomes

    • search.dataone.org
    Updated Aug 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ethan Linck; Carlos Daniel Cadena (2025). A latitudinal gradient of reference genomes [Dataset]. http://doi.org/10.5061/dryad.2v6wwpzxh
    Explore at:
    Dataset updated
    Aug 22, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Ethan Linck; Carlos Daniel Cadena
    Description

    Global inequality rooted in legacies of colonialism and uneven development can lead to systematic biases in scientific knowledge. In ecology and evolutionary biology, findings, funding, and research effort are disproportionately concentrated at high latitudes, while biological diversity is concentrated at low latitudes. This discrepancy may have a particular influence in fields like phylogeography, molecular ecology, and conservation genetics, where the rise of genomics has increased the cost and technical expertise required to apply state-of-the-art methods. Here, we ask whether a fundamental biogeographic pattern – the latitudinal gradient of species richness in tetrapods – is reflected in available reference genomes, an important data resource for various applications of molecular tools for biodiversity research and conservation. We also ask whether sequencing approaches differ between the Global South and Global North, reviewing the last five years of conservation genetics rese..., We used the National Center for Biotechnology Information (NCBI) Datasets command-line tools v.16.19.0 (O’Leary et al. 2024) to download taxonomy metadata for the subset of species with an assembled reference genome in the following taxa: birds (Class: Aves), mammals (Class: Mammalia), squamates (Order: Squamata), amphibians (Class: Amphibia), turtles (Order: Testudines), crocodilians (Order: Crocodilia) and tuataras (Order: Rhynchocephalia). We selected these groups—together comprising extant tetrapods—to provide a snapshot of animal diversity in relatively well-studied clades with different ecologies and evolutionary histories, while restricting the total dataset to a computationally manageable size. From this initial list we retained species with an exact match to the Global Biodiversity Information Facility’s (GBIF) Backbone Taxonomy using rgbif v.3.8.0 (Chamberlain et al. 2024) and downloaded all observations of each backed by georeferenced voucher specimens in natural history muse..., , # Data from: A latitudinal gradient of reference genomes

    Dataset DOI: 10.5061/dryad.2v6wwpzxh

    Description of the data and file structure

    Linck and Cadena 2024 Mol. Ecol. obtained two types of data: 1) metadata on published tetrapod reference genomes from NCBI's Genome Browser; and 2) georeferenced occurrence data from the Global Biodiversity Information Facility. Data were aggregated using the NCBI Datasets command-line tools v.16.19.0 and rgbif v.3.8.0. GBIF data are used in accordance with the organization's Data user agreement (https://www.gbif.org/terms/data-user)

    GBIF Data

    The georeferenced occurrence data used in the study and necessary to knit 01_analysis.Rmd are available directly from GBIF via dedicated DOIs and landing pages. These downloads are:Â https://doi.org/10.15468/dl.59eyey (producing 0013380-240626123714530.zip); [https://doi.org/10.15468/dl.vybgce...,

  6. u

    Indexed RefSeq database

    • figshare.unimelb.edu.au
    bin
    Updated Feb 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VANESSA ROSSETTO MARCELINO (2024). Indexed RefSeq database [Dataset]. http://doi.org/10.26188/25222604.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Feb 28, 2024
    Dataset provided by
    The University of Melbourne
    Authors
    VANESSA ROSSETTO MARCELINO
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Indexed NCBI RefSeq database, containing both bacterial and fungal genomes.to download form the command line, use:curl "https://mediaflux.researchsoftware.unimelb.edu.au:443/mflux/share.mfjp?_token=Lqaic1pBmpDdqX8ofv1C1128247855&browser=true&filename=RefSeq_bf.zip" -d browser=false -o RefSeq_bf.zip

  7. Z

    Supplementary data related to draft genome of the ascomycotal fungal species...

    • data.niaid.nih.gov
    Updated Jul 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arumugam, Krithika; Ho, Sherilyn; Bessarab, Irina; Goh, Falicia; Haryono, Mindia; Santillan, Ezequiel; Wuertz, Stefan; Chow, Yvonne; Williams, Rohan (2024). Supplementary data related to draft genome of the ascomycotal fungal species Pseudopithomyces maydicus (family Didymosphaeriaceae) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7374665
    Explore at:
    Dataset updated
    Jul 15, 2024
    Dataset provided by
    National University of Singapore
    Singapore Institute of Food and Biotechnology Innovation (SIFBI)
    Nanyang Technological University
    Authors
    Arumugam, Krithika; Ho, Sherilyn; Bessarab, Irina; Goh, Falicia; Haryono, Mindia; Santillan, Ezequiel; Wuertz, Stefan; Chow, Yvonne; Williams, Rohan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In a recent manuscript, we report a draft genome of the ascomycotal fungal species Pseudopithomyces maydicus (isolate name SBW1) obtained using a culture isolate from brewery wastewater. From a 22 contig assembly, we predict 13502 protein coding gene models, of which 4389 (32.5%) were annotated to KEGG Orthology and identify 39 biosynthetic gene clusters. Here we provide supplementary data from our analysis:

    Supplementary Figure 1 Sequence alignment between Sanger-sequenced partial 28S LSU-rRNA sequence and the top ranked BLASTN hit from NCBI nr/nt database.

    Supplementary Figure 2 Pairs plot for contig GC-content, contig coverage and contig length from the P. maydicus assembly.

    Supplementary Data File 1 Table listing properties of contigs from the P. maydicus assembly.

    Supplementary Data File 2 Summary of taxonomic classification analysis of recovered 18S SSU-rRNA sequences to the SILVA 138 database.

    Supplementary Data File 3 Alignment of Sanger-sequenced partial 28S LSU-rRNA sequence against three 28S LSU-rRNA gene sequences recovered from the P. maydicus long read genome assembly and a set of 62 28S LSU-rRNA sequences from members of genus Psuedopithomyces (NCBI Nucleotide searched for “Pseudopithomyces AND 28S" on 30th May 2022).

    Supplementary Data File 4 MASH similarity statistics obtained by comparing the P. maydicus long read genome assembly sequence to 9563 fungal genomes obtained from NCBI. The reference genomes from NCBI were downloaded using the NCBI ‘dataset’ (version 13.6.0) command line tool (datasets_13.6.0 download genome taxon 4751 --filename fungi.zip --assembly-level complete_genome,chromosome,scaffold,contig --exclude-gff3 --exclude-protein --exclude-rna).

    Supplementary Data File 5 BlastKOALA annotation data for all proteins predicted from P. maydicus long read assembly.

    Supplementary Results Complete output from the antiSMASH6 analysis of the P. maydicus long read assembly.

  8. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
National Library of Medicine (2025). NCBI Datasets [Dataset]. https://catalog.data.gov/dataset/ncbi-datasets-beta

NCBI Datasets

Explore at:
Dataset updated
Jul 17, 2025
Dataset provided by
National Library of Medicine
Description

NCBI Datasets is one-stop shop for finding, browsing, and downloading genomic data. Find and download taxonomy, genome, gene, transcript, protein data, including installation of NCBI Datasets command-line tools.

Search
Clear search
Close search
Google apps
Main menu