100+ datasets found
  1. d

    Sequence Read Archive (SRA)

    • catalog.data.gov
    • healthdata.gov
    • +2more
    Updated Jun 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Library of Medicine (2025). Sequence Read Archive (SRA) [Dataset]. https://catalog.data.gov/dataset/sequence-read-archive-sra-54e4a
    Explore at:
    Dataset updated
    Jun 19, 2025
    Dataset provided by
    National Library of Medicine
    Description

    The Sequence Read Archive (SRA) stores sequencing data from the next generation of sequencing platforms including Roche 454 GS System®, Illumina Genome Analyzer®, Life Technologies AB SOLiD System®, Helicos Biosciences Heliscope®, Complete Genomics®, and Pacific Biosciences SMRT®.

  2. V

    Sequence Read Archive (SRA)

    • data.virginia.gov
    • healthdata.gov
    • +2more
    html
    Updated Jul 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institutes of Health (NIH) (2023). Sequence Read Archive (SRA) [Dataset]. https://data.virginia.gov/dataset/sequence-read-archive-sra
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jul 25, 2023
    Dataset provided by
    National Institutes of Health (NIH)
    Description

    The Sequence Read Archive (SRA) stores raw sequencing data from the next generation of sequencing platforms including Roche 454 GS System®, Illumina Genome Analyzer®, Applied Biosystems SOLiD® System, Helicos Heliscope®, Complete Genomics®, and Pacific Biosciences SMRT®.

  3. o

    COVID-19 Genome Sequence Dataset

    • registry.opendata.aws
    • catalog.midasnetwork.us
    Updated Jul 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Library of Medicine (NLM) (2020). COVID-19 Genome Sequence Dataset [Dataset]. https://registry.opendata.aws/ncbi-covid-19/
    Explore at:
    Dataset updated
    Jul 9, 2020
    Dataset provided by
    <a href="http://nlm.nih.gov/">National Library of Medicine (NLM)</a>
    Description

    This repository within the ACTIV TRACE initiative houses a comprehensive collection of datasets related to SARS-CoV-2. The processing of SARS-CoV-2 Sequence Read Archive (SRA) files has been optimized to identify genetic variations in viral samples. This information is then presented in the Variant Call Format (VCF). Each VCF file corresponds to the SRA parent-run's accession ID. Additionally, the data is available in the parquet format, making it easier to search and filter using the Amazon Athena Service. The SARS-CoV-2 Variant Calling Pipeline is designed to handle new data every six hours, with updates to the AWS ODP bucket occurring daily.

  4. N

    Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genome...

    • data.niaid.nih.gov
    Updated Apr 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genome sequencing [Dataset]. https://data.niaid.nih.gov/resources?id=ncbi_sra_srp253798
    Explore at:
    Dataset updated
    Apr 9, 2021
    Description

    Genomic sequence data of clinical SARS-CoV-2 samples.

  5. d

    Data relating to RNA sequence accessions at NCBI from Ross Sea...

    • search.dataone.org
    • bco-dmo.org
    • +1more
    Updated Dec 5, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rebecca J. Gast (2021). Data relating to RNA sequence accessions at NCBI from Ross Sea Dinoflagellates, Phaeocystis antarctica, Pyramimons tychotreta, and Micromonas polaris (CCMP 2099) (Kleptoplasty project) [Dataset]. https://search.dataone.org/view/http%3A%2F%2Flod.bco-dmo.org%2Fid%2Fdataset%2F728427
    Explore at:
    Dataset updated
    Dec 5, 2021
    Dataset provided by
    Biological and Chemical Oceanography Data Management Office (BCO-DMO)
    Authors
    Rebecca J. Gast
    Time period covered
    Dec 1, 1997 - Apr 7, 1998
    Area covered
    Description

    This dataset contains data related to RNA sequence genetic accessions at the National Center for Biotechnology Information (NCBI) including information about the host organism, collection location, and collection date.

    The accessions are the unprocessed Illumina MiSeq reads for the Ross Sea Dinoflagellate RNA-Seq experiments, Phaeocystis antarctica RNA-Seq experiments, and Pyramimons tychotreta & Micromonas polaris (CCMP 2099) mixotrophy experiments.

    Pyramimonas tychotreta & Micromonas polaris (CCMP 2099) mixotrophy RNA sequences are available through the NCBI Sequence Read Archive (SRA) under the SRA accession number SRP090401 (BioProject PRJNA342459)

    Ross Sea Dinoflagellate RNA sequences are available through the NCBI Sequence Read Archive (SRA) under the accession number SRP132912 (BioProject PRJNA428208).

    Phaeocystis antarctica RNA sequences are available through the NCBI Sequence Read Archive (SRA) under the accession number SRP133243 (BioProject PRJNA434497).

  6. Sequencing Data for Hospital Metagenomes

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Sequencing Data for Hospital Metagenomes [Dataset]. https://catalog.data.gov/dataset/sequencing-data-for-hospital-metagenomes
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    FASTA files containing the sequence data and for Assembled contigs (FastA), Predicted genes (FastA), Predicted proteins (FastA), Gene prediction (GFF v2). This dataset is not publicly accessible because: These are sequences that have already been deposited in publicly available databases and therefore we can avoid replication. Also the data is quite large and there are numerous files associated with these entries, which are included in the links below. It can be accessed through the following means: Using the following web links https://www.ncbi.nlm.nih.gov/bioproject/PRJNA299404 https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP065069 http://enve-omics.ce.gatech.edu/data/showerheads. Format: The data represent genome sequencing and assembly of 180 different contigs. This dataset is associated with the following publication: Soto-Giron, M.J., L. Rodriguez, C. Luo , M. Elk, H. Ryu, J. Santodomingo , and K. Konstantinidis. Biofilms on Hospital Shower Hoses: Characterization and Implications for Nosocomial Infections. APPLIED AND ENVIRONMENTAL MICROBIOLOGY. American Society for Microbiology, Washington, DC, USA, 82(9): 2872-2883, (2016).

  7. d

    Data from Readsynth: short-read simulation for consideration of...

    • datadryad.org
    zip
    Updated Apr 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ryan Kuster (2024). Data from Readsynth: short-read simulation for consideration of composition-biases in reduced metagenome sequencing approaches [Dataset]. http://doi.org/10.5061/dryad.nzs7h44zk
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 12, 2024
    Dataset provided by
    Dryad
    Authors
    Ryan Kuster
    Description

    readsynth_analysis

    https://doi.org/10.5061/dryad.nzs7h44zk

    The dataset contained here provides the necessary raw sequence data to perform analyses for the simulation software readsynth.

    The dataset includes the genomes and databases necessary to reproduce the steps in the github repository readsynth_analysis and correspond with that repository's "raw_data" directory.

    Description of the data and file structure

    The genome directory "raw_data" is broken into the following subdirectories (further descriptions below):

    .
    ├── helius
    │  └── all_2084
    │    ├── genomes
    │    └── genomes_combined
    ├── kraken_dbs
    │  ├── k2_pluspfp_20220607
    │  ├── snipen_bei_db
    │  │  └── library
    │  │    └── added
    │  └── sun_atcc_db
    │    └── library
    │      └── added
    ├── liu_RMS
    │  └── mock_community_estimate
    │    ├── 10M_bracken_profile
    │ ...
    
  8. e

    Catalog of NCBI sequence read archive (SRA) data for salamanders at the...

    • portal.edirepository.org
    csv
    Updated Apr 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brett Addis; Madaline Cochrane; Winsor Lowe (2024). Catalog of NCBI sequence read archive (SRA) data for salamanders at the Hubbard Brook Experimental Forest 2012-2021 [Dataset]. http://doi.org/10.6073/pasta/6df7199d751ec81315395a042cbd8083
    Explore at:
    csv(312227 byte), csv(220695 byte), csv(282251 byte)Available download formats
    Dataset updated
    Apr 9, 2024
    Dataset provided by
    EDI
    Authors
    Brett Addis; Madaline Cochrane; Winsor Lowe
    Time period covered
    2012 - 2021
    Area covered
    Variables measured
    strain, ecotype, isolate, lat_lon, cultivar, organism, Accession, BioProject, env_medium, sample_URL, and 8 more
    Description

    This project was designed to describe fine-scale population genetic differentiation of the stream salamander Gryinophilus porphyriticus among five study streams in the Hubbard Brook Experimental Forest. The data are paired with intensive capture-recapture data to assess direct fitness effects of individual genetic diversity, including effects of individual multilocus heterozygosity on stage-specific survival probabilities.

       This dataset publishes a manifest of the genomic sequence reads submitted to the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). These samples are published at NCBI under the BioProject ID 1090913 (https://www.ncbi.nlm.nih.gov/bioproject/1090913). The tables here include sample metadata and the NCBI URLs to each sample.
    
       These data were gathered as part of the Hubbard Brook Ecosystem Study (HBES). The HBES is a collaborative effort at the Hubbard Brook Experimental Forest, which is operated and maintained by the USDA Forest Service, Northern Research Station.
    
  9. Combined antibiogram dataset from NCBI, ENA, BV-BRC, and more

    • zenodo.org
    zip
    Updated Jul 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anton Pashkov; Anton Pashkov; César Aguilar; César Aguilar (2025). Combined antibiogram dataset from NCBI, ENA, BV-BRC, and more [Dataset]. http://doi.org/10.5281/zenodo.15809334
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 4, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anton Pashkov; Anton Pashkov; César Aguilar; César Aguilar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The antibiograms.tsv.zip dataset collects antibiograms found in NCBI, ENA and BV-BRC. Each row corresponds to an antibiotic susceptibility test (AST) for a given sample against a specific antibiotic. The dataset is a table with 14 columns:

    1. biosample. A unique identifier for the sample from the NCBI BioSample database.
    2. sra_biosample. If given, a space-separated list of sample identifiers from the NCBI SRA Sample Database.
    3. species. The species the sample belongs to, and, in some cases, with subspecies information.
    4. antibiotic. The name of the antibiotic against which the sample is tested.
    5. phenotype. The interpreted phenotype from the AST standard used during testing.
    6. measurement_sign. If given, corresponds to the sign of the raw result from the AST. Its interpretation depends on the typing method.
    7. measurement_value. If given, corresponds to the value of the raw result from the AST. Its interpretation depends on the typing method.
    8. measurement_units. If given, corresponds to the units of the raw result from the AST.
    9. typing_method. Name of the technique used for AST.
    10. typing_platform. Name of the platform used for AST.
    11. standard. Testing standard used for the interpretation of the phenotype.
    12. genomes. Space-separated list of genome identifiers from the NCBI Genome Database (starting with GCA_ or GCF_), the BV-BRC Genome Database (starting with BVBRC_), or the ENA FTP Site (starting with ftp://ftp.sra.ebi.ac.uk/vol1/analysis/).
    13. reads. Space-separated list of read run identifiers from the NCBI SRA database.
    14. read_type. Space-separated list with the same length as the reads column, storing the type of read of each corresponding read run.

    The gn-genomes.zip file contains some extra genomes with associated AST metadata found in metadata.xlsx file within it.

  10. N

    Mycobacterium Genome sequencing

    • data.niaid.nih.gov
    Updated Dec 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Mycobacterium Genome sequencing [Dataset]. https://data.niaid.nih.gov/resources?id=ncbi_sra_srp053287
    Explore at:
    Dataset updated
    Dec 1, 2021
    Description

    United States Department of Agriculture Mycobacterium Diagnostics

  11. f

    "Mapping information-rich genotype-phenotype landscapes with genome-scale...

    • plus.figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joseph Replogle; Jonathan Weissman (2023). "Mapping information-rich genotype-phenotype landscapes with genome-scale Perturb-seq" Replogle et al. 2022 SRA and GEO file manifest [Dataset]. http://doi.org/10.25452/figshare.plus.20022944.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figshare+
    Authors
    Joseph Replogle; Jonathan Weissman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These tables list the raw fastq files (found on SRA under BioProject PRJNA831566 https://www.ncbi.nlm.nih.gov/bioproject/PRJNA831566) and associated BAM/MTX files for each lane/gemgroup of single-cell RNA-sequencing.

    In these manifest files, all FASTQ files associated with a single lane/gemgroup of 10X Genomics single-cell RNA-sequencing are listed in a single row. These files can be used as input for an aligner like celllranger count, starsolo, or kallisto-bustools.

    The "gemgroup" field indicates the integer value which is assigned to this gemgroup for aggregation of single-cell RNA-sequencing data in the finalized h5ad dataset.

  12. e

    Catalog of GenBank sequence read archive (SRA) entries of metagenomic DNA...

    • portal.edirepository.org
    • search-ucsb-1.dataone.org
    • +2more
    csv
    Updated Apr 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Byron C Crump; Colleen TE Kellogg; Kristina Baker; James W McClelland; Kenneth H Dunton (2021). Catalog of GenBank sequence read archive (SRA) entries of metagenomic DNA sequence analyses of bacterial and archaeal water column communities along the Eastern Beaufort Sea coast, North Slope, Alaska, 2012 [Dataset]. http://doi.org/10.6073/pasta/6bfe32f7eb95a886ee5af8f099634b3c
    Explore at:
    csv(17707 byte)Available download formats
    Dataset updated
    Apr 6, 2021
    Dataset provided by
    EDI
    Authors
    Byron C Crump; Colleen TE Kellogg; Kristina Baker; James W McClelland; Kenneth H Dunton
    License

    https://spdx.org/licenses/CC0-1.0https://spdx.org/licenses/CC0-1.0

    Time period covered
    Apr 17, 2012 - Aug 15, 2012
    Area covered
    Variables measured
    run, bases, depth, bytes_b, latitude, run_link, biosample, env_biome, longitude, site_name, and 14 more
    Description

    In contrast to temperate systems, Arctic lagoons that span the Alaska Beaufort Sea coast face extreme seasonality. Nine months of ice cover up to ∼1.7 m thick is followed by a spring thaw that introduces an enormous pulse of freshwater, nutrients, and organic matter into these lagoons over a relatively brief 2–3 week period. Prokaryotic communities link these subsidies to lagoon food webs through nutrient uptake, heterotrophic production, and other biogeochemical processes, but little is known about how the genomic capabilities of these communities respond to seasonal variability. This study characterizes the metabolic capabilities of microbial communities across three seasons in two lagoons and one open coastal site along the eastern Alaska Beaufort Sea coast. We used metagenomic DNA sequence data of bacterial and archaeal water column communities to identify genes of relevant biogeochemical pathways.

    This data package catalogs sequence read archive (SRA) entries available through GenBank BioProject PRJNA642637 at https://www.ncbi.nlm.nih.gov/bioproject/PRJNA642637. This data package is associated with the following publication:

    Baker, Kristina D., Colleen T. E. Kellogg, James W. McClelland, Kenneth H. Dunton, and Byron C. Crump. “The Genomic Capabilities of Microbial Communities Track Seasonal Variation in Environmental Conditions of Arctic Lagoons.” Frontiers in Microbiology 12 (2021). https://doi.org/10.3389/fmicb.2021.601901.

    Environmental variables (physiochemical data from YSI and HOBO data loggers, as well as organic matter analysis and stable isotope data from discrete water samples) associated with this genomic dataset are available from the Arctic Data Center:

    Kenneth Dunton, Byron Crump, and James McClelland. Physical, chemical, and biological data from lagoons and open coastal waters in the nearshore environment of the eastern Alaska Beaufort Sea, 2011-2013. Arctic Data Center. doi:10.18739/A2DG13.

    To join the two datasets together, please use the provided site codes (column "site_name" here) and collection dates (column "collection_date" here) in each dataset.

    Instead of citing this package, which is a catalog, please cite the original GenBank data, journal article, or related Arctic Data Center dataset as appropriate. Citation guidance for the journal article and related Arctic Data Center dataset is available on the respective publishers' websites.

  13. d

    Data from: Metagenomic and near full-length 16S rRNA sequence data in...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    • +2more
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data from: Metagenomic and near full-length 16S rRNA sequence data in support of the phylogenetic analysis of the rumen bacterial community in steers [Dataset]. https://catalog.data.gov/dataset/data-from-metagenomic-and-near-full-length-16s-rrna-sequence-data-in-support-of-the-phylog-07c7d
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    Amplicon sequencing utilizing next-generation platforms has significantly transformed how research is conducted, specifically microbial ecology. However, primer and sequencing platform biases can confound or change the way scientists interpret these data. The Pacific Biosciences RSII instrument may also preferentially load smaller fragments, which may also be a function of PCR product exhaustion during sequencing. To further examine theses biases, data is provided from 16S rRNA rumen community analyses. Specifically, data from the relative phylum-level abundances for the ruminal bacterial community are provided to determine between-sample variability. Direct sequencing of metagenomic DNA was conducted to circumvent primer-associated biases in 16S rRNA reads and rarefaction curves were generated to demonstrate adequate coverage of each amplicon. PCR products were also subjected to reduced amplification and pooling to reduce the likelihood of PCR product exhaustion during sequencing on the Pacific Biosciences platform. The taxonomic profiles for the relative phylum-level and genus-level abundance of rumen microbiota as a function of PCR pooling for sequencing on the Pacific Biosciences RSII platform were provided. Data is within this article and raw ruminal MiSeq sequence data is available from the NCBI Sequence Read Archive (SRA Accession SRP047292). Additional descriptive information is associated with NCBI BioProject PRJNA261425. http://www.ncbi.nlm.nih.gov/bioproject/PRJNA261425/ Resources in this dataset:Resource Title: NCBI Sequence Read Archive (SRA Accession SRP047292). File Name: Web Page, url: https://www.ncbi.nlm.nih.gov/sra/SRX704260 1 ILLUMINA (Illumina MiSeq) run: 978,195 spots, 532.9M bases, 311.6Mb downloads.

  14. Z

    Genome assemblies and respective wg/cgMLST profiles of a diverse dataset...

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated Jul 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    João Paulo Gomes (2023). Genome assemblies and respective wg/cgMLST profiles of a diverse dataset comprising 1,999 Escherichia coli isolates [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7120057
    Explore at:
    Dataset updated
    Jul 24, 2023
    Dataset provided by
    Daniel Sobral
    Miguel Pinto
    Carlus Deneke
    Holger Brendebach
    Verónica Mixão
    João Paulo Gomes
    Vítor Borges
    Simon Tausch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset

    This dataset comprises the genome assemblies and respective 7,601-loci whole-genome (wg) Multiple Locus Sequence Type (MLST) profiles [INNUENDO schema (Llarena et al. 2018) available in chewie-NS (Mamede et al. 2022)] of a final set of 1,999 Escherichia coli samples selected among the Whole-Genome Sequencing (WGS) data publicly available in the European Nucleotide Archive (ENA) or in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) at the beginning of the analysis (November 2021). This set of samples was carefully selected to cover a wide genetic diversity (assessed in terms of serotype). In total, 411 different serotypes are represented in this dataset, with O157:H7 being the most represented one, corresponding to 37.1% of the dataset.

    File “Ec_metadata.xlsx” contains metadata information for each isolate, including ENA/SRA accession number, BioProject and in-silico MLST ST and serotype.

    The directory “assemblies/” contains all the genome assemblies (.fasta format) of each isolate presented in the metadata file.

    The file “profiles/Ec_profiles_wgMLST.tsv” corresponds to a tab separated file with the 7,601-loci wgMLST profiles of each isolate presented in the metadata file. The files “profiles/Ec_profiles_cgMLST_95.tsv”, “profiles/Ec_profiles_cgMLST_98.tsv” and “profiles/Ec_profiles_cgMLST_100.tsv” correspond to a 2,826-loci, 2,704-loci and 465-loci cgMLST profiles of each isolate presented in the metadata file, respectively. These profiles were determined as explained below.

    Dataset selection and curation

    With the objective of creating a diverse dataset of E. coli genome assemblies, we collected information about the genetic diversity (serotype) of the isolates available at Enterobase database in the beginning of this analysis (November 2021) and in other previous works. Based on this information, we selected an initial dataset comprising 2,688 samples associated with three BioProjects (PRJNA230969, PRJEB27020 and PRJNA248042). Their WGS data was downloaded from ENA/SRA with fastq-dl v1.0.6. Read quality control, trimming and assembly were performed with the Aquamis v1.3.9 (Deneke et al. 2021) using default parameters. Assembly quality control (QC), including contamination assessment, as well as MLST ST determination were performed with the same pipeline. All genome assemblies passing the QC were included in the final dataset. Among the others, we noticed that a considerable proportion of assemblies was flagged as “QC fail” exclusively due to the “NumContamSNVs” parameter, suggesting that this setting might have been too strict. After manual inspection of a random subset, assemblies for which the percentage of reads corresponding to the correct species was >98% were recovered and integrated in the final dataset (those samples are labeled in the Metadata file). In total, 1,999 isolates passed this curation step and were included in the final dataset. In-silico serotyping was performed with seq_typing v2.2. wgMLST profiles of each of these isolates were determined with chewBBACA v2.8.5 (Silva et al. 2018), using the 7,601-loci INNUENDO schema available in chewie-NS (Llarena et al. 2018; Mamede et al. 2022) and downloaded on May 31st, 2022. Three cgMLST schemas were obtained with ReporTree v1.0.0 (Mixão et al. 2022) using the 7,601-loci wgMLST profiles of the 1,999 isolates as input and setting distinct “--site-inclusion” thresholds: 0.95, 0.98 and 1.0 (i.e., keep schema loci called in at least 95%, 98% and 100% of the samples, resulting in a 2,826-loci, 2,704-loci and 465-loci allelic matrices, respectively).

  15. d

    Data from: Variation in genomic vulnerability to climate change across...

    • search.dataone.org
    • catalogue.arctic-sdi.org
    • +3more
    Updated Mar 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicholas Jeffery; Benedikte Vercaemer; Ryan Stanley; Tony Kess; France Dufresne; Fanny Noisette; Mary O'Connor; Melisa Wong (2024). Variation in genomic vulnerability to climate change across temperate populations of eelgrass (Zostera marina) [Dataset]. http://doi.org/10.5061/dryad.xpnvx0kp2
    Explore at:
    Dataset updated
    Mar 5, 2024
    Dataset provided by
    Dryad Digital Repository
    Authors
    Nicholas Jeffery; Benedikte Vercaemer; Ryan Stanley; Tony Kess; France Dufresne; Fanny Noisette; Mary O'Connor; Melisa Wong
    Time period covered
    Jan 1, 2024
    Description

    A global decline in seagrass populations has led to renewed calls for their conservation as important providers of biogenic and foraging habitat, shoreline stabilisation, and carbon storage. Eelgrass (Zostera marina) occupies the largest geographic range among seagrass species spanning a commensurately broad spectrum of environmental conditions. In Canada, eelgrass is managed as a single phylogroup despite occurring across three oceans and a range of ocean temperatures and salinity gradients. Previous research has focused on applying relatively few markers to reveal population structure of eelgrass, whereas a whole genome approach is warranted to investigate cryptic structure among populations inhabiting different ocean basins and localized environmental conditions. We used a pooled whole-genome re-sequencing approach to characterise population structure, gene flow, and environmental associations of 23 eelgrass populations ranging from the Northeast United States, to Atlantic, subarctic..., We generated allele frequencies for 23 Zostera marina populations across North America using a pooled whole-genome sequencing approach (poolseq). Individual shoots of eelgrass were collected from plants at least 2 metres apart in the field, to minimize the potential presence of clones in the data. Genomic DNA was extracted from all individuals and pooled at the population level for sequencing on an Illumina NovaSeq platform at Genome Quebec (Canada), and SNPs were called following the GATK pipeline. Analyses were conducted with Popoolation2 and R Studio. R Studio was used for all genomic-environmental association analyses, including redundancy analyses, and calculating genomic offset sensu Capblancq and Forester (2021). , , # Variation in genomic vulnerability to climate change across temperate populations of eelgrass (Zostera marina)

    https://doi.org/10.5061/dryad.xpnvx0kp2

    The data herein includes sample site metadata (GPS coordinates, collection dates, personnel, and site names and codes), environmental data used for genomic-environment association analyses (redundancy analyses), as well as pairwise Fst matrices for all sampling sites. All raw DNA sequences for all 23 sampling locations are contained within the National Center for Biotechnology Information (https://www.ncbi.nlm.nih.gov/sra/PRJNA891275)

    Description of the data and file structure

    The primary data in this project are the raw fastq DNA files deposited in NCBI for each of 23 populations sampled. As these fastq files represent pooled genomic DNA sequences per population, there are no individual genotypes, but rather can be processed into populat...

  16. N

    Scallop Genome sequencing and assembly

    • data.niaid.nih.gov
    Updated Aug 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Scallop Genome sequencing and assembly [Dataset]. https://data.niaid.nih.gov/resources?id=ncbi_sra_srp364753
    Explore at:
    Dataset updated
    Aug 4, 2022
    Description

    This research is part of ongoing M10K+ genome project that is proposed by M10K+ Consortium and targets for sequencing 10K molluscan genomes.

  17. Chromosome assembly and preliminary gene and repeat annotations for Myzomela...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Jul 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elsie Shogren; Jason Sardell; Christina Muirhead; Emiliano Martí; Elizabeth Cooper; Robert Moyle; Daven Presgraves; Albert J. Uy (2024). Chromosome assembly and preliminary gene and repeat annotations for Myzomela tristrami reference genome [Dataset]. http://doi.org/10.5061/dryad.612jm64c9
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 27, 2024
    Dataset provided by
    University of North Carolina at Charlotte
    University of Rochester
    University of Kansas
    PrecisionLife Ltd.
    Authors
    Elsie Shogren; Jason Sardell; Christina Muirhead; Emiliano Martí; Elizabeth Cooper; Robert Moyle; Daven Presgraves; Albert J. Uy
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Secondary contact between closely related taxa represents a “moment of truth” for speciation — an opportunity to test the efficacy of reproductive isolation that evolved in allopatry and to identify the genetic, behavioral, and/or ecological barriers that separate species in sympatry. Sex chromosomes are known to rapidly accumulate differences between species, an effect that may be exacerbated for neo-sex chromosomes that are transitioning from autosomal to sex-specific inheritance. Here we report that, in the Solomon Islands, two closely related bird species in the honeyeater family — Myzomela cardinalis and Myzomela tristrami — carry neo-sex chromosomes and have come into recent secondary contact after ~1.1 my of geographic isolation. Hybrids of the two species were first observed in sympatry ~100 years ago. To determine the genetic consequences of hybridization, we use population genomic analyses of individuals sampled in allopatry and in sympatry to characterize gene flow in the contact zone. Using genome-wide estimates of diversity, differentiation, and divergence, we find that the degree and direction of introgression varies dramatically across the genome. For sympatric birds, autosomal introgression is bidirectional, with phenotypic hybrids and phenotypic parentals of both species showing admixed ancestry. In other regions of the genome, however, the story is different. While introgression on the Z/neo-Z-linked sequence is limited, introgression of W/neo-W regions and mitochondrial sequence (mtDNA) is highly asymmetric, moving only from the invading M. cardinalis to the resident M. tristrami. The recent hybridization between these species has thus enabled gene flow in some genomic regions but the interaction of admixture, asymmetric mate choice, and/or natural selection has led to the variation in the amount and direction of gene flow at sex-linked regions of the genome. Methods This data repository contains Myzomela tristrami reference genome files. The sequences associated with this assembly are available on NCBI sequence read archive at https://www.ncbi.nlm.nih.gov/sra/?term=SRA%20SRR29254783. We sequenced a M. tristrami female at the University of Delaware DNA sequencing & Genotyping Cener. HiFi libraries were prepared with SMRTbell prep kit, followed by Blue Pippin size selection (15-20Kbp) before sequencing on a PacBio Sequel IIe. We generated a de novo assembly using hifiasm v0.13-r308 with default parameters using the resulting long reads (Cheng et al. 2021, 2022). We used GeMoMa (v1.8) and the annotation from zebra finch genome bTaeGut1.4.pri to infer a rough annotation of genes in the Myzomela genome. We then used these rough annotations, comparing contigs against both zebra finch and the chicken genome bGalGal1.mat.broiler.GRCg7b to infer synteny relationships, remove duplicate haplotigs, and, finally, scaffold contigs into chromosomes in Myzomela. The resulting assembly uses the zebra finch numbering system for chromosomes 1-29; chromosome 30-40 were named in descending order of size. Final chromosomes and contigs were aligned with those of related species— helmeted honeyeater (Lichenostomus melanops cassidix), and blue-faced honeyeater (Entomyzon cyanotis)— using Mauve (version 2015-02-25), and visualized using FastANI (v1.33) (Darling et al. 2004, Jain et al. 2018, Robledo-Ruiz et al. 2022, Burley et al. 2023). We generated repetitive DNA libraries using the RepeatModeler v2 pipeline (Flynn et al. 2020). RepeatModeler employs a combination of de novo and homology-based characterization of different classes of repeats. The repeat library was annotated and combined with Repbase, and manually curated repeat libraries from other studies (Suh et al. 2018, Boman et al. 2019, Weissensteiner et al. 2020, Peona et al. 2021). We then used RepeatMasker ( v4.1.0) to identify and mask repetitive regions of the genome (Smit et al. 2013). Boman, J., C. Frankl-Vilches, M. D. S. dos Santos, E. H. C. de Oliveira, M. Gahr, and A. Suh. 2019. The genome of Blue-capped Cordon-bleu uncovers hidden diversity of LTR retrotransposons in Zebra Finch. Genes 10. Burley, J. T., S. C. M. Orzechowski, S. Y. W. Sin, and S. V. Edwards. 2023. Whole-genome phylogeography of the blue-faced honeyeater (Entomyzon cyanotis) and discovery and characterization of a neo-Z chromosome. Molecular Ecology 32:1248–1270. Cheng, H., G. T. Concepcion, X. Feng, H. Zhang, and H. Li. 2021. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods 18:170–175. Cheng, H., E. D. Jarvis, O. Fedrigo, K. P. Koepfli, L. Urban, N. J. Gemmell, and H. Li. 2022. Haplotype-resolved assembly of diploid genomes without parental data. Nature Biotechnology 40:1332–1335. Darling, A. C. E., B. Mau, F. R. Blattner, and N. T. Perna. 2004. Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Research 14:1394–1403. Flynn, J. M., R. Hubley, C. Goubert, J. Rosen, A. G. Clark, C. Feschotte, and A. F. Smit. 2020. RepeatModeler2 for automated genomic discovery of transposable element families. Proceedings of the National Academy of Sciences of the United States of America 117:9451–9457. Jain, C., L. M. Rodriguez-R, A. M. Phillippy, K. T. Konstantinidis, and S. Aluru. 2018. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nature Communications 9:5114. Peona, V., O. M. Palacios-Gimenez, J. Blommaert, J. Liu, T. Haryoko, K. A. Jønsson, M. Irestedt, Q. Zhou, P. Jern, and A. Suh. 2021. The avian W chromosome is a refugium for endogenous retroviruses with likely effects on female-biased mutational load and genetic incompatibilities. Philosophical Transactions of the Royal Society B: Biological Sciences 376. Robledo-Ruiz, D. A., H. M. Gan, P. Kaur, O. Dudchenko, D. Weisz, R. Khan, E. Lieberman Aiden, E. Osipova, M. Hiller, H. E. Morales, M. J. L. Magrath, R. H. Clarke, P. Sunnucks, and A. Pavlova. 2022. Chromosome-length genome assembly and linkage map of a critically endangered Australian bird: the helmeted honeyeater. GigaScience 11:giac025. Smit, A., R. Hubley, and P. Green. 2013, 2015. RepeatMasker Open-4.0. Suh, A., L. Smeds, and H. Ellegren. 2018. Abundant recent activity of retrovirus-like retrotransposons within and among flycatcher species implies a rich source of structural variation in songbird genomes. Molecular Ecology 27:99–111. Weissensteiner, M. H., I. Bunikis, A. Catalán, K. J. Francoijs, U. Knief, W. Heim, V. Peona, S. D. Pophaly, F. J. Sedlazeck, A. Suh, V. M. Warmuth, and J. B. W. Wolf. 2020. Discovery and population genomics of structural variation in a songbird genus. Nature Communications 11:1–11.

  18. Genome assemblies and respective cgMLST profiles of a diverse dataset...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jul 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Verónica Mixão; Verónica Mixão; Holger Brendebach; Miguel Pinto; Daniel Sobral; João Paulo Gomes; Carlus Deneke; Simon Tausch; Vítor Borges; Holger Brendebach; Miguel Pinto; Daniel Sobral; João Paulo Gomes; Carlus Deneke; Simon Tausch; Vítor Borges (2023). Genome assemblies and respective cgMLST profiles of a diverse dataset comprising 1,874 Listeria monocytogenes isolates [Dataset]. http://doi.org/10.5281/zenodo.7116879
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 24, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Verónica Mixão; Verónica Mixão; Holger Brendebach; Miguel Pinto; Daniel Sobral; João Paulo Gomes; Carlus Deneke; Simon Tausch; Vítor Borges; Holger Brendebach; Miguel Pinto; Daniel Sobral; João Paulo Gomes; Carlus Deneke; Simon Tausch; Vítor Borges
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset

    This dataset comprises the genome assemblies and respective 1,748-loci core-genome (cg) Multiple Locus Sequence Type (MLST) profiles [Pasteur schema (Moura et al. 2016) available in chewie-NS (Mamede et al. 2022)] of a final set of 1,874 Listeria monocytogenes samples selected among the Whole-Genome Sequencing (WGS) data publicly available in the European Nucleotide Archive (ENA) or in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) at the beginning of the analysis (November 2021). This set of samples was carefully selected to cover a wide genetic diversity (assessed in terms of Sequence Type [ST]). In total, 204 different STs are represented in this dataset, with ST121, ST6, ST9, ST1 and ST155 being in the top 5 and, together, corresponding to 37.9% of the dataset.

    File “Lm_metadata.xlsx” contains metadata information for each isolate, including ENA/SRA accession number, BioProject and in-silico MLST ST.

    The directory “assemblies/” contains all the genome assemblies (.fasta format) of each isolate presented in the metadata file.

    The file “profiles/Lm_profile.tsv” corresponds to a tab separated file with the 1,748-loci cgMLST profile of each isolate presented in the metadata file. These profiles were determined as explained below.

    Dataset selection and curation

    With the objective of creating a diverse dataset of L. monocytogenes genome assemblies, we collected information about the genetic diversity (STs) of the isolates available at BIGSdb-Lm database in the beginning of this analysis (November 2021) and in other previous works. Based on this information, we selected an initial dataset comprising 1,957 samples associated with three previous studies (Moura et al. 2016; Maury et al. 2017; Painset et al. 2019). Their WGS data was downloaded from ENA/SRA with fastq-dl v1.0.6. Read quality control, trimming and assembly were performed with the Aquamis v1.3.9 (Deneke et al. 2021) using default parameters. Assembly quality control (QC), including contamination assessment, as well as MLST ST determination were performed with the same pipeline. All genome assemblies passing the QC were included in the final dataset. Among the others, we noticed that a considerable proportion of assemblies was flagged as “QC fail” exclusively due to the “NumContamSNVs” parameter, suggesting that this setting might have been too strict. After manual inspection of a random subset, assemblies for which the percentage of reads corresponding to the correct species was >98% were recovered and integrated in the final dataset (those samples are labeled in the Metadata file). In total, 1,874 isolates passed the dataset curation step and were included in the final dataset. cgMLST profiles of each of these isolates were determined with chewBBACA v2.8.5 (Silva et al. 2018), using the 1,748-loci Pasteur schema (Moura et al. 2016) available in chewie-NS (Mamede et al. 2022) and downloaded on June 23rd, 2022.

  19. f

    Pseudomonas sp. HOU2 spreadsheet of predicted features

    • figshare.com
    xls
    Updated Jul 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Van Hong Thi Dao; Son Truong Dinh (2024). Pseudomonas sp. HOU2 spreadsheet of predicted features [Dataset]. http://doi.org/10.6084/m9.figshare.26325049.v1
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jul 22, 2024
    Dataset provided by
    figshare
    Authors
    Van Hong Thi Dao; Son Truong Dinh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The HOU2 spreadsheet of predicted features of the complete genome of Pseudomonas sp. HOU2 were the results analyzed by RAST (Rapid Annotation using Subsystem Technology) (https://rast.nmpdr.org/) on 18 July 2024 with the following selected optionsGenetic code: 11Annotation scheme: RASTtkPreserve gene calls: noAutomatically fix errors: yesFix frameshifts: yesBackfill gaps: yesNCBI Sequence Read Archive of Pseudomonas sp. HOU2 is SRR29666724 (https://www.ncbi.nlm.nih.gov/sra/SRR29666724)NCBI complete genome of Pseudomonas sp. HOU2 is CP160398.1 (https://www.ncbi.nlm.nih.gov/nuccore/CP160398)

  20. z

    Genome assemblies and respective wg/cgMLST profiles of a diverse dataset...

    • zenodo.org
    • data.niaid.nih.gov
    xlsx, zip
    Updated Sep 28, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Verónica Mixão; Holger Brendebach; Miguel Pinto; Daniel Sobral; João Paulo Gomes; Carlus Deneke; Simon Tausch; Vítor Borges (2022). Genome assemblies and respective wg/cgMLST profiles of a diverse dataset comprising 1,434 Salmonella enterica isolates [Dataset]. http://doi.org/10.5281/zenodo.7230091
    Explore at:
    zip, xlsxAvailable download formats
    Dataset updated
    Sep 28, 2022
    Dataset provided by
    Genomics and Bioinformatics Unit, Department of Infectious Diseases, National Institute of Health Doutor Ricardo Jorge (INSA), Lisbon, Portugal
    Department Biological Safety, German Federal Institute for Risk Assessment, Berlin, Germany
    Authors
    Verónica Mixão; Holger Brendebach; Miguel Pinto; Daniel Sobral; João Paulo Gomes; Carlus Deneke; Simon Tausch; Vítor Borges
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset

    This dataset comprises the genome assemblies and respective 8,558-loci whole-genome (wg) Multiple Locus Sequence Type (MLST) profiles [INNUENDO schema (Llarena et al. 2018) available in chewie-NS (Mamede et al. 2022)] of a final set of 1,434 Salmonella enterica samples selected among the Whole-Genome Sequencing (WGS) data publicly available in the European Nucleotide Archive (ENA) or in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) at the beginning of the analysis (November 2021). This set of samples was carefully selected to cover a wide genetic diversity (assessed in terms of serotype). In total, 125 different serotypes are represented in this dataset, with Typhimurium (including monophasic), Enteritidis and Infantis being the most represented ones and, together, corresponding to 56.2% of the dataset.

    File “Se_metadata.xlsx” contains metadata information for each isolate, including ENA/SRA accession number, BioProject and in-silico MLST ST and serotype.

    The directory “assemblies/” contains all the genome assemblies (.fasta format) of each isolate presented in the metadata file.

    The file “profiles/Se_profiles_wgMLST.tsv” corresponds to a tab separated file with the 8,558-loci wgMLST profiles of each isolate presented in the metadata file. The files “profiles/Se_profiles_cgMLST_95.tsv”, “profiles/Se_profiles_cgMLST_98.tsv” and “profiles/Se_profiles_cgMLST_100.tsv” correspond to a 3,261-loci, 3,179-loci and 874-loci cgMLST profiles of each isolate presented in the metadata file, respectively. These profiles were determined as explained below.

    Dataset selection and curation

    With the objective of creating a diverse dataset of S. enterica genome assemblies, we collected information about the genetic diversity (serotype) of the isolates available at Enterobase database in the beginning of this analysis (November 2021) and in other previous works. Based on this information, we selected an initial dataset comprising 1,779 samples associated with four BioProjects (PRJEB16326, PRJEB20997, PRJEB30335 and PRJEB39988). Their WGS data was downloaded from ENA/SRA with fastq-dl v1.0.6. Read quality control, trimming and assembly were performed with the Aquamis v1.3.9 (Deneke et al. 2021) using default parameters. Assembly quality control (QC), including contamination assessment, as well as MLST ST determination were performed with the same pipeline. All genome assemblies passing the QC were included in the final dataset. Among the others, we noticed that a considerable proportion of assemblies was flagged as “QC fail” exclusively due to the “NumContamSNVs” parameter, suggesting that this setting might have been too strict. After manual inspection of a random subset, assemblies for which the percentage of reads corresponding to the correct species was >98% were recovered and integrated in the final dataset (those samples are labeled in the Metadata file). In total, 1,434 isolates passed this curation step and were included in the final dataset. In-silico serotyping was performed with SeqSero2 v1.2.1 (Zhang et al. 2019). wgMLST profiles of each of these isolates were determined with chewBBACA v2.8.5 (Silva et al. 2018), using the 8,558-loci INNUENDO schema available in chewie-NS (Llarena et al. 2018; Mamede et al. 2022) and downloaded on May 31st, 2022. Three cgMLST schemas were obtained with ReporTree v1.0.0 (Mixão et al. 2022) using the 8,558-loci wgMLST profiles of the 1,434 isolates as input and setting distinct “--site-inclusion” thresholds: 0.95, 0.98 and 1.0 (i.e., keep schema loci called in at least 95%, 98% and 100% of the samples, resulting in a 3,261-loci, 3,179-loci and 874-loci allelic matrices, respectively).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
National Library of Medicine (2025). Sequence Read Archive (SRA) [Dataset]. https://catalog.data.gov/dataset/sequence-read-archive-sra-54e4a

Sequence Read Archive (SRA)

Explore at:
Dataset updated
Jun 19, 2025
Dataset provided by
National Library of Medicine
Description

The Sequence Read Archive (SRA) stores sequencing data from the next generation of sequencing platforms including Roche 454 GS System®, Illumina Genome Analyzer®, Life Technologies AB SOLiD System®, Helicos Biosciences Heliscope®, Complete Genomics®, and Pacific Biosciences SMRT®.

Search
Clear search
Close search
Google apps
Main menu