The Trace Archives includes the following archives: The Sequence Read Archive (SRA) stores raw sequence data from "next-generation" sequencing technologies including 454, IonTorrent, Illumina, SOLiD, Helicos and Complete Genomics. In addition to raw sequence data, SRA now stores alignment information in the form of read placements on a reference sequence. The Trace Archive serves as the repository of sequencing data from gel/capillary platforms such as Applied Biosystems ABI 3730. The Trace Assembly Archive stores pairwise alignment and multiple alignment of sequencing reads, linking basic trace data with finished genomic sequence as found in GenBank.
This project was designed to describe fine-scale population genetic differentiation of the stream salamander Gryinophilus porphyriticus among five study streams in the Hubbard Brook Experimental Forest. The data are paired with intensive capture-recapture data to assess direct fitness effects of individual genetic diversity, including effects of individual multilocus heterozygosity on stage-specific survival probabilities.
This dataset publishes a manifest of the genomic sequence reads submitted to the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). These samples are published at NCBI under the BioProject ID 1090913 (https://www.ncbi.nlm.nih.gov/bioproject/1090913). The tables here include sample metadata and the NCBI URLs to each sample.
These data were gathered as part of the Hubbard Brook Ecosystem Study (HBES). The HBES is a collaborative effort at the Hubbard Brook Experimental Forest, which is operated and maintained by the USDA Forest Service, Northern Research Station.
These data include sample information and accession links to sequence data at The National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA).
This data submission consists of metabarcoding data for the zooplankton community in the epipelagic, mesopelagic and upper bathypelagic zones (0-1500m) of the North Pacific Subtropical Gyre. The goal of this study was to assess the hidden diversity present in zooplankton assemblages in midwaters, and detect vertical gradients in species richness, depth distributions, and community composition of the full zooplankton assemblage. Samples were collected in June 2014 from Station ALOHA (22.75, -158) using a 1 meter square Multiple Opening and Closing Nets and Environmental Sampling System (MOCNESS, 200um mesh), on R/V Falkor cruise FK140613. Next generation sequence data (Illumina MiSeq, V3 chemistry, 300-bp paired-end) of the zooplankton assemblage derive from amplicons of the V1-V2 region of 18S rRNA (primers described in Fonseca et al. 2010). The data includes sequences and read count abundance information for molecular OTUs from both holoplanktonic and meroplanktonic taxa
Related dataset containing OTU tables and fasta sequences (representative / most abundance read for each OTU):
Metabarcoding zooplankton at station ALOHA: OTU tables and fasta files
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset
This dataset comprises the genome assemblies of 308 Escherichia coli samples collected by the BeONE Consortium on behalf of the One Health European Joint Programme “BeONE: Building Integrative Tools for One Health Surveillance” (https://onehealthejp.eu/jrp-beone/). Additionally, a complementary dataset is also made available (https://zenodo.org/record/7120057), comprising genome assemblies of 1,999 E. coli samples selected among the Whole-Genome Sequencing (WGS) data publicly available in the European Nucleotide Archive (ENA) or in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA).
File “BeONE_Ec_metadata.xlsx” contains the genome assembly statistics for each isolate, including European Nucleotide Archive accession numbers, in-silico Multi Locus Sequence Type and Serotype.
The archive “BeONE_Ec_assemblies.zip” contains all the genome assemblies (.fasta format) of each isolate presented in the metadata file.
Dataset selection and curation
This anonymized dataset of E. coli genome assemblies was generated using Next Generation Sequencing data collected within the BeONE Consortium available at the European Nucleotide Archive under BioProject Accession Number PRJEB57098. Read quality control, trimming and assembly were performed with Aquamis v1.3.9 (Deneke et al. 2021) using default parameters. Assembly quality control (QC), including contamination assessment, as well as MLST ST determination were performed with the same pipeline. All genome assemblies passing the QC were included in the final dataset. Among the others, we noticed that a considerable proportion of assemblies was flagged as “QC fail” exclusively due to the “NumContamSNVs” parameter, suggesting that this setting might have been too strict. After manual inspection of a random subset, assemblies for which the percentage of reads corresponding to the correct species was >98% were recovered and integrated in the final dataset (those samples are labeled in the Metadata file). In total, 308 isolates passed the dataset curation step and were included in the final dataset. In-silico serotyping was performed with seq_typing v2.2.
Funding
This work was supported by funding from the European Union’s Horizon 2020 Research and Innovation programme under grant agreement No 773830: One Health European Joint Programme.
This dataset includes sample collection information and sequence accessions at the National Center for Biotechnology Information (NCBI) for whole genome sequencing of eelgrass (Zostera marina) collected at Bodega and Tomales Bay, California, USA from July and September of 2019. Sequence Read Archive (SRA) Experiments and BioSamples can be accessed from the NCBI BioProject PRJNA887384 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA887384/).
Results summary as described in Scheibelhut, et al. (2023): We examine genomic signals of selection in the eelgrass Zostera marina across temperature gradients in adjacent embayments. Although we find many genomic regions with signals of selection within each bay there is very little overlap in signals of selection at the SNP level, despite most polymorphisms being shared across bays. We do find overlap at the gene level, potentially suggesting multiple mutational pathways to the same phenotype. Using polygenic models we find that some sets of candidate SNPs are able to predict temperature across both bays, suggesting that small but parallel shifts in allele frequencies may be missed by independent genome scans. Together, these results highlight the continuous rather than binary nature of parallel evolution in polygenic traits and the complexity of evolutionary predictability.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset
This dataset comprises the genome assemblies of 1,426 Listeria monocytogenes samples collected by the BeONE Consortium on behalf of the One Health European Joint Programme “BeONE: Building Integrative Tools for One Health Surveillance” (https://onehealthejp.eu/jrp-beone/). Additionally, a complementary dataset is also made available (https://zenodo.org/record/7116878), comprising genome assemblies of 1,874 L. monocytogenes samples selected among the Whole-Genome Sequencing (WGS) data publicly available in the European Nucleotide Archive (ENA) or in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA).
File “BeONE_Lm_metadata.xlsx” contains the genome assembly statistics for each isolate, including European Nucleotide Archive accession numbers and in-silico Multi Locus Sequence Type.
The archive “BeONE_Lm_assemblies.zip” contains all the genome assemblies (.fasta format) of each isolate presented in the metadata file.
Dataset selection and curation
This anonymized dataset of L. monocytogenes genome assemblies was generated using Next Generation Sequencing data collected within the BeONE Consortium available at the European Nucleotide Archive under BioProject Accession Number PRJEB57166. Read quality control, trimming and assembly were performed with Aquamis v1.3.9 (Deneke et al. 2021) using default parameters. Assembly quality control (QC), including contamination assessment, as well as MLST ST determination were performed with the same pipeline. All genome assemblies passing the QC were included in the final dataset. Among the others, we noticed that a considerable proportion of assemblies was flagged as “QC fail” exclusively due to the “NumContamSNVs” parameter, suggesting that this setting might have been too strict. After manual inspection of a random subset, assemblies for which the percentage of reads corresponding to the correct species was >98% were recovered and integrated in the final dataset (those samples are labeled in the Metadata file). In total, 1,426 isolates passed the dataset curation step and were included in the final dataset.
Funding
This work was supported by funding from the European Union’s Horizon 2020 Research and Innovation programme under grant agreement No 773830: One Health European Joint Programme.
The data release details the samples, methods, and raw data used to generate high-quality genome assemblies for greater sage-grouse (Centrocercus urophasianus), white-tailed ptarmigan (Lagopus leucura), and trumpeter swan (Cygnus buccinator). The raw data have been deposited in the Sequence Read Archive (SRA) of the National Center for Biotechnology Information (NCBI), the authoritative repository for public biological sequence data, and are not included in this data release. Instead, the accessions that link to those data via the NCBI portal (www.ncbi.nlm.nih.gov) are provided herein. The release consists of a single file, sample.metadata.txt, which maps NCBI accessions to the samples sequenced and the different types of sequencing performed to generate the assemblies and annotate their gene features.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset
This dataset comprises the genome assemblies of 610 Campylobacter jejuni samples collected by the BeONE Consortium on behalf of the One Health European Joint Programme “BeONE: Building Integrative Tools for One Health Surveillance” (https://onehealthejp.eu/jrp-beone/). Additionally, a complementary dataset is also made available (https://zenodo.org/record/7120166), comprising genome assemblies of 3,076 C. jejuni samples selected among the Whole-Genome Sequencing (WGS) data publicly available in the European Nucleotide Archive (ENA) or in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA).
File “BeONE_Cj_metadata.xlsx” contains the genome assembly statistics for each isolate, including European Nucleotide Archive accession numbers and in-silico Multi Locus Sequence Type, and information regarding year of sampling, country and source.
The archive “BeONE_Cj_assemblies.zip” contains all the genome assemblies (.fasta format) of each isolate presented in the metadata file.
Dataset selection and curation
This anonymized dataset of C. jejuni genome assemblies was generated using Next Generation Sequencing data collected within the BeONE Consortium available at the European Nucleotide Archive under BioProject Accession Number PRJEB57119. Read quality control, trimming and assembly were performed with Aquamis v1.3.9 (Deneke et al. 2021) using default parameters. Assembly quality control (QC), including contamination assessment, as well as MLST ST determination were performed with the same pipeline. All genome assemblies passing the QC were included in the final dataset. Among the others, we noticed that a considerable proportion of assemblies was flagged as “QC fail” exclusively due to the “NumContamSNVs” parameter, suggesting that this setting might have been too strict. After manual inspection of a random subset, assemblies for which the percentage of reads corresponding to the correct species was >98% were recovered and integrated in the final dataset (those samples are labeled in the Metadata file). In total, 610 isolates passed the dataset curation step and were included in the final dataset.
Funding
This work was supported by funding from the European Union’s Horizon 2020 Research and Innovation programme under grant agreement No 773830: One Health European Joint Programme.
Acknowledgements
We thank the National Distributed Computing Infrastructure of Portugal (INCD) for providing the necessary resources to run the genome assemblies. INCD was funded by FCT and FEDER under the project 22153-01/SAICT/2016.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Cannabis is a genus of flowering plants in the family Cannabaceae.
Source: https://en.wikipedia.org/wiki/Cannabis
In October 2016, Phylos Bioscience released a genomic open dataset of approximately 850 strains of Cannabis via the Open Cannabis Project. In combination with other genomics datasets made available by Courtagen Life Sciences, Michigan State University, NCBI, Sunrise Medicinal, University of Calgary, University of Toronto, and Yunnan Academy of Agricultural Sciences, the total amount of publicly available data exceeds 1,000 samples taken from nearly as many unique strains.
These data were retrieved from the National Center for Biotechnology Information’s Sequence Read Archive (NCBI SRA), processed using the BWA aligner and FreeBayes variant caller, indexed with the Google Genomics API, and exported to BigQuery for analysis. Data are available directly from Google Cloud Storage at gs://gcs-public-data--genomics/cannabis, as well as via the Google Genomics API as dataset ID 918853309083001239, and an additional duplicated subset of only transcriptome data as dataset ID 94241232795910911, as well as in the BigQuery dataset bigquery-public-data:genomics_cannabis.
All tables in the Cannabis Genomes Project dataset have a suffix like _201703. The suffix is referred to as [BUILD_DATE] in the descriptions below. The dataset is updated frequently as new releases become available.
The following tables are included in the Cannabis Genomes Project dataset:
Sample_info contains fields extracted for each SRA sample, including the SRA sample ID and other data that give indications about the type of sample. Sample types include: strain, library prep methods, and sequencing technology. See SRP008673 for an example of upstream sample data. SRP008673 is the University of Toronto sequencing of Cannabis Sativa subspecies Purple Kush.
MNPR01_reference_[BUILD_DATE] contains reference sequence names and lengths for the draft assembly of Cannabis Sativa subspecies Cannatonic produced by Phylos Bioscience. This table contains contig identifiers and their lengths.
MNPR01_[BUILD_DATE] contains variant calls for all included samples and types (genomic, transcriptomic) aligned to the MNPR01_reference_[BUILD_DATE] table. Samples can be found in the sample_info table. The MNPR01_[BUILD_DATE] table is exported using the Google Genomics BigQuery variants schema. This table is useful for general analysis of the Cannabis genome.
MNPR01_transcriptome_[BUILD_DATE] is similar to the MNPR01_[BUILD_DATE] table, but it includes only the subset transcriptomic samples. This table is useful for transcribed gene-level analysis of the Cannabis genome.
Fork this kernel to get started with this dataset.
Dataset Source: http://opencannabisproject.org/ Category: Genomics Use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - https://www.ncbi.nlm.nih.gov/home/about/policies.shtml - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset. Update frequency: As additional data are released to GenBank View in BigQuery: https://bigquery.cloud.google.com/dataset/bigquery-public-data:genomics_cannabis View in Google Cloud Storage: gs://gcs-public-data--genomics/cannabis
Banner Photo by Rick Proctor from Unplash.
Which Cannabis samples are included in the variants table?
Which contigs in the MNPR01_reference_[BUILD_DATE] table have the highest density of variants?
How many variants does each sample have at the THC Synthase gene (THCA1) locus?
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset
This dataset comprises the genome assemblies of 1,540 Salmonella enterica samples collected by the BeONE Consortium on behalf of the One Health European Joint Programme “BeONE: Building Integrative Tools for One Health Surveillance” (https://onehealthejp.eu/jrp-beone/). Additionally, a complementary dataset is also made available (https://zenodo.org/record/7119735), comprising genome assemblies of 1,434 S. enterica samples selected among the Whole-Genome Sequencing (WGS) data publicly available in the European Nucleotide Archive (ENA) or in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA).
File “BeONE_Se_metadata.xlsx” contains the genome assembly statistics for each isolate, including European Nucleotide Archive accession numbers, in-silico Multi Locus Sequence Type and Serotype, and information regarding year of sampling, country and source.
The archive “BeONE_Se_assemblies.zip” contains all the genome assemblies (.fasta format) of each isolate presented in the metadata file.
Dataset selection and curation
This anonymized dataset of S. enterica genome assemblies was generated using Next Generation Sequencing data collected within the BeONE Consortium available at the European Nucleotide Archive under BioProject Accession Number PRJEB57179. Read quality control, trimming and assembly were performed with Aquamis v1.3.9 (Deneke et al. 2021) using default parameters. Assembly quality control (QC), including contamination assessment, as well as MLST ST determination were performed with the same pipeline. All genome assemblies passing the QC were included in the final dataset. Among the others, we noticed that a considerable proportion of assemblies was flagged as “QC fail” exclusively due to the “NumContamSNVs” parameter, suggesting that this setting might have been too strict. After manual inspection of a random subset, assemblies for which the percentage of reads corresponding to the correct species was >98% were recovered and integrated in the final dataset (those samples are labeled in the Metadata file). In total, 1,540 isolates passed the dataset curation step and were included in the final dataset. In-silico serotyping was performed with SeqSero2 v1.2.1 (Zhang et al. 2019).
Funding
This work was supported by funding from the European Union’s Horizon 2020 Research and Innovation programme under grant agreement No 773830: One Health European Joint Programme.
Acknowledgements
We thank the National Distributed Computing Infrastructure of Portugal (INCD) for providing the necessary resources to run the genome assemblies. INCD was funded by FCT and FEDER under the project 22153-01/SAICT/2016.
Biological indicator taxa have long been used for integrative assessments of water quality, particularly benthic invertebrate groups such as arthropods. While standardized protocols have been developed to calculate 'biological index' scores based on the abundances of these taxa, such systems are challenging to implement at large scales due to the sampling effort required, taxonomic expertise needed, and the need for repeated sampling to reliably discriminate sites. Many of the same taxa detected by traditional surveys can also be detected by genetic analysis of environmental DNA (eDNA), potentially allowing for an alternative formulation of biological indexes that might be faster and more economical to produce. The current data were produced to evaluate eDNA-derived biological indexes at sites within the Potomac River watershed of the eastern United States, specifically within units of the National Park Service for which previous biological assessment data were available. This data release consists of five files: 1. sample.metadata.txt, which contains sampling metadata and identifiers linking to sample-derived sequence data that has been deposited in the Sequence Read Archive of the National Center for Biotechnology Information (NCBI). This database is authoritative and comprehensive for sharing high-throughput sequence data produced with public funds. All accessions listed in the file can be searched to retrieve sample and sequence information at www.ncbi.nlm.nih.gov. 2. cox1.references.fasta, which contains reference sequences of the cytochrome c oxidase 1gene of arthropods (typically abbreviated cox1 or COI), identified from regional checklists. The file is a text file in FASTA format. 3. mt16S.references.fasta, which contains reference sequences of the mitochondrial 16S ribosomal RNA (mt16S) gene of arthropods identified from regional checklists. The file is a text file in FASTA format. 4. first.stage.counts.txt, which is a tab-delimited table of counts of sequences that are attributed to each taxon from each sample for the first stage of the study. Whether the taxon attribution is from the mt16S or cox1 locus is also indicated. 5. second.stage.counts.txt, which is a tab-delimited table of counts of sequences that are attributed to each taxon from each sample for the second stage of the study. Whether the taxon attribution is from the mt16S or cox1 locus is also indicated.
NSF Division of Ocean Sciences (NSF OCE) OCE-1737409
A survey of public honey bee sequence data was performed to detect infections by Lake Sinai Virus (LSV). The Sequence Read Archive of the National Center for Biotechnology Information (NCBI) was queried to identify accessions of RNA sequence data derived from honey bee. These were filtered as described below and then up to 50 million reads or read pairs were downloaded and searched against a reference database of conserved LSV sequence. Accessions with matches above a specified threshold were downloaded in their entirety and assembled into longer contiguous sequences (contigs). The result contigs were searched against each open reading frame (ORF) of the reference LSV genome present in the NCBI database (accession NC_032433.1) and matching regions from each contig. These ORF sequences were aligned with additional sequences identified in NCBI databases through the BLAST web service. These alignments provide the basis for computing phylogenetic trees, rates of nucleotide substitution, codon usage bias, and other evolutionary parameters.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset
This dataset comprises the genome assemblies and respective 8,558-loci whole-genome (wg) Multiple Locus Sequence Type (MLST) profiles [INNUENDO schema (Llarena et al. 2018) available in chewie-NS (Mamede et al. 2022)] of a final set of 1,434 Salmonella enterica samples selected among the Whole-Genome Sequencing (WGS) data publicly available in the European Nucleotide Archive (ENA) or in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) at the beginning of the analysis (November 2021). This set of samples was carefully selected to cover a wide genetic diversity (assessed in terms of serotype). In total, 125 different serotypes are represented in this dataset, with Typhimurium (including monophasic), Enteritidis and Infantis being the most represented ones and, together, corresponding to 56.2% of the dataset.
File “Se_metadata.xlsx” contains metadata information for each isolate, including ENA/SRA accession number, BioProject and in-silico MLST ST and serotype.
The directory “assemblies/” contains all the genome assemblies (.fasta format) of each isolate presented in the metadata file.
The file “profiles/Se_profiles_wgMLST.tsv” corresponds to a tab separated file with the 8,558-loci wgMLST profiles of each isolate presented in the metadata file. The files “profiles/Se_profiles_cgMLST_95.tsv”, “profiles/Se_profiles_cgMLST_98.tsv” and “profiles/Se_profiles_cgMLST_100.tsv” correspond to a 3,261-loci, 3,179-loci and 874-loci cgMLST profiles of each isolate presented in the metadata file, respectively. These profiles were determined as explained below.
Dataset selection and curation
With the objective of creating a diverse dataset of S. enterica genome assemblies, we collected information about the genetic diversity (serotype) of the isolates available at Enterobase database in the beginning of this analysis (November 2021) and in other previous works. Based on this information, we selected an initial dataset comprising 1,779 samples associated with four BioProjects (PRJEB16326, PRJEB20997, PRJEB30335 and PRJEB39988). Their WGS data was downloaded from ENA/SRA with fastq-dl v1.0.6. Read quality control, trimming and assembly were performed with the Aquamis v1.3.9 (Deneke et al. 2021) using default parameters. Assembly quality control (QC), including contamination assessment, as well as MLST ST determination were performed with the same pipeline. All genome assemblies passing the QC were included in the final dataset. Among the others, we noticed that a considerable proportion of assemblies was flagged as “QC fail” exclusively due to the “NumContamSNVs” parameter, suggesting that this setting might have been too strict. After manual inspection of a random subset, assemblies for which the percentage of reads corresponding to the correct species was >98% were recovered and integrated in the final dataset (those samples are labeled in the Metadata file). In total, 1,434 isolates passed this curation step and were included in the final dataset. In-silico serotyping was performed with SeqSero2 v1.2.1 (Zhang et al. 2019). wgMLST profiles of each of these isolates were determined with chewBBACA v2.8.5 (Silva et al. 2018), using the 8,558-loci INNUENDO schema available in chewie-NS (Llarena et al. 2018; Mamede et al. 2022) and downloaded on May 31st, 2022. Three cgMLST schemas were obtained with ReporTree v1.0.0 (Mixão et al. 2022) using the 8,558-loci wgMLST profiles of the 1,434 isolates as input and setting distinct “--site-inclusion” thresholds: 0.95, 0.98 and 1.0 (i.e., keep schema loci called in at least 95%, 98% and 100% of the samples, resulting in a 3,261-loci, 3,179-loci and 874-loci allelic matrices, respectively).
Acknowledgements
We thank the National Distributed Computing Infrastructure of Portugal (INCD) for providing the necessary resources to run the genome assemblies. INCD was funded by FCT and FEDER under the project 22153-01/SAICT/2016.
This dataset contains data related to RNA sequence genetic accessions at the National Center for Biotechnology Information (NCBI) including information about the host organism, collection location, and collection date.
The accessions are the unprocessed Illumina MiSeq reads for the Ross Sea Dinoflagellate RNA-Seq experiments, Phaeocystis antarctica RNA-Seq experiments, and Pyramimons tychotreta & Micromonas polaris (CCMP 2099) mixotrophy experiments.
Pyramimonas tychotreta & Micromonas polaris (CCMP 2099) mixotrophy RNA sequences are available through the NCBI Sequence Read Archive (SRA) under the SRA accession number SRP090401 (BioProject PRJNA342459)
Ross Sea Dinoflagellate RNA sequences are available through the NCBI Sequence Read Archive (SRA) under the accession number SRP132912 (BioProject PRJNA428208).
Phaeocystis antarctica RNA sequences are available through the NCBI Sequence Read Archive (SRA) under the accession number SRP133243 (BioProject PRJNA434497).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset
This dataset comprises the genome assemblies and respective 7,601-loci whole-genome (wg) Multiple Locus Sequence Type (MLST) profiles [INNUENDO schema (Llarena et al. 2018) available in chewie-NS (Mamede et al. 2022)] of a final set of 1,999 Escherichia coli samples selected among the Whole-Genome Sequencing (WGS) data publicly available in the European Nucleotide Archive (ENA) or in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) at the beginning of the analysis (November 2021). This set of samples was carefully selected to cover a wide genetic diversity (assessed in terms of serotype). In total, 411 different serotypes are represented in this dataset, with O157:H7 being the most represented one, corresponding to 37.1% of the dataset.
File “Ec_metadata.xlsx” contains metadata information for each isolate, including ENA/SRA accession number, BioProject and in-silico MLST ST and serotype.
The directory “assemblies/” contains all the genome assemblies (.fasta format) of each isolate presented in the metadata file.
The file “profiles/Ec_profiles_wgMLST.tsv” corresponds to a tab separated file with the 7,601-loci wgMLST profiles of each isolate presented in the metadata file. The files “profiles/Ec_profiles_cgMLST_95.tsv”, “profiles/Ec_profiles_cgMLST_98.tsv” and “profiles/Ec_profiles_cgMLST_100.tsv” correspond to a 2,826-loci, 2,704-loci and 465-loci cgMLST profiles of each isolate presented in the metadata file, respectively. These profiles were determined as explained below.
Dataset selection and curation
With the objective of creating a diverse dataset of E. coli genome assemblies, we collected information about the genetic diversity (serotype) of the isolates available at Enterobase database in the beginning of this analysis (November 2021) and in other previous works. Based on this information, we selected an initial dataset comprising 2,688 samples associated with three BioProjects (PRJNA230969, PRJEB27020 and PRJNA248042). Their WGS data was downloaded from ENA/SRA with fastq-dl v1.0.6. Read quality control, trimming and assembly were performed with the Aquamis v1.3.9 (Deneke et al. 2021) using default parameters. Assembly quality control (QC), including contamination assessment, as well as MLST ST determination were performed with the same pipeline. All genome assemblies passing the QC were included in the final dataset. Among the others, we noticed that a considerable proportion of assemblies was flagged as “QC fail” exclusively due to the “NumContamSNVs” parameter, suggesting that this setting might have been too strict. After manual inspection of a random subset, assemblies for which the percentage of reads corresponding to the correct species was >98% were recovered and integrated in the final dataset (those samples are labeled in the Metadata file). In total, 1,999 isolates passed this curation step and were included in the final dataset. In-silico serotyping was performed with seq_typing v2.2. wgMLST profiles of each of these isolates were determined with chewBBACA v2.8.5 (Silva et al. 2018), using the 7,601-loci INNUENDO schema available in chewie-NS (Llarena et al. 2018; Mamede et al. 2022) and downloaded on May 31st, 2022. Three cgMLST schemas were obtained with ReporTree v1.0.0 (Mixão et al. 2022) using the 7,601-loci wgMLST profiles of the 1,999 isolates as input and setting distinct “--site-inclusion” thresholds: 0.95, 0.98 and 1.0 (i.e., keep schema loci called in at least 95%, 98% and 100% of the samples, resulting in a 2,826-loci, 2,704-loci and 465-loci allelic matrices, respectively).
To capture the immediate effects of storm-driven freshwater runoff on coral and symbiont physiology, we leveraged the heavy rainfall associated with Hurricane Harvey in late August 2017 by sampling FGB coral gene expression at two time points: September 2017, when surface water salinity was reduced (∼34 ppt); and 1 month later when salinity had returned to typical levels (∼36 ppt in October 2017).
This dataset includes Sequence Read Archive (SRA) and BioSample accessions under BioProject PRJNA552981 at The National Center for Biotechnology Information. It also contains sample information and species names for samples collected the east and west banks of the Flower Garden Banks National Marine Sanctuary (FGBNMS) at 80ft.
These data were published in Wright et al. (2019).
The dataset consists of genotypes (diploid base calls at variant sites) at 3,601 anonymous sites of the Arizona Toad (Anaxyrus microscaphus) nuclear genome. The genotyped samples are representative of the range of the species and its major population units, and the genotyped loci have a high degree of completeness. This data release consists of several files: 1. sample.metadata.txt, which contains sampling metadata and identifiers linking to sequence data that has been deposited in the Sequence Read Archive of the National Center for Biotechnology Information (NCBI). This database is authoritative and comprehensive for sharing high-throughput sequence data produced with public funds. All NCBI-derived accessions listed in the file can be searched at www.ncbi.nlm.nih.gov to retrieve sample and sequence information, as can the umbrella BioProject accession PRJNA995169 2. genotypes.genepop.txt, which contains inferred genetic variants in a common and convertible text-based format
Dataset: Raw low-coverage whole genome sequencing reads
https://rightsstatements.org/vocab/UND/1.0/https://rightsstatements.org/vocab/UND/1.0/
Raptor roosts as invasion archives: insights from the first black rat mitochondrial genome sequenced from the Caribbean
The Trace Archives includes the following archives: The Sequence Read Archive (SRA) stores raw sequence data from "next-generation" sequencing technologies including 454, IonTorrent, Illumina, SOLiD, Helicos and Complete Genomics. In addition to raw sequence data, SRA now stores alignment information in the form of read placements on a reference sequence. The Trace Archive serves as the repository of sequencing data from gel/capillary platforms such as Applied Biosystems ABI 3730. The Trace Assembly Archive stores pairwise alignment and multiple alignment of sequencing reads, linking basic trace data with finished genomic sequence as found in GenBank.