The Sequence Read Archive (SRA) stores sequencing data from the next generation of sequencing platforms including Roche 454 GS System®, Illumina Genome Analyzer®, Life Technologies AB SOLiD System®, Helicos Biosciences Heliscope®, Complete Genomics®, and Pacific Biosciences SMRT®.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Study Characteristics: In this table, all publicly available data that were aggregated for this study are described, along with their Sequence Read Archive bioproject numbers, sample descriptors and average number (#) of reads.
This repository within the ACTIV TRACE initiative houses a comprehensive collection of datasets related to SARS-CoV-2. The processing of SARS-CoV-2 Sequence Read Archive (SRA) files has been optimized to identify genetic variations in viral samples. This information is then presented in the Variant Call Format (VCF). Each VCF file corresponds to the SRA parent-run's accession ID. Additionally, the data is available in the parquet format, making it easier to search and filter using the Amazon Athena Service. The SARS-CoV-2 Variant Calling Pipeline is designed to handle new data every six hours, with updates to the AWS ODP bucket occurring daily.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This table describes whether previous reports exist linking these genes to aging or neurodegeneration phenotypes in Human or another model organism. (XLSX)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This table depicts the aging correlated genes for humans and flies sorted according to their correlation coefficient. (XLSX)
This project was designed to describe fine-scale population genetic differentiation of the stream salamander Gryinophilus porphyriticus among five study streams in the Hubbard Brook Experimental Forest. The data are paired with intensive capture-recapture data to assess direct fitness effects of individual genetic diversity, including effects of individual multilocus heterozygosity on stage-specific survival probabilities.
This dataset publishes a manifest of the genomic sequence reads submitted to the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). These samples are published at NCBI under the BioProject ID 1090913 (https://www.ncbi.nlm.nih.gov/bioproject/1090913). The tables here include sample metadata and the NCBI URLs to each sample.
These data were gathered as part of the Hubbard Brook Ecosystem Study (HBES). The HBES is a collaborative effort at the Hubbard Brook Experimental Forest, which is operated and maintained by the USDA Forest Service, Northern Research Station.
This dataset contains data related to RNA sequence genetic accessions at the National Center for Biotechnology Information (NCBI) including information about the host organism, collection location, and collection date.
The accessions are the unprocessed Illumina MiSeq reads for the Ross Sea Dinoflagellate RNA-Seq experiments, Phaeocystis antarctica RNA-Seq experiments, and Pyramimons tychotreta & Micromonas polaris (CCMP 2099) mixotrophy experiments.
Pyramimonas tychotreta & Micromonas polaris (CCMP 2099) mixotrophy RNA sequences are available through the NCBI Sequence Read Archive (SRA) under the SRA accession number SRP090401 (BioProject PRJNA342459)
Ross Sea Dinoflagellate RNA sequences are available through the NCBI Sequence Read Archive (SRA) under the accession number SRP132912 (BioProject PRJNA428208).
Phaeocystis antarctica RNA sequences are available through the NCBI Sequence Read Archive (SRA) under the accession number SRP133243 (BioProject PRJNA434497).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This table depicts the average R2, mean square error, median absolute error and R2 95% confidence interval across 1000 iterations of training/testing predictions. Each row represents a different way to select genetic features for age prediction, where each column represents the metric used for evaluating the effectiveness in predicting aging. (XLSX)
In contrast to temperate systems, Arctic lagoons that span the Alaska Beaufort Sea coast face extreme seasonality. Nine months of ice cover up to ∼1.7 m thick is followed by a spring thaw that introduces an enormous pulse of freshwater, nutrients, and organic matter into these lagoons over a relatively brief 2–3 week period. Prokaryotic communities link these subsidies to lagoon food webs through nutrient uptake, heterotrophic production, and other biogeochemical processes, but little is known about how the genomic capabilities of these communities respond to seasonal variability. This study characterizes the metabolic capabilities of microbial communities across three seasons in two lagoons and one open coastal site along the eastern Alaska Beaufort Sea coast. We used metagenomic DNA sequence data of bacterial and archaeal water column communities to identify genes of relevant biogeochemical pathways. This data package catalogs sequence read archive (SRA) entries available through GenBank BioProject PRJNA642637 at https://www.ncbi.nlm.nih.gov/bioproject/PRJNA642637. This data package is associated with the following publication: Baker, Kristina D., Colleen T. E. Kellogg, James W. McClelland, Kenneth H. Dunton, and Byron C. Crump. “The Genomic Capabilities of Microbial Communities Track Seasonal Variation in Environmental Conditions of Arctic Lagoons.” Frontiers in Microbiology 12 (2021). https://doi.org/10.3389/fmicb.2021.601901. Environmental variables (physiochemical data from YSI and HOBO data loggers, as well as organic matter analysis and stable isotope data from discrete water samples) associated with this genomic dataset are available from the Arctic Data Center: Kenneth Dunton, Byron Crump, and James McClelland. Physical, chemical, and biological data from lagoons and open coastal waters in the nearshore environment of the eastern Alaska Beaufort Sea, 2011-2013. Arctic Data Center. doi:10.18739/A2DG13. To join the two datasets together, please use the provided site codes (column "site_name" here) and collection dates (column "collection_date" here) in each dataset. Instead of citing this package, which is a catalog, please cite the original GenBank data, journal article, or related Arctic Data Center dataset as appropriate. Citation guidance for the journal article and related Arctic Data Center dataset is available on the respective publishers' websites.
The data release details the samples, methods, and raw data used to generate high-quality genome assemblies for greater sage-grouse (Centrocercus urophasianus), white-tailed ptarmigan (Lagopus leucura), and trumpeter swan (Cygnus buccinator). The raw data have been deposited in the Sequence Read Archive (SRA) of the National Center for Biotechnology Information (NCBI), the authoritative repository for public biological sequence data, and are not included in this data release. Instead, the accessions that link to those data via the NCBI portal (www.ncbi.nlm.nih.gov) are provided herein. The release consists of a single file, sample.metadata.txt, which maps NCBI accessions to the samples sequenced and the different types of sequencing performed to generate the assemblies and annotate their gene features.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset
This dataset comprises the genome assemblies and respective 7,601-loci whole-genome (wg) Multiple Locus Sequence Type (MLST) profiles [INNUENDO schema (Llarena et al. 2018) available in chewie-NS (Mamede et al. 2022)] of a final set of 1,999 Escherichia coli samples selected among the Whole-Genome Sequencing (WGS) data publicly available in the European Nucleotide Archive (ENA) or in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) at the beginning of the analysis (November 2021). This set of samples was carefully selected to cover a wide genetic diversity (assessed in terms of serotype). In total, 411 different serotypes are represented in this dataset, with O157:H7 being the most represented one, corresponding to 37.1% of the dataset.
File “Ec_metadata.xlsx” contains metadata information for each isolate, including ENA/SRA accession number, BioProject and in-silico MLST ST and serotype.
The directory “assemblies/” contains all the genome assemblies (.fasta format) of each isolate presented in the metadata file.
The file “profiles/Ec_profiles_wgMLST.tsv” corresponds to a tab separated file with the 7,601-loci wgMLST profiles of each isolate presented in the metadata file. The files “profiles/Ec_profiles_cgMLST_95.tsv”, “profiles/Ec_profiles_cgMLST_98.tsv” and “profiles/Ec_profiles_cgMLST_100.tsv” correspond to a 2,826-loci, 2,704-loci and 465-loci cgMLST profiles of each isolate presented in the metadata file, respectively. These profiles were determined as explained below.
Dataset selection and curation
With the objective of creating a diverse dataset of E. coli genome assemblies, we collected information about the genetic diversity (serotype) of the isolates available at Enterobase database in the beginning of this analysis (November 2021) and in other previous works. Based on this information, we selected an initial dataset comprising 2,688 samples associated with three BioProjects (PRJNA230969, PRJEB27020 and PRJNA248042). Their WGS data was downloaded from ENA/SRA with fastq-dl v1.0.6. Read quality control, trimming and assembly were performed with the Aquamis v1.3.9 (Deneke et al. 2021) using default parameters. Assembly quality control (QC), including contamination assessment, as well as MLST ST determination were performed with the same pipeline. All genome assemblies passing the QC were included in the final dataset. Among the others, we noticed that a considerable proportion of assemblies was flagged as “QC fail” exclusively due to the “NumContamSNVs” parameter, suggesting that this setting might have been too strict. After manual inspection of a random subset, assemblies for which the percentage of reads corresponding to the correct species was >98% were recovered and integrated in the final dataset (those samples are labeled in the Metadata file). In total, 1,999 isolates passed this curation step and were included in the final dataset. In-silico serotyping was performed with seq_typing v2.2. wgMLST profiles of each of these isolates were determined with chewBBACA v2.8.5 (Silva et al. 2018), using the 7,601-loci INNUENDO schema available in chewie-NS (Llarena et al. 2018; Mamede et al. 2022) and downloaded on May 31st, 2022. Three cgMLST schemas were obtained with ReporTree v1.0.0 (Mixão et al. 2022) using the 7,601-loci wgMLST profiles of the 1,999 isolates as input and setting distinct “--site-inclusion” thresholds: 0.95, 0.98 and 1.0 (i.e., keep schema loci called in at least 95%, 98% and 100% of the samples, resulting in a 2,826-loci, 2,704-loci and 465-loci allelic matrices, respectively).
A table (DP_SRA.xlsx) contains rows as sample and columns as entries representing the biosample accession number (NCBI), collection (date), library strategy, target (source), and sequencing (technology) for each individual sample. The zip file (Genome_Set01.zip) contain nine (9) fasta file (DP_bin_02.fasta, DP_bin_04.fasta, DP_bin_09.fasta, DP_bin_10.fasta, DP_bin_14.fasta, DP_bin_15.fasta, DP_bin_16a.fasta, DP_bin_20.fasta, DP_bin_23.fasta) with the contig sequences (i.e. binning) for each metagenome-assembled genomes (MAGs). These data are available from the NCBI Sequence Read Archive (SRA) under the BioProject (https://www.ncbi.nlm.nih.gov/bioproject) with accession number PRJNA646252 and the following BioSample numbers: SAMN15536103 to SAMN15536108. This dataset is associated with the following publication: Gomez-Alvarez, V., H. Liu, J. Pressman, and D. Wahman. Metagenomic Profile of Microbial Communities in a Drinking Water Storage Tank Sediment after Sequential Exposure to Monochloramine, Free Chlorine, and Monochloramine. ENVIRONMENTAL SCIENCE & TECHNOLOGY. American Chemical Society, Washington, DC, USA, 1(5): 1283-1294, (2021).
The availability of phosphorus (P), strongly influences crop yield and quality. However, due to agricultural practices P accumulated in soil, mostly in inaccessible forms. Bacteria play an important role to mobilize P. The release of P is rather a result of the bacterial need for C and N than the immediate need of P. Thus, we postulated that the addition of carbon and N would stimulate phosphorus mobilization by bacteria. Thus, we performed a metagenomic study to investigate soils from two agricultural sites (Rostock, Freising), which only received mineral N fertilizer or mineral N and organic fertilizer for more than 20 years. The metagenomic sequencing followed by taxonomic and functional annotations of the sequences by blasting against the NCBI-nr database (http://ftp.ncbi.nlm.nih.gov/blast/ db/FASTA/nr.gz) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (June 2011) also revealed that independent of site and season, the relative abundance of genes involved in P turnover was not significantly affected by the addition of fertilizers. However, the type of fertilization had a significant impact on the composition of bacterial families harboring genes coding for the different P transformation processes. This gives rise to the possibility that fertilizers can substantially change phosphorus turnover efficiency by favoring different families. Additionally, none of the families involved in phosphorus turnover covered all investigated processes. Therefore, promoting bacteria which play an essential role specifically in mobilization of hardly accessible phosphorus could help to secure the phosphorus supply of plants in soils with low P input as so far the most abundant genes involved in the acquisition of external P sources in our study were those involved in solubilization and subsequent uptake of inorganic phosphorus. The raw sequencing data is available at the sequencing read archive (SRA) under the BioProject ID PRJNA385596 (SAMN06894543- SAMN06894566). Additionally, we determined dissolved organic nitrogen (DON) and carbon (DOC) contents by extracting the soil with 0.01 M CaCl2 solution (soil to liquid ratio: 1:4) and the microbial biomass carbon (Cmic) and nitrogen (Nmic) content by applying a chloroform-fumigation-extraction procedure. Our data indicate that more the site then the treatment changed those values as stability of Cmic, Nmic as well as DOC and DON was high across the different fertilizer regimes. Only additional P fertilization slightly increased DOC values. Data are published in Grafe, M., Goers, M., von Tucher, S., Baum, C., Zimmer, D., Leinweber, P., Vestergaard, G., Kublik, S., Schloter, M., and Schulz, S.: Bacterial potentials for uptake, solubilization and mineralization of extracellular phosphorus in agricultural soils are highly stable under different fertilization regimes, Environ. Microbiol. Rep., 10, 320-327, https://doi.org/10.1111/1758-2229.12651
Table of Contents
Main Description File Descriptions Linked Files Installation and Instructions
This is the Zenodo repository for the manuscript titled "A TCR β chain-directed antibody-fusion molecule that activates and expands subsets of T cells and promotes antitumor activity.". The code included in the file titled marengo_code_for_paper_jan_2023.R
was used to generate the figures from the single-cell RNA sequencing data.
The following libraries are required for script execution:
Seurat scReportoire ggplot2 stringr dplyr ggridges ggrepel ComplexHeatmap
The code can be downloaded and opened in RStudios. The "marengo_code_for_paper_jan_2023.R" contains all the code needed to reproduce the figues in the paper The "Marengo_newID_March242023.rds" file is available at the following address: https://zenodo.org/badge/DOI/10.5281/zenodo.7566113.svg (Zenodo DOI: 10.5281/zenodo.7566113). The "all_res_deg_for_heat_updated_march2023.txt" file contains the unfiltered results from DGE anlaysis, also used to create the heatmap with DGE and volcano plots. The "genes_for_heatmap_fig5F.xlsx" contains the genes included in the heatmap in figure 5F.
This repository contains code for the analysis of single cell RNA-seq dataset. The dataset contains raw FASTQ files, as well as, the aligned files that were deposited in GEO. The "Rdata" or "Rds" file was deposited in Zenodo. Provided below are descriptions of the linked datasets:
Gene Expression Omnibus (GEO) ID: GSE223311(https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE223311)
Title: Gene expression profile at single cell level of CD4+ and CD8+ tumor infiltrating lymphocytes (TIL) originating from the EMT6 tumor model from mSTAR1302 treatment. Description: This submission contains the "matrix.mtx", "barcodes.tsv", and "genes.tsv" files for each replicate and condition, corresponding to the aligned files for single cell sequencing data. Submission type: Private. In order to gain access to the repository, you must use a reviewer token (https://www.ncbi.nlm.nih.gov/geo/info/reviewer.html).
Sequence read archive (SRA) repository ID: SRX19088718 and SRX19088719
Title: Gene expression profile at single cell level of CD4+ and CD8+ tumor infiltrating lymphocytes (TIL) originating from the EMT6 tumor model from mSTAR1302 treatment.
Description: This submission contains the raw sequencing or .fastq.gz
files, which are tab delimited text files.
Submission type: Private. In order to gain access to the repository, you must use a reviewer token (https://www.ncbi.nlm.nih.gov/geo/info/reviewer.html).
Zenodo DOI: 10.5281/zenodo.7566113(https://zenodo.org/record/7566113#.ZCcmvC2cbrJ)
Title: A TCR β chain-directed antibody-fusion molecule that activates and expands subsets of T cells and promotes antitumor activity. Description: This submission contains the "Rdata" or ".Rds" file, which is an R object file. This is a necessary file to use the code. Submission type: Restricted Acess. In order to gain access to the repository, you must contact the author.
The code included in this submission requires several essential packages, as listed above. Please follow these instructions for installation:
Ensure you have R version 4.1.2 or higher for compatibility.
Although it is not essential, you can use R-Studios (Version 2022.12.0+353 (2022.12.0+353)) for accessing and executing the code.
marengo_code_for_paper_jan_2023.R Install_Packages.R Marengo_newID_March242023.rds genes_for_heatmap_fig5F.xlsx all_res_deg_for_heat_updated_march2023.txt
You can use the following code to set the working directory in R:
setwd(directory)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This table describes previous literature of listed genes, along with references. (XLSX)
https://spdx.org/licenses/CC0-1.0https://spdx.org/licenses/CC0-1.0
Microbial communities in the coastal Arctic Ocean experience extreme variability in organic matter and inorganic nutrients driven by seasonal shifts in sea ice extent and freshwater inputs. Lagoons border more than half of the Beaufort Sea coast and provide important habitats for migratory fish and seabirds; yet, little is known about the planktonic food webs supporting these higher trophic levels. To investigate seasonal changes in bacterial and protistan planktonic communities, amplicon sequences of 16S and 18S rRNA genes were generated from samples collected during periods of ice-cover (April), ice break-up (June), and open water (August) from shallow lagoons along the eastern Alaska Beaufort Sea coast from 2011 through 2013.
This data package catalogs sequence read archive (SRA) entries available through GenBank BioProject PRJNA530074 at https://www.ncbi.nlm.nih.gov/bioproject/PRJNA530074. This data package is associated with the following publication:
Kellogg CTE, McClelland JW, Dunton KH and Crump BC (2019) Strong Seasonality in Arctic Estuarine Microbial Food Webs. Front. Microbiol. 10:2628. doi: 10.3389/fmicb.2019.02628
Environmental variables (physiochemical data from YSI and HOBO data loggers, as well as organic matter analysis and stable isotope data from discrete water samples) associated with this genomic dataset are available from the Arctic Data Center:
Kenneth Dunton, Byron Crump, and James McClelland. Physical, chemical, and biological data from lagoons and open coastal waters in the nearshore environment of the eastern Alaska Beaufort Sea, 2011-2013. Arctic Data Center. doi:10.18739/A2DG13.
To join the two datasets together, please use the provided site codes (column "site_name" here) and collection dates (column "collection_date" here) in each dataset. Note that the site codes in this package are without hyphens (e.g. JAA) while site codes in the above environmental data package have hyphens (e.g. JA-A).
Instead of citing this package which is just a catalog, please cite the original GenBank data, journal article, or related Arctic Data Center dataset as appropriate. Citation guidance for the journal article and related Arctic Data Center dataset is available on the respective publishers' websites.
A table (DP_SRA.xlsx) contains rows as sample and columns as entries representing the biosample accession number (NCBI), collection (date), library strategy, target (source), and sequencing (technology) for each individual sample.
The zip file (Genome_Set01.zip) contain nine (9) fasta file (DP_bin_02.fasta, DP_bin_04.fasta, DP_bin_09.fasta, DP_bin_10.fasta, DP_bin_14.fasta, DP_bin_15.fasta, DP_bin_16a.fasta, DP_bin_20.fasta, DP_bin_23.fasta) with the contig sequences (i.e. binning) for each metagenome-assembled genomes (MAGs).
These data are available from the NCBI Sequence Read Archive (SRA) under the BioProject (https://www.ncbi.nlm.nih.gov/bioproject) with accession number PRJNA646252 and the following BioSample numbers: SAMN15536103 to SAMN15536108.
This dataset is associated with the following publication: Gomez-Alvarez, V., H. Liu, J. Pressman, and D. Wahman. Metagenomic Profile of Microbial Communities in a Drinking Water Storage Tank Sediment after Sequential Exposure to Monochloramine, Free Chlorine, and Monochloramine. ENVIRONMENTAL SCIENCE & TECHNOLOGY. American Chemical Society, Washington, DC, USA, 1(5): 1283-1294, (2021).
Secondary contact between closely related taxa represents a “moment of truth†for speciation — an opportunity to test the efficacy of reproductive isolation that evolved in allopatry and to identify the genetic, behavioral, and/or ecological barriers that separate species in sympatry. Sex chromosomes are known to rapidly accumulate differences between species, an effect that may be exacerbated for neo-sex chromosomes that are transitioning from autosomal to sex-specific inheritance. Here we report that, in the Solomon Islands, two closely related bird species in the honeyeater family — Myzomela cardinalis and Myzomela tristrami — carry neo-sex chromosomes and have come into recent secondary contact after ~1.1 my of geographic isolation. Hybrids of the two species were first observed in sympatry ~100 years ago. To determine the genetic consequences of hybridization, we use population genomic analyses of individuals sampled in allopatry and in sympatry to characterize gene flow in the con..., This data repository contains Myzomela tristrami reference genome files. The sequences associated with this assembly are available on NCBI sequence read archive at https://www.ncbi.nlm.nih.gov/sra/?term=SRA%20SRR29254783. We sequenced a M. tristrami female at the University of Delaware DNA sequencing & Genotyping Cener. HiFi libraries were prepared with SMRTbell prep kit, followed by Blue Pippin size selection (15-20Kbp) before sequencing on a PacBio Sequel IIe. We generated a de novo assembly using hifiasm v0.13-r308 with default parameters using the resulting long reads (Cheng et al. 2021, 2022). We used GeMoMa (v1.8) and the annotation from zebra finch genome bTaeGut1.4.pri to infer a rough annotation of genes in the Myzomela genome. We then used these rough annotations, comparing contigs against both zebra finch and the chicken genome bGalGal1.mat.broiler.GRCg7b to infer synteny relationships, remove duplicate haplotigs, and, finally, scaffold contigs into chromosomes in Myzomel..., , # Chromosome assembly and preliminary gene and repeat annotations for Myzomela tristrami reference genome I. Files (GENOME) Mt_v1.0_MAIN.fa.gz Primary genome, (largely) scaffolded to chromosome-level, plus other primary assembled contigs Mt_v1.0_MAIN.gff.gz Simple gene annotations for primary genome, annotated using GeMoMa v1.8 and a zebra finch (bTaeGut1.4.pri) annotation reference Mt_v1.0_extra.fa.gz Additional contigs, not for use in most analyses but some may be of interest This set is a combination of hand-identified haplotigs of the main genome, and assembler-identified "alternate" (haplotig) contigs (ORIGINAL_ASSEMBLY_CONTIGS) Mt_hifi.asm.p.fa.gz "primary" assembly contigs, output from hifiasm (v0.13-r308) Mt_hifi.asm.a.fa.gz "alternate" assembly contigs, output from hifiasm (v0.13-r308) (REPEAT_MASKING) TElib_Myzo_preliminary.fa.gz Preliminary Myzomela-tuned TE/repeat library, generated using RepeatModeler (v.2) Mt_v1.0_MAIN_RM_sites_to_filter.txt List of sites masked by RepeatM...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset
This dataset comprises the genome assemblies and respective 1,748-loci core-genome (cg) Multiple Locus Sequence Type (MLST) profiles [Pasteur schema (Moura et al. 2016) available in chewie-NS (Mamede et al. 2022)] of a final set of 1,874 Listeria monocytogenes samples selected among the Whole-Genome Sequencing (WGS) data publicly available in the European Nucleotide Archive (ENA) or in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) at the beginning of the analysis (November 2021). This set of samples was carefully selected to cover a wide genetic diversity (assessed in terms of Sequence Type [ST]). In total, 204 different STs are represented in this dataset, with ST121, ST6, ST9, ST1 and ST155 being in the top 5 and, together, corresponding to 37.9% of the dataset.
File “Lm_metadata.xlsx” contains metadata information for each isolate, including ENA/SRA accession number, BioProject and in-silico MLST ST.
The directory “assemblies/” contains all the genome assemblies (.fasta format) of each isolate presented in the metadata file.
The file “profiles/Lm_profile.tsv” corresponds to a tab separated file with the 1,748-loci cgMLST profile of each isolate presented in the metadata file. These profiles were determined as explained below.
Dataset selection and curation
With the objective of creating a diverse dataset of L. monocytogenes genome assemblies, we collected information about the genetic diversity (STs) of the isolates available at BIGSdb-Lm database in the beginning of this analysis (November 2021) and in other previous works. Based on this information, we selected an initial dataset comprising 1,957 samples associated with three previous studies (Moura et al. 2016; Maury et al. 2017; Painset et al. 2019). Their WGS data was downloaded from ENA/SRA with fastq-dl v1.0.6. Read quality control, trimming and assembly were performed with the Aquamis v1.3.9 (Deneke et al. 2021) using default parameters. Assembly quality control (QC), including contamination assessment, as well as MLST ST determination were performed with the same pipeline. All genome assemblies passing the QC were included in the final dataset. Among the others, we noticed that a considerable proportion of assemblies was flagged as “QC fail” exclusively due to the “NumContamSNVs” parameter, suggesting that this setting might have been too strict. After manual inspection of a random subset, assemblies for which the percentage of reads corresponding to the correct species was >98% were recovered and integrated in the final dataset (those samples are labeled in the Metadata file). In total, 1,874 isolates passed the dataset curation step and were included in the final dataset. cgMLST profiles of each of these isolates were determined with chewBBACA v2.8.5 (Silva et al. 2018), using the 1,748-loci Pasteur schema (Moura et al. 2016) available in chewie-NS (Mamede et al. 2022) and downloaded on June 23rd, 2022.
Amplicon sequencing utilizing next-generation platforms has significantly transformed how research is conducted, specifically microbial ecology. However, primer and sequencing platform biases can confound or change the way scientists interpret these data. The Pacific Biosciences RSII instrument may also preferentially load smaller fragments, which may also be a function of PCR product exhaustion during sequencing. To further examine theses biases, data is provided from 16S rRNA rumen community analyses. Specifically, data from the relative phylum-level abundances for the ruminal bacterial community are provided to determine between-sample variability. Direct sequencing of metagenomic DNA was conducted to circumvent primer-associated biases in 16S rRNA reads and rarefaction curves were generated to demonstrate adequate coverage of each amplicon. PCR products were also subjected to reduced amplification and pooling to reduce the likelihood of PCR product exhaustion during sequencing on the Pacific Biosciences platform. The taxonomic profiles for the relative phylum-level and genus-level abundance of rumen microbiota as a function of PCR pooling for sequencing on the Pacific Biosciences RSII platform were provided. Data is within this article and raw ruminal MiSeq sequence data is available from the NCBI Sequence Read Archive (SRA Accession SRP047292). Additional descriptive information is associated with NCBI BioProject PRJNA261425. http://www.ncbi.nlm.nih.gov/bioproject/PRJNA261425/ Resources in this dataset:Resource Title: NCBI Sequence Read Archive (SRA Accession SRP047292). File Name: Web Page, url: https://www.ncbi.nlm.nih.gov/sra/SRX704260 1 ILLUMINA (Illumina MiSeq) run: 978,195 spots, 532.9M bases, 311.6Mb downloads.
The Sequence Read Archive (SRA) stores sequencing data from the next generation of sequencing platforms including Roche 454 GS System®, Illumina Genome Analyzer®, Life Technologies AB SOLiD System®, Helicos Biosciences Heliscope®, Complete Genomics®, and Pacific Biosciences SMRT®.