Facebook
TwitterThe Sequence Read Archive (SRA) stores sequencing data from the next generation of sequencing platforms including Roche 454 GS System®, Illumina Genome Analyzer®, Life Technologies AB SOLiD System®, Helicos Biosciences Heliscope®, Complete Genomics®, and Pacific Biosciences SMRT®.
Facebook
TwitterFASTA files containing the sequence data and for Assembled contigs (FastA), Predicted genes (FastA), Predicted proteins (FastA), Gene prediction (GFF v2). This dataset is not publicly accessible because: These are sequences that have already been deposited in publicly available databases and therefore we can avoid replication. Also the data is quite large and there are numerous files associated with these entries, which are included in the links below. It can be accessed through the following means: Using the following web links https://www.ncbi.nlm.nih.gov/bioproject/PRJNA299404 https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP065069 http://enve-omics.ce.gatech.edu/data/showerheads. Format: The data represent genome sequencing and assembly of 180 different contigs. This dataset is associated with the following publication: Soto-Giron, M.J., L. Rodriguez, C. Luo , M. Elk, H. Ryu, J. Santodomingo , and K. Konstantinidis. Biofilms on Hospital Shower Hoses: Characterization and Implications for Nosocomial Infections. APPLIED AND ENVIRONMENTAL MICROBIOLOGY. American Society for Microbiology, Washington, DC, USA, 82(9): 2872-2883, (2016).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These whole genome of Pseudomonas sp. HOU2 were analyzed by RAST (Rapid Annotation using Subsystem Technology) (https://rast.nmpdr.org/) on 18 July 2024 with the following selected options to get the predicted HOU2 gene sequences. Genetic code: 11Annotation scheme: RASTtkPreserve gene calls: noAutomatically fix errors: yesFix frameshifts: yesBackfill gaps: yesNCBI Sequence Read Archive of Pseudomonas sp. HOU2 is SRR29666724 (https://www.ncbi.nlm.nih.gov/sra/SRR29666724)NCBI complete genome of Pseudomonas sp. HOU2 is CP160398.1 (https://www.ncbi.nlm.nih.gov/nuccore/CP160398)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains SNP variant data and population genomics analysis results derived from RNA-Seq data of Cinnamomum species. The processed VCF files and population genetics outputs include SNP annotations, allele frequency distributions, linkage disequilibrium analysis, and Fst calculations. These data were used in the study titled "[Your Manuscript Title]" and are made publicly available for further research and validation.
Included Files:
Processed SNP datasets in VCF format
SNP annotations and allele frequency tables
Population genomics analysis results (Fst, LD decay, AFS)
Usage:Researchers can use this dataset for comparative genomics, evolutionary studies, and population structure analysis of Cinnamomum species.
NCBI SRA Data
The raw RNA-Seq data used in this study were retrieved from the NCBI Sequence Read Archive (SRA) under the following accession numbers:
SRR10063926
SRR10063927
SRR10063928
SRR31477125
SRR31477126
SRR31477127
The datasets can be accessed via the NCBI SRA database: https://www.ncbi.nlm.nih.gov/sra
Reference Genome
The reference genome used for alignment and variant calling was obtained from:
GenBank Accession: GCA_003546025.1
Available at: NCBI Genome Database
Variant Calling and Population Genetics Analysis Tools
The variant calling, SNP annotation, and population genetics analyses were performed using the following tools:
VCFtools: Danecek P, et al. (2011) "The variant call format and VCFtools." Bioinformatics. DOI: 10.1093/bioinformatics/btr330
PLINK: Purcell S, et al. (2007) "PLINK: a tool set for whole-genome association and population-based linkage analyses." American Journal of Human Genetics. DOI: 10.1086/519795
Facebook
TwitterI. Files (GENOME) Mt_v1.0_MAIN.fa.gz Primary genome, (largely) scaffolded to chromosome-level, plus other primary assembled contigs Mt_v1.0_MAIN.gff.gz Simple gene annotations for primary genome, annotated using GeMoMa v1.8 and a zebra finch (bTaeGut1.4.pri) annotation reference Mt_v1.0_extra.fa.gz Additional contigs, not for use in most analyses but some may be of interest This set is a combination of hand-identified haplotigs of the main genome, and assembler-identified "alternate" (haplotig) contigs (ORIGINAL_ASSEMBLY_CONTIGS) Mt_hifi.asm.p.fa.gz "primary" assembly contigs, output from hifiasm (v0.13-r308) Mt_hifi.asm.a.fa.gz "alternate" assembly contigs, output from hifiasm (v0.13-r308) (REPEAT_MASKING) TElib_Myzo_preliminary.fa.gz Preliminary Myzomela-tuned TE/repeat library, generated using RepeatModeler (v.2) Mt_v1.0_MAIN_RM_sites_to_filter.txt List of sites masked by RepeatM...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains data related to RNA sequence genetic accessions at the National Center for Biotechnology Information (NCBI) including information about the host organism, collection location, and collection date.
The accessions are the unprocessed Illumina MiSeq reads for the Ross Sea Dinoflagellate RNA-Seq experiments, Phaeocystis antarctica RNA-Seq experiments, and Pyramimons tychotreta & Micromonas polaris (CCMP 2099) mixotrophy experiments.
Pyramimonas tychotreta & Micromonas polaris (CCMP 2099) mixotrophy RNA sequences are available through the NCBI Sequence Read Archive (SRA) under the SRA accession number SRP090401 (BioProject PRJNA342459)
Ross Sea Dinoflagellate RNA sequences are available through the NCBI Sequence Read Archive (SRA) under the accession number SRP132912 (BioProject PRJNA428208).
Phaeocystis antarctica RNA sequences are available through the NCBI Sequence Read Archive (SRA) under the accession number SRP133243 (BioProject PRJNA434497).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 2: Supplementary Table S1. List of NCBI SRA ( https://www.ncbi.nlm.nih.gov/sra ) studies and associated platform, research institute and country.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset
This dataset comprises the genome assemblies and respective 8,558-loci whole-genome (wg) Multiple Locus Sequence Type (MLST) profiles [INNUENDO schema (Llarena et al. 2018) available in chewie-NS (Mamede et al. 2022)] of a final set of 1,434 Salmonella enterica samples selected among the Whole-Genome Sequencing (WGS) data publicly available in the European Nucleotide Archive (ENA) or in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) at the beginning of the analysis (November 2021). This set of samples was carefully selected to cover a wide genetic diversity (assessed in terms of serotype). In total, 125 different serotypes are represented in this dataset, with Typhimurium (including monophasic), Enteritidis and Infantis being the most represented ones and, together, corresponding to 56.2% of the dataset.
File “Se_metadata.xlsx” contains metadata information for each isolate, including ENA/SRA accession number, BioProject and in-silico MLST ST and serotype.
The directory “assemblies/” contains all the genome assemblies (.fasta format) of each isolate presented in the metadata file.
The file “profiles/Se_profiles_wgMLST.tsv” corresponds to a tab separated file with the 8,558-loci wgMLST profiles of each isolate presented in the metadata file. The files “profiles/Se_profiles_cgMLST_95.tsv”, “profiles/Se_profiles_cgMLST_98.tsv” and “profiles/Se_profiles_cgMLST_100.tsv” correspond to a 3,261-loci, 3,179-loci and 874-loci cgMLST profiles of each isolate presented in the metadata file, respectively. These profiles were determined as explained below.
Dataset selection and curation
With the objective of creating a diverse dataset of S. enterica genome assemblies, we collected information about the genetic diversity (serotype) of the isolates available at Enterobase database in the beginning of this analysis (November 2021) and in other previous works. Based on this information, we selected an initial dataset comprising 1,779 samples associated with four BioProjects (PRJEB16326, PRJEB20997, PRJEB30335 and PRJEB39988). Their WGS data was downloaded from ENA/SRA with fastq-dl v1.0.6. Read quality control, trimming and assembly were performed with the Aquamis v1.3.9 (Deneke et al. 2021) using default parameters. Assembly quality control (QC), including contamination assessment, as well as MLST ST determination were performed with the same pipeline. All genome assemblies passing the QC were included in the final dataset. Among the others, we noticed that a considerable proportion of assemblies was flagged as “QC fail” exclusively due to the “NumContamSNVs” parameter, suggesting that this setting might have been too strict. After manual inspection of a random subset, assemblies for which the percentage of reads corresponding to the correct species was >98% were recovered and integrated in the final dataset (those samples are labeled in the Metadata file). In total, 1,434 isolates passed this curation step and were included in the final dataset. In-silico serotyping was performed with SeqSero2 v1.2.1 (Zhang et al. 2019). wgMLST profiles of each of these isolates were determined with chewBBACA v2.8.5 (Silva et al. 2018), using the 8,558-loci INNUENDO schema available in chewie-NS (Llarena et al. 2018; Mamede et al. 2022) and downloaded on May 31st, 2022. Three cgMLST schemas were obtained with ReporTree v1.0.0 (Mixão et al. 2022) using the 8,558-loci wgMLST profiles of the 1,434 isolates as input and setting distinct “--site-inclusion” thresholds: 0.95, 0.98 and 1.0 (i.e., keep schema loci called in at least 95%, 98% and 100% of the samples, resulting in a 3,261-loci, 3,179-loci and 874-loci allelic matrices, respectively).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 3: Supplementary Table S2. HiSeq datasets. Publicly accessible studies deposited in NCBI SRA ( https://www.ncbi.nlm.nih.gov/sra ) were reviewed to account for lost read names.
Facebook
TwitterBackground The application of reduced metagenomic sequencing approaches holds promise as a middle ground between targeted amplicon sequencing and whole metagenome sequencing approaches but has not been widely adopted as a technique. A major barrier to adoption is the lack of read simulation software built to handle characteristic features of these novel approaches. Reduced metagenomic sequencing (RMS) produces unique patterns of fragmentation per genome that are sensitive to restriction enzyme choice, and the non-uniform size selection of these fragments may introduce novel challenges to taxonomic assignment as well as relative abundance estimates. Results Through the development and application of simulation software, readsynth, we compare simulated metagenomic sequencing libraries with existing RMS data to assess the influence of multiple library preparation and sequencing steps on downstream analytical results. Based on read depth per position, readsynth achieved 0.79 Pearson’s corre..., Sequence data were collected and aggregated from publicly available NCBI SRA databases for raw sequence data (https://www.ncbi.nlm.nih.gov/sra) and NCBI RefSeq databases for reference genome assemblies (https://www.ncbi.nlm.nih.gov/refseq/). Downloaded reference genomes have been concatenated and indexed using command line "cat" command and the bwa index command., , # readsynth_analysis
https://doi.org/10.5061/dryad.nzs7h44zk
The dataset contained here provides the necessary raw sequence data to perform analyses for the simulation software readsynth.
The dataset includes the genomes and databases necessary to reproduce the steps in the github repository readsynth_analysis and correspond with that repository's "raw_data" directory.
The genome directory "raw_data" is broken into the following subdirectories (further descriptions below):
.
├── helius
│  └── all_2084
│  ├── genomes
│  └── genomes_combined
├── kraken_dbs
│  ├── k2_pluspfp_20220607
│  ├── snipen_bei_db
│  │  └── library
│  │  └── added
│  └── sun_atcc_db
│  └── library
│  └── added
├── liu_RMS
│  └── mock_community_estimate
│  ├── 10M_bracken_profile
│ ...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This file contains a de novo assembly of the Ensete ventricosum genome based on whole-genome shotgun sequencing by Illumina HiSeq paired reads and assembled using SOAPdenovo. This assembly has, in part, been submitted to GenBank under accession number AMZH00000000.1 (http://www.ncbi.nlm.nih.gov/nuccore/AMZH00000000.1/). However, because of limitations on the number of supercontigs/contigs that GenBank will accept, we did not submit supercontigs and contigs of shorter than 5 kb. The raw data are available from the Sequence Read Archive under accession number SRX202265 (see http://www.ncbi.nlm.nih.gov/sra?LinkName=nuccore_sra_wgs&from_uid=440571971). Data are described in this paper: Harrison, J.; Moore, K.A.; Paszkiewicz, K.; Jones, T.; Grant, M.R.; Ambacheew, D.; Muzemil, S.; Studholme, D.J. A Draft Genome Sequence for Ensete ventricosum, the Drought-Tolerant “Tree Against Hunger”. Agronomy 2014, 4, 13-33.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Global dataset of C. auris genome sequences (raw reads) generated with Illumna platforms following a paired-end approach. Selected records were retrieved from the NCBI SRA database (https://www.ncbi.nlm.nih.gov/sra/?term=candida+auris, accessed on December 30, 2022) through the RunSelecter tool. Inclusion criteria: assay type, WGS; organism, Candida auirs; host, homo sapiens; instrument, Illumina MiSeq, iSeq, HiSeq, NextSeq, NovaSeq; sequenced megabases, >190.
Facebook
TwitterThe de novo mutation can cause the onset of a disease. This subtype is difficult to show symptoms in childhood and is easy to be ignored. Expanding the gene genotype mutation spectrum, can lay a foundation for the further application of mutation screening in genetic counseling.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The antibiograms.tsv.zip dataset collects antibiograms found in NCBI, ENA and BV-BRC. Each row corresponds to an antibiotic susceptibility test (AST) for a given sample against a specific antibiotic. The dataset is a table with 14 columns:
GCA_ or GCF_), the BV-BRC Genome Database (starting with BVBRC_), or the ENA FTP Site (starting with ftp://ftp.sra.ebi.ac.uk/vol1/analysis/).The gn-genomes.zip file contains some extra genomes with associated AST metadata found in metadata.xlsx file within it.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset
This dataset comprises the genome assemblies and respective 7,601-loci whole-genome (wg) Multiple Locus Sequence Type (MLST) profiles [INNUENDO schema (Llarena et al. 2018) available in chewie-NS (Mamede et al. 2022)] of a final set of 1,999 Escherichia coli samples selected among the Whole-Genome Sequencing (WGS) data publicly available in the European Nucleotide Archive (ENA) or in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) at the beginning of the analysis (November 2021). This set of samples was carefully selected to cover a wide genetic diversity (assessed in terms of serotype). In total, 411 different serotypes are represented in this dataset, with O157:H7 being the most represented one, corresponding to 37.1% of the dataset.
File “Ec_metadata.xlsx” contains metadata information for each isolate, including ENA/SRA accession number, BioProject and in-silico MLST ST and serotype.
The directory “assemblies/” contains all the genome assemblies (.fasta format) of each isolate presented in the metadata file.
The file “profiles/Ec_profiles_wgMLST.tsv” corresponds to a tab separated file with the 7,601-loci wgMLST profiles of each isolate presented in the metadata file. The files “profiles/Ec_profiles_cgMLST_95.tsv”, “profiles/Ec_profiles_cgMLST_98.tsv” and “profiles/Ec_profiles_cgMLST_100.tsv” correspond to a 2,826-loci, 2,704-loci and 465-loci cgMLST profiles of each isolate presented in the metadata file, respectively. These profiles were determined as explained below.
Dataset selection and curation
With the objective of creating a diverse dataset of E. coli genome assemblies, we collected information about the genetic diversity (serotype) of the isolates available at Enterobase database in the beginning of this analysis (November 2021) and in other previous works. Based on this information, we selected an initial dataset comprising 2,688 samples associated with three BioProjects (PRJNA230969, PRJEB27020 and PRJNA248042). Their WGS data was downloaded from ENA/SRA with fastq-dl v1.0.6. Read quality control, trimming and assembly were performed with the Aquamis v1.3.9 (Deneke et al. 2021) using default parameters. Assembly quality control (QC), including contamination assessment, as well as MLST ST determination were performed with the same pipeline. All genome assemblies passing the QC were included in the final dataset. Among the others, we noticed that a considerable proportion of assemblies was flagged as “QC fail” exclusively due to the “NumContamSNVs” parameter, suggesting that this setting might have been too strict. After manual inspection of a random subset, assemblies for which the percentage of reads corresponding to the correct species was >98% were recovered and integrated in the final dataset (those samples are labeled in the Metadata file). In total, 1,999 isolates passed this curation step and were included in the final dataset. In-silico serotyping was performed with seq_typing v2.2. wgMLST profiles of each of these isolates were determined with chewBBACA v2.8.5 (Silva et al. 2018), using the 7,601-loci INNUENDO schema available in chewie-NS (Llarena et al. 2018; Mamede et al. 2022) and downloaded on May 31st, 2022. Three cgMLST schemas were obtained with ReporTree v1.0.0 (Mixão et al. 2022) using the 7,601-loci wgMLST profiles of the 1,999 isolates as input and setting distinct “--site-inclusion” thresholds: 0.95, 0.98 and 1.0 (i.e., keep schema loci called in at least 95%, 98% and 100% of the samples, resulting in a 2,826-loci, 2,704-loci and 465-loci allelic matrices, respectively).
Acknowledgements
We thank the National Distributed Computing Infrastructure of Portugal (INCD) for providing the necessary resources to run the genome assemblies. INCD was funded by FCT and FEDER under the project 22153-01/SAICT/2016.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This FigShare repository contains genomic datasets for the Thylacine Genomics Project at the University of Melbourne (VIC, Australia). Currently 4 files arising from 2 publications are hosted on this repository. All assemblies arise from NCBI BioSample SAMN060496721). Feigin et al. 2018: Genome of the Tasmanian tiger provides insights into the evolution and demography of an extinct marsupial carnivore [https://www.nature.com/articles/s41559-017-0417-y]a) ThyCyn1.0: This assembly is the first de novo whole-genome assembly for the thylacine. It is a contig-level assembly and thus contains no scaffolds. It was generated from short insert paired-end reads (https://www.ncbi.nlm.nih.gov/sra?linkname=bioproject_sra_all&from_uid=354646). Sequencing and data pre-processing strategy are discussed in the methods of Feigin et al. 2018. This assembly was used exclusively to estimate genome size and G+C content of the thylacine genome.b) UniMelb_Thylacine_Refassem_1/GCA_007646695.1: Because of the highly-fragmentary nature (low N50) of ThyCyn1.0, UniMelb_Thylacine_Refassem_1 was generated to perform all evolutionary analyses detailed in Feigin et al. 2018. UniMelb_Thylacine_Refassem_1 was generated by mapping thylacine reads against the repeatmasked version of the previous Tasmanian devil draft genome Devil_ref v7.0 and generating reference-guided scaffolds. This is not a complete genome assembly, as it is composed only of non-repetitive genomic regions and does not include indel differences between Thylacine and devil. This was done to preserve the coordinate systems between thylacine and devil, permitting the use of the already-existing Tasmanian devil gene annotations. See methods of Feigin et al. 2018 for details. Assembly is hosted on NCBI under BioProject PRJNA354646. 2) Feigin et al. 2022: A chromosome-scale hybrid genome assembly of the extinct Tasmanian tiger (Thylacinus cynocephalus) [https://www.biorxiv.org/content/10.1101/2022.03.02.482690v1.full]a) ThyCyn2.0: This assembly is a chromosome-scale hybrid genome for the thylacine. It was generated by producing improved de novo contigs and short read-based scaffolds, which were then aligned to the Tasmanian devil reference genome mSarHar1.11. This assembly represents a substantial improvement in contiguity and completeness over both ThyCyn1.0 and UniMelb_Thylacine_Refassem_1. Assembly is hosted on NCBI under BioProject PRJNA354646.b) ThyCyn2.0 annotation: Associated with ThyCyn2.0, we have produced a set of homology-based gene annotations using a gene model liftover procedure (see Feigin et al. 2022 for details). Briefly, exons from the Tasmanian devil genome RefSeq annotation were aligned to the thylacine assembly and gene models were created by linking exons together, filtering for those with preserved the intron-exon structure of the reference devil annotation (with an allowable distance scaling factor of 4).
Facebook
TwitterThe Sequence Read Archive (SRA) stores sequencing data from the next generation of sequencing platforms including Roche 454 GS System®, Illumina Genome Analyzer®, Life Technologies AB SOLiD System®, Helicos Biosciences Heliscope®, Complete Genomics®, and Pacific Biosciences SMRT®.