Facebook
TwitterThe Sequence Read Archive (SRA) stores sequencing data from the next generation of sequencing platforms including Roche 454 GS System®, Illumina Genome Analyzer®, Life Technologies AB SOLiD System®, Helicos Biosciences Heliscope®, Complete Genomics®, and Pacific Biosciences SMRT®.
Facebook
TwitterSpecies included in the analysis, including environment (freshwater [FW] or saltwater [SW]), collection location, sample size (N), NCBI Sequence Read Archive (SRA) accession numbers, and study reference.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This table describes whether previous reports exist linking these genes to aging or neurodegeneration phenotypes in Human or another model organism. (XLSX)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset
This dataset comprises the genome assemblies and respective 8,558-loci whole-genome (wg) Multiple Locus Sequence Type (MLST) profiles [INNUENDO schema (Llarena et al. 2018) available in chewie-NS (Mamede et al. 2022)] of a final set of 1,434 Salmonella enterica samples selected among the Whole-Genome Sequencing (WGS) data publicly available in the European Nucleotide Archive (ENA) or in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) at the beginning of the analysis (November 2021). This set of samples was carefully selected to cover a wide genetic diversity (assessed in terms of serotype). In total, 125 different serotypes are represented in this dataset, with Typhimurium (including monophasic), Enteritidis and Infantis being the most represented ones and, together, corresponding to 56.2% of the dataset.
File “Se_metadata.xlsx” contains metadata information for each isolate, including ENA/SRA accession number, BioProject and in-silico MLST ST and serotype.
The directory “assemblies/” contains all the genome assemblies (.fasta format) of each isolate presented in the metadata file.
The file “profiles/Se_profiles_wgMLST.tsv” corresponds to a tab separated file with the 8,558-loci wgMLST profiles of each isolate presented in the metadata file. The files “profiles/Se_profiles_cgMLST_95.tsv”, “profiles/Se_profiles_cgMLST_98.tsv” and “profiles/Se_profiles_cgMLST_100.tsv” correspond to a 3,261-loci, 3,179-loci and 874-loci cgMLST profiles of each isolate presented in the metadata file, respectively. These profiles were determined as explained below.
Dataset selection and curation
With the objective of creating a diverse dataset of S. enterica genome assemblies, we collected information about the genetic diversity (serotype) of the isolates available at Enterobase database in the beginning of this analysis (November 2021) and in other previous works. Based on this information, we selected an initial dataset comprising 1,779 samples associated with four BioProjects (PRJEB16326, PRJEB20997, PRJEB30335 and PRJEB39988). Their WGS data was downloaded from ENA/SRA with fastq-dl v1.0.6. Read quality control, trimming and assembly were performed with the Aquamis v1.3.9 (Deneke et al. 2021) using default parameters. Assembly quality control (QC), including contamination assessment, as well as MLST ST determination were performed with the same pipeline. All genome assemblies passing the QC were included in the final dataset. Among the others, we noticed that a considerable proportion of assemblies was flagged as “QC fail” exclusively due to the “NumContamSNVs” parameter, suggesting that this setting might have been too strict. After manual inspection of a random subset, assemblies for which the percentage of reads corresponding to the correct species was >98% were recovered and integrated in the final dataset (those samples are labeled in the Metadata file). In total, 1,434 isolates passed this curation step and were included in the final dataset. In-silico serotyping was performed with SeqSero2 v1.2.1 (Zhang et al. 2019). wgMLST profiles of each of these isolates were determined with chewBBACA v2.8.5 (Silva et al. 2018), using the 8,558-loci INNUENDO schema available in chewie-NS (Llarena et al. 2018; Mamede et al. 2022) and downloaded on May 31st, 2022. Three cgMLST schemas were obtained with ReporTree v1.0.0 (Mixão et al. 2022) using the 8,558-loci wgMLST profiles of the 1,434 isolates as input and setting distinct “--site-inclusion” thresholds: 0.95, 0.98 and 1.0 (i.e., keep schema loci called in at least 95%, 98% and 100% of the samples, resulting in a 3,261-loci, 3,179-loci and 874-loci allelic matrices, respectively).
Facebook
TwitterA table (DP_SRA.xlsx) contains rows as sample and columns as entries representing the biosample accession number (NCBI), collection (date), library strategy, target (source), and sequencing (technology) for each individual sample. The zip file (Genome_Set01.zip) contain nine (9) fasta file (DP_bin_02.fasta, DP_bin_04.fasta, DP_bin_09.fasta, DP_bin_10.fasta, DP_bin_14.fasta, DP_bin_15.fasta, DP_bin_16a.fasta, DP_bin_20.fasta, DP_bin_23.fasta) with the contig sequences (i.e. binning) for each metagenome-assembled genomes (MAGs). These data are available from the NCBI Sequence Read Archive (SRA) under the BioProject (https://www.ncbi.nlm.nih.gov/bioproject) with accession number PRJNA646252 and the following BioSample numbers: SAMN15536103 to SAMN15536108. This dataset is associated with the following publication: Gomez-Alvarez, V., H. Liu, J. Pressman, and D. Wahman. Metagenomic Profile of Microbial Communities in a Drinking Water Storage Tank Sediment after Sequential Exposure to Monochloramine, Free Chlorine, and Monochloramine. ENVIRONMENTAL SCIENCE & TECHNOLOGY. American Chemical Society, Washington, DC, USA, 1(5): 1283-1294, (2021).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset
This dataset comprises the genome assemblies and respective 7,601-loci whole-genome (wg) Multiple Locus Sequence Type (MLST) profiles [INNUENDO schema (Llarena et al. 2018) available in chewie-NS (Mamede et al. 2022)] of a final set of 1,999 Escherichia coli samples selected among the Whole-Genome Sequencing (WGS) data publicly available in the European Nucleotide Archive (ENA) or in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) at the beginning of the analysis (November 2021). This set of samples was carefully selected to cover a wide genetic diversity (assessed in terms of serotype). In total, 411 different serotypes are represented in this dataset, with O157:H7 being the most represented one, corresponding to 37.1% of the dataset.
File “Ec_metadata.xlsx” contains metadata information for each isolate, including ENA/SRA accession number, BioProject and in-silico MLST ST and serotype.
The directory “assemblies/” contains all the genome assemblies (.fasta format) of each isolate presented in the metadata file.
The file “profiles/Ec_profiles_wgMLST.tsv” corresponds to a tab separated file with the 7,601-loci wgMLST profiles of each isolate presented in the metadata file. The files “profiles/Ec_profiles_cgMLST_95.tsv”, “profiles/Ec_profiles_cgMLST_98.tsv” and “profiles/Ec_profiles_cgMLST_100.tsv” correspond to a 2,826-loci, 2,704-loci and 465-loci cgMLST profiles of each isolate presented in the metadata file, respectively. These profiles were determined as explained below.
Dataset selection and curation
With the objective of creating a diverse dataset of E. coli genome assemblies, we collected information about the genetic diversity (serotype) of the isolates available at Enterobase database in the beginning of this analysis (November 2021) and in other previous works. Based on this information, we selected an initial dataset comprising 2,688 samples associated with three BioProjects (PRJNA230969, PRJEB27020 and PRJNA248042). Their WGS data was downloaded from ENA/SRA with fastq-dl v1.0.6. Read quality control, trimming and assembly were performed with the Aquamis v1.3.9 (Deneke et al. 2021) using default parameters. Assembly quality control (QC), including contamination assessment, as well as MLST ST determination were performed with the same pipeline. All genome assemblies passing the QC were included in the final dataset. Among the others, we noticed that a considerable proportion of assemblies was flagged as “QC fail” exclusively due to the “NumContamSNVs” parameter, suggesting that this setting might have been too strict. After manual inspection of a random subset, assemblies for which the percentage of reads corresponding to the correct species was >98% were recovered and integrated in the final dataset (those samples are labeled in the Metadata file). In total, 1,999 isolates passed this curation step and were included in the final dataset. In-silico serotyping was performed with seq_typing v2.2. wgMLST profiles of each of these isolates were determined with chewBBACA v2.8.5 (Silva et al. 2018), using the 7,601-loci INNUENDO schema available in chewie-NS (Llarena et al. 2018; Mamede et al. 2022) and downloaded on May 31st, 2022. Three cgMLST schemas were obtained with ReporTree v1.0.0 (Mixão et al. 2022) using the 7,601-loci wgMLST profiles of the 1,999 isolates as input and setting distinct “--site-inclusion” thresholds: 0.95, 0.98 and 1.0 (i.e., keep schema loci called in at least 95%, 98% and 100% of the samples, resulting in a 2,826-loci, 2,704-loci and 465-loci allelic matrices, respectively).
Acknowledgements
We thank the National Distributed Computing Infrastructure of Portugal (INCD) for providing the necessary resources to run the genome assemblies. INCD was funded by FCT and FEDER under the project 22153-01/SAICT/2016.
Facebook
TwitterTable of Contents
Main Description File Descriptions Linked Files Installation and Instructions
This is the Zenodo repository for the manuscript titled "A TCR β chain-directed antibody-fusion molecule that activates and expands subsets of T cells and promotes antitumor activity.". The code included in the file titled marengo_code_for_paper_jan_2023.R was used to generate the figures from the single-cell RNA sequencing data.
The following libraries are required for script execution:
Seurat scReportoire ggplot2 stringr dplyr ggridges ggrepel ComplexHeatmap
The code can be downloaded and opened in RStudios. The "marengo_code_for_paper_jan_2023.R" contains all the code needed to reproduce the figues in the paper The "Marengo_newID_March242023.rds" file is available at the following address: https://zenodo.org/badge/DOI/10.5281/zenodo.7566113.svg (Zenodo DOI: 10.5281/zenodo.7566113). The "all_res_deg_for_heat_updated_march2023.txt" file contains the unfiltered results from DGE anlaysis, also used to create the heatmap with DGE and volcano plots. The "genes_for_heatmap_fig5F.xlsx" contains the genes included in the heatmap in figure 5F.
This repository contains code for the analysis of single cell RNA-seq dataset. The dataset contains raw FASTQ files, as well as, the aligned files that were deposited in GEO. The "Rdata" or "Rds" file was deposited in Zenodo. Provided below are descriptions of the linked datasets:
Gene Expression Omnibus (GEO) ID: GSE223311(https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE223311)
Title: Gene expression profile at single cell level of CD4+ and CD8+ tumor infiltrating lymphocytes (TIL) originating from the EMT6 tumor model from mSTAR1302 treatment. Description: This submission contains the "matrix.mtx", "barcodes.tsv", and "genes.tsv" files for each replicate and condition, corresponding to the aligned files for single cell sequencing data. Submission type: Private. In order to gain access to the repository, you must use a reviewer token (https://www.ncbi.nlm.nih.gov/geo/info/reviewer.html).
Sequence read archive (SRA) repository ID: SRX19088718 and SRX19088719
Title: Gene expression profile at single cell level of CD4+ and CD8+ tumor infiltrating lymphocytes (TIL) originating from the EMT6 tumor model from mSTAR1302 treatment.
Description: This submission contains the raw sequencing or .fastq.gz files, which are tab delimited text files.
Submission type: Private. In order to gain access to the repository, you must use a reviewer token (https://www.ncbi.nlm.nih.gov/geo/info/reviewer.html).
Zenodo DOI: 10.5281/zenodo.7566113(https://zenodo.org/record/7566113#.ZCcmvC2cbrJ)
Title: A TCR β chain-directed antibody-fusion molecule that activates and expands subsets of T cells and promotes antitumor activity. Description: This submission contains the "Rdata" or ".Rds" file, which is an R object file. This is a necessary file to use the code. Submission type: Restricted Acess. In order to gain access to the repository, you must contact the author.
The code included in this submission requires several essential packages, as listed above. Please follow these instructions for installation:
Ensure you have R version 4.1.2 or higher for compatibility.
Although it is not essential, you can use R-Studios (Version 2022.12.0+353 (2022.12.0+353)) for accessing and executing the code.
marengo_code_for_paper_jan_2023.R Install_Packages.R Marengo_newID_March242023.rds genes_for_heatmap_fig5F.xlsx all_res_deg_for_heat_updated_march2023.txt
You can use the following code to set the working directory in R:
setwd(directory)
Facebook
Twitterhttps://media.market.us/privacy-policyhttps://media.market.us/privacy-policy
The Global Bioinformatics Services Market is projected to reach USD 10.7 billion by 2033, growing from USD 2.9 billion in 2023 at a CAGR of 13.9%. Growth is being driven by the rapid expansion of genomic and health data generation across research institutions, healthcare systems, and public-health agencies. The World Health Organization’s Global Genomic Surveillance Strategy has positioned bioinformatics as a core element in detecting and responding to health threats. This policy direction is reinforcing global demand for scalable analytical platforms, secure data sharing, and sustainable workflow solutions.
A fundamental growth catalyst is the declining cost of sequencing. According to the U.S. National Human Genome Research Institute, the cost per genome has decreased sharply since the late 2000s. As sequencing becomes more affordable, the number of samples increases, driving demand for downstream data storage, processing, and interpretation. Consequently, outsourcing bioinformatics tasks to specialized service providers has become more common and cost-effective.
Another major factor supporting market expansion is the rise in publicly available genomic data. The NIH Sequence Read Archive (SRA) surpassed 50 petabases of data by early 2024, requiring large-scale indexing, quality control, and reanalysis. This massive data load necessitates professional expertise and infrastructure, which are primarily offered by bioinformatics service companies.
The integration of genomics into healthcare systems is further strengthening market growth. The NHS Genomic Medicine Service in England is expanding clinical genomics applications in oncology and rare disease management. This transition creates sustained demand for validated bioinformatics pipelines, variant curation, and clinical reporting services. Healthcare institutions increasingly depend on external service providers for secure, clinical-grade analysis pipelines and data governance compliance, ensuring both accuracy and confidentiality in genomic interpretation.
Public health initiatives and global investments are enhancing the bioinformatics services landscape. Programs like the U.S. CDC’s Advanced Molecular Detection and ECDC’s sequencing integration are driving large-scale genomic surveillance. These initiatives require ongoing analysis, pipeline standardization, and data-platform management, which are largely delivered through external service providers. As countries institutionalize sequencing, recurring demand for bioinformatics workflows and analytic services is expected to persist.
In low- and middle-income countries, international investment is expanding market opportunities. The World Bank’s genomic capacity-building programs in Africa are fostering sequencing and analytics infrastructure. These efforts include bioinformatics training and workflow design, ensuring long-term sustainability. Such projects significantly widen the global serviceable market for bioinformatics expertise. Similarly, large-scale national genomic initiatives like the NIH All of Us program generate billions of variants that require harmonization, annotation, and interpretation, sustaining demand for cloud-based data management and analytic platforms.
The growing focus on antimicrobial resistance (AMR) is also fueling bioinformatics adoption. Under WHO’s GLASS platform, countries are integrating whole-genome sequencing into AMR surveillance. This expansion is creating consistent demand for quality assurance, centralized analysis hubs, and workflow optimization. Furthermore, data governance reforms by the OECD and other regulatory bodies are facilitating secure secondary use of genomic data, promoting trust in data sharing and collaboration.
Strategic public funding further strengthens the market outlook. Horizon Europe’s Health Work Programme (2025) and NHGRI’s technology initiatives continue to fund large-scale, data-driven research, ensuring a steady flow of contracts for bioinformatics firms. Workforce development is also improving, with national systems such as NHS England expanding bioinformatics training. This capacity building not only supports in-house analytics but also increases outsourcing to handle peak workloads and specialized computational tasks.
In conclusion, the bioinformatics services market is benefiting from multiple converging factors—technological affordability, global health investments, regulatory clarity, and expanding data ecosystems. These structural developments are shaping a resilient, long-term demand environment for scalable, compliant, and high-quality bioinformatics services worldwide.
Facebook
TwitterThis project was designed to describe fine-scale population genetic differentiation of the stream salamander Gryinophilus porphyriticus among five study streams in the Hubbard Brook Experimental Forest. The data are paired with intensive capture-recapture data to assess direct fitness effects of individual genetic diversity, including effects of individual multilocus heterozygosity on stage-specific survival probabilities.
This dataset publishes a manifest of the genomic sequence reads submitted to the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). These samples are published at NCBI under the BioProject ID 1090913 (https://www.ncbi.nlm.nih.gov/bioproject/1090913). The tables here include sample metadata and the NCBI URLs to each sample.
These data were gathered as part of the Hubbard Brook Ecosystem Study (HBES). The HBES is a collaborative effort at the Hubbard Brook Experimental Forest, which is operated and maintained by the USDA Forest Service, Northern Research Station.
Facebook
TwitterI. Files (GENOME) Mt_v1.0_MAIN.fa.gz Primary genome, (largely) scaffolded to chromosome-level, plus other primary assembled contigs Mt_v1.0_MAIN.gff.gz Simple gene annotations for primary genome, annotated using GeMoMa v1.8 and a zebra finch (bTaeGut1.4.pri) annotation reference Mt_v1.0_extra.fa.gz Additional contigs, not for use in most analyses but some may be of interest This set is a combination of hand-identified haplotigs of the main genome, and assembler-identified "alternate" (haplotig) contigs (ORIGINAL_ASSEMBLY_CONTIGS) Mt_hifi.asm.p.fa.gz "primary" assembly contigs, output from hifiasm (v0.13-r308) Mt_hifi.asm.a.fa.gz "alternate" assembly contigs, output from hifiasm (v0.13-r308) (REPEAT_MASKING) TElib_Myzo_preliminary.fa.gz Preliminary Myzomela-tuned TE/repeat library, generated using RepeatModeler (v.2) Mt_v1.0_MAIN_RM_sites_to_filter.txt List of sites masked by RepeatM...
Facebook
TwitterThis is the GitHub repository for the single cell RNA sequencing data analysis for the human manuscript. The following essential libraries are required for script execution: Seurat scReportoire ggplot2 dplyr ggridges ggrepel ComplexHeatmap Linked File: -------------------------------------- This repository contains code for the analysis of single cell RNA-seq dataset. The dataset contains raw FASTQ files, as well as, the aligned files that were deposited in GEO. Provided below are descriptions of the linked datasets: 1. Gene Expression Omnibus (GEO) ID: GSE229626 - Title: Gene expression profile at single cell level of human T cells stimulated via antibodies against the T Cell Receptor (TCR) - Description: This submission contains the matrix.mtx, barcodes.tsv, and genes.tsv files for each replicate and condition, corresponding to the aligned files for single cell sequencing data. - Submission type: Private. In order to gain access to the repository, you must use a "reviewer token"(https://www.ncbi.nlm.nih.gov/geo/info/reviewer.html). 2. Sequence read archive (SRA) repository - Title: Gene expression profile at single cell level of human T cells stimulated via antibodies against the T Cell Receptor (TCR) - Description: This submission contains the "raw sequencing" or .fastq.gz files, which are tab delimited text files. - Submission type: Private. In order to gain access to the repository, you must use a "reviewer token" (https://www.ncbi.nlm.nih.gov/geo/info/reviewer.html). Please note that since the GSE submission is private, the raw data deposited at SRA may not be accessible until the embargo on GSE229626 has been lifted. Installation and Instructions -------------------------------------- The code included in this submission requires several essential packages, as listed above. Please follow these instructions for installation: > Ensure you have R version 4.1.2 or higher for compatibility. > Although it is not essential, you can use R-Studios (Version 2022.12.0+353 (2022.12.0+353)) for accessing and executing the code. The following code can be used to set working directory in R: > setwd(directory) Steps: 1. Download the "Human_code_April2023.R" and "Install_Packages.R" R scripts, and the processed data from GSE229626. 2. Open "R-Studios"(https://www.rstudio.com/tags/rstudio-ide/) or a similar integrated development environment (IDE) for R. 3. Set your working directory to where the following files are located: - Human_code_April2023.R - Install_Packages.R 4. Open the file titled Install_Packages.R and execute it in R IDE. This script will attempt to install all the necessary pacakges, and its dependencies. 5. Open the Human_code_April2023.R R script and execute commands as necessary.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains data related to RNA sequence genetic accessions at the National Center for Biotechnology Information (NCBI) including information about the host organism, collection location, and collection date.
The accessions are the unprocessed Illumina MiSeq reads for the Ross Sea Dinoflagellate RNA-Seq experiments, Phaeocystis antarctica RNA-Seq experiments, and Pyramimons tychotreta & Micromonas polaris (CCMP 2099) mixotrophy experiments.
Pyramimonas tychotreta & Micromonas polaris (CCMP 2099) mixotrophy RNA sequences are available through the NCBI Sequence Read Archive (SRA) under the SRA accession number SRP090401 (BioProject PRJNA342459)
Ross Sea Dinoflagellate RNA sequences are available through the NCBI Sequence Read Archive (SRA) under the accession number SRP132912 (BioProject PRJNA428208).
Phaeocystis antarctica RNA sequences are available through the NCBI Sequence Read Archive (SRA) under the accession number SRP133243 (BioProject PRJNA434497).
Facebook
TwitterThe data release details the samples, methods, and raw data used to generate high-quality genome assemblies for greater sage-grouse (Centrocercus urophasianus), white-tailed ptarmigan (Lagopus leucura), and trumpeter swan (Cygnus buccinator). The raw data have been deposited in the Sequence Read Archive (SRA) of the National Center for Biotechnology Information (NCBI), the authoritative repository for public biological sequence data, and are not included in this data release. Instead, the accessions that link to those data via the NCBI portal (www.ncbi.nlm.nih.gov) are provided herein. The release consists of a single file, sample.metadata.txt, which maps NCBI accessions to the samples sequenced and the different types of sequencing performed to generate the assemblies and annotate their gene features.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These whole genome of Pseudomonas sp. HOU2 were analyzed by RAST (Rapid Annotation using Subsystem Technology) (https://rast.nmpdr.org/) on 18 July 2024 with the following selected options to get the predicted HOU2 gene sequences. Genetic code: 11Annotation scheme: RASTtkPreserve gene calls: noAutomatically fix errors: yesFix frameshifts: yesBackfill gaps: yesNCBI Sequence Read Archive of Pseudomonas sp. HOU2 is SRR29666724 (https://www.ncbi.nlm.nih.gov/sra/SRR29666724)NCBI complete genome of Pseudomonas sp. HOU2 is CP160398.1 (https://www.ncbi.nlm.nih.gov/nuccore/CP160398)
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Amplicon sequencing utilizing next-generation platforms has significantly transformed how research is conducted, specifically microbial ecology. However, primer and sequencing platform biases can confound or change the way scientists interpret these data. The Pacific Biosciences RSII instrument may also preferentially load smaller fragments, which may also be a function of PCR product exhaustion during sequencing. To further examine theses biases, data is provided from 16S rRNA rumen community analyses. Specifically, data from the relative phylum-level abundances for the ruminal bacterial community are provided to determine between-sample variability. Direct sequencing of metagenomic DNA was conducted to circumvent primer-associated biases in 16S rRNA reads and rarefaction curves were generated to demonstrate adequate coverage of each amplicon. PCR products were also subjected to reduced amplification and pooling to reduce the likelihood of PCR product exhaustion during sequencing on the Pacific Biosciences platform. The taxonomic profiles for the relative phylum-level and genus-level abundance of rumen microbiota as a function of PCR pooling for sequencing on the Pacific Biosciences RSII platform were provided. Data is within this article and raw ruminal MiSeq sequence data is available from the NCBI Sequence Read Archive (SRA Accession SRP047292). Additional descriptive information is associated with NCBI BioProject PRJNA261425. http://www.ncbi.nlm.nih.gov/bioproject/PRJNA261425/ Resources in this dataset:Resource Title: NCBI Sequence Read Archive (SRA Accession SRP047292). File Name: Web Page, url: https://www.ncbi.nlm.nih.gov/sra/SRX704260 1 ILLUMINA (Illumina MiSeq) run: 978,195 spots, 532.9M bases, 311.6Mb downloads.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This analysis contained the identification of DNA methylation sites in the context of CpG islands and differentially methylated regions (DMRs) with MethylScore. Further the mRNA sequencing data were analyzed to differentially expressed genes. Differentially expressed genes and identified DMRs were correlated. Finally DMRs and their expression changes were characterized in their genomic context. All raw sequencing data used as input for analyses to obtain the data in this repository are deposited in the NCBI Sequence Read Archive (SRA) under BioProject ID PRJNA1163668 (http://www.ncbi.nlm.nih.gov/bioproject/1163668). The genome assembly and annotation data used here were obtained from DataverseNO (https://doi.org/10.18710/GXMSUH). This genome assembly is based on the raw sequencing data deposited under BioProject ID PRJNA1119394 (http://www.ncbi.nlm.nih.gov/bioproject/1119394). Together these data were used to identify CpG sites genome wide and further identify differentially methylated regions. The corresponding mRNA was utilized to identify transcriptional changes and enable a comparison of differentially methylated genes with differentially expressed genes. Scripts are available in the GitHub repository WholeGenomeBisulphiteSequencing (https://github.com/MagdalenaWinklhofer/WholeGenomeBisulphiteSequencing.git).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This table depicts the aging correlated genes for humans and flies sorted according to their correlation coefficient. (XLSX)
Facebook
TwitterLocal chickens that do not confer to a breed are labelled as “village”. (XLSX)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset
This dataset comprises the genome assemblies and respective 2,794-loci whole-genome (wg) Multiple Locus Sequence Type (MLST) profiles [INNUENDO schema (Llarena et al. 2018) available in chewie-NS (Mamede et al. 2022)] of a final set of 3,076 Campylobacter jejuni samples selected among the Whole-Genome Sequencing (WGS) data publicly available in the European Nucleotide Archive (ENA) or in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) at the beginning of the analysis (November 2021). This set of samples was carefully selected to cover a wide genetic diversity (assessed in terms of Sequence Type [ST]). In total, 476 different STs are represented in this dataset, with ST21, ST50, ST48, ST45 and ST257 being the most represented ones and, together, corresponding to 29.1% of the dataset.
File “Cj_metadata.xlsx” contains metadata information for each isolate, including ENA/SRA accession number, BioProject and in-silico MLST ST.
The directory “assemblies/” contains all the genome assemblies (.fasta format) of each isolate presented in the metadata file.
The file “profiles/Cj_profiles_wgMLST.tsv” corresponds to a tab separated file with the 2,794-loci wgMLST profiles of each solate presented in the metadata file. The files “profiles/Cj_profiles_cgMLST_95.tsv”, “profiles/Cj_profiles_cgMLST_98.tsv” and “profiles/Cj_profiles_cgMLST_100.tsv” correspond to a 1,012-loci, 987-loci and 29-loci cgMLST profiles of each isolate presented in the metadata file, respectively. These profiles were determined as explained below.
Dataset selection and curation
With the objective of creating a diverse dataset of C. jejuni genome assemblies, we collected information about the genetic diversity (serotype) of the isolates available at PubMLST database in the beginning of this analysis (November 2021) and in other previous works. Based on this information, we selected an initial dataset comprising 3,539 samples. The majority of them are associated with the INNUENDO project (Llarena et al. 2018). The remaining ones are associated with five BioProjects (PRJEB31119, PRJEB38253, PRJEB40238, PRJEB4165 and PRJNA350537). Their WGS data was downloaded from ENA/SRA with fastq-dl v1.0.6. Read quality control, trimming and assembly were performed with the Aquamis v1.3.9 (Deneke et al. 2021) using default parameters. Assembly quality control (QC), including contamination assessment, as well as MLST ST determination were performed with the same pipeline. All genome assemblies passing the QC were included in the final dataset. Among the others, we noticed that a considerable proportion of assemblies was flagged as “QC fail” exclusively due to the “NumContamSNVs” parameter, suggesting that this setting might have been too strict. After manual inspection of a random subset, assemblies for which the percentage of reads corresponding to the correct species was >98% were recovered and integrated in the final dataset (those samples are labeled in the Metadata file). In total, 3,076 isolates passed this curation step and were included in the final dataset. wgMLST profiles of each of these isolates were determined with chewBBACA v2.8.5 (Silva et al. 2018), using the 2,794-loci INNUENDO schema available in chewie-NS (Llarena et al. 2018; Mamede et al. 2022) and downloaded on May 31st, 2022. Three cgMLST schemas were obtained with ReporTree v1.0.0 (Mixão et al. 2022) using the 2,794-loci wgMLST profiles of the 3,076 isolates as input and setting distinct “--site-inclusion” thresholds: 0.95, 0.98 and 1.0 (i.e., keep schema loci called in at least 95%, 98% and 100% of the samples, resulting in a 1,012-loci, 987-loci and 29-loci allelic matrices, respectively).
Acknowledgements
We thank the National Distributed Computing Infrastructure of Portugal (INCD) for providing the necessary resources to run the genome assemblies. INCD was funded by FCT and FEDER under the project 22153-01/SAICT/2016.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Single cell genome sequencing has become a useful tool in medicine and biology studies. However, an independent library is required for each cell in single cell genome sequencing, so that the cost grows in step with the number of cells. In this study, we report a study on efficient single-cell copy number variation (CNV) analysis based on overlapping pooling strategy together with branch and bound (B&B) algorithm. Single cells are overlapped pooled before sequencing, and later are assorted into specific types by estimating their CNV patterns by B&B algorithm. Instead of constructing libraries for each cell, a library is required only for each pool. As long as the number of pools is smaller than the cells, fewer libraries are needed, and a lower cost is spent. Through computer simulations, we overlapping pooled 80 cells into 40 and 27 pools and classified them into cell types based on CNV pattern. The results showed that 84% cells in 40 pools and 76.5% cells in 27 pools were correctly classified on average, while only half or one-third of the sequencing libraries are required. Combining with traditional approaches, our method is expected to significantly improve the efficiency of single cell genome sequencing. Methods The dataset contains the statistics of the sequencing data and the copy number profiles of the single cells.
The single-cell sequencing data of 80 single cells from 7 tumor patients with Triple-Negative Breast Cancer (TNBC) were downloaded in FASTQ format from National Center for Biotechnology Information (NCBI) [15] under Sequence Read Archive (SRA) accessions SRP064210.
Performed basic statistics on BAM files mapped to the human genome hg19. We followed the protocol put forward by Baslan et al. to obtain the copy number profile of single cells.
Facebook
TwitterThe Sequence Read Archive (SRA) stores sequencing data from the next generation of sequencing platforms including Roche 454 GS System®, Illumina Genome Analyzer®, Life Technologies AB SOLiD System®, Helicos Biosciences Heliscope®, Complete Genomics®, and Pacific Biosciences SMRT®.