Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionThe advent of RNA sequencing (RNA-Seq) has significantly advanced our understanding of the transcriptomic landscape, revealing intricate gene expression patterns across biological states and conditions. However, the complexity and volume of RNA-Seq data pose challenges in identifying differentially expressed genes (DEGs), critical for understanding the molecular basis of diseases like cancer.MethodsWe introduce a novel Machine Learning-Enhanced Genomic Data Analysis Pipeline (ML-GAP) that incorporates autoencoders and innovative data augmentation strategies, notably the MixUp method, to overcome these challenges. By creating synthetic training examples through a linear combination of input pairs and their labels, MixUp significantly enhances the model’s ability to generalize from the training data to unseen examples.ResultsOur results demonstrate the ML-GAP’s superiority in accuracy, efficiency, and insights, particularly crediting the MixUp method for its substantial contribution to the pipeline’s effectiveness, advancing greatly genomic data analysis and setting a new standard in the field.DiscussionThis, in turn, suggests that ML-GAP has the potential to perform more accurate detection of DEGs but also offers new avenues for therapeutic intervention and research. By integrating explainable artificial intelligence (XAI) techniques, ML-GAP ensures a transparent and interpretable analysis, highlighting the significance of identified genetic markers.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
For identifying the genes that are regulated by a transcription factor (TF), we have established an analytical pipeline that combines genomic systematic evolution of ligands by exponential enrichment (gSELEX)-Seq and RNA-Seq. Here, SELEX was used to select DNA fragments from an Aspergillus nidulans genomic library that bound specifically to AmyR, a TF from A. nidulans. High-throughput sequencing data were obtained for the DNAs enriched through the selection, following which various in silico analyses were performed. Mapping reads to the genome revealed the binding motifs including the canonical AmyR-binding motif, CGGN8CGG, as well as the candidate promoters controlled by AmyR. In parallel, differentially expressed genes related to AmyR were identified by using RNA-Seq analysis with samples from A. nidulans WT and amyR deletant. By obtaining the intersecting set of genes detected using both gSELEX-Seq and RNA-Seq, the genes directly regulated by AmyR in A. nidulans can be identified with high reliability. This analytical pipeline is a robust platform for comprehensive genome-wide identification of the genes that are regulated by a target TF.
A critical task in high throughput sequencing is aligning millions of short reads to a reference genome. Alignment is especially complicated for RNA sequencing (RNA-Seq) because of RNA splicing. A number of RNA-Seq algorithms are available, and claim to align reads with high accuracy and efficiency while detecting splice junctions. RNA-Seq data is discrete in nature; therefore with reasonable gene models and comparative metrics RNA-Seq data can be simulated to sufficient accuracy to enable meaningful benchmarking of alignment algorithms. The exercise to rigorously compare all viable published RNA-Seq algorithms has not previously been performed. RESULTS: We developed an RNA-Seq simulator that models the main impediments to RNA alignment, including alternative splicing, insertions, deletions, substitutions, sequencing errors, and intron signal. We used this simulator to measure the accuracy and robustness of available algorithms at the base and junction levels. Additionally, we used RT-PCR and Sanger sequencing to validate the ability of the algorithms to detect novel transcript features such as novel exons and alternative splicing in RNA-Seq data from mouse retina. A pipeline based on BLAT was developed to explore the performance of established tools for this problem, and to compare it to the recently developed methods. This pipeline, the RNA-Seq Unified Mapper (RUM) performs comparably to the best current aligners and provides an advantageous combination of accuracy, speed and usability. RNA-Seq of mouse retinal RNA, as described.
Even though high-throughput transcriptome sequencing is routinely performed in many laboratories, computational analysis of such data remains a cumbersome process often executed manually, hence error-prone and lacking reproducibility. For corresponding data processing, we introduce Curare, an easy-to-use yet versatile workflow builder for analyzing high-throughput RNA-Seq data focusing on differential gene expression experiments. Data analysis with Curare is customizable and subdivided into preprocessing, quality control, mapping, and downstream analysis stages, providing multiple options for each step while ensuring the reproducibility of the workflow. For a fast and straightforward exploration and visualization of differential gene expression results, we provide the gene expression visualizer software GenExVis. GenExVis can create various charts and tables from simple gene expression tables and DESeq2 results without the requirement to upload data or install software packages.
NGS-Based Rna-Seq Market Size 2024-2028
The NGS-based RNA-seq market size is forecast to increase by USD 6.66 billion, at a CAGR of 20.52% between 2023 and 2028.
The market is witnessing significant growth, driven by the increased adoption of next-generation sequencing (NGS) methods for RNA-Seq analysis. The advanced capabilities of NGS techniques, such as high-throughput, cost-effectiveness, and improved accuracy, have made them the preferred choice for researchers and clinicians in various fields, including genomics, transcriptomics, and personalized medicine. However, the market faces challenges, primarily from the lack of clinical validation on direct-to-consumer genetic tests. As the use of NGS technology in consumer applications expands, ensuring the accuracy and reliability of results becomes crucial.
The absence of standardized protocols and regulatory oversight in this area poses a significant challenge to market growth and trust. Companies seeking to capitalize on market opportunities must focus on addressing these challenges through collaborations, partnerships, and investments in research and development to ensure the clinical validity and reliability of their NGS-based RNA-Seq offerings.
What will be the Size of the NGS-based RNA-Seq market during the forecast period?
Explore in-depth regional segment analysis with market size data - historical 2018-2022 and forecasts 2024-2028 - in the full report.
Request Free Sample
The market continues to evolve, driven by advancements in NGS technology and its applications across various sectors. Spatial transcriptomics, a novel approach to studying gene expression in its spatial context, is gaining traction in disease research and precision medicine. Splice junction detection, a critical component of RNA-seq data analysis, enhances the accuracy of gene expression profiling and differential gene expression studies. Cloud computing plays a pivotal role in handling the massive amounts of data generated by NGS platforms, enabling real-time data analysis and storage. Enrichment analysis, gene ontology, and pathway analysis facilitate the interpretation of RNA-seq data, while data normalization and quality control ensure the reliability of results.
Precision medicine and personalized therapy are key applications of RNA-seq, with single-cell RNA-seq offering unprecedented insights into the complexities of gene expression at the single-cell level. Read alignment and variant calling are essential steps in RNA-seq data analysis, while bioinformatics pipelines and RNA-seq software streamline the process. NGS technology is revolutionizing drug discovery by enabling the identification of biomarkers and gene fusion detection in various diseases, including cancer and neurological disorders. RNA-seq is also finding applications in infectious diseases, microbiome analysis, environmental monitoring, agricultural genomics, and forensic science. Sequencing costs are decreasing, making RNA-seq more accessible to researchers and clinicians.
The ongoing development of sequencing platforms, library preparation, and sample preparation kits continues to drive innovation in the field. The dynamic nature of the market ensures that it remains a vibrant and evolving field, with ongoing research and development in areas such as data visualization, clinical trials, and sequencing depth.
How is this NGS-based RNA-Seq industry segmented?
The NGS-based RNA-seq industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2024-2028, as well as historical data from 2018-2022 for the following segments.
End-user
Acamedic and research centers
Clinical research
Pharma companies
Hospitals
Technology
Sequencing by synthesis
Ion semiconductor sequencing
Single-molecule real-time sequencing
Others
Geography
North America
US
Europe
Germany
UK
APAC
China
Singapore
Rest of World (ROW)
.
By End-user Insights
The acamedic and research centers segment is estimated to witness significant growth during the forecast period.
The global next-generation sequencing (NGS) market for RNA sequencing (RNA-Seq) is primarily driven by academic and research institutions, including those from universities, research institutes, government entities, biotechnology organizations, and pharmaceutical companies. These institutions utilize NGS technology for various research applications, such as whole-genome sequencing, epigenetics, and emerging fields like agrigenomics and animal research, to enhance crop yield and nutritional composition. NGS-based RNA-Seq plays a pivotal role in translational research, with significant investments from both private and public organizations fueling its growth. The technology is instrumental in disease research, enabling the identification of nov
Background High-throughput RNA-Sequencing (RNA-Seq) has become the preferred technique for studying gene expression differences between biological samples and for discovering novel isoforms, though the techniques to analyze the resulting data are still immature. One pre-processing step that is widely but heterogeneously applied is trimming, in which low quality bases, identified by the probability that they are called incorrectly, are removed. However, the impact of trimming on subsequent alignment to a genome could influence downstream analyses including gene expression estimation; we hypothesized that this might occur in an inconsistent manner across different genes, resulting in differential bias. Results To assess the effects of trimming on gene expression, we generated RNA-Seq data sets from four samples of larval Drosophila melanogaster sensory neurons, and used three trimming algorithms—SolexaQA, Trimmomatic, and ConDeTri—to perform quality-based trimming across a wide range of stringencies. After aligning the reads to the D. melanogaster genome with TopHat2, we used Cuffdiff2 to compare the original, untrimmed gene expression estimates to those following trimming. With the most aggressive trimming parameters, over ten percent of genes had significant changes in their estimated expression levels. This trend was seen with two additional RNA-Seq data sets and with alternative differential expression analysis pipelines. We found that the majority of the expression changes could be mitigated by imposing a minimum length filter following trimming, suggesting that the differential gene expression was primarily being driven by spurious mapping of short reads. Slight differences with the untrimmed data set remained after length filtering, which were associated with genes with low exon numbers and high GC content. Finally, an analysis of paired RNA-seq/microarray data sets suggests that no or modest trimming results in the most biologically accurate gene expression estimates. Conclusions We find that aggressive quality-based trimming has a large impact on the apparent makeup of RNA-Seq-based gene expression estimates, and that short reads can have a particularly strong impact. We conclude that implementation of trimming in RNA-Seq analysis workflows warrants caution, and if used, should be used in conjunction with a minimum read length filter to minimize the introduction of unpredictable changes in expression estimates. Four biological replicates of 100 Drosophila melanogaster larval multi-dendritic sensory neurons were profiled by mRNA-Seq
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
GENE-counter is a complete Perl-based computational pipeline for analyzing RNA-Sequencing (RNA-Seq) data for differential gene expression. In addition to its use in studying transcriptomes of eukaryotic model organisms, GENE-counter is applicable for prokaryotes and non-model organisms without an available genome reference sequence. For alignments, GENE-counter is configured for CASHX, Bowtie, and BWA, but an end user can use any Sequence Alignment/Map (SAM)-compliant program of preference. To analyze data for differential gene expression, GENE-counter can be run with any one of three statistics packages that are based on variations of the negative binomial distribution. The default method is a new and simple statistical test we developed based on an over-parameterized version of the negative binomial distribution. GENE-counter also includes three different methods for assessing differentially expressed features for enriched gene ontology (GO) terms. Results are transparent and data are systematically stored in a MySQL relational database to facilitate additional analyses as well as quality assessment. We used next generation sequencing to generate a small-scale RNA-Seq dataset derived from the heavily studied defense response of Arabidopsis thaliana and used GENE-counter to process the data. Collectively, the support from analysis of microarrays as well as the observed and substantial overlap in results from each of the three statistics packages demonstrates that GENE-counter is well suited for handling the unique characteristics of small sample sizes and high variability in gene counts.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Single cell RNA-seq data generated and reported as part of the manuscript entitled "Dissecting the mechanisms underlying the Cytokine Release Syndrome (CRS) mediated by T Cell Bispecific Antibodies" by Leclercq-Cohen et al 2023. Raw and processed (filtered and annotated) data are provided as AnnData objects which can be directly ingested to reproduce the findings of the paper or for ab initio data reuse: 1- raw.zip provides concatenated raw/unfiltered counts for the 20 samples in the standard Market Exchange Format (MEX) format. 2- 230330_sw_besca2_LowFil_raw.h5ad contains filtered cells and raw counts in the HDF5 format. 3- 221124_sw_besca2_LowFil.annotated.h5ad contains filtered cells and log normalized counts, along with cell type annotation in the HDF5 format.
scRNAseq data generation: Whole blood from 4 donors was treated with 0.2 μg/mL CD20-TCB, or incubated in the absence of CD20- TCB. At baseline (before addition of TCB) and assay endpoints (2, 4, 6, and 20 hrs), blood was collected for total leukocyte isolation using EasySepTM red blood cell depletion reagent (Stemcell). Briefly, cells were counted and processed for single cell RNA sequencing using the BD Rhapsody platform. To load several samples on a single BD Rhapsody cartridge, sample cells were labelled with sample tags (BD Human Single-Cell Multiplexing Kit) following the manufacturer’s protocol prior to pooling. Briefly, 1x106 cells from each sample were re-suspended in 180 μL FBS Stain Buffer (BD, PharMingen) and sample tags were added to the respective samples and incubated for 20 min at RT. After incubation, 2 successive washes were performed by addition of 2 mL stain buffer and centrifugation for 5 min at 300 g. Cells were then re- suspended in 620 μL cold BD Sample Buffer, stained with 3.1 μL of both 2 mM Calcein AM (Thermo Fisher Scientific) and 0.3 mM Draq7 (BD Biosciences) and finally counted on the BD Rhapsody scanner. Samples were then diluted and/or pooled equally in 650 μL cold BD Sample Buffer. The BD Rhapsody cartridges were then loaded with up to 40 000 – 50 000 cells. Single cells were isolated using Single-Cell Capture and cDNA Synthesis with the BD Rhapsody Express Single-Cell Analysis System according to the manufacturer’s recommendations (BD Biosciences). cDNA libraries were prepared using the Whole Transcriptome Analysis Amplification Kit following the BD Rhapsody System mRNA Whole Transcriptome Analysis (WTA) and Sample Tag Library Preparation Protocol (BD Biosciences). Indexed WTA and sample tags libraries were quantified and quality controlled on the Qubit Fluorometer using the Qubit dsDNA HS Assay, and on the Agilent 2100 Bioanalyzer system using the Agilent High Sensitivity DNA Kit. Sequencing was performed on a Novaseq 6000 (Illumina) in paired-end mode (64-8- 58) with Novaseq6000 S2 v1 or Novaseq6000 SP v1.5 reagents kits (100 cycles). scRNAseq data analysis: Sequencing data was processed using the BD Rhapsody Analysis pipeline (v 1.0 https://www.bd.com/documents/guides/user-guides/GMX_BD-Rhapsody-genomics- informatics_UG_EN.pdf) on the Seven Bridges Genomics platform. Briefly, read pairs with low sequencing quality were first removed and the cell label and UMI identified for further quality check and filtering. Valid reads were then mapped to the human reference genome (GRCh38-PhiX-gencodev29) using the aligner Bowtie2 v2.2.9, and reads with the same cell label, same UMI sequence and same gene were collapsed into a single raw molecule while undergoing further error correction and quality checks. Cell labels were filtered with a multi-step algorithm to distinguish those associated with putative cells from those associated with noise. After determining the putative cells, each cell was assigned to the sample of origin through the sample tag (only for cartridges with multiplex loading). Finally, the single-cell gene expression matrices were generated and a metrics summary was provided. After pre-processing with BD’s pipeline, the count matrices and metadata of each sample were aggregated into a single adata object and loaded into the besca v2.3 pipeline for the single cell RNA sequencing analysis (43). First, we filtered low quality cells with less than 200 genes, less than 500 counts or more than 30% of mitochondrial reads. This permissive filtering was used in order to preserve the neutrophils. We further excluded potential multiplets (cells with more than 5,000 genes or 20,000 counts), and genes expressed in less than 30 cells. Normalization, log-transformed UMI counts per 10,000 reads [log(CP10K+1)], was applied before downstream analysis. After normalization, technical variance was removed by regressing out the effects of total UMI counts and percentage of mitochondrial reads, and gene expression was scaled. The 2,507 most variable genes (having a minimum mean expression of 0.0125, a maximum mean expression of 3 and a minimum dispersion of 0.5) were used for principal component analysis. Finally, the first 50 PCs were used as input for calculating the 10 nearest neighbours and the neighbourhood graph was then embedded into the two-dimensional space using the UMAP algorithm at a resolution of 2. Cell type annotation was performed using the Sig-annot semi-automated besca module, which is a signature- based hierarchical cell annotation method. The used signatures, configuration and nomenclature files can be found at https://github.com/bedapub/besca/tree/master/besca/datasets. For more details, please refer to the publication.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Background: mRNA interactions with each other and other signaling molecules define different biological pathways and functions. Researchers have been investigating various tools to analyze these types of interactions. In particular gene co-expression network methods have proved useful in finding and analyzing these molecular interactions. Many different analytical pipelines to identify these interactions networks have been proposed with the aim of identifying an optimal partition of the network where the individual modules are neither too small to make any general inference or too large to be biologically interpretable. Results: In this study we propose a new pipeline to perform gene co-expression network analysis. The proposed pipeline uses WGCNA a widely used software to perform different aspects of gene co-expression network analysis and modularity maximization algorithm to analyze novel RNA-Seq data to understand the effects of low-dose 56Fe ion irradiation on the formation of hepatocellular carcinoma in mice. The network results along with experimental validation show that using WGCNA combined with Modularity provide a more biologically interpretable network in our dataset. Our pipeline showed better performance than the existing clustering algorithm in WGCNA in finding modules and identified a module with mitochondrial subunits that are supported by mitochondrial complex assay. Conclusions: We present a pipeline that can reduce the problem of parameter selection with the existing algorithm in WGCNA for comparable RNA-Seq datasets which may assist in future research to discover novel mRNA interactions and their downstream molecular effects. C57BL16 males were placed into 2 treatment groups and received the following irradiation treatments at Brookhaven National Laboratories (Long Island NY): 600 MeV/n 56Fe (0.2 Gy) and no irradiation. Left liver lobes were collected at 30 60 120 270 and 360 days post-irradiation flash frozen and stored at -80 xc2 xb0C until they could be processed for RNA-Seq. Livers were sampled by taking two 40-micron thick slices using a cryotome at -20 xc2 xb0C. This allowed multiple sampling of the tissue without the tissue going through multiple freeze/thaw cycles. Total RNA was isolated from the liver slices using RNAqueousTM Total RNA Isolation Kit (ThermoFisher Scientific Waltham MA) and rRNA was removed via Ribo-ZeroTM rRNA Removal Kit (Illumina San Diego CA) prior to library preparation with the Illumina TruSeq RNA Library kit. Samples were sequenced in a paired-end 50 base format on an Illumina HiSeq 1500. Reads were aligned to the mouse GRCm38 reference genome using the STAR alignment program version 2.5.3a with the recommended ENCODE options. The -quantMode GeneCounts option was used to obtain read counts per gene based on the Gencode release M14 annotation file. Total number of reads used in analysis varies between 23-35 millions of reads.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We developed a single-cell transcriptomics pipeline for high-throughput pharmacotranscriptomic screening. We explored the transcriptional landscape of three HGSOC models (JHOS2, a representative cell line; PDC2 and PDC3, two patient-derived samples) after treating their cells for 24 hours with 45 drugs representing 13 distinct classes of mechanism of action. Our work establishes a new precision oncology framework for the study of molecular mechanisms activated by a broad array of drug responses in cancer. . ├── 3D UMAPs/ → Interactive 3D UMAPs of cells treated with the 45 drugs used for multiplexed scRNA-seq. Related to Figure 4. Coordinates: x = UMAP 1; y = UMAP 2; z = UMAP 3. Legend: green = PDC1; blue = PDC2; red = JHOS2. │ ├── DMSO_3D_UMAP_Dini.et.al.html → 3D UMAP of untreated cells. │ └── drug_3D_UMAP_Dini.et.al.html → 3D UMAP of cells treated with (drug). ├── QC_plots/ → Diagnostic plots. Related to Figures 2–4. │ ├── model_QC_violin_plot_2023.pdf → Violin plots of the QC metrics used to filter the data. │ ├── model_col_HTO or model_row_HTO before and after filt → Heatmaps of the row or column HTO expression in each cell. │ └── model_counts_histogram_2023.pdf → Histogram of the distribution of the total counts per cell after filtering for high-quality cells. ├── scRNAseq/ → scRNA-seq data. Related to Figures 2–4. │ ├── AllData_subsampled_DGE_edgeR.csv.gz → Differential gene expression analyses results between treated and untreated cells via pseudobulk of aggregate subsamples, for each of the three models. Related to Figure 3. │ └── All_vs_all_RNAclusters_DEG_signif.txt → Differential gene expression analysis results (p.adj < 0.05) of FindAllMarkers for the Leiden/RNA clusters. ├── PDCs.transcript.counts.tsv → Bulk RNA-seq count data for PDCs 1–3 processed by Kallisto. Related to Figure S6. └── PDCs.transcript.TPM.tsv → Bulk RNA-seq TPM data for PDCs 1–3 processed by Kallisto. Related to Figure S6.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets produced during the validation of CWL-based pipelines, designed for the analysis of data from RNA-Seq, ChIP-Seq and germline variant calling experiments. Specifically, the workflows were tested using publicly available High-throughput (HTS) data from published studies on Chronic Lymphocytic Leukemia (CLL) (accession numbers: E-MTAB-6962, GSE115772) and Genome in a Bottle (GIAB) project samples (accession numbers: SRR6794144, SRR22476789, SRR22476790, SRR22476791).
The supporting data include:
Differential transcript and gene expression results produced during the analysis with the CWL-based RNA-Seq pipeline
Bigwig and narrowPeak files, differential binding results, table of consensus peaks and read counts of EZH2 and H3K27me3, produced during the analysis with the CWL-based ChIP-Seq pipeline
VCF files containing the detected and filtered variants, along with the respective hap.py () results regarding comparisons against the GIAB golden standard truth sets for both CWL-based germline variant calling pipelines
These datasets are generated by ReapTEC (read-level pre-filtering and transcribed enhancer call) using 5' single-cell RNA-seq data on human heterogenous CD4+ T cells. By taking advantage of a unique “cap signature†derived from the 5′-end of a transcript, ReapTEC simultaneously profiles gene expression and enhancer activity at nucleotide resolution using 5′-end single-cell RNA-sequencing (5′ scRNA-seq). The detail of ReapTEC pipeline is described in https://github.com/MurakawaLab/ReapTEC., , , README: Transcription start site analysis for heterogenous CD4+ T cells using 5′ scRNA-seq
https://doi.org/10.5061/dryad.gtht76hv9
Description of the data and file structure
Data_summary.xlsx.zip: Summary of single-cell experiments in this study.
5scCTSSbed_All.zip: There are 102 files containing count data for analyzing transcription start site (TSS) signals. Details are as follows.
Our original raw sequencing data and processed data of 5′ scRNA-seq have been deposited to National Bioscience Database Center (NBDC) Human Database (accession code: hum0350). Raw sequencing data originated from human subjects have been deposited to Japanese Genotype-phenotype Archive (JGA, accession code: JGAS000689). We retrieved 5′ scRNA-seq data for human memory CD4+ T cells stimulated with viral antigens from the Gene Expression Omnibus database (accession number GSE152522). In total, 102 5′ scRNA-seq datasets were processed by ReapTEC pipeline (https://github.com/MurakawaLab/ReapTEC)....
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of four samples of GEO accession GSE119855 with the IBU RNA-seq pipeline
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Background: mRNA interactions with each other and other signaling molecules define different biological pathways and functions. Researchers have been investigating various tools to analyze these types of interactions. In particular gene co-expression network methods have proved useful in finding and analyzing these molecular interactions. Many different analytical pipelines to identify these interactions networks have been proposed with the aim of identifying an optimal partition of the network where the individual modules are neither too small to make any general inference or too large to be biologically interpretable. Results: In this study we propose a new pipeline to perform gene co-expression network analysis. The proposed pipeline uses WGCNA a widely used software to perform different aspects of gene co-expression network analysis and modularity maximization algorithm to analyze novel RNA-Seq data to understand the effects of low-dose 56Fe ion irradiation on the formation of hepatocellular carcinoma in mice. The network results along with experimental validation show that using WGCNA combined with Modularity provide a more biologically interpretable network in our dataset. Our pipeline showed better performance than the existing clustering algorithm in WGCNA in finding modules and identified a module with mitochondrial subunits that are supported by mitochondrial complex assay. Conclusions: We present a pipeline that can reduce the problem of parameter selection with the existing algorithm in WGCNA for comparable RNA-Seq datasets which may assist in future research to discover novel mRNA interactions and their downstream molecular effects. C57BL16 males were placed into 2 treatment groups and received the following irradiation treatments at Brookhaven National Laboratories (Long Island NY): 600 MeV/n 56Fe (0.2 Gy) and no irradiation. Left liver lobes were collected at 30 60 120 270 and 360 days post-irradiation flash frozen and stored at -80 xc2 xb0C until they could be processed for RNA-Seq. Livers were sampled by taking two 40-micron thick slices using a cryotome at -20 xc2 xb0C. This allowed multiple sampling of the tissue without the tissue going through multiple freeze/thaw cycles. Total RNA was isolated from the liver slices using RNAqueousTM Total RNA Isolation Kit (ThermoFisher Scientific Waltham MA) and rRNA was removed via Ribo-ZeroTM rRNA Removal Kit (Illumina San Diego CA) prior to library preparation with the Illumina TruSeq RNA Library kit. Samples were sequenced in a paired-end 50 base format on an Illumina HiSeq 1500. Reads were aligned to the mouse GRCm38 reference genome using the STAR alignment program version 2.5.3a with the recommended ENCODE options. The -quantMode GeneCounts option was used to obtain read counts per gene based on the Gencode release M14 annotation file. Total number of reads used in analysis varies between 23-35 millions of reads.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This upload contains data and code used to generate the statistics and graphs discussed in the paper "How tool combinations in different pipeline versions affect the outcome in RNA-seq analysis". The paper was pre-published on BioRxiv and submitted for publication to NAR Genomics and Bioinformatics.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The emergence of NextGen sequencing technology has generated much interest in the exploration of transcriptomes. Currently, Illumina Inc. (San Diego, CA) provides one of the most widely utilized sequencing platforms for gene expression analysis. While Illumina reagents and protocols perform adequately in RNA-sequencing (RNA-seq), alternative reagents and protocols promise a higher throughput at a much lower cost. We have developed a low-cost and robust protocol to produce Illumina-compatible (GAIIx and HiSeq2000 platforms) RNA-seq libraries by combining several recent improvements. First, we designed balanced adapter sequences for multiplexing of samples; second, dUTP incorporation in 2nd strand synthesis was used to enforce strand-specificity; third, we simplified RNA purification, fragmentation and library size-selection steps thus drastically reducing the time and increasing throughput of library construction; fourth, we included an RNA spike-in control for validation and normalization purposes. To streamline informatics analysis for the community, we established a pipeline within the iPlant Collaborative. These scripts are easily customized to meet specific research needs and improve on existing informatics and statistical treatments of RNA-seq data. In particular, we apply significance tests for determining differential gene expression and intron retention events. To demonstrate the potential of both the library-construction protocol and data-analysis pipeline, we characterized the transcriptome of the rice leaf. Our data supports novel gene models and can be used to improve current rice genome annotation. Additionally, using the rice transcriptome data, we compared different methods of calculating gene expression and discuss the advantages of a strand-specific approach to detect bona-fide anti-sense transcripts and to detect intron retention events. Our results demonstrate the potential of this low cost and robust method for RNA-seq library construction and data analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundPipeline comparisons for gene expression data are highly valuable for applied real data analyses, as they enable the selection of suitable analysis strategies for the dataset at hand. Such pipelines for RNA-Seq data should include mapping of reads, counting and differential gene expression analysis or preprocessing, normalization and differential gene expression in case of microarray analysis, in order to give a global insight into pipeline performances.MethodsFour commonly used RNA-Seq pipelines (STAR/HTSeq-Count/edgeR, STAR/RSEM/edgeR, Sailfish/edgeR, TopHat2/Cufflinks/CuffDiff)) were investigated on multiple levels (alignment and counting) and cross-compared with the microarray counterpart on the level of gene expression and gene ontology enrichment. For these comparisons we generated two matched microarray and RNA-Seq datasets: Burkitt Lymphoma cell line data and rectal cancer patient data.ResultsThe overall mapping rate of STAR was 98.98% for the cell line dataset and 98.49% for the patient dataset. Tophat’s overall mapping rate was 97.02% and 96.73%, respectively, while Sailfish had only an overall mapping rate of 84.81% and 54.44%. The correlation of gene expression in microarray and RNA-Seq data was moderately worse for the patient dataset (ρ = 0.67–0.69) than for the cell line dataset (ρ = 0.87–0.88). An exception were the correlation results of Cufflinks, which were substantially lower (ρ = 0.21–0.29 and 0.34–0.53). For both datasets we identified very low numbers of differentially expressed genes using the microarray platform. For RNA-Seq we checked the agreement of differentially expressed genes identified in the different pipelines and of GO-term enrichment results.ConclusionIn conclusion the combination of STAR aligner with HTSeq-Count followed by STAR aligner with RSEM and Sailfish generated differentially expressed genes best suited for the dataset at hand and in agreement with most of the other transcriptomics pipelines.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Gene expression data and associated supplementary files from RNAseq of breast cancer samples from Staaf et al. npj Breast Cancer 2022 (source reference below). Library preparation for mRNA-sequencing was done by a stranded dUTP mRNA protocol or by Illumina stranded TruSeq mRNA protocol. Expression data (Fragments Per Kilobase per Million reads, FPKM) was generated by an analysis pipeline utilizing Hisat/StringTie with GRCh38 human genome primary assembly and GENCODE Release 27 transcripts/genes. Gene expression data is summarized on GENCODE gene identifier. Gene and transcript definitions and gene annotations are from GENCODE Release 27.
Detailed description including material and methods for RNAseq, Hisat/StringTie analysis pipeline, and the development of the Single Sample Predictor (SSP) models for Breast Cancer is available in Staaf et al. npj Breast Cancer 2022 (source reference below).
The developed SSP models are available as an R package available at GitHub (reference below).
RNAseq comparing wt strain PcPCL1606 and the derivative mutant AdarB, defective in HPR production. RNA was extracted from the rhizosphere samples using a PowerSoil® RNA extraction kit (Qiagen Iberia S.L., Madrid, Spain) following the manufacturer's instructions and its amount was quantified using a NanoDrop 2000 spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA). For the RNAseq experiment, the quantity and quality of RNA were verified by the Genomics and Ultrasequencing Service Unit (University of Malaga) and subsequently sequenced using NextSeq550 equipment (Illumina). The raw reads and their subsequent processing were carried out by the Centre for Supercomputing and Bioinnovation (University of Malaga). The bacterial RNAseq data analysis was performed based on a series of software packages adapted to the experimental model. The software components of the RNAseq analysis pipeline included analysis by SeqTrimNext (v.2.0.6) to remove low-quality reads, adapters, organular DNA and contaminant sequences; BOWTIE (v.2.2.9) to align reads to the genomic reference; Samtools (v. 0.1.19), a package of programs to deal directly with the alignment files, reading, writing, editing or viewing the alignment files in SAM/BAM format (http://www.htslib.org/); and TUXEDO tools (http://cole-trapnell-lab.github.io/cufflinks/manual/), used to estimate the aligned RNAseq reads in the different transcripts and estimate their abundance. The abundance of the transcripts was measured in fragments per kilobase of fragments of exon per million reads (fpkm). Once the transcripts and their corresponding estimated fpkm have been assembled, these transcripts were annotated with the known reference set of genes obtained from the database from the annotated reference file. This pipeline is a tool developed by the Andalusian Platform for Bioinformatics (PAB; http://www.scbi.uma.es/site/omics/bioinformatics) for the study of differential expression analysis using data of RNAseq on a genomic reference. The subsequent analysis of differential expression with a method analogous to differentially expressed sequences, and the graphical representation of the expression results was done using the 'cummeRbund' R package (v. 2.42.0). The array of reads in fpkm format generated will be used to obtain a list of differentially expressed genes that showed a p-value less than 0.05.NAseq comparing wt strain PcPCL1606 and the derivative mutant AdarB, defective in HPR production.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This workflow adapts the approach and parameter settings of Trans-Omics for precision Medicine (TOPMed). The RNA-seq pipeline originated from the Broad Institute. There are in total five steps in the workflow starting from:
For testing and analysis, the workflow author provided example data created by down-sampling the read files of a TOPMed public access data. Chromosome 12 was extracted from the Homo Sapien Assembly 38 reference sequence and provided by the workflow authors. The required GTF and RSEM reference data files are also provided. The workflow is well-documented with a detailed set of instructions of the steps performed to down-sample the data are also provided for transparency. The availability of example input data, use of containerization for underlying software and detailed documentation are important factors in choosing this specific CWL workflow for CWLProv evaluation.
This dataset folder is a CWLProv Research Object that captures the Common Workflow Language execution provenance, see https://w3id.org/cwl/prov/0.5.0 or use https://pypi.org/project/cwl
Steps to reproduce
To build the research object again, use Python 3 on macOS. Built with:
Install cwltool
pip3 install cwltool==1.0.20180912090223
Install git lfs
The data download with the git repository requires the installation of Git lfs:
https://www.atlassian.com/git/tutorials/git-lfs#installing-git-lfs
Get the data and make the analysis environment ready:
git clone https://github.com/FarahZKhan/cwl_workflows.git
cd cwl_workflows/
git checkout CWLProvTesting
./topmed-workflows/TOPMed_RNAseq_pipeline/input-examples/download_examples.sh
Run the following commands to create the CWLProv Research Object:
cwltool --provenance rnaseqwf_0.6.0_linux --tmp-outdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp --tmpdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp topmed-workflows/TOPMed_RNAseq_pipeline/rnaseq_pipeline_fastq.cwl topmed-workflows/TOPMed_RNAseq_pipeline/input-examples/Dockstore.json
zip -r rnaseqwf_0.5.0_mac.zip rnaseqwf_0.5.0_mac
sha256sum rnaseqwf_0.5.0_mac.zip > rnaseqwf_0.5.0_mac_mac.zip.sha256
The https://github.com/FarahZKhan/cwl_workflows repository is a frozen snapshot from https://github.com/heliumdatacommons/TOPMed_RNAseq_CWL commit 027e8af41b906173aafdb791351fb29efc044120
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionThe advent of RNA sequencing (RNA-Seq) has significantly advanced our understanding of the transcriptomic landscape, revealing intricate gene expression patterns across biological states and conditions. However, the complexity and volume of RNA-Seq data pose challenges in identifying differentially expressed genes (DEGs), critical for understanding the molecular basis of diseases like cancer.MethodsWe introduce a novel Machine Learning-Enhanced Genomic Data Analysis Pipeline (ML-GAP) that incorporates autoencoders and innovative data augmentation strategies, notably the MixUp method, to overcome these challenges. By creating synthetic training examples through a linear combination of input pairs and their labels, MixUp significantly enhances the model’s ability to generalize from the training data to unseen examples.ResultsOur results demonstrate the ML-GAP’s superiority in accuracy, efficiency, and insights, particularly crediting the MixUp method for its substantial contribution to the pipeline’s effectiveness, advancing greatly genomic data analysis and setting a new standard in the field.DiscussionThis, in turn, suggests that ML-GAP has the potential to perform more accurate detection of DEGs but also offers new avenues for therapeutic intervention and research. By integrating explainable artificial intelligence (XAI) techniques, ML-GAP ensures a transparent and interpretable analysis, highlighting the significance of identified genetic markers.