Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
One of the key challenges for transcriptomics-based research is not only the processing of large data but also modeling the complexity of features that are sources of variation across samples, which is required for an accurate statistical analysis. Therefore, our goal is to foster access for wet lab researchers to bioinformatics tools, in order to enhance their ability to explore biological aspects and validate hypotheses with robust analysis. In this context, user-friendly interfaces can enable researchers to apply computational biology methods without requiring bioinformatics expertise. Such bespoke platforms can improve the quality of the findings by allowing the researcher to freely explore the data and test a new hypothesis with independence. Simplicity DiffExpress is a data-driven software platform dedicated to enabling non-bioinformaticians to take ownership of the differential expression analysis (DEA) step in a transcriptomics experiment while presenting the results in a comprehensible layout, which supports an efficient results exploration, information storage, and reproducibility. Simplicity DiffExpress’ key component is the bespoke statistical model validation that guides the user through any necessary alteration in the dataset or model, tackling the challenges behind complex data analysis. The software utilizes edgeR, and it is implemented as part of the SimplicityTM platform, providing a dynamic interface, with well-organized results that are easy to navigate and are shareable. Computational biologists and bioinformaticians can also benefit from its use since the data validation is more informative than the usual DEA resources. Wet-lab collaborators can benefit from receiving their results in an organized interface. Simplicity DiffExpress is freely available for academic use, and it is cloud-based (https://simplicity.nsilico.com/dea).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Warden and Wu Preprint: v1
In general, this primarily focuses on the following types of comparisons:
Cell line experiments with over-expression or knock-down to define a known causal gene, with processing starting with public reads.
Processed TCGA (The Cancer Genome Atlas) data for breast cancer (BRCA) to compare gene expression by immunohistochemistry status (ER/ESR1, PR/PGR, or HER2/ERBB2).
Differential expression methods include the following:
edgeR (GLM)
edgeR-robust (GLM)
edgeR (QL)
edgeR-robust (QL)
DESeq1
DESeq2
limma-voom
limma-trend (CPM)
limma-trend (FPKM/RPKM)
ANOVA (log2 FRPKM/RPKM)
The most common preprocessing strategies include STAR, TopHat2, and Salmon. However, a limited amount of additional processing with HISAT2, kallisto, Bowtie2 (+eXpress), and Bowtie1 (+RSEM) is also provided.
Most STAR and TopHat2 alignments use htseq-count for quantification, as well as running cuffdiff (for single variable 2-group comparisons). However, a limited amount of additional processing with featureCounts is also provided.
Most STAR and TopHat2 alignments start with the public forward reads, even if paired-end data was available.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
EdgeR results from MMGs. Differential expression results calculated by edgeR for MMG counts produced by the stage 2 analysis. Can be downloaded from [43]. (XLSX 428 kb)
Data set 1. Transcript expression across human RNA-Seq samples: estimated read counts. The file contains estimated read counts, generated by kallisto (https://pachterlab.github.io/kallisto/), for human transcripts and RNA-Seq samples used in this study (see Additional file 2 of the accompanying publication). The format is a compressed (GZIP) tab-separated transcript-by-sample matrix. Ensembl transcript identifiers and a combined Sequence Read Archive study/sample name identifier serve as row and column names, respectively. Data set 2. Transcript expression across murine RNA-Seq samples: estimated read counts. As in Data set 1, but for mouse transcripts. Data set 3. Transcript expression across simian RNA-Seq samples: estimated read counts. As in Data set 1, but for chimpanzee transcripts. Data set 4. Transcript expression across across human RNA-Seq samples: estimated transcript abundances. As in Data set 1, but instead of read counts, transcript abundances in transcripts per million (TPM), as estimated by kallisto (https://pachterlab.github.io/kallisto/), are listed. Format, column and row names as in Data set 1. Data set 5. Transcript expression across murine RNA-Seq samples: estimated transcript abundances. As in Data set 4, but for mouse transcripts. Data set 6. Transcript expression across simian RNA-Seq samples: estimated transcript abundances. As in Data set 4, but for chimpanzee transcripts. Data set 7. Differential expression analyses across human RNA-Seq sample groups: log fold changes. The file contains log fold changes, inferred by edgeR (http://bioconductor.org/packages/release/bioc/html/edgeR.html), for human genes and the RNA-Seq sample group contrasts listed in Additional file 3 of the accompanying publication in a compressed (GZIP) TSV gene-by-comparison matrix. Ensembl gene identifiers and a descriptive contrast identifier serve as row and column names, respectively. Data set 8. Differential expression analyses across murine RNA-Seq sample groups: log fold changes. As in Data set 7, but for mouse genes. Data set 9. Differential expression analyses across simian RNA-Seq sample groups: log fold changes. As in Data set 7, but for chimpanzee genes. Data set 10. Differential expression analyses across human RNA-Seq sample groups: false discovery rates. The file contains false discovery rates (FDR) for the differential expression analyses summarized in Data set 7. Format, column and row names as in Data set 7. Data set 11. Differential expression analyses across murine RNA-Seq sample groups: false discovery rates. As in Data set 10, but for mouse genes. Data set 12. Differential expression analyses across simian RNA-Seq sample groups: false discovery rates. As in Data set 10, but for chimpanzee genes. Data set 13. Quantification of alternative splicing events across human RNA-Seq samples. The file contains ‘percent spliced in’ (PSI) values computed by SUPPA (https://github.com/comprna/SUPPA) for annotated alternative splicing events (inferred from the transcript annotation of the human genome, Ensembl release 84; http://www.ensembl.org/). The format is a compressed (GZIP) tab-separated transcript-by-sample matrix. SUPPA-provided event identifiers and a combined Sequence Read Archive study/sample name identifier serve as row and column names, respectively. Data set 14. Quantification of alternative splicing events across murine RNA-Seq samples. As in Data set 13, but for mouse alternative splicing events. Data set 15. Differential splicing analyses across human RNA-Seq sample groups: differences in ‘percent spliced in’ (ΔPSI). The file contains ΔPSI values for human alternative splicing events (as in Data set 13). The RNA-Seq sample group contrasts are listed in Additional file 3 of the accompanying publication. Values were inferred by SUPPA’s diffSplice functionality (https://github.com/comprna/SUPPA). The format is a compressed (GZIP) tab-separated gene-by-comparison matrix. SUPPA event identifiers and a descriptive contrast identifier serve as row and column names, respectively. Data set 16. Differential splicing analyses across murine RNA-Seq sample groups: differences in ‘percent spliced in’ (ΔPSI). As in Data set 15, but for mouse alternative splicing events. Data set 17. Differential splicing analyses across human RNA-Seq sample groups: P values. The file contains P values for the differential splicing analysis of human alternative splicing events summarized in Data set 15. Format, column and row names as in Data set 15. Data set 18. Differential splicing analyses across murine RNA-Seq sample groups: P values. The file contains P values for the differential splicing analysis of mouse alternative splicing events summarized in Data set 16. Format, column and row names as in Data set 15. {"references": ["Alamancos, G. P., Pag\u00e8s, A., Trincado, J. L., Bellora, N. & Eyras, E. Leveraging transcript quantification for fast computation of alternative splicing profiles. ...
BackgroundPipeline comparisons for gene expression data are highly valuable for applied real data analyses, as they enable the selection of suitable analysis strategies for the dataset at hand. Such pipelines for RNA-Seq data should include mapping of reads, counting and differential gene expression analysis or preprocessing, normalization and differential gene expression in case of microarray analysis, in order to give a global insight into pipeline performances.MethodsFour commonly used RNA-Seq pipelines (STAR/HTSeq-Count/edgeR, STAR/RSEM/edgeR, Sailfish/edgeR, TopHat2/Cufflinks/CuffDiff)) were investigated on multiple levels (alignment and counting) and cross-compared with the microarray counterpart on the level of gene expression and gene ontology enrichment. For these comparisons we generated two matched microarray and RNA-Seq datasets: Burkitt Lymphoma cell line data and rectal cancer patient data.ResultsThe overall mapping rate of STAR was 98.98% for the cell line dataset and 98.49% for the patient dataset. Tophat’s overall mapping rate was 97.02% and 96.73%, respectively, while Sailfish had only an overall mapping rate of 84.81% and 54.44%. The correlation of gene expression in microarray and RNA-Seq data was moderately worse for the patient dataset (ρ = 0.67–0.69) than for the cell line dataset (ρ = 0.87–0.88). An exception were the correlation results of Cufflinks, which were substantially lower (ρ = 0.21–0.29 and 0.34–0.53). For both datasets we identified very low numbers of differentially expressed genes using the microarray platform. For RNA-Seq we checked the agreement of differentially expressed genes identified in the different pipelines and of GO-term enrichment results.ConclusionIn conclusion the combination of STAR aligner with HTSeq-Count followed by STAR aligner with RSEM and Sailfish generated differentially expressed genes best suited for the dataset at hand and in agreement with most of the other transcriptomics pipelines.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Figure S1, Venn diagram showing the number of differentially expressed genes identified by two versions of Cuffdiff2. Figure S2, The effects of biological replicates on the differential expression analysis for Cuffdiff v2.0.2. Figure S3, The detected fold changes of all the differentially expressed genes identified by three tools were compared and shown, including DESeq vs. edgeR (top panel), DESeq vs. Cuffdiff2 (middle panel) and edgeR vs. Cuffdiff2 (bottom panel). File S1, Analysis pipelines, methods and examples of commands for differential expression analysis, subsampling fastq files and generating SAM/BAM files based on simulated count values. File S2, The raw count values for genes with high fold changes were picked up by edgeR but not by DESeq. Genes with high fold changes (the absolute value of log2 fold changes larger than 2) identified as DEGs by edgeR but not by DESeq are listed in the file. The gene ID, the log2 fold changes (logFC) and FDR from DESeq, the logFC and FDR from edgeR, the raw count values for the four replicates of sample K (K1–K4) and sample N (N1–N4) are shown in each of the columns. Table S1, Numbers of reads for the human hbr and uhr samples from the MAQC dataset. Table S2, Numbers of reads for the mouse neurosphere samples for treatment groups of K and N (the K_N dataset). Table S3, The number of reads for each individual sample of the LCL3 dataset. Table S4, The definition for TP, FP, TN, FN, TPR and FPR. Table S5, The false positive rate for Cuffdiff2, DESeq and edgeR based on the LCL1 dataset. (ZIP)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
EdgeR results from unique counts. Differential expression results calculated by edgeR for gene counts produced by the stage 1 analysis. Can be downloaded from [43]. (XLSX 2159 kb)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Tables “Genes” (List of genome fragments identified by TopHat2, their genomic location, RPKM values in six samples and annotation), “Cufflinks” (Differentially expressed genes identified by Cufflinks), “EdgeR” (Differentially expressed genes identified by EdgeR), Table “UpRegDEGS_Cufflinks & EdgeR” (Differentially expressed genes identified by Cufflinks and EdgeR, upregulated in BLP line), and “DownRegDEGS_Cufflinks & EdgeR” (Differentially expressed genes identified by Cufflinks and EdgeR, downregulated in BLP line). (XLSX 2557 kb)
Male sterility is important mechanism in watermelon for production of hybrid seed. While some fruit development related studies were widely performed in watermelon, there are no reports of profiling gene expression in floral organs of watermelon. RNA-seq analysis was performed in order to identify male sterility related genes from two different groups of watermelon (genetic male-sterile (GMS) DAH3615-MS line and male-fertile DAH3615 line, respectively) to identify the differentially expressed genes (DEGs). This study employed tophat and edgeR for transcriptome analysis of next-generation RNA-seq data, which included 2 tissues obtained from 2 different breeds of watermelon
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Average AUC values for simulation data with various options. Average AUC values of 100 trials are shown. The suggested (or default) options and the highest AUC values are in bold. Sheet 1: E-E (edgeR), Sheet 2: D-D (DESeq), Sheet 3: S-S (DESeq2). (XLSX 12 kb)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Output data from RNA-Seq analyses of various datasets used in the study "SMG1:SMG8:SMG9-complex integrity maintains robustness of nonsense-mediated mRNA decay".
Data is organized by zip folders for each individual RNA-Seq dataset, please see SMG189_datasets.csv or SMG189_samples.csv for dataset/sample metadata. Each zip folder contains the output of Salmon transcript quantification, DESeq2 DGE and edgeR DTE analyses, Log files, quality control analyses and metadata files (design.txt and experiment.txt).
Combined with the scripts and addititional information found at https://github.com/boehmv/2024_SMG189 these data should allow reproducing the analysis and plots in the manuscript.
Additional helper files (e.g. annotation) used in the analyses and too big for GitHub are provided here as well. E.g. 2024-10-28_SMG189_datasources.rds contains the DESeq2 and edgeR output ready for import via R.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
RNAseq FKPM values of mouse cell lines Ink4a.1, Met25, Met35, Met36, and Met38. Pipeline of data: Ink4a.1 and Met cell lines were profiled using standard RNA-seq completed by Azenta Life Sciences (formally known as Genewiz, South Plainfield, NJ). After quality check of the reads using FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), we used Salmon to quantify transcript-level expression and EdgeR to identify genes with significantly differential expression between pairs of conditions based on replicated count data from bulk RNA-seq profiling. The normalized data were applied to R package GAGE for gene-set enrichment and pathway analysis. The p-values were corrected for multiple testing using FDR.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
WT, wild-type strain Br48; Myc, mycelia; GT, germination tubesThe edgeR package was used to identify genes differentially expressed or enriched with the corrected p-value cutoffs of 0.01 (ChIP-seq analysis) or 0.001 (RNA-seq analysis).*, Data from twoΔmoset1 complemented strains (Δmoset1-TF2 and -TF3) with FLAG-tagged MoSET1 were used in the analysis**, number in parentheses represent the number of MoSET1-dependent genes (see text)Number of genes differentially enriched for H3K4me2/H3K4me3 in ChIP-seq analysis and differentially expressed in RNA-seq analysis between mycelia and germination tubes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
edgeR detected top 1000 differentially expressed genes (FDR cutoff = 0.01) in GBM samples without TIN correction. FC = Fold Change; CPM = Count Per Million; FDR = False Discovery Rate. (XLS 143 kb)
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Primary bone marrow-derived macrophages (BMDMs) were isolated from femurs of Nrp2fl/flLyz2-cre and Nrp2fl/fl mice and cultured in DMEM supplemented with 10% FBS, penicillin-streptomycin (50 U/mL) and 20% L929 supernatant as a source of M-CSF for 7 days maturation. After 7 days, BMDMs were treated with 200 ng/ml LPS for 18 h in vitro. After polarization, BMDMs were collected for RNA sequencing analysis. All the animal experiments were approved by Institutional Animal Care and Use Committee of Sichuan University.UID RNA-seq experiment and high through-put sequencing and data analysis were conducted by Seqhealth Technology Co., LTD (Wuhan, China). Briefly, total RNAs were extracted from samples using TRIzol Reagent. 2 μg total RNAs were used for stranded RNA sequencing library preparation using KC-DigitalTM Stranded mRNA Library Prep Kit for Illumina® following the manufacturer’s instruction. Raw sequencing data was first filtered by Trimmomatic (version 0.36). Low-quality reads were discarded, and the reads contaminated with adaptor sequences were trimmed. Clean Reads were further treated with in-house scripts to eliminate duplication bias introduced in library preparation and sequencing. The de-duplicated consensus sequences were used for standard RNA-seq analysis. Reads mapped to the exon regions of each gene were counted by featureCounts (Subread-1.5.1; Bioconductor) and then RPKM was calculated. Genes differentially expressed between groups were identified using the edgeR package (version 3.12.1). A p-value cutoff of 0.05 and Fold-change cutoff of 2 were used to judge the statistical significance of gene expression differences.
Purpose: The goals of this study are to compare transcriptomes using RNA-seq of mouse myoblasts (C2C12 cell line) in undifferentiated and differentiated states and with siRNA-mediated knock down of the RNA binding proteins, Rbfox1 (only expressed in differentiated state) and Rbfox2 (expressed in both undifferentiated and differentiated states). Methods: Differentiated and undifferentiated C2C12 cultures treated with Rbfox1 (differentiated only) or Rbfox2 siRNAs or a mock siRNA transfection were used for RNA-Seq analysis using Illumina HiSeq2000. 101x2 paired-end RNA-seq reads were first uniquely aligned to the mouse genome (mm9) using TopHat 1.4.1. RSEM was used to count the number of reads mapped to genes using UCSC database, followed by edgeR to call differentially expressed genes with false discovery rate less than 0.01. Cufflinks was used to reconstruct isoforms and analyze alternative splicing and percent spliced in (PSI) was calculated. PSI values were validated by RT-PCR. Results: 58-88% of the RNA-seq reads from technical and biological replicates mapped uniquely to the mouse genome. Analysis of gene expression and alternative splicing changes are published in Singh et al. Molecular Cell (2014). Conclusions: Our study has identified gene expression and alternative splicing transitions that occur during myoblast differentiation, demonstrate that 30% of the splicing transitions are regulated by Rbfox2, demonstrated that Rbfox2 is required for a late step of myoblast differentiation and identified two Rbfox2-regulated splicing transitions that are required for differentiation. Undifferentiated and differentiated C2C12 cultures with Rbfox2 depletion or Rbfox1 depletion (differentiated only) in at least duplicate samples analyzed by deep sequencing on Illumina HiSeq2000.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
edgeR detected 665 differentially expressed genes (FDR cutoff = 0.01) in mCRPC samples (without TIN correction). FC = Fold Change; CPM = Count Per Million; FDR = False Discovery Rate. (XLS 97 kb)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Correlation coefficients for each method compared to RT-qPCR. (XLSX 10 kb)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
edgeR detected top 117 differentially expressed genes (FDR cutoff = 0.01) in GBM samples using 3′ count method (3TC). FC = Fold Change; CPM = Count Per Million; FDR = False Discovery Rate. (XLS 22 kb)
Conventional (bulk) RNA-sequencing was performed on unfractionated cell suspension or snap frozen whole tissue material. Total RNA was isolated with TRIzol reagent followed by purification over PureLink RNA Mini Kit columns (Invitrogen). RNA-seq was performed using a polyA-enriched strand-specific library construction protocol (doi: 10.1016/j.ccell.2016.02.009) and paired-end 75bp sequencing on an Illumina HiSeq 2500 instrument. Raw reads were aligned to the reference human genome assembly GRCh37 (hg19) using STAR (v2.5.2.a). To improve spliced alignment, STAR was provided with exon junction coordinates from the reference annotations (Gencode v19). We applied a modified version of a bioinformatics workflow for normalization of raw read counts and differential gene expression analysis (doi: 10.12688/f1000research.9005.3). Gene-level read counts were quantified using HTSEQ-count (v0.11.0; intersection-strict, reverse mode) (doi: 10.1093/bioinformatics/btu638). Genes showing low read counts (i.e., genes not showing counts per million (cpm) > 1.0 in at least 10% of samples) were removed from further analysis. Raw counts from expressed genes were then TMM-normalized and scaled to counts per million (CPM) using the edgeR (v3.22.2) package (doi: 10.1093/bioinformatics/btp616). Sample IDs correspond to those referenced in Wang X et al, Nature Communications (2022).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
One of the key challenges for transcriptomics-based research is not only the processing of large data but also modeling the complexity of features that are sources of variation across samples, which is required for an accurate statistical analysis. Therefore, our goal is to foster access for wet lab researchers to bioinformatics tools, in order to enhance their ability to explore biological aspects and validate hypotheses with robust analysis. In this context, user-friendly interfaces can enable researchers to apply computational biology methods without requiring bioinformatics expertise. Such bespoke platforms can improve the quality of the findings by allowing the researcher to freely explore the data and test a new hypothesis with independence. Simplicity DiffExpress is a data-driven software platform dedicated to enabling non-bioinformaticians to take ownership of the differential expression analysis (DEA) step in a transcriptomics experiment while presenting the results in a comprehensible layout, which supports an efficient results exploration, information storage, and reproducibility. Simplicity DiffExpress’ key component is the bespoke statistical model validation that guides the user through any necessary alteration in the dataset or model, tackling the challenges behind complex data analysis. The software utilizes edgeR, and it is implemented as part of the SimplicityTM platform, providing a dynamic interface, with well-organized results that are easy to navigate and are shareable. Computational biologists and bioinformaticians can also benefit from its use since the data validation is more informative than the usual DEA resources. Wet-lab collaborators can benefit from receiving their results in an organized interface. Simplicity DiffExpress is freely available for academic use, and it is cloud-based (https://simplicity.nsilico.com/dea).