RNA-seq gene count datasets built using the raw data from 18 different studies. The raw sequencing data (.fastq files) were processed with Myrna to obtain tables of counts for each gene. For ease of statistical analysis, they combined each count table with sample phenotype data to form an R object of class ExpressionSet. The count tables, ExpressionSets, and phenotype tables are ready to use and freely available. By taking care of several preprocessing steps and combining many datasets into one easily-accessible website, we make finding and analyzing RNA-seq data considerably more straightforward.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
RNA sequencing (RNA-seq) is widely used for RNA quantification in the environmental, biological and medical sciences. It enables the description of genome-wide patterns of expression and the identification of regulatory interactions and networks. The aim of RNA-seq data analyses is to achieve rigorous quantification of genes/transcripts to allow a reliable prediction of differential expression (DE), despite variation in levels of noise and inherent biases in sequencing data. This can be especially challenging for datasets in which gene expression differences are subtle, as in the behavioural transcriptomics test dataset from D. melanogaster that we used here. We investigated the power of existing approaches for quality checking mRNA-seq data and explored additional, quantitative quality checks. To accommodate nested, multi-level experimental designs, we incorporated sample layout into our analyses. We employed a subsampling without replacement-based normalization and an identification of DE that accounted for the hierarchy and amplitude of effect sizes within samples, then evaluated the resulting differential expression call in comparison to existing approaches. In a final step to test for broader applicability, we applied our approaches to a published set of H. sapiens mRNA-seq samples, The dataset-tailored methods improved sample comparability and delivered a robust prediction of subtle gene expression changes. The proposed approaches have the potential to improve key steps in the analysis of RNA-seq data by incorporating the structure and characteristics of biological experiments.
This record includes training materials associated with the Australian BioCommons workshop 'Single cell RNAseq analysis in R'. This workshop took place over two, 3.5 hour sessions on 26 and 27 October 2023. Event description Analysis and interpretation of single cell RNAseq (scRNAseq) data requires dedicated workflows. In this hands-on workshop we will show you how to perform single cell analysis using Seurat - an R package for QC, analysis, and exploration of single-cell RNAseq data. We will discuss the 'why' behind each step and cover reading in the count data, quality control, filtering, normalisation, clustering, UMAP layout and identification of cluster markers. We will also explore various ways of visualising single cell expression data. This workshop is presented by the Australian BioCommons, Queensland Cyber Infrastructure Foundation (QCIF) and the Monash Genomics and Bioinformatics Platform with the assistance of a network of facilitators from the national Bioinformatics Training Cooperative. Lead trainers: Sarah Williams, Adele Barugahare, Paul Harrison, Laura Perlaza Jimenez Facilitators: Nick Matigan, Valentine Murigneux, Magdalena (Magda) Antczak Infrastructure provision: Uwe Winter Coordinator: Melissa Burke Training materials Materials are shared under a Creative Commons Attribution 4.0 International agreement unless otherwise specified and were current at the time of the event. Files and materials included in this record: Event metadata (PDF): Information about the event including, description, event URL, learning objectives, prerequisites, technical requirements etc. Index of training materials (PDF): List and description of all materials associated with this event including the name, format, location and a brief description of each file. scRNAseq_Schedule (PDF): A breakdown of the topics and timings for the workshop Materials shared elsewhere: This workshop follows the tutorial 'scRNAseq Analysis in R with Seurat' https://swbioinf.github.io/scRNAseqInR_Doco/index.html Slides used to introduce key topics are available via GitHub https://github.com/swbioinf/scRNAseqInR_Doco/tree/main/slides This material is based on the introductory Guided Clustering Tutorial tutorial from Seurat. It is also drawing from a similar workshop held by Monash Bioinformatics Platform Single-Cell-Workshop, with material here.
For methodological details, see S1 Text, paragraph "RNA-Seq Analysis". (XLSX)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
One of the key challenges for transcriptomics-based research is not only the processing of large data but also modeling the complexity of features that are sources of variation across samples, which is required for an accurate statistical analysis. Therefore, our goal is to foster access for wet lab researchers to bioinformatics tools, in order to enhance their ability to explore biological aspects and validate hypotheses with robust analysis. In this context, user-friendly interfaces can enable researchers to apply computational biology methods without requiring bioinformatics expertise. Such bespoke platforms can improve the quality of the findings by allowing the researcher to freely explore the data and test a new hypothesis with independence. Simplicity DiffExpress is a data-driven software platform dedicated to enabling non-bioinformaticians to take ownership of the differential expression analysis (DEA) step in a transcriptomics experiment while presenting the results in a comprehensible layout, which supports an efficient results exploration, information storage, and reproducibility. Simplicity DiffExpress’ key component is the bespoke statistical model validation that guides the user through any necessary alteration in the dataset or model, tackling the challenges behind complex data analysis. The software utilizes edgeR, and it is implemented as part of the SimplicityTM platform, providing a dynamic interface, with well-organized results that are easy to navigate and are shareable. Computational biologists and bioinformaticians can also benefit from its use since the data validation is more informative than the usual DEA resources. Wet-lab collaborators can benefit from receiving their results in an organized interface. Simplicity DiffExpress is freely available for academic use, and it is cloud-based (https://simplicity.nsilico.com/dea).
Some datasets for the SAOD (Statistical Analysis of Omics Data) course (Aix-Marseille Université, D. Puthier). The Homo_sapiens.GRCh38.110.chr.tsv was produced using the following command: gtftk retrieve -r 110 gtftk convert_ensembl -i Homo_sapiens.GRCh38.110.chr.gtf.gz | gtftk nb_exons | gtftk feature_size -t mature_rna | gtftk feature_size -t transcript -k tx_genomic_size | gtftk exon_sizes | gtftk intron_sizes | gtftk select_by_key -t | gtftk tabulate -k '*' -u -x > Homo_sapiens.GRCh38.110.chr.tsv
This dataset contains the supplementary data for the research paper "Haploinsufficiency of the intellectual disability gene SETD5 disturbs developmental gene expression and cognition".
The contained files have the following content: 'Supplementary Figures.pdf' Additional figures (as referenced in the paper). 'Supplementary Table 1. Statistics.xlsx' Details on statistical tests performed in the paper. 'Supplementary Table 2. Differentially expressed gene analysis.xlsx' Results for the differential gene expression analysis for embryonic (E9.5; analysis with edgeR) and in vitro (ESCs, EBs, NPCs; analysis with DESeq2) samples. 'Supplementary Table 3. Gene Ontology (GO) term enrichment analysis.xlsx' Results for the GO term enrichment analysis for differentially expressed genes in embryonic (GO E9.5) and in vitro (GO ESC, GO EBs, GO NPCs) samples. Differentially expressed genes for in vitro samples were split into upregulated and downregulated genes (up/down) and the analysis was performed on each subset (e.g. GO ESC up / GO ESC down). 'Supplementary Table 4. Differentially expressed gene analysis for CFC samples.xlsx' Results for the differential gene expression analysis for samples from adult mice before (HC - Homecage) and 1h and 3h after contextual fear conditioning (1h and 3h, respectively). Each sheet shows the results for a different comparison. Sheets 1-3 show results for comparisons between timepoints for wild type (WT) samples only and sheets 4-6 for the same comparisons in mutant (Het) samples. Sheets 7-9 show results for comparisons between genotypes at each time point and sheet 10 contains the results for the analysis of differential expression trajectories between wild type and mutant. 'Supplementary Table 5. Cluster identification.xlsx' Results for k-means clustering of genes by expression. Sheet 1 shows clustering of just the genes with significantly different expression trajectories between genotypes. Sheet 2 shows clustering of all genes that are significantly differentially expressed in any of the comparisons (includes also genes with same trajectories). 'Supplementary Table 6. GO term cluster analysis.xlsx' Results for the GO term enrichment analysis and EWCE analysis for enrichment of cell type specific genes for each cluster identified by clustering genes with different expression trajectories (see Table S5, sheet 1). 'Supplementary Table 7. Setd5 mass spectrometry results.xlsx' Results showing proteins interacting with Setd5 as identified by mass spectrometry. Sheet 1 shows protein protein interaction data generated from these results (combined with data from the STRING database. Sheet 2 shows the results of the statistical analysis with limma. 'Supplementary Table 8. PolII ChIP-seq analysis.xlsx' Results for the Chip-Seq analysis for binding of RNA polymerase II (PolII). Sheet 1 shows results for differential binding of PolII at the transcription start site (TSS) between genotypes and sheets 2+3 show the corresponding GO enrichment analysis for these differentially bound genes. Sheet 4 shows RNAseq counts for genes with increased binding of PolII at the TSS.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data provided here are part of a Galaxy Training Network tutorial that analyzes RNA-Seq data from a study published by Brooks et al. 2011 to identify genes and exons that are regulated by Pasilla gene.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Alignment statistics of the RNA-Seq analysis.
Summary statistics of RNA-seq (quantification) library sequencing and mapping.
https://www.scilifelab.se/data/restricted-access/https://www.scilifelab.se/data/restricted-access/
Data Set Description
These data are collected from a total of 70 participants (47 adult; 23 pediatric), all of which had relapsed or primary resistant acute myeloid leukemia. The data, which here are separated into an adult and a pediatric dataset, were generated as part of a study by Stratmann et. al. (https://doi.org/10.1182/bloodadvances.2021004962). The Stratmann et. al. study is currently pre-published here: https://ashpublications.org/bloodadvances/article/doi/10.1182/bloodadvances.2021004962/477210/Transcriptomic-analysis-reveals-pro-inflammatory Please note that separate applications are necessary for the adult and pediatric dataset, respectively. When applying for access, please indicate which of the datasets that the application applies for. The adult dataset contains transcriptome sequencing (RNA-seq) data from 25 diagnosis (D), 45 relapse (R1/R2/R3) and five (5) primary resistant (PR) leukemic samples from 47 patients, as well as five (5) normal CD34+ bone marrow control samples. The pediatric dataset contains RNA-seq data from 18 diagnosis (D), 22 relapse (R1/R2), six (6) persistent relapse (R1/2-P) and one (1) primary resistant (PR) leukemic samples from 23 patients, as well as five (5) normal CD34+ bone marrow control samples. The leukemic samples originate from bone marrow or peripheral blood. The normal RNA samples originate from purified CD34+ bone marrow cells from five different healthy individuals. Further details regarding the samples are available in the Supplemental Information part of Stratmann et. al. (https://doi.org/10.1182/bloodadvances.2021004962). RNA-seq libraries and associated next-generation sequencing were carried out by the SNP&SEQ Technology platform, SciLifeLab, National Genomics Infrastructure Uppsala, Sweden. Libraries were prepared using the TruSeq stranded total RNA library preparation kit with ribosomal depletion by RiboZero Gold (Illumina). Sequencing of adult samples was carried out on the Illumina HiSeq2500 platform, generating paired-end 125bp reads using v4 sequencing chemistry. Sequencing of pediatric samples was carried out on the Illumina NovaSeq6000 platform (S2 flowcell), generating paired-end 100bp reads using the v1 sequencing chemistry. The CD34+ bone marrow control samples were sequenced using both platforms (Illumina HiSeq2500 and NovaSeq6000). Further, all of these acute myeloid leukemia samples have also been characterized by whole genome sequencing or whole exome sequencing, with the datasets available under controlled access through doi.org/10.17044/scilifelab.12292778. Terms for accessThe adult and pediatric datasets are only to be used for research that is seeking to advance the understanding of the influence of genetic and transcriptomic factors on human acute myeloid leukemia etiology and biology. Use of the protected pediatric dataset is only for research projects that can merely be conducted using pediatric acute myeloid leukemia data, and for which the research objectives cannot be accomplished using data from adults. Applications intending various method development would thus not be considered as acceptable for use of the pediatric dataset. Further, the pediatric dataset may not be used for research investigating predisposition for acute myeloid leukemia based on germline variants.
For conditional access to the adult and/or pediatric dataset in this publication, please contact datacentre@scilifelab.se
RNA-seq (RNA sequencing) uses high-throughput (HTS) data to reveal the presence and quantity of RNA in a biological sample at a given moment in time. In the training available at http://galaxyproject.github.io/RNA-Seq/tutorials/ref_based, we introduce the bioinformatics methods to analyze RNA-seq data using a reference genome. The toy datasets were extracted from the study of Brooks et al. 2011.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Tables and data corresponding to the manuscript "The SpliZ generalizes “Percent Spliced In” to reveal regulated splicing at single-cell resolution" (https://www.biorxiv.org/content/10.1101/2020.11.10.377572v2.full.pdf)spliz_scores.xlsx: Separate tables with the SpliZ and SpliZVD score for each cell and gene for both individuals 1 and 2 and both 10x and Smart-seq2 technologies. The cell, gene, cell type, SpliZ, and SpliZVD are given by the "cell", "geneR1A", "ontology", "scZ", and "svd_z0" columns, respectively. A: individual 1 10x; B: individual 2 10x; C: individual 1 Smart-seq2; D: individual 2 Smart-seq2.leafcutter_calls.xlsx: To call significant splicing events for Leafcutter, the “p” column was required to be < 10e-10 and the “max_logef” value was required to be > 1.5. The “called” column is True if the splicing event was called by Leafcutter using these cutoffs and False otherwise. For the SpliZ, genes were called as differentially alternatively spliced if perm_pval_adj_scZ < 0.05. A: Leafcutter results for individual 1. B: Leafcutter results for individual 2. C: Leafcutter results for channel P3_2_S10 from individual 2. D: SpliZ results for channel P3_2_S10 from individual 2.The HLCA*.pq files are the input data for the pipeline (both individuals, 10x and Smart-seq2 data) required to reproduce these results.
The root apex is an important section of the plant root involved in environmental sensing and cellular development. Analyzing the gene profile of root apex in diverse environments is important and challenging especially when the samples are limiting and precious such as in spaceflight. The feasibility of using tiny root sections for transcriptome analysis was examined in this study. To understand the gene expression profiles of the root apex Arabidopsis thaliana Col-0 roots were sectioned into Zone-I (0.5 mm root cap and meristematic zone) and Zone-II (1.5 mm transition elongation and growth terminating zone). Gene expression was analyzed using microarray and RNA seq. Both the techniques arrays and RNA-Seq identified 4180 common genes as differentially expressed (with > two-fold changes) between the zones. In addition 771 unique genes and 19 novel TARs were identified by RNA-Seq as differentially expressed which were not detected in the arrays. Single root tip zones can be used for full transcriptome analysis; further the root apex zones are functionally very distinct from each other. RNA-Seq provided novel information about the transcripts compared to the arrays. These data will help optimize transcriptome techniques for dealing with small rare samples.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary of small RNA sequencing data analysis.
Reference is regularly made to the power of new genomic sequencing approaches. Using powerful technology, however, is not the same as having the necessary power to address a research question with statistical robustness. In the rush to adopt new and improved genomic research methods, limitations of technology and experimental design may be initially neglected. Here, we review these issues with regard to RNA sequencing (RNA-seq). RNA-seq adds large-scale transcriptomics to the toolkit of ecological and evolutionary biologists, enabling differential gene expression (DE) studies in non-model species without the need for prior genomic resources. High biological variance is typical of field-based gene expression studies and means that larger sample sizes are often needed to achieve the same degree of statistical power as clinical studies based on data from cell lines or inbred animal models. Sequencing costs have plummeted, yet RNA-seq studies still underutilise biological replication. Finit...
Comprehensive introduction to the processing and analysis of bulk RNA-seq data including basic information about Illumina-based short read sequencing, common file formats (FASTQ, SAM/BAM, BED, ...) and quality controls. Contains ready-to-use UNIX and R code; covers the most common application of bulk RNA-seq to identify genes that are differentially expressed when comparing two conditions.
Pathway analysis (Ingenuity.com) of RNA-Seq data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains all the Seurat objects that were used for generating all the figures in Pal et al. 2021 (https://doi.org/10.15252/embj.2020107333). All the Seurat objects were created under R v3.6.1 using the Seurat package v3.1.1. The detailed information of each object is listed in a table in Chen et al. 2021.
Summary of the RNA-sequencing data.
RNA-seq gene count datasets built using the raw data from 18 different studies. The raw sequencing data (.fastq files) were processed with Myrna to obtain tables of counts for each gene. For ease of statistical analysis, they combined each count table with sample phenotype data to form an R object of class ExpressionSet. The count tables, ExpressionSets, and phenotype tables are ready to use and freely available. By taking care of several preprocessing steps and combining many datasets into one easily-accessible website, we make finding and analyzing RNA-seq data considerably more straightforward.