Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data repository for the scMappR manuscript:
Abstract from biorXiv (https://www.biorxiv.org/content/10.1101/2020.08.24.265298v1.full).
RNA sequencing (RNA-seq) is widely used to identify differentially expressed genes (DEGs) and reveal biological mechanisms underlying complex biological processes. RNA-seq is often performed on heterogeneous samples and the resulting DEGs do not necessarily indicate the cell types where the differential expression occurred. While single-cell RNA-seq (scRNA-seq) methods solve this problem, technical and cost constraints currently limit its widespread use. Here we present single cell Mapper (scMappR), a method that assigns cell-type specificity scores to DEGs obtained from bulk RNA-seq by integrating cell-type expression data generated by scRNA-seq and existing deconvolution methods. After benchmarking scMappR using RNA-seq data obtained from sorted blood cells, we asked if scMappR could reveal known cell-type specific changes that occur during kidney regeneration. We found that scMappR appropriately assigned DEGs to cell-types involved in kidney regeneration, including a relatively small proportion of immune cells. While scMappR can work with any user supplied scRNA-seq data, we curated scRNA-seq expression matrices for ∼100 human and mouse tissues to facilitate its use with bulk RNA-seq data alone. Overall, scMappR is a user-friendly R package that complements traditional differential expression analysis available at CRAN.
Table of Contents
1. Main Description
---------------------------
This is the Zenodo repository for the manuscript titled "A TCR β chain-directed antibody-fusion molecule that activates and expands subsets of T cells and promotes antitumor activity.". The code included in the file titled `marengo_code_for_paper_jan_2023.R` was used to generate the figures from the single-cell RNA sequencing data.
The following libraries are required for script execution:
File Descriptions
---------------------------
Linked Files
---------------------
This repository contains code for the analysis of single cell RNA-seq dataset. The dataset contains raw FASTQ files, as well as, the aligned files that were deposited in GEO. The "Rdata" or "Rds" file was deposited in Zenodo. Provided below are descriptions of the linked datasets:
Gene Expression Omnibus (GEO) ID: GSE223311(https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE223311)
Sequence read archive (SRA) repository ID: SRX19088718 and SRX19088719
Zenodo DOI: 10.5281/zenodo.7566113(https://zenodo.org/record/7566113#.ZCcmvC2cbrJ)
Installation and Instructions
--------------------------------------
The code included in this submission requires several essential packages, as listed above. Please follow these instructions for installation:
> Ensure you have R version 4.1.2 or higher for compatibility.
> Although it is not essential, you can use R-Studios (Version 2022.12.0+353 (2022.12.0+353)) for accessing and executing the code.
1. Download the *"Rdata" or ".Rds" file from Zenodo (https://zenodo.org/record/7566113#.ZCcmvC2cbrJ) (Zenodo DOI: 10.5281/zenodo.7566113).
2. Open R-Studios (https://www.rstudio.com/tags/rstudio-ide/) or a similar integrated development environment (IDE) for R.
3. Set your working directory to where the following files are located:
You can use the following code to set the working directory in R:
> setwd(directory)
4. Open the file titled "Install_Packages.R" and execute it in R IDE. This script will attempt to install all the necessary pacakges, and its dependencies in order to set up an environment where the code in "marengo_code_for_paper_jan_2023.R" can be executed.
5. Once the "Install_Packages.R" script has been successfully executed, re-start R-Studios or your IDE of choice.
6. Open the file "marengo_code_for_paper_jan_2023.R" file in R-studios or your IDE of choice.
7. Execute commands in the file titled "marengo_code_for_paper_jan_2023.R" in R-Studios or your IDE of choice to generate the plots.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This workflow adapts the approach and parameter settings of Trans-Omics for precision Medicine (TOPMed). The RNA-seq pipeline originated from the Broad Institute. There are in total five steps in the workflow starting from:
Read alignment using STAR which produces aligned BAM files including the Genome BAM and Transcriptome BAM.
The Genome BAM file is processed using Picard MarkDuplicates. producing an updated BAM file containing information on duplicate reads (such reads can indicate biased interpretation).
SAMtools index is then employed to generate an index for the BAM file, in preparation for the next step.
The indexed BAM file is processed further with RNA-SeQC which takes the BAM file, human genome reference sequence and Gene Transfer Format (GTF) file as inputs to generate transcriptome-level expression quantifications and standard quality control metrics.
In parallel with transcript quantification, isoform expression levels are quantified by RSEM. This step depends only on the output of the STAR tool, and additional RSEM reference sequences.
For testing and analysis, the workflow author provided example data created by down-sampling the read files of a TOPMed public access data. Chromosome 12 was extracted from the Homo Sapien Assembly 38 reference sequence and provided by the workflow authors. The required GTF and RSEM reference data files are also provided. The workflow is well-documented with a detailed set of instructions of the steps performed to down-sample the data are also provided for transparency. The availability of example input data, use of containerization for underlying software and detailed documentation are important factors in choosing this specific CWL workflow for CWLProv evaluation.
This dataset folder is a CWLProv Research Object that captures the Common Workflow Language execution provenance, see https://w3id.org/cwl/prov/0.5.0 or use https://pypi.org/project/cwl
Steps to reproduce
To build the research object again, use Python 3 on macOS. Built with:
Processor 2.8GHz Intel Core i7
Memory: 16GB
OS: macOS High Sierra, Version 10.13.3
Storage: 250GB
Install cwltool
pip3 install cwltool==1.0.20180912090223
Install git lfs The data download with the git repository requires the installation of Git lfs: https://www.atlassian.com/git/tutorials/git-lfs#installing-git-lfs
Get the data and make the analysis environment ready:
git clone https://github.com/FarahZKhan/cwl_workflows.git cd cwl_workflows/ git checkout CWLProvTesting ./topmed-workflows/TOPMed_RNAseq_pipeline/input-examples/download_examples.sh
Run the following commands to create the CWLProv Research Object:
cwltool --provenance rnaseqwf_0.6.0_linux --tmp-outdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp --tmpdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp topmed-workflows/TOPMed_RNAseq_pipeline/rnaseq_pipeline_fastq.cwl topmed-workflows/TOPMed_RNAseq_pipeline/input-examples/Dockstore.json
zip -r rnaseqwf_0.5.0_mac.zip rnaseqwf_0.5.0_mac sha256sum rnaseqwf_0.5.0_mac.zip > rnaseqwf_0.5.0_mac_mac.zip.sha256
The https://github.com/FarahZKhan/cwl_workflows repository is a frozen snapshot from https://github.com/heliumdatacommons/TOPMed_RNAseq_CWL commit 027e8af41b906173aafdb791351fb29efc044120
Premise of the study: The root apex is an important region involved in environmental sensing, but comprises a very small part of the root. Obtaining root apex transcriptomes is therefore challenging when the samples are limited. The feasibility of using tiny root sections for transcriptome analysis was examined, comparing RNA sequencing (RNA-Seq) to microarrays in characterizing genes that are relevant to spaceflight.Methods:Arabidopsis thaliana Columbia ecotype (Col-0) roots were sectioned into Zone 1 (0.5 mm; root cap and meristematic zone) and Zone 2 (1.5 mm; transition, elongation, and growth-terminating zone). Differential gene expression in each was compared.Results: Both microarrays and RNA-Seq proved applicable to the small samples. A total of 4180 genes were differentially expressed (with fold changes of 2 or greater) between Zone 1 and Zone 2. In addition, 771 unique genes and 19 novel transcriptionally active regions were identified by RNA-Seq that were not detected in microarrays. However, microarrays detected spaceflight-relevant genes that were missed in RNA-Seq. Discussion: Single root tip subsections can be used for transcriptome analysis using either RNA-Seq or microarrays. Both RNA-Seq and microarrays provided novel information. These data suggest that techniques for dealing with small, rare samples from spaceflight can be further enhanced, and that RNA-Seq may miss some spaceflight-relevant changes in gene expression.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The following datasets were created for Project Cognoma:expression-matrix.tsv.bz2
is a sample × gene matrix indicating a gene's expression level for a given sample. This dataset will be the feature/x/predictor for Project Cognoma.mutation-matrix.tsv.bz2
is a sample × gene matrix indicating whether a gene is mutated for a given sample. Select columns (or unions of several columns) in this dataset will be the status/y/outcome for Project Cognoma.These are preliminary datasets for development use and machine learning. The data was retrieved from the UCSC Xena Browser.All original work in the data is released under CC0. However, the license of TCGA and Xena data is currently unclear.These two datasets were created by the GitHub repository commit below, although they were not tracked due to large file size. See the download directory of the cancer-data repository for metadata files with the version info for the Xena downloads this release is based on.See the data/subset directory of the cancer-data repository on GitHub to browse small subsets of these datasets.
https://ega-archive.org/dacs/EGAC00001003376https://ega-archive.org/dacs/EGAC00001003376
This dataset contains RNA sequencing (RNAseq) data of 814 patients from the CheckMate 649 clinical trial whose ICF allows data deposition into a public repository. Gene expression profiling was performed retrospectively using RNAseq on a subset of baseline tumor samples. Paired-end FASTQ files were processed on Seven Bridges platform (Seven Bridges Genomics).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this Zenodo repository, we share the data that is required to reproduce all the analyses from our publication "Differential detection workflows for multi-sample single-cell RNA-seq data".
This repository includes all* input data, intermediate results and final outputs that are represented in our manuscript. For a more elaborate description of the data, we refer to the companion GitHub. https://github.com/statOmics/DD_benchmarks for the benchmarks and https://github.com/statOmics/DD_cases for the case studies, respectively.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
H.sapien normalized counts RNA seq data matrix from NASA Genelab's open science data repository. Created using R.
Female C57BL/6CR mice were flown onboard STS-135 for 13 days and returned to Earth for analysis. Livers were collected within 3-4 hours of landing and snap frozen in liquid nitrogen. Liver tissue samples that were used for microarray analysis for GLDS-25 were provided to GeneLab. GeneLab extracted RNA, added ERCC control spike-in to the samples, and performed RNA-Seq analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
D.melanogaster normalized counts RNA seq data matrix developed from NASA Genelab's open science data repository. Created using R.
Software tool as a repository of uniformly processed RNA-seq data mined from public data obtained from NCBI Short Read Archive . DEE2 consists of three parts: Webserver where end-users can search for and obtain data-sets of interest, Pipeline that can download and process SRA data as well as users own fastq files, Back-end that collects, filters and organises data provided by contributing worker nodes.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Analysis files related to figures 1-8
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We analysed the field of expression profiling by high throughput sequencing, or HT-seq, in terms of replicability and reproducibility, using data from the NCBI GEO (Gene Expression Omnibus) repository.
- This release includes GEO series up to Dec-31, 2020;
- Fixed xlrd missing optional dependency, which affected import of some xls files, previously we were using only openpyxl (thanks to anonymous reviewer);
- All files in supplementary _RAW.tar files were checked for p values, previously _RAW.tar files were completely omitted, alas (thanks to anonymous reviewer).
Archived dataset contains following files:
- output/parsed_suppfiles.csv, p-value histograms, histogram classes, estimated number of true null hypotheses (pi0).
- output/document_summaries.csv, document summaries of NCBI GEO series
- output/publications.csv, publication info of NCBI GEO series
- output/scopus_citedbycount.csv, Scopus citation info of NCBI GEO series
- output/single-cell.csv, single cell experiments
- spots.csv, NCBI SRA sequencing run metadata
- suppfilenames.txt, list of all supplementary file names of NCBI GEO submissions. One filename per row.
- suppfilenames_filtered.txt, list of supplementary file names used for downloading files from NCBI GEO. One filename per row.
We sought to determine whether the spaceflight environment can induce alterations in small extracellular vesicles (sEV) smallRNA content and their utility as biomarkers. Using small RNA sequencing (sRNAseq), we evaluated the impact of the spaceflight environment on sEV miRNA content in peripheral blood (PB) plasma of 14 astronauts, who flew STS missions between 1998-2001. Samples were collected at three-time points:10 days before the launch (L-10), the day of return (R-0), and three days post-landing (R+3).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
RNA sequencing data showing the gene expression in p62 knockout vs. wild-type hearts.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
M.musculus unnormalized counts RNA seq data matrix from NASA Genelab's open science data repository. Created using R.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data repository contains coexpression networks from publicly-available RNA-Seq datasets (obtained from the recount2 database) that were generated using the best workflows identified in the benchmarking study: Johnson KA, Krishnan A (2020) Robust normalization and transformation techniques for constructing gene coexpression networks from RNA-seq data. bioRxiv 10.1101/2020.09.22.308577.
GTEx coexpression networks
There are 62 coexpression networks built from 31 GTEx datasets (each dataset corresponding to one GTEx tissue) reconstructed using two different network-building workflows: i) CTF_CLR: Counts adjusted using TMM Factors followed by CLR transformation of the Pearson correlation coefficients; ii) CTF: Counts adjusted using TMM Factors (without any further transformation).
SRA coexpression networks
There are 256 coexpression networks built from 256 SRA datasets. Each dataset corresponds to a set of samples generated as part of the same transcriptome experiment from the same tissue. These networks are reconstructed using the top-performing workflow: CTF, Counts adjusted using TMM Factors.
Refer to the preprint for more details on the workflows and the steps used for obtaining the original datasets.
Each of 70 cell samples either at the control condition or treated with FDA-approved cancer drugs is sequenced by the single-ended random-primed mRNA-sequencing method with a read length of 100 base pairs, and a total of 70 raw sequence data files in the FASTQ format are generated. These sequence data files are then analyzed by a high-performance computational pipeline and ranked lists of gene signatures and biological processes related to drug-induced cardiotoxicity are generated for each drug. The raw sequence datasets and the analysis results have been carefully controlled for data quality, and they are made publicly available at the Gene Expression Omnibus (GEO) database repository of NIH. As such, this broad drug-stimulated transcriptomi dataset is valuable for the prediction of drug toxicities and their mitigations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data accompanying the manuscript describing MIX-Seq, a method for transcriptional profiling of mixtures of cancer cell lines treated with small molecule and genetic perturbations (McFarland and Paolella et al., Nat Commun, 2020). Data consists of single-cell RNA-sequencing (UMI count matrices), and associated drug sensitivity and genomic features of the cancer cell lines.See README file for more information on dataset contents.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository holds the additional raw data used to generate figures in the publication "Cell-type-specific co-expression inference from single cell RNA-sequencing data" (preprint version: https://www.biorxiv.org/content/10.1101/2022.12.13.520181v1).
Table of contents:
Figure_1B.rds:
raw data of Figure 1B
co-expression estimates of 500*499/2 gene pairs across 100 replicates for 7 methods under two settings of sequencing detph variations
Supplementary_Figure_1B.rds:
raw data of Supplementary Figure 1B
co-expression estimates of 500*499/2 gene pairs across 100 replicates for 7 methods under two settings of sequencing detph variations
Supplementary_Figure_2.rds:
raw data of Supplementary Figure 2
empirical power evaluated for 4999 gene pairs and 6 methods
Figure_3B.rds:
raw data of Figure 3B
co-expression estimates of a network of 500 genes for 9 methods across 100 replicates
Additional_Raw_Data.xlsx
raw data of Figure 3A: (geometric mean expression levels, co-expression estimates) for 4999 gene pairs and 11 methods
raw data of Figure 3C: running times for 11 methods
raw data of Supplementary Figure 3: (geometric mean expression levels, co-expression estimates) for 4999 gene pairs and 11 methods under two settings of sequencing detph variations
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data repository for the scMappR manuscript:
Abstract from biorXiv (https://www.biorxiv.org/content/10.1101/2020.08.24.265298v1.full).
RNA sequencing (RNA-seq) is widely used to identify differentially expressed genes (DEGs) and reveal biological mechanisms underlying complex biological processes. RNA-seq is often performed on heterogeneous samples and the resulting DEGs do not necessarily indicate the cell types where the differential expression occurred. While single-cell RNA-seq (scRNA-seq) methods solve this problem, technical and cost constraints currently limit its widespread use. Here we present single cell Mapper (scMappR), a method that assigns cell-type specificity scores to DEGs obtained from bulk RNA-seq by integrating cell-type expression data generated by scRNA-seq and existing deconvolution methods. After benchmarking scMappR using RNA-seq data obtained from sorted blood cells, we asked if scMappR could reveal known cell-type specific changes that occur during kidney regeneration. We found that scMappR appropriately assigned DEGs to cell-types involved in kidney regeneration, including a relatively small proportion of immune cells. While scMappR can work with any user supplied scRNA-seq data, we curated scRNA-seq expression matrices for ∼100 human and mouse tissues to facilitate its use with bulk RNA-seq data alone. Overall, scMappR is a user-friendly R package that complements traditional differential expression analysis available at CRAN.