Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data normalization is vital to single-cell sequencing, addressing limitations presented by low input material and various forms of bias or noise present in the sequencing process. Several such normalization methods exist, some of which rely on spike-in genes, molecules added in known quantities to serve as a basis for a normalization model. Depending on available information and the type of data, some methods may express certain advantages over others. We compare the effectiveness of seven available normalization methods designed specifically for single-cell sequencing using two real data sets containing spike-in genes and one simulation study. Additionally, we test those methods not dependent on spike-in genes using a real data set with three distinct cell-cycle states and a real data set under the 10X Genomics GemCode platform with multiple cell types represented. We demonstrate the differences in effectiveness for the featured methods using visualization and classification assessment and conclude which methods are preferable for normalizing a certain type of data for further downstream analysis, such as classification or differential analysis. The comparison in computational time for all methods is addressed as well.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data normalization is vital to single-cell sequencing, addressing limitations presented by low input material and various forms of bias or noise present in the sequencing process. Several such normalization methods exist, some of which rely on spike-in genes, molecules added in known quantities to serve as a basis for a normalization model. Depending on available information and the type of data, some methods may express certain advantages over others. We compare the effectiveness of seven available normalization methods designed specifically for single-cell sequencing using two real data sets containing spike-in genes and one simulation study. Additionally, we test those methods not dependent on spike-in genes using a real data set with three distinct cell-cycle states and a real data set under the 10X Genomics GemCode platform with multiple cell types represented. We demonstrate the differences in effectiveness for the featured methods using visualization and classification assessment and conclude which methods are preferable for normalizing a certain type of data for further downstream analysis, such as classification or differential analysis. The comparison in computational time for all methods is addressed as well.
Normalization of RNA-sequencing data is essential for accurate downstream inference, but the assumptions upon which most methods are based do not hold in the single-cell setting. Consequently, applying existing normalization methods to single-cell RNA-seq data introduces artifacts that bias downstream analyses. To address this, we introduce SCnorm for accurate and efficient normalization of scRNA-seq data. Total 183 single cells (92 H1 cells, 91 H9 cells), sequenced twice, were used to evaluate SCnorm in normalizing single cell RNA-seq experiments. Total 48 bulk H1 samples were used to compare bulk and single cell properties. For single-cell RNA-seq, the identical single-cell indexed and fragmented cDNA were pooled at 96 cells per lane or at 24 cells per lane to test the effects of sequencing depth, resulting in approximately 1 million and 4 million mapped reads per cell in the two pooling groups, respectively.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data normalization is vital to single-cell sequencing, addressing limitations presented by low input material and various forms of bias or noise present in the sequencing process. Several such normalization methods exist, some of which rely on spike-in genes, molecules added in known quantities to serve as a basis for a normalization model. Depending on available information and the type of data, some methods may express certain advantages over others. We compare the effectiveness of seven available normalization methods designed specifically for single-cell sequencing using two real data sets containing spike-in genes and one simulation study. Additionally, we test those methods not dependent on spike-in genes using a real data set with three distinct cell-cycle states and a real data set under the 10X Genomics GemCode platform with multiple cell types represented. We demonstrate the differences in effectiveness for the featured methods using visualization and classification assessment and conclude which methods are preferable for normalizing a certain type of data for further downstream analysis, such as classification or differential analysis. The comparison in computational time for all methods is addressed as well.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Single cell RNA-sequencing dataset of peripheral blood mononuclear cells (pbmc: T, B, NK and monocytes) extracted from two healthy donors.
Cells labeled as C26 come from a 30 years old female and cells labeled as C27 come from a 53 years old male. Cells have been isolated from blood using ficoll. Samples were sequenced using standard 3' v3 chemistry protocols by 10x genomics. Cellranger v4.0.0 was used for the processing, and reads were aligned to the ensembl GRCg38 human genome (GRCg38_r98-ensembl_Sept2019). QC metrics were calculated on the count matrix generated by cellranger (filtered_feature_bc_matrix). Cells with less than 3 genes per cells, less than 500 reads per cell and more than 20% of mithocondrial genes were discarded.
The processing steps was performed with the R package Seurat (https://satijalab.org/seurat/), including sample integration, data normalisation and scaling, dimensional reduction, and clustering. SCTransform method was adopted for the normalisation and scaling steps. The clustered cells were manually annotated using known cell type markers.
Files content:
- raw_dataset.csv: raw gene counts
- normalized_dataset.csv: normalized gene counts (single cell matrix)
- cell_types.csv: cell types identified from annotated cell clusters
- cell_types_macro.csv: cell macro types
- UMAP_coordinates.csv: 2d cell coordinates computed with UMAP algorithm in Seurat
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data normalization is a crucial step in the gene expression analysis as it ensures the validity of its downstream analyses. Although many metrics have been designed to evaluate the existing normalization methods, different metrics or different datasets by the same metric yield inconsistent results, particularly for the single-cell RNA sequencing (scRNA-seq) data. The worst situations could be that one method evaluated as the best by one metric is evaluated as the poorest by another metric, or one method evaluated as the best using one dataset is evaluated as the poorest using another dataset. Here raises an open question: principles need to be established to guide the evaluation of normalization methods. In this study, we propose a principle that one normalization method evaluated as the best by one metric should also be evaluated as the best by another metric (the consistency of metrics) and one method evaluated as the best using scRNA-seq data should also be evaluated as the best using bulk RNA-seq data or microarray data (the consistency of datasets). Then, we designed a new metric named Area Under normalized CV threshold Curve (AUCVC) and applied it with another metric mSCC to evaluate 14 commonly used normalization methods using both scRNA-seq data and bulk RNA-seq data, satisfying the consistency of metrics and the consistency of datasets. Our findings paved the way to guide future studies in the normalization of gene expression data with its evaluation. The raw gene expression data, normalization methods, and evaluation metrics used in this study have been included in an R package named NormExpression. NormExpression provides a framework and a fast and simple way for researchers to select the best method for the normalization of their gene expression data based on the evaluation of different methods (particularly some data-driven methods or their own methods) in the principle of the consistency of metrics and the consistency of datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Single cell RNA-seq data, like data from other sequencing technology, contain systematic technical noise. Such noise results from a combined effect of unequal efficiencies in the capturing and counting of mRNA molecules, such as extraction/amplification efficiency and sequencing depth. We show that such technical effects are not only cell-specific, but also affect genes differently, thus a simple cell-wise size factor adjustment may not be sufficient. We present a non-linear normalization approach that provides a cell- and gene-specific normalization factor for each gene in each cell. We show that the proposed normalization method (implemented in “SC2P" package) reduces more technical variation than competing methods, without reducing biological variation. When technical effects such as sequencing depths are not balanced between cell populations, SC2P normalization also removes the bias due to uneven technical noise. This method is applicable to scRNA-seq experiments that do not use unique molecular identifier (UMI) thus retain amplification biases.
Background & Aims: Pancreatic ductal adenocarcinomas (PDAC) are characterized by fibrosis and an abundance of cancer-associated fibroblasts (CAFs). We investigated strategies to disrupt interactions among CAFs, the immune system, and cancer cells, focusing on adhesion molecule cadherin 11 (CDH11), which has been associated with other fibrotic disorders and is expressed by activated fibroblasts. Methods: We compared levels of CDH11mRNA in human pancreatitis and pancreatic cancer tissues and cells, compared with normal pancreas, and measured levels of CDH11 protein in human and mouse pancreatic lesions and normal tissues. We crossed p48-Cre;LSL-KrasG12D/+;LSL-Trp53R172H/+(KPC) mice with CDH11-knockout mice and measured survival times of offspring. Pancreata were collected and analyzed by histology, immunohistochemistry, and (single-cell) RNA sequencing; RNA and proteins were identified by imaging mass cytometry. Some mice were given injections of PD1 antibody or gemcitabine and survival was monitored. Pancreatic cancer cells from KPC mice were subcutaneously injected into Cdh11+/+ and Cdh11–/– mice and tumor growth was monitored. Pancreatic cancer cells (mT3) from KPC mice (C57BL/6), were subcutaneously injected into Cdh11+/+ (C57BL/6J) mice and mice were given injections of antibody against CDH11, gemcitabine, or small molecule inhibitor of CDH11 (SD133) and tumor growth was monitored. Results: Levels of CDH11mRNA and protein were significantly higher in CAFs than in pancreatic cancer epithelial cells, human or mouse pancreatic cancer cell lines, or immune cells. KPC/Cdh11+/– and KPC/Cdh11–/– mice survived significantly longer than KPC/Cdh11+/+ mice. Markers of stromal activation entirely surrounded pancreatic intraepithelial neoplasias in KPC/Cdh11+/+ mice and incompletely in KPC/Cdh11+/– and KPC/Cdh11–/– mice, whose lesions also contained fewer FOXP3+cells in the tumor center. Compared with pancreatic tumors inKPC/Cdh11+/+ mice, tumors of KPC/Cdh11+/– mice had increased markers of antigen processing and presentation; more lymphocytes and associated cytokines; decreased extracellular matrix components; and reductions in markers and cytokines associated with immunosuppression. Administration of the PD1 antibody did not prolong survival of KPC mice with 0, 1, or 2 alleles of Cdh11. Gemcitabine extended survival only of KPC/Cdh11+/– and KPC/Cdh11–/– mice or reduced subcutaneous tumor growth in mT3 engrafted Cdh11+/+ mice given in combination with the CDH11 antibody. A small molecule inhibitor of CDH11 reduced growth of pre-established mT3 subcutaneous tumors only if T and B cells were present in mice. Conclusions: Knockout or inhibition of CDH11, which is expressed by CAFs in the pancreatic tumor stroma, reduces growth of pancreatic tumors, increases their response to gemcitabine, and significantly extends survival of mice. CDH11 promotes immunosuppression and extracellular matrix deposition, and might be developed as a therapeutic target for pancreatic cancer mT3 tumor was generated by injecting 25,000 mT3 cells (derived from a PDAC of a KPC C57BL/6 mouse) subcutaneously into the back flank of 10-week-old female C57BL/6 mice in a 1:1 suspension of Matrigel (Cat# 354234, Corning) and PBS. At 3 weeks post injection, the tumor was dissected and processed as described before to obtain single cell suspensions. Subsequently, immune cells and blood cells were removed by CD45+ magnetic bead-based depletion (Cat# 130-052-301, Miltenyi Biotech) and ACK lysis buffer (Cat# A1049201, Gibco), respectively, following manufacturer’s guidelines. Remaining cells were prepared for single cell sequencing using Chromium Single Cell 3ʹ GEM, Library & Gel Bead Kit v3 (Cat# 1000075, 10X Genomics) on a 10X Genomics Chromium Controller following manufacturers protocol and sequenced using Illumina NextSeq 500 sequencer. The Cell Ranger Single-Cell Software Suite (10X Genomics) was used to perform sample demultiplexing, barcode processing, and single-cell 3′ gene counting. Sequencing data was aligned to the mouse reference genome (mm10) using “cellranger mkfastq” with default parameters. Unique molecular identifier (UMI) counts were generated using “cellranger count”. Further analysis was performed in R using the Seurat package. Briefly, cells with fewer than 500 detected genes per cell and genes that were expressed by fewer than 5 cells were filtered out. Subsequently, cells with >7800 genes were filtered out to remove noise from droplets containing more than one cell. Dead cells were excluded by retaining cells with <5% mitochondrial reads. The data was subsequently normalized by employing a global-scaling normalization method “LogNormalize” followed by identification of 2,000 most variable genes in the dataset, data scaling and subsequently dimensionality reduction by principal component analysis (PCA) using the 2000 variable genes. Then, a gra...
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
We performed CODEX (co-detection by indexing) multiplexed imaging on 64 sections of the human intestine (~16 mm2) from 8 donors (B004, B005, B006, B008, B009, B010, B011, and B012) using a panel of 57 oligonucleotide-barcoded antibodies. Subsequently, images underwent standard CODEX image processing (tile stitching, drift compensation, cycle concatenation, background subtraction, deconvolution, and determination of best focal plane), single cell segmentation, and column marker z-normalization by tissue. The outputs of this process were data frames of 2.6 million cells with 57 antibody fluorescence values quantified from each marker. Each cell has its cell type, cellular neighborhood, community of neighborhooods, and tissue unit defined with x, y coordinates representing pixel location in the original image. This is from a total of 25 cell types, 20 multicellular neighborhoods, 10 communities of neighborhoods, and 3 tissue segments that could be used to understand the cellular interactions, composition, and structure of the human intestine from the duodenum to the sigmoid colon and understand differences between different areas of the intestine. This data could be used as a healthy baseline to compare other single-cell datasets of the human intestine, particularly multiplexed imaging ones. The overall structure of the datasets is individual cells segmented out in each row. Columns MUC2 through CD161 are the markers used for clustering the cell types. These are the columns that are the values of the antibody staining the target protein within the tissue quantified at the single-cell level. This value is the per cell/area averaged fluorescent intensity that has subsequently been z normalized along each column as described above. OLFM4 through MUC6 were captured in the quantification but not used within the clustering of cell types. Other columns are explained in the table in the Usage Notes section below. Along with this main data table, there is also a donor metadata table that links the donor ids to clinical metadata such as: age, sex, race, BMI, history of diabetes, history of cancer, history of hypertension, and history of gastorintestinal disease. The raw imaging data can be found at (https://portal.hubmapconsortium.org/). We have created a landing page with links to all the raw dataset IDs and the HuBMAP ID for this Collection is HBM692.JRZB.356 and the DOI is:10.35079/HBM692.JRZB.356. This can be used to also pair it with the matched snRNAseq and snATACseq for each section of tissue. Methods For a detailed description of each of the steps of protocols and processes to obtain this data see the detailed materials and methods in the associated manuscript. Briefly, intestine pieces from 8 different sites across the small intestine and colon were taken and frozen in OCT. These were assembled into an array of 4 tissues, cut into 7 um slices, and stained with a panel of 54 CODEX DNA-oligonucleotide barcoded antibodies. Tissues were imaged with a Keyence microscope at 20x objective and then processed using image stitching, drift compensation, deconvolution, and cycle concatenation. Processed data were then segmented using CellVisionSegmenter, a neural network R-CNN-based single-cell segmentation algorithm. Cell type analysis was completed on B004, 5, and 6 by z normalization of protein markers used for clustering and then overclustered using leiden-based clustering. The cell type labels were verified by looking back at the original image. Cell type labels were transferred to other donors using STELLAR framework for annotating spatially resolved single-cell data as described in detail in the companion Nature Methods manuscript. With set cell type labels we performed neighborhood analysis by clustering windows of the 10 nearest neighbors around a given cell and were named based on cell type enrichment and location in the tissue. Similarly, communities of neighborhoods were determined by taking the 100 nearest neighbors with the neighborhood labels and clustered. Finally, tissue segments were determined through multiple rounds of clustering the 300 nearest neighbors of the community labels of each cell. Broad categories for cell types, neighborhoods, and communities were expert annotated based on epithelial, immune, or other stromal compartments.
We performed CODEX (co-detection by indexing) multiplexed imaging on four sections of the human colon (ascending, transverse, descending, and sigmoid) using a panel of 47 oligonucleotide-barcoded antibodies. Subsequently images underwent standard CODEX image processing (tile stitching, drift compensation, cycle concatenation, background subtraction, deconvolution, and determination of best focal plane), and single cell segmentation. Output of this process was a dataframe of nearly 130,000 cells with fluorescence values quantified from each marker. We used this dataframe as input to 1 of the 5 normalization techniques of which we compared z, double-log(z), min/max, and arcsinh normalizations to the original unmodified dataset. We used these normalized dataframes as inputs for 4 unsupervised clustering algorithms: k-means, leiden, X-shift euclidian, and X-shift angular.
From the clustering outputs, we then labeled the clusters that resulted for cells observed in the data producing 20 u...
(1) qPCR Gene Expression Data The THP-1 cell line was sub-cloned and one clone (#5) was selected for its ability to differentiate relatively homogeneously in response to phorbol 12-myristate-13-acetate (PMA) (Sigma). THP-1.5 was used for all subsequent experiments. THP-1.5 cells were cultured in RPMI, 10% FBS, Penicillin/Streptomycin, 10mM HEPES, 1mM Sodium Pyruvate, 50uM 2-Mercaptoethanol. THP-1.5 were treated with 30ng/ml PMA over a time-course of 96h. Total cell lysates were harvested in TRIzol reagent at 1, 2, 4, 6, 12, 24, 48, 72, 96 hours, including an undifferentiated control. Undifferentiated cells were harvested in TRIzol reagent at the beginning of the LPS time-course. One biological replicate was prepared for each time point. Total RNA was purified from TRIzol lysates according to manufacturer’s instructions. Genespecific primer pairs were designed using Primer3 software, with an optimal primer size of 20 bases, amplification size of 140bp, and annealing temperature of 60°C. Primer sequences were designed for 2,396 candidate genes including four potential controls: GAPDH, beta actin (ACTB), beta-2-microglobulin (B2M), phosphoglycerate kinase 1 (PGK1). The RNA samples were reverse transcribed to produce cDNA and then subjected to quantitative PCR using SYBR Green (Molecular Probes) using the ABI Prism 7900HT system (Applied Biosystems, Foster City, CA, USA) with a 384-well amplification plate; genes for each sample were assayed in triplicate. Reactions were carried out in 20μL volumes in 384-well plates; each reaction contained: 0.5 U of HotStar Taq DNA polymerase (Qiagen) and the manufacturer’s 1× amplification buffer adjusted to a final concentration of 1mM MgCl2, 160μM dNTPs, 1/38000 SYBR Green I (Molecular Probes), 7% DMSO, 0.4% ROX Reference Dye (Invitrogen), 300 nM of each primer (forward and reverse), and 2μL of 40-fold diluted first-strand cDNA synthesis reaction mixture (12.5ng total RNA equivalent). Polymerase activation at 95ºC for 15 min was followed by 40 cycles of 15 s at 94ºC, 30 s at 60ºC, and 30 s at 72ºC. The dissociation curve analysis, which evaluates each PCR product to be amplified from single cDNA, was carried out in accordance with the manufacturer’s protocol. Expression levels were reported as Ct values. The large number of genes assayed and the replicates measures required that samples be distributed across multiple amplification plates, with an average of twelve plates per sample. Because it was envisioned that GAPDH would serve as a single-gene normalization control, this gene was included on each plate. All primer pairs were replicated in triplicates. Raw qPCR expression measures were quantified using Applied Biosystems SDS software and reported as Ct values. The Ct value represents the number of cycles or rounds of amplification required for the fluorescence of a gene or primer pair to surpass an arbitrary threshold. The magnitude of the Ct value is inversely proportional to the expression level so that a gene expressed at a high level will have a low Ct value and vice versa. Replicate Ct values were combined by averaging, with additional quality control constraints imposed by a standard filtering method developed by the RIKEN group for the preprocessing of their qPCR data. Briefly this method entails: 1. Sort the triplicate Ct values in ascending order: Ct1, Ct2, Ct3. Calculate differences between consecutive Ct values: difference1 = Ct2 – Ct1 and difference2 = Ct3 – Ct2. 2. Four regions are defined (where Region4 overrides the other regions): Region1: difference ≦ 0.2, Region2: 0.2 < difference ≦ 1.0, Region3: 1.0 < difference, Region4: one of the Ct values in the difference calculation is 40 If difference1 and difference2 fall in the same region, then the three replicate Ct values are averaged to give a final representative measure. If difference1 and difference2 are in different regions, then the two replicate Ct values that are in the small number region are averaged instead. This particular filtering method is specific to the data set we used here and does not represent a part of the normalization procedure itself; Alternate methods of filtering can be applied if appropriate prior to normalization. Moreover while the presentation in this manuscript has used Ct values as an example, any measure of transcript abundance, including those corrected for primer efficiency can be used as input to our data-driven methods. (2) Quantile Normalization Algorithm Quantile normalization proceeds in two stages. First, if samples are distributed across multiple plates, normalization is applied to all of the genes assayed for each sample to remove plate-to-plate effects by enforcing the same quantile distribution on each plate. Then, an overall quantile normalization is applied between samples, assuring that each sample has the same distribution of expression values as all of the other samples to be compared. A similar approach using quantile ormalization has been previously described in the context of microarray normalization. Briefly, our method entails the following steps: i) qPCR data from a single RNA sample are stored in a matrix M of dimension k (maximum number of genes or primer pairs on a plate) rows by p (number of plates) columns. Plates with differing numbers of genes are made equivalent by padded plates with missing values to constrain M to a rectangular structure. ii) Each column is sorted into ascending order and stored in matrix M’. The sorted columns correspond to the quantile distribution of each plate. The missing values are placed at the end of each ordered column. All calculations in quantile normalization are performed on non-missing values. iii) The average quantile distribution is calculated by taking the average of each row in M’. Each column in M’ is replaced by this average quantile distribution and rearranged to have the same ordering as the original row order in M. This gives the within-sample normalized data from one RNA sample. iv) Steps analogous to 1 – 3 are repeated for each sample. Between-sample normalization is performed by storing the within-normalized data as a new matrix N of dimension k (total number of genes, in our example k = 2,396) rows by n (number of samples) columns. Steps 2 and 3 are then applied to this matrix. (3) Rank-Invariant Set Normalization Algorithm We describe an extension of this method for use on qPCR data with any number of experimental conditions or samples in which we identify a set of stably-expressed genes from within the measured expression data and then use these to adjust expression between samples. Briefly, i) qPCR data from all samples are stored in matrix R of dimension g (total number of genes or primer pairs used for all plates) rows by s (total number of samples). ii) We first select gene sets that are rank-invariant across a single sample compared to a common reference. The reference may be chosen in a variety of ways, depending on the experimental design and aims of the experiment. As described in Tseng et al., the reference may be designated as a particular sample from the experiment (e.g. time zero in a time course experiment), the average or median of all samples, or selecting the sample which is closest to the average or median of all samples. Genes are considered to be rank-invariant if they retain their ordering or rank with respect to expression across the experimental sample versus the common reference sample. We collect sets of rank-invariant genes for all of the s pairwise comparisons, relative to a common reference. We take the intersection of all s sets to obtain the final set of rank-invariant genes that is used for normalization. iii) Let αj represent the average expression value of the rank-invariant genes in sample j. (α1, …, αs) then represents the vector of rank-invariant average expression values for all conditions 1 to s iv) We calculate the scale f The THP-1 cell line was sub-cloned and one clone (#5) was selected for its ability to differentiate relatively homogeneously in response to phorbol 12-myristate-13-acetate (PMA) (Sigma). THP-1.5 was used for all subsequent experiments. THP-1.5 cells were cultured in RPMI, 10% FBS, Penicillin/Streptomycin, 10mM HEPES, 1mM Sodium Pyruvate, 50uM 2-Mercaptoethanol. THP-1.5 were treated with 30ng/ml PMA over a time-course of 96h. Total cell lysates were harvested in TRIzol reagent at 1, 2, 4, 6, 12, 24, 48, 72, 96 hours, including an undifferentiated control. Total RNA was purifed from TRIzol lysates according to manufacturer’s instructions. The RNA samples were reverse transcribed to produce cDNA and then subjected to quantitative PCR using SYBR Green (Molecular Probes) using the ABI Prism 7900HT system (Applied Biosystems, Foster City, CA,USA) with a 384-well amplification plate; genes for each sample were assayed in triplicate.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Single-cell transcriptomics has unveiled a vast landscape of cellular heterogeneity in which the cell cycle is a significant component. We trained a high-resolution cell cycle classifier (ccAFv2) using single cell RNA-seq (scRNA-seq) characterized human neural stem cells. The features of this classifier are that it classifies six cell cycle states (G1, Late G1, S, S/G2, G2/M, and M/Early G1) and a quiescent-like G0 state, and it incorporates a tunable parameter to filter out less certain classifications. The ccAFv2 classifier performed better than or equivalent to other state-of-the-art methods even while classifying more cell cycle states, including G0. We showcased the versatility of ccAFv2 by successfully applying it to classify cells, nuclei, and spatial transcriptomics data in humans and mice, using various normalization methods and gene identifiers. We provide methods to regress the cell cycle expression patterns out of single cell or nuclei data to uncover underlying biological signals. The classifier can be used either as an R package integrated with Seurat (https://github.com/plaisier-lab/ccafv2_R) or a PyPI package integrated with scanpy (https://pypi.org/project/ccAF/). We proved that ccAFv2 has enhanced accuracy, flexibility, and adaptability across various experimental conditions, establishing ccAFv2 as a powerful tool for dissecting complex biological systems, unraveling cellular heterogeneity, and deciphering the molecular mechanisms by which proliferation and quiescence affect cellular processes.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
This collection of datasets comprises results from four single-cell spatial experiments conducted on mouse brains: two spatial transcriptomics experiments and two spatial proteomics experiments. These experiments were performed using the Bruker Nanostring CosMx technology on 10µm coronal brain sections from the following mouse models: (1) 14-month-old male 5xFAD;ApoeCh mice and genotype controls, and (2) 9-month-old PS19;ApoeCh mice and genotype controls. Each dataset is provided as an RDS file which includes raw and corrected counts for the RNA data and mean fluorescent intensity for the protein data, along with comprehensive metadata. Metadata includes mouse genotype, sample ID, cell type annotations, sex (for PS19;ApoeCh dataset), and X-Y coordinates of each cell. Results from differential gene expression analysis for each cell type between genotypes using MAST are also included as .csv files. Methods Sample preparation: Isopentane fresh-frozen brain hemispheres were embedded in optimal cutting temperature (OCT) compound (Tissue-Tek, Sakura Fintek, Torrance, CA), and 10µm thick coronal sections were prepared using a cryostat (CM1950, LeicaBiosystems, Deer Park, IL). Six hemibrains were mounted onto each VWR Superfrost Plus microscope slide (Avantor, 48311-703) and kept at -80°C until fixation. For both 5xFAD (14 months old, males) and PS19 (9 months old, females and 1 male ApoeCh) models, n=3 mice per genotype except for n=2 for PS19;ApoeCh (wild-type, ApoeCh HO, 5xFAD HEMI or PS19 HEMI, and 5xFAD HEMI; ApoeCh HO or PS19 HEMI;ApoeCh HO) were used for transcriptomics and proteomics. The same mice were used for both transcriptomics and proteomics. Tissues were processed according to the Nanostring CosMx fresh-frozen slide preparation manual for RNA and protein assays (NanoString University). Data processing: Spatial transcriptomics datasets were filtered using the AtoMx RNA Quality Control module to flag outlier negative probes (control probes targeting non-existent sequences to quantify non-specific hybridization), lowly-expressing cells, FOVs, and target genes. Datasets were then normalized and scaled using Seurat 5.0.1 SCTransform to account for differences in library size across cell types [31]. Principal component analysis (PCA) and uniform manifold approximation and projection (UMAP) analysis were performed to reduce dimensionality and visualize clusters in space. Unsupervised clustering at 1.0 resolution yielded 33 clusters for the 5xFAD dataset and 40 clusters for the PS19 dataset. Clusters were manually annotated based on gene expression and spatial location. Spatial proteomics data were filtered using the AtoMx Protein Quality Control module to flag unreliable cells based on segmented cell area, negative probe expression, and overly high/low protein expression. Mean fluorescence intensity data were hyperbolic arcsine transformed with the AtoMx Protein Normalization module. Cell types were automatically annotated based on marker gene expression using the CELESTA algorithm.
A Transcriptome Database for Astrocytes, Neurons, and Oligodendrocytes: A New Resource for Understanding Brain Development and Function Understanding the cell-cell interactions that control CNS development and function has long been limited by the lack of methods to cleanly separate astrocytes, neurons, and oligodendrocytes. Here we describe the first method for the isolation and purification of developing and mature astrocytes from mouse forebrain. This method takes advantage of the expression of S100β by astrocytes. We used fluorescent activated cell sorting (FACS) to isolate EGFP positive cells from transgenic mice that express EGFP under the control of an S100β promoter. By depletion of astrocytes and oligodendrocytes we obtained purified populations of neurons, while by panning with oligodendrocyte-specific antibodies we obtained purified populations of oligodendrocytes. Using GeneChip Arrays we then created a transcriptome database of the expression levels of over 20,000 genes by gene profiling these three main CNS neural cell types at postnatal ages day 1 to 30. This database provides the first global characterization of the genes expressed by mammalian astrocytes in vivo and is the first direct comparison between the astrocyte, neuron, and oligodendrocyte transcriptomes. We demonstrate that Aldh1L1, a highly expressed astrocyte gene, is a highly specific antigenic marker for astrocytes with a substantially broader, and therefore potentially more useful, pattern of astrocyte expression than the traditional astrocyte marker GFAP. This transcriptome database of acutely isolated and highly pure populations of astrocytes, neurons and oligodendrocytes provides a resource to the neuroscience community by providing improved cell type specific markers and for better understanding of neural development, function, and disease. We acutely purified mouse astrocytes from early postnatal ages (P1) to later postnatal ages (P30), when astrocyte differentiation is morphologically complete (Bushong et al., 2004), and acutely purified mouse OL-lineage cells from stages ranging from OPCs to newly differentiated OLs to myelinating OLs. We extracted RNA from each of these highly purified, acutely isolated cell types and used GeneChip Arrays to determine the expression levels of over 20,000 genes and construct a comprehensive database of cell type specific gene expression in the mouse forebrain. Analysis of this database confirms cell type specific expression of many well characterized and functionally important genes. In addition, we have identified thousands of new cell type enriched genes, thereby providing important new information about astrocyte, OL, and neuron interactions, metabolism, development, and function. This database provides a comparison of the genome-wide transcriptional profiles of the main CNS cell types and is a resource to the neuroscience community for better understanding the development, physiology, and pathology of the CNS. Keywords: Developmental CNS Cell type comparision FACS purification of astrocytes: Dissociated forebrains from S100β-EGFP mice were resuspended in panning buffer (DBPS containing 0.02% BSA and 12.5 U/ml DNase) and sequentially incubated on the following panning plates: secondary antibody only plate to deplete microglia, O4 plate to deplete OLs, PDGFRα plate to deplete OPCs, and a second O4 plate to deplete any remaining OLs. This procedure was sufficient to deplete all OL-lineage cells from animals P8 and younger, however, in older animals that had begun to myelinate, additional depletion of OLs and myelin debris was accomplished as follows. The nonadherent cells from the last O4 dish were harvested by centrifugation, and the cells were resuspended in panning buffer containing GalC, MOG, and O1 supernatant and incubated for 15 minutes at room temperature. The cell suspension was washed and then resuspended in panning buffer containing 20 μg donkey anti-mouse APC for 15 minutes. The cells were washed and resuspended in panning buffer containing propidium iodide (PI). EGFP+ astrocytes were then purified by fluorescence activated cell sorting (FACS). Dead cells were gated out using high PI staining and forward light scatter. Astrocytes were identified based on high EGFP fluorescence and negative APC fluorescence from indirect immunostaining for OL markers GalC, MOG, and O1. Cells were sorted twice and routinely yielded >99.5% purity based on reanalysis of double sorted cells.; FACS purification of neurons: EGFP- cells were the remaining forebrain cells after microglia, OLs, and astrocytes had been removed, and were primarily composed of neurons, and to a lesser extent, endothelial cells (we estimate < 4% endothelial cells at P7 and < 20% endothelial cells at P17). EGFP- cells from S100β-EGFP dissociated forebrain were FACS purified in parallel with astrocyte purification and were sorted based on their negative EGFP fluorescence immunofluorescence. Cells were sorted twice and routinely yielded >99.9% purity. In independent preparations, the EGFP- cell population was additionally depleted of endothelial cells and pericytes by sequentially labeling with biotin-BSL1 lectin and streptavidin-APC while also labeling for OL markers as described above. Cells were sorted twice and routinely yielded >99.9% purity.; Panning purification of oligodendrocyte lineage cells: Dissociated mouse forebrains were resuspended in panning buffer. In order to deplete microglia, the single-cell suspension was sequentially panned on four BSL1 panning plates. The cell suspension was then sequentially incubated on two PDGFRα plates (to purify and deplete OPCs), one A2B5 plate (to deplete any remaining OPCs), two MOG plates (to purify and deplete myelinating OLs), and one GalC plate (to purify the remaining PDGFRα-, MOG-, OLs). The adherent cells on the first PDGFRα, MOG, and GalC plates were washed to remove all antigen-negative nonadherent cells. The cells were then lysed while still attached to the panning plate with Qiagen RLT lysis buffer, and total RNA was purified. Purified OPCs were >95% NG2 positive and 0% MOG positive. Purified Myelin OLs were 100% MOG positive, >95% MBP positive, and 0% NG2 positive. Purified GalC OLs depleted of OPCs and Myelin OLs were <10% MOG positive and ~50% weakly NG2 positive, a reflection of their recent development as early OLs.; Data normalization and analysis: Raw image files were processed using Affymetrix GCOS and the MAS 5.0 algorithm. Intensity data was normalized per chip to a target intensity TGT value of 500, and expression data and absent/present calls for individual probe sets were determined. Gene expression values were normalized and modeled across arrays using the dChip software package with invariant-set normalization and a PM model. (www.dchip.org, Li and Wong, 2001). The 29 samples were grouped into 9 sample types: Astros P7-P8, Astros P17, Astros P17-gray matter (P17g), Neurons P7, Neurons P17, Neurons-endothelial cell depleted (P7n, P17n), OPCs, GalC-OLs, and MOG-OLs. Gene filtering was performed to select probe sets that were consistently expressed in at least one cell type, where consistently expressed was defined as being called present and having a MAS 5.0 intensity level greater than 200 in at least two-thirds of the samples in the cell type. We identified 20,932 of the 45,037 probe sets that were consistently expressed in at least one of the nine cell types. The Significance Analysis of Microarrays (SAM) method (Tusher et al., 2001) was used to determine genes that were significantly differentially expressed between different cell types (see Supplemental Table S2 for SAM cell type groupings). Clustering was performed using the hclust method with complete linkage in R. Expression values were transformed for clustering by computing a mean expression value for the gene using those samples in the corresponding SAM statistical analysis, and then subtracting the mean from expression intensities. In order to preserve the log2 scale of the data, unless otherwise indicated, no normalization by variance was performed. Plots were created using the gplots package in R. The Bioconductor software package (Gentleman et al., 2004) was used throughout the expression analyses. Functional analyses were performed through the use of Ingenuity Pathways Analysis (Ingenuity® Systems, www.ingenuity.com).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Single cell RNA-seq data generated and reported as part of the manuscript entitled "Dissecting the mechanisms underlying the Cytokine Release Syndrome (CRS) mediated by T Cell Bispecific Antibodies" by Leclercq-Cohen et al 2023. Raw and processed (filtered and annotated) data are provided as AnnData objects which can be directly ingested to reproduce the findings of the paper or for ab initio data reuse: 1- raw.zip provides concatenated raw/unfiltered counts for the 20 samples in the standard Market Exchange Format (MEX) format. 2- 230330_sw_besca2_LowFil_raw.h5ad contains filtered cells and raw counts in the HDF5 format. 3- 221124_sw_besca2_LowFil.annotated.h5ad contains filtered cells and log normalized counts, along with cell type annotation in the HDF5 format.
scRNAseq data generation: Whole blood from 4 donors was treated with 0.2 μg/mL CD20-TCB, or incubated in the absence of CD20- TCB. At baseline (before addition of TCB) and assay endpoints (2, 4, 6, and 20 hrs), blood was collected for total leukocyte isolation using EasySepTM red blood cell depletion reagent (Stemcell). Briefly, cells were counted and processed for single cell RNA sequencing using the BD Rhapsody platform. To load several samples on a single BD Rhapsody cartridge, sample cells were labelled with sample tags (BD Human Single-Cell Multiplexing Kit) following the manufacturer’s protocol prior to pooling. Briefly, 1x106 cells from each sample were re-suspended in 180 μL FBS Stain Buffer (BD, PharMingen) and sample tags were added to the respective samples and incubated for 20 min at RT. After incubation, 2 successive washes were performed by addition of 2 mL stain buffer and centrifugation for 5 min at 300 g. Cells were then re- suspended in 620 μL cold BD Sample Buffer, stained with 3.1 μL of both 2 mM Calcein AM (Thermo Fisher Scientific) and 0.3 mM Draq7 (BD Biosciences) and finally counted on the BD Rhapsody scanner. Samples were then diluted and/or pooled equally in 650 μL cold BD Sample Buffer. The BD Rhapsody cartridges were then loaded with up to 40 000 – 50 000 cells. Single cells were isolated using Single-Cell Capture and cDNA Synthesis with the BD Rhapsody Express Single-Cell Analysis System according to the manufacturer’s recommendations (BD Biosciences). cDNA libraries were prepared using the Whole Transcriptome Analysis Amplification Kit following the BD Rhapsody System mRNA Whole Transcriptome Analysis (WTA) and Sample Tag Library Preparation Protocol (BD Biosciences). Indexed WTA and sample tags libraries were quantified and quality controlled on the Qubit Fluorometer using the Qubit dsDNA HS Assay, and on the Agilent 2100 Bioanalyzer system using the Agilent High Sensitivity DNA Kit. Sequencing was performed on a Novaseq 6000 (Illumina) in paired-end mode (64-8- 58) with Novaseq6000 S2 v1 or Novaseq6000 SP v1.5 reagents kits (100 cycles). scRNAseq data analysis: Sequencing data was processed using the BD Rhapsody Analysis pipeline (v 1.0 https://www.bd.com/documents/guides/user-guides/GMX_BD-Rhapsody-genomics- informatics_UG_EN.pdf) on the Seven Bridges Genomics platform. Briefly, read pairs with low sequencing quality were first removed and the cell label and UMI identified for further quality check and filtering. Valid reads were then mapped to the human reference genome (GRCh38-PhiX-gencodev29) using the aligner Bowtie2 v2.2.9, and reads with the same cell label, same UMI sequence and same gene were collapsed into a single raw molecule while undergoing further error correction and quality checks. Cell labels were filtered with a multi-step algorithm to distinguish those associated with putative cells from those associated with noise. After determining the putative cells, each cell was assigned to the sample of origin through the sample tag (only for cartridges with multiplex loading). Finally, the single-cell gene expression matrices were generated and a metrics summary was provided. After pre-processing with BD’s pipeline, the count matrices and metadata of each sample were aggregated into a single adata object and loaded into the besca v2.3 pipeline for the single cell RNA sequencing analysis (43). First, we filtered low quality cells with less than 200 genes, less than 500 counts or more than 30% of mitochondrial reads. This permissive filtering was used in order to preserve the neutrophils. We further excluded potential multiplets (cells with more than 5,000 genes or 20,000 counts), and genes expressed in less than 30 cells. Normalization, log-transformed UMI counts per 10,000 reads [log(CP10K+1)], was applied before downstream analysis. After normalization, technical variance was removed by regressing out the effects of total UMI counts and percentage of mitochondrial reads, and gene expression was scaled. The 2,507 most variable genes (having a minimum mean expression of 0.0125, a maximum mean expression of 3 and a minimum dispersion of 0.5) were used for principal component analysis. Finally, the first 50 PCs were used as input for calculating the 10 nearest neighbours and the neighbourhood graph was then embedded into the two-dimensional space using the UMAP algorithm at a resolution of 2. Cell type annotation was performed using the Sig-annot semi-automated besca module, which is a signature- based hierarchical cell annotation method. The used signatures, configuration and nomenclature files can be found at https://github.com/bedapub/besca/tree/master/besca/datasets. For more details, please refer to the publication.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets linked to publication "Revealing viral and cellular dynamics of HIV-1 at the single-cell level during early treatment periods", Otte et al 2023 published in Cell Reports Methods pre-ART (antiretroviral therapy) cryo-conserved and and whole blood specimen were sampled for HIV-1 virus reservoir determination in HIV-1 positive individuals from the Swiss HIV Study Cohort. Patients were monitored for proviral (DNA), poly-A transcripts (RNA), late protein translation (Gag and Envelope reactivation co-detection assay, GERDA) and intact viruses (golden standard: viral outgrowth assay, VOA). In this dataset we deposited the pipeline for the multidimensional data analysis of our newly established GERDA method, using DBScan and tSNE. For further comprehension NGS and Sanger sequencing data were attached as processed and raw data (GenBank).
Resubmitted to Cell Reports Methods (Jan-2023), accepted in principal (Mar-2023)
GERDA is a new detection method to decipher the HIV-1 cellular reservoir in blood (tissue or any other specimen). It integrates HIV-1 Gag and Env co-detection along with cellular surface markers to reveal 1) what cells still contain HIV-1 translation competent virus and 2) which marker the respective infected cells express. The phenotypic marker repertoire of the cells allow to make predictions on potential homing and to assess the HIV-1 (tissue) reservoir. All FACS data were acquired on a LSRFortessa BD FACS machine (markers: CCR7, CD45RA, CD28, CD4, CD25, PD1, IntegrinB7, CLA, HIV-1 Env, HIV-1 Gag) Raw FACS data (pre-gated CD4CD3+ T-cells) were arcsin transformed and dimensionally reduced using optsne. Data was further clustered using DBSCAN and either individual clusters were further analyzed for individual marker expression or expression profiles of all relevant clusters were analyzed by heatmaps. Sequences before/after therapy initiation and during viral outgrowth cultures were monitored for individuals P01-46 and P04-56 by Next-generation sequencing (NGS of HIV-1 Envelope V3 loop only) and by Sanger (single genome amplification, SGA)
data normalization code (by Julian Spagnuolo) FACS normalized data as CSV (XXX_arcsin.csv) OMIQ conText file (_OMIQ-context_XXX) arcsin normalized FACS data after optsne dimension reduction with OMIQ.ai as CSV file (XXXarcsin.csv.csv) R pipeline with codes (XXX_commented.R) P01_46-NGS and Sanger sequences P04_56-NGS and Sanger sequences
NGS-Based Rna-Seq Market Size 2024-2028
The NGS-based RNA-seq market size is forecast to increase by USD 6.66 billion, at a CAGR of 20.52% between 2023 and 2028.
The market is witnessing significant growth, driven by the increased adoption of next-generation sequencing (NGS) methods for RNA-Seq analysis. The advanced capabilities of NGS techniques, such as high-throughput, cost-effectiveness, and improved accuracy, have made them the preferred choice for researchers and clinicians in various fields, including genomics, transcriptomics, and personalized medicine. However, the market faces challenges, primarily from the lack of clinical validation on direct-to-consumer genetic tests. As the use of NGS technology in consumer applications expands, ensuring the accuracy and reliability of results becomes crucial.
The absence of standardized protocols and regulatory oversight in this area poses a significant challenge to market growth and trust. Companies seeking to capitalize on market opportunities must focus on addressing these challenges through collaborations, partnerships, and investments in research and development to ensure the clinical validity and reliability of their NGS-based RNA-Seq offerings.
What will be the Size of the NGS-based RNA-Seq market during the forecast period?
Explore in-depth regional segment analysis with market size data - historical 2018-2022 and forecasts 2024-2028 - in the full report.
Request Free Sample
The market continues to evolve, driven by advancements in NGS technology and its applications across various sectors. Spatial transcriptomics, a novel approach to studying gene expression in its spatial context, is gaining traction in disease research and precision medicine. Splice junction detection, a critical component of RNA-seq data analysis, enhances the accuracy of gene expression profiling and differential gene expression studies. Cloud computing plays a pivotal role in handling the massive amounts of data generated by NGS platforms, enabling real-time data analysis and storage. Enrichment analysis, gene ontology, and pathway analysis facilitate the interpretation of RNA-seq data, while data normalization and quality control ensure the reliability of results.
Precision medicine and personalized therapy are key applications of RNA-seq, with single-cell RNA-seq offering unprecedented insights into the complexities of gene expression at the single-cell level. Read alignment and variant calling are essential steps in RNA-seq data analysis, while bioinformatics pipelines and RNA-seq software streamline the process. NGS technology is revolutionizing drug discovery by enabling the identification of biomarkers and gene fusion detection in various diseases, including cancer and neurological disorders. RNA-seq is also finding applications in infectious diseases, microbiome analysis, environmental monitoring, agricultural genomics, and forensic science. Sequencing costs are decreasing, making RNA-seq more accessible to researchers and clinicians.
The ongoing development of sequencing platforms, library preparation, and sample preparation kits continues to drive innovation in the field. The dynamic nature of the market ensures that it remains a vibrant and evolving field, with ongoing research and development in areas such as data visualization, clinical trials, and sequencing depth.
How is this NGS-based RNA-Seq industry segmented?
The NGS-based RNA-seq industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2024-2028, as well as historical data from 2018-2022 for the following segments.
End-user
Acamedic and research centers
Clinical research
Pharma companies
Hospitals
Technology
Sequencing by synthesis
Ion semiconductor sequencing
Single-molecule real-time sequencing
Others
Geography
North America
US
Europe
Germany
UK
APAC
China
Singapore
Rest of World (ROW)
.
By End-user Insights
The acamedic and research centers segment is estimated to witness significant growth during the forecast period.
The global next-generation sequencing (NGS) market for RNA sequencing (RNA-Seq) is primarily driven by academic and research institutions, including those from universities, research institutes, government entities, biotechnology organizations, and pharmaceutical companies. These institutions utilize NGS technology for various research applications, such as whole-genome sequencing, epigenetics, and emerging fields like agrigenomics and animal research, to enhance crop yield and nutritional composition. NGS-based RNA-Seq plays a pivotal role in translational research, with significant investments from both private and public organizations fueling its growth. The technology is instrumental in disease research, enabling the identification
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Our research used CODEX (Co-Detection by Indexing) multiplexed imaging to gain insights into the cellular microenvironment surrounding T cell stimulating hydrogels. These hydrogels were engineered with signals that could locally expand antigen-specific T cells for use in tumor immunotherapy. CODEX imaging involves an iterative process of annealing and stripping fluorophore-labeled oligonucleotide barcodes, complementing the barcodes attached to over 40 antibodies used for tissue staining. Subsequently, images underwent standard CODEX image processing (tile stitching, drift compensation, cycle concatenation, background subtraction, deconvolution, and determination of best focal plane), single cell segmentation, and column marker z-normalization by tissue. Our datasets comprise individual cells as rows, each characterized by 40+ antibody fluorescence values quantified from various markers evaluated for each study. These markers correspond to the antibodies targeting specific proteins within the tissue, quantified at the single-cell level. The values represent per-cell/area-averaged fluorescent intensities, z-normalized along each column. Each cell is mapped with its cell type, defined by x and y coordinates representing pixel locations in the original image. We then used this data to investigate how different proportions of the cell types change over time in response to the stimulating hydrogel injection with antigen-specific T cells. These data could be used to understand the cellular interactions, composition, and structure of T cell stimulating biomaterials for antigen-specific immunotherapy and with adoptive T cell transfer. These datasets offer valuable insights for researchers interested in engineering T cell stimulating microenvironments, immune responses, and therapeutic interventions such as T cell therapies. We investigate the dynamic interplay between immune responses, antigen-specific T cell interactions, and hydrogel environment in a murine melanoma model. We injected antigen-specific T cells with microparticle T cell stimulating hydrogels into mice subcutaneously. Injection sites were take out at different time points day=0 (just after injection), day=3, and day=9 (n=3-6 per time point). Our 51-plex CODEX antibody panel characterizes immune cell types, T cell phenotypes, and stromal cell types, resulting in a rich dataset of 241,685 cells across 51 marker channels. Methods
For a detailed description of each of the steps of protocols and processes to obtain this data see the detailed materials and methods in the associated manuscript. Briefly, murine injection sites were extracted from sacrificed mice at indicated timepoint post therapy and frozen in OCT. These were assembled into an array of 4-7 tissues, cut into 7 um slices, and stained with a panel of 40+ CODEX DNA-oligonucleotide barcoded antibodies. Tissues were imaged with a Phenocycler Fusion imaging system from Akoya Biosciences and processed using manufacturer’s v1.6.0 (e.g., background-subtracted, stitched, and shading correction). Processed data were then segmented using CellVisionSegmenter, a neural network R-CNN-based single-cell segmentation algorithm. Cell type analysis was completed by z normalization of protein markers used for clustering and then overclustered using leiden-based clustering. The cell type labels were verified by looking back at the original image.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The mechanisms underlying ETS-driven prostate cancer initiation and progression remain poorly understood due to a lack of model systems that recapitulate this phenotype. We generated a genetically engineered mouse with prostate-specific expression of the ETS factor, ETV4, at lower and higher protein dosages through mutation of its degron. Lower-level expression of ETV4 caused mild luminal cell expansion without histologic abnormalities and higher-level expression of stabilized ETV4 caused prostatic intraepithelial neoplasia (mPIN) with 100% penetrance within 1 week. Tumor progression was limited by p53-mediated senescence and Trp53 deletion cooperated with stabilized ETV4. The neoplastic cells expressed differentiation markers such as Nkx3.1 recapitulating luminal gene expression features of untreated human prostate cancer. Single-cell and bulk RNA-sequencing showed stabilized ETV4 induced a novel luminal-derived expression cluster with signatures of the cell cycle, senescence, and epithelial to mesenchymal transition. These data suggest that ETS overexpression alone, at sufficient dosage, can initiate prostate neoplasia. Methods Mouse prostate digestion: Intraperitoneal injection of tamoxifen was administered in 8-week-old mice. 2 weeks after tamoxifen treatment, the mouse prostate was digested 1 hour with Collagenase/Hyaluronidase (STEMCELL, #07912), and then 30 minutes with TrypLETM Express Enzyme (Thermo Fischer, # 12605028) at 37°C to isolate single prostate cells. The prostate cells were stained with PE/Cy7 conjugated anti-mouse CD326 (EpCAM) antibody (BioLegend, 118216) and then, CD326 and EYFP double positive cells were sorted out by flow cytometry, which are luminal cells mainly from the anterior prostate and dorsal prostate. The mRNA or genomic DNA were extracted from these double-positive cells and then were used for ATAC-sequencing and RNA-sequencing analysis. ATAC-seq and primary data processing: ATAC-seq was performed as previously described. Primary data processing and peak calling were performed using ENCODE ATAC-seq pipeline (https://github.com/kundajelab/atac_dnase_pipelines). Briefly, paired-end reads were trimmed, filtered, and aligned against mm9 using Bowtie2. PCR duplicates and reads mapped to mitochondrial chromosome or repeated regions were removed. Mapped reads were shifted +4/-5 to correct for the Tn5 transposase insertion. Peak calling was performed using MACS2, with p-value < 0.01 as the cutoff. Reproducible peaks from two biological replicates were defined as peaks that overlapped by more than 50%. On average 25 million uniquely mapped pairs of reads were remained after filtering. The distribution of inserted fragment length shows a typical nucleosome banding pattern, and the TSS enrichment score (reads that are enriched around TSS against background) ranges between 28 and 33, suggesting the libraries have high quality and were able to capture the majority of regions of interest. Differential peak accessibility: Reads aligned to peak regions were counted using R package GenomicAlignments_v1.12.2. Read count normalization and differential accessible peaks were called with DESeq2_v1.16.1 in R 3.4.1. Differential peaks were defined as peaks with adjusted p-value < 0.01 and |log2(FC)| > 2. For visualization, coverage bigwig files were generated using bamCoverage command from deepTools2, normalizing using the size factor generated by DESeq2. The differential ATAC-seq peak density plot was generated with deepTools2, using regions that were significantly more or less accessible in ETV4AAA samples relative to EYFP samples. Motif analysis: Enriched motif was performed using MEME-ChIP 5.0.0 with differentially accessible regions in ETV4AAA relative to EYFP. ATAC-seq footprinting was performed using TOBIAS. First, ACACCorrect was run to correct Tn5 bias, followed by ScoreBigwig to calculate footprint score, and finally BindDetect to generate differential footprint across regions. RNA-seq analysis: The extracted RNA was processed for RNA-sequencing by the Integrated Genomics Core Facility at MSKCC. The libraries were sequenced on an Illumina HiSeq-2500 platform with 51 bp paired-end reads to obtain a minimum yield of 40 million reads per sample. The sequenced data were aligned using STAR v2.3 with GRCm38.p6 as annotation. DESeq2_v1.16.1 was subsequently applied on read counts for normalization and the identification of differentially expressed genes between ETV4AAA and EYFP groups, with an adjusted p-value < 0.05 as the threshold. Genes were ranked by sign(log2(FC)) * (-log(p-value)) as input for GSEA analysis using ‘Run GSEA Pre-ranked’ with 1000 permutations (48). The custom gene sets used in GSEA analysis are shown in Table S2. Unsupervised hierarchical clustering: To get an overall sample clustering as part of QC, hierarchical clustering was performed using pheatmap_v1.0.10 package in R on normalized ATAC-seq or RNA-seq data. It was done using all peaks or all genes, with Spearman or Pearson correlation as the distance metric. To have an overview of the differential gene expression from the RNA-seq data, unsupervised clustering was also performed on a matrix with all samples as columns and scaled normalized read counts of differentially expressed genes between ETV4AAA and EYFP as rows. Integrative analysis of ATAC-seq, RNA-seq, and ChIP-seq data: ERG ChIP-seq peaks were called using MACS 2.1, with an FDR cutoff of q < 10-3 and the removal of peaks mapped to blacklist regions. Reproducible peaks between two biological replicates were identified as ETV4AAA ATAC-seq peaks. ERG ChIP-seq peaks and ETV4AAA ATAC-seq peaks were considered as overlap if peak summits were within 250bp. To determine whether the overlap was significant, enrichment analysis was done using regioneR_v1.8.1 in R, which counted the number of overlapped peaks between a set of randomly selected regions in the genome (excluding blacklist regions) and the ERG-ChIP seq peaks or ETV4AAA ATAC-seq peaks. A null distribution was formed using 1000 permutation tests to compute the p-value and z-score of the original evaluation. To assign ATAC-seq peaks to genes, ChIPseeker_v1.12.1 in R was used. Each peak was unambiguously assigned to one gene with a TSS or 3’ end closest to that peak. Differential gene expression between ETV4AAA and EYFP was evaluated using log2(FC) calculated by DESeq2. p-values were estimated with Wilcoxon rank t-test and Student t-test. scRNA-sequencing: Tmprss2-CreERT2, EYFP; Tmprss2-CreERT2, ETV4WT; Tmprss2-CreERT2, ETV4AAA; and Tmprss2-CreERT2, ETV4AAA; Trp53L/L mice were euthanized 2 weeks or 4 months after tamoxifen treatment (n=3 mice for each genotype and time point). After euthanasia, the prostates were dissected out and minced with scalpel, and then processed for 1h digestion with collagenase/hyaluronidase (#07912, STEMCELL Technologies) and 30min digestion with TrypLE (#12605010, Gibco). Live single prostate cells were sorted out by flow cytometry as DAPI-. For each mouse, 5,000 cells were directly processed with 10X genomics Chromium Single Cell 3’ GEM, Library & Gel Bead Kit v3 according to manufacturer’s specifications. For each sample, 200 million reads were acquired on NovaSeq platform S4 flow cell. Reads obtained from the 10x Genomics scRNAseq platform were mapped to mouse genome (mm9) using the Cell Ranger package (10X Genomics). True cells are distinguished from empty droplets using scCB2 package. The levels of mitochondrial reads and numbers of unique molecular identifiers (UMIs) were similar among the samples, which indicates that there were no systematic biases in the libraries from mice with different genotypes. Cells were removed if they expressed fewer than 600 unique genes, less than 1,500 total counts, more than 50,000 total counts, or greater than 20% mitochondrial reads. Genes detected in less than 10 cells and all mitochondrial genes were removed for subsequent analyses. Putative doublets were removed using the Doublet Detection package. The average gene detection in each cell type was similar among the samples. Combining samples in the entire cohort yielded a filtered count matrix of 48,926 cells by 19,854 genes, with a median of 6,944 counts and a median of 1,973 genes per cell, and a median of 2,039 cells per sample. The count matrix was then normalized to CPM (counts per million), and log2(X+1) transformed for analysis of the combined dataset. The top 1000 highly variable genes were found using SCANPY (version 1.6.1) (77). Principal Component Analysis (PCA) was performed on the 1,000 most variable genes with the top 50 principal components (PCs) retained with 29% variance explained. To visualize single cells of the global atlas, we used UMAP projections (https://arxiv.org/abs/1802.03426). We then performed Leiden clustering. Marker genes for each cluster were found with scanpy.tl.rank_genes_groups. Cell types were determined using the SCSA package, an automatic tool, based on a score annotation model combining differentially expressed genes (DEGs) and confidence levels of cell markers from both known and user-defined information. Heat-map were performed for single cells based on log-normalized and scaled expression values of marker genes curated from literature or identified as highly differentially expressed. Differentially expressed genes between different clusters were found using MAST package, which were shown in heat-map. The logFC of MAST output was used for the ranked gene list in GSEA analysis (48). The custom gene sets used in GSEA analysis are shown in Table S2. Gene imputation was performed using MAGIC (Markov affinity-based graph imputation of cells) package, and imputated gene expression were used in the heatmap. Analysis of public human gene expression datasets: To analyze TP53 RNA expression in human prostate cancer samples, we obtained normalized RNA-seq data from prostate cancer TCGA (www.firebrowse.org) (3). To assess the role of TP53 loss on
This dataset could be used to test machine learning algorithms for cell type label transfer accuracy methods. It could also be used to look at cell type relationships in tonsil, intestine, and Barrett's esophagus tissues. The overall structure of the datasets are individual cells segmented out in each row. Then there are columns for the X, Y position in pixels in the overall montage image of the dataset. There are also columns to indicate which region the data came from. There are also cell type labels generated from expert annotations. The other columns are the values of the antibody staining the target protein within the tissue quantified at the single-cell level. This value is the per cell/area averaged fluorescent intensity that has subsequently been z normalized along each column as described above. For the B004_training_dryad.csv dataset, data from donor B004 was expert annotated for cell types within the small intestine and colon (~250,000 cells) and contains cell type labels in addition to protein marker expressions and x, y positions. Each donor has 4 regions from the colon and 4 regions from the small intestine. For the intra-region comparisons we looked at B004 regions in the colon with training 3 regions and then predicting on the fourth. For the B0056_unnanotated_dryad.csv dataset, data from donors B005 and B006 were unnanotated samples we transferred cell type labels to from the B004 training dataset. This means that B0056 has the quantified and preprocessed single-cell quantified protein expression with x, y positions from CODEX imaging, but no cell type annotation labels associated yet. The TonsilBE_processed_0.5 pkl is the preconstructed graph for the tonsil and BE datasets so that running a demo example of STELLAR found on our github (https://github.com/snap-stanford/stellar) runs faster and is included as supplementary data. We performed CODEX (co-detection by indexing) multiplexed imaging on 24 sections of the human intestine from 3 donors (B004, B005, B006) using a panel of 47 oligonucleotide-barcoded antibodies. We also performed CODEX imaging on both human tonsil and Barrett's esophagus (BE) using a panel of 57 oligonucleotide-barcoded antibodies. Subsequently images underwent standard CODEX image processing (tile stitching, drift compensation, cycle concatenation, background subtraction, deconvolution, and determination of best focal plane), single cell segmentation, and column marker z-normalization by tissue. Output of this process were dataframes of 870,000 cells and 220,000 cells respectively with fluorescence values quantified from each marker. See README file.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data normalization is vital to single-cell sequencing, addressing limitations presented by low input material and various forms of bias or noise present in the sequencing process. Several such normalization methods exist, some of which rely on spike-in genes, molecules added in known quantities to serve as a basis for a normalization model. Depending on available information and the type of data, some methods may express certain advantages over others. We compare the effectiveness of seven available normalization methods designed specifically for single-cell sequencing using two real data sets containing spike-in genes and one simulation study. Additionally, we test those methods not dependent on spike-in genes using a real data set with three distinct cell-cycle states and a real data set under the 10X Genomics GemCode platform with multiple cell types represented. We demonstrate the differences in effectiveness for the featured methods using visualization and classification assessment and conclude which methods are preferable for normalizing a certain type of data for further downstream analysis, such as classification or differential analysis. The comparison in computational time for all methods is addressed as well.