https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Skeletal muscle repair is driven by the coordinated self-renewal and fusion of myogenic stem and progenitor cells. Single-cell gene expression analyses of myogenesis have been hampered by the poor sampling of rare and transient cell states that are critical for muscle repair, and do not inform the spatial context that is important for myogenic differentiation. Here, we demonstrate how large-scale integration of single-cell and spatial transcriptomic data can overcome these limitations. We created a single-cell transcriptomic dataset of mouse skeletal muscle by integration, consensus annotation, and analysis of 23 newly collected scRNAseq datasets and 88 publicly available single-cell (scRNAseq) and single-nucleus (snRNAseq) RNA-sequencing datasets. The resulting dataset includes more than 365,000 cells and spans a wide range of ages, injury, and repair conditions. Together, these data enabled identification of the predominant cell types in skeletal muscle, and resolved cell subtypes, including endothelial subtypes distinguished by vessel-type of origin, fibro/adipogenic progenitors defined by functional roles, and many distinct immune populations. The representation of different experimental conditions and the depth of transcriptome coverage enabled robust profiling of sparsely expressed genes. We built a densely sampled transcriptomic model of myogenesis, from stem cell quiescence to myofiber maturation and identified rare, transitional states of progenitor commitment and fusion that are poorly represented in individual datasets. We performed spatial RNA sequencing of mouse muscle at three time points after injury and used the integrated dataset as a reference to achieve a high-resolution, local deconvolution of cell subtypes. We also used the integrated dataset to explore ligand-receptor co-expression patterns and identify dynamic cell-cell interactions in muscle injury response. We provide a public web tool to enable interactive exploration and visualization of the data. Our work supports the utility of large-scale integration of single-cell transcriptomic data as a tool for biological discovery.
Methods Mice. The Cornell University Institutional Animal Care and Use Committee (IACUC) approved all animal protocols, and experiments were performed in compliance with its institutional guidelines. Adult C57BL/6J mice (mus musculus) were obtained from Jackson Laboratories (#000664; Bar Harbor, ME) and were used at 4-7 months of age. Aged C57BL/6J mice were obtained from the National Institute of Aging (NIA) Rodent Aging Colony and were used at 20 months of age. For new scRNAseq experiments, female mice were used in each experiment.
Mouse injuries and single-cell isolation. To induce muscle injury, both tibialis anterior (TA) muscles of old (20 months) C57BL/6J mice were injected with 10 µl of notexin (10 µg/ml; Latoxan; France). At 0, 1, 2, 3.5, 5, or 7 days post-injury (dpi), mice were sacrificed and TA muscles were collected and processed independently to generate single-cell suspensions. Muscles were digested with 8 mg/ml Collagenase D (Roche; Switzerland) and 10 U/ml Dispase II (Roche; Switzerland), followed by manual dissociation to generate cell suspensions. Cell suspensions were sequentially filtered through 100 and 40 μm filters (Corning Cellgro #431752 and #431750) to remove debris. Erythrocytes were removed through incubation in erythrocyte lysis buffer (IBI Scientific #89135-030).
Single-cell RNA-sequencing library preparation. After digestion, single-cell suspensions were washed and resuspended in 0.04% BSA in PBS at a concentration of 106 cells/ml. Cells were counted manually with a hemocytometer to determine their concentration. Single-cell RNA-sequencing libraries were prepared using the Chromium Single Cell 3’ reagent kit v3 (10x Genomics, PN-1000075; Pleasanton, CA) following the manufacturer’s protocol. Cells were diluted into the Chromium Single Cell A Chip to yield a recovery of 6,000 single-cell transcriptomes. After preparation, libraries were sequenced using on a NextSeq 500 (Illumina; San Diego, CA) using 75 cycle high output kits (Index 1 = 8, Read 1 = 26, and Read 2 = 58). Details on estimated sequencing saturation and the number of reads per sample are shown in Sup. Data 1.
Spatial RNA sequencing library preparation. Tibialis anterior muscles of adult (5 mo) C57BL6/J mice were injected with 10µl notexin (10 µg/ml) at 2, 5, and 7 days prior to collection. Upon collection, tibialis anterior muscles were isolated, embedded in OCT, and frozen fresh in liquid nitrogen. Spatially tagged cDNA libraries were built using the Visium Spatial Gene Expression 3’ Library Construction v1 Kit (10x Genomics, PN-1000187; Pleasanton, CA) (Fig. S7). Optimal tissue permeabilization time for 10 µm thick sections was found to be 15 minutes using the 10x Genomics Visium Tissue Optimization Kit (PN-1000193). H&E stained tissue sections were imaged using Zeiss PALM MicroBeam laser capture microdissection system and the images were stitched and processed using Fiji ImageJ software. cDNA libraries were sequenced on an Illumina NextSeq 500 using 150 cycle high output kits (Read 1=28bp, Read 2=120bp, Index 1=10bp, and Index 2=10bp). Frames around the capture area on the Visium slide were aligned manually and spots covering the tissue were selected using Loop Browser v4.0.0 software (10x Genomics). Sequencing data was then aligned to the mouse reference genome (mm10) using the spaceranger v1.0.0 pipeline to generate a feature-by-spot-barcode expression matrix (10x Genomics).
Download and alignment of single-cell RNA sequencing data. For all samples available via SRA, parallel-fastq-dump (github.com/rvalieris/parallel-fastq-dump) was used to download raw .fastq files. Samples which were only available as .bam files were converted to .fastq format using bamtofastq from 10x Genomics (github.com/10XGenomics/bamtofastq). Raw reads were aligned to the mm10 reference using cellranger (v3.1.0).
Preprocessing and batch correction of single-cell RNA sequencing datasets. First, ambient RNA signal was removed using the default SoupX (v1.4.5) workflow (autoEstCounts and adjustCounts; github.com/constantAmateur/SoupX). Samples were then preprocessed using the standard Seurat (v3.2.1) workflow (NormalizeData, ScaleData, FindVariableFeatures, RunPCA, FindNeighbors, FindClusters, and RunUMAP; github.com/satijalab/seurat). Cells with fewer than 750 features, fewer than 1000 transcripts, or more than 30% of unique transcripts derived from mitochondrial genes were removed. After preprocessing, DoubletFinder (v2.0) was used to identify putative doublets in each dataset, individually. BCmvn optimization was used for PK parameterization. Estimated doublet rates were computed by fitting the total number of cells after quality filtering to a linear regression of the expected doublet rates published in the 10x Chromium handbook. Estimated homotypic doublet rates were also accounted for using the modelHomotypic function. The default PN value (0.25) was used. Putative doublets were then removed from each individual dataset. After preprocessing and quality filtering, we merged the datasets and performed batch-correction with three tools, independently- Harmony (github.com/immunogenomics/harmony) (v1.0), Scanorama (github.com/brianhie/scanorama) (v1.3), and BBKNN (github.com/Teichlab/bbknn) (v1.3.12). We then used Seurat to process the integrated data. After initial integration, we removed the noisy cluster and re-integrated the data using each of the three batch-correction tools.
Cell type annotation. Cell types were determined for each integration method independently. For Harmony and Scanorama, dimensions accounting for 95% of the total variance were used to generate SNN graphs (Seurat::FindNeighbors). Louvain clustering was then performed on the output graphs (including the corrected graph output by BBKNN) using Seurat::FindClusters. A clustering resolution of 1.2 was used for Harmony (25 initial clusters), BBKNN (28 initial clusters), and Scanorama (38 initial clusters). Cell types were determined based on expression of canonical genes (Fig. S3). Clusters which had similar canonical marker gene expression patterns were merged.
Pseudotime workflow. Cells were subset based on the consensus cell types between all three integration methods. Harmony embedding values from the dimensions accounting for 95% of the total variance were used for further dimensional reduction with PHATE, using phateR (v1.0.4) (github.com/KrishnaswamyLab/phateR).
Deconvolution of spatial RNA sequencing spots. Spot deconvolution was performed using the deconvolution module in BayesPrism (previously known as “Tumor microEnvironment Deconvolution”, TED, v1.0; github.com/Danko-Lab/TED). First, myogenic cells were re-labeled, according to binning along the first PHATE dimension, as “Quiescent MuSCs” (bins 4-5), “Activated MuSCs” (bins 6-7), “Committed Myoblasts” (bins 8-10), and “Fusing Myoctes” (bins 11-18). Culture-associated muscle stem cells were ignored and myonuclei labels were retained as “Myonuclei (Type IIb)” and “Myonuclei (Type IIx)”. Next, highly and differentially expressed genes across the 25 groups of cells were identified with differential gene expression analysis using Seurat (FindAllMarkers, using Wilcoxon Rank Sum Test; results in Sup. Data 2). The resulting genes were filtered based on average log2-fold change (avg_logFC > 1) and the percentage of cells within the cluster which express each gene (pct.expressed > 0.5), yielding 1,069 genes. Mitochondrial and ribosomal protein genes were also removed from this list, in line with recommendations in the BayesPrism vignette. For each of the cell types, mean raw counts were calculated across the 1,069 genes to generate a gene expression profile for BayesPrism. Raw counts for each spot were then passed to the run.Ted function, using
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Preprocessed .h5ad data from https://github.com/LieberInstitute/spatialLIBD used for the benchmarking of the spatial clustering methods in CellCharter.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
You can find here the datasets used in the publication:
Valdeolivas, A., Amberg, B., Giroud, N. et al. Profiling the heterogeneity of colorectal cancer consensus molecular subtypes using spatial transcriptomics. npj Precis. Onc. 8, 10 (2024). https://doi.org/10.1038/s41698-023-00488-4
This contents the raw Spatial Transcriptomics data, spot categorization made by pathologist, the results of the deconvolution and intermediary files required to run the analysis described in our manuscript and available in Github:
https://github.com/alberto-valdeolivas/ST_CRC_CMS
In particular, you will find here several zip compressed files with the following content:
- Intermediary_FileObjects.zip: The intermediary files generated in the scripts hosted in the github repo and required to run some later scripts.
- IntermediaryFiles_ST_CRC_LiverMetastasis.zip: The intermediary files generated in the scripts hosted in the github repo and required to run some of the scripts dealing with the external CRC ST dataset used in our manuscript.
- Pathology_SpotAnnotations.zip: The categories assigned by the pathologists to all the spots across our set ST samples to a different anatomical category (tumor, stroma, non-neoplastic mucosa...)
-SN048_A121573_Rep1.zip, SN048_A121573_Rep2.zip, SN048_A416371_Rep1.zip, SN048_A416371_Rep2.zip, SN123_A551763_Rep1.zip, SN123_A595688_Rep1.zip, SN123_A798015_Rep1.zip, SN123_A938797_Rep1_X.zip, SN124_A551763_Rep2.zip, SN124_A595688_Rep2.zip, SN124_A798015_Rep2.zip, SN124_A938797_Rep2.zip, SN84_A120838_Rep1.zip, SN84_A120838_Rep2.zip: The output of Space Ranger, including processed count data matrices and histological images, for the ST data generated in this study
- DeconvolutionResults_ST_CRC_BelgianCohort.zip, DeconvolutionResults_ST_CRC_KoreanCohort.zip, DeconvolutionResults_ST_CRC_LiverMetastasis.zip: These files contain the main results obtained when using the Cell2Location deconvolution approach in our samples (with two different references: Korean and Belgian cohorts) and in the external set of CRC ST samples (only Korean cohort)
- We have also uploaded the whole slide images (WSI). These are the files with an ndpi extension:
Visium Frozen_SN V10B01-048_new CRC_2021_02_16.ndp ... (samples A121573_Rep1, A121573_Rep2, A416371_Rep1 and A416371_Rep2), Visium Frozen_SN V19S23-084.ndpi (samples A120838_Rep1 and A120838_Rep2), Visium Frozen_SN V19S23-123.ndpi (samples A551763_Rep1, A595688_Rep1, A798015_Rep1, A938797_Rep1) and Visium Frozen_SN V19S23-124.ndpi (samples A551763_Rep2, A595688_Rep2, A798015_Rep2 and A938797_Rep2)
- We have now included the fastq and Bam files for the different samples, excluding replicate 1 of the A938797 sample whose fastq files are missing:
IMPORTANT: Fastq files are in version 1, while bam files are in version 2 of the dashboards reported below:
Here, we summarise available data and source code regarding the publication "Spatially Resolved Transcriptomics Mining in 3D and Virtual Reality Environments with VR-Omics". Abstract Spatially resolved transcriptomics (SRT) technologies produce complex, multi-dimensional data sets of gene expression information that can be obtained at subcellular spatial resolution. While several computational tools are available to process and analyse SRT data, no platforms facilitate the visualisation and interaction with SRT data in an immersive manner. Here we present VR-Omics, a computational platform that supports the analysis, visualisation, exploration, and interpretation SRT data compatible with any SRT technology. VR-Omics is the first tool capable of analysing and visualising data generated by multiple SRT platforms in both 2D desktop and virtual reality environments. It incorporates an in-built workflow to automatically pre-process and spatially mine the data within a user-friendly graphical user interface. Benchmarking VR-Omics against other comparable software demonstrates its seamless end-to-end analysis of SRT data, hence making SRT data processing and mining universally accessible. VR-Omics is an open-source software freely available at: https://ramialison-lab.github.io/pages/vromics.html or below. For development of VR-Omics publicly available data was used. The Visium data from 10XGenomics is available at the 10X Genomics website: https://www.10xgenomics.com/resources/datasets. The 10X Genomics Xenium dataset is available under: https://www.10xgenomics.com/products/xenium-in-situ/preview-dataset-human-breast. The STOmics database is available at: https://db.cngb.org/stomics. The Vizgen MERFISH data release program can be accessed via: https://vizgen.com/data-release-program/. The Tomo-seq data is available via their publication https://doi.org/10.1016/j.cell.2014.09.038 which also contains the MATLAB code for the 3D data reconstruction. The Visium demo was adapted from Asp et al. and can be accessed via the related publication https://doi.org/10.1016/j.cell.2019.11.025 or at https://data.mendeley.com/datasets/zkzvyprd5z/1. The demo datasets generated for VR-Omics can be found at: https://doi.org/10.26180/22207579.v1 or below for download. The 3D Visium data set of the human developing heart adapted from Asp et al. can be found within the application and can be accessed from the main menu following the Visium, Demo context menu. The complete standalone version of VR-Omics (containing Python AW and Visualiser) can be downloaded at https://ramialison-lab.github.io/pages/vromics.html or at https://doi.org/10.26180/20220312.v1 or below for download. Alternatively, the code is available at GitHub (https://github.com/Ramialison-Lab/VR-Omics). To use the GitHub version an installation of Unity Gaming Engine (version 2021.3.11f1) is required. This version does not include the Python AW. The Python AW can be accessed at: https://doi.org/10.26180/22207903.v1. More information of run VR-Omics via Unity can be found in the full documentation accessible at https://ramialison-lab.github.io/pages/vromics.html.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This Zenodo record consists of:
The following datasets are included:
The CosMx human liver dataset was obtained from the NanoString website (https://nanostring.com/products/cosmx-spatial-molecular-imager/ffpe-dataset/human-liver-rna-ffpe-dataset). The dataset consists of 2 Formalin-Fixed Paraffin-Embedded (FFPE) samples from 2 patients, one being normal liver and the other from a hepatocellular carcinoma patient with grade G3 cancer. Data was generated on the CosMx platform using the Human Universal Cell Characterization Panel 1000 plex. The ground truth annotations were computationally identified using Mclust clustering on the frequency of each cell type among its 200 nearest neighbors. See NanoString_Data_License_Agreement.pdf
for the license terms.
The CosMx human non-small-cell lung cancer dataset was obtained from the NanoString website (https://staging.nanostring.com/products/cosmx-spatial-molecular-imager/ffpe-dataset/nsclc-ffpe-dataset). The dataset consists of 8 FFPE samples from 5 patients presenting with non-small-cell lung cancer grade G1-G3. Data was generated on a CosMx prototype instrument using a 960 gene panel [1]. The ground truth annotations were computationally identified using Mclust clustering on the frequency of each cell type among its 200 nearest neighbors. See NanoString_Data_License_Agreement.pdf
for the license terms.
The MERFISH mouse brain thalamus dataset [2] was obtained from the Brain Knowledge Platform (https://alleninstitute.github.io/abc_atlas_access/descriptions/MERFISH-C57BL6J-638850.html). The dataset consists of 59 fresh frozen (FF) serial full coronal sections at 200-µm intervals spanning one entire mouse brain. Data was generated on a Vizgen MERSCOPE instrument using a custom gene panel of 500 genes. The ground truth annotations were identified by aligning the MERFISH data to the CCFv3 coordinate space and labeling cells with the corresponding CCFv3 anatomical parcellation term [3]. Only the thalamus (TH; CCFv3 structure ID 549) and hypothalamic zona incerta (ZI; CCFv3 structure ID 797) were analyzed in this study. Spatially variable genes in the thalamus were identified by differential gene expression analysis on neighboring consensus clusters.
The MERFISH human developmental heart dataset [4] was obtained from Dryad (https://datadryad.org/stash/dataset/doi:10.5061/dryad.w0vt4b8vp). The dataset consists of 4 FF samples from 2 donors at 13 and 15 post-conception weeks (PCW). Data was generated using MERFISH with a custom 238-gene panel. The ground truth annotations (referred to as cellular communities in the original study) were computationally identified using k-means clustering of relative cell-type composition within 150µm of each cell.
The STARmap PLUS mouse brain dataset [5] was obtained from Zenodo (https://zenodo.org/records/8327576). The dataset consists of 20 FF samples from 3 mice. Data was generated using STARmap PLUS using a custom 1,022 gene panel. The ground truth annotations were manually identified by aligning the data to the CCFv3.
The Xenium breast cancer dataset was obtained from the 10x website (https://www.10xgenomics.com/datasets/xenium-ffpe-human-breast-with-custom-add-on-panel-1-standard). The dataset consists of 1 FFPE sample from a patient with infiltrating ductal carcinoma breast cancer. Data was generated on a Xenium Analyzer using the Xenium human breast gene expression panel v1 (280 genes) with 100 additional custom genes. The ground truth annotation was manually identified using the matched histopathology image, annotating for eight region types: ductal carcinoma in-situ, invasive tumor, normal ducts, immune cells, cysts, blood vessels, adipose tissue, and stroma [6].
The Xenium mouse brain dataset was obtained from the 10x genomics website (https://www.10xgenomics.com/datasets/fresh-frozen-mouse-brain-replicates-1-standard). The dataset consists of 1 FF sample of a full coronal section. Data was generated on a Xenium Analyzer using the v1 mouse brain gene expression panel (247 genes). The ground truth annotation was manually identified using the mouse coronal P56 sample from Allen Brain Atlas [3] to specify anatomical regions [7].
The Slide-seqV2 mouse brain olfactory bulb dataset [8] was obtained from the STOmicsDB website (https://db.cngb.org/stomics/datasets/STDS0000172/data). The dataset consists of 20 samples of a mouse olfactory bulb evenly spaced along the anterior-posterior axis. Data was generated using Slide-seqV2 and sequenced using paired-end reads on an Illumina Novaseq6000 instrument, targeting 200 million reads per sample. The ground truth annotations were manually identified based on the expression of marker genes.
The Stereo-seq mouse liver dataset [9] was obtained from the STomicsDB website (https://db.cngb.org/stomics/lista/spatial). The dataset consists of 6 FF samples. Data was generated on Stereo-seq chips and sequenced using paired-end reads on a DIPSEQ T1 instrument. The ground truth annotations were computationally identified where zonation layers were annotated based on the differences between the scores of pericentral and periportal hepatocyte landmark genes.
The Stereo-seq mouse embryo dataset [10] was obtained from the StOmicsDB website (https://ftp.cngb.org/pub/SciRAID/stomics/STDS0000058/stomics). The dataset consists of 53 FF samples from mouse embryos spanning E9.5–E16.5 with one-day intervals. Data was generated on Stereo-seq chips and sequenced using paired-end reads on a MGI DNBSEQ-Tx sequencer. The ground truth annotations were computationally identified using Spatially Constrained Clustering (SCC), which is built on top of the Leiden clustering algorithm.
The Visium human brain LIBD DLPFC dataset 1 [11] was obtained from the spatialLIBD Bioconductor package (https://research.libd.org/spatialLIBD). The dataset consists of 12 FF samples from 3 donors. The data was generated on Visium chips and sequenced using paired-end reads on an Illumina NovaSeq 6000 instrument. The ground truth annotations were manually identified based on cytoarchitecture and selected gene markers.
The osmFISH mouse brain somatosensory cortex dataset [12] was obtained from the Linnarsson Lab website (https://linnarssonlab.org/osmFISH). The dataset consists of a single FF sample from the mouse brain somatosensory cortex. Data was generated using osmFISH using a custom 33-gene panel. The ground truth annotation was computationally identified using an iterative graph-based algorithm.
The Visium human breast cancer dataset, originally from 10x Genomics (https://www.10xgenomics.com/resources/datasets/human-breast-cancer-block-a-section-1-1-standard-1-1-0), was obtained from GitHub (https://github.com/JinmiaoChenLab/SEDR_analyses). The dataset consists of a single FF sample of invasive ductal carcinoma breast tissue. The data was generated on a Visium chip and sequenced
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets to use in the tutorials of scCellFie:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Spatial transcriptomics (ST) technologies enable high throughput gene expression characterization within thin tissue sections. However, comparing spatial observations across sections, samples, and technologies remains challenging. To address this challenge, we developed STalign to align ST datasets in a manner that accounts for partially matched tissue sections and other local non-linear distortions using diffeomorphic metric mapping. We apply STalign to align ST datasets within and across technologies as well as to align ST datasets to a 3D common coordinate framework. We show that STalign achieves high gene expression and cell-type correspondence across matched spatial locations that is significantly improved over landmark-based affine alignments. Applying STalign to align ST datasets of the mouse brain to the 3D common coordinate framework from the Allen Brain Atlas, we highlight how STalign can be used to lift over brain region annotations and enable the interrogation of compositional heterogeneity across anatomical structures. STalign is available as an open-source Python toolkit at https://github.com/JEFworks-Lab/STalign and as supplementary software with additional documentation and tutorials available at https://jef.works/STalign.
Here we have included alignment results that were used in performance analysis of STalign:
We aligned Slice 2 Replicate 3 to Slice 2 Replicate 2 of the MERFISH mouse coronal brain sections available from Vizgen Data Release V1.0. May 2021 (https://info.vizgen.com/mouse-brain-map).
Additionally, we aligned Slice 2 Replicate 3 to a Visium dataset of an FFPE preserved adult mouse brain were obtained from the 10X Datasets website for Spatial Gene Expression Dataset by Space Ranger 1.3.0 (https://www.10xgenomics.com/resources/datasets/adult-mouse-brain-ffpe-1-standard-1-3-0).
Furthermore, we performed alignments with the 50um resolution 3D Allen Reference Atlas Nissl common coordinate framework, CCF (https://help.brain-map.org/display/mouseconnectivity/API). We applied STalign to align the Allen CCF to each of the 9 MERFISH slices (3 slice locations with 3 biological replicates) provided by Vizgen. Because the Allen CCF has annotated brain regions, we were able to lift over those brain region annotations to label all cells in the MERFISH datasets.
Also, since the STalign mappings from the Allen CCF to the MERFISH slices are invertible, for each slice we can apply the inverse of the mapping to get cell positions in the Allen CCF coordinates.
To evaluate the 3D CCF alignment, we performed unified transcriptional clustering analysis and cell-type annotation. All MERFISH datasets were combined. Transcriptional clustering analysis and cell type annotation was performed using the SCANPY package [version 1.9.1]. Data were normalized to counts per million (scanpy: normalize_total) and log transformed (scanpy: log1p). PCA (scanpy: pca) was computed on the cell by gene matrix. A neighborhood graph of cells using the top 10 PCs and 10 nearest neighbors was created (scanpy: neighbors), and Leiden clustering was performed on this graph (scanpy: leiden) to identify 29 clusters. Differentially expressed genes were extracted from each cluster (scanpy: rank_genes_groups), and cell-types were annotated based on marker genes in each cluster.
This updated (v2) cell-type annotation file contains a new column with simplified cell-types. Briefly, we fixed typos, standardized lower case/upper case formats, merged subclasses of each cell-types. For example, subclasses of astrocytes, which are originally labeled as “Astrocytes”, “Astrocytes(1)”, “Astrocytes(2)”, “Astrocytes(3)”, are all labeled as “Astrocytes” in the added column.
Note: Cell ids may have been mutated from original string of numbers through reading and writing across programming languages that handle numbers with different precision. If using R to read the files shared here, one can find the cells in STalign_celltypeannotations_merfishslices_v2.csv.gz that correspond with STalign_SXRX_with_structure_id_name.csv.gz when cell ids are formatted as a double in scientific notation, which is how R will read the file automatically.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Present dataset is the result of applying reference-based RCTD cell typing method on the 8um bin of the public P2 human CRC dataset as downloaded from 10x website. Public single-cell HTAN CRC data (validation cohort) from CellXGene was used as a reference to assign cell labels. The choice of CRC for this illustration is purely random.
The dataset structure is as folows:
Interested parties are invited to look at `squidpy/visium-hd-crc-p2/squidpy.h5ad`, it contains the original counts matrix in `adata.X`, associated spatial information, as well as most abundant cell type in `adata.obs['cell_type']` and raw RCTD outputs in `adata.uns['cell_types']`. Furthermore, `example.ipynb` contains some toy scripts to get started with analysis. Other files are untransformed and transformed versions of source data and some basic squidpy plots.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Skeletal muscle repair is driven by the coordinated self-renewal and fusion of myogenic stem and progenitor cells. Single-cell gene expression analyses of myogenesis have been hampered by the poor sampling of rare and transient cell states that are critical for muscle repair, and do not inform the spatial context that is important for myogenic differentiation. Here, we demonstrate how large-scale integration of single-cell and spatial transcriptomic data can overcome these limitations. We created a single-cell transcriptomic dataset of mouse skeletal muscle by integration, consensus annotation, and analysis of 23 newly collected scRNAseq datasets and 88 publicly available single-cell (scRNAseq) and single-nucleus (snRNAseq) RNA-sequencing datasets. The resulting dataset includes more than 365,000 cells and spans a wide range of ages, injury, and repair conditions. Together, these data enabled identification of the predominant cell types in skeletal muscle, and resolved cell subtypes, including endothelial subtypes distinguished by vessel-type of origin, fibro/adipogenic progenitors defined by functional roles, and many distinct immune populations. The representation of different experimental conditions and the depth of transcriptome coverage enabled robust profiling of sparsely expressed genes. We built a densely sampled transcriptomic model of myogenesis, from stem cell quiescence to myofiber maturation and identified rare, transitional states of progenitor commitment and fusion that are poorly represented in individual datasets. We performed spatial RNA sequencing of mouse muscle at three time points after injury and used the integrated dataset as a reference to achieve a high-resolution, local deconvolution of cell subtypes. We also used the integrated dataset to explore ligand-receptor co-expression patterns and identify dynamic cell-cell interactions in muscle injury response. We provide a public web tool to enable interactive exploration and visualization of the data. Our work supports the utility of large-scale integration of single-cell transcriptomic data as a tool for biological discovery.
Methods Mice. The Cornell University Institutional Animal Care and Use Committee (IACUC) approved all animal protocols, and experiments were performed in compliance with its institutional guidelines. Adult C57BL/6J mice (mus musculus) were obtained from Jackson Laboratories (#000664; Bar Harbor, ME) and were used at 4-7 months of age. Aged C57BL/6J mice were obtained from the National Institute of Aging (NIA) Rodent Aging Colony and were used at 20 months of age. For new scRNAseq experiments, female mice were used in each experiment.
Mouse injuries and single-cell isolation. To induce muscle injury, both tibialis anterior (TA) muscles of old (20 months) C57BL/6J mice were injected with 10 µl of notexin (10 µg/ml; Latoxan; France). At 0, 1, 2, 3.5, 5, or 7 days post-injury (dpi), mice were sacrificed and TA muscles were collected and processed independently to generate single-cell suspensions. Muscles were digested with 8 mg/ml Collagenase D (Roche; Switzerland) and 10 U/ml Dispase II (Roche; Switzerland), followed by manual dissociation to generate cell suspensions. Cell suspensions were sequentially filtered through 100 and 40 μm filters (Corning Cellgro #431752 and #431750) to remove debris. Erythrocytes were removed through incubation in erythrocyte lysis buffer (IBI Scientific #89135-030).
Single-cell RNA-sequencing library preparation. After digestion, single-cell suspensions were washed and resuspended in 0.04% BSA in PBS at a concentration of 106 cells/ml. Cells were counted manually with a hemocytometer to determine their concentration. Single-cell RNA-sequencing libraries were prepared using the Chromium Single Cell 3’ reagent kit v3 (10x Genomics, PN-1000075; Pleasanton, CA) following the manufacturer’s protocol. Cells were diluted into the Chromium Single Cell A Chip to yield a recovery of 6,000 single-cell transcriptomes. After preparation, libraries were sequenced using on a NextSeq 500 (Illumina; San Diego, CA) using 75 cycle high output kits (Index 1 = 8, Read 1 = 26, and Read 2 = 58). Details on estimated sequencing saturation and the number of reads per sample are shown in Sup. Data 1.
Spatial RNA sequencing library preparation. Tibialis anterior muscles of adult (5 mo) C57BL6/J mice were injected with 10µl notexin (10 µg/ml) at 2, 5, and 7 days prior to collection. Upon collection, tibialis anterior muscles were isolated, embedded in OCT, and frozen fresh in liquid nitrogen. Spatially tagged cDNA libraries were built using the Visium Spatial Gene Expression 3’ Library Construction v1 Kit (10x Genomics, PN-1000187; Pleasanton, CA) (Fig. S7). Optimal tissue permeabilization time for 10 µm thick sections was found to be 15 minutes using the 10x Genomics Visium Tissue Optimization Kit (PN-1000193). H&E stained tissue sections were imaged using Zeiss PALM MicroBeam laser capture microdissection system and the images were stitched and processed using Fiji ImageJ software. cDNA libraries were sequenced on an Illumina NextSeq 500 using 150 cycle high output kits (Read 1=28bp, Read 2=120bp, Index 1=10bp, and Index 2=10bp). Frames around the capture area on the Visium slide were aligned manually and spots covering the tissue were selected using Loop Browser v4.0.0 software (10x Genomics). Sequencing data was then aligned to the mouse reference genome (mm10) using the spaceranger v1.0.0 pipeline to generate a feature-by-spot-barcode expression matrix (10x Genomics).
Download and alignment of single-cell RNA sequencing data. For all samples available via SRA, parallel-fastq-dump (github.com/rvalieris/parallel-fastq-dump) was used to download raw .fastq files. Samples which were only available as .bam files were converted to .fastq format using bamtofastq from 10x Genomics (github.com/10XGenomics/bamtofastq). Raw reads were aligned to the mm10 reference using cellranger (v3.1.0).
Preprocessing and batch correction of single-cell RNA sequencing datasets. First, ambient RNA signal was removed using the default SoupX (v1.4.5) workflow (autoEstCounts and adjustCounts; github.com/constantAmateur/SoupX). Samples were then preprocessed using the standard Seurat (v3.2.1) workflow (NormalizeData, ScaleData, FindVariableFeatures, RunPCA, FindNeighbors, FindClusters, and RunUMAP; github.com/satijalab/seurat). Cells with fewer than 750 features, fewer than 1000 transcripts, or more than 30% of unique transcripts derived from mitochondrial genes were removed. After preprocessing, DoubletFinder (v2.0) was used to identify putative doublets in each dataset, individually. BCmvn optimization was used for PK parameterization. Estimated doublet rates were computed by fitting the total number of cells after quality filtering to a linear regression of the expected doublet rates published in the 10x Chromium handbook. Estimated homotypic doublet rates were also accounted for using the modelHomotypic function. The default PN value (0.25) was used. Putative doublets were then removed from each individual dataset. After preprocessing and quality filtering, we merged the datasets and performed batch-correction with three tools, independently- Harmony (github.com/immunogenomics/harmony) (v1.0), Scanorama (github.com/brianhie/scanorama) (v1.3), and BBKNN (github.com/Teichlab/bbknn) (v1.3.12). We then used Seurat to process the integrated data. After initial integration, we removed the noisy cluster and re-integrated the data using each of the three batch-correction tools.
Cell type annotation. Cell types were determined for each integration method independently. For Harmony and Scanorama, dimensions accounting for 95% of the total variance were used to generate SNN graphs (Seurat::FindNeighbors). Louvain clustering was then performed on the output graphs (including the corrected graph output by BBKNN) using Seurat::FindClusters. A clustering resolution of 1.2 was used for Harmony (25 initial clusters), BBKNN (28 initial clusters), and Scanorama (38 initial clusters). Cell types were determined based on expression of canonical genes (Fig. S3). Clusters which had similar canonical marker gene expression patterns were merged.
Pseudotime workflow. Cells were subset based on the consensus cell types between all three integration methods. Harmony embedding values from the dimensions accounting for 95% of the total variance were used for further dimensional reduction with PHATE, using phateR (v1.0.4) (github.com/KrishnaswamyLab/phateR).
Deconvolution of spatial RNA sequencing spots. Spot deconvolution was performed using the deconvolution module in BayesPrism (previously known as “Tumor microEnvironment Deconvolution”, TED, v1.0; github.com/Danko-Lab/TED). First, myogenic cells were re-labeled, according to binning along the first PHATE dimension, as “Quiescent MuSCs” (bins 4-5), “Activated MuSCs” (bins 6-7), “Committed Myoblasts” (bins 8-10), and “Fusing Myoctes” (bins 11-18). Culture-associated muscle stem cells were ignored and myonuclei labels were retained as “Myonuclei (Type IIb)” and “Myonuclei (Type IIx)”. Next, highly and differentially expressed genes across the 25 groups of cells were identified with differential gene expression analysis using Seurat (FindAllMarkers, using Wilcoxon Rank Sum Test; results in Sup. Data 2). The resulting genes were filtered based on average log2-fold change (avg_logFC > 1) and the percentage of cells within the cluster which express each gene (pct.expressed > 0.5), yielding 1,069 genes. Mitochondrial and ribosomal protein genes were also removed from this list, in line with recommendations in the BayesPrism vignette. For each of the cell types, mean raw counts were calculated across the 1,069 genes to generate a gene expression profile for BayesPrism. Raw counts for each spot were then passed to the run.Ted function, using