8 datasets found
  1. n

    Data from: Large-scale integration of single-cell transcriptomic data...

    • data.niaid.nih.gov
    • dataone.org
    • +1more
    zip
    Updated Dec 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David McKellar; Iwijn De Vlaminck; Benjamin Cosgrove (2021). Large-scale integration of single-cell transcriptomic data captures transitional progenitor states in mouse skeletal muscle regeneration [Dataset]. http://doi.org/10.5061/dryad.t4b8gtj34
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 14, 2021
    Dataset provided by
    Cornell University
    Authors
    David McKellar; Iwijn De Vlaminck; Benjamin Cosgrove
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Skeletal muscle repair is driven by the coordinated self-renewal and fusion of myogenic stem and progenitor cells. Single-cell gene expression analyses of myogenesis have been hampered by the poor sampling of rare and transient cell states that are critical for muscle repair, and do not inform the spatial context that is important for myogenic differentiation. Here, we demonstrate how large-scale integration of single-cell and spatial transcriptomic data can overcome these limitations. We created a single-cell transcriptomic dataset of mouse skeletal muscle by integration, consensus annotation, and analysis of 23 newly collected scRNAseq datasets and 88 publicly available single-cell (scRNAseq) and single-nucleus (snRNAseq) RNA-sequencing datasets. The resulting dataset includes more than 365,000 cells and spans a wide range of ages, injury, and repair conditions. Together, these data enabled identification of the predominant cell types in skeletal muscle, and resolved cell subtypes, including endothelial subtypes distinguished by vessel-type of origin, fibro/adipogenic progenitors defined by functional roles, and many distinct immune populations. The representation of different experimental conditions and the depth of transcriptome coverage enabled robust profiling of sparsely expressed genes. We built a densely sampled transcriptomic model of myogenesis, from stem cell quiescence to myofiber maturation and identified rare, transitional states of progenitor commitment and fusion that are poorly represented in individual datasets. We performed spatial RNA sequencing of mouse muscle at three time points after injury and used the integrated dataset as a reference to achieve a high-resolution, local deconvolution of cell subtypes. We also used the integrated dataset to explore ligand-receptor co-expression patterns and identify dynamic cell-cell interactions in muscle injury response. We provide a public web tool to enable interactive exploration and visualization of the data. Our work supports the utility of large-scale integration of single-cell transcriptomic data as a tool for biological discovery.

    Methods Mice. The Cornell University Institutional Animal Care and Use Committee (IACUC) approved all animal protocols, and experiments were performed in compliance with its institutional guidelines. Adult C57BL/6J mice (mus musculus) were obtained from Jackson Laboratories (#000664; Bar Harbor, ME) and were used at 4-7 months of age. Aged C57BL/6J mice were obtained from the National Institute of Aging (NIA) Rodent Aging Colony and were used at 20 months of age. For new scRNAseq experiments, female mice were used in each experiment.

    Mouse injuries and single-cell isolation. To induce muscle injury, both tibialis anterior (TA) muscles of old (20 months) C57BL/6J mice were injected with 10 µl of notexin (10 µg/ml; Latoxan; France). At 0, 1, 2, 3.5, 5, or 7 days post-injury (dpi), mice were sacrificed and TA muscles were collected and processed independently to generate single-cell suspensions. Muscles were digested with 8 mg/ml Collagenase D (Roche; Switzerland) and 10 U/ml Dispase II (Roche; Switzerland), followed by manual dissociation to generate cell suspensions. Cell suspensions were sequentially filtered through 100 and 40 μm filters (Corning Cellgro #431752 and #431750) to remove debris. Erythrocytes were removed through incubation in erythrocyte lysis buffer (IBI Scientific #89135-030).

    Single-cell RNA-sequencing library preparation. After digestion, single-cell suspensions were washed and resuspended in 0.04% BSA in PBS at a concentration of 106 cells/ml. Cells were counted manually with a hemocytometer to determine their concentration. Single-cell RNA-sequencing libraries were prepared using the Chromium Single Cell 3’ reagent kit v3 (10x Genomics, PN-1000075; Pleasanton, CA) following the manufacturer’s protocol. Cells were diluted into the Chromium Single Cell A Chip to yield a recovery of 6,000 single-cell transcriptomes. After preparation, libraries were sequenced using on a NextSeq 500 (Illumina; San Diego, CA) using 75 cycle high output kits (Index 1 = 8, Read 1 = 26, and Read 2 = 58). Details on estimated sequencing saturation and the number of reads per sample are shown in Sup. Data 1.

    Spatial RNA sequencing library preparation. Tibialis anterior muscles of adult (5 mo) C57BL6/J mice were injected with 10µl notexin (10 µg/ml) at 2, 5, and 7 days prior to collection. Upon collection, tibialis anterior muscles were isolated, embedded in OCT, and frozen fresh in liquid nitrogen. Spatially tagged cDNA libraries were built using the Visium Spatial Gene Expression 3’ Library Construction v1 Kit (10x Genomics, PN-1000187; Pleasanton, CA) (Fig. S7). Optimal tissue permeabilization time for 10 µm thick sections was found to be 15 minutes using the 10x Genomics Visium Tissue Optimization Kit (PN-1000193). H&E stained tissue sections were imaged using Zeiss PALM MicroBeam laser capture microdissection system and the images were stitched and processed using Fiji ImageJ software. cDNA libraries were sequenced on an Illumina NextSeq 500 using 150 cycle high output kits (Read 1=28bp, Read 2=120bp, Index 1=10bp, and Index 2=10bp). Frames around the capture area on the Visium slide were aligned manually and spots covering the tissue were selected using Loop Browser v4.0.0 software (10x Genomics). Sequencing data was then aligned to the mouse reference genome (mm10) using the spaceranger v1.0.0 pipeline to generate a feature-by-spot-barcode expression matrix (10x Genomics).

    Download and alignment of single-cell RNA sequencing data. For all samples available via SRA, parallel-fastq-dump (github.com/rvalieris/parallel-fastq-dump) was used to download raw .fastq files. Samples which were only available as .bam files were converted to .fastq format using bamtofastq from 10x Genomics (github.com/10XGenomics/bamtofastq). Raw reads were aligned to the mm10 reference using cellranger (v3.1.0).

    Preprocessing and batch correction of single-cell RNA sequencing datasets. First, ambient RNA signal was removed using the default SoupX (v1.4.5) workflow (autoEstCounts and adjustCounts; github.com/constantAmateur/SoupX). Samples were then preprocessed using the standard Seurat (v3.2.1) workflow (NormalizeData, ScaleData, FindVariableFeatures, RunPCA, FindNeighbors, FindClusters, and RunUMAP; github.com/satijalab/seurat). Cells with fewer than 750 features, fewer than 1000 transcripts, or more than 30% of unique transcripts derived from mitochondrial genes were removed. After preprocessing, DoubletFinder (v2.0) was used to identify putative doublets in each dataset, individually. BCmvn optimization was used for PK parameterization. Estimated doublet rates were computed by fitting the total number of cells after quality filtering to a linear regression of the expected doublet rates published in the 10x Chromium handbook. Estimated homotypic doublet rates were also accounted for using the modelHomotypic function. The default PN value (0.25) was used. Putative doublets were then removed from each individual dataset. After preprocessing and quality filtering, we merged the datasets and performed batch-correction with three tools, independently- Harmony (github.com/immunogenomics/harmony) (v1.0), Scanorama (github.com/brianhie/scanorama) (v1.3), and BBKNN (github.com/Teichlab/bbknn) (v1.3.12). We then used Seurat to process the integrated data. After initial integration, we removed the noisy cluster and re-integrated the data using each of the three batch-correction tools.

    Cell type annotation. Cell types were determined for each integration method independently. For Harmony and Scanorama, dimensions accounting for 95% of the total variance were used to generate SNN graphs (Seurat::FindNeighbors). Louvain clustering was then performed on the output graphs (including the corrected graph output by BBKNN) using Seurat::FindClusters. A clustering resolution of 1.2 was used for Harmony (25 initial clusters), BBKNN (28 initial clusters), and Scanorama (38 initial clusters). Cell types were determined based on expression of canonical genes (Fig. S3). Clusters which had similar canonical marker gene expression patterns were merged.

    Pseudotime workflow. Cells were subset based on the consensus cell types between all three integration methods. Harmony embedding values from the dimensions accounting for 95% of the total variance were used for further dimensional reduction with PHATE, using phateR (v1.0.4) (github.com/KrishnaswamyLab/phateR).

    Deconvolution of spatial RNA sequencing spots. Spot deconvolution was performed using the deconvolution module in BayesPrism (previously known as “Tumor microEnvironment Deconvolution”, TED, v1.0; github.com/Danko-Lab/TED). First, myogenic cells were re-labeled, according to binning along the first PHATE dimension, as “Quiescent MuSCs” (bins 4-5), “Activated MuSCs” (bins 6-7), “Committed Myoblasts” (bins 8-10), and “Fusing Myoctes” (bins 11-18). Culture-associated muscle stem cells were ignored and myonuclei labels were retained as “Myonuclei (Type IIb)” and “Myonuclei (Type IIx)”. Next, highly and differentially expressed genes across the 25 groups of cells were identified with differential gene expression analysis using Seurat (FindAllMarkers, using Wilcoxon Rank Sum Test; results in Sup. Data 2). The resulting genes were filtered based on average log2-fold change (avg_logFC > 1) and the percentage of cells within the cluster which express each gene (pct.expressed > 0.5), yielding 1,069 genes. Mitochondrial and ribosomal protein genes were also removed from this list, in line with recommendations in the BayesPrism vignette. For each of the cell types, mean raw counts were calculated across the 1,069 genes to generate a gene expression profile for BayesPrism. Raw counts for each spot were then passed to the run.Ted function, using

  2. Visium DLPFC preprocessed .h5ad

    • figshare.com
    hdf
    Updated May 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marco Varrone (2023). Visium DLPFC preprocessed .h5ad [Dataset]. http://doi.org/10.6084/m9.figshare.22004273.v2
    Explore at:
    hdfAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Authors
    Marco Varrone
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Preprocessed .h5ad data from https://github.com/LieberInstitute/spatialLIBD used for the benchmarking of the spatial clustering methods in CellCharter.

  3. Data from: Profiling the Heterogeneity of Colorectal Cancer Consensus...

    • zenodo.org
    zip
    Updated Dec 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alberto Valdeolivas; Alberto Valdeolivas (2024). Profiling the Heterogeneity of Colorectal Cancer Consensus Molecular Subtypes using Spatial Transcriptomics: datasets [Dataset]. http://doi.org/10.5281/zenodo.7760264
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 20, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alberto Valdeolivas; Alberto Valdeolivas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    You can find here the datasets used in the publication:

    Valdeolivas, A., Amberg, B., Giroud, N. et al. Profiling the heterogeneity of colorectal cancer consensus molecular subtypes using spatial transcriptomics. npj Precis. Onc. 8, 10 (2024). https://doi.org/10.1038/s41698-023-00488-4

    This contents the raw Spatial Transcriptomics data, spot categorization made by pathologist, the results of the deconvolution and intermediary files required to run the analysis described in our manuscript and available in Github:

    https://github.com/alberto-valdeolivas/ST_CRC_CMS

    In particular, you will find here several zip compressed files with the following content:

    - Intermediary_FileObjects.zip: The intermediary files generated in the scripts hosted in the github repo and required to run some later scripts.

    - IntermediaryFiles_ST_CRC_LiverMetastasis.zip: The intermediary files generated in the scripts hosted in the github repo and required to run some of the scripts dealing with the external CRC ST dataset used in our manuscript.

    - Pathology_SpotAnnotations.zip: The categories assigned by the pathologists to all the spots across our set ST samples to a different anatomical category (tumor, stroma, non-neoplastic mucosa...)

    -SN048_A121573_Rep1.zip, SN048_A121573_Rep2.zip, SN048_A416371_Rep1.zip, SN048_A416371_Rep2.zip, SN123_A551763_Rep1.zip, SN123_A595688_Rep1.zip, SN123_A798015_Rep1.zip, SN123_A938797_Rep1_X.zip, SN124_A551763_Rep2.zip, SN124_A595688_Rep2.zip, SN124_A798015_Rep2.zip, SN124_A938797_Rep2.zip, SN84_A120838_Rep1.zip, SN84_A120838_Rep2.zip: The output of Space Ranger, including processed count data matrices and histological images, for the ST data generated in this study

    - DeconvolutionResults_ST_CRC_BelgianCohort.zip, DeconvolutionResults_ST_CRC_KoreanCohort.zip, DeconvolutionResults_ST_CRC_LiverMetastasis.zip: These files contain the main results obtained when using the Cell2Location deconvolution approach in our samples (with two different references: Korean and Belgian cohorts) and in the external set of CRC ST samples (only Korean cohort)

    - We have also uploaded the whole slide images (WSI). These are the files with an ndpi extension:


    Visium Frozen_SN V10B01-048_new CRC_2021_02_16.ndp ... (samples A121573_Rep1, A121573_Rep2, A416371_Rep1 and A416371_Rep2), Visium Frozen_SN V19S23-084.ndpi (samples A120838_Rep1 and A120838_Rep2), Visium Frozen_SN V19S23-123.ndpi (samples A551763_Rep1, A595688_Rep1, A798015_Rep1, A938797_Rep1) and Visium Frozen_SN V19S23-124.ndpi (samples A551763_Rep2, A595688_Rep2, A798015_Rep2 and A938797_Rep2)

    - We have now included the fastq and Bam files for the different samples, excluding replicate 1 of the A938797 sample whose fastq files are missing:

    IMPORTANT: Fastq files are in version 1, while bam files are in version 2 of the dashboards reported below:

    1. Sample S1_Cec (A551763)
    2. Sample S2_Col_R (A595688)
    3. Sample S3_Col_R (A416371)
    4. Sample S4_Col_Sig (A120838)
    5. Sample S5_Rec (A121573)
    6. Sample S6_Rec (A938797)
    7. Sample S7_Rec/Sig (A798015)

  4. e

    Spatially Resolved Transcriptomics Mining in 3D and Virtual Reality...

    • b2find.eudat.eu
    Updated Aug 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Spatially Resolved Transcriptomics Mining in 3D and Virtual Reality Environments with VR-Omics (Software and Data) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/a6b7d012-f18a-56ff-a3f1-68353d954e46
    Explore at:
    Dataset updated
    Aug 28, 2024
    Description

    Here, we summarise available data and source code regarding the publication "Spatially Resolved Transcriptomics Mining in 3D and Virtual Reality Environments with VR-Omics". Abstract Spatially resolved transcriptomics (SRT) technologies produce complex, multi-dimensional data sets of gene expression information that can be obtained at subcellular spatial resolution. While several computational tools are available to process and analyse SRT data, no platforms facilitate the visualisation and interaction with SRT data in an immersive manner. Here we present VR-Omics, a computational platform that supports the analysis, visualisation, exploration, and interpretation SRT data compatible with any SRT technology. VR-Omics is the first tool capable of analysing and visualising data generated by multiple SRT platforms in both 2D desktop and virtual reality environments. It incorporates an in-built workflow to automatically pre-process and spatially mine the data within a user-friendly graphical user interface. Benchmarking VR-Omics against other comparable software demonstrates its seamless end-to-end analysis of SRT data, hence making SRT data processing and mining universally accessible. VR-Omics is an open-source software freely available at: https://ramialison-lab.github.io/pages/vromics.html or below. For development of VR-Omics publicly available data was used. The Visium data from 10XGenomics is available at the 10X Genomics website: https://www.10xgenomics.com/resources/datasets. The 10X Genomics Xenium dataset is available under: https://www.10xgenomics.com/products/xenium-in-situ/preview-dataset-human-breast. The STOmics database is available at: https://db.cngb.org/stomics. The Vizgen MERFISH data release program can be accessed via: https://vizgen.com/data-release-program/. The Tomo-seq data is available via their publication https://doi.org/10.1016/j.cell.2014.09.038 which also contains the MATLAB code for the 3D data reconstruction. The Visium demo was adapted from Asp et al. and can be accessed via the related publication https://doi.org/10.1016/j.cell.2019.11.025 or at https://data.mendeley.com/datasets/zkzvyprd5z/1. The demo datasets generated for VR-Omics can be found at: https://doi.org/10.26180/22207579.v1 or below for download. The 3D Visium data set of the human developing heart adapted from Asp et al. can be found within the application and can be accessed from the main menu following the Visium, Demo context menu. The complete standalone version of VR-Omics (containing Python AW and Visualiser) can be downloaded at https://ramialison-lab.github.io/pages/vromics.html or at https://doi.org/10.26180/20220312.v1 or below for download. Alternatively, the code is available at GitHub (https://github.com/Ramialison-Lab/VR-Omics). To use the GitHub version an installation of Unity Gaming Engine (version 2021.3.11f1) is required. This version does not include the Python AW. The Python AW can be accessed at: https://doi.org/10.26180/22207903.v1. More information of run VR-Omics via Unity can be found in the full documentation accessible at https://ramialison-lab.github.io/pages/vromics.html.

  5. Beyond benchmarking: an expert-guided consensus approach to spatially aware...

    • zenodo.org
    pdf, tsv, zip
    Updated Jun 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jieran Sun; Jieran Sun; Kirti Biharie; Kirti Biharie; Peiying Cai; Peiying Cai; Niklas Müller-Bötticher; Niklas Müller-Bötticher; Paul Kiessling; Paul Kiessling; Meghan Turner; Meghan Turner; Søren Helweg Dam; Søren Helweg Dam; Florian Heyl; Florian Heyl; Sarusan Kathirchelvan; Martin Emons; Martin Emons; Samuel Gunz; Samuel Gunz; Sven Twardziok; Sven Twardziok; Amin El-Heliebi; Amin El-Heliebi; Martin Zacharias; Martin Zacharias; Roland Eils; Roland Eils; Marcel Reinders; Marcel Reinders; Raphael Gottardo; Raphael Gottardo; Christoph Kuppe; Christoph Kuppe; Brian Long; Brian Long; Ahmed Mahfouz; Ahmed Mahfouz; Mark Robinson; Mark Robinson; Naveed Ishaque; Naveed Ishaque; Sarusan Kathirchelvan (2025). Beyond benchmarking: an expert-guided consensus approach to spatially aware clustering - Supporting Data [Dataset]. http://doi.org/10.5281/zenodo.15487520
    Explore at:
    zip, pdf, tsvAvailable download formats
    Dataset updated
    Jun 27, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jieran Sun; Jieran Sun; Kirti Biharie; Kirti Biharie; Peiying Cai; Peiying Cai; Niklas Müller-Bötticher; Niklas Müller-Bötticher; Paul Kiessling; Paul Kiessling; Meghan Turner; Meghan Turner; Søren Helweg Dam; Søren Helweg Dam; Florian Heyl; Florian Heyl; Sarusan Kathirchelvan; Martin Emons; Martin Emons; Samuel Gunz; Samuel Gunz; Sven Twardziok; Sven Twardziok; Amin El-Heliebi; Amin El-Heliebi; Martin Zacharias; Martin Zacharias; Roland Eils; Roland Eils; Marcel Reinders; Marcel Reinders; Raphael Gottardo; Raphael Gottardo; Christoph Kuppe; Christoph Kuppe; Brian Long; Brian Long; Ahmed Mahfouz; Ahmed Mahfouz; Mark Robinson; Mark Robinson; Naveed Ishaque; Naveed Ishaque; Sarusan Kathirchelvan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This Zenodo record consists of:

    The following datasets are included:

    CosMx human liver liver dataset (cosmx_liver)

    The CosMx human liver dataset was obtained from the NanoString website (https://nanostring.com/products/cosmx-spatial-molecular-imager/ffpe-dataset/human-liver-rna-ffpe-dataset). The dataset consists of 2 Formalin-Fixed Paraffin-Embedded (FFPE) samples from 2 patients, one being normal liver and the other from a hepatocellular carcinoma patient with grade G3 cancer. Data was generated on the CosMx platform using the Human Universal Cell Characterization Panel 1000 plex. The ground truth annotations were computationally identified using Mclust clustering on the frequency of each cell type among its 200 nearest neighbors. See NanoString_Data_License_Agreement.pdf for the license terms.

    CosMx human non-small-cell lung cancer dataset (cosmx_lung)

    The CosMx human non-small-cell lung cancer dataset was obtained from the NanoString website (https://staging.nanostring.com/products/cosmx-spatial-molecular-imager/ffpe-dataset/nsclc-ffpe-dataset). The dataset consists of 8 FFPE samples from 5 patients presenting with non-small-cell lung cancer grade G1-G3. Data was generated on a CosMx prototype instrument using a 960 gene panel [1]. The ground truth annotations were computationally identified using Mclust clustering on the frequency of each cell type among its 200 nearest neighbors. See NanoString_Data_License_Agreement.pdf for the license terms.

    MERSCOPE mouse brain thalamus (abc_atlas_wmb_thalamus)

    The MERFISH mouse brain thalamus dataset [2] was obtained from the Brain Knowledge Platform (https://alleninstitute.github.io/abc_atlas_access/descriptions/MERFISH-C57BL6J-638850.html). The dataset consists of 59 fresh frozen (FF) serial full coronal sections at 200-µm intervals spanning one entire mouse brain. Data was generated on a Vizgen MERSCOPE instrument using a custom gene panel of 500 genes. The ground truth annotations were identified by aligning the MERFISH data to the CCFv3 coordinate space and labeling cells with the corresponding CCFv3 anatomical parcellation term [3]. Only the thalamus (TH; CCFv3 structure ID 549) and hypothalamic zona incerta (ZI; CCFv3 structure ID 797) were analyzed in this study. Spatially variable genes in the thalamus were identified by differential gene expression analysis on neighboring consensus clusters.

    MERFISH human developmental heart dataset (merfish_devheart)

    The MERFISH human developmental heart dataset [4] was obtained from Dryad (https://datadryad.org/stash/dataset/doi:10.5061/dryad.w0vt4b8vp). The dataset consists of 4 FF samples from 2 donors at 13 and 15 post-conception weeks (PCW). Data was generated using MERFISH with a custom 238-gene panel. The ground truth annotations (referred to as cellular communities in the original study) were computationally identified using k-means clustering of relative cell-type composition within 150µm of each cell.

    STARmap PLUS mouse brain dataset (STARmap_plus)

    The STARmap PLUS mouse brain dataset [5] was obtained from Zenodo (https://zenodo.org/records/8327576). The dataset consists of 20 FF samples from 3 mice. Data was generated using STARmap PLUS using a custom 1,022 gene panel. The ground truth annotations were manually identified by aligning the data to the CCFv3.

    Xenium human breast cancer dataset (xenium-ffpe-bc-idc)

    The Xenium breast cancer dataset was obtained from the 10x website (https://www.10xgenomics.com/datasets/xenium-ffpe-human-breast-with-custom-add-on-panel-1-standard). The dataset consists of 1 FFPE sample from a patient with infiltrating ductal carcinoma breast cancer. Data was generated on a Xenium Analyzer using the Xenium human breast gene expression panel v1 (280 genes) with 100 additional custom genes. The ground truth annotation was manually identified using the matched histopathology image, annotating for eight region types: ductal carcinoma in-situ, invasive tumor, normal ducts, immune cells, cysts, blood vessels, adipose tissue, and stroma [6].

    Xenium mouse brain dataset (xenium-mouse-brain-SergioSalas)

    The Xenium mouse brain dataset was obtained from the 10x genomics website (https://www.10xgenomics.com/datasets/fresh-frozen-mouse-brain-replicates-1-standard). The dataset consists of 1 FF sample of a full coronal section. Data was generated on a Xenium Analyzer using the v1 mouse brain gene expression panel (247 genes). The ground truth annotation was manually identified using the mouse coronal P56 sample from Allen Brain Atlas [3] to specify anatomical regions [7].

    Slide-seqV2 mouse brain olfactory bulb dataset (slideseq2_olfactory_bulb)

    The Slide-seqV2 mouse brain olfactory bulb dataset [8] was obtained from the STOmicsDB website (https://db.cngb.org/stomics/datasets/STDS0000172/data). The dataset consists of 20 samples of a mouse olfactory bulb evenly spaced along the anterior-posterior axis. Data was generated using Slide-seqV2 and sequenced using paired-end reads on an Illumina Novaseq6000 instrument, targeting 200 million reads per sample. The ground truth annotations were manually identified based on the expression of marker genes.

    Stereo-seq mouse liver dataset (stereoseq_liver)

    The Stereo-seq mouse liver dataset [9] was obtained from the STomicsDB website (https://db.cngb.org/stomics/lista/spatial). The dataset consists of 6 FF samples. Data was generated on Stereo-seq chips and sequenced using paired-end reads on a DIPSEQ T1 instrument. The ground truth annotations were computationally identified where zonation layers were annotated based on the differences between the scores of pericentral and periportal hepatocyte landmark genes.

    Stereo-seq mouse embryo dataset (stereoseq_mouse_embryo)

    The Stereo-seq mouse embryo dataset [10] was obtained from the StOmicsDB website (https://ftp.cngb.org/pub/SciRAID/stomics/STDS0000058/stomics). The dataset consists of 53 FF samples from mouse embryos spanning E9.5–E16.5 with one-day intervals. Data was generated on Stereo-seq chips and sequenced using paired-end reads on a MGI DNBSEQ-Tx sequencer. The ground truth annotations were computationally identified using Spatially Constrained Clustering (SCC), which is built on top of the Leiden clustering algorithm.

    Visium human brain LIBD DLPFC dataset 1 (libd_dlpfc)

    The Visium human brain LIBD DLPFC dataset 1 [11] was obtained from the spatialLIBD Bioconductor package (https://research.libd.org/spatialLIBD). The dataset consists of 12 FF samples from 3 donors. The data was generated on Visium chips and sequenced using paired-end reads on an Illumina NovaSeq 6000 instrument. The ground truth annotations were manually identified based on cytoarchitecture and selected gene markers.

    osmFISH mouse brain somatosensory cortex dataset (osmfish_Ssp)

    The osmFISH mouse brain somatosensory cortex dataset [12] was obtained from the Linnarsson Lab website (https://linnarssonlab.org/osmFISH). The dataset consists of a single FF sample from the mouse brain somatosensory cortex. Data was generated using osmFISH using a custom 33-gene panel. The ground truth annotation was computationally identified using an iterative graph-based algorithm.

    Visium human breast cancer (visium_breast_cancer_SEDR)

    The Visium human breast cancer dataset, originally from 10x Genomics (https://www.10xgenomics.com/resources/datasets/human-breast-cancer-block-a-section-1-1-standard-1-1-0), was obtained from GitHub (https://github.com/JinmiaoChenLab/SEDR_analyses). The dataset consists of a single FF sample of invasive ductal carcinoma breast tissue. The data was generated on a Visium chip and sequenced

  6. Data for scCellFie Tutorials

    • zenodo.org
    bin
    Updated May 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erick Armingol; Erick Armingol (2025). Data for scCellFie Tutorials [Dataset]. http://doi.org/10.5281/zenodo.15330688
    Explore at:
    binAvailable download formats
    Dataset updated
    May 2, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Erick Armingol; Erick Armingol
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets to use in the tutorials of scCellFie:

    • Here we provide a subset for the HECA dataset, a full version of it is publicly available at The Reproductive Cell Atlas
    • Here we provide a Visium dataset of a Mouse Whole Embryo, which is publicly available at 10X Genomics
  7. Data from: STalign: Alignment of spatial transcriptomics data using...

    • zenodo.org
    application/gzip
    Updated Feb 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kalen Clifton; Kalen Clifton; Manjari Anant; Manjari Anant; Gohta Aihara; Gohta Aihara; Jean Fan; Jean Fan (2024). STalign: Alignment of spatial transcriptomics data using diffeomorphic metric mapping [Dataset]. http://doi.org/10.5281/zenodo.10724029
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Feb 28, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Kalen Clifton; Kalen Clifton; Manjari Anant; Manjari Anant; Gohta Aihara; Gohta Aihara; Jean Fan; Jean Fan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Spatial transcriptomics (ST) technologies enable high throughput gene expression characterization within thin tissue sections. However, comparing spatial observations across sections, samples, and technologies remains challenging. To address this challenge, we developed STalign to align ST datasets in a manner that accounts for partially matched tissue sections and other local non-linear distortions using diffeomorphic metric mapping. We apply STalign to align ST datasets within and across technologies as well as to align ST datasets to a 3D common coordinate framework. We show that STalign achieves high gene expression and cell-type correspondence across matched spatial locations that is significantly improved over landmark-based affine alignments. Applying STalign to align ST datasets of the mouse brain to the 3D common coordinate framework from the Allen Brain Atlas, we highlight how STalign can be used to lift over brain region annotations and enable the interrogation of compositional heterogeneity across anatomical structures. STalign is available as an open-source Python toolkit at https://github.com/JEFworks-Lab/STalign and as supplementary software with additional documentation and tutorials available at https://jef.works/STalign.

    Here we have included alignment results that were used in performance analysis of STalign:

    We aligned Slice 2 Replicate 3 to Slice 2 Replicate 2 of the MERFISH mouse coronal brain sections available from Vizgen Data Release V1.0. May 2021 (https://info.vizgen.com/mouse-brain-map).

    • STalign_S2R3_to_S2R2.csv.gz contains cell ids, original cell centroid positions of S2R3, cell positions of S2R3 after alignment to S2R2 with STalign, cell positions of S2R3 after supervised affine alignment to S2R2, and counts for genes and blanks.
    • STalign_S2R2.csv.gz contains cell ids, cell centroid positions of S2R2 and counts for genes and blanks.

    Additionally, we aligned Slice 2 Replicate 3 to a Visium dataset of an FFPE preserved adult mouse brain were obtained from the 10X Datasets website for Spatial Gene Expression Dataset by Space Ranger 1.3.0 (https://www.10xgenomics.com/resources/datasets/adult-mouse-brain-ffpe-1-standard-1-3-0).

    • STalign_S2R3_to_Visium.csv.gz contains cell ids, original cell centroid positions of S2R3, cell positions of S2R3 after alignment to Visium H&E staining with STalign, and counts for genes and blanks.

    Furthermore, we performed alignments with the 50um resolution 3D Allen Reference Atlas Nissl common coordinate framework, CCF (https://help.brain-map.org/display/mouseconnectivity/API). We applied STalign to align the Allen CCF to each of the 9 MERFISH slices (3 slice locations with 3 biological replicates) provided by Vizgen. Because the Allen CCF has annotated brain regions, we were able to lift over those brain region annotations to label all cells in the MERFISH datasets.

    Also, since the STalign mappings from the Allen CCF to the MERFISH slices are invertible, for each slice we can apply the inverse of the mapping to get cell positions in the Allen CCF coordinates.

    • STalign_SXRX_with_structure_id_name.csv.gz contains cell ids for Slice X Replicate X, original cell centroid positions, cell xyz-coordinates in Allen CCF, brain structure id per cell, brain structure acronym

    To evaluate the 3D CCF alignment, we performed unified transcriptional clustering analysis and cell-type annotation. All MERFISH datasets were combined. Transcriptional clustering analysis and cell type annotation was performed using the SCANPY package [version 1.9.1]. Data were normalized to counts per million (scanpy: normalize_total) and log transformed (scanpy: log1p). PCA (scanpy: pca) was computed on the cell by gene matrix. A neighborhood graph of cells using the top 10 PCs and 10 nearest neighbors was created (scanpy: neighbors), and Leiden clustering was performed on this graph (scanpy: leiden) to identify 29 clusters. Differentially expressed genes were extracted from each cluster (scanpy: rank_genes_groups), and cell-types were annotated based on marker genes in each cluster.

    • STalign_celltypeannotations_merfishslices_v2.csv.gz contains for all nine slices cell ids and cell type annotations

    This updated (v2) cell-type annotation file contains a new column with simplified cell-types. Briefly, we fixed typos, standardized lower case/upper case formats, merged subclasses of each cell-types. For example, subclasses of astrocytes, which are originally labeled as “Astrocytes”, “Astrocytes(1)”, “Astrocytes(2)”, “Astrocytes(3)”, are all labeled as “Astrocytes” in the added column.

    Note: Cell ids may have been mutated from original string of numbers through reading and writing across programming languages that handle numbers with different precision. If using R to read the files shared here, one can find the cells in STalign_celltypeannotations_merfishslices_v2.csv.gz that correspond with STalign_SXRX_with_structure_id_name.csv.gz when cell ids are formatted as a double in scientific notation, which is how R will read the file automatically.

  8. Has this cell type annotation worked?

    • zenodo.org
    zip
    Updated Jul 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dmitrijs Lvovs; Dmitrijs Lvovs (2025). Has this cell type annotation worked? [Dataset]. http://doi.org/10.5281/zenodo.16033194
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 17, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dmitrijs Lvovs; Dmitrijs Lvovs
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Present dataset is the result of applying reference-based RCTD cell typing method on the 8um bin of the public P2 human CRC dataset as downloaded from 10x website. Public single-cell HTAN CRC data (validation cohort) from CellXGene was used as a reference to assign cell labels. The choice of CRC for this illustration is purely random.

    The dataset structure is as folows:

    Interested parties are invited to look at `squidpy/visium-hd-crc-p2/squidpy.h5ad`, it contains the original counts matrix in `adata.X`, associated spatial information, as well as most abundant cell type in `adata.obs['cell_type']` and raw RCTD outputs in `adata.uns['cell_types']`. Furthermore, `example.ipynb` contains some toy scripts to get started with analysis. Other files are untransformed and transformed versions of source data and some basic squidpy plots.

  9. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
David McKellar; Iwijn De Vlaminck; Benjamin Cosgrove (2021). Large-scale integration of single-cell transcriptomic data captures transitional progenitor states in mouse skeletal muscle regeneration [Dataset]. http://doi.org/10.5061/dryad.t4b8gtj34

Data from: Large-scale integration of single-cell transcriptomic data captures transitional progenitor states in mouse skeletal muscle regeneration

Related Article
Explore at:
zipAvailable download formats
Dataset updated
Dec 14, 2021
Dataset provided by
Cornell University
Authors
David McKellar; Iwijn De Vlaminck; Benjamin Cosgrove
License

https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

Description

Skeletal muscle repair is driven by the coordinated self-renewal and fusion of myogenic stem and progenitor cells. Single-cell gene expression analyses of myogenesis have been hampered by the poor sampling of rare and transient cell states that are critical for muscle repair, and do not inform the spatial context that is important for myogenic differentiation. Here, we demonstrate how large-scale integration of single-cell and spatial transcriptomic data can overcome these limitations. We created a single-cell transcriptomic dataset of mouse skeletal muscle by integration, consensus annotation, and analysis of 23 newly collected scRNAseq datasets and 88 publicly available single-cell (scRNAseq) and single-nucleus (snRNAseq) RNA-sequencing datasets. The resulting dataset includes more than 365,000 cells and spans a wide range of ages, injury, and repair conditions. Together, these data enabled identification of the predominant cell types in skeletal muscle, and resolved cell subtypes, including endothelial subtypes distinguished by vessel-type of origin, fibro/adipogenic progenitors defined by functional roles, and many distinct immune populations. The representation of different experimental conditions and the depth of transcriptome coverage enabled robust profiling of sparsely expressed genes. We built a densely sampled transcriptomic model of myogenesis, from stem cell quiescence to myofiber maturation and identified rare, transitional states of progenitor commitment and fusion that are poorly represented in individual datasets. We performed spatial RNA sequencing of mouse muscle at three time points after injury and used the integrated dataset as a reference to achieve a high-resolution, local deconvolution of cell subtypes. We also used the integrated dataset to explore ligand-receptor co-expression patterns and identify dynamic cell-cell interactions in muscle injury response. We provide a public web tool to enable interactive exploration and visualization of the data. Our work supports the utility of large-scale integration of single-cell transcriptomic data as a tool for biological discovery.

Methods Mice. The Cornell University Institutional Animal Care and Use Committee (IACUC) approved all animal protocols, and experiments were performed in compliance with its institutional guidelines. Adult C57BL/6J mice (mus musculus) were obtained from Jackson Laboratories (#000664; Bar Harbor, ME) and were used at 4-7 months of age. Aged C57BL/6J mice were obtained from the National Institute of Aging (NIA) Rodent Aging Colony and were used at 20 months of age. For new scRNAseq experiments, female mice were used in each experiment.

Mouse injuries and single-cell isolation. To induce muscle injury, both tibialis anterior (TA) muscles of old (20 months) C57BL/6J mice were injected with 10 µl of notexin (10 µg/ml; Latoxan; France). At 0, 1, 2, 3.5, 5, or 7 days post-injury (dpi), mice were sacrificed and TA muscles were collected and processed independently to generate single-cell suspensions. Muscles were digested with 8 mg/ml Collagenase D (Roche; Switzerland) and 10 U/ml Dispase II (Roche; Switzerland), followed by manual dissociation to generate cell suspensions. Cell suspensions were sequentially filtered through 100 and 40 μm filters (Corning Cellgro #431752 and #431750) to remove debris. Erythrocytes were removed through incubation in erythrocyte lysis buffer (IBI Scientific #89135-030).

Single-cell RNA-sequencing library preparation. After digestion, single-cell suspensions were washed and resuspended in 0.04% BSA in PBS at a concentration of 106 cells/ml. Cells were counted manually with a hemocytometer to determine their concentration. Single-cell RNA-sequencing libraries were prepared using the Chromium Single Cell 3’ reagent kit v3 (10x Genomics, PN-1000075; Pleasanton, CA) following the manufacturer’s protocol. Cells were diluted into the Chromium Single Cell A Chip to yield a recovery of 6,000 single-cell transcriptomes. After preparation, libraries were sequenced using on a NextSeq 500 (Illumina; San Diego, CA) using 75 cycle high output kits (Index 1 = 8, Read 1 = 26, and Read 2 = 58). Details on estimated sequencing saturation and the number of reads per sample are shown in Sup. Data 1.

Spatial RNA sequencing library preparation. Tibialis anterior muscles of adult (5 mo) C57BL6/J mice were injected with 10µl notexin (10 µg/ml) at 2, 5, and 7 days prior to collection. Upon collection, tibialis anterior muscles were isolated, embedded in OCT, and frozen fresh in liquid nitrogen. Spatially tagged cDNA libraries were built using the Visium Spatial Gene Expression 3’ Library Construction v1 Kit (10x Genomics, PN-1000187; Pleasanton, CA) (Fig. S7). Optimal tissue permeabilization time for 10 µm thick sections was found to be 15 minutes using the 10x Genomics Visium Tissue Optimization Kit (PN-1000193). H&E stained tissue sections were imaged using Zeiss PALM MicroBeam laser capture microdissection system and the images were stitched and processed using Fiji ImageJ software. cDNA libraries were sequenced on an Illumina NextSeq 500 using 150 cycle high output kits (Read 1=28bp, Read 2=120bp, Index 1=10bp, and Index 2=10bp). Frames around the capture area on the Visium slide were aligned manually and spots covering the tissue were selected using Loop Browser v4.0.0 software (10x Genomics). Sequencing data was then aligned to the mouse reference genome (mm10) using the spaceranger v1.0.0 pipeline to generate a feature-by-spot-barcode expression matrix (10x Genomics).

Download and alignment of single-cell RNA sequencing data. For all samples available via SRA, parallel-fastq-dump (github.com/rvalieris/parallel-fastq-dump) was used to download raw .fastq files. Samples which were only available as .bam files were converted to .fastq format using bamtofastq from 10x Genomics (github.com/10XGenomics/bamtofastq). Raw reads were aligned to the mm10 reference using cellranger (v3.1.0).

Preprocessing and batch correction of single-cell RNA sequencing datasets. First, ambient RNA signal was removed using the default SoupX (v1.4.5) workflow (autoEstCounts and adjustCounts; github.com/constantAmateur/SoupX). Samples were then preprocessed using the standard Seurat (v3.2.1) workflow (NormalizeData, ScaleData, FindVariableFeatures, RunPCA, FindNeighbors, FindClusters, and RunUMAP; github.com/satijalab/seurat). Cells with fewer than 750 features, fewer than 1000 transcripts, or more than 30% of unique transcripts derived from mitochondrial genes were removed. After preprocessing, DoubletFinder (v2.0) was used to identify putative doublets in each dataset, individually. BCmvn optimization was used for PK parameterization. Estimated doublet rates were computed by fitting the total number of cells after quality filtering to a linear regression of the expected doublet rates published in the 10x Chromium handbook. Estimated homotypic doublet rates were also accounted for using the modelHomotypic function. The default PN value (0.25) was used. Putative doublets were then removed from each individual dataset. After preprocessing and quality filtering, we merged the datasets and performed batch-correction with three tools, independently- Harmony (github.com/immunogenomics/harmony) (v1.0), Scanorama (github.com/brianhie/scanorama) (v1.3), and BBKNN (github.com/Teichlab/bbknn) (v1.3.12). We then used Seurat to process the integrated data. After initial integration, we removed the noisy cluster and re-integrated the data using each of the three batch-correction tools.

Cell type annotation. Cell types were determined for each integration method independently. For Harmony and Scanorama, dimensions accounting for 95% of the total variance were used to generate SNN graphs (Seurat::FindNeighbors). Louvain clustering was then performed on the output graphs (including the corrected graph output by BBKNN) using Seurat::FindClusters. A clustering resolution of 1.2 was used for Harmony (25 initial clusters), BBKNN (28 initial clusters), and Scanorama (38 initial clusters). Cell types were determined based on expression of canonical genes (Fig. S3). Clusters which had similar canonical marker gene expression patterns were merged.

Pseudotime workflow. Cells were subset based on the consensus cell types between all three integration methods. Harmony embedding values from the dimensions accounting for 95% of the total variance were used for further dimensional reduction with PHATE, using phateR (v1.0.4) (github.com/KrishnaswamyLab/phateR).

Deconvolution of spatial RNA sequencing spots. Spot deconvolution was performed using the deconvolution module in BayesPrism (previously known as “Tumor microEnvironment Deconvolution”, TED, v1.0; github.com/Danko-Lab/TED). First, myogenic cells were re-labeled, according to binning along the first PHATE dimension, as “Quiescent MuSCs” (bins 4-5), “Activated MuSCs” (bins 6-7), “Committed Myoblasts” (bins 8-10), and “Fusing Myoctes” (bins 11-18). Culture-associated muscle stem cells were ignored and myonuclei labels were retained as “Myonuclei (Type IIb)” and “Myonuclei (Type IIx)”. Next, highly and differentially expressed genes across the 25 groups of cells were identified with differential gene expression analysis using Seurat (FindAllMarkers, using Wilcoxon Rank Sum Test; results in Sup. Data 2). The resulting genes were filtered based on average log2-fold change (avg_logFC > 1) and the percentage of cells within the cluster which express each gene (pct.expressed > 0.5), yielding 1,069 genes. Mitochondrial and ribosomal protein genes were also removed from this list, in line with recommendations in the BayesPrism vignette. For each of the cell types, mean raw counts were calculated across the 1,069 genes to generate a gene expression profile for BayesPrism. Raw counts for each spot were then passed to the run.Ted function, using

Search
Clear search
Close search
Google apps
Main menu