Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Large-scale comprehensive single-cell experiments are often resource-intensive and require the involvement of many laboratories and/or taking measurements at various times. This inevitably leads to batch effects, and systematic variations in the data that might occur due to different technology platforms, reagent lots, or handling personnel. Such technical differences confound biological variations of interest and need to be corrected during the data integration process. Data integration is a challenging task due to the overlapping of biological and technical factors, which makes it difficult to distinguish their individual contribution to the overall observed effect. Moreover, the choice of integration method may impact the downstream analyses, including searching for differentially expressed genes. From the existing data integration methods, we selected only those that return the full expression matrix. We evaluated six methods in terms of their influence on the performance of differential gene expression analysis in two single-cell datasets with the same biological study design that differ only in the way the measurement was done: one dataset manifests strong batch effects due to the measurements of each sample at a different time. Integrated data were visualized using the UMAP method. The evaluation was done both on individual gene level using parametric and non-parametric approaches for finding differentially expressed genes and on gene set level using gene set enrichment analysis. As an evaluation metric, we used two correlation coefficients, Pearson and Spearman, of the obtained test statistics between reference, test, and corrected studies. Visual comparison of UMAP plots highlighted ComBat-seq, limma, and MNN, which reduced batch effects and preserved differences between biological conditions. Most of the tested methods changed the data distribution after integration, which negatively impacts the use of parametric methods for the analysis. Two algorithms, MNN and Scanorama, gave very poor results in terms of differential analysis on gene and gene set levels. Finally, we highlight ComBat-seq as it led to the highest correlation of test statistics between reference and corrected dataset among others. Moreover, it does not distort the original distribution of gene expression data, so it can be used in all types of downstream analyses.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sample Group Description
The dataset includes the following human & non-human samples:
Human samples:
-Live healthy donors (LHD; n = 8), labeled as M
-Patients with liver pathology (adjacent normal tissue sampled; n = 8), labeled as P
Non-human samples:
-Wild boar (n = 2), labeled as non_human_P
-Cow (n = 2), labeled as non_human_C
-Domesticated pig (n = 3), labeled as non_human_PD
Visium:
LHDs and adjacent normal samples:
Human loupe files:
This folder includes the Loupe Browser-compatible files (16 total) corresponding to the human samples (M1–M8, P2, P3, P6, P7, P14, P17, P18, P21) for downstream visualization and exploration.
Human_h5_files
This folder contains .h5
formatted output files from Space Ranger for the 16 human samples.
Human_Spatial_transcriptomics_data:
This folder contains spatial transcriptomics data from 16 human samples.
For each sample the following files are included:
counts_ALL.csv
– full gene expression matrix
counts_UTT.csv
– filtered matrix (UTT: under the tissue)
tissue_positions_list.csv
– spatial barcode coordinates
scalefactors_json.json
– image scaling information
tissue_hires_image.png
– high-resolution histology image
Non_Human_Loupe_files:
This folder contains Loupe Browser-compatible files for the 7 non-human samples (C1, C2, P1, P2, PD1, PD2, PD3).
Non_human_h5_files
This folder contains.h5
formatted output files from Space Ranger for the 7 non-human samples
Non_Human_Spatial_transcriptomics_data:
This folder includes spatial transcriptomics data from 7 non-human samples (C1, C2, P1,P2, PD1, PD2, PD3).
counts_ALL.csv
– full gene expression matrix
counts_UTT.csv
– filtered matrix (UTT: under the tissue)
tissue_positions_list.csv
– spatial barcode coordinates
scalefactors_json.json
– image scaling information
tissue_hires_image.png
– high-resolution histology image
VisumHD:
This folder contains spatial transcriptomics data from 10x Genomics Visium HD for human liver samples:
M1_VisiumHD.cloupe
– Loupe Browser visualization file for patient M1, showing spatial transcriptomics data at 8 μm bin resolution.
M2_VisiumHD.cloupe
– Loupe Browser visualization file for patient M2, showing spatial transcriptomics data at 8 μm bin resolution.
M6_VisiumHD.cloupe
– Loupe Browser file for a Visium HD slide that includes two tissue sections. The tissue at the bottom of the slide corresponds to patient M6, which is the one analyzed in the downstream dataset (marked under ‘patients’ as ‘M6-high quality’). Data is shown at 8 μm bin resolution.
visiumHD_data_M2_M6.h5ad
– A filtered and integrated .h5ad file containing single-cell–resolved spatial gene expression data from both M2 and M6 samples.
-Resolution: Cells. Based on single-cell segmentation (see “Liver Cell Atlas using Visium HD” method).
-Cell filtering: Only cells detected via segmentation and that have passed quality filters were included.
-UMI threshold: Cells with fewer than 200 UMIs were excluded.
-Batch correction: Harmony was applied to correct sample-specific effects prior to UMAP visualization.
-Format: .h5ad (AnnData format, compatible with Scanpy).
-Includes: Single-cell expression matrix, spatial coordinates, Harmony-corrected UMAP, cluster identity, and metadata.
M6:
This folder contains spatial transcriptomics data (8*8 μm) for sample M6, generated using the 10x Genomics Visium HD platform.
NOTES:
-The gene expression matrices (*.h5) come from the full slide output of Space Ranger, including both tissue sections (like the M6 Loupe file ).
-The spatial metadata files (*.json, *.tif, .csv) refer to the cropped region, corresponding to the bottom tissue, which is the actual M6 sample used in downstream analysis.
-This is the raw Space Ranger output, prior to cell segmentation or high-level filtering (apart from the default filtered feature matrix).
-This data reflects raw 8 μm resolution bins, not single-cell segmentations.
-For downstream analysis based on cell segmentation, refer to the visiumHD_data_M2_M6.h5ad file in the top-level VisiumHD folder.
Gene Expression Matrices (uncropped – both tissues included):
filtered_feature_bc_matrix_8um.h5
raw_feature_bc_matrix_8um.h5
Spatial Metadata (cropped – M6 tissue only):
scalefactors_json.json
Images:
tissue_hires_image.tif
/ tissue_lowres_image.tif
/ tissue_fullres_image.tif
tissue_positions.csv
- Barcode-to-position table corresponding only to the cropped region, i.e., the M6 tissue. Only the barcodes listed in this file are relevant to M6 and should be used to extract or analyze this tissue’s expression data from the full matrix.
M2:
This folder contains the full, unmodified output of the 10x Genomics Visium HD Space Ranger for the M2 liver tissue sample (8X8 um resolution).
filtered_feature_bc_matrix_8um.h5
raw_feature_bc_matrix_8um.h5
scalefactors_json.json
Images: tissue_hires_image.tif
/ tissue_lowres_image.tif
tissue_positions_orig.csv
M1:
This folder contains the full, unmodified output of the 10x Genomics Visium HD Space Ranger for the M1 liver tissue sample (8X8 um resolution)
filtered_feature_bc_matrix_8um.h5
raw_feature_bc_matrix_8um.h5
scalefactors_json.json
Images:
tissue_hires_image.tif
/ tissue_lowres_image.tif
tissue_positions_orig.csv
snRNAseq:
This folder contains single-nucleus RNA-seq (snRNA-seq) data from four human liver samples (M5, M6, M7, M8). Data was generated using Cell Ranger multi.
single_nuc_RNAseq.cloupe
- Output from Cell Ranger multi. data from all four samples.
snRNAseq.h5ad
- Processed and filtered .h5ad file containing single-nucleus expression data from M5–M8, integrated into one dataset.
-Filtering includes standard QC (e.g., low-gene/UMI exclusion, mitochondrial content, etc.)
-Batch correction: Harmony was applied to correct sample-specific effects prior to UMAP visualization.
-Format: .h5ad (AnnData format, compatible with Scanpy).
-Includes: expression matrix, spatial coordinates, Harmony-corrected UMAP, cluster identity and metadata.
M5, M6, M7, M8
Each sample folder contains raw and filtered matrices generated by Cell Ranger:
-sample_filtered_feature_bc_matrix
-sample_raw_feature_bc_matrix
MERFISH:
For both samples- M5 and M8, each sample folder contains:
-cell_by_gene.csv
-cell_metadata.csv
-detected_transcripts.csv
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This sc-RNAseq dataset is composed of disease-unaffected epidermal samples from 96 skin biopsies: 18 from published datasets - GSE173706, GSE249279 – and 78 newly generated ones. Biopsy sample and protocol details, and curated cell-type signature genes, are available in the scRNASeq_source_info_FigShare spreadsheet of this dataset. Processed Seurat object are provided herein. Raw data are available in SRA (id PRJNA1054546). Biopsies originated from seven body sites (face, scalp, axilla, palmoplantar, arm, leg, and back). The skin biopsies were separated into epidermis and dermis before dissociated and enriched for various cell fractions (keratinocytes, fibroblasts, and endothelial cells) and immune cells (myeloid and lymphoid cells) to up sample rare cell types. In total, across body sites, 274,834 cells were profiled, including 96,194 keratinocytes. Seurat v3.0. was utilized to normalize, scale, and reduce the dimensionality of the data. Low quality cells containing less than 200 genes per cell as well as greater than 5,000 genes per cell were filtered out. Cells containing more mitochondrial genes than the permitted quantile of 0.05 were removed. Ambient RNA was removed using R package SoupX v1.6.2. Doublets were removed using scDblFinder v1.12.0. Principal components (PC) were obtained from the topmost 2,000 variable genes, and the Uniform Manifold Approximation and Projection (UMAP) dimensional reduction technique was applied to the 30 topmost variable PC-reduced dataset. Batch effect correction was performed utilizing harmony v1.0, using donor as batch. After batch correction, cells were clustered using shared nearest neighbor modularity optimization-based clustering. Cluster marker genes were identified with FindAllMarkers; cluster corresponding cell type was identified by comparing marker genes to curated cell-type signature genes. Differential expression by keratinocyte subtype was performed with Seurat (v4.3.0) FindMarkers function by comparing keratinocyte subtype to non-keratinocyte clusters. The log fold-change of the average expression between a keratinocyte subtype cluster compared to the rest of clusters is utilized as keratinocyte-subtype gene expression statistic.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Quality Control of Single-Cell DataRaw sequencing data were processed using SCOPE-tools (v1.4.0) to generate a gene expression matrix. After extracting and correcting barcodes and unique molecular identifiers (UMIs), adapter sequences and poly(A) tails were removed. The trimmed reads were aligned to the chicken reference genome (GRCg6a) using the integrated STAR (v2.7.9a) algorithm in CellRanger (v5.0.0). Gene mapping was performed with featureCounts, followed by UMI correction and quantification to produce a complete gene expression matrix. The processed data were then compiled into a matrix file. The expression matrix was further analyzed using the Seurat (v4.3.0.1) package to ensure data quality. Cells were filtered based on gene count thresholds (min.cells > 3 and min.features > 200). Cells with fewer than 1,000 UMIs or a log10GenesPerUMI value exceeding 0.7 were excluded. Additionally, cells with mitochondrial gene content exceeding 25% were removed. These quality control measures ensured the reliability of downstream analyses.Dimensionality Reduction and Clustering of Single-Cell DataTo reduce technical noise and ensure high data quality, the gene expression matrix was normalized and scaled using the NormalizeData and ScaleData functions in the Seurat package. The FindVariableFeatures function was applied to calculate the mean expression and dispersion for each gene, identifying 2,000 highly variable genes. Principal component analysis (PCA) was then performed on the high-dimensional data, retaining the top 20 principal components. Simulated doublet data were generated to match the expected doublet rate, and these were integrated with the original dataset. Each cell was assigned a doublet score using a k-nearest neighbor (k-NN) classifier. Potential doublets were identified using the doubletFinder_v3 function with the parameter pN = 0.25 and removed based on the expected doublet threshold, resulting in a final dataset of 70,361 high-quality cells for downstream analyses. To correct for potential batch effects, the Harmony algorithm was applied. For clustering, the FindClusters function was used with a resolution of 0.4, followed by dimensionality reduction and visualization using uniform manifold approximation and projection (UMAP) and t-distributed stochastic neighbor embedding. The UMAP algorithm was optimized with a neighborhood size of 20 to achieve optimal cell clustering and clear visual representation of the cell populations.Differential Gene Screening To characterize the functional properties of different cell clusters, we identified differentially expressed genes (DEGs) using the "FindAllMarkers" function in the Seurat package. The selection criteria required genes to be expressed in more than 25% of cells in the target cell subpopulation (min.pct = 0.25) and to exhibit significantly higher expression levels in the target cluster compared to others (test.use = "MAST"). To ensure the biological relevance of the results, more stringent thresholds were applied: p-value < 0.05 and |log2 fold change| > 1. Cell types were annotated by integrating literature-supported evidence and classical marker genes, allowing for accurate classification of cell populations and elucidation of their biological functions. The expression patterns of marker genes were visualized using the DoHeatmap, DotPlot, and VlnPlot functions in the Seurat package. These visualizations further clarified cell identities and highlighted their functional characteristics.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Large-scale comprehensive single-cell experiments are often resource-intensive and require the involvement of many laboratories and/or taking measurements at various times. This inevitably leads to batch effects, and systematic variations in the data that might occur due to different technology platforms, reagent lots, or handling personnel. Such technical differences confound biological variations of interest and need to be corrected during the data integration process. Data integration is a challenging task due to the overlapping of biological and technical factors, which makes it difficult to distinguish their individual contribution to the overall observed effect. Moreover, the choice of integration method may impact the downstream analyses, including searching for differentially expressed genes. From the existing data integration methods, we selected only those that return the full expression matrix. We evaluated six methods in terms of their influence on the performance of differential gene expression analysis in two single-cell datasets with the same biological study design that differ only in the way the measurement was done: one dataset manifests strong batch effects due to the measurements of each sample at a different time. Integrated data were visualized using the UMAP method. The evaluation was done both on individual gene level using parametric and non-parametric approaches for finding differentially expressed genes and on gene set level using gene set enrichment analysis. As an evaluation metric, we used two correlation coefficients, Pearson and Spearman, of the obtained test statistics between reference, test, and corrected studies. Visual comparison of UMAP plots highlighted ComBat-seq, limma, and MNN, which reduced batch effects and preserved differences between biological conditions. Most of the tested methods changed the data distribution after integration, which negatively impacts the use of parametric methods for the analysis. Two algorithms, MNN and Scanorama, gave very poor results in terms of differential analysis on gene and gene set levels. Finally, we highlight ComBat-seq as it led to the highest correlation of test statistics between reference and corrected dataset among others. Moreover, it does not distort the original distribution of gene expression data, so it can be used in all types of downstream analyses.