Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Assigns identifiers to collections of datasets indexed by CELLxGENE.
CELLxGENE is an interactive data visualization and exploration tool developed by the Chan Zuckerberg Initiative that enables researchers to analyze and share single-cell genomics datasets. It provides a user-friendly interface for biologists and computational scientists to interrogate gene expression patterns across different cell types.
Facebook
TwitterPortal used to find and download any of data sets published on CELLxGENE. Allows to download and visually explore data to understand functionality of human tissues at cellular level. Optimized for finding, exploring, and reusing single cell data. Collections Page lists collections hosted on CELLxGENE Discover and metadata that define tissue, assay, disease, organism, and cell count for each collection. Once you find published dataset of interest on CELLxGENE Discover, you can click on the explore button below the dataset description to explore the cells of that dataset using the CELLxGENE Explorer.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
GCS LINK:
gs://kds-6860773353013302b6e19605df3e5195ee14d269d4d746edb218f8ff
A curated dataset of approximately 700,000 healthy human single cells (approx. 100,000 per tissue) sourced from the CellXGene Census, covering seven major tissues: * heart * blood * brain * lung * kidney * intestine * pancreas.
This is 1 of 4 datasets focusing on providing progressively larger, ready-to-use collections of healthy human single-cell RNA sequencing data in the H5AD format.
The goal is to offer standardized benchmarks/datasets derived from CellXGene for exploring fundamental scRNA-seq analysis, understanding multi-tissue cellular composition, developing and testing computational models, and evaluating method scalability across different orders of magnitude.
This dataset provides a focused collection of single-cell transcriptomic profiles representing healthy human tissues, curated from the comprehensive CZ CELLxGENE Discover Census (CellXGene) from the latest (Jan 2025) stable release. It includes data exclusively from Homo sapiens cells annotated as 'normal' or 'healthy' and in 'cell' suspension.
With its somewhat manageable size (approx. 700k total cells), this dataset serves as an excellent middle ground for exploration, model development, and scaling to larger use-cases.
Facebook
TwitterAssigns identifiers to datasets indexed by CELLxGENE, such those resulting from scRNA-seq experiments
Facebook
Twitterhttps://mit-license.orghttps://mit-license.org
This project utilizes the scCompass and CELLxGENE datasets with data scales of 100K, 200K, 500K, 1M, 2M, and 5M to pre-train model: scGPT.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
H5ad file can be used as demo input for Cellxgene VIP. Dataset was the re-process from Schirmer et al Nature 2019 paper by using the raw fastq files. In order to reproduce the h5ad file, details could be found in https://github.com/interactivereport/cellxgene_VIP/blob/master/notebook/MS_Nature_Rowitch_snRNAseq.ipynb Two rds files are also included here which are the input files for sample differential expression (DE) analysis scripts (glmmTMB and Nebula)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
4 Visium Spatial Transcriptomics datasets downloaded 10X Genomics data site ,and organized in the way to be used for Cellxgene VIP input.
10X_demo_data_Breast_Cancer_Block_A_Section_1 10X_demo_data_Breast_Cancer_Block_A_Section_2 10X_demo_data_Human_Heart 10X_demo_data_Human_Lymph_Node
Facebook
Twitterhttps://mit-license.orghttps://mit-license.org
ScCompass and CELLxGENE Training Datasets: Human and Mouse for scGPT.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
h5ad objects for cellxgene visualization of scDRS results: - scdrs_tmsfacs_thin.h5ad: scDRS results for the TMS FACS data of 110,096 cells (gene count matrix removed to save space)- scdrs_demo.h5ad: demo scDRS results for 3 TMS FACS cell types and 3 diseases (gene count matrix removed to save space)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview This dataset contains comprehensive metadata from single-cell gene expression studies, providing researchers with structured information about cellular phenotypes, experimental conditions, and sample characteristics. The data is particularly valuable for bioinformatics research, machine learning applications in genomics, and comparative studies across different cell types and conditions.
Dataset Description: The dataset comprises metadata associated with single-cell RNA sequencing (scRNA-seq) experiments, including: Cell Type Information: Classification of different cell types and subtypes Experimental Metadata: Details about experimental conditions, protocols, and methodologies Sample Characteristics: Information about biological samples, including tissue origin, developmental stages, and treatment conditions Quality Metrics: Data quality indicators and filtering parameters Annotation Details: Standardized cell type annotations and biological classifications
Data Source and Licensing This dataset is derived from publicly available single-cell gene expression data, potentially sourced from: CELLxGENE Data Portal (https://cellxgene.cziscience.com/) Gene Expression Omnibus (GEO) European Bioinformatics Institute (EBI) Other public genomics repositories
License: Creative Commons CC BY 4.0 (or specify the actual license) ✅ Commercial use allowed ✅ Modification allowed ✅ Distribution allowed ✅ Private use allowed ❗ Attribution required
Research Applications Cell Type Discovery: Identify novel cell types and subtypes Comparative Genomics: Study cellular differences across conditions, tissues, or species Disease Research: Investigate cellular changes in disease states Developmental Biology: Analyze cellular differentiation and development patterns
Machine Learning Applications Classification Tasks: Predict cell types from gene expression data Clustering Analysis: Discover cellular subpopulations and states Dimensionality Reduction: Apply PCA, t-SNE, UMAP for visualization Biomarker Discovery: Identify genes characteristic of specific cell types
Educational Use : Teaching bioinformatics and computational biology concepts. Demonstrating single-cell analysis workflows. Training in data preprocessing and quality control.
Data Quality and Preprocessing : Quality Control: Metadata has been curated and standardized Missing Values: [Specify how missing values are handled] Standardization: Cell type annotations follow established ontologies (e.g., Cell Ontology) Validation: Data has been cross-referenced with original publications
Usage Guidelines : Getting Started- Load the metadata files using pandas or your preferred data analysis tool. Explore the cell type distributions and experimental conditions. Filter data based on quality metrics as needed. Join with corresponding gene expression data for comprehensive analysis.
Best Practices Always cite original data sources and publications. Consider batch effects when combining data from different experiments. Validate findings with independent datasets when possible. Follow established bioinformatics workflows for single-cell analysis.
Citation and Acknowledgments : If you use this dataset in your research, please: Cite this dataset:[Kazi Aishikuzzaman]. (2024). Cell Gene Expression Metadata. Kaggle. https://www.kaggle.com/datasets/kaziaishikuzzaman/cell-gene-expression-metadata
File Structure :
dataset-
─ metadata_summary.csv # Main metadata file
─ cell_type_annotations.csv # Detailed cell type information
─ experimental_conditions.csv # Experiment-specific metadata
─ quality_metrics.csv # Data quality indicators
─ README.txt # Detailed file descriptions
Technical Specifications : File Encoding: UTF-8 Separator: Comma-separated values (CSV) Missing Values: Represented as 'NA' or empty cells Data Types: Mixed (categorical, numerical, text)
Contact and Support : For questions about this dataset: Kaggle Profile: @kaziaishikuzzaman Dataset Issues: Use Kaggle's discussion section Collaboration: Open to research collaborations and improvements
Version History : v1.0: Initial release with comprehensive metadata collection [Future versions]: Updates and additional annotations as available
Related Datasets: Consider exploring these complementary datasets- Single-cell gene expression data (companion to this metadata) Cell atlas datasets from major consortiums Disease-specific single-cell studies Multi-omics datasets with matching cell types
Keywords: single-cell, RNA-seq, genomics, cell types, metadata, bioinformatics, machine learning, computational biology Category: Biology > Genomics
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
snRNASeq data generated at Biogen from 3 control mouse brains. Each brain picked 3 brain regions.
Animal IDs 1, 4 and 7
Brain region codes: W: WhiteMatter H: Hippo G: GreyMatter
10X standard mm10 (3.0.0) reference was used, on cellranger 5.0.0 with --include-introns on.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The single cell Alzheimer's Disease Data Portal is an aggregated data portal created as part of the Enfield EU Funded program for the single-cell Generative Pretrained Transformer (scGPT-AD) model research. The data portal contains data from the ssREAD data portal, along with single-cell AD data from latest studies (dharsini et al, pan et al, rexach et al). The data from the individual studies where accessed through the cellXgene data portal, a vast portal for single cell data. The data have been uploaded in two seperate .zip files (part1, part2).
The single cell data follow the Annotated Data format. The core data for each sample is the gene-expression matrix, which refers to the level of expression of each gene in a single cell. Additionally, the dataset contains the `.obs` attributed which includes core cell metadata for each of the sample (cell type, brain region, braak stage, donor age, disease condition, donor gender, etc.), along with the gene names accessed via `.var` attribute.
The source data have been processed to create a unified data portal ready to be used as training dataset for a Transformer model. The main processing steps were:
|
Total Cells |
2.3M |
|
AD Cells |
1.2M |
|
Control Cells |
1.1M |
|
Unique Genes |
91k |
|
Donors |
166 |
|
Data Source |
Unique Genes |
Total Cells |
AD Cells |
Control Cells |
Donors |
Cell Type Label |
Brain Region |
Tissue Type |
Braak Stage |
Donors Id |
Donor Gender |
Donor Age |
|
rexach et al |
30k |
217k |
118k |
99k |
20 |
✅ |
✘ |
✅ |
✘ |
✅ |
✅ |
✅ |
|
pan et al |
61k |
43k |
11k |
32k |
7 |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
|
dharsini et al |
61k |
425k |
311k |
114k |
46 |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
|
ssREAD |
62k |
2.42M |
1.14M |
1.28M |
135 |
✅ |
✅ |
✘ |
✅ |
✅ |
✅ |
✅ |
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Bladder Tissue from Tabula Muris Senis
Tabula Muris Senis is a mammalian aging single-cell gene expression dataset, downloaded from https://cellxgene.cziscience.com/collections/0b9d8a04-bb9d-44da-aa27-705bb65b54eb. This dataset represents the Bladder tissue, using the SmartSeq2 full-length mRNA library preparation method for single cells. Code to download and process this dataset is available in: https://github.com/seanome/2025-longevity-x-ai-hackathon
Ageing is characterized by a… See the full description on the dataset page: https://huggingface.co/datasets/longevity-db/tabula-muris-senis-bladder-smartseq2.
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Discover the booming single-cell analysis software market! Our in-depth report reveals key trends, growth drivers, leading companies (Cellenics, BioTuring Browser, 10x Genomics Loupe Browser, etc.), and future projections through 2033. Learn about market segmentation and regional analysis to gain a competitive edge.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset hosts files needed to reproduce the Human Retina Cell Atlas (HRCA) reference model using scArches. The HRCA data can be accessed through several interactive browsers, including HCA Data Portal, CELLxGENE, UCSC Cell Browser, and the Broad Single Cell Portal. Please use these browsers for atlas exploration and visualization. For more information on HRCA, please refer to the HRCA paper (Li et al., bioRxiv 2023) and the Github repository at https://github.com/RCHENLAB/HRCA_reproducibility. This dataset has been used in the tutorial for the HRCA reference model at https://github.com/RCHENLAB/HRCA_reproducibility/tree/main/scArches.
Data description:
1. HRCA_snRNA_allcells_rawcounts.h5ad
This file contains the cell-by-gene count matrix for over 3.1 million single nuclei and more than 36,000 gene features of the HRCA. Gene features are represented by gene symbols. Please refer to the interactive browsers for atlas exploration, where gene features are mapped to Ensembl IDs. In the cell metadata, "sampleid" indicates sample batches of cells, and "celltype" specifies 123 retina cell types.
2. model.pt
This file is the trained reference model using scArches, incorporating 10,000 highly variable features from the full count matrix. It can be directly used for cell type annotation of new retina samples.
3. HRCA_snRNA_allcells_rawcounts_latent.h5ad
This file contains the embeddings of all 3.1 million reference single nuclei generated by the trained reference model using scArches. These embeddings can be used to compare with the embeddings of query data for exploration.
4. HRCA_reference_model_gene_id_and_symbol.csv
This file contains the mapping of Ensembl IDs to gene symbols for the 10,000 features used in the reference model. This mapping can be used to convert the gene features in a query .h5ad file from gene IDs to gene symbols, allowing cell type labels to be predicted using the trained reference model, which uses gene symbols as gene features.
5. query.h5ad
This file contains a cell-by-gene count matrix for a query dataset, designed to support reproducibility in the HRCA reference model tutorial. The "majorclass" column includes pre-annotated major cell classes. Additional details on the tutorial are available at https://github.com/RCHENLAB/HRCA_reproducibility/tree/main/scArches.
6. query_latent.h5ad
This file contains the embeddings of the query data against the trained reference model. These embeddings can be compared with the reference data embeddings for exploration and visualization.
Facebook
TwitterRemark 1: for cell cycle analysis - see paper https://arxiv.org/abs/2208.05229 "Computational challenges of cell cycle analysis using single cell transcriptomics" Alexander Chervov, Andrei Zinovyev
Remark 2: The first of the data see in https://www.kaggle.com/alexandervc/scrnaseq-tabula-sapiens-human-500-000-cells
Data - results of single cell RNA sequencing, i.e. rows - correspond to cells, columns to genes (or vice versa). value of the matrix shows how strong is "expression" of the corresponding gene in the corresponding cell. https://en.wikipedia.org/wiki/Single-cell_transcriptomics
Particular data: "Tabula Sapiens" project: https://tabula-sapiens-portal.ds.czbiohub.org/ Data section for download: https://figshare.com/articles/dataset/Tabula_Sapiens_release_1_0/14267219 Paper: https://www.science.org/doi/10.1126/science.abl4896 https://www.biorxiv.org/content/10.1101/2021.07.19.452956v2
Tabula Sapiens is a benchmark, first-draft human cell atlas of nearly 500,000 cells from 24 organs of 15 normal human subjects. This work is the product of the Tabula Sapiens Consortium. Special thanks to the Chan Zuckerberg Initiative for funding this project and to the CZI Science Technology team for creating cellxgene, the tool that makes the visualization of this research possible.
Course at Sanger's institute https://scrnaseq-course.cog.sanger.ac.uk/website/tabula-muris.html
Course at CZ-hub: https://chanzuckerberg.github.io/scRNA-python-workshop/intro/about
On kaggle - copies of the notebooks and data from the course above https://www.kaggle.com/aayush9753/singlecell-rnaseq-data-from-mouse-brain
Single cell RNA sequencing is important technology in modern biology, see e.g. "Eleven grand challenges in single-cell data science" https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-1926-6
Also see review : Nature. P. Kharchenko: "The triumphs and limitations of computational methods for scRNA-seq" https://www.nature.com/articles/s41592-021-01171-x
Facebook
TwitterThis single-cell RNA-sequencing (scRNA-seq) dataset comprises two files: an RData file (combined_data.RData), which can be loaded into RStudio to generate a Seurat object, and an h5ad object (annotated_combined_adata_full.h5ad) for downstream analysis in Scanpy or cellxgene. The dataset contains previously published data and five new samples derived from kidney allografts undergoing graft nephrectomies. Overall, 217,411 human kidney cells are included, including 151,038 ‘control’ cells from living donor biopsies or non-tumorous regions of tumour nephrectomies and 66,373 cells from diseased samples, including chronic kidney disease and different aetiologies of transplant rejection. For full information on generation of the dataset, please see the associated preprint, which has been uploaded to bioRxiv and is available at: https://www.biorxiv.org/content/10.1101/2022.10.28.514222v2. The code used for scRNA-seq analysis is available at: https://github.com/daniyal-jafree1995/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CellxGene 45M Collection
A curated subset of CellxGene (~45M cells) used to align the Stack model after pretraining on full human scBaseCount.
Selection Criteria
≥ 50,000 cells per dataset ≥ 5 donors per dataset
Cell Type Annotations
Author-annotated coarse-grained cell type labels were heuristically identified and transferred to adata.obs["author_cell_type"].
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains scRNA-Seq data related to the Jenkins et al. 2024 study "Single cell and spatial analysis of immune-hot and immune-cold tumours identifies fibroblast subtypes associated with distinct immunological niches and positive immunotherapy response".
HNSCC_fibroblasts_integ_srt.RDS - Seurat object containing fibroblasts from integrated analysis of EPG dataset (https://cellxgene.cziscience.com/collections/3c34e6f1-6827-47dd-8e19-9edcd461893f) with GSE164690 - Relating to Figure 2.
PCFA_srt_obj.RDS - Seurat object containing Pan-Cancer Fibroblast Atlas (PCFA) - Relating to Figures 5-7.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Single-cell lung (CellxGene Census) — Zarr
This dataset was exported from the CellxGene Census as a chunked + compressed Zarr store intended for easy streaming access.
Source: CellxGene Census API Organism: Homo sapiens Filter: tissue_general == 'lung' and is_primary_data == True Shape: 100,000 cells × 61,497 genes Zarr path: lung.zarr
Compression
Uncompressed (dense float32): 22.91 GB Compressed Zarr: ~307 MB (322 MB on Hub) Compression ratio: ~76× (Blosc zstd on… See the full description on the dataset page: https://huggingface.co/datasets/KokosDev/single-cell-lung-zarr.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Assigns identifiers to collections of datasets indexed by CELLxGENE.
CELLxGENE is an interactive data visualization and exploration tool developed by the Chan Zuckerberg Initiative that enables researchers to analyze and share single-cell genomics datasets. It provides a user-friendly interface for biologists and computational scientists to interrogate gene expression patterns across different cell types.