CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
"*.csv" files contain the single cell gene expression values (log2(tpm+1)) for all genes in each cell from melanoma and squamous cell carcinoma of head and neck (HNSCC) tumors. The cell type and origin of tumor for each cell is also included in "*.csv" files.The "MalignantCellSubtypes.xlsx" defines the tumor subtype."CCLE_RNAseq_rsem_genes_tpm_20180929.zip" is downloaded from CCLE database.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Contains loom files and preprocessed adata objects to compare methods for temporal gene expression integration. Loom files can be accessed using the 'read' function in Scvelo. Preprocessed adata objects can be accessed using the 'read_h5ad' function in Scanpy.
The raw single-cell RNA sequencing datasets can be found under the following accession codes.
Mouse embryonic cell cycle dataset from Ref. (https://doi.org/10.1038/nbt.3102) was originally downloaded from ArrayExpress with the accession code E-MTAB-2805
Hematopoiesis differentiation dataset from Ref. (https://doi.org/10.1182/blood-2016-05-716480) was originally downloaded from the Gene Expression Omnibus with the accession code GSE81682
NKT cell differentiation dataset from Ref. (https://doi.org/10.1038/ni.3437) was originally downloaded from the Gene Expression Omnibus with the accession code GSE74596.
Hematopoiesis differentiation dataset from Ref. (https://doi.org/10.1038/nature19348) was originally downloaded from the Gene Expression Omnibus with the accession codes GSE70236, GSE70240, GSE70244
LPS stimulation dataset from Ref. (https://doi.org/10.1016/j.cels.2017.03.010) was originally downloaded from the Gene Expression Omnibus with the accession code GSE94383.
INF-gamma stimulation dataset from Ref. (https://doi.org/10.1038/s41587-020-00803-5) was originally downloaded from the Gene Expression Omnibus with the accession code GSE161465.
AML chemotherapy dataset from Ref. (https://doi.org/10.1038/s41591-018-0233-1) was originally downloaded from the Gene Expression Omnibus with the accession code GSE116481.
AML diagnosis/relapse dataset from Ref. (https://doi.org/10.1038/s41375-021-01338-7) was originally downloaded from the Gene Expression Omnibus with the accession code GSE126068.
MS case control PBMC and CSF datasets from Ref. (https://doi.org/10.1038/s41467-019-14118-w) was originally downloaded from the Gene Expression Omnibus with the accession code GSE138266.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Skeletal muscle repair is driven by the coordinated self-renewal and fusion of myogenic stem and progenitor cells. Single-cell gene expression analyses of myogenesis have been hampered by the poor sampling of rare and transient cell states that are critical for muscle repair, and do not inform the spatial context that is important for myogenic differentiation. Here, we demonstrate how large-scale integration of single-cell and spatial transcriptomic data can overcome these limitations. We created a single-cell transcriptomic dataset of mouse skeletal muscle by integration, consensus annotation, and analysis of 23 newly collected scRNAseq datasets and 88 publicly available single-cell (scRNAseq) and single-nucleus (snRNAseq) RNA-sequencing datasets. The resulting dataset includes more than 365,000 cells and spans a wide range of ages, injury, and repair conditions. Together, these data enabled identification of the predominant cell types in skeletal muscle, and resolved cell subtypes, including endothelial subtypes distinguished by vessel-type of origin, fibro/adipogenic progenitors defined by functional roles, and many distinct immune populations. The representation of different experimental conditions and the depth of transcriptome coverage enabled robust profiling of sparsely expressed genes. We built a densely sampled transcriptomic model of myogenesis, from stem cell quiescence to myofiber maturation and identified rare, transitional states of progenitor commitment and fusion that are poorly represented in individual datasets. We performed spatial RNA sequencing of mouse muscle at three time points after injury and used the integrated dataset as a reference to achieve a high-resolution, local deconvolution of cell subtypes. We also used the integrated dataset to explore ligand-receptor co-expression patterns and identify dynamic cell-cell interactions in muscle injury response. We provide a public web tool to enable interactive exploration and visualization of the data. Our work supports the utility of large-scale integration of single-cell transcriptomic data as a tool for biological discovery.
Methods Mice. The Cornell University Institutional Animal Care and Use Committee (IACUC) approved all animal protocols, and experiments were performed in compliance with its institutional guidelines. Adult C57BL/6J mice (mus musculus) were obtained from Jackson Laboratories (#000664; Bar Harbor, ME) and were used at 4-7 months of age. Aged C57BL/6J mice were obtained from the National Institute of Aging (NIA) Rodent Aging Colony and were used at 20 months of age. For new scRNAseq experiments, female mice were used in each experiment.
Mouse injuries and single-cell isolation. To induce muscle injury, both tibialis anterior (TA) muscles of old (20 months) C57BL/6J mice were injected with 10 µl of notexin (10 µg/ml; Latoxan; France). At 0, 1, 2, 3.5, 5, or 7 days post-injury (dpi), mice were sacrificed and TA muscles were collected and processed independently to generate single-cell suspensions. Muscles were digested with 8 mg/ml Collagenase D (Roche; Switzerland) and 10 U/ml Dispase II (Roche; Switzerland), followed by manual dissociation to generate cell suspensions. Cell suspensions were sequentially filtered through 100 and 40 μm filters (Corning Cellgro #431752 and #431750) to remove debris. Erythrocytes were removed through incubation in erythrocyte lysis buffer (IBI Scientific #89135-030).
Single-cell RNA-sequencing library preparation. After digestion, single-cell suspensions were washed and resuspended in 0.04% BSA in PBS at a concentration of 106 cells/ml. Cells were counted manually with a hemocytometer to determine their concentration. Single-cell RNA-sequencing libraries were prepared using the Chromium Single Cell 3’ reagent kit v3 (10x Genomics, PN-1000075; Pleasanton, CA) following the manufacturer’s protocol. Cells were diluted into the Chromium Single Cell A Chip to yield a recovery of 6,000 single-cell transcriptomes. After preparation, libraries were sequenced using on a NextSeq 500 (Illumina; San Diego, CA) using 75 cycle high output kits (Index 1 = 8, Read 1 = 26, and Read 2 = 58). Details on estimated sequencing saturation and the number of reads per sample are shown in Sup. Data 1.
Spatial RNA sequencing library preparation. Tibialis anterior muscles of adult (5 mo) C57BL6/J mice were injected with 10µl notexin (10 µg/ml) at 2, 5, and 7 days prior to collection. Upon collection, tibialis anterior muscles were isolated, embedded in OCT, and frozen fresh in liquid nitrogen. Spatially tagged cDNA libraries were built using the Visium Spatial Gene Expression 3’ Library Construction v1 Kit (10x Genomics, PN-1000187; Pleasanton, CA) (Fig. S7). Optimal tissue permeabilization time for 10 µm thick sections was found to be 15 minutes using the 10x Genomics Visium Tissue Optimization Kit (PN-1000193). H&E stained tissue sections were imaged using Zeiss PALM MicroBeam laser capture microdissection system and the images were stitched and processed using Fiji ImageJ software. cDNA libraries were sequenced on an Illumina NextSeq 500 using 150 cycle high output kits (Read 1=28bp, Read 2=120bp, Index 1=10bp, and Index 2=10bp). Frames around the capture area on the Visium slide were aligned manually and spots covering the tissue were selected using Loop Browser v4.0.0 software (10x Genomics). Sequencing data was then aligned to the mouse reference genome (mm10) using the spaceranger v1.0.0 pipeline to generate a feature-by-spot-barcode expression matrix (10x Genomics).
Download and alignment of single-cell RNA sequencing data. For all samples available via SRA, parallel-fastq-dump (github.com/rvalieris/parallel-fastq-dump) was used to download raw .fastq files. Samples which were only available as .bam files were converted to .fastq format using bamtofastq from 10x Genomics (github.com/10XGenomics/bamtofastq). Raw reads were aligned to the mm10 reference using cellranger (v3.1.0).
Preprocessing and batch correction of single-cell RNA sequencing datasets. First, ambient RNA signal was removed using the default SoupX (v1.4.5) workflow (autoEstCounts and adjustCounts; github.com/constantAmateur/SoupX). Samples were then preprocessed using the standard Seurat (v3.2.1) workflow (NormalizeData, ScaleData, FindVariableFeatures, RunPCA, FindNeighbors, FindClusters, and RunUMAP; github.com/satijalab/seurat). Cells with fewer than 750 features, fewer than 1000 transcripts, or more than 30% of unique transcripts derived from mitochondrial genes were removed. After preprocessing, DoubletFinder (v2.0) was used to identify putative doublets in each dataset, individually. BCmvn optimization was used for PK parameterization. Estimated doublet rates were computed by fitting the total number of cells after quality filtering to a linear regression of the expected doublet rates published in the 10x Chromium handbook. Estimated homotypic doublet rates were also accounted for using the modelHomotypic function. The default PN value (0.25) was used. Putative doublets were then removed from each individual dataset. After preprocessing and quality filtering, we merged the datasets and performed batch-correction with three tools, independently- Harmony (github.com/immunogenomics/harmony) (v1.0), Scanorama (github.com/brianhie/scanorama) (v1.3), and BBKNN (github.com/Teichlab/bbknn) (v1.3.12). We then used Seurat to process the integrated data. After initial integration, we removed the noisy cluster and re-integrated the data using each of the three batch-correction tools.
Cell type annotation. Cell types were determined for each integration method independently. For Harmony and Scanorama, dimensions accounting for 95% of the total variance were used to generate SNN graphs (Seurat::FindNeighbors). Louvain clustering was then performed on the output graphs (including the corrected graph output by BBKNN) using Seurat::FindClusters. A clustering resolution of 1.2 was used for Harmony (25 initial clusters), BBKNN (28 initial clusters), and Scanorama (38 initial clusters). Cell types were determined based on expression of canonical genes (Fig. S3). Clusters which had similar canonical marker gene expression patterns were merged.
Pseudotime workflow. Cells were subset based on the consensus cell types between all three integration methods. Harmony embedding values from the dimensions accounting for 95% of the total variance were used for further dimensional reduction with PHATE, using phateR (v1.0.4) (github.com/KrishnaswamyLab/phateR).
Deconvolution of spatial RNA sequencing spots. Spot deconvolution was performed using the deconvolution module in BayesPrism (previously known as “Tumor microEnvironment Deconvolution”, TED, v1.0; github.com/Danko-Lab/TED). First, myogenic cells were re-labeled, according to binning along the first PHATE dimension, as “Quiescent MuSCs” (bins 4-5), “Activated MuSCs” (bins 6-7), “Committed Myoblasts” (bins 8-10), and “Fusing Myoctes” (bins 11-18). Culture-associated muscle stem cells were ignored and myonuclei labels were retained as “Myonuclei (Type IIb)” and “Myonuclei (Type IIx)”. Next, highly and differentially expressed genes across the 25 groups of cells were identified with differential gene expression analysis using Seurat (FindAllMarkers, using Wilcoxon Rank Sum Test; results in Sup. Data 2). The resulting genes were filtered based on average log2-fold change (avg_logFC > 1) and the percentage of cells within the cluster which express each gene (pct.expressed > 0.5), yielding 1,069 genes. Mitochondrial and ribosomal protein genes were also removed from this list, in line with recommendations in the BayesPrism vignette. For each of the cell types, mean raw counts were calculated across the 1,069 genes to generate a gene expression profile for BayesPrism. Raw counts for each spot were then passed to the run.Ted function, using
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This page includes the data and code necessary to reproduce the results of the following paper: Yang Liao, Dinesh Raghu, Bhupinder Pal, Lisa Mielke and Wei Shi. cellCounts: fast and accurate quantification of 10x Chromium single-cell RNA sequencing data. Under review. A Linux computer running an operating system of CentOS 7 (or later) or Ubuntu 20.04 (or later) is recommended for running this analysis. The computer should have >2 TB of disk space and >64 GB of RAM. The following software packages need to be installed before running the analysis. Software executables generated after installation should be included in the $PATH environment variable.
R (v4.0.0 or newer) https://www.r-project.org/ Rsubread (v2.12.2 or newer) http://bioconductor.org/packages/3.16/bioc/html/Rsubread.html CellRanger (v6.0.1) https://support.10xgenomics.com/single-cell-gene-expression/software/overview/welcome STARsolo (v2.7.10a) https://github.com/alexdobin/STAR sra-tools (v2.10.0 or newer) https://github.com/ncbi/sra-tools Seurat (v3.0.0 or newer) https://satijalab.org/seurat/ edgeR (v3.30.0 or newer) https://bioconductor.org/packages/edgeR/ limma (v3.44.0 or newer) https://bioconductor.org/packages/limma/ mltools (v0.3.5 or newer) https://cran.r-project.org/web/packages/mltools/index.html
Reference packages generated by 10x Genomics are also required for this analysis and they can be downloaded from the following link (2020-A version for individual human and mouse reference packages should be selected): https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest After all these are done, you can simply run the shell script ‘test-all-new.bash’ to perform all the analyses carried out in the paper. This script will automatically download the mixture scRNA-seq data from the SRA database, and it will output a text file called ‘test-all.log’ that contains all the screen outputs and speed/accuracy results of CellRanger, STARsolo and cellCounts.
Table of Contents
1. Main Description
---------------------------
This is the Zenodo repository for the manuscript titled "A TCR β chain-directed antibody-fusion molecule that activates and expands subsets of T cells and promotes antitumor activity.". The code included in the file titled `marengo_code_for_paper_jan_2023.R` was used to generate the figures from the single-cell RNA sequencing data.
The following libraries are required for script execution:
File Descriptions
---------------------------
Linked Files
---------------------
This repository contains code for the analysis of single cell RNA-seq dataset. The dataset contains raw FASTQ files, as well as, the aligned files that were deposited in GEO. The "Rdata" or "Rds" file was deposited in Zenodo. Provided below are descriptions of the linked datasets:
Gene Expression Omnibus (GEO) ID: GSE223311(https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE223311)
Sequence read archive (SRA) repository ID: SRX19088718 and SRX19088719
Zenodo DOI: 10.5281/zenodo.7566113(https://zenodo.org/record/7566113#.ZCcmvC2cbrJ)
Installation and Instructions
--------------------------------------
The code included in this submission requires several essential packages, as listed above. Please follow these instructions for installation:
> Ensure you have R version 4.1.2 or higher for compatibility.
> Although it is not essential, you can use R-Studios (Version 2022.12.0+353 (2022.12.0+353)) for accessing and executing the code.
1. Download the *"Rdata" or ".Rds" file from Zenodo (https://zenodo.org/record/7566113#.ZCcmvC2cbrJ) (Zenodo DOI: 10.5281/zenodo.7566113).
2. Open R-Studios (https://www.rstudio.com/tags/rstudio-ide/) or a similar integrated development environment (IDE) for R.
3. Set your working directory to where the following files are located:
You can use the following code to set the working directory in R:
> setwd(directory)
4. Open the file titled "Install_Packages.R" and execute it in R IDE. This script will attempt to install all the necessary pacakges, and its dependencies in order to set up an environment where the code in "marengo_code_for_paper_jan_2023.R" can be executed.
5. Once the "Install_Packages.R" script has been successfully executed, re-start R-Studios or your IDE of choice.
6. Open the file "marengo_code_for_paper_jan_2023.R" file in R-studios or your IDE of choice.
7. Execute commands in the file titled "marengo_code_for_paper_jan_2023.R" in R-Studios or your IDE of choice to generate the plots.
Bulk data of human pancreas The dataset from Fadista et al. (2014) contains raw read counts data from bulk RNA-seq of human pancreatic islets to study glucose metabolism in healthy and hyper-hypoglycemic conditions. For the purpose of this vignette, the dataset is pre-processed and made available on the data download page. In addition to read counts, this dataset also contains HbA1c levels, BMI, gender and age information for each subject. Single Cell Data of Human Pancreas The single cell data are from Segerstolpe et al. (2016), which constrains read counts for 25453 genes across 2209 cells. Here we only include the 1097 cells from 6 healthy subjects. The read counts are available on the data download page, in the form of an ExpressionSet. Another single cell data is from Xin et al. (2016), which have 39849 genes and 1492 cells. The read counts are available on the data download page, in the form of an ExpressionSet. The deconvolution of 89 subjects from Fadista et al. (2014) are preformed with bulk data GSE50244.bulk.eset and single cell reference EMTAB.eset. We constrained our estimation on 6 major cell types: alpha, beta, delta, gamma, acinar and ductal, which make up over 90% of the whole islet.
This is the GitHub repository for the single cell RNA sequencing data analysis for the human manuscript. The following essential libraries are required for script execution: Seurat scReportoire ggplot2 dplyr ggridges ggrepel ComplexHeatmap Linked File: -------------------------------------- This repository contains code for the analysis of single cell RNA-seq dataset. The dataset contains raw FASTQ files, as well as, the aligned files that were deposited in GEO. Provided below are descriptions of the linked datasets: 1. Gene Expression Omnibus (GEO) ID: GSE229626 - Title: Gene expression profile at single cell level of human T cells stimulated via antibodies against the T Cell Receptor (TCR) - Description: This submission contains the matrix.mtx
, barcodes.tsv
, and genes.tsv
files for each replicate and condition, corresponding to the aligned files for single cell sequencing data. - Submission type: Private. In order to gain access to the repository, you must use a "reviewer token"(https://www.ncbi.nlm.nih.gov/geo/info/reviewer.html). 2. Sequence read archive (SRA) repository - Title: Gene expression profile at single cell level of human T cells stimulated via antibodies against the T Cell Receptor (TCR) - Description: This submission contains the "raw sequencing" or .fastq.gz
files, which are tab delimited text files. - Submission type: Private. In order to gain access to the repository, you must use a "reviewer token" (https://www.ncbi.nlm.nih.gov/geo/info/reviewer.html). Please note that since the GSE submission is private, the raw data deposited at SRA may not be accessible until the embargo on GSE229626 has been lifted. Installation and Instructions -------------------------------------- The code included in this submission requires several essential packages, as listed above. Please follow these instructions for installation: > Ensure you have R version 4.1.2 or higher for compatibility. > Although it is not essential, you can use R-Studios (Version 2022.12.0+353 (2022.12.0+353)) for accessing and executing the code. The following code can be used to set working directory in R: > setwd(directory) Steps: 1. Download the "Human_code_April2023.R" and "Install_Packages.R" R scripts, and the processed data from GSE229626. 2. Open "R-Studios"(https://www.rstudio.com/tags/rstudio-ide/) or a similar integrated development environment (IDE) for R. 3. Set your working directory to where the following files are located: - Human_code_April2023.R - Install_Packages.R 4. Open the file titled Install_Packages.R
and execute it in R IDE. This script will attempt to install all the necessary pacakges, and its dependencies. 5. Open the Human_code_April2023.R
R script and execute commands as necessary.
https://ega-archive.org/dacs/EGAC00001001974https://ega-archive.org/dacs/EGAC00001001974
Single-cell RNA-Sequencing of 26 primary breast cancers from Wu et al. (2021) study. Data was generated using the Chromium controller (10X Genomics) and sequenced on the NextSeq 500 platform.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We published 3 protocols illustrating how MetaNeighbor can be used to quantify cell type replicability across single cell transcriptomic datasets.The data files included here are needed to run the R version of the protocols available on Github (https://github.com/gillislab/MetaNeighbor-Protocol) in RMarkdown (.Rmd) and Jupyter (.ipynb) notebook format. To run the protocols, download the protocols on Github, download the data on Figshare, place the data and protocol files in the same directory, then run the notebooks in Rstudio or Jupyter.The scripts used to generate the data are included in the Github directory. Briefly: - full_biccn_hvg.rds contains a single cell transcriptomic dataset published by the Brain Initiative Cell Census Network (in SingleCellExperiment format). It combines data from 7 datasets obtained in the mouse primary motor cortex (https://www.biorxiv.org/content/10.1101/2020.02.29.970558v2). Note that this dataset only contains highly variable genes. - biccn_hvgs.txt: highly variable genes from the BICCN dataset described above (computed with the MetaNeighbor library). - biccn_gaba.rds: same dataset as full_biccn_hvg.rds, but restricted to GABAergic neurons. The dataset contains all genes common to the 7 BICCN datasets (not just highly variable genes). - go_mouse.rds: gene ontology annotations, stored as a list of gene symbols (one element per gene set).- functional_aurocs.txt: results of the MetaNeighbor functional analysis in protocol 3.
The dataset contains an integrated, annotated Seurat v4 object. One can load the dataset into the R environment using the code below:
seurat_obj <- readRDS('PATH/TO/DOWNLOAD/seurat.rds')
The object has three assays: (I) RNA, (II) SCT and (III) integrated.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Custom genome and gene annotations for single cell ATAC and RNA-seq analyses by BaDoi Phan (badoi dot phan at pitt dot edu)
This Kilthub upload is a clone of the github repository where this project may be updated or corrected in the future: https://github.com/pfenninglab/custom_ArchR_genomes_and_annotations
Premise: Not all of single-cell ATAC-seq biomedical molecular epigenetics is done in human and mouse genomes where there are high quality genomes and gene annotations. For the other species that are still highly relevant to study health and disease, here are some ArchR annotations to enable less frustration to have snATAC-seq data analyzed with ArchR.
Strategy for better gene annotations: We can use the proper that evolution of related mammalian species tend to have orthologous gene elements (TSS, exons, genes). For example, house mouse (mus musculus) is a median of 15.4MY diverged from the Norway rat (rattus norvegicus), with TimeTree. Humans are a median of 28.9 MY diverged from rhesus macaques. To borrow the higher quality and more complete gene annotations, we can use a gene-aware method of lifting gene annotations from one genome to another, liftoff, Shumate and Salzberg, 2021. For the source of "high quality" gene annotation, we use the NCBI Refseq annotations from the hg38/GRCh38 and mm10/GRCm38 annotations downloaded from the UCSC Genome browser.
For single cell RNA-seq, He, Kleyman et al. 2021 Current Biology (https://pubmed.ncbi.nlm.nih.gov/34727523/) found that using a regular liftOver of the human NCBI Refseq to rheMac10 was able to recover higher number of UMI counts to genes. This is likely due to incomplete annotations in either rheMac8 or rheMac10 genomes for the 3' UTRs that are usually targeted by common single cell/nucleus RNA-seq technologies. This allow more reads that would otherwise be found "outside" a gene because of incomplete 3' UTRs in a target species to be appropriately attributed to that gene using the orthologs of that gene from a more complete annotation in a related species. Furthermore, the complex splicing is better measured in humans, so more "intergenic" annotations by the rheMac10 annotations became "intronic" and better able to be mapped to a liftOvered annotation from human. For this reason, we create alternate annotations for the rhesus macaque, marmoset, and rat genomes borrowing orthology as identified with the newer liftoff method from more complete human or mouse annotations.
Similarly, for single cell ATAC-seq seq, a more complete map of genes and transcription start sites (TSS) enable aggregate metrics like a "gene score" to better calculate gene-based measures to perform co-clustering with single cell RNA-seq dataset. A more complete annotation would be able to accurately discern single cell open chromatin regions and not falsely report exonic regions or alternate promoters that were missed from primary transcriptomic data in monkey, marmoset, or rat but can be bioinformatically inferred.
Lastly, work by the ENCODE Consortium has found with the large human and mouse epigenomic data that certain regions of the genome in these species have artifactual signals and need to be excluded from epigenomic analsyes, Amemiya et al., 2021. These regions were pulled from and human and mouse from here and used the liftOver to map to the target genomes below, for simplicity.
list of resources by file name Surprisingly, all these files are small enough to put on github for a couple custom genomes. Below are the organizations - *.gtf.gz and *.gff3.gz: the gzipped annotation from the higher quality annotations to the target genome using liftoff - *liftOver*blacklist.v2.bed: the ENCODE regions to exclude from epigenomic analyses mapped to the target genome using liftOver - *ArchRGenome.R: the Rscript used to make the custom ArchR annotations - *ArchR_annotations.rda: the R Data object that contains the geneAnnotation and objects to use with ArchR::createArrowFiles()
list of species/genomes/source files For most of these files, the genome fasta sequences were grabbed from the UCSC Genome Browser at https://hgdownload.soe.ucsc.edu/goldenPath/${GENOME_VERSION}/, where ${GENOME_VERSION} is any of the version below except mCalJac1. Some of these genomes were updated from the Vertebrate Genome Project, which seeks to create complete rather than draft genome assemblies of all mammals on the planet, Rhie et al. 2021. These genomes have VGP and that naming version if there's an alternate naming scheme. The VGP is pretty cool and they make good genome assemblies.
rheMac8: rhesus macaque v8
rheMac10: rhesus macaque v10
mCalJac1: marmoset VGP genome, fasta from the maternal assembly here
The Seattle Alzheimer's Disease Brain Cell Atlas (SEA-AD) consortium strives to gain a deep molecular and cellular understanding of the early pathogenesis of Alzheimer's disease and is funded by the National Institutes on Aging (NIA U19AG060909). The SEA-AD datasets available here comprise single cell profiling (transcriptomics and epigenomics) and quantitative neuropathology. To explore gene expression and chromatin accessibility information, the single-cell profiling data includes: snRNAseq and snATAC-seq data from the SEA-AD donor cohort (aged brains which span the spectrum of Alzheimer's Disease pathology) and neurotypical reference brains. To explore key pathological proteins and cell types of interest to Alzheimer's disease, the neuropathology data includes: full resolution brightfield images, images processed and segmented in HALO image analysis software, image annotations, and quantification summary files for the relevant stains including Abeta (6E10), IBA1, a-Synuclein, GFAP, H&E-LFB, NeuN, pTau(AT8), and pTDP43.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data for research XXDOI:Notice: file of filter_pan_cancer can be download from 10.6084/m9.figshare.28889885
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A single-cell transcriptomic atlas characterizes ageing tissues in the mouse
Code to download and process this dataset is available in: https://github.com/seanome/2025-longevity-x-ai-hackathon Dataset structure is originally from AnnData. Descriptions of each data file is below.
Data Files
This dataset contains multiple parquet files, one for each sheet in the original Excel file: gene-expression-single-cell-mouse_*.parquet - Data files containing gene expression and… See the full description on the dataset page: https://huggingface.co/datasets/longevity-db/gene-expression-single-cell-mouse.
Dataset is downloaded from https://amp.pharm.mssm.edu/archs4/download.html The methods are described in Nature Communications paper: https://www.nature.com/articles/s41467-018-03751-6
The ARCHS4 data provides user-friendly access to multiple gene expression data from the GEO database. (https://www.ncbi.nlm.nih.gov/geo/ ). While in GEO database most of data is stored in raw formats, ARCHS4 provides prepared count matrix expression data. While GEO contains data stored separately for each research paper, ARCHS4 collects all the information in one single matrix. One may consult the main site for further information.
Main data files are in H5 (HD5, Hierarchical Data Format ) file format https://en.wikipedia.org/wiki/Hierarchical_Data_Format It contains expression data, as well as annotation data and futher meta-information. There are several other auxilliary files like TSNE 3d projection (in CSV format) and correlation matrices for genes for human and mouse in feather format.
The main file (for human): human_matrix.h5 - contains data matrix - which is 238522 samples times 35238 genes, as well as, various meta information: gene names, samples information (tissue, etc), references to GEO database id where all the details can be found.
There is also similar data for mouse, csv files with TSNE images, correlation matrices for genes.
The ARCHS4 project is by :
'Alexander Lachmann', 'alexander.lachmann@mssm.edu', update: '2020-02-06'
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the processed data from the publication: "Unified single-cell analysis of testis gene regulation and pathology in 5 mouse strains" (https://doi.org/10.1101/393769)
The raw data is avaliable at GEO: GSE113293
Associated software is at https://zenodo.org/badge/latestdoi/140632831
SDA_objects.zip contains key tables required for many functions, download this to use the shiny app. Contents:
Other R objects include:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This resource contains pre-processed A. thaliana root , the H. sapiens aortic valve datasets, PBMC Covid atlas and public 10x datasetse used in the paper, SCEMENT: Scalable and Memory Efficient Integration of Large-scale Single Cell RNA-sequencing Data. The raw datasets provided in the links below are pre-processed for quality control with respect to both cells and genes.
A. thaliana datasets are sourced from the following locations at Single-cell Gene expression Atlas and Gene Expression Omnibus (GEO):
H. sapiens datasets are obtained from the NCBI database : https://www.ncbi.nlm.nih.gov/bioproject/PRJNA562645/
All COVID atlas datasets are from: http://covid19.cancer-pku.cn . covid_atlas_data1.zip contains the h5ad files and covid_atlas_data2.zip contains the Seurat rds files.
PBMC datasets are from the following public sources:
References for the Datasets :
Single cell RNA-seq study of induced pluripotent stem cell derived neural stem cells. Analysis of gene expression over cell clusters identified inherent presence of neurogenic progenitors and gliogenic progenitors in established neural stem cells. This study aids to explain heterogeneity of neural stem cell identity and resolves gene expression enrichment in subpopulations of diverse progenitors. Processed and quality controlled data sets used for generating figure 2 in published article. Single cell raw data files for experiments are not available for public download.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
AnnData object containing single-cell gene expression levels of from a large-scale Perturb-seq CRISPR experiment from "Exploring genetic interaction manifolds constructed from rich single-cell phenotypes". Cells were labeled according to perturbation categories provided by the original authors. Data preprocessed as decribed in "Isolating salient variations of interest in single-cell data with contrastiveVI" (Nature Methods 2023).
Gene expression data portal developed for stem cell community, containing public gene expression datasets derived from microarray, RNA sequencing and single cell profiling technologies. Portal to visualize and download curated stem cell data. Provides easy to use and intuitive tools for biologists to visually explore data, including interactive gene expression profiles, principal component analysis plots and hierarchical clusters, among others.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
"*.csv" files contain the single cell gene expression values (log2(tpm+1)) for all genes in each cell from melanoma and squamous cell carcinoma of head and neck (HNSCC) tumors. The cell type and origin of tumor for each cell is also included in "*.csv" files.The "MalignantCellSubtypes.xlsx" defines the tumor subtype."CCLE_RNAseq_rsem_genes_tpm_20180929.zip" is downloaded from CCLE database.