GWHed/geoquery dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
We use an LLM to generate text descriptions of satellite imagery, and then do semantic search just by using text embeddings. We also use image embeddings to validate how image descriptions can mimic them. We provide precomputed image and text embeddings for 48k locations around the world, together with their Sentinel2 RGB imagery in chips sized 512x512 pixels at 10m/pixel. See our github repo at rramosp/geoquery-poc for notebooks and examples on how to use this data. This is an example.… See the full description on the dataset page: https://huggingface.co/datasets/rramosp/geoquery-48k.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Processed single cell datasets for a tool benchmark.
Raw data was downloaded from GEO:
1) https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE122960
2) https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE128033
3) https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE135893
4) https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE136831
Normalization procedure can be found here: https://github.com/mora-lab/cell-cell-interactions/tree/main/benchmark-workflow/R
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description:
A set of transcriptomics studies on the Gene Expression Omnibus (GEO) platform somehow related to infectious or neurodegenerative diseases.
Columns:
Column names followed by "_QID" hold the Wikidata IDs relative to that column.
Google Sheets:
https://docs.google.com/spreadsheets/d/1LjF4h8n6Sy4PgTJoC-fJ7mGnqGCvqmWiatf2zD-5RM8
Funding:
This curation and release were supported by the grants #2018/10257-2 and #2019/26284-1 from the São Paulo Research Foundation (FAPESP).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Uploaded files are raw (*_raw.h5mu), filtered (*_filtered.h5mu), and trained (*_trained.h5mu) h5mu objects, as processed by mTopic.
Single-cell data processed here are publicly available from the Gene Expression Omnibus (GEO) and 10x Genomics:
P22 mouse brain datasets (P22_Mouse_Brain_H3K4me3_RNA, P22_Mouse_Brain_H3K27me3_RNA, P22_Mouse_Brain_H3K27ac_RNA) [1]
RNA data: GSE218593
– GSM6753043 for ATAC-RNA
– GSM6753046 for H3K4me3-RNA
– GSM6753044 for H3K27me3-RNA
– GSM6753045 for H3K27ac-RNA
ATAC/histone modification data: GSE205055
– GSM6758285 for ATAC-RNA
– GSM6704980 for H3K4me3-RNA
– GSM6704978 for H3K27me3-RNA
– GSM6704979 for H3K27ac-RNA
Human PBMC dataset (Human_PBMC_ATAC_RNA_Protein) [2]
GSE166188
– GSM5065524 for ATAC
– GSM5065525 for RNA
– GSM5065526 for protein
Human tonsil dataset (Human_Tonsil_RNA_Protein)
Available from 10x Genomics here.
References
[1] Zhang D, Deng Y, Kukanja P, Agirre E et al. Spatial epigenome-transcriptome co-profiling of mammalian tissues. Nature 2023 Apr;616(7955):113-122. PMID: 36922587
[2] Mimitou EP, Lareau CA, Chen KY, Zorzetto-Fernandes AL et al. Scalable, multimodal profiling of chromatin accessibility, gene expression and protein levels in single cells. Nat Biotechnol 2021 Oct;39(10):1246-1258. PMID: 34083792
Details on data processing and analysis can be found in the associated article.
Although localized to the mineralized matrix of bone, osteocytes are able to respond to systemic factors such as the calciotropic hormones 1,25(OH)2D3 and PTH. In the present studies, we examine the transcriptomic response to PTH in an osteocyte cell model and found that this hormone regulated an extensive panel of genes. Surprisingly, PTH uniquely modulated two cohorts of genes, one that was expressed and associated with the osteoblast to osteocyte transition and the other a cohort that was expressed only in the mature osteocyte. Interestingly, PTHM-bM-^@M-^Ys effects were largely to oppose the expression of differentiation-related genes in the former cohort, while potentiating the expression of osteocyte-specific genes in the latter cohort. A comparison of the transcriptional effects of PTH with those obtained previously with 1,25(OH)2D3 revealed a subset of genes that was strongly overlapping. While 1,25(OH)2D3 potentiated the expression of osteocyte-specific genes similar to that seen with PTH, the overlap between the two hormones was more limited. Additional experiments identified the PKA-activated phospho-CREB (pCREB) cistrome, revealing that while many of the differentiation-related PTH regulated genes were apparent targets of a PKA-mediated signaling pathway, a reduction in pCREB binding at sites associated with osteocyte-specific PTH targets appeared to involve alternative PTH activation pathways. That pCREB binding activities positioned near important hormone-regulated gene cohorts were localized to control regions of genes was reinforced by the presence of epigenetic enhancer signatures exemplified by unique modifications at histones H3 and H4. These studies suggest that both PTH and 1,25(OH)2D3 may play important and perhaps cooperative roles in limiting osteocyte differentiation from its precursors while simultaneously exerting distinct roles in regulating mature osteocyte function. Our results provide new insight into transcription factor-associated mechanisms through which PTH and 1,25(OH)2D3 regulate a plethora of genes important to the osteoblast/osteocyte lineage. Fully differentiated IDG-SW3 cells were treated in biological triplicate with 100nM PTH for 24 hours prior to mRNA isolation and sequencing. Vehicle treated samples were previously published in GSE54783: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1323967 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1323968 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1323969
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
scRNA data from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE223922 (Sur et al. 2023), see a detailed description of the study here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10055256/
Data were downloaded from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE223922 to create a R Seurat object and converted into AnnData (h5ad) file to be able to analyse with e.g. python scanpy package.
If you use this data, please cite Sur et al. 2023.
Journal article published in PLOS One, Vol 20, Issue 5, e0320862, 2025; DOI: https://doi.org/10.1371/journal.pone.0320862; PMC12064016. The datasets generated and analyzed during the current study are provided in Supplemental S1 File. The RNA-seq data is Protein Atlas Version 23 from the Human Protein Atlas website (https://www.proteinatlas.org/about/download, “RNA HPA cell line gene data” released 2023.06.19). All FASTQ files and aligned counts for the U.S. EPA TempO-seq data have been deposited into NCBI Gene Expression Omnibus under the accession number GSE288929 and are publicly available at: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE288929. The R code is available through FigShare at: https://doi.org/10.23645/epacomptox.27341970.v1. This dataset is associated with the following publication: Word, L., C. Willis, R. Judson, L. Everett, S. Davidson-Fritz, D. Haggard, B. Chambers, J. Rogers, J. Bundy, I. Shah, N. Sipes, and J. Harrill. TempO-seq and RNA-seq Gene Expression Levels are Highly Correlated for Most Genes: A Comparison Using 39 Human Cell Lines. PLOS ONE. Public Library of Science, San Francisco, CA, USA, 20(5): e0320862, (2025).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Uploaded files are raw (*_raw.rds) and filtered (*_filtered.rds) RDS objects used for R tutorials of mTopic.
Single-cell data processed here are publicly available from the Gene Expression Omnibus (GEO) and 10x Genomics:
P22 mouse brain dataset (P22_Mouse_Brain_ATAC_RNA) [1]
– GSE218593 (GSM6753043) for RNA
– GSE205055 (GSM6758285) for ATAC
Human tonsil dataset (Human_Tonsil_RNA_Protein)
Available from 10x Genomics here.
Human PBMC dataset (Human_PBMC_ATAC_RNA_Protein) [2]
GSE166188
– GSM5065524 for ATAC
– GSM5065525 for RNA
– GSM5065526 for protein
References
[1] Zhang D, Deng Y, Kukanja P, Agirre E et al. Spatial epigenome-transcriptome co-profiling of mammalian tissues. Nature 2023 Apr;616(7955):113-122. PMID: 36922587
[2] Mimitou EP, Lareau CA, Chen KY, Zorzetto-Fernandes AL et al. Scalable, multimodal profiling of chromatin accessibility, gene expression and protein levels in single cells. Nat Biotechnol 2021 Oct;39(10):1246-1258. PMID: 34083792
Collection of gene expression and similar datasets related to brain tumors. In particular Medulloblastoma. Medulloblastoma is the most common malignant brain tumor in childhood. Typically csv files genes x samples.
GSE124814 WOW! Integration of many (all?) medulloblastoma datasets(!): 1641 samples, of which 1350 samples represent primary medulloblastomas and 291 samples represent normal brain
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE124814 Weishaupt H, Johansson P, Sundström A, Lubovac-Pilav Z et al. Batch-normalization of cerebellar and medulloblastoma gene expression datasets utilizing empirically defined negative control genes. Bioinformatics 2019 Sep 15;35(18):3357-3364. PMID: 30715209 https://doi.org/10.1093/bioinformatics/btz066 We downloaded a total of 1796 CEL files from previously published GEO or ArrayExpress records: GSE85217(n=763), GSE25219(n=154), GSE60862(n=130), GSE12992(n=40), GSE67850(n=22), GSE10327(n=62), GSE30074(n=30), E-MTAB-292(n=19), GSE74195(n=30), GSE37418(n=76), GSE4036(n=14), GSE62803(n=52), GSE21140(n=103), GSE37382(n=50), GSE22569(n=24), GSE35974(n=50), GSE73038(n=46), GSE50161(n=24), GSE3526(n=9), GSE50765(n=12), GSE49243(n=58), GSE41842(n=19), GSE44971(n=9). After preprocessing of all CEL files, we averaged the expression profiles of samples that mapped to the same patient in a single dataset, producing a final expression array comprising 1641 samples, of which 1350 samples represent primary medulloblastomas and 291 samples represent normal brain (cerebellum/upper rhombic lip). Also discussed in paper: A transcriptome-based classifier to determine molecular subtypes in medulloblastoma https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008263
GSE85217 (Cavalli ... Taylor ) 768 samples 2016 ( Affimetrix Human Gene 1.1 ST Array ) Cavalli FMG, Remke M, Rampasek L, Peacock J et al. Intertumoral Heterogeneity within Medulloblastoma Subgroups. Cancer Cell 2017 Jun 12;31(6):737-754.e6. PMID: 28609654 Ramaswamy V, Taylor MD. Bioinformatic Strategies for the Genomic and Epigenomic Characterization of Brain Tumors. Methods Mol Biol 2019;1869:37-56. PMID: 30324512 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85217
GSE202043 (Pomeroy) 214 samples, 2011 (Expression profiling by array) Cho YJ, Tsherniak A, Tamayo P, Santagata S et al. Integrative genomic analysis of medulloblastoma identifies a molecular subgroup that drives poor clinical outcome. J Clin Oncol 2011 Apr 10;29(11):1424-30. PMID: 21098324 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE202043
GSE12992 (Fattet ... Delattre) 72 samples, 2009 (Expression profiling by array) Fattet S, Haberler C, Legoix P, Varlet P et al. Beta-catenin status in paediatric medulloblastomas: correlation of immunohistochemical expression with mutational status, genetic profiles, and clinical characteristics. J Pathol 2009 May;218(1):86-94. PMID: 19197950 A series of 72 pediatric medulloblastoma tumors has been studied at the genomic level (array-CGH), screened for CTNNB1 mutations and beta-catenin expression (immunohistochemistry). A subset of 40 tumor samples has been analyzed at the RNA expression level (Affymetrix HG U133 Plus 2.0). https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE12992
GSE37382 (Northcott ... Taylor) 2012 (Expression profiling by array, Affymetrix Human Gene 1.1 ST Array profiling of 285 primary medulloblastoma samples.) Northcott PA, Shih DJ, Peacock J, Garzia L et al. Subgroup-specific structural variation across 1,000 medulloblastoma genomes. Nature 2012 Aug 2;488(7409):49-56. PMID: 22832581 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37382
GSE10327 (M. Kool ) 62 samples, 2008 ( Expression profiling by array ) (beware it is sometimes referred as GSE10237 in original paper and several references - that is an error reference). Kool M, Koster J, Bunt J, Hasselt NE et al. Integrated genomics identifies five medulloblastoma subtypes with distinct genetic profiles, pathway signatures and clinicopathological features. PLoS One 2008 Aug 28;3(8):e3088. PMID: 18769486 Rack PG, Ni J, Payumo AY, Nguyen V et al. Arhgap36-dependent activation of Gli transcription factors. Proc Natl Acad Sci U S A 2014 Jul 29;111(30):11061-6. PMID: 25024229 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE10327
Other datasets (not yet loaded):
(47.1 Gb, 2012) (Expression profiling by array, Genome variation profiling by SNP array, SNP genotyping by SNP array ) Northcott PA, Shih DJ, Peacock J, Garzia L et al. Subgroup-specific structural variation across 1,000 medulloblastoma genomes. Nature 2012 Aug 2;488(7409):49-56. PMID: 22832581 Here we report somatic copy number aberrations (SCNAs) in 1087 unique medulloblastomas. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37385
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The expression matrix and gene list for github demo code at https://github.com/thamnguy/l-PGC. The dataset contains peripheral blood single cell from healthy donors in dataset GSM4710729 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM4710729)
scRNA data from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE199308 (Huang et al. 2023), see a detailed description of the study here: https://atlas.gs.washington.edu/mmca_v2/public/about.html Data were downloaded from https://atlas.gs.washington.edu/mmca_v2/public/download.html to create an AnnData (h5ad) file with meta data to be able to analyse with e.g. python scanpy package. If you use this data, please cite Huang et al. 2023.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Harmony integrated single-cell Dataset containing all cells from 4 different murine fibrotic disease models. The dataset is largely public available but also contains unpublished (cardiac) samples https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE137720https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE104154https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE138826https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-9816https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-7895
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Peaks called for human uvCLAP (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85155) data using PEAKachu (https://github.com/tbischler/PEAKachu).
Sequences from this study are available at the NCBI GEO under accession series GSE131846 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?&acc=GSE131846
https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
The annotations.ods spreadsheet contains the annotation report provided by the sigReannot pipeline in November 2018 for the GPL16524 (Agilent-037880) and GPL10162 (Agilent-020109) chips using the pig data available in Ensembl92 (Sscrofa11.1) Chip data are available at: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL16524 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL10162
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Combined and converted scRNA data from (Packer and Zhu et al. 2019), see a detailed description of the study here: https://www.science.org/doi/full/10.1126/science.aax1971
Data were downloaded from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE126954 converted into Seurat object and converted into loom and AnnData (h5ad) files to be able to analyse with e.g. python scanpy package.
If you use this data, please cite Packer and Zhu et al. 2019.
The data discussed in this publication have been deposited in NCBI's Gene Expression Omnibus (GEO) and are accessible through GEO Series accession number GSE204989 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE204989).
Sequence Read Archive (SRA) data, BioSamples, and GEO holdings can be accessed from the NCBI BioProject PRJNA843039 (http://www.ncbi.nlm.nih.gov/bioproject/PRJNA843039).
This experiment is contains rhesus macaque organism part samples and strand-specific RNA-seq data from experiment E-GEOD-41637 (https://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-41637/), which aimed at assessing tissue-specific transcriptome variation across mammals, with chicken used as an outgroup in evolutionary analyses. Each organism part was sourced from three different animals as biological replicates. This data set was originally submitted to NCBI Gene Expression Omnibus under accession number GSE41637 (http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE41637) and later imported to ArrayExpress as E-GEOD-41637.
GWHed/geoquery dataset hosted on Hugging Face and contributed by the HF Datasets community