39 datasets found

u
BRAQUE: Bayesian Reduction for Amplified Quantization in UMAP Embedding....
board.unimib.it
Updated Jan 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lorenzo Dall'Olio (2023). BRAQUE: Bayesian Reduction for Amplified Quantization in UMAP Embedding. Supplementary data. [Dataset]. http://doi.org/10.17632/j8xbwb93x9.1
Explore at:
Unique identifier
https://doi.org/10.17632/j8xbwb93x9.1
Dataset updated
Jan 23, 2023
Authors
Lorenzo Dall'Olio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We propose a Bayesian Reduction for Amplified Quantization in Umap Embedding (BRAQUE) as an integrative novel approach, from data preprocessing to phenotype classification. BRAQUE starts with an innovative preprocessing, named Lognormal Shrinkage, able to enhance input fragmentation by fitting a lognormal mixture model and shrinking each component towards its median, in order to help further clustering step in finding more separated and clear clusters. The BRAQUE’s pipeline consist of a dimensionality reduction step performed using UMAP, and a clustering performed using HDBSCAN on UMAP embedding. These SUPPLEMENTAL DATA contain the csv image data files for seven lymphoid tissues, the antibody list and an MTA agreement letter for the CyBorgh software.
h
glove.6B.50d.umap.2d
huggingface.co
Updated Jan 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mario Tormo Romero (2024). glove.6B.50d.umap.2d [Dataset]. https://huggingface.co/datasets/mt0rm0/glove.6B.50d.umap.2d
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 31, 2024
Authors
Mario Tormo Romero
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Dataset Card

This dataset is a UMAP 2D-projection of the glove.6B.50d embeddings from Stanford. It is intended as a fast reference for visualizing embeddings in a workshop from the AI Service Center Berlin-Brandenburg at the Hasso Plattner Institute.

Dataset Details Dataset Description

The embeddings have a vocabulary of 400k tokens with 2 dimensions each token. Curated by: Mario Tormo Romero License: cc0-1.0

Dataset Sources

This Dataset has been… See the full description on the dataset page: https://huggingface.co/datasets/mt0rm0/glove.6B.50d.umap.2d.
f
ProjecTILs murine reference atlas of tumor-infiltrating T cells, version 1
figshare.com
application/gzip
Updated Jun 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Massimo Andreatta; Santiago Carmona (2023). ProjecTILs murine reference atlas of tumor-infiltrating T cells, version 1 [Dataset]. http://doi.org/10.6084/m9.figshare.12478571.v2
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12478571.v2
Dataset updated
Jun 29, 2023
Dataset provided by
figshare
Authors
Massimo Andreatta; Santiago Carmona
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We have developed ProjecTILs, a computational approach to project new data sets into a reference map of T cells, enabling their direct comparison in a stable, annotated system of coordinates. Because new cells are embedded in the same space of the reference, ProjecTILs enables the classification of query cells into annotated, discrete states, but also over a continuous space of intermediate states. By comparing multiple samples over the same map, and across alternative embeddings, the method allows exploring the effect of cellular perturbations (e.g. as the result of therapy or genetic engineering) and identifying genetic programs significantly altered in the query compared to a control set or to the reference map. We illustrate the projection of several data sets from recent publications over two cross-study murine T cell reference atlases: the first describing tumor-infiltrating T lymphocytes (TILs), the second characterizing acute and chronic viral infection.To construct the reference TIL atlas, we obtained single-cell gene expression matrices from the following GEO entries: GSE124691, GSE116390, GSE121478, GSE86028; and entry E-MTAB-7919 from Array-Express. Data from GSE124691 contained samples from tumor and from tumor-draining lymph nodes, and were therefore treated as two separate datasets. For the TIL projection examples (OVA Tet+, miR-155 KO and Regnase-KO), we obtained the gene expression counts from entries GSE122713, GSE121478 and GSE137015, respectively.Prior to dataset integration, single-cell data from individual studies were filtered using TILPRED-1.0 (https://github.com/carmonalab/TILPRED), which removes cells not enriched in T cell markers (e.g. Cd2, Cd3d, Cd3e, Cd3g, Cd4, Cd8a, Cd8b1) and cells enriched in non T cell genes (e.g. Spi1, Fcer1g, Csf1r, Cd19). Dataset integration was performed using STACAS (https://github.com/carmonalab/STACAS), a batch-correction algorithm based on Seurat 3. For the TIL reference map, we specified 600 variable genes per dataset, excluding cell cycling genes, mitochondrial, ribosomal and non-coding genes, as well as genes expressed in less than 0.1% or more than 90% of the cells of a given dataset. For integration, a total of 800 variable genes were derived as the intersection of the 600 variable genes of individual datasets, prioritizing genes found in multiple datasets and, in case of draws, those derived from the largest datasets. We determined pairwise dataset anchors using STACAS with default parameters, and filtered anchors using an anchor score threshold of 0.8. Integration was performed using the IntegrateData function in Seurat3, providing the anchor set determined by STACAS, and a custom integration tree to initiate alignment from the largest and most heterogeneous datasets.Next, we performed unsupervised clustering of the integrated cell embeddings using the Shared Nearest Neighbor (SNN) clustering method implemented in Seurat 3 with parameters {resolution=0.6, reduction=”umap”, k.param=20}. We then manually annotated individual clusters (merging clusters when necessary) based on several criteria: i) average expression of key marker genes in individual clusters; ii) gradients of gene expression over the UMAP representation of the reference map; iii) gene-set enrichment analysis to determine over- and under- expressed genes per cluster using MAST. In order to have access to predictive methods for UMAP, we recomputed PCA and UMAP embeddings independently of Seurat3 using respectively the prcomp function from basic R package “stats”, and the “umap” R package (https://github.com/tkonopka/umap).
h
laion-aesthetics-12m-umap
huggingface.co
Updated Apr 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David McClure (2023). laion-aesthetics-12m-umap [Dataset]. https://huggingface.co/datasets/dclure/laion-aesthetics-12m-umap
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 7, 2023
Authors
David McClure
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
LAION-Aesthetics :: CLIP → UMAP

This dataset is a CLIP (text) → UMAP embedding of the LAION-Aesthetics dataset - specifically the improved_aesthetics_6plus version, which filters the full dataset to images with scores of > 6 under the "aesthetic" filtering model. Thanks LAION for this amazing corpus!

The dataset here includes coordinates for 3x separate UMAP fits using different values for the n_neighbors parameter - 10, 30, and 60 - which are broken out as separate columns with… See the full description on the dataset page: https://huggingface.co/datasets/dclure/laion-aesthetics-12m-umap.
Z
Dataset and trained models belonging to the article 'Distant reading...
data.niaid.nih.gov
zenodo.org
Updated Sep 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ros, Ruben (2021). Dataset and trained models belonging to the article 'Distant reading patterns of iconicity in 940.000 online circulations of 26 iconic photographs' [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4244000
Explore at:
Dataset updated
Sep 28, 2021
Dataset provided by
Smits, Thomas
Ros, Ruben
Description
Quantifying Iconicity - Zenodo

The Dataset

This dataset contains the material collected for the article "Distant reading 940,000 online circulations of 26 iconic photographs" (to be) published in New Media & Society (DOI: 10.1177/14614448211049459). We identified 26 iconic photographs based on earlier work (Van der Hoeven, 2019). The Google Cloud Vision (GCV) API was subsequently used to identify webpages that host a reproduction of the iconic image. The GCV API uses computer vision methods and the Google index to retrieve these reproductions. The code for calling the API and parsing the data can be found on GitHub: https://github.com/rubenros1795/ReACT_GCV.

The core dataset consists of .tsv-files with the URLs that refer to the webpages. Other metadata provided by the GCV API is also found in the file and manually generated metadata. This includes: - the URL that refers specifically to the image. This can be an URL that refers to a full match or a partial match - the title of the page - the iteration number. Because the GCV API puts a limit on its output, we had to reupload the identified images to the API to extend our search. We continued these iterations until no more new unique URLs were found - the language found by the langid Python module link, along with the normalized score. - the labels associated with the image by Google - the scrape date

Alongside the .tsv-files, there are several other elements in the following folder structure:

├── data │ ├── embeddings │ └── doc2vec │ └── input-text │ └── metadata │ └── umap │ └── evaluation │ └── results │ └── diachronic-plots │ └── top-words │ └── tsv

The /embeddings folder contains the doc2vec models, the training input for the models, the metadata (id, URL, date) and the UMAP embeddings used in the GMM clustering. Please note that the date parser was not able to find dates for all webpages and for this reason not all training texts have associated metadata.

The /evaluation folder contains the AIC and BIC scores for GMM clustering with different numbers of clusters.

The /results folder contains the top words associated with the clusters and the diachronic cluster prominence plots.

Data Cleaning and Curation

Our pipeline contained several interventions to prevent noise in the data. First, in between the iterations we manually checked the scraped photos for relevance. We did so because reuploading an iconic image that is paired with another, irrelevant, one results in reproductions of the irrelevant one in the next iteration. Because we did not catch all noise, we used Scale Invariant Feature Transform (SIFT), a basic computer vision algorithm, to remove images that did not meet a threshold of ten keypoints. By doing so we removed completely unrelated photographs, but left room for variations of the original (such as painted versions of Che Guevara, or cropped versions of the Napalm Girl image). Another issue was the parsing of webpage texts. After experimenting with different webpage parsers that aim to extract 'relevant' text it proved too difficult to use one solution for all our webpages. Therefore we simply parsed all the text contained in commonly used html-tags, such as , etc.
f
ProjecTILs murine reference atlas of virus-specific CD8 T cells, version 2
figshare.com
application/gzip
Updated Jul 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Massimo Andreatta; Santiago Carmona (2023). ProjecTILs murine reference atlas of virus-specific CD8 T cells, version 2 [Dataset]. http://doi.org/10.6084/m9.figshare.23764572.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.23764572.v1
Dataset updated
Jul 26, 2023
Dataset provided by
figshare
Authors
Massimo Andreatta; Santiago Carmona
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We have developed ProjecTILs, a computational approach to project new data sets into a reference map of T cells, enabling their direct comparison in a stable, annotated system of coordinates. Because new cells are embedded in the same space of the reference, ProjecTILs enables the classification of query cells into annotated, discrete states, but also over a continuous space of intermediate states. By comparing multiple samples over the same map, and across alternative embeddings, the method allows exploring the effect of cellular perturbations (e.g. as the result of therapy or genetic engineering) and identifying genetic programs significantly altered in the query compared to a control set or to the reference map. We illustrate the projection of several data sets from recent publications over two cross-study murine T cell reference atlases: the first describing tumor-infiltrating T lymphocytes (TILs), the second characterizing acute and chronic viral infection. Single-cell data to build the virus-specific CD8 T cell reference map were downloaded from GEO under the following entries: GSE131535, GSE134139 and GSE119943, selecting only samples in wild type conditions. Data for the Ptpn2-KO, Tox-KO and CD4-depletion projections were obtained from entries GSE134139, GSE119943, and GSE137007 and were not included in the construction of the reference map. To construct the LCMV reference map, we split the dataset into five batches that displayed strong batch effects, and applied STACAS (https://github.com/carmonalab/STACAS) to mitigate its confounding effects. We computed 800 variable genes per batch, excluding cell cycling genes, ribosomal and mitochondrial genes, and computed pairwise anchors using 200 integration genes, and otherwise default STACAS parameters. Anchors were filtered at the default threshold 0.8 percentile, and integration was performed with the IntegrateData Seurat3 function with the guide tree suggested by STACAS. Next, we performed unsupervised clustering of the integrated cell embeddings using the Shared Nearest Neighbor (SNN) clustering method implemented in Seurat 3 with parameters {resolution=0.4, reduction=”pca”, k.param=20}. We then manually annotated individual clusters (merging clusters when necessary) based on several criteria: i) average expression of key marker genes in individual clusters; ii) gradients of gene expression over the UMAP representation of the reference map; iii) gene-set enrichment analysis to determine over- and under- expressed genes per cluster using MAST. In order to have access to predictive methods for UMAP, we recomputed PCA and UMAP embeddings independently of Seurat3 using respectively the prcomp function from basic R package “stats”, and the “umap” R package (https://github.com/tkonopka/umap).
f
Data_Sheet_1_Manifold learning for fMRI time-varying functional...
frontiersin.figshare.com
docx
Updated Jul 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javier Gonzalez-Castillo; Isabel S. Fernandez; Ka Chun Lam; Daniel A. Handwerker; Francisco Pereira; Peter A. Bandettini (2023). Data_Sheet_1_Manifold learning for fMRI time-varying functional connectivity.docx [Dataset]. http://doi.org/10.3389/fnhum.2023.1134012.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fnhum.2023.1134012.s001
Dataset updated
Jul 11, 2023
Dataset provided by
Frontiers
Authors
Javier Gonzalez-Castillo; Isabel S. Fernandez; Ka Chun Lam; Daniel A. Handwerker; Francisco Pereira; Peter A. Bandettini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Whole-brain functional connectivity (FC) measured with functional MRI (fMRI) evolves over time in meaningful ways at temporal scales going from years (e.g., development) to seconds [e.g., within-scan time-varying FC (tvFC)]. Yet, our ability to explore tvFC is severely constrained by its large dimensionality (several thousands). To overcome this difficulty, researchers often seek to generate low dimensional representations (e.g., 2D and 3D scatter plots) hoping those will retain important aspects of the data (e.g., relationships to behavior and disease progression). Limited prior empirical work suggests that manifold learning techniques (MLTs)—namely those seeking to infer a low dimensional non-linear surface (i.e., the manifold) where most of the data lies—are good candidates for accomplishing this task. Here we explore this possibility in detail. First, we discuss why one should expect tvFC data to lie on a low dimensional manifold. Second, we estimate what is the intrinsic dimension (ID; i.e., minimum number of latent dimensions) of tvFC data manifolds. Third, we describe the inner workings of three state-of-the-art MLTs: Laplacian Eigenmaps (LEs), T-distributed Stochastic Neighbor Embedding (T-SNE), and Uniform Manifold Approximation and Projection (UMAP). For each method, we empirically evaluate its ability to generate neuro-biologically meaningful representations of tvFC data, as well as their robustness against hyper-parameter selection. Our results show that tvFC data has an ID that ranges between 4 and 26, and that ID varies significantly between rest and task states. We also show how all three methods can effectively capture subject identity and task being performed: UMAP and T-SNE can capture these two levels of detail concurrently, but LE could only capture one at a time. We observed substantial variability in embedding quality across MLTs, and within-MLT as a function of hyper-parameter selection. To help alleviate this issue, we provide heuristics that can inform future studies. Finally, we also demonstrate the importance of feature normalization when combining data across subjects and the role that temporal autocorrelation plays in the application of MLTs to tvFC data. Overall, we conclude that while MLTs can be useful to generate summary views of labeled tvFC data, their application to unlabeled data such as resting-state remains challenging.
Z
Joint embedding of vertebrate brain single-cell RNA-Seq using sequence or...
data.niaid.nih.gov
Updated Aug 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sun, Dennis (2023). Joint embedding of vertebrate brain single-cell RNA-Seq using sequence or structure [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7838975
Explore at:
Dataset updated
Aug 18, 2023
Dataset authored and provided by
Sun, Dennis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Embeddings of single-cell RNA-Seq data from three adult vertebrate brain datasets into Orthogroup feature space or Structural cluster feature space. Orthogroups were generated using OrthoFinder v5.5.0; Structural clusters were assigned by using FoldSeek to cluster AlphaFold-v4 structural predictions.

The three datasets used as the basis for these embeddings were:

sample "Brain8" from the Jiang et al. 2021 zebrafish cell atlas (files beginning with GSM3768152)

sample "Brain1" from the Han et al. 2018 mouse cell atlas (files beginning with GSM2906405)

sample "Xenopus_brain_COL65" from the Liao et al. 2022 Xenopus laevis adult cell atlas (files beginning with GSM6214268)

For each dataset, we also generated a standardized cell type annotation file based on the author's originally provided cell type annotation data. The first column is the cell barcode for that species and the second column is the original study's cell type annotation for that cell.

For the Xenopus brain data, we removed around ~18k cells that were not annotated in the original data to simplify data analyses - these are reflected in the files with the "subsampled" suffix. Subsampled versions of the data are also available for the joint embedding space (prefixed with "DrerMmusXlae").

For the final datasets used in our analyses, we also provide features x cell matrices as .h5ad files for smaller file sizes and faster loading using Scanpy.

For visualizing our UMAP plots of our top200 embedding space, we provide ".tsv" files with a variety of metrics and the x and y positions of each cell in the UMAP. See "DrerMmusXlae_adultbrain_FoldSeek_plotlydata.tsv" and "DrerMmusXlae_adultbrain_OrthoFinder_plotlydata.tsv"

These data are part of the Arcadia Science Pub titled "Comparing gene expression across species based on protein structure instead of sequence".
h
s1K-1.1-850
huggingface.co
Updated Mar 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
InfiX.ai (2025). s1K-1.1-850 [Dataset]. https://huggingface.co/datasets/InfiX-ai/s1K-1.1-850
Explore at:
Dataset updated
Mar 7, 2025
Dataset provided by
InfiX.ai
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This data is obtained by simplescaling/s1K-1.1. Compared with the original simplescaling/s1K-1.1 data, our filtered data uses less data and achieves better results.

What we did

Text Embedding Generation: We use all-MiniLM-L6-v2 (from SentenceTransformers library) to generate "input" embeddings.

Dimensionality reduction: We use UMAP approach which preserves local and global data structures.

n_components=2, n_neighbors=15, min_dist=0.1

Data Sparsification (Dense Points… See the full description on the dataset page: https://huggingface.co/datasets/InfiX-ai/s1K-1.1-850.
f
Results of the proposed methods compared to other methods from the...
figshare.com
xls
Updated Apr 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammed Shalaby; Mohamed Farouk; Hatem A. Khater (2024). Results of the proposed methods compared to other methods from the literature on the USPS dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0300641.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300641.t007
Dataset updated
Apr 3, 2024
Dataset provided by
PLOS ONE
Authors
Mohammed Shalaby; Mohamed Farouk; Hatem A. Khater
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Results of the proposed methods compared to other methods from the literature on the USPS dataset.
D
Data from: Data related to Panzer: A Machine Learning Based Approach to...
darus.uni-stuttgart.de
Updated Nov 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tim Panzer (2024). Data related to Panzer: A Machine Learning Based Approach to Analyze Supersecondary Structures of Proteins [Dataset]. http://doi.org/10.18419/DARUS-4576
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.18419/DARUS-4576
Dataset updated
Nov 27, 2024
Dataset provided by
DaRUS
Authors
Tim Panzer
License
https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4576https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4576
Time period covered
Nov 1, 1976 - Feb 29, 2024
Dataset funded by
DFG
Description
This entry contains the data used to implement the bachelor thesis. It was investigated how embeddings can be used to analyze supersecondary structures. Abstract of the thesis: This thesis analyzes the behavior of supersecondary structures in the context of embeddings. For this purpose, data from the Protein Topology Graph Library was provided with embeddings. This resulted in a structured graph database, which will be used for future work and analyses. In addition, different projections were made into the two-dimensional space to analyze how the embeddings behave there. In the Jupyter Notebook 1_data_retrival.ipynb the download process of the graph files from the Protein Topology Graph Library (https://ptgl.uni-frankfurt.de) can be found. The downloaded .gml files can also be found in graph_files.zip. These form graphs that represent the relationships of supersecondary structures in the proteins. These form the data basis for further analyses. These graph files are then processed in the Jupyter Notebook 2_data_storage_and_embeddings.ipynb and entered into a graph database. The sequences of the supersecondary and secondary structures from the PTGL can be found in fastas.zip. The embeddings were also calculated using the ESM model of the Facebook Research Group (huggingface.co/facebook/esm2_t12_35M_UR50D), which can be found in three .h5 files. These are then added there subsequently. The whole process in this notebook serves to build up the database, which can then be searched using Cypher querys. In the Jupyter Notebook 3_data_science.ipynb different visualizations and analyses are then carried out, which were made with the help of UMAP. For the installation of all dependencies, it is recommended to create a Conda environment and then install all packages there. To use the project, PyEED should be installed using the snapshot of the original repository (source repository: https://github.com/PyEED/pyeed). The best way to install PyEED is to execute the pip install -e . command in the pyeed_BT folder. The dependencies can also be installed by using poetry and the .toml file. In addition, seaborn, h5py and umap-learn are required. These can be installed using the following commands: pip install h5py==3.12.1 pip install seaborn==0.13.2 umap-learn==0.5.7
Test dataset for omero-vitessce
zenodo.org
data.niaid.nih.gov
csv, json, png
Updated Sep 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michele Bortolomeazzi; Michele Bortolomeazzi (2024). Test dataset for omero-vitessce [Dataset]. http://doi.org/10.5281/zenodo.13832665
Explore at:
png, json, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13832665
Dataset updated
Sep 24, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Michele Bortolomeazzi; Michele Bortolomeazzi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Test datasets for omero-vitessce

Dataset designed for testing the omero-vitessce (https://github.com/NFDI4BIOIMAGE/omero-vitessce) plugin for OMERO (https://www.openmicroscopy.org/omero/). The omero-vitessce repository contains a cropped version of this dataset for automated testing (https://github.com/NFDI4BIOIMAGE/omero-vitessce/tree/main/test/data/MB266).

Files

MAX_MBEN_ff_Xenium_0018446_MB-266_DAPI_2024-01-23_12.47.34_Fused_405nm_corr_cropped.png = PNG image with the DAPI channel.

MAX_MBEN_ff_Xenium_0018446_MB-266_DAPI_2024-01-23_12.47.34_Fused_405nm_corr_cropped_cp_masks.png= Cell segmentation mask pixel values correspond to cell identities, 0 = background).

cells.csv =

embeddings.csv = UMAP embeddings for drawing an interactive scatterplot.

feature_matrix.csv = Transcript counts in each cell.

transcripts.csv = Gene name and coordinates (pixel) of each transcript.

VitessceConfig.json = Example configuration file generated by the omero-vitessce plugin for the Vitessce, an equivalent file can be generated by using the form provided by the plugin in OMERO.web.

See the repository README file for more details on the formats of these files: https://github.com/NFDI4BIOIMAGE/omero-vitessce?tab=readme-ov-file#config-files

Usage

Add the omero-web-zarr and omero-vitessce plugins to your OMERO.web installation.

Import the images into OMERO in the same dataset.

Attach all the .csv data files.

Use the form in the "Vitessce" tab of the right-panel to generate a configuration file and open the Vitessce viewer.

See the repository README file for more details on usage (https://github.com/NFDI4BIOIMAGE/omero-vitessce?tab=readme-ov-file#usage) and installation (https://github.com/NFDI4BIOIMAGE/omero-vitessce?tab=readme-ov-file#installation)

Data Sources

Adapted from the full original data at: https://www.ebi.ac.uk/biostudies/bioimages/studies/S-BIAD1093 (10.6019/S-BIAD1093).

The original data were produced and analysed in the course of this study:

https://www.biorxiv.org/content/10.1101/2024.04.03.586404v1
d
Replication Data for: Measuring the impact of campaign finance on...
search.dataone.org
dataverse.harvard.edu
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lalisse, Matthias (2023). Replication Data for: Measuring the impact of campaign finance on congressional voting: A machine learning approach [Dataset]. http://doi.org/10.7910/DVN/DHQQHX
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/DHQQHX
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Lalisse, Matthias
Description
Replication data for the paper: "Measuring the impact of campaign finance on congressional voting: A machine learning approach" Includes: * metadata for legislators and bills, * text embeddings for legislative summaries (sourced from ProPublica Congress Database). Includes 768d LongFormer embeddings and 2d embeddings for visualization (UMAP and Isomap), * legislator embeddings: 100d PCA on legislators' financial disclosures, as well as 2d visualization embeddings (UMAP and Isomap), * scripts for running the classification and RSA analyses. Up to 100d embeddings are provided from the output of PCA for both bills and legislators. See README.ipynb for a tour of the datasets as well as starter code.
Z
GeoWaVe Cytometry Benchmark Data
data.niaid.nih.gov
zenodo.org
Updated Oct 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthew Morgan (2022). GeoWaVe Cytometry Benchmark Data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7134722
Explore at:
Dataset updated
Oct 2, 2022
Dataset provided by
Simone Cuff
Matthew Morgan
Ross Jake Burton
Matthias Eberl
Andreas Artemiou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Contained within this folder are six benchmark datasets (Levine13, Levine32, Samusik, Sepsis, and PD) used for the evaluation of the GeoWaVe ensemble clustering algorithm, part of the cytocluster (https://github.com/burtonrj/CytoCluster) package.

The data are compensated, arc-sine transformed, and debris and dead cells removed. See manuscript for details: https://doi.org/10.1101/2022.06.30.496829

Each dataset is available as a CSV file and includes two additional columns: UMAP1 and UMAP2. The UMAP columns contain embeddings generated using UMAP (2 components and n_neighbours=30) and were used for visualisation purposes. The column 'population' contains the original population labels generated using manual gating.
Data from: A highly resolved integrated single-cell atlas of HPV-negative...
zenodo.org
Updated Dec 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lina Kroehling; Stefano Monti; Lina Kroehling; Stefano Monti (2024). A highly resolved integrated single-cell atlas of HPV-negative Head and Neck Cancer [Dataset]. http://doi.org/10.5281/zenodo.14579515
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.14579515
Dataset updated
Dec 30, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lina Kroehling; Stefano Monti; Lina Kroehling; Stefano Monti
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A single cell transcriptomic atlas of HPV-negative Head and Neck Squamous Carcinomas.

The atlas is a Seurat (v 4.1.0) object stored as a .rds file, which can be loaded in R.

The data can be loaded into R as follows:

atlas <- readRDS('FullHNSCCAtlas.rds')

The immune and nonimmune compartments can be separated as follows:

immune <- subset(atlas, cells = rownames(atlas@meta.data[atlas@meta.data$Compartment == "Immune",]))
immune[["umap"]]@cell.embeddings <- as.matrix(immune@meta.data[,c("CompartmentUMAP1_Coordinates", "CompartmentUMAP2_Coordinates")])
colnames(x = immune[["umap"]]@cell.embeddings) <- paste0("UMAP_", 1:2)

nonimmune <- subset(atlas, cells = rownames(atlas@meta.data[atlas@meta.data$Compartment == "nonImmune",]))
nonimmune[["umap"]]@cell.embeddings <- as.matrix(nonimmune@meta.data[,c("CompartmentUMAP1_Coordinates", "CompartmentUMAP2_Coordinates")])
colnames(x = nonimmune[["umap"]]@cell.embeddings) <- paste0("UMAP_", 1:2)
f
Results of the proposed methods compared to other methods from the...
figshare.com
xls
Updated Apr 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammed Shalaby; Mohamed Farouk; Hatem A. Khater (2024). Results of the proposed methods compared to other methods from the literature on the Banana dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0300641.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300641.t003
Dataset updated
Apr 3, 2024
Dataset provided by
PLOS ONE
Authors
Mohammed Shalaby; Mohamed Farouk; Hatem A. Khater
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Results of the proposed methods compared to other methods from the literature on the Banana dataset.
LungMAP Azimuth Reference - Human Adult Lung scRNA-Seq
zenodo.org
explore.openaire.eu
bin
Updated Nov 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LungMAP; LungMAP (2021). LungMAP Azimuth Reference - Human Adult Lung scRNA-Seq [Dataset]. http://doi.org/10.5281/zenodo.5649206
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5649206
Dataset updated
Nov 6, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
LungMAP; LungMAP
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The LungMAP scRNA-Seq reference associated with the Lung CellCards resource. The initial reference integrated 259k cells from 72 donors from five published (PMIDs: 32726565, 32427931, 30554520, 32832599, 32832598) and one unpublished single cell RNA-seq cohort. Non-diseased adult and pediatric healthy lung single-cell 10x Genomics captures (3’ v2 and v3). Cells from different donors were integrated using Batchlor. Preliminary cell types were called based on Leiden clustering analysis and expression patterns of LungMAP cell card markers. UMAP embeddings were generated using monocle3. This reference, along with a corresponding single-nucleus specific version of this atlas, is under active construction. We expect to release the beta version of the reference in November 2021. Conforms to Azimuth reference data structure described at https://github.com/satijalab/azimuth/wiki/Azimuth-Reference-Format.
f
Pareto set of different methods on the USPS dataset.
plos.figshare.com
xls
Updated Apr 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammed Shalaby; Mohamed Farouk; Hatem A. Khater (2024). Pareto set of different methods on the USPS dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0300641.t008
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300641.t008
Dataset updated
Apr 3, 2024
Dataset provided by
PLOS ONE
Authors
Mohammed Shalaby; Mohamed Farouk; Hatem A. Khater
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Pareto set of different methods on the USPS dataset.

Dataset on Bibliographic, Textual, and Embedding Data for General Relativity...

zenodo.org

bin

Updated Dec 31, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Raphael Schlattmann; Raphael Schlattmann (2024). Dataset on Bibliographic, Textual, and Embedding Data for General Relativity and Gravitation Publications (1911–2000) [Dataset]. http://doi.org/10.5281/zenodo.14581503

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14581503

Dataset updated

Dec 31, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Raphael Schlattmann; Raphael Schlattmann

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Dataset Overview

This dataset supplements the paper “Trajectories of Change: Approaches for Tracking Knowledge Evolution,” currently under review. It includes bibliographic, textual, and embedding data for 180,785 publications in General Relativity and Gravitation (GRG), spanning 1911 to 2000 and is based on the NASA/ADS. The file is in Parquet format with 33 columns.

Usage

The dataset is directly compatible with the UnigramKLD and EmbeddingDensities classes of the semanticlayertools Python package.

Data Structure

Column	Format	Description	Example
Bibcode	string	Unique publication identifier.	`"1995PASP..107..803U"`
Author	string	Authors listed as comma-separated names.	`"Urry CM, Padovani P"`
Title	string	Title of the publication.	`"Unified Schemes for Radio-Loud Active Galactic Nuclei"`
Title_en	string	Title translated into English.	`"Unified Schemes for Radio-Loud Active Galactic Nuclei"`
Year	integer	Year of publication.	`1995`
Journal	string	Journal name.	`"Publications of the Astronomical Society of the Pacific"`
Journal Abbreviation	string	Abbreviated journal name.	`"PASP"`
Volume	string	Volume number (if applicable).	`"107"`
Issue	string	Issue number (if applicable).	`"19"`
First Page	string	Starting page.	`"803"`
Last Page	string	Ending page.	`"25"`
Abstract	string	Abstract text.	`"The appearance of active galactic nuclei (AGN) depends strongly on orientation, dominating classification..."`
Abstract_en	string	Abstract translated into English.	`"The appearance of active galactic nuclei (AGN) depends strongly on orientation, dominating classification..."`
Keywords	string	Comma-separated keywords.	`"galaxies: active, galaxies: fundamental parameters, astrophysics"`
DOI	string	Digital Object Identifier.	`"10.1086/133630"`
Affiliation	string	Author affiliations.	`"AA(University of XYZ), AB(-)"`
Category	string	Publication type (e.g., article, book).	`"article"`
Citation Count	float	Number of citations.	`4380.0`
References	array of strings	List of cited Bibcodes.	`["1966Natur.209..751H", "1966Natur.211..468R", "1968ApJ...151..393S"]`
PDF_URL	string	Link to the publication PDF.	`"https://ui.adsabs.harvard.edu/link_gateway/1995PASP..107..803U/ADS_PDF"`
Title_lang	string	Language of the title.	`"en"`
Abstract_lang	string	Language of the abstract.	`"en"`
full_text	string	Full text of the publication (where available).	`"Unified Schemes for Radio-Loud Active Galactic Nuclei. The appearance of AGN depends so strongly on..."`
tokens	array of strings	Tokenized text of the title and abstract for computational analysis.	`["unify", "schemes", "radio", "loud", "active", "galactic", "nuclei"]`
UMAP-1	float32	UMAP embedding coordinate 1.	`10.423940`
UMAP-2	float32	UMAP embedding coordinate 2.	`7.890975`
Cluster	integer	Cluster label for topic modeling or grouping.	`15`
Name	string	Descriptive cluster name.	`"15_radio_quasars_sources_galaxies"`
KeyBERT	string	Key phrases extracted via KeyBERT.	`"radio galaxies, high redshift, radio sources, optical imaging"`
OpenAI	string	Embedding-based descriptive phrases.	`"Cosmological Evolution of Radio-Loud Quasars"`
MMR	string	Extracted key phrases using Maximal Marginal Relevance (MMR).	`"quasars, radio sources, redshift, luminosity, star formation"`
POS	string	Key terms extracted via part-of-speech tagging.	`"radio, quasars, sources, galaxies, redshift, optical"`
full_embeddings	array of floats	Text embeddings generated using OpenAI's text-embedding-3-large model.	`"[ 0.01164897 -0.00343577 -0.03168862 ... 0.00237622]"`

Transcriptomics of Alveolar Immune Cells Reveals Insight into Mechanisms of...

zenodo.org

Updated May 5, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Peter Allen; Peter Allen (2025). Transcriptomics of Alveolar Immune Cells Reveals Insight into Mechanisms of Human Pulmonary Fibrosis [Dataset]. http://doi.org/10.5281/zenodo.15339001

Explore at:

Unique identifier

https://doi.org/10.5281/zenodo.15339001

Dataset updated

May 5, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Peter Allen; Peter Allen

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This repository contains the preprocessed single-cell RNA data from the manuscript Transcriptomics of Alveolar Immune Cells Reveals Insight into Mechanisms of Human Pulmonary Fibrosis (under submission).

Data Descriptions:

BAL_FINAL.rds	Seurat Object with the entire dataset
BAL_FINAL.h5ad	Scanpy Object with the entire dataset
BAL_FINAL_metadata.txt	Metadata for each cell
mlm_umap_embeddings.csv	UMAP embeddings for the monocyte-derived clusters
ipf_allen2022.full_score.gz	scDRS results for each cell using the Richard Allen et al. 2022 IPF GWAS summary statistics

Facebook

Twitter

Click to copy link

Link copied

Cite

Lorenzo Dall'Olio (2023). BRAQUE: Bayesian Reduction for Amplified Quantization in UMAP Embedding. Supplementary data. [Dataset]. http://doi.org/10.17632/j8xbwb93x9.1

BRAQUE: Bayesian Reduction for Amplified Quantization in UMAP Embedding. Supplementary data.

Explore at:

Unique identifier

https://doi.org/10.17632/j8xbwb93x9.1

Dataset updated

Jan 23, 2023

Authors

Lorenzo Dall'Olio

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

We propose a Bayesian Reduction for Amplified Quantization in Umap Embedding (BRAQUE) as an integrative novel approach, from data preprocessing to phenotype classification. BRAQUE starts with an innovative preprocessing, named Lognormal Shrinkage, able to enhance input fragmentation by fitting a lognormal mixture model and shrinking each component towards its median, in order to help further clustering step in finding more separated and clear clusters. The BRAQUE’s pipeline consist of a dimensionality reduction step performed using UMAP, and a clustering performed using HDBSCAN on UMAP embedding. These SUPPLEMENTAL DATA contain the csv image data files for seven lymphoid tissues, the antibody list and an MTA agreement letter for the CyBorgh software.

Clear search

Close search

Google apps

Main menu

BRAQUE: Bayesian Reduction for Amplified Quantization in UMAP Embedding....

glove.6B.50d.umap.2d

ProjecTILs murine reference atlas of tumor-infiltrating T cells, version 1

laion-aesthetics-12m-umap

Dataset and trained models belonging to the article 'Distant reading...

The Dataset

Data Cleaning and Curation

ProjecTILs murine reference atlas of virus-specific CD8 T cells, version 2

Data_Sheet_1_Manifold learning for fMRI time-varying functional...

Joint embedding of vertebrate brain single-cell RNA-Seq using sequence or...

s1K-1.1-850

Results of the proposed methods compared to other methods from the...

Data from: Data related to Panzer: A Machine Learning Based Approach to...

Test dataset for omero-vitessce

Test datasets for omero-vitessce

Files

Usage

Data Sources

Replication Data for: Measuring the impact of campaign finance on...

GeoWaVe Cytometry Benchmark Data

Data from: A highly resolved integrated single-cell atlas of HPV-negative...

Results of the proposed methods compared to other methods from the...

LungMAP Azimuth Reference - Human Adult Lung scRNA-Seq

Pareto set of different methods on the USPS dataset.

Dataset on Bibliographic, Textual, and Embedding Data for General Relativity...

Dataset Overview

Usage

Data Structure

Transcriptomics of Alveolar Immune Cells Reveals Insight into Mechanisms of...

BRAQUE: Bayesian Reduction for Amplified Quantization in UMAP Embedding. Supplementary data.