23 datasets found

UMAP visualization of WBC and RBC morphotypes
figshare.com
txt
Updated Sep 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
José Almeida (2022). UMAP visualization of WBC and RBC morphotypes [Dataset]. http://doi.org/10.6084/m9.figshare.20939335.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20939335.v1
Dataset updated
Sep 5, 2022
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
José Almeida
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
JSON files containing UMAP coordinates for WBC and RBC morphometry and their morphotype classification.
umap-learn
kaggle.com
zip
Updated Oct 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HyeongChan Kim (2025). umap-learn [Dataset]. https://www.kaggle.com/kozistr/umaplearn
Explore at:
zip(46934808 bytes)Available download formats
Dataset updated
Oct 19, 2025
Authors
HyeongChan Kim
Description
UMAP

Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualization similarly to t-SNE, but also for general non-linear dimension reduction. The algorithm is founded on three assumptions about the data:

The data is uniformly distributed on a Riemannian manifold; The Riemannian metric is locally constant (or can be approximated as such); The manifold is locally connected. From these assumptions, it is possible to model the manifold with a fuzzy topological structure. The embedding is found by searching for a low dimensional projection of the data that has the closest possible equivalent fuzzy topological structure.

The details for the underlying mathematics can be found in our paper on ArXiv:

McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv e-prints 1802.03426, 2018
Additional file 3 of Mugen-UMAP: UMAP visualization and clustering of...
figshare.com
springernature.figshare.com
csv
Updated Sep 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Teng Li; Yiran Zou; Xianghan Li; Thomas K. F. Wong; Allen G. Rodrigo (2024). Additional file 3 of Mugen-UMAP: UMAP visualization and clustering of mutated genes in single-cell DNA sequencing data [Dataset]. http://doi.org/10.6084/m9.figshare.27123950.v1
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27123950.v1
Dataset updated
Sep 28, 2024
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Teng Li; Yiran Zou; Xianghan Li; Thomas K. F. Wong; Allen G. Rodrigo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Supplementary file 3. AnnData format of the 12 NSCLC patients dataset.
Data from: MSI-VISUAL: New visualization methods for Mass Spectrometry...
data.niaid.nih.gov
nde-dev.biothings.io
+1more
xml
Updated Mar 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jens Pahnke; Jens Pahnke (2025). MSI-VISUAL: New visualization methods for Mass Spectrometry Imaging and tools for interactive mapping and exploration of m/z values [Dataset]. https://data.niaid.nih.gov/resources?id=pxd056609
Explore at:
xmlAvailable download formats
Dataset updated
Mar 14, 2025
Dataset provided by
University of Oslo, www.pahnkelab.eu
University of Oslo, University of Lübeck, University of Latvia, Tel Aviv University, www.pahnkelab.eu
Authors
Jens Pahnke; Jens Pahnke
Variables measured
Proteomics
Description
Mass spectrometry imaging dataset from fresh frozen mouse brain sections for development of a novel spatial segmentation computational pipeline.
Additional file 5 of GECO: gene expression clustering optimization app for...
springernature.figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
A. N. Habowski; T. J. Habowski; M. L. Waterman (2023). Additional file 5 of GECO: gene expression clustering optimization app for non-linear data visualization of patterns [Dataset]. http://doi.org/10.6084/m9.figshare.13642382.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13642382.v1
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
A. N. Habowski; T. J. Habowski; M. L. Waterman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 5: CSV file of bulk RNA-seq data of F. nucleatum infection time course used for GECO UMAP generation.
Additional file 4 of GECO: gene expression clustering optimization app for...
springernature.figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
A. N. Habowski; T. J. Habowski; M. L. Waterman (2023). Additional file 4 of GECO: gene expression clustering optimization app for non-linear data visualization of patterns [Dataset]. http://doi.org/10.6084/m9.figshare.13642379.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13642379.v1
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
A. N. Habowski; T. J. Habowski; M. L. Waterman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 4: CSV file of colon crypt bulk RNA-seq data used for GECO UMAP generation.
n
Acoustic features as a tool to visualize and explore marine soundscapes:...
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Feb 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simone Cominelli; Nicolo' Bellin; Carissa D. Brown; Jack Lawson (2024). Acoustic features as a tool to visualize and explore marine soundscapes: Applications illustrated using marine mammal Passive Acoustic Monitoring datasets [Dataset]. http://doi.org/10.5061/dryad.3bk3j9kn8
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.3bk3j9kn8
Dataset updated
Feb 15, 2024
Dataset provided by
Memorial University of Newfoundland
Fisheries and Oceans Canada
University of Parma
Authors
Simone Cominelli; Nicolo' Bellin; Carissa D. Brown; Jack Lawson
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Passive Acoustic Monitoring (PAM) is emerging as a solution for monitoring species and environmental change over large spatial and temporal scales. However, drawing rigorous conclusions based on acoustic recordings is challenging, as there is no consensus over which approaches, and indices are best suited for characterizing marine and terrestrial acoustic environments. Here, we describe the application of multiple machine-learning techniques to the analysis of a large PAM dataset. We combine pre-trained acoustic classification models (VGGish, NOAA & Google Humpback Whale Detector), dimensionality reduction (UMAP), and balanced random forest algorithms to demonstrate how machine-learned acoustic features capture different aspects of the marine environment. The UMAP dimensions derived from VGGish acoustic features exhibited good performance in separating marine mammal vocalizations according to species and locations. RF models trained on the acoustic features performed well for labelled sounds in the 8 kHz range, however, low and high-frequency sounds could not be classified using this approach. The workflow presented here shows how acoustic feature extraction, visualization, and analysis allow for establishing a link between ecologically relevant information and PAM recordings at multiple scales. The datasets and scripts provided in this repository allow replicating the results presented in the publication. Methods Data acquisition and preparation We collected all records available in the Watkins Marine Mammal Database website listed under the “all cuts'' page. For each audio file in the WMD the associated metadata included a label for the sound sources present in the recording (biological, anthropogenic, and environmental), as well as information related to the location and date of recording. To minimize the presence of unwanted sounds in the samples, we only retained audio files with a single source listed in the metadata. We then labelled the selected audio clips according to taxonomic group (Odontocetae, Mysticetae), and species. We limited the analysis to 12 marine mammal species by discarding data when a species: had less than 60 s of audio available, had a vocal repertoire extending beyond the resolution of the acoustic classification model (VGGish), or was recorded in a single country. To determine if a species was suited for analysis using VGGish, we inspected the Mel-spectrograms of 3-s audio samples and only retained species with vocalizations that could be captured in the Mel-spectrogram (Appendix S1). The vocalizations of species that produce very low frequency, or very high frequency were not captured by the Mel-spectrogram, thus we removed them from the analysis. To ensure that records included the vocalizations of multiple individuals for each species, we only considered species with records from two or more different countries. Lastly, to avoid overrepresentation of sperm whale vocalizations, we excluded 30,000 sperm whale recordings collected in the Dominican Republic. The resulting dataset consisted in 19,682 audio clips with a duration of 960 milliseconds each (0.96 s) (Table 1). The Placentia Bay Database (PBD) includes recordings collected by Fisheries and Oceans Canada in Placentia Bay (Newfoundland, Canada), in 2019. The dataset consisted of two months of continuous recordings (1230 hours), starting on July 1st, 2019, and ending on August 31st 2029. The data was collected using an AMAR G4 hydrophone (sensitivity: -165.02 dB re 1V/µPa at 250 Hz) deployed at 64 m of depth. The hydrophone was set to operate following 15 min cycles, with the first 60 s sampled at 512 kHz, and the remaining 14 min sampled at 64 kHz. For the purpose of this study, we limited the analysis to the 64 kHz recordings. Acoustic feature extraction The audio files from the WMD and PBD databases were used as input for VGGish (Abu-El-Haija et al., 2016; Chung et al., 2018), a CNN developed and trained to perform general acoustic classification. VGGish was trained on the Youtube8M dataset, containing more than two million user-labelled audio-video files. Rather than focusing on the final output of the model (i.e., the assigned labels), here the model was used as a feature extractor (Sethi et al., 2020). VGGish converts audio input into a semantically meaningful vector consisting of 128 features. The model returns features at multiple resolution: ~1 s (960 ms); ~5 s (4800 ms); ~1 min (59’520 ms); ~5 min (299’520 ms). All of the visualizations and results pertaining to the WMD were prepared using the finest feature resolution of ~1 s. The visualizations and results pertaining to the PBD were prepared using the ~5 s features for the humpback whale detection example, and were then averaged to an interval of 30 min in order to match the temporal resolution of the environmental measures available for the area. UMAP ordination and visualization UMAP is a non-linear dimensionality reduction algorithm based on the concept of topological data analysis which, unlike other dimensionality reduction techniques (e.g., tSNE), preserves both the local and global structure of multivariate datasets (McInnes et al., 2018). To allow for data visualization and to reduce the 128 features to two dimensions for further analysis, we applied Uniform Manifold Approximation and Projection (UMAP) to both datasets and inspected the resulting plots. The UMAP algorithm generates a low-dimensional representation of a multivariate dataset while maintaining the relationships between points in the global dataset structure (i.e., the 128 features extracted from VGGish). Each point in a UMAP plot in this paper represents an audio sample with duration of ~ 1 second (WMD dataset), ~ 5 seconds (PBD dataset, humpback whale detections), or 30 minutes (PBD dataset, environmental variables). Each point in the two-dimensional UMAP space also represents a vector of 128 VGGish features. The nearer two points are in the plot space, the nearer the two points are in the 128-dimensional space, and thus the distance between two points in UMAP reflects the degree of similarity between two audio samples in our datasets. Areas with a high density of samples in UMAP space should, therefore, contain sounds with similar characteristics, and such similarity should decrease with increasing point distance. Previous studies illustrated how VGGish and UMAP can be applied to the analysis of terrestrial acoustic datasets (Heath et al., 2021; Sethi et al., 2020). The visualizations and classification trials presented here illustrate how the two techniques (VGGish and UMAP) can be used together for marine ecoacoustics analysis. UMAP visualizations were prepared the umap-learn package for Python programming language (version 3.10). All UMAP visualizations presented in this study were generated using the algorithm’s default parameters.
Labelling sound sources The labels for the WMD records (i.e., taxonomic group, species, location) were obtained from the database metadata. For the PBD recordings, we obtained measures of wind speed, surface temperature, and current speed from (Fig 1) an oceanographic buy located in proximity of the recorder. We choose these three variables for their different contributions to background noise in marine environments. Wind speed contributes to underwater background noise at multiple frequencies, ranging 500 Hz to 20 kHz (Hildebrand et al., 2021). Sea surface temperature contributes to background noise at frequencies between 63 Hz and 125 Hz (Ainslie et al., 2021), while ocean currents contribute to ambient noise at frequencies below 50 Hz (Han et al., 2021) Prior to analysis, we categorized the environmental variables and assigned the categories as labels to the acoustic features (Table 2). Humpback whale vocalizations in the PBD recordings were processed using the humpback whale acoustic detector created by NOAA and Google (Allen et al., 2021), providing a model score for every ~5 s sample. This model was trained on a large dataset (14 years and 13 locations) using humpback whale recordings annotated by experts (Allen et al., 2021). The model returns scores ranging from 0 to 1 indicating the confidence in the predicted humpback whale presence. We used the results of this detection model to label the PBD samples according to presence of humpback whale vocalizations. To verify the model results, we inspected all audio files that contained a 5 s sample with a model score higher than 0.9 for the month of July. If the presence of a humpback whale was confirmed, we labelled the segment as a model detection. We labelled any additional humpback whale vocalization present in the inspected audio files as a visual detection, while we labelled other sources and background noise samples as absences. In total, we labelled 4.6 hours of recordings. We reserved the recordings collected in August to test the precision of the final predictive model. Label prediction performance We used Balanced Random Forest models (BRF) provided in the imbalanced-learn python package (Lemaître et al., 2017) to predict humpback whale presence and environmental conditions from the acoustic features generated by VGGish. We choose BRF as the algorithm as it is suited for datasets characterized by class imbalance. The BRF algorithm performs under sampling of the majority class prior to prediction, allowing to overcome class imbalance (Lemaître et al., 2017). For each model run, the PBD dataset was split into training (80%) and testing (20%) sets. The training datasets were used to fine-tune the models though a nested k-fold cross validation approach with ten-folds in the outer loop, and five-folds in the inner loop. We selected nested cross validation as it allows optimizing model hyperparameters and performing model evaluation in a single step. We used the default parameters of the BRF algorithm, except for the ‘n_estimators’ hyperparameter, for which we tested
Vietnamese Curated Dataset
kaggle.com
zip
Updated Jan 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Henry (2025). Vietnamese Curated Dataset [Dataset]. https://www.kaggle.com/datasets/ndy001/vietnamese-curated-dataset-2
Explore at:
zip(31037919590 bytes)Available download formats
Dataset updated
Jan 26, 2025
Authors
Daniel Henry
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Description

Vietnamese Curated Text Dataset. This dataset is collected from multiple open Vietnamese datasets, and curated with NeMo Curator

Developed by: Viettel Solutions

Language: Vietnamese

Details

Please visit our Tech Blog post on NVIDIA's plog page for details. Link

Data Collection

We utilize a combination of datasets that contain samples in Vietnamese language, ensuring a robust and representative text corpus. These datasets include: - The Vietnamese subset of the C4 dataset . - The Vietnamese subset of the OSCAR dataset, version 23.01. - Wikipedia's Vietnamese articles. - Binhvq's Vietnamese news corpus.

Preprocessing

We use NeMo Curator to curate the collected data. The data curation pipeline includes these key steps: 1. Unicode Reformatting: Texts are standardized into a consistent Unicode format to avoid encoding issues. 2. Exact Deduplication: Removes exact duplicates to reduce redundancy. 3. Quality Filtering: 4. Heuristic Filtering: Applies rules-based filters to remove low-quality content. 5. Classifier-Based Filtering: Uses machine learning to classify and filter documents based on quality.

Notebook

Dataset Statistics

Content diversity https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/mW6Pct3uyP_XDdGmE8EP3.png" alt="Domain proportion in curated dataset">

Character based metrics https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/W9TQjM2vcC7uXozyERHSQ.png" alt="Box plots of percentage of symbols, numbers, and whitespace characters compared to the total characters, word counts and average word lengths">

Token count distribution https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/PDelYpBI0DefSmQgFONgE.png" alt="Distribution of document sizes (in terms of token count)">

Embedding visualization https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/sfeoZWuQ7DcSpbmUOJ12r.png" alt="UMAP visualization of 5% of the dataset"> UMAP visualization of 5% of the dataset
Data_Sheet_1_Manifold learning for fMRI time-varying functional...
frontiersin.figshare.com
docx
Updated Jul 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javier Gonzalez-Castillo; Isabel S. Fernandez; Ka Chun Lam; Daniel A. Handwerker; Francisco Pereira; Peter A. Bandettini (2023). Data_Sheet_1_Manifold learning for fMRI time-varying functional connectivity.docx [Dataset]. http://doi.org/10.3389/fnhum.2023.1134012.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fnhum.2023.1134012.s001
Dataset updated
Jul 11, 2023
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Javier Gonzalez-Castillo; Isabel S. Fernandez; Ka Chun Lam; Daniel A. Handwerker; Francisco Pereira; Peter A. Bandettini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Whole-brain functional connectivity (FC) measured with functional MRI (fMRI) evolves over time in meaningful ways at temporal scales going from years (e.g., development) to seconds [e.g., within-scan time-varying FC (tvFC)]. Yet, our ability to explore tvFC is severely constrained by its large dimensionality (several thousands). To overcome this difficulty, researchers often seek to generate low dimensional representations (e.g., 2D and 3D scatter plots) hoping those will retain important aspects of the data (e.g., relationships to behavior and disease progression). Limited prior empirical work suggests that manifold learning techniques (MLTs)—namely those seeking to infer a low dimensional non-linear surface (i.e., the manifold) where most of the data lies—are good candidates for accomplishing this task. Here we explore this possibility in detail. First, we discuss why one should expect tvFC data to lie on a low dimensional manifold. Second, we estimate what is the intrinsic dimension (ID; i.e., minimum number of latent dimensions) of tvFC data manifolds. Third, we describe the inner workings of three state-of-the-art MLTs: Laplacian Eigenmaps (LEs), T-distributed Stochastic Neighbor Embedding (T-SNE), and Uniform Manifold Approximation and Projection (UMAP). For each method, we empirically evaluate its ability to generate neuro-biologically meaningful representations of tvFC data, as well as their robustness against hyper-parameter selection. Our results show that tvFC data has an ID that ranges between 4 and 26, and that ID varies significantly between rest and task states. We also show how all three methods can effectively capture subject identity and task being performed: UMAP and T-SNE can capture these two levels of detail concurrently, but LE could only capture one at a time. We observed substantial variability in embedding quality across MLTs, and within-MLT as a function of hyper-parameter selection. To help alleviate this issue, we provide heuristics that can inform future studies. Finally, we also demonstrate the importance of feature normalization when combining data across subjects and the role that temporal autocorrelation plays in the application of MLTs to tvFC data. Overall, we conclude that while MLTs can be useful to generate summary views of labeled tvFC data, their application to unlabeled data such as resting-state remains challenging.
Twitter Airline Sentiment Dataset
kaggle.com
zip
Updated Nov 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chandana Ramakrishna (2025). Twitter Airline Sentiment Dataset [Dataset]. https://www.kaggle.com/datasets/chandana890/twitter-airline-sentiment-dataset
Explore at:
zip(1134990 bytes)Available download formats
Dataset updated
Nov 14, 2025
Authors
Chandana Ramakrishna
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Overview

This dataset contains tweets related to major US airlines and is widely used for NLP and sentiment analysis tasks. Each record includes the tweet text, timestamp, airline name, and sentiment label (positive, negative, neutral). This uploaded version is prepared to support advanced text processing, machine learning, and anomaly detection experiments.

What's Included

Tweets.csv – Full collection of airline-related tweets

Text content suitable for NLP tasks

Timestamp information (useful for time-based analysis)

Sentiment labels for classification and evaluation

Cleaned text field for direct use in ML pipelines

Purpose of This Dataset

This dataset is used in a machine learning workflow focused on: - sentiment analysis
- embedding generation (transformers)
- dimensionality reduction (PCA, UMAP)
- clustering and visualization
- unsupervised anomaly detection using Isolation Forest

It is especially suited for exploring changes in public sentiment, event detection, and contextual analysis in social media data.

Key Use Cases

Building and testing NLP models

Semantic similarity and embedding-based analysis

Sentiment classification

Detecting anomalous posts or time periods

Visualizing tweet clusters using UMAP

Studying customer feedback patterns in the airline industry

Source

Originally derived from the Twitter US Airline Sentiment dataset on Kaggle.
This uploaded version is intended for educational, analytical, and research purposes.

Notes

If you're using this dataset in a notebook, ensure you update your file path accordingly: ```python df = pd.read_csv("/kaggle/input/twitter-airline-sentiment-dataset/Tweets.csv")
H
Replication Data for: Measuring the impact of campaign finance on...
dataverse.harvard.edu
search.dataone.org
Updated Mar 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthias Lalisse (2022). Replication Data for: Measuring the impact of campaign finance on congressional voting: A machine learning approach [Dataset]. http://doi.org/10.7910/DVN/DHQQHX
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/DHQQHX
Dataset updated
Mar 31, 2022
Dataset provided by
Harvard Dataverse
Authors
Matthias Lalisse
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Replication data for the paper: "Measuring the impact of campaign finance on congressional voting: A machine learning approach" Includes: * metadata for legislators and bills, * text embeddings for legislative summaries (sourced from ProPublica Congress Database). Includes 768d LongFormer embeddings and 2d embeddings for visualization (UMAP and Isomap), * legislator embeddings: 100d PCA on legislators' financial disclosures, as well as 2d visualization embeddings (UMAP and Isomap), * scripts for running the classification and RSA analyses. Up to 100d embeddings are provided from the output of PCA for both bills and legislators. See README.ipynb for a tour of the datasets as well as starter code.
h
wikipos
huggingface.co
Updated Sep 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Philip Gerdes (2025). wikipos [Dataset]. https://huggingface.co/datasets/whatphiliptrains/wikipos
Explore at:
Dataset updated
Sep 15, 2025
Authors
Philip Gerdes
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Dataset Card for WikiPos

Dataset Summary

WikiPos is a processed version of the Wikimedia Wikipedia dataset that includes 2D spatial coordinates generated through dimensionality reduction techniques. The dataset contains Wikipedia articles with their original text content plus x,y coordinates derived from sentence embeddings using UMAP and t-SNE algorithms. The dataset enables spatial visualization and exploration of Wikipedia content, allowing researchers to analyze… See the full description on the dataset page: https://huggingface.co/datasets/whatphiliptrains/wikipos.
h
MNIST-Curation
huggingface.co
Updated Nov 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Constantin (2025). MNIST-Curation [Dataset]. https://huggingface.co/datasets/Consscht/MNIST-Curation
Explore at:
Dataset updated
Nov 22, 2025
Authors
Constantin
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Curation of the famous MNIST Dataset

The curation was done using qualitative analysis of the dataset, following visualization techniques like PCA and UMAP and score-based categorization of the samples using metrics like hardness, mistakenness, or uniqueness. The code of the curation can be found on GitHub:👉 https://github.com/Conscht/MNIST_Curation_Repo/tree/main
This curated version of MNIST introduces an additional IDK (“I Don’t Know”) label for digits that are ambiguous, noisy… See the full description on the dataset page: https://huggingface.co/datasets/Consscht/MNIST-Curation.
Dataset name, reference, dimensions and cell type composition.
plos.figshare.com
xls
Updated Dec 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuta Hozumi; Guo-Wei Wei (2024). Dataset name, reference, dimensions and cell type composition. [Dataset]. http://doi.org/10.1371/journal.pone.0311791.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0311791.t001
Dataset updated
Dec 13, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Yuta Hozumi; Guo-Wei Wei
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset name, reference, dimensions and cell type composition.
f
Two CyTOF benchmark data sets for analysis.
plos.figshare.com
xls
Updated Jun 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lijun Cheng; Pratik Karkhanis; Birkan Gokbag; Yueze Liu; Lang Li (2023). Two CyTOF benchmark data sets for analysis. [Dataset]. http://doi.org/10.1371/journal.pcbi.1008885.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1008885.t001
Dataset updated
Jun 15, 2023
Dataset provided by
PLOS Computational Biology
Authors
Lijun Cheng; Pratik Karkhanis; Birkan Gokbag; Yueze Liu; Lang Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Two CyTOF benchmark data sets for analysis.
Comparison of machine-learning methods by different measurements for CyTOF...
plos.figshare.com
xls
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lijun Cheng; Pratik Karkhanis; Birkan Gokbag; Yueze Liu; Lang Li (2023). Comparison of machine-learning methods by different measurements for CyTOF Dataset 1 (13 biomarkers, 24 labeled cell types). [Dataset]. http://doi.org/10.1371/journal.pcbi.1008885.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1008885.t004
Dataset updated
Jun 5, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Lijun Cheng; Pratik Karkhanis; Birkan Gokbag; Yueze Liu; Lang Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparison of machine-learning methods by different measurements for CyTOF Dataset 1 (13 biomarkers, 24 labeled cell types).
Comparison of methods for averaging performance in the identification of...
plos.figshare.com
xls
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lijun Cheng; Pratik Karkhanis; Birkan Gokbag; Yueze Liu; Lang Li (2023). Comparison of methods for averaging performance in the identification of known cell types in training and testing data by different measurements for CyTOF1 and CyTOF2 datasets. [Dataset]. http://doi.org/10.1371/journal.pcbi.1008885.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1008885.t003
Dataset updated
Jun 5, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Lijun Cheng; Pratik Karkhanis; Birkan Gokbag; Yueze Liu; Lang Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparison of methods for averaging performance in the identification of known cell types in training and testing data by different measurements for CyTOF1 and CyTOF2 datasets.
Calibration of cell types utilizing calibration feedback for CyTOF1 and...
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lijun Cheng; Pratik Karkhanis; Birkan Gokbag; Yueze Liu; Lang Li (2023). Calibration of cell types utilizing calibration feedback for CyTOF1 and CyTOF2 data. [Dataset]. http://doi.org/10.1371/journal.pcbi.1008885.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1008885.t006
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Lijun Cheng; Pratik Karkhanis; Birkan Gokbag; Yueze Liu; Lang Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Calibration of cell types utilizing calibration feedback for CyTOF1 and CyTOF2 data.
Additional file 1 of Choice of pre-processing pipeline influences clustering...
springernature.figshare.com
zip
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Inbal Shainer; Manuel Stemmer (2023). Additional file 1 of Choice of pre-processing pipeline influences clustering quality of scRNA-seq datasets [Dataset]. http://doi.org/10.6084/m9.figshare.16620628.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.16620628.v1
Dataset updated
Jun 5, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Inbal Shainer; Manuel Stemmer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 1: Fig. S1 Total gene detection of all datasets compared after processing with either kallisto or Cell Ranger. The Venn diagrams show commonly detected number of genes by both pipelines and uniquely detected genes. Fig. S2 Violin-plots showing distribution of gene and UMI detection per cell of all the analyzed datasets (Table 1) run with the Cell Ranger pipeline. Fig. S3 Violin-plots showing distribution of gene and UMI detection per cell of all the analyzed datasets (Table 1) run with the kallisto pipeline. Fig. S4 Cell counts of all datasets compared after processing with either kallisto forced or Cell Ranger. The Venn diagrams show commonly detected cell barcodes by both pipelines and uniquely detected cell barcodes. Fig. S5 Alignment results of all datasets (Table 1) run with either Cell Ranger or kallisto forced against Ensembl reference. a Percent alignment rates of all reads against the reference transcriptome. b Total gene detection. c Median gene counts over all cells per dataset. d Median UMI counts over all cells per dataset. e Total cell counts of each dataset. Fig. S6 Total gene detection of all datasets compared after processing with either kallisto forced or Cell Ranger. The Venn diagrams show commonly detected number of genes by both pipelines and uniquely detected genes. Fig. S7 Violin-plots showing distribution of gene and UMI detection per cell of all the analyzed datasets (Table 1) run with the kallisto forced pipeline. Fig. S8 Violin-plots showing distribution of gene and UMI detection per cell of the dr_pineal_s2 dataset after additional filtering for downstream analysis. Run with either Cell Ranger (a), kallisto (b) or kallisto forced (c). Fig. S9 Downstream analysis of dr_pineal_s2 before cluster merging. a 2D visualization using UMAP of Cell Ranger analyzed clusters before merging, with resolution equal to 0.9. Each point represents a single cell, colored according to cell type. The cells were clustered into 21 types. b Expression profile of marker genes according to cluster [7] of (a). Clusters 0, 1, 8 and 18 are all rod-like PhRs subclusters. They expressed rod-like PhR markers (exorh, gant1, gngt1), but the expression levels differed and resulted in their separation. For simplicity, they were merged and referred as a single rod-like PhRs cluster in the main text. Similarly, cluster 7 and 12 were merged into a single Müller-glia like cluster, clusters 2, 5, 16 were merged into a single RPE-like cluster, clusters 3 and 10 were merged into a single habenula kiss1 cluster and cluster 11 and 19 were merged into a single leukocytes cluster. c. 2D visualization using UMAP of Cell Ranger analyzed clusters, with resolution equal to 2. The cells were clustered into 31 types. However, the two different cone-like PhR cell types are still not distinguished from one another. d Expression profile of marker genes according to cluster of (c). e 2D visualization using UMAP of kallisto analyzed dr_pineal_s2 clusters before merging, with resolution equal to 0.9. The cells were clustered into 24 types. f Expression profile of marker genes according to cluster of (c). Similar to the descried above, clusters 1, 2, 3, 7 and 21 were merged into a single rod-like PhRs cluster, clusters 0, 9, 17 were merged into a single RPE-like cluster, clusters 11 and 12 were merged into a single Müller-glia like cluster, clusters 4, 5 and 20 were merged into a single habenula kiss1 cluster and clusters 13 and 22 were merged into a single leukocytes cluster. g 2D visualization using UMAP of kallisto forced analyzed dr_pineal_s2 clusters, with resolution equal to 1.2. The cells were clustered into 27 types. h Expression profile of marker genes according to cluster of (g). The col14a1b gene was only detected in the kallisto and kallisto forced datasets and is the strongest DE marker within the red cone-like cluster (f, h). Fig. S10 Heatmap of genes with higher counts in kallisto pre-processed pineal data. All the UMI counts for both kallisto and Cell Ranger were summed, and the diff_ratio value was calculated ( kallisto _ counts − CellRanger _ counts kallisto _ counts + CellRanger _ counts \(\frac{\left( kallisto\_ counts- CellRanger\_ counts\right)}{\left( kallisto\_ counts+ CellRanger\_ counts\right)}\) ) for each gene (Additional file 1: Fig. 10). The top 80 diff_ratio genes, as well as the top 20 genes uniquely identified in kallisto were plotted according to the average scaled expression per cluster. Fig. S11 Heatmap of genes with higher counts in Cell Ranger pre-processed pineal data. All the UMI counts for both kallisto and Cell Ranger were summed, and the diff_ratio value was calculated ( kallisto _ counts − CellRanger _ counts kallisto _ counts + CellRanger _ counts \(\frac{\left( kallisto\_ counts- CellRanger\_ counts\right)}{\left( kallisto\_ counts+ CellRanger\_ counts\right)}\) ) for each gene (Additional file 1: Fig. S11). The top 80 diff_ratio genes, as well as the top 20 genes uniquely identified in Cell Ranger were plotted according to the average scaled expression per cluster.
Multicellular ecotypes shape progression of lung adenocarcinoma from...
figshare.com
bin
Updated Feb 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yulan Deng (2024). Multicellular ecotypes shape progression of lung adenocarcinoma from ground-glass opacity towards advanced stages [Dataset]. http://doi.org/10.6084/m9.figshare.25287325.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25287325.v1
Dataset updated
Feb 26, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Yulan Deng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Single-cell RNA sequencing and read processing The cell suspension was loaded into Chromium microfluidic chips with 3’(v3) chemistry and barcoded with a 10× Chromium Controller (10X Genomics). RNA from the barcoded cells was subsequently reverse-transcribed and sequencing libraries constructed with reagents from a Chromium Single Cell 3’ v3 reagent kit (10X Genomics) according to the manufacturer’s instructions. Sequencing was performed with Illumina Novaseq 6000, according to the manufacturer’s instructions (Illumina). The 10X Genomics CellRanger software pipeline (v 5.0.1) was used to demultiplex cell barcodes and reads were mapped to the hg38 human genome using STAR aligner (v2.7.10a)(Dobin et al., 2013).Filtering, normalization, integration and clustering of scRNA-seq data Seurat (v3.1.0) was used for filtering, selecting variable gene, dataset integration, dimensionality reduction, clustering, cell type annotation, differential expression, and visualization. We applied quality measures on raw gene-cell-barcode matrix for each cell: mitochondrial genes (≤20%, unique molecular identifiers (UMIs), and gene count (ranging from 200 to 6000). We excluded genes with min.cells < 3 and removed mitochondrial as well as ribosomal genes in the subsequent analysis. For the remaining cells and genes, we defined relative expression by centering the gene count through using the ‘ScaleData’ function. In the integration step, function ‘SelectIntegrationFeatures’ was used to select features, which were used to scaled (function ' ScaleData’) and compute the principal component (PCs, function ‘RunPCA’). When identifying integration anchors, one sample from each clinical stage was randomly selected as reference (function ‘FindIntegrationAnchors’, reduction = "rpca"), Then all scRNAseq datasets were integrated using above anchor (function ‘IntegrateData’). Cell clustering, tSNE visualization and UMAP visualization were performed using the FindClusters, RunTSNE and RunUMAP functions, respectively. The annotations of cell identity on each cluster were defined by the expression of known marker genes, including: EPCAM, KRT19, KRT18, CDH1 for epithelial cells; CD3D, CD3E, CD3G for T cells; TRAC, LYZ, MARCO, CD68, FCGR3A for myeloid cells; CD79A for B cells; DCN, THY1, COL1A1, COL1A2 for fibroblasts, PECAM1, FLT1 for endothelial cells; KIT, MS4A2, GATA2 for MAST cells; NKG7, NCAM1, KLRD1 for NK cells.