37 datasets found

f
Table_1_A t-SNE Based Classification Approach to Compositional Microbiome...
datasetcatalog.nlm.nih.gov
frontiersin.figshare.com
Updated Dec 14, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yang, Zhenyu; Xie, Zhongming; Li, Dongfang; Xu, Ximing; Xu, Xueli (2020). Table_1_A t-SNE Based Classification Approach to Compositional Microbiome Data.DOCX [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000506264
Explore at:
Dataset updated
Dec 14, 2020
Authors
Yang, Zhenyu; Xie, Zhongming; Li, Dongfang; Xu, Ximing; Xu, Xueli
Description
As a data-driven dimensionality reduction and visualization tool, t-distributed stochastic neighborhood embedding (t-SNE) has been successfully applied to a variety of fields. In recent years, it has also received increasing attention for classification and regression analysis. This study presented a t-SNE based classification approach for compositional microbiome data, which enabled us to build classifiers and classify new samples in the reduced dimensional space produced by t-SNE. The Aitchison distance was employed to modify the conditional probabilities in t-SNE to account for the compositionality of microbiome data. To classify a new sample, its low-dimensional features were obtained as the weighted mean vector of its nearest neighbors in the training set. Using the low-dimensional features as input, three commonly used machine learning algorithms, logistic regression (LR), support vector machine (SVM), and decision tree (DT) were considered for classification tasks in this study. The proposed approach was applied to two disease-associated microbiome datasets, achieving better classification performance compared with the classifiers built in the original high-dimensional space. The analytic results also showed that t-SNE with Aitchison distance led to improvement of classification accuracy in both datasets. In conclusion, we have developed a t-SNE based classification approach that is suitable for compositional microbiome data and may also serve as a baseline for more complex classification models.
t
t-SNE for Data Visualization in Structural Engineering - Dataset - LDM
service.tib.eu
resodate.org
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). t-SNE for Data Visualization in Structural Engineering - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/t-sne-for-data-visualization-in-structural-engineering
Explore at:
Dataset updated
Dec 2, 2024
Description
The dataset is a binary data set stemming from computational models of earthquake ground motions in structural engineering.
Additional file 5 of GECO: gene expression clustering optimization app for...
springernature.figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
A. N. Habowski; T. J. Habowski; M. L. Waterman (2023). Additional file 5 of GECO: gene expression clustering optimization app for non-linear data visualization of patterns [Dataset]. http://doi.org/10.6084/m9.figshare.13642382.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13642382.v1
Dataset updated
May 31, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
A. N. Habowski; T. J. Habowski; M. L. Waterman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 5: CSV file of bulk RNA-seq data of F. nucleatum infection time course used for GECO UMAP generation.
Z
Data from: Visualizing histopathologic deep learning classification and...
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Faust, Kevin; Xie, Quin; Han, Dominick; Goyle, Kartikay; Volynskaya, Zoya; Djuric, Ugljesa; Diamandis, Phedias (2020). Visualizing histopathologic deep learning classification and anomaly detection using nonlinear feature space dimensionality reduction [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_1237975
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
University Health Network
University of Toronto
Authors
Faust, Kevin; Xie, Quin; Han, Dominick; Goyle, Kartikay; Volynskaya, Zoya; Djuric, Ugljesa; Diamandis, Phedias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Training image dataset used in the manuscript "Visualizing histopathologic deep learning classification and anomaly detection using nonlinear feature space dimensionality reduction"
Cluster tendency assessment in neuronal spike data
plos.figshare.com
pdf
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sara Mahallati; James C. Bezdek; Milos R. Popovic; Taufik A. Valiante (2023). Cluster tendency assessment in neuronal spike data [Dataset]. http://doi.org/10.1371/journal.pone.0224547
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0224547
Dataset updated
Jun 5, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Sara Mahallati; James C. Bezdek; Milos R. Popovic; Taufik A. Valiante
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Sorting spikes from extracellular recording into clusters associated with distinct single units (putative neurons) is a fundamental step in analyzing neuronal populations. Such spike sorting is intrinsically unsupervised, as the number of neurons are not known a priori. Therefor, any spike sorting is an unsupervised learning problem that requires either of the two approaches: specification of a fixed value k for the number of clusters to seek, or generation of candidate partitions for several possible values of c, followed by selection of a best candidate based on various post-clustering validation criteria. In this paper, we investigate the first approach and evaluate the utility of several methods for providing lower dimensional visualization of the cluster structure and on subsequent spike clustering. We also introduce a visualization technique called improved visual assessment of cluster tendency (iVAT) to estimate possible cluster structures in data without the need for dimensionality reduction. Experimental results are conducted on two datasets with ground truth labels. In data with a relatively small number of clusters, iVAT is beneficial in estimating the number of clusters to inform the initialization of clustering algorithms. With larger numbers of clusters, iVAT gives a useful estimate of the coarse cluster structure but sometimes fails to indicate the presumptive number of clusters. We show that noise associated with recording extracellular neuronal potentials can disrupt computational clustering schemes, highlighting the benefit of probabilistic clustering models. Our results show that t-Distributed Stochastic Neighbor Embedding (t-SNE) provides representations of the data that yield more accurate visualization of potential cluster structure to inform the clustering stage. Moreover, The clusters obtained using t-SNE features were more reliable than the clusters obtained using the other methods, which indicates that t-SNE can potentially be used for both visualization and to extract features to be used by any clustering algorithm.
Additional file 4 of GECO: gene expression clustering optimization app for...
springernature.figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
A. N. Habowski; T. J. Habowski; M. L. Waterman (2023). Additional file 4 of GECO: gene expression clustering optimization app for non-linear data visualization of patterns [Dataset]. http://doi.org/10.6084/m9.figshare.13642379.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13642379.v1
Dataset updated
May 31, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
A. N. Habowski; T. J. Habowski; M. L. Waterman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 4: CSV file of colon crypt bulk RNA-seq data used for GECO UMAP generation.

🌆 City Lifestyle Segmentation Dataset

kaggle.com

zip

Updated Nov 15, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

UmutUygurr (2025). 🌆 City Lifestyle Segmentation Dataset [Dataset]. https://www.kaggle.com/datasets/umuttuygurr/city-lifestyle-segmentation-dataset

Explore at:

zip(11274 bytes)Available download formats

Dataset updated

Nov 15, 2025

Authors

UmutUygurr

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22121490%2F7189944f8fc292a094c90daa799d08ca%2FChatGPT%20Image%2015%20Kas%202025%2014_07_37.png?generation=1763204959770660&alt=media" alt="">

🌆 About This Dataset

This synthetic dataset simulates 300 global cities across 6 major geographic regions, designed specifically for unsupervised machine learning and clustering analysis. It explores how economic status, environmental quality, infrastructure, and digital access shape urban lifestyles worldwide.

🎯 Perfect For:

📊 K-Means, DBSCAN, Agglomerative Clustering
🔬 PCA & t-SNE Dimensionality Reduction
🗺️ Geospatial Visualization (Plotly, Folium)
📈 Correlation Analysis & Feature Engineering
🎓 Educational Projects (Beginner to Intermediate)

📦 What's Inside?

Feature	Description	Range
10 Features	Economic, environmental & social indicators	Realistically scaled
300 Cities	Europe, Asia, Americas, Africa, Oceania	Diverse distributions
Strong Correlations	Income ↔ Rent (+0.8), Density ↔ Pollution (+0.6)	ML-ready
No Missing Values	Clean, preprocessed data	Ready for analysis
4-5 Natural Clusters	Metropolitan hubs, eco-towns, developing centers	Pre-validated

🔥 Key Features

✅ Realistic Correlations: Income strongly predicts rent (+0.8), internet access (+0.7), and happiness (+0.6)
✅ Regional Diversity: Each region has distinct economic and environmental characteristics
✅ Clustering-Ready: Naturally separable into 4-5 lifestyle archetypes
✅ Beginner-Friendly: No data cleaning required, includes example code
✅ Documented: Comprehensive README with methodology and use cases

🚀 Quick Start Example

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Load and prepare
df = pd.read_csv('city_lifestyle_dataset.csv')
X = df.drop(['city_name', 'country'], axis=1)
X_scaled = StandardScaler().fit_transform(X)

# Cluster
kmeans = KMeans(n_clusters=5, random_state=42)
df['cluster'] = kmeans.fit_predict(X_scaled)

# Analyze
print(df.groupby('cluster').mean())

🎓 Learning Outcomes

After working with this dataset, you will be able to: 1. Apply K-Means, DBSCAN, and Hierarchical Clustering 2. Use PCA for dimensionality reduction and visualization 3. Interpret correlation matrices and feature relationships 4. Create geographic visualizations with cluster assignments 5. Profile and name discovered clusters based on characteristics

📚 Ideal For These Projects

🏆 Kaggle Competitions: Practice clustering techniques
📝 Academic Projects: Urban planning, sociology, environmental science
💼 Portfolio Work: Showcase ML skills to employers
🎓 Learning: Hands-on practice with unsupervised learning
🔬 Research: Urban lifestyle segmentation studies

🌍 Expected Clusters

Cluster	Characteristics	Example Cities
Metropolitan Tech Hubs	High income, density, rent	Silicon Valley, Singapore
Eco-Friendly Towns	Low density, clean air, high happiness	Nordic cities
Developing Centers	Mid income, high density, poor air	Emerging markets
Low-Income Suburban	Low infrastructure, income	Rural areas
Industrial Mega-Cities	Very high density, pollution	Manufacturing hubs

🛠️ Technical Details

Format: CSV (UTF-8)
Size: ~300 rows × 10 columns
Missing Values: 0%
Data Types: 2 categorical, 8 numerical
Target Variable: None (unsupervised)
Correlation Strength: Pre-validated (r: 0.4 to 0.8)

📖 What Makes This Dataset Special?

Unlike random synthetic data, this dataset was carefully engineered with: - ✨ Realistic correlation structures based on urban research - 🌍 Regional characteristics matching real-world patterns - 🎯 Optimal cluster separability (validated via silhouette scores) - 📚 Comprehensive documentation and starter code

🏅 Use This Dataset If You Want To:

✓ Learn clustering without data cleaning hassles
✓ Practice PCA and dimensionality reduction
✓ Create beautiful geographic visualizations
✓ Understand feature correlation in real-world contexts
✓ Build a portfolio project with clear business insights

📊 Acknowledgments

This dataset was designed for educational purposes in machine learning and data science. While synthetic, it reflects real patterns observed in global urban development research.

Happy Clustering! 🎉

umap-learn
kaggle.com
zip
Updated Oct 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HyeongChan Kim (2025). umap-learn [Dataset]. https://www.kaggle.com/kozistr/umaplearn
Explore at:
zip(46934808 bytes)Available download formats
Dataset updated
Oct 19, 2025
Authors
HyeongChan Kim
Description
UMAP

Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualization similarly to t-SNE, but also for general non-linear dimension reduction. The algorithm is founded on three assumptions about the data:

The data is uniformly distributed on a Riemannian manifold; The Riemannian metric is locally constant (or can be approximated as such); The manifold is locally connected. From these assumptions, it is possible to model the manifold with a fuzzy topological structure. The embedding is found by searching for a low dimensional projection of the data that has the closest possible equivalent fuzzy topological structure.

The details for the underlying mathematics can be found in our paper on ArXiv:

McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv e-prints 1802.03426, 2018
f
Collected dimension and attribute.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Nov 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cheng, Ching-Hsue; Tsai, Ming-Chi; Chang, Yuan-Shao (2023). Collected dimension and attribute. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000941483
Explore at:
Dataset updated
Nov 2, 2023
Authors
Cheng, Ching-Hsue; Tsai, Ming-Chi; Chang, Yuan-Shao
Description
The hotel industry is essential for tourism. With the rapid expansion of the internet, consumers only search for their desired keywords on the website when they trying to find a hotel to stay, causing the relevant hotel information would appear. To quickly respond to the changing market and consumer habits, each hotel must focus on its website information and information quality. This study proposes a novel methodology that uses rough set theory (RST), principal component analysis, t-Distributed Stochastic Neighbor Embedding (t-SNE), and attribute performance visualization to explore the relationship between hotel star ratings and hotel website information quality. The collected data are based on the star-rated hotels of the Taiwanstay website, and the checklists of hotel website services are used to obtain the relevant attributes data. The results show that there are significant differences in information quality between hotels below two stars and those above four stars. The information quality provided by the higher star hotels was more detailed than that offered by low-star hotels. Based on the attribute performance matrix, the one-star and two-star hotels have advantage attributes in their landscape, reply time, restaurant information, social media, and compensation. Furthermore, the three-five star hotels have advantage attributes in their operational support, compensation, restaurant information, traffic information, and room information. These results could be provided to the stakeholders as a reference.
f
The results of different classifiers.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Nov 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tsai, Ming-Chi; Chang, Yuan-Shao; Cheng, Ching-Hsue (2023). The results of different classifiers. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000941519
Explore at:
Dataset updated
Nov 2, 2023
Authors
Tsai, Ming-Chi; Chang, Yuan-Shao; Cheng, Ching-Hsue
Description
The hotel industry is essential for tourism. With the rapid expansion of the internet, consumers only search for their desired keywords on the website when they trying to find a hotel to stay, causing the relevant hotel information would appear. To quickly respond to the changing market and consumer habits, each hotel must focus on its website information and information quality. This study proposes a novel methodology that uses rough set theory (RST), principal component analysis, t-Distributed Stochastic Neighbor Embedding (t-SNE), and attribute performance visualization to explore the relationship between hotel star ratings and hotel website information quality. The collected data are based on the star-rated hotels of the Taiwanstay website, and the checklists of hotel website services are used to obtain the relevant attributes data. The results show that there are significant differences in information quality between hotels below two stars and those above four stars. The information quality provided by the higher star hotels was more detailed than that offered by low-star hotels. Based on the attribute performance matrix, the one-star and two-star hotels have advantage attributes in their landscape, reply time, restaurant information, social media, and compensation. Furthermore, the three-five star hotels have advantage attributes in their operational support, compensation, restaurant information, traffic information, and room information. These results could be provided to the stakeholders as a reference.
d
OsterlundJBC_Figure 8
search.dataone.org
borealisdata.ca
Updated Dec 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Osterlund, Elizabeth (2023). OsterlundJBC_Figure 8 [Dataset]. http://doi.org/10.5683/SP3/A7TRJF
Explore at:
Unique identifier
https://doi.org/10.5683/SP3/A7TRJF
Dataset updated
Dec 28, 2023
Dataset provided by
Borealis
Authors
Osterlund, Elizabeth
Description
Figure 8 data Includes: -Panel A, Visualization of multidimensional Red channel MCF-7 image data after dimension reduction using the t-SNE algorithm. Each point represents 1 of the 3000 cells randomly selected from the 143,211 total cells for each of the 4 landmarks (listed in legend). For visualization, the 300 cells closest to the centroid were displayed. See Supplementary Fig7C for more data. -Panel B, Confusion Matrix, plotting Predicted versus True(known localization of the red channel landmark) labels for the validation set of Red channel single cell images, which were not included in the initial training step. Data was normalized to the total number of cells from the true label of each landmark. See Supplementary Fig7D for original data. -Panels C-D, Original pie charts and the number of cells analyzed per transfection for each cell line (per pie chart), exported from python. "PredictedLabel" indicates the label predicted by the classifier. "TrueLabel" indicates the ground truth label (known from the Platemap for each landmark in the Red Channel). Single-cell image dataset was too large to upload. Contact Elizabeth.osterlund@gmail.com, if desired.
h
wikipos
huggingface.co
Updated Sep 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Philip Gerdes (2025). wikipos [Dataset]. https://huggingface.co/datasets/whatphiliptrains/wikipos
Explore at:
Dataset updated
Sep 15, 2025
Authors
Philip Gerdes
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Dataset Card for WikiPos

Dataset Summary

WikiPos is a processed version of the Wikimedia Wikipedia dataset that includes 2D spatial coordinates generated through dimensionality reduction techniques. The dataset contains Wikipedia articles with their original text content plus x,y coordinates derived from sentence embeddings using UMAP and t-SNE algorithms. The dataset enables spatial visualization and exploration of Wikipedia content, allowing researchers to analyze… See the full description on the dataset page: https://huggingface.co/datasets/whatphiliptrains/wikipos.
🔢🖊️ Digital Recognition: MNIST Dataset
kaggle.com
zip
Updated Nov 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wasiq Ali (2025). 🔢🖊️ Digital Recognition: MNIST Dataset [Dataset]. https://www.kaggle.com/datasets/wasiqaliyasir/digital-mnist-dataset
Explore at:
zip(2278207 bytes)Available download formats
Dataset updated
Nov 13, 2025
Authors
Wasiq Ali
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Handwritten Digits Pixel Dataset - Documentation

Overview

The Handwritten Digits Pixel Dataset is a collection of numerical data representing handwritten digits from 0 to 9. Unlike image datasets that store actual image files, this dataset contains pixel intensity values arranged in a structured tabular format, making it ideal for machine learning and data analysis applications.

Dataset Description

Basic Information

Format: CSV (Comma-Separated Values)

Total Samples: [Number of rows based on your dataset]

Features: 784 pixel columns (28×28 pixels) + 1 label column

Label Range: Digits 0-9

Pixel Value Range: 0-255 (grayscale intensity)

File Structure

Column Description

label: The target variable representing the digit (0-9)

pixel columns: 784 columns named in format [row]xcolumn

Each pixel column contains integer values from 0-255 representing grayscale intensity

Data Characteristics

Label Distribution

The dataset contains handwritten digit samples with the following distribution:

Digit 0: [X] samples

Digit 1: [X] samples

Digit 2: [X] samples

Digit 3: [X] samples

Digit 4: [X] samples

Digit 5: [X] samples

Digit 6: [X] samples

Digit 7: [X] samples

Digit 8: [X] samples

Digit 9: [X] samples

(Note: Actual distribution counts would be calculated from your specific dataset)

Data Quality

Missing Values: No missing values detected

Data Type: All values are integers

Normalization: Pixel values range from 0-255 (can be normalized to 0-1 for ML models)

Consistency: Uniform 28×28 grid structure across all samples

Technical Specifications

Data Preprocessing Requirements

Normalization: Scale pixel values from 0-255 to 0-1 range

Reshaping: Convert 1D pixel arrays to 2D 28×28 matrices for visualization

Train-Test Split: Recommended 80-20 or 70-30 split for model development

Recommended Machine Learning Approaches

Classification Algorithms:

Random Forest

Support Vector Machines (SVM)

Neural Networks

K-Nearest Neighbors (KNN)

Deep Learning Architectures:

Convolutional Neural Networks (CNNs)

Multi-layer Perceptrons (MLPs)

Dimensionality Reduction:

PCA (Principal Component Analysis)

t-SNE for visualization

Usage Examples

Loading the Dataset

import pandas as pd # Load the dataset df = pd.read_csv('/kaggle/input/handwritten_digits_pixel_dataset/mnist.csv') # Separate features and labels X = df.drop('label', axis=1) y = df['label'] # Normalize pixel values X_normalized = X / 255.0
Data_Sheet_1_Manifold learning for fMRI time-varying functional...
frontiersin.figshare.com
docx
Updated Jul 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javier Gonzalez-Castillo; Isabel S. Fernandez; Ka Chun Lam; Daniel A. Handwerker; Francisco Pereira; Peter A. Bandettini (2023). Data_Sheet_1_Manifold learning for fMRI time-varying functional connectivity.docx [Dataset]. http://doi.org/10.3389/fnhum.2023.1134012.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fnhum.2023.1134012.s001
Dataset updated
Jul 11, 2023
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Javier Gonzalez-Castillo; Isabel S. Fernandez; Ka Chun Lam; Daniel A. Handwerker; Francisco Pereira; Peter A. Bandettini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Whole-brain functional connectivity (FC) measured with functional MRI (fMRI) evolves over time in meaningful ways at temporal scales going from years (e.g., development) to seconds [e.g., within-scan time-varying FC (tvFC)]. Yet, our ability to explore tvFC is severely constrained by its large dimensionality (several thousands). To overcome this difficulty, researchers often seek to generate low dimensional representations (e.g., 2D and 3D scatter plots) hoping those will retain important aspects of the data (e.g., relationships to behavior and disease progression). Limited prior empirical work suggests that manifold learning techniques (MLTs)—namely those seeking to infer a low dimensional non-linear surface (i.e., the manifold) where most of the data lies—are good candidates for accomplishing this task. Here we explore this possibility in detail. First, we discuss why one should expect tvFC data to lie on a low dimensional manifold. Second, we estimate what is the intrinsic dimension (ID; i.e., minimum number of latent dimensions) of tvFC data manifolds. Third, we describe the inner workings of three state-of-the-art MLTs: Laplacian Eigenmaps (LEs), T-distributed Stochastic Neighbor Embedding (T-SNE), and Uniform Manifold Approximation and Projection (UMAP). For each method, we empirically evaluate its ability to generate neuro-biologically meaningful representations of tvFC data, as well as their robustness against hyper-parameter selection. Our results show that tvFC data has an ID that ranges between 4 and 26, and that ID varies significantly between rest and task states. We also show how all three methods can effectively capture subject identity and task being performed: UMAP and T-SNE can capture these two levels of detail concurrently, but LE could only capture one at a time. We observed substantial variability in embedding quality across MLTs, and within-MLT as a function of hyper-parameter selection. To help alleviate this issue, we provide heuristics that can inform future studies. Finally, we also demonstrate the importance of feature normalization when combining data across subjects and the role that temporal autocorrelation plays in the application of MLTs to tvFC data. Overall, we conclude that while MLTs can be useful to generate summary views of labeled tvFC data, their application to unlabeled data such as resting-state remains challenging.
d
Replication Data for: Continuous Distributed Representation of Biological...
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Asgari, Ehsaneddin (2023). Replication Data for: Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics [Dataset]. http://doi.org/10.7910/DVN/JMFHTN
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/JMFHTN
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Asgari, Ehsaneddin
Description
Users should cite: Asgari E, Mofrad MRK. Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PLoS ONE 10(11): e0141287. doi:10.1371/journal.pone.0141287. This archive also contains the family classification data that we used in the above mentioned PLoS ONE paper. This data can be used as a benchmark for family classification task.
Cosmetics datasets
kaggle.com
zip
Updated Dec 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abid Ali Awan (2020). Cosmetics datasets [Dataset]. https://www.kaggle.com/kingabzpro/cosmetics-datasets
Explore at:
zip(269637 bytes)Available download formats
Dataset updated
Dec 16, 2020
Authors
Abid Ali Awan
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Description
Context

Whenever I want to try a new cosmetic item, it's so difficult to choose. It's actually more than difficult. It's sometimes scary because new items that I've never tried end up giving me skin trouble. We know the information we need is on the back of each product, but it's really hard to interpret those ingredient lists unless you're a chemist. You may be able to relate to this situation.

Content

we are going to create a content-based recommendation system where the 'content' will be the chemical components of cosmetics. Specifically, we will process ingredient lists for 1472 cosmetics on Sephora via word embedding, then visualize ingredient similarity using a machine learning method called t-SNE and an interactive visualization library called Bokeh. Let's inspect our data first.

Acknowledgements

DataCamp
f
Reduct set of two classes with 23 attributes.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Nov 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tsai, Ming-Chi; Cheng, Ching-Hsue; Chang, Yuan-Shao (2023). Reduct set of two classes with 23 attributes. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000941523
Explore at:
Dataset updated
Nov 2, 2023
Authors
Tsai, Ming-Chi; Cheng, Ching-Hsue; Chang, Yuan-Shao
Description
The hotel industry is essential for tourism. With the rapid expansion of the internet, consumers only search for their desired keywords on the website when they trying to find a hotel to stay, causing the relevant hotel information would appear. To quickly respond to the changing market and consumer habits, each hotel must focus on its website information and information quality. This study proposes a novel methodology that uses rough set theory (RST), principal component analysis, t-Distributed Stochastic Neighbor Embedding (t-SNE), and attribute performance visualization to explore the relationship between hotel star ratings and hotel website information quality. The collected data are based on the star-rated hotels of the Taiwanstay website, and the checklists of hotel website services are used to obtain the relevant attributes data. The results show that there are significant differences in information quality between hotels below two stars and those above four stars. The information quality provided by the higher star hotels was more detailed than that offered by low-star hotels. Based on the attribute performance matrix, the one-star and two-star hotels have advantage attributes in their landscape, reply time, restaurant information, social media, and compensation. Furthermore, the three-five star hotels have advantage attributes in their operational support, compensation, restaurant information, traffic information, and room information. These results could be provided to the stakeholders as a reference.
AI Developer Performance Dataset
kaggle.com
zip
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shahzad Aslam (2025). AI Developer Performance Dataset [Dataset]. https://www.kaggle.com/datasets/zeesolver/ai-developer-dataset
Explore at:
zip(5992 bytes)Available download formats
Dataset updated
May 27, 2025
Authors
Shahzad Aslam
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

This dataset contains 500 records and 9 features related to the productivity of developers using AI tools. It appears to analyze how factors like working habits, caffeine intake, and AI usage affect developer performance.

Suggested Machine Learning Tasks

Binary classification (task_success)

Regression (e.g., predicting cognitive_load)

Clustering of work patterns

Correlation analysis & feature importance

Time series simulation & rolling averages (useful with synthetic date column)

Exploratory Data Analysis (EDA)

Anomaly detection (e.g., outliers in bugs_reported)

Multi-output regression (predicting commits and bugs_reported)

Dimensionality reduction (PCA or t-SNE for pattern visualization)

Decision rule extraction (e.g., tree-based rules for task_success) # 🧠 Inspiration Developers with balanced AI usage, sleep, and moderate coffee intake show higher task success. Overuse of AI or caffeine increases cognitive load, reducing effectiveness. Productivity thrives on smart work, not just hard work.

📊 Column Descriptions

hours_coding – Daily coding hours (float).

coffee_intake_mg – Daily caffeine intake in milligrams (integer).

distractions – Number of distractions experienced (integer).

sleep_hours – Average sleep hours per day (float).

commits– Number of code commits per day (integer).

bugs_reported – Number of bugs reported (integer).

ai_usage_hours – Daily AI tool usage hours (float).

cognitive_load – Measured cognitive load on a scale (float).

task_success – Binary variable indicating task completion success (1 = success, 0 = fail).
f
Additional file 1 of Integrative single-cell RNA-seq and ATAC-seq analysis...
datasetcatalog.nlm.nih.gov
Updated Apr 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wang, Xiaoyu; Lin, Zhuhu; Chen, Meilin; Tong, Xian; Li, Jianhao; Zhu, Qi; Duo, Tianqi; Li, Enru; Cai, Shufang; Liu, Tongni; Liu, Xiaohong; Xu, Rong; Mo, Delin; Chen, Yaosheng; Hu, Bin; Liang, Ziyun (2023). Additional file 1 of Integrative single-cell RNA-seq and ATAC-seq analysis of myogenic differentiation in pig [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001023879
Explore at:
Dataset updated
Apr 13, 2023
Authors
Wang, Xiaoyu; Lin, Zhuhu; Chen, Meilin; Tong, Xian; Li, Jianhao; Zhu, Qi; Duo, Tianqi; Li, Enru; Cai, Shufang; Liu, Tongni; Liu, Xiaohong; Xu, Rong; Mo, Delin; Chen, Yaosheng; Hu, Bin; Liang, Ziyun
Description
Additional file 1: Figure S1. Quality control and batch effect correction in scRNA-Seq, related to Figure 1 A. Violin plots showing the number of expressed genes, the number of reads uniquely mapped against the reference genome, and the fraction of mitochondrial genes compared to all genes per cell in scRNA-Seq data. B. Box plot showing the number of genes (left) and the number of uniquely mapped reads (right) per cell in each identified cell type in scRNA-Seq data. C. tSNE plot visualization of the sample source for all 70,201 cells. Each dot is a cell. Different colors represent different samples. D. tSNE plot visualization of unsupervised clustering analysis for all 70,201 cells based on scRNA-Seq data after quality control, which gave rise to 31 distinct clusters. Figure S2. Gene Ontology (GO) analysis of the DEGs for each cell type was performed and the representative enriched GO terms are presented, related to Figure 1. Figure S3. Expression of selected marker genes along the differentiation trajectory, related to Figure 2 A. tSNE plot demonstrating cell cycle regression (left). Visualization of myogenic differentiation trajectory by cell cycle phases (G1, S, and G2/M) (right). B. Donut plots showing the percentages of cells in G1, S, and G2M phase at different cell states. C. Expression levels of cell cycle-related genes in the myogenic cells organized into the Monocle trajectory. D. Expression levels of muscle related genes in the myogenic cells organized into the Monocle trajectory. Figure S4. Unsupervised clustering analysis for all cells in scATAC-Seq data and myogenic-specific scATAC-seq peaks, related to Figure 4 A-C. tSNE plot visualization of the sample source for all 48514 cells in scATAC-Seq. Each dot is a cell. Different colors represent different pigs (A), different embryonic stages (B), or different samples (C). D. tSNE plot visualization of unsupervised clustering analysis for all 48514 cells after quality control in scATAC-Seq data, which gave rise to 15 distinct clusters. E. tSNE plot visualization of myogenic cells and other cells. Clusters 4 and 8 in Figure S4D were annotated as myogenic cells due to their high levels of accessibility of marker genes associated with myogenic lineage. F. Genome browser view of myogenic-specific peaks at the TSS of MyoG and Myf5 for myogenic cells and other cells in the scATAC-seq dataset. Figure S5. Percentage distribution of open chromatin elements in scATAC-Seq data, related to Figure 4 A. Distribution of open chromatin elements in each snATAC-seq sample. B. Distribution of open chromatin elements in snATAC-seq of myogenic cell types. C. Percentage distribution of open chromatin elements among DAPs in myogenic cell types. Figure S6. Integrative analysis of transcription factors and target genes, related to Figure 5 A. tSNE depiction of regulon activity (“on-blue”, “off-gray”), TF gene expression (red scale), and expression of predicted target genes (purple scale) of MyoG, FOSB, and TCF12. B. Corresponding chromatin accessibility in scATAC data for TFs and predicted target genes are depicted. Figure S7. Pseudotime-dependent chromatin accessibility and gene expression changes, related to Figure 7. The first column shows the dynamics of the 10× Genomics TF enrichment score. The second column shows the dynamics of TF gene expression values, and the third and fourth columns represent the dynamics of the SCENIC-reported target gene expression values of corresponding TFs, respectively. Figure S8. Myogenesis related gene expression in DMD (Duchenne muscular dystrophy) mice. Comparison of RNA-seq data of flexor digitorum short (FDB), extensor digitorum long (EDL), and soleus (SOL) in DMD and wild-type mice including 2- month and 5-month age. A. The expression levels of myogenesis related genes (Myod1, Myog, Myf5, Pax7). B. The expression levels of related genes that were upregulated during porcine embryonic myogenesis (EGR1, RHOB, KLF4, SOX8, NGFR, MAX, RBFOX2, ANXA6, HES6, RASSF4, PLS3, SPG21). C. The expression levels of related genes that were downregulated during porcine embryonic myogenesis COX5A, HOMER2, BNIP3, CNCS). Data were obtained from the GEO database (GSE162455; WT, n = 4; DMD, n = 7). Figure S9. Genome browser view of differentially accessible peaks at the TSS of EGR1 and RHOB between myogenic cells in the scATAC-seq dataset, related to Figure 8. Figure S10. Functional analysis of EGR1 in myogenesis, related to Figure 8 A-B. EdU assays for the proliferation of pig primary myogenic cells (A) and C2C12 myoblasts following EGR1 overexpression. C. qPCR analysis of the mRNA levels of cell cycle regulators in C2C12 cells following EGR1 overexpression. D. Immunofluorescence staining for MyHC in C2C12 cells following EGR1 overexpression and differentiation for 3 d. Then, the fusion index was calculated. Figure S11. Functional analysis of RHOB in myogenesis, related to Figure 8 A-B. EdU assays for proliferation of pig primary myogenic cells (A) and C2C12 myoblasts following RHOB overexpression. C. qPCR analysis of the mRNA levels of cell-cycle regulators in C2C12 cells following RHOB overexpression. D. Immunofluorescence staining for MyHC in C2C12 cells following RHOB overexpression and differentiation for 3 d. Then, the fusion index was calculated.
u
Data from: A Benchmark Dataset for Multilingual Tokenization Energy and...
observatorio-cientifico.ua.es
zenodo.org
Updated 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Quesada Granja, Carlos; Quesada Granja, Carlos (2025). A Benchmark Dataset for Multilingual Tokenization Energy and Efficiency Across 23 Models and 325 Languages [Dataset]. https://observatorio-cientifico.ua.es/documentos/688b604617bb6239d2d4a92a
Explore at:
Dataset updated
2025
Authors
Quesada Granja, Carlos; Quesada Granja, Carlos
Description
This repository contains a benchmark dataset and processing scripts for analyzing the energy consumption and processing efficiency of 23 Hugging Face tokenizers applied to 8,212 standardized text chunks across 325 languages.

The dataset includes raw and normalized energy measurements, tokenization times, token counts, and rich structural metadata for each chunk, including script composition, entropy, compression ratios, and character-level features. All experiments were conducted in a controlled environment using PyRAPL on a Linux workstation, with baseline CPU consumption subtracted via linear interpolation.

The resulting data enables fine-grained comparison of tokenizers across linguistic and script diversity. It also supports downstream tasks such as energy-aware NLP, script-sensitive modeling, and benchmarking of future tokenization methods.

Accompanying R scripts are provided to reproduce data processing, regression models, and clustering analyses. Visualization outputs and cluster-level summaries are also included.

All contents are organized into clearly structured folders to support easy access, interpretability, and reuse by the research community.

01_processing_scripts/

R scripts to transform raw data, subtract baseline energy, and produce clean metrics.

multimodel_tokenization_energy.py⤷ Python script used to tokenize all chunks with 23 models while logging energy and time.

adapting_original_dataset.R⤷ Reads raw logs and metadata, computes net energy, and outputs cleaned files.

energy_patterns.R⤷ Performs clustering, regression, t-SNE, and generates all visualizations.

02_raw_data/

Raw output from the tokenization experiment and baseline profiler.

all_models_tokenization.csv⤷ Full log of 42M+ tokenization runs (23 models × 225 reps × 8,212 chunks).

baseline.csv⤷ Background CPU energy samples, one per 50 chunks. Used for normalization.

03_clean_data/

Cleaned, enriched, and reshaped datasets ready for analysis.

net_energy.csv⤷ Raw tokenization results after baseline energy subtraction (per run).

tokenization_long.csv⤷ One row per chunk × tokenizer, with medians + token counts.

tokenization_wide.csv⤷ Wide-format matrix: one row per chunk, one column per tokenizer × metric.

complete.csv⤷ Fully enriched dataset joining all metrics, metadata, and script distributions.

metadata.csv⤷ Structural features and script-based character stats per chunk.

04_cluster_outputs/

Outputs from clustering and dimensionality reduction over tokenizer energy profiles.

tokenizer_dendrogram.pdf⤷ Hierarchical clustering of 23 tokenizers based on energy profiles.

tokenizer_tsne.pdf⤷ t-SNE projection of tokenizers grouped by energy usage.

mean_energy_per_cluster.csv⤷ Mean energy consumption (mJ) per language × tokenizer cluster.

sd_energy_per_cluster.csv⤷ Standard deviation of energy consumption (mJ) per language × cluster.

grid.pdf⤷ Heatmap of script-wise energy deltas (relative to Latin) for all tokenizers.

Facebook

Twitter

Click to copy link

Link copied

Cite

Yang, Zhenyu; Xie, Zhongming; Li, Dongfang; Xu, Ximing; Xu, Xueli (2020). Table_1_A t-SNE Based Classification Approach to Compositional Microbiome Data.DOCX [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000506264

Table_1_A t-SNE Based Classification Approach to Compositional Microbiome Data.DOCX

Explore at:

Dataset updated

Dec 14, 2020

Authors

Yang, Zhenyu; Xie, Zhongming; Li, Dongfang; Xu, Ximing; Xu, Xueli

Description

As a data-driven dimensionality reduction and visualization tool, t-distributed stochastic neighborhood embedding (t-SNE) has been successfully applied to a variety of fields. In recent years, it has also received increasing attention for classification and regression analysis. This study presented a t-SNE based classification approach for compositional microbiome data, which enabled us to build classifiers and classify new samples in the reduced dimensional space produced by t-SNE. The Aitchison distance was employed to modify the conditional probabilities in t-SNE to account for the compositionality of microbiome data. To classify a new sample, its low-dimensional features were obtained as the weighted mean vector of its nearest neighbors in the training set. Using the low-dimensional features as input, three commonly used machine learning algorithms, logistic regression (LR), support vector machine (SVM), and decision tree (DT) were considered for classification tasks in this study. The proposed approach was applied to two disease-associated microbiome datasets, achieving better classification performance compared with the classifiers built in the original high-dimensional space. The analytic results also showed that t-SNE with Aitchison distance led to improvement of classification accuracy in both datasets. In conclusion, we have developed a t-SNE based classification approach that is suitable for compositional microbiome data and may also serve as a baseline for more complex classification models.

Clear search

Close search

Google apps

Main menu

Table_1_A t-SNE Based Classification Approach to Compositional Microbiome...

t-SNE for Data Visualization in Structural Engineering - Dataset - LDM

Additional file 5 of GECO: gene expression clustering optimization app for...

Data from: Visualizing histopathologic deep learning classification and...

Cluster tendency assessment in neuronal spike data

Additional file 4 of GECO: gene expression clustering optimization app for...

🌆 City Lifestyle Segmentation Dataset

🌆 About This Dataset

🎯 Perfect For:

📦 What's Inside?

🔥 Key Features

🚀 Quick Start Example

🎓 Learning Outcomes

📚 Ideal For These Projects

🌍 Expected Clusters

🛠️ Technical Details

📖 What Makes This Dataset Special?

🏅 Use This Dataset If You Want To:

📊 Acknowledgments

umap-learn

UMAP

Collected dimension and attribute.

The results of different classifiers.

OsterlundJBC_Figure 8

wikipos

🔢🖊️ Digital Recognition: MNIST Dataset

Handwritten Digits Pixel Dataset - Documentation

Overview

Dataset Description

Basic Information

File Structure

Column Description

Data Characteristics

Label Distribution

Data Quality

Technical Specifications

Data Preprocessing Requirements

Recommended Machine Learning Approaches

Classification Algorithms:

Deep Learning Architectures:

Dimensionality Reduction:

Usage Examples

Loading the Dataset

Data_Sheet_1_Manifold learning for fMRI time-varying functional...

Replication Data for: Continuous Distributed Representation of Biological...

Cosmetics datasets

Context

Content

Acknowledgements

Reduct set of two classes with 23 attributes.

AI Developer Performance Dataset

Context

Suggested Machine Learning Tasks

📊 Column Descriptions

Additional file 1 of Integrative single-cell RNA-seq and ATAC-seq analysis...

Data from: A Benchmark Dataset for Multilingual Tokenization Energy and...

Table_1_A t-SNE Based Classification Approach to Compositional Microbiome Data.DOCXSee More Versions

Table_1_A t-SNE Based Classification Approach to Compositional Microbiome Data.DOCX