Facebook
TwitterAs a data-driven dimensionality reduction and visualization tool, t-distributed stochastic neighborhood embedding (t-SNE) has been successfully applied to a variety of fields. In recent years, it has also received increasing attention for classification and regression analysis. This study presented a t-SNE based classification approach for compositional microbiome data, which enabled us to build classifiers and classify new samples in the reduced dimensional space produced by t-SNE. The Aitchison distance was employed to modify the conditional probabilities in t-SNE to account for the compositionality of microbiome data. To classify a new sample, its low-dimensional features were obtained as the weighted mean vector of its nearest neighbors in the training set. Using the low-dimensional features as input, three commonly used machine learning algorithms, logistic regression (LR), support vector machine (SVM), and decision tree (DT) were considered for classification tasks in this study. The proposed approach was applied to two disease-associated microbiome datasets, achieving better classification performance compared with the classifiers built in the original high-dimensional space. The analytic results also showed that t-SNE with Aitchison distance led to improvement of classification accuracy in both datasets. In conclusion, we have developed a t-SNE based classification approach that is suitable for compositional microbiome data and may also serve as a baseline for more complex classification models.
Facebook
TwitterThe dataset is a binary data set stemming from computational models of earthquake ground motions in structural engineering.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 5: CSV file of bulk RNA-seq data of F. nucleatum infection time course used for GECO UMAP generation.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Training image dataset used in the manuscript "Visualizing histopathologic deep learning classification and anomaly detection using nonlinear feature space dimensionality reduction"
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sorting spikes from extracellular recording into clusters associated with distinct single units (putative neurons) is a fundamental step in analyzing neuronal populations. Such spike sorting is intrinsically unsupervised, as the number of neurons are not known a priori. Therefor, any spike sorting is an unsupervised learning problem that requires either of the two approaches: specification of a fixed value k for the number of clusters to seek, or generation of candidate partitions for several possible values of c, followed by selection of a best candidate based on various post-clustering validation criteria. In this paper, we investigate the first approach and evaluate the utility of several methods for providing lower dimensional visualization of the cluster structure and on subsequent spike clustering. We also introduce a visualization technique called improved visual assessment of cluster tendency (iVAT) to estimate possible cluster structures in data without the need for dimensionality reduction. Experimental results are conducted on two datasets with ground truth labels. In data with a relatively small number of clusters, iVAT is beneficial in estimating the number of clusters to inform the initialization of clustering algorithms. With larger numbers of clusters, iVAT gives a useful estimate of the coarse cluster structure but sometimes fails to indicate the presumptive number of clusters. We show that noise associated with recording extracellular neuronal potentials can disrupt computational clustering schemes, highlighting the benefit of probabilistic clustering models. Our results show that t-Distributed Stochastic Neighbor Embedding (t-SNE) provides representations of the data that yield more accurate visualization of potential cluster structure to inform the clustering stage. Moreover, The clusters obtained using t-SNE features were more reliable than the clusters obtained using the other methods, which indicates that t-SNE can potentially be used for both visualization and to extract features to be used by any clustering algorithm.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 4: CSV file of colon crypt bulk RNA-seq data used for GECO UMAP generation.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22121490%2F7189944f8fc292a094c90daa799d08ca%2FChatGPT%20Image%2015%20Kas%202025%2014_07_37.png?generation=1763204959770660&alt=media" alt="">
This synthetic dataset simulates 300 global cities across 6 major geographic regions, designed specifically for unsupervised machine learning and clustering analysis. It explores how economic status, environmental quality, infrastructure, and digital access shape urban lifestyles worldwide.
| Feature | Description | Range |
|---|---|---|
| 10 Features | Economic, environmental & social indicators | Realistically scaled |
| 300 Cities | Europe, Asia, Americas, Africa, Oceania | Diverse distributions |
| Strong Correlations | Income ↔ Rent (+0.8), Density ↔ Pollution (+0.6) | ML-ready |
| No Missing Values | Clean, preprocessed data | Ready for analysis |
| 4-5 Natural Clusters | Metropolitan hubs, eco-towns, developing centers | Pre-validated |
✅ Realistic Correlations: Income strongly predicts rent (+0.8), internet access (+0.7), and happiness (+0.6)
✅ Regional Diversity: Each region has distinct economic and environmental characteristics
✅ Clustering-Ready: Naturally separable into 4-5 lifestyle archetypes
✅ Beginner-Friendly: No data cleaning required, includes example code
✅ Documented: Comprehensive README with methodology and use cases
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# Load and prepare
df = pd.read_csv('city_lifestyle_dataset.csv')
X = df.drop(['city_name', 'country'], axis=1)
X_scaled = StandardScaler().fit_transform(X)
# Cluster
kmeans = KMeans(n_clusters=5, random_state=42)
df['cluster'] = kmeans.fit_predict(X_scaled)
# Analyze
print(df.groupby('cluster').mean())
After working with this dataset, you will be able to: 1. Apply K-Means, DBSCAN, and Hierarchical Clustering 2. Use PCA for dimensionality reduction and visualization 3. Interpret correlation matrices and feature relationships 4. Create geographic visualizations with cluster assignments 5. Profile and name discovered clusters based on characteristics
| Cluster | Characteristics | Example Cities |
|---|---|---|
| Metropolitan Tech Hubs | High income, density, rent | Silicon Valley, Singapore |
| Eco-Friendly Towns | Low density, clean air, high happiness | Nordic cities |
| Developing Centers | Mid income, high density, poor air | Emerging markets |
| Low-Income Suburban | Low infrastructure, income | Rural areas |
| Industrial Mega-Cities | Very high density, pollution | Manufacturing hubs |
Unlike random synthetic data, this dataset was carefully engineered with: - ✨ Realistic correlation structures based on urban research - 🌍 Regional characteristics matching real-world patterns - 🎯 Optimal cluster separability (validated via silhouette scores) - 📚 Comprehensive documentation and starter code
✓ Learn clustering without data cleaning hassles
✓ Practice PCA and dimensionality reduction
✓ Create beautiful geographic visualizations
✓ Understand feature correlation in real-world contexts
✓ Build a portfolio project with clear business insights
This dataset was designed for educational purposes in machine learning and data science. While synthetic, it reflects real patterns observed in global urban development research.
Happy Clustering! 🎉
Facebook
TwitterUniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualization similarly to t-SNE, but also for general non-linear dimension reduction. The algorithm is founded on three assumptions about the data:
The data is uniformly distributed on a Riemannian manifold; The Riemannian metric is locally constant (or can be approximated as such); The manifold is locally connected. From these assumptions, it is possible to model the manifold with a fuzzy topological structure. The embedding is found by searching for a low dimensional projection of the data that has the closest possible equivalent fuzzy topological structure.
The details for the underlying mathematics can be found in our paper on ArXiv:
McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv e-prints 1802.03426, 2018
Facebook
TwitterThe hotel industry is essential for tourism. With the rapid expansion of the internet, consumers only search for their desired keywords on the website when they trying to find a hotel to stay, causing the relevant hotel information would appear. To quickly respond to the changing market and consumer habits, each hotel must focus on its website information and information quality. This study proposes a novel methodology that uses rough set theory (RST), principal component analysis, t-Distributed Stochastic Neighbor Embedding (t-SNE), and attribute performance visualization to explore the relationship between hotel star ratings and hotel website information quality. The collected data are based on the star-rated hotels of the Taiwanstay website, and the checklists of hotel website services are used to obtain the relevant attributes data. The results show that there are significant differences in information quality between hotels below two stars and those above four stars. The information quality provided by the higher star hotels was more detailed than that offered by low-star hotels. Based on the attribute performance matrix, the one-star and two-star hotels have advantage attributes in their landscape, reply time, restaurant information, social media, and compensation. Furthermore, the three-five star hotels have advantage attributes in their operational support, compensation, restaurant information, traffic information, and room information. These results could be provided to the stakeholders as a reference.
Facebook
TwitterThe hotel industry is essential for tourism. With the rapid expansion of the internet, consumers only search for their desired keywords on the website when they trying to find a hotel to stay, causing the relevant hotel information would appear. To quickly respond to the changing market and consumer habits, each hotel must focus on its website information and information quality. This study proposes a novel methodology that uses rough set theory (RST), principal component analysis, t-Distributed Stochastic Neighbor Embedding (t-SNE), and attribute performance visualization to explore the relationship between hotel star ratings and hotel website information quality. The collected data are based on the star-rated hotels of the Taiwanstay website, and the checklists of hotel website services are used to obtain the relevant attributes data. The results show that there are significant differences in information quality between hotels below two stars and those above four stars. The information quality provided by the higher star hotels was more detailed than that offered by low-star hotels. Based on the attribute performance matrix, the one-star and two-star hotels have advantage attributes in their landscape, reply time, restaurant information, social media, and compensation. Furthermore, the three-five star hotels have advantage attributes in their operational support, compensation, restaurant information, traffic information, and room information. These results could be provided to the stakeholders as a reference.
Facebook
TwitterFigure 8 data Includes: -Panel A, Visualization of multidimensional Red channel MCF-7 image data after dimension reduction using the t-SNE algorithm. Each point represents 1 of the 3000 cells randomly selected from the 143,211 total cells for each of the 4 landmarks (listed in legend). For visualization, the 300 cells closest to the centroid were displayed. See Supplementary Fig7C for more data. -Panel B, Confusion Matrix, plotting Predicted versus True(known localization of the red channel landmark) labels for the validation set of Red channel single cell images, which were not included in the initial training step. Data was normalized to the total number of cells from the true label of each landmark. See Supplementary Fig7D for original data. -Panels C-D, Original pie charts and the number of cells analyzed per transfection for each cell line (per pie chart), exported from python. "PredictedLabel" indicates the label predicted by the classifier. "TrueLabel" indicates the ground truth label (known from the Platemap for each landmark in the Red Channel). Single-cell image dataset was too large to upload. Contact Elizabeth.osterlund@gmail.com, if desired.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Dataset Card for WikiPos
Dataset Summary
WikiPos is a processed version of the Wikimedia Wikipedia dataset that includes 2D spatial coordinates generated through dimensionality reduction techniques. The dataset contains Wikipedia articles with their original text content plus x,y coordinates derived from sentence embeddings using UMAP and t-SNE algorithms. The dataset enables spatial visualization and exploration of Wikipedia content, allowing researchers to analyze… See the full description on the dataset page: https://huggingface.co/datasets/whatphiliptrains/wikipos.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Handwritten Digits Pixel Dataset is a collection of numerical data representing handwritten digits from 0 to 9. Unlike image datasets that store actual image files, this dataset contains pixel intensity values arranged in a structured tabular format, making it ideal for machine learning and data analysis applications.
The dataset contains handwritten digit samples with the following distribution:
(Note: Actual distribution counts would be calculated from your specific dataset)
import pandas as pd
# Load the dataset
df = pd.read_csv('/kaggle/input/handwritten_digits_pixel_dataset/mnist.csv')
# Separate features and labels
X = df.drop('label', axis=1)
y = df['label']
# Normalize pixel values
X_normalized = X / 255.0
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Whole-brain functional connectivity (FC) measured with functional MRI (fMRI) evolves over time in meaningful ways at temporal scales going from years (e.g., development) to seconds [e.g., within-scan time-varying FC (tvFC)]. Yet, our ability to explore tvFC is severely constrained by its large dimensionality (several thousands). To overcome this difficulty, researchers often seek to generate low dimensional representations (e.g., 2D and 3D scatter plots) hoping those will retain important aspects of the data (e.g., relationships to behavior and disease progression). Limited prior empirical work suggests that manifold learning techniques (MLTs)—namely those seeking to infer a low dimensional non-linear surface (i.e., the manifold) where most of the data lies—are good candidates for accomplishing this task. Here we explore this possibility in detail. First, we discuss why one should expect tvFC data to lie on a low dimensional manifold. Second, we estimate what is the intrinsic dimension (ID; i.e., minimum number of latent dimensions) of tvFC data manifolds. Third, we describe the inner workings of three state-of-the-art MLTs: Laplacian Eigenmaps (LEs), T-distributed Stochastic Neighbor Embedding (T-SNE), and Uniform Manifold Approximation and Projection (UMAP). For each method, we empirically evaluate its ability to generate neuro-biologically meaningful representations of tvFC data, as well as their robustness against hyper-parameter selection. Our results show that tvFC data has an ID that ranges between 4 and 26, and that ID varies significantly between rest and task states. We also show how all three methods can effectively capture subject identity and task being performed: UMAP and T-SNE can capture these two levels of detail concurrently, but LE could only capture one at a time. We observed substantial variability in embedding quality across MLTs, and within-MLT as a function of hyper-parameter selection. To help alleviate this issue, we provide heuristics that can inform future studies. Finally, we also demonstrate the importance of feature normalization when combining data across subjects and the role that temporal autocorrelation plays in the application of MLTs to tvFC data. Overall, we conclude that while MLTs can be useful to generate summary views of labeled tvFC data, their application to unlabeled data such as resting-state remains challenging.
Facebook
TwitterUsers should cite: Asgari E, Mofrad MRK. Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PLoS ONE 10(11): e0141287. doi:10.1371/journal.pone.0141287. This archive also contains the family classification data that we used in the above mentioned PLoS ONE paper. This data can be used as a benchmark for family classification task.
Facebook
Twitterhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Whenever I want to try a new cosmetic item, it's so difficult to choose. It's actually more than difficult. It's sometimes scary because new items that I've never tried end up giving me skin trouble. We know the information we need is on the back of each product, but it's really hard to interpret those ingredient lists unless you're a chemist. You may be able to relate to this situation.
we are going to create a content-based recommendation system where the 'content' will be the chemical components of cosmetics. Specifically, we will process ingredient lists for 1472 cosmetics on Sephora via word embedding, then visualize ingredient similarity using a machine learning method called t-SNE and an interactive visualization library called Bokeh. Let's inspect our data first.
DataCamp
Facebook
TwitterThe hotel industry is essential for tourism. With the rapid expansion of the internet, consumers only search for their desired keywords on the website when they trying to find a hotel to stay, causing the relevant hotel information would appear. To quickly respond to the changing market and consumer habits, each hotel must focus on its website information and information quality. This study proposes a novel methodology that uses rough set theory (RST), principal component analysis, t-Distributed Stochastic Neighbor Embedding (t-SNE), and attribute performance visualization to explore the relationship between hotel star ratings and hotel website information quality. The collected data are based on the star-rated hotels of the Taiwanstay website, and the checklists of hotel website services are used to obtain the relevant attributes data. The results show that there are significant differences in information quality between hotels below two stars and those above four stars. The information quality provided by the higher star hotels was more detailed than that offered by low-star hotels. Based on the attribute performance matrix, the one-star and two-star hotels have advantage attributes in their landscape, reply time, restaurant information, social media, and compensation. Furthermore, the three-five star hotels have advantage attributes in their operational support, compensation, restaurant information, traffic information, and room information. These results could be provided to the stakeholders as a reference.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains 500 records and 9 features related to the productivity of developers using AI tools. It appears to analyze how factors like working habits, caffeine intake, and AI usage affect developer performance.
Facebook
TwitterAdditional file 1: Figure S1. Quality control and batch effect correction in scRNA-Seq, related to Figure 1 A. Violin plots showing the number of expressed genes, the number of reads uniquely mapped against the reference genome, and the fraction of mitochondrial genes compared to all genes per cell in scRNA-Seq data. B. Box plot showing the number of genes (left) and the number of uniquely mapped reads (right) per cell in each identified cell type in scRNA-Seq data. C. tSNE plot visualization of the sample source for all 70,201 cells. Each dot is a cell. Different colors represent different samples. D. tSNE plot visualization of unsupervised clustering analysis for all 70,201 cells based on scRNA-Seq data after quality control, which gave rise to 31 distinct clusters. Figure S2. Gene Ontology (GO) analysis of the DEGs for each cell type was performed and the representative enriched GO terms are presented, related to Figure 1. Figure S3. Expression of selected marker genes along the differentiation trajectory, related to Figure 2 A. tSNE plot demonstrating cell cycle regression (left). Visualization of myogenic differentiation trajectory by cell cycle phases (G1, S, and G2/M) (right). B. Donut plots showing the percentages of cells in G1, S, and G2M phase at different cell states. C. Expression levels of cell cycle-related genes in the myogenic cells organized into the Monocle trajectory. D. Expression levels of muscle related genes in the myogenic cells organized into the Monocle trajectory. Figure S4. Unsupervised clustering analysis for all cells in scATAC-Seq data and myogenic-specific scATAC-seq peaks, related to Figure 4 A-C. tSNE plot visualization of the sample source for all 48514 cells in scATAC-Seq. Each dot is a cell. Different colors represent different pigs (A), different embryonic stages (B), or different samples (C). D. tSNE plot visualization of unsupervised clustering analysis for all 48514 cells after quality control in scATAC-Seq data, which gave rise to 15 distinct clusters. E. tSNE plot visualization of myogenic cells and other cells. Clusters 4 and 8 in Figure S4D were annotated as myogenic cells due to their high levels of accessibility of marker genes associated with myogenic lineage. F. Genome browser view of myogenic-specific peaks at the TSS of MyoG and Myf5 for myogenic cells and other cells in the scATAC-seq dataset. Figure S5. Percentage distribution of open chromatin elements in scATAC-Seq data, related to Figure 4 A. Distribution of open chromatin elements in each snATAC-seq sample. B. Distribution of open chromatin elements in snATAC-seq of myogenic cell types. C. Percentage distribution of open chromatin elements among DAPs in myogenic cell types. Figure S6. Integrative analysis of transcription factors and target genes, related to Figure 5 A. tSNE depiction of regulon activity (“on-blue”, “off-gray”), TF gene expression (red scale), and expression of predicted target genes (purple scale) of MyoG, FOSB, and TCF12. B. Corresponding chromatin accessibility in scATAC data for TFs and predicted target genes are depicted. Figure S7. Pseudotime-dependent chromatin accessibility and gene expression changes, related to Figure 7. The first column shows the dynamics of the 10× Genomics TF enrichment score. The second column shows the dynamics of TF gene expression values, and the third and fourth columns represent the dynamics of the SCENIC-reported target gene expression values of corresponding TFs, respectively. Figure S8. Myogenesis related gene expression in DMD (Duchenne muscular dystrophy) mice. Comparison of RNA-seq data of flexor digitorum short (FDB), extensor digitorum long (EDL), and soleus (SOL) in DMD and wild-type mice including 2- month and 5-month age. A. The expression levels of myogenesis related genes (Myod1, Myog, Myf5, Pax7). B. The expression levels of related genes that were upregulated during porcine embryonic myogenesis (EGR1, RHOB, KLF4, SOX8, NGFR, MAX, RBFOX2, ANXA6, HES6, RASSF4, PLS3, SPG21). C. The expression levels of related genes that were downregulated during porcine embryonic myogenesis COX5A, HOMER2, BNIP3, CNCS). Data were obtained from the GEO database (GSE162455; WT, n = 4; DMD, n = 7). Figure S9. Genome browser view of differentially accessible peaks at the TSS of EGR1 and RHOB between myogenic cells in the scATAC-seq dataset, related to Figure 8. Figure S10. Functional analysis of EGR1 in myogenesis, related to Figure 8 A-B. EdU assays for the proliferation of pig primary myogenic cells (A) and C2C12 myoblasts following EGR1 overexpression. C. qPCR analysis of the mRNA levels of cell cycle regulators in C2C12 cells following EGR1 overexpression. D. Immunofluorescence staining for MyHC in C2C12 cells following EGR1 overexpression and differentiation for 3 d. Then, the fusion index was calculated. Figure S11. Functional analysis of RHOB in myogenesis, related to Figure 8 A-B. EdU assays for proliferation of pig primary myogenic cells (A) and C2C12 myoblasts following RHOB overexpression. C. qPCR analysis of the mRNA levels of cell-cycle regulators in C2C12 cells following RHOB overexpression. D. Immunofluorescence staining for MyHC in C2C12 cells following RHOB overexpression and differentiation for 3 d. Then, the fusion index was calculated.
Facebook
TwitterThis repository contains a benchmark dataset and processing scripts for analyzing the energy consumption and processing efficiency of 23 Hugging Face tokenizers applied to 8,212 standardized text chunks across 325 languages.
The dataset includes raw and normalized energy measurements, tokenization times, token counts, and rich structural metadata for each chunk, including script composition, entropy, compression ratios, and character-level features. All experiments were conducted in a controlled environment using PyRAPL on a Linux workstation, with baseline CPU consumption subtracted via linear interpolation.
The resulting data enables fine-grained comparison of tokenizers across linguistic and script diversity. It also supports downstream tasks such as energy-aware NLP, script-sensitive modeling, and benchmarking of future tokenization methods.
Accompanying R scripts are provided to reproduce data processing, regression models, and clustering analyses. Visualization outputs and cluster-level summaries are also included.
All contents are organized into clearly structured folders to support easy access, interpretability, and reuse by the research community.
01_processing_scripts/
R scripts to transform raw data, subtract baseline energy, and produce clean metrics.
multimodel_tokenization_energy.py⤷ Python script used to tokenize all chunks with 23 models while logging energy and time.
adapting_original_dataset.R⤷ Reads raw logs and metadata, computes net energy, and outputs cleaned files.
energy_patterns.R⤷ Performs clustering, regression, t-SNE, and generates all visualizations.
02_raw_data/
Raw output from the tokenization experiment and baseline profiler.
all_models_tokenization.csv⤷ Full log of 42M+ tokenization runs (23 models × 225 reps × 8,212 chunks).
baseline.csv⤷ Background CPU energy samples, one per 50 chunks. Used for normalization.
03_clean_data/
Cleaned, enriched, and reshaped datasets ready for analysis.
net_energy.csv⤷ Raw tokenization results after baseline energy subtraction (per run).
tokenization_long.csv⤷ One row per chunk × tokenizer, with medians + token counts.
tokenization_wide.csv⤷ Wide-format matrix: one row per chunk, one column per tokenizer × metric.
complete.csv⤷ Fully enriched dataset joining all metrics, metadata, and script distributions.
metadata.csv⤷ Structural features and script-based character stats per chunk.
04_cluster_outputs/
Outputs from clustering and dimensionality reduction over tokenizer energy profiles.
tokenizer_dendrogram.pdf⤷ Hierarchical clustering of 23 tokenizers based on energy profiles.
tokenizer_tsne.pdf⤷ t-SNE projection of tokenizers grouped by energy usage.
mean_energy_per_cluster.csv⤷ Mean energy consumption (mJ) per language × tokenizer cluster.
sd_energy_per_cluster.csv⤷ Standard deviation of energy consumption (mJ) per language × cluster.
grid.pdf⤷ Heatmap of script-wise energy deltas (relative to Latin) for all tokenizers.
Facebook
TwitterAs a data-driven dimensionality reduction and visualization tool, t-distributed stochastic neighborhood embedding (t-SNE) has been successfully applied to a variety of fields. In recent years, it has also received increasing attention for classification and regression analysis. This study presented a t-SNE based classification approach for compositional microbiome data, which enabled us to build classifiers and classify new samples in the reduced dimensional space produced by t-SNE. The Aitchison distance was employed to modify the conditional probabilities in t-SNE to account for the compositionality of microbiome data. To classify a new sample, its low-dimensional features were obtained as the weighted mean vector of its nearest neighbors in the training set. Using the low-dimensional features as input, three commonly used machine learning algorithms, logistic regression (LR), support vector machine (SVM), and decision tree (DT) were considered for classification tasks in this study. The proposed approach was applied to two disease-associated microbiome datasets, achieving better classification performance compared with the classifiers built in the original high-dimensional space. The analytic results also showed that t-SNE with Aitchison distance led to improvement of classification accuracy in both datasets. In conclusion, we have developed a t-SNE based classification approach that is suitable for compositional microbiome data and may also serve as a baseline for more complex classification models.