37 datasets found
  1. f

    Table_1_A t-SNE Based Classification Approach to Compositional Microbiome...

    • datasetcatalog.nlm.nih.gov
    • frontiersin.figshare.com
    Updated Dec 14, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yang, Zhenyu; Xie, Zhongming; Li, Dongfang; Xu, Ximing; Xu, Xueli (2020). Table_1_A t-SNE Based Classification Approach to Compositional Microbiome Data.DOCX [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000506264
    Explore at:
    Dataset updated
    Dec 14, 2020
    Authors
    Yang, Zhenyu; Xie, Zhongming; Li, Dongfang; Xu, Ximing; Xu, Xueli
    Description

    As a data-driven dimensionality reduction and visualization tool, t-distributed stochastic neighborhood embedding (t-SNE) has been successfully applied to a variety of fields. In recent years, it has also received increasing attention for classification and regression analysis. This study presented a t-SNE based classification approach for compositional microbiome data, which enabled us to build classifiers and classify new samples in the reduced dimensional space produced by t-SNE. The Aitchison distance was employed to modify the conditional probabilities in t-SNE to account for the compositionality of microbiome data. To classify a new sample, its low-dimensional features were obtained as the weighted mean vector of its nearest neighbors in the training set. Using the low-dimensional features as input, three commonly used machine learning algorithms, logistic regression (LR), support vector machine (SVM), and decision tree (DT) were considered for classification tasks in this study. The proposed approach was applied to two disease-associated microbiome datasets, achieving better classification performance compared with the classifiers built in the original high-dimensional space. The analytic results also showed that t-SNE with Aitchison distance led to improvement of classification accuracy in both datasets. In conclusion, we have developed a t-SNE based classification approach that is suitable for compositional microbiome data and may also serve as a baseline for more complex classification models.

  2. t

    t-SNE for Data Visualization in Structural Engineering - Dataset - LDM

    • service.tib.eu
    • resodate.org
    Updated Dec 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). t-SNE for Data Visualization in Structural Engineering - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/t-sne-for-data-visualization-in-structural-engineering
    Explore at:
    Dataset updated
    Dec 2, 2024
    Description

    The dataset is a binary data set stemming from computational models of earthquake ground motions in structural engineering.

  3. Additional file 5 of GECO: gene expression clustering optimization app for...

    • springernature.figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    A. N. Habowski; T. J. Habowski; M. L. Waterman (2023). Additional file 5 of GECO: gene expression clustering optimization app for non-linear data visualization of patterns [Dataset]. http://doi.org/10.6084/m9.figshare.13642382.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    A. N. Habowski; T. J. Habowski; M. L. Waterman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 5: CSV file of bulk RNA-seq data of F. nucleatum infection time course used for GECO UMAP generation.

  4. Z

    Data from: Visualizing histopathologic deep learning classification and...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Faust, Kevin; Xie, Quin; Han, Dominick; Goyle, Kartikay; Volynskaya, Zoya; Djuric, Ugljesa; Diamandis, Phedias (2020). Visualizing histopathologic deep learning classification and anomaly detection using nonlinear feature space dimensionality reduction [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_1237975
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    University Health Network
    University of Toronto
    Authors
    Faust, Kevin; Xie, Quin; Han, Dominick; Goyle, Kartikay; Volynskaya, Zoya; Djuric, Ugljesa; Diamandis, Phedias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Training image dataset used in the manuscript "Visualizing histopathologic deep learning classification and anomaly detection using nonlinear feature space dimensionality reduction"

  5. Cluster tendency assessment in neuronal spike data

    • plos.figshare.com
    pdf
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sara Mahallati; James C. Bezdek; Milos R. Popovic; Taufik A. Valiante (2023). Cluster tendency assessment in neuronal spike data [Dataset]. http://doi.org/10.1371/journal.pone.0224547
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Sara Mahallati; James C. Bezdek; Milos R. Popovic; Taufik A. Valiante
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sorting spikes from extracellular recording into clusters associated with distinct single units (putative neurons) is a fundamental step in analyzing neuronal populations. Such spike sorting is intrinsically unsupervised, as the number of neurons are not known a priori. Therefor, any spike sorting is an unsupervised learning problem that requires either of the two approaches: specification of a fixed value k for the number of clusters to seek, or generation of candidate partitions for several possible values of c, followed by selection of a best candidate based on various post-clustering validation criteria. In this paper, we investigate the first approach and evaluate the utility of several methods for providing lower dimensional visualization of the cluster structure and on subsequent spike clustering. We also introduce a visualization technique called improved visual assessment of cluster tendency (iVAT) to estimate possible cluster structures in data without the need for dimensionality reduction. Experimental results are conducted on two datasets with ground truth labels. In data with a relatively small number of clusters, iVAT is beneficial in estimating the number of clusters to inform the initialization of clustering algorithms. With larger numbers of clusters, iVAT gives a useful estimate of the coarse cluster structure but sometimes fails to indicate the presumptive number of clusters. We show that noise associated with recording extracellular neuronal potentials can disrupt computational clustering schemes, highlighting the benefit of probabilistic clustering models. Our results show that t-Distributed Stochastic Neighbor Embedding (t-SNE) provides representations of the data that yield more accurate visualization of potential cluster structure to inform the clustering stage. Moreover, The clusters obtained using t-SNE features were more reliable than the clusters obtained using the other methods, which indicates that t-SNE can potentially be used for both visualization and to extract features to be used by any clustering algorithm.

  6. Additional file 4 of GECO: gene expression clustering optimization app for...

    • springernature.figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    A. N. Habowski; T. J. Habowski; M. L. Waterman (2023). Additional file 4 of GECO: gene expression clustering optimization app for non-linear data visualization of patterns [Dataset]. http://doi.org/10.6084/m9.figshare.13642379.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    A. N. Habowski; T. J. Habowski; M. L. Waterman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 4: CSV file of colon crypt bulk RNA-seq data used for GECO UMAP generation.

  7. 🌆 City Lifestyle Segmentation Dataset

    • kaggle.com
    zip
    Updated Nov 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UmutUygurr (2025). 🌆 City Lifestyle Segmentation Dataset [Dataset]. https://www.kaggle.com/datasets/umuttuygurr/city-lifestyle-segmentation-dataset
    Explore at:
    zip(11274 bytes)Available download formats
    Dataset updated
    Nov 15, 2025
    Authors
    UmutUygurr
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22121490%2F7189944f8fc292a094c90daa799d08ca%2FChatGPT%20Image%2015%20Kas%202025%2014_07_37.png?generation=1763204959770660&alt=media" alt="">

    🌆 About This Dataset

    This synthetic dataset simulates 300 global cities across 6 major geographic regions, designed specifically for unsupervised machine learning and clustering analysis. It explores how economic status, environmental quality, infrastructure, and digital access shape urban lifestyles worldwide.

    🎯 Perfect For:

    • 📊 K-Means, DBSCAN, Agglomerative Clustering
    • 🔬 PCA & t-SNE Dimensionality Reduction
    • 🗺️ Geospatial Visualization (Plotly, Folium)
    • 📈 Correlation Analysis & Feature Engineering
    • 🎓 Educational Projects (Beginner to Intermediate)

    📦 What's Inside?

    FeatureDescriptionRange
    10 FeaturesEconomic, environmental & social indicatorsRealistically scaled
    300 CitiesEurope, Asia, Americas, Africa, OceaniaDiverse distributions
    Strong CorrelationsIncome ↔ Rent (+0.8), Density ↔ Pollution (+0.6)ML-ready
    No Missing ValuesClean, preprocessed dataReady for analysis
    4-5 Natural ClustersMetropolitan hubs, eco-towns, developing centersPre-validated

    🔥 Key Features

    Realistic Correlations: Income strongly predicts rent (+0.8), internet access (+0.7), and happiness (+0.6)
    Regional Diversity: Each region has distinct economic and environmental characteristics
    Clustering-Ready: Naturally separable into 4-5 lifestyle archetypes
    Beginner-Friendly: No data cleaning required, includes example code
    Documented: Comprehensive README with methodology and use cases

    🚀 Quick Start Example

    import pandas as pd
    from sklearn.cluster import KMeans
    from sklearn.preprocessing import StandardScaler
    
    # Load and prepare
    df = pd.read_csv('city_lifestyle_dataset.csv')
    X = df.drop(['city_name', 'country'], axis=1)
    X_scaled = StandardScaler().fit_transform(X)
    
    # Cluster
    kmeans = KMeans(n_clusters=5, random_state=42)
    df['cluster'] = kmeans.fit_predict(X_scaled)
    
    # Analyze
    print(df.groupby('cluster').mean())
    

    🎓 Learning Outcomes

    After working with this dataset, you will be able to: 1. Apply K-Means, DBSCAN, and Hierarchical Clustering 2. Use PCA for dimensionality reduction and visualization 3. Interpret correlation matrices and feature relationships 4. Create geographic visualizations with cluster assignments 5. Profile and name discovered clusters based on characteristics

    📚 Ideal For These Projects

    • 🏆 Kaggle Competitions: Practice clustering techniques
    • 📝 Academic Projects: Urban planning, sociology, environmental science
    • 💼 Portfolio Work: Showcase ML skills to employers
    • 🎓 Learning: Hands-on practice with unsupervised learning
    • 🔬 Research: Urban lifestyle segmentation studies

    🌍 Expected Clusters

    ClusterCharacteristicsExample Cities
    Metropolitan Tech HubsHigh income, density, rentSilicon Valley, Singapore
    Eco-Friendly TownsLow density, clean air, high happinessNordic cities
    Developing CentersMid income, high density, poor airEmerging markets
    Low-Income SuburbanLow infrastructure, incomeRural areas
    Industrial Mega-CitiesVery high density, pollutionManufacturing hubs

    🛠️ Technical Details

    • Format: CSV (UTF-8)
    • Size: ~300 rows × 10 columns
    • Missing Values: 0%
    • Data Types: 2 categorical, 8 numerical
    • Target Variable: None (unsupervised)
    • Correlation Strength: Pre-validated (r: 0.4 to 0.8)

    📖 What Makes This Dataset Special?

    Unlike random synthetic data, this dataset was carefully engineered with: - ✨ Realistic correlation structures based on urban research - 🌍 Regional characteristics matching real-world patterns - 🎯 Optimal cluster separability (validated via silhouette scores) - 📚 Comprehensive documentation and starter code

    🏅 Use This Dataset If You Want To:

    ✓ Learn clustering without data cleaning hassles
    ✓ Practice PCA and dimensionality reduction
    ✓ Create beautiful geographic visualizations
    ✓ Understand feature correlation in real-world contexts
    ✓ Build a portfolio project with clear business insights

    📊 Acknowledgments

    This dataset was designed for educational purposes in machine learning and data science. While synthetic, it reflects real patterns observed in global urban development research.

    Happy Clustering! 🎉

  8. umap-learn

    • kaggle.com
    zip
    Updated Oct 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HyeongChan Kim (2025). umap-learn [Dataset]. https://www.kaggle.com/kozistr/umaplearn
    Explore at:
    zip(46934808 bytes)Available download formats
    Dataset updated
    Oct 19, 2025
    Authors
    HyeongChan Kim
    Description

    UMAP

    Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualization similarly to t-SNE, but also for general non-linear dimension reduction. The algorithm is founded on three assumptions about the data:

    The data is uniformly distributed on a Riemannian manifold; The Riemannian metric is locally constant (or can be approximated as such); The manifold is locally connected. From these assumptions, it is possible to model the manifold with a fuzzy topological structure. The embedding is found by searching for a low dimensional projection of the data that has the closest possible equivalent fuzzy topological structure.

    The details for the underlying mathematics can be found in our paper on ArXiv:

    McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv e-prints 1802.03426, 2018

  9. f

    Collected dimension and attribute.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Nov 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cheng, Ching-Hsue; Tsai, Ming-Chi; Chang, Yuan-Shao (2023). Collected dimension and attribute. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000941483
    Explore at:
    Dataset updated
    Nov 2, 2023
    Authors
    Cheng, Ching-Hsue; Tsai, Ming-Chi; Chang, Yuan-Shao
    Description

    The hotel industry is essential for tourism. With the rapid expansion of the internet, consumers only search for their desired keywords on the website when they trying to find a hotel to stay, causing the relevant hotel information would appear. To quickly respond to the changing market and consumer habits, each hotel must focus on its website information and information quality. This study proposes a novel methodology that uses rough set theory (RST), principal component analysis, t-Distributed Stochastic Neighbor Embedding (t-SNE), and attribute performance visualization to explore the relationship between hotel star ratings and hotel website information quality. The collected data are based on the star-rated hotels of the Taiwanstay website, and the checklists of hotel website services are used to obtain the relevant attributes data. The results show that there are significant differences in information quality between hotels below two stars and those above four stars. The information quality provided by the higher star hotels was more detailed than that offered by low-star hotels. Based on the attribute performance matrix, the one-star and two-star hotels have advantage attributes in their landscape, reply time, restaurant information, social media, and compensation. Furthermore, the three-five star hotels have advantage attributes in their operational support, compensation, restaurant information, traffic information, and room information. These results could be provided to the stakeholders as a reference.

  10. f

    The results of different classifiers.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Nov 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tsai, Ming-Chi; Chang, Yuan-Shao; Cheng, Ching-Hsue (2023). The results of different classifiers. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000941519
    Explore at:
    Dataset updated
    Nov 2, 2023
    Authors
    Tsai, Ming-Chi; Chang, Yuan-Shao; Cheng, Ching-Hsue
    Description

    The hotel industry is essential for tourism. With the rapid expansion of the internet, consumers only search for their desired keywords on the website when they trying to find a hotel to stay, causing the relevant hotel information would appear. To quickly respond to the changing market and consumer habits, each hotel must focus on its website information and information quality. This study proposes a novel methodology that uses rough set theory (RST), principal component analysis, t-Distributed Stochastic Neighbor Embedding (t-SNE), and attribute performance visualization to explore the relationship between hotel star ratings and hotel website information quality. The collected data are based on the star-rated hotels of the Taiwanstay website, and the checklists of hotel website services are used to obtain the relevant attributes data. The results show that there are significant differences in information quality between hotels below two stars and those above four stars. The information quality provided by the higher star hotels was more detailed than that offered by low-star hotels. Based on the attribute performance matrix, the one-star and two-star hotels have advantage attributes in their landscape, reply time, restaurant information, social media, and compensation. Furthermore, the three-five star hotels have advantage attributes in their operational support, compensation, restaurant information, traffic information, and room information. These results could be provided to the stakeholders as a reference.

  11. d

    OsterlundJBC_Figure 8

    • search.dataone.org
    • borealisdata.ca
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Osterlund, Elizabeth (2023). OsterlundJBC_Figure 8 [Dataset]. http://doi.org/10.5683/SP3/A7TRJF
    Explore at:
    Dataset updated
    Dec 28, 2023
    Dataset provided by
    Borealis
    Authors
    Osterlund, Elizabeth
    Description

    Figure 8 data Includes: -Panel A, Visualization of multidimensional Red channel MCF-7 image data after dimension reduction using the t-SNE algorithm. Each point represents 1 of the 3000 cells randomly selected from the 143,211 total cells for each of the 4 landmarks (listed in legend). For visualization, the 300 cells closest to the centroid were displayed. See Supplementary Fig7C for more data. -Panel B, Confusion Matrix, plotting Predicted versus True(known localization of the red channel landmark) labels for the validation set of Red channel single cell images, which were not included in the initial training step. Data was normalized to the total number of cells from the true label of each landmark. See Supplementary Fig7D for original data. -Panels C-D, Original pie charts and the number of cells analyzed per transfection for each cell line (per pie chart), exported from python. "PredictedLabel" indicates the label predicted by the classifier. "TrueLabel" indicates the ground truth label (known from the Platemap for each landmark in the Red Channel). Single-cell image dataset was too large to upload. Contact Elizabeth.osterlund@gmail.com, if desired.

  12. h

    wikipos

    • huggingface.co
    Updated Sep 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Philip Gerdes (2025). wikipos [Dataset]. https://huggingface.co/datasets/whatphiliptrains/wikipos
    Explore at:
    Dataset updated
    Sep 15, 2025
    Authors
    Philip Gerdes
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Dataset Card for WikiPos

      Dataset Summary
    

    WikiPos is a processed version of the Wikimedia Wikipedia dataset that includes 2D spatial coordinates generated through dimensionality reduction techniques. The dataset contains Wikipedia articles with their original text content plus x,y coordinates derived from sentence embeddings using UMAP and t-SNE algorithms. The dataset enables spatial visualization and exploration of Wikipedia content, allowing researchers to analyze… See the full description on the dataset page: https://huggingface.co/datasets/whatphiliptrains/wikipos.

  13. 🔢🖊️ Digital Recognition: MNIST Dataset

    • kaggle.com
    zip
    Updated Nov 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wasiq Ali (2025). 🔢🖊️ Digital Recognition: MNIST Dataset [Dataset]. https://www.kaggle.com/datasets/wasiqaliyasir/digital-mnist-dataset
    Explore at:
    zip(2278207 bytes)Available download formats
    Dataset updated
    Nov 13, 2025
    Authors
    Wasiq Ali
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Handwritten Digits Pixel Dataset - Documentation

    Overview

    The Handwritten Digits Pixel Dataset is a collection of numerical data representing handwritten digits from 0 to 9. Unlike image datasets that store actual image files, this dataset contains pixel intensity values arranged in a structured tabular format, making it ideal for machine learning and data analysis applications.

    Dataset Description

    Basic Information

    • Format: CSV (Comma-Separated Values)
    • Total Samples: [Number of rows based on your dataset]
    • Features: 784 pixel columns (28×28 pixels) + 1 label column
    • Label Range: Digits 0-9
    • Pixel Value Range: 0-255 (grayscale intensity)

    File Structure

    Column Description

    • label: The target variable representing the digit (0-9)
    • pixel columns: 784 columns named in format [row]xcolumn
    • Each pixel column contains integer values from 0-255 representing grayscale intensity

    Data Characteristics

    Label Distribution

    The dataset contains handwritten digit samples with the following distribution:

    • Digit 0: [X] samples
    • Digit 1: [X] samples
    • Digit 2: [X] samples
    • Digit 3: [X] samples
    • Digit 4: [X] samples
    • Digit 5: [X] samples
    • Digit 6: [X] samples
    • Digit 7: [X] samples
    • Digit 8: [X] samples
    • Digit 9: [X] samples

    (Note: Actual distribution counts would be calculated from your specific dataset)

    Data Quality

    • Missing Values: No missing values detected
    • Data Type: All values are integers
    • Normalization: Pixel values range from 0-255 (can be normalized to 0-1 for ML models)
    • Consistency: Uniform 28×28 grid structure across all samples

    Technical Specifications

    Data Preprocessing Requirements

    • Normalization: Scale pixel values from 0-255 to 0-1 range
    • Reshaping: Convert 1D pixel arrays to 2D 28×28 matrices for visualization
    • Train-Test Split: Recommended 80-20 or 70-30 split for model development

    Recommended Machine Learning Approaches

    Classification Algorithms:

    • Random Forest
    • Support Vector Machines (SVM)
    • Neural Networks
    • K-Nearest Neighbors (KNN)

    Deep Learning Architectures:

    • Convolutional Neural Networks (CNNs)
    • Multi-layer Perceptrons (MLPs)

    Dimensionality Reduction:

    • PCA (Principal Component Analysis)
    • t-SNE for visualization

    Usage Examples

    Loading the Dataset

    import pandas as pd
    
    # Load the dataset
    df = pd.read_csv('/kaggle/input/handwritten_digits_pixel_dataset/mnist.csv')
    
    # Separate features and labels
    X = df.drop('label', axis=1)
    y = df['label']
    
    # Normalize pixel values
    X_normalized = X / 255.0
    
  14. Data_Sheet_1_Manifold learning for fMRI time-varying functional...

    • frontiersin.figshare.com
    docx
    Updated Jul 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javier Gonzalez-Castillo; Isabel S. Fernandez; Ka Chun Lam; Daniel A. Handwerker; Francisco Pereira; Peter A. Bandettini (2023). Data_Sheet_1_Manifold learning for fMRI time-varying functional connectivity.docx [Dataset]. http://doi.org/10.3389/fnhum.2023.1134012.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jul 11, 2023
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Javier Gonzalez-Castillo; Isabel S. Fernandez; Ka Chun Lam; Daniel A. Handwerker; Francisco Pereira; Peter A. Bandettini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Whole-brain functional connectivity (FC) measured with functional MRI (fMRI) evolves over time in meaningful ways at temporal scales going from years (e.g., development) to seconds [e.g., within-scan time-varying FC (tvFC)]. Yet, our ability to explore tvFC is severely constrained by its large dimensionality (several thousands). To overcome this difficulty, researchers often seek to generate low dimensional representations (e.g., 2D and 3D scatter plots) hoping those will retain important aspects of the data (e.g., relationships to behavior and disease progression). Limited prior empirical work suggests that manifold learning techniques (MLTs)—namely those seeking to infer a low dimensional non-linear surface (i.e., the manifold) where most of the data lies—are good candidates for accomplishing this task. Here we explore this possibility in detail. First, we discuss why one should expect tvFC data to lie on a low dimensional manifold. Second, we estimate what is the intrinsic dimension (ID; i.e., minimum number of latent dimensions) of tvFC data manifolds. Third, we describe the inner workings of three state-of-the-art MLTs: Laplacian Eigenmaps (LEs), T-distributed Stochastic Neighbor Embedding (T-SNE), and Uniform Manifold Approximation and Projection (UMAP). For each method, we empirically evaluate its ability to generate neuro-biologically meaningful representations of tvFC data, as well as their robustness against hyper-parameter selection. Our results show that tvFC data has an ID that ranges between 4 and 26, and that ID varies significantly between rest and task states. We also show how all three methods can effectively capture subject identity and task being performed: UMAP and T-SNE can capture these two levels of detail concurrently, but LE could only capture one at a time. We observed substantial variability in embedding quality across MLTs, and within-MLT as a function of hyper-parameter selection. To help alleviate this issue, we provide heuristics that can inform future studies. Finally, we also demonstrate the importance of feature normalization when combining data across subjects and the role that temporal autocorrelation plays in the application of MLTs to tvFC data. Overall, we conclude that while MLTs can be useful to generate summary views of labeled tvFC data, their application to unlabeled data such as resting-state remains challenging.

  15. d

    Replication Data for: Continuous Distributed Representation of Biological...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Asgari, Ehsaneddin (2023). Replication Data for: Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics [Dataset]. http://doi.org/10.7910/DVN/JMFHTN
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Asgari, Ehsaneddin
    Description

    Users should cite: Asgari E, Mofrad MRK. Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PLoS ONE 10(11): e0141287. doi:10.1371/journal.pone.0141287. This archive also contains the family classification data that we used in the above mentioned PLoS ONE paper. This data can be used as a benchmark for family classification task.

  16. Cosmetics datasets

    • kaggle.com
    zip
    Updated Dec 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abid Ali Awan (2020). Cosmetics datasets [Dataset]. https://www.kaggle.com/kingabzpro/cosmetics-datasets
    Explore at:
    zip(269637 bytes)Available download formats
    Dataset updated
    Dec 16, 2020
    Authors
    Abid Ali Awan
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    Context

    Whenever I want to try a new cosmetic item, it's so difficult to choose. It's actually more than difficult. It's sometimes scary because new items that I've never tried end up giving me skin trouble. We know the information we need is on the back of each product, but it's really hard to interpret those ingredient lists unless you're a chemist. You may be able to relate to this situation.

    Content

    we are going to create a content-based recommendation system where the 'content' will be the chemical components of cosmetics. Specifically, we will process ingredient lists for 1472 cosmetics on Sephora via word embedding, then visualize ingredient similarity using a machine learning method called t-SNE and an interactive visualization library called Bokeh. Let's inspect our data first.

    Acknowledgements

    DataCamp

  17. f

    Reduct set of two classes with 23 attributes.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Nov 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tsai, Ming-Chi; Cheng, Ching-Hsue; Chang, Yuan-Shao (2023). Reduct set of two classes with 23 attributes. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000941523
    Explore at:
    Dataset updated
    Nov 2, 2023
    Authors
    Tsai, Ming-Chi; Cheng, Ching-Hsue; Chang, Yuan-Shao
    Description

    The hotel industry is essential for tourism. With the rapid expansion of the internet, consumers only search for their desired keywords on the website when they trying to find a hotel to stay, causing the relevant hotel information would appear. To quickly respond to the changing market and consumer habits, each hotel must focus on its website information and information quality. This study proposes a novel methodology that uses rough set theory (RST), principal component analysis, t-Distributed Stochastic Neighbor Embedding (t-SNE), and attribute performance visualization to explore the relationship between hotel star ratings and hotel website information quality. The collected data are based on the star-rated hotels of the Taiwanstay website, and the checklists of hotel website services are used to obtain the relevant attributes data. The results show that there are significant differences in information quality between hotels below two stars and those above four stars. The information quality provided by the higher star hotels was more detailed than that offered by low-star hotels. Based on the attribute performance matrix, the one-star and two-star hotels have advantage attributes in their landscape, reply time, restaurant information, social media, and compensation. Furthermore, the three-five star hotels have advantage attributes in their operational support, compensation, restaurant information, traffic information, and room information. These results could be provided to the stakeholders as a reference.

  18. AI Developer Performance Dataset

    • kaggle.com
    zip
    Updated May 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shahzad Aslam (2025). AI Developer Performance Dataset [Dataset]. https://www.kaggle.com/datasets/zeesolver/ai-developer-dataset
    Explore at:
    zip(5992 bytes)Available download formats
    Dataset updated
    May 27, 2025
    Authors
    Shahzad Aslam
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    This dataset contains 500 records and 9 features related to the productivity of developers using AI tools. It appears to analyze how factors like working habits, caffeine intake, and AI usage affect developer performance.

    Suggested Machine Learning Tasks

    • Binary classification (task_success)
    • Regression (e.g., predicting cognitive_load)
    • Clustering of work patterns
    • Correlation analysis & feature importance
    • Time series simulation & rolling averages (useful with synthetic date column)
    • Exploratory Data Analysis (EDA)
    • Anomaly detection (e.g., outliers in bugs_reported)
    • Multi-output regression (predicting commits and bugs_reported)
    • Dimensionality reduction (PCA or t-SNE for pattern visualization)
    • Decision rule extraction (e.g., tree-based rules for task_success) # 🧠 Inspiration Developers with balanced AI usage, sleep, and moderate coffee intake show higher task success. Overuse of AI or caffeine increases cognitive load, reducing effectiveness. Productivity thrives on smart work, not just hard work.

    📊 Column Descriptions

    • hours_coding – Daily coding hours (float).
    • coffee_intake_mg – Daily caffeine intake in milligrams (integer).
    • distractions – Number of distractions experienced (integer).
    • sleep_hours – Average sleep hours per day (float).
    • commits– Number of code commits per day (integer).
    • bugs_reported – Number of bugs reported (integer).
    • ai_usage_hours – Daily AI tool usage hours (float).
    • cognitive_load – Measured cognitive load on a scale (float).
    • task_success – Binary variable indicating task completion success (1 = success, 0 = fail).
  19. f

    Additional file 1 of Integrative single-cell RNA-seq and ATAC-seq analysis...

    • datasetcatalog.nlm.nih.gov
    Updated Apr 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wang, Xiaoyu; Lin, Zhuhu; Chen, Meilin; Tong, Xian; Li, Jianhao; Zhu, Qi; Duo, Tianqi; Li, Enru; Cai, Shufang; Liu, Tongni; Liu, Xiaohong; Xu, Rong; Mo, Delin; Chen, Yaosheng; Hu, Bin; Liang, Ziyun (2023). Additional file 1 of Integrative single-cell RNA-seq and ATAC-seq analysis of myogenic differentiation in pig [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001023879
    Explore at:
    Dataset updated
    Apr 13, 2023
    Authors
    Wang, Xiaoyu; Lin, Zhuhu; Chen, Meilin; Tong, Xian; Li, Jianhao; Zhu, Qi; Duo, Tianqi; Li, Enru; Cai, Shufang; Liu, Tongni; Liu, Xiaohong; Xu, Rong; Mo, Delin; Chen, Yaosheng; Hu, Bin; Liang, Ziyun
    Description

    Additional file 1: Figure S1. Quality control and batch effect correction in scRNA-Seq, related to Figure 1 A. Violin plots showing the number of expressed genes, the number of reads uniquely mapped against the reference genome, and the fraction of mitochondrial genes compared to all genes per cell in scRNA-Seq data. B. Box plot showing the number of genes (left) and the number of uniquely mapped reads (right) per cell in each identified cell type in scRNA-Seq data. C. tSNE plot visualization of the sample source for all 70,201 cells. Each dot is a cell. Different colors represent different samples. D. tSNE plot visualization of unsupervised clustering analysis for all 70,201 cells based on scRNA-Seq data after quality control, which gave rise to 31 distinct clusters. Figure S2. Gene Ontology (GO) analysis of the DEGs for each cell type was performed and the representative enriched GO terms are presented, related to Figure 1. Figure S3. Expression of selected marker genes along the differentiation trajectory, related to Figure 2 A. tSNE plot demonstrating cell cycle regression (left). Visualization of myogenic differentiation trajectory by cell cycle phases (G1, S, and G2/M) (right). B. Donut plots showing the percentages of cells in G1, S, and G2M phase at different cell states. C. Expression levels of cell cycle-related genes in the myogenic cells organized into the Monocle trajectory. D. Expression levels of muscle related genes in the myogenic cells organized into the Monocle trajectory. Figure S4. Unsupervised clustering analysis for all cells in scATAC-Seq data and myogenic-specific scATAC-seq peaks, related to Figure 4 A-C. tSNE plot visualization of the sample source for all 48514 cells in scATAC-Seq. Each dot is a cell. Different colors represent different pigs (A), different embryonic stages (B), or different samples (C). D. tSNE plot visualization of unsupervised clustering analysis for all 48514 cells after quality control in scATAC-Seq data, which gave rise to 15 distinct clusters. E. tSNE plot visualization of myogenic cells and other cells. Clusters 4 and 8 in Figure S4D were annotated as myogenic cells due to their high levels of accessibility of marker genes associated with myogenic lineage. F. Genome browser view of myogenic-specific peaks at the TSS of MyoG and Myf5 for myogenic cells and other cells in the scATAC-seq dataset. Figure S5. Percentage distribution of open chromatin elements in scATAC-Seq data, related to Figure 4 A. Distribution of open chromatin elements in each snATAC-seq sample. B. Distribution of open chromatin elements in snATAC-seq of myogenic cell types. C. Percentage distribution of open chromatin elements among DAPs in myogenic cell types. Figure S6. Integrative analysis of transcription factors and target genes, related to Figure 5 A. tSNE depiction of regulon activity (“on-blue”, “off-gray”), TF gene expression (red scale), and expression of predicted target genes (purple scale) of MyoG, FOSB, and TCF12. B. Corresponding chromatin accessibility in scATAC data for TFs and predicted target genes are depicted. Figure S7. Pseudotime-dependent chromatin accessibility and gene expression changes, related to Figure 7. The first column shows the dynamics of the 10× Genomics TF enrichment score. The second column shows the dynamics of TF gene expression values, and the third and fourth columns represent the dynamics of the SCENIC-reported target gene expression values of corresponding TFs, respectively. Figure S8. Myogenesis related gene expression in DMD (Duchenne muscular dystrophy) mice. Comparison of RNA-seq data of flexor digitorum short (FDB), extensor digitorum long (EDL), and soleus (SOL) in DMD and wild-type mice including 2- month and 5-month age. A. The expression levels of myogenesis related genes (Myod1, Myog, Myf5, Pax7). B. The expression levels of related genes that were upregulated during porcine embryonic myogenesis (EGR1, RHOB, KLF4, SOX8, NGFR, MAX, RBFOX2, ANXA6, HES6, RASSF4, PLS3, SPG21). C. The expression levels of related genes that were downregulated during porcine embryonic myogenesis COX5A, HOMER2, BNIP3, CNCS). Data were obtained from the GEO database (GSE162455; WT, n = 4; DMD, n = 7). Figure S9. Genome browser view of differentially accessible peaks at the TSS of EGR1 and RHOB between myogenic cells in the scATAC-seq dataset, related to Figure 8. Figure S10. Functional analysis of EGR1 in myogenesis, related to Figure 8 A-B. EdU assays for the proliferation of pig primary myogenic cells (A) and C2C12 myoblasts following EGR1 overexpression. C. qPCR analysis of the mRNA levels of cell cycle regulators in C2C12 cells following EGR1 overexpression. D. Immunofluorescence staining for MyHC in C2C12 cells following EGR1 overexpression and differentiation for 3 d. Then, the fusion index was calculated. Figure S11. Functional analysis of RHOB in myogenesis, related to Figure 8 A-B. EdU assays for proliferation of pig primary myogenic cells (A) and C2C12 myoblasts following RHOB overexpression. C. qPCR analysis of the mRNA levels of cell-cycle regulators in C2C12 cells following RHOB overexpression. D. Immunofluorescence staining for MyHC in C2C12 cells following RHOB overexpression and differentiation for 3 d. Then, the fusion index was calculated.

  20. u

    Data from: A Benchmark Dataset for Multilingual Tokenization Energy and...

    • observatorio-cientifico.ua.es
    • zenodo.org
    Updated 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Quesada Granja, Carlos; Quesada Granja, Carlos (2025). A Benchmark Dataset for Multilingual Tokenization Energy and Efficiency Across 23 Models and 325 Languages [Dataset]. https://observatorio-cientifico.ua.es/documentos/688b604617bb6239d2d4a92a
    Explore at:
    Dataset updated
    2025
    Authors
    Quesada Granja, Carlos; Quesada Granja, Carlos
    Description

    This repository contains a benchmark dataset and processing scripts for analyzing the energy consumption and processing efficiency of 23 Hugging Face tokenizers applied to 8,212 standardized text chunks across 325 languages.

    The dataset includes raw and normalized energy measurements, tokenization times, token counts, and rich structural metadata for each chunk, including script composition, entropy, compression ratios, and character-level features. All experiments were conducted in a controlled environment using PyRAPL on a Linux workstation, with baseline CPU consumption subtracted via linear interpolation.

    The resulting data enables fine-grained comparison of tokenizers across linguistic and script diversity. It also supports downstream tasks such as energy-aware NLP, script-sensitive modeling, and benchmarking of future tokenization methods.

    Accompanying R scripts are provided to reproduce data processing, regression models, and clustering analyses. Visualization outputs and cluster-level summaries are also included.

    All contents are organized into clearly structured folders to support easy access, interpretability, and reuse by the research community.

    01_processing_scripts/

    R scripts to transform raw data, subtract baseline energy, and produce clean metrics.

    multimodel_tokenization_energy.py⤷ Python script used to tokenize all chunks with 23 models while logging energy and time.

    adapting_original_dataset.R⤷ Reads raw logs and metadata, computes net energy, and outputs cleaned files.

    energy_patterns.R⤷ Performs clustering, regression, t-SNE, and generates all visualizations.

    02_raw_data/

    Raw output from the tokenization experiment and baseline profiler.

    all_models_tokenization.csv⤷ Full log of 42M+ tokenization runs (23 models × 225 reps × 8,212 chunks).

    baseline.csv⤷ Background CPU energy samples, one per 50 chunks. Used for normalization.

    03_clean_data/

    Cleaned, enriched, and reshaped datasets ready for analysis.

    net_energy.csv⤷ Raw tokenization results after baseline energy subtraction (per run).

    tokenization_long.csv⤷ One row per chunk × tokenizer, with medians + token counts.

    tokenization_wide.csv⤷ Wide-format matrix: one row per chunk, one column per tokenizer × metric.

    complete.csv⤷ Fully enriched dataset joining all metrics, metadata, and script distributions.

    metadata.csv⤷ Structural features and script-based character stats per chunk.

    04_cluster_outputs/

    Outputs from clustering and dimensionality reduction over tokenizer energy profiles.

    tokenizer_dendrogram.pdf⤷ Hierarchical clustering of 23 tokenizers based on energy profiles.

    tokenizer_tsne.pdf⤷ t-SNE projection of tokenizers grouped by energy usage.

    mean_energy_per_cluster.csv⤷ Mean energy consumption (mJ) per language × tokenizer cluster.

    sd_energy_per_cluster.csv⤷ Standard deviation of energy consumption (mJ) per language × cluster.

    grid.pdf⤷ Heatmap of script-wise energy deltas (relative to Latin) for all tokenizers.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Yang, Zhenyu; Xie, Zhongming; Li, Dongfang; Xu, Ximing; Xu, Xueli (2020). Table_1_A t-SNE Based Classification Approach to Compositional Microbiome Data.DOCX [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000506264

Table_1_A t-SNE Based Classification Approach to Compositional Microbiome Data.DOCX

Explore at:
Dataset updated
Dec 14, 2020
Authors
Yang, Zhenyu; Xie, Zhongming; Li, Dongfang; Xu, Ximing; Xu, Xueli
Description

As a data-driven dimensionality reduction and visualization tool, t-distributed stochastic neighborhood embedding (t-SNE) has been successfully applied to a variety of fields. In recent years, it has also received increasing attention for classification and regression analysis. This study presented a t-SNE based classification approach for compositional microbiome data, which enabled us to build classifiers and classify new samples in the reduced dimensional space produced by t-SNE. The Aitchison distance was employed to modify the conditional probabilities in t-SNE to account for the compositionality of microbiome data. To classify a new sample, its low-dimensional features were obtained as the weighted mean vector of its nearest neighbors in the training set. Using the low-dimensional features as input, three commonly used machine learning algorithms, logistic regression (LR), support vector machine (SVM), and decision tree (DT) were considered for classification tasks in this study. The proposed approach was applied to two disease-associated microbiome datasets, achieving better classification performance compared with the classifiers built in the original high-dimensional space. The analytic results also showed that t-SNE with Aitchison distance led to improvement of classification accuracy in both datasets. In conclusion, we have developed a t-SNE based classification approach that is suitable for compositional microbiome data and may also serve as a baseline for more complex classification models.

Search
Clear search
Close search
Google apps
Main menu