46 datasets found
  1. K Means - Data Blobs

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    txt
    Updated Feb 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jesus Rogel-Salazar (2022). K Means - Data Blobs [Dataset]. http://doi.org/10.6084/m9.figshare.19102187.v3
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 2, 2022
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Jesus Rogel-Salazar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Example data to understand the implementation of K Means

  2. h

    wine-clustering

    • huggingface.co
    Updated Sep 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Trevor (2024). wine-clustering [Dataset]. https://huggingface.co/datasets/mltrev23/wine-clustering
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 12, 2024
    Authors
    Trevor
    Description

    Wine Clustering Dataset

      Overview
    

    The Wine Clustering Dataset contains data on various chemical properties of wines, intended for use in clustering tasks. This dataset is ideal for exploring clustering algorithms such as K-Means, hierarchical clustering, and others, to group wines based on their chemical composition.

      Dataset Structure
    

    The dataset is provided as a single CSV file named wine-clustering.csv. It contains 178 entries, each representing a unique wine… See the full description on the dataset page: https://huggingface.co/datasets/mltrev23/wine-clustering.

  3. Customer Segmentation : Clustering

    • kaggle.com
    zip
    Updated Jan 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vishakh Patel (2024). Customer Segmentation : Clustering [Dataset]. https://www.kaggle.com/datasets/vishakhdapat/customer-segmentation-clustering
    Explore at:
    zip(63448 bytes)Available download formats
    Dataset updated
    Jan 13, 2024
    Authors
    Vishakh Patel
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Customer Personality Analysis involves a thorough examination of a company's optimal customer profiles. This analysis facilitates a deeper understanding of customers, enabling businesses to tailor products to meet the distinct needs, behaviors, and concerns of various customer types.

    By conducting a Customer Personality Analysis, businesses can refine their products based on the preferences of specific customer segments. Rather than allocating resources to market a new product to the entire customer database, companies can identify the segments most likely to be interested in the product. Subsequently, targeted marketing efforts can be directed toward those particular segments, optimizing resource utilization and increasing the likelihood of successful product adoption.

    Details of Features are as below:

    • Id: Unique identifier for each individual in the dataset.
    • Year_Birth: The birth year of the individual.
    • Education: The highest level of education attained by the individual.
    • Marital_Status: The marital status of the individual.
    • Income: The annual income of the individual.
    • Kidhome: The number of young children in the household.
    • Teenhome: The number of teenagers in the household.
    • Dt_Customer: The date when the customer was first enrolled or became a part of the company's database.
    • Recency: The number of days since the last purchase or interaction.
    • MntWines: The amount spent on wines.
    • MntFruits: The amount spent on fruits.
    • MntMeatProducts: The amount spent on meat products.
    • MntFishProducts: The amount spent on fish products.
    • MntSweetProducts: The amount spent on sweet products.
    • MntGoldProds: The amount spent on gold products.
    • NumDealsPurchases: The number of purchases made with a discount or as part of a deal.
    • NumWebPurchases: The number of purchases made through the company's website.
    • NumCatalogPurchases: The number of purchases made through catalogs.
    • NumStorePurchases: The number of purchases made in physical stores.
    • NumWebVisitsMonth: The number of visits to the company's website in a month.
    • AcceptedCmp3: Binary indicator (1 or 0) whether the individual accepted the third marketing campaign.
    • AcceptedCmp4: Binary indicator (1 or 0) whether the individual accepted the fourth marketing campaign.
    • AcceptedCmp5: Binary indicator (1 or 0) whether the individual accepted the fifth marketing campaign.
    • AcceptedCmp1: Binary indicator (1 or 0) whether the individual accepted the first marketing campaign.
    • AcceptedCmp2: Binary indicator (1 or 0) whether the individual accepted the second marketing campaign.
    • Complain: Binary indicator (1 or 0) whether the individual has made a complaint.
    • Z_CostContact: A constant cost associated with contacting a customer.
    • Z_Revenue: A constant revenue associated with a successful campaign response.
    • Response: Binary indicator (1 or 0) whether the individual responded to the marketing campaign.
  4. Data from: Galaxy clustering

    • kaggle.com
    zip
    Updated Jan 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Galaxy clustering [Dataset]. https://www.kaggle.com/datasets/thedevastator/clustering-polygons-utilizing-iris-moon-and-circ
    Explore at:
    zip(6339 bytes)Available download formats
    Dataset updated
    Jan 3, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Galaxy clustering

    Iris, Moon, and Circles datasets for Galaxy clustering tutorial

    By [source]

    About this dataset

    This dataset contains a wealth of information that can be used to explore the effectiveness of various clustering algorithms. With its inclusion of numerical measurements (X, Y, Sepal.Length, and Petal.Length) and categorical values (Species), it is possible to investigate the relationship between different types of variables and clustering performance. Additionally, by comparing results for the 3 datasets provided - moon.csv (which contains x and y coordinates), iris.csv (which contains measurements for sepal and petal lengths),and circles.csv - we can gain insights into how different data distributions affect clustering techniques such as K-Means or Hierarchical Clustering among others!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset can also be a great starting point to further explore more complex clusters by using higher dimensional space variables such as color or texture that may be present in other datasets not included here but which can help to form more accurate groups when using cluster-analysis algorithms. Additionally, it could also assist in visualization projects where clusters may need to be generated such as plotting mapped data points or examining relationships between two different variables within a certain region drawn on a chart.

    To use this dataset effectively it is important to understand how exactly your chosen algorithm works since some require specifying parameters beforehand while others take care of those details automatically; otherwise the interpretation may be invalid depending on the methods used alongside clustering you intend for your project. Furthermore, familiarize yourself with concepts like silhouette score and rand index - these are commonly used metrics that measure your cluster’s performance against other clusterings models so you know if what you have done so far satisfies an acceptable level of accuracy or not yet! Good luck!

    Research Ideas

    • Utilizing the sepal and petal lengths and widths to perform flower recognition or part of a larger image recognition pipeline.
    • Classifying the data points in each dataset by the X-Y coordinates using clustering algorithms to analyze galaxy locations or overall formation patterns for stars, planets, or galaxies.
    • Exploring correlations between species of flowers in terms of sepal/petal lengths by performing supervised learning tasks such as classification with this dataset

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: moon.csv | Column name | Description | |:--------------|:------------------------------------------| | X | X coordinate of the data point. (Numeric) | | Y | Y coordinate of the data point. (Numeric) |

    File: iris.csv | Column name | Description | |:-----------------|:---------------------------------------------| | Sepal.Length | Length of the sepal of the flower. (Numeric) | | Petal.Length | Length of the petal of the flower. (Numeric) | | Species | Species of the flower. (Categorical) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .

  5. t

    Data for A method for assessment of the general circulation model quality...

    • data.taltech.ee
    • data.niaid.nih.gov
    Updated Mar 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ilja Maljutenko; Ilja Maljutenko; Urmas Raudsepp; Urmas Raudsepp (2025). Data for A method for assessment of the general circulation model quality using k-means clustering algorithm [Dataset]. http://doi.org/10.5281/zenodo.4588510
    Explore at:
    Dataset updated
    Mar 11, 2025
    Dataset provided by
    TalTech Data Repository
    Authors
    Ilja Maljutenko; Ilja Maljutenko; Urmas Raudsepp; Urmas Raudsepp
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2021
    Description

    The dataset consists of simulated and observed salinity/temperature data which were used in the manuscript "A method for assessment of the general circulation model quality using k-means clustering algorithm" submitted to Geoscientific Model Development.
    The model simulation dataset is from long-term 3D circulation model simulation (Maljutenko and Raudsepp 2014, 2019). The observations are from the "Baltic Sea - Eutrophication and Acidity aggregated datasets 1902/2017 v2018" SMHI (2018).

    The files are in simple comma separated table format without headers.
    The Dout-t_z_lat_lon_Smod_Sobs_Tmod_Tobs.csv file contains columns with following variables [units]:
    Time [matlab datenum units], Vertical coordinate [m], latitude [oN], longitude [oE], model salinity [g/kg], observed salinity [g/kg], model temperature [oC], observed temperature [oC].

    The Dout-t_z_lat_lon_dS_dT_K1_K2_K3_K4_K5_K6_K7_K8_K9.csv file contains columns with following variables [units]:
    4 first columns are the same as in the previous file, salinity error [g/kg], temperature error [oC], columns 7-8 are integers showing the cluster to which the error pair is designated.

    do_clust_valid_DataFig.m is a Matlab script which reads the two csv files (and optionally mask file Model_mask.mat), performs the clustering analysis and creates plots which are used in Manuscript. The script is organized into %% blocks which can be executed separately (default: ctrl+enter).

    k-means function is used from the Matlab Statistics and Machine Learning Toolbox.

    Additional software used in the do_clust_valid_DataFig.m:

    Author's auxiliary formatting scripts script/
    datetick_cst.m
    do_fitfig.m
    do_skipticks.m
    do_skipticks_y.m

    Colormaps are generated using cbrewer.m (Charles, 2021).
    Moving average smoothing is performed using nanmoving_average.m (Aguilera, 2021).

  6. 2D Clustering Dataset Collection

    • kaggle.com
    zip
    Updated Jan 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SAMOILOV MIKHAIL (2025). 2D Clustering Dataset Collection [Dataset]. https://www.kaggle.com/datasets/samoilovmikhail/2d-clustering-dataset-collection
    Explore at:
    zip(136543 bytes)Available download formats
    Dataset updated
    Jan 21, 2025
    Authors
    SAMOILOV MIKHAIL
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset collection comprises 15 diverse two-dimensional datasets specifically designed for clustering analysis. Each dataset contains three columns: x, y, and target, where x and y represent the coordinates of the data points, and target indicates the cluster label.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F20292402%2F3cc81328beabc815fe500973fee1f7ac%2Fdescription.png?generation=1737484616903723&alt=media" alt="Visualisation of data">

  7. f

    k-means clustering of cohort mean temperature coefficients.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Jul 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Schnable, Patrick S.; Yeh, Cheng-Ting “Eddy”; Kusmec, Aaron; Attigala, Lakshmi; Dai, Xiongtao; Srinivasan, Srikant (2023). k-means clustering of cohort mean temperature coefficients. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000995763
    Explore at:
    Dataset updated
    Jul 6, 2023
    Authors
    Schnable, Patrick S.; Yeh, Cheng-Ting “Eddy”; Kusmec, Aaron; Attigala, Lakshmi; Dai, Xiongtao; Srinivasan, Srikant
    Description

    Data underlying Fig 2D. Cohort means are the weighted average temperature coefficient for all hybrids that first appeared in the dataset in the indicated year. Only temperature bins 30 to >41°C inclusive are included. Columns include year, temperature (°C), cohort mean coefficient, and cluster. (CSV)

  8. ChatGPT API and BERT NLP

    • figshare.com
    application/csv
    Updated Mar 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carmen Atkins (2024). ChatGPT API and BERT NLP [Dataset]. http://doi.org/10.6084/m9.figshare.25403407.v2
    Explore at:
    application/csvAvailable download formats
    Dataset updated
    Mar 13, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Carmen Atkins
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    input_prompts.csv provides the inputs for the ChatGPT API (countries and their respective prompts).topic_consolidations.csv contains the 4,018 unique topics listed across all ChatGPT responses to prompts in our study and their corresponding cluster labels after applying K-means++ clustering (n = 50) via natural language processing with Bidirectional Encoder Representations from Transformers (BERT). ChatGPT response topics come from both versions (3.5 and 4) over 10 iterations each (per each country).ChatGPT_prompt_automation.ipynb is the Jupyter notebook of Python code used to run the API to prompt ChatGPT and gather responses.topic_consolidation_BERT.ipynb is the Jupyter notebook of Python code used to process the 4,018 unique topics gathered through BERT NLP. This code was adapted from Vimal Pillar on Kaggle (https://www.kaggle.com/code/vimalpillai/text-clustering-with-sentence-bert).

  9. Patient Dataset for Clustering (Raw Data)

    • kaggle.com
    Updated Aug 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arjunn Sharma (2023). Patient Dataset for Clustering (Raw Data) [Dataset]. https://www.kaggle.com/datasets/arjunnsharma/patient-dataset-for-clustering-raw-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 10, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Arjunn Sharma
    Description

    About Dataset ● Based on patient symptoms, identify patients needing immediate resuscitation; to assign patients to a predesignated patient care area, thereby prioritizing their care; and to initiate diagnostic/therapeutic measures as appropriate. ● Three individual datasets used for three urgent illness/injury, each dataset has its own features and symptoms for each patient and we merged them to know what are the most severe symptoms for each illness and give them priority of treatment.

    PROJECT SUMMARY Triage refers to the sorting of injured or sick people according to their need for emergency medical attention. It is a method of determining priority for who gets care first. BACKGROUND Triage is the prioritization of patient care (or victims during a disaster) based on illness/injury, symptoms, severity, prognosis, and resource availability. The purpose of triage is to identify patients needing immediate resuscitation; to assign patients to a predesignated patient care area, thereby prioritizing their care; and to initiate diagnostic/therapeutic measures as appropriate. BUSINESS CHALLENGE Based on patient symptoms, identify patients needing immediate resuscitation; to assign patients to a predesignated patient care area, thereby prioritizing their care; and to initiate diagnostic/therapeutic measures as appropriate.

  10. K-means derived GECO Scores.

    • plos.figshare.com
    txt
    Updated Jun 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jason Bennett; Mikhail Pomaznoy; Akul Singhania; Bjoern Peters (2023). K-means derived GECO Scores. [Dataset]. http://doi.org/10.1371/journal.pcbi.1009459.s008
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 9, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Jason Bennett; Mikhail Pomaznoy; Akul Singhania; Bjoern Peters
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The table of GECO scores generated after clustering with k-means using each value of k listed in the first column ‘k’. (CSV)

  11. Student Performance and Learning Behavior Dataset for Educational Analytics

    • zenodo.org
    bin, csv
    Updated Aug 13, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kamal NAJEM; Kamal NAJEM (2025). Student Performance and Learning Behavior Dataset for Educational Analytics [Dataset]. http://doi.org/10.5281/zenodo.16459132
    Explore at:
    bin, csvAvailable download formats
    Dataset updated
    Aug 13, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Kamal NAJEM; Kamal NAJEM
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jul 26, 2025
    Description

    The dataset used in this study integrates quantitative data on student learning behaviors, engagement patterns, demographics, and academic performance. It was compiled by merging two publicly available Kaggle datasets, resulting in a combined file (“merged_dataset.csv”) containing 14,003 student records with 16 attributes. All records are anonymized and contain no personally identifiable information.

    The dataset covers the following categories of variables:

    • Study behaviors and engagement: StudyHours, Attendance, Extracurricular, AssignmentCompletion, OnlineCourses, Discussions
    • Resource access and learning environment: Resources, Internet, EduTech

    • Motivation and psychological factors: Motivation, StressLevel

    • Demographic information: Gender, Age (ranging from 18 to 30 years)

    • Learning preference classification: LearningStyle

    • Academic performance indicators: ExamScore, FinalGrade

    In this study, “ExamScore” and “FinalGrade” served as the primary performance indicators. The remaining variables were used to derive behavioral and contextual profiles, which were clustered using unsupervised machine learning techniques.

    The analysis and modeling were implemented in Python through a structured Jupyter Notebook (“Project.ipynb”), which included the following main steps:

    1. Environment Setup – Import of essential libraries (NumPy, pandas, Matplotlib, Seaborn, SciPy, StatsModels, scikit-learn, imbalanced-learn) and visualization configuration.

    2. Data Import and Integration – Loading the two source CSV files, harmonizing columns, removing irrelevant attributes, aligning formats, handling missing values, and merging them into a unified dataset (merged_dataset.csv).

    3. Data Preprocessing

      • Encoding categorical variables using LabelEncoder.

      • Scaling features using both z-score standardization (for statistical tests and PCA) and Min–Max normalization (for clustering).

      • Detecting and removing duplicates.

    4. Clustering Analysis

      • Applying K-Means clustering to segment learners into distinct profiles.

      • Determining the optimal number of clusters using the Elbow Method and Silhouette Score.

      • Evaluating cluster quality with internal metrics (Silhouette Score, Davies–Bouldin Index).

    5. Dimensionality Reduction & Visualization – Using PCA for 2D/3D cluster visualization and feature importance exploration.

    6. Mapping Clusters to Learning Styles – Associating each identified cluster with the most relevant learning style model based on feature patterns and alignment scores.

    7. Statistical Analysis – Conducting ANOVA and regression to test for significant differences in performance between clusters.

    8. Interpretation & Practical Recommendations – Analyzing cluster-specific characteristics and providing implications for adaptive and mobile learning integration.

  12. D

    Replication Data for: Singapore Soundscape Site Selection Survey (S5):...

    • researchdata.ntu.edu.sg
    tsv
    Updated Jun 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kenneth Ooi; Kenneth Ooi; Bhan Lam; Bhan Lam; Joo Young Hong; Joo Young Hong; Karn N. Watcharasupat; Karn N. Watcharasupat; Zhen-Ting Ong; Zhen-Ting Ong; Woon-Seng Gan; Woon-Seng Gan (2022). Replication Data for: Singapore Soundscape Site Selection Survey (S5): Identification of Characteristic Soundscapes of Singapore via Weighted k-means Clustering [Dataset]. http://doi.org/10.21979/N9/BBBPMO
    Explore at:
    tsv(63781)Available download formats
    Dataset updated
    Jun 14, 2022
    Dataset provided by
    DR-NTU (Data)
    Authors
    Kenneth Ooi; Kenneth Ooi; Bhan Lam; Bhan Lam; Joo Young Hong; Joo Young Hong; Karn N. Watcharasupat; Karn N. Watcharasupat; Zhen-Ting Ong; Zhen-Ting Ong; Woon-Seng Gan; Woon-Seng Gan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Area covered
    Singapore
    Dataset funded by
    National Research Foundation (NRF)
    Ministry of National Development (MND)
    Description

    This dataset contains the data used for all statistical analysis in our publication "Singapore Soundscape Site Selection Survey (S5): Identification of Characteristic Soundscapes of Singapore via Weighted k-means Clustering", summarised in a single .csv file. For more details on the study methodology, please refer to our manuscript: Ooi, K.; Lam, B.; Hong, J.; Watcharasupat, K. N.; Ong, Z.-T.; Gan, W.-S. Singapore Soundscape Site Selection Survey (S5): Identification of Characteristic Soundscapes of Singapore via Weighted k-means Clustering. Sustainability, 2022. For our replication code utilising this data, please refer to our Github repository: https://github.com/ntudsp/singapore-soundscape-site-selection-survey A short explanation of the columns in the .csv file is as follows: Full of life & exciting [Latitude]: The latitude, in degrees, of the location chosen by the participant as "Full of life & exciting". Full of life & exciting [Longitude]: The longitude, in degrees, of the location chosen by the participant as "Full of life & exciting". Full of life & exciting [# times visited]: The number of times that the participant had visited the chosen location they considered "Full of life & exciting" before, as reported by the participant. Full of life & exciting [Duration]: The average duration per visit to the chosen location the participant considered "Full of life & exciting", as reported by the participant. Chaotic & restless [Latitude]: The latitude, in degrees, of the location chosen by the participant as "Chaotic & restless". Chaotic & restless [Longitude]: The longitude, in degrees, of the location chosen by the participant as "Chaotic & restless". Chaotic & restless [# times visited]: The number of times that the participant had visited the chosen location they considered "Chaotic & restless" before, as reported by the participant. Chaotic & restless [Duration]: The average duration per visit to the chosen location the participant considered "Chaotic & restless", as reported by the participant. Calm & tranquil [Latitude]: The latitude, in degrees, of the location chosen by the participant as "Calm & tranquil". Calm & tranquil [Longitude]: The longitude, in degrees, of the location chosen by the participant as "Calm & tranquil". Calm & tranquil [# times visited]: The number of times that the participant had visited the chosen location they considered "Calm & tranquil" before, as reported by the participant. Calm & tranquil [Duration]: The average duration per visit to the chosen location the participant considered "Calm & tranquil", as reported by the participant. Boring & lifeless [Latitude]: The latitude, in degrees, of the location chosen by the participant as "Boring & lifeless". Boring & lifeless [Longitude]: The longitude, in degrees, of the location chosen by the participant as "Boring & lifeless". Boring & lifeless [# times visited]: The number of times that the participant had visited the chosen location they considered "Boring & lifeless" before, as reported by the participant. Boring & lifeless [Duration]: The average duration per visit to the chosen location the participant considered "Boring & lifeless", as reported by the participant.

  13. 🌆 City Lifestyle Segmentation Dataset

    • kaggle.com
    zip
    Updated Nov 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UmutUygurr (2025). 🌆 City Lifestyle Segmentation Dataset [Dataset]. https://www.kaggle.com/datasets/umuttuygurr/city-lifestyle-segmentation-dataset
    Explore at:
    zip(11274 bytes)Available download formats
    Dataset updated
    Nov 15, 2025
    Authors
    UmutUygurr
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22121490%2F7189944f8fc292a094c90daa799d08ca%2FChatGPT%20Image%2015%20Kas%202025%2014_07_37.png?generation=1763204959770660&alt=media" alt="">

    🌆 About This Dataset

    This synthetic dataset simulates 300 global cities across 6 major geographic regions, designed specifically for unsupervised machine learning and clustering analysis. It explores how economic status, environmental quality, infrastructure, and digital access shape urban lifestyles worldwide.

    🎯 Perfect For:

    • 📊 K-Means, DBSCAN, Agglomerative Clustering
    • 🔬 PCA & t-SNE Dimensionality Reduction
    • 🗺️ Geospatial Visualization (Plotly, Folium)
    • 📈 Correlation Analysis & Feature Engineering
    • 🎓 Educational Projects (Beginner to Intermediate)

    📦 What's Inside?

    FeatureDescriptionRange
    10 FeaturesEconomic, environmental & social indicatorsRealistically scaled
    300 CitiesEurope, Asia, Americas, Africa, OceaniaDiverse distributions
    Strong CorrelationsIncome ↔ Rent (+0.8), Density ↔ Pollution (+0.6)ML-ready
    No Missing ValuesClean, preprocessed dataReady for analysis
    4-5 Natural ClustersMetropolitan hubs, eco-towns, developing centersPre-validated

    🔥 Key Features

    Realistic Correlations: Income strongly predicts rent (+0.8), internet access (+0.7), and happiness (+0.6)
    Regional Diversity: Each region has distinct economic and environmental characteristics
    Clustering-Ready: Naturally separable into 4-5 lifestyle archetypes
    Beginner-Friendly: No data cleaning required, includes example code
    Documented: Comprehensive README with methodology and use cases

    🚀 Quick Start Example

    import pandas as pd
    from sklearn.cluster import KMeans
    from sklearn.preprocessing import StandardScaler
    
    # Load and prepare
    df = pd.read_csv('city_lifestyle_dataset.csv')
    X = df.drop(['city_name', 'country'], axis=1)
    X_scaled = StandardScaler().fit_transform(X)
    
    # Cluster
    kmeans = KMeans(n_clusters=5, random_state=42)
    df['cluster'] = kmeans.fit_predict(X_scaled)
    
    # Analyze
    print(df.groupby('cluster').mean())
    

    🎓 Learning Outcomes

    After working with this dataset, you will be able to: 1. Apply K-Means, DBSCAN, and Hierarchical Clustering 2. Use PCA for dimensionality reduction and visualization 3. Interpret correlation matrices and feature relationships 4. Create geographic visualizations with cluster assignments 5. Profile and name discovered clusters based on characteristics

    📚 Ideal For These Projects

    • 🏆 Kaggle Competitions: Practice clustering techniques
    • 📝 Academic Projects: Urban planning, sociology, environmental science
    • 💼 Portfolio Work: Showcase ML skills to employers
    • 🎓 Learning: Hands-on practice with unsupervised learning
    • 🔬 Research: Urban lifestyle segmentation studies

    🌍 Expected Clusters

    ClusterCharacteristicsExample Cities
    Metropolitan Tech HubsHigh income, density, rentSilicon Valley, Singapore
    Eco-Friendly TownsLow density, clean air, high happinessNordic cities
    Developing CentersMid income, high density, poor airEmerging markets
    Low-Income SuburbanLow infrastructure, incomeRural areas
    Industrial Mega-CitiesVery high density, pollutionManufacturing hubs

    🛠️ Technical Details

    • Format: CSV (UTF-8)
    • Size: ~300 rows × 10 columns
    • Missing Values: 0%
    • Data Types: 2 categorical, 8 numerical
    • Target Variable: None (unsupervised)
    • Correlation Strength: Pre-validated (r: 0.4 to 0.8)

    📖 What Makes This Dataset Special?

    Unlike random synthetic data, this dataset was carefully engineered with: - ✨ Realistic correlation structures based on urban research - 🌍 Regional characteristics matching real-world patterns - 🎯 Optimal cluster separability (validated via silhouette scores) - 📚 Comprehensive documentation and starter code

    🏅 Use This Dataset If You Want To:

    ✓ Learn clustering without data cleaning hassles
    ✓ Practice PCA and dimensionality reduction
    ✓ Create beautiful geographic visualizations
    ✓ Understand feature correlation in real-world contexts
    ✓ Build a portfolio project with clear business insights

    📊 Acknowledgments

    This dataset was designed for educational purposes in machine learning and data science. While synthetic, it reflects real patterns observed in global urban development research.

    Happy Clustering! 🎉

  14. Z

    DrCyZ: Techniques for analyzing and extracting useful information from CyZ.

    • data.niaid.nih.gov
    Updated Jan 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    de Curtò, J.; de Zarzà, I. (2022). DrCyZ: Techniques for analyzing and extracting useful information from CyZ. [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5816857
    Explore at:
    Dataset updated
    Jan 19, 2022
    Dataset provided by
    Universitat Oberta de Catalunya
    Authors
    de Curtò, J.; de Zarzà, I.
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    DrCyZ: Techniques for analyzing and extracting useful information from CyZ.

    Samples from NASA Perseverance and set of GAN generated synthetic images from Neural Mars.

    Repository: https://github.com/decurtoidiaz/drcyz

    Subset of samples from (includes tools to visualize and analyse the dataset):

    CyZ: MARS Space Exploration Dataset. [https://doi.org/10.5281/zenodo.5655473]

    Images from NASA missions of the celestial body.

    Repository: https://github.com/decurtoidiaz/cyz

    Authors:

    J. de Curtò c@decurto.be

    I. de Zarzà z@dezarza.be

    File Information from DrCyZ-1.1

    • Subset of samples from Perseverance (drcyz/c).
      ∙ png (drcyz/c/png).
        PNG files (5025) selected from NASA Perseverance (CyZ-1.1) after t-SNE and K-means Clustering. 
      ∙ csv (drcyz/c/csv).
        CSV file.
    
    
    • Resized samples from Perseverance (drcyz/c+).
      ∙ png 64x64; 128x128; 256x256; 512x512; 1024x1024 (drcyz/c+/drcyz_64-1024).
        PNG files resized at the corresponding size. 
      ∙ TFRecords 64x64; 128x128; 256x256; 512x512; 1024x1024 (drcyz/c+/tfr_drcyz_64-1024).
        TFRecord resized at the corresponding size to import on Tensorflow.
    
    
    • Synthetic images from Neural Mars generated using Stylegan2-ada (drcyz/drcyz+).
      ∙ png 100; 1000; 10000 (drcyz/drcyz+/drcyz_256_100-10000)
        PNG files subset of 100, 1000 and 10000 at size 256x256.
    
    
    • Network Checkpoint from Stylegan2-ada trained at size 256x256 (drcyz/model_drcyz).
      ∙ network-snapshot-000798-drcyz.pkl
    
    
    • Notebooks in python to analyse the original dataset and reproduce the experiments; K-means Clustering, t-SNE, PCA, synthetic generation using Stylegan2-ada and instance segmentation using Deeplab (https://github.com/decurtoidiaz/drcyz/tree/main/dr_cyz+).
      ∙ clustering_curiosity_de_curto_and_de_zarza.ipynb
        K-means Clustering and PCA(2) with images from Curiosity.
      ∙ clustering_perseverance_de_curto_and_de_zarza.ipynb
        K-means Clustering and PCA(2) with images from Perseverance.
      ∙ tsne_curiosity_de_curto_and_de_zarza.ipynb
        t-SNE and PCA (components selected to explain 99% of variance) with images from Curiosity.
      ∙ tsne_perseverance_de_curto_and_de_zarza.ipynb
        t-SNE and PCA (components selected to explain 99% of variance) with images from Perseverance.
      ∙ Stylegan2-ada_de_curto_and_de_zarza.ipynb
        Stylegan2-ada trained on a subset of images from NASA Perseverance (DrCyZ).
      ∙ statistics_perseverance_de_curto_and_de_zarza.ipynb
        Compute statistics from synthetic samples generated by Stylegan2-ada (DrCyZ) and images from NASA Perseverance (CyZ).
      ∙ DeepLab_TFLite_ADE20k_de_curto_and_de_zarza.ipynb
        Example of instance segmentation using Deeplab with a sample from NASA Perseverance (DrCyZ).
    
  15. Z

    Additional TAU datasets for Wi-Fi fingerprinting-based positioning

    • data.niaid.nih.gov
    Updated May 13, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lohan (2020). Additional TAU datasets for Wi-Fi fingerprinting-based positioning [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3819916
    Explore at:
    Dataset updated
    May 13, 2020
    Dataset provided by
    TAU
    Authors
    Lohan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    1. Contents

    This document describes two datasets collected at Tampere University facilities with samples taken from a Wi-Fi network interface for experiments with indoor positioning based on Wi-Fi fingerprinting.

    To reference this dataset, please use

    E.S. Lohan et al. “Additional TAU datasets for Wi-Fi fingerprinting-based positioning” 10.5281/zenodo.3819917

    Additional reference using these datasets

    Torres-Sospedra, J.; Quezada-Gaibor, D.; Mendoza-Silva, G. M.; Nurmi, J.; Koucheryavy, Y. and Huerta, J. New Cluster Selection and Fine-grained Search for k-Means Clustering and Wi-Fi Fingerprinting Proceedings of the Tenth International Conference on Localization and GNSS (ICL-GNSS), 2020.

    Dataset format

    Two independent datasets are provided, they are in different folders, namely “Database_Building01” and “Database_Building02” respectively. Each dataset includes two sets of samples:

    radio map – a set of Wi-Fi samples collected at a grid of points (reference points);

    evaluation – a set of Wi-Fi samples randomly collected in the evaluation area.

    Two files are provided for each set that include the rss vectors and the coordinates. For the radio map, the provided files have their names starting with “rm_”; for the evaluation, the evaluation files have their names starting with “eval_”. For instance, for the radio map they are:

    rm_crd.csv: holds coordinates (x,y)and floor identifier (z) where the samples were collected;

    rm_rss.csv: holds the measured RSSI values from each of the Access Points (AP) detected in each sample;

    All the file are described in the same format, and all files are CSV – Comma Separated Values plain text (UTF-8).

    Coordinates: Each sample is associated to a pair of coordinates in a 2D Euclidean reference system. The origin of the reference system was chosen arbitrarily for convenience. The units are meters. Therefore, distances between points can be easy calculated. Moreover, the floor identifier is included to enable 3D positioning.

    RSSI values: The RSSI values provided as read from the Wi-Fi network interface through the Android API. In each sample, a value of +100 was assigned to each AP not detected during a measurement. No information is provided about the MAC addresses of the APs. However, in the files, the same order is used for all samples, meaning that the values in each column are all associated to the same AP.

    Both datasets are independent and none of the provided files include an identifier for each sample. The values in the two provided files are associated by the line number, meaning that the coordinates and RSSI values in the same line, in each file, refer to the same sample.

  16. K-means clustering V-measure scores.

    • plos.figshare.com
    csv
    Updated Sep 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alina Troglio; Peter Konradi; Andrea Fiebig; Ariadna Pérez Garriga; Rainer Röhrig; James Dunham; Ekaterina Kutafina; Barbara Namer (2025). K-means clustering V-measure scores. [Dataset]. http://doi.org/10.1371/journal.pone.0329537.s009
    Explore at:
    csvAvailable download formats
    Dataset updated
    Sep 26, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Alina Troglio; Peter Konradi; Andrea Fiebig; Ariadna Pérez Garriga; Rainer Röhrig; James Dunham; Ekaterina Kutafina; Barbara Namer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    V-measure values for k-means clustering, provided as a CSV file with comma-separated values. (CSV)

  17. K-means clustering ARI scores.

    • plos.figshare.com
    csv
    Updated Sep 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alina Troglio; Peter Konradi; Andrea Fiebig; Ariadna Pérez Garriga; Rainer Röhrig; James Dunham; Ekaterina Kutafina; Barbara Namer (2025). K-means clustering ARI scores. [Dataset]. http://doi.org/10.1371/journal.pone.0329537.s007
    Explore at:
    csvAvailable download formats
    Dataset updated
    Sep 26, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Alina Troglio; Peter Konradi; Andrea Fiebig; Ariadna Pérez Garriga; Rainer Röhrig; James Dunham; Ekaterina Kutafina; Barbara Namer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Adjusted Rand Index (ARI) values for k-means clustering, provided as a CSV file with comma-separated values. (CSV)

  18. Z

    Track dataset of four regional varieties of South Asian monsoon low-pressure...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Jun 8, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kieran M. R. Hunt; Akshay Deoras; Andrew G. Turner (2021). Track dataset of four regional varieties of South Asian monsoon low-pressure systems [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_4572899
    Explore at:
    Dataset updated
    Jun 8, 2021
    Dataset provided by
    Department of Meteorology, University of Reading, UK
    National Centre for Atmospheric Science & Department of Meteorology, University of Reading, UK
    Authors
    Kieran M. R. Hunt; Akshay Deoras; Andrew G. Turner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    South Asia
    Description

    This dataset contains tracks and intensities of four regional varieties of South Asian monsoon low-pressure systems (LPSs), as identified in ERA-Interim reanalysis dataset. A feature-tracking algorithm (Hunt et al., 2016; 2018), which is based on identifying and linking track points featuring 850 hPa relative vorticity maximum, is used to identify LPSs. A k-means clustering technique is then used to group LPSs into four LPS varieties (Hunt and Fletcher, 2019). Only those LPSs, which had their genesis during June–September 1979–2018 are retained in this dataset. LPSs in this dataset include monsoon low-pressure areas, depressions and deep depressions. The temporal resolution of ERA-Interim is six-hourly. A full description of four regional LPS varieties can be found here: https://doi.org/10.1002/wea.3997

    Files

    arabian.csv: contains track details of LPSs occurring over the Arabian Sea

    bob_long.csv: contains track details of long-lived LPSs that propagate over India after their genesis over the head of the Bay of Bengal and nearby coastal regions

    bob_short.csv: contains track details of short-lived LPSs that propagate over India after their genesis over the head of the Bay of Bengal and nearby coastal regions

    srilankan.csv: contains track details of LPSs occurring over Sri Lanka and adjoining parts of the Bay of Bengal

     Columns: 
    

    time: a time stamp showing when an LPS was present

    lon: the longitude of an LPS at a given time step

    lat: the latitude of an LPS at a given time step

    candidate_id: a random identity number for each LPS

    vort: the 850 hPa relative vorticity at the centre of an LPS at a given time step

    For further details, contact Dr Kieran M. R. Hunt (k.m.r.hunt@reading.ac.uk) or Akshay Deoras (deorasakshay@gmail.com).

  19. f

    Table5_Comparative analysis of tissue-specific genes in maize based on...

    • datasetcatalog.nlm.nih.gov
    • frontiersin.figshare.com
    Updated May 9, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Li, Hongfu; Jiang, Yi; Wang, Zijie; Zhu, Yuzhi; Tang, Xinqiang; Liu, Zhule (2023). Table5_Comparative analysis of tissue-specific genes in maize based on machine learning models: CNN performs technically best, LightGBM performs biologically soundest.csv [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001005747
    Explore at:
    Dataset updated
    May 9, 2023
    Authors
    Li, Hongfu; Jiang, Yi; Wang, Zijie; Zhu, Yuzhi; Tang, Xinqiang; Liu, Zhule
    Description

    Introduction: With the advancement of RNA-seq technology and machine learning, training large-scale RNA-seq data from databases with machine learning models can generally identify genes with important regulatory roles that were previously missed by standard linear analytic methodologies. Finding tissue-specific genes could improve our comprehension of the relationship between tissues and genes. However, few machine learning models for transcriptome data have been deployed and compared to identify tissue-specific genes, particularly for plants.Methods: In this study, an expression matrix was processed with linear models (Limma), machine learning models (LightGBM), and deep learning models (CNN) with information gain and the SHAP strategy based on 1,548 maize multi-tissue RNA-seq data obtained from a public database to identify tissue-specific genes. In terms of validation, V-measure values were computed based on k-means clustering of the gene sets to evaluate their technical complementarity. Furthermore, GO analysis and literature retrieval were used to validate the functions and research status of these genes.Results: Based on clustering validation, the convolutional neural network outperformed others with higher V-measure values as 0.647, indicating that its gene set could cover as many specific properties of various tissues as possible, whereas LightGBM discovered key transcription factors. The combination of three gene sets produced 78 core tissue-specific genes that had previously been shown in the literature to be biologically significant.Discussion: Different tissue-specific gene sets were identified due to the distinct interpretation strategy for machine learning models and researchers may use multiple methodologies and strategies for tissue-specific gene sets based on their goals, types of data, and computational resources. This study provided comparative insight for large-scale data mining of transcriptome datasets, shedding light on resolving high dimensions and bias difficulties in bioinformatics data processing.

  20. Additional file 1 of An additional k-means clustering step improves the...

    • springernature.figshare.com
    txt
    Updated May 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juan BotĂa; Jana Vandrovcova; Paola Forabosco; Sebastian Guelfi; Karishma D’Sa; John Hardy; Cathryn Lewis; Mina Ryten; Michael Weale (2023). Additional file 1 of An additional k-means clustering step improves the biological features of WGCNA gene co-expression networks [Dataset]. http://doi.org/10.6084/m9.figshare.c.3741080_D1.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Juan Botía; Jana Vandrovcova; Paola Forabosco; Sebastian Guelfi; Karishma D’Sa; John Hardy; Cathryn Lewis; Mina Ryten; Michael Weale
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Lists tissues, samples and genes used for the creation of each GCN. (CSV 4 kb)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jesus Rogel-Salazar (2022). K Means - Data Blobs [Dataset]. http://doi.org/10.6084/m9.figshare.19102187.v3
Organization logoOrganization logo

K Means - Data Blobs

Explore at:
txtAvailable download formats
Dataset updated
Feb 2, 2022
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Jesus Rogel-Salazar
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Example data to understand the implementation of K Means

Search
Clear search
Close search
Google apps
Main menu