46 datasets found

K Means - Data Blobs
figshare.com
datasetcatalog.nlm.nih.gov
txt
Updated Feb 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jesus Rogel-Salazar (2022). K Means - Data Blobs [Dataset]. http://doi.org/10.6084/m9.figshare.19102187.v3
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19102187.v3
Dataset updated
Feb 2, 2022
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Jesus Rogel-Salazar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Example data to understand the implementation of K Means
h
wine-clustering
huggingface.co
Updated Sep 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Trevor (2024). wine-clustering [Dataset]. https://huggingface.co/datasets/mltrev23/wine-clustering
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 12, 2024
Authors
Trevor
Description
Wine Clustering Dataset

Overview

The Wine Clustering Dataset contains data on various chemical properties of wines, intended for use in clustering tasks. This dataset is ideal for exploring clustering algorithms such as K-Means, hierarchical clustering, and others, to group wines based on their chemical composition.

Dataset Structure

The dataset is provided as a single CSV file named wine-clustering.csv. It contains 178 entries, each representing a unique wine… See the full description on the dataset page: https://huggingface.co/datasets/mltrev23/wine-clustering.
Customer Segmentation : Clustering
kaggle.com
zip
Updated Jan 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vishakh Patel (2024). Customer Segmentation : Clustering [Dataset]. https://www.kaggle.com/datasets/vishakhdapat/customer-segmentation-clustering
Explore at:
zip(63448 bytes)Available download formats
Dataset updated
Jan 13, 2024
Authors
Vishakh Patel
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Customer Personality Analysis involves a thorough examination of a company's optimal customer profiles. This analysis facilitates a deeper understanding of customers, enabling businesses to tailor products to meet the distinct needs, behaviors, and concerns of various customer types.

By conducting a Customer Personality Analysis, businesses can refine their products based on the preferences of specific customer segments. Rather than allocating resources to market a new product to the entire customer database, companies can identify the segments most likely to be interested in the product. Subsequently, targeted marketing efforts can be directed toward those particular segments, optimizing resource utilization and increasing the likelihood of successful product adoption.

Details of Features are as below:

Id: Unique identifier for each individual in the dataset.

Year_Birth: The birth year of the individual.

Education: The highest level of education attained by the individual.

Marital_Status: The marital status of the individual.

Income: The annual income of the individual.

Kidhome: The number of young children in the household.

Teenhome: The number of teenagers in the household.

Dt_Customer: The date when the customer was first enrolled or became a part of the company's database.

Recency: The number of days since the last purchase or interaction.

MntWines: The amount spent on wines.

MntFruits: The amount spent on fruits.

MntMeatProducts: The amount spent on meat products.

MntFishProducts: The amount spent on fish products.

MntSweetProducts: The amount spent on sweet products.

MntGoldProds: The amount spent on gold products.

NumDealsPurchases: The number of purchases made with a discount or as part of a deal.

NumWebPurchases: The number of purchases made through the company's website.

NumCatalogPurchases: The number of purchases made through catalogs.

NumStorePurchases: The number of purchases made in physical stores.

NumWebVisitsMonth: The number of visits to the company's website in a month.

AcceptedCmp3: Binary indicator (1 or 0) whether the individual accepted the third marketing campaign.

AcceptedCmp4: Binary indicator (1 or 0) whether the individual accepted the fourth marketing campaign.

AcceptedCmp5: Binary indicator (1 or 0) whether the individual accepted the fifth marketing campaign.

AcceptedCmp1: Binary indicator (1 or 0) whether the individual accepted the first marketing campaign.

AcceptedCmp2: Binary indicator (1 or 0) whether the individual accepted the second marketing campaign.

Complain: Binary indicator (1 or 0) whether the individual has made a complaint.

Z_CostContact: A constant cost associated with contacting a customer.

Z_Revenue: A constant revenue associated with a successful campaign response.

Response: Binary indicator (1 or 0) whether the individual responded to the marketing campaign.
Data from: Galaxy clustering
kaggle.com
zip
Updated Jan 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Galaxy clustering [Dataset]. https://www.kaggle.com/datasets/thedevastator/clustering-polygons-utilizing-iris-moon-and-circ
Explore at:
zip(6339 bytes)Available download formats
Dataset updated
Jan 3, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Galaxy clustering

Iris, Moon, and Circles datasets for Galaxy clustering tutorial

By [source]

About this dataset

This dataset contains a wealth of information that can be used to explore the effectiveness of various clustering algorithms. With its inclusion of numerical measurements (X, Y, Sepal.Length, and Petal.Length) and categorical values (Species), it is possible to investigate the relationship between different types of variables and clustering performance. Additionally, by comparing results for the 3 datasets provided - moon.csv (which contains x and y coordinates), iris.csv (which contains measurements for sepal and petal lengths),and circles.csv - we can gain insights into how different data distributions affect clustering techniques such as K-Means or Hierarchical Clustering among others!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset can also be a great starting point to further explore more complex clusters by using higher dimensional space variables such as color or texture that may be present in other datasets not included here but which can help to form more accurate groups when using cluster-analysis algorithms. Additionally, it could also assist in visualization projects where clusters may need to be generated such as plotting mapped data points or examining relationships between two different variables within a certain region drawn on a chart.

To use this dataset effectively it is important to understand how exactly your chosen algorithm works since some require specifying parameters beforehand while others take care of those details automatically; otherwise the interpretation may be invalid depending on the methods used alongside clustering you intend for your project. Furthermore, familiarize yourself with concepts like silhouette score and rand index - these are commonly used metrics that measure your cluster’s performance against other clusterings models so you know if what you have done so far satisfies an acceptable level of accuracy or not yet! Good luck!

Research Ideas

Utilizing the sepal and petal lengths and widths to perform flower recognition or part of a larger image recognition pipeline.

Classifying the data points in each dataset by the X-Y coordinates using clustering algorithms to analyze galaxy locations or overall formation patterns for stars, planets, or galaxies.

Exploring correlations between species of flowers in terms of sepal/petal lengths by performing supervised learning tasks such as classification with this dataset

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: moon.csv | Column name | Description | |:--------------|:------------------------------------------| | X | X coordinate of the data point. (Numeric) | | Y | Y coordinate of the data point. (Numeric) |

File: iris.csv | Column name | Description | |:-----------------|:---------------------------------------------| | Sepal.Length | Length of the sepal of the flower. (Numeric) | | Petal.Length | Length of the petal of the flower. (Numeric) | | Species | Species of the flower. (Categorical) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .
t
Data for A method for assessment of the general circulation model quality...
data.taltech.ee
data.niaid.nih.gov
Updated Mar 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ilja Maljutenko; Ilja Maljutenko; Urmas Raudsepp; Urmas Raudsepp (2025). Data for A method for assessment of the general circulation model quality using k-means clustering algorithm [Dataset]. http://doi.org/10.5281/zenodo.4588510
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4588510
Dataset updated
Mar 11, 2025
Dataset provided by
TalTech Data Repository
Authors
Ilja Maljutenko; Ilja Maljutenko; Urmas Raudsepp; Urmas Raudsepp
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2021
Description
The dataset consists of simulated and observed salinity/temperature data which were used in the manuscript "A method for assessment of the general circulation model quality using k-means clustering algorithm" submitted to Geoscientific Model Development.
The model simulation dataset is from long-term 3D circulation model simulation (Maljutenko and Raudsepp 2014, 2019). The observations are from the "Baltic Sea - Eutrophication and Acidity aggregated datasets 1902/2017 v2018" SMHI (2018).

The files are in simple comma separated table format without headers.
The Dout-t_z_lat_lon_Smod_Sobs_Tmod_Tobs.csv file contains columns with following variables [units]:
Time [matlab datenum units], Vertical coordinate [m], latitude [oN], longitude [oE], model salinity [g/kg], observed salinity [g/kg], model temperature [oC], observed temperature [oC].
The Dout-t_z_lat_lon_dS_dT_K1_K2_K3_K4_K5_K6_K7_K8_K9.csv file contains columns with following variables [units]:
4 first columns are the same as in the previous file, salinity error [g/kg], temperature error [oC], columns 7-8 are integers showing the cluster to which the error pair is designated.
do_clust_valid_DataFig.m is a Matlab script which reads the two csv files (and optionally mask file Model_mask.mat), performs the clustering analysis and creates plots which are used in Manuscript. The script is organized into %% blocks which can be executed separately (default: ctrl+enter).
k-means function is used from the Matlab Statistics and Machine Learning Toolbox.
Additional software used in the do_clust_valid_DataFig.m:
Author's auxiliary formatting scripts script/
datetick_cst.m
do_fitfig.m
do_skipticks.m
do_skipticks_y.m
Colormaps are generated using cbrewer.m (Charles, 2021).
Moving average smoothing is performed using nanmoving_average.m (Aguilera, 2021).
2D Clustering Dataset Collection
kaggle.com
zip
Updated Jan 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SAMOILOV MIKHAIL (2025). 2D Clustering Dataset Collection [Dataset]. https://www.kaggle.com/datasets/samoilovmikhail/2d-clustering-dataset-collection
Explore at:
zip(136543 bytes)Available download formats
Dataset updated
Jan 21, 2025
Authors
SAMOILOV MIKHAIL
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset collection comprises 15 diverse two-dimensional datasets specifically designed for clustering analysis. Each dataset contains three columns: x, y, and target, where x and y represent the coordinates of the data points, and target indicates the cluster label.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F20292402%2F3cc81328beabc815fe500973fee1f7ac%2Fdescription.png?generation=1737484616903723&alt=media" alt="Visualisation of data">
f
k-means clustering of cohort mean temperature coefficients.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Jul 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Schnable, Patrick S.; Yeh, Cheng-Ting “Eddy”; Kusmec, Aaron; Attigala, Lakshmi; Dai, Xiongtao; Srinivasan, Srikant (2023). k-means clustering of cohort mean temperature coefficients. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000995763
Explore at:
Dataset updated
Jul 6, 2023
Authors
Schnable, Patrick S.; Yeh, Cheng-Ting “Eddy”; Kusmec, Aaron; Attigala, Lakshmi; Dai, Xiongtao; Srinivasan, Srikant
Description
Data underlying Fig 2D. Cohort means are the weighted average temperature coefficient for all hybrids that first appeared in the dataset in the indicated year. Only temperature bins 30 to >41°C inclusive are included. Columns include year, temperature (°C), cohort mean coefficient, and cluster. (CSV)
ChatGPT API and BERT NLP
figshare.com
application/csv
Updated Mar 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carmen Atkins (2024). ChatGPT API and BERT NLP [Dataset]. http://doi.org/10.6084/m9.figshare.25403407.v2
Explore at:
application/csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25403407.v2
Dataset updated
Mar 13, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Carmen Atkins
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
input_prompts.csv provides the inputs for the ChatGPT API (countries and their respective prompts).topic_consolidations.csv contains the 4,018 unique topics listed across all ChatGPT responses to prompts in our study and their corresponding cluster labels after applying K-means++ clustering (n = 50) via natural language processing with Bidirectional Encoder Representations from Transformers (BERT). ChatGPT response topics come from both versions (3.5 and 4) over 10 iterations each (per each country).ChatGPT_prompt_automation.ipynb is the Jupyter notebook of Python code used to run the API to prompt ChatGPT and gather responses.topic_consolidation_BERT.ipynb is the Jupyter notebook of Python code used to process the 4,018 unique topics gathered through BERT NLP. This code was adapted from Vimal Pillar on Kaggle (https://www.kaggle.com/code/vimalpillai/text-clustering-with-sentence-bert).
Patient Dataset for Clustering (Raw Data)
kaggle.com
Updated Aug 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arjunn Sharma (2023). Patient Dataset for Clustering (Raw Data) [Dataset]. https://www.kaggle.com/datasets/arjunnsharma/patient-dataset-for-clustering-raw-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 10, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Arjunn Sharma
Description
About Dataset ● Based on patient symptoms, identify patients needing immediate resuscitation; to assign patients to a predesignated patient care area, thereby prioritizing their care; and to initiate diagnostic/therapeutic measures as appropriate. ● Three individual datasets used for three urgent illness/injury, each dataset has its own features and symptoms for each patient and we merged them to know what are the most severe symptoms for each illness and give them priority of treatment.

PROJECT SUMMARY Triage refers to the sorting of injured or sick people according to their need for emergency medical attention. It is a method of determining priority for who gets care first. BACKGROUND Triage is the prioritization of patient care (or victims during a disaster) based on illness/injury, symptoms, severity, prognosis, and resource availability. The purpose of triage is to identify patients needing immediate resuscitation; to assign patients to a predesignated patient care area, thereby prioritizing their care; and to initiate diagnostic/therapeutic measures as appropriate. BUSINESS CHALLENGE Based on patient symptoms, identify patients needing immediate resuscitation; to assign patients to a predesignated patient care area, thereby prioritizing their care; and to initiate diagnostic/therapeutic measures as appropriate.
K-means derived GECO Scores.
plos.figshare.com
txt
Updated Jun 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jason Bennett; Mikhail Pomaznoy; Akul Singhania; Bjoern Peters (2023). K-means derived GECO Scores. [Dataset]. http://doi.org/10.1371/journal.pcbi.1009459.s008
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1009459.s008
Dataset updated
Jun 9, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Jason Bennett; Mikhail Pomaznoy; Akul Singhania; Bjoern Peters
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The table of GECO scores generated after clustering with k-means using each value of k listed in the first column ‘k’. (CSV)
Student Performance and Learning Behavior Dataset for Educational Analytics
zenodo.org
bin, csv
Updated Aug 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kamal NAJEM; Kamal NAJEM (2025). Student Performance and Learning Behavior Dataset for Educational Analytics [Dataset]. http://doi.org/10.5281/zenodo.16459132
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.16459132
Dataset updated
Aug 13, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Kamal NAJEM; Kamal NAJEM
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jul 26, 2025
Description
The dataset used in this study integrates quantitative data on student learning behaviors, engagement patterns, demographics, and academic performance. It was compiled by merging two publicly available Kaggle datasets, resulting in a combined file (“merged_dataset.csv”) containing 14,003 student records with 16 attributes. All records are anonymized and contain no personally identifiable information.

The dataset covers the following categories of variables:

Study behaviors and engagement: StudyHours, Attendance, Extracurricular, AssignmentCompletion, OnlineCourses, Discussions

Resource access and learning environment: Resources, Internet, EduTech

Motivation and psychological factors: Motivation, StressLevel

Demographic information: Gender, Age (ranging from 18 to 30 years)

Learning preference classification: LearningStyle

Academic performance indicators: ExamScore, FinalGrade

In this study, “ExamScore” and “FinalGrade” served as the primary performance indicators. The remaining variables were used to derive behavioral and contextual profiles, which were clustered using unsupervised machine learning techniques.

The analysis and modeling were implemented in Python through a structured Jupyter Notebook (“Project.ipynb”), which included the following main steps:

Environment Setup – Import of essential libraries (NumPy, pandas, Matplotlib, Seaborn, SciPy, StatsModels, scikit-learn, imbalanced-learn) and visualization configuration.

Data Import and Integration – Loading the two source CSV files, harmonizing columns, removing irrelevant attributes, aligning formats, handling missing values, and merging them into a unified dataset (merged_dataset.csv).

Data Preprocessing –

Encoding categorical variables using LabelEncoder.

Scaling features using both z-score standardization (for statistical tests and PCA) and Min–Max normalization (for clustering).

Detecting and removing duplicates.

Clustering Analysis –

Applying K-Means clustering to segment learners into distinct profiles.

Determining the optimal number of clusters using the Elbow Method and Silhouette Score.

Evaluating cluster quality with internal metrics (Silhouette Score, Davies–Bouldin Index).

Dimensionality Reduction & Visualization – Using PCA for 2D/3D cluster visualization and feature importance exploration.

Mapping Clusters to Learning Styles – Associating each identified cluster with the most relevant learning style model based on feature patterns and alignment scores.

Statistical Analysis – Conducting ANOVA and regression to test for significant differences in performance between clusters.

Interpretation & Practical Recommendations – Analyzing cluster-specific characteristics and providing implications for adaptive and mobile learning integration.
D
Replication Data for: Singapore Soundscape Site Selection Survey (S5):...
researchdata.ntu.edu.sg
tsv
Updated Jun 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kenneth Ooi; Kenneth Ooi; Bhan Lam; Bhan Lam; Joo Young Hong; Joo Young Hong; Karn N. Watcharasupat; Karn N. Watcharasupat; Zhen-Ting Ong; Zhen-Ting Ong; Woon-Seng Gan; Woon-Seng Gan (2022). Replication Data for: Singapore Soundscape Site Selection Survey (S5): Identification of Characteristic Soundscapes of Singapore via Weighted k-means Clustering [Dataset]. http://doi.org/10.21979/N9/BBBPMO
Explore at:
tsv(63781)Available download formats
Unique identifier
https://doi.org/10.21979/N9/BBBPMO
Dataset updated
Jun 14, 2022
Dataset provided by
DR-NTU (Data)
Authors
Kenneth Ooi; Kenneth Ooi; Bhan Lam; Bhan Lam; Joo Young Hong; Joo Young Hong; Karn N. Watcharasupat; Karn N. Watcharasupat; Zhen-Ting Ong; Zhen-Ting Ong; Woon-Seng Gan; Woon-Seng Gan
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Area covered
Singapore
Dataset funded by
National Research Foundation (NRF)
Ministry of National Development (MND)
Description
This dataset contains the data used for all statistical analysis in our publication "Singapore Soundscape Site Selection Survey (S5): Identification of Characteristic Soundscapes of Singapore via Weighted k-means Clustering", summarised in a single .csv file. For more details on the study methodology, please refer to our manuscript: Ooi, K.; Lam, B.; Hong, J.; Watcharasupat, K. N.; Ong, Z.-T.; Gan, W.-S. Singapore Soundscape Site Selection Survey (S5): Identification of Characteristic Soundscapes of Singapore via Weighted k-means Clustering. Sustainability, 2022. For our replication code utilising this data, please refer to our Github repository: https://github.com/ntudsp/singapore-soundscape-site-selection-survey A short explanation of the columns in the .csv file is as follows: Full of life & exciting [Latitude]: The latitude, in degrees, of the location chosen by the participant as "Full of life & exciting". Full of life & exciting [Longitude]: The longitude, in degrees, of the location chosen by the participant as "Full of life & exciting". Full of life & exciting [# times visited]: The number of times that the participant had visited the chosen location they considered "Full of life & exciting" before, as reported by the participant. Full of life & exciting [Duration]: The average duration per visit to the chosen location the participant considered "Full of life & exciting", as reported by the participant. Chaotic & restless [Latitude]: The latitude, in degrees, of the location chosen by the participant as "Chaotic & restless". Chaotic & restless [Longitude]: The longitude, in degrees, of the location chosen by the participant as "Chaotic & restless". Chaotic & restless [# times visited]: The number of times that the participant had visited the chosen location they considered "Chaotic & restless" before, as reported by the participant. Chaotic & restless [Duration]: The average duration per visit to the chosen location the participant considered "Chaotic & restless", as reported by the participant. Calm & tranquil [Latitude]: The latitude, in degrees, of the location chosen by the participant as "Calm & tranquil". Calm & tranquil [Longitude]: The longitude, in degrees, of the location chosen by the participant as "Calm & tranquil". Calm & tranquil [# times visited]: The number of times that the participant had visited the chosen location they considered "Calm & tranquil" before, as reported by the participant. Calm & tranquil [Duration]: The average duration per visit to the chosen location the participant considered "Calm & tranquil", as reported by the participant. Boring & lifeless [Latitude]: The latitude, in degrees, of the location chosen by the participant as "Boring & lifeless". Boring & lifeless [Longitude]: The longitude, in degrees, of the location chosen by the participant as "Boring & lifeless". Boring & lifeless [# times visited]: The number of times that the participant had visited the chosen location they considered "Boring & lifeless" before, as reported by the participant. Boring & lifeless [Duration]: The average duration per visit to the chosen location the participant considered "Boring & lifeless", as reported by the participant.

🌆 City Lifestyle Segmentation Dataset

kaggle.com

zip

Updated Nov 15, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

UmutUygurr (2025). 🌆 City Lifestyle Segmentation Dataset [Dataset]. https://www.kaggle.com/datasets/umuttuygurr/city-lifestyle-segmentation-dataset

Explore at:

zip(11274 bytes)Available download formats

Dataset updated

Nov 15, 2025

Authors

UmutUygurr

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22121490%2F7189944f8fc292a094c90daa799d08ca%2FChatGPT%20Image%2015%20Kas%202025%2014_07_37.png?generation=1763204959770660&alt=media" alt="">

🌆 About This Dataset

This synthetic dataset simulates 300 global cities across 6 major geographic regions, designed specifically for unsupervised machine learning and clustering analysis. It explores how economic status, environmental quality, infrastructure, and digital access shape urban lifestyles worldwide.

🎯 Perfect For:

📊 K-Means, DBSCAN, Agglomerative Clustering
🔬 PCA & t-SNE Dimensionality Reduction
🗺️ Geospatial Visualization (Plotly, Folium)
📈 Correlation Analysis & Feature Engineering
🎓 Educational Projects (Beginner to Intermediate)

📦 What's Inside?

Feature	Description	Range
10 Features	Economic, environmental & social indicators	Realistically scaled
300 Cities	Europe, Asia, Americas, Africa, Oceania	Diverse distributions
Strong Correlations	Income ↔ Rent (+0.8), Density ↔ Pollution (+0.6)	ML-ready
No Missing Values	Clean, preprocessed data	Ready for analysis
4-5 Natural Clusters	Metropolitan hubs, eco-towns, developing centers	Pre-validated

🔥 Key Features

✅ Realistic Correlations: Income strongly predicts rent (+0.8), internet access (+0.7), and happiness (+0.6)
✅ Regional Diversity: Each region has distinct economic and environmental characteristics
✅ Clustering-Ready: Naturally separable into 4-5 lifestyle archetypes
✅ Beginner-Friendly: No data cleaning required, includes example code
✅ Documented: Comprehensive README with methodology and use cases

🚀 Quick Start Example

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Load and prepare
df = pd.read_csv('city_lifestyle_dataset.csv')
X = df.drop(['city_name', 'country'], axis=1)
X_scaled = StandardScaler().fit_transform(X)

# Cluster
kmeans = KMeans(n_clusters=5, random_state=42)
df['cluster'] = kmeans.fit_predict(X_scaled)

# Analyze
print(df.groupby('cluster').mean())

🎓 Learning Outcomes

After working with this dataset, you will be able to: 1. Apply K-Means, DBSCAN, and Hierarchical Clustering 2. Use PCA for dimensionality reduction and visualization 3. Interpret correlation matrices and feature relationships 4. Create geographic visualizations with cluster assignments 5. Profile and name discovered clusters based on characteristics

📚 Ideal For These Projects

🏆 Kaggle Competitions: Practice clustering techniques
📝 Academic Projects: Urban planning, sociology, environmental science
💼 Portfolio Work: Showcase ML skills to employers
🎓 Learning: Hands-on practice with unsupervised learning
🔬 Research: Urban lifestyle segmentation studies

🌍 Expected Clusters

Cluster	Characteristics	Example Cities
Metropolitan Tech Hubs	High income, density, rent	Silicon Valley, Singapore
Eco-Friendly Towns	Low density, clean air, high happiness	Nordic cities
Developing Centers	Mid income, high density, poor air	Emerging markets
Low-Income Suburban	Low infrastructure, income	Rural areas
Industrial Mega-Cities	Very high density, pollution	Manufacturing hubs

🛠️ Technical Details

Format: CSV (UTF-8)
Size: ~300 rows × 10 columns
Missing Values: 0%
Data Types: 2 categorical, 8 numerical
Target Variable: None (unsupervised)
Correlation Strength: Pre-validated (r: 0.4 to 0.8)

📖 What Makes This Dataset Special?

Unlike random synthetic data, this dataset was carefully engineered with: - ✨ Realistic correlation structures based on urban research - 🌍 Regional characteristics matching real-world patterns - 🎯 Optimal cluster separability (validated via silhouette scores) - 📚 Comprehensive documentation and starter code

🏅 Use This Dataset If You Want To:

✓ Learn clustering without data cleaning hassles
✓ Practice PCA and dimensionality reduction
✓ Create beautiful geographic visualizations
✓ Understand feature correlation in real-world contexts
✓ Build a portfolio project with clear business insights

📊 Acknowledgments

This dataset was designed for educational purposes in machine learning and data science. While synthetic, it reflects real patterns observed in global urban development research.

Happy Clustering! 🎉

DrCyZ: Techniques for analyzing and extracting useful information from CyZ.

data.niaid.nih.gov

Updated Jan 19, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

de Curtò, J.; de Zarzà, I. (2022). DrCyZ: Techniques for analyzing and extracting useful information from CyZ. [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5816857

Explore at:

Dataset updated

Jan 19, 2022

Dataset provided by

Universitat Oberta de Catalunya

Authors

de Curtò, J.; de Zarzà, I.

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

DrCyZ: Techniques for analyzing and extracting useful information from CyZ.

Samples from NASA Perseverance and set of GAN generated synthetic images from Neural Mars.

Repository: https://github.com/decurtoidiaz/drcyz

Subset of samples from (includes tools to visualize and analyse the dataset):

CyZ: MARS Space Exploration Dataset. [https://doi.org/10.5281/zenodo.5655473]

Images from NASA missions of the celestial body.

Repository: https://github.com/decurtoidiaz/cyz

Authors:

J. de Curtò c@decurto.be

I. de Zarzà z@dezarza.be

File Information from DrCyZ-1.1

• Subset of samples from Perseverance (drcyz/c).
  ∙ png (drcyz/c/png).
    PNG files (5025) selected from NASA Perseverance (CyZ-1.1) after t-SNE and K-means Clustering. 
  ∙ csv (drcyz/c/csv).
    CSV file.


• Resized samples from Perseverance (drcyz/c+).
  ∙ png 64x64; 128x128; 256x256; 512x512; 1024x1024 (drcyz/c+/drcyz_64-1024).
    PNG files resized at the corresponding size. 
  ∙ TFRecords 64x64; 128x128; 256x256; 512x512; 1024x1024 (drcyz/c+/tfr_drcyz_64-1024).
    TFRecord resized at the corresponding size to import on Tensorflow.


• Synthetic images from Neural Mars generated using Stylegan2-ada (drcyz/drcyz+).
  ∙ png 100; 1000; 10000 (drcyz/drcyz+/drcyz_256_100-10000)
    PNG files subset of 100, 1000 and 10000 at size 256x256.


• Network Checkpoint from Stylegan2-ada trained at size 256x256 (drcyz/model_drcyz).
  ∙ network-snapshot-000798-drcyz.pkl


• Notebooks in python to analyse the original dataset and reproduce the experiments; K-means Clustering, t-SNE, PCA, synthetic generation using Stylegan2-ada and instance segmentation using Deeplab (https://github.com/decurtoidiaz/drcyz/tree/main/dr_cyz+).
  ∙ clustering_curiosity_de_curto_and_de_zarza.ipynb
    K-means Clustering and PCA(2) with images from Curiosity.
  ∙ clustering_perseverance_de_curto_and_de_zarza.ipynb
    K-means Clustering and PCA(2) with images from Perseverance.
  ∙ tsne_curiosity_de_curto_and_de_zarza.ipynb
    t-SNE and PCA (components selected to explain 99% of variance) with images from Curiosity.
  ∙ tsne_perseverance_de_curto_and_de_zarza.ipynb
    t-SNE and PCA (components selected to explain 99% of variance) with images from Perseverance.
  ∙ Stylegan2-ada_de_curto_and_de_zarza.ipynb
    Stylegan2-ada trained on a subset of images from NASA Perseverance (DrCyZ).
  ∙ statistics_perseverance_de_curto_and_de_zarza.ipynb
    Compute statistics from synthetic samples generated by Stylegan2-ada (DrCyZ) and images from NASA Perseverance (CyZ).
  ∙ DeepLab_TFLite_ADE20k_de_curto_and_de_zarza.ipynb
    Example of instance segmentation using Deeplab with a sample from NASA Perseverance (DrCyZ).

Z
Additional TAU datasets for Wi-Fi fingerprinting-based positioning
data.niaid.nih.gov
Updated May 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lohan (2020). Additional TAU datasets for Wi-Fi fingerprinting-based positioning [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3819916
Explore at:
Dataset updated
May 13, 2020
Dataset provided by
TAU
Authors
Lohan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Contents

This document describes two datasets collected at Tampere University facilities with samples taken from a Wi-Fi network interface for experiments with indoor positioning based on Wi-Fi fingerprinting.

To reference this dataset, please use

E.S. Lohan et al. “Additional TAU datasets for Wi-Fi fingerprinting-based positioning” 10.5281/zenodo.3819917

Additional reference using these datasets

Torres-Sospedra, J.; Quezada-Gaibor, D.; Mendoza-Silva, G. M.; Nurmi, J.; Koucheryavy, Y. and Huerta, J. New Cluster Selection and Fine-grained Search for k-Means Clustering and Wi-Fi Fingerprinting Proceedings of the Tenth International Conference on Localization and GNSS (ICL-GNSS), 2020.

Dataset format

Two independent datasets are provided, they are in different folders, namely “Database_Building01” and “Database_Building02” respectively. Each dataset includes two sets of samples:

radio map – a set of Wi-Fi samples collected at a grid of points (reference points);

evaluation – a set of Wi-Fi samples randomly collected in the evaluation area.

Two files are provided for each set that include the rss vectors and the coordinates. For the radio map, the provided files have their names starting with “rm_”; for the evaluation, the evaluation files have their names starting with “eval_”. For instance, for the radio map they are:

rm_crd.csv: holds coordinates (x,y)and floor identifier (z) where the samples were collected;

rm_rss.csv: holds the measured RSSI values from each of the Access Points (AP) detected in each sample;

All the file are described in the same format, and all files are CSV – Comma Separated Values plain text (UTF-8).

Coordinates: Each sample is associated to a pair of coordinates in a 2D Euclidean reference system. The origin of the reference system was chosen arbitrarily for convenience. The units are meters. Therefore, distances between points can be easy calculated. Moreover, the floor identifier is included to enable 3D positioning.

RSSI values: The RSSI values provided as read from the Wi-Fi network interface through the Android API. In each sample, a value of +100 was assigned to each AP not detected during a measurement. No information is provided about the MAC addresses of the APs. However, in the files, the same order is used for all samples, meaning that the values in each column are all associated to the same AP.

Both datasets are independent and none of the provided files include an identifier for each sample. The values in the two provided files are associated by the line number, meaning that the coordinates and RSSI values in the same line, in each file, refer to the same sample.
K-means clustering V-measure scores.
plos.figshare.com
csv
Updated Sep 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alina Troglio; Peter Konradi; Andrea Fiebig; Ariadna Pérez Garriga; Rainer Röhrig; James Dunham; Ekaterina Kutafina; Barbara Namer (2025). K-means clustering V-measure scores. [Dataset]. http://doi.org/10.1371/journal.pone.0329537.s009
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0329537.s009
Dataset updated
Sep 26, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Alina Troglio; Peter Konradi; Andrea Fiebig; Ariadna Pérez Garriga; Rainer Röhrig; James Dunham; Ekaterina Kutafina; Barbara Namer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
V-measure values for k-means clustering, provided as a CSV file with comma-separated values. (CSV)
K-means clustering ARI scores.
plos.figshare.com
csv
Updated Sep 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alina Troglio; Peter Konradi; Andrea Fiebig; Ariadna Pérez Garriga; Rainer Röhrig; James Dunham; Ekaterina Kutafina; Barbara Namer (2025). K-means clustering ARI scores. [Dataset]. http://doi.org/10.1371/journal.pone.0329537.s007
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0329537.s007
Dataset updated
Sep 26, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Alina Troglio; Peter Konradi; Andrea Fiebig; Ariadna Pérez Garriga; Rainer Röhrig; James Dunham; Ekaterina Kutafina; Barbara Namer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Adjusted Rand Index (ARI) values for k-means clustering, provided as a CSV file with comma-separated values. (CSV)
Z
Track dataset of four regional varieties of South Asian monsoon low-pressure...
data-staging.niaid.nih.gov
data.niaid.nih.gov
Updated Jun 8, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kieran M. R. Hunt; Akshay Deoras; Andrew G. Turner (2021). Track dataset of four regional varieties of South Asian monsoon low-pressure systems [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_4572899
Explore at:
Dataset updated
Jun 8, 2021
Dataset provided by
Department of Meteorology, University of Reading, UK
National Centre for Atmospheric Science & Department of Meteorology, University of Reading, UK
Authors
Kieran M. R. Hunt; Akshay Deoras; Andrew G. Turner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
South Asia
Description
This dataset contains tracks and intensities of four regional varieties of South Asian monsoon low-pressure systems (LPSs), as identified in ERA-Interim reanalysis dataset. A feature-tracking algorithm (Hunt et al., 2016; 2018), which is based on identifying and linking track points featuring 850 hPa relative vorticity maximum, is used to identify LPSs. A k-means clustering technique is then used to group LPSs into four LPS varieties (Hunt and Fletcher, 2019). Only those LPSs, which had their genesis during June–September 1979–2018 are retained in this dataset. LPSs in this dataset include monsoon low-pressure areas, depressions and deep depressions. The temporal resolution of ERA-Interim is six-hourly. A full description of four regional LPS varieties can be found here: https://doi.org/10.1002/wea.3997

Files

arabian.csv: contains track details of LPSs occurring over the Arabian Sea

bob_long.csv: contains track details of long-lived LPSs that propagate over India after their genesis over the head of the Bay of Bengal and nearby coastal regions

bob_short.csv: contains track details of short-lived LPSs that propagate over India after their genesis over the head of the Bay of Bengal and nearby coastal regions

srilankan.csv: contains track details of LPSs occurring over Sri Lanka and adjoining parts of the Bay of Bengal

Columns:

time: a time stamp showing when an LPS was present

lon: the longitude of an LPS at a given time step

lat: the latitude of an LPS at a given time step

candidate_id: a random identity number for each LPS

vort: the 850 hPa relative vorticity at the centre of an LPS at a given time step

For further details, contact Dr Kieran M. R. Hunt (k.m.r.hunt@reading.ac.uk) or Akshay Deoras (deorasakshay@gmail.com).
f
Table5_Comparative analysis of tissue-specific genes in maize based on...
datasetcatalog.nlm.nih.gov
frontiersin.figshare.com
Updated May 9, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Li, Hongfu; Jiang, Yi; Wang, Zijie; Zhu, Yuzhi; Tang, Xinqiang; Liu, Zhule (2023). Table5_Comparative analysis of tissue-specific genes in maize based on machine learning models: CNN performs technically best, LightGBM performs biologically soundest.csv [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001005747
Explore at:
Dataset updated
May 9, 2023
Authors
Li, Hongfu; Jiang, Yi; Wang, Zijie; Zhu, Yuzhi; Tang, Xinqiang; Liu, Zhule
Description
Introduction: With the advancement of RNA-seq technology and machine learning, training large-scale RNA-seq data from databases with machine learning models can generally identify genes with important regulatory roles that were previously missed by standard linear analytic methodologies. Finding tissue-specific genes could improve our comprehension of the relationship between tissues and genes. However, few machine learning models for transcriptome data have been deployed and compared to identify tissue-specific genes, particularly for plants.Methods: In this study, an expression matrix was processed with linear models (Limma), machine learning models (LightGBM), and deep learning models (CNN) with information gain and the SHAP strategy based on 1,548 maize multi-tissue RNA-seq data obtained from a public database to identify tissue-specific genes. In terms of validation, V-measure values were computed based on k-means clustering of the gene sets to evaluate their technical complementarity. Furthermore, GO analysis and literature retrieval were used to validate the functions and research status of these genes.Results: Based on clustering validation, the convolutional neural network outperformed others with higher V-measure values as 0.647, indicating that its gene set could cover as many specific properties of various tissues as possible, whereas LightGBM discovered key transcription factors. The combination of three gene sets produced 78 core tissue-specific genes that had previously been shown in the literature to be biologically significant.Discussion: Different tissue-specific gene sets were identified due to the distinct interpretation strategy for machine learning models and researchers may use multiple methodologies and strategies for tissue-specific gene sets based on their goals, types of data, and computational resources. This study provided comparative insight for large-scale data mining of transcriptome datasets, shedding light on resolving high dimensions and bias difficulties in bioinformatics data processing.
Additional file 1 of An additional k-means clustering step improves the...
springernature.figshare.com
txt
Updated May 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan BotĂa; Jana Vandrovcova; Paola Forabosco; Sebastian Guelfi; Karishma Dâ€™Sa; John Hardy; Cathryn Lewis; Mina Ryten; Michael Weale (2023). Additional file 1 of An additional k-means clustering step improves the biological features of WGCNA gene co-expression networks [Dataset]. http://doi.org/10.6084/m9.figshare.c.3741080_D1.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.c.3741080_D1.v1
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Juan BotĂa; Jana Vandrovcova; Paola Forabosco; Sebastian Guelfi; Karishma Dâ€™Sa; John Hardy; Cathryn Lewis; Mina Ryten; Michael Weale
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Lists tissues, samples and genes used for the creation of each GCN. (CSV 4 kb)

Facebook

Twitter

Click to copy link

Link copied

Cite

Jesus Rogel-Salazar (2022). K Means - Data Blobs [Dataset]. http://doi.org/10.6084/m9.figshare.19102187.v3

K Means - Data Blobs

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.19102187.v3

Dataset updated

Feb 2, 2022

Dataset provided by

figshare
Figsharehttp://figshare.com/

Authors

Jesus Rogel-Salazar

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Example data to understand the implementation of K Means

Clear search

Close search

Google apps

Main menu

K Means - Data Blobs

wine-clustering

Customer Segmentation : Clustering

Data from: Galaxy clustering

Galaxy clustering

Iris, Moon, and Circles datasets for Galaxy clustering tutorial

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Data for A method for assessment of the general circulation model quality...

2D Clustering Dataset Collection

k-means clustering of cohort mean temperature coefficients.

ChatGPT API and BERT NLP

Patient Dataset for Clustering (Raw Data)

K-means derived GECO Scores.

Student Performance and Learning Behavior Dataset for Educational Analytics

Replication Data for: Singapore Soundscape Site Selection Survey (S5):...

🌆 City Lifestyle Segmentation Dataset

🌆 About This Dataset

🎯 Perfect For:

📦 What's Inside?

🔥 Key Features

🚀 Quick Start Example

🎓 Learning Outcomes

📚 Ideal For These Projects

🌍 Expected Clusters

🛠️ Technical Details

📖 What Makes This Dataset Special?

🏅 Use This Dataset If You Want To:

📊 Acknowledgments

DrCyZ: Techniques for analyzing and extracting useful information from CyZ.

File Information from DrCyZ-1.1

Additional TAU datasets for Wi-Fi fingerprinting-based positioning

K-means clustering V-measure scores.

K-means clustering ARI scores.

Track dataset of four regional varieties of South Asian monsoon low-pressure...

Table5_Comparative analysis of tissue-specific genes in maize based on...

Additional file 1 of An additional k-means clustering step improves the...

K Means - Data Blobs