100+ datasets found

K Means - Data Blobs
figshare.com
datasetcatalog.nlm.nih.gov
txt
Updated Feb 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jesus Rogel-Salazar (2022). K Means - Data Blobs [Dataset]. http://doi.org/10.6084/m9.figshare.19102187.v3
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19102187.v3
Dataset updated
Feb 2, 2022
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Jesus Rogel-Salazar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Example data to understand the implementation of K Means
CLUSTERING: K-MEANS
kaggle.com
zip
Updated Apr 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Syed Touqeer (2020). CLUSTERING: K-MEANS [Dataset]. https://www.kaggle.com/datasets/syedtouqeer/clustering-kmeans
Explore at:
zip(1599 bytes)Available download formats
Dataset updated
Apr 24, 2020
Authors
Syed Touqeer
Description
Dataset

This dataset was created by Syed Touqeer

Contents
K-MEANS CLUSTERING
kaggle.com
zip
Updated Apr 27, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prasun Varshney (2019). K-MEANS CLUSTERING [Dataset]. https://www.kaggle.com/varshneyprasun/kmeans-clustering
Explore at:
zip(1088 bytes)Available download formats
Dataset updated
Apr 27, 2019
Authors
Prasun Varshney
Description
Dataset

This dataset was created by Prasun Varshney

Contents
h
wine-clustering
huggingface.co
Updated Sep 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Trevor (2024). wine-clustering [Dataset]. https://huggingface.co/datasets/mltrev23/wine-clustering
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 12, 2024
Authors
Trevor
Description
Wine Clustering Dataset

Overview

The Wine Clustering Dataset contains data on various chemical properties of wines, intended for use in clustering tasks. This dataset is ideal for exploring clustering algorithms such as K-Means, hierarchical clustering, and others, to group wines based on their chemical composition.

Dataset Structure

The dataset is provided as a single CSV file named wine-clustering.csv. It contains 178 entries, each representing a unique wine… See the full description on the dataset page: https://huggingface.co/datasets/mltrev23/wine-clustering.
Customer Segmentation : Clustering
kaggle.com
zip
Updated Jan 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vishakh Patel (2024). Customer Segmentation : Clustering [Dataset]. https://www.kaggle.com/datasets/vishakhdapat/customer-segmentation-clustering
Explore at:
zip(63448 bytes)Available download formats
Dataset updated
Jan 13, 2024
Authors
Vishakh Patel
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Customer Personality Analysis involves a thorough examination of a company's optimal customer profiles. This analysis facilitates a deeper understanding of customers, enabling businesses to tailor products to meet the distinct needs, behaviors, and concerns of various customer types.

By conducting a Customer Personality Analysis, businesses can refine their products based on the preferences of specific customer segments. Rather than allocating resources to market a new product to the entire customer database, companies can identify the segments most likely to be interested in the product. Subsequently, targeted marketing efforts can be directed toward those particular segments, optimizing resource utilization and increasing the likelihood of successful product adoption.

Details of Features are as below:

Id: Unique identifier for each individual in the dataset.

Year_Birth: The birth year of the individual.

Education: The highest level of education attained by the individual.

Marital_Status: The marital status of the individual.

Income: The annual income of the individual.

Kidhome: The number of young children in the household.

Teenhome: The number of teenagers in the household.

Dt_Customer: The date when the customer was first enrolled or became a part of the company's database.

Recency: The number of days since the last purchase or interaction.

MntWines: The amount spent on wines.

MntFruits: The amount spent on fruits.

MntMeatProducts: The amount spent on meat products.

MntFishProducts: The amount spent on fish products.

MntSweetProducts: The amount spent on sweet products.

MntGoldProds: The amount spent on gold products.

NumDealsPurchases: The number of purchases made with a discount or as part of a deal.

NumWebPurchases: The number of purchases made through the company's website.

NumCatalogPurchases: The number of purchases made through catalogs.

NumStorePurchases: The number of purchases made in physical stores.

NumWebVisitsMonth: The number of visits to the company's website in a month.

AcceptedCmp3: Binary indicator (1 or 0) whether the individual accepted the third marketing campaign.

AcceptedCmp4: Binary indicator (1 or 0) whether the individual accepted the fourth marketing campaign.

AcceptedCmp5: Binary indicator (1 or 0) whether the individual accepted the fifth marketing campaign.

AcceptedCmp1: Binary indicator (1 or 0) whether the individual accepted the first marketing campaign.

AcceptedCmp2: Binary indicator (1 or 0) whether the individual accepted the second marketing campaign.

Complain: Binary indicator (1 or 0) whether the individual has made a complaint.

Z_CostContact: A constant cost associated with contacting a customer.

Z_Revenue: A constant revenue associated with a successful campaign response.

Response: Binary indicator (1 or 0) whether the individual responded to the marketing campaign.
Benchmarks datasets for cluster analysis
kaggle.com
zip
Updated Nov 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Onthada Preedasawakul (2023). Benchmarks datasets for cluster analysis [Dataset]. https://www.kaggle.com/datasets/onthada/benchmarks-datasets-for-clustering
Explore at:
zip(608532 bytes)Available download formats
Dataset updated
Nov 15, 2023
Authors
Onthada Preedasawakul
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
25 Artificial Datasets

The datasets are generated using either Gaussian or Uniform distributions. Each dataset contains several known sub-groups intended for testing centroid-based clustering results and cluster validity indices.

Cluster analysis is a popular machine learning used for segmenting datasets with similar data points in the same group. For those who are familiar with R, there is a new R package called "UniversalCVI" https://CRAN.R-project.org/package=UniversalCVI used for cluster evaluation. This package provides algorithms for checking the accuracy of a clustering result with known classes, computing cluster validity indices, and generating plots for comparing them. The package is compatible with K-means, fuzzy C means, EM clustering, and hierarchical clustering (single, average, and complete linkage). To use the "UniversalCVI" package, one can follow the instructions provided in the R documentation.

For more in-depth details of the package and cluster evaluation, please see the papers https://doi.org/10.1016/j.patcog.2023.109910 and https://arxiv.org/abs/2308.14785

All the datasets are also available on GitHub at

https://github.com/O-PREEDASAWAKUL/FuzzyDatasets.git .

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F17645646%2Fa2f87fbad212a908718535589681a703%2Frealplot.jpeg?generation=1700111724010268&alt=media" alt="">
Dataset for: Some Remarks on the R2 for Clustering
wiley.figshare.com
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicola Loperfido; Thaddeus Tarpey (2023). Dataset for: Some Remarks on the R2 for Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.6124508.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6124508.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Wileyhttps://www.wiley.com/
Authors
Nicola Loperfido; Thaddeus Tarpey
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
A common descriptive statistic in cluster analysis is the $R^2$ that measures the overall proportion of variance explained by the cluster means. This note highlights properties of the $R^2$ for clustering. In particular, we show that generally the $R^2$ can be artificially inflated by linearly transforming the data by ``stretching'' and by projecting. Also, the $R^2$ for clustering will often be a poor measure of clustering quality in high-dimensional settings. We also investigate the $R^2$ for clustering for misspecified models. Several simulation illustrations are provided highlighting weaknesses in the clustering $R^2$, especially in high-dimensional settings. A functional data example is given showing how that $R^2$ for clustering can vary dramatically depending on how the curves are estimated.
T
Modified k-means clustering model in multi stire delivery service
dataverse.telkomuniversity.ac.id
pdf
Updated Jul 14, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Telkom University Dataverse (2022). Modified k-means clustering model in multi stire delivery service [Dataset]. http://doi.org/10.34820/FK2/UWEYZR
Explore at:
pdf(251276)Available download formats
Unique identifier
https://doi.org/10.34820/FK2/UWEYZR
Dataset updated
Jul 14, 2022
Dataset provided by
Telkom University Dataverse
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
In this work, we propose the centralized shared delivery service model. As a centralized model, the delivery service is handled by the central management so that coordination in delivery process among vehicles can be more efficient.
Customer Segmentation for Targeted Campaigns
kaggle.com
zip
Updated May 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mani Devesh (2024). Customer Segmentation for Targeted Campaigns [Dataset]. https://www.kaggle.com/datasets/manidevesh/customer-sales-data
Explore at:
zip(914292 bytes)Available download formats
Dataset updated
May 21, 2024
Authors
Mani Devesh
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Project Overview: Customer Segmentation Using K-Means Clustering

Introduction In this project, I analysed customer data from a retail store to identify distinct customer segments. The dataset includes key attributes such as age, city, and total sales of the customers. By leveraging K-Means clustering, an unsupervised machine learning technique, I aim to group customers based on their age and sales metrics. These insights will enable the creation of targeted marketing campaigns tailored to the specific needs and behaviours of each customer segment.

Objectives - Cluster Customers: Use K-Means clustering to group customers based on age and total sales. - Analyse Segments: Examine the characteristics of each customer segment. - Targeted Marketing: Develop strategies for personalized marketing campaigns targeting each identified customer group.

Data Description The dataset comprises:

Age: The age of the customers.

City: The city where the customers reside.

Total Sales: The total sales generated by each customer.

Methodology - Data Preprocessing: Clean and preprocess the data to handle any missing or inconsistent entries. - Feature Selection: Focus on age and total sales as primary features for clustering. - K-Means Clustering: Apply the K-Means algorithm to identify distinct customer segments. - Cluster Analysis: Analyse the resulting clusters to understand the demographic and sales characteristics of each group. - Marketing Strategy Development: Create targeted marketing strategies for each customer segment to enhance engagement and sales.

Expected Outcomes - Customer Segments: Clear identification of customer groups based on age and purchasing behaviour. - Insights for Marketing: Detailed understanding of each segment to inform targeted marketing efforts. - Business Impact: Enhanced ability to tailor marketing campaigns, potentially leading to increased customer satisfaction and sales.

By clustering customers based on age and total sales, this project aims to provide actionable insights for personalized marketing, ultimately driving better customer engagement and higher sales for the retail store.
f
Data from: Factor Modeling for Clustering High-Dimensional Time Series
tandf.figshare.com
zip
Updated Feb 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bo Zhang; Guangming Pan; Qiwei Yao; Wang Zhou (2024). Factor Modeling for Clustering High-Dimensional Time Series [Dataset]. http://doi.org/10.6084/m9.figshare.22141184.v4
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22141184.v4
Dataset updated
Feb 9, 2024
Dataset provided by
Taylor & Francis
Authors
Bo Zhang; Guangming Pan; Qiwei Yao; Wang Zhou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We propose a new unsupervised learning method for clustering a large number of time series based on a latent factor structure. Each cluster is characterized by its own cluster-specific factors in addition to some common factors which impact on all the time series concerned. Our setting also offers the flexibility that some time series may not belong to any clusters. The consistency with explicit convergence rates is established for the estimation of the common factors, the cluster-specific factors, and the latent clusters. Numerical illustration with both simulated data as well as a real data example is also reported. As a spin-off, the proposed new approach also advances significantly the statistical inference for the factor model of Lam and Yao. Supplementary materials for this article are available online.
Dataset For KMeans Clustering
kaggle.com
zip
Updated Jan 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mit Gandhi (2025). Dataset For KMeans Clustering [Dataset]. https://www.kaggle.com/datasets/mitgandhi10/dataset-for-kmeans-clustering/code
Explore at:
zip(25556 bytes)Available download formats
Dataset updated
Jan 8, 2025
Authors
Mit Gandhi
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset captures Instagram users' visit scores and their spending rank (on a scale of 0 to 100). The goal is to analyze and group users into distinct clusters based on their behaviors, enabling insights into user engagement and spending potential. The dataset is suitable for unsupervised machine learning techniques like K-Means clustering, which can help identify patterns and group users effectively.

Dataset Highlights: Ideal for practicing clustering algorithms. Small and easy-to-handle dataset. Includes key metrics for user behavior analysis.
h
stratified-kmeans-diverse-reasoning-100K-1M
huggingface.co
Updated Oct 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aman Priyanshu (2025). stratified-kmeans-diverse-reasoning-100K-1M [Dataset]. https://huggingface.co/datasets/AmanPriyanshu/stratified-kmeans-diverse-reasoning-100K-1M
Explore at:
Dataset updated
Oct 8, 2025
Authors
Aman Priyanshu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Stratified K-Means Diverse Reasoning Dataset (100K-1M)

A carefully balanced subset of NVIDIA's Llama-Nemotron Post-Training Dataset, featuring square-root rebalanced sampling across math, code, science, instruction-following, chat, and safety tasks at multiple scales.

👥 Follow the Authors

Aman Priyanshu

Supriti Vijay

Overview

This dataset provides stratified subsets at 50k, 100k, 250k, 500k, and 1M scales from the Llama-Nemotron… See the full description on the dataset page: https://huggingface.co/datasets/AmanPriyanshu/stratified-kmeans-diverse-reasoning-100K-1M.
h
stratified-kmeans-diverse-instruction-following-100K-1M
huggingface.co
Updated Oct 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aman Priyanshu (2025). stratified-kmeans-diverse-instruction-following-100K-1M [Dataset]. https://huggingface.co/datasets/AmanPriyanshu/stratified-kmeans-diverse-instruction-following-100K-1M
Explore at:
Dataset updated
Oct 8, 2025
Authors
Aman Priyanshu
Description
Stratified K-Means Diverse Instruction-Following Dataset (100K-1M)

A carefully balanced subset combining Tulu-3 SFT Mixture and Orca AgentInstruct, featuring embedding-based k-means sampling across diverse instruction-following tasks at multiple scales.

👥 Follow the Authors

Aman Priyanshu

Supriti Vijay

Overview

This dataset provides stratified subsets at 50k, 100k, 250k, 500k, and 1M scales, combining high-quality instruction-following data from… See the full description on the dataset page: https://huggingface.co/datasets/AmanPriyanshu/stratified-kmeans-diverse-instruction-following-100K-1M.
m
Educational Attainment in North Carolina Public Schools: Use of statistical...
data.mendeley.com
Updated Nov 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
Explore at:
Unique identifier
https://doi.org/10.17632/6cm9wyd5g5.1
Dataset updated
Nov 14, 2018
Authors
Scott Herford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
h
stratified-kmeans-diverse-pretraining-100K-1M
huggingface.co
Updated Oct 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aman Priyanshu (2025). stratified-kmeans-diverse-pretraining-100K-1M [Dataset]. https://huggingface.co/datasets/AmanPriyanshu/stratified-kmeans-diverse-pretraining-100K-1M
Explore at:
Dataset updated
Oct 8, 2025
Authors
Aman Priyanshu
Description
Stratified K-Means Diverse Pre-Training Dataset (100K-1M)

A carefully balanced subset combining FineWeb-Edu and Proof-Pile-2, featuring embedding-based k-means sampling to ensure diverse representation across educational and mathematical/scientific content at multiple scales.

👥 Follow the Authors

Aman Priyanshu

Supriti Vijay

Overview

This dataset provides stratified subsets at 50k, 100k, 250k, 500k, and 1M scales, combining high-quality… See the full description on the dataset page: https://huggingface.co/datasets/AmanPriyanshu/stratified-kmeans-diverse-pretraining-100K-1M.
h
Web-Instruct-Kmean-V0-raw
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
II Vietnam, Web-Instruct-Kmean-V0-raw [Dataset]. https://huggingface.co/datasets/II-Vietnam/Web-Instruct-Kmean-V0-raw
Explore at:
Dataset authored and provided by
II Vietnam
Description
Web-Instruct-Kmean-V0-raw Dataset

A large-scale instruction dataset containing 850,591 question-answer pairs categorized by academic disciplines using k-means clustering.

Dataset Description

This dataset contains instruction-following examples collected from various web sources, with each example categorized into academic disciplines using k-means clustering and manual task categorization.

Features

The dataset includes the following fields:

orig_question:… See the full description on the dataset page: https://huggingface.co/datasets/II-Vietnam/Web-Instruct-Kmean-V0-raw.
a
Data from: What to Do When K-Means Clustering Fails: A Simple yet Principled...
researchdata.aston.ac.uk
plos.figshare.com
Updated Sep 26, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yordan Raykov; Alexis Boukouvalas; Fahd Baig; Max Little (2016). What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm [Dataset]. https://researchdata.aston.ac.uk/id/eprint/152/
Explore at:
Dataset updated
Sep 26, 2016
Authors
Yordan Raykov; Alexis Boukouvalas; Fahd Baig; Max Little
Area covered
Birmingham, United Kingdom
Description
The K-means algorithm is one of the most popular clustering algorithms in current use as it is relatively fast yet simple to understand and deploy in practice. Nevertheless, its use entails certain restrictive assumptions about the data, the negative consequences of which are not always immediately apparent, as we demonstrate. While more flexible algorithms have been developed, their widespread use has been hindered by their computational and technical complexity. Motivated by these considerations, we present a flexible alternative to K-means that relaxes most of the assumptions, whilst remaining almost as fast and simple. This novel algorithm which we call MAP-DP (maximum a-posteriori Dirichlet process mixtures), is statistically rigorous as it is based on nonparametric Bayesian Dirichlet process mixture modeling. This approach allows us to overcome most of the limitations imposed by K-means. The number of clusters K is estimated from the data instead of being fixed a-priori as in K-means. In addition, while K-means is restricted to continuous data, the MAP-DP framework can be applied to many kinds of data, for example, binary, count or ordinal data. Also, it can efficiently separate outliers from the data. This additional flexibility does not incur a significant computational overhead compared to K-means with MAP-DP convergence typically achieved in the order of seconds for many practical problems. Finally, in contrast to K-means, since the algorithm is based on an underlying statistical model, the MAP-DP framework can deal with missing data and enables model testing such as cross validation in a principled way. We demonstrate the simplicity and effectiveness of this algorithm on the health informatics problem of clinical sub-typing in a cluster of diseases known as parkinsonism.
300 Places in the US for K-means Clustering
kaggle.com
zip
Updated Aug 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dongou (2022). 300 Places in the US for K-means Clustering [Dataset]. https://www.kaggle.com/datasets/adamxing2021/300places
Explore at:
zip(2610 bytes)Available download formats
Dataset updated
Aug 16, 2022
Authors
Dongou
Area covered
United States
Description
The file consists of the locations of 300 places in the US. Each location is a two-dimensional point that represents the longitude and latitude of the place. For example, "-112.1,33.5" means the longitude of the place is -112.1, and the latitude is 33.5. from Course Data mining / Cluster Analysis by University of Illinois at Urbana-Champaign
Summary of literature review on K-means hybridization with metaheuristic...
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abiodun M. Ikotun; Absalom E. Ezugwu (2023). Summary of literature review on K-means hybridization with metaheuristic algorithms. [Dataset]. http://doi.org/10.1371/journal.pone.0272861.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0272861.t001
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Abiodun M. Ikotun; Absalom E. Ezugwu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary of literature review on K-means hybridization with metaheuristic algorithms.
Dataset and code to accompany the manuscript 'Consistency of clustering...
zenodo.org
bin, nc +1
Updated Oct 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rebecca Millington; Rebecca Millington; Dale Partridge; Dale Partridge; Helen R. Powley; Helen R. Powley; Gennadi Lessin; Gennadi Lessin; David Moffat; David Moffat; Jeremy Blackford; Jeremy Blackford (2025). Dataset and code to accompany the manuscript 'Consistency of clustering analysis of complex 3D ocean datasets' [Dataset]. http://doi.org/10.5281/zenodo.17227522
Explore at:
text/x-python, bin, ncAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.17227522
Dataset updated
Oct 2, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rebecca Millington; Rebecca Millington; Dale Partridge; Dale Partridge; Helen R. Powley; Helen R. Powley; Gennadi Lessin; Gennadi Lessin; David Moffat; David Moffat; Jeremy Blackford; Jeremy Blackford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data and code needed to recreate all analyses and figures presented in the manuscript 'Consistency of clustering analysis of complex 3D ocean datasets'.

Data

'all_data_for_paper.nc': model data, 2000-2004 mean of all variables used, provided at all depth levels.

'mesh_mask.nc': domain and depth data file to be used alongside model data.

ModelViz (code)

Tool to classify marine biogeochemical output from numerical models

Written by rmi, dapa & dmof

Preprocessing

preprocess_amm7_functions.py
Functions needed to run different preprocessing scripts.
preprocess_all_depths.py
First script to run. Extracts relevant variables and takes temporal mean for physical,biogeochemical and ecological variables. For physical - calculates PAR from qsr.
preprocess_amm7_mean.py
Use for surface biogeochemical and ecological sets (faster)
preprocess_DI_DA.py
Use for depth integrated, depth averaged and bottom biogeochemical and ecological sets. Can use for surface but slower.
preprocess_amm7_mean_one_depth.py
Extracts data at specified depth (numeric). Works for biogeochemical and ecological variables.
preprocess_physics.py
Takes all_depths_physics and calculates physics data at different depths.

Metrics

silhouette_nvars.py
Calculates silhouette score for inputs with different numbers of variables and clusters
rand_index.py
rand_index_depth.py
remove_one_var.py
Calculates rand index between cluster sets with one variable removed and original set

Clustering

Modelviz.py
Contains functions for applying clustering to data

Plotting

kmeans-paper-plots.ipynb
Produces figure 4
kmeans-paper-plots-illustrate-normalisation.ipynb
Produces figure 2
kmeans-paper-plots-depths.ipynb
Produces figures 5-7
plot_silhouette.ipynb
Produces figure 3