100+ datasets found
  1. K Means - Data Blobs

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    txt
    Updated Feb 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jesus Rogel-Salazar (2022). K Means - Data Blobs [Dataset]. http://doi.org/10.6084/m9.figshare.19102187.v3
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 2, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Jesus Rogel-Salazar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Example data to understand the implementation of K Means

  2. CLUSTERING: K-MEANS

    • kaggle.com
    zip
    Updated Apr 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Syed Touqeer (2020). CLUSTERING: K-MEANS [Dataset]. https://www.kaggle.com/datasets/syedtouqeer/clustering-kmeans
    Explore at:
    zip(1599 bytes)Available download formats
    Dataset updated
    Apr 24, 2020
    Authors
    Syed Touqeer
    Description

    Dataset

    This dataset was created by Syed Touqeer

    Contents

  3. K-MEANS CLUSTERING

    • kaggle.com
    zip
    Updated Apr 27, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prasun Varshney (2019). K-MEANS CLUSTERING [Dataset]. https://www.kaggle.com/varshneyprasun/kmeans-clustering
    Explore at:
    zip(1088 bytes)Available download formats
    Dataset updated
    Apr 27, 2019
    Authors
    Prasun Varshney
    Description

    Dataset

    This dataset was created by Prasun Varshney

    Contents

  4. h

    wine-clustering

    • huggingface.co
    Updated Sep 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Trevor (2024). wine-clustering [Dataset]. https://huggingface.co/datasets/mltrev23/wine-clustering
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 12, 2024
    Authors
    Trevor
    Description

    Wine Clustering Dataset

      Overview
    

    The Wine Clustering Dataset contains data on various chemical properties of wines, intended for use in clustering tasks. This dataset is ideal for exploring clustering algorithms such as K-Means, hierarchical clustering, and others, to group wines based on their chemical composition.

      Dataset Structure
    

    The dataset is provided as a single CSV file named wine-clustering.csv. It contains 178 entries, each representing a unique wine… See the full description on the dataset page: https://huggingface.co/datasets/mltrev23/wine-clustering.

  5. Customer Segmentation : Clustering

    • kaggle.com
    zip
    Updated Jan 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vishakh Patel (2024). Customer Segmentation : Clustering [Dataset]. https://www.kaggle.com/datasets/vishakhdapat/customer-segmentation-clustering
    Explore at:
    zip(63448 bytes)Available download formats
    Dataset updated
    Jan 13, 2024
    Authors
    Vishakh Patel
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Customer Personality Analysis involves a thorough examination of a company's optimal customer profiles. This analysis facilitates a deeper understanding of customers, enabling businesses to tailor products to meet the distinct needs, behaviors, and concerns of various customer types.

    By conducting a Customer Personality Analysis, businesses can refine their products based on the preferences of specific customer segments. Rather than allocating resources to market a new product to the entire customer database, companies can identify the segments most likely to be interested in the product. Subsequently, targeted marketing efforts can be directed toward those particular segments, optimizing resource utilization and increasing the likelihood of successful product adoption.

    Details of Features are as below:

    • Id: Unique identifier for each individual in the dataset.
    • Year_Birth: The birth year of the individual.
    • Education: The highest level of education attained by the individual.
    • Marital_Status: The marital status of the individual.
    • Income: The annual income of the individual.
    • Kidhome: The number of young children in the household.
    • Teenhome: The number of teenagers in the household.
    • Dt_Customer: The date when the customer was first enrolled or became a part of the company's database.
    • Recency: The number of days since the last purchase or interaction.
    • MntWines: The amount spent on wines.
    • MntFruits: The amount spent on fruits.
    • MntMeatProducts: The amount spent on meat products.
    • MntFishProducts: The amount spent on fish products.
    • MntSweetProducts: The amount spent on sweet products.
    • MntGoldProds: The amount spent on gold products.
    • NumDealsPurchases: The number of purchases made with a discount or as part of a deal.
    • NumWebPurchases: The number of purchases made through the company's website.
    • NumCatalogPurchases: The number of purchases made through catalogs.
    • NumStorePurchases: The number of purchases made in physical stores.
    • NumWebVisitsMonth: The number of visits to the company's website in a month.
    • AcceptedCmp3: Binary indicator (1 or 0) whether the individual accepted the third marketing campaign.
    • AcceptedCmp4: Binary indicator (1 or 0) whether the individual accepted the fourth marketing campaign.
    • AcceptedCmp5: Binary indicator (1 or 0) whether the individual accepted the fifth marketing campaign.
    • AcceptedCmp1: Binary indicator (1 or 0) whether the individual accepted the first marketing campaign.
    • AcceptedCmp2: Binary indicator (1 or 0) whether the individual accepted the second marketing campaign.
    • Complain: Binary indicator (1 or 0) whether the individual has made a complaint.
    • Z_CostContact: A constant cost associated with contacting a customer.
    • Z_Revenue: A constant revenue associated with a successful campaign response.
    • Response: Binary indicator (1 or 0) whether the individual responded to the marketing campaign.
  6. Benchmarks datasets for cluster analysis

    • kaggle.com
    zip
    Updated Nov 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Onthada Preedasawakul (2023). Benchmarks datasets for cluster analysis [Dataset]. https://www.kaggle.com/datasets/onthada/benchmarks-datasets-for-clustering
    Explore at:
    zip(608532 bytes)Available download formats
    Dataset updated
    Nov 15, 2023
    Authors
    Onthada Preedasawakul
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    25 Artificial Datasets

    The datasets are generated using either Gaussian or Uniform distributions. Each dataset contains several known sub-groups intended for testing centroid-based clustering results and cluster validity indices.

    Cluster analysis is a popular machine learning used for segmenting datasets with similar data points in the same group. For those who are familiar with R, there is a new R package called "UniversalCVI" https://CRAN.R-project.org/package=UniversalCVI used for cluster evaluation. This package provides algorithms for checking the accuracy of a clustering result with known classes, computing cluster validity indices, and generating plots for comparing them. The package is compatible with K-means, fuzzy C means, EM clustering, and hierarchical clustering (single, average, and complete linkage). To use the "UniversalCVI" package, one can follow the instructions provided in the R documentation.

    For more in-depth details of the package and cluster evaluation, please see the papers https://doi.org/10.1016/j.patcog.2023.109910 and https://arxiv.org/abs/2308.14785

    All the datasets are also available on GitHub at

    https://github.com/O-PREEDASAWAKUL/FuzzyDatasets.git .

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F17645646%2Fa2f87fbad212a908718535589681a703%2Frealplot.jpeg?generation=1700111724010268&alt=media" alt="">

  7. Dataset for: Some Remarks on the R2 for Clustering

    • wiley.figshare.com
    txt
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicola Loperfido; Thaddeus Tarpey (2023). Dataset for: Some Remarks on the R2 for Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.6124508.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Wileyhttps://www.wiley.com/
    Authors
    Nicola Loperfido; Thaddeus Tarpey
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    A common descriptive statistic in cluster analysis is the $R^2$ that measures the overall proportion of variance explained by the cluster means. This note highlights properties of the $R^2$ for clustering. In particular, we show that generally the $R^2$ can be artificially inflated by linearly transforming the data by ``stretching'' and by projecting. Also, the $R^2$ for clustering will often be a poor measure of clustering quality in high-dimensional settings. We also investigate the $R^2$ for clustering for misspecified models. Several simulation illustrations are provided highlighting weaknesses in the clustering $R^2$, especially in high-dimensional settings. A functional data example is given showing how that $R^2$ for clustering can vary dramatically depending on how the curves are estimated.

  8. T

    Modified k-means clustering model in multi stire delivery service

    • dataverse.telkomuniversity.ac.id
    pdf
    Updated Jul 14, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Telkom University Dataverse (2022). Modified k-means clustering model in multi stire delivery service [Dataset]. http://doi.org/10.34820/FK2/UWEYZR
    Explore at:
    pdf(251276)Available download formats
    Dataset updated
    Jul 14, 2022
    Dataset provided by
    Telkom University Dataverse
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    In this work, we propose the centralized shared delivery service model. As a centralized model, the delivery service is handled by the central management so that coordination in delivery process among vehicles can be more efficient.

  9. Customer Segmentation for Targeted Campaigns

    • kaggle.com
    zip
    Updated May 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mani Devesh (2024). Customer Segmentation for Targeted Campaigns [Dataset]. https://www.kaggle.com/datasets/manidevesh/customer-sales-data
    Explore at:
    zip(914292 bytes)Available download formats
    Dataset updated
    May 21, 2024
    Authors
    Mani Devesh
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Project Overview: Customer Segmentation Using K-Means Clustering

    Introduction In this project, I analysed customer data from a retail store to identify distinct customer segments. The dataset includes key attributes such as age, city, and total sales of the customers. By leveraging K-Means clustering, an unsupervised machine learning technique, I aim to group customers based on their age and sales metrics. These insights will enable the creation of targeted marketing campaigns tailored to the specific needs and behaviours of each customer segment.

    Objectives - Cluster Customers: Use K-Means clustering to group customers based on age and total sales. - Analyse Segments: Examine the characteristics of each customer segment. - Targeted Marketing: Develop strategies for personalized marketing campaigns targeting each identified customer group.

    Data Description The dataset comprises:

    • Age: The age of the customers.
    • City: The city where the customers reside.
    • Total Sales: The total sales generated by each customer.

    Methodology - Data Preprocessing: Clean and preprocess the data to handle any missing or inconsistent entries. - Feature Selection: Focus on age and total sales as primary features for clustering. - K-Means Clustering: Apply the K-Means algorithm to identify distinct customer segments. - Cluster Analysis: Analyse the resulting clusters to understand the demographic and sales characteristics of each group. - Marketing Strategy Development: Create targeted marketing strategies for each customer segment to enhance engagement and sales.

    Expected Outcomes - Customer Segments: Clear identification of customer groups based on age and purchasing behaviour. - Insights for Marketing: Detailed understanding of each segment to inform targeted marketing efforts. - Business Impact: Enhanced ability to tailor marketing campaigns, potentially leading to increased customer satisfaction and sales.

    By clustering customers based on age and total sales, this project aims to provide actionable insights for personalized marketing, ultimately driving better customer engagement and higher sales for the retail store.

  10. f

    Data from: Factor Modeling for Clustering High-Dimensional Time Series

    • tandf.figshare.com
    zip
    Updated Feb 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bo Zhang; Guangming Pan; Qiwei Yao; Wang Zhou (2024). Factor Modeling for Clustering High-Dimensional Time Series [Dataset]. http://doi.org/10.6084/m9.figshare.22141184.v4
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 9, 2024
    Dataset provided by
    Taylor & Francis
    Authors
    Bo Zhang; Guangming Pan; Qiwei Yao; Wang Zhou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We propose a new unsupervised learning method for clustering a large number of time series based on a latent factor structure. Each cluster is characterized by its own cluster-specific factors in addition to some common factors which impact on all the time series concerned. Our setting also offers the flexibility that some time series may not belong to any clusters. The consistency with explicit convergence rates is established for the estimation of the common factors, the cluster-specific factors, and the latent clusters. Numerical illustration with both simulated data as well as a real data example is also reported. As a spin-off, the proposed new approach also advances significantly the statistical inference for the factor model of Lam and Yao. Supplementary materials for this article are available online.

  11. Dataset For KMeans Clustering

    • kaggle.com
    zip
    Updated Jan 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mit Gandhi (2025). Dataset For KMeans Clustering [Dataset]. https://www.kaggle.com/datasets/mitgandhi10/dataset-for-kmeans-clustering/code
    Explore at:
    zip(25556 bytes)Available download formats
    Dataset updated
    Jan 8, 2025
    Authors
    Mit Gandhi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset captures Instagram users' visit scores and their spending rank (on a scale of 0 to 100). The goal is to analyze and group users into distinct clusters based on their behaviors, enabling insights into user engagement and spending potential. The dataset is suitable for unsupervised machine learning techniques like K-Means clustering, which can help identify patterns and group users effectively.

    Dataset Highlights: Ideal for practicing clustering algorithms. Small and easy-to-handle dataset. Includes key metrics for user behavior analysis.

  12. h

    stratified-kmeans-diverse-reasoning-100K-1M

    • huggingface.co
    Updated Oct 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Priyanshu (2025). stratified-kmeans-diverse-reasoning-100K-1M [Dataset]. https://huggingface.co/datasets/AmanPriyanshu/stratified-kmeans-diverse-reasoning-100K-1M
    Explore at:
    Dataset updated
    Oct 8, 2025
    Authors
    Aman Priyanshu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Stratified K-Means Diverse Reasoning Dataset (100K-1M)

    A carefully balanced subset of NVIDIA's Llama-Nemotron Post-Training Dataset, featuring square-root rebalanced sampling across math, code, science, instruction-following, chat, and safety tasks at multiple scales.

      👥 Follow the Authors
    

    Aman Priyanshu

    Supriti Vijay

      Overview
    

    This dataset provides stratified subsets at 50k, 100k, 250k, 500k, and 1M scales from the Llama-Nemotron… See the full description on the dataset page: https://huggingface.co/datasets/AmanPriyanshu/stratified-kmeans-diverse-reasoning-100K-1M.

  13. h

    stratified-kmeans-diverse-instruction-following-100K-1M

    • huggingface.co
    Updated Oct 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Priyanshu (2025). stratified-kmeans-diverse-instruction-following-100K-1M [Dataset]. https://huggingface.co/datasets/AmanPriyanshu/stratified-kmeans-diverse-instruction-following-100K-1M
    Explore at:
    Dataset updated
    Oct 8, 2025
    Authors
    Aman Priyanshu
    Description

    Stratified K-Means Diverse Instruction-Following Dataset (100K-1M)

    A carefully balanced subset combining Tulu-3 SFT Mixture and Orca AgentInstruct, featuring embedding-based k-means sampling across diverse instruction-following tasks at multiple scales.

      👥 Follow the Authors
    

    Aman Priyanshu

    Supriti Vijay

      Overview
    

    This dataset provides stratified subsets at 50k, 100k, 250k, 500k, and 1M scales, combining high-quality instruction-following data from… See the full description on the dataset page: https://huggingface.co/datasets/AmanPriyanshu/stratified-kmeans-diverse-instruction-following-100K-1M.

  14. m

    Educational Attainment in North Carolina Public Schools: Use of statistical...

    • data.mendeley.com
    Updated Nov 14, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
    Explore at:
    Dataset updated
    Nov 14, 2018
    Authors
    Scott Herford
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.

  15. h

    stratified-kmeans-diverse-pretraining-100K-1M

    • huggingface.co
    Updated Oct 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Priyanshu (2025). stratified-kmeans-diverse-pretraining-100K-1M [Dataset]. https://huggingface.co/datasets/AmanPriyanshu/stratified-kmeans-diverse-pretraining-100K-1M
    Explore at:
    Dataset updated
    Oct 8, 2025
    Authors
    Aman Priyanshu
    Description

    Stratified K-Means Diverse Pre-Training Dataset (100K-1M)

    A carefully balanced subset combining FineWeb-Edu and Proof-Pile-2, featuring embedding-based k-means sampling to ensure diverse representation across educational and mathematical/scientific content at multiple scales.

      👥 Follow the Authors
    

    Aman Priyanshu

    Supriti Vijay

      Overview
    

    This dataset provides stratified subsets at 50k, 100k, 250k, 500k, and 1M scales, combining high-quality… See the full description on the dataset page: https://huggingface.co/datasets/AmanPriyanshu/stratified-kmeans-diverse-pretraining-100K-1M.

  16. h

    Web-Instruct-Kmean-V0-raw

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    II Vietnam, Web-Instruct-Kmean-V0-raw [Dataset]. https://huggingface.co/datasets/II-Vietnam/Web-Instruct-Kmean-V0-raw
    Explore at:
    Dataset authored and provided by
    II Vietnam
    Description

    Web-Instruct-Kmean-V0-raw Dataset

    A large-scale instruction dataset containing 850,591 question-answer pairs categorized by academic disciplines using k-means clustering.

      Dataset Description
    

    This dataset contains instruction-following examples collected from various web sources, with each example categorized into academic disciplines using k-means clustering and manual task categorization.

      Features
    

    The dataset includes the following fields:

    orig_question:… See the full description on the dataset page: https://huggingface.co/datasets/II-Vietnam/Web-Instruct-Kmean-V0-raw.

  17. a

    Data from: What to Do When K-Means Clustering Fails: A Simple yet Principled...

    • researchdata.aston.ac.uk
    • plos.figshare.com
    Updated Sep 26, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yordan Raykov; Alexis Boukouvalas; Fahd Baig; Max Little (2016). What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm [Dataset]. https://researchdata.aston.ac.uk/id/eprint/152/
    Explore at:
    Dataset updated
    Sep 26, 2016
    Authors
    Yordan Raykov; Alexis Boukouvalas; Fahd Baig; Max Little
    Area covered
    Birmingham, United Kingdom
    Description

    The K-means algorithm is one of the most popular clustering algorithms in current use as it is relatively fast yet simple to understand and deploy in practice. Nevertheless, its use entails certain restrictive assumptions about the data, the negative consequences of which are not always immediately apparent, as we demonstrate. While more flexible algorithms have been developed, their widespread use has been hindered by their computational and technical complexity. Motivated by these considerations, we present a flexible alternative to K-means that relaxes most of the assumptions, whilst remaining almost as fast and simple. This novel algorithm which we call MAP-DP (maximum a-posteriori Dirichlet process mixtures), is statistically rigorous as it is based on nonparametric Bayesian Dirichlet process mixture modeling. This approach allows us to overcome most of the limitations imposed by K-means. The number of clusters K is estimated from the data instead of being fixed a-priori as in K-means. In addition, while K-means is restricted to continuous data, the MAP-DP framework can be applied to many kinds of data, for example, binary, count or ordinal data. Also, it can efficiently separate outliers from the data. This additional flexibility does not incur a significant computational overhead compared to K-means with MAP-DP convergence typically achieved in the order of seconds for many practical problems. Finally, in contrast to K-means, since the algorithm is based on an underlying statistical model, the MAP-DP framework can deal with missing data and enables model testing such as cross validation in a principled way. We demonstrate the simplicity and effectiveness of this algorithm on the health informatics problem of clinical sub-typing in a cluster of diseases known as parkinsonism.

  18. 300 Places in the US for K-means Clustering

    • kaggle.com
    zip
    Updated Aug 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dongou (2022). 300 Places in the US for K-means Clustering [Dataset]. https://www.kaggle.com/datasets/adamxing2021/300places
    Explore at:
    zip(2610 bytes)Available download formats
    Dataset updated
    Aug 16, 2022
    Authors
    Dongou
    Area covered
    United States
    Description

    The file consists of the locations of 300 places in the US. Each location is a two-dimensional point that represents the longitude and latitude of the place. For example, "-112.1,33.5" means the longitude of the place is -112.1, and the latitude is 33.5. from Course Data mining / Cluster Analysis by University of Illinois at Urbana-Champaign

  19. Summary of literature review on K-means hybridization with metaheuristic...

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abiodun M. Ikotun; Absalom E. Ezugwu (2023). Summary of literature review on K-means hybridization with metaheuristic algorithms. [Dataset]. http://doi.org/10.1371/journal.pone.0272861.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Abiodun M. Ikotun; Absalom E. Ezugwu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary of literature review on K-means hybridization with metaheuristic algorithms.

  20. Dataset and code to accompany the manuscript 'Consistency of clustering...

    • zenodo.org
    bin, nc +1
    Updated Oct 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rebecca Millington; Rebecca Millington; Dale Partridge; Dale Partridge; Helen R. Powley; Helen R. Powley; Gennadi Lessin; Gennadi Lessin; David Moffat; David Moffat; Jeremy Blackford; Jeremy Blackford (2025). Dataset and code to accompany the manuscript 'Consistency of clustering analysis of complex 3D ocean datasets' [Dataset]. http://doi.org/10.5281/zenodo.17227522
    Explore at:
    text/x-python, bin, ncAvailable download formats
    Dataset updated
    Oct 2, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rebecca Millington; Rebecca Millington; Dale Partridge; Dale Partridge; Helen R. Powley; Helen R. Powley; Gennadi Lessin; Gennadi Lessin; David Moffat; David Moffat; Jeremy Blackford; Jeremy Blackford
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data and code needed to recreate all analyses and figures presented in the manuscript 'Consistency of clustering analysis of complex 3D ocean datasets'.

    Data

    'all_data_for_paper.nc': model data, 2000-2004 mean of all variables used, provided at all depth levels.

    'mesh_mask.nc': domain and depth data file to be used alongside model data.

    ModelViz (code)

    Tool to classify marine biogeochemical output from numerical models

    Written by rmi, dapa & dmof

    Preprocessing

    preprocess_amm7_functions.py
    Functions needed to run different preprocessing scripts.
    preprocess_all_depths.py
    First script to run. Extracts relevant variables and takes temporal mean for physical,biogeochemical and ecological variables. For physical - calculates PAR from qsr.
    preprocess_amm7_mean.py
    Use for surface biogeochemical and ecological sets (faster)
    preprocess_DI_DA.py
    Use for depth integrated, depth averaged and bottom biogeochemical and ecological sets. Can use for surface but slower.
    preprocess_amm7_mean_one_depth.py
    Extracts data at specified depth (numeric). Works for biogeochemical and ecological variables.
    preprocess_physics.py
    Takes all_depths_physics and calculates physics data at different depths.

    Metrics

    silhouette_nvars.py
    Calculates silhouette score for inputs with different numbers of variables and clusters
    rand_index.py
    rand_index_depth.py
    remove_one_var.py
    Calculates rand index between cluster sets with one variable removed and original set

    Clustering

    Modelviz.py
    Contains functions for applying clustering to data

    Plotting

    kmeans-paper-plots.ipynb
    Produces figure 4
    kmeans-paper-plots-illustrate-normalisation.ipynb
    Produces figure 2
    kmeans-paper-plots-depths.ipynb
    Produces figures 5-7
    plot_silhouette.ipynb
    Produces figure 3

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jesus Rogel-Salazar (2022). K Means - Data Blobs [Dataset]. http://doi.org/10.6084/m9.figshare.19102187.v3
Organization logoOrganization logo

K Means - Data Blobs

Explore at:
txtAvailable download formats
Dataset updated
Feb 2, 2022
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Jesus Rogel-Salazar
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Example data to understand the implementation of K Means

Search
Clear search
Close search
Google apps
Main menu