100+ datasets found

Clustering Exercises
kaggle.com
zip
Updated Apr 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joonas (2022). Clustering Exercises [Dataset]. https://www.kaggle.com/datasets/joonasyoon/clustering-exercises
Explore at:
zip(3602272 bytes)Available download formats
Dataset updated
Apr 29, 2022
Authors
Joonas
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Overview

https://i.imgur.com/ZUX61cD.png" alt="Overview">

Context

The method of disuniting similar data is called clustering. you can create dummy data for classifying clusters by method from sklearn package but it needs to put your effort into job.

For users who making hard test cases for example of clustering, I think this dataset helps them.

Try out to select a meaningful number of clusters, and dividing the data into clusters. Here are exercises for you.

Dataset

All csv files contain a lots of x, y and color, and you can see above figures.

If you want to use position as type of integer, scale it and round off to integer as like x = round(x * 100).

Furthermore, here is GUI Tool to generate 2D points for clustering. you can make your dataset with this tool. https://www.joonas.io/cluster-paint

Stay tuned for further updates! also if any idea, you can comment me.
K Means - Data Blobs
figshare.com
datasetcatalog.nlm.nih.gov
txt
Updated Feb 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jesus Rogel-Salazar (2022). K Means - Data Blobs [Dataset]. http://doi.org/10.6084/m9.figshare.19102187.v3
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19102187.v3
Dataset updated
Feb 2, 2022
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Jesus Rogel-Salazar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Example data to understand the implementation of K Means
Dataset for: Optimal Transport, Mean Partition, and Uncertainty Assessment...
wiley.figshare.com
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jia Li; Beomseok Seo; Lin Lin (2023). Dataset for: Optimal Transport, Mean Partition, and Uncertainty Assessment in Cluster Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.8038925
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8038925
Dataset updated
May 30, 2023
Dataset provided by
Wileyhttps://www.wiley.com/
Authors
Jia Li; Beomseok Seo; Lin Lin
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
In scientific data analysis, clusters identified computationally often substantiate existing hypotheses or motivate new ones. Yet the combinatorial nature of the clustering result, which is a partition rather than a set of parameters or a function, blurs notions of mean and variance. This intrinsic difficulty hinders the development of methods to improve clustering by aggregation or to assess the uncertainty of clusters generated. We overcome that barrier by aligning clusters via optimal transport. Equipped with this technique, we propose a new algorithm to enhance clustering by any baseline method using bootstrap samples. Cluster alignment enables us to quantify variation in the clustering result at the levels of both overall partitions and individual clusters. Set relationships between clusters such as one-to-one match, split, and merge can be revealed. A covering point set for each cluster, a concept kin to the confidence interval, is proposed. The tools we have developed here will help address the crucial question of whether any cluster is an intrinsic or spurious pattern. Experimental results on both simulated and real datasets are provided.
o
Phrase Clustering Dataset (PCD)
registry.opendata.aws
Updated Nov 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon (2022). Phrase Clustering Dataset (PCD) [Dataset]. https://registry.opendata.aws/pcd/
Explore at:
Dataset updated
Nov 21, 2022
Dataset provided by
Amazon.comhttp://amazon.com/
Description
This dataset is part of the paper "McPhraSy: Multi-Context Phrase Similarity and Clustering" by DN Cohen et al (2022). The purpose of PCD is to evaluate the quality of semantic-based clustering of noun phrases. The phrases were collected from the Amazon Review Dataset.
Document Clustering
kaggle.com
zip
Updated Mar 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arai Seisenbek (2022). Document Clustering [Dataset]. https://www.kaggle.com/datasets/nenriki/document-clustering
Explore at:
zip(1587 bytes)Available download formats
Dataset updated
Mar 7, 2022
Authors
Arai Seisenbek
Description
Assignment 1. Text similarity and Agglomerative Document Clustering.

Learning outcomes:

Read texts from file and splitting them to the words.

Transform texts into vector spaces, calculate distances in these spaces 3. Bag of words and TF/IDF vectorizer.

Task 1.

Please download text file. There we have 22 sentence related to the «cat» topic. - Cat (animal) - UNIX-utility cat for to display the contents of files. - versions of the OS X operating system named after the feline family. Your task is to find two sentences that are closest in meaning to the first sentence in document («In comparison to dogs, cats have not undergone .......») We will use the cosine distance as a measure of proximity.

Steps: 1. Open the file. 2. Each line is the one sentence. Please make them all in lower case form using string function lower(). EXAMPLE: in comparison to dogs, cats have not undergone major changes during the domestication process. 3. Tokenization. Means that splitting the sentences to the words. For that purpose you can use regular expressions, that can split the words by space or any other symbols that aren’t letters. re.split('[^a-z]', t). Do not forgot to remove empty words. EXAMPLE: ['in', 'comparison', 'to', 'dogs', '', 'cats', 'have', 'not', 'undergone', 'major', 'changes', 'during', 'the', 'domestication', 'process']. 4. Make a list of all the words that appear in the sentences. Note: all the words are unique. And give the index to the each sentence index from 0 to #of_the_unique_words. You can use dict. Example: {0: ‘mac', 1: 'permanently',2: 'osx',3: 'download',4: 'between',5: ‘based', 6: ‘which', ............., 252: ‘safer', 253: ‘will’}. Hint: we have 254 unique words. 5. And create Matrix with N x D dimensions. N is the number of the sentences and D is the number of the unique words (22 x 254). Fill it in: the element with index (i, j) in this matrix must be equal to the number of occurrences of the j-th word in the i-th sentence. (bag of words).

Find the cosine distance from first sentence to the all other sentences. Which two sentences is close to the first sentence. You can use scipy.spatial.distance.cosine

Of course, during the Task 1 we implanted very simple method. For example int this method «cat» and «cats» two different words, but the meaning is the same.

Task 2.

For the second Task please do the same step from Task 1 (steps 1- 4). In this task you will create Term Frequency — Inverse Document Frequency matrix. Find the cosine distance from first sentence to the all other sentences. Which two sentences is close to the first sentence. You can use scipy.spatial.distance.cosine. Is the any difference from the result of the previous Task? Note: You should not to use any existing libraries for tdf/idf. All the steps similar to the previous example.

Task 3.

Please run the Hierarchy Clustering algorithm for the Task 1 and Task 2. And plot the dendrogram. Please explain your results.

NOTE: by default scipy.cluster.hierarchy it uses euclidean distance. You should change it to the cosine distance.
FastLloyd Clustering Datasets
zenodo.org
xz
Updated May 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdulrahman Diaa; Abdulrahman Diaa; Thomas Humphries; Thomas Humphries; Florian Kerschbaum; Florian Kerschbaum (2025). FastLloyd Clustering Datasets [Dataset]. http://doi.org/10.5281/zenodo.15530593
Explore at:
xzAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15530593
Dataset updated
May 28, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Abdulrahman Diaa; Abdulrahman Diaa; Thomas Humphries; Thomas Humphries; Florian Kerschbaum; Florian Kerschbaum
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This artifact bundles the five dataset archives used in our private federated clustering evaluation, corresponding to the real-world benchmarks, scaling experiments, ablation studies, and timing performance tests described in the paper. The real_datasets.tar.xz includes ten established clustering benchmarks drawn from UCI and the Clustering basic benchmark (DOI: https://doi.org/10.1007/s10489-018-1238-7); scale_datasets.tar.xz contains the SynthNew family generated to assess scalability via the R clusterGeneration package ; ablate_datasets.tar.xz holds the AblateSynth sets varying cluster separation for ablation analysis also powered by clusterGeneration ; g2_datasets.tar.xz packages the G2 sets—Gaussian clusters of size 2048 across dimensions 2–1024 with two clusters each, collected from the Clustering basic benchmark (DOI: https://doi.org/10.1007/s10489-018-1238-7) ; and timing_datasets.tar.xz includes the real s1 and lsun datasets alongside TimeSynth files (balanced synthetic clusters for timing), as per Mohassel et al.’s experimental framework .

Contents

1. real_datasets.tar.xz

Contains ten real-world benchmark datasets and formatted as one sample per line with space-separated features:

iris.txt: 150 samples, 4 features, 3 classes; classic UCI Iris dataset for petal/sepal measurements.

lsun.txt: 400 samples, 2 features, 3 clusters; two-dimensional variant of the LSUN dataset for clustering experiments .

s1.txt: 5,000 samples, 2 features, 15 clusters; synthetic benchmark from Fränti’s S1 series.

house.txt: 1,837 samples, 3 features, 3 clusters; housing data transformed for clustering tasks.

adult.txt: 48,842 samples, 6 features, 3 clusters; UCI Census Income (“Adult”) dataset for income bracket prediction.

wine.txt: 178 samples, 13 features, 3 cultivars; UCI Wine dataset with chemical analysis features.

breast.txt: 569 samples, 9 features, 2 classes; Wisconsin Diagnostic Breast Cancer dataset.

yeast.txt: 1,484 samples, 8 features, 10 localization sites; yeast protein localization data.

mnist.txt: 10,000 samples, 784 features (28×28 pixels), 10 digit classes; MNIST handwritten digits.

birch2.txt: (a random) 25,000/100,000 subset of samples, 2 features, 100 clusters; synthetic BIRCH2 dataset for high-cluster‐count evaluation .

2. scale_datasets.tar.xz

Holds the SynthNew_{k}_{d}_{s}.txt files for scaling experiments, where:

$k \in \{2,4,8,16,32\}$ is the number of clusters,

$d \in \{2,4,8,16,32,64,128,256,512\}$ is the dimensionality,

$s \in \{1,2,3\}$ are different random seeds.

These are generated with the R clusterGeneration package with cluster sizes following a $1:2:...:k$ ratio. We incorporate a random number (in $[0, 100]$) of randomly sampled outliers and set the cluster separation degrees randomly in $[0.16, 0.26]$, spanning partially overlapping to separated clusters.

3. ablate_datasets.tar.xz

Contains the AblateSynth_{k}_{d}_{sep}.txt files for ablation studies, with:

$k \in \{2,4,8,16\}$ clusters,

$d \in \{2,4,8,16\}$ dimensions,

$sep \in \{0.25, 0.5, 0.75\}$ controlling cluster separation degrees.

Also generated via clusterGeneration.

4. g2_datasets.tar.xz

Packages the G2 synthetic sets (g2-{dim}-{var}.txt) from the clustering-data benchmarks:

$N=2048$ samples, $k=2$ Gaussian clusters,

Dimensions $d \in \{1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024\}$

Cluster overlap $var \in \{10, 20, 30, 40, 50, 60, 70, 80, 90, 100\}$

5. timing_datasets.tar.xz

Includes:

s1.txt, lsun.txt: two real datasets for baseline timing.

timesynth_{k}_{d}_{n}.txt: synthetic timing datasets with balanced cluster sizes C_{avg}=N/K, varying:

$k \in \{2,5\}$

$d \in \{2,5\}$

$N \in \{10000; 100000\}$

Generated similarly to the scaling sets, following Mohassel et al.’s timing experiment protocol .

Usage:

Unpack any archive with tar -xJf
Clustering Data Sets With 2 Examples
kaggle.com
zip
Updated Sep 9, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manohar Reddy (2019). Clustering Data Sets With 2 Examples [Dataset]. https://www.kaggle.com/manohar676/clustering-data-sets-with-2-examples
Explore at:
zip(1905 bytes)Available download formats
Dataset updated
Sep 9, 2019
Authors
Manohar Reddy
Description
Dataset

This dataset was created by Manohar Reddy

Contents
f
Data from: Data Nuggets: A Method for Reducing Big Data While Preserving...
tandf.figshare.com
tar
Updated Jun 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Traymon E. Beavers; Ge Cheng; Yajie Duan; Javier Cabrera; Mariusz Lubomirski; Dhammika Amaratunga; Jeffrey E. Teigler (2024). Data Nuggets: A Method for Reducing Big Data While Preserving Data Structure [Dataset]. http://doi.org/10.6084/m9.figshare.25594361.v1
Explore at:
tarAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25594361.v1
Dataset updated
Jun 11, 2024
Dataset provided by
Taylor & Francis
Authors
Traymon E. Beavers; Ge Cheng; Yajie Duan; Javier Cabrera; Mariusz Lubomirski; Dhammika Amaratunga; Jeffrey E. Teigler
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Big data, with N × P dimension where N is extremely large, has created new challenges for data analysis, particularly in the realm of creating meaningful clusters of data. Clustering techniques, such as K-means or hierarchical clustering are popular methods for performing exploratory analysis on large datasets. Unfortunately, these methods are not always possible to apply to big data due to memory or time constraints generated by calculations of order P*N(N−1)2. To circumvent this problem, typically the clustering technique is applied to a random sample drawn from the dataset; however, a weakness is that the structure of the dataset, particularly at the edges, is not necessarily maintained. We propose a new solution through the concept of “data nuggets”, which reduces a large dataset into a small collection of nuggets of data, each containing a center, weight, and scale parameter. The data nuggets are then input into algorithms that compute methods such as principal components analysis and clustering in a more computationally efficient manner. We show the consistency of the data nuggets based covariance estimator and apply the methodology of data nuggets to perform exploratory analysis of a flow cytometry dataset containing over one million observations using PCA and K-means clustering for weighted observations. Supplementary materials for this article are available online.
Dataset for: Some Remarks on the R2 for Clustering
wiley.figshare.com
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicola Loperfido; Thaddeus Tarpey (2023). Dataset for: Some Remarks on the R2 for Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.6124508.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6124508.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Wileyhttps://www.wiley.com/
Authors
Nicola Loperfido; Thaddeus Tarpey
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
A common descriptive statistic in cluster analysis is the $R^2$ that measures the overall proportion of variance explained by the cluster means. This note highlights properties of the $R^2$ for clustering. In particular, we show that generally the $R^2$ can be artificially inflated by linearly transforming the data by ``stretching'' and by projecting. Also, the $R^2$ for clustering will often be a poor measure of clustering quality in high-dimensional settings. We also investigate the $R^2$ for clustering for misspecified models. Several simulation illustrations are provided highlighting weaknesses in the clustering $R^2$, especially in high-dimensional settings. A functional data example is given showing how that $R^2$ for clustering can vary dramatically depending on how the curves are estimated.
2D clustering data
kaggle.com
zip
Updated Sep 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samuel Cortinhas (2022). 2D clustering data [Dataset]. https://www.kaggle.com/datasets/samuelcortinhas/2d-clustering-data/versions/2
Explore at:
zip(6686 bytes)Available download formats
Dataset updated
Sep 11, 2022
Authors
Samuel Cortinhas
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Small 2 dimensional clustering dataset for examples and case studies.

Created using https://www.joonas.io/cluster-paint/

I used this in my introduction to k-Means clustering notebook here: https://www.kaggle.com/samuelcortinhas/k-means-from-scratch
m
Educational Attainment in North Carolina Public Schools: Use of statistical...
data.mendeley.com
Updated Nov 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
Explore at:
Unique identifier
https://doi.org/10.17632/6cm9wyd5g5.1
Dataset updated
Nov 14, 2018
Authors
Scott Herford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
K-Means Cluster Dataset
kaggle.com
zip
Updated Oct 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saquib Hussain (2023). K-Means Cluster Dataset [Dataset]. https://www.kaggle.com/datasets/saquib7hussain/k-mean-cluster-dataset/code
Explore at:
zip(9965 bytes)Available download formats
Dataset updated
Oct 23, 2023
Authors
Saquib Hussain
Description
This Dataset is generated with the help of make_blobs() function.

make_blobs() function The make_blobs() function in Python is used to generate isotropic Gaussian blobs for clustering. It can be used to generate a dataset with a specified number of clusters, number of samples per cluster, and standard deviation for each cluster.The make_blobs() function can be used to generate datasets for a variety of clustering algorithms. It is a useful tool for testing and evaluating clustering algorithms.

The syntax for the make_blobs() function is as follows: from sklearn.datasets import make_blobs X, y = make_blobs(n_samples=100, n_clusters=3, center_box=(-10, 10), std_dev=1.0)

This will generate a dataset with 100 samples, 3 clusters, and a standard deviation of 1.0 for each cluster. The center of each cluster will be a random point in the box (-10, 10).

In the below graph if K=3

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13406058%2Fbb816848d18527ab108d27b69400218b%2FScreenshot%202023-10-23%20230428.png?generation=1698082507938342&alt=media" alt="">
f
Data from: U-Statistical Inference for Hierarchical Clustering
tandf.figshare.com
zip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marcio Valk; Gabriela Bettella Cybis (2023). U-Statistical Inference for Hierarchical Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.12844523.v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12844523.v3
Dataset updated
Jun 1, 2023
Dataset provided by
Taylor & Francis
Authors
Marcio Valk; Gabriela Bettella Cybis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Clustering methods are valuable tools for the identification of patterns in high-dimensional data with applications in many scientific fields. However, quantifying uncertainty in clustering is a challenging problem, particularly when dealing with high-dimension low sample size (HDLSS) data. We develop a U-statistics based clustering approach that assesses statistical significance in clustering and is specifically tailored to HDLSS scenarios. These nonparametric methods rely on very few assumptions about the data, and thus can be applied to a wide range of datasets for which the Euclidean distance captures relevant features. Our main result is the development of a hierarchical significance clustering method. To do so, we first introduce an extension of a relevant U-statistic and develop its asymptotic theory. Additionally, as a preliminary step, we propose a binary nonnested significance clustering method and show its optimality in terms of expected values. Our approach is tested through multiple simulations and found to have more statistical power than competing alternatives in all scenarios considered. Our methods are further showcased in three applications ranging from genetics to image recognition problems. Code for these methods is available in R-package uclust. Supplementary materials for this article are available online.
Clustering of samples and variables with mixed-type data
plos.figshare.com
tiff
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manuela Hummel; Dominic Edelmann; Annette Kopp-Schneider (2023). Clustering of samples and variables with mixed-type data [Dataset]. http://doi.org/10.1371/journal.pone.0188274
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0188274
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Manuela Hummel; Dominic Edelmann; Annette Kopp-Schneider
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of data measured on different scales is a relevant challenge. Biomedical studies often focus on high-throughput datasets of, e.g., quantitative measurements. However, the need for integration of other features possibly measured on different scales, e.g. clinical or cytogenetic factors, becomes increasingly important. The analysis results (e.g. a selection of relevant genes) are then visualized, while adding further information, like clinical factors, on top. However, a more integrative approach is desirable, where all available data are analyzed jointly, and where also in the visualization different data sources are combined in a more natural way. Here we specifically target integrative visualization and present a heatmap-style graphic display. To this end, we develop and explore methods for clustering mixed-type data, with special focus on clustering variables. Clustering of variables does not receive as much attention in the literature as does clustering of samples. We extend the variables clustering methodology by two new approaches, one based on the combination of different association measures and the other on distance correlation. With simulation studies we evaluate and compare different clustering strategies. Applying specific methods for mixed-type data proves to be comparable and in many cases beneficial as compared to standard approaches applied to corresponding quantitative or binarized data. Our two novel approaches for mixed-type variables show similar or better performance than the existing methods ClustOfVar and bias-corrected mutual information. Further, in contrast to ClustOfVar, our methods provide dissimilarity matrices, which is an advantage, especially for the purpose of visualization. Real data examples aim to give an impression of various kinds of potential applications for the integrative heatmap and other graphical displays based on dissimilarity matrices. We demonstrate that the presented integrative heatmap provides more information than common data displays about the relationship among variables and samples. The described clustering and visualization methods are implemented in our R package CluMix available from https://cran.r-project.org/web/packages/CluMix.
f
Data from: Factor Modeling for Clustering High-Dimensional Time Series
tandf.figshare.com
zip
Updated Feb 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bo Zhang; Guangming Pan; Qiwei Yao; Wang Zhou (2024). Factor Modeling for Clustering High-Dimensional Time Series [Dataset]. http://doi.org/10.6084/m9.figshare.22141184.v4
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22141184.v4
Dataset updated
Feb 9, 2024
Dataset provided by
Taylor & Francis
Authors
Bo Zhang; Guangming Pan; Qiwei Yao; Wang Zhou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We propose a new unsupervised learning method for clustering a large number of time series based on a latent factor structure. Each cluster is characterized by its own cluster-specific factors in addition to some common factors which impact on all the time series concerned. Our setting also offers the flexibility that some time series may not belong to any clusters. The consistency with explicit convergence rates is established for the estimation of the common factors, the cluster-specific factors, and the latent clusters. Numerical illustration with both simulated data as well as a real data example is also reported. As a spin-off, the proposed new approach also advances significantly the statistical inference for the factor model of Lam and Yao. Supplementary materials for this article are available online.
f
8d synthetic dataset labels from Clustering: how much bias do we need?.
rs.figshare.com
txt
Updated Jun 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tom Lorimer; Jenny Held; Ruedi Stoop (2023). 8d synthetic dataset labels from Clustering: how much bias do we need?. [Dataset]. http://doi.org/10.6084/m9.figshare.4806571.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4806571.v2
Dataset updated
Jun 3, 2023
Dataset provided by
The Royal Society
Authors
Tom Lorimer; Jenny Held; Ruedi Stoop
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Scientific investigations in medicine and beyond, increasingly require observations to be described by more features than can be simultaneously visualized. Simply reducing the dimensionality by projections destroys essential relationships in the data. Similarly, traditional clustering algorithms introduce data bias that prevents detection of natural structures expected from generic nonlinear processes. We examine how these problems can best be addressed, where in particular we focus on two recent clustering approaches, Phenograph and Hebbian learning clustering, applied to synthetic and natural data examples. Our results reveal that already for very basic questions, minimizing clustering bias is essential, but that results can benefit further from biased post-processing.
Evaluating Correlation Between Measurement Samples in Reverberation Chambers...
catalog.data.gov
datasets.ai
+2more
Updated May 9, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2023). Evaluating Correlation Between Measurement Samples in Reverberation Chambers Using Clustering [Dataset]. https://catalog.data.gov/dataset/evaluating-correlation-between-measurement-samples-in-reverberation-chambers-using-cluster
Explore at:
Dataset updated
May 9, 2023
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
Evaluating Correlation Between Measurement Samples in Reverberation Chambers Using ClusteringAbstract: Traditionally, in reverberation chambers (RC) measurement autocorrelation or correlation-matrix methods have been applied to evaluate measurement correlation. In this article, we introduce the use of clustering based on correlative distance to group correlated measurements. We apply the method to measurements taken in an RC using one and two paddles to stir the electromagnetic fields and applying decreasing angular steps between consecutive paddles positions. The results using varying correlation threshold values demonstrate that the method calculates the number of effective samples and allows discerning outliers, i.e., uncorrelated measurements, and clusters of correlated measurements. This calculation method, if verified, will allow non-sequential stir sequence design and, thereby, reduce testing time.Keywords: Correlation, Pearson correlation coefficient (PCC), reverberation chambers (RC), mode-stirring samples, correlative distance, clustering analysis, adjacency matrix.
d
R code and example data for using genogeographic clustering approach
datadryad.org
search.dataone.org
zip
Updated Feb 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vanessa Arranz (2022). R code and example data for using genogeographic clustering approach [Dataset]. http://doi.org/10.5061/dryad.b2rbnzsbr
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.b2rbnzsbr
Dataset updated
Feb 14, 2022
Dataset provided by
Dryad
Authors
Vanessa Arranz
Time period covered
Jun 8, 2020
Description
This R code and example data file is provided to use "Genogeographic clustering approach" to discerning multiple common spatial patterns of diversity among a large number of species. It will permit more rigorous comparative studies from diverse published data, and can be easily extended to a wide variety of alternative measures of genetic diversity or divergence. We see the approach best used as an exploratory method, to uncover the patterns often hidden in multi-species communities, likely to be followed by more targeted model-testing analyses. The dataset was built from previously published data of single-species population genetics studies. For each species, a genetic diversity measure was computed within each geographic location. The measure of genetic diversity used depended on the genetic marker available. Haplotype diversity “H” was used for mitochondrial DNA, and the analogous allelic expected heterozygosity “He” for nuclear DNA microsatellites. Collectively, these values are he...
f
Data from: Simple Measures of Individual Cluster-Membership Certainty for...
datasetcatalog.nlm.nih.gov
tandf.figshare.com
Updated Jul 9, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Graham, Jinko; Liu, Dongmeng (2018). Simple Measures of Individual Cluster-Membership Certainty for Hard Partitional Clustering [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000690043
Explore at:
Dataset updated
Jul 9, 2018
Authors
Graham, Jinko; Liu, Dongmeng
Description
We propose two probability-like measures of individual cluster-membership certainty which can be applied to a hard partition of the sample such as that obtained from the Partitioning Around Medoids (PAM) algorithm, hierarchical clustering or k-means clustering. One measure extends the individual silhouette widths and the other is obtained directly from the pairwise dissimilarities in the sample. Unlike the classic silhouette, however, the measures behave like probabilities and can be used to investigate an individual's tendency to belong to a cluster. We also suggest two possible ways to evaluate the hard partition using these measures. We evaluate the performance of both measures in individuals with ambiguous cluster membership, using simulated binary datasets that have been partitioned by the PAM algorithm or continuous datasets that have been partitioned by hierarchical clustering and k-means clustering. For comparison, we also present results from soft-clustering algorithms such as soft analysis clustering (FANNY) and two model-based clustering methods. Our proposed measures perform comparably to the posterior-probability estimators from either FANNY or the model-based clustering methods. We also illustrate the proposed measures by applying them to Fisher's classic data set on irises.
h
blurbs-clustering-p2p
huggingface.co
Updated Apr 22, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Silvan (2023). blurbs-clustering-p2p [Dataset]. https://huggingface.co/datasets/slvnwhrl/blurbs-clustering-p2p
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 22, 2023
Authors
Silvan
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset can be used as a benchmark for clustering word embeddings for German. The datasets contains book titles and is based on the dataset from the GermEval 2019 Shared Task on Hierarchical Classification of Blurbs. It contains 18'084 unqiue samples, 28 splits with 177 to 16'425 samples and 4 to 93 unique classes. Splits are built similarly to MTEB's ArxivClusteringP2P. Have a look at German Text Embedding Clustering Benchmark (Github, Paper) for more infos, datasets and evaluation… See the full description on the dataset page: https://huggingface.co/datasets/slvnwhrl/blurbs-clustering-p2p.

Facebook

Twitter

Click to copy link

Link copied

Cite

Joonas (2022). Clustering Exercises [Dataset]. https://www.kaggle.com/datasets/joonasyoon/clustering-exercises

Clustering Exercises

Clustering Using Methods You Know

Explore at:

zip(3602272 bytes)Available download formats

Dataset updated

Apr 29, 2022

Authors

Joonas

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Overview

https://i.imgur.com/ZUX61cD.png" alt="Overview">

Context

The method of disuniting similar data is called clustering. you can create dummy data for classifying clusters by method from sklearn package but it needs to put your effort into job.

For users who making hard test cases for example of clustering, I think this dataset helps them.

Try out to select a meaningful number of clusters, and dividing the data into clusters. Here are exercises for you.

Dataset

All csv files contain a lots of x, y and color, and you can see above figures.

If you want to use position as type of integer, scale it and round off to integer as like x = round(x * 100).

Furthermore, here is GUI Tool to generate 2D points for clustering. you can make your dataset with this tool. https://www.joonas.io/cluster-paint

Stay tuned for further updates! also if any idea, you can comment me.

Clear search

Close search

Google apps

Main menu

Clustering Exercises

Overview

Context

Dataset

K Means - Data Blobs

Dataset for: Optimal Transport, Mean Partition, and Uncertainty Assessment...

Phrase Clustering Dataset (PCD)

Document Clustering

Assignment 1. Text similarity and Agglomerative Document Clustering.

Learning outcomes:

Task 1.

Task 2.

Task 3.

FastLloyd Clustering Datasets

Contents

1. real_datasets.tar.xz

2. scale_datasets.tar.xz

3. ablate_datasets.tar.xz

4. g2_datasets.tar.xz

5. timing_datasets.tar.xz

Clustering Data Sets With 2 Examples

Dataset

Contents

Data from: Data Nuggets: A Method for Reducing Big Data While Preserving...

Dataset for: Some Remarks on the R2 for Clustering

2D clustering data

Educational Attainment in North Carolina Public Schools: Use of statistical...

K-Means Cluster Dataset

Data from: U-Statistical Inference for Hierarchical Clustering

Clustering of samples and variables with mixed-type data

Data from: Factor Modeling for Clustering High-Dimensional Time Series

8d synthetic dataset labels from Clustering: how much bias do we need?.

Evaluating Correlation Between Measurement Samples in Reverberation Chambers...

R code and example data for using genogeographic clustering approach

Data from: Simple Measures of Individual Cluster-Membership Certainty for...

blurbs-clustering-p2p

Clustering Exercises

Clustering Using Methods You Know

Overview

Context

Dataset