100+ datasets found
  1. Clustering Exercises

    • kaggle.com
    zip
    Updated Apr 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joonas (2022). Clustering Exercises [Dataset]. https://www.kaggle.com/datasets/joonasyoon/clustering-exercises
    Explore at:
    zip(3602272 bytes)Available download formats
    Dataset updated
    Apr 29, 2022
    Authors
    Joonas
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Overview

    https://i.imgur.com/ZUX61cD.png" alt="Overview">

    Context

    The method of disuniting similar data is called clustering. you can create dummy data for classifying clusters by method from sklearn package but it needs to put your effort into job.

    For users who making hard test cases for example of clustering, I think this dataset helps them.

    Try out to select a meaningful number of clusters, and dividing the data into clusters. Here are exercises for you.

    Dataset

    All csv files contain a lots of x, y and color, and you can see above figures.

    If you want to use position as type of integer, scale it and round off to integer as like x = round(x * 100).

    Furthermore, here is GUI Tool to generate 2D points for clustering. you can make your dataset with this tool. https://www.joonas.io/cluster-paint

    Stay tuned for further updates! also if any idea, you can comment me.

  2. K Means - Data Blobs

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    txt
    Updated Feb 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jesus Rogel-Salazar (2022). K Means - Data Blobs [Dataset]. http://doi.org/10.6084/m9.figshare.19102187.v3
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 2, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Jesus Rogel-Salazar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Example data to understand the implementation of K Means

  3. Dataset for: Optimal Transport, Mean Partition, and Uncertainty Assessment...

    • wiley.figshare.com
    • figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jia Li; Beomseok Seo; Lin Lin (2023). Dataset for: Optimal Transport, Mean Partition, and Uncertainty Assessment in Cluster Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.8038925
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Wileyhttps://www.wiley.com/
    Authors
    Jia Li; Beomseok Seo; Lin Lin
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    In scientific data analysis, clusters identified computationally often substantiate existing hypotheses or motivate new ones. Yet the combinatorial nature of the clustering result, which is a partition rather than a set of parameters or a function, blurs notions of mean and variance. This intrinsic difficulty hinders the development of methods to improve clustering by aggregation or to assess the uncertainty of clusters generated. We overcome that barrier by aligning clusters via optimal transport. Equipped with this technique, we propose a new algorithm to enhance clustering by any baseline method using bootstrap samples. Cluster alignment enables us to quantify variation in the clustering result at the levels of both overall partitions and individual clusters. Set relationships between clusters such as one-to-one match, split, and merge can be revealed. A covering point set for each cluster, a concept kin to the confidence interval, is proposed. The tools we have developed here will help address the crucial question of whether any cluster is an intrinsic or spurious pattern. Experimental results on both simulated and real datasets are provided.

  4. o

    Phrase Clustering Dataset (PCD)

    • registry.opendata.aws
    Updated Nov 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon (2022). Phrase Clustering Dataset (PCD) [Dataset]. https://registry.opendata.aws/pcd/
    Explore at:
    Dataset updated
    Nov 21, 2022
    Dataset provided by
    Amazon.comhttp://amazon.com/
    Description

    This dataset is part of the paper "McPhraSy: Multi-Context Phrase Similarity and Clustering" by DN Cohen et al (2022). The purpose of PCD is to evaluate the quality of semantic-based clustering of noun phrases. The phrases were collected from the Amazon Review Dataset.

  5. Document Clustering

    • kaggle.com
    zip
    Updated Mar 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arai Seisenbek (2022). Document Clustering [Dataset]. https://www.kaggle.com/datasets/nenriki/document-clustering
    Explore at:
    zip(1587 bytes)Available download formats
    Dataset updated
    Mar 7, 2022
    Authors
    Arai Seisenbek
    Description

    Assignment 1. Text similarity and Agglomerative Document Clustering.

    Learning outcomes:

    1. Read texts from file and splitting them to the words.
    2. Transform texts into vector spaces, calculate distances in these spaces 3. Bag of words and TF/IDF vectorizer.

    Task 1.

    Please download text file. There we have 22 sentence related to the «cat» topic. - Cat (animal) - UNIX-utility cat for to display the contents of files. - versions of the OS X operating system named after the feline family. Your task is to find two sentences that are closest in meaning to the first sentence in document («In comparison to dogs, cats have not undergone .......») We will use the cosine distance as a measure of proximity.

    Steps: 1. Open the file. 2. Each line is the one sentence. Please make them all in lower case form using string function lower(). EXAMPLE: in comparison to dogs, cats have not undergone major changes during the domestication process. 3. Tokenization. Means that splitting the sentences to the words. For that purpose you can use regular expressions, that can split the words by space or any other symbols that aren’t letters. re.split('[^a-z]', t). Do not forgot to remove empty words. EXAMPLE: ['in', 'comparison', 'to', 'dogs', '', 'cats', 'have', 'not', 'undergone', 'major', 'changes', 'during', 'the', 'domestication', 'process']. 4. Make a list of all the words that appear in the sentences. Note: all the words are unique. And give the index to the each sentence index from 0 to #of_the_unique_words. You can use dict. Example: {0: ‘mac', 1: 'permanently',2: 'osx',3: 'download',4: 'between',5: ‘based', 6: ‘which', ............., 252: ‘safer', 253: ‘will’}. Hint: we have 254 unique words. 5. And create Matrix with N x D dimensions. N is the number of the sentences and D is the number of the unique words (22 x 254). Fill it in: the element with index (i, j) in this matrix must be equal to the number of occurrences of the j-th word in the i-th sentence. (bag of words).

    1. Find the cosine distance from first sentence to the all other sentences. Which two sentences is close to the first sentence. You can use scipy.spatial.distance.cosine

    Of course, during the Task 1 we implanted very simple method. For example int this method «cat» and «cats» two different words, but the meaning is the same.

    Task 2.

    For the second Task please do the same step from Task 1 (steps 1- 4). In this task you will create Term Frequency — Inverse Document Frequency matrix. Find the cosine distance from first sentence to the all other sentences. Which two sentences is close to the first sentence. You can use scipy.spatial.distance.cosine. Is the any difference from the result of the previous Task? Note: You should not to use any existing libraries for tdf/idf. All the steps similar to the previous example.

    Task 3.

    Please run the Hierarchy Clustering algorithm for the Task 1 and Task 2. And plot the dendrogram. Please explain your results.

    NOTE: by default scipy.cluster.hierarchy it uses euclidean distance. You should change it to the cosine distance.

  6. FastLloyd Clustering Datasets

    • zenodo.org
    xz
    Updated May 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdulrahman Diaa; Abdulrahman Diaa; Thomas Humphries; Thomas Humphries; Florian Kerschbaum; Florian Kerschbaum (2025). FastLloyd Clustering Datasets [Dataset]. http://doi.org/10.5281/zenodo.15530593
    Explore at:
    xzAvailable download formats
    Dataset updated
    May 28, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Abdulrahman Diaa; Abdulrahman Diaa; Thomas Humphries; Thomas Humphries; Florian Kerschbaum; Florian Kerschbaum
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This artifact bundles the five dataset archives used in our private federated clustering evaluation, corresponding to the real-world benchmarks, scaling experiments, ablation studies, and timing performance tests described in the paper. The real_datasets.tar.xz includes ten established clustering benchmarks drawn from UCI and the Clustering basic benchmark (DOI: https://doi.org/10.1007/s10489-018-1238-7); scale_datasets.tar.xz contains the SynthNew family generated to assess scalability via the R clusterGeneration package ; ablate_datasets.tar.xz holds the AblateSynth sets varying cluster separation for ablation analysis also powered by clusterGeneration ; g2_datasets.tar.xz packages the G2 sets—Gaussian clusters of size 2048 across dimensions 2–1024 with two clusters each, collected from the Clustering basic benchmark (DOI: https://doi.org/10.1007/s10489-018-1238-7) ; and timing_datasets.tar.xz includes the real s1 and lsun datasets alongside TimeSynth files (balanced synthetic clusters for timing), as per Mohassel et al.’s experimental framework .

    Contents

    1. real_datasets.tar.xz

    Contains ten real-world benchmark datasets and formatted as one sample per line with space-separated features:

    • iris.txt: 150 samples, 4 features, 3 classes; classic UCI Iris dataset for petal/sepal measurements.

    • lsun.txt: 400 samples, 2 features, 3 clusters; two-dimensional variant of the LSUN dataset for clustering experiments .

    • s1.txt: 5,000 samples, 2 features, 15 clusters; synthetic benchmark from Fränti’s S1 series.

    • house.txt: 1,837 samples, 3 features, 3 clusters; housing data transformed for clustering tasks.

    • adult.txt: 48,842 samples, 6 features, 3 clusters; UCI Census Income (“Adult”) dataset for income bracket prediction.

    • wine.txt: 178 samples, 13 features, 3 cultivars; UCI Wine dataset with chemical analysis features.

    • breast.txt: 569 samples, 9 features, 2 classes; Wisconsin Diagnostic Breast Cancer dataset.

    • yeast.txt: 1,484 samples, 8 features, 10 localization sites; yeast protein localization data.

    • mnist.txt: 10,000 samples, 784 features (28×28 pixels), 10 digit classes; MNIST handwritten digits.

    • birch2.txt: (a random) 25,000/100,000 subset of samples, 2 features, 100 clusters; synthetic BIRCH2 dataset for high-cluster‐count evaluation .

    2. scale_datasets.tar.xz

    Holds the SynthNew_{k}_{d}_{s}.txt files for scaling experiments, where:

    • $k \in \{2,4,8,16,32\}$ is the number of clusters,

    • $d \in \{2,4,8,16,32,64,128,256,512\}$ is the dimensionality,

    • $s \in \{1,2,3\}$ are different random seeds.

    These are generated with the R clusterGeneration package with cluster sizes following a $1:2:...:k$ ratio. We incorporate a random number (in $[0, 100]$) of randomly sampled outliers and set the cluster separation degrees randomly in $[0.16, 0.26]$, spanning partially overlapping to separated clusters.

    3. ablate_datasets.tar.xz

    Contains the AblateSynth_{k}_{d}_{sep}.txt files for ablation studies, with:

    • $k \in \{2,4,8,16\}$ clusters,

    • $d \in \{2,4,8,16\}$ dimensions,

    • $sep \in \{0.25, 0.5, 0.75\}$ controlling cluster separation degrees.

    Also generated via clusterGeneration.

    4. g2_datasets.tar.xz

    Packages the G2 synthetic sets (g2-{dim}-{var}.txt) from the clustering-data benchmarks:

    • $N=2048$ samples, $k=2$ Gaussian clusters,

    • Dimensions $d \in \{1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024\}$

    • Cluster overlap $var \in \{10, 20, 30, 40, 50, 60, 70, 80, 90, 100\}$

    5. timing_datasets.tar.xz

    Includes:

    • s1.txt, lsun.txt: two real datasets for baseline timing.

    • timesynth_{k}_{d}_{n}.txt: synthetic timing datasets with balanced cluster sizes C_{avg}=N/K, varying:

      • $k \in \{2,5\}$

      • $d \in \{2,5\}$

      • $N \in \{10000; 100000\}$

    Generated similarly to the scaling sets, following Mohassel et al.’s timing experiment protocol .

    Usage:

    Unpack any archive with tar -xJf

  7. Clustering Data Sets With 2 Examples

    • kaggle.com
    zip
    Updated Sep 9, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manohar Reddy (2019). Clustering Data Sets With 2 Examples [Dataset]. https://www.kaggle.com/manohar676/clustering-data-sets-with-2-examples
    Explore at:
    zip(1905 bytes)Available download formats
    Dataset updated
    Sep 9, 2019
    Authors
    Manohar Reddy
    Description

    Dataset

    This dataset was created by Manohar Reddy

    Contents

  8. f

    Data from: Data Nuggets: A Method for Reducing Big Data While Preserving...

    • tandf.figshare.com
    tar
    Updated Jun 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Traymon E. Beavers; Ge Cheng; Yajie Duan; Javier Cabrera; Mariusz Lubomirski; Dhammika Amaratunga; Jeffrey E. Teigler (2024). Data Nuggets: A Method for Reducing Big Data While Preserving Data Structure [Dataset]. http://doi.org/10.6084/m9.figshare.25594361.v1
    Explore at:
    tarAvailable download formats
    Dataset updated
    Jun 11, 2024
    Dataset provided by
    Taylor & Francis
    Authors
    Traymon E. Beavers; Ge Cheng; Yajie Duan; Javier Cabrera; Mariusz Lubomirski; Dhammika Amaratunga; Jeffrey E. Teigler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Big data, with N × P dimension where N is extremely large, has created new challenges for data analysis, particularly in the realm of creating meaningful clusters of data. Clustering techniques, such as K-means or hierarchical clustering are popular methods for performing exploratory analysis on large datasets. Unfortunately, these methods are not always possible to apply to big data due to memory or time constraints generated by calculations of order P*N(N−1)2. To circumvent this problem, typically the clustering technique is applied to a random sample drawn from the dataset; however, a weakness is that the structure of the dataset, particularly at the edges, is not necessarily maintained. We propose a new solution through the concept of “data nuggets”, which reduces a large dataset into a small collection of nuggets of data, each containing a center, weight, and scale parameter. The data nuggets are then input into algorithms that compute methods such as principal components analysis and clustering in a more computationally efficient manner. We show the consistency of the data nuggets based covariance estimator and apply the methodology of data nuggets to perform exploratory analysis of a flow cytometry dataset containing over one million observations using PCA and K-means clustering for weighted observations. Supplementary materials for this article are available online.

  9. Dataset for: Some Remarks on the R2 for Clustering

    • wiley.figshare.com
    txt
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicola Loperfido; Thaddeus Tarpey (2023). Dataset for: Some Remarks on the R2 for Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.6124508.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Wileyhttps://www.wiley.com/
    Authors
    Nicola Loperfido; Thaddeus Tarpey
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    A common descriptive statistic in cluster analysis is the $R^2$ that measures the overall proportion of variance explained by the cluster means. This note highlights properties of the $R^2$ for clustering. In particular, we show that generally the $R^2$ can be artificially inflated by linearly transforming the data by ``stretching'' and by projecting. Also, the $R^2$ for clustering will often be a poor measure of clustering quality in high-dimensional settings. We also investigate the $R^2$ for clustering for misspecified models. Several simulation illustrations are provided highlighting weaknesses in the clustering $R^2$, especially in high-dimensional settings. A functional data example is given showing how that $R^2$ for clustering can vary dramatically depending on how the curves are estimated.

  10. 2D clustering data

    • kaggle.com
    zip
    Updated Sep 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samuel Cortinhas (2022). 2D clustering data [Dataset]. https://www.kaggle.com/datasets/samuelcortinhas/2d-clustering-data/versions/2
    Explore at:
    zip(6686 bytes)Available download formats
    Dataset updated
    Sep 11, 2022
    Authors
    Samuel Cortinhas
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Small 2 dimensional clustering dataset for examples and case studies.

    Created using https://www.joonas.io/cluster-paint/

    I used this in my introduction to k-Means clustering notebook here: https://www.kaggle.com/samuelcortinhas/k-means-from-scratch

  11. m

    Educational Attainment in North Carolina Public Schools: Use of statistical...

    • data.mendeley.com
    Updated Nov 14, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
    Explore at:
    Dataset updated
    Nov 14, 2018
    Authors
    Scott Herford
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.

  12. K-Means Cluster Dataset

    • kaggle.com
    zip
    Updated Oct 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saquib Hussain (2023). K-Means Cluster Dataset [Dataset]. https://www.kaggle.com/datasets/saquib7hussain/k-mean-cluster-dataset/code
    Explore at:
    zip(9965 bytes)Available download formats
    Dataset updated
    Oct 23, 2023
    Authors
    Saquib Hussain
    Description

    This Dataset is generated with the help of make_blobs() function.

    make_blobs() function The make_blobs() function in Python is used to generate isotropic Gaussian blobs for clustering. It can be used to generate a dataset with a specified number of clusters, number of samples per cluster, and standard deviation for each cluster.The make_blobs() function can be used to generate datasets for a variety of clustering algorithms. It is a useful tool for testing and evaluating clustering algorithms.

    The syntax for the make_blobs() function is as follows: from sklearn.datasets import make_blobs X, y = make_blobs(n_samples=100, n_clusters=3, center_box=(-10, 10), std_dev=1.0)

    This will generate a dataset with 100 samples, 3 clusters, and a standard deviation of 1.0 for each cluster. The center of each cluster will be a random point in the box (-10, 10).

    In the below graph if K=3

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13406058%2Fbb816848d18527ab108d27b69400218b%2FScreenshot%202023-10-23%20230428.png?generation=1698082507938342&alt=media" alt="">

  13. f

    Data from: U-Statistical Inference for Hierarchical Clustering

    • tandf.figshare.com
    zip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marcio Valk; Gabriela Bettella Cybis (2023). U-Statistical Inference for Hierarchical Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.12844523.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Marcio Valk; Gabriela Bettella Cybis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Clustering methods are valuable tools for the identification of patterns in high-dimensional data with applications in many scientific fields. However, quantifying uncertainty in clustering is a challenging problem, particularly when dealing with high-dimension low sample size (HDLSS) data. We develop a U-statistics based clustering approach that assesses statistical significance in clustering and is specifically tailored to HDLSS scenarios. These nonparametric methods rely on very few assumptions about the data, and thus can be applied to a wide range of datasets for which the Euclidean distance captures relevant features. Our main result is the development of a hierarchical significance clustering method. To do so, we first introduce an extension of a relevant U-statistic and develop its asymptotic theory. Additionally, as a preliminary step, we propose a binary nonnested significance clustering method and show its optimality in terms of expected values. Our approach is tested through multiple simulations and found to have more statistical power than competing alternatives in all scenarios considered. Our methods are further showcased in three applications ranging from genetics to image recognition problems. Code for these methods is available in R-package uclust. Supplementary materials for this article are available online.

  14. Clustering of samples and variables with mixed-type data

    • plos.figshare.com
    tiff
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manuela Hummel; Dominic Edelmann; Annette Kopp-Schneider (2023). Clustering of samples and variables with mixed-type data [Dataset]. http://doi.org/10.1371/journal.pone.0188274
    Explore at:
    tiffAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Manuela Hummel; Dominic Edelmann; Annette Kopp-Schneider
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of data measured on different scales is a relevant challenge. Biomedical studies often focus on high-throughput datasets of, e.g., quantitative measurements. However, the need for integration of other features possibly measured on different scales, e.g. clinical or cytogenetic factors, becomes increasingly important. The analysis results (e.g. a selection of relevant genes) are then visualized, while adding further information, like clinical factors, on top. However, a more integrative approach is desirable, where all available data are analyzed jointly, and where also in the visualization different data sources are combined in a more natural way. Here we specifically target integrative visualization and present a heatmap-style graphic display. To this end, we develop and explore methods for clustering mixed-type data, with special focus on clustering variables. Clustering of variables does not receive as much attention in the literature as does clustering of samples. We extend the variables clustering methodology by two new approaches, one based on the combination of different association measures and the other on distance correlation. With simulation studies we evaluate and compare different clustering strategies. Applying specific methods for mixed-type data proves to be comparable and in many cases beneficial as compared to standard approaches applied to corresponding quantitative or binarized data. Our two novel approaches for mixed-type variables show similar or better performance than the existing methods ClustOfVar and bias-corrected mutual information. Further, in contrast to ClustOfVar, our methods provide dissimilarity matrices, which is an advantage, especially for the purpose of visualization. Real data examples aim to give an impression of various kinds of potential applications for the integrative heatmap and other graphical displays based on dissimilarity matrices. We demonstrate that the presented integrative heatmap provides more information than common data displays about the relationship among variables and samples. The described clustering and visualization methods are implemented in our R package CluMix available from https://cran.r-project.org/web/packages/CluMix.

  15. f

    Data from: Factor Modeling for Clustering High-Dimensional Time Series

    • tandf.figshare.com
    zip
    Updated Feb 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bo Zhang; Guangming Pan; Qiwei Yao; Wang Zhou (2024). Factor Modeling for Clustering High-Dimensional Time Series [Dataset]. http://doi.org/10.6084/m9.figshare.22141184.v4
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 9, 2024
    Dataset provided by
    Taylor & Francis
    Authors
    Bo Zhang; Guangming Pan; Qiwei Yao; Wang Zhou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We propose a new unsupervised learning method for clustering a large number of time series based on a latent factor structure. Each cluster is characterized by its own cluster-specific factors in addition to some common factors which impact on all the time series concerned. Our setting also offers the flexibility that some time series may not belong to any clusters. The consistency with explicit convergence rates is established for the estimation of the common factors, the cluster-specific factors, and the latent clusters. Numerical illustration with both simulated data as well as a real data example is also reported. As a spin-off, the proposed new approach also advances significantly the statistical inference for the factor model of Lam and Yao. Supplementary materials for this article are available online.

  16. f

    8d synthetic dataset labels from Clustering: how much bias do we need?.

    • rs.figshare.com
    txt
    Updated Jun 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tom Lorimer; Jenny Held; Ruedi Stoop (2023). 8d synthetic dataset labels from Clustering: how much bias do we need?. [Dataset]. http://doi.org/10.6084/m9.figshare.4806571.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    The Royal Society
    Authors
    Tom Lorimer; Jenny Held; Ruedi Stoop
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Scientific investigations in medicine and beyond, increasingly require observations to be described by more features than can be simultaneously visualized. Simply reducing the dimensionality by projections destroys essential relationships in the data. Similarly, traditional clustering algorithms introduce data bias that prevents detection of natural structures expected from generic nonlinear processes. We examine how these problems can best be addressed, where in particular we focus on two recent clustering approaches, Phenograph and Hebbian learning clustering, applied to synthetic and natural data examples. Our results reveal that already for very basic questions, minimizing clustering bias is essential, but that results can benefit further from biased post-processing.

  17. Evaluating Correlation Between Measurement Samples in Reverberation Chambers...

    • catalog.data.gov
    • datasets.ai
    • +2more
    Updated May 9, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2023). Evaluating Correlation Between Measurement Samples in Reverberation Chambers Using Clustering [Dataset]. https://catalog.data.gov/dataset/evaluating-correlation-between-measurement-samples-in-reverberation-chambers-using-cluster
    Explore at:
    Dataset updated
    May 9, 2023
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    Evaluating Correlation Between Measurement Samples in Reverberation Chambers Using ClusteringAbstract: Traditionally, in reverberation chambers (RC) measurement autocorrelation or correlation-matrix methods have been applied to evaluate measurement correlation. In this article, we introduce the use of clustering based on correlative distance to group correlated measurements. We apply the method to measurements taken in an RC using one and two paddles to stir the electromagnetic fields and applying decreasing angular steps between consecutive paddles positions. The results using varying correlation threshold values demonstrate that the method calculates the number of effective samples and allows discerning outliers, i.e., uncorrelated measurements, and clusters of correlated measurements. This calculation method, if verified, will allow non-sequential stir sequence design and, thereby, reduce testing time.Keywords: Correlation, Pearson correlation coefficient (PCC), reverberation chambers (RC), mode-stirring samples, correlative distance, clustering analysis, adjacency matrix.

  18. d

    R code and example data for using genogeographic clustering approach

    • datadryad.org
    • search.dataone.org
    zip
    Updated Feb 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vanessa Arranz (2022). R code and example data for using genogeographic clustering approach [Dataset]. http://doi.org/10.5061/dryad.b2rbnzsbr
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 14, 2022
    Dataset provided by
    Dryad
    Authors
    Vanessa Arranz
    Time period covered
    Jun 8, 2020
    Description

    This R code and example data file is provided to use "Genogeographic clustering approach" to discerning multiple common spatial patterns of diversity among a large number of species. It will permit more rigorous comparative studies from diverse published data, and can be easily extended to a wide variety of alternative measures of genetic diversity or divergence. We see the approach best used as an exploratory method, to uncover the patterns often hidden in multi-species communities, likely to be followed by more targeted model-testing analyses. The dataset was built from previously published data of single-species population genetics studies. For each species, a genetic diversity measure was computed within each geographic location. The measure of genetic diversity used depended on the genetic marker available. Haplotype diversity “H” was used for mitochondrial DNA, and the analogous allelic expected heterozygosity “He” for nuclear DNA microsatellites. Collectively, these values are he...

  19. f

    Data from: Simple Measures of Individual Cluster-Membership Certainty for...

    • datasetcatalog.nlm.nih.gov
    • tandf.figshare.com
    Updated Jul 9, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Graham, Jinko; Liu, Dongmeng (2018). Simple Measures of Individual Cluster-Membership Certainty for Hard Partitional Clustering [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000690043
    Explore at:
    Dataset updated
    Jul 9, 2018
    Authors
    Graham, Jinko; Liu, Dongmeng
    Description

    We propose two probability-like measures of individual cluster-membership certainty which can be applied to a hard partition of the sample such as that obtained from the Partitioning Around Medoids (PAM) algorithm, hierarchical clustering or k-means clustering. One measure extends the individual silhouette widths and the other is obtained directly from the pairwise dissimilarities in the sample. Unlike the classic silhouette, however, the measures behave like probabilities and can be used to investigate an individual's tendency to belong to a cluster. We also suggest two possible ways to evaluate the hard partition using these measures. We evaluate the performance of both measures in individuals with ambiguous cluster membership, using simulated binary datasets that have been partitioned by the PAM algorithm or continuous datasets that have been partitioned by hierarchical clustering and k-means clustering. For comparison, we also present results from soft-clustering algorithms such as soft analysis clustering (FANNY) and two model-based clustering methods. Our proposed measures perform comparably to the posterior-probability estimators from either FANNY or the model-based clustering methods. We also illustrate the proposed measures by applying them to Fisher's classic data set on irises.

  20. h

    blurbs-clustering-p2p

    • huggingface.co
    Updated Apr 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Silvan (2023). blurbs-clustering-p2p [Dataset]. https://huggingface.co/datasets/slvnwhrl/blurbs-clustering-p2p
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 22, 2023
    Authors
    Silvan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This dataset can be used as a benchmark for clustering word embeddings for German. The datasets contains book titles and is based on the dataset from the GermEval 2019 Shared Task on Hierarchical Classification of Blurbs. It contains 18'084 unqiue samples, 28 splits with 177 to 16'425 samples and 4 to 93 unique classes. Splits are built similarly to MTEB's ArxivClusteringP2P. Have a look at German Text Embedding Clustering Benchmark (Github, Paper) for more infos, datasets and evaluation… See the full description on the dataset page: https://huggingface.co/datasets/slvnwhrl/blurbs-clustering-p2p.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Joonas (2022). Clustering Exercises [Dataset]. https://www.kaggle.com/datasets/joonasyoon/clustering-exercises
Organization logo

Clustering Exercises

Clustering Using Methods You Know

Explore at:
zip(3602272 bytes)Available download formats
Dataset updated
Apr 29, 2022
Authors
Joonas
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Overview

https://i.imgur.com/ZUX61cD.png" alt="Overview">

Context

The method of disuniting similar data is called clustering. you can create dummy data for classifying clusters by method from sklearn package but it needs to put your effort into job.

For users who making hard test cases for example of clustering, I think this dataset helps them.

Try out to select a meaningful number of clusters, and dividing the data into clusters. Here are exercises for you.

Dataset

All csv files contain a lots of x, y and color, and you can see above figures.

If you want to use position as type of integer, scale it and round off to integer as like x = round(x * 100).

Furthermore, here is GUI Tool to generate 2D points for clustering. you can make your dataset with this tool. https://www.joonas.io/cluster-paint

Stay tuned for further updates! also if any idea, you can comment me.

Search
Clear search
Close search
Google apps
Main menu