Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset collection comprises 15 diverse two-dimensional datasets specifically designed for clustering analysis. Each dataset contains three columns: x, y, and target, where x and y represent the coordinates of the data points, and target indicates the cluster label.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F20292402%2F3cc81328beabc815fe500973fee1f7ac%2Fdescription.png?generation=1737484616903723&alt=media" alt="Visualisation of data">
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Clustering benchmark datasets published by School of Computing, University of Eastern Finland
2D scatter points and label which need to process the formatting first.
find more in https://cs.joensuu.fi/sipu/datasets/
@misc{ClusteringDatasets, author = {Pasi Fr"anti et al}, title = {Clustering datasets}, year = {2015}, url = {http://cs.uef.fi/sipu/datasets/}, }
With standard and famous benchmark, various clustering algorithm can be performed and compared though a number of kernels.
Facebook
TwitterMULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING MOHAMMAD SALIM AHMED, LATIFUR KHAN, NIKUNJ OZA, AND MANDAVA RAJESWARI Abstract. There has been a lot of research targeting text classification. Many of them focus on a particular characteristic of text data - multi-labelity. This arises due to the fact that a document may be associated with multiple classes at the same time. The consequence of such a characteristic is the low performance of traditional binary or multi-class classification techniques on multi-label text data. In this paper, we propose a text classification technique that considers this characteristic and provides very good performance. Our multi-label text classification approach is an extension of our previously formulated [3] multi-class text classification approach called SISC (Semi-supervised Impurity based Subspace Clustering). We call this new classification model as SISC-ML(SISC Multi-Label). Empirical evaluation on real world multi-label NASA ASRS (Aviation Safety Reporting System) data set reveals that our approach outperforms state-of-theart text classification as well as subspace clustering algorithms.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IFCB Plankton Labeled (Cluster-Sorted)
This dataset contains labeled images of phytoplankton collected with the Planktivore Imaging System. Images were preprocessed with a zero-padding and resized to the standard size used for ViT_b_16 The dataset was originally constructed by clustering unlabeled ROI images using deep features from a ViT model.Clusters were then saved locally and manually curated into taxonomic labels and higher-order groups.
Dataset Summary⌠See the full description on the dataset page: https://huggingface.co/datasets/patcdaniel/synchro-April2025-cluster-labeled-highMag.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12038776%2F80f5b355cecadcd6755894dbf770d2d4%2F0_hbeIx8g0MO81snB_.jpg?generation=1713974740038283&alt=media" alt="">
This dataset was collected from PriceRunner, a popular product comparison platform. It includes 35311 product offers from 10 categories, provided by 306 different merchants. This dataset offers an ideal ground for evaluating classification, clustering, and entity matching algorithms. Although it contains product-related data, it can still be applied to any problem involving text/short-text mining.
| Variable Name | Role | Type | Description | Units | Missing Values |
|---|---|---|---|---|---|
| Product ID | Feature | Integer | Unique identifier for each product | No | |
| Product Title | Feature | Categorical | Title/name of the product | No | |
| Merchant ID | Feature | Integer | Unique identifier for each merchant | No | |
| Cluster ID | Feature | Integer | Identifier for product clusters | No | |
| Cluster Label | Feature | Categorical | Label for product clusters | No | |
| Category ID | Feature | Integer | Unique identifier for each category | No | |
| Category Label | Feature | Categorical | Label for product category | No |
Facebook
TwitterWith the advent and expansion of social networking, the amount of generated text data has seen a sharp increase. In order to handle such a huge volume of text data, new and improved text mining techniques are a necessity. One of the characteristics of text data that makes text mining difficult, is multi-labelity. In order to build a robust and effective text classification method which is an integral part of text mining research, we must consider this property more closely. This kind of property is not unique to text data as it can be found in non-text (e.g., numeric) data as well. However, in text data, it is most prevalent. This property also puts the text classification problem in the domain of multi-label classification (MLC), where each instance is associated with a subset of class-labels instead of a single class, as in conventional classification. In this paper, we explore how the generation of pseudo labels (i.e., combinations of existing class labels) can help us in performing better text classification and under what kind of circumstances. During the classification, the high and sparse dimensionality of text data has also been considered. Although, here we are proposing and evaluating a text classification technique, our main focus is on the handling of the multi-labelity of text data while utilizing the correlation among multiple labels existing in the data set. Our text classification technique is called pseudo-LSC (pseudo-Label Based Subspace Clustering). It is a subspace clustering algorithm that considers the high and sparse dimensionality as well as the correlation among different class labels during the classification process to provide better performance than existing approaches. Results on three real world multi-label data sets provide us insight into how the multi-labelity is handled in our classification process and shows the effectiveness of our approach.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
We performed CODEX (co-detection by indexing) multiplexed imaging on four sections of the human colon (ascending, transverse, descending, and sigmoid) using a panel of 47 oligonucleotide-barcoded antibodies. Subsequently images underwent standard CODEX image processing (tile stitching, drift compensation, cycle concatenation, background subtraction, deconvolution, and determination of best focal plane), and single cell segmentation. Output of this process was a dataframe of nearly 130,000 cells with fluorescence values quantified from each marker. We used this dataframe as input to 1 of the 5 normalization techniques of which we compared z, double-log(z), min/max, and arcsinh normalizations to the original unmodified dataset. We used these normalized dataframes as inputs for 4 unsupervised clustering algorithms: k-means, leiden, X-shift euclidian, and X-shift angular.
From the clustering outputs, we then labeled the clusters that resulted for cells observed in the data producing 20 unique cell type labels. We also labeled cell types by hiearchical hand-gating data within cellengine (cellengine.com). We also created another gold standard for comparison by overclustering unormalized data with X-shift angular clustering. Finally, we created one last label as the major cell type call from each cell from all 21 cell type labels in the dataset.
Consequently the dataset has individual cells segmented out in each row. Then there are columns for the X, Y position in pixels in the overall montage image of the dataset. There are also columns to indicate which region the data came from (4 total). The rest are labels generated by all the clustering and normalization techniques used in the manuscript and what were compared to each other. These also were the data that were used for neighborhood analysis for the last figure of the manuscript. These are provided at all four levels of cell type level granularity (from 7 cell types to 35 cell types).
Facebook
TwitterWith the advent and expansion of social networking, the amount of generated text data has seen a sharp increase. In order to handle such a huge volume of text data, new and improved text mining techniques are a necessity. One of the characteristics of text data that makes text mining difficult, is multi-labelity. In order to build a robust and effective text classification method which is an integral part of text mining research, we must consider this property more closely. This kind of property is not unique to text data as it can be found in non-text (e.g., numeric) data as well. However, in text data, it is most prevalent. This property also puts the text classification problem in the domain of multi-label classification (MLC), where each instance is associated with a subset of class-labels instead of a single class, as in conventional classification. In this paper, we explore how the generation of pseudo labels (i.e., combinations of existing class labels) can help us in performing better text classification and under what kind of circumstances. During the classification, the high and sparse dimensionality of text data has also been considered. Although, here we are proposing and evaluating a text classification technique, our main focus is on the handling of the multi-labelity of text data while utilizing the correlation among multiple labels existing in the data set. Our text classification technique is called pseudo-LSC (pseudo-Label Based Subspace Clustering). It is a subspace clustering algorithm that considers the high and sparse dimensionality as well as the correlation among different class labels during the classification process to provide better performance than existing approaches. Results on three real world multi-label data sets provide us insight into how the multi-labelity is handled in our classification process and shows the effectiveness of our approach.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
StackExchangeClustering.v2 An MTEB dataset Massive Text Embedding Benchmark
Clustering of titles from 121 stackexchanges. Clustering of 25 sets, each with 10-50 classes, and each class with 100 - 1000 sentences.
Task category t2c
Domains Web, Written Reference https://arxiv.org/abs/2104.07081
How to evaluate on this task
You can evaluate an embedding model on this dataset using the following code: import mteb
task =⌠See the full description on the dataset page: https://huggingface.co/datasets/mteb/stackexchange-clustering.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Topic labelled online social network (OSN) data sets are useful to evaluate topic modelling and document clustering tasks. We provide three data sets with topic labels from two online social networks: Twitter and Reddit. To comply with Twitterâs terms and conditions, we only publish the tweet identifiers along with the topic label. The Reddit data is supplied with the full text and the topic label. The first Twitter data set was collected from the Twitter API by filtering for the hashtag #Auspol, used to tag political discussion tweets in Australia. The second Twitter data set was originally used in the RepLab 2013 competition and contains expert annotated topics. The Reddit data set consists of 40,000 Reddit parent comments from May 2015 belonging to 5 subreddit pages, which are used as topic labels.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundA hierarchy, characterized by tree-like relationships, is a natural method of organizing data in various domains. When considering an unsupervised machine learning routine, such as clustering, a bottom-up hierarchical (BU, agglomerative) algorithm is used as a default and is often the only method applied.Methodology/Principal FindingsWe show that hierarchical clustering that involve global considerations, such as top-down (TD, divisive), or glocal (global-local) algorithms are better suited to reveal meaningful patterns in the data. This is demonstrated, by testing the correspondence between the results of several algorithms (TD, glocal and BU) and the correct annotations provided by experts. The correspondence was tested in multiple domains including gene expression experiments, stock trade records and functional protein families. The performance of each of the algorithms is evaluated by statistical criteria that are assigned to clusters (nodes of the hierarchy tree) based on expert-labeled data. Whereas TD algorithms perform better on global patterns, BU algorithms perform well and are advantageous when finer granularity of the data is sought. In addition, a novel TD algorithm that is based on genuine density of the data points is presented and is shown to outperform other divisive and agglomerative methods. Application of the algorithm to more than 500 protein sequences belonging to ion-channels illustrates the potential of the method for inferring overlooked functional annotations. ClustTree, a graphical Matlab toolbox for applying various hierarchical clustering algorithms and testing their quality is made available.ConclusionsAlthough currently rarely used, global approaches, in particular, TD or glocal algorithms, should be considered in the exploratory process of clustering. In general, applying unsupervised clustering methods can leverage the quality of manually-created mapping of proteins families. As demonstrated, it can also provide insights in erroneous and missed annotations.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Federated clustering is a distributed clustering algorithm that does not require the transmission of raw data and is widely used. However, it struggles to handle Non-IID data effectively because it is difficult to obtain accurate global consistency measures under Non-Independent and Identically Distributed (Non-IID) conditions. To address this issue, we propose a federated k-means clustering algorithm based on a cluster backbone called FKmeansCB. First, we add Laplace noise to all the local data, and run k-means clustering on the client side to obtain cluster centers, which faithfully represent the cluster backbone (i.e., the data structures of the clusters). The cluster backbone represents the clientâs features and can approximatively capture the features of different labeled data points in Non-IID situations. We then upload these cluster centers to the server. Subsequently, the server aggregates all cluster centers and runs the k-means clustering algorithm to obtain global cluster centers, which are then sent back to the client. Finally, the client assigns all data points to the nearest global cluster center to produce the final clustering results. We have validated the performance of our proposed algorithm using six datasets, including the large-scale MNIST dataset. Compared with the leading non-federated and federated clustering algorithms, FKmeansCB offers significant advantages in both clustering accuracy and running time.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains performance, attendance, and participation metrics of 300 students, intended for clustering, exploratory data analysis (EDA), and educational analytics. It can be used to explore relationships between quizzes, exams, GPA, attendance, lab sessions, and other academic indicators.
This dataset is ideal for unsupervised learning exercises, clustering students based on performance patterns, or for demonstrating educational analytics workflows.
Note: This is a small dataset (300 rows) and is not suitable for training large-scale supervised models.
File Name: student_performance.csv Format: CSV (Comma-Separated Values) Rows: 300 Columns: 16 features + optional identifier columns
Column Details:
| Column Name | Type | Description |
| ----------------------- | ------- | -------------------------------------------------------- |
| student_id | int64 | Unique student identifier |
| name | object | Student name (should be anonymized before use) |
| age | int64 | Age of the student (years) |
| gender | object | Gender of the student |
| quiz1_marks | float64 | Marks obtained in Quiz 1 (0â10) |
| quiz2_marks | float64 | Marks obtained in Quiz 2 (0â10) |
| quiz3_marks | float64 | Marks obtained in Quiz 3 (0â10) |
| total_assignments | int64 | Total number of assignments assigned |
| assignments_submitted | float64 | Number of assignments submitted (NaN in current dataset) |
| midterm_marks | float64 | Marks obtained in midterm exam (0â30) |
| final_marks | float64 | Marks obtained in final exam (0â50) |
| previous_gpa | float64 | GPA from previous semester (0â4 scale) |
| total_lectures | int64 | Total number of lectures scheduled |
| lectures_attended | int64 | Number of lectures attended |
| total_lab_sessions | int64 | Total lab sessions assigned |
| labs_attended | int64 | Number of lab sessions attended |
Suggested Usage:
License: CC BY 4.0 â Free to use, share, and adapt with proper attribution.
Citation: Muhammad Khubaib Ahmad, "Student Performance and Clustering Dataset", 2025, Kaggle. DOI: https://doi.org/10.34740/kaggle/dsv/13489035
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Manually labeled 555 metamodels mined from GitHub in April 2017.
Domains: (1) bibliography, (2) conference management, (3) bug/issue tracker, (4) build systems, (5) document/office products, (6) requirement/use case, (7) database/sql, (8) state machines, (9) petri nets
Procedure for constructing the dataset: fully manual, by searching for certain keywords and regexes (e.g. "state" and "transition" for state machines) in the metamodels and inspecting the results for inclusion.
Format for the file names: ABSINDEX_CLUSTER_ITEMINDEX_name_hash.ecore
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains data related to the experiment conducted in the paper Towards the Systematic Testing of Virtual Reality Programs.
It contains an implementation of an approach for predicting defect proneness on unlabeled datasets- Average Clustering and Labeling (ACL).
ACL models get good prediction performance and are comparable to typical supervised learning models in terms of F-measure. ACL offers a viable choice for defect prediction on unlabeled dataset.
This dataset also contains analyzes related to code smells on C# repositories. Please check the paper to get futher information.
Facebook
Twitterhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
The continuous growth of the e-commerce industry has rendered the problem of product retrieval particularly important. As more enterprises move their activities on the Web, the volume and diversity of the product-related information increase quickly. These factors make it difficult for the users to identify and compare the features of their desired products. Clustering, classification, and product matching are useful algorithms that can contribute to the organization of product-related information and consequently, enhance the retrieval effectiveness.
This repository is designed to provide multiple datasets which are suitable for such algorithms. Each dataset is accompanied by its corresponding ground truth file that can be used for evaluation purposes.
This repository includes 18 real-world datasets from different product categories, acquired from two online product comparison platforms: PriceRunner and Skroutz. In particular, we partially crawled these two platforms and we constructed 8 datasets out of each one. Each of these 16 datasets represents a specific product category. The categories were selected with two criteria, in order to: i) study the performance difference of the same methods on similar products that were provided by different vendors, and ii) examine the effectiveness of the algorithms on products from diverse categories. For this reason, we included products from both identical and different categories. Moreover, we created one aggregate dataset per platform that contains all the products from all 8 categories combined. These datasets enable the examination of the performance on heterogeneous datasets.
The datasets are provided in standard CSV and XML formats. Each CSV/XML entry includes the following pieces of information: * Product ID * Product Title as it appears in the respective product comparison platform (but in lower case and with punctuation removed) * Vendor ID: this is ID of the electronic store that provides (sells) the product. Vendor ID can be used for refinement purposes, such as the verification algorithm that we developed in [1]. * Cluster ID: this is the ID of the cluster that the product belongs to. Useful for entity matching and clustering tasks. * Cluster Label: The title of the aforementioned cluster. * Category ID: this is the ID of the category that the product belongs to. It is meaningful mainly in the two aggregate datasets that contain products from multiple categories. Useful for classification and categorization tasks. * Category Label: The title of the aforementioned category.
The datasets are licensed under General Public License (GPL 2.0) and can be used by anybody. Nevertheless, in the case they are used for research purposes, the researchers are kindly requested to include the following articles into the References list of their published paper/s:
[1] L. Akritidis, A. Fevgas, P. Bozanis, C. Makris, "A Self-Verifying Clustering Approach to Unsupervised Matching of Product Titles", Artificial Intelligence Review (Springer), pp. 1-44, 2020.
[2] L. Akritidis, P. Bozanis, "Effective Unsupervised Matching of Product Titles with k-Combinations and Permutations", In Proceedings of the 14th IEEE International Conference on Innovations in Intelligent Systems and Applications (INISTA), pp. 1-10, 2018.
[3] L. Akritidis, A. Fevgas, P. Bozanis, "Effective Product Categorization with Importance Scores and Morphological Analysis of the Titles", In Proceedings of the 30th IEEE International Conference on Tools with Artificial Intelligence (ICTAI), pp. 213-220, 2018.
Facebook
TwitterThis is text document classification dataset which contains 2225 text data and five categories of documents. Five categories are politics, sport, tech, entertainment and business. We can use this dataset for documents classification and document clustering.
About Dataset - Dataset contains two features text and label. - No. of Rows : 2225 - No. of Columns : 2
Text: It contains different categories of text data Label: It contains labels for five different categories : 0,1,2,3,4
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The ACCESS-AM2 (Australian Community Climate and Earth-System Simulator - Atmospheric Model Version 2) data and k-means analysis used for the study described in Fiddes et al. 2022 'Southern Ocean cloud and shortwave radiation biases in a nudged climate model simulation: does the model ever get it right?' .
Included files:
The code that performs the analysis/generates this data and has instructions for where to download MODIS data can be found here: https://github.com/sfiddes/code_for_publications_2022/tree/main/ACCESS_cloud_radiation_eval
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Is it possible to cluster all the photos in your phone automatically without labeling?
This small dataset includes 80 photos of dogs (10), cats (10), family (20), alone (20), and food (20). There is no labeling info, but you will see it clearly.
All the photos were from pixabay(https://pixabay.com/). They are free under some restrictions. please see the license page of pixabay (https://pixabay.com/ko/service/license/).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Source: These datasets are from the following paper: A. Noroozi, S. M. Esha, and M. Ghari, âPredictors of childhood vaccination uptake in England: an explainable machine learning analysis of regional data (2021â2024)â, Vaccine, vol. 68, p. 127902, Dec. 2025, doi: 10.1016/j.vaccine.2025.127902The original paper is available at: https://www.sciencedirect.com/science/article/pii/S0264410X25011995Description: Vacstat2021-22, Vacstat2022-23, and Vacstat2023-24 contain the vaccination data of 150 Upper Tier Local Authorities (UTLA) in England for 14 types of diseases for children under 5 years old for 2021-2022, 2022-2023, and 2023-2024, respectively. GDSC.csv contains the GDSC data as mentioned in the paper, which includes Geographical, demographic, socioeconomic, and cultural (ethnic) data for the same 150 UTLAs (regions) in EnglandTasks: 1- You can use the vaccination data for clustering the vaccination rate of UTLAs (regions) and assign a cluster label to each region. This label represents the level of vaccination coverage for that UTLA2- You can use the GDSC data to classify the vaccination cluster of the UTLAs (regions) License: This dataset is released under a custom Dataset License Agreement (see LICENSE_DATA.txt).The Figshare setting âCC BY 4.0â applies only insofar as it is consistent with the custom license.Commercial use, redistribution, or hosting of the dataset elsewhere is not permitted, even with attribution.Users should share the official Figshare DOI link instead.By downloading or using this dataset, you agree to the terms of the Dataset License Agreement.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset collection comprises 15 diverse two-dimensional datasets specifically designed for clustering analysis. Each dataset contains three columns: x, y, and target, where x and y represent the coordinates of the data points, and target indicates the cluster label.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F20292402%2F3cc81328beabc815fe500973fee1f7ac%2Fdescription.png?generation=1737484616903723&alt=media" alt="Visualisation of data">