100+ datasets found
  1. 2D Clustering Dataset Collection

    • kaggle.com
    zip
    Updated Jan 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SAMOILOV MIKHAIL (2025). 2D Clustering Dataset Collection [Dataset]. https://www.kaggle.com/datasets/samoilovmikhail/2d-clustering-dataset-collection
    Explore at:
    zip(136543 bytes)Available download formats
    Dataset updated
    Jan 21, 2025
    Authors
    SAMOILOV MIKHAIL
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset collection comprises 15 diverse two-dimensional datasets specifically designed for clustering analysis. Each dataset contains three columns: x, y, and target, where x and y represent the coordinates of the data points, and target indicates the cluster label.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F20292402%2F3cc81328beabc815fe500973fee1f7ac%2Fdescription.png?generation=1737484616903723&alt=media" alt="Visualisation of data">

  2. Clustering benchmark datasets

    • kaggle.com
    zip
    Updated Feb 12, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Krid Jin (2018). Clustering benchmark datasets [Dataset]. https://www.kaggle.com/vasopikof/clustering-benchmark-datasets
    Explore at:
    zip(87717 bytes)Available download formats
    Dataset updated
    Feb 12, 2018
    Authors
    Krid Jin
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Context

    Clustering benchmark datasets published by School of Computing, University of Eastern Finland

    Content

    2D scatter points and label which need to process the formatting first.

    find more in https://cs.joensuu.fi/sipu/datasets/

    Acknowledgements

    @misc{ClusteringDatasets, author = {Pasi Fr"anti et al}, title = {Clustering datasets}, year = {2015}, url = {http://cs.uef.fi/sipu/datasets/}, }

    Inspiration

    With standard and famous benchmark, various clustering algorithm can be performed and compared though a number of kernels.

  3. d

    MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE...

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Apr 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING [Dataset]. https://catalog.data.gov/dataset/multi-label-asrs-dataset-classification-using-semi-supervised-subspace-clustering
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING MOHAMMAD SALIM AHMED, LATIFUR KHAN, NIKUNJ OZA, AND MANDAVA RAJESWARI Abstract. There has been a lot of research targeting text classification. Many of them focus on a particular characteristic of text data - multi-labelity. This arises due to the fact that a document may be associated with multiple classes at the same time. The consequence of such a characteristic is the low performance of traditional binary or multi-class classification techniques on multi-label text data. In this paper, we propose a text classification technique that considers this characteristic and provides very good performance. Our multi-label text classification approach is an extension of our previously formulated [3] multi-class text classification approach called SISC (Semi-supervised Impurity based Subspace Clustering). We call this new classification model as SISC-ML(SISC Multi-Label). Empirical evaluation on real world multi-label NASA ASRS (Aviation Safety Reporting System) data set reveals that our approach outperforms state-of-theart text classification as well as subspace clustering algorithms.

  4. h

    synchro-April2025-cluster-labeled-highMag

    • huggingface.co
    Updated Apr 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Daniel (2025). synchro-April2025-cluster-labeled-highMag [Dataset]. https://huggingface.co/datasets/patcdaniel/synchro-April2025-cluster-labeled-highMag
    Explore at:
    Dataset updated
    Apr 15, 2025
    Authors
    Patrick Daniel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IFCB Plankton Labeled (Cluster-Sorted)

    This dataset contains labeled images of phytoplankton collected with the Planktivore Imaging System. Images were preprocessed with a zero-padding and resized to the standard size used for ViT_b_16 The dataset was originally constructed by clustering unlabeled ROI images using deep features from a ViT model.Clusters were then saved locally and manually curated into taxonomic labels and higher-order groups.

      Dataset Summary… See the full description on the dataset page: https://huggingface.co/datasets/patcdaniel/synchro-April2025-cluster-labeled-highMag.
    
  5. Product Classification and Clustering

    • kaggle.com
    zip
    Updated Apr 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elza (2024). Product Classification and Clustering [Dataset]. https://www.kaggle.com/datasets/nayanack/product-classification-and-clustering/code
    Explore at:
    zip(634599 bytes)Available download formats
    Dataset updated
    Apr 24, 2024
    Authors
    Elza
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12038776%2F80f5b355cecadcd6755894dbf770d2d4%2F0_hbeIx8g0MO81snB_.jpg?generation=1713974740038283&alt=media" alt="">

    Dataset

    This dataset was collected from PriceRunner, a popular product comparison platform. It includes 35311 product offers from 10 categories, provided by 306 different merchants. This dataset offers an ideal ground for evaluating classification, clustering, and entity matching algorithms. Although it contains product-related data, it can still be applied to any problem involving text/short-text mining.

    Column Information

    Variable NameRoleTypeDescriptionUnitsMissing Values
    Product IDFeatureIntegerUnique identifier for each productNo
    Product TitleFeatureCategoricalTitle/name of the productNo
    Merchant IDFeatureIntegerUnique identifier for each merchantNo
    Cluster IDFeatureIntegerIdentifier for product clustersNo
    Cluster LabelFeatureCategoricalLabel for product clustersNo
    Category IDFeatureIntegerUnique identifier for each categoryNo
    Category LabelFeatureCategoricalLabel for product categoryNo
  6. Pseudo-Label Generation for Multi-Label Text Classification - Dataset - NASA...

    • data.nasa.gov
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Pseudo-Label Generation for Multi-Label Text Classification - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/pseudo-label-generation-for-multi-label-text-classification
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    With the advent and expansion of social networking, the amount of generated text data has seen a sharp increase. In order to handle such a huge volume of text data, new and improved text mining techniques are a necessity. One of the characteristics of text data that makes text mining difficult, is multi-labelity. In order to build a robust and effective text classification method which is an integral part of text mining research, we must consider this property more closely. This kind of property is not unique to text data as it can be found in non-text (e.g., numeric) data as well. However, in text data, it is most prevalent. This property also puts the text classification problem in the domain of multi-label classification (MLC), where each instance is associated with a subset of class-labels instead of a single class, as in conventional classification. In this paper, we explore how the generation of pseudo labels (i.e., combinations of existing class labels) can help us in performing better text classification and under what kind of circumstances. During the classification, the high and sparse dimensionality of text data has also been considered. Although, here we are proposing and evaluating a text classification technique, our main focus is on the handling of the multi-labelity of text data while utilizing the correlation among multiple labels existing in the data set. Our text classification technique is called pseudo-LSC (pseudo-Label Based Subspace Clustering). It is a subspace clustering algorithm that considers the high and sparse dimensionality as well as the correlation among different class labels during the classification process to provide better performance than existing approaches. Results on three real world multi-label data sets provide us insight into how the multi-labelity is handled in our classification process and shows the effectiveness of our approach.

  7. n

    Cell type labels for all clustering and normalization combinations compared...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Nov 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Hickey (2022). Cell type labels for all clustering and normalization combinations compared for CODEX multiplexed imaging [Dataset]. http://doi.org/10.5061/dryad.dfn2z352c
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 17, 2022
    Dataset provided by
    Stanford University
    Authors
    John Hickey
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    We performed CODEX (co-detection by indexing) multiplexed imaging on four sections of the human colon (ascending, transverse, descending, and sigmoid) using a panel of 47 oligonucleotide-barcoded antibodies. Subsequently images underwent standard CODEX image processing (tile stitching, drift compensation, cycle concatenation, background subtraction, deconvolution, and determination of best focal plane), and single cell segmentation. Output of this process was a dataframe of nearly 130,000 cells with fluorescence values quantified from each marker. We used this dataframe as input to 1 of the 5 normalization techniques of which we compared z, double-log(z), min/max, and arcsinh normalizations to the original unmodified dataset. We used these normalized dataframes as inputs for 4 unsupervised clustering algorithms: k-means, leiden, X-shift euclidian, and X-shift angular.

    From the clustering outputs, we then labeled the clusters that resulted for cells observed in the data producing 20 unique cell type labels. We also labeled cell types by hiearchical hand-gating data within cellengine (cellengine.com). We also created another gold standard for comparison by overclustering unormalized data with X-shift angular clustering. Finally, we created one last label as the major cell type call from each cell from all 21 cell type labels in the dataset.

    Consequently the dataset has individual cells segmented out in each row. Then there are columns for the X, Y position in pixels in the overall montage image of the dataset. There are also columns to indicate which region the data came from (4 total). The rest are labels generated by all the clustering and normalization techniques used in the manuscript and what were compared to each other. These also were the data that were used for neighborhood analysis for the last figure of the manuscript. These are provided at all four levels of cell type level granularity (from 7 cell types to 35 cell types).

  8. d

    Pseudo-Label Generation for Multi-Label Text Classification

    • catalog.data.gov
    • datasets.ai
    • +1more
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Pseudo-Label Generation for Multi-Label Text Classification [Dataset]. https://catalog.data.gov/dataset/pseudo-label-generation-for-multi-label-text-classification
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    With the advent and expansion of social networking, the amount of generated text data has seen a sharp increase. In order to handle such a huge volume of text data, new and improved text mining techniques are a necessity. One of the characteristics of text data that makes text mining difficult, is multi-labelity. In order to build a robust and effective text classification method which is an integral part of text mining research, we must consider this property more closely. This kind of property is not unique to text data as it can be found in non-text (e.g., numeric) data as well. However, in text data, it is most prevalent. This property also puts the text classification problem in the domain of multi-label classification (MLC), where each instance is associated with a subset of class-labels instead of a single class, as in conventional classification. In this paper, we explore how the generation of pseudo labels (i.e., combinations of existing class labels) can help us in performing better text classification and under what kind of circumstances. During the classification, the high and sparse dimensionality of text data has also been considered. Although, here we are proposing and evaluating a text classification technique, our main focus is on the handling of the multi-labelity of text data while utilizing the correlation among multiple labels existing in the data set. Our text classification technique is called pseudo-LSC (pseudo-Label Based Subspace Clustering). It is a subspace clustering algorithm that considers the high and sparse dimensionality as well as the correlation among different class labels during the classification process to provide better performance than existing approaches. Results on three real world multi-label data sets provide us insight into how the multi-labelity is handled in our classification process and shows the effectiveness of our approach.

  9. h

    stackexchange-clustering

    • huggingface.co
    Updated Apr 28, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massive Text Embedding Benchmark (2022). stackexchange-clustering [Dataset]. https://huggingface.co/datasets/mteb/stackexchange-clustering
    Explore at:
    Dataset updated
    Apr 28, 2022
    Dataset authored and provided by
    Massive Text Embedding Benchmark
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    StackExchangeClustering.v2 An MTEB dataset Massive Text Embedding Benchmark

    Clustering of titles from 121 stackexchanges. Clustering of 25 sets, each with 10-50 classes, and each class with 100 - 1000 sentences.

    Task category t2c

    Domains Web, Written Reference https://arxiv.org/abs/2104.07081

      How to evaluate on this task
    

    You can evaluate an embedding model on this dataset using the following code: import mteb

    task =… See the full description on the dataset page: https://huggingface.co/datasets/mteb/stackexchange-clustering.

  10. m

    Data for: An evaluation of document clustering and topic modelling in two...

    • data.mendeley.com
    Updated Apr 17, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephan Curiskis (2019). Data for: An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit [Dataset]. http://doi.org/10.17632/85njyhj45m.1
    Explore at:
    Dataset updated
    Apr 17, 2019
    Authors
    Stephan Curiskis
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Topic labelled online social network (OSN) data sets are useful to evaluate topic modelling and document clustering tasks. We provide three data sets with topic labels from two online social networks: Twitter and Reddit. To comply with Twitter’s terms and conditions, we only publish the tweet identifiers along with the topic label. The Reddit data is supplied with the full text and the topic label. The first Twitter data set was collected from the Twitter API by filtering for the hashtag #Auspol, used to tag political discussion tweets in Australia. The second Twitter data set was originally used in the RepLab 2013 competition and contains expert annotated topics. The Reddit data set consists of 40,000 Reddit parent comments from May 2015 belonging to 5 subreddit pages, which are used as topic labels.

  11. Global Considerations in Hierarchical Clustering Reveal Meaningful Patterns...

    • plos.figshare.com
    pdf
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roy Varshavsky; David Horn; Michal Linial (2023). Global Considerations in Hierarchical Clustering Reveal Meaningful Patterns in Data [Dataset]. http://doi.org/10.1371/journal.pone.0002247
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Roy Varshavsky; David Horn; Michal Linial
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundA hierarchy, characterized by tree-like relationships, is a natural method of organizing data in various domains. When considering an unsupervised machine learning routine, such as clustering, a bottom-up hierarchical (BU, agglomerative) algorithm is used as a default and is often the only method applied.Methodology/Principal FindingsWe show that hierarchical clustering that involve global considerations, such as top-down (TD, divisive), or glocal (global-local) algorithms are better suited to reveal meaningful patterns in the data. This is demonstrated, by testing the correspondence between the results of several algorithms (TD, glocal and BU) and the correct annotations provided by experts. The correspondence was tested in multiple domains including gene expression experiments, stock trade records and functional protein families. The performance of each of the algorithms is evaluated by statistical criteria that are assigned to clusters (nodes of the hierarchy tree) based on expert-labeled data. Whereas TD algorithms perform better on global patterns, BU algorithms perform well and are advantageous when finer granularity of the data is sought. In addition, a novel TD algorithm that is based on genuine density of the data points is presented and is shown to outperform other divisive and agglomerative methods. Application of the algorithm to more than 500 protein sequences belonging to ion-channels illustrates the potential of the method for inferring overlooked functional annotations. ClustTree, a graphical Matlab toolbox for applying various hierarchical clustering algorithms and testing their quality is made available.ConclusionsAlthough currently rarely used, global approaches, in particular, TD or glocal algorithms, should be considered in the exploratory process of clustering. In general, applying unsupervised clustering methods can leverage the quality of manually-created mapping of proteins families. As demonstrated, it can also provide insights in erroneous and missed annotations.

  12. Dataset features.

    • plos.figshare.com
    xls
    Updated Jun 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zilong Deng; Yizhang Wang; Mustafa Muwafak Alobaedy (2025). Dataset features. [Dataset]. http://doi.org/10.1371/journal.pone.0326145.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 12, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Zilong Deng; Yizhang Wang; Mustafa Muwafak Alobaedy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Federated clustering is a distributed clustering algorithm that does not require the transmission of raw data and is widely used. However, it struggles to handle Non-IID data effectively because it is difficult to obtain accurate global consistency measures under Non-Independent and Identically Distributed (Non-IID) conditions. To address this issue, we propose a federated k-means clustering algorithm based on a cluster backbone called FKmeansCB. First, we add Laplace noise to all the local data, and run k-means clustering on the client side to obtain cluster centers, which faithfully represent the cluster backbone (i.e., the data structures of the clusters). The cluster backbone represents the client’s features and can approximatively capture the features of different labeled data points in Non-IID situations. We then upload these cluster centers to the server. Subsequently, the server aggregates all cluster centers and runs the k-means clustering algorithm to obtain global cluster centers, which are then sent back to the client. Finally, the client assigns all data points to the nearest global cluster center to produce the final clustering results. We have validated the performance of our proposed algorithm using six datasets, including the large-scale MNIST dataset. Compared with the leading non-federated and federated clustering algorithms, FKmeansCB offers significant advantages in both clustering accuracy and running time.

  13. Student Performance and Clustering Dataset

    • kaggle.com
    zip
    Updated Oct 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Khubaib Ahmad (2025). Student Performance and Clustering Dataset [Dataset]. https://www.kaggle.com/datasets/muhammadkhubaibahmad/student-performance-and-clustering-dataset
    Explore at:
    zip(7906 bytes)Available download formats
    Dataset updated
    Oct 24, 2025
    Authors
    Muhammad Khubaib Ahmad
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Description:

    This dataset contains performance, attendance, and participation metrics of 300 students, intended for clustering, exploratory data analysis (EDA), and educational analytics. It can be used to explore relationships between quizzes, exams, GPA, attendance, lab sessions, and other academic indicators.

    This dataset is ideal for unsupervised learning exercises, clustering students based on performance patterns, or for demonstrating educational analytics workflows.

    Note: This is a small dataset (300 rows) and is not suitable for training large-scale supervised models.

    File Information:

    File Name: student_performance.csv Format: CSV (Comma-Separated Values) Rows: 300 Columns: 16 features + optional identifier columns

    Column Details: | Column Name | Type | Description | | ----------------------- | ------- | -------------------------------------------------------- | | student_id | int64 | Unique student identifier | | name | object | Student name (should be anonymized before use) | | age | int64 | Age of the student (years) | | gender | object | Gender of the student | | quiz1_marks | float64 | Marks obtained in Quiz 1 (0–10) | | quiz2_marks | float64 | Marks obtained in Quiz 2 (0–10) | | quiz3_marks | float64 | Marks obtained in Quiz 3 (0–10) | | total_assignments | int64 | Total number of assignments assigned | | assignments_submitted | float64 | Number of assignments submitted (NaN in current dataset) | | midterm_marks | float64 | Marks obtained in midterm exam (0–30) | | final_marks | float64 | Marks obtained in final exam (0–50) | | previous_gpa | float64 | GPA from previous semester (0–4 scale) | | total_lectures | int64 | Total number of lectures scheduled | | lectures_attended | int64 | Number of lectures attended | | total_lab_sessions | int64 | Total lab sessions assigned | | labs_attended | int64 | Number of lab sessions attended |

    Suggested Usage:

    • Clustering: Group students based on performance metrics, attendance, and GPA trends.
    • Exploratory Data Analysis (EDA): Analyze correlations between attendance, quizzes, midterm/final scores, and GPA.
    • Educational Analytics: Derive participation rates, average scores, and performance trends.
    • Feature Engineering: Compute additional metrics like average quiz score, total participation, or engagement ratios. Preprocessing Notes:
    • Drop or impute assignments_submitted if using for ML.
    • Anonymize name to maintain privacy.
    • Categorical variable gender can be label encoded or one-hot encoded if needed.

    License: CC BY 4.0 – Free to use, share, and adapt with proper attribution.

    Citation: Muhammad Khubaib Ahmad, "Student Performance and Clustering Dataset", 2025, Kaggle. DOI: https://doi.org/10.34740/kaggle/dsv/13489035

  14. Z

    A labeled Ecore metamodel dataset for domain clustering

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Önder Babur (2020). A labeled Ecore metamodel dataset for domain clustering [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2585431
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Eindhoven University of Technology
    Authors
    Önder Babur
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Manually labeled 555 metamodels mined from GitHub in April 2017.

    Domains: (1) bibliography, (2) conference management, (3) bug/issue tracker, (4) build systems, (5) document/office products, (6) requirement/use case, (7) database/sql, (8) state machines, (9) petri nets

    Procedure for constructing the dataset: fully manual, by searching for certain keywords and regexes (e.g. "state" and "transition" for state machines) in the metamodels and inspecting the results for inclusion.

    Format for the file names: ABSINDEX_CLUSTER_ITEMINDEX_name_hash.ecore

  15. m

    Dataset - Towards the Systematic Testing of Virtual Reality Programs

    • data.mendeley.com
    Updated Sep 16, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    StevĂŁo Andrade (2021). Dataset - Towards the Systematic Testing of Virtual Reality Programs [Dataset]. http://doi.org/10.17632/4myfs585s9.2
    Explore at:
    Dataset updated
    Sep 16, 2021
    Authors
    StevĂŁo Andrade
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains data related to the experiment conducted in the paper Towards the Systematic Testing of Virtual Reality Programs.

    It contains an implementation of an approach for predicting defect proneness on unlabeled datasets- Average Clustering and Labeling (ACL).

    ACL models get good prediction performance and are comparable to typical supervised learning models in terms of F-measure. ACL offers a viable choice for defect prediction on unlabeled dataset.

    This dataset also contains analyzes related to code smells on C# repositories. Please check the paper to get futher information.

  16. Product Clustering, Matching & Classification

    • kaggle.com
    zip
    Updated Mar 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leonidas Akritidis (2020). Product Clustering, Matching & Classification [Dataset]. https://www.kaggle.com/lakritidis/product-clustering-matching-classification
    Explore at:
    zip(22794103 bytes)Available download formats
    Dataset updated
    Mar 8, 2020
    Authors
    Leonidas Akritidis
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    Introduction

    The continuous growth of the e-commerce industry has rendered the problem of product retrieval particularly important. As more enterprises move their activities on the Web, the volume and diversity of the product-related information increase quickly. These factors make it difficult for the users to identify and compare the features of their desired products. Clustering, classification, and product matching are useful algorithms that can contribute to the organization of product-related information and consequently, enhance the retrieval effectiveness.

    This repository is designed to provide multiple datasets which are suitable for such algorithms. Each dataset is accompanied by its corresponding ground truth file that can be used for evaluation purposes.

    Content

    This repository includes 18 real-world datasets from different product categories, acquired from two online product comparison platforms: PriceRunner and Skroutz. In particular, we partially crawled these two platforms and we constructed 8 datasets out of each one. Each of these 16 datasets represents a specific product category. The categories were selected with two criteria, in order to: i) study the performance difference of the same methods on similar products that were provided by different vendors, and ii) examine the effectiveness of the algorithms on products from diverse categories. For this reason, we included products from both identical and different categories. Moreover, we created one aggregate dataset per platform that contains all the products from all 8 categories combined. These datasets enable the examination of the performance on heterogeneous datasets.

    The datasets are provided in standard CSV and XML formats. Each CSV/XML entry includes the following pieces of information: * Product ID * Product Title as it appears in the respective product comparison platform (but in lower case and with punctuation removed) * Vendor ID: this is ID of the electronic store that provides (sells) the product. Vendor ID can be used for refinement purposes, such as the verification algorithm that we developed in [1]. * Cluster ID: this is the ID of the cluster that the product belongs to. Useful for entity matching and clustering tasks. * Cluster Label: The title of the aforementioned cluster. * Category ID: this is the ID of the category that the product belongs to. It is meaningful mainly in the two aggregate datasets that contain products from multiple categories. Useful for classification and categorization tasks. * Category Label: The title of the aforementioned category.

    Licence

    The datasets are licensed under General Public License (GPL 2.0) and can be used by anybody. Nevertheless, in the case they are used for research purposes, the researchers are kindly requested to include the following articles into the References list of their published paper/s:

    [1] L. Akritidis, A. Fevgas, P. Bozanis, C. Makris, "A Self-Verifying Clustering Approach to Unsupervised Matching of Product Titles", Artificial Intelligence Review (Springer), pp. 1-44, 2020.

    [2] L. Akritidis, P. Bozanis, "Effective Unsupervised Matching of Product Titles with k-Combinations and Permutations", In Proceedings of the 14th IEEE International Conference on Innovations in Intelligent Systems and Applications (INISTA), pp. 1-10, 2018.

    [3] L. Akritidis, A. Fevgas, P. Bozanis, "Effective Product Categorization with Importance Scores and Morphological Analysis of the Titles", In Proceedings of the 30th IEEE International Conference on Tools with Artificial Intelligence (ICTAI), pp. 213-220, 2018.

  17. Text Document Classification Dataset

    • kaggle.com
    zip
    Updated Dec 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sunil thite (2023). Text Document Classification Dataset [Dataset]. https://www.kaggle.com/datasets/sunilthite/text-document-classification-dataset
    Explore at:
    zip(1941393 bytes)Available download formats
    Dataset updated
    Dec 4, 2023
    Authors
    sunil thite
    Description

    This is text document classification dataset which contains 2225 text data and five categories of documents. Five categories are politics, sport, tech, entertainment and business. We can use this dataset for documents classification and document clustering.

    About Dataset - Dataset contains two features text and label. - No. of Rows : 2225 - No. of Columns : 2

    Text: It contains different categories of text data Label: It contains labels for five different categories : 0,1,2,3,4

    1. Politics = 0
    2. Sport = 1
    3. Technology = 2
    4. Entertainment =3
    5. Business = 4
  18. ACCESS-AM2 Southern Ocean cloud and radiation data for k-means clustering...

    • zenodo.org
    • data-staging.niaid.nih.gov
    • +1more
    nc
    Updated Oct 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sonya L. Fiddes; Sonya L. Fiddes (2022). ACCESS-AM2 Southern Ocean cloud and radiation data for k-means clustering and analysis [Dataset]. http://doi.org/10.5281/zenodo.6004062
    Explore at:
    ncAvailable download formats
    Dataset updated
    Oct 21, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sonya L. Fiddes; Sonya L. Fiddes
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Southern Ocean
    Description

    The ACCESS-AM2 (Australian Community Climate and Earth-System Simulator - Atmospheric Model Version 2) data and k-means analysis used for the study described in Fiddes et al. 2022 'Southern Ocean cloud and shortwave radiation biases in a nudged climate model simulation: does the model ever get it right?' .

    Included files:

    • modis_cluster_centres_2015-2019.nc - kmeans derived cluster centres for MODIS
    • modis_cluster_labels_2015-2019.nc - kmeans derived cluster labels for MODIS
    • bx400_cluster_labels_2015-2019.nc - kmeans fitted cluster label for model
    • COSP_vars_bx400_2015-2019.nc - model data for analysis

    The code that performs the analysis/generates this data and has instructions for where to download MODIS data can be found here: https://github.com/sfiddes/code_for_publications_2022/tree/main/ACCESS_cloud_radiation_eval

  19. Small image dataset for unsupervised clustering

    • kaggle.com
    zip
    Updated Oct 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Won-Du Chang (2022). Small image dataset for unsupervised clustering [Dataset]. https://www.kaggle.com/datasets/heavensky/image-dataset-for-unsupervised-clustering
    Explore at:
    zip(6947440 bytes)Available download formats
    Dataset updated
    Oct 29, 2022
    Authors
    Won-Du Chang
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Is it possible to cluster all the photos in your phone automatically without labeling?

    This small dataset includes 80 photos of dogs (10), cats (10), family (20), alone (20), and food (20). There is no labeling info, but you will see it clearly.

    All the photos were from pixabay(https://pixabay.com/). They are free under some restrictions. please see the license page of pixabay (https://pixabay.com/ko/service/license/).

  20. Vaccination Cluster Classification Dataset

    • figshare.com
    csv
    Updated Oct 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amin Noroozi (2025). Vaccination Cluster Classification Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.30472496.v1
    Explore at:
    csvAvailable download formats
    Dataset updated
    Oct 29, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Amin Noroozi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Source: These datasets are from the following paper: A. Noroozi, S. M. Esha, and M. Ghari, ‘Predictors of childhood vaccination uptake in England: an explainable machine learning analysis of regional data (2021–2024)’, Vaccine, vol. 68, p. 127902, Dec. 2025, doi: 10.1016/j.vaccine.2025.127902The original paper is available at: https://www.sciencedirect.com/science/article/pii/S0264410X25011995Description: Vacstat2021-22, Vacstat2022-23, and Vacstat2023-24 contain the vaccination data of 150 Upper Tier Local Authorities (UTLA) in England for 14 types of diseases for children under 5 years old for 2021-2022, 2022-2023, and 2023-2024, respectively. GDSC.csv contains the GDSC data as mentioned in the paper, which includes Geographical, demographic, socioeconomic, and cultural (ethnic) data for the same 150 UTLAs (regions) in EnglandTasks: 1- You can use the vaccination data for clustering the vaccination rate of UTLAs (regions) and assign a cluster label to each region. This label represents the level of vaccination coverage for that UTLA2- You can use the GDSC data to classify the vaccination cluster of the UTLAs (regions) License: This dataset is released under a custom Dataset License Agreement (see LICENSE_DATA.txt).The Figshare setting “CC BY 4.0” applies only insofar as it is consistent with the custom license.Commercial use, redistribution, or hosting of the dataset elsewhere is not permitted, even with attribution.Users should share the official Figshare DOI link instead.By downloading or using this dataset, you agree to the terms of the Dataset License Agreement.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
SAMOILOV MIKHAIL (2025). 2D Clustering Dataset Collection [Dataset]. https://www.kaggle.com/datasets/samoilovmikhail/2d-clustering-dataset-collection
Organization logo

2D Clustering Dataset Collection

An Educational Dataset for Mastering Clustering Techniques

Explore at:
zip(136543 bytes)Available download formats
Dataset updated
Jan 21, 2025
Authors
SAMOILOV MIKHAIL
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

This dataset collection comprises 15 diverse two-dimensional datasets specifically designed for clustering analysis. Each dataset contains three columns: x, y, and target, where x and y represent the coordinates of the data points, and target indicates the cluster label.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F20292402%2F3cc81328beabc815fe500973fee1f7ac%2Fdescription.png?generation=1737484616903723&alt=media" alt="Visualisation of data">

Search
Clear search
Close search
Google apps
Main menu