100+ datasets found
  1. Data from: Galaxy clustering

    • kaggle.com
    zip
    Updated Jan 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Galaxy clustering [Dataset]. https://www.kaggle.com/datasets/thedevastator/clustering-polygons-utilizing-iris-moon-and-circ
    Explore at:
    zip(6339 bytes)Available download formats
    Dataset updated
    Jan 3, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Galaxy clustering

    Iris, Moon, and Circles datasets for Galaxy clustering tutorial

    By [source]

    About this dataset

    This dataset contains a wealth of information that can be used to explore the effectiveness of various clustering algorithms. With its inclusion of numerical measurements (X, Y, Sepal.Length, and Petal.Length) and categorical values (Species), it is possible to investigate the relationship between different types of variables and clustering performance. Additionally, by comparing results for the 3 datasets provided - moon.csv (which contains x and y coordinates), iris.csv (which contains measurements for sepal and petal lengths),and circles.csv - we can gain insights into how different data distributions affect clustering techniques such as K-Means or Hierarchical Clustering among others!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset can also be a great starting point to further explore more complex clusters by using higher dimensional space variables such as color or texture that may be present in other datasets not included here but which can help to form more accurate groups when using cluster-analysis algorithms. Additionally, it could also assist in visualization projects where clusters may need to be generated such as plotting mapped data points or examining relationships between two different variables within a certain region drawn on a chart.

    To use this dataset effectively it is important to understand how exactly your chosen algorithm works since some require specifying parameters beforehand while others take care of those details automatically; otherwise the interpretation may be invalid depending on the methods used alongside clustering you intend for your project. Furthermore, familiarize yourself with concepts like silhouette score and rand index - these are commonly used metrics that measure your cluster’s performance against other clusterings models so you know if what you have done so far satisfies an acceptable level of accuracy or not yet! Good luck!

    Research Ideas

    • Utilizing the sepal and petal lengths and widths to perform flower recognition or part of a larger image recognition pipeline.
    • Classifying the data points in each dataset by the X-Y coordinates using clustering algorithms to analyze galaxy locations or overall formation patterns for stars, planets, or galaxies.
    • Exploring correlations between species of flowers in terms of sepal/petal lengths by performing supervised learning tasks such as classification with this dataset

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: moon.csv | Column name | Description | |:--------------|:------------------------------------------| | X | X coordinate of the data point. (Numeric) | | Y | Y coordinate of the data point. (Numeric) |

    File: iris.csv | Column name | Description | |:-----------------|:---------------------------------------------| | Sepal.Length | Length of the sepal of the flower. (Numeric) | | Petal.Length | Length of the petal of the flower. (Numeric) | | Species | Species of the flower. (Categorical) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .

  2. Data from: EEG signature of grouping strategies in numerosity perception

    • data.europa.eu
    unknown
    Updated Feb 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2024). EEG signature of grouping strategies in numerosity perception [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-7913116?locale=lv
    Explore at:
    unknown(11798)Available download formats
    Dataset updated
    Feb 19, 2024
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Behavioral Data The excel file contains for each row data from individual participant. We reported the average response (columns B-E) and precision index (Weber fractions; columns G-J) for 6 and 8 items, both grouped and ungrouped. EEG Data The excel file contains for each row data from individual participant. We reported the N1 latency (columns B-K), N1 amplitude (columns M-V) and P2p amplitude (columns X-AG). Values are reported for the subitizing range (3 and 4 items) and the estimation range (6 and 8 items). In the estimation range values are separately reported for spatial arrangement (grouped and ungrouped) and number of subgroups (3 or 4 subgroups).

  3. Z

    Data from: Grouping strategies in number estimation extend the subitizing...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Nov 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maldonado Moscoso Paula Andrea; Castaldi Elisa; Burr David C.; Arrighi Roberto; Anobile Giovanni (2020). Grouping strategies in number estimation extend the subitizing range [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_4292116
    Explore at:
    Dataset updated
    Nov 30, 2020
    Dataset provided by
    Department of Neuroscience, Psychology, Pharmacology and Child Health, University of Florence, Florence, Italy
    Authors
    Maldonado Moscoso Paula Andrea; Castaldi Elisa; Burr David C.; Arrighi Roberto; Anobile Giovanni
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In the calculation folder: each file contains a matrix called “MATR”. Each row of the matrix “MATR” is a trial.

    The columns contain the following information:

    1st: Number of trial

    2nd: Subject response

    4th: Response time

    5th: first number

    6th: math symbol (1=*; 2= +; 3= –)

    7th: second number

    8th: third number

    In the calculation folder: each file contains a matrix called “matr”. Each row of the matrix “matr” is a trial.

    The columns contain the following information:

    1st: subject response in the numerosity task

    2nd: the presented numerosity

    3rd: subject response in the numerosity task

    4th: zero

    5th: stimulus duration

    6th: Response time in the numerosity task

    7th: Grouped (1) or random (2) presentation

    8th: 1

    9th: 1

    10th: Number of items of the upper-left quadrant

    11th: Number of items of the lower-left quadrant

    12th: Number of items of the upper-right quadrant

    13th: Number of items of the lower - right quadrant

    14th: odd shape presented (1=diamond; 2=triangle; 3=circle)

    15th: subject response in the shape task

    16th: 0.2 in the single task response time in the shape task when dual task

    17th: single (0) or dual (1) task

    18th: time stimulus on

    19th: time stimulus off

  4. D

    Replication data for: “Role grouping experiments: A new method for studying...

    • dataverse.no
    • dataverse.azure.uit.no
    • +1more
    docx, pdf, txt, xlsx
    Updated Jan 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicolay Worren; Nicolay Worren (2025). Replication data for: “Role grouping experiments: A new method for studying organization re-design decisions” [Dataset]. http://doi.org/10.18710/GURHXD
    Explore at:
    txt(121670), pdf(391557), pdf(129952), pdf(137725), docx(45321), txt(10183), xlsx(226265)Available download formats
    Dataset updated
    Jan 13, 2025
    Dataset provided by
    DataverseNO
    Authors
    Nicolay Worren; Nicolay Worren
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Oslo, Norway
    Description

    We developed an experimental method that can be used to study organization design and grouping decisions more specifically. We demonstrate the method in a study with 285 participants. The participants were asked to group a set of nine roles into units using card-sorting. The role descriptions indicated that there were interdependencies between some of the roles. Participants’ grouping decisions were quantified and compared against an algorithmic solution that minimized coordination costs. It was found that a relatively small difference in task complexity between groups greatly affected participants’ performance. The files that are uploaded here contain the raw data and "distance scores" for study of how people make organization design decisions. See the appendices in the article for tips on how to set up similar studies.

  5. d

    MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE...

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Apr 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING [Dataset]. https://catalog.data.gov/dataset/multi-label-asrs-dataset-classification-using-semi-supervised-subspace-clustering
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING MOHAMMAD SALIM AHMED, LATIFUR KHAN, NIKUNJ OZA, AND MANDAVA RAJESWARI Abstract. There has been a lot of research targeting text classification. Many of them focus on a particular characteristic of text data - multi-labelity. This arises due to the fact that a document may be associated with multiple classes at the same time. The consequence of such a characteristic is the low performance of traditional binary or multi-class classification techniques on multi-label text data. In this paper, we propose a text classification technique that considers this characteristic and provides very good performance. Our multi-label text classification approach is an extension of our previously formulated [3] multi-class text classification approach called SISC (Semi-supervised Impurity based Subspace Clustering). We call this new classification model as SISC-ML(SISC Multi-Label). Empirical evaluation on real world multi-label NASA ASRS (Aviation Safety Reporting System) data set reveals that our approach outperforms state-of-theart text classification as well as subspace clustering algorithms.

  6. m

    Educational Attainment in North Carolina Public Schools: Use of statistical...

    • data.mendeley.com
    Updated Nov 14, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
    Explore at:
    Dataset updated
    Nov 14, 2018
    Authors
    Scott Herford
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.

  7. VK group classificaton

    • kaggle.com
    zip
    Updated Jul 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikita Alenkin (2021). VK group classificaton [Dataset]. https://www.kaggle.com/datasets/nikitaalenkin/vk-group-classificaton
    Explore at:
    zip(71881011 bytes)Available download formats
    Dataset updated
    Jul 12, 2021
    Authors
    Nikita Alenkin
    Description

    Context

    Using VK API, author was able to parse the first 1000 posts from 9 very popular Russian groups. Data contains easy to handle information about group content and first 100 comments of users below every post. Using this data you can train your NLP, Data preprocessing and ML skills as well as ability to extract insights from data.

    A total of 2 Datasets are present:

    group_data.xlsx parsing_comments.csv

    Author also aims at adding a detailed notebook playing around with the available data with the motive of establishing a general workflow of how to perform data preprocessing, visualization, NLP and ML techniques.

    Feel free to experiment with interesting methods in data analytics, visualization, etc with the available data.

  8. f

    Data from: Data Nuggets: A Method for Reducing Big Data While Preserving...

    • tandf.figshare.com
    tar
    Updated Jun 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Traymon E. Beavers; Ge Cheng; Yajie Duan; Javier Cabrera; Mariusz Lubomirski; Dhammika Amaratunga; Jeffrey E. Teigler (2024). Data Nuggets: A Method for Reducing Big Data While Preserving Data Structure [Dataset]. http://doi.org/10.6084/m9.figshare.25594361.v1
    Explore at:
    tarAvailable download formats
    Dataset updated
    Jun 11, 2024
    Dataset provided by
    Taylor & Francis
    Authors
    Traymon E. Beavers; Ge Cheng; Yajie Duan; Javier Cabrera; Mariusz Lubomirski; Dhammika Amaratunga; Jeffrey E. Teigler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Big data, with N × P dimension where N is extremely large, has created new challenges for data analysis, particularly in the realm of creating meaningful clusters of data. Clustering techniques, such as K-means or hierarchical clustering are popular methods for performing exploratory analysis on large datasets. Unfortunately, these methods are not always possible to apply to big data due to memory or time constraints generated by calculations of order P*N(N−1)2. To circumvent this problem, typically the clustering technique is applied to a random sample drawn from the dataset; however, a weakness is that the structure of the dataset, particularly at the edges, is not necessarily maintained. We propose a new solution through the concept of “data nuggets”, which reduces a large dataset into a small collection of nuggets of data, each containing a center, weight, and scale parameter. The data nuggets are then input into algorithms that compute methods such as principal components analysis and clustering in a more computationally efficient manner. We show the consistency of the data nuggets based covariance estimator and apply the methodology of data nuggets to perform exploratory analysis of a flow cytometry dataset containing over one million observations using PCA and K-means clustering for weighted observations. Supplementary materials for this article are available online.

  9. d

    Data from: A clustering method for repeat analysis in DNA sequences

    • catalog.data.gov
    • odgavaprod.ogopendata.com
    • +1more
    Updated Sep 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institutes of Health (2025). A clustering method for repeat analysis in DNA sequences [Dataset]. https://catalog.data.gov/dataset/a-clustering-method-for-repeat-analysis-in-dna-sequences
    Explore at:
    Dataset updated
    Sep 6, 2025
    Dataset provided by
    National Institutes of Health
    Description

    Background A computational system for analysis of the repetitive structure of genomic sequences is described. The method uses suffix trees to organize and search the input sequences; this data structure has been used previously for efficient computation of exact and degenerate repeats. Results The resulting software tool collects all repeat classes and outputs summary statistics as well as a file containing multiple sequences (multi fasta), that can be used as the target of searches. Its use is demonstrated here on several complete microbial genomes, the entire Arabidopsis thaliana genome, and a large collection of rice bacterial artificial chromosome end sequences. Conclusions We propose a new clustering method for analysis of the repeat data captured in suffix trees. This method has been incorporated into a system that can find repeats in individual genome sequences or sets of sequences, and that can organize those repeats into classes. It quickly and accurately creates repeat databases from small and large genomes. The associated software (RepeatFinder), should prove helpful in the analysis of repeat structure for both complete and partial genome sequences.

  10. f

    Data_Sheet_1_Improved space breakdown method – A robust clustering technique...

    • frontiersin.figshare.com
    docx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eugen-Richard Ardelean; Ana-Maria Ichim; Mihaela Dînşoreanu; Raul Cristian Mureşan (2023). Data_Sheet_1_Improved space breakdown method – A robust clustering technique for spike sorting.docx [Dataset]. http://doi.org/10.3389/fncom.2023.1019637.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Eugen-Richard Ardelean; Ana-Maria Ichim; Mihaela Dînşoreanu; Raul Cristian Mureşan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Space Breakdown Method (SBM) is a clustering algorithm that was developed specifically for low-dimensional neuronal spike sorting. Cluster overlap and imbalance are common characteristics of neuronal data that produce difficulties for clustering methods. SBM is able to identify overlapping clusters through its design of cluster centre identification and the expansion of these centres. SBM’s approach is to divide the distribution of values of each feature into chunks of equal size. In each of these chunks, the number of points is counted and based on this number the centres of clusters are found and expanded. SBM has been shown to be a contender for other well-known clustering algorithms especially for the particular case of two dimensions while being too computationally expensive for high-dimensional data. Here, we present two main improvements to the original algorithm in order to increase its ability to deal with high-dimensional data while preserving its performance: the initial array structure was substituted with a graph structure and the number of partitions has been made feature-dependent, denominating this improved version as the Improved Space Breakdown Method (ISBM). In addition, we propose a clustering validation metric that does not punish overclustering and such obtains more suitable evaluations of clustering for spike sorting. Extracellular data recorded from the brain is unlabelled, therefore we have chosen simulated neural data, to which we have the ground truth, to evaluate more accurately the performance. Evaluations conducted on synthetic data indicate that the proposed improvements reduce the space and time complexity of the original algorithm, while simultaneously leading to an increased performance on neural data when compared with other state-of-the-art algorithms.Code available athttps://github.com/ArdeleanRichard/Space-Breakdown-Method.

  11. Data from: A virtual multi-label approach to imbalanced data classification

    • tandf.figshare.com
    text/x-tex
    Updated Feb 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elizabeth P. Chou; Shan-Ping Yang (2024). A virtual multi-label approach to imbalanced data classification [Dataset]. http://doi.org/10.6084/m9.figshare.19390561.v1
    Explore at:
    text/x-texAvailable download formats
    Dataset updated
    Feb 28, 2024
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Elizabeth P. Chou; Shan-Ping Yang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    One of the most challenging issues in machine learning is imbalanced data analysis. Usually, in this type of research, correctly predicting minority labels is more critical than correctly predicting majority labels. However, traditional machine learning techniques easily lead to learning bias. Traditional classifiers tend to place all subjects in the majority group, resulting in biased predictions. Machine learning studies are typically conducted from one of two perspectives: a data-based perspective or a model-based perspective. Oversampling and undersampling are examples of data-based approaches, while the addition of costs, penalties, or weights to optimize the algorithm is typical of a model-based approach. Some ensemble methods have been studied recently. These methods cause various problems, such as overfitting, the omission of some information, and long computation times. In addition, these methods do not apply to all kinds of datasets. Based on this problem, the virtual labels (ViLa) approach for the majority label is proposed to solve the imbalanced problem. A new multiclass classification approach with the equal K-means clustering method is demonstrated in the study. The proposed method is compared with commonly used imbalance problem methods, such as sampling methods (oversampling, undersampling, and SMOTE) and classifier methods (SVM and one-class SVM). The results show that the proposed method performs better when the degree of data imbalance increases and will gradually outperform other methods.

  12. Data from: The Advantages of Using Group Means in Estimating the Lorenz...

    • tandf.figshare.com
    pdf
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Merritt Lyon; Li C. Cheung; Joseph L. Gastwirth (2023). The Advantages of Using Group Means in Estimating the Lorenz Curve and Gini Index From Grouped Data [Dataset]. http://doi.org/10.6084/m9.figshare.1583396
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Merritt Lyon; Li C. Cheung; Joseph L. Gastwirth
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A recent article proposed a histogram-based method for estimating the Lorenz curve and Gini index from grouped data that did not use the group means reported by government agencies. When comparing their method to one based on group means, the authors assume a uniform density in each grouping interval, which leads to an overestimate of the overall average income. After reviewing the additional information in the group means, it will be shown that as the number of groups increases, the bounds on the Gini index obtained from the group means become narrower. This is not necessarily true for the histogram method. Two simple interpolation methods using the group means are described and the accuracy of the estimated Gini index they yield and the histogram-based one are compared to the published Gini index for the 1967–2013 period. The average absolute errors of the estimated Gini index obtained from the two methods using group means are noticeably less than that of the histogram-based method. Supplementary materials for this article are available online. [Received August 2014. Revised September 2015.]

  13. Clustering of samples and variables with mixed-type data

    • plos.figshare.com
    tiff
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manuela Hummel; Dominic Edelmann; Annette Kopp-Schneider (2023). Clustering of samples and variables with mixed-type data [Dataset]. http://doi.org/10.1371/journal.pone.0188274
    Explore at:
    tiffAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Manuela Hummel; Dominic Edelmann; Annette Kopp-Schneider
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of data measured on different scales is a relevant challenge. Biomedical studies often focus on high-throughput datasets of, e.g., quantitative measurements. However, the need for integration of other features possibly measured on different scales, e.g. clinical or cytogenetic factors, becomes increasingly important. The analysis results (e.g. a selection of relevant genes) are then visualized, while adding further information, like clinical factors, on top. However, a more integrative approach is desirable, where all available data are analyzed jointly, and where also in the visualization different data sources are combined in a more natural way. Here we specifically target integrative visualization and present a heatmap-style graphic display. To this end, we develop and explore methods for clustering mixed-type data, with special focus on clustering variables. Clustering of variables does not receive as much attention in the literature as does clustering of samples. We extend the variables clustering methodology by two new approaches, one based on the combination of different association measures and the other on distance correlation. With simulation studies we evaluate and compare different clustering strategies. Applying specific methods for mixed-type data proves to be comparable and in many cases beneficial as compared to standard approaches applied to corresponding quantitative or binarized data. Our two novel approaches for mixed-type variables show similar or better performance than the existing methods ClustOfVar and bias-corrected mutual information. Further, in contrast to ClustOfVar, our methods provide dissimilarity matrices, which is an advantage, especially for the purpose of visualization. Real data examples aim to give an impression of various kinds of potential applications for the integrative heatmap and other graphical displays based on dissimilarity matrices. We demonstrate that the presented integrative heatmap provides more information than common data displays about the relationship among variables and samples. The described clustering and visualization methods are implemented in our R package CluMix available from https://cran.r-project.org/web/packages/CluMix.

  14. Benchmarks datasets for cluster analysis

    • kaggle.com
    zip
    Updated Nov 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Onthada Preedasawakul (2023). Benchmarks datasets for cluster analysis [Dataset]. https://www.kaggle.com/datasets/onthada/benchmarks-datasets-for-clustering
    Explore at:
    zip(608532 bytes)Available download formats
    Dataset updated
    Nov 15, 2023
    Authors
    Onthada Preedasawakul
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    25 Artificial Datasets

    The datasets are generated using either Gaussian or Uniform distributions. Each dataset contains several known sub-groups intended for testing centroid-based clustering results and cluster validity indices.

    Cluster analysis is a popular machine learning used for segmenting datasets with similar data points in the same group. For those who are familiar with R, there is a new R package called "UniversalCVI" https://CRAN.R-project.org/package=UniversalCVI used for cluster evaluation. This package provides algorithms for checking the accuracy of a clustering result with known classes, computing cluster validity indices, and generating plots for comparing them. The package is compatible with K-means, fuzzy C means, EM clustering, and hierarchical clustering (single, average, and complete linkage). To use the "UniversalCVI" package, one can follow the instructions provided in the R documentation.

    For more in-depth details of the package and cluster evaluation, please see the papers https://doi.org/10.1016/j.patcog.2023.109910 and https://arxiv.org/abs/2308.14785

    All the datasets are also available on GitHub at

    https://github.com/O-PREEDASAWAKUL/FuzzyDatasets.git .

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F17645646%2Fa2f87fbad212a908718535589681a703%2Frealplot.jpeg?generation=1700111724010268&alt=media" alt="">

  15. e

    Data on students' group project preferences

    • datarepository.eur.nl
    • dataverse.nl
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tim M. Benning (2023). Data on students' group project preferences [Dataset]. http://doi.org/10.25397/eur.20342649.v1
    Explore at:
    Dataset updated
    May 30, 2023
    Dataset provided by
    Erasmus University Rotterdam (EUR)
    Authors
    Tim M. Benning
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data files contain information about the preferences of bachelor 1 and 2 students obtained via a discrete choice experiment (12 choice tasks per respondent), demographic characteristics of the sample and population, experiences with free-riding, attitude towards teamwork, and a measure of individualism/collectivism. Students were presented a different grade weight before each choice task (i.e., 10%, 30%, or 100%). The data was collected from mid-June to mid-July 2021.

    Access to the data is subject to the approval of a data sharing agreement due to the personal information contained in the dataset.

    A summary of the publication can be found below: Reducing free-riding is an important challenge for educators who use group projects. In this study, we measure students’ preferences for group project characteristics and investigate if characteristics that better help to reduce free-riding become more important for students when stakes increase. We used a discrete choice experiment based on twelve choice tasks in which students chose between two group projects that differed on five characteristics of which each level had its own effect on free-riding. A different group project grade weight was presented before each choice task to manipulate how much there was at stake for students in the group project. Data of 257 student respondents were used in the analysis. Based on random parameter logit model estimates we find that students prefer (in order of importance) assignment based on schedule availability and motivation or self-selection (instead of random assignment), the use of one or two peer process evaluations (instead of zero), a small team size of three or two students (instead of four), a common grade (instead of a divided grade), and a discussion with the course coordinator without a sanction as a method to handle free-riding (instead of member expulsion). Furthermore, we find that the characteristic team formation approach becomes even more important (especially self-selection) when student stakes increase. Educators can use our findings to design group projects that better help to reduce free-riding by (1) avoiding random assignment as team formation approach, (2) using (one or two) peer process evaluations, and (3) creating small(er) teams.

  16. 4

    Supplementary data for the publication: A Grouping Method for Optimization...

    • data.4tu.nl
    zip
    Updated Feb 1, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tom van Woudenberg; Frans van der Meer (2021). Supplementary data for the publication: A Grouping Method for Optimization of Steel Skeletal Structures by Applying a Combinatorial Search Algorithm Based on a Fully Stressed Design [Dataset]. http://doi.org/10.4121/12718790.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 1, 2021
    Dataset provided by
    4TU.Centre for Research Data
    Authors
    Tom van Woudenberg; Frans van der Meer
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset belongs to the publication: A Grouping Method for Optimization of Steel Skeletal Structures by Applying a Combinatorial Search Algorithm Based on a Fully Stressed Design. It contains all input data of the eight benchmarks problems used and the results of the numerical experiments.

  17. d

    Data from: The influence of a priori grouping on inference of genetic...

    • datadryad.org
    • search.dataone.org
    • +1more
    zip
    Updated Aug 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joshua Miller; Catherine Cullingham; Rhiannon Peery (2020). The influence of a priori grouping on inference of genetic clusters: simulation study and literature review of the DAPC method [Dataset]. http://doi.org/10.5061/dryad.4tmpg4f76
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 21, 2020
    Dataset provided by
    Dryad
    Authors
    Joshua Miller; Catherine Cullingham; Rhiannon Peery
    Time period covered
    Jul 19, 2020
    Description

    Detailed description of files are given in Miller_et_al_Dryad_Read_Me.txt.

  18. Credit Card Customer Data

    • kaggle.com
    zip
    Updated May 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arya Shah (2021). Credit Card Customer Data [Dataset]. https://www.kaggle.com/datasets/aryashah2k/credit-card-customer-data/discussion
    Explore at:
    zip(6431 bytes)Available download formats
    Dataset updated
    May 15, 2021
    Authors
    Arya Shah
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    A Customer Credit Card Information Dataset which can be used for Identifying Loyal Customers, Customer Segmentation, Targeted Marketing and other such use cases in the Marketing Industry.

    A few tasks that can be performed using this dataset is as follows: - Perform Data-Cleaning,Preprocessing,Visualizing and Feature Engineering on the Dataset. - Implement Heirarchical Clustering, K-Means Clustering models. - Create RFM (Recency,Frequency,Monetary) Matrix to identify Loyal Customers.

    Content

    The Attributes Include: - Sl_No - Customer Key - AvgCreditLimit - TotalCreditCards - Totalvisitsbank - Totalvisitsonline - Totalcallsmade

  19. f

    Performance comparison of different model based clustering methods on wine...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    • +1more
    Updated May 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pal, Samyajoy; Heumann, Christian (2022). Performance comparison of different model based clustering methods on wine data. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000322074
    Explore at:
    Dataset updated
    May 18, 2022
    Authors
    Pal, Samyajoy; Heumann, Christian
    Description

    Performance comparison of different model based clustering methods on wine data.

  20. f

    Data from: U-Statistical Inference for Hierarchical Clustering

    • tandf.figshare.com
    zip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marcio Valk; Gabriela Bettella Cybis (2023). U-Statistical Inference for Hierarchical Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.12844523.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Marcio Valk; Gabriela Bettella Cybis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Clustering methods are valuable tools for the identification of patterns in high-dimensional data with applications in many scientific fields. However, quantifying uncertainty in clustering is a challenging problem, particularly when dealing with high-dimension low sample size (HDLSS) data. We develop a U-statistics based clustering approach that assesses statistical significance in clustering and is specifically tailored to HDLSS scenarios. These nonparametric methods rely on very few assumptions about the data, and thus can be applied to a wide range of datasets for which the Euclidean distance captures relevant features. Our main result is the development of a hierarchical significance clustering method. To do so, we first introduce an extension of a relevant U-statistic and develop its asymptotic theory. Additionally, as a preliminary step, we propose a binary nonnested significance clustering method and show its optimality in terms of expected values. Our approach is tested through multiple simulations and found to have more statistical power than competing alternatives in all scenarios considered. Our methods are further showcased in three applications ranging from genetics to image recognition problems. Code for these methods is available in R-package uclust. Supplementary materials for this article are available online.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
The Devastator (2023). Galaxy clustering [Dataset]. https://www.kaggle.com/datasets/thedevastator/clustering-polygons-utilizing-iris-moon-and-circ
Organization logo

Data from: Galaxy clustering

Iris, Moon, and Circles datasets for Galaxy clustering tutorial

Related Article
Explore at:
zip(6339 bytes)Available download formats
Dataset updated
Jan 3, 2023
Authors
The Devastator
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Galaxy clustering

Iris, Moon, and Circles datasets for Galaxy clustering tutorial

By [source]

About this dataset

This dataset contains a wealth of information that can be used to explore the effectiveness of various clustering algorithms. With its inclusion of numerical measurements (X, Y, Sepal.Length, and Petal.Length) and categorical values (Species), it is possible to investigate the relationship between different types of variables and clustering performance. Additionally, by comparing results for the 3 datasets provided - moon.csv (which contains x and y coordinates), iris.csv (which contains measurements for sepal and petal lengths),and circles.csv - we can gain insights into how different data distributions affect clustering techniques such as K-Means or Hierarchical Clustering among others!

More Datasets

For more datasets, click here.

Featured Notebooks

  • 🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset can also be a great starting point to further explore more complex clusters by using higher dimensional space variables such as color or texture that may be present in other datasets not included here but which can help to form more accurate groups when using cluster-analysis algorithms. Additionally, it could also assist in visualization projects where clusters may need to be generated such as plotting mapped data points or examining relationships between two different variables within a certain region drawn on a chart.

To use this dataset effectively it is important to understand how exactly your chosen algorithm works since some require specifying parameters beforehand while others take care of those details automatically; otherwise the interpretation may be invalid depending on the methods used alongside clustering you intend for your project. Furthermore, familiarize yourself with concepts like silhouette score and rand index - these are commonly used metrics that measure your cluster’s performance against other clusterings models so you know if what you have done so far satisfies an acceptable level of accuracy or not yet! Good luck!

Research Ideas

  • Utilizing the sepal and petal lengths and widths to perform flower recognition or part of a larger image recognition pipeline.
  • Classifying the data points in each dataset by the X-Y coordinates using clustering algorithms to analyze galaxy locations or overall formation patterns for stars, planets, or galaxies.
  • Exploring correlations between species of flowers in terms of sepal/petal lengths by performing supervised learning tasks such as classification with this dataset

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: moon.csv | Column name | Description | |:--------------|:------------------------------------------| | X | X coordinate of the data point. (Numeric) | | Y | Y coordinate of the data point. (Numeric) |

File: iris.csv | Column name | Description | |:-----------------|:---------------------------------------------| | Sepal.Length | Length of the sepal of the flower. (Numeric) | | Petal.Length | Length of the petal of the flower. (Numeric) | | Species | Species of the flower. (Categorical) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .

Search
Clear search
Close search
Google apps
Main menu