100+ datasets found

Data from: Galaxy clustering
kaggle.com
zip
Updated Jan 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Galaxy clustering [Dataset]. https://www.kaggle.com/datasets/thedevastator/clustering-polygons-utilizing-iris-moon-and-circ
Explore at:
zip(6339 bytes)Available download formats
Dataset updated
Jan 3, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Galaxy clustering

Iris, Moon, and Circles datasets for Galaxy clustering tutorial

By [source]

About this dataset

This dataset contains a wealth of information that can be used to explore the effectiveness of various clustering algorithms. With its inclusion of numerical measurements (X, Y, Sepal.Length, and Petal.Length) and categorical values (Species), it is possible to investigate the relationship between different types of variables and clustering performance. Additionally, by comparing results for the 3 datasets provided - moon.csv (which contains x and y coordinates), iris.csv (which contains measurements for sepal and petal lengths),and circles.csv - we can gain insights into how different data distributions affect clustering techniques such as K-Means or Hierarchical Clustering among others!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset can also be a great starting point to further explore more complex clusters by using higher dimensional space variables such as color or texture that may be present in other datasets not included here but which can help to form more accurate groups when using cluster-analysis algorithms. Additionally, it could also assist in visualization projects where clusters may need to be generated such as plotting mapped data points or examining relationships between two different variables within a certain region drawn on a chart.

To use this dataset effectively it is important to understand how exactly your chosen algorithm works since some require specifying parameters beforehand while others take care of those details automatically; otherwise the interpretation may be invalid depending on the methods used alongside clustering you intend for your project. Furthermore, familiarize yourself with concepts like silhouette score and rand index - these are commonly used metrics that measure your cluster’s performance against other clusterings models so you know if what you have done so far satisfies an acceptable level of accuracy or not yet! Good luck!

Research Ideas

Utilizing the sepal and petal lengths and widths to perform flower recognition or part of a larger image recognition pipeline.

Classifying the data points in each dataset by the X-Y coordinates using clustering algorithms to analyze galaxy locations or overall formation patterns for stars, planets, or galaxies.

Exploring correlations between species of flowers in terms of sepal/petal lengths by performing supervised learning tasks such as classification with this dataset

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: moon.csv | Column name | Description | |:--------------|:------------------------------------------| | X | X coordinate of the data point. (Numeric) | | Y | Y coordinate of the data point. (Numeric) |

File: iris.csv | Column name | Description | |:-----------------|:---------------------------------------------| | Sepal.Length | Length of the sepal of the flower. (Numeric) | | Petal.Length | Length of the petal of the flower. (Numeric) | | Species | Species of the flower. (Categorical) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .
Data from: EEG signature of grouping strategies in numerosity perception
data.europa.eu
unknown
Updated Feb 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2024). EEG signature of grouping strategies in numerosity perception [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-7913116?locale=lv
Explore at:
unknown(11798)Available download formats
Dataset updated
Feb 19, 2024
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Behavioral Data The excel file contains for each row data from individual participant. We reported the average response (columns B-E) and precision index (Weber fractions; columns G-J) for 6 and 8 items, both grouped and ungrouped. EEG Data The excel file contains for each row data from individual participant. We reported the N1 latency (columns B-K), N1 amplitude (columns M-V) and P2p amplitude (columns X-AG). Values are reported for the subitizing range (3 and 4 items) and the estimation range (6 and 8 items). In the estimation range values are separately reported for spatial arrangement (grouped and ungrouped) and number of subgroups (3 or 4 subgroups).
Z
Data from: Grouping strategies in number estimation extend the subitizing...
data-staging.niaid.nih.gov
data.niaid.nih.gov
Updated Nov 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maldonado Moscoso Paula Andrea; Castaldi Elisa; Burr David C.; Arrighi Roberto; Anobile Giovanni (2020). Grouping strategies in number estimation extend the subitizing range [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_4292116
Explore at:
Dataset updated
Nov 30, 2020
Dataset provided by
Department of Neuroscience, Psychology, Pharmacology and Child Health, University of Florence, Florence, Italy
Authors
Maldonado Moscoso Paula Andrea; Castaldi Elisa; Burr David C.; Arrighi Roberto; Anobile Giovanni
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In the calculation folder: each file contains a matrix called “MATR”. Each row of the matrix “MATR” is a trial.

The columns contain the following information:

1st: Number of trial

2nd: Subject response

4th: Response time

5th: first number

6th: math symbol (1=*; 2= +; 3= –)

7th: second number

8th: third number

In the calculation folder: each file contains a matrix called “matr”. Each row of the matrix “matr” is a trial.

The columns contain the following information:

1st: subject response in the numerosity task

2nd: the presented numerosity

3rd: subject response in the numerosity task

4th: zero

5th: stimulus duration

6th: Response time in the numerosity task

7th: Grouped (1) or random (2) presentation

8th: 1

9th: 1

10th: Number of items of the upper-left quadrant

11th: Number of items of the lower-left quadrant

12th: Number of items of the upper-right quadrant

13th: Number of items of the lower - right quadrant

14th: odd shape presented (1=diamond; 2=triangle; 3=circle)

15th: subject response in the shape task

16th: 0.2 in the single task response time in the shape task when dual task

17th: single (0) or dual (1) task

18th: time stimulus on

19th: time stimulus off
D
Replication data for: “Role grouping experiments: A new method for studying...
dataverse.no
dataverse.azure.uit.no
+1more
docx, pdf, txt, xlsx
Updated Jan 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicolay Worren; Nicolay Worren (2025). Replication data for: “Role grouping experiments: A new method for studying organization re-design decisions” [Dataset]. http://doi.org/10.18710/GURHXD
Explore at:
txt(121670), pdf(391557), pdf(129952), pdf(137725), docx(45321), txt(10183), xlsx(226265)Available download formats
Unique identifier
https://doi.org/10.18710/GURHXD
Dataset updated
Jan 13, 2025
Dataset provided by
DataverseNO
Authors
Nicolay Worren; Nicolay Worren
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Oslo, Norway
Description
We developed an experimental method that can be used to study organization design and grouping decisions more specifically. We demonstrate the method in a study with 285 participants. The participants were asked to group a set of nine roles into units using card-sorting. The role descriptions indicated that there were interdependencies between some of the roles. Participants’ grouping decisions were quantified and compared against an algorithmic solution that minimized coordination costs. It was found that a relatively small difference in task complexity between groups greatly affected participants’ performance. The files that are uploaded here contain the raw data and "distance scores" for study of how people make organization design decisions. See the appendices in the article for tips on how to set up similar studies.
d
MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE...
catalog.data.gov
s.cnmilf.com
+1more
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING [Dataset]. https://catalog.data.gov/dataset/multi-label-asrs-dataset-classification-using-semi-supervised-subspace-clustering
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING MOHAMMAD SALIM AHMED, LATIFUR KHAN, NIKUNJ OZA, AND MANDAVA RAJESWARI Abstract. There has been a lot of research targeting text classification. Many of them focus on a particular characteristic of text data - multi-labelity. This arises due to the fact that a document may be associated with multiple classes at the same time. The consequence of such a characteristic is the low performance of traditional binary or multi-class classification techniques on multi-label text data. In this paper, we propose a text classification technique that considers this characteristic and provides very good performance. Our multi-label text classification approach is an extension of our previously formulated [3] multi-class text classification approach called SISC (Semi-supervised Impurity based Subspace Clustering). We call this new classification model as SISC-ML(SISC Multi-Label). Empirical evaluation on real world multi-label NASA ASRS (Aviation Safety Reporting System) data set reveals that our approach outperforms state-of-theart text classification as well as subspace clustering algorithms.
m
Educational Attainment in North Carolina Public Schools: Use of statistical...
data.mendeley.com
Updated Nov 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
Explore at:
Unique identifier
https://doi.org/10.17632/6cm9wyd5g5.1
Dataset updated
Nov 14, 2018
Authors
Scott Herford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
VK group classificaton
kaggle.com
zip
Updated Jul 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikita Alenkin (2021). VK group classificaton [Dataset]. https://www.kaggle.com/datasets/nikitaalenkin/vk-group-classificaton
Explore at:
zip(71881011 bytes)Available download formats
Dataset updated
Jul 12, 2021
Authors
Nikita Alenkin
Description
Context

Using VK API, author was able to parse the first 1000 posts from 9 very popular Russian groups. Data contains easy to handle information about group content and first 100 comments of users below every post. Using this data you can train your NLP, Data preprocessing and ML skills as well as ability to extract insights from data.

A total of 2 Datasets are present:

group_data.xlsx parsing_comments.csv

Author also aims at adding a detailed notebook playing around with the available data with the motive of establishing a general workflow of how to perform data preprocessing, visualization, NLP and ML techniques.

Feel free to experiment with interesting methods in data analytics, visualization, etc with the available data.
f
Data from: Data Nuggets: A Method for Reducing Big Data While Preserving...
tandf.figshare.com
tar
Updated Jun 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Traymon E. Beavers; Ge Cheng; Yajie Duan; Javier Cabrera; Mariusz Lubomirski; Dhammika Amaratunga; Jeffrey E. Teigler (2024). Data Nuggets: A Method for Reducing Big Data While Preserving Data Structure [Dataset]. http://doi.org/10.6084/m9.figshare.25594361.v1
Explore at:
tarAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25594361.v1
Dataset updated
Jun 11, 2024
Dataset provided by
Taylor & Francis
Authors
Traymon E. Beavers; Ge Cheng; Yajie Duan; Javier Cabrera; Mariusz Lubomirski; Dhammika Amaratunga; Jeffrey E. Teigler
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Big data, with N × P dimension where N is extremely large, has created new challenges for data analysis, particularly in the realm of creating meaningful clusters of data. Clustering techniques, such as K-means or hierarchical clustering are popular methods for performing exploratory analysis on large datasets. Unfortunately, these methods are not always possible to apply to big data due to memory or time constraints generated by calculations of order P*N(N−1)2. To circumvent this problem, typically the clustering technique is applied to a random sample drawn from the dataset; however, a weakness is that the structure of the dataset, particularly at the edges, is not necessarily maintained. We propose a new solution through the concept of “data nuggets”, which reduces a large dataset into a small collection of nuggets of data, each containing a center, weight, and scale parameter. The data nuggets are then input into algorithms that compute methods such as principal components analysis and clustering in a more computationally efficient manner. We show the consistency of the data nuggets based covariance estimator and apply the methodology of data nuggets to perform exploratory analysis of a flow cytometry dataset containing over one million observations using PCA and K-means clustering for weighted observations. Supplementary materials for this article are available online.
d
Data from: A clustering method for repeat analysis in DNA sequences
catalog.data.gov
odgavaprod.ogopendata.com
+1more
Updated Sep 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institutes of Health (2025). A clustering method for repeat analysis in DNA sequences [Dataset]. https://catalog.data.gov/dataset/a-clustering-method-for-repeat-analysis-in-dna-sequences
Explore at:
Dataset updated
Sep 6, 2025
Dataset provided by
National Institutes of Health
Description
Background A computational system for analysis of the repetitive structure of genomic sequences is described. The method uses suffix trees to organize and search the input sequences; this data structure has been used previously for efficient computation of exact and degenerate repeats. Results The resulting software tool collects all repeat classes and outputs summary statistics as well as a file containing multiple sequences (multi fasta), that can be used as the target of searches. Its use is demonstrated here on several complete microbial genomes, the entire Arabidopsis thaliana genome, and a large collection of rice bacterial artificial chromosome end sequences. Conclusions We propose a new clustering method for analysis of the repeat data captured in suffix trees. This method has been incorporated into a system that can find repeats in individual genome sequences or sets of sequences, and that can organize those repeats into classes. It quickly and accurately creates repeat databases from small and large genomes. The associated software (RepeatFinder), should prove helpful in the analysis of repeat structure for both complete and partial genome sequences.
f
Data_Sheet_1_Improved space breakdown method – A robust clustering technique...
frontiersin.figshare.com
docx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eugen-Richard Ardelean; Ana-Maria Ichim; Mihaela Dînşoreanu; Raul Cristian Mureşan (2023). Data_Sheet_1_Improved space breakdown method – A robust clustering technique for spike sorting.docx [Dataset]. http://doi.org/10.3389/fncom.2023.1019637.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fncom.2023.1019637.s001
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Eugen-Richard Ardelean; Ana-Maria Ichim; Mihaela Dînşoreanu; Raul Cristian Mureşan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Space Breakdown Method (SBM) is a clustering algorithm that was developed specifically for low-dimensional neuronal spike sorting. Cluster overlap and imbalance are common characteristics of neuronal data that produce difficulties for clustering methods. SBM is able to identify overlapping clusters through its design of cluster centre identification and the expansion of these centres. SBM’s approach is to divide the distribution of values of each feature into chunks of equal size. In each of these chunks, the number of points is counted and based on this number the centres of clusters are found and expanded. SBM has been shown to be a contender for other well-known clustering algorithms especially for the particular case of two dimensions while being too computationally expensive for high-dimensional data. Here, we present two main improvements to the original algorithm in order to increase its ability to deal with high-dimensional data while preserving its performance: the initial array structure was substituted with a graph structure and the number of partitions has been made feature-dependent, denominating this improved version as the Improved Space Breakdown Method (ISBM). In addition, we propose a clustering validation metric that does not punish overclustering and such obtains more suitable evaluations of clustering for spike sorting. Extracellular data recorded from the brain is unlabelled, therefore we have chosen simulated neural data, to which we have the ground truth, to evaluate more accurately the performance. Evaluations conducted on synthetic data indicate that the proposed improvements reduce the space and time complexity of the original algorithm, while simultaneously leading to an increased performance on neural data when compared with other state-of-the-art algorithms.Code available athttps://github.com/ArdeleanRichard/Space-Breakdown-Method.
Data from: A virtual multi-label approach to imbalanced data classification
tandf.figshare.com
text/x-tex
Updated Feb 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elizabeth P. Chou; Shan-Ping Yang (2024). A virtual multi-label approach to imbalanced data classification [Dataset]. http://doi.org/10.6084/m9.figshare.19390561.v1
Explore at:
text/x-texAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19390561.v1
Dataset updated
Feb 28, 2024
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Elizabeth P. Chou; Shan-Ping Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
One of the most challenging issues in machine learning is imbalanced data analysis. Usually, in this type of research, correctly predicting minority labels is more critical than correctly predicting majority labels. However, traditional machine learning techniques easily lead to learning bias. Traditional classifiers tend to place all subjects in the majority group, resulting in biased predictions. Machine learning studies are typically conducted from one of two perspectives: a data-based perspective or a model-based perspective. Oversampling and undersampling are examples of data-based approaches, while the addition of costs, penalties, or weights to optimize the algorithm is typical of a model-based approach. Some ensemble methods have been studied recently. These methods cause various problems, such as overfitting, the omission of some information, and long computation times. In addition, these methods do not apply to all kinds of datasets. Based on this problem, the virtual labels (ViLa) approach for the majority label is proposed to solve the imbalanced problem. A new multiclass classification approach with the equal K-means clustering method is demonstrated in the study. The proposed method is compared with commonly used imbalance problem methods, such as sampling methods (oversampling, undersampling, and SMOTE) and classifier methods (SVM and one-class SVM). The results show that the proposed method performs better when the degree of data imbalance increases and will gradually outperform other methods.
Data from: The Advantages of Using Group Means in Estimating the Lorenz...
tandf.figshare.com
pdf
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Merritt Lyon; Li C. Cheung; Joseph L. Gastwirth (2023). The Advantages of Using Group Means in Estimating the Lorenz Curve and Gini Index From Grouped Data [Dataset]. http://doi.org/10.6084/m9.figshare.1583396
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1583396
Dataset updated
Jun 3, 2023
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Merritt Lyon; Li C. Cheung; Joseph L. Gastwirth
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A recent article proposed a histogram-based method for estimating the Lorenz curve and Gini index from grouped data that did not use the group means reported by government agencies. When comparing their method to one based on group means, the authors assume a uniform density in each grouping interval, which leads to an overestimate of the overall average income. After reviewing the additional information in the group means, it will be shown that as the number of groups increases, the bounds on the Gini index obtained from the group means become narrower. This is not necessarily true for the histogram method. Two simple interpolation methods using the group means are described and the accuracy of the estimated Gini index they yield and the histogram-based one are compared to the published Gini index for the 1967–2013 period. The average absolute errors of the estimated Gini index obtained from the two methods using group means are noticeably less than that of the histogram-based method. Supplementary materials for this article are available online. [Received August 2014. Revised September 2015.]
Clustering of samples and variables with mixed-type data
plos.figshare.com
tiff
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manuela Hummel; Dominic Edelmann; Annette Kopp-Schneider (2023). Clustering of samples and variables with mixed-type data [Dataset]. http://doi.org/10.1371/journal.pone.0188274
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0188274
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Manuela Hummel; Dominic Edelmann; Annette Kopp-Schneider
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of data measured on different scales is a relevant challenge. Biomedical studies often focus on high-throughput datasets of, e.g., quantitative measurements. However, the need for integration of other features possibly measured on different scales, e.g. clinical or cytogenetic factors, becomes increasingly important. The analysis results (e.g. a selection of relevant genes) are then visualized, while adding further information, like clinical factors, on top. However, a more integrative approach is desirable, where all available data are analyzed jointly, and where also in the visualization different data sources are combined in a more natural way. Here we specifically target integrative visualization and present a heatmap-style graphic display. To this end, we develop and explore methods for clustering mixed-type data, with special focus on clustering variables. Clustering of variables does not receive as much attention in the literature as does clustering of samples. We extend the variables clustering methodology by two new approaches, one based on the combination of different association measures and the other on distance correlation. With simulation studies we evaluate and compare different clustering strategies. Applying specific methods for mixed-type data proves to be comparable and in many cases beneficial as compared to standard approaches applied to corresponding quantitative or binarized data. Our two novel approaches for mixed-type variables show similar or better performance than the existing methods ClustOfVar and bias-corrected mutual information. Further, in contrast to ClustOfVar, our methods provide dissimilarity matrices, which is an advantage, especially for the purpose of visualization. Real data examples aim to give an impression of various kinds of potential applications for the integrative heatmap and other graphical displays based on dissimilarity matrices. We demonstrate that the presented integrative heatmap provides more information than common data displays about the relationship among variables and samples. The described clustering and visualization methods are implemented in our R package CluMix available from https://cran.r-project.org/web/packages/CluMix.
Benchmarks datasets for cluster analysis
kaggle.com
zip
Updated Nov 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Onthada Preedasawakul (2023). Benchmarks datasets for cluster analysis [Dataset]. https://www.kaggle.com/datasets/onthada/benchmarks-datasets-for-clustering
Explore at:
zip(608532 bytes)Available download formats
Dataset updated
Nov 15, 2023
Authors
Onthada Preedasawakul
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
25 Artificial Datasets

The datasets are generated using either Gaussian or Uniform distributions. Each dataset contains several known sub-groups intended for testing centroid-based clustering results and cluster validity indices.

Cluster analysis is a popular machine learning used for segmenting datasets with similar data points in the same group. For those who are familiar with R, there is a new R package called "UniversalCVI" https://CRAN.R-project.org/package=UniversalCVI used for cluster evaluation. This package provides algorithms for checking the accuracy of a clustering result with known classes, computing cluster validity indices, and generating plots for comparing them. The package is compatible with K-means, fuzzy C means, EM clustering, and hierarchical clustering (single, average, and complete linkage). To use the "UniversalCVI" package, one can follow the instructions provided in the R documentation.

For more in-depth details of the package and cluster evaluation, please see the papers https://doi.org/10.1016/j.patcog.2023.109910 and https://arxiv.org/abs/2308.14785

All the datasets are also available on GitHub at

https://github.com/O-PREEDASAWAKUL/FuzzyDatasets.git .

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F17645646%2Fa2f87fbad212a908718535589681a703%2Frealplot.jpeg?generation=1700111724010268&alt=media" alt="">
e
Data on students' group project preferences
datarepository.eur.nl
dataverse.nl
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tim M. Benning (2023). Data on students' group project preferences [Dataset]. http://doi.org/10.25397/eur.20342649.v1
Explore at:
Unique identifier
https://doi.org/10.25397/eur.20342649.v1
Dataset updated
May 30, 2023
Dataset provided by
Erasmus University Rotterdam (EUR)
Authors
Tim M. Benning
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data files contain information about the preferences of bachelor 1 and 2 students obtained via a discrete choice experiment (12 choice tasks per respondent), demographic characteristics of the sample and population, experiences with free-riding, attitude towards teamwork, and a measure of individualism/collectivism. Students were presented a different grade weight before each choice task (i.e., 10%, 30%, or 100%). The data was collected from mid-June to mid-July 2021.

Access to the data is subject to the approval of a data sharing agreement due to the personal information contained in the dataset.

A summary of the publication can be found below: Reducing free-riding is an important challenge for educators who use group projects. In this study, we measure students’ preferences for group project characteristics and investigate if characteristics that better help to reduce free-riding become more important for students when stakes increase. We used a discrete choice experiment based on twelve choice tasks in which students chose between two group projects that differed on five characteristics of which each level had its own effect on free-riding. A different group project grade weight was presented before each choice task to manipulate how much there was at stake for students in the group project. Data of 257 student respondents were used in the analysis. Based on random parameter logit model estimates we find that students prefer (in order of importance) assignment based on schedule availability and motivation or self-selection (instead of random assignment), the use of one or two peer process evaluations (instead of zero), a small team size of three or two students (instead of four), a common grade (instead of a divided grade), and a discussion with the course coordinator without a sanction as a method to handle free-riding (instead of member expulsion). Furthermore, we find that the characteristic team formation approach becomes even more important (especially self-selection) when student stakes increase. Educators can use our findings to design group projects that better help to reduce free-riding by (1) avoiding random assignment as team formation approach, (2) using (one or two) peer process evaluations, and (3) creating small(er) teams.
4
Supplementary data for the publication: A Grouping Method for Optimization...
data.4tu.nl
zip
Updated Feb 1, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tom van Woudenberg; Frans van der Meer (2021). Supplementary data for the publication: A Grouping Method for Optimization of Steel Skeletal Structures by Applying a Combinatorial Search Algorithm Based on a Fully Stressed Design [Dataset]. http://doi.org/10.4121/12718790.v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/12718790.v3
Dataset updated
Feb 1, 2021
Dataset provided by
4TU.Centre for Research Data
Authors
Tom van Woudenberg; Frans van der Meer
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset belongs to the publication: A Grouping Method for Optimization of Steel Skeletal Structures by Applying a Combinatorial Search Algorithm Based on a Fully Stressed Design. It contains all input data of the eight benchmarks problems used and the results of the numerical experiments.
d
Data from: The influence of a priori grouping on inference of genetic...
datadryad.org
search.dataone.org
+1more
zip
Updated Aug 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joshua Miller; Catherine Cullingham; Rhiannon Peery (2020). The influence of a priori grouping on inference of genetic clusters: simulation study and literature review of the DAPC method [Dataset]. http://doi.org/10.5061/dryad.4tmpg4f76
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.4tmpg4f76
Dataset updated
Aug 21, 2020
Dataset provided by
Dryad
Authors
Joshua Miller; Catherine Cullingham; Rhiannon Peery
Time period covered
Jul 19, 2020
Description
Detailed description of files are given in Miller_et_al_Dryad_Read_Me.txt.
Credit Card Customer Data
kaggle.com
zip
Updated May 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arya Shah (2021). Credit Card Customer Data [Dataset]. https://www.kaggle.com/datasets/aryashah2k/credit-card-customer-data/discussion
Explore at:
zip(6431 bytes)Available download formats
Dataset updated
May 15, 2021
Authors
Arya Shah
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

A Customer Credit Card Information Dataset which can be used for Identifying Loyal Customers, Customer Segmentation, Targeted Marketing and other such use cases in the Marketing Industry.

A few tasks that can be performed using this dataset is as follows: - Perform Data-Cleaning,Preprocessing,Visualizing and Feature Engineering on the Dataset. - Implement Heirarchical Clustering, K-Means Clustering models. - Create RFM (Recency,Frequency,Monetary) Matrix to identify Loyal Customers.

Content

The Attributes Include: - Sl_No - Customer Key - AvgCreditLimit - TotalCreditCards - Totalvisitsbank - Totalvisitsonline - Totalcallsmade
f
Performance comparison of different model based clustering methods on wine...
datasetcatalog.nlm.nih.gov
plos.figshare.com
+1more
Updated May 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pal, Samyajoy; Heumann, Christian (2022). Performance comparison of different model based clustering methods on wine data. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000322074
Explore at:
Dataset updated
May 18, 2022
Authors
Pal, Samyajoy; Heumann, Christian
Description
Performance comparison of different model based clustering methods on wine data.
f
Data from: U-Statistical Inference for Hierarchical Clustering
tandf.figshare.com
zip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marcio Valk; Gabriela Bettella Cybis (2023). U-Statistical Inference for Hierarchical Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.12844523.v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12844523.v3
Dataset updated
Jun 1, 2023
Dataset provided by
Taylor & Francis
Authors
Marcio Valk; Gabriela Bettella Cybis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Clustering methods are valuable tools for the identification of patterns in high-dimensional data with applications in many scientific fields. However, quantifying uncertainty in clustering is a challenging problem, particularly when dealing with high-dimension low sample size (HDLSS) data. We develop a U-statistics based clustering approach that assesses statistical significance in clustering and is specifically tailored to HDLSS scenarios. These nonparametric methods rely on very few assumptions about the data, and thus can be applied to a wide range of datasets for which the Euclidean distance captures relevant features. Our main result is the development of a hierarchical significance clustering method. To do so, we first introduce an extension of a relevant U-statistic and develop its asymptotic theory. Additionally, as a preliminary step, we propose a binary nonnested significance clustering method and show its optimality in terms of expected values. Our approach is tested through multiple simulations and found to have more statistical power than competing alternatives in all scenarios considered. Our methods are further showcased in three applications ranging from genetics to image recognition problems. Code for these methods is available in R-package uclust. Supplementary materials for this article are available online.

Facebook

Twitter

Click to copy link

Link copied

Cite

The Devastator (2023). Galaxy clustering [Dataset]. https://www.kaggle.com/datasets/thedevastator/clustering-polygons-utilizing-iris-moon-and-circ

Data from: Galaxy clustering

Iris, Moon, and Circles datasets for Galaxy clustering tutorial

Explore at:

zip(6339 bytes)Available download formats

Dataset updated

Jan 3, 2023

Authors

The Devastator

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Galaxy clustering

Iris, Moon, and Circles datasets for Galaxy clustering tutorial

By [source]

About this dataset

This dataset contains a wealth of information that can be used to explore the effectiveness of various clustering algorithms. With its inclusion of numerical measurements (X, Y, Sepal.Length, and Petal.Length) and categorical values (Species), it is possible to investigate the relationship between different types of variables and clustering performance. Additionally, by comparing results for the 3 datasets provided - moon.csv (which contains x and y coordinates), iris.csv (which contains measurements for sepal and petal lengths),and circles.csv - we can gain insights into how different data distributions affect clustering techniques such as K-Means or Hierarchical Clustering among others!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset can also be a great starting point to further explore more complex clusters by using higher dimensional space variables such as color or texture that may be present in other datasets not included here but which can help to form more accurate groups when using cluster-analysis algorithms. Additionally, it could also assist in visualization projects where clusters may need to be generated such as plotting mapped data points or examining relationships between two different variables within a certain region drawn on a chart.

To use this dataset effectively it is important to understand how exactly your chosen algorithm works since some require specifying parameters beforehand while others take care of those details automatically; otherwise the interpretation may be invalid depending on the methods used alongside clustering you intend for your project. Furthermore, familiarize yourself with concepts like silhouette score and rand index - these are commonly used metrics that measure your cluster’s performance against other clusterings models so you know if what you have done so far satisfies an acceptable level of accuracy or not yet! Good luck!

Research Ideas

Utilizing the sepal and petal lengths and widths to perform flower recognition or part of a larger image recognition pipeline.

Classifying the data points in each dataset by the X-Y coordinates using clustering algorithms to analyze galaxy locations or overall formation patterns for stars, planets, or galaxies.

Exploring correlations between species of flowers in terms of sepal/petal lengths by performing supervised learning tasks such as classification with this dataset

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: moon.csv | Column name | Description | |:--------------|:------------------------------------------| | X | X coordinate of the data point. (Numeric) | | Y | Y coordinate of the data point. (Numeric) |

File: iris.csv | Column name | Description | |:-----------------|:---------------------------------------------| | Sepal.Length | Length of the sepal of the flower. (Numeric) | | Petal.Length | Length of the petal of the flower. (Numeric) | | Species | Species of the flower. (Categorical) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .

Clear search

Close search

Google apps

Main menu

Data from: Galaxy clustering

Galaxy clustering

Iris, Moon, and Circles datasets for Galaxy clustering tutorial

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Data from: EEG signature of grouping strategies in numerosity perception

Data from: Grouping strategies in number estimation extend the subitizing...

Replication data for: “Role grouping experiments: A new method for studying...

MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE...

Educational Attainment in North Carolina Public Schools: Use of statistical...

VK group classificaton

Context

Data from: Data Nuggets: A Method for Reducing Big Data While Preserving...

Data from: A clustering method for repeat analysis in DNA sequences

Data_Sheet_1_Improved space breakdown method – A robust clustering technique...

Data from: A virtual multi-label approach to imbalanced data classification

Data from: The Advantages of Using Group Means in Estimating the Lorenz...

Clustering of samples and variables with mixed-type data

Benchmarks datasets for cluster analysis

25 Artificial Datasets

All the datasets are also available on GitHub at

Data on students' group project preferences

Supplementary data for the publication: A Grouping Method for Optimization...

Data from: The influence of a priori grouping on inference of genetic...

Credit Card Customer Data

Context

Content

Performance comparison of different model based clustering methods on wine...

Data from: U-Statistical Inference for Hierarchical Clustering

Data from: Galaxy clustering

Iris, Moon, and Circles datasets for Galaxy clustering tutorial

Galaxy clustering

Iris, Moon, and Circles datasets for Galaxy clustering tutorial

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements