Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
In scientific data analysis, clusters identified computationally often substantiate existing hypotheses or motivate new ones. Yet the combinatorial nature of the clustering result, which is a partition rather than a set of parameters or a function, blurs notions of mean and variance. This intrinsic difficulty hinders the development of methods to improve clustering by aggregation or to assess the uncertainty of clusters generated. We overcome that barrier by aligning clusters via optimal transport. Equipped with this technique, we propose a new algorithm to enhance clustering by any baseline method using bootstrap samples. Cluster alignment enables us to quantify variation in the clustering result at the levels of both overall partitions and individual clusters. Set relationships between clusters such as one-to-one match, split, and merge can be revealed. A covering point set for each cluster, a concept kin to the confidence interval, is proposed. The tools we have developed here will help address the crucial question of whether any cluster is an intrinsic or spurious pattern. Experimental results on both simulated and real datasets are provided.
Facebook
TwitterAttribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
182 simulated datasets (first set contains small datasets and second set contains large datasets) with different cluster compositions – i.e., different number clusters and separation values – generated using clusterGeneration package in R. Each set of simulation datasets consists of 91 datasets in comma separated values (csv) format (total of 182 csv files) with 3-15 clusters and 0.1 to 0.7 separation values. Separation values can range between (−0.999, 0.999), where a higher separation value indicates cluster structure with more separable clusters.
Size of the dataset, number of clusters, and separation value of the clusters in the dataset is printed in file name. size_X_n_Y_sepval_Z.csv: Size of the dataset = X number of clusters in the dataset = Y separation value of the clusters in the dataset = Z
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
This dataset contains a wealth of information that can be used to explore the effectiveness of various clustering algorithms. With its inclusion of numerical measurements (X, Y, Sepal.Length, and Petal.Length) and categorical values (Species), it is possible to investigate the relationship between different types of variables and clustering performance. Additionally, by comparing results for the 3 datasets provided - moon.csv (which contains x and y coordinates), iris.csv (which contains measurements for sepal and petal lengths),and circles.csv - we can gain insights into how different data distributions affect clustering techniques such as K-Means or Hierarchical Clustering among others!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset can also be a great starting point to further explore more complex clusters by using higher dimensional space variables such as color or texture that may be present in other datasets not included here but which can help to form more accurate groups when using cluster-analysis algorithms. Additionally, it could also assist in visualization projects where clusters may need to be generated such as plotting mapped data points or examining relationships between two different variables within a certain region drawn on a chart.
To use this dataset effectively it is important to understand how exactly your chosen algorithm works since some require specifying parameters beforehand while others take care of those details automatically; otherwise the interpretation may be invalid depending on the methods used alongside clustering you intend for your project. Furthermore, familiarize yourself with concepts like silhouette score and rand index - these are commonly used metrics that measure your cluster’s performance against other clusterings models so you know if what you have done so far satisfies an acceptable level of accuracy or not yet! Good luck!
- Utilizing the sepal and petal lengths and widths to perform flower recognition or part of a larger image recognition pipeline.
- Classifying the data points in each dataset by the X-Y coordinates using clustering algorithms to analyze galaxy locations or overall formation patterns for stars, planets, or galaxies.
- Exploring correlations between species of flowers in terms of sepal/petal lengths by performing supervised learning tasks such as classification with this dataset
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: moon.csv | Column name | Description | |:--------------|:------------------------------------------| | X | X coordinate of the data point. (Numeric) | | Y | Y coordinate of the data point. (Numeric) |
File: iris.csv | Column name | Description | |:-----------------|:---------------------------------------------| | Sepal.Length | Length of the sepal of the flower. (Numeric) | | Petal.Length | Length of the petal of the flower. (Numeric) | | Species | Species of the flower. (Categorical) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .
Facebook
TwitterData used for the Meta-Analysis of 46 Career Pathways Impact and data from four large nationally representative longitudinal surveys, as well as licensed data on occupational transitions from online career profiles, to examine workers’ career paths and wages for the Career Trajectories and Occupational Transitions Study.
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
The datasets are generated using either Gaussian or Uniform distributions. Each dataset contains several known sub-groups intended for testing centroid-based clustering results and cluster validity indices.
Cluster analysis is a popular machine learning used for segmenting datasets with similar data points in the same group. For those who are familiar with R, there is a new R package called "UniversalCVI" https://CRAN.R-project.org/package=UniversalCVI used for cluster evaluation. This package provides algorithms for checking the accuracy of a clustering result with known classes, computing cluster validity indices, and generating plots for comparing them. The package is compatible with K-means, fuzzy C means, EM clustering, and hierarchical clustering (single, average, and complete linkage). To use the "UniversalCVI" package, one can follow the instructions provided in the R documentation.
For more in-depth details of the package and cluster evaluation, please see the papers https://doi.org/10.1016/j.patcog.2023.109910 and https://arxiv.org/abs/2308.14785
https://github.com/O-PREEDASAWAKUL/FuzzyDatasets.git .
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F17645646%2Fa2f87fbad212a908718535589681a703%2Frealplot.jpeg?generation=1700111724010268&alt=media" alt="">
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract
The dataset was collected for 332 compute nodes throughout May 19 - 23, 2023. May 19 - 22 characterizes normal compute cluster behavior, while May 23 includes an anomalous event. The dataset includes eight CPU, 11 disk, 47 memory, and 22 Slurm metrics. It represents five distinct hardware configurations and contains over one million records, totaling more than 180GB of raw data.
Background
Motivated by the goal to develop a digital twin of a compute cluster, the dataset was collected using a Prometheus server (1) scraping the Thomas Jefferson National Accelerator Facility (JLab) batch cluster used to run an assortment of physics analysis and simulation jobs, where analysis workloads leverage data generated from the laboratory's electron accelerator, and simulation workloads generate large amounts of flat data that is then carved to verify amplitudes. Metrics were scraped from the cluster throughout May 19 - 23, 2023. Data from May 19 to May 22 primarily reflected normal system behavior, while May 23, 2023, recorded a notable anomaly. This anomaly was severe enough to necessitate intervention by JLab IT Operations staff.
The metrics were collected from CPU, disk, memory, and Slurm. Metrics related to CPU, disk, and memory provide insights into the status of individual compute nodes. Furthermore, Slurm metrics collected from the network have the capability to detect anomalies that may propagate to compute nodes executing the same job.
Usage Notes
While the data from May 19 - 22 characterizes normal compute cluster behavior, and May 23 includes anomalous observations, the dataset cannot be considered labeled data. The set of nodes and the exact start and end time affected nodes demonstrate abnormal effects are unclear. Thus, the dataset could be used to develop unsupervised machine-learning algorithms to detect anomalous events in a batch cluster.
Facebook
TwitterThe Catalog of Open Cluster Data (COCD) is a result of studies of the wide neighborhoods of 513 open clusters and 7 compact associations carried out in the high-precision homogeneous All-Sky Compiled Catalog of 2.5 Million Stars (ASCC-2.5, Kharchenko 2001, CDS Cat. ). On the basis of data on about 33,000 possible members (including about 10,000 most probable ones) and homogeneous methods of cluster parameter determination, the angular sizes of cluster cores and coronae, cluster heliocentric distances, mean proper motions, mean radial velocities and ages were established and collected in the COCD. These include cluster distances for 200 clusters, average cluster radial velocities for 94 clusters, and cluster ages for 195 clusters derived for the first time. Clusters in the catalogue are sequenced in their Right Ascension J2000 order. The Open Cluster Diagrams Atlas (OCDA) presents a set of open cluster diagrams used for the determination of parameters of the 513 open clusters and 7 compact associations, and is intended to illustrate the quality of the constructed cluster membership (Kharchenko et al. 2004, CDS Cat.
Facebook
TwitterThe Regional Innovation Clusters serve a diverse group of sectors and geographies. Three of the initial pilot clusters, termed Advanced Defense Technology clusters, are specifically focused on meeting the needs of the defense industry. The Wood Products Cluster, debuted in 2015, supports the White House’s Partnerships for Opportunity and Workforce and Economic Revitalization (POWER) Initiative for coal communities. All of the clusters support small businesses by fostering a synergistic network of small and large businesses, university researchers, regional economic organizations, stakeholders, and investors, while providing matchmaking, business training, counseling, mentoring, and other services to help small businesses expand and grow.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ASSESSMENT OF THE ECONOMIC EFFICIENCY OF FEDERAL FMCG RETAIL CHAINS IN RUSSIA: A CLUSTER APPROACH
Facebook
TwitterThis dataset was created by DAT DO
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Six trajectory clustering datasets (Morris and Trivedi, 2009) are provided for benchmarking trajectory clustering algorithms. These datasets cover a diverse range of scenes to enable thorough evaluation of different algorithms. The datasets include three simulated highway scenes, a real highway scene, a simulated intersection scene, and an indoor omnidirectional camera scene.These datasets originate from the work by Morris and Trivedi (2009), described in their paper "Learning trajectory patterns by clustering: Experimental studies and comparative evaluation," presented at the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. The datasets provide a valuable resource for the trajectory clustering community, enabling comparative evaluation and benchmarking of various clustering algorithms. The datasets can be accessed through the following DOI link: https://doi.org/10.1109/CVPR.2009.5206559.
Facebook
TwitterAbout Dataset ● Based on patient symptoms, identify patients needing immediate resuscitation; to assign patients to a predesignated patient care area, thereby prioritizing their care; and to initiate diagnostic/therapeutic measures as appropriate. ● Three individual datasets used for three urgent illness/injury, each dataset has its own features and symptoms for each patient and we merged them to know what are the most severe symptoms for each illness and give them priority of treatment.
PROJECT SUMMARY Triage refers to the sorting of injured or sick people according to their need for emergency medical attention. It is a method of determining priority for who gets care first. BACKGROUND Triage is the prioritization of patient care (or victims during a disaster) based on illness/injury, symptoms, severity, prognosis, and resource availability. The purpose of triage is to identify patients needing immediate resuscitation; to assign patients to a predesignated patient care area, thereby prioritizing their care; and to initiate diagnostic/therapeutic measures as appropriate. BUSINESS CHALLENGE Based on patient symptoms, identify patients needing immediate resuscitation; to assign patients to a predesignated patient care area, thereby prioritizing their care; and to initiate diagnostic/therapeutic measures as appropriate.
Facebook
TwitterThis table contains a list of 130 Galactic open clusters, found in the All-Sky Compiled Catalogue of 2.5 Million Stars (ASCC-2.5) and not included in the original Catalog of Open Cluster Data (COCD): it is known as the 1st Extension of the COCD (COCD-1). For these new clusters, the authors determined a homogeneous set of astrophysical parameters such as size, membership, motion, distance and age. In their previous work (the Browse table COCD based on the CDS Cat. J/A+A/438/1163), 520 already-known open clusters out of a sample of 1700 clusters from the literature were confirmed in the ASCC-2.5 using independent, objective methods. Using these same methods, the whole sky was systematically screened for new clusters. The newly detected clusters show the same distribution over the sky as the known ones. It is found that without the a priori knowledge about existing clusters the authors' search lead to clusters which are, on average, brighter, have more members and cover larger angular radii than the 520 previously-known ones. On the basis of data on about 6,200 possible members (including about 2,200 most probable ones) and homogeneous methods of cluster parameter determination, the angular sizes of cluster cores and coronae, cluster heliocentric distances, colour-excesses, mean proper motions, and ages of 130 clusters and mean radial velocities of 69 clusters were established and collected in the COCD-1. Clusters in the catalogue are numbered in order of increasing J2000.0 Right Ascension. The 1st Extension of the Open Cluster Diagrams Atlas (OCDA-1) presents a set of open cluster diagrams used for the determination of parameters of the 130 newly discovered open clusters, and is intended to illustrate the quality of the constructed cluster membership, and the accuracy of the derived cluster parameters. Every diagram presents relations between various stellar data from the all sky catalog ASCC-2.5(Kharchenko, 2001, CDS Cat. ) in the area of the specific cluster. There are five diagrams provided for every cluster in the Atlas: the area map, the density profile, the vector point diagram, the "magnitude equation" (proper motion in each coordinate versus V magnitude) diagram, and the color-magnitude diagram. The 130 OCDA-1 PostScript plots (one file per cluster) are available as a remote data product for all of the entries in this table. This table was created by the HEASARC in May 2011 based on CDS Catalog J/A+A/440/403/ files cluster.dat and notes.dat. This is a service provided by NASA HEASARC .
Facebook
TwitterBackground Microarray technologies are emerging as a promising tool for genomic studies. The challenge now is how to analyze the resulting large amounts of data. Clustering techniques have been widely applied in analyzing microarray gene-expression data. However, normal mixture model-based cluster analysis has not been widely used for such data, although it has a solid probabilistic foundation. Here, we introduce and illustrate its use in detecting differentially expressed genes. In particular, we do not cluster gene-expression patterns but a summary statistic, the t-statistic. Results The method is applied to a data set containing expression levels of 1,176 genes of rats with and without pneumococcal middle-ear infection. Three clusters were found, two of which contain more than 95% genes with almost no altered gene-expression levels, whereas the third one has 30 genes with more or less differential gene-expression levels. Conclusions Our results indicate that model-based clustering of t-statistics (and possibly other summary statistics) can be a useful statistical tool to exploit differential gene expression for microarray data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is file is the dataset for a clusteranalysis to cluster and characterize autonomous last mile concepts in a standardized and holistic manner.
The dataset includes autonomous last mile concepts described by a new delevoped holistic taxonomy.
With the developed clusters it is possible to better categorize and compare individual autonomous last mile concepts. Furthermore, the developed taxonomy of autonomous last mileconcepts and the cluster analysis allow an evaluation of the interrelations between characteristics of autonomous last mile concepts, so that the design of new concepts as well as the adaptation or selection of a concept for a specific use case is supported.
Facebook
TwitterAdult mouse visual cortex (RPKM values for 24,057 genes and 1,679 cells) with cluster information taken from https://singlecell.broadinstitute.org/single_cell/study/SCP6/a-transcriptomic-taxonomy-of-adult-mouse-visual-cortex-visp#study-download
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Apple Flower Clusters is a dataset for instance segmentation tasks - it contains Flower Clusters annotations for 202 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Determining intrinsic number of clusters in a multidimensional dataset is a commonly encountered problem in exploratory data analysis. Unsupervised clustering algorithms often rely on specification of cluster number as an input parameter. However, this is typically not known a priori. Many methods have been proposed to estimate cluster number, including statistical and information-theoretic approaches such as the gap statistic, but these methods are not always reliable when applied to non-normally distributed datasets containing outliers or noise. In this study, I propose a novel method called hierarchical linkage regression, which uses regression to estimate the intrinsic number of clusters in a multidimensional dataset. The method operates on the hypothesis that the organization of data into clusters can be inferred from the hierarchy generated by partitioning the dataset, and therefore does not directly depend on the specific values of the data or their distribution, but on their relative ranking within the partitioned set. Moreover, the technique does not require empirical data to train on, but can use synthetic data generated from random distributions to fit regression coefficients. The trained hierarchical linkage regression model is able to infer cluster number in test datasets of varying complexity and differing distributions, for image, text and numeric data, using the same regression model without retraining. The method performs favourably against other cluster number estimation techniques, and is also robust to parameter changes, as demonstrated by sensitivity analysis. The apparent robustness and generalizability of hierarchical linkage regression make it a promising tool for unsupervised exploratory data analysis and discovery.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Numbers indicate what fraction of the whole dataset is grouped in a given cluster (column “all”) and what is the ratio of the sequences of a given phenotype to all sequences in the respective cluster.
Facebook
TwitterThis study tests the efficacy of an intervention--Safe Public Spaces (SPS) -- focused on improving the safety of public spaces in schools, such as hallways, cafeterias, and stairwells. Twenty-four schools with middle grades in a large urban area were recruited for participation and were pair-matched and then assigned to either treatment or control. The study comprises four components: an implementation evaluation, a cost study, an impact study, and a community crime study. Community-crime-study: The community crime study used the arrest of juveniles from the NYPD (New York Police Department) data. The data can be found at (https://data.cityofnewyork.us/Public-Safety/NYPD-Arrests-Data-Historic-/8h9b-rp9u). Data include all arrest for the juvenile crime during the life of the intervention. The 12 matched schools were identified and geo-mapped using Quantum GIS (QGIS) 3.8 software. Block groups in the 2010 US Census in which the schools reside and neighboring block groups were mapped into micro-areas. This resulted in twelve experimental school blocks and 11 control blocks which the schools reside (two of the control schools existed in the same census block group). Additionally, neighboring blocks using were geo-mapped into 70 experimental and 77 control adjacent block groups (see map). Finally, juvenile arrests were mapped into experimental and control areas. Using the ARIMA time-series method in Stata 15 statistical software package, arrest data were analyzed to compare the change in juvenile arrests in the experimental and control sites. Cost-study: For the cost study, information from the implementing organization (Engaging Schools) was combined with data from phone conversations and follow-up communications with staff in school sites to populate a Resource Cost Model. The Resource Cost Model Excel file will be provided for archiving. This file contains details on the staff time and materials allocated to the intervention, as well as the NYC prices in 2018 US dollars associated with each element. Prices were gathered from multiple sources, including actual NYC DOE data on salaries for position types for which these data were available and district salary schedules for the other staff types. Census data were used to calculate benefits. Impact-evaluation: The impact evaluation was conducted using data from the Research Alliance for New York City Schools. Among the core functions of the Research Alliance is maintaining a unique archive of longitudinal data on NYC schools to support ongoing research. The Research Alliance builds and maintains an archive of longitudinal data about NYC schools. Their agreement with the New York City Department of Education (NYC DOE) outlines the data they receive, the process they use to obtain it, and the security measures to keep it safe. Implementation-study: The implementation study comprises the baseline survey and observation data. Interview transcripts are not archived.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
In scientific data analysis, clusters identified computationally often substantiate existing hypotheses or motivate new ones. Yet the combinatorial nature of the clustering result, which is a partition rather than a set of parameters or a function, blurs notions of mean and variance. This intrinsic difficulty hinders the development of methods to improve clustering by aggregation or to assess the uncertainty of clusters generated. We overcome that barrier by aligning clusters via optimal transport. Equipped with this technique, we propose a new algorithm to enhance clustering by any baseline method using bootstrap samples. Cluster alignment enables us to quantify variation in the clustering result at the levels of both overall partitions and individual clusters. Set relationships between clusters such as one-to-one match, split, and merge can be revealed. A covering point set for each cluster, a concept kin to the confidence interval, is proposed. The tools we have developed here will help address the crucial question of whether any cluster is an intrinsic or spurious pattern. Experimental results on both simulated and real datasets are provided.