100+ datasets found

Dataset for: Optimal Transport, Mean Partition, and Uncertainty Assessment...
wiley.figshare.com
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jia Li; Beomseok Seo; Lin Lin (2023). Dataset for: Optimal Transport, Mean Partition, and Uncertainty Assessment in Cluster Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.8038925
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8038925
Dataset updated
May 30, 2023
Dataset provided by
Wileyhttps://www.wiley.com/
Authors
Jia Li; Beomseok Seo; Lin Lin
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
In scientific data analysis, clusters identified computationally often substantiate existing hypotheses or motivate new ones. Yet the combinatorial nature of the clustering result, which is a partition rather than a set of parameters or a function, blurs notions of mean and variance. This intrinsic difficulty hinders the development of methods to improve clustering by aggregation or to assess the uncertainty of clusters generated. We overcome that barrier by aligning clusters via optimal transport. Equipped with this technique, we propose a new algorithm to enhance clustering by any baseline method using bootstrap samples. Cluster alignment enables us to quantify variation in the clustering result at the levels of both overall partitions and individual clusters. Set relationships between clusters such as one-to-one match, split, and merge can be revealed. A covering point set for each cluster, a concept kin to the confidence interval, is proposed. The tools we have developed here will help address the crucial question of whether any cluster is an intrinsic or spurious pattern. Experimental results on both simulated and real datasets are provided.
m
Data for: kluster: An Efficient Scalable Procedure for Approximating the...
data.mendeley.com
Updated Jun 19, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hossein Estiri (2018). Data for: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised Learning [Dataset]. http://doi.org/10.17632/vfx46vcwpp.1
Explore at:
Unique identifier
https://doi.org/10.17632/vfx46vcwpp.1
Dataset updated
Jun 19, 2018
Authors
Hossein Estiri
License
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Description
182 simulated datasets (first set contains small datasets and second set contains large datasets) with different cluster compositions – i.e., different number clusters and separation values – generated using clusterGeneration package in R. Each set of simulation datasets consists of 91 datasets in comma separated values (csv) format (total of 182 csv files) with 3-15 clusters and 0.1 to 0.7 separation values. Separation values can range between (−0.999, 0.999), where a higher separation value indicates cluster structure with more separable clusters.

Size of the dataset, number of clusters, and separation value of the clusters in the dataset is printed in file name. size_X_n_Y_sepval_Z.csv: Size of the dataset = X number of clusters in the dataset = Y separation value of the clusters in the dataset = Z
Data from: Galaxy clustering
kaggle.com
zip
Updated Jan 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Galaxy clustering [Dataset]. https://www.kaggle.com/datasets/thedevastator/clustering-polygons-utilizing-iris-moon-and-circ
Explore at:
zip(6339 bytes)Available download formats
Dataset updated
Jan 3, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Galaxy clustering

Iris, Moon, and Circles datasets for Galaxy clustering tutorial

By [source]

About this dataset

This dataset contains a wealth of information that can be used to explore the effectiveness of various clustering algorithms. With its inclusion of numerical measurements (X, Y, Sepal.Length, and Petal.Length) and categorical values (Species), it is possible to investigate the relationship between different types of variables and clustering performance. Additionally, by comparing results for the 3 datasets provided - moon.csv (which contains x and y coordinates), iris.csv (which contains measurements for sepal and petal lengths),and circles.csv - we can gain insights into how different data distributions affect clustering techniques such as K-Means or Hierarchical Clustering among others!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset can also be a great starting point to further explore more complex clusters by using higher dimensional space variables such as color or texture that may be present in other datasets not included here but which can help to form more accurate groups when using cluster-analysis algorithms. Additionally, it could also assist in visualization projects where clusters may need to be generated such as plotting mapped data points or examining relationships between two different variables within a certain region drawn on a chart.

To use this dataset effectively it is important to understand how exactly your chosen algorithm works since some require specifying parameters beforehand while others take care of those details automatically; otherwise the interpretation may be invalid depending on the methods used alongside clustering you intend for your project. Furthermore, familiarize yourself with concepts like silhouette score and rand index - these are commonly used metrics that measure your cluster’s performance against other clusterings models so you know if what you have done so far satisfies an acceptable level of accuracy or not yet! Good luck!

Research Ideas

Utilizing the sepal and petal lengths and widths to perform flower recognition or part of a larger image recognition pipeline.

Classifying the data points in each dataset by the X-Y coordinates using clustering algorithms to analyze galaxy locations or overall formation patterns for stars, planets, or galaxies.

Exploring correlations between species of flowers in terms of sepal/petal lengths by performing supervised learning tasks such as classification with this dataset

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: moon.csv | Column name | Description | |:--------------|:------------------------------------------| | X | X coordinate of the data point. (Numeric) | | Y | Y coordinate of the data point. (Numeric) |

File: iris.csv | Column name | Description | |:-----------------|:---------------------------------------------| | Sepal.Length | Length of the sepal of the flower. (Numeric) | | Petal.Length | Length of the petal of the flower. (Numeric) | | Species | Species of the flower. (Categorical) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .
d
Dashboard cluster dataset
catalog.data.gov
datasets.ai
Updated Oct 8, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office of Assistant Secretary for Policy (2022). Dashboard cluster dataset [Dataset]. https://catalog.data.gov/dataset/dashboard-cluster-dataset
Explore at:
Dataset updated
Oct 8, 2022
Dataset provided by
Office of Assistant Secretary for Policy
Description
Data used for the Meta-Analysis of 46 Career Pathways Impact and data from four large nationally representative longitudinal surveys, as well as licensed data on occupational transitions from online career profiles, to examine workers’ career paths and wages for the Career Trajectories and Occupational Transitions Study.
Benchmarks datasets for cluster analysis
kaggle.com
zip
Updated Nov 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Onthada Preedasawakul (2023). Benchmarks datasets for cluster analysis [Dataset]. https://www.kaggle.com/datasets/onthada/benchmarks-datasets-for-clustering
Explore at:
zip(608532 bytes)Available download formats
Dataset updated
Nov 15, 2023
Authors
Onthada Preedasawakul
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
25 Artificial Datasets

The datasets are generated using either Gaussian or Uniform distributions. Each dataset contains several known sub-groups intended for testing centroid-based clustering results and cluster validity indices.

Cluster analysis is a popular machine learning used for segmenting datasets with similar data points in the same group. For those who are familiar with R, there is a new R package called "UniversalCVI" https://CRAN.R-project.org/package=UniversalCVI used for cluster evaluation. This package provides algorithms for checking the accuracy of a clustering result with known classes, computing cluster validity indices, and generating plots for comparing them. The package is compatible with K-means, fuzzy C means, EM clustering, and hierarchical clustering (single, average, and complete linkage). To use the "UniversalCVI" package, one can follow the instructions provided in the R documentation.

For more in-depth details of the package and cluster evaluation, please see the papers https://doi.org/10.1016/j.patcog.2023.109910 and https://arxiv.org/abs/2308.14785

All the datasets are also available on GitHub at

https://github.com/O-PREEDASAWAKUL/FuzzyDatasets.git .

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F17645646%2Fa2f87fbad212a908718535589681a703%2Frealplot.jpeg?generation=1700111724010268&alt=media" alt="">
Dataset for Investigating Anomalies in Compute Clusters
zenodo.org
data.niaid.nih.gov
tar
Updated Nov 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Diana McSpadden; Diana McSpadden; Alanazi Yasir; Alanazi Yasir; Bryan Hess; Bryan Hess; Laura Hild; Mark Jones; Yiyang Lu; Ahmed Mohammed; Wesley Moore; Jie Ren; Jie Ren; Malachi Schram; Malachi Schram; Evgenia Smirni; Evgenia Smirni; Laura Hild; Mark Jones; Yiyang Lu; Ahmed Mohammed; Wesley Moore (2023). Dataset for Investigating Anomalies in Compute Clusters [Dataset]. http://doi.org/10.5281/zenodo.10058230
Explore at:
tarAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10058230
Dataset updated
Nov 29, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Diana McSpadden; Diana McSpadden; Alanazi Yasir; Alanazi Yasir; Bryan Hess; Bryan Hess; Laura Hild; Mark Jones; Yiyang Lu; Ahmed Mohammed; Wesley Moore; Jie Ren; Jie Ren; Malachi Schram; Malachi Schram; Evgenia Smirni; Evgenia Smirni; Laura Hild; Mark Jones; Yiyang Lu; Ahmed Mohammed; Wesley Moore
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract
The dataset was collected for 332 compute nodes throughout May 19 - 23, 2023. May 19 - 22 characterizes normal compute cluster behavior, while May 23 includes an anomalous event. The dataset includes eight CPU, 11 disk, 47 memory, and 22 Slurm metrics. It represents five distinct hardware configurations and contains over one million records, totaling more than 180GB of raw data.
Background
Motivated by the goal to develop a digital twin of a compute cluster, the dataset was collected using a Prometheus server (1) scraping the Thomas Jefferson National Accelerator Facility (JLab) batch cluster used to run an assortment of physics analysis and simulation jobs, where analysis workloads leverage data generated from the laboratory's electron accelerator, and simulation workloads generate large amounts of flat data that is then carved to verify amplitudes. Metrics were scraped from the cluster throughout May 19 - 23, 2023. Data from May 19 to May 22 primarily reflected normal system behavior, while May 23, 2023, recorded a notable anomaly. This anomaly was severe enough to necessitate intervention by JLab IT Operations staff.
The metrics were collected from CPU, disk, memory, and Slurm. Metrics related to CPU, disk, and memory provide insights into the status of individual compute nodes. Furthermore, Slurm metrics collected from the network have the capability to detect anomalies that may propagate to compute nodes executing the same job.
Usage Notes
While the data from May 19 - 22 characterizes normal compute cluster behavior, and May 23 includes anomalous observations, the dataset cannot be considered labeled data. The set of nodes and the exact start and end time affected nodes demonstrate abnormal effects are unclear. Thus, the dataset could be used to develop unsupervised machine-learning algorithms to detect anomalous events in a batch cluster.
https://doi.org/10.48550/arXiv.2311.16129
COCD: Catalog of Open Cluster Data - Dataset - NASA Open Data Portal
data.nasa.gov
Updated Apr 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). COCD: Catalog of Open Cluster Data - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/cocd-catalog-of-open-cluster-data
Explore at:
Dataset updated
Apr 1, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
The Catalog of Open Cluster Data (COCD) is a result of studies of the wide neighborhoods of 513 open clusters and 7 compact associations carried out in the high-precision homogeneous All-Sky Compiled Catalog of 2.5 Million Stars (ASCC-2.5, Kharchenko 2001, CDS Cat. ). On the basis of data on about 33,000 possible members (including about 10,000 most probable ones) and homogeneous methods of cluster parameter determination, the angular sizes of cluster cores and coronae, cluster heliocentric distances, mean proper motions, mean radial velocities and ages were established and collected in the COCD. These include cluster distances for 200 clusters, average cluster radial velocities for 94 clusters, and cluster ages for 195 clusters derived for the first time. Clusters in the catalogue are sequenced in their Right Ascension J2000 order. The Open Cluster Diagrams Atlas (OCDA) presents a set of open cluster diagrams used for the determination of parameters of the 513 open clusters and 7 compact associations, and is intended to illustrate the quality of the constructed cluster membership (Kharchenko et al. 2004, CDS Cat.
Regional Innovation Clusters
catalog.data.gov
datasets.ai
+2more
Updated Feb 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Small Business Administration (2023). Regional Innovation Clusters [Dataset]. https://catalog.data.gov/dataset/regional-innovation-clusters
Explore at:
Dataset updated
Feb 9, 2023
Dataset provided by
Small Business Administrationhttps://www.sba.gov/
Description
The Regional Innovation Clusters serve a diverse group of sectors and geographies. Three of the initial pilot clusters, termed Advanced Defense Technology clusters, are specifically focused on meeting the needs of the defense industry. The Wood Products Cluster, debuted in 2015, supports the White House’s Partnerships for Opportunity and Workforce and Economic Revitalization (POWER) Initiative for coal communities. All of the clusters support small businesses by fostering a synergistic network of small and large businesses, university researchers, regional economic organizations, stakeholders, and investors, while providing matchmaking, business training, counseling, mentoring, and other services to help small businesses expand and grow.
Detailed Data for Cluster Analysis in IBM SPSS Statistics
figshare.com
xlsx
Updated Sep 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vitaliy Kolomiets (2024). Detailed Data for Cluster Analysis in IBM SPSS Statistics [Dataset]. http://doi.org/10.6084/m9.figshare.27083131.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27083131.v1
Dataset updated
Sep 22, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Vitaliy Kolomiets
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ASSESSMENT OF THE ECONOMIC EFFICIENCY OF FEDERAL FMCG RETAIL CHAINS IN RUSSIA: A CLUSTER APPROACH
cluster-data
kaggle.com
zip
Updated Dec 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DAT DO (2023). cluster-data [Dataset]. https://www.kaggle.com/datasets/ducanger/cluster-data
Explore at:
zip(4788393 bytes)Available download formats
Dataset updated
Dec 13, 2023
Authors
DAT DO
Description
Dataset

This dataset was created by DAT DO

Contents
CVRR dataset for trajectory clustering
figshare.com
zip
Updated May 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chen (2024). CVRR dataset for trajectory clustering [Dataset]. http://doi.org/10.6084/m9.figshare.25826839.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25826839.v1
Dataset updated
May 15, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Six trajectory clustering datasets (Morris and Trivedi, 2009) are provided for benchmarking trajectory clustering algorithms. These datasets cover a diverse range of scenes to enable thorough evaluation of different algorithms. The datasets include three simulated highway scenes, a real highway scene, a simulated intersection scene, and an indoor omnidirectional camera scene.These datasets originate from the work by Morris and Trivedi (2009), described in their paper "Learning trajectory patterns by clustering: Experimental studies and comparative evaluation," presented at the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. The datasets provide a valuable resource for the trajectory clustering community, enabling comparative evaluation and benchmarking of various clustering algorithms. The datasets can be accessed through the following DOI link: https://doi.org/10.1109/CVPR.2009.5206559.
Patient Dataset for Clustering (Raw Data)
kaggle.com
Updated Aug 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arjunn Sharma (2023). Patient Dataset for Clustering (Raw Data) [Dataset]. https://www.kaggle.com/datasets/arjunnsharma/patient-dataset-for-clustering-raw-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 10, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Arjunn Sharma
Description
About Dataset ● Based on patient symptoms, identify patients needing immediate resuscitation; to assign patients to a predesignated patient care area, thereby prioritizing their care; and to initiate diagnostic/therapeutic measures as appropriate. ● Three individual datasets used for three urgent illness/injury, each dataset has its own features and symptoms for each patient and we merged them to know what are the most severe symptoms for each illness and give them priority of treatment.

PROJECT SUMMARY Triage refers to the sorting of injured or sick people according to their need for emergency medical attention. It is a method of determining priority for who gets care first. BACKGROUND Triage is the prioritization of patient care (or victims during a disaster) based on illness/injury, symptoms, severity, prognosis, and resource availability. The purpose of triage is to identify patients needing immediate resuscitation; to assign patients to a predesignated patient care area, thereby prioritizing their care; and to initiate diagnostic/therapeutic measures as appropriate. BUSINESS CHALLENGE Based on patient symptoms, identify patients needing immediate resuscitation; to assign patients to a predesignated patient care area, thereby prioritizing their care; and to initiate diagnostic/therapeutic measures as appropriate.
COCD: Catalog of Open Cluster Data First Extension - Dataset - NASA Open...
data.nasa.gov
Updated Apr 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). COCD: Catalog of Open Cluster Data First Extension - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/cocd-catalog-of-open-cluster-data-first-extension
Explore at:
Dataset updated
Apr 1, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
This table contains a list of 130 Galactic open clusters, found in the All-Sky Compiled Catalogue of 2.5 Million Stars (ASCC-2.5) and not included in the original Catalog of Open Cluster Data (COCD): it is known as the 1st Extension of the COCD (COCD-1). For these new clusters, the authors determined a homogeneous set of astrophysical parameters such as size, membership, motion, distance and age. In their previous work (the Browse table COCD based on the CDS Cat. J/A+A/438/1163), 520 already-known open clusters out of a sample of 1700 clusters from the literature were confirmed in the ASCC-2.5 using independent, objective methods. Using these same methods, the whole sky was systematically screened for new clusters. The newly detected clusters show the same distribution over the sky as the known ones. It is found that without the a priori knowledge about existing clusters the authors' search lead to clusters which are, on average, brighter, have more members and cover larger angular radii than the 520 previously-known ones. On the basis of data on about 6,200 possible members (including about 2,200 most probable ones) and homogeneous methods of cluster parameter determination, the angular sizes of cluster cores and coronae, cluster heliocentric distances, colour-excesses, mean proper motions, and ages of 130 clusters and mean radial velocities of 69 clusters were established and collected in the COCD-1. Clusters in the catalogue are numbered in order of increasing J2000.0 Right Ascension. The 1st Extension of the Open Cluster Diagrams Atlas (OCDA-1) presents a set of open cluster diagrams used for the determination of parameters of the 130 newly discovered open clusters, and is intended to illustrate the quality of the constructed cluster membership, and the accuracy of the derived cluster parameters. Every diagram presents relations between various stellar data from the all sky catalog ASCC-2.5(Kharchenko, 2001, CDS Cat. ) in the area of the specific cluster. There are five diagrams provided for every cluster in the Atlas: the area map, the density profile, the vector point diagram, the "magnitude equation" (proper motion in each coordinate versus V magnitude) diagram, and the color-magnitude diagram. The 130 OCDA-1 PostScript plots (one file per cluster) are available as a remote data product for all of the entries in this table. This table was created by the HEASARC in May 2011 based on CDS Catalog J/A+A/440/403/ files cluster.dat and notes.dat. This is a service provided by NASA HEASARC .
d
Model-based cluster analysis of microarray gene-expression data
catalog.data.gov
data.virginia.gov
+1more
Updated Sep 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institutes of Health (2025). Model-based cluster analysis of microarray gene-expression data [Dataset]. https://catalog.data.gov/dataset/model-based-cluster-analysis-of-microarray-gene-expression-data
Explore at:
Dataset updated
Sep 7, 2025
Dataset provided by
National Institutes of Health
Description
Background Microarray technologies are emerging as a promising tool for genomic studies. The challenge now is how to analyze the resulting large amounts of data. Clustering techniques have been widely applied in analyzing microarray gene-expression data. However, normal mixture model-based cluster analysis has not been widely used for such data, although it has a solid probabilistic foundation. Here, we introduce and illustrate its use in detecting differentially expressed genes. In particular, we do not cluster gene-expression patterns but a summary statistic, the t-statistic. Results The method is applied to a data set containing expression levels of 1,176 genes of rats with and without pneumococcal middle-ear infection. Three clusters were found, two of which contain more than 95% genes with almost no altered gene-expression levels, whereas the third one has 30 genes with more or less differential gene-expression levels. Conclusions Our results indicate that model-based clustering of t-statistics (and possibly other summary statistics) can be a useful statistical tool to exploit differential gene expression for microarray data.
4
Dataset for a clusteranalysis to cluster and characterize autonomous last...
data.4tu.nl
zip
Updated Mar 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julian Maas (2023). Dataset for a clusteranalysis to cluster and characterize autonomous last mile concepts in a standardized and holistic manner [Dataset]. http://doi.org/10.4121/21293736.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/21293736.v2
Dataset updated
Mar 23, 2023
Dataset provided by
4TU.ResearchData
Authors
Julian Maas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is file is the dataset for a clusteranalysis to cluster and characterize autonomous last mile concepts in a standardized and holistic manner.
The dataset includes autonomous last mile concepts described by a new delevoped holistic taxonomy.
With the developed clusters it is possible to better categorize and compare individual autonomous last mile concepts. Furthermore, the developed taxonomy of autonomous last mileconcepts and the cluster analysis allow an evaluation of the interrelations between characteristics of autonomous last mile concepts, so that the design of new concepts as well as the adaptation or selection of a concept for a specific use case is supported.
f
Neuronal dataset and cluster data information
datasetcatalog.nlm.nih.gov
figshare.com
Updated May 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Padron-Manrique, Cristian (2024). Neuronal dataset and cluster data information [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001442464
Explore at:
Dataset updated
May 5, 2024
Authors
Padron-Manrique, Cristian
Description
Adult mouse visual cortex (RPKM values for 24,057 genes and 1,679 cells) with cluster information taken from https://singlecell.broadinstitute.org/single_cell/study/SCP6/a-transcriptomic-taxonomy-of-adult-mouse-visual-cortex-visp#study-download
R
Apple Flower Clusters Dataset
universe.roboflow.com
zip
Updated Dec 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CTU (2023). Apple Flower Clusters Dataset [Dataset]. https://universe.roboflow.com/ctu-b0lvz/apple-flower-clusters
Explore at:
zipAvailable download formats
Dataset updated
Dec 22, 2023
Dataset authored and provided by
CTU
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Flower Clusters Polygons
Description
Apple Flower Clusters

## Overview Apple Flower Clusters is a dataset for instance segmentation tasks - it contains Flower Clusters annotations for 202 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Blind method for discovering number of clusters in multidimensional datasets...
plos.figshare.com
docx
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Osbert C. Zalay (2023). Blind method for discovering number of clusters in multidimensional datasets by regression on linkage hierarchies generated from random data [Dataset]. http://doi.org/10.1371/journal.pone.0227788
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0227788
Dataset updated
Jun 4, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Osbert C. Zalay
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Determining intrinsic number of clusters in a multidimensional dataset is a commonly encountered problem in exploratory data analysis. Unsupervised clustering algorithms often rely on specification of cluster number as an input parameter. However, this is typically not known a priori. Many methods have been proposed to estimate cluster number, including statistical and information-theoretic approaches such as the gap statistic, but these methods are not always reliable when applied to non-normally distributed datasets containing outliers or noise. In this study, I propose a novel method called hierarchical linkage regression, which uses regression to estimate the intrinsic number of clusters in a multidimensional dataset. The method operates on the hypothesis that the organization of data into clusters can be inferred from the hierarchy generated by partitioning the dataset, and therefore does not directly depend on the specific values of the data or their distribution, but on their relative ranking within the partitioned set. Moreover, the technique does not require empirical data to train on, but can use synthetic data generated from random distributions to fit regression coefficients. The trained hierarchical linkage regression model is able to infer cluster number in test datasets of varying complexity and differing distributions, for image, text and numeric data, using the same regression model without retraining. The method performs favourably against other cluster number estimation techniques, and is also robust to parameter changes, as demonstrated by sensitivity analysis. The apparent robustness and generalizability of hierarchical linkage regression make it a promising tool for unsupervised exploratory data analysis and discovery.
Five largest clusters in the NSI/SI and R5/X4 dataset clustering.
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katarzyna Bozek; Alexander Thielen; Saleta Sierra; Rolf Kaiser; Thomas Lengauer (2023). Five largest clusters in the NSI/SI and R5/X4 dataset clustering. [Dataset]. http://doi.org/10.1371/journal.pone.0007387.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0007387.t001
Dataset updated
Jun 10, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Katarzyna Bozek; Alexander Thielen; Saleta Sierra; Rolf Kaiser; Thomas Lengauer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Numbers indicate what fraction of the whole dataset is grouped in a given cluster (column “all”) and what is the ratio of the sequences of a given phenotype to all sequences in the respective cluster.
d
Data from: A Cluster Randomized Controlled Trial of the Safe Public Spaces...
catalog.data.gov
icpsr.umich.edu
Updated Mar 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Justice (2025). A Cluster Randomized Controlled Trial of the Safe Public Spaces in Schools Program, New York City, 2016-2018 [Dataset]. https://catalog.data.gov/dataset/a-cluster-randomized-controlled-trial-of-the-safe-public-spaces-in-schools-program-ne-2016-f67d7
Explore at:
Dataset updated
Mar 12, 2025
Dataset provided by
National Institute of Justice
Area covered
New York
Description
This study tests the efficacy of an intervention--Safe Public Spaces (SPS) -- focused on improving the safety of public spaces in schools, such as hallways, cafeterias, and stairwells. Twenty-four schools with middle grades in a large urban area were recruited for participation and were pair-matched and then assigned to either treatment or control. The study comprises four components: an implementation evaluation, a cost study, an impact study, and a community crime study. Community-crime-study: The community crime study used the arrest of juveniles from the NYPD (New York Police Department) data. The data can be found at (https://data.cityofnewyork.us/Public-Safety/NYPD-Arrests-Data-Historic-/8h9b-rp9u). Data include all arrest for the juvenile crime during the life of the intervention. The 12 matched schools were identified and geo-mapped using Quantum GIS (QGIS) 3.8 software. Block groups in the 2010 US Census in which the schools reside and neighboring block groups were mapped into micro-areas. This resulted in twelve experimental school blocks and 11 control blocks which the schools reside (two of the control schools existed in the same census block group). Additionally, neighboring blocks using were geo-mapped into 70 experimental and 77 control adjacent block groups (see map). Finally, juvenile arrests were mapped into experimental and control areas. Using the ARIMA time-series method in Stata 15 statistical software package, arrest data were analyzed to compare the change in juvenile arrests in the experimental and control sites. Cost-study: For the cost study, information from the implementing organization (Engaging Schools) was combined with data from phone conversations and follow-up communications with staff in school sites to populate a Resource Cost Model. The Resource Cost Model Excel file will be provided for archiving. This file contains details on the staff time and materials allocated to the intervention, as well as the NYC prices in 2018 US dollars associated with each element. Prices were gathered from multiple sources, including actual NYC DOE data on salaries for position types for which these data were available and district salary schedules for the other staff types. Census data were used to calculate benefits. Impact-evaluation: The impact evaluation was conducted using data from the Research Alliance for New York City Schools. Among the core functions of the Research Alliance is maintaining a unique archive of longitudinal data on NYC schools to support ongoing research. The Research Alliance builds and maintains an archive of longitudinal data about NYC schools. Their agreement with the New York City Department of Education (NYC DOE) outlines the data they receive, the process they use to obtain it, and the security measures to keep it safe. Implementation-study: The implementation study comprises the baseline survey and observation data. Interview transcripts are not archived.

Facebook

Twitter

Click to copy link

Link copied

Cite

Jia Li; Beomseok Seo; Lin Lin (2023). Dataset for: Optimal Transport, Mean Partition, and Uncertainty Assessment in Cluster Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.8038925

Dataset for: Optimal Transport, Mean Partition, and Uncertainty Assessment in Cluster Analysis

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.8038925

Dataset updated

May 30, 2023

Dataset provided by

Wileyhttps://www.wiley.com/

Authors

Jia Li; Beomseok Seo; Lin Lin

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

In scientific data analysis, clusters identified computationally often substantiate existing hypotheses or motivate new ones. Yet the combinatorial nature of the clustering result, which is a partition rather than a set of parameters or a function, blurs notions of mean and variance. This intrinsic difficulty hinders the development of methods to improve clustering by aggregation or to assess the uncertainty of clusters generated. We overcome that barrier by aligning clusters via optimal transport. Equipped with this technique, we propose a new algorithm to enhance clustering by any baseline method using bootstrap samples. Cluster alignment enables us to quantify variation in the clustering result at the levels of both overall partitions and individual clusters. Set relationships between clusters such as one-to-one match, split, and merge can be revealed. A covering point set for each cluster, a concept kin to the confidence interval, is proposed. The tools we have developed here will help address the crucial question of whether any cluster is an intrinsic or spurious pattern. Experimental results on both simulated and real datasets are provided.

Clear search

Close search

Google apps

Main menu

Dataset for: Optimal Transport, Mean Partition, and Uncertainty Assessment...

Data for: kluster: An Efficient Scalable Procedure for Approximating the...

Data from: Galaxy clustering

Galaxy clustering

Iris, Moon, and Circles datasets for Galaxy clustering tutorial

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Dashboard cluster dataset

Benchmarks datasets for cluster analysis

25 Artificial Datasets

All the datasets are also available on GitHub at

Dataset for Investigating Anomalies in Compute Clusters

COCD: Catalog of Open Cluster Data - Dataset - NASA Open Data Portal

Regional Innovation Clusters

Detailed Data for Cluster Analysis in IBM SPSS Statistics

cluster-data

Dataset

Contents

CVRR dataset for trajectory clustering

Patient Dataset for Clustering (Raw Data)

COCD: Catalog of Open Cluster Data First Extension - Dataset - NASA Open...

Model-based cluster analysis of microarray gene-expression data

Dataset for a clusteranalysis to cluster and characterize autonomous last...

Neuronal dataset and cluster data information

Apple Flower Clusters Dataset

Apple Flower Clusters

Blind method for discovering number of clusters in multidimensional datasets...

Five largest clusters in the NSI/SI and R5/X4 dataset clustering.

Data from: A Cluster Randomized Controlled Trial of the Safe Public Spaces...

Dataset for: Optimal Transport, Mean Partition, and Uncertainty Assessment in Cluster Analysis