100+ datasets found
  1. Dataset for: Optimal Transport, Mean Partition, and Uncertainty Assessment...

    • wiley.figshare.com
    • figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jia Li; Beomseok Seo; Lin Lin (2023). Dataset for: Optimal Transport, Mean Partition, and Uncertainty Assessment in Cluster Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.8038925
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Wileyhttps://www.wiley.com/
    Authors
    Jia Li; Beomseok Seo; Lin Lin
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    In scientific data analysis, clusters identified computationally often substantiate existing hypotheses or motivate new ones. Yet the combinatorial nature of the clustering result, which is a partition rather than a set of parameters or a function, blurs notions of mean and variance. This intrinsic difficulty hinders the development of methods to improve clustering by aggregation or to assess the uncertainty of clusters generated. We overcome that barrier by aligning clusters via optimal transport. Equipped with this technique, we propose a new algorithm to enhance clustering by any baseline method using bootstrap samples. Cluster alignment enables us to quantify variation in the clustering result at the levels of both overall partitions and individual clusters. Set relationships between clusters such as one-to-one match, split, and merge can be revealed. A covering point set for each cluster, a concept kin to the confidence interval, is proposed. The tools we have developed here will help address the crucial question of whether any cluster is an intrinsic or spurious pattern. Experimental results on both simulated and real datasets are provided.

  2. m

    Data for: kluster: An Efficient Scalable Procedure for Approximating the...

    • data.mendeley.com
    Updated Jun 19, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hossein Estiri (2018). Data for: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised Learning [Dataset]. http://doi.org/10.17632/vfx46vcwpp.1
    Explore at:
    Dataset updated
    Jun 19, 2018
    Authors
    Hossein Estiri
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Description

    182 simulated datasets (first set contains small datasets and second set contains large datasets) with different cluster compositions – i.e., different number clusters and separation values – generated using clusterGeneration package in R. Each set of simulation datasets consists of 91 datasets in comma separated values (csv) format (total of 182 csv files) with 3-15 clusters and 0.1 to 0.7 separation values. Separation values can range between (−0.999, 0.999), where a higher separation value indicates cluster structure with more separable clusters.

    Size of the dataset, number of clusters, and separation value of the clusters in the dataset is printed in file name. size_X_n_Y_sepval_Z.csv: Size of the dataset = X number of clusters in the dataset = Y separation value of the clusters in the dataset = Z

  3. Data from: Galaxy clustering

    • kaggle.com
    zip
    Updated Jan 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Galaxy clustering [Dataset]. https://www.kaggle.com/datasets/thedevastator/clustering-polygons-utilizing-iris-moon-and-circ
    Explore at:
    zip(6339 bytes)Available download formats
    Dataset updated
    Jan 3, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Galaxy clustering

    Iris, Moon, and Circles datasets for Galaxy clustering tutorial

    By [source]

    About this dataset

    This dataset contains a wealth of information that can be used to explore the effectiveness of various clustering algorithms. With its inclusion of numerical measurements (X, Y, Sepal.Length, and Petal.Length) and categorical values (Species), it is possible to investigate the relationship between different types of variables and clustering performance. Additionally, by comparing results for the 3 datasets provided - moon.csv (which contains x and y coordinates), iris.csv (which contains measurements for sepal and petal lengths),and circles.csv - we can gain insights into how different data distributions affect clustering techniques such as K-Means or Hierarchical Clustering among others!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset can also be a great starting point to further explore more complex clusters by using higher dimensional space variables such as color or texture that may be present in other datasets not included here but which can help to form more accurate groups when using cluster-analysis algorithms. Additionally, it could also assist in visualization projects where clusters may need to be generated such as plotting mapped data points or examining relationships between two different variables within a certain region drawn on a chart.

    To use this dataset effectively it is important to understand how exactly your chosen algorithm works since some require specifying parameters beforehand while others take care of those details automatically; otherwise the interpretation may be invalid depending on the methods used alongside clustering you intend for your project. Furthermore, familiarize yourself with concepts like silhouette score and rand index - these are commonly used metrics that measure your cluster’s performance against other clusterings models so you know if what you have done so far satisfies an acceptable level of accuracy or not yet! Good luck!

    Research Ideas

    • Utilizing the sepal and petal lengths and widths to perform flower recognition or part of a larger image recognition pipeline.
    • Classifying the data points in each dataset by the X-Y coordinates using clustering algorithms to analyze galaxy locations or overall formation patterns for stars, planets, or galaxies.
    • Exploring correlations between species of flowers in terms of sepal/petal lengths by performing supervised learning tasks such as classification with this dataset

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: moon.csv | Column name | Description | |:--------------|:------------------------------------------| | X | X coordinate of the data point. (Numeric) | | Y | Y coordinate of the data point. (Numeric) |

    File: iris.csv | Column name | Description | |:-----------------|:---------------------------------------------| | Sepal.Length | Length of the sepal of the flower. (Numeric) | | Petal.Length | Length of the petal of the flower. (Numeric) | | Species | Species of the flower. (Categorical) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .

  4. d

    Dashboard cluster dataset

    • catalog.data.gov
    • datasets.ai
    Updated Oct 8, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Office of Assistant Secretary for Policy (2022). Dashboard cluster dataset [Dataset]. https://catalog.data.gov/dataset/dashboard-cluster-dataset
    Explore at:
    Dataset updated
    Oct 8, 2022
    Dataset provided by
    Office of Assistant Secretary for Policy
    Description

    Data used for the Meta-Analysis of 46 Career Pathways Impact and data from four large nationally representative longitudinal surveys, as well as licensed data on occupational transitions from online career profiles, to examine workers’ career paths and wages for the Career Trajectories and Occupational Transitions Study.

  5. Benchmarks datasets for cluster analysis

    • kaggle.com
    zip
    Updated Nov 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Onthada Preedasawakul (2023). Benchmarks datasets for cluster analysis [Dataset]. https://www.kaggle.com/datasets/onthada/benchmarks-datasets-for-clustering
    Explore at:
    zip(608532 bytes)Available download formats
    Dataset updated
    Nov 15, 2023
    Authors
    Onthada Preedasawakul
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    25 Artificial Datasets

    The datasets are generated using either Gaussian or Uniform distributions. Each dataset contains several known sub-groups intended for testing centroid-based clustering results and cluster validity indices.

    Cluster analysis is a popular machine learning used for segmenting datasets with similar data points in the same group. For those who are familiar with R, there is a new R package called "UniversalCVI" https://CRAN.R-project.org/package=UniversalCVI used for cluster evaluation. This package provides algorithms for checking the accuracy of a clustering result with known classes, computing cluster validity indices, and generating plots for comparing them. The package is compatible with K-means, fuzzy C means, EM clustering, and hierarchical clustering (single, average, and complete linkage). To use the "UniversalCVI" package, one can follow the instructions provided in the R documentation.

    For more in-depth details of the package and cluster evaluation, please see the papers https://doi.org/10.1016/j.patcog.2023.109910 and https://arxiv.org/abs/2308.14785

    All the datasets are also available on GitHub at

    https://github.com/O-PREEDASAWAKUL/FuzzyDatasets.git .

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F17645646%2Fa2f87fbad212a908718535589681a703%2Frealplot.jpeg?generation=1700111724010268&alt=media" alt="">

  6. Dataset for Investigating Anomalies in Compute Clusters

    • zenodo.org
    • data.niaid.nih.gov
    tar
    Updated Nov 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Diana McSpadden; Diana McSpadden; Alanazi Yasir; Alanazi Yasir; Bryan Hess; Bryan Hess; Laura Hild; Mark Jones; Yiyang Lu; Ahmed Mohammed; Wesley Moore; Jie Ren; Jie Ren; Malachi Schram; Malachi Schram; Evgenia Smirni; Evgenia Smirni; Laura Hild; Mark Jones; Yiyang Lu; Ahmed Mohammed; Wesley Moore (2023). Dataset for Investigating Anomalies in Compute Clusters [Dataset]. http://doi.org/10.5281/zenodo.10058230
    Explore at:
    tarAvailable download formats
    Dataset updated
    Nov 29, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Diana McSpadden; Diana McSpadden; Alanazi Yasir; Alanazi Yasir; Bryan Hess; Bryan Hess; Laura Hild; Mark Jones; Yiyang Lu; Ahmed Mohammed; Wesley Moore; Jie Ren; Jie Ren; Malachi Schram; Malachi Schram; Evgenia Smirni; Evgenia Smirni; Laura Hild; Mark Jones; Yiyang Lu; Ahmed Mohammed; Wesley Moore
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    The dataset was collected for 332 compute nodes throughout May 19 - 23, 2023. May 19 - 22 characterizes normal compute cluster behavior, while May 23 includes an anomalous event. The dataset includes eight CPU, 11 disk, 47 memory, and 22 Slurm metrics. It represents five distinct hardware configurations and contains over one million records, totaling more than 180GB of raw data.

    Background

    Motivated by the goal to develop a digital twin of a compute cluster, the dataset was collected using a Prometheus server (1) scraping the Thomas Jefferson National Accelerator Facility (JLab) batch cluster used to run an assortment of physics analysis and simulation jobs, where analysis workloads leverage data generated from the laboratory's electron accelerator, and simulation workloads generate large amounts of flat data that is then carved to verify amplitudes. Metrics were scraped from the cluster throughout May 19 - 23, 2023. Data from May 19 to May 22 primarily reflected normal system behavior, while May 23, 2023, recorded a notable anomaly. This anomaly was severe enough to necessitate intervention by JLab IT Operations staff.

    The metrics were collected from CPU, disk, memory, and Slurm. Metrics related to CPU, disk, and memory provide insights into the status of individual compute nodes. Furthermore, Slurm metrics collected from the network have the capability to detect anomalies that may propagate to compute nodes executing the same job.

    Usage Notes

    While the data from May 19 - 22 characterizes normal compute cluster behavior, and May 23 includes anomalous observations, the dataset cannot be considered labeled data. The set of nodes and the exact start and end time affected nodes demonstrate abnormal effects are unclear. Thus, the dataset could be used to develop unsupervised machine-learning algorithms to detect anomalous events in a batch cluster.

    https://doi.org/10.48550/arXiv.2311.16129

  7. COCD: Catalog of Open Cluster Data - Dataset - NASA Open Data Portal

    • data.nasa.gov
    Updated Apr 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). COCD: Catalog of Open Cluster Data - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/cocd-catalog-of-open-cluster-data
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    The Catalog of Open Cluster Data (COCD) is a result of studies of the wide neighborhoods of 513 open clusters and 7 compact associations carried out in the high-precision homogeneous All-Sky Compiled Catalog of 2.5 Million Stars (ASCC-2.5, Kharchenko 2001, CDS Cat. ). On the basis of data on about 33,000 possible members (including about 10,000 most probable ones) and homogeneous methods of cluster parameter determination, the angular sizes of cluster cores and coronae, cluster heliocentric distances, mean proper motions, mean radial velocities and ages were established and collected in the COCD. These include cluster distances for 200 clusters, average cluster radial velocities for 94 clusters, and cluster ages for 195 clusters derived for the first time. Clusters in the catalogue are sequenced in their Right Ascension J2000 order. The Open Cluster Diagrams Atlas (OCDA) presents a set of open cluster diagrams used for the determination of parameters of the 513 open clusters and 7 compact associations, and is intended to illustrate the quality of the constructed cluster membership (Kharchenko et al. 2004, CDS Cat.

  8. Regional Innovation Clusters

    • catalog.data.gov
    • datasets.ai
    • +2more
    Updated Feb 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Small Business Administration (2023). Regional Innovation Clusters [Dataset]. https://catalog.data.gov/dataset/regional-innovation-clusters
    Explore at:
    Dataset updated
    Feb 9, 2023
    Dataset provided by
    Small Business Administrationhttps://www.sba.gov/
    Description

    The Regional Innovation Clusters serve a diverse group of sectors and geographies. Three of the initial pilot clusters, termed Advanced Defense Technology clusters, are specifically focused on meeting the needs of the defense industry. The Wood Products Cluster, debuted in 2015, supports the White House’s Partnerships for Opportunity and Workforce and Economic Revitalization (POWER) Initiative for coal communities. All of the clusters support small businesses by fostering a synergistic network of small and large businesses, university researchers, regional economic organizations, stakeholders, and investors, while providing matchmaking, business training, counseling, mentoring, and other services to help small businesses expand and grow.

  9. Detailed Data for Cluster Analysis in IBM SPSS Statistics

    • figshare.com
    xlsx
    Updated Sep 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vitaliy Kolomiets (2024). Detailed Data for Cluster Analysis in IBM SPSS Statistics [Dataset]. http://doi.org/10.6084/m9.figshare.27083131.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Sep 22, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Vitaliy Kolomiets
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ASSESSMENT OF THE ECONOMIC EFFICIENCY OF FEDERAL FMCG RETAIL CHAINS IN RUSSIA: A CLUSTER APPROACH

  10. cluster-data

    • kaggle.com
    zip
    Updated Dec 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DAT DO (2023). cluster-data [Dataset]. https://www.kaggle.com/datasets/ducanger/cluster-data
    Explore at:
    zip(4788393 bytes)Available download formats
    Dataset updated
    Dec 13, 2023
    Authors
    DAT DO
    Description

    Dataset

    This dataset was created by DAT DO

    Contents

  11. CVRR dataset for trajectory clustering

    • figshare.com
    zip
    Updated May 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chen (2024). CVRR dataset for trajectory clustering [Dataset]. http://doi.org/10.6084/m9.figshare.25826839.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 15, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Chen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Six trajectory clustering datasets (Morris and Trivedi, 2009) are provided for benchmarking trajectory clustering algorithms. These datasets cover a diverse range of scenes to enable thorough evaluation of different algorithms. The datasets include three simulated highway scenes, a real highway scene, a simulated intersection scene, and an indoor omnidirectional camera scene.These datasets originate from the work by Morris and Trivedi (2009), described in their paper "Learning trajectory patterns by clustering: Experimental studies and comparative evaluation," presented at the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. The datasets provide a valuable resource for the trajectory clustering community, enabling comparative evaluation and benchmarking of various clustering algorithms. The datasets can be accessed through the following DOI link: https://doi.org/10.1109/CVPR.2009.5206559.

  12. Patient Dataset for Clustering (Raw Data)

    • kaggle.com
    Updated Aug 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arjunn Sharma (2023). Patient Dataset for Clustering (Raw Data) [Dataset]. https://www.kaggle.com/datasets/arjunnsharma/patient-dataset-for-clustering-raw-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 10, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Arjunn Sharma
    Description

    About Dataset ● Based on patient symptoms, identify patients needing immediate resuscitation; to assign patients to a predesignated patient care area, thereby prioritizing their care; and to initiate diagnostic/therapeutic measures as appropriate. ● Three individual datasets used for three urgent illness/injury, each dataset has its own features and symptoms for each patient and we merged them to know what are the most severe symptoms for each illness and give them priority of treatment.

    PROJECT SUMMARY Triage refers to the sorting of injured or sick people according to their need for emergency medical attention. It is a method of determining priority for who gets care first. BACKGROUND Triage is the prioritization of patient care (or victims during a disaster) based on illness/injury, symptoms, severity, prognosis, and resource availability. The purpose of triage is to identify patients needing immediate resuscitation; to assign patients to a predesignated patient care area, thereby prioritizing their care; and to initiate diagnostic/therapeutic measures as appropriate. BUSINESS CHALLENGE Based on patient symptoms, identify patients needing immediate resuscitation; to assign patients to a predesignated patient care area, thereby prioritizing their care; and to initiate diagnostic/therapeutic measures as appropriate.

  13. COCD: Catalog of Open Cluster Data First Extension - Dataset - NASA Open...

    • data.nasa.gov
    Updated Apr 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). COCD: Catalog of Open Cluster Data First Extension - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/cocd-catalog-of-open-cluster-data-first-extension
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    This table contains a list of 130 Galactic open clusters, found in the All-Sky Compiled Catalogue of 2.5 Million Stars (ASCC-2.5) and not included in the original Catalog of Open Cluster Data (COCD): it is known as the 1st Extension of the COCD (COCD-1). For these new clusters, the authors determined a homogeneous set of astrophysical parameters such as size, membership, motion, distance and age. In their previous work (the Browse table COCD based on the CDS Cat. J/A+A/438/1163), 520 already-known open clusters out of a sample of 1700 clusters from the literature were confirmed in the ASCC-2.5 using independent, objective methods. Using these same methods, the whole sky was systematically screened for new clusters. The newly detected clusters show the same distribution over the sky as the known ones. It is found that without the a priori knowledge about existing clusters the authors' search lead to clusters which are, on average, brighter, have more members and cover larger angular radii than the 520 previously-known ones. On the basis of data on about 6,200 possible members (including about 2,200 most probable ones) and homogeneous methods of cluster parameter determination, the angular sizes of cluster cores and coronae, cluster heliocentric distances, colour-excesses, mean proper motions, and ages of 130 clusters and mean radial velocities of 69 clusters were established and collected in the COCD-1. Clusters in the catalogue are numbered in order of increasing J2000.0 Right Ascension. The 1st Extension of the Open Cluster Diagrams Atlas (OCDA-1) presents a set of open cluster diagrams used for the determination of parameters of the 130 newly discovered open clusters, and is intended to illustrate the quality of the constructed cluster membership, and the accuracy of the derived cluster parameters. Every diagram presents relations between various stellar data from the all sky catalog ASCC-2.5(Kharchenko, 2001, CDS Cat. ) in the area of the specific cluster. There are five diagrams provided for every cluster in the Atlas: the area map, the density profile, the vector point diagram, the "magnitude equation" (proper motion in each coordinate versus V magnitude) diagram, and the color-magnitude diagram. The 130 OCDA-1 PostScript plots (one file per cluster) are available as a remote data product for all of the entries in this table. This table was created by the HEASARC in May 2011 based on CDS Catalog J/A+A/440/403/ files cluster.dat and notes.dat. This is a service provided by NASA HEASARC .

  14. d

    Model-based cluster analysis of microarray gene-expression data

    • catalog.data.gov
    • data.virginia.gov
    • +1more
    Updated Sep 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institutes of Health (2025). Model-based cluster analysis of microarray gene-expression data [Dataset]. https://catalog.data.gov/dataset/model-based-cluster-analysis-of-microarray-gene-expression-data
    Explore at:
    Dataset updated
    Sep 7, 2025
    Dataset provided by
    National Institutes of Health
    Description

    Background Microarray technologies are emerging as a promising tool for genomic studies. The challenge now is how to analyze the resulting large amounts of data. Clustering techniques have been widely applied in analyzing microarray gene-expression data. However, normal mixture model-based cluster analysis has not been widely used for such data, although it has a solid probabilistic foundation. Here, we introduce and illustrate its use in detecting differentially expressed genes. In particular, we do not cluster gene-expression patterns but a summary statistic, the t-statistic. Results The method is applied to a data set containing expression levels of 1,176 genes of rats with and without pneumococcal middle-ear infection. Three clusters were found, two of which contain more than 95% genes with almost no altered gene-expression levels, whereas the third one has 30 genes with more or less differential gene-expression levels. Conclusions Our results indicate that model-based clustering of t-statistics (and possibly other summary statistics) can be a useful statistical tool to exploit differential gene expression for microarray data.

  15. 4

    Dataset for a clusteranalysis to cluster and characterize autonomous last...

    • data.4tu.nl
    zip
    Updated Mar 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julian Maas (2023). Dataset for a clusteranalysis to cluster and characterize autonomous last mile concepts in a standardized and holistic manner [Dataset]. http://doi.org/10.4121/21293736.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 23, 2023
    Dataset provided by
    4TU.ResearchData
    Authors
    Julian Maas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is file is the dataset for a clusteranalysis to cluster and characterize autonomous last mile concepts in a standardized and holistic manner.

    The dataset includes autonomous last mile concepts described by a new delevoped holistic taxonomy.

    With the developed clusters it is possible to better categorize and compare individual autonomous last mile concepts. Furthermore, the developed taxonomy of autonomous last mileconcepts and the cluster analysis allow an evaluation of the interrelations between characteristics of autonomous last mile concepts, so that the design of new concepts as well as the adaptation or selection of a concept for a specific use case is supported.

  16. f

    Neuronal dataset and cluster data information

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated May 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Padron-Manrique, Cristian (2024). Neuronal dataset and cluster data information [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001442464
    Explore at:
    Dataset updated
    May 5, 2024
    Authors
    Padron-Manrique, Cristian
    Description

    Adult mouse visual cortex (RPKM values for 24,057 genes and 1,679 cells) with cluster information taken from https://singlecell.broadinstitute.org/single_cell/study/SCP6/a-transcriptomic-taxonomy-of-adult-mouse-visual-cortex-visp#study-download

  17. R

    Apple Flower Clusters Dataset

    • universe.roboflow.com
    zip
    Updated Dec 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CTU (2023). Apple Flower Clusters Dataset [Dataset]. https://universe.roboflow.com/ctu-b0lvz/apple-flower-clusters
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 22, 2023
    Dataset authored and provided by
    CTU
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Flower Clusters Polygons
    Description

    Apple Flower Clusters

    ## Overview
    
    Apple Flower Clusters is a dataset for instance segmentation tasks - it contains Flower Clusters annotations for 202 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  18. Blind method for discovering number of clusters in multidimensional datasets...

    • plos.figshare.com
    docx
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Osbert C. Zalay (2023). Blind method for discovering number of clusters in multidimensional datasets by regression on linkage hierarchies generated from random data [Dataset]. http://doi.org/10.1371/journal.pone.0227788
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Osbert C. Zalay
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Determining intrinsic number of clusters in a multidimensional dataset is a commonly encountered problem in exploratory data analysis. Unsupervised clustering algorithms often rely on specification of cluster number as an input parameter. However, this is typically not known a priori. Many methods have been proposed to estimate cluster number, including statistical and information-theoretic approaches such as the gap statistic, but these methods are not always reliable when applied to non-normally distributed datasets containing outliers or noise. In this study, I propose a novel method called hierarchical linkage regression, which uses regression to estimate the intrinsic number of clusters in a multidimensional dataset. The method operates on the hypothesis that the organization of data into clusters can be inferred from the hierarchy generated by partitioning the dataset, and therefore does not directly depend on the specific values of the data or their distribution, but on their relative ranking within the partitioned set. Moreover, the technique does not require empirical data to train on, but can use synthetic data generated from random distributions to fit regression coefficients. The trained hierarchical linkage regression model is able to infer cluster number in test datasets of varying complexity and differing distributions, for image, text and numeric data, using the same regression model without retraining. The method performs favourably against other cluster number estimation techniques, and is also robust to parameter changes, as demonstrated by sensitivity analysis. The apparent robustness and generalizability of hierarchical linkage regression make it a promising tool for unsupervised exploratory data analysis and discovery.

  19. Five largest clusters in the NSI/SI and R5/X4 dataset clustering.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Katarzyna Bozek; Alexander Thielen; Saleta Sierra; Rolf Kaiser; Thomas Lengauer (2023). Five largest clusters in the NSI/SI and R5/X4 dataset clustering. [Dataset]. http://doi.org/10.1371/journal.pone.0007387.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Katarzyna Bozek; Alexander Thielen; Saleta Sierra; Rolf Kaiser; Thomas Lengauer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Numbers indicate what fraction of the whole dataset is grouped in a given cluster (column “all”) and what is the ratio of the sequences of a given phenotype to all sequences in the respective cluster.

  20. d

    Data from: A Cluster Randomized Controlled Trial of the Safe Public Spaces...

    • catalog.data.gov
    • icpsr.umich.edu
    Updated Mar 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Justice (2025). A Cluster Randomized Controlled Trial of the Safe Public Spaces in Schools Program, New York City, 2016-2018 [Dataset]. https://catalog.data.gov/dataset/a-cluster-randomized-controlled-trial-of-the-safe-public-spaces-in-schools-program-ne-2016-f67d7
    Explore at:
    Dataset updated
    Mar 12, 2025
    Dataset provided by
    National Institute of Justice
    Area covered
    New York
    Description

    This study tests the efficacy of an intervention--Safe Public Spaces (SPS) -- focused on improving the safety of public spaces in schools, such as hallways, cafeterias, and stairwells. Twenty-four schools with middle grades in a large urban area were recruited for participation and were pair-matched and then assigned to either treatment or control. The study comprises four components: an implementation evaluation, a cost study, an impact study, and a community crime study. Community-crime-study: The community crime study used the arrest of juveniles from the NYPD (New York Police Department) data. The data can be found at (https://data.cityofnewyork.us/Public-Safety/NYPD-Arrests-Data-Historic-/8h9b-rp9u). Data include all arrest for the juvenile crime during the life of the intervention. The 12 matched schools were identified and geo-mapped using Quantum GIS (QGIS) 3.8 software. Block groups in the 2010 US Census in which the schools reside and neighboring block groups were mapped into micro-areas. This resulted in twelve experimental school blocks and 11 control blocks which the schools reside (two of the control schools existed in the same census block group). Additionally, neighboring blocks using were geo-mapped into 70 experimental and 77 control adjacent block groups (see map). Finally, juvenile arrests were mapped into experimental and control areas. Using the ARIMA time-series method in Stata 15 statistical software package, arrest data were analyzed to compare the change in juvenile arrests in the experimental and control sites. Cost-study: For the cost study, information from the implementing organization (Engaging Schools) was combined with data from phone conversations and follow-up communications with staff in school sites to populate a Resource Cost Model. The Resource Cost Model Excel file will be provided for archiving. This file contains details on the staff time and materials allocated to the intervention, as well as the NYC prices in 2018 US dollars associated with each element. Prices were gathered from multiple sources, including actual NYC DOE data on salaries for position types for which these data were available and district salary schedules for the other staff types. Census data were used to calculate benefits. Impact-evaluation: The impact evaluation was conducted using data from the Research Alliance for New York City Schools. Among the core functions of the Research Alliance is maintaining a unique archive of longitudinal data on NYC schools to support ongoing research. The Research Alliance builds and maintains an archive of longitudinal data about NYC schools. Their agreement with the New York City Department of Education (NYC DOE) outlines the data they receive, the process they use to obtain it, and the security measures to keep it safe. Implementation-study: The implementation study comprises the baseline survey and observation data. Interview transcripts are not archived.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jia Li; Beomseok Seo; Lin Lin (2023). Dataset for: Optimal Transport, Mean Partition, and Uncertainty Assessment in Cluster Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.8038925
Organization logo

Dataset for: Optimal Transport, Mean Partition, and Uncertainty Assessment in Cluster Analysis

Related Article
Explore at:
txtAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
Wileyhttps://www.wiley.com/
Authors
Jia Li; Beomseok Seo; Lin Lin
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

In scientific data analysis, clusters identified computationally often substantiate existing hypotheses or motivate new ones. Yet the combinatorial nature of the clustering result, which is a partition rather than a set of parameters or a function, blurs notions of mean and variance. This intrinsic difficulty hinders the development of methods to improve clustering by aggregation or to assess the uncertainty of clusters generated. We overcome that barrier by aligning clusters via optimal transport. Equipped with this technique, we propose a new algorithm to enhance clustering by any baseline method using bootstrap samples. Cluster alignment enables us to quantify variation in the clustering result at the levels of both overall partitions and individual clusters. Set relationships between clusters such as one-to-one match, split, and merge can be revealed. A covering point set for each cluster, a concept kin to the confidence interval, is proposed. The tools we have developed here will help address the crucial question of whether any cluster is an intrinsic or spurious pattern. Experimental results on both simulated and real datasets are provided.

Search
Clear search
Close search
Google apps
Main menu