100+ datasets found

Data from: Galaxy clustering
kaggle.com
zip
Updated Jan 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Galaxy clustering [Dataset]. https://www.kaggle.com/datasets/thedevastator/clustering-polygons-utilizing-iris-moon-and-circ
Explore at:
zip(6339 bytes)Available download formats
Dataset updated
Jan 3, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Galaxy clustering

Iris, Moon, and Circles datasets for Galaxy clustering tutorial

By [source]

About this dataset

This dataset contains a wealth of information that can be used to explore the effectiveness of various clustering algorithms. With its inclusion of numerical measurements (X, Y, Sepal.Length, and Petal.Length) and categorical values (Species), it is possible to investigate the relationship between different types of variables and clustering performance. Additionally, by comparing results for the 3 datasets provided - moon.csv (which contains x and y coordinates), iris.csv (which contains measurements for sepal and petal lengths),and circles.csv - we can gain insights into how different data distributions affect clustering techniques such as K-Means or Hierarchical Clustering among others!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset can also be a great starting point to further explore more complex clusters by using higher dimensional space variables such as color or texture that may be present in other datasets not included here but which can help to form more accurate groups when using cluster-analysis algorithms. Additionally, it could also assist in visualization projects where clusters may need to be generated such as plotting mapped data points or examining relationships between two different variables within a certain region drawn on a chart.

To use this dataset effectively it is important to understand how exactly your chosen algorithm works since some require specifying parameters beforehand while others take care of those details automatically; otherwise the interpretation may be invalid depending on the methods used alongside clustering you intend for your project. Furthermore, familiarize yourself with concepts like silhouette score and rand index - these are commonly used metrics that measure your cluster’s performance against other clusterings models so you know if what you have done so far satisfies an acceptable level of accuracy or not yet! Good luck!

Research Ideas

Utilizing the sepal and petal lengths and widths to perform flower recognition or part of a larger image recognition pipeline.

Classifying the data points in each dataset by the X-Y coordinates using clustering algorithms to analyze galaxy locations or overall formation patterns for stars, planets, or galaxies.

Exploring correlations between species of flowers in terms of sepal/petal lengths by performing supervised learning tasks such as classification with this dataset

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: moon.csv | Column name | Description | |:--------------|:------------------------------------------| | X | X coordinate of the data point. (Numeric) | | Y | Y coordinate of the data point. (Numeric) |

File: iris.csv | Column name | Description | |:-----------------|:---------------------------------------------| | Sepal.Length | Length of the sepal of the flower. (Numeric) | | Petal.Length | Length of the petal of the flower. (Numeric) | | Species | Species of the flower. (Categorical) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .
Patient Dataset for Clustering (Raw Data)
kaggle.com
Updated Aug 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arjunn Sharma (2023). Patient Dataset for Clustering (Raw Data) [Dataset]. https://www.kaggle.com/datasets/arjunnsharma/patient-dataset-for-clustering-raw-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 10, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Arjunn Sharma
Description
About Dataset ● Based on patient symptoms, identify patients needing immediate resuscitation; to assign patients to a predesignated patient care area, thereby prioritizing their care; and to initiate diagnostic/therapeutic measures as appropriate. ● Three individual datasets used for three urgent illness/injury, each dataset has its own features and symptoms for each patient and we merged them to know what are the most severe symptoms for each illness and give them priority of treatment.

PROJECT SUMMARY Triage refers to the sorting of injured or sick people according to their need for emergency medical attention. It is a method of determining priority for who gets care first. BACKGROUND Triage is the prioritization of patient care (or victims during a disaster) based on illness/injury, symptoms, severity, prognosis, and resource availability. The purpose of triage is to identify patients needing immediate resuscitation; to assign patients to a predesignated patient care area, thereby prioritizing their care; and to initiate diagnostic/therapeutic measures as appropriate. BUSINESS CHALLENGE Based on patient symptoms, identify patients needing immediate resuscitation; to assign patients to a predesignated patient care area, thereby prioritizing their care; and to initiate diagnostic/therapeutic measures as appropriate.
f
Data_Sheet_1_Improved space breakdown method – A robust clustering technique...
frontiersin.figshare.com
docx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eugen-Richard Ardelean; Ana-Maria Ichim; Mihaela Dînşoreanu; Raul Cristian Mureşan (2023). Data_Sheet_1_Improved space breakdown method – A robust clustering technique for spike sorting.docx [Dataset]. http://doi.org/10.3389/fncom.2023.1019637.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fncom.2023.1019637.s001
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Eugen-Richard Ardelean; Ana-Maria Ichim; Mihaela Dînşoreanu; Raul Cristian Mureşan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Space Breakdown Method (SBM) is a clustering algorithm that was developed specifically for low-dimensional neuronal spike sorting. Cluster overlap and imbalance are common characteristics of neuronal data that produce difficulties for clustering methods. SBM is able to identify overlapping clusters through its design of cluster centre identification and the expansion of these centres. SBM’s approach is to divide the distribution of values of each feature into chunks of equal size. In each of these chunks, the number of points is counted and based on this number the centres of clusters are found and expanded. SBM has been shown to be a contender for other well-known clustering algorithms especially for the particular case of two dimensions while being too computationally expensive for high-dimensional data. Here, we present two main improvements to the original algorithm in order to increase its ability to deal with high-dimensional data while preserving its performance: the initial array structure was substituted with a graph structure and the number of partitions has been made feature-dependent, denominating this improved version as the Improved Space Breakdown Method (ISBM). In addition, we propose a clustering validation metric that does not punish overclustering and such obtains more suitable evaluations of clustering for spike sorting. Extracellular data recorded from the brain is unlabelled, therefore we have chosen simulated neural data, to which we have the ground truth, to evaluate more accurately the performance. Evaluations conducted on synthetic data indicate that the proposed improvements reduce the space and time complexity of the original algorithm, while simultaneously leading to an increased performance on neural data when compared with other state-of-the-art algorithms.Code available athttps://github.com/ArdeleanRichard/Space-Breakdown-Method.
Reference list of 265 sources used for the discovery of relationships...
doi.pangaea.de
search.dataone.org
Updated Jul 8, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jürgen Bernard; Tobias Ruppert; Tobias Schreck; Maximilian Scherer; Jörn Kohlhammer (2012). Reference list of 265 sources used for the discovery of relationships between data clusters and metadata properties [Dataset]. http://doi.org/10.1594/PANGAEA.785666
Explore at:
Unique identifier
https://doi.org/10.1594/PANGAEA.785666
Dataset updated
Jul 8, 2012
Dataset provided by
PANGAEA
Authors
Jürgen Bernard; Tobias Ruppert; Tobias Schreck; Maximilian Scherer; Jörn Kohlhammer
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Time period covered
Jan 1, 2006 - Dec 31, 2006
Area covered
Description
Visual cluster analysis provides valuable tools that help analysts to understand large data sets in terms of representative clusters and relationships thereof. Often, the found clusters are to be understood in context of belonging categorical, numerical or textual metadata which are given for the data elements. While often not part of the clustering process, such metadata play an important role and need to be considered during the interactive cluster exploration process. Traditionally, linked-views allow to relate (or loosely speaking: correlate) clusters with metadata or other properties of the underlying cluster data. Manually inspecting the distribution of metadata for each cluster in a linked-view approach is tedious, specially for large data sets, where a large search problem arises. Fully interactive search for potentially useful or interesting cluster to metadata relationships may constitute a cumbersome and long process. To remedy this problem, we propose a novel approach for guiding users in discovering interesting relationships between clusters and associated metadata. Its goal is to guide the analyst through the potentially huge search space. We focus in our work on metadata of categorical type, which can be summarized for a cluster in form of a histogram. We start from a given visual cluster representation, and compute certain measures of interestingness defined on the distribution of metadata categories for the clusters. These measures are used to automatically score and rank the clusters for potential interestingness regarding the distribution of categorical metadata. Identified interesting relationships are highlighted in the visual cluster representation for easy inspection by the user. We present a system implementing an encompassing, yet extensible, set of interestingness scores for categorical metadata, which can also be extended to numerical metadata. Appropriate visual representations are provided for showing the visual correlations, as well as the calculated ranking scores. Focusing on clusters of time series data, we test our approach on a large real-world data set of time-oriented scientific research data, demonstrating how specific interesting views are automatically identified, supporting the analyst discovering interesting and visually understandable relationships.
m
Educational Attainment in North Carolina Public Schools: Use of statistical...
data.mendeley.com
Updated Nov 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
Explore at:
Unique identifier
https://doi.org/10.17632/6cm9wyd5g5.1
Dataset updated
Nov 14, 2018
Authors
Scott Herford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.

ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of...

zenodo.org
elki-project.github.io
+1more

application/gzip

Updated May 2, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Erich Schubert; Erich Schubert; Arthur Zimek; Arthur Zimek (2024). ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of Object Images (ALOI) [Dataset]. http://doi.org/10.5281/zenodo.6355684

Explore at:

application/gzipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6355684

Dataset updated

May 2, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Erich Schubert; Erich Schubert; Arthur Zimek; Arthur Zimek

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

2022

Description

These data sets were originally created for the following publications:

M. E. Houle, H.-P. Kriegel, P. Kröger, E. Schubert, A. Zimek
Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?
In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM), Heidelberg, Germany, 2010.

H.-P. Kriegel, E. Schubert, A. Zimek
Evaluation of Multiple Clustering Solutions
In 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011, Athens, Greece, 2011.

The outlier data set versions were introduced in:

E. Schubert, R. Wojdanowski, A. Zimek, H.-P. Kriegel
On Evaluation of Outlier Rankings and Outlier Scores
In Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA, 2012.

They are derived from the original image data available at https://aloi.science.uva.nl/

The image acquisition process is documented in the original ALOI work: J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders, The Amsterdam library of object images, Int. J. Comput. Vision, 61(1), 103-112, January, 2005

Additional information is available at: https://elki-project.github.io/datasets/multi_view

The following views are currently available:

Feature type	Description	Files
Object number	Sparse 1000 dimensional vectors that give the true object assignment	objs.arff.gz
RGB color histograms	Standard RGB color histograms (uniform binning)	aloi-8d.csv.gz aloi-27d.csv.gz aloi-64d.csv.gz aloi-125d.csv.gz aloi-216d.csv.gz aloi-343d.csv.gz aloi-512d.csv.gz aloi-729d.csv.gz aloi-1000d.csv.gz
HSV color histograms	Standard HSV/HSB color histograms in various binnings	aloi-hsb-2x2x2.csv.gz aloi-hsb-3x3x3.csv.gz aloi-hsb-4x4x4.csv.gz aloi-hsb-5x5x5.csv.gz aloi-hsb-6x6x6.csv.gz aloi-hsb-7x7x7.csv.gz aloi-hsb-7x2x2.csv.gz aloi-hsb-7x3x3.csv.gz aloi-hsb-14x3x3.csv.gz aloi-hsb-8x4x4.csv.gz aloi-hsb-9x5x5.csv.gz aloi-hsb-13x4x4.csv.gz aloi-hsb-14x5x5.csv.gz aloi-hsb-10x6x6.csv.gz aloi-hsb-14x6x6.csv.gz
Color similiarity	Average similarity to 77 reference colors (not histograms) 18 colors x 2 sat x 2 bri + 5 grey values (incl. white, black)	aloi-colorsim77.arff.gz (feature subsets are meaningful here, as these features are computed independently of each other)
Haralick features	First 13 Haralick features (radius 1 pixel)	aloi-haralick-1.csv.gz
Front to back	Vectors representing front face vs. back faces of individual objects	front.arff.gz
Basic light	Vectors indicating basic light situations	light.arff.gz
Manual annotations	Manually annotated object groups of semantically related objects such as cups	manual1.arff.gz

Outlier Detection Versions

Additionally, we generated a number of subsets for outlier detection:

Feature type	Description	Files
RGB Histograms	Downsampled to 100000 objects (553 outliers)	aloi-27d-100000-max10-tot553.csv.gz aloi-64d-100000-max10-tot553.csv.gz
	Downsampled to 75000 objects (717 outliers)	aloi-27d-75000-max4-tot717.csv.gz aloi-64d-75000-max4-tot717.csv.gz
	Downsampled to 50000 objects (1508 outliers)	aloi-27d-50000-max5-tot1508.csv.gz aloi-64d-50000-max5-tot1508.csv.gz

h
wine-clustering
huggingface.co
Updated Sep 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Trevor (2024). wine-clustering [Dataset]. https://huggingface.co/datasets/mltrev23/wine-clustering
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 12, 2024
Authors
Trevor
Description
Wine Clustering Dataset

Overview

The Wine Clustering Dataset contains data on various chemical properties of wines, intended for use in clustering tasks. This dataset is ideal for exploring clustering algorithms such as K-Means, hierarchical clustering, and others, to group wines based on their chemical composition.

Dataset Structure

The dataset is provided as a single CSV file named wine-clustering.csv. It contains 178 entries, each representing a unique wine… See the full description on the dataset page: https://huggingface.co/datasets/mltrev23/wine-clustering.
H
Replication data for: Nearest Neighbor Networks: clustering expression data...
dataverse.harvard.edu
Updated Feb 8, 2010
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
C. Huttenhower; A. Flamholz; J. Landis; S. Sahi; C. Myers; K. Olszewski; M. Hibbs; Siemers, N.,; O. Troyanskaya; H Coller (2010). Replication data for: Nearest Neighbor Networks: clustering expression data based on gene neighborhoods [Dataset]. http://doi.org/10.7910/DVN/PO4EWY
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/PO4EWY
Dataset updated
Feb 8, 2010
Dataset provided by
Harvard Dataverse
Authors
C. Huttenhower; A. Flamholz; J. Landis; S. Sahi; C. Myers; K. Olszewski; M. Hibbs; Siemers, N.,; O. Troyanskaya; H Coller
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Background The availability of microarrays measuring thousands of genes simultaneously across hundreds of biological conditions represents an opportunity to understand both individual biological pathways and the integrated workings of the cell. However, translating this amount of data into biological insight remains a daunting task. An important initial step in the analysis of microarray data is clustering of genes with similar behavior. A number of classical techniques are commonly used to perform this task, particularly hierarchical and K-means clustering, and many novel approaches have been suggested recently. While these approaches are useful, they are not without drawbacks; these methods can find clusters in purely random data, and even clusters enriched for biological functions can be skewed towards a small number of processes (e.g. ribosomes). Results We developed Nearest Neighbor Networks (NNN), a graph-based algorithm to generate clusters of genes with similar expression profiles. This method produces clusters based on overlapping cliques within an interaction network generated from mutual nearest neighborhoods. This focus on nearest neighbors rather than on absolute distance measures allows us to capture clusters with high connectivity even when they are spatially separated, and requiring mutual nearest neighbors allows gene s with no sufficiently similar partners to remain unclustered. We compared the clusters generated by NNN with those generated by eight other clustering methods. NNN was particularly successful at generating functionally coherent clusters with high precision, and these clusters generally represented a much broader selection of biological processes than those recovered by other methods. Conclusion The Nearest Neighbor Networks algorithm is a valuable clustering method that effectively groups genes that are likely to be functionally related. It is particularly attractive due to its simplicity, its success in the analysis of large datasets, and its ability to span a wide range of biological functions with high precision
D
Replication Data for: The semantic structuring of minimizing constructions...
dataverse.no
search.dataone.org
Updated Sep 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Margot Van den Heede; Margot Van den Heede; Peter Lauwers; Peter Lauwers (2025). Replication Data for: The semantic structuring of minimizing constructions in present-day Netherlandic Dutch: a distribution-based cluster analysis [Dataset]. http://doi.org/10.18710/GIKMKM
Explore at:
text/comma-separated-values(95929), type/x-r-syntax(1369), txt(6170), txt(13467)Available download formats
Unique identifier
https://doi.org/10.18710/GIKMKM
Dataset updated
Sep 2, 2025
Dataset provided by
DataverseNO
Authors
Margot Van den Heede; Margot Van den Heede; Peter Lauwers; Peter Lauwers
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Netherlands
Dataset funded by
Special Research Fund for Concerted Research Actions - Ghent University
Description
Dataset abstract: This dataset contains the data files that were used for the cluster analysis of the Dutch minimizing construction, as described in the publication cited below. In addition to a ReadMe file, it contains three files: A txt file is provided with the corpus queries that were used to find tokens of the minimizing constructions in the Dutch Web 2014 (nlTenTen14) corpus, available via Sketch Engine (more information about the TenTen corpora: Jakubíček, M., A. Kilgarriff, V. Kovář, P. Rychlý & V. Suchomel (2013). The TenTen corpus family. In: 7th International Corpus Linguistics Conference CL. Lancaster, 125–127). A csv file is provided that forms the input file for the cluster analysis. It contains a list of 5,863 minimizer-predicate combinations, more specifically a list of the predicates that are combined with the minimizers that have a token frequency of at least 10 in my dataset. An R-script is provided with the code to perform the cluster analysis in R. Article abstract: This paper examines the semantic structuring of a paradigm of 89 minimizers, i.e., nouns that reinforce sentential negation in present-day Netherlandic Dutch, such as meter ‘meter’ in voor geen meter vertrouwen ‘not to trust for a meter’. Cosine distances are computed on the basis of the predicates the minimizers combine with in a sample of 100 tokens downloaded from the Dutch Web corpus 2014 (nlTenTen14) and clustered according to the Partitioning Around Medoids (PAM) algorithm into nine semantic clusters. The clusters largely correspond to semantic categories such as taboo terms or units of money. This suggests that, in general, minimizers belonging to the same semantic domain are combined with a similar (core) set of predicates. Based on the shared predicates per cluster, we detect signs of analogical attraction between minimizers or, conversely, competition. Crucially, low silhouette widths enable us to identify outliers in their respective clusters, for instance, minimizing nouns that exhibit signs of context expansion, as shown by their combination with semantically non-harmonious verbs. As such, this paper provides a synchronic snapshot of the semantic processes involved in (incipient) grammaticalization of minimizing nouns and, more in general, it illustrates how distributional semantics offers a heuristic to analyze the structure of a network of comparable micro-constructions.
f
Data from: Integrating Multidimensional Data for Clustering Analysis With...
tandf.figshare.com
pdf
Updated Feb 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seyoung Park; Hao Xu; Hongyu Zhao (2024). Integrating Multidimensional Data for Clustering Analysis With Applications to Cancer Patient Data [Dataset]. http://doi.org/10.6084/m9.figshare.11873859.v2
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.11873859.v2
Dataset updated
Feb 19, 2024
Dataset provided by
Taylor & Francis
Authors
Seyoung Park; Hao Xu; Hongyu Zhao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Advances in high-throughput genomic technologies coupled with large-scale studies including The Cancer Genome Atlas (TCGA) project have generated rich resources of diverse types of omics data to better understand cancer etiology and treatment responses. Clustering patients into subtypes with similar disease etiologies and/or treatment responses using multiple omics data types has the potential to improve the precision of clustering than using a single data type. However, in practice, patient clustering is still mostly based on a single type of omics data or ad hoc integration of clustering results from individual data types, leading to potential loss of information. By treating each omics data type as a different informative representation from patients, we propose a novel multi-view spectral clustering framework to integrate different omics data types measured from the same subject. We learn the weight of each data type as well as a similarity measure between patients via a nonconvex optimization framework. We solve the proposed nonconvex problem iteratively using the ADMM algorithm and show the convergence of the algorithm. The accuracy and robustness of the proposed clustering method is studied both in theory and through various synthetic data. When our method is applied to the TCGA data, the patient clusters inferred by our method show more significant differences in survival times between clusters than those inferred from existing clustering methods. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Clustering results of real datasets.
plos.figshare.com
xls
Updated Jun 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wei Xingqiong; Li Kang (2025). Clustering results of real datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0325161.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0325161.t004
Dataset updated
Jun 5, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Wei Xingqiong; Li Kang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Clustering is a fundamental tool in data mining, widely used in various fields such as image segmentation, data science, pattern recognition, and bioinformatics. Density Peak Clustering (DPC) is a density-based method that identifies clusters by calculating the local density of data points and selecting cluster centers based on these densities. However, DPC has several limitations. First, it requires a cutoff distance to calculate local density, and this parameter varies across datasets, which requires manual tuning and affects the algorithm’s performance. Second, the number of cluster centers must be manually specified, as the algorithm cannot automatically determine the optimal number of clusters, making the algorithm dependent on human intervention. To address these issues, we propose an adaptive Density Peak Clustering (DPC) method, which automatically adjusts parameters like cutoff distance and the number of clusters, based on the Delaunay graph. This approach uses the Delaunay graph to calculate the connectivity between data points and prunes the points based on these connections, automatically determining the number of cluster centers. Additionally, by optimizing clustering indices, the algorithm automatically adjusts its parameters, enabling clustering without any manual input. Experimental results on both synthetic and real-world datasets demonstrate that the proposed algorithm outperforms similar methods in terms of both efficiency and clustering accuracy.
Clustering Exercises
kaggle.com
zip
Updated Apr 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joonas (2022). Clustering Exercises [Dataset]. https://www.kaggle.com/datasets/joonasyoon/clustering-exercises
Explore at:
zip(3602272 bytes)Available download formats
Dataset updated
Apr 29, 2022
Authors
Joonas
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Overview

https://i.imgur.com/ZUX61cD.png" alt="Overview">

Context

The method of disuniting similar data is called clustering. you can create dummy data for classifying clusters by method from sklearn package but it needs to put your effort into job.

For users who making hard test cases for example of clustering, I think this dataset helps them.

Try out to select a meaningful number of clusters, and dividing the data into clusters. Here are exercises for you.

Dataset

All csv files contain a lots of x, y and color, and you can see above figures.

If you want to use position as type of integer, scale it and round off to integer as like x = round(x * 100).

Furthermore, here is GUI Tool to generate 2D points for clustering. you can make your dataset with this tool. https://www.joonas.io/cluster-paint

Stay tuned for further updates! also if any idea, you can comment me.
f
Data from: A model-based clustering method to detect infectious disease...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Nov 13, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Poon, Art F. Y.; McCloskey, Rosemary M. (2017). A model-based clustering method to detect infectious disease transmission outbreaks from sequence variation [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001817256
Explore at:
Dataset updated
Nov 13, 2017
Authors
Poon, Art F. Y.; McCloskey, Rosemary M.
Description
Clustering infections by genetic similarity is a popular technique for identifying potential outbreaks of infectious disease, in part because sequences are now routinely collected for clinical management of many infections. A diverse number of nonparametric clustering methods have been developed for this purpose. These methods are generally intuitive, rapid to compute, and readily scale with large data sets. However, we have found that nonparametric clustering methods can be biased towards identifying clusters of diagnosis—where individuals are sampled sooner post-infection—rather than the clusters of rapid transmission that are meant to be potential foci for public health efforts. We develop a fundamentally new approach to genetic clustering based on fitting a Markov-modulated Poisson process (MMPP), which represents the evolution of transmission rates along the tree relating different infections. We evaluated this model-based method alongside five nonparametric clustering methods using both simulated and actual HIV sequence data sets. For simulated clusters of rapid transmission, the MMPP clustering method obtained higher mean sensitivity (85%) and specificity (91%) than the nonparametric methods. When we applied these clustering methods to published sequences from a study of HIV-1 genetic clusters in Seattle, USA, we found that the MMPP method categorized about half (46%) as many individuals to clusters compared to the other methods. Furthermore, the mean internal branch lengths that approximate transmission rates were significantly shorter in clusters extracted using MMPP, but not by other methods. We determined that the computing time for the MMPP method scaled linearly with the size of trees, requiring about 30 seconds for a tree of 1,000 tips and about 20 minutes for 50,000 tips on a single computer. This new approach to genetic clustering has significant implications for the application of pathogen sequence analysis to public health, where it is critical to robustly and accurately identify clusters for the most cost-effective deployment of outbreak management and prevention resources.
Data from: Skeleton Clustering: Dimension-Free Density-Aided Clustering
tandf.figshare.com
pdf
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zeyu Wei; Yen-Chi Chen (2023). Skeleton Clustering: Dimension-Free Density-Aided Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.21976961.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21976961.v1
Dataset updated
Jun 4, 2023
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Zeyu Wei; Yen-Chi Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We introduce a density-aided clustering method called Skeleton Clustering that can detect clusters in multivariate and even high-dimensional data with irregular shapes. To bypass the curse of dimensionality, we propose surrogate density measures that are less dependent on the dimension but have intuitive geometric interpretations. The clustering framework constructs a concise representation of the given data as an intermediate step and can be thought of as a combination of prototype methods, density-based clustering, and hierarchical clustering. We show by theoretical analysis and empirical studies that the skeleton clustering leads to reliable clusters in multivariate and high-dimensional scenarios. Supplementary materials for this article are available online.
f
The details of experiment datasets.
plos.figshare.com
xls
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wei Xingqiong; Li Kang (2025). The details of experiment datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0325161.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0325161.t001
Dataset updated
Jun 5, 2025
Dataset provided by
PLOS ONE
Authors
Wei Xingqiong; Li Kang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Clustering is a fundamental tool in data mining, widely used in various fields such as image segmentation, data science, pattern recognition, and bioinformatics. Density Peak Clustering (DPC) is a density-based method that identifies clusters by calculating the local density of data points and selecting cluster centers based on these densities. However, DPC has several limitations. First, it requires a cutoff distance to calculate local density, and this parameter varies across datasets, which requires manual tuning and affects the algorithm’s performance. Second, the number of cluster centers must be manually specified, as the algorithm cannot automatically determine the optimal number of clusters, making the algorithm dependent on human intervention. To address these issues, we propose an adaptive Density Peak Clustering (DPC) method, which automatically adjusts parameters like cutoff distance and the number of clusters, based on the Delaunay graph. This approach uses the Delaunay graph to calculate the connectivity between data points and prunes the points based on these connections, automatically determining the number of cluster centers. Additionally, by optimizing clustering indices, the algorithm automatically adjusts its parameters, enabling clustering without any manual input. Experimental results on both synthetic and real-world datasets demonstrate that the proposed algorithm outperforms similar methods in terms of both efficiency and clustering accuracy.
Dataset and code to accompany the manuscript 'Consistency of clustering...
zenodo.org
bin, nc +1
Updated Oct 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rebecca Millington; Rebecca Millington; Dale Partridge; Dale Partridge; Helen R. Powley; Helen R. Powley; Gennadi Lessin; Gennadi Lessin; David Moffat; David Moffat; Jeremy Blackford; Jeremy Blackford (2025). Dataset and code to accompany the manuscript 'Consistency of clustering analysis of complex 3D ocean datasets' [Dataset]. http://doi.org/10.5281/zenodo.17227522
Explore at:
text/x-python, bin, ncAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.17227522
Dataset updated
Oct 2, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rebecca Millington; Rebecca Millington; Dale Partridge; Dale Partridge; Helen R. Powley; Helen R. Powley; Gennadi Lessin; Gennadi Lessin; David Moffat; David Moffat; Jeremy Blackford; Jeremy Blackford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data and code needed to recreate all analyses and figures presented in the manuscript 'Consistency of clustering analysis of complex 3D ocean datasets'.

Data

'all_data_for_paper.nc': model data, 2000-2004 mean of all variables used, provided at all depth levels.

'mesh_mask.nc': domain and depth data file to be used alongside model data.

ModelViz (code)

Tool to classify marine biogeochemical output from numerical models

Written by rmi, dapa & dmof

Preprocessing

preprocess_amm7_functions.py
Functions needed to run different preprocessing scripts.
preprocess_all_depths.py
First script to run. Extracts relevant variables and takes temporal mean for physical,biogeochemical and ecological variables. For physical - calculates PAR from qsr.
preprocess_amm7_mean.py
Use for surface biogeochemical and ecological sets (faster)
preprocess_DI_DA.py
Use for depth integrated, depth averaged and bottom biogeochemical and ecological sets. Can use for surface but slower.
preprocess_amm7_mean_one_depth.py
Extracts data at specified depth (numeric). Works for biogeochemical and ecological variables.
preprocess_physics.py
Takes all_depths_physics and calculates physics data at different depths.

Metrics

silhouette_nvars.py
Calculates silhouette score for inputs with different numbers of variables and clusters
rand_index.py
rand_index_depth.py
remove_one_var.py
Calculates rand index between cluster sets with one variable removed and original set

Clustering

Modelviz.py
Contains functions for applying clustering to data

Plotting

kmeans-paper-plots.ipynb
Produces figure 4
kmeans-paper-plots-illustrate-normalisation.ipynb
Produces figure 2
kmeans-paper-plots-depths.ipynb
Produces figures 5-7
plot_silhouette.ipynb
Produces figure 3
f
Data from: Pattern Recognition for Steam Flooding Field Applications Based...
datasetcatalog.nlm.nih.gov
acs.figshare.com
Updated May 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hao, Jian; Wei, Mingzhen; Bai, Baojun; Zhang, Na; Jia, Shun; Wang, Xiaopeng (2022). Pattern Recognition for Steam Flooding Field Applications Based on Hierarchical Clustering and Principal Component Analysis [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000412528
Explore at:
Dataset updated
May 25, 2022
Authors
Hao, Jian; Wei, Mingzhen; Bai, Baojun; Zhang, Na; Jia, Shun; Wang, Xiaopeng
Description
Steam flooding is a complex process that has been considered as an effective enhanced oil recovery technique in both heavy oil and light oil reservoirs. Many studies have been conducted on different sets of steam flooding projects using the conventional data analysis methods, while the implementation of machine learning algorithms to find the hidden patterns is rarely found. In this study, a hierarchical clustering algorithm (HCA) coupled with principal component analysis is used to analyze the steam flooding projects worldwide. The goal of this research is to group similar steam flooding projects into the same cluster so that valuable operational design experiences and production performance from the analogue cases can be referenced for decision-making. Besides, hidden patterns embedded in steam flooding applications can be revealed based on data characteristics of each cluster for different reservoir/fluid conditions. In this research, principal component analysis is applied to project original data to a new feature space, which finds two principal components to represent the eight reservoir/fluid parameters (8D) but still retain about 90% of the variance. HCA is implemented with the optimized design of five clusters, Euclidean distance, and Ward’s linkage method. The results of the hierarchical clustering depict that each cluster detects a unique range of each property, and the analogue cases present that fields under similar reservoir/fluid conditions could share similar operational design and production performance.
d
Replication Data for the Cluster Analysis
search.dataone.org
dataverse.harvard.edu
+1more
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Singleton, Alex (2023). Replication Data for the Cluster Analysis [Dataset]. http://doi.org/10.7910/DVN/VJCJB3
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/VJCJB3
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Singleton, Alex
Description
For each postcode in Great Britain, 64 variables were created through the application of a convolutional autoencoder to a set of Sentinel 2 images surrounding a 160m buffered area around each postcode. These then formed input into a cluster analysis.
Hierarchical clustering of 7 Million Proteins
kaggle.com
zip
Updated Aug 9, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rajasankar Viswanathan (2017). Hierarchical clustering of 7 Million Proteins [Dataset]. https://www.kaggle.com/rajasankar/hierarchical-clustering-of-7-million-proteins
Explore at:
zip(32711247 bytes)Available download formats
Dataset updated
Aug 9, 2017
Authors
Rajasankar Viswanathan
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Context

Clustering using distance needs all-against-all matching. New algorithm can cluster 7 Million proteins using approximate clustering under one hour.

Content

cat: contains Hierarchical sequence. protein_names : List of proteins in the group. Original data can be downloaded from ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/env_nr.gz

Inspiration

Researchers can use the data to find relationships between proteins more easily.

Data Structure Description

Data set has two files. protein_groupings file is the clustered data. This file has only names. Sequences for the names can be found in protein_name_letter file.

How it is created

Data is downloaded from the NCBI site and fasta format was converted into full length sequence. Sequences were fed into the clustering algorithm.

Cluster mappings

As this is the Hierarchical clustering, relationship between the sequences can be found by comparing the values in gn_list .

All the groups start with cluster_id:0 , split:0 and progress into matched splits. Difference between the splits would indicate that how much two sequences can match. Comparing the cluster_id would check if the sequences belong to same group or different group.

cluster_id = unique id for cluster. split = approximate similarity between the sequences. This is an absolute value. 63 would mean there is 63 letters would match between the sequences. Higher the value more similarity. inner_cluster_id = unique id to compare inner cluster matches. total clusters = number of clusters after approximate match is generated.

Due to space restrictions in Kaggle, this data set has only 9093 groups containing 129696 sequences.

One sequence may be in more than cluster because similarity is calculated as if all-against-all comparison is used.

Ex : For A, B, C , if A ~ B = 50, B~ C = 50 and A~C =0 then clustering will have two groups [A,B] and [B,C]

If you need full dataset for your research, contact me.

What is the issue with previous dataset

Previous dataset had issues with similarity comparisons between intra-clusters. Inner cluster comparison worked. This is fixed in the new version.
Z
Data from: Dataset for Investigating Anomalies in Compute Clusters
data.niaid.nih.gov
zenodo.org
Updated Nov 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
McSpadden, Diana; Yasir, Alanazi; Hess, Bryan; Hild, Laura; Jones, Mark; Lu, Yiyang; Mohammed, Ahmed; Moore, Wesley; Ren, Jie; Schram, Malachi; Smirni, Evgenia (2023). Dataset for Investigating Anomalies in Compute Clusters [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10058229
Explore at:
Dataset updated
Nov 29, 2023
Dataset provided by
William & Mary
Thomas Jefferson National Accelerator Facility
Authors
McSpadden, Diana; Yasir, Alanazi; Hess, Bryan; Hild, Laura; Jones, Mark; Lu, Yiyang; Mohammed, Ahmed; Moore, Wesley; Ren, Jie; Schram, Malachi; Smirni, Evgenia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract The dataset was collected for 332 compute nodes throughout May 19 - 23, 2023. May 19 - 22 characterizes normal compute cluster behavior, while May 23 includes an anomalous event. The dataset includes eight CPU, 11 disk, 47 memory, and 22 Slurm metrics. It represents five distinct hardware configurations and contains over one million records, totaling more than 180GB of raw data. Background Motivated by the goal to develop a digital twin of a compute cluster, the dataset was collected using a Prometheus server (1) scraping the Thomas Jefferson National Accelerator Facility (JLab) batch cluster used to run an assortment of physics analysis and simulation jobs, where analysis workloads leverage data generated from the laboratory's electron accelerator, and simulation workloads generate large amounts of flat data that is then carved to verify amplitudes. Metrics were scraped from the cluster throughout May 19 - 23, 2023. Data from May 19 to May 22 primarily reflected normal system behavior, while May 23, 2023, recorded a notable anomaly. This anomaly was severe enough to necessitate intervention by JLab IT Operations staff. The metrics were collected from CPU, disk, memory, and Slurm. Metrics related to CPU, disk, and memory provide insights into the status of individual compute nodes. Furthermore, Slurm metrics collected from the network have the capability to detect anomalies that may propagate to compute nodes executing the same job. Usage Notes While the data from May 19 - 22 characterizes normal compute cluster behavior, and May 23 includes anomalous observations, the dataset cannot be considered labeled data. The set of nodes and the exact start and end time affected nodes demonstrate abnormal effects are unclear. Thus, the dataset could be used to develop unsupervised machine-learning algorithms to detect anomalous events in a batch cluster. https://doi.org/10.48550/arXiv.2311.16129

Facebook

Twitter

Click to copy link

Link copied

Cite

The Devastator (2023). Galaxy clustering [Dataset]. https://www.kaggle.com/datasets/thedevastator/clustering-polygons-utilizing-iris-moon-and-circ

Data from: Galaxy clustering

Iris, Moon, and Circles datasets for Galaxy clustering tutorial

Explore at:

zip(6339 bytes)Available download formats

Dataset updated

Jan 3, 2023

Authors

The Devastator

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Galaxy clustering

Iris, Moon, and Circles datasets for Galaxy clustering tutorial

By [source]

About this dataset

This dataset contains a wealth of information that can be used to explore the effectiveness of various clustering algorithms. With its inclusion of numerical measurements (X, Y, Sepal.Length, and Petal.Length) and categorical values (Species), it is possible to investigate the relationship between different types of variables and clustering performance. Additionally, by comparing results for the 3 datasets provided - moon.csv (which contains x and y coordinates), iris.csv (which contains measurements for sepal and petal lengths),and circles.csv - we can gain insights into how different data distributions affect clustering techniques such as K-Means or Hierarchical Clustering among others!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset can also be a great starting point to further explore more complex clusters by using higher dimensional space variables such as color or texture that may be present in other datasets not included here but which can help to form more accurate groups when using cluster-analysis algorithms. Additionally, it could also assist in visualization projects where clusters may need to be generated such as plotting mapped data points or examining relationships between two different variables within a certain region drawn on a chart.

To use this dataset effectively it is important to understand how exactly your chosen algorithm works since some require specifying parameters beforehand while others take care of those details automatically; otherwise the interpretation may be invalid depending on the methods used alongside clustering you intend for your project. Furthermore, familiarize yourself with concepts like silhouette score and rand index - these are commonly used metrics that measure your cluster’s performance against other clusterings models so you know if what you have done so far satisfies an acceptable level of accuracy or not yet! Good luck!

Research Ideas

Utilizing the sepal and petal lengths and widths to perform flower recognition or part of a larger image recognition pipeline.

Classifying the data points in each dataset by the X-Y coordinates using clustering algorithms to analyze galaxy locations or overall formation patterns for stars, planets, or galaxies.

Exploring correlations between species of flowers in terms of sepal/petal lengths by performing supervised learning tasks such as classification with this dataset

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: moon.csv | Column name | Description | |:--------------|:------------------------------------------| | X | X coordinate of the data point. (Numeric) | | Y | Y coordinate of the data point. (Numeric) |

File: iris.csv | Column name | Description | |:-----------------|:---------------------------------------------| | Sepal.Length | Length of the sepal of the flower. (Numeric) | | Petal.Length | Length of the petal of the flower. (Numeric) | | Species | Species of the flower. (Categorical) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .

Clear search

Close search

Google apps

Main menu

Data from: Galaxy clustering

Galaxy clustering

Iris, Moon, and Circles datasets for Galaxy clustering tutorial

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Patient Dataset for Clustering (Raw Data)

Data_Sheet_1_Improved space breakdown method – A robust clustering technique...

Reference list of 265 sources used for the discovery of relationships...

Educational Attainment in North Carolina Public Schools: Use of statistical...

ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of...

wine-clustering

Replication data for: Nearest Neighbor Networks: clustering expression data...

Replication Data for: The semantic structuring of minimizing constructions...

Data from: Integrating Multidimensional Data for Clustering Analysis With...

Clustering results of real datasets.

Clustering Exercises

Overview

Context

Dataset

Data from: A model-based clustering method to detect infectious disease...

Data from: Skeleton Clustering: Dimension-Free Density-Aided Clustering

The details of experiment datasets.

Dataset and code to accompany the manuscript 'Consistency of clustering...

Data

ModelViz (code)

Preprocessing

Metrics

Clustering

Plotting

Data from: Pattern Recognition for Steam Flooding Field Applications Based...

Replication Data for the Cluster Analysis

Hierarchical clustering of 7 Million Proteins

Context

Content

Inspiration

Data Structure Description

How it is created

Cluster mappings

What is the issue with previous dataset

Data from: Dataset for Investigating Anomalies in Compute Clusters

Data from: Galaxy clustering

Iris, Moon, and Circles datasets for Galaxy clustering tutorial

Galaxy clustering

Iris, Moon, and Circles datasets for Galaxy clustering tutorial

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements