Facebook
TwitterWine Clustering Dataset
Overview
The Wine Clustering Dataset contains data on various chemical properties of wines, intended for use in clustering tasks. This dataset is ideal for exploring clustering algorithms such as K-Means, hierarchical clustering, and others, to group wines based on their chemical composition.
Dataset Structure
The dataset is provided as a single CSV file named wine-clustering.csv. It contains 178 entries, each representing a unique wine… See the full description on the dataset page: https://huggingface.co/datasets/mltrev23/wine-clustering.
Facebook
TwitterAbout Dataset ● Based on patient symptoms, identify patients needing immediate resuscitation; to assign patients to a predesignated patient care area, thereby prioritizing their care; and to initiate diagnostic/therapeutic measures as appropriate. ● Three individual datasets used for three urgent illness/injury, each dataset has its own features and symptoms for each patient and we merged them to know what are the most severe symptoms for each illness and give them priority of treatment.
PROJECT SUMMARY Triage refers to the sorting of injured or sick people according to their need for emergency medical attention. It is a method of determining priority for who gets care first. BACKGROUND Triage is the prioritization of patient care (or victims during a disaster) based on illness/injury, symptoms, severity, prognosis, and resource availability. The purpose of triage is to identify patients needing immediate resuscitation; to assign patients to a predesignated patient care area, thereby prioritizing their care; and to initiate diagnostic/therapeutic measures as appropriate. BUSINESS CHALLENGE Based on patient symptoms, identify patients needing immediate resuscitation; to assign patients to a predesignated patient care area, thereby prioritizing their care; and to initiate diagnostic/therapeutic measures as appropriate.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
This dataset contains a wealth of information that can be used to explore the effectiveness of various clustering algorithms. With its inclusion of numerical measurements (X, Y, Sepal.Length, and Petal.Length) and categorical values (Species), it is possible to investigate the relationship between different types of variables and clustering performance. Additionally, by comparing results for the 3 datasets provided - moon.csv (which contains x and y coordinates), iris.csv (which contains measurements for sepal and petal lengths),and circles.csv - we can gain insights into how different data distributions affect clustering techniques such as K-Means or Hierarchical Clustering among others!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset can also be a great starting point to further explore more complex clusters by using higher dimensional space variables such as color or texture that may be present in other datasets not included here but which can help to form more accurate groups when using cluster-analysis algorithms. Additionally, it could also assist in visualization projects where clusters may need to be generated such as plotting mapped data points or examining relationships between two different variables within a certain region drawn on a chart.
To use this dataset effectively it is important to understand how exactly your chosen algorithm works since some require specifying parameters beforehand while others take care of those details automatically; otherwise the interpretation may be invalid depending on the methods used alongside clustering you intend for your project. Furthermore, familiarize yourself with concepts like silhouette score and rand index - these are commonly used metrics that measure your cluster’s performance against other clusterings models so you know if what you have done so far satisfies an acceptable level of accuracy or not yet! Good luck!
- Utilizing the sepal and petal lengths and widths to perform flower recognition or part of a larger image recognition pipeline.
- Classifying the data points in each dataset by the X-Y coordinates using clustering algorithms to analyze galaxy locations or overall formation patterns for stars, planets, or galaxies.
- Exploring correlations between species of flowers in terms of sepal/petal lengths by performing supervised learning tasks such as classification with this dataset
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: moon.csv | Column name | Description | |:--------------|:------------------------------------------| | X | X coordinate of the data point. (Numeric) | | Y | Y coordinate of the data point. (Numeric) |
File: iris.csv | Column name | Description | |:-----------------|:---------------------------------------------| | Sepal.Length | Length of the sepal of the flower. (Numeric) | | Petal.Length | Length of the petal of the flower. (Numeric) | | Species | Species of the flower. (Categorical) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example data to understand the implementation of K Means
Facebook
TwitterThis dataset was created by Deepansh Saxena1
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://i.imgur.com/ZUX61cD.png" alt="Overview">
The method of disuniting similar data is called clustering. you can create dummy data for classifying clusters by method from sklearn package but it needs to put your effort into job.
For users who making hard test cases for example of clustering, I think this dataset helps them.
Try out to select a meaningful number of clusters, and dividing the data into clusters. Here are exercises for you.
All csv files contain a lots of x, y and color, and you can see above figures.
If you want to use position as type of integer, scale it and round off to integer as like x = round(x * 100).
Furthermore, here is GUI Tool to generate 2D points for clustering. you can make your dataset with this tool. https://www.joonas.io/cluster-paint
Stay tuned for further updates! also if any idea, you can comment me.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset collection comprises 15 diverse two-dimensional datasets specifically designed for clustering analysis. Each dataset contains three columns: x, y, and target, where x and y represent the coordinates of the data points, and target indicates the cluster label.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F20292402%2F3cc81328beabc815fe500973fee1f7ac%2Fdescription.png?generation=1737484616903723&alt=media" alt="Visualisation of data">
Facebook
TwitterThe dataset includes: 1. learning data containing e-commerce user sessions (DATASET-X-session_visit.csv files) 2. clustering results (including metrics values and customer clusters), per algorithm tested 3. calculations (xlsx file)
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
182 simulated datasets (first set contains small datasets and second set contains large datasets) with different cluster compositions – i.e., different number clusters and separation values – generated using clusterGeneration package in R. Each set of simulation datasets consists of 91 datasets in comma separated values (csv) format (total of 182 csv files) with 3-15 clusters and 0.1 to 0.7 separation values. Separation values can range between (−0.999, 0.999), where a higher separation value indicates cluster structure with more separable clusters. Size of the dataset, number of clusters, and separation value of the clusters in the dataset is printed in file name. size_X_n_Y_sepval_Z.csv: Size of the dataset = X number of clusters in the dataset = Y separation value of the clusters in the dataset = Z
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These data sets were originally created for the following publications:
M. E. Houle, H.-P. Kriegel, P. Kröger, E. Schubert, A. Zimek
Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?
In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM), Heidelberg, Germany, 2010.
H.-P. Kriegel, E. Schubert, A. Zimek
Evaluation of Multiple Clustering Solutions
In 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011, Athens, Greece, 2011.
The outlier data set versions were introduced in:
E. Schubert, R. Wojdanowski, A. Zimek, H.-P. Kriegel
On Evaluation of Outlier Rankings and Outlier Scores
In Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA, 2012.
They are derived from the original image data available at https://aloi.science.uva.nl/
The image acquisition process is documented in the original ALOI work: J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders, The Amsterdam library of object images, Int. J. Comput. Vision, 61(1), 103-112, January, 2005
Additional information is available at: https://elki-project.github.io/datasets/multi_view
The following views are currently available:
| Feature type | Description | Files |
|---|---|---|
| Object number | Sparse 1000 dimensional vectors that give the true object assignment | objs.arff.gz |
| RGB color histograms | Standard RGB color histograms (uniform binning) | aloi-8d.csv.gz aloi-27d.csv.gz aloi-64d.csv.gz aloi-125d.csv.gz aloi-216d.csv.gz aloi-343d.csv.gz aloi-512d.csv.gz aloi-729d.csv.gz aloi-1000d.csv.gz |
| HSV color histograms | Standard HSV/HSB color histograms in various binnings | aloi-hsb-2x2x2.csv.gz aloi-hsb-3x3x3.csv.gz aloi-hsb-4x4x4.csv.gz aloi-hsb-5x5x5.csv.gz aloi-hsb-6x6x6.csv.gz aloi-hsb-7x7x7.csv.gz aloi-hsb-7x2x2.csv.gz aloi-hsb-7x3x3.csv.gz aloi-hsb-14x3x3.csv.gz aloi-hsb-8x4x4.csv.gz aloi-hsb-9x5x5.csv.gz aloi-hsb-13x4x4.csv.gz aloi-hsb-14x5x5.csv.gz aloi-hsb-10x6x6.csv.gz aloi-hsb-14x6x6.csv.gz |
| Color similiarity | Average similarity to 77 reference colors (not histograms) 18 colors x 2 sat x 2 bri + 5 grey values (incl. white, black) | aloi-colorsim77.arff.gz (feature subsets are meaningful here, as these features are computed independently of each other) |
| Haralick features | First 13 Haralick features (radius 1 pixel) | aloi-haralick-1.csv.gz |
| Front to back | Vectors representing front face vs. back faces of individual objects | front.arff.gz |
| Basic light | Vectors indicating basic light situations | light.arff.gz |
| Manual annotations | Manually annotated object groups of semantically related objects such as cups | manual1.arff.gz |
Outlier Detection Versions
Additionally, we generated a number of subsets for outlier detection:
| Feature type | Description | Files |
|---|---|---|
| RGB Histograms | Downsampled to 100000 objects (553 outliers) | aloi-27d-100000-max10-tot553.csv.gz aloi-64d-100000-max10-tot553.csv.gz |
| Downsampled to 75000 objects (717 outliers) | aloi-27d-75000-max4-tot717.csv.gz aloi-64d-75000-max4-tot717.csv.gz | |
| Downsampled to 50000 objects (1508 outliers) | aloi-27d-50000-max5-tot1508.csv.gz aloi-64d-50000-max5-tot1508.csv.gz |
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Customer Personality Analysis involves a thorough examination of a company's optimal customer profiles. This analysis facilitates a deeper understanding of customers, enabling businesses to tailor products to meet the distinct needs, behaviors, and concerns of various customer types.
By conducting a Customer Personality Analysis, businesses can refine their products based on the preferences of specific customer segments. Rather than allocating resources to market a new product to the entire customer database, companies can identify the segments most likely to be interested in the product. Subsequently, targeted marketing efforts can be directed toward those particular segments, optimizing resource utilization and increasing the likelihood of successful product adoption.
Details of Features are as below:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The synthetic data set has 600 points that form 20 clusters with 30 points each in 2 dimensions. The offset between a given point and its true center in each dimension is determined by Rand[0.02, 0.04] ∗ G where G is a random Gaussian number.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
input_prompts.csv provides the inputs for the ChatGPT API (countries and their respective prompts).topic_consolidations.csv contains the 4,018 unique topics listed across all ChatGPT responses to prompts in our study and their corresponding cluster labels after applying K-means++ clustering (n = 50) via natural language processing with Bidirectional Encoder Representations from Transformers (BERT). ChatGPT response topics come from both versions (3.5 and 4) over 10 iterations each (per each country).ChatGPT_prompt_automation.ipynb is the Jupyter notebook of Python code used to run the API to prompt ChatGPT and gather responses.topic_consolidation_BERT.ipynb is the Jupyter notebook of Python code used to process the 4,018 unique topics gathered through BERT NLP. This code was adapted from Vimal Pillar on Kaggle (https://www.kaggle.com/code/vimalpillai/text-clustering-with-sentence-bert).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Validation data for the Astro scientific publication clustering benchmark dataset
This is the dataset used in the publication Donner, P. "Validation of the Astro dataset clustering solutions with external data", Scientometrics, DOI 10.1007/s11192-020-03780-3
Certain data included herein are derived from Clarivate Web of Science. © Copyright Clarivate 2020. All rights reserved.
Published with permission from Clarivate.
The original Astro dataset is not contained in this data. It can be obtained from http://topic-challenge.info/ and requires permission from Clarivate Analytics for use.
This dataset collection consists of four files. Each file contains an independent dataset that relates to the Astro dataset via Web of Science (WoS) record identifiers. These identifiers are called UTs. All files are tabular data in CSV format. In each, at least one column contains UT data. This should be used to link to the Astro dataset or other WoS data. The datasets are discussed in detail in the journal publication.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These files contain the 1-minute resolution dataset (“labeled_sunside_data.csv”) and 15 minute or longer region list (“_region_list.csv”) for Toy-Edens et al.'s Classifying 8 years of MMS Dayside Plasma Regions via Unsupervised Machine Learning. The 1-minute resolution file contains the rolled up 1-minute epoch, probe name (mms1, mms2, mms3, mms4), features that go into clustering and post-cleansing methods, spacecraft positions (in GSE, GSM, and magnetic latitude/local time), raw and cleansed clustering labels, and transition name. The 15+ minute region lists contain the name of the plasma region type, the probe name (mms1, mms2, mms3, mms4), and the start and stop epoch of >= 15 minute epoch where the probe is solidly within that region.
We ask that if you use any parts of the dataset that you cite Toy-Edens et al.'s Classifying 8 years of MMS Dayside Plasma Regions via Unsupervised Machine Learning (Submitted for review and publication to Journal of Geophysical Research: Space Research 1/2024 - DOI to be created upon acceptance).
This work was funded by grant 2225463 from the NSF GEM program.
The following tables detail the contents of the described files:
labeled_sunside_data.csv description
Column Name
Description
Epoch
Epoch in datetime
probe
MMS probe name
ratio_max_width
Ratio of the width of the most prominent ion spectra peak (in number of energy channels) to max number of energy channels. See paper for more information
ratio_high_low
Ratio of the mean of the log intensity of high energies in the ion spectra to the mean of the log intensity of low energies in the ion spectra. See paper for more information
norm_Btot
Magnitude of the total magnetic field normalized to 50nT. See paper for more information
small_energy_mean
The denominator in ratio_high_low
large_energy_mean
The numerator in ratio_high_low
temp_total
Total temperature from the DIS moments. See paper for more information
r_gse_x
x position of the spacecraft in GSE
r_gse_y
y position of the spacecraft in GSE
r_gse_z
z position of the spacecraft in GSE
r_gsm_x
x position of the spacecraft in GSM
r_gsm_y
y position of the spacecraft in GSM
r_gsm_z
z position of the spacecraft in GSM
mlat
magnetic latitude of spacecraft
mlt
magnetic local time of spacecraft
raw_named_label
Raw cluster assigned plasma region label (allowed values: magnetosheath, magnetosphere, solar wind, ion foreshock)
modified_named_label
Cleansed cluster assigned plasma region label (use these unless have a specific reason to use raw labels). See paper for more information
transition_name
Transition names (e.g. quasi-perpendicular bow shock, magnetopause). See paper for more information
_region_list.csv description
Column Name
Description
start
Starting Epoch in datetime
stop
Stopping Epoch in datetime
probe
MMS probe name
region
Cleansed cluster name associated with 1-minute resolution “modified_named_label”
Facebook
TwitterThis is text document classification dataset which contains 2225 text data and five categories of documents. Five categories are politics, sport, tech, entertainment and business. We can use this dataset for documents classification and document clustering.
About Dataset - Dataset contains two features text and label. - No. of Rows : 2225 - No. of Columns : 2
Text: It contains different categories of text data Label: It contains labels for five different categories : 0,1,2,3,4
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset used in this study integrates quantitative data on student learning behaviors, engagement patterns, demographics, and academic performance. It was compiled by merging two publicly available Kaggle datasets, resulting in a combined file (“merged_dataset.csv”) containing 14,003 student records with 16 attributes. All records are anonymized and contain no personally identifiable information.
The dataset covers the following categories of variables:
Resource access and learning environment: Resources, Internet, EduTech
Motivation and psychological factors: Motivation, StressLevel
Demographic information: Gender, Age (ranging from 18 to 30 years)
Learning preference classification: LearningStyle
Academic performance indicators: ExamScore, FinalGrade
In this study, “ExamScore” and “FinalGrade” served as the primary performance indicators. The remaining variables were used to derive behavioral and contextual profiles, which were clustered using unsupervised machine learning techniques.
The analysis and modeling were implemented in Python through a structured Jupyter Notebook (“Project.ipynb”), which included the following main steps:
Environment Setup – Import of essential libraries (NumPy, pandas, Matplotlib, Seaborn, SciPy, StatsModels, scikit-learn, imbalanced-learn) and visualization configuration.
Data Import and Integration – Loading the two source CSV files, harmonizing columns, removing irrelevant attributes, aligning formats, handling missing values, and merging them into a unified dataset (merged_dataset.csv).
Data Preprocessing –
Encoding categorical variables using LabelEncoder.
Scaling features using both z-score standardization (for statistical tests and PCA) and Min–Max normalization (for clustering).
Detecting and removing duplicates.
Clustering Analysis –
Applying K-Means clustering to segment learners into distinct profiles.
Determining the optimal number of clusters using the Elbow Method and Silhouette Score.
Evaluating cluster quality with internal metrics (Silhouette Score, Davies–Bouldin Index).
Dimensionality Reduction & Visualization – Using PCA for 2D/3D cluster visualization and feature importance exploration.
Mapping Clusters to Learning Styles – Associating each identified cluster with the most relevant learning style model based on feature patterns and alignment scores.
Statistical Analysis – Conducting ANOVA and regression to test for significant differences in performance between clusters.
Interpretation & Practical Recommendations – Analyzing cluster-specific characteristics and providing implications for adaptive and mobile learning integration.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset consists of simulated and observed salinity/temperature data which were used in the manuscript "A method for assessment of the general circulation model quality using k-means clustering algorithm" submitted to Geoscientific Model Development.
The model simulation dataset is from long-term 3D circulation model simulation (Maljutenko and Raudsepp 2014, 2019). The observations are from the "Baltic Sea - Eutrophication and Acidity aggregated datasets 1902/2017 v2018" SMHI (2018).
The files are in simple comma separated table format without headers.
The Dout-t_z_lat_lon_Smod_Sobs_Tmod_Tobs.csv file contains columns with following variables [units]:
Time [matlab datenum units], Vertical coordinate [m], latitude [oN], longitude [oE], model salinity [g/kg], observed salinity [g/kg], model temperature [oC], observed temperature [oC].
The Dout-t_z_lat_lon_dS_dT_K1_K2_K3_K4_K5_K6_K7_K8_K9.csv file contains columns with following variables [units]:
4 first columns are the same as in the previous file, salinity error [g/kg], temperature error [oC], columns 7-8 are integers showing the cluster to which the error pair is designated.
do_clust_valid_DataFig.m is a Matlab script which reads the two csv files (and optionally mask file Model_mask.mat), performs the clustering analysis and creates plots which are used in Manuscript. The script is organized into %% blocks which can be executed separately (default: ctrl+enter).
k-means function is used from the Matlab Statistics and Machine Learning Toolbox.
Additional software used in the do_clust_valid_DataFig.m:
Author's auxiliary formatting scripts script/
datetick_cst.m
do_fitfig.m
do_skipticks.m
do_skipticks_y.m
Colormaps are generated using cbrewer.m (Charles, 2021).
Moving average smoothing is performed using nanmoving_average.m (Aguilera, 2021).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
How does Facebook always seems to know what the next funny video should be to sustain your attention with the platform? Facebook has not asked you whether you like videos of cats doing something funny: They just seem to know. In fact, FaceBook learns through your behavior on the platform (e.g., how long have you engaged with similar movies, what posts have you previously liked or commented on, etc.). As a result, Facebook is able to sustain the attention of their user for a long time. On the other hand, the typical mHealth apps suffer from rapidly collapsing user engagement levels. To sustain engagement levels, mHealth apps nowadays employ all sorts of intervention strategies. Of course, it would be powerful to know—like Facebook knows—what strategy should be presented to what individual to sustain their engagement. To be able to do that, the first step could be to be able to cluster similar users (and then derive intervention strategies from there). This dataset was collected through a single mHealth app over 8 different mHealth campaigns (i.e., scientific studies). Using this dataset, one could derive clusters from app user event data. One approach could be to differentiate between two phases: a process mining phase and a clustering phase. In the process mining phase one may derive from the dataset the processes (i.e., sequences of app actions) that users undertake. In the clustering phase, based on the processes different users engaged in, one may cluster similar users (i.e., users that perform similar sequences of app actions).
List of files
0-list-of-variables.pdf includes an overview of different variables within the dataset.
1-description-of-endpoints.pdf includes a description of the unique endpoints that appear in the dataset.
2-requests.csv includes the dataset with actual app user event data.
2-requests-by-session.csv includes the dataset with actual app user event data with a session variable, to differentiate between user requests that were made in the same session.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data for urban clustering used in the paper "Learning to clusterize urban areas: two competitive approaches and an empirical validation". We release two datasets for urban clustering based on data acquired in Santiago de Chile. The first dataset is computed at the level of urban blocks. The second dataset is computed at the level of individuals using a uniform sample of Santiago inhabitants. Both datasets comprises features based on social characteristics (e.g., SES), land use, and aesthetic visual perception of the city. The features of each data unit (blocks or individuals) are provided using row packing (each row is a data unit) in CSV files. We release PCA (Principal Components Analysis) features for both datasets.
Facebook
TwitterWine Clustering Dataset
Overview
The Wine Clustering Dataset contains data on various chemical properties of wines, intended for use in clustering tasks. This dataset is ideal for exploring clustering algorithms such as K-Means, hierarchical clustering, and others, to group wines based on their chemical composition.
Dataset Structure
The dataset is provided as a single CSV file named wine-clustering.csv. It contains 178 entries, each representing a unique wine… See the full description on the dataset page: https://huggingface.co/datasets/mltrev23/wine-clustering.