Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
In scientific data analysis, clusters identified computationally often substantiate existing hypotheses or motivate new ones. Yet the combinatorial nature of the clustering result, which is a partition rather than a set of parameters or a function, blurs notions of mean and variance. This intrinsic difficulty hinders the development of methods to improve clustering by aggregation or to assess the uncertainty of clusters generated. We overcome that barrier by aligning clusters via optimal transport. Equipped with this technique, we propose a new algorithm to enhance clustering by any baseline method using bootstrap samples. Cluster alignment enables us to quantify variation in the clustering result at the levels of both overall partitions and individual clusters. Set relationships between clusters such as one-to-one match, split, and merge can be revealed. A covering point set for each cluster, a concept kin to the confidence interval, is proposed. The tools we have developed here will help address the crucial question of whether any cluster is an intrinsic or spurious pattern. Experimental results on both simulated and real datasets are provided.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Clustering methods are valuable tools for the identification of patterns in high-dimensional data with applications in many scientific fields. However, quantifying uncertainty in clustering is a challenging problem, particularly when dealing with high-dimension low sample size (HDLSS) data. We develop a U-statistics based clustering approach that assesses statistical significance in clustering and is specifically tailored to HDLSS scenarios. These nonparametric methods rely on very few assumptions about the data, and thus can be applied to a wide range of datasets for which the Euclidean distance captures relevant features. Our main result is the development of a hierarchical significance clustering method. To do so, we first introduce an extension of a relevant U-statistic and develop its asymptotic theory. Additionally, as a preliminary step, we propose a binary nonnested significance clustering method and show its optimality in terms of expected values. Our approach is tested through multiple simulations and found to have more statistical power than competing alternatives in all scenarios considered. Our methods are further showcased in three applications ranging from genetics to image recognition problems. Code for these methods is available in R-package uclust. Supplementary materials for this article are available online.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Ridlo Wahyudi Wibowo
Released under CC0: Public Domain
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionAdolescents’ physical activity (PA) behavior can be driven by several psychosocial determinants at the same time. Most analyses use a variable-based approach that examines relations between PA-related determinants and PA behavior on the between-person level. Using this approach, possible coexistences of different psychosocial determinants within one person cannot be examined. Therefore, by applying a person-oriented approach, this study examined (a) which profiles regarding PA-related psychosocial variables typically occur in female sixth-graders, (b) if these profiles deliver a self-consistent picture according to theoretical assumptions, and (c) if the profiles contribute to the explanation of PA.Materials and MethodsThe sample comprised 475 female sixth-graders. Seventeen PA-related variables were assessed: support for autonomy, competence and relatedness in PE as well as their satisfaction in PE and leisure-time; behavioral regulation of exercise (five subscales); self-efficacy and social support from friends and family (two subscales). Moderate-to-vigorous PA was measured using accelerometers. Data were analyzed using the self-organizing maps (SOM) analysis, a cluster analysis including an unsupervised algorithm for non-linear models.ResultsAccording to the respective level of psychosocial resources, a positive, a medium and a negative cluster were identified. This superordinate cluster solution represented a self-consistent picture that was in line with theoretical assumptions. The three-cluster solution contributed to the explanation of PA behavior, with the positive cluster accumulating an average of 6 min more moderate-to-vigorous PA per day than the medium cluster and 10 min more than the negative cluster. Additionally, SOM detected a subgroup within the positive cluster that benefited from a specific combination of intrinsic and external regulations with regard to PA.DiscussionThe results underline the relevance of the assessed psychosocial determinants of PA behavior in female sixth-graders. The results further indicate that the different psychosocial resources within a given person do not develop independently of one another, which supports the use of a person-oriented approach. In addition, the SOM analysis identified subgroups with specific characteristics, which would have remained undetected using variable-based approaches. Thus, this approach offers the possibility to reduce data complexity without overlooking subgroups with special demands that go beyond the superordinate cluster solution.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This artifact bundles the five dataset archives used in our private federated clustering evaluation, corresponding to the real-world benchmarks, scaling experiments, ablation studies, and timing performance tests described in the paper. The real_datasets.tar.xz includes ten established clustering benchmarks drawn from UCI and the Clustering basic benchmark (DOI: https://doi.org/10.1007/s10489-018-1238-7); scale_datasets.tar.xz contains the SynthNew family generated to assess scalability via the R clusterGeneration package ; ablate_datasets.tar.xz holds the AblateSynth sets varying cluster separation for ablation analysis also powered by clusterGeneration ; g2_datasets.tar.xz packages the G2 sets—Gaussian clusters of size 2048 across dimensions 2–1024 with two clusters each, collected from the Clustering basic benchmark (DOI: https://doi.org/10.1007/s10489-018-1238-7) ; and timing_datasets.tar.xz includes the real s1 and lsun datasets alongside TimeSynth files (balanced synthetic clusters for timing), as per Mohassel et al.’s experimental framework .
Contains ten real-world benchmark datasets and formatted as one sample per line with space-separated features:
iris.txt: 150 samples, 4 features, 3 classes; classic UCI Iris dataset for petal/sepal measurements.
lsun.txt: 400 samples, 2 features, 3 clusters; two-dimensional variant of the LSUN dataset for clustering experiments .
s1.txt: 5,000 samples, 2 features, 15 clusters; synthetic benchmark from Fränti’s S1 series.
house.txt: 1,837 samples, 3 features, 3 clusters; housing data transformed for clustering tasks.
adult.txt: 48,842 samples, 6 features, 3 clusters; UCI Census Income (“Adult”) dataset for income bracket prediction.
wine.txt: 178 samples, 13 features, 3 cultivars; UCI Wine dataset with chemical analysis features.
breast.txt: 569 samples, 9 features, 2 classes; Wisconsin Diagnostic Breast Cancer dataset.
yeast.txt: 1,484 samples, 8 features, 10 localization sites; yeast protein localization data.
mnist.txt: 10,000 samples, 784 features (28×28 pixels), 10 digit classes; MNIST handwritten digits.
birch2.txt: (a random) 25,000/100,000 subset of samples, 2 features, 100 clusters; synthetic BIRCH2 dataset for high-cluster‐count evaluation .
Holds the SynthNew_{k}_{d}_{s}.txt files for scaling experiments, where:
$k \in \{2,4,8,16,32\}$ is the number of clusters,
$d \in \{2,4,8,16,32,64,128,256,512\}$ is the dimensionality,
$s \in \{1,2,3\}$ are different random seeds.
These are generated with the R clusterGeneration package with cluster sizes following a $1:2:...:k$ ratio. We incorporate a random number (in $[0, 100]$) of randomly sampled outliers and set the cluster separation degrees randomly in $[0.16, 0.26]$, spanning partially overlapping to separated clusters.
Contains the AblateSynth_{k}_{d}_{sep}.txt files for ablation studies, with:
$k \in \{2,4,8,16\}$ clusters,
$d \in \{2,4,8,16\}$ dimensions,
$sep \in \{0.25, 0.5, 0.75\}$ controlling cluster separation degrees.
Also generated via clusterGeneration.
Packages the G2 synthetic sets (g2-{dim}-{var}.txt) from the clustering-data benchmarks:
$N=2048$ samples, $k=2$ Gaussian clusters,
Dimensions $d \in \{1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024\}$
Includes:
s1.txt, lsun.txt: two real datasets for baseline timing.
timesynth_{k}_{d}_{n}.txt: synthetic timing datasets with balanced cluster sizes C_{avg}=N/K, varying:
$k \in \{2,5\}$
$d \in \{2,5\}$
$N \in \{10000; 100000\}$
Generated similarly to the scaling sets, following Mohassel et al.’s timing experiment protocol .
Usage:
Unpack any archive with tar -xJf
Facebook
TwitterA pilot study using discriminant analysis on major oxide and minor element data from 67 chert samples in the Rampart area, southeastern Tanana and southwest Livengood Quadrangles, western Yukon-Tanana Upland, Alaska, generally indicates a unique geochemical signature for the cherts of a given unit.Chert samples from five known type locales were used as standards of comparison: 1)Livengood Dome Chert (Ordovician), 2)Amy Creek unit (Proterozoic to early Paleozoic), 3)Rampart Group (Mississippian to Triassic), 4)Troublesome Creek unit (Devonian), and 5)Permian-Triassic clastic unit (associated with the Triassic-dated gabbro). Samples from the above units were compared to chert from Tanana B-1 area-units of unknown or uncertain affinity.We have determined that discriminant analysis of chert geochemistry can assign chert profiles to specific units with only minor exceptions, and is useful in geologic mapping of the Tanana B-1 Quadrangle (Reifenstuhl and others, 1997).
Facebook
TwitterAccording to the Food and Agricultural Organization (FAO) 123 million Chinese remained undernourished in 2003-2005. That represents 14% of the global total. UNICEF states that 7.2 million of the world's stunted children are located in China. In absolute terms, China continues to rank in the top countries carrying the global burden of under-nutrition. China must-and still can reduce under-nutrition, thus contributing even further to the global attainment of MDG1. In this context that the United Nations Joint Programme, in partnership with the Chinese government, has conducted this study. The key objective is to improve evidence of household food security through a baseline study in six pilot counties in rural China. The results will be used to guide policy and programmes aimed at reducing household food insecurity in the most vulnerable populations in China. The study is not meant to be an exhaustive analysis of the food security situation in the country, but to provide a demonstrative example of food assessment tools that may be replicated or scaled up to other places.
Six rural counties
The survey covered household heads and women between 15-49 years resident of that household. A household is defined as a group of people currently living and eating together "under the same roof" (or in same compound if the household has 2 structures).
Sample survey data [ssd]
The required sample size for the survey was calculated using standard sample size calculations with each county representing a stratum. After the sample size was calculated, a two-stage clustering approach was applied. The first stage is the selection of villages using the probability proportional to size (PPS) method to create a self-weighted sample in which larger population clusters (villages) have a greater chance of selection, proportional to their size. Following the selection of the villages, 12 households within the village were selected using simple random selection.
Floods and landslides prevented the team from visiting two of the selected villages, one in Wuding and one in Panxian, so they substituted them with replacement villages.
Face-to-face [f2f]
The household questionnaire was administered to all households in the survey and included modules on demography, education, migration and remittances, housing and facilities, household assets, agricultural, income activities, expenditure, food sources and consumption, shocks and coping strategies.
The objective of the village questionnaire was to gather contextual information on the six counties for descriptive purposes. In each village visited, a focus group discussion took place on topics including: population of the village, migrants, access to social services such as education and health, infrastructure, access to markets, difficulties facing the village, information on local agricultural practices.
The questionnaires were developed by WFP and Chinese Academy of Agricultural Sciences (CAAS) with inputs from partnering agencies. They were originally formulated in English and then translated into Mandarin. They were pilot tested in the field and corrected as needed. The final interviews were administered in Mandarin with translation provided in the local language when needed.
All questionnaires and modules are provided as external resources.
After data collection, data entry was carried out by CAAS staff in Beijing using EpiData software. The datasets were then exported into SPSS for analysis. Data cleaning was an iterative process throughout the data entry and analysis phases.
Descriptive analysis, correlation analysis, principle component analysis, cluster analysis and various other forms of analyses were conducted using SPSS.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset abstract: This dataset contains the data files that were used for the cluster analysis of the Dutch minimizing construction, as described in the publication cited below. In addition to a ReadMe file, it contains three files: A txt file is provided with the corpus queries that were used to find tokens of the minimizing constructions in the Dutch Web 2014 (nlTenTen14) corpus, available via Sketch Engine (more information about the TenTen corpora: Jakubíček, M., A. Kilgarriff, V. Kovář, P. Rychlý & V. Suchomel (2013). The TenTen corpus family. In: 7th International Corpus Linguistics Conference CL. Lancaster, 125–127). A csv file is provided that forms the input file for the cluster analysis. It contains a list of 5,863 minimizer-predicate combinations, more specifically a list of the predicates that are combined with the minimizers that have a token frequency of at least 10 in my dataset. An R-script is provided with the code to perform the cluster analysis in R. Article abstract: This paper examines the semantic structuring of a paradigm of 89 minimizers, i.e., nouns that reinforce sentential negation in present-day Netherlandic Dutch, such as meter ‘meter’ in voor geen meter vertrouwen ‘not to trust for a meter’. Cosine distances are computed on the basis of the predicates the minimizers combine with in a sample of 100 tokens downloaded from the Dutch Web corpus 2014 (nlTenTen14) and clustered according to the Partitioning Around Medoids (PAM) algorithm into nine semantic clusters. The clusters largely correspond to semantic categories such as taboo terms or units of money. This suggests that, in general, minimizers belonging to the same semantic domain are combined with a similar (core) set of predicates. Based on the shared predicates per cluster, we detect signs of analogical attraction between minimizers or, conversely, competition. Crucially, low silhouette widths enable us to identify outliers in their respective clusters, for instance, minimizing nouns that exhibit signs of context expansion, as shown by their combination with semantically non-harmonious verbs. As such, this paper provides a synchronic snapshot of the semantic processes involved in (incipient) grammaticalization of minimizing nouns and, more in general, it illustrates how distributional semantics offers a heuristic to analyze the structure of a network of comparable micro-constructions.
Facebook
TwitterThis dataset was created by Dr Reza Rafiee
Facebook
TwitterEvaluating Correlation Between Measurement Samples in Reverberation Chambers Using ClusteringAbstract: Traditionally, in reverberation chambers (RC) measurement autocorrelation or correlation-matrix methods have been applied to evaluate measurement correlation. In this article, we introduce the use of clustering based on correlative distance to group correlated measurements. We apply the method to measurements taken in an RC using one and two paddles to stir the electromagnetic fields and applying decreasing angular steps between consecutive paddles positions. The results using varying correlation threshold values demonstrate that the method calculates the number of effective samples and allows discerning outliers, i.e., uncorrelated measurements, and clusters of correlated measurements. This calculation method, if verified, will allow non-sequential stir sequence design and, thereby, reduce testing time.Keywords: Correlation, Pearson correlation coefficient (PCC), reverberation chambers (RC), mode-stirring samples, correlative distance, clustering analysis, adjacency matrix.
Facebook
TwitterObjectiveEsophageal cancer (EC) is characterized by a high degree of malignancy and poor prognosis. N6-methyladenosine (m6A), a prominent post-transcriptional modification of mRNA in mammalian cells, plays a pivotal role in regulating various cellular and biological processes. Similarly, cuproptosis has garnered attention for its potential implications in cancer biology. This study seeks to elucidate the impact of m6A- and cuproptosis-related long non-coding RNAs (m6aCRLncs) on the prognosis of patients with EC.MethodsThe EC transcriptional data and corresponding clinical information were retrieved from The Cancer Genome Atlas (TCGA) database, comprising 11 normal samples and 159 EC samples. Data on 23 m6A regulators and 25 cuproptosis-related genes were sourced from the latest literature. The m6aCRLncs linked to EC were identified through co-expression analysis. Differentially expressed m6aCRLncs associated with EC prognosis were screened using the limma package in R and univariate Cox regression analysis. Subtype clustering was performed to classify EC patients, enabling the investigation of differences in clinical outcomes and immune microenvironment across patient clusters. A risk prognostic model was constructed using least absolute shrinkage and selection operator (LASSO) regression. Its robustness was evaluated through survival analysis, risk stratification curves, and receiver operating characteristic (ROC) curves. Additionally, the model’s applicability across various clinical features and molecular subtypes of EC patients was assessed. To further explore the model’s utility in predicting the immune microenvironment, single-sample gene set enrichment analysis (ssGSEA), immune cell infiltration analysis, and immune checkpoint differential expression analysis were conducted. Drug sensitivity analysis was performed to identify potential therapeutic agents for EC. Finally, the mRNA expression levels of m6aCRLncs in EC cell lines were validated using reverse transcription quantitative polymerase chain reaction (RT-qPCR).ResultsWe developed a prognostic risk model based on five m6aCRLncs, namely ELF3-AS1, HNF1A-AS1, LINC00942, LINC01389, and MIR181A2HG, to predict survival outcomes and characterize the immune microenvironment in EC patients. Analysis of molecular subtypes and clinical features revealed significant differences in cluster distribution, disease stage, and N stage between high- and low-risk groups. Immune profiling further identified distinct immune cell populations and functional pathways associated with risk scores, including positive correlations with naive B cells, resting CD4+ T cells, and plasma cells, and negative correlations with macrophages M0 and M1. Additionally, we identified key immune checkpoint-related genes with significant differential expression between risk groups, including TNFRSF14, TNFSF15, TNFRSF18, LGALS9, CD44, HHLA2, and CD40. Furthermore, nine candidate drugs with potential therapeutic efficacy in EC were identified: Bleomycin, Cisplatin, Cyclopamine, PLX4720, Erlotinib, Gefitinib, RO.3306, XMD8.85, and WH.4.023. Finally, RT-qPCR validation of the mRNA expression levels of m6aCRLncs in EC cell lines demonstrated that ELF3-AS1 expression was significantly upregulated in the EC cell lines KYSE-30 and KYSE-180 compared to normal esophageal epithelial cells.ConclusionThis study elucidates the role of m6aCRLncs in shaping the prognostic outcomes and immune microenvironment of EC. Furthermore, it identifies potential therapeutic agents with efficacy against EC. These findings hold significant promise for enhancing the survival of EC patients and provide valuable insights to inform clinical decision-making in the management of this disease.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset combines multiple parameters measured in groundwater and surface water samples and interpreted using Hierarchical Cluster Analysis. This dataset combines groundwater and surface water hydrochemical (major ions, EC, TDS) sourced from GA datasets that were originally sourced from the NT government database and other sources listed under column 'Source'.
Geological and Bioregional Assessment Program
A hydrochemical dataset composed of 857 groundwater samples and 21 surface water samples was used for Hierarchical Cluster Analysis to investigate inter-aquifer and GW-SW connectivity. Groundwater samples are assigned to the following hydrostratigraphic units: Antrim Plateau Volcanics/Helen Springs/Jindare, Bukalara, Cenozoic, Cretaceous, Jinduckin/AnthonyLagoon/HookerCreek; Proterozoic and Tindall/GumRidge/Montejinni. Variables used for the analysis included major ions (Ca, Mg, Na, K, HCO3, Cl, SO4), electrical conductivity (EC) and pH with data normalised for statistical calculation. Datapoints with charge balance error above 10% have been removed prior to interpretation. The interpretation resulted in five clusters with distinct geochemical properties, as described in the respective section of the technical report Hydrogeology of the Beetaloo GBA region. Check column source to identify data origin with references in the above mentioned report.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundClustering of gene expression data is widely used to identify novel subtypes of cancer. Plenty of clustering approaches have been proposed, but there is a lack of knowledge regarding their relative merits and how data characteristics influence the performance. We evaluate how cluster analysis choices affect the performance by studying four publicly available human cancer data sets: breast, brain, kidney and stomach cancer. In particular, we focus on how the sample size, distribution of subtypes and sample heterogeneity affect the performance.ResultsIn general, increasing the sample size had limited effect on the clustering performance, e.g. for the breast cancer data similar performance was obtained for n = 40 as for n = 330. The relative distribution of the subtypes had a noticeable effect on the ability to identify the disease subtypes and data with disproportionate cluster sizes turned out to be difficult to cluster. Both the choice of clustering method and selection method affected the ability to identify the subtypes, but the relative performance varied between data sets, making it difficult to rank the approaches. For some data sets, the performance was substantially higher when the clustering was based on data from only one sex compared to data from a mixed population. This suggests that homogeneous data are easier to cluster than heterogeneous data and that clustering males and females individually may be beneficial and increase the chance to detect novel subtypes. It was also observed that the performance often differed substantially between females and males.ConclusionsThe number of samples seems to have a limited effect on the performance while the heterogeneity, at least with respect to sex, is important for the performance. Hence, by analyzing the genders separately, the possible loss caused by having fewer samples could be outweighed by the benefit of a more homogeneous data.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22121490%2F7189944f8fc292a094c90daa799d08ca%2FChatGPT%20Image%2015%20Kas%202025%2014_07_37.png?generation=1763204959770660&alt=media" alt="">
This synthetic dataset simulates 300 global cities across 6 major geographic regions, designed specifically for unsupervised machine learning and clustering analysis. It explores how economic status, environmental quality, infrastructure, and digital access shape urban lifestyles worldwide.
| Feature | Description | Range |
|---|---|---|
| 10 Features | Economic, environmental & social indicators | Realistically scaled |
| 300 Cities | Europe, Asia, Americas, Africa, Oceania | Diverse distributions |
| Strong Correlations | Income ↔ Rent (+0.8), Density ↔ Pollution (+0.6) | ML-ready |
| No Missing Values | Clean, preprocessed data | Ready for analysis |
| 4-5 Natural Clusters | Metropolitan hubs, eco-towns, developing centers | Pre-validated |
✅ Realistic Correlations: Income strongly predicts rent (+0.8), internet access (+0.7), and happiness (+0.6)
✅ Regional Diversity: Each region has distinct economic and environmental characteristics
✅ Clustering-Ready: Naturally separable into 4-5 lifestyle archetypes
✅ Beginner-Friendly: No data cleaning required, includes example code
✅ Documented: Comprehensive README with methodology and use cases
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# Load and prepare
df = pd.read_csv('city_lifestyle_dataset.csv')
X = df.drop(['city_name', 'country'], axis=1)
X_scaled = StandardScaler().fit_transform(X)
# Cluster
kmeans = KMeans(n_clusters=5, random_state=42)
df['cluster'] = kmeans.fit_predict(X_scaled)
# Analyze
print(df.groupby('cluster').mean())
After working with this dataset, you will be able to: 1. Apply K-Means, DBSCAN, and Hierarchical Clustering 2. Use PCA for dimensionality reduction and visualization 3. Interpret correlation matrices and feature relationships 4. Create geographic visualizations with cluster assignments 5. Profile and name discovered clusters based on characteristics
| Cluster | Characteristics | Example Cities |
|---|---|---|
| Metropolitan Tech Hubs | High income, density, rent | Silicon Valley, Singapore |
| Eco-Friendly Towns | Low density, clean air, high happiness | Nordic cities |
| Developing Centers | Mid income, high density, poor air | Emerging markets |
| Low-Income Suburban | Low infrastructure, income | Rural areas |
| Industrial Mega-Cities | Very high density, pollution | Manufacturing hubs |
Unlike random synthetic data, this dataset was carefully engineered with: - ✨ Realistic correlation structures based on urban research - 🌍 Regional characteristics matching real-world patterns - 🎯 Optimal cluster separability (validated via silhouette scores) - 📚 Comprehensive documentation and starter code
✓ Learn clustering without data cleaning hassles
✓ Practice PCA and dimensionality reduction
✓ Create beautiful geographic visualizations
✓ Understand feature correlation in real-world contexts
✓ Build a portfolio project with clear business insights
This dataset was designed for educational purposes in machine learning and data science. While synthetic, it reflects real patterns observed in global urban development research.
Happy Clustering! 🎉
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Facebook
TwitterWe propose two probability-like measures of individual cluster-membership certainty which can be applied to a hard partition of the sample such as that obtained from the Partitioning Around Medoids (PAM) algorithm, hierarchical clustering or k-means clustering. One measure extends the individual silhouette widths and the other is obtained directly from the pairwise dissimilarities in the sample. Unlike the classic silhouette, however, the measures behave like probabilities and can be used to investigate an individual's tendency to belong to a cluster. We also suggest two possible ways to evaluate the hard partition using these measures. We evaluate the performance of both measures in individuals with ambiguous cluster membership, using simulated binary datasets that have been partitioned by the PAM algorithm or continuous datasets that have been partitioned by hierarchical clustering and k-means clustering. For comparison, we also present results from soft-clustering algorithms such as soft analysis clustering (FANNY) and two model-based clustering methods. Our proposed measures perform comparably to the posterior-probability estimators from either FANNY or the model-based clustering methods. We also illustrate the proposed measures by applying them to Fisher's classic data set on irises.
Facebook
Twitter
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset features a collection of R scripts designed to leverage the Affinity Propagation machine-learning algorithm for clustering similar window photographs. These scripts employ an unsupervised clustering analysis, culminating in the manual tagging of clusters. Following this, the scripts process the tags and clusters to generate tabulated data in .csv format. Additionally, the code includes tools for reapplying EXIF metadata to images and organizing them efficiently. The dataset also provides a sample output, derived from processing the scripts with data from the study “Image dataset: Year-long hourly facade photos of a university building” (https://doi.org/10.1016/j.dib.2024.110798).
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The dataset contains the data for the hierarchical cluster analysis as explained in the article "A panorama of inchoative constructions in Spanish: Cluster analysis as an answer to the near-synonymy puzzle". The dataset contains the data for the hierarchical cluster analysis as explained in the article "A panorama of inchoative constructions in Spanish: Cluster analysis as an answer to the near-synonymy puzzle". In total, the dataset contains 3955 observations, which are tokens of the inchoative construction for the following auxiliaries: comenzar, empezar, meter, poner, echar(se), liar, arrancar and romper. The data originates from the the Spanish Web corpus (esTenTen18), accessed via Sketch Engine. Only the European Spanish subcorpus was selected. The search syntax that was used to detect the inchoative construction was the following: “[lemma="empezar"] [tag="R.*"]{0,3}"a"[tag="V.*"] within " (replacing the concrete lemma "empezar" by other lemma's for each auxiliary, see Spinc_queries_20221202.txt for all concrete corpus queries). After downloading samples of 10.000 tokens per auxiliary, the samples were manually cleaned. Only 500 tokens per auxiliary were retained in the dataset. Next, the data were annotated for the infinitive observed after the preposition 'a' and for the semantic class to which this infinitive belongs, following the existing ADESSE classification (see below), besides other criteria that are not taken into account for this study. Concretely, the variables 'INF' (infinitive) and 'Class' were used as input for the hierarchical cluster analysis (see data-specific sections below for more information about the variables).
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
In scientific data analysis, clusters identified computationally often substantiate existing hypotheses or motivate new ones. Yet the combinatorial nature of the clustering result, which is a partition rather than a set of parameters or a function, blurs notions of mean and variance. This intrinsic difficulty hinders the development of methods to improve clustering by aggregation or to assess the uncertainty of clusters generated. We overcome that barrier by aligning clusters via optimal transport. Equipped with this technique, we propose a new algorithm to enhance clustering by any baseline method using bootstrap samples. Cluster alignment enables us to quantify variation in the clustering result at the levels of both overall partitions and individual clusters. Set relationships between clusters such as one-to-one match, split, and merge can be revealed. A covering point set for each cluster, a concept kin to the confidence interval, is proposed. The tools we have developed here will help address the crucial question of whether any cluster is an intrinsic or spurious pattern. Experimental results on both simulated and real datasets are provided.