100+ datasets found

Dataset for: Optimal Transport, Mean Partition, and Uncertainty Assessment...
wiley.figshare.com
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jia Li; Beomseok Seo; Lin Lin (2023). Dataset for: Optimal Transport, Mean Partition, and Uncertainty Assessment in Cluster Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.8038925
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8038925
Dataset updated
May 30, 2023
Dataset provided by
Wileyhttps://www.wiley.com/
Authors
Jia Li; Beomseok Seo; Lin Lin
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
In scientific data analysis, clusters identified computationally often substantiate existing hypotheses or motivate new ones. Yet the combinatorial nature of the clustering result, which is a partition rather than a set of parameters or a function, blurs notions of mean and variance. This intrinsic difficulty hinders the development of methods to improve clustering by aggregation or to assess the uncertainty of clusters generated. We overcome that barrier by aligning clusters via optimal transport. Equipped with this technique, we propose a new algorithm to enhance clustering by any baseline method using bootstrap samples. Cluster alignment enables us to quantify variation in the clustering result at the levels of both overall partitions and individual clusters. Set relationships between clusters such as one-to-one match, split, and merge can be revealed. A covering point set for each cluster, a concept kin to the confidence interval, is proposed. The tools we have developed here will help address the crucial question of whether any cluster is an intrinsic or spurious pattern. Experimental results on both simulated and real datasets are provided.
f
Data from: U-Statistical Inference for Hierarchical Clustering
tandf.figshare.com
zip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marcio Valk; Gabriela Bettella Cybis (2023). U-Statistical Inference for Hierarchical Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.12844523.v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12844523.v3
Dataset updated
Jun 1, 2023
Dataset provided by
Taylor & Francis
Authors
Marcio Valk; Gabriela Bettella Cybis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Clustering methods are valuable tools for the identification of patterns in high-dimensional data with applications in many scientific fields. However, quantifying uncertainty in clustering is a challenging problem, particularly when dealing with high-dimension low sample size (HDLSS) data. We develop a U-statistics based clustering approach that assesses statistical significance in clustering and is specifically tailored to HDLSS scenarios. These nonparametric methods rely on very few assumptions about the data, and thus can be applied to a wide range of datasets for which the Euclidean distance captures relevant features. Our main result is the development of a hierarchical significance clustering method. To do so, we first introduce an extension of a relevant U-statistic and develop its asymptotic theory. Additionally, as a preliminary step, we propose a binary nonnested significance clustering method and show its optimality in terms of expected values. Our approach is tested through multiple simulations and found to have more statistical power than competing alternatives in all scenarios considered. Our methods are further showcased in three applications ranging from genetics to image recognition problems. Code for these methods is available in R-package uclust. Supplementary materials for this article are available online.
fake dataset for clustering
kaggle.com
zip
Updated Feb 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ridlo Wahyudi Wibowo (2022). fake dataset for clustering [Dataset]. https://www.kaggle.com/datasets/ridloww/fake-dataset-untuk-clustering
Explore at:
zip(24078 bytes)Available download formats
Dataset updated
Feb 26, 2022
Authors
Ridlo Wahyudi Wibowo
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Ridlo Wahyudi Wibowo

Released under CC0: Public Domain

Contents
m
Educational Attainment in North Carolina Public Schools: Use of statistical...
data.mendeley.com
Updated Nov 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
Explore at:
Unique identifier
https://doi.org/10.17632/6cm9wyd5g5.1
Dataset updated
Nov 14, 2018
Authors
Scott Herford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
f
Data_Sheet_1_Physical Activity-Related Profiles of Female Sixth-Graders...
frontiersin.figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joachim Bachner; David J. Sturm; Xavier García-Massó; Javier Molina-García; Yolanda Demetriou (2023). Data_Sheet_1_Physical Activity-Related Profiles of Female Sixth-Graders Regarding Motivational Psychosocial Variables: A Cluster Analysis Within the CReActivity Project.CSV [Dataset]. http://doi.org/10.3389/fpsyg.2020.580563.s001
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyg.2020.580563.s001
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Joachim Bachner; David J. Sturm; Xavier García-Massó; Javier Molina-García; Yolanda Demetriou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionAdolescents’ physical activity (PA) behavior can be driven by several psychosocial determinants at the same time. Most analyses use a variable-based approach that examines relations between PA-related determinants and PA behavior on the between-person level. Using this approach, possible coexistences of different psychosocial determinants within one person cannot be examined. Therefore, by applying a person-oriented approach, this study examined (a) which profiles regarding PA-related psychosocial variables typically occur in female sixth-graders, (b) if these profiles deliver a self-consistent picture according to theoretical assumptions, and (c) if the profiles contribute to the explanation of PA.Materials and MethodsThe sample comprised 475 female sixth-graders. Seventeen PA-related variables were assessed: support for autonomy, competence and relatedness in PE as well as their satisfaction in PE and leisure-time; behavioral regulation of exercise (five subscales); self-efficacy and social support from friends and family (two subscales). Moderate-to-vigorous PA was measured using accelerometers. Data were analyzed using the self-organizing maps (SOM) analysis, a cluster analysis including an unsupervised algorithm for non-linear models.ResultsAccording to the respective level of psychosocial resources, a positive, a medium and a negative cluster were identified. This superordinate cluster solution represented a self-consistent picture that was in line with theoretical assumptions. The three-cluster solution contributed to the explanation of PA behavior, with the positive cluster accumulating an average of 6 min more moderate-to-vigorous PA per day than the medium cluster and 10 min more than the negative cluster. Additionally, SOM detected a subgroup within the positive cluster that benefited from a specific combination of intrinsic and external regulations with regard to PA.DiscussionThe results underline the relevance of the assessed psychosocial determinants of PA behavior in female sixth-graders. The results further indicate that the different psychosocial resources within a given person do not develop independently of one another, which supports the use of a person-oriented approach. In addition, the SOM analysis identified subgroups with specific characteristics, which would have remained undetected using variable-based approaches. Thus, this approach offers the possibility to reduce data complexity without overlooking subgroups with special demands that go beyond the superordinate cluster solution.
FastLloyd Clustering Datasets
zenodo.org
xz
Updated May 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdulrahman Diaa; Abdulrahman Diaa; Thomas Humphries; Thomas Humphries; Florian Kerschbaum; Florian Kerschbaum (2025). FastLloyd Clustering Datasets [Dataset]. http://doi.org/10.5281/zenodo.15530593
Explore at:
xzAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15530593
Dataset updated
May 28, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Abdulrahman Diaa; Abdulrahman Diaa; Thomas Humphries; Thomas Humphries; Florian Kerschbaum; Florian Kerschbaum
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This artifact bundles the five dataset archives used in our private federated clustering evaluation, corresponding to the real-world benchmarks, scaling experiments, ablation studies, and timing performance tests described in the paper. The real_datasets.tar.xz includes ten established clustering benchmarks drawn from UCI and the Clustering basic benchmark (DOI: https://doi.org/10.1007/s10489-018-1238-7); scale_datasets.tar.xz contains the SynthNew family generated to assess scalability via the R clusterGeneration package ; ablate_datasets.tar.xz holds the AblateSynth sets varying cluster separation for ablation analysis also powered by clusterGeneration ; g2_datasets.tar.xz packages the G2 sets—Gaussian clusters of size 2048 across dimensions 2–1024 with two clusters each, collected from the Clustering basic benchmark (DOI: https://doi.org/10.1007/s10489-018-1238-7) ; and timing_datasets.tar.xz includes the real s1 and lsun datasets alongside TimeSynth files (balanced synthetic clusters for timing), as per Mohassel et al.’s experimental framework .

Contents

1. real_datasets.tar.xz

Contains ten real-world benchmark datasets and formatted as one sample per line with space-separated features:

iris.txt: 150 samples, 4 features, 3 classes; classic UCI Iris dataset for petal/sepal measurements.

lsun.txt: 400 samples, 2 features, 3 clusters; two-dimensional variant of the LSUN dataset for clustering experiments .

s1.txt: 5,000 samples, 2 features, 15 clusters; synthetic benchmark from Fränti’s S1 series.

house.txt: 1,837 samples, 3 features, 3 clusters; housing data transformed for clustering tasks.

adult.txt: 48,842 samples, 6 features, 3 clusters; UCI Census Income (“Adult”) dataset for income bracket prediction.

wine.txt: 178 samples, 13 features, 3 cultivars; UCI Wine dataset with chemical analysis features.

breast.txt: 569 samples, 9 features, 2 classes; Wisconsin Diagnostic Breast Cancer dataset.

yeast.txt: 1,484 samples, 8 features, 10 localization sites; yeast protein localization data.

mnist.txt: 10,000 samples, 784 features (28×28 pixels), 10 digit classes; MNIST handwritten digits.

birch2.txt: (a random) 25,000/100,000 subset of samples, 2 features, 100 clusters; synthetic BIRCH2 dataset for high-cluster‐count evaluation .

2. scale_datasets.tar.xz

Holds the SynthNew_{k}_{d}_{s}.txt files for scaling experiments, where:

$k \in \{2,4,8,16,32\}$ is the number of clusters,

$d \in \{2,4,8,16,32,64,128,256,512\}$ is the dimensionality,

$s \in \{1,2,3\}$ are different random seeds.

These are generated with the R clusterGeneration package with cluster sizes following a $1:2:...:k$ ratio. We incorporate a random number (in $[0, 100]$) of randomly sampled outliers and set the cluster separation degrees randomly in $[0.16, 0.26]$, spanning partially overlapping to separated clusters.

3. ablate_datasets.tar.xz

Contains the AblateSynth_{k}_{d}_{sep}.txt files for ablation studies, with:

$k \in \{2,4,8,16\}$ clusters,

$d \in \{2,4,8,16\}$ dimensions,

$sep \in \{0.25, 0.5, 0.75\}$ controlling cluster separation degrees.

Also generated via clusterGeneration.

4. g2_datasets.tar.xz

Packages the G2 synthetic sets (g2-{dim}-{var}.txt) from the clustering-data benchmarks:

$N=2048$ samples, $k=2$ Gaussian clusters,

Dimensions $d \in \{1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024\}$

Cluster overlap $var \in \{10, 20, 30, 40, 50, 60, 70, 80, 90, 100\}$

5. timing_datasets.tar.xz

Includes:

s1.txt, lsun.txt: two real datasets for baseline timing.

timesynth_{k}_{d}_{n}.txt: synthetic timing datasets with balanced cluster sizes C_{avg}=N/K, varying:

$k \in \{2,5\}$

$d \in \{2,5\}$

$N \in \{10000; 100000\}$

Generated similarly to the scaling sets, following Mohassel et al.’s timing experiment protocol .

Usage:

Unpack any archive with tar -xJf
d
Chert geochemistry discriminant analysis and K-meas cluster analysis:...
catalog.data.gov
Updated Jul 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
State of Alaska, Department of Natural Resources, Division of Geological & Geophysical Surveys (Point of Contact) (2023). Chert geochemistry discriminant analysis and K-meas cluster analysis: Rampart Project area, Tanana B-1 Quadrangle, east-central Alaska [Dataset]. https://catalog.data.gov/dataset/chert-geochemistry-discriminant-analysis-and-k-meas-cluster-analysis-rampart-project-area-tanan1
Explore at:
Dataset updated
Jul 5, 2023
Dataset provided by
State of Alaska, Department of Natural Resources, Division of Geological & Geophysical Surveys (Point of Contact)
Area covered
Central, Tanana, Alaska
Description
A pilot study using discriminant analysis on major oxide and minor element data from 67 chert samples in the Rampart area, southeastern Tanana and southwest Livengood Quadrangles, western Yukon-Tanana Upland, Alaska, generally indicates a unique geochemical signature for the cherts of a given unit.Chert samples from five known type locales were used as standards of comparison: 1)Livengood Dome Chert (Ordovician), 2)Amy Creek unit (Proterozoic to early Paleozoic), 3)Rampart Group (Mississippian to Triassic), 4)Troublesome Creek unit (Devonian), and 5)Permian-Triassic clastic unit (associated with the Triassic-dated gabbro). Samples from the above units were compared to chert from Tanana B-1 area-units of unknown or uncertain affinity.We have determined that discriminant analysis of chert geochemistry can assign chert profiles to specific units with only minor exceptions, and is useful in geologic mapping of the Tanana B-1 Quadrangle (Reifenstuhl and others, 1997).
Comprehensive Food Security and Vulnerability Analysis 2010 - China
catalog.ihsn.org
Updated Mar 29, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
World Food Programme (2019). Comprehensive Food Security and Vulnerability Analysis 2010 - China [Dataset]. https://catalog.ihsn.org/catalog/4350
Explore at:
Dataset updated
Mar 29, 2019
Dataset authored and provided by
World Food Programmehttp://da.wfp.org/
Time period covered
2010
Area covered
China
Description
Abstract

According to the Food and Agricultural Organization (FAO) 123 million Chinese remained undernourished in 2003-2005. That represents 14% of the global total. UNICEF states that 7.2 million of the world's stunted children are located in China. In absolute terms, China continues to rank in the top countries carrying the global burden of under-nutrition. China must-and still can reduce under-nutrition, thus contributing even further to the global attainment of MDG1. In this context that the United Nations Joint Programme, in partnership with the Chinese government, has conducted this study. The key objective is to improve evidence of household food security through a baseline study in six pilot counties in rural China. The results will be used to guide policy and programmes aimed at reducing household food insecurity in the most vulnerable populations in China. The study is not meant to be an exhaustive analysis of the food security situation in the country, but to provide a demonstrative example of food assessment tools that may be replicated or scaled up to other places.

Geographic coverage

Six rural counties

Analysis unit

Household

Village

Universe

The survey covered household heads and women between 15-49 years resident of that household. A household is defined as a group of people currently living and eating together "under the same roof" (or in same compound if the household has 2 structures).

Kind of data

Sample survey data [ssd]

Sampling procedure

The required sample size for the survey was calculated using standard sample size calculations with each county representing a stratum. After the sample size was calculated, a two-stage clustering approach was applied. The first stage is the selection of villages using the probability proportional to size (PPS) method to create a self-weighted sample in which larger population clusters (villages) have a greater chance of selection, proportional to their size. Following the selection of the villages, 12 households within the village were selected using simple random selection.

Sampling deviation

Floods and landslides prevented the team from visiting two of the selected villages, one in Wuding and one in Panxian, so they substituted them with replacement villages.

Mode of data collection

Face-to-face [f2f]

Research instrument

The household questionnaire was administered to all households in the survey and included modules on demography, education, migration and remittances, housing and facilities, household assets, agricultural, income activities, expenditure, food sources and consumption, shocks and coping strategies.

The objective of the village questionnaire was to gather contextual information on the six counties for descriptive purposes. In each village visited, a focus group discussion took place on topics including: population of the village, migrants, access to social services such as education and health, infrastructure, access to markets, difficulties facing the village, information on local agricultural practices.

The questionnaires were developed by WFP and Chinese Academy of Agricultural Sciences (CAAS) with inputs from partnering agencies. They were originally formulated in English and then translated into Mandarin. They were pilot tested in the field and corrected as needed. The final interviews were administered in Mandarin with translation provided in the local language when needed.

All questionnaires and modules are provided as external resources.

Cleaning operations

After data collection, data entry was carried out by CAAS staff in Beijing using EpiData software. The datasets were then exported into SPSS for analysis. Data cleaning was an iterative process throughout the data entry and analysis phases.

Descriptive analysis, correlation analysis, principle component analysis, cluster analysis and various other forms of analyses were conducted using SPSS.
D
Replication Data for: The semantic structuring of minimizing constructions...
dataverse.no
search.dataone.org
Updated Sep 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Margot Van den Heede; Margot Van den Heede; Peter Lauwers; Peter Lauwers (2025). Replication Data for: The semantic structuring of minimizing constructions in present-day Netherlandic Dutch: a distribution-based cluster analysis [Dataset]. http://doi.org/10.18710/GIKMKM
Explore at:
text/comma-separated-values(95929), type/x-r-syntax(1369), txt(6170), txt(13467)Available download formats
Unique identifier
https://doi.org/10.18710/GIKMKM
Dataset updated
Sep 2, 2025
Dataset provided by
DataverseNO
Authors
Margot Van den Heede; Margot Van den Heede; Peter Lauwers; Peter Lauwers
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Netherlands
Dataset funded by
Special Research Fund for Concerted Research Actions - Ghent University
Description
Dataset abstract: This dataset contains the data files that were used for the cluster analysis of the Dutch minimizing construction, as described in the publication cited below. In addition to a ReadMe file, it contains three files: A txt file is provided with the corpus queries that were used to find tokens of the minimizing constructions in the Dutch Web 2014 (nlTenTen14) corpus, available via Sketch Engine (more information about the TenTen corpora: Jakubíček, M., A. Kilgarriff, V. Kovář, P. Rychlý & V. Suchomel (2013). The TenTen corpus family. In: 7th International Corpus Linguistics Conference CL. Lancaster, 125–127). A csv file is provided that forms the input file for the cluster analysis. It contains a list of 5,863 minimizer-predicate combinations, more specifically a list of the predicates that are combined with the minimizers that have a token frequency of at least 10 in my dataset. An R-script is provided with the code to perform the cluster analysis in R. Article abstract: This paper examines the semantic structuring of a paradigm of 89 minimizers, i.e., nouns that reinforce sentential negation in present-day Netherlandic Dutch, such as meter ‘meter’ in voor geen meter vertrouwen ‘not to trust for a meter’. Cosine distances are computed on the basis of the predicates the minimizers combine with in a sample of 100 tokens downloaded from the Dutch Web corpus 2014 (nlTenTen14) and clustered according to the Partitioning Around Medoids (PAM) algorithm into nine semantic clusters. The clusters largely correspond to semantic categories such as taboo terms or units of money. This suggests that, in general, minimizers belonging to the same semantic domain are combined with a similar (core) set of predicates. Based on the shared predicates per cluster, we detect signs of analogical attraction between minimizers or, conversely, competition. Crucially, low silhouette widths enable us to identify outliers in their respective clusters, for instance, minimizing nouns that exhibit signs of context expansion, as shown by their combination with semantically non-harmonious verbs. As such, this paper provides a synchronic snapshot of the semantic processes involved in (incipient) grammaticalization of minimizing nouns and, more in general, it illustrates how distributional semantics offers a heuristic to analyze the structure of a network of comparable micro-constructions.
ICW2_Dataset | 2009 samples with 440 features
kaggle.com
zip
Updated Nov 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr Reza Rafiee (2022). ICW2_Dataset | 2009 samples with 440 features [Dataset]. https://www.kaggle.com/datasets/rezarafiee/immune2009
Explore at:
zip(3983894 bytes)Available download formats
Dataset updated
Nov 17, 2022
Authors
Dr Reza Rafiee
Description
Dataset

This dataset was created by Dr Reza Rafiee

Contents
Evaluating Correlation Between Measurement Samples in Reverberation Chambers...
catalog.data.gov
datasets.ai
+2more
Updated May 9, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2023). Evaluating Correlation Between Measurement Samples in Reverberation Chambers Using Clustering [Dataset]. https://catalog.data.gov/dataset/evaluating-correlation-between-measurement-samples-in-reverberation-chambers-using-cluster
Explore at:
Dataset updated
May 9, 2023
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
Evaluating Correlation Between Measurement Samples in Reverberation Chambers Using ClusteringAbstract: Traditionally, in reverberation chambers (RC) measurement autocorrelation or correlation-matrix methods have been applied to evaluate measurement correlation. In this article, we introduce the use of clustering based on correlative distance to group correlated measurements. We apply the method to measurements taken in an RC using one and two paddles to stir the electromagnetic fields and applying decreasing angular steps between consecutive paddles positions. The results using varying correlation threshold values demonstrate that the method calculates the number of effective samples and allows discerning outliers, i.e., uncorrelated measurements, and clusters of correlated measurements. This calculation method, if verified, will allow non-sequential stir sequence design and, thereby, reduce testing time.Keywords: Correlation, Pearson correlation coefficient (PCC), reverberation chambers (RC), mode-stirring samples, correlative distance, clustering analysis, adjacency matrix.
f
Data Sheet 2_Subtype cluster analysis unveiled the correlation between m6A-...
datasetcatalog.nlm.nih.gov
frontiersin.figshare.com
Updated Feb 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yang, Zhi; Liu, Lin; Shao, Xiaolong; An, Peng; Zhang, Ming; Wen, Pengfei; Yang, Mingyi; Su, Yani; Jing, Wensen; Yang, Peng (2025). Data Sheet 2_Subtype cluster analysis unveiled the correlation between m6A- and cuproptosis-related lncRNAs and the prognosis, immune microenvironment, and treatment sensitivity of esophageal cancer.zip [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001284208
Explore at:
Dataset updated
Feb 17, 2025
Authors
Yang, Zhi; Liu, Lin; Shao, Xiaolong; An, Peng; Zhang, Ming; Wen, Pengfei; Yang, Mingyi; Su, Yani; Jing, Wensen; Yang, Peng
Description
ObjectiveEsophageal cancer (EC) is characterized by a high degree of malignancy and poor prognosis. N6-methyladenosine (m6A), a prominent post-transcriptional modification of mRNA in mammalian cells, plays a pivotal role in regulating various cellular and biological processes. Similarly, cuproptosis has garnered attention for its potential implications in cancer biology. This study seeks to elucidate the impact of m6A- and cuproptosis-related long non-coding RNAs (m6aCRLncs) on the prognosis of patients with EC.MethodsThe EC transcriptional data and corresponding clinical information were retrieved from The Cancer Genome Atlas (TCGA) database, comprising 11 normal samples and 159 EC samples. Data on 23 m6A regulators and 25 cuproptosis-related genes were sourced from the latest literature. The m6aCRLncs linked to EC were identified through co-expression analysis. Differentially expressed m6aCRLncs associated with EC prognosis were screened using the limma package in R and univariate Cox regression analysis. Subtype clustering was performed to classify EC patients, enabling the investigation of differences in clinical outcomes and immune microenvironment across patient clusters. A risk prognostic model was constructed using least absolute shrinkage and selection operator (LASSO) regression. Its robustness was evaluated through survival analysis, risk stratification curves, and receiver operating characteristic (ROC) curves. Additionally, the model’s applicability across various clinical features and molecular subtypes of EC patients was assessed. To further explore the model’s utility in predicting the immune microenvironment, single-sample gene set enrichment analysis (ssGSEA), immune cell infiltration analysis, and immune checkpoint differential expression analysis were conducted. Drug sensitivity analysis was performed to identify potential therapeutic agents for EC. Finally, the mRNA expression levels of m6aCRLncs in EC cell lines were validated using reverse transcription quantitative polymerase chain reaction (RT-qPCR).ResultsWe developed a prognostic risk model based on five m6aCRLncs, namely ELF3-AS1, HNF1A-AS1, LINC00942, LINC01389, and MIR181A2HG, to predict survival outcomes and characterize the immune microenvironment in EC patients. Analysis of molecular subtypes and clinical features revealed significant differences in cluster distribution, disease stage, and N stage between high- and low-risk groups. Immune profiling further identified distinct immune cell populations and functional pathways associated with risk scores, including positive correlations with naive B cells, resting CD4+ T cells, and plasma cells, and negative correlations with macrophages M0 and M1. Additionally, we identified key immune checkpoint-related genes with significant differential expression between risk groups, including TNFRSF14, TNFSF15, TNFRSF18, LGALS9, CD44, HHLA2, and CD40. Furthermore, nine candidate drugs with potential therapeutic efficacy in EC were identified: Bleomycin, Cisplatin, Cyclopamine, PLX4720, Erlotinib, Gefitinib, RO.3306, XMD8.85, and WH.4.023. Finally, RT-qPCR validation of the mRNA expression levels of m6aCRLncs in EC cell lines demonstrated that ELF3-AS1 expression was significantly upregulated in the EC cell lines KYSE-30 and KYSE-180 compared to normal esophageal epithelial cells.ConclusionThis study elucidates the role of m6aCRLncs in shaping the prognostic outcomes and immune microenvironment of EC. Furthermore, it identifies potential therapeutic agents with efficacy against EC. These findings hold significant promise for enhancing the survival of EC patients and provide valuable insights to inform clinical decision-making in the management of this disease.
d
Groundwater and surface water sample locations included in hydrochemical...
data.gov.au
zip
Updated May 20, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bioregional Assessment Program (2020). Groundwater and surface water sample locations included in hydrochemical cluster analysis [Dataset]. https://data.gov.au/data/dataset/52556c6b-a44c-4ad4-8fb4-f9f609b1e228
Explore at:
zip(122638)Available download formats
Dataset updated
May 20, 2020
Dataset provided by
Bioregional Assessment Program
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract

Dataset combines multiple parameters measured in groundwater and surface water samples and interpreted using Hierarchical Cluster Analysis. This dataset combines groundwater and surface water hydrochemical (major ions, EC, TDS) sourced from GA datasets that were originally sourced from the NT government database and other sources listed under column 'Source'.

Attribution

Geological and Bioregional Assessment Program

History

A hydrochemical dataset composed of 857 groundwater samples and 21 surface water samples was used for Hierarchical Cluster Analysis to investigate inter-aquifer and GW-SW connectivity. Groundwater samples are assigned to the following hydrostratigraphic units: Antrim Plateau Volcanics/Helen Springs/Jindare, Bukalara, Cenozoic, Cretaceous, Jinduckin/AnthonyLagoon/HookerCreek; Proterozoic and Tindall/GumRidge/Montejinni. Variables used for the analysis included major ions (Ca, Mg, Na, K, HCO3, Cl, SO4), electrical conductivity (EC) and pH with data normalised for statistical calculation. Datapoints with charge balance error above 10% have been removed prior to interpretation. The interpretation resulted in five clusters with distinct geochemical properties, as described in the respective section of the technical report Hydrogeology of the Beetaloo GBA region. Check column source to identify data origin with references in the above mentioned report.
Cluster analysis on high dimensional RNA-seq data with applications to...
plos.figshare.com
xlsx
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Linda Vidman; David Källberg; Patrik Rydén (2023). Cluster analysis on high dimensional RNA-seq data with applications to cancer research - An evaluation study [Dataset]. http://doi.org/10.1371/journal.pone.0219102
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0219102
Dataset updated
Jun 5, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Linda Vidman; David Källberg; Patrik Rydén
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundClustering of gene expression data is widely used to identify novel subtypes of cancer. Plenty of clustering approaches have been proposed, but there is a lack of knowledge regarding their relative merits and how data characteristics influence the performance. We evaluate how cluster analysis choices affect the performance by studying four publicly available human cancer data sets: breast, brain, kidney and stomach cancer. In particular, we focus on how the sample size, distribution of subtypes and sample heterogeneity affect the performance.ResultsIn general, increasing the sample size had limited effect on the clustering performance, e.g. for the breast cancer data similar performance was obtained for n = 40 as for n = 330. The relative distribution of the subtypes had a noticeable effect on the ability to identify the disease subtypes and data with disproportionate cluster sizes turned out to be difficult to cluster. Both the choice of clustering method and selection method affected the ability to identify the subtypes, but the relative performance varied between data sets, making it difficult to rank the approaches. For some data sets, the performance was substantially higher when the clustering was based on data from only one sex compared to data from a mixed population. This suggests that homogeneous data are easier to cluster than heterogeneous data and that clustering males and females individually may be beneficial and increase the chance to detect novel subtypes. It was also observed that the performance often differed substantially between females and males.ConclusionsThe number of samples seems to have a limited effect on the performance while the heterogeneity, at least with respect to sex, is important for the performance. Hence, by analyzing the genders separately, the possible loss caused by having fewer samples could be outweighed by the benefit of a more homogeneous data.

🌆 City Lifestyle Segmentation Dataset

kaggle.com

zip

Updated Nov 15, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

UmutUygurr (2025). 🌆 City Lifestyle Segmentation Dataset [Dataset]. https://www.kaggle.com/datasets/umuttuygurr/city-lifestyle-segmentation-dataset

Explore at:

zip(11274 bytes)Available download formats

Dataset updated

Nov 15, 2025

Authors

UmutUygurr

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22121490%2F7189944f8fc292a094c90daa799d08ca%2FChatGPT%20Image%2015%20Kas%202025%2014_07_37.png?generation=1763204959770660&alt=media" alt="">

🌆 About This Dataset

This synthetic dataset simulates 300 global cities across 6 major geographic regions, designed specifically for unsupervised machine learning and clustering analysis. It explores how economic status, environmental quality, infrastructure, and digital access shape urban lifestyles worldwide.

🎯 Perfect For:

📊 K-Means, DBSCAN, Agglomerative Clustering
🔬 PCA & t-SNE Dimensionality Reduction
🗺️ Geospatial Visualization (Plotly, Folium)
📈 Correlation Analysis & Feature Engineering
🎓 Educational Projects (Beginner to Intermediate)

📦 What's Inside?

Feature	Description	Range
10 Features	Economic, environmental & social indicators	Realistically scaled
300 Cities	Europe, Asia, Americas, Africa, Oceania	Diverse distributions
Strong Correlations	Income ↔ Rent (+0.8), Density ↔ Pollution (+0.6)	ML-ready
No Missing Values	Clean, preprocessed data	Ready for analysis
4-5 Natural Clusters	Metropolitan hubs, eco-towns, developing centers	Pre-validated

🔥 Key Features

✅ Realistic Correlations: Income strongly predicts rent (+0.8), internet access (+0.7), and happiness (+0.6)
✅ Regional Diversity: Each region has distinct economic and environmental characteristics
✅ Clustering-Ready: Naturally separable into 4-5 lifestyle archetypes
✅ Beginner-Friendly: No data cleaning required, includes example code
✅ Documented: Comprehensive README with methodology and use cases

🚀 Quick Start Example

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Load and prepare
df = pd.read_csv('city_lifestyle_dataset.csv')
X = df.drop(['city_name', 'country'], axis=1)
X_scaled = StandardScaler().fit_transform(X)

# Cluster
kmeans = KMeans(n_clusters=5, random_state=42)
df['cluster'] = kmeans.fit_predict(X_scaled)

# Analyze
print(df.groupby('cluster').mean())

🎓 Learning Outcomes

After working with this dataset, you will be able to: 1. Apply K-Means, DBSCAN, and Hierarchical Clustering 2. Use PCA for dimensionality reduction and visualization 3. Interpret correlation matrices and feature relationships 4. Create geographic visualizations with cluster assignments 5. Profile and name discovered clusters based on characteristics

📚 Ideal For These Projects

🏆 Kaggle Competitions: Practice clustering techniques
📝 Academic Projects: Urban planning, sociology, environmental science
💼 Portfolio Work: Showcase ML skills to employers
🎓 Learning: Hands-on practice with unsupervised learning
🔬 Research: Urban lifestyle segmentation studies

🌍 Expected Clusters

Cluster	Characteristics	Example Cities
Metropolitan Tech Hubs	High income, density, rent	Silicon Valley, Singapore
Eco-Friendly Towns	Low density, clean air, high happiness	Nordic cities
Developing Centers	Mid income, high density, poor air	Emerging markets
Low-Income Suburban	Low infrastructure, income	Rural areas
Industrial Mega-Cities	Very high density, pollution	Manufacturing hubs

🛠️ Technical Details

Format: CSV (UTF-8)
Size: ~300 rows × 10 columns
Missing Values: 0%
Data Types: 2 categorical, 8 numerical
Target Variable: None (unsupervised)
Correlation Strength: Pre-validated (r: 0.4 to 0.8)

📖 What Makes This Dataset Special?

Unlike random synthetic data, this dataset was carefully engineered with: - ✨ Realistic correlation structures based on urban research - 🌍 Regional characteristics matching real-world patterns - 🎯 Optimal cluster separability (validated via silhouette scores) - 📚 Comprehensive documentation and starter code

🏅 Use This Dataset If You Want To:

✓ Learn clustering without data cleaning hassles
✓ Practice PCA and dimensionality reduction
✓ Create beautiful geographic visualizations
✓ Understand feature correlation in real-world contexts
✓ Build a portfolio project with clear business insights

📊 Acknowledgments

This dataset was designed for educational purposes in machine learning and data science. While synthetic, it reflects real patterns observed in global urban development research.

Happy Clustering! 🎉

Data from: Supplementary Material for "Sonification for Exploratory Data...
search.datacite.org
pub.uni-bielefeld.de
Updated Feb 5, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas Hermann (2019). Supplementary Material for "Sonification for Exploratory Data Analysis" [Dataset]. http://doi.org/10.4119/unibi/2920448
Explore at:
Unique identifier
https://doi.org/10.4119/unibi/2920448
Dataset updated
Feb 5, 2019
Dataset provided by
DataCitehttps://www.datacite.org/
Bielefeld University
Authors
Thomas Hermann
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Sonification for Exploratory Data Analysis #### Chapter 8: Sonification Models In Chapter 8 of the thesis, 6 sonification models are presented to give some examples for the framework of Model-Based Sonification, developed in Chapter 7. Sonification models determine the rendering of the sonification and possible interactions. The "model in mind" helps the user to interprete the sound with respect to the data. ##### 8.1 Data Sonograms Data Sonograms use spherical expanding shock waves to excite linear oscillators which are represented by point masses in model space. * Table 8.2, page 87: Sound examples for Data Sonograms File: Iris dataset: started in plot (a) at S0 (b) at S1 (c) at S2
10d noisy circle dataset: started in plot (c) at S0 (mean) (d) at S1 (edge)
10d Gaussian: plot (d) started at S0
3 clusters: Example 1
3 clusters: invisible columns used as output variables: Example 2 Description: Data Sonogram Sound examples for synthetic datasets and the Iris dataset Duration: about 5 s ##### 8.2 Particle Trajectory Sonification Model This sonification model explores features of a data distribution by computing the trajectories of test particles which are injected into model space and move according to Newton's laws of motion in a potential given by the dataset. * Sound example: page 93, PTSM-Ex-1 Audification of 1 particle in the potential of phi(x). * Sound example: page 93, PTSM-Ex-2 Audification of a sequence of 15 particles in the potential of a dataset with 2 clusters. * Sound example: page 94, PTSM-Ex-3 Audification of 25 particles simultaneous in a potential of a dataset with 2 clusters. * Sound example: page 94, PTSM-Ex-4 Audification of 25 particles simultaneous in a potential of a dataset with 1 cluster. * Sound example: page 95, PTSM-Ex-5 sigma-step sequence for a mixture of three Gaussian clusters * Sound example: page 95, PTSM-Ex-6 sigma-step sequence for a Gaussian cluster * Sound example: page 96, PTSM-Iris-1 Sonification for the Iris Dataset with 20 particles per step. * Sound example: page 96, PTSM-Iris-2 Sonification for the Iris Dataset with 3 particles per step. * Sound example: page 96, PTSM-Tetra-1 Sonification for a 4d tetrahedron clusters dataset. ##### 8.3 Markov chain Monte Carlo Sonification The McMC Sonification Model defines a exploratory process in the domain of a given density p such that the acoustic representation summarizes features of p, particularly concerning the modes of p by sound. * Sound Example: page 105, MCMC-Ex-1 McMC Sonification, stabilization of amplitudes. * Sound Example: page 106, MCMC-Ex-2 Trajectory Audification for 100 McMC steps in 3 cluster dataset * McMC Sonification for Cluster Analysis, dataset with three clusters, page 107 * Stream 1 MCMC-Ex-3.1 * Stream 2 MCMC-Ex-3.2 * Stream 3 MCMC-Ex-3.3 * Mix MCMC-Ex-3.4 * McMC Sonification for Cluster Analysis, dataset with three clusters, T =0.002s, page 107 * Stream 1 MCMC-Ex-4.1 (stream 1) * Stream 2 MCMC-Ex-4.2 (stream 2) * Stream 3 MCMC-Ex-4.3 (stream 3) * Mix MCMC-Ex-4.4 * McMC Sonification for Cluster Analysis, density with 6 modes, T=0.008s, page 107 * Stream 1 MCMC-Ex-5.1 (stream 1) * Stream 2 MCMC-Ex-5.2 (stream 2) * Stream 3 MCMC-Ex-5.3 (stream 3) * Mix MCMC-Ex-5.4 * McMC Sonification for the Iris dataset, page 108 * MCMC-Ex-6.1 * MCMC-Ex-6.2 * MCMC-Ex-6.3 * MCMC-Ex-6.4 * MCMC-Ex-6.5 * MCMC-Ex-6.6 * MCMC-Ex-6.7 * MCMC-Ex-6.8 ##### 8.4 Principal Curve Sonification Principal Curve Sonification represents data by synthesizing the soundscape while a virtual listener moves along the principal curve of the dataset through the model space. * Noisy Spiral dataset, PCS-Ex-1.1 , page 113 * Noisy Spiral dataset with variance modulation PCS-Ex-1.2 , page 114 * 9d tetrahedron cluster dataset (10 clusters) PCS-Ex-2 , page 114 * Iris dataset, class label used as pitch of auditory grains PCS-Ex-3 , page 114 ##### 8.5 Data Crystallization Sonification Model * Table 8.6, page 122: Sound examples for Crystallization Sonification for 5d Gaussian distribution File: DCS started at center, in tail, from far outside Description: DCS for dataset sampled from N{0, I_5} excited at different locations Duration: 1.4 s * Mixture of 2 Gaussians, page 122 * DCS started at point A DCS-Ex1A * DCS started at point B DCS-Ex1B * Table 8.7, page 124: Sound examples for DCS on variation of the harmonics factor File: h_omega = 1, 2, 3, 4, 5, 6 Description: DCS for a mixture of two Gaussians with varying harmonics factor Duration: 1.4 s * Table 8.8, page 124: Sound examples for DCS on variation of the energy decay time File: tau_(1/2) = 0.001, 0.005, 0.01, 0.05, 0.1, 0.2 Description: DCS for a mixture of two Gaussians varying the energy decay time tau_(1/2) Duration: 1.4 s * Table 8.9, page 125: Sound examples for DCS on variation of the sonification time File: T = 0.2, 0.5, 1, 2, 4, 8 Description: DCS for a mixture of two Gaussians on varying the duration T Duration: 0.2s -- 8s * Table 8.10, page 125: Sound examples for DCS on variation of model space dimension File: selected columns of the dataset: (x0) (x0,x1) (x0,...,x2) (x0,...,x3) (x0,...,x4) (x0,...,x5) Description: DCS for a mixture of two Gaussians varying the dimension Duration: 1.4 s * Table 8.11, page 126: Sound examples for DCS for different excitation locations File: starting point: C0, C1, C2 Description: DCS for a mixture of three Gaussians in 10d space with different rank(S) = {2,4,8} Duration: 1.9 s * Table 8.12, page 126: Sound examples for DCS for the mixture of a 2d distribution and a 5d cluster File: condensation nucleus in (x0,x1)-plane at: (-6,0)=C1, (-3,0)=C2, ( 0,0)=C0 Description: DCS for a mixture of a uniform 2d and a 5d Gaussian Duration: 2.16 s * Table 8.13, page 127: Sound examples for DCS for the cancer dataset File: condensation nucleus in (x0,x1)-plane at: benign 1, benign 2
malignant 1, malignant 2 Description: DCS for a mixture of a uniform 2d and a 5d Gaussian Duration: 2.16 s ##### 8.6 Growing Neural Gas Sonification * Table 8.14, page 133: Sound examples for GNGS Probing File: Cluster C0 (2d): a, b, c
Cluster C1 (4d): a, b, c
Cluster C2 (8d): a, b, c Description: GNGS for a mixture of 3 Gaussians in 10d space Duration: 1 s * Table 8.15, page 134: Sound examples for GNGS for the noisy spiral dataset File: (a) GNG with 3 neurons 1, 2
(b) GNG with 20 neurons end, middle, inner end
(c) GNG with 45 neurons outer end, middle, close to inner end, at inner end
(d) GNG with 150 neurons outer end, in the middle, inner end
(e) GNG with 20 neurons outer end, in the middle, inner end
(f) GNG with 45 neurons outer end, in the middle, inner end Description: GNG probing sonification for 2d noisy spiral dataset Duration: 1 s * Table 8.16, page 136: Sound examples for GNG Process Monitoring Sonification for different data distributions File: Noisy spiral with 1 rotation: sound
Noisy spiral with 2 rotations: sound
Gaussian in 5d: sound
Mixture of 5d and 2d distributions: sound Description: GNG process sonification examples Duration: 5 s #### Chapter 9: Extensions #### In this chapter, two extensions for Parameter Mapping
f
Data from: Simple Measures of Individual Cluster-Membership Certainty for...
datasetcatalog.nlm.nih.gov
tandf.figshare.com
Updated Jul 9, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Graham, Jinko; Liu, Dongmeng (2018). Simple Measures of Individual Cluster-Membership Certainty for Hard Partitional Clustering [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000690043
Explore at:
Dataset updated
Jul 9, 2018
Authors
Graham, Jinko; Liu, Dongmeng
Description
We propose two probability-like measures of individual cluster-membership certainty which can be applied to a hard partition of the sample such as that obtained from the Partitioning Around Medoids (PAM) algorithm, hierarchical clustering or k-means clustering. One measure extends the individual silhouette widths and the other is obtained directly from the pairwise dissimilarities in the sample. Unlike the classic silhouette, however, the measures behave like probabilities and can be used to investigate an individual's tendency to belong to a cluster. We also suggest two possible ways to evaluate the hard partition using these measures. We evaluate the performance of both measures in individuals with ambiguous cluster membership, using simulated binary datasets that have been partitioned by the PAM algorithm or continuous datasets that have been partitioned by hierarchical clustering and k-means clustering. For comparison, we also present results from soft-clustering algorithms such as soft analysis clustering (FANNY) and two model-based clustering methods. Our proposed measures perform comparably to the posterior-probability estimators from either FANNY or the model-based clustering methods. We also illustrate the proposed measures by applying them to Fisher's classic data set on irises.
d
Data from: \"Size\" and \"shape\" in the measurement of multivariate...
search.dataone.org
data.niaid.nih.gov
+1more
Updated Jul 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Greenacre (2025). \"Size\" and \"shape\" in the measurement of multivariate proximity [Dataset]. http://doi.org/10.5061/dryad.6r5j8
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.6r5j8
Dataset updated
Jul 4, 2025
Dataset provided by
Dryad Digital Repository
Authors
Michael Greenacre
Time period covered
Mar 16, 2018
Description
Ordination and clustering methods are widely applied to ecological data that are nonnegative, for example species abundances or biomasses. These methods rely on a measure of multivariate proximity that quantifies differences between the sampling units (e.g. individuals, stations, time points), leading to results such as: (i) ordinations of the units, where interpoint distances optimally display the measured differences; (ii) clustering the units into homogeneous clusters; or (iii) assessing differences between pre-specified groups of units (e.g., regions, periods, treatment-control groups). 2. These methods all conceal a fundamental question: To what extent are the differences between the sampling units, computed according to the chosen proximity function, capturing the "size" in the multivariate observations, or their "shape"? "Size" means the overall level of the measurements: for example, some samples contain higher total abundances or more biomass, others less. "Shape" mea...
C
R scripts to cluster similar window photos using the affinity propagation...
dataverse.csuc.cat
tsv, txt +1
Updated Sep 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marc Roca Musach; Marc Roca Musach; Isabel Crespo Cabillo; Isabel Crespo Cabillo; Helena Coch Roura; Helena Coch Roura (2024). R scripts to cluster similar window photos using the affinity propagation machine-learning algorithm [Dataset]. http://doi.org/10.34810/data1736
Explore at:
type/x-r-syntax(4715), type/x-r-syntax(1054), tsv(449380), type/x-r-syntax(2270), tsv(283403), type/x-r-syntax(10434), tsv(291418), tsv(332514), type/x-r-syntax(6637), type/x-r-syntax(1874), type/x-r-syntax(5495), type/x-r-syntax(4511), type/x-r-syntax(6802), txt(9703), type/x-r-syntax(2178)Available download formats
Unique identifier
https://doi.org/10.34810/data1736
Dataset updated
Sep 20, 2024
Dataset provided by
CORA.Repositori de Dades de Recerca
Authors
Marc Roca Musach; Marc Roca Musach; Isabel Crespo Cabillo; Isabel Crespo Cabillo; Helena Coch Roura; Helena Coch Roura
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset funded by
https://ror.org/003x0zc53
https://ror.org/01bg62x04
Description
This dataset features a collection of R scripts designed to leverage the Affinity Propagation machine-learning algorithm for clustering similar window photographs. These scripts employ an unsupervised clustering analysis, culminating in the manual tagging of clusters. Following this, the scripts process the tags and clusters to generate tabulated data in .csv format. Additionally, the code includes tools for reapplying EXIF metadata to images and organizing them efficiently. The dataset also provides a sample output, derived from processing the scripts with data from the study “Image dataset: Year-long hourly facade photos of a university building” (https://doi.org/10.1016/j.dib.2024.110798).
D
Replication Data for: A panorama of inchoative constructions in Spanish:...
dataverse.azure.uit.no
dataverse.no
csv, txt +1
Updated Sep 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sven Van Hulle; Sven Van Hulle (2023). Replication Data for: A panorama of inchoative constructions in Spanish: Cluster analysis as an answer to the near-synonymy puzzle. [Dataset]. http://doi.org/10.18710/DR8QKQ
Explore at:
txt(2182), csv(80915), type/x-r-syntax(1260), csv(62008), txt(8147)Available download formats
Unique identifier
https://doi.org/10.18710/DR8QKQ
Dataset updated
Sep 5, 2023
Dataset provided by
DataverseNO
Authors
Sven Van Hulle; Sven Van Hulle
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Spain
Description
The dataset contains the data for the hierarchical cluster analysis as explained in the article "A panorama of inchoative constructions in Spanish: Cluster analysis as an answer to the near-synonymy puzzle". The dataset contains the data for the hierarchical cluster analysis as explained in the article "A panorama of inchoative constructions in Spanish: Cluster analysis as an answer to the near-synonymy puzzle". In total, the dataset contains 3955 observations, which are tokens of the inchoative construction for the following auxiliaries: comenzar, empezar, meter, poner, echar(se), liar, arrancar and romper. The data originates from the the Spanish Web corpus (esTenTen18), accessed via Sketch Engine. Only the European Spanish subcorpus was selected. The search syntax that was used to detect the inchoative construction was the following: “[lemma="empezar"] [tag="R.*"]{0,3}"a"[tag="V.*"] within " (replacing the concrete lemma "empezar" by other lemma's for each auxiliary, see Spinc_queries_20221202.txt for all concrete corpus queries). After downloading samples of 10.000 tokens per auxiliary, the samples were manually cleaned. Only 500 tokens per auxiliary were retained in the dataset. Next, the data were annotated for the infinitive observed after the preposition 'a' and for the semantic class to which this infinitive belongs, following the existing ADESSE classification (see below), besides other criteria that are not taken into account for this study. Concretely, the variables 'INF' (infinitive) and 'Class' were used as input for the hierarchical cluster analysis (see data-specific sections below for more information about the variables).

Facebook

Twitter

Click to copy link

Link copied

Cite

Jia Li; Beomseok Seo; Lin Lin (2023). Dataset for: Optimal Transport, Mean Partition, and Uncertainty Assessment in Cluster Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.8038925

Dataset for: Optimal Transport, Mean Partition, and Uncertainty Assessment in Cluster Analysis

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.8038925

Dataset updated

May 30, 2023

Dataset provided by

Wileyhttps://www.wiley.com/

Authors

Jia Li; Beomseok Seo; Lin Lin

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

In scientific data analysis, clusters identified computationally often substantiate existing hypotheses or motivate new ones. Yet the combinatorial nature of the clustering result, which is a partition rather than a set of parameters or a function, blurs notions of mean and variance. This intrinsic difficulty hinders the development of methods to improve clustering by aggregation or to assess the uncertainty of clusters generated. We overcome that barrier by aligning clusters via optimal transport. Equipped with this technique, we propose a new algorithm to enhance clustering by any baseline method using bootstrap samples. Cluster alignment enables us to quantify variation in the clustering result at the levels of both overall partitions and individual clusters. Set relationships between clusters such as one-to-one match, split, and merge can be revealed. A covering point set for each cluster, a concept kin to the confidence interval, is proposed. The tools we have developed here will help address the crucial question of whether any cluster is an intrinsic or spurious pattern. Experimental results on both simulated and real datasets are provided.

Clear search

Close search

Google apps

Main menu

Dataset for: Optimal Transport, Mean Partition, and Uncertainty Assessment...

Data from: U-Statistical Inference for Hierarchical Clustering

fake dataset for clustering

Dataset

Contents

Educational Attainment in North Carolina Public Schools: Use of statistical...

Data_Sheet_1_Physical Activity-Related Profiles of Female Sixth-Graders...

FastLloyd Clustering Datasets

Contents

1. real_datasets.tar.xz

2. scale_datasets.tar.xz

3. ablate_datasets.tar.xz

4. g2_datasets.tar.xz

5. timing_datasets.tar.xz

Chert geochemistry discriminant analysis and K-meas cluster analysis:...

Comprehensive Food Security and Vulnerability Analysis 2010 - China

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Sampling deviation

Mode of data collection

Research instrument

Cleaning operations

Replication Data for: The semantic structuring of minimizing constructions...

ICW2_Dataset | 2009 samples with 440 features

Dataset

Contents

Evaluating Correlation Between Measurement Samples in Reverberation Chambers...

Data Sheet 2_Subtype cluster analysis unveiled the correlation between m6A-...

Groundwater and surface water sample locations included in hydrochemical...

Abstract

Attribution

History

Cluster analysis on high dimensional RNA-seq data with applications to...

🌆 City Lifestyle Segmentation Dataset

🌆 About This Dataset

🎯 Perfect For:

📦 What's Inside?

🔥 Key Features

🚀 Quick Start Example

🎓 Learning Outcomes

📚 Ideal For These Projects

🌍 Expected Clusters

🛠️ Technical Details

📖 What Makes This Dataset Special?

🏅 Use This Dataset If You Want To:

📊 Acknowledgments

Data from: Supplementary Material for "Sonification for Exploratory Data...

Data from: Simple Measures of Individual Cluster-Membership Certainty for...

Data from: \"Size\" and \"shape\" in the measurement of multivariate...

R scripts to cluster similar window photos using the affinity propagation...

Replication Data for: A panorama of inchoative constructions in Spanish:...

Dataset for: Optimal Transport, Mean Partition, and Uncertainty Assessment in Cluster Analysis