100+ datasets found
  1. Dataset for: Optimal Transport, Mean Partition, and Uncertainty Assessment...

    • wiley.figshare.com
    • figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jia Li; Beomseok Seo; Lin Lin (2023). Dataset for: Optimal Transport, Mean Partition, and Uncertainty Assessment in Cluster Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.8038925
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Wileyhttps://www.wiley.com/
    Authors
    Jia Li; Beomseok Seo; Lin Lin
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    In scientific data analysis, clusters identified computationally often substantiate existing hypotheses or motivate new ones. Yet the combinatorial nature of the clustering result, which is a partition rather than a set of parameters or a function, blurs notions of mean and variance. This intrinsic difficulty hinders the development of methods to improve clustering by aggregation or to assess the uncertainty of clusters generated. We overcome that barrier by aligning clusters via optimal transport. Equipped with this technique, we propose a new algorithm to enhance clustering by any baseline method using bootstrap samples. Cluster alignment enables us to quantify variation in the clustering result at the levels of both overall partitions and individual clusters. Set relationships between clusters such as one-to-one match, split, and merge can be revealed. A covering point set for each cluster, a concept kin to the confidence interval, is proposed. The tools we have developed here will help address the crucial question of whether any cluster is an intrinsic or spurious pattern. Experimental results on both simulated and real datasets are provided.

  2. f

    Data from: U-Statistical Inference for Hierarchical Clustering

    • tandf.figshare.com
    zip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marcio Valk; Gabriela Bettella Cybis (2023). U-Statistical Inference for Hierarchical Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.12844523.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Marcio Valk; Gabriela Bettella Cybis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Clustering methods are valuable tools for the identification of patterns in high-dimensional data with applications in many scientific fields. However, quantifying uncertainty in clustering is a challenging problem, particularly when dealing with high-dimension low sample size (HDLSS) data. We develop a U-statistics based clustering approach that assesses statistical significance in clustering and is specifically tailored to HDLSS scenarios. These nonparametric methods rely on very few assumptions about the data, and thus can be applied to a wide range of datasets for which the Euclidean distance captures relevant features. Our main result is the development of a hierarchical significance clustering method. To do so, we first introduce an extension of a relevant U-statistic and develop its asymptotic theory. Additionally, as a preliminary step, we propose a binary nonnested significance clustering method and show its optimality in terms of expected values. Our approach is tested through multiple simulations and found to have more statistical power than competing alternatives in all scenarios considered. Our methods are further showcased in three applications ranging from genetics to image recognition problems. Code for these methods is available in R-package uclust. Supplementary materials for this article are available online.

  3. fake dataset for clustering

    • kaggle.com
    zip
    Updated Feb 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ridlo Wahyudi Wibowo (2022). fake dataset for clustering [Dataset]. https://www.kaggle.com/datasets/ridloww/fake-dataset-untuk-clustering
    Explore at:
    zip(24078 bytes)Available download formats
    Dataset updated
    Feb 26, 2022
    Authors
    Ridlo Wahyudi Wibowo
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Ridlo Wahyudi Wibowo

    Released under CC0: Public Domain

    Contents

  4. m

    Educational Attainment in North Carolina Public Schools: Use of statistical...

    • data.mendeley.com
    Updated Nov 14, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
    Explore at:
    Dataset updated
    Nov 14, 2018
    Authors
    Scott Herford
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.

  5. f

    Data_Sheet_1_Physical Activity-Related Profiles of Female Sixth-Graders...

    • frontiersin.figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joachim Bachner; David J. Sturm; Xavier García-Massó; Javier Molina-García; Yolanda Demetriou (2023). Data_Sheet_1_Physical Activity-Related Profiles of Female Sixth-Graders Regarding Motivational Psychosocial Variables: A Cluster Analysis Within the CReActivity Project.CSV [Dataset]. http://doi.org/10.3389/fpsyg.2020.580563.s001
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers
    Authors
    Joachim Bachner; David J. Sturm; Xavier García-Massó; Javier Molina-García; Yolanda Demetriou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionAdolescents’ physical activity (PA) behavior can be driven by several psychosocial determinants at the same time. Most analyses use a variable-based approach that examines relations between PA-related determinants and PA behavior on the between-person level. Using this approach, possible coexistences of different psychosocial determinants within one person cannot be examined. Therefore, by applying a person-oriented approach, this study examined (a) which profiles regarding PA-related psychosocial variables typically occur in female sixth-graders, (b) if these profiles deliver a self-consistent picture according to theoretical assumptions, and (c) if the profiles contribute to the explanation of PA.Materials and MethodsThe sample comprised 475 female sixth-graders. Seventeen PA-related variables were assessed: support for autonomy, competence and relatedness in PE as well as their satisfaction in PE and leisure-time; behavioral regulation of exercise (five subscales); self-efficacy and social support from friends and family (two subscales). Moderate-to-vigorous PA was measured using accelerometers. Data were analyzed using the self-organizing maps (SOM) analysis, a cluster analysis including an unsupervised algorithm for non-linear models.ResultsAccording to the respective level of psychosocial resources, a positive, a medium and a negative cluster were identified. This superordinate cluster solution represented a self-consistent picture that was in line with theoretical assumptions. The three-cluster solution contributed to the explanation of PA behavior, with the positive cluster accumulating an average of 6 min more moderate-to-vigorous PA per day than the medium cluster and 10 min more than the negative cluster. Additionally, SOM detected a subgroup within the positive cluster that benefited from a specific combination of intrinsic and external regulations with regard to PA.DiscussionThe results underline the relevance of the assessed psychosocial determinants of PA behavior in female sixth-graders. The results further indicate that the different psychosocial resources within a given person do not develop independently of one another, which supports the use of a person-oriented approach. In addition, the SOM analysis identified subgroups with specific characteristics, which would have remained undetected using variable-based approaches. Thus, this approach offers the possibility to reduce data complexity without overlooking subgroups with special demands that go beyond the superordinate cluster solution.

  6. FastLloyd Clustering Datasets

    • zenodo.org
    xz
    Updated May 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdulrahman Diaa; Abdulrahman Diaa; Thomas Humphries; Thomas Humphries; Florian Kerschbaum; Florian Kerschbaum (2025). FastLloyd Clustering Datasets [Dataset]. http://doi.org/10.5281/zenodo.15530593
    Explore at:
    xzAvailable download formats
    Dataset updated
    May 28, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Abdulrahman Diaa; Abdulrahman Diaa; Thomas Humphries; Thomas Humphries; Florian Kerschbaum; Florian Kerschbaum
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This artifact bundles the five dataset archives used in our private federated clustering evaluation, corresponding to the real-world benchmarks, scaling experiments, ablation studies, and timing performance tests described in the paper. The real_datasets.tar.xz includes ten established clustering benchmarks drawn from UCI and the Clustering basic benchmark (DOI: https://doi.org/10.1007/s10489-018-1238-7); scale_datasets.tar.xz contains the SynthNew family generated to assess scalability via the R clusterGeneration package ; ablate_datasets.tar.xz holds the AblateSynth sets varying cluster separation for ablation analysis also powered by clusterGeneration ; g2_datasets.tar.xz packages the G2 sets—Gaussian clusters of size 2048 across dimensions 2–1024 with two clusters each, collected from the Clustering basic benchmark (DOI: https://doi.org/10.1007/s10489-018-1238-7) ; and timing_datasets.tar.xz includes the real s1 and lsun datasets alongside TimeSynth files (balanced synthetic clusters for timing), as per Mohassel et al.’s experimental framework .

    Contents

    1. real_datasets.tar.xz

    Contains ten real-world benchmark datasets and formatted as one sample per line with space-separated features:

    • iris.txt: 150 samples, 4 features, 3 classes; classic UCI Iris dataset for petal/sepal measurements.

    • lsun.txt: 400 samples, 2 features, 3 clusters; two-dimensional variant of the LSUN dataset for clustering experiments .

    • s1.txt: 5,000 samples, 2 features, 15 clusters; synthetic benchmark from Fränti’s S1 series.

    • house.txt: 1,837 samples, 3 features, 3 clusters; housing data transformed for clustering tasks.

    • adult.txt: 48,842 samples, 6 features, 3 clusters; UCI Census Income (“Adult”) dataset for income bracket prediction.

    • wine.txt: 178 samples, 13 features, 3 cultivars; UCI Wine dataset with chemical analysis features.

    • breast.txt: 569 samples, 9 features, 2 classes; Wisconsin Diagnostic Breast Cancer dataset.

    • yeast.txt: 1,484 samples, 8 features, 10 localization sites; yeast protein localization data.

    • mnist.txt: 10,000 samples, 784 features (28×28 pixels), 10 digit classes; MNIST handwritten digits.

    • birch2.txt: (a random) 25,000/100,000 subset of samples, 2 features, 100 clusters; synthetic BIRCH2 dataset for high-cluster‐count evaluation .

    2. scale_datasets.tar.xz

    Holds the SynthNew_{k}_{d}_{s}.txt files for scaling experiments, where:

    • $k \in \{2,4,8,16,32\}$ is the number of clusters,

    • $d \in \{2,4,8,16,32,64,128,256,512\}$ is the dimensionality,

    • $s \in \{1,2,3\}$ are different random seeds.

    These are generated with the R clusterGeneration package with cluster sizes following a $1:2:...:k$ ratio. We incorporate a random number (in $[0, 100]$) of randomly sampled outliers and set the cluster separation degrees randomly in $[0.16, 0.26]$, spanning partially overlapping to separated clusters.

    3. ablate_datasets.tar.xz

    Contains the AblateSynth_{k}_{d}_{sep}.txt files for ablation studies, with:

    • $k \in \{2,4,8,16\}$ clusters,

    • $d \in \{2,4,8,16\}$ dimensions,

    • $sep \in \{0.25, 0.5, 0.75\}$ controlling cluster separation degrees.

    Also generated via clusterGeneration.

    4. g2_datasets.tar.xz

    Packages the G2 synthetic sets (g2-{dim}-{var}.txt) from the clustering-data benchmarks:

    • $N=2048$ samples, $k=2$ Gaussian clusters,

    • Dimensions $d \in \{1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024\}$

    • Cluster overlap $var \in \{10, 20, 30, 40, 50, 60, 70, 80, 90, 100\}$

    5. timing_datasets.tar.xz

    Includes:

    • s1.txt, lsun.txt: two real datasets for baseline timing.

    • timesynth_{k}_{d}_{n}.txt: synthetic timing datasets with balanced cluster sizes C_{avg}=N/K, varying:

      • $k \in \{2,5\}$

      • $d \in \{2,5\}$

      • $N \in \{10000; 100000\}$

    Generated similarly to the scaling sets, following Mohassel et al.’s timing experiment protocol .

    Usage:

    Unpack any archive with tar -xJf

  7. d

    Chert geochemistry discriminant analysis and K-meas cluster analysis:...

    • catalog.data.gov
    Updated Jul 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    State of Alaska, Department of Natural Resources, Division of Geological & Geophysical Surveys (Point of Contact) (2023). Chert geochemistry discriminant analysis and K-meas cluster analysis: Rampart Project area, Tanana B-1 Quadrangle, east-central Alaska [Dataset]. https://catalog.data.gov/dataset/chert-geochemistry-discriminant-analysis-and-k-meas-cluster-analysis-rampart-project-area-tanan1
    Explore at:
    Dataset updated
    Jul 5, 2023
    Dataset provided by
    State of Alaska, Department of Natural Resources, Division of Geological & Geophysical Surveys (Point of Contact)
    Area covered
    Central, Tanana, Alaska
    Description

    A pilot study using discriminant analysis on major oxide and minor element data from 67 chert samples in the Rampart area, southeastern Tanana and southwest Livengood Quadrangles, western Yukon-Tanana Upland, Alaska, generally indicates a unique geochemical signature for the cherts of a given unit.Chert samples from five known type locales were used as standards of comparison: 1)Livengood Dome Chert (Ordovician), 2)Amy Creek unit (Proterozoic to early Paleozoic), 3)Rampart Group (Mississippian to Triassic), 4)Troublesome Creek unit (Devonian), and 5)Permian-Triassic clastic unit (associated with the Triassic-dated gabbro). Samples from the above units were compared to chert from Tanana B-1 area-units of unknown or uncertain affinity.We have determined that discriminant analysis of chert geochemistry can assign chert profiles to specific units with only minor exceptions, and is useful in geologic mapping of the Tanana B-1 Quadrangle (Reifenstuhl and others, 1997).

  8. Comprehensive Food Security and Vulnerability Analysis 2010 - China

    • catalog.ihsn.org
    Updated Mar 29, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    World Food Programme (2019). Comprehensive Food Security and Vulnerability Analysis 2010 - China [Dataset]. https://catalog.ihsn.org/catalog/4350
    Explore at:
    Dataset updated
    Mar 29, 2019
    Dataset authored and provided by
    World Food Programmehttp://da.wfp.org/
    Time period covered
    2010
    Area covered
    China
    Description

    Abstract

    According to the Food and Agricultural Organization (FAO) 123 million Chinese remained undernourished in 2003-2005. That represents 14% of the global total. UNICEF states that 7.2 million of the world's stunted children are located in China. In absolute terms, China continues to rank in the top countries carrying the global burden of under-nutrition. China must-and still can reduce under-nutrition, thus contributing even further to the global attainment of MDG1. In this context that the United Nations Joint Programme, in partnership with the Chinese government, has conducted this study. The key objective is to improve evidence of household food security through a baseline study in six pilot counties in rural China. The results will be used to guide policy and programmes aimed at reducing household food insecurity in the most vulnerable populations in China. The study is not meant to be an exhaustive analysis of the food security situation in the country, but to provide a demonstrative example of food assessment tools that may be replicated or scaled up to other places.

    Geographic coverage

    Six rural counties

    Analysis unit

    • Household
    • Village

    Universe

    The survey covered household heads and women between 15-49 years resident of that household. A household is defined as a group of people currently living and eating together "under the same roof" (or in same compound if the household has 2 structures).

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The required sample size for the survey was calculated using standard sample size calculations with each county representing a stratum. After the sample size was calculated, a two-stage clustering approach was applied. The first stage is the selection of villages using the probability proportional to size (PPS) method to create a self-weighted sample in which larger population clusters (villages) have a greater chance of selection, proportional to their size. Following the selection of the villages, 12 households within the village were selected using simple random selection.

    Sampling deviation

    Floods and landslides prevented the team from visiting two of the selected villages, one in Wuding and one in Panxian, so they substituted them with replacement villages.

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    The household questionnaire was administered to all households in the survey and included modules on demography, education, migration and remittances, housing and facilities, household assets, agricultural, income activities, expenditure, food sources and consumption, shocks and coping strategies.

    The objective of the village questionnaire was to gather contextual information on the six counties for descriptive purposes. In each village visited, a focus group discussion took place on topics including: population of the village, migrants, access to social services such as education and health, infrastructure, access to markets, difficulties facing the village, information on local agricultural practices.

    The questionnaires were developed by WFP and Chinese Academy of Agricultural Sciences (CAAS) with inputs from partnering agencies. They were originally formulated in English and then translated into Mandarin. They were pilot tested in the field and corrected as needed. The final interviews were administered in Mandarin with translation provided in the local language when needed.

    All questionnaires and modules are provided as external resources.

    Cleaning operations

    After data collection, data entry was carried out by CAAS staff in Beijing using EpiData software. The datasets were then exported into SPSS for analysis. Data cleaning was an iterative process throughout the data entry and analysis phases.

    Descriptive analysis, correlation analysis, principle component analysis, cluster analysis and various other forms of analyses were conducted using SPSS.

  9. D

    Replication Data for: The semantic structuring of minimizing constructions...

    • dataverse.no
    • search.dataone.org
    Updated Sep 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Margot Van den Heede; Margot Van den Heede; Peter Lauwers; Peter Lauwers (2025). Replication Data for: The semantic structuring of minimizing constructions in present-day Netherlandic Dutch: a distribution-based cluster analysis [Dataset]. http://doi.org/10.18710/GIKMKM
    Explore at:
    text/comma-separated-values(95929), type/x-r-syntax(1369), txt(6170), txt(13467)Available download formats
    Dataset updated
    Sep 2, 2025
    Dataset provided by
    DataverseNO
    Authors
    Margot Van den Heede; Margot Van den Heede; Peter Lauwers; Peter Lauwers
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Netherlands
    Dataset funded by
    Special Research Fund for Concerted Research Actions - Ghent University
    Description

    Dataset abstract: This dataset contains the data files that were used for the cluster analysis of the Dutch minimizing construction, as described in the publication cited below. In addition to a ReadMe file, it contains three files: A txt file is provided with the corpus queries that were used to find tokens of the minimizing constructions in the Dutch Web 2014 (nlTenTen14) corpus, available via Sketch Engine (more information about the TenTen corpora: Jakubíček, M., A. Kilgarriff, V. Kovář, P. Rychlý & V. Suchomel (2013). The TenTen corpus family. In: 7th International Corpus Linguistics Conference CL. Lancaster, 125–127). A csv file is provided that forms the input file for the cluster analysis. It contains a list of 5,863 minimizer-predicate combinations, more specifically a list of the predicates that are combined with the minimizers that have a token frequency of at least 10 in my dataset. An R-script is provided with the code to perform the cluster analysis in R. Article abstract: This paper examines the semantic structuring of a paradigm of 89 minimizers, i.e., nouns that reinforce sentential negation in present-day Netherlandic Dutch, such as meter ‘meter’ in voor geen meter vertrouwen ‘not to trust for a meter’. Cosine distances are computed on the basis of the predicates the minimizers combine with in a sample of 100 tokens downloaded from the Dutch Web corpus 2014 (nlTenTen14) and clustered according to the Partitioning Around Medoids (PAM) algorithm into nine semantic clusters. The clusters largely correspond to semantic categories such as taboo terms or units of money. This suggests that, in general, minimizers belonging to the same semantic domain are combined with a similar (core) set of predicates. Based on the shared predicates per cluster, we detect signs of analogical attraction between minimizers or, conversely, competition. Crucially, low silhouette widths enable us to identify outliers in their respective clusters, for instance, minimizing nouns that exhibit signs of context expansion, as shown by their combination with semantically non-harmonious verbs. As such, this paper provides a synchronic snapshot of the semantic processes involved in (incipient) grammaticalization of minimizing nouns and, more in general, it illustrates how distributional semantics offers a heuristic to analyze the structure of a network of comparable micro-constructions.

  10. ICW2_Dataset | 2009 samples with 440 features

    • kaggle.com
    zip
    Updated Nov 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr Reza Rafiee (2022). ICW2_Dataset | 2009 samples with 440 features [Dataset]. https://www.kaggle.com/datasets/rezarafiee/immune2009
    Explore at:
    zip(3983894 bytes)Available download formats
    Dataset updated
    Nov 17, 2022
    Authors
    Dr Reza Rafiee
    Description

    Dataset

    This dataset was created by Dr Reza Rafiee

    Contents

  11. Evaluating Correlation Between Measurement Samples in Reverberation Chambers...

    • catalog.data.gov
    • datasets.ai
    • +2more
    Updated May 9, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2023). Evaluating Correlation Between Measurement Samples in Reverberation Chambers Using Clustering [Dataset]. https://catalog.data.gov/dataset/evaluating-correlation-between-measurement-samples-in-reverberation-chambers-using-cluster
    Explore at:
    Dataset updated
    May 9, 2023
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    Evaluating Correlation Between Measurement Samples in Reverberation Chambers Using ClusteringAbstract: Traditionally, in reverberation chambers (RC) measurement autocorrelation or correlation-matrix methods have been applied to evaluate measurement correlation. In this article, we introduce the use of clustering based on correlative distance to group correlated measurements. We apply the method to measurements taken in an RC using one and two paddles to stir the electromagnetic fields and applying decreasing angular steps between consecutive paddles positions. The results using varying correlation threshold values demonstrate that the method calculates the number of effective samples and allows discerning outliers, i.e., uncorrelated measurements, and clusters of correlated measurements. This calculation method, if verified, will allow non-sequential stir sequence design and, thereby, reduce testing time.Keywords: Correlation, Pearson correlation coefficient (PCC), reverberation chambers (RC), mode-stirring samples, correlative distance, clustering analysis, adjacency matrix.

  12. f

    Data Sheet 2_Subtype cluster analysis unveiled the correlation between m6A-...

    • datasetcatalog.nlm.nih.gov
    • frontiersin.figshare.com
    Updated Feb 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yang, Zhi; Liu, Lin; Shao, Xiaolong; An, Peng; Zhang, Ming; Wen, Pengfei; Yang, Mingyi; Su, Yani; Jing, Wensen; Yang, Peng (2025). Data Sheet 2_Subtype cluster analysis unveiled the correlation between m6A- and cuproptosis-related lncRNAs and the prognosis, immune microenvironment, and treatment sensitivity of esophageal cancer.zip [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001284208
    Explore at:
    Dataset updated
    Feb 17, 2025
    Authors
    Yang, Zhi; Liu, Lin; Shao, Xiaolong; An, Peng; Zhang, Ming; Wen, Pengfei; Yang, Mingyi; Su, Yani; Jing, Wensen; Yang, Peng
    Description

    ObjectiveEsophageal cancer (EC) is characterized by a high degree of malignancy and poor prognosis. N6-methyladenosine (m6A), a prominent post-transcriptional modification of mRNA in mammalian cells, plays a pivotal role in regulating various cellular and biological processes. Similarly, cuproptosis has garnered attention for its potential implications in cancer biology. This study seeks to elucidate the impact of m6A- and cuproptosis-related long non-coding RNAs (m6aCRLncs) on the prognosis of patients with EC.MethodsThe EC transcriptional data and corresponding clinical information were retrieved from The Cancer Genome Atlas (TCGA) database, comprising 11 normal samples and 159 EC samples. Data on 23 m6A regulators and 25 cuproptosis-related genes were sourced from the latest literature. The m6aCRLncs linked to EC were identified through co-expression analysis. Differentially expressed m6aCRLncs associated with EC prognosis were screened using the limma package in R and univariate Cox regression analysis. Subtype clustering was performed to classify EC patients, enabling the investigation of differences in clinical outcomes and immune microenvironment across patient clusters. A risk prognostic model was constructed using least absolute shrinkage and selection operator (LASSO) regression. Its robustness was evaluated through survival analysis, risk stratification curves, and receiver operating characteristic (ROC) curves. Additionally, the model’s applicability across various clinical features and molecular subtypes of EC patients was assessed. To further explore the model’s utility in predicting the immune microenvironment, single-sample gene set enrichment analysis (ssGSEA), immune cell infiltration analysis, and immune checkpoint differential expression analysis were conducted. Drug sensitivity analysis was performed to identify potential therapeutic agents for EC. Finally, the mRNA expression levels of m6aCRLncs in EC cell lines were validated using reverse transcription quantitative polymerase chain reaction (RT-qPCR).ResultsWe developed a prognostic risk model based on five m6aCRLncs, namely ELF3-AS1, HNF1A-AS1, LINC00942, LINC01389, and MIR181A2HG, to predict survival outcomes and characterize the immune microenvironment in EC patients. Analysis of molecular subtypes and clinical features revealed significant differences in cluster distribution, disease stage, and N stage between high- and low-risk groups. Immune profiling further identified distinct immune cell populations and functional pathways associated with risk scores, including positive correlations with naive B cells, resting CD4+ T cells, and plasma cells, and negative correlations with macrophages M0 and M1. Additionally, we identified key immune checkpoint-related genes with significant differential expression between risk groups, including TNFRSF14, TNFSF15, TNFRSF18, LGALS9, CD44, HHLA2, and CD40. Furthermore, nine candidate drugs with potential therapeutic efficacy in EC were identified: Bleomycin, Cisplatin, Cyclopamine, PLX4720, Erlotinib, Gefitinib, RO.3306, XMD8.85, and WH.4.023. Finally, RT-qPCR validation of the mRNA expression levels of m6aCRLncs in EC cell lines demonstrated that ELF3-AS1 expression was significantly upregulated in the EC cell lines KYSE-30 and KYSE-180 compared to normal esophageal epithelial cells.ConclusionThis study elucidates the role of m6aCRLncs in shaping the prognostic outcomes and immune microenvironment of EC. Furthermore, it identifies potential therapeutic agents with efficacy against EC. These findings hold significant promise for enhancing the survival of EC patients and provide valuable insights to inform clinical decision-making in the management of this disease.

  13. d

    Groundwater and surface water sample locations included in hydrochemical...

    • data.gov.au
    zip
    Updated May 20, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bioregional Assessment Program (2020). Groundwater and surface water sample locations included in hydrochemical cluster analysis [Dataset]. https://data.gov.au/data/dataset/52556c6b-a44c-4ad4-8fb4-f9f609b1e228
    Explore at:
    zip(122638)Available download formats
    Dataset updated
    May 20, 2020
    Dataset provided by
    Bioregional Assessment Program
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    Dataset combines multiple parameters measured in groundwater and surface water samples and interpreted using Hierarchical Cluster Analysis. This dataset combines groundwater and surface water hydrochemical (major ions, EC, TDS) sourced from GA datasets that were originally sourced from the NT government database and other sources listed under column 'Source'.

    Attribution

    Geological and Bioregional Assessment Program

    History

    A hydrochemical dataset composed of 857 groundwater samples and 21 surface water samples was used for Hierarchical Cluster Analysis to investigate inter-aquifer and GW-SW connectivity. Groundwater samples are assigned to the following hydrostratigraphic units: Antrim Plateau Volcanics/Helen Springs/Jindare, Bukalara, Cenozoic, Cretaceous, Jinduckin/AnthonyLagoon/HookerCreek; Proterozoic and Tindall/GumRidge/Montejinni. Variables used for the analysis included major ions (Ca, Mg, Na, K, HCO3, Cl, SO4), electrical conductivity (EC) and pH with data normalised for statistical calculation. Datapoints with charge balance error above 10% have been removed prior to interpretation. The interpretation resulted in five clusters with distinct geochemical properties, as described in the respective section of the technical report Hydrogeology of the Beetaloo GBA region. Check column source to identify data origin with references in the above mentioned report.

  14. Cluster analysis on high dimensional RNA-seq data with applications to...

    • plos.figshare.com
    xlsx
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Linda Vidman; David Källberg; Patrik Rydén (2023). Cluster analysis on high dimensional RNA-seq data with applications to cancer research - An evaluation study [Dataset]. http://doi.org/10.1371/journal.pone.0219102
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Linda Vidman; David Källberg; Patrik Rydén
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundClustering of gene expression data is widely used to identify novel subtypes of cancer. Plenty of clustering approaches have been proposed, but there is a lack of knowledge regarding their relative merits and how data characteristics influence the performance. We evaluate how cluster analysis choices affect the performance by studying four publicly available human cancer data sets: breast, brain, kidney and stomach cancer. In particular, we focus on how the sample size, distribution of subtypes and sample heterogeneity affect the performance.ResultsIn general, increasing the sample size had limited effect on the clustering performance, e.g. for the breast cancer data similar performance was obtained for n = 40 as for n = 330. The relative distribution of the subtypes had a noticeable effect on the ability to identify the disease subtypes and data with disproportionate cluster sizes turned out to be difficult to cluster. Both the choice of clustering method and selection method affected the ability to identify the subtypes, but the relative performance varied between data sets, making it difficult to rank the approaches. For some data sets, the performance was substantially higher when the clustering was based on data from only one sex compared to data from a mixed population. This suggests that homogeneous data are easier to cluster than heterogeneous data and that clustering males and females individually may be beneficial and increase the chance to detect novel subtypes. It was also observed that the performance often differed substantially between females and males.ConclusionsThe number of samples seems to have a limited effect on the performance while the heterogeneity, at least with respect to sex, is important for the performance. Hence, by analyzing the genders separately, the possible loss caused by having fewer samples could be outweighed by the benefit of a more homogeneous data.

  15. 🌆 City Lifestyle Segmentation Dataset

    • kaggle.com
    zip
    Updated Nov 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UmutUygurr (2025). 🌆 City Lifestyle Segmentation Dataset [Dataset]. https://www.kaggle.com/datasets/umuttuygurr/city-lifestyle-segmentation-dataset
    Explore at:
    zip(11274 bytes)Available download formats
    Dataset updated
    Nov 15, 2025
    Authors
    UmutUygurr
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22121490%2F7189944f8fc292a094c90daa799d08ca%2FChatGPT%20Image%2015%20Kas%202025%2014_07_37.png?generation=1763204959770660&alt=media" alt="">

    🌆 About This Dataset

    This synthetic dataset simulates 300 global cities across 6 major geographic regions, designed specifically for unsupervised machine learning and clustering analysis. It explores how economic status, environmental quality, infrastructure, and digital access shape urban lifestyles worldwide.

    🎯 Perfect For:

    • 📊 K-Means, DBSCAN, Agglomerative Clustering
    • 🔬 PCA & t-SNE Dimensionality Reduction
    • 🗺️ Geospatial Visualization (Plotly, Folium)
    • 📈 Correlation Analysis & Feature Engineering
    • 🎓 Educational Projects (Beginner to Intermediate)

    📦 What's Inside?

    FeatureDescriptionRange
    10 FeaturesEconomic, environmental & social indicatorsRealistically scaled
    300 CitiesEurope, Asia, Americas, Africa, OceaniaDiverse distributions
    Strong CorrelationsIncome ↔ Rent (+0.8), Density ↔ Pollution (+0.6)ML-ready
    No Missing ValuesClean, preprocessed dataReady for analysis
    4-5 Natural ClustersMetropolitan hubs, eco-towns, developing centersPre-validated

    🔥 Key Features

    Realistic Correlations: Income strongly predicts rent (+0.8), internet access (+0.7), and happiness (+0.6)
    Regional Diversity: Each region has distinct economic and environmental characteristics
    Clustering-Ready: Naturally separable into 4-5 lifestyle archetypes
    Beginner-Friendly: No data cleaning required, includes example code
    Documented: Comprehensive README with methodology and use cases

    🚀 Quick Start Example

    import pandas as pd
    from sklearn.cluster import KMeans
    from sklearn.preprocessing import StandardScaler
    
    # Load and prepare
    df = pd.read_csv('city_lifestyle_dataset.csv')
    X = df.drop(['city_name', 'country'], axis=1)
    X_scaled = StandardScaler().fit_transform(X)
    
    # Cluster
    kmeans = KMeans(n_clusters=5, random_state=42)
    df['cluster'] = kmeans.fit_predict(X_scaled)
    
    # Analyze
    print(df.groupby('cluster').mean())
    

    🎓 Learning Outcomes

    After working with this dataset, you will be able to: 1. Apply K-Means, DBSCAN, and Hierarchical Clustering 2. Use PCA for dimensionality reduction and visualization 3. Interpret correlation matrices and feature relationships 4. Create geographic visualizations with cluster assignments 5. Profile and name discovered clusters based on characteristics

    📚 Ideal For These Projects

    • 🏆 Kaggle Competitions: Practice clustering techniques
    • 📝 Academic Projects: Urban planning, sociology, environmental science
    • 💼 Portfolio Work: Showcase ML skills to employers
    • 🎓 Learning: Hands-on practice with unsupervised learning
    • 🔬 Research: Urban lifestyle segmentation studies

    🌍 Expected Clusters

    ClusterCharacteristicsExample Cities
    Metropolitan Tech HubsHigh income, density, rentSilicon Valley, Singapore
    Eco-Friendly TownsLow density, clean air, high happinessNordic cities
    Developing CentersMid income, high density, poor airEmerging markets
    Low-Income SuburbanLow infrastructure, incomeRural areas
    Industrial Mega-CitiesVery high density, pollutionManufacturing hubs

    🛠️ Technical Details

    • Format: CSV (UTF-8)
    • Size: ~300 rows × 10 columns
    • Missing Values: 0%
    • Data Types: 2 categorical, 8 numerical
    • Target Variable: None (unsupervised)
    • Correlation Strength: Pre-validated (r: 0.4 to 0.8)

    📖 What Makes This Dataset Special?

    Unlike random synthetic data, this dataset was carefully engineered with: - ✨ Realistic correlation structures based on urban research - 🌍 Regional characteristics matching real-world patterns - 🎯 Optimal cluster separability (validated via silhouette scores) - 📚 Comprehensive documentation and starter code

    🏅 Use This Dataset If You Want To:

    ✓ Learn clustering without data cleaning hassles
    ✓ Practice PCA and dimensionality reduction
    ✓ Create beautiful geographic visualizations
    ✓ Understand feature correlation in real-world contexts
    ✓ Build a portfolio project with clear business insights

    📊 Acknowledgments

    This dataset was designed for educational purposes in machine learning and data science. While synthetic, it reflects real patterns observed in global urban development research.

    Happy Clustering! 🎉

  16. Data from: Supplementary Material for "Sonification for Exploratory Data...

    • search.datacite.org
    • pub.uni-bielefeld.de
    Updated Feb 5, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas Hermann (2019). Supplementary Material for "Sonification for Exploratory Data Analysis" [Dataset]. http://doi.org/10.4119/unibi/2920448
    Explore at:
    Dataset updated
    Feb 5, 2019
    Dataset provided by
    DataCitehttps://www.datacite.org/
    Bielefeld University
    Authors
    Thomas Hermann
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Sonification for Exploratory Data Analysis #### Chapter 8: Sonification Models In Chapter 8 of the thesis, 6 sonification models are presented to give some examples for the framework of Model-Based Sonification, developed in Chapter 7. Sonification models determine the rendering of the sonification and possible interactions. The "model in mind" helps the user to interprete the sound with respect to the data. ##### 8.1 Data Sonograms Data Sonograms use spherical expanding shock waves to excite linear oscillators which are represented by point masses in model space. * Table 8.2, page 87: Sound examples for Data Sonograms File: Iris dataset: started in plot (a) at S0 (b) at S1 (c) at S2
    10d noisy circle dataset: started in plot (c) at S0 (mean) (d) at S1 (edge)
    10d Gaussian: plot (d) started at S0
    3 clusters: Example 1
    3 clusters: invisible columns used as output variables: Example 2 Description: Data Sonogram Sound examples for synthetic datasets and the Iris dataset Duration: about 5 s ##### 8.2 Particle Trajectory Sonification Model This sonification model explores features of a data distribution by computing the trajectories of test particles which are injected into model space and move according to Newton's laws of motion in a potential given by the dataset. * Sound example: page 93, PTSM-Ex-1 Audification of 1 particle in the potential of phi(x). * Sound example: page 93, PTSM-Ex-2 Audification of a sequence of 15 particles in the potential of a dataset with 2 clusters. * Sound example: page 94, PTSM-Ex-3 Audification of 25 particles simultaneous in a potential of a dataset with 2 clusters. * Sound example: page 94, PTSM-Ex-4 Audification of 25 particles simultaneous in a potential of a dataset with 1 cluster. * Sound example: page 95, PTSM-Ex-5 sigma-step sequence for a mixture of three Gaussian clusters * Sound example: page 95, PTSM-Ex-6 sigma-step sequence for a Gaussian cluster * Sound example: page 96, PTSM-Iris-1 Sonification for the Iris Dataset with 20 particles per step. * Sound example: page 96, PTSM-Iris-2 Sonification for the Iris Dataset with 3 particles per step. * Sound example: page 96, PTSM-Tetra-1 Sonification for a 4d tetrahedron clusters dataset. ##### 8.3 Markov chain Monte Carlo Sonification The McMC Sonification Model defines a exploratory process in the domain of a given density p such that the acoustic representation summarizes features of p, particularly concerning the modes of p by sound. * Sound Example: page 105, MCMC-Ex-1 McMC Sonification, stabilization of amplitudes. * Sound Example: page 106, MCMC-Ex-2 Trajectory Audification for 100 McMC steps in 3 cluster dataset * McMC Sonification for Cluster Analysis, dataset with three clusters, page 107 * Stream 1 MCMC-Ex-3.1 * Stream 2 MCMC-Ex-3.2 * Stream 3 MCMC-Ex-3.3 * Mix MCMC-Ex-3.4 * McMC Sonification for Cluster Analysis, dataset with three clusters, T =0.002s, page 107 * Stream 1 MCMC-Ex-4.1 (stream 1) * Stream 2 MCMC-Ex-4.2 (stream 2) * Stream 3 MCMC-Ex-4.3 (stream 3) * Mix MCMC-Ex-4.4 * McMC Sonification for Cluster Analysis, density with 6 modes, T=0.008s, page 107 * Stream 1 MCMC-Ex-5.1 (stream 1) * Stream 2 MCMC-Ex-5.2 (stream 2) * Stream 3 MCMC-Ex-5.3 (stream 3) * Mix MCMC-Ex-5.4 * McMC Sonification for the Iris dataset, page 108 * MCMC-Ex-6.1 * MCMC-Ex-6.2 * MCMC-Ex-6.3 * MCMC-Ex-6.4 * MCMC-Ex-6.5 * MCMC-Ex-6.6 * MCMC-Ex-6.7 * MCMC-Ex-6.8 ##### 8.4 Principal Curve Sonification Principal Curve Sonification represents data by synthesizing the soundscape while a virtual listener moves along the principal curve of the dataset through the model space. * Noisy Spiral dataset, PCS-Ex-1.1 , page 113 * Noisy Spiral dataset with variance modulation PCS-Ex-1.2 , page 114 * 9d tetrahedron cluster dataset (10 clusters) PCS-Ex-2 , page 114 * Iris dataset, class label used as pitch of auditory grains PCS-Ex-3 , page 114 ##### 8.5 Data Crystallization Sonification Model * Table 8.6, page 122: Sound examples for Crystallization Sonification for 5d Gaussian distribution File: DCS started at center, in tail, from far outside Description: DCS for dataset sampled from N{0, I_5} excited at different locations Duration: 1.4 s * Mixture of 2 Gaussians, page 122 * DCS started at point A DCS-Ex1A * DCS started at point B DCS-Ex1B * Table 8.7, page 124: Sound examples for DCS on variation of the harmonics factor File: h_omega = 1, 2, 3, 4, 5, 6 Description: DCS for a mixture of two Gaussians with varying harmonics factor Duration: 1.4 s * Table 8.8, page 124: Sound examples for DCS on variation of the energy decay time File: tau_(1/2) = 0.001, 0.005, 0.01, 0.05, 0.1, 0.2 Description: DCS for a mixture of two Gaussians varying the energy decay time tau_(1/2) Duration: 1.4 s * Table 8.9, page 125: Sound examples for DCS on variation of the sonification time File: T = 0.2, 0.5, 1, 2, 4, 8 Description: DCS for a mixture of two Gaussians on varying the duration T Duration: 0.2s -- 8s * Table 8.10, page 125: Sound examples for DCS on variation of model space dimension File: selected columns of the dataset: (x0) (x0,x1) (x0,...,x2) (x0,...,x3) (x0,...,x4) (x0,...,x5) Description: DCS for a mixture of two Gaussians varying the dimension Duration: 1.4 s * Table 8.11, page 126: Sound examples for DCS for different excitation locations File: starting point: C0, C1, C2 Description: DCS for a mixture of three Gaussians in 10d space with different rank(S) = {2,4,8} Duration: 1.9 s * Table 8.12, page 126: Sound examples for DCS for the mixture of a 2d distribution and a 5d cluster File: condensation nucleus in (x0,x1)-plane at: (-6,0)=C1, (-3,0)=C2, ( 0,0)=C0 Description: DCS for a mixture of a uniform 2d and a 5d Gaussian Duration: 2.16 s * Table 8.13, page 127: Sound examples for DCS for the cancer dataset File: condensation nucleus in (x0,x1)-plane at: benign 1, benign 2
    malignant 1, malignant 2 Description: DCS for a mixture of a uniform 2d and a 5d Gaussian Duration: 2.16 s ##### 8.6 Growing Neural Gas Sonification * Table 8.14, page 133: Sound examples for GNGS Probing File: Cluster C0 (2d): a, b, c
    Cluster C1 (4d): a, b, c
    Cluster C2 (8d): a, b, c Description: GNGS for a mixture of 3 Gaussians in 10d space Duration: 1 s * Table 8.15, page 134: Sound examples for GNGS for the noisy spiral dataset File: (a) GNG with 3 neurons 1, 2
    (b) GNG with 20 neurons end, middle, inner end
    (c) GNG with 45 neurons outer end, middle, close to inner end, at inner end
    (d) GNG with 150 neurons outer end, in the middle, inner end
    (e) GNG with 20 neurons outer end, in the middle, inner end
    (f) GNG with 45 neurons outer end, in the middle, inner end Description: GNG probing sonification for 2d noisy spiral dataset Duration: 1 s * Table 8.16, page 136: Sound examples for GNG Process Monitoring Sonification for different data distributions File: Noisy spiral with 1 rotation: sound
    Noisy spiral with 2 rotations: sound
    Gaussian in 5d: sound
    Mixture of 5d and 2d distributions: sound Description: GNG process sonification examples Duration: 5 s #### Chapter 9: Extensions #### In this chapter, two extensions for Parameter Mapping

  17. f

    Data from: Simple Measures of Individual Cluster-Membership Certainty for...

    • datasetcatalog.nlm.nih.gov
    • tandf.figshare.com
    Updated Jul 9, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Graham, Jinko; Liu, Dongmeng (2018). Simple Measures of Individual Cluster-Membership Certainty for Hard Partitional Clustering [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000690043
    Explore at:
    Dataset updated
    Jul 9, 2018
    Authors
    Graham, Jinko; Liu, Dongmeng
    Description

    We propose two probability-like measures of individual cluster-membership certainty which can be applied to a hard partition of the sample such as that obtained from the Partitioning Around Medoids (PAM) algorithm, hierarchical clustering or k-means clustering. One measure extends the individual silhouette widths and the other is obtained directly from the pairwise dissimilarities in the sample. Unlike the classic silhouette, however, the measures behave like probabilities and can be used to investigate an individual's tendency to belong to a cluster. We also suggest two possible ways to evaluate the hard partition using these measures. We evaluate the performance of both measures in individuals with ambiguous cluster membership, using simulated binary datasets that have been partitioned by the PAM algorithm or continuous datasets that have been partitioned by hierarchical clustering and k-means clustering. For comparison, we also present results from soft-clustering algorithms such as soft analysis clustering (FANNY) and two model-based clustering methods. Our proposed measures perform comparably to the posterior-probability estimators from either FANNY or the model-based clustering methods. We also illustrate the proposed measures by applying them to Fisher's classic data set on irises.

  18. d

    Data from: \"Size\" and \"shape\" in the measurement of multivariate...

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Jul 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Greenacre (2025). \"Size\" and \"shape\" in the measurement of multivariate proximity [Dataset]. http://doi.org/10.5061/dryad.6r5j8
    Explore at:
    Dataset updated
    Jul 4, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Michael Greenacre
    Time period covered
    Mar 16, 2018
    Description
    1. Ordination and clustering methods are widely applied to ecological data that are nonnegative, for example species abundances or biomasses. These methods rely on a measure of multivariate proximity that quantifies differences between the sampling units (e.g. individuals, stations, time points), leading to results such as: (i) ordinations of the units, where interpoint distances optimally display the measured differences; (ii) clustering the units into homogeneous clusters; or (iii) assessing differences between pre-specified groups of units (e.g., regions, periods, treatment-control groups). 2. These methods all conceal a fundamental question: To what extent are the differences between the sampling units, computed according to the chosen proximity function, capturing the "size" in the multivariate observations, or their "shape"? "Size" means the overall level of the measurements: for example, some samples contain higher total abundances or more biomass, others less. "Shape" mea...
  19. C

    R scripts to cluster similar window photos using the affinity propagation...

    • dataverse.csuc.cat
    tsv, txt +1
    Updated Sep 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marc Roca Musach; Marc Roca Musach; Isabel Crespo Cabillo; Isabel Crespo Cabillo; Helena Coch Roura; Helena Coch Roura (2024). R scripts to cluster similar window photos using the affinity propagation machine-learning algorithm [Dataset]. http://doi.org/10.34810/data1736
    Explore at:
    type/x-r-syntax(4715), type/x-r-syntax(1054), tsv(449380), type/x-r-syntax(2270), tsv(283403), type/x-r-syntax(10434), tsv(291418), tsv(332514), type/x-r-syntax(6637), type/x-r-syntax(1874), type/x-r-syntax(5495), type/x-r-syntax(4511), type/x-r-syntax(6802), txt(9703), type/x-r-syntax(2178)Available download formats
    Dataset updated
    Sep 20, 2024
    Dataset provided by
    CORA.Repositori de Dades de Recerca
    Authors
    Marc Roca Musach; Marc Roca Musach; Isabel Crespo Cabillo; Isabel Crespo Cabillo; Helena Coch Roura; Helena Coch Roura
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    https://ror.org/003x0zc53
    https://ror.org/01bg62x04
    Description

    This dataset features a collection of R scripts designed to leverage the Affinity Propagation machine-learning algorithm for clustering similar window photographs. These scripts employ an unsupervised clustering analysis, culminating in the manual tagging of clusters. Following this, the scripts process the tags and clusters to generate tabulated data in .csv format. Additionally, the code includes tools for reapplying EXIF metadata to images and organizing them efficiently. The dataset also provides a sample output, derived from processing the scripts with data from the study “Image dataset: Year-long hourly facade photos of a university building” (https://doi.org/10.1016/j.dib.2024.110798).

  20. D

    Replication Data for: A panorama of inchoative constructions in Spanish:...

    • dataverse.azure.uit.no
    • dataverse.no
    csv, txt +1
    Updated Sep 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sven Van Hulle; Sven Van Hulle (2023). Replication Data for: A panorama of inchoative constructions in Spanish: Cluster analysis as an answer to the near-synonymy puzzle. [Dataset]. http://doi.org/10.18710/DR8QKQ
    Explore at:
    txt(2182), csv(80915), type/x-r-syntax(1260), csv(62008), txt(8147)Available download formats
    Dataset updated
    Sep 5, 2023
    Dataset provided by
    DataverseNO
    Authors
    Sven Van Hulle; Sven Van Hulle
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Spain
    Description

    The dataset contains the data for the hierarchical cluster analysis as explained in the article "A panorama of inchoative constructions in Spanish: Cluster analysis as an answer to the near-synonymy puzzle". The dataset contains the data for the hierarchical cluster analysis as explained in the article "A panorama of inchoative constructions in Spanish: Cluster analysis as an answer to the near-synonymy puzzle". In total, the dataset contains 3955 observations, which are tokens of the inchoative construction for the following auxiliaries: comenzar, empezar, meter, poner, echar(se), liar, arrancar and romper. The data originates from the the Spanish Web corpus (esTenTen18), accessed via Sketch Engine. Only the European Spanish subcorpus was selected. The search syntax that was used to detect the inchoative construction was the following: “[lemma="empezar"] [tag="R.*"]{0,3}"a"[tag="V.*"] within " (replacing the concrete lemma "empezar" by other lemma's for each auxiliary, see Spinc_queries_20221202.txt for all concrete corpus queries). After downloading samples of 10.000 tokens per auxiliary, the samples were manually cleaned. Only 500 tokens per auxiliary were retained in the dataset. Next, the data were annotated for the infinitive observed after the preposition 'a' and for the semantic class to which this infinitive belongs, following the existing ADESSE classification (see below), besides other criteria that are not taken into account for this study. Concretely, the variables 'INF' (infinitive) and 'Class' were used as input for the hierarchical cluster analysis (see data-specific sections below for more information about the variables).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jia Li; Beomseok Seo; Lin Lin (2023). Dataset for: Optimal Transport, Mean Partition, and Uncertainty Assessment in Cluster Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.8038925
Organization logo

Dataset for: Optimal Transport, Mean Partition, and Uncertainty Assessment in Cluster Analysis

Related Article
Explore at:
txtAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
Wileyhttps://www.wiley.com/
Authors
Jia Li; Beomseok Seo; Lin Lin
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

In scientific data analysis, clusters identified computationally often substantiate existing hypotheses or motivate new ones. Yet the combinatorial nature of the clustering result, which is a partition rather than a set of parameters or a function, blurs notions of mean and variance. This intrinsic difficulty hinders the development of methods to improve clustering by aggregation or to assess the uncertainty of clusters generated. We overcome that barrier by aligning clusters via optimal transport. Equipped with this technique, we propose a new algorithm to enhance clustering by any baseline method using bootstrap samples. Cluster alignment enables us to quantify variation in the clustering result at the levels of both overall partitions and individual clusters. Set relationships between clusters such as one-to-one match, split, and merge can be revealed. A covering point set for each cluster, a concept kin to the confidence interval, is proposed. The tools we have developed here will help address the crucial question of whether any cluster is an intrinsic or spurious pattern. Experimental results on both simulated and real datasets are provided.

Search
Clear search
Close search
Google apps
Main menu