20 datasets found
  1. f

    Blind method for discovering number of clusters in multidimensional datasets...

    • plos.figshare.com
    docx
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Osbert C. Zalay (2023). Blind method for discovering number of clusters in multidimensional datasets by regression on linkage hierarchies generated from random data [Dataset]. http://doi.org/10.1371/journal.pone.0227788
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Osbert C. Zalay
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Determining intrinsic number of clusters in a multidimensional dataset is a commonly encountered problem in exploratory data analysis. Unsupervised clustering algorithms often rely on specification of cluster number as an input parameter. However, this is typically not known a priori. Many methods have been proposed to estimate cluster number, including statistical and information-theoretic approaches such as the gap statistic, but these methods are not always reliable when applied to non-normally distributed datasets containing outliers or noise. In this study, I propose a novel method called hierarchical linkage regression, which uses regression to estimate the intrinsic number of clusters in a multidimensional dataset. The method operates on the hypothesis that the organization of data into clusters can be inferred from the hierarchy generated by partitioning the dataset, and therefore does not directly depend on the specific values of the data or their distribution, but on their relative ranking within the partitioned set. Moreover, the technique does not require empirical data to train on, but can use synthetic data generated from random distributions to fit regression coefficients. The trained hierarchical linkage regression model is able to infer cluster number in test datasets of varying complexity and differing distributions, for image, text and numeric data, using the same regression model without retraining. The method performs favourably against other cluster number estimation techniques, and is also robust to parameter changes, as demonstrated by sensitivity analysis. The apparent robustness and generalizability of hierarchical linkage regression make it a promising tool for unsupervised exploratory data analysis and discovery.

  2. e

    Unsupervised clustering of type II SNe LCs - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Nov 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Unsupervised clustering of type II SNe LCs - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/5651110e-056d-5be1-b26f-a844dba9c84c
    Explore at:
    Dataset updated
    Nov 3, 2023
    Description

    As new facilities come online, the astronomical community will be provided with extremely large data sets of well-sampled light curves (LCs) of transients. This motivates systematic studies of the LCs of supernovae (SNe) of all types, including the early rising phase. We performed unsupervised k-means clustering on a sample of 59 R-band SNII LCs and find that the rise to peak plays an important role in classifying LCs. Our sample can be divided into three classes: slowly rising (II-S), fast rise/slow decline (II-FS), and fast rise/fast decline (II-FF). We also identify three outliers based on the algorithm. The II-FF and II-FS classes are disjoint in their decline rates, while the II-S class is intermediate and "bridges the gap." This may explain recent conflicting results regarding II-P/II-L populations. The II-FS class is also significantly less luminous than the other two classes. Performing clustering on the first two principal component analysis components gives equivalent results to using the full LC morphologies. This indicates that Type II LCs could possibly be reduced to two parameters. We present several important caveats to the technique, and find that the division into these classes is not fully robust. Moreover, these classes have some overlap, and are defined in the R band only. It is currently unclear if they represent distinct physical classes, and more data is needed to study these issues. However, we show that the outliers are actually composed of slowly evolving SN IIb, demonstrating the potential of such methods. The slowly evolving SNe IIb may arise from single massive progenitors.

  3. s

    Outlier Set Two-step Method (OSTI)

    • orda.shef.ac.uk
    application/x-rar
    Updated Jul 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amal Sarfraz; Abigail Birnbaum; Flannery Dolan; Jonathan Lamontagne; Lyudmila Mihaylova; Charles Rouge (2025). Outlier Set Two-step Method (OSTI) [Dataset]. http://doi.org/10.15131/shef.data.28227974.v3
    Explore at:
    application/x-rarAvailable download formats
    Dataset updated
    Jul 1, 2025
    Dataset provided by
    The University of Sheffield
    Authors
    Amal Sarfraz; Abigail Birnbaum; Flannery Dolan; Jonathan Lamontagne; Lyudmila Mihaylova; Charles Rouge
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These files are supplements to the paper titled 'A Robust Two-step Method for Detection of Outlier Sets'.This paper identifies and addresses the need for a robust method that identifies sets of points that collectively deviate from typical patterns in a dataset, which it calls "outlier sets'', while excluding individual points from detection. This new methodology, Outlier Set Two-step Identification (OSTI) employs a two-step approach to detect and label these outlier sets. First, it uses Gaussian Mixture Models for probabilistic clustering, identifying candidate outlier sets based on cluster weights below a predetermined threshold. Second, OSTI measures the Inter-cluster Mahalanobis distance between each candidate outlier set's centroid and the overall dataset mean. OSTI then tests the null hypothesis that this distance does not significantly differ from its theoretical chi-square distribution, enabling the formal detection of outlier sets. We test OSTI systematically on 8,000 synthetic 2D datasets across various inlier configurations and thousands of possible outlier set characteristics. Results show OSTI robustly and consistently detects outlier sets with an average F1 score of 0.92 and an average purity (the degree to which outlier sets identified correspond to those generated synthetically, i.e., our ground truth) of 98.58%. We also compare OSTI with state-of-the-art outlier detection methods, to illuminate how OSTI fills a gap as a tool for the exclusive detection of outlier sets.

  4. e

    Velocities of galaxies in Abell 520 - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Aug 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Velocities of galaxies in Abell 520 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/35b0b507-eedf-5f7c-af4a-fd1233627b76
    Explore at:
    Dataset updated
    Aug 18, 2023
    Description

    The connection of cluster mergers with the presence of extended, diffuse radio sources in galaxy clusters is still debated. An interesting case is the rich, merging cluster Abell 520, containing a radio halo. A recent gravitational analysis has shown in this cluster the presence of a massive dark core suggested to be a possible problem for the current cold dark matter paradigm. We aim to obtain new insights into the internal dynamics of Abell 520 analyzing velocities and positions of member galaxies. Our analysis is based on redshift data for 293 galaxies in the cluster field obtained combining new redshift data for 8 galaxies acquired at the TNG with data obtained by CNOC team and other few data from the literature. We also use new photometric data obtained at the INT telescope. We combine galaxy velocities and positions to select 167 cluster members around z~0.201. We analyze the cluster structure using the weighted gap analysis, the KMM method, the Dressler-Shectman statistics and the analysis of the velocity dispersion profiles. We compare our results with those from X-ray, radio and gravitational lensing analyses.

  5. f

    Supplementary Dataset for “Revealing Principal Components, Patterns, and...

    • figshare.com
    xlsx
    Updated Jul 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adel A. Nasser; Mijahed Nasser Aljober; Abed Saif Ahmed Alghawli; Amani A. K. Elsayed (2025). Supplementary Dataset for “Revealing Principal Components, Patterns, and Structural Gaps in Health Security among High-Income Countries” [Dataset]. http://doi.org/10.6084/m9.figshare.29582498.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 16, 2025
    Dataset provided by
    figshare
    Authors
    Adel A. Nasser; Mijahed Nasser Aljober; Abed Saif Ahmed Alghawli; Amani A. K. Elsayed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This supplementary file provides comprehensive support for the findings and methodology presented in the study. It includes detailed outputs from the Principal Component Analysis (PCA), such as factor loadings, eigenvalues, and the percentage of variance explained, along with a full classification of the 37 Global Health Security Index (GHSI) indicators across the nine identified principal components. Additionally, it contains visualizations and datasets for all three clustering scenarios: one based on countries’ average scores across the nine extracted components, another using the 13 high-loading indicators from the first principal component, and a third based on aggregated scores from the six original GHSI categories. The file also presents the resulting cluster centroids, validation comparisons, and identified performance patterns. Together, these materials strengthen the credibility of the analytical approach and ensure transparency for replication, deeper analysis, and peer validation. All data are integrated into a single Excel-based tool that includes the underlying values used to generate the study’s tables and figures. This supplementary resource serves as a detailed and practical reference to replicate the study’s procedures and validate its results.

  6. e

    Model spectra for identifying age spreads - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Apr 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Model spectra for identifying age spreads - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/5a66576b-1d64-5c08-bb2a-67b48c3729fa
    Explore at:
    Dataset updated
    Apr 29, 2023
    Description

    In this third paper of a series on the precision of obtaining ages of stellar populations using the full spectrum fitting technique, we examine the precision of this technique in deriving possible age spreads within a star cluster. We test how well an internal age spread can be resolved as a function of cluster age, population mass fraction, and signal-to-noise (S/N) ratio. For this test, the two ages (Age (SSP1) and Age (SSP2)) are free parameters along with the mass fraction of SSP1. We perform the analysis on 118,800 mock star clusters covering all ages in the range 6.8<log(age/yr)9.6 for any mass fraction or S/N, due to the similarity of SED shapes for those ages. In terms of the recovery of age spreads, we find that the derived age spreads are often larger than the real ones, especially for log(age/yr)<8.0 and high mass fractions of SSP1. Increasing the age gap in the mock clusters improves the derived parameters, but Age (SSP2) is still overestimated for the younger ages.

  7. f

    SHIPS: Spectral Hierarchical Clustering for the Inference of Population...

    • plos.figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthieu Bouaziz; Caroline Paccard; Mickael Guedj; Christophe Ambroise (2023). SHIPS: Spectral Hierarchical Clustering for the Inference of Population Structure in Genetic Studies [Dataset]. http://doi.org/10.1371/journal.pone.0045685
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Matthieu Bouaziz; Caroline Paccard; Mickael Guedj; Christophe Ambroise
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Inferring the structure of populations has many applications for genetic research. In addition to providing information for evolutionary studies, it can be used to account for the bias induced by population stratification in association studies. To this end, many algorithms have been proposed to cluster individuals into genetically homogeneous sub-populations. The parametric algorithms, such as Structure, are very popular but their underlying complexity and their high computational cost led to the development of faster parametric alternatives such as Admixture. Alternatives to these methods are the non-parametric approaches. Among this category, AWclust has proven efficient but fails to properly identify population structure for complex datasets. We present in this article a new clustering algorithm called Spectral Hierarchical clustering for the Inference of Population Structure (SHIPS), based on a divisive hierarchical clustering strategy, allowing a progressive investigation of population structure. This method takes genetic data as input to cluster individuals into homogeneous sub-populations and with the use of the gap statistic estimates the optimal number of such sub-populations. SHIPS was applied to a set of simulated discrete and admixed datasets and to real SNP datasets, that are data from the HapMap and Pan-Asian SNP consortium. The programs Structure, Admixture, AWclust and PCAclust were also investigated in a comparison study. SHIPS and the parametric approach Structure were the most accurate when applied to simulated datasets both in terms of individual assignments and estimation of the correct number of clusters. The analysis of the results on the real datasets highlighted that the clusterings of SHIPS were the more consistent with the population labels or those produced by the Admixture program. The performances of SHIPS when applied to SNP data, along with its relatively low computational cost and its ease of use make this method a promising solution to infer fine-scale genetic patterns.

  8. e

    Properties of LMC star clusters - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Apr 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Properties of LMC star clusters - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/2855c0f3-1a67-5da5-81b9-472985eaf813
    Explore at:
    Dataset updated
    Apr 30, 2023
    Description

    The YMCA (Yes, Magellanic Clouds Again) and STEP (The SMC in Time: Evolution of a Prototype interacting late-type dwarf galaxy) projects are deep g, i photometric surveys carried out with the VLT Survey Telescope (VST) and devoted to study the outskirts of the Magellanic System. A main goal of YMCA and STEP is to identify candidate stellar clusters and complete their census out to the outermost regions of the Magellanic Clouds. We adopted a specific overdensity search technique coupled with a visual inspection of the colour-magnitude diagrams to select the best candidates and estimate their ages. To date, we analysed a region of 23 square degrees in the outskirts of the Large Magellanic Cloud, detecting 85 candidate cluster candidates, 16 of which have estimated ages falling in the so-called age gap. We use these objects together with literature data to gain insight into the formation and interaction history of the Magellanic Clouds.

  9. Z

    Data for the paper « An all-Africa dataset of energy model "supply regions"...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Feb 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamed Elabbas (2025). Data for the paper « An all-Africa dataset of energy model "supply regions" for solar PV and wind power » [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6452116
    Explore at:
    Dataset updated
    Feb 14, 2025
    Dataset provided by
    Mohamed Elabbas
    Sebastian Sterl
    Bilal Hussain
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Africa
    Description

    This dataset contains data provided alongside the paper "An all-Africa dataset of energy model “supply regions” for solar PV and wind power" by Sterl et al. (2022).

    It concerns a novel representative subset of attractive sites for solar PV and onshore wind power for the entire African continent. We refer to these sites as “Model Supply Regions” (MSRs). This MSR dataset was created from an in-depth analysis of various existing datasets on resource potential, grid infrastructure, land use, topography and others (see Methods), and achieves hourly temporal resolution and kilometre-scale spatial resolution. This dataset fills an important research need by closing the gap between comprehensive datasets on African VRE potential (such as the Global Solar Atlas and Global Wind Atlas) on the one hand, and the input needed to run cost-optimisation models on the other. It also allows a detailed analysis of the trade-offs involved in exploiting excellent, but far-from-grid resources as compared to mediocre but more accessible resources, which is a crucial component of power systems planning to be elaborated for many African countries.

    Five separate datasets are included:

    Folder (1) provides shapefiles of each country's overall feasible area for developing solar and wind power projects, under the restrictions/criteria mentioned above and described in Sterl et al. (2022).

    Folder (2) provides the best 5% ("best" measured by expected LCOE, from lowest to highest, including grid and road extension costs; 5% measured in terms of coverage of a country's area) of each country's solar and wind development potential, including hourly time series for model input.

    Folder (3) provides the corresponding shapefiles.

    Folder (4) provides simplified/aggregated results in terms of MSR clusters (see Sterl et al. 2022 for details), alongside hourly time series based on the meteorological year 2018. The amount of clusters was chosen to be 2, 5 or 10 depending on country size.

    Folder (5) provides PDF-file maps at the country level, showing resource strength and clustering outcomes by MSR (post-screening).

    Explanations of the headers in any spreadsheet files are provided in the Supplementary Information of Sterl et al. (2022).

    Countries/territories included in the dataset:

    AlgeriaAngolaBeninBotswanaBurkina FasoBurundiCameroonCentral African RepublicChadCongo RepublicDemocratic Republic of the CongoDjiboutiEgyptEquatorial GuineaEritreaEswatiniEthiopiaGabonThe GambiaGhanaGuineaGuiné-BissauCôte d'IvoireKenyaLesothoLiberiaLibyaMadagascarMalawiMaliMauritaniaMoroccoMozambiqueNamibiaNigerNigeriaRwandaSenegalSierra LeoneSomaliaSouth AfricaSouth SudanSudanTogoTunisiaUgandaTanzaniaZambiaZimbabwe

    References

    Sterl, S., Hussain, B., Miketa, A. et al. An all-Africa dataset of energy model “supply regions” for solar photovoltaic and wind power. Sci Data 9, 664 (2022). https://doi.org/10.1038/s41597-022-01786-5

    See also

    Sterl, S. (2024). Solar PV and wind power Model Supply Region (MSR) dataset as energy model input for countries in Central and South America (1.0.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10650822

  10. o

    Supplementary material for: Building alternative consensus trees and...

    • explore.openaire.eu
    • search.dataone.org
    • +1more
    Updated Mar 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vladimir Makarenkov (2021). Supplementary material for: Building alternative consensus trees and supertrees using k-means and Robinson and Foulds distance [Dataset]. http://doi.org/10.5061/dryad.ffbg79ctw
    Explore at:
    Dataset updated
    Mar 28, 2021
    Authors
    Vladimir Makarenkov
    Description

    Each gene has its own evolutionary history which can substantially differ from the evolutionary histories of other genes. For example, some individual genes or operons can be affected by specific horizontal gene transfer and recombination events. Thus, the evolutionary history of each gene should be represented by its own phylogenetic tree which may display different evolutionary patterns from the species tree that accounts for the main patterns of vertical descent. The output of traditional consensus tree or supertree inference methods is a unique consensus tree or supertree. Here, we describe a new efficient method for inferring multiple alternative consensus trees and supertrees to best represent the most important evolutionary patterns of a given set of phylogenetic trees (i.e. additive trees or X-trees). We show how a specific version of the popular k-means clustering algorithm, based on some interesting properties of the Robinson and Foulds topological distance, can be used to partition a given set of trees into one (when the data are homogeneous) or multiple (when the data are heterogeneous) cluster(s) of trees. We adapt the popular Caliński-Harabasz, Silhouette, Ball and Hall, and Gap cluster validity indices to tree clustering with k-means. A special attention is paid to the relevant but very challenging problem of inferring alternative supertrees, built from phylogenies constructed for different, but mutually overlapping, sets of taxa. The use of the Euclidean approximation in the objective function of the method makes it faster than the existing tree clustering techniques, and thus perfectly suitable for the analysis of large genomic datasets.

  11. d

    Mapping the Unseen: Identifying Data Gaps and Proposing New Sampling Points...

    • dataone.org
    • borealisdata.ca
    Updated Jul 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yang, Stephanie (2024). Mapping the Unseen: Identifying Data Gaps and Proposing New Sampling Points in Northern Boreal Mountain Eco-province, BC Using K-Means Clustering and cLHS [Dataset]. http://doi.org/10.5683/SP3/ZYIQ2U
    Explore at:
    Dataset updated
    Jul 24, 2024
    Dataset provided by
    Borealis
    Authors
    Yang, Stephanie
    Description

    This study investigates the soil variability within the Northern Boreal Mountains Ecoprovince in British Columbia, with a particular focus on wetland soils and soil organic carbon mapping. Utilizing the BCSOIL2020 dataset and an array of environmental covariates, we employed Principal Component Analysis (PCA), k-means clustering, and conditioned Latin Hypercube Sampling (cLHS) to develop a comprehensive environmental covariate space. This approach allowed for the evaluation of the BCSOIL2020 dataset's representativeness of the current distribution of wetland soils and the generation of new, strategically placed sampling plots aimed at enhancing future research efforts. Through this methodology, the study identifies critical data gaps in existing datasets and proposes a methodological framework for improving soil mapping practices, thereby contributing to more informed resource management and conservation strategies.

  12. Benchmark-Dataset FAN-01: Low pressure Axial Fan in a short Duct

    • zenodo.org
    pdf, tar +2
    Updated Jul 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Krömer; Clemens Junger; Stefan Becker; Manfred Kaltenbacher; Felix Czwielong; Stefan Schoder; Stefan Schoder; Florian Krömer; Clemens Junger; Stefan Becker; Manfred Kaltenbacher; Felix Czwielong (2024). Benchmark-Dataset FAN-01: Low pressure Axial Fan in a short Duct [Dataset]. http://doi.org/10.5281/zenodo.10787093
    Explore at:
    tar, zip, text/x-python, pdfAvailable download formats
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Florian Krömer; Clemens Junger; Stefan Becker; Manfred Kaltenbacher; Felix Czwielong; Stefan Schoder; Stefan Schoder; Florian Krömer; Clemens Junger; Stefan Becker; Manfred Kaltenbacher; Felix Czwielong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The case consists of a generic axial fan for industrial applications. Provided measurement data include instationary pressure probes in the rotor's tip gap, distributions of velocity and turbulent kinetic energy gained by laser Doppler anemometry, as well as acoustic results gained by microphones and beamforming.

    A detailed description of the dataset with references can be found in the PDF-File. The rotor geometry is available as IGS or Parasolid file. The measurement data is available, including the ones (LDA-data, pressure probes, acoustic microphones, üerformance) listed in the PDF description file.

    Citation of the fan and the data:

    Zenger, Florian, et al. A benchmark case for aerodynamics and aeroacoustics of a low pressure axial fan. No. 2016-01-1805. SAE Technical Paper, 2016.

    Citation of the microphone array measurements:

    Krömer, Florian J. Sound emission of low-pressure axial fans under distorted inflow conditions. FAU University Press, 2018.

    Citation of the python scripts:

    Junger, Clemens. Computational aeroacoustics for the characterization of noise sources in rotating systems. Diss. Technische Universität Wien, 2019.

    Related work and existing publications:

    Schoder, Stefan, Clemens Junger, and Manfred Kaltenbacher. "Computational aeroacoustics of the EAA benchmark case of an axial fan." Acta Acustica 4.5 (2020): 22. https://doi.org/10.1051/aacus/2020021

    Schoder, Stefan, and Felix Czwielong. "Dataset fan-01: Revisiting the EAA benchmark for a low-pressure axial fan." arXiv preprint arXiv:2211.12014 (2022). https://doi.org/10.48550/arXiv.2211.12014

    Kaltenbacher, Manfred, and Stefan Schoder. "EAA Benchmark for an axial fan." e-Forum Acusticum 2020. 2020. https://hal.science/hal-03221387/document

    Tieghi, Lorenzo, et al. "Machine-learning clustering methods applied to detection of noise sources in low-speed axial fan." Journal of Engineering for Gas Turbines and Power 145.3 (2023): 031020. https://doi.org/10.1115/1.4055417

    Antoniou, E., Romani, G., Jantzen, A., Czwielong, F., & Schoder, S. (2023). Numerical flow noise simulation of an axial fan with a Lattice-Boltzmann solver. Acta Acustica, 7, 65. https://doi.org/10.1051/aacus/2023060

    Data curation and Questions about the Dataset

    Data curated by Stefan Schoder, any questions related to the dataset to stefan.schoder@tugraz.at.

  13. f

    Datasheet1_Unlocking high-value football fans: unsupervised machine learning...

    • frontiersin.figshare.com
    • figshare.com
    pdf
    Updated Aug 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karim Chouaten; Cristian Rodriguez Rivero; Frank Nack; Max Reckers (2024). Datasheet1_Unlocking high-value football fans: unsupervised machine learning for customer segmentation and lifetime value.pdf [Dataset]. http://doi.org/10.3389/fspor.2024.1362489.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Aug 23, 2024
    Dataset provided by
    Frontiers
    Authors
    Karim Chouaten; Cristian Rodriguez Rivero; Frank Nack; Max Reckers
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionIn the modern competitive landscape of football, clubs are increasingly leveraging data-driven decision-making to strengthen their commercial positions, particularly against rival clubs. The strategic allocation of resources to attract and retain profitable fans who exhibit long-term loyalty is crucial for advancing a club's marketing efforts. While the Recency, Frequency, and Monetary (RFM) customer segmentation technique has seen widespread application in various industries for predicting customer behavior, its adoption within the football industry remains underexplored. This study aims to address this gap by introducing an adjusted RFM approach, enhanced with the Analytic Hierarchy Process (AHP) and unsupervised machine learning, to effectively segment football fans based on Customer Lifetime Value (CLV).MethodsThis research employs a novel weighted RFM method where the significance of each RFM component is quantified using the AHP method. The study utilizes a dataset comprising 500,591 anonymized merchandising transactions from Amsterdamsche Football Club Ajax (AFC Ajax). The derived weights for the RFM variables are 0.409 for Monetary, 0.343 for Frequency, and 0.248 for Recency. These weights are then integrated into a clustering framework using unsupervised machine learning algorithms to segment fans based on their weighted RFM values. The simple weighted sum approach is subsequently applied to estimate the CLV ranking for each fan, enabling the identification of distinct fan segments.ResultsThe analysis reveals eight distinct fan clusters, each characterized by unique behaviors and value contributions: The Golden Fans (clusters 1 and 2) exhibit the most favourable scores across the recency, frequency, and monetary metrics, making them relatively the most valuable. They are critical to the club's profitability and should be rewarded through loyalty programs and exclusive services. The Promising segment (cluster 3) shows potential to ascend to Golden Fan status with increased spending. Targeted marketing campaigns and incentives can stimulate this transition. The Needs Attention segment (cluster 4) are formerly loyal fans whose engagement has diminished. Re-engagement strategies are vital to prevent further churn. The New Fans segment (clusters 5 and 6) are fans who have recently transacted and show potential for growth with proper engagement and personalized offerings. Lastly, the Churned/Low Value segment (clusters 7 and 8) are fans who relatively contribute the least and may require price incentives to potentially re-engage, though they hold relatively lower priority compared to other segments.DiscussionThe findings validate the proposed method's utility through its application to AFC Ajax's Customer Relationship Management (CRM) data and provides a robust framework for fan segmentation in the football industry. The approach offers actionable insights that can significantly enhance marketing strategies by identifying and prioritizing high-value segments based on the club's preferences and requirements. By maintaining the loyalty of Golden Fans and nurturing the Promising segment, football clubs can achieve substantial gains in profitability and fan engagement. Additionally, the study underscores the necessity of re-engaging formerly loyal fans and fostering new fans' growth to enable long-term commercial success. This methodology not only aims to bridge a research gap, but also equips marketing practitioners with data-driven tools for effective and efficient customer segmentation in the football industry.

  14. e

    Ages and metallicities of old stellar systems - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Apr 18, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2012). Ages and metallicities of old stellar systems - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/c2465ef6-49c1-59a0-8ba7-bd558321566a
    Explore at:
    Dataset updated
    Apr 18, 2012
    Description

    We present a statistical analysis of the properties of a large sample of dynamically hot old stellar systems, from globular clusters (GCs) to giant ellipticals, which was performed in order to investigate the origin of ultracompact dwarf galaxies (UCDs). The data were mostly drawn from Forbes et al. (2008, Cat. J/MNRAS/389/1924). We recalculated some of the effective radii, computed mean surface brightnesses and mass-to-light ratios, and estimated ages and metallicities. We completed the sample with GCs of M31. We used a multivariate statistical technique (K-Means clustering), together with a new algorithm (Gap Statistics) for finding the optimum number of homogeneous sub-groups in the sample, using a total of six parameters (absolute magnitude, effective radius, virial mass-to-light ratio, stellar mass-to-light ratio, and metallicity). We found six groups. FK1 and FK5 are composed of high- and low-mass elliptical galaxies, respectively. FK3 and FK6 are composed of high-metallicity and low-metallicity objects, respectively, and both include GCs and UCDs. Two very small groups, FK2 and FK4, are composed of Local Group dwarf spheroidals. Our groups differ in their mean masses and virial mass-to-light ratios. The relations between these two parameters are also different for the various groups. The probability density distributions of metallicity for the four groups of galaxies are similar to those of the GCs and UCDs. The brightest low-metallicity GCs and UCDs tend to follow the mass-metallicity relation like elliptical galaxies. The objects of FK3 are more metal-rich per unit effective luminosity density than high-mass ellipticals. Cone search capability for table J/ApJ/750/91/table1 (Photometric and structural parameters of the objects of the sample studied.)

  15. f

    List of benchmarking dataset from TCGA and the gold-standard subtype as well...

    • plos.figshare.com
    xls
    Updated Aug 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Teemu J. Rintala; Vittorio Fortino (2024). List of benchmarking dataset from TCGA and the gold-standard subtype as well as survival type considered (OS–overall survival; PFI—Progression-free interval). [Dataset]. http://doi.org/10.1371/journal.pcbi.1012275.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Aug 15, 2024
    Dataset provided by
    PLOS Computational Biology
    Authors
    Teemu J. Rintala; Vittorio Fortino
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    List of benchmarking dataset from TCGA and the gold-standard subtype as well as survival type considered (OS–overall survival; PFI—Progression-free interval).

  16. Diffusion MRI data supporting "Tractography Processing with the Sparse...

    • nih.figshare.com
    zip
    Updated Jul 6, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ryan Cabeen; Arthur W. Toga; David H. Laidlaw (2020). Diffusion MRI data supporting "Tractography Processing with the Sparse Closest Point Transform" [Dataset]. http://doi.org/10.35092/yhjc.12441953.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 6, 2020
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Ryan Cabeen; Arthur W. Toga; David H. Laidlaw
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Diffusion MRI data supporting the manuscript entitled "Tractography Processing with the Sparse Closest Point Transform". This includes a population-averaged diffusion tensor imaging dataset, tractography reconstructions of fiber bundles, and classification model parameters for selecting bundles in subject data.Abstract:"We propose a novel approach for processing diffusion MRI tractography datasets using the sparse closest point transform (SCPT). Tractography enables the 3D geometry of white matter pathways to be reconstructed; however, algorithms for processing them are often highly customized, and thus, do not leverage the existing wealth of machine learning (ML) algorithms. We investigated a vector-space tractography representation that aims to bridge this gap by using the SCPT, which consists of two steps: first, extracting sparse and representative landmarks from a tractography dataset, and second transforming curves relative to these landmarks with a closest point transform. We explore its use in three typical tasks: fiber bundle clustering, simplification, and selection across a population. The clustering algorithm groups fibers from single whole-brain datasets using a non-parametric k-means clustering algorithm, with performance compared with three alternative methods and across four datasets. The simplification algorithm removes redundant curves to improve interactive visualization, with performance gauged relative to random subsampling. The selection algorithm extracts bundles across a population using a one-class Gaussian classifier derived from an atlas prototype, with performance gauged by scan-rescan reliability and sensitivity to normal aging, as compared to manual mask-based selection. Our results demonstrate how the SCPT enables the novel application of existing vector-space ML algorithms to create effective and efficient tools for tractography processing. Our experimental data is available online, and our software implementation is available in the Quantitative Imaging Toolkit."

  17. f

    Performance measure after applying SMOTE.

    • figshare.com
    xls
    Updated May 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sumya Akter; Hossen A. Mustafa (2024). Performance measure after applying SMOTE. [Dataset]. http://doi.org/10.1371/journal.pone.0300670.t009
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Sumya Akter; Hossen A. Mustafa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Thyroid disease classification plays a crucial role in early diagnosis and effective treatment of thyroid disorders. Machine learning (ML) techniques have demonstrated remarkable potential in this domain, offering accurate and efficient diagnostic tools. Most of the real-life datasets have imbalanced characteristics that hamper the overall performance of the classifiers. Existing data balancing techniques process the whole dataset at a time that sometimes causes overfitting and underfitting. However, the complexity of some ML models, often referred to as “black boxes,” raises concerns about their interpretability and clinical applicability. This paper presents a comprehensive study focused on the analysis and interpretability of various ML models for classifying thyroid diseases. In our work, we first applied a new data-balancing mechanism using a clustering technique and then analyzed the performance of different ML algorithms. To address the interpretability challenge, we explored techniques for model explanation and feature importance analysis using eXplainable Artificial Intelligence (XAI) tools globally as well as locally. Finally, the XAI results are validated with the domain experts. Experimental results have shown that our proposed mechanism is efficient in diagnosing thyroid disease and can explain the models effectively. The findings can contribute to bridging the gap between adopting advanced ML techniques and the clinical requirements of transparency and accountability in diagnostic decision-making.

  18. f

    Data from: Machine Learning Prediction and Validation of Plasma...

    • acs.figshare.com
    xlsx
    Updated May 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hiroaki Iwata; Michiharu Kageyama; Koichi Handa (2025). Machine Learning Prediction and Validation of Plasma Concentration–Time Profiles [Dataset]. http://doi.org/10.1021/acs.molpharmaceut.4c01431.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 9, 2025
    Dataset provided by
    ACS Publications
    Authors
    Hiroaki Iwata; Michiharu Kageyama; Koichi Handa
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Recent research has increasingly focused on using machine learning for covariate selection in population pharmacokinetics (PPK) analysis. However, few studies have explored the prediction of plasma concentration profiles of drugs using nonlinear mixed-effect models combined with machine learning. This gap includes limited validation of prediction accuracy and applicability to diverse patient populations and dosing conditions. This study addresses these gaps by using remifentanil as a model drug and applying machine learning models to predict plasma concentration profiles based on virtual and real-world data. We created various training data sets for the virtual data by clustering based on the size and diversity of the test data set. Our results demonstrated high prediction accuracy for virtual and real-world data sets using Random Forest models. These results suggest that machine learning models are effective for large-scale data sets and real-world data with variable dosing times and amounts per patient. Considering the efficiency of machine learning, it offers a fit-for-purpose approach alongside traditional PPK methods, potentially enhancing future pharmacokinetic and pharmacodynamic studies.

  19. f

    Clustering results.

    • plos.figshare.com
    xls
    Updated Dec 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hyeonbin Ji; Ingeun Hwang; Junghwon Kim; Suan Lee; Wookey Lee (2024). Clustering results. [Dataset]. http://doi.org/10.1371/journal.pone.0314931.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 30, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Hyeonbin Ji; Ingeun Hwang; Junghwon Kim; Suan Lee; Wookey Lee
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In the contemporary manufacturing landscape, the advent of artificial intelligence and big data analytics has been a game-changer in enhancing product quality. Despite these advancements, their application in diagnosing failure probability and risk remains underexplored. The current practice of failure risk diagnosis is impeded by the manual intervention of managers, leading to varying evaluations for identical products or similar facilities. This study aims to bridge this gap by implementing advanced data analysis techniques on maintenance data from an aluminum extruder. We have employed text embedding, dimensionality reduction, and feature extraction methods, integrating the K-means algorithm with the Silhouette Score for risk level classification. Our findings reveal that the combination of Word2Vec for embedding and Contractive Auto Encoder for dimensionality reduction and feature extraction yields high-performance results. The optimal cluster count, identified as three, achieved the highest Silhouette Score. Statistical analysis using one-way ANOVA confirmed the significance of these findings with a p-value of 5.3213 × e−6, well within the 5% significance threshold. Furthermore, this study utilized BERTopic for topic modeling to extract principal topics from each cluster, facilitating an in-depth analysis of the clusters in relation to the extruder’s characteristics. The outcome of this research offers a novel methodology for facility managers to objectively diagnose equipment failures. By minimizing subjective judgment, this approach is poised to significantly enhance the efficacy of quality assurance systems in manufacturing, leveraging the robust capabilities of artificial intelligence.

  20. f

    Data from: Diagrammatic Simplification of Linearized Coupled Cluster Theory

    • figshare.com
    • acs.figshare.com
    xlsx
    Updated Jun 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kevin Carter-Fenk (2025). Diagrammatic Simplification of Linearized Coupled Cluster Theory [Dataset]. http://doi.org/10.1021/acs.jpca.5c03203.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 26, 2025
    Dataset provided by
    ACS Publications
    Authors
    Kevin Carter-Fenk
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Linearized Coupled Cluster Doubles (LinCCD) often provides near-singular energies in small-gap systems that exhibit static correlation. This has been attributed to the lack of quadratic T̂22 terms that typically balance out small energy denominators in the CCD amplitude equations. Herein, I show that exchange contributions to ring and crossed-ring contractions (not small denominators per se) cause the divergent behavior of LinCC(S)D approaches. Rather than omitting exchange terms, I recommend a regular and size-consistent method that retains only linear ladder diagrams. As LinCCD and configuration interaction doubles (CID) equations are isomorphic, this also implies that simplification (rather than quadratic extensions) of CID amplitude equations can lead to a size-consistent theory. Linearized ladder CCD (LinLCCD) is robust in statically correlated systems and can be made O(nocc4nvir2) with a hole–hole approximation. The results presented here show that LinLCCD and its hole–hole approximation can accurately capture energy differences, even outperforming full CCD and CCSD for noncovalent interactions in small-to-medium sized molecules, setting the stage for further adaptations of these approaches that incorporate more dynamical correlation.

  21. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Osbert C. Zalay (2023). Blind method for discovering number of clusters in multidimensional datasets by regression on linkage hierarchies generated from random data [Dataset]. http://doi.org/10.1371/journal.pone.0227788

Blind method for discovering number of clusters in multidimensional datasets by regression on linkage hierarchies generated from random data

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
docxAvailable download formats
Dataset updated
Jun 4, 2023
Dataset provided by
PLOS ONE
Authors
Osbert C. Zalay
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Determining intrinsic number of clusters in a multidimensional dataset is a commonly encountered problem in exploratory data analysis. Unsupervised clustering algorithms often rely on specification of cluster number as an input parameter. However, this is typically not known a priori. Many methods have been proposed to estimate cluster number, including statistical and information-theoretic approaches such as the gap statistic, but these methods are not always reliable when applied to non-normally distributed datasets containing outliers or noise. In this study, I propose a novel method called hierarchical linkage regression, which uses regression to estimate the intrinsic number of clusters in a multidimensional dataset. The method operates on the hypothesis that the organization of data into clusters can be inferred from the hierarchy generated by partitioning the dataset, and therefore does not directly depend on the specific values of the data or their distribution, but on their relative ranking within the partitioned set. Moreover, the technique does not require empirical data to train on, but can use synthetic data generated from random distributions to fit regression coefficients. The trained hierarchical linkage regression model is able to infer cluster number in test datasets of varying complexity and differing distributions, for image, text and numeric data, using the same regression model without retraining. The method performs favourably against other cluster number estimation techniques, and is also robust to parameter changes, as demonstrated by sensitivity analysis. The apparent robustness and generalizability of hierarchical linkage regression make it a promising tool for unsupervised exploratory data analysis and discovery.

Search
Clear search
Close search
Google apps
Main menu