Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Determining intrinsic number of clusters in a multidimensional dataset is a commonly encountered problem in exploratory data analysis. Unsupervised clustering algorithms often rely on specification of cluster number as an input parameter. However, this is typically not known a priori. Many methods have been proposed to estimate cluster number, including statistical and information-theoretic approaches such as the gap statistic, but these methods are not always reliable when applied to non-normally distributed datasets containing outliers or noise. In this study, I propose a novel method called hierarchical linkage regression, which uses regression to estimate the intrinsic number of clusters in a multidimensional dataset. The method operates on the hypothesis that the organization of data into clusters can be inferred from the hierarchy generated by partitioning the dataset, and therefore does not directly depend on the specific values of the data or their distribution, but on their relative ranking within the partitioned set. Moreover, the technique does not require empirical data to train on, but can use synthetic data generated from random distributions to fit regression coefficients. The trained hierarchical linkage regression model is able to infer cluster number in test datasets of varying complexity and differing distributions, for image, text and numeric data, using the same regression model without retraining. The method performs favourably against other cluster number estimation techniques, and is also robust to parameter changes, as demonstrated by sensitivity analysis. The apparent robustness and generalizability of hierarchical linkage regression make it a promising tool for unsupervised exploratory data analysis and discovery.
As new facilities come online, the astronomical community will be provided with extremely large data sets of well-sampled light curves (LCs) of transients. This motivates systematic studies of the LCs of supernovae (SNe) of all types, including the early rising phase. We performed unsupervised k-means clustering on a sample of 59 R-band SNII LCs and find that the rise to peak plays an important role in classifying LCs. Our sample can be divided into three classes: slowly rising (II-S), fast rise/slow decline (II-FS), and fast rise/fast decline (II-FF). We also identify three outliers based on the algorithm. The II-FF and II-FS classes are disjoint in their decline rates, while the II-S class is intermediate and "bridges the gap." This may explain recent conflicting results regarding II-P/II-L populations. The II-FS class is also significantly less luminous than the other two classes. Performing clustering on the first two principal component analysis components gives equivalent results to using the full LC morphologies. This indicates that Type II LCs could possibly be reduced to two parameters. We present several important caveats to the technique, and find that the division into these classes is not fully robust. Moreover, these classes have some overlap, and are defined in the R band only. It is currently unclear if they represent distinct physical classes, and more data is needed to study these issues. However, we show that the outliers are actually composed of slowly evolving SN IIb, demonstrating the potential of such methods. The slowly evolving SNe IIb may arise from single massive progenitors.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These files are supplements to the paper titled 'A Robust Two-step Method for Detection of Outlier Sets'.This paper identifies and addresses the need for a robust method that identifies sets of points that collectively deviate from typical patterns in a dataset, which it calls "outlier sets'', while excluding individual points from detection. This new methodology, Outlier Set Two-step Identification (OSTI) employs a two-step approach to detect and label these outlier sets. First, it uses Gaussian Mixture Models for probabilistic clustering, identifying candidate outlier sets based on cluster weights below a predetermined threshold. Second, OSTI measures the Inter-cluster Mahalanobis distance between each candidate outlier set's centroid and the overall dataset mean. OSTI then tests the null hypothesis that this distance does not significantly differ from its theoretical chi-square distribution, enabling the formal detection of outlier sets. We test OSTI systematically on 8,000 synthetic 2D datasets across various inlier configurations and thousands of possible outlier set characteristics. Results show OSTI robustly and consistently detects outlier sets with an average F1 score of 0.92 and an average purity (the degree to which outlier sets identified correspond to those generated synthetically, i.e., our ground truth) of 98.58%. We also compare OSTI with state-of-the-art outlier detection methods, to illuminate how OSTI fills a gap as a tool for the exclusive detection of outlier sets.
The connection of cluster mergers with the presence of extended, diffuse radio sources in galaxy clusters is still debated. An interesting case is the rich, merging cluster Abell 520, containing a radio halo. A recent gravitational analysis has shown in this cluster the presence of a massive dark core suggested to be a possible problem for the current cold dark matter paradigm. We aim to obtain new insights into the internal dynamics of Abell 520 analyzing velocities and positions of member galaxies. Our analysis is based on redshift data for 293 galaxies in the cluster field obtained combining new redshift data for 8 galaxies acquired at the TNG with data obtained by CNOC team and other few data from the literature. We also use new photometric data obtained at the INT telescope. We combine galaxy velocities and positions to select 167 cluster members around z~0.201. We analyze the cluster structure using the weighted gap analysis, the KMM method, the Dressler-Shectman statistics and the analysis of the velocity dispersion profiles. We compare our results with those from X-ray, radio and gravitational lensing analyses.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This supplementary file provides comprehensive support for the findings and methodology presented in the study. It includes detailed outputs from the Principal Component Analysis (PCA), such as factor loadings, eigenvalues, and the percentage of variance explained, along with a full classification of the 37 Global Health Security Index (GHSI) indicators across the nine identified principal components. Additionally, it contains visualizations and datasets for all three clustering scenarios: one based on countries’ average scores across the nine extracted components, another using the 13 high-loading indicators from the first principal component, and a third based on aggregated scores from the six original GHSI categories. The file also presents the resulting cluster centroids, validation comparisons, and identified performance patterns. Together, these materials strengthen the credibility of the analytical approach and ensure transparency for replication, deeper analysis, and peer validation. All data are integrated into a single Excel-based tool that includes the underlying values used to generate the study’s tables and figures. This supplementary resource serves as a detailed and practical reference to replicate the study’s procedures and validate its results.
In this third paper of a series on the precision of obtaining ages of stellar populations using the full spectrum fitting technique, we examine the precision of this technique in deriving possible age spreads within a star cluster. We test how well an internal age spread can be resolved as a function of cluster age, population mass fraction, and signal-to-noise (S/N) ratio. For this test, the two ages (Age (SSP1) and Age (SSP2)) are free parameters along with the mass fraction of SSP1. We perform the analysis on 118,800 mock star clusters covering all ages in the range 6.8<log(age/yr)9.6 for any mass fraction or S/N, due to the similarity of SED shapes for those ages. In terms of the recovery of age spreads, we find that the derived age spreads are often larger than the real ones, especially for log(age/yr)<8.0 and high mass fractions of SSP1. Increasing the age gap in the mock clusters improves the derived parameters, but Age (SSP2) is still overestimated for the younger ages.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Inferring the structure of populations has many applications for genetic research. In addition to providing information for evolutionary studies, it can be used to account for the bias induced by population stratification in association studies. To this end, many algorithms have been proposed to cluster individuals into genetically homogeneous sub-populations. The parametric algorithms, such as Structure, are very popular but their underlying complexity and their high computational cost led to the development of faster parametric alternatives such as Admixture. Alternatives to these methods are the non-parametric approaches. Among this category, AWclust has proven efficient but fails to properly identify population structure for complex datasets. We present in this article a new clustering algorithm called Spectral Hierarchical clustering for the Inference of Population Structure (SHIPS), based on a divisive hierarchical clustering strategy, allowing a progressive investigation of population structure. This method takes genetic data as input to cluster individuals into homogeneous sub-populations and with the use of the gap statistic estimates the optimal number of such sub-populations. SHIPS was applied to a set of simulated discrete and admixed datasets and to real SNP datasets, that are data from the HapMap and Pan-Asian SNP consortium. The programs Structure, Admixture, AWclust and PCAclust were also investigated in a comparison study. SHIPS and the parametric approach Structure were the most accurate when applied to simulated datasets both in terms of individual assignments and estimation of the correct number of clusters. The analysis of the results on the real datasets highlighted that the clusterings of SHIPS were the more consistent with the population labels or those produced by the Admixture program. The performances of SHIPS when applied to SNP data, along with its relatively low computational cost and its ease of use make this method a promising solution to infer fine-scale genetic patterns.
The YMCA (Yes, Magellanic Clouds Again) and STEP (The SMC in Time: Evolution of a Prototype interacting late-type dwarf galaxy) projects are deep g, i photometric surveys carried out with the VLT Survey Telescope (VST) and devoted to study the outskirts of the Magellanic System. A main goal of YMCA and STEP is to identify candidate stellar clusters and complete their census out to the outermost regions of the Magellanic Clouds. We adopted a specific overdensity search technique coupled with a visual inspection of the colour-magnitude diagrams to select the best candidates and estimate their ages. To date, we analysed a region of 23 square degrees in the outskirts of the Large Magellanic Cloud, detecting 85 candidate cluster candidates, 16 of which have estimated ages falling in the so-called age gap. We use these objects together with literature data to gain insight into the formation and interaction history of the Magellanic Clouds.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains data provided alongside the paper "An all-Africa dataset of energy model “supply regions” for solar PV and wind power" by Sterl et al. (2022).
It concerns a novel representative subset of attractive sites for solar PV and onshore wind power for the entire African continent. We refer to these sites as “Model Supply Regions” (MSRs). This MSR dataset was created from an in-depth analysis of various existing datasets on resource potential, grid infrastructure, land use, topography and others (see Methods), and achieves hourly temporal resolution and kilometre-scale spatial resolution. This dataset fills an important research need by closing the gap between comprehensive datasets on African VRE potential (such as the Global Solar Atlas and Global Wind Atlas) on the one hand, and the input needed to run cost-optimisation models on the other. It also allows a detailed analysis of the trade-offs involved in exploiting excellent, but far-from-grid resources as compared to mediocre but more accessible resources, which is a crucial component of power systems planning to be elaborated for many African countries.
Five separate datasets are included:
Folder (1) provides shapefiles of each country's overall feasible area for developing solar and wind power projects, under the restrictions/criteria mentioned above and described in Sterl et al. (2022).
Folder (2) provides the best 5% ("best" measured by expected LCOE, from lowest to highest, including grid and road extension costs; 5% measured in terms of coverage of a country's area) of each country's solar and wind development potential, including hourly time series for model input.
Folder (3) provides the corresponding shapefiles.
Folder (4) provides simplified/aggregated results in terms of MSR clusters (see Sterl et al. 2022 for details), alongside hourly time series based on the meteorological year 2018. The amount of clusters was chosen to be 2, 5 or 10 depending on country size.
Folder (5) provides PDF-file maps at the country level, showing resource strength and clustering outcomes by MSR (post-screening).
Explanations of the headers in any spreadsheet files are provided in the Supplementary Information of Sterl et al. (2022).
Countries/territories included in the dataset:
AlgeriaAngolaBeninBotswanaBurkina FasoBurundiCameroonCentral African RepublicChadCongo RepublicDemocratic Republic of the CongoDjiboutiEgyptEquatorial GuineaEritreaEswatiniEthiopiaGabonThe GambiaGhanaGuineaGuiné-BissauCôte d'IvoireKenyaLesothoLiberiaLibyaMadagascarMalawiMaliMauritaniaMoroccoMozambiqueNamibiaNigerNigeriaRwandaSenegalSierra LeoneSomaliaSouth AfricaSouth SudanSudanTogoTunisiaUgandaTanzaniaZambiaZimbabwe
References
Sterl, S., Hussain, B., Miketa, A. et al. An all-Africa dataset of energy model “supply regions” for solar photovoltaic and wind power. Sci Data 9, 664 (2022). https://doi.org/10.1038/s41597-022-01786-5
See also
Sterl, S. (2024). Solar PV and wind power Model Supply Region (MSR) dataset as energy model input for countries in Central and South America (1.0.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10650822
Each gene has its own evolutionary history which can substantially differ from the evolutionary histories of other genes. For example, some individual genes or operons can be affected by specific horizontal gene transfer and recombination events. Thus, the evolutionary history of each gene should be represented by its own phylogenetic tree which may display different evolutionary patterns from the species tree that accounts for the main patterns of vertical descent. The output of traditional consensus tree or supertree inference methods is a unique consensus tree or supertree. Here, we describe a new efficient method for inferring multiple alternative consensus trees and supertrees to best represent the most important evolutionary patterns of a given set of phylogenetic trees (i.e. additive trees or X-trees). We show how a specific version of the popular k-means clustering algorithm, based on some interesting properties of the Robinson and Foulds topological distance, can be used to partition a given set of trees into one (when the data are homogeneous) or multiple (when the data are heterogeneous) cluster(s) of trees. We adapt the popular Caliński-Harabasz, Silhouette, Ball and Hall, and Gap cluster validity indices to tree clustering with k-means. A special attention is paid to the relevant but very challenging problem of inferring alternative supertrees, built from phylogenies constructed for different, but mutually overlapping, sets of taxa. The use of the Euclidean approximation in the objective function of the method makes it faster than the existing tree clustering techniques, and thus perfectly suitable for the analysis of large genomic datasets.
This study investigates the soil variability within the Northern Boreal Mountains Ecoprovince in British Columbia, with a particular focus on wetland soils and soil organic carbon mapping. Utilizing the BCSOIL2020 dataset and an array of environmental covariates, we employed Principal Component Analysis (PCA), k-means clustering, and conditioned Latin Hypercube Sampling (cLHS) to develop a comprehensive environmental covariate space. This approach allowed for the evaluation of the BCSOIL2020 dataset's representativeness of the current distribution of wetland soils and the generation of new, strategically placed sampling plots aimed at enhancing future research efforts. Through this methodology, the study identifies critical data gaps in existing datasets and proposes a methodological framework for improving soil mapping practices, thereby contributing to more informed resource management and conservation strategies.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The case consists of a generic axial fan for industrial applications. Provided measurement data include instationary pressure probes in the rotor's tip gap, distributions of velocity and turbulent kinetic energy gained by laser Doppler anemometry, as well as acoustic results gained by microphones and beamforming.
A detailed description of the dataset with references can be found in the PDF-File. The rotor geometry is available as IGS or Parasolid file. The measurement data is available, including the ones (LDA-data, pressure probes, acoustic microphones, üerformance) listed in the PDF description file.
Citation of the fan and the data:
Zenger, Florian, et al. A benchmark case for aerodynamics and aeroacoustics of a low pressure axial fan. No. 2016-01-1805. SAE Technical Paper, 2016.
Citation of the microphone array measurements:
Krömer, Florian J. Sound emission of low-pressure axial fans under distorted inflow conditions. FAU University Press, 2018.
Citation of the python scripts:
Junger, Clemens. Computational aeroacoustics for the characterization of noise sources in rotating systems. Diss. Technische Universität Wien, 2019.
Related work and existing publications:
Schoder, Stefan, Clemens Junger, and Manfred Kaltenbacher. "Computational aeroacoustics of the EAA benchmark case of an axial fan." Acta Acustica 4.5 (2020): 22. https://doi.org/10.1051/aacus/2020021
Schoder, Stefan, and Felix Czwielong. "Dataset fan-01: Revisiting the EAA benchmark for a low-pressure axial fan." arXiv preprint arXiv:2211.12014 (2022). https://doi.org/10.48550/arXiv.2211.12014
Kaltenbacher, Manfred, and Stefan Schoder. "EAA Benchmark for an axial fan." e-Forum Acusticum 2020. 2020. https://hal.science/hal-03221387/document
Tieghi, Lorenzo, et al. "Machine-learning clustering methods applied to detection of noise sources in low-speed axial fan." Journal of Engineering for Gas Turbines and Power 145.3 (2023): 031020. https://doi.org/10.1115/1.4055417
Antoniou, E., Romani, G., Jantzen, A., Czwielong, F., & Schoder, S. (2023). Numerical flow noise simulation of an axial fan with a Lattice-Boltzmann solver. Acta Acustica, 7, 65. https://doi.org/10.1051/aacus/2023060
Data curation and Questions about the Dataset
Data curated by Stefan Schoder, any questions related to the dataset to stefan.schoder@tugraz.at.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionIn the modern competitive landscape of football, clubs are increasingly leveraging data-driven decision-making to strengthen their commercial positions, particularly against rival clubs. The strategic allocation of resources to attract and retain profitable fans who exhibit long-term loyalty is crucial for advancing a club's marketing efforts. While the Recency, Frequency, and Monetary (RFM) customer segmentation technique has seen widespread application in various industries for predicting customer behavior, its adoption within the football industry remains underexplored. This study aims to address this gap by introducing an adjusted RFM approach, enhanced with the Analytic Hierarchy Process (AHP) and unsupervised machine learning, to effectively segment football fans based on Customer Lifetime Value (CLV).MethodsThis research employs a novel weighted RFM method where the significance of each RFM component is quantified using the AHP method. The study utilizes a dataset comprising 500,591 anonymized merchandising transactions from Amsterdamsche Football Club Ajax (AFC Ajax). The derived weights for the RFM variables are 0.409 for Monetary, 0.343 for Frequency, and 0.248 for Recency. These weights are then integrated into a clustering framework using unsupervised machine learning algorithms to segment fans based on their weighted RFM values. The simple weighted sum approach is subsequently applied to estimate the CLV ranking for each fan, enabling the identification of distinct fan segments.ResultsThe analysis reveals eight distinct fan clusters, each characterized by unique behaviors and value contributions: The Golden Fans (clusters 1 and 2) exhibit the most favourable scores across the recency, frequency, and monetary metrics, making them relatively the most valuable. They are critical to the club's profitability and should be rewarded through loyalty programs and exclusive services. The Promising segment (cluster 3) shows potential to ascend to Golden Fan status with increased spending. Targeted marketing campaigns and incentives can stimulate this transition. The Needs Attention segment (cluster 4) are formerly loyal fans whose engagement has diminished. Re-engagement strategies are vital to prevent further churn. The New Fans segment (clusters 5 and 6) are fans who have recently transacted and show potential for growth with proper engagement and personalized offerings. Lastly, the Churned/Low Value segment (clusters 7 and 8) are fans who relatively contribute the least and may require price incentives to potentially re-engage, though they hold relatively lower priority compared to other segments.DiscussionThe findings validate the proposed method's utility through its application to AFC Ajax's Customer Relationship Management (CRM) data and provides a robust framework for fan segmentation in the football industry. The approach offers actionable insights that can significantly enhance marketing strategies by identifying and prioritizing high-value segments based on the club's preferences and requirements. By maintaining the loyalty of Golden Fans and nurturing the Promising segment, football clubs can achieve substantial gains in profitability and fan engagement. Additionally, the study underscores the necessity of re-engaging formerly loyal fans and fostering new fans' growth to enable long-term commercial success. This methodology not only aims to bridge a research gap, but also equips marketing practitioners with data-driven tools for effective and efficient customer segmentation in the football industry.
We present a statistical analysis of the properties of a large sample of dynamically hot old stellar systems, from globular clusters (GCs) to giant ellipticals, which was performed in order to investigate the origin of ultracompact dwarf galaxies (UCDs). The data were mostly drawn from Forbes et al. (2008, Cat. J/MNRAS/389/1924). We recalculated some of the effective radii, computed mean surface brightnesses and mass-to-light ratios, and estimated ages and metallicities. We completed the sample with GCs of M31. We used a multivariate statistical technique (K-Means clustering), together with a new algorithm (Gap Statistics) for finding the optimum number of homogeneous sub-groups in the sample, using a total of six parameters (absolute magnitude, effective radius, virial mass-to-light ratio, stellar mass-to-light ratio, and metallicity). We found six groups. FK1 and FK5 are composed of high- and low-mass elliptical galaxies, respectively. FK3 and FK6 are composed of high-metallicity and low-metallicity objects, respectively, and both include GCs and UCDs. Two very small groups, FK2 and FK4, are composed of Local Group dwarf spheroidals. Our groups differ in their mean masses and virial mass-to-light ratios. The relations between these two parameters are also different for the various groups. The probability density distributions of metallicity for the four groups of galaxies are similar to those of the GCs and UCDs. The brightest low-metallicity GCs and UCDs tend to follow the mass-metallicity relation like elliptical galaxies. The objects of FK3 are more metal-rich per unit effective luminosity density than high-mass ellipticals. Cone search capability for table J/ApJ/750/91/table1 (Photometric and structural parameters of the objects of the sample studied.)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
List of benchmarking dataset from TCGA and the gold-standard subtype as well as survival type considered (OS–overall survival; PFI—Progression-free interval).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Diffusion MRI data supporting the manuscript entitled "Tractography Processing with the Sparse Closest Point Transform". This includes a population-averaged diffusion tensor imaging dataset, tractography reconstructions of fiber bundles, and classification model parameters for selecting bundles in subject data.Abstract:"We propose a novel approach for processing diffusion MRI tractography datasets using the sparse closest point transform (SCPT). Tractography enables the 3D geometry of white matter pathways to be reconstructed; however, algorithms for processing them are often highly customized, and thus, do not leverage the existing wealth of machine learning (ML) algorithms. We investigated a vector-space tractography representation that aims to bridge this gap by using the SCPT, which consists of two steps: first, extracting sparse and representative landmarks from a tractography dataset, and second transforming curves relative to these landmarks with a closest point transform. We explore its use in three typical tasks: fiber bundle clustering, simplification, and selection across a population. The clustering algorithm groups fibers from single whole-brain datasets using a non-parametric k-means clustering algorithm, with performance compared with three alternative methods and across four datasets. The simplification algorithm removes redundant curves to improve interactive visualization, with performance gauged relative to random subsampling. The selection algorithm extracts bundles across a population using a one-class Gaussian classifier derived from an atlas prototype, with performance gauged by scan-rescan reliability and sensitivity to normal aging, as compared to manual mask-based selection. Our results demonstrate how the SCPT enables the novel application of existing vector-space ML algorithms to create effective and efficient tools for tractography processing. Our experimental data is available online, and our software implementation is available in the Quantitative Imaging Toolkit."
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Thyroid disease classification plays a crucial role in early diagnosis and effective treatment of thyroid disorders. Machine learning (ML) techniques have demonstrated remarkable potential in this domain, offering accurate and efficient diagnostic tools. Most of the real-life datasets have imbalanced characteristics that hamper the overall performance of the classifiers. Existing data balancing techniques process the whole dataset at a time that sometimes causes overfitting and underfitting. However, the complexity of some ML models, often referred to as “black boxes,” raises concerns about their interpretability and clinical applicability. This paper presents a comprehensive study focused on the analysis and interpretability of various ML models for classifying thyroid diseases. In our work, we first applied a new data-balancing mechanism using a clustering technique and then analyzed the performance of different ML algorithms. To address the interpretability challenge, we explored techniques for model explanation and feature importance analysis using eXplainable Artificial Intelligence (XAI) tools globally as well as locally. Finally, the XAI results are validated with the domain experts. Experimental results have shown that our proposed mechanism is efficient in diagnosing thyroid disease and can explain the models effectively. The findings can contribute to bridging the gap between adopting advanced ML techniques and the clinical requirements of transparency and accountability in diagnostic decision-making.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Recent research has increasingly focused on using machine learning for covariate selection in population pharmacokinetics (PPK) analysis. However, few studies have explored the prediction of plasma concentration profiles of drugs using nonlinear mixed-effect models combined with machine learning. This gap includes limited validation of prediction accuracy and applicability to diverse patient populations and dosing conditions. This study addresses these gaps by using remifentanil as a model drug and applying machine learning models to predict plasma concentration profiles based on virtual and real-world data. We created various training data sets for the virtual data by clustering based on the size and diversity of the test data set. Our results demonstrated high prediction accuracy for virtual and real-world data sets using Random Forest models. These results suggest that machine learning models are effective for large-scale data sets and real-world data with variable dosing times and amounts per patient. Considering the efficiency of machine learning, it offers a fit-for-purpose approach alongside traditional PPK methods, potentially enhancing future pharmacokinetic and pharmacodynamic studies.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In the contemporary manufacturing landscape, the advent of artificial intelligence and big data analytics has been a game-changer in enhancing product quality. Despite these advancements, their application in diagnosing failure probability and risk remains underexplored. The current practice of failure risk diagnosis is impeded by the manual intervention of managers, leading to varying evaluations for identical products or similar facilities. This study aims to bridge this gap by implementing advanced data analysis techniques on maintenance data from an aluminum extruder. We have employed text embedding, dimensionality reduction, and feature extraction methods, integrating the K-means algorithm with the Silhouette Score for risk level classification. Our findings reveal that the combination of Word2Vec for embedding and Contractive Auto Encoder for dimensionality reduction and feature extraction yields high-performance results. The optimal cluster count, identified as three, achieved the highest Silhouette Score. Statistical analysis using one-way ANOVA confirmed the significance of these findings with a p-value of 5.3213 × e−6, well within the 5% significance threshold. Furthermore, this study utilized BERTopic for topic modeling to extract principal topics from each cluster, facilitating an in-depth analysis of the clusters in relation to the extruder’s characteristics. The outcome of this research offers a novel methodology for facility managers to objectively diagnose equipment failures. By minimizing subjective judgment, this approach is poised to significantly enhance the efficacy of quality assurance systems in manufacturing, leveraging the robust capabilities of artificial intelligence.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Linearized Coupled Cluster Doubles (LinCCD) often provides near-singular energies in small-gap systems that exhibit static correlation. This has been attributed to the lack of quadratic T̂22 terms that typically balance out small energy denominators in the CCD amplitude equations. Herein, I show that exchange contributions to ring and crossed-ring contractions (not small denominators per se) cause the divergent behavior of LinCC(S)D approaches. Rather than omitting exchange terms, I recommend a regular and size-consistent method that retains only linear ladder diagrams. As LinCCD and configuration interaction doubles (CID) equations are isomorphic, this also implies that simplification (rather than quadratic extensions) of CID amplitude equations can lead to a size-consistent theory. Linearized ladder CCD (LinLCCD) is robust in statically correlated systems and can be made O(nocc4nvir2) with a hole–hole approximation. The results presented here show that LinLCCD and its hole–hole approximation can accurately capture energy differences, even outperforming full CCD and CCSD for noncovalent interactions in small-to-medium sized molecules, setting the stage for further adaptations of these approaches that incorporate more dynamical correlation.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Determining intrinsic number of clusters in a multidimensional dataset is a commonly encountered problem in exploratory data analysis. Unsupervised clustering algorithms often rely on specification of cluster number as an input parameter. However, this is typically not known a priori. Many methods have been proposed to estimate cluster number, including statistical and information-theoretic approaches such as the gap statistic, but these methods are not always reliable when applied to non-normally distributed datasets containing outliers or noise. In this study, I propose a novel method called hierarchical linkage regression, which uses regression to estimate the intrinsic number of clusters in a multidimensional dataset. The method operates on the hypothesis that the organization of data into clusters can be inferred from the hierarchy generated by partitioning the dataset, and therefore does not directly depend on the specific values of the data or their distribution, but on their relative ranking within the partitioned set. Moreover, the technique does not require empirical data to train on, but can use synthetic data generated from random distributions to fit regression coefficients. The trained hierarchical linkage regression model is able to infer cluster number in test datasets of varying complexity and differing distributions, for image, text and numeric data, using the same regression model without retraining. The method performs favourably against other cluster number estimation techniques, and is also robust to parameter changes, as demonstrated by sensitivity analysis. The apparent robustness and generalizability of hierarchical linkage regression make it a promising tool for unsupervised exploratory data analysis and discovery.