Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The retrieval of important information from a dataset requires applying a special data mining technique known as data clustering (DC). DC classifies similar objects into a groups of similar characteristics. Clustering involves grouping the data around k-cluster centres that typically are selected randomly. Recently, the issues behind DC have called for a search for an alternative solution. Recently, a nature-based optimization algorithm named Black Hole Algorithm (BHA) was developed to address the several well-known optimization problems. The BHA is a metaheuristic (population-based) that mimics the event around the natural phenomena of black holes, whereby an individual star represents the potential solutions revolving around the solution space. The original BHA algorithm showed better performance compared to other algorithms when applied to a benchmark dataset, despite its poor exploration capability. Hence, this paper presents a multi-population version of BHA as a generalization of the BHA called MBHA wherein the performance of the algorithm is not dependent on the best-found solution but a set of generated best solutions. The method formulated was subjected to testing using a set of nine widespread and popular benchmark test functions. The ensuing experimental outcomes indicated the highly precise results generated by the method compared to BHA and comparable algorithms in the study, as well as excellent robustness. Furthermore, the proposed MBHA achieved a high rate of convergence on six real datasets (collected from the UCL machine learning lab), making it suitable for DC problems. Lastly, the evaluations conclusively indicated the appropriateness of the proposed algorithm to resolve DC issues.
This dataset was created by ddpr raju
It contains the following files:
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This file contains a number of randomly generated datasets. The properties of each dataset are indicated in the name of each respective file: 'C' indicates the number of classes, 'F' indicates the number of features, 'Ne' indicates the number of objects contained in each class, 'A' is related to the average separation between classes and 'R' is an index used to differentiate distinct random trials. So, for instance, the file C2F10N2Ne5A1.2R0 is a dataset containing 2 classes, 10 features, 5 objects for each class and having a typical separation between classes of 1.2. The methodology used for generating the datasets is described in the accompanying reference.
This paper proposes a scalable, local privacy preserving algorithm for distributed Peer-to-Peer (P2P) data aggregation useful for many advanced data mining/analysis tasks such as average/sum computation, decision tree induction, feature selection, and more. Unlike most multi-party privacy-preserving data mining algorithms, this approach works in an asynchronous manner through local interactions and it is highly scalable. It particularly deals with the distributed computation of the sum of a set of numbers stored at different peers in a P2P network in the context of a P2P web mining application. The proposed optimization based privacy-preserving technique for computing the sum allows different peers to specify different privacy requirements without having to adhere to a global set of parameters for the chosen privacy model. Since distributed sum computation is a frequently used primitive, the proposed approach is likely to have significant impact on many data mining tasks such as multi-party privacy-preserving clustering, frequent itemset mining, and statistical aggregate computation.
The file consists of the locations of 300 places in the US. Each location is a two-dimensional point that represents the longitude and latitude of the place. For example, "-112.1,33.5" means the longitude of the place is -112.1, and the latitude is 33.5. from Course Data mining / Cluster Analysis by University of Illinois at Urbana-Champaign
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In a large network of computers, wireless sensors, or mobile devices, each of the components (hence, peers) has some data about the global status of the system. Many of the functions of the system, such as routing decisions, search strategies, data cleansing, and the assignment of mutual trust, depend on the global status. Therefore, it is essential that the system be able to detect, and react to, changes in its global status. Computing global predicates in such systems is usually very costly. Mainly because of their scale, and in some cases (e.g., sensor networks) also because of the high cost of communication. The cost further increases when the data changes rapidly (due to state changes, node failure, etc.) and computation has to follow these changes. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient local algorithm which detect when the L2 norm of the average data surpasses a threshold. Then, we use this algorithm as a feedback loop for the monitoring of complex predicates on the data – such as the data’s k-means clustering. The efficiency of the L2 algorithm guarantees that so long as the clustering results represent the data (i.e., the data is stationary) few resources are required. When the data undergoes an epoch change – a change in the underlying distribution – and the model no longer represents it, the feedback loop indicates this and the model is rebuilt. Furthermore, the existence of a feedback loop allows using approximate and “best-effort ” methods for constructing the model; if an ill-fit model is built the feedback loop would indicate so, and the model would be rebuilt.
In a large network of computers or wireless sensors, each of the components (henceforth, peers) has some data about the global state of the system. Much of the system's functionality such as message routing, information retrieval and load sharing relies on modeling the global state. We refer to the outcome of the function (e.g., the load experienced by each peer) as the emph{model} of the system. Since the state of the system is constantly changing, it is necessary to keep the models up-to-date. Computing global data mining models e.g. decision trees, k-means clustering in large distributed systems may be very costly due to the scale of the system and due to communication cost, which may be high. The cost further increases in a dynamic scenario when the data changes rapidly. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient emph{local} algorithm which can be used to monitor a wide class of data mining models. Then, we use this algorithm as a feedback loop for the monitoring of complex functions of the data such as its k-means clustering. The theoretical claims are corroborated with a thorough experimental analysis.
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The unsupervised learning market is experiencing robust growth, driven by the increasing need for businesses to extract meaningful insights from large, unstructured datasets. This market is projected to be valued at approximately $15 billion in 2025, exhibiting a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033. This significant expansion is fueled by several key factors. The proliferation of big data and the need for efficient data analysis are primary drivers. Businesses across various sectors, including finance, healthcare, and retail, are increasingly adopting unsupervised learning techniques like clustering and anomaly detection to identify patterns, predict customer behavior, and optimize operational efficiency. Furthermore, advancements in machine learning algorithms, improved computational power, and the availability of cloud-based solutions are further accelerating market growth. The segment dominated by cloud-based solutions is growing faster than the on-premise segment, reflecting a broader industry shift toward cloud computing and its scalability advantages. Large enterprises represent a significant portion of the market, owing to their greater resources and willingness to invest in sophisticated analytics capabilities. However, challenges remain, including the complexity of implementing and interpreting unsupervised learning models, the need for specialized expertise, and concerns regarding data privacy and security. Despite these challenges, the long-term outlook for the unsupervised learning market remains positive. The continuous evolution of machine learning algorithms and the increasing availability of user-friendly tools are expected to lower the barrier to entry for businesses of all sizes. Furthermore, the growing adoption of artificial intelligence (AI) across various industries will further fuel demand for unsupervised learning solutions. The market is witnessing considerable geographic expansion, with North America currently holding a significant market share due to the presence of major technology companies and a well-established IT infrastructure. However, other regions, particularly Asia-Pacific, are also witnessing substantial growth, driven by rapid digitalization and increasing investment in data analytics. Competition in the market is intense, with established players like Microsoft, IBM, and Google vying for market share alongside specialized vendors like RapidMiner and H2o.ai. The continued innovation and development of advanced algorithms and platforms will shape the competitive landscape in the coming years.
In a large network of computers, wireless sensors, or mobile devices, each of the components (hence, peers) has some data about the global status of the system. Many of the functions of the system, such as routing decisions, search strategies, data cleansing, and the assignment of mutual trust, depend on the global status. Therefore, it is essential that the system be able to detect, and react to, changes in its global status. Computing global predicates in such systems is usually very costly. Mainly because of their scale, and in some cases (e.g., sensor networks) also because of the high cost of communication. The cost further increases when the data changes rapidly (due to state changes, node failure, etc.) and computation has to follow these changes. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient local algorithm which detect when the L2 norm of the average data surpasses a threshold. Then, we use this algorithm as a feedback loop for the monitoring of complex predicates on the data – such as the data’s k-means clustering. The efficiency of the L2 algorithm guarantees that so long as the clustering results represent the data (i.e., the data is stationary) few resources are required. When the data undergoes an epoch change – a change in the underlying distribution – and the model no longer represents it, the feedback loop indicates this and the model is rebuilt. Furthermore, the existence of a feedback loop allows using approximate and “best-effort ” methods for constructing the model; if an ill-fit model is built the feedback loop would indicate so, and the model would be rebuilt.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Clustering is a fundamental tool in data mining, widely used in various fields such as image segmentation, data science, pattern recognition, and bioinformatics. Density Peak Clustering (DPC) is a density-based method that identifies clusters by calculating the local density of data points and selecting cluster centers based on these densities. However, DPC has several limitations. First, it requires a cutoff distance to calculate local density, and this parameter varies across datasets, which requires manual tuning and affects the algorithm’s performance. Second, the number of cluster centers must be manually specified, as the algorithm cannot automatically determine the optimal number of clusters, making the algorithm dependent on human intervention. To address these issues, we propose an adaptive Density Peak Clustering (DPC) method, which automatically adjusts parameters like cutoff distance and the number of clusters, based on the Delaunay graph. This approach uses the Delaunay graph to calculate the connectivity between data points and prunes the points based on these connections, automatically determining the number of cluster centers. Additionally, by optimizing clustering indices, the algorithm automatically adjusts its parameters, enabling clustering without any manual input. Experimental results on both synthetic and real-world datasets demonstrate that the proposed algorithm outperforms similar methods in terms of both efficiency and clustering accuracy.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The Data Mining and Modeling market is experiencing robust growth, driven by the exponential increase in data volume and the rising need for businesses to extract actionable insights for strategic decision-making. The market, estimated at $25 billion in 2025, is projected to expand at a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching approximately $75 billion by 2033. This growth is fueled by several key factors, including the increasing adoption of cloud-based data mining solutions, the development of sophisticated analytical tools capable of handling big data, and the growing demand for predictive analytics across diverse sectors such as finance, healthcare, and retail. Furthermore, advancements in artificial intelligence (AI) and machine learning (ML) are significantly enhancing the capabilities of data mining and modeling tools, enabling more accurate predictions and deeper insights. The market is segmented by various deployment models (cloud, on-premise), analytical techniques (regression, classification, clustering), and industry verticals. The major restraints on market growth include the high cost of implementation and maintenance of data mining and modeling solutions, the scarcity of skilled professionals proficient in advanced analytical techniques, and concerns about data privacy and security. However, these challenges are being gradually addressed through the development of user-friendly tools, the emergence of specialized training programs, and the increasing adoption of robust security measures. The competitive landscape is characterized by a mix of established players like SAS and IBM, along with several specialized providers like Symbrium, Coheris, and Expert System. These companies are constantly innovating to enhance their offerings and cater to the evolving needs of businesses across various industries. The market's trajectory indicates a promising future driven by ongoing technological advancements and the increasing importance of data-driven decision-making in a rapidly evolving business environment.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SDOclust Evaluation Tests v2 conducted for the paper: Parameterization-Free Clustering with Sparse Data Observers Context and methodology SDOclust is a clustering extension of the Sparse Data Observers (SDO) algorithm. SDOclust uses data observers as graph nodes and cluster them considering connected components and local thresholding. Observers' labels are subsequently propagated to data points. In this repository, SDOclust is evaluated with 235 datasets (both synthetic and real) taken from the literature about clustering evaluation, and compared with HDBSCAN, k-means--, CLASSIX, N2D (Deep Learning Clustering), Fuzzy Clustering, and Hierarchical Clustering algorithms. This repository is framed within the research on the following domains: algorithm evaluation, clustering, unsupervised learning, machine learning, data mining, data analysis. Datasets and algorithms can be used for experiment replication and for further clustering evaluation and comparison. Technical details Experiments are conducted in Python 3. The file and folder structure is as follows:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
conducted for the paper: Stream Clustering Robust to Concept Drift
SDOstreamclust is a stream clustering algorithm able to process data incrementally or per batches. It is a combination of the previous SDOstream (anomaly detection in data streams) and SDOclust (static clustering). SDOstreamclust holds the characteristics of SDO algoritmhs: lightweight, intuitive, self-adjusting, resistant to noise, capable of identifying non-convex clusters, and constructed upon robust parameters and interpretable models. Moreover, it shows excellent adaptation to concept drift
In this repository, SDOclust is evaluated with 165 datasets (both synthetic and real) and compared with CluStream, DBstream, DenStream, StreamKMeans.
This repository is framed within the research on the following domains: algorithm evaluation, stream clustering, unsupervised learning, machine learning, data mining, streaming data analysis. Datasets and algorithms can be used for experiment replication and for further evaluation and comparison.
Docker
A Docker version is also available in: https://hub.docker.com/r/fiv5/sdostreamclust
Experiments are conducted in Python v3.8.14. The file and folder structure is as follows:- [algorithms] contains a script with functions related to algorithm configurations.
The CC-BY license applies to all data generated with MDCgen. All distributed code is under the GPLv3+ license.
The publication of the Gaia Data Release 2 (Gaia DR2) opens a new era in astronomy. It includes precise astrometric data (positions, proper motions, and parallaxes) for more than 1.3 billion sources, mostly stars. To analyse such a vast amount of new data, the use of data-mining techniques and machine-learning algorithms is mandatory. A great example of the application of such techniques and algorithms is the search for open clusters (OCs), groups of stars that were born and move together, located in the disc. Our aim is to develop a method to automatically explore the data space, requiring minimal manual intervention. We explore the performance of a density-based clustering algorithm, DBSCAN, to find clusters in the data together with a supervised learning method such as an artificial neural network (ANN) to automatically distinguish between real OCs and statistical clusters. The development and implementation of this method in a five-dimensional space (l, b, p, {mu}{alpha}^*^, {mu}{delta}) with the Tycho-Gaia Astrometric Solution (TGAS) data, and a posterior validation using Gaia DR2 data, lead to the proposal of a set of new nearby OCs. We have developed a method to find OCs in astrometric data, designed to be applied to the full Gaia DR2 archive. Cone search capability for table J/A+A/618/A59/members (Members for the reported UBC clusters) Cone search capability for table J/A+A/618/A59/centers (Mean parameters for the reported UBC clusters (table2 of the paper))
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The fragile balance of endorheic lakes in highly managed semi-arid basins with transboundary water issues has been altered by the intertwined effects of global warming and long-term water mismanagement to support agricultural and industrial demand. The alarming rate of global endorheic lakes' depletion in recent decades necessitates formulating mitigation strategies for ecosystem restoration. However, detecting and quantifying the relative contribution of causal factors (climate variability and anthropogenic stressors) is challenging. This study developed a diagnostic multivariate framework to identify major hydrologic drivers of lake depletion in a highly managed endorheic basin with a complex water distribution system. The framework integrates the Soil and Water Assessment Tool (SWAT) simulations with time-series decomposition and clustering methods to identify the major drivers of change. This diagnostic framework was applied to the Salton Sea Transboundary Basin (SSTB), the host of the world's most impaired inland lake. The results showed signs of depletion across the SSTB since late 1998 with no significant changes in climate conditions. The time-series data mining of the SSTB water balance components indicated that decreases in lake tributary inflows (-16.4 Mm3 yr-2) in response to decline in Colorado River inflows, associated with state water transfer agreements, are causing the Salton Sea to shrink, not changes in the irrigation operation as commonly believed. The developed multivariate detection and attribution framework is useful for identifying major drivers of change in coupled natural-human systems.
This paper proposes a scalable, local privacy-preserving algorithm for distributed peer-to-peer (P2P) data aggregation useful for many advanced data mining/analysis tasks such as average/sum computation, decision tree induction, feature selection, and more. Unlike most multi-party privacy-preserving data mining algorithms, this approach works in an asynchronous manner through local interactions and therefore, is highly scalable. It particularly deals with the distributed computation of the sum of a set of numbers stored at different peers in a P2P network in the context of a P2P web mining application. The proposed optimization-based privacy-preserving technique for computing the sum allows different peers to specify different privacy requirements without having to adhere to a global set of parameters for the chosen privacy model. Since distributed sum computation is a frequently used primitive, the proposed approach is likely to have significant impact on many data mining tasks such as multi-party privacypreserving clustering, frequent itemset mining, and statistical aggregate computation.
IMS: Inductive Monitoring System The Inductive Monitoring System (IMS) is a tool that uses a data mining technique called clustering to extract models of normal system operation from archived data. IMS works with vectors of data values. IMS analyzes data collected during periods of normal system operation to build a system model. It characterizes how the parameters relate to one another during normal operation by finding areas in the vector space where nominal data tends to fall. These areas are called nominal operating regions and correspond to clusters of similar points found by the IMS clustering algorithm. These nominal operating regions are stored in a knowledge base that IMS uses for real-time telemetry monitoring or archived data analysis. During the monitoring operation, IMS reads real-time or archived data values, formats them into the predefined vector structure, and searches the knowledge base of nominal operating regions to see how well the new data fits the nominal system characterization. For each input vector, IMS returns the distance that vector falls from the nearest nominal operating region. Data that matches the normal training data well will have a deviation distance of zero. If one or more of the data parameters is slightly outside of expected values, a small non-zero result is returned. As incoming data deviates further from the normal system data, indicating a possible malfunction, IMS will return a higher deviation value to alert users of the anomaly. IMS also calculates the contribution of each individual parameter to the overall deviation, which can help isolate the cause of the anomaly.
The Gaia Data Release 2 (DR2) provided an unprecedented volume of precise astrometric and excellent photometric data. In terms of data mining the Gaia catalogue, machine learning methods have shown to be a powerful tool, for instance in the search for unknown stellar structures. Particularly, supervised and unsupervised learning methods combined together significantly improves the detection rate of open clusters. We systematically scan Gaia DR2 in a region covering the Galactic anticentre and the Perseus arm (120{deg}<=l<=205{deg} and -10{deg}<=b<=10{deg}), with the goal of finding any open clusters that may exist in this region, and fine tuning a previously proposed methodology and successfully applied to TGAS data, adapting it to different density regions. Our methodology uses an unsupervised, density-based, clustering algorithm, DBSCAN, that identifies overdensities in the five-dimensional astrometric parameter space (l, b, {varpi}, pmRA*, pmDE) that may correspond to physical clusters. The overdensities are separated into physical clusters (open clusters) or random statistical clusters using an artificial neural network to recognise the isochrone pattern that open clusters show in a colour magnitude diagram. The method is able to recover more than 75% of the open clusters confirmed in the search area. Moreover, we detected 53 open clusters unknown previous to Gaia DR2, which represents an increase of more than 22% with respect to the already catalogued clusters in this region. We find that the census of nearby open clusters is not complete. Different machine learning methodologies for a blind search of open clusters are complementary to each other; no single method is able to detect 100% of the existing groups. Our methodology has shown to be a reliable tool for the automatic detection of open clusters, designed to be applied to the full Gaia DR2 catalogue.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.