99 datasets found

m
Educational Attainment in North Carolina Public Schools: Use of statistical...
data.mendeley.com
Updated Nov 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
Explore at:
Unique identifier
https://doi.org/10.17632/6cm9wyd5g5.1
Dataset updated
Nov 14, 2018
Authors
Scott Herford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
North Carolina
Description
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
f
Benchmark test functions.
plos.figshare.com
xls
Updated Jul 5, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sinan Q. Salih; AbdulRahman A. Alsewari; H. A. Wahab; Mustafa K. A. Mohammed; Tarik A. Rashid; Debashish Das; Shadi S. Basurra (2023). Benchmark test functions. [Dataset]. http://doi.org/10.1371/journal.pone.0288044.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0288044.t001
Dataset updated
Jul 5, 2023
Dataset provided by
PLOS ONE
Authors
Sinan Q. Salih; AbdulRahman A. Alsewari; H. A. Wahab; Mustafa K. A. Mohammed; Tarik A. Rashid; Debashish Das; Shadi S. Basurra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The retrieval of important information from a dataset requires applying a special data mining technique known as data clustering (DC). DC classifies similar objects into a groups of similar characteristics. Clustering involves grouping the data around k-cluster centres that typically are selected randomly. Recently, the issues behind DC have called for a search for an alternative solution. Recently, a nature-based optimization algorithm named Black Hole Algorithm (BHA) was developed to address the several well-known optimization problems. The BHA is a metaheuristic (population-based) that mimics the event around the natural phenomena of black holes, whereby an individual star represents the potential solutions revolving around the solution space. The original BHA algorithm showed better performance compared to other algorithms when applied to a benchmark dataset, despite its poor exploration capability. Hence, this paper presents a multi-population version of BHA as a generalization of the BHA called MBHA wherein the performance of the algorithm is not dependent on the best-found solution but a set of generated best solutions. The method formulated was subjected to testing using a set of nine widespread and popular benchmark test functions. The ensuing experimental outcomes indicated the highly precise results generated by the method compared to BHA and comparable algorithms in the study, as well as excellent robustness. Furthermore, the proposed MBHA achieved a high rate of convergence on six real datasets (collected from the UCL machine learning lab), making it suitable for DC problems. Lastly, the evaluations conclusively indicated the appropriateness of the proposed algorithm to resolve DC issues.
Application Research of Clustering on kmeans
kaggle.com
zip
Updated Feb 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ddpr raju (2021). Application Research of Clustering on kmeans [Dataset]. https://www.kaggle.com/ddprraju/tirupati-compus-school
Explore at:
zip(34507 bytes)Available download formats
Dataset updated
Feb 27, 2021
Authors
ddpr raju
Description
Dataset

This dataset was created by ddpr raju

Contents

It contains the following files:
Artificial dataset for clustering algorithms
figshare.com
zip
Updated Jun 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mayra Zegarra Rodriguez; Cesar H. Comin; Dalcimar Casanova; Odemir M; Diego R. Amancio; Francisco A. Rodrigues; Luciano da F. Costa (2023). Artificial dataset for clustering algorithms [Dataset]. http://doi.org/10.6084/m9.figshare.5412091.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5412091.v1
Dataset updated
Jun 4, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Mayra Zegarra Rodriguez; Cesar H. Comin; Dalcimar Casanova; Odemir M; Diego R. Amancio; Francisco A. Rodrigues; Luciano da F. Costa
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This file contains a number of randomly generated datasets. The properties of each dataset are indicated in the name of each respective file: 'C' indicates the number of classes, 'F' indicates the number of features, 'Ne' indicates the number of objects contained in each class, 'A' is related to the average separation between classes and 'R' is an index used to differentiate distinct random trials. So, for instance, the file C2F10N2Ne5A1.2R0 is a dataset containing 2 classes, 10 features, 5 objects for each class and having a typical separation between classes of 1.2. The methodology used for generating the datasets is described in the accompanying reference.
d
Data from: Multi-objective optimization based privacy preserving distributed...
catalog.data.gov
cloud.csiss.gmu.edu
+2more
Updated Apr 9, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Multi-objective optimization based privacy preserving distributed data mining in Peer-to-Peer networks [Dataset]. https://catalog.data.gov/dataset/multi-objective-optimization-based-privacy-preserving-distributed-data-mining-in-peer-to-p
Explore at:
Dataset updated
Apr 9, 2025
Dataset provided by
Dashlink
Description
This paper proposes a scalable, local privacy preserving algorithm for distributed Peer-to-Peer (P2P) data aggregation useful for many advanced data mining/analysis tasks such as average/sum computation, decision tree induction, feature selection, and more. Unlike most multi-party privacy-preserving data mining algorithms, this approach works in an asynchronous manner through local interactions and it is highly scalable. It particularly deals with the distributed computation of the sum of a set of numbers stored at different peers in a P2P network in the context of a P2P web mining application. The proposed optimization based privacy-preserving technique for computing the sum allows different peers to specify different privacy requirements without having to adhere to a global set of parameters for the chosen privacy model. Since distributed sum computation is a frequently used primitive, the proposed approach is likely to have significant impact on many data mining tasks such as multi-party privacy-preserving clustering, frequent itemset mining, and statistical aggregate computation.
300 Places in the US for K-means Clustering
kaggle.com
Updated Aug 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dongou (2022). 300 Places in the US for K-means Clustering [Dataset]. https://www.kaggle.com/datasets/adamxing2021/300places/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 16, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Dongou
Description
The file consists of the locations of 300 places in the US. Each location is a two-dimensional point that represents the longitude and latitude of the place. For example, "-112.1,33.5" means the longitude of the place is -112.1, and the latitude is 33.5. from Course Data mining / Cluster Analysis by University of Illinois at Urbana-Champaign
Data from: Hidden Room game in University of Cadiz data clustering by...
figshare.com
produccioncientifica.uca.es
png
Updated Apr 30, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manuel Palomo-duarte; Anke Berns (2018). Hidden Room game in University of Cadiz data clustering by DeutschUCA [Dataset]. http://doi.org/10.6084/m9.figshare.6194597.v1
Explore at:
pngAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6194597.v1
Dataset updated
Apr 30, 2018
Dataset provided by
Figsharehttp://figshare.com/
Authors
Manuel Palomo-duarte; Anke Berns
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description

Histograms and results of k-means and Ward's clustering for Hidden Room game (Open Simulator) in University of Cadiz (Spain) by DeutschUCAThe fileset contains information from three sources:1. Histograms files:* Lexical_histogram.png (histogram of lexical error ratios)* Grammatical_histogram.png (histogram of grammatical error ratios)2. K-means clustering files:*

elbow-lex kmeans.png (clustering by lexical aspects: error curves

obtained for applying elbow method to determinate the optimal number of

clusters)* cube-lex kmeans.png (clustering by lexical aspects: a

three-dimensional representation of clusters obtained after applying

k-means method)* Lexical_clusters (table) kmeans.xls (clustering by

lexical aspects: centroids, standard deviations and number of instances

assigned to each cluster)* elbow-gram kmeans.png (clustering by

grammatical aspects: error curves obtained for applying elbow method to

determinate the optimal number of clusters)* cube-gramm kmeans.png

(clustering by grammatical aspects: a three-dimensional representation

of clusters obtained after applying k-means method)*

Grammatical_clusters (table) kmeans.xls (clustering by grammatical

aspects: centroids, standard deviations and number of instances assigned

to each cluster)* elbow-lexgram kmeans.png (clustering by lexical

and grammatical aspects: error curves obtained for applying elbow method

to determinate the optimal number of clusters)*

Lexical_Grammatical_clusters (table) kmeans.xls (clustering by lexical

and grammatical aspects: centroids, standard deviations and number of

instances assigned to each cluster)*

Grammatical_clusters_number_of_words (table) kmeans.xls

number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to grammatical error ratios.* Lexical_clusters_number_of_words (table) kmeans.xls

number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to lexical error ratios.* Lexical_Grammatical_clusters_number_of_words (table) kmeans.xls

number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to lexical and grammatical error ratios.3. Ward’s Agglomerative Hierarchical Clustering files:* Lexical_Cluster_Dendrogram_ward.png (clustering by lexical aspects: dendrogram obtained after applying Ward's clustering method).* Grammatical_Cluster_Dendrogram_ward.png (clustering by grammatical aspects: dendrogram obtained after applying Ward's clustering method)* Lexical_Grammatical_Cluster_Dendrogram_ward.png (clustering by lexical and grammatical aspects: dendrogram obtained after applying Ward's clustering method)* Lexical_Grammatical_clusters (table) ward.xls:
Centroids (from column 2 to 7) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to lexical and grammatical error ratios.* Grammatical_clusters (table) ward.xls: Centroids (from column 2 to 4) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to grammatical error ratios.* Lexical_clusters (table) ward.xls: Centroids (from column 2 to 4) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to lexical error ratios.* Lexical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to lexical error ratios.* Grammatical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to grammatical error ratios.* Lexical_Grammatical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to lexical and grammatical error ratios.
Local L2 Thresholding Based Data Mining in Peer-to-Peer Systems - Dataset -...
data.nasa.gov
data.staging.idas-ds1.appdat.jsc.nasa.gov
Updated Mar 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Local L2 Thresholding Based Data Mining in Peer-to-Peer Systems - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/local-l2-thresholding-based-data-mining-in-peer-to-peer-systems
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
In a large network of computers, wireless sensors, or mobile devices, each of the components (hence, peers) has some data about the global status of the system. Many of the functions of the system, such as routing decisions, search strategies, data cleansing, and the assignment of mutual trust, depend on the global status. Therefore, it is essential that the system be able to detect, and react to, changes in its global status. Computing global predicates in such systems is usually very costly. Mainly because of their scale, and in some cases (e.g., sensor networks) also because of the high cost of communication. The cost further increases when the data changes rapidly (due to state changes, node failure, etc.) and computation has to follow these changes. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient local algorithm which detect when the L2 norm of the average data surpasses a threshold. Then, we use this algorithm as a feedback loop for the monitoring of complex predicates on the data – such as the data’s k-means clustering. The efficiency of the L2 algorithm guarantees that so long as the clustering results represent the data (i.e., the data is stationary) few resources are required. When the data undergoes an epoch change – a change in the underlying distribution – and the model no longer represents it, the feedback loop indicates this and the model is rebuilt. Furthermore, the existence of a feedback loop allows using approximate and “best-effort ” methods for constructing the model; if an ill-fit model is built the feedback loop would indicate so, and the model would be rebuilt.
d
Data from: A Generic Local Algorithm for Mining Data Streams in Large...
catalog.data.gov
datasets.ai
+3more
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). A Generic Local Algorithm for Mining Data Streams in Large Distributed Systems [Dataset]. https://catalog.data.gov/dataset/a-generic-local-algorithm-for-mining-data-streams-in-large-distributed-systems
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
In a large network of computers or wireless sensors, each of the components (henceforth, peers) has some data about the global state of the system. Much of the system's functionality such as message routing, information retrieval and load sharing relies on modeling the global state. We refer to the outcome of the function (e.g., the load experienced by each peer) as the emph{model} of the system. Since the state of the system is constantly changing, it is necessary to keep the models up-to-date. Computing global data mining models e.g. decision trees, k-means clustering in large distributed systems may be very costly due to the scale of the system and due to communication cost, which may be high. The cost further increases in a dynamic scenario when the data changes rapidly. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient emph{local} algorithm which can be used to monitor a wide class of data mining models. Then, we use this algorithm as a feedback loop for the monitoring of complex functions of the data such as its k-means clustering. The theoretical claims are corroborated with a thorough experimental analysis.
U
Unsupervised Learning Report
archivemarketresearch.com
doc, pdf, ppt
Updated Mar 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). Unsupervised Learning Report [Dataset]. https://www.archivemarketresearch.com/reports/unsupervised-learning-56632
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Mar 13, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The unsupervised learning market is experiencing robust growth, driven by the increasing need for businesses to extract meaningful insights from large, unstructured datasets. This market is projected to be valued at approximately $15 billion in 2025, exhibiting a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033. This significant expansion is fueled by several key factors. The proliferation of big data and the need for efficient data analysis are primary drivers. Businesses across various sectors, including finance, healthcare, and retail, are increasingly adopting unsupervised learning techniques like clustering and anomaly detection to identify patterns, predict customer behavior, and optimize operational efficiency. Furthermore, advancements in machine learning algorithms, improved computational power, and the availability of cloud-based solutions are further accelerating market growth. The segment dominated by cloud-based solutions is growing faster than the on-premise segment, reflecting a broader industry shift toward cloud computing and its scalability advantages. Large enterprises represent a significant portion of the market, owing to their greater resources and willingness to invest in sophisticated analytics capabilities. However, challenges remain, including the complexity of implementing and interpreting unsupervised learning models, the need for specialized expertise, and concerns regarding data privacy and security. Despite these challenges, the long-term outlook for the unsupervised learning market remains positive. The continuous evolution of machine learning algorithms and the increasing availability of user-friendly tools are expected to lower the barrier to entry for businesses of all sizes. Furthermore, the growing adoption of artificial intelligence (AI) across various industries will further fuel demand for unsupervised learning solutions. The market is witnessing considerable geographic expansion, with North America currently holding a significant market share due to the presence of major technology companies and a well-established IT infrastructure. However, other regions, particularly Asia-Pacific, are also witnessing substantial growth, driven by rapid digitalization and increasing investment in data analytics. Competition in the market is intense, with established players like Microsoft, IBM, and Google vying for market share alongside specialized vendors like RapidMiner and H2o.ai. The continued innovation and development of advanced algorithms and platforms will shape the competitive landscape in the coming years.
d
Data from: Local L2 Thresholding Based Data Mining in Peer-to-Peer Systems
catalog.data.gov
datasets.ai
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Local L2 Thresholding Based Data Mining in Peer-to-Peer Systems [Dataset]. https://catalog.data.gov/dataset/local-l2-thresholding-based-data-mining-in-peer-to-peer-systems
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
In a large network of computers, wireless sensors, or mobile devices, each of the components (hence, peers) has some data about the global status of the system. Many of the functions of the system, such as routing decisions, search strategies, data cleansing, and the assignment of mutual trust, depend on the global status. Therefore, it is essential that the system be able to detect, and react to, changes in its global status. Computing global predicates in such systems is usually very costly. Mainly because of their scale, and in some cases (e.g., sensor networks) also because of the high cost of communication. The cost further increases when the data changes rapidly (due to state changes, node failure, etc.) and computation has to follow these changes. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient local algorithm which detect when the L2 norm of the average data surpasses a threshold. Then, we use this algorithm as a feedback loop for the monitoring of complex predicates on the data – such as the data’s k-means clustering. The efficiency of the L2 algorithm guarantees that so long as the clustering results represent the data (i.e., the data is stationary) few resources are required. When the data undergoes an epoch change – a change in the underlying distribution – and the model no longer represents it, the feedback loop indicates this and the model is rebuilt. Furthermore, the existence of a feedback loop allows using approximate and “best-effort ” methods for constructing the model; if an ill-fit model is built the feedback loop would indicate so, and the model would be rebuilt.
f
Parameter Settings for Comparative Experiments.
plos.figshare.com
xls
Updated Jun 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wei Xingqiong; Li Kang (2025). Parameter Settings for Comparative Experiments. [Dataset]. http://doi.org/10.1371/journal.pone.0325161.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0325161.t002
Dataset updated
Jun 5, 2025
Dataset provided by
PLOS ONE
Authors
Wei Xingqiong; Li Kang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Clustering is a fundamental tool in data mining, widely used in various fields such as image segmentation, data science, pattern recognition, and bioinformatics. Density Peak Clustering (DPC) is a density-based method that identifies clusters by calculating the local density of data points and selecting cluster centers based on these densities. However, DPC has several limitations. First, it requires a cutoff distance to calculate local density, and this parameter varies across datasets, which requires manual tuning and affects the algorithm’s performance. Second, the number of cluster centers must be manually specified, as the algorithm cannot automatically determine the optimal number of clusters, making the algorithm dependent on human intervention. To address these issues, we propose an adaptive Density Peak Clustering (DPC) method, which automatically adjusts parameters like cutoff distance and the number of clusters, based on the Delaunay graph. This approach uses the Delaunay graph to calculate the connectivity between data points and prunes the points based on these connections, automatically determining the number of cluster centers. Additionally, by optimizing clustering indices, the algorithm automatically adjusts its parameters, enabling clustering without any manual input. Experimental results on both synthetic and real-world datasets demonstrate that the proposed algorithm outperforms similar methods in terms of both efficiency and clustering accuracy.
D
Data Mining and Modeling Report
datainsightsmarket.com
doc, pdf, ppt
Updated May 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Data Mining and Modeling Report [Dataset]. https://www.datainsightsmarket.com/reports/data-mining-and-modeling-1947982
Explore at:
pdf, doc, pptAvailable download formats
Dataset updated
May 26, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Data Mining and Modeling market is experiencing robust growth, driven by the exponential increase in data volume and the rising need for businesses to extract actionable insights for strategic decision-making. The market, estimated at $25 billion in 2025, is projected to expand at a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching approximately $75 billion by 2033. This growth is fueled by several key factors, including the increasing adoption of cloud-based data mining solutions, the development of sophisticated analytical tools capable of handling big data, and the growing demand for predictive analytics across diverse sectors such as finance, healthcare, and retail. Furthermore, advancements in artificial intelligence (AI) and machine learning (ML) are significantly enhancing the capabilities of data mining and modeling tools, enabling more accurate predictions and deeper insights. The market is segmented by various deployment models (cloud, on-premise), analytical techniques (regression, classification, clustering), and industry verticals. The major restraints on market growth include the high cost of implementation and maintenance of data mining and modeling solutions, the scarcity of skilled professionals proficient in advanced analytical techniques, and concerns about data privacy and security. However, these challenges are being gradually addressed through the development of user-friendly tools, the emergence of specialized training programs, and the increasing adoption of robust security measures. The competitive landscape is characterized by a mix of established players like SAS and IBM, along with several specialized providers like Symbrium, Coheris, and Expert System. These companies are constantly innovating to enhance their offerings and cater to the evolving needs of businesses across various industries. The market's trajectory indicates a promising future driven by ongoing technological advancements and the increasing importance of data-driven decision-making in a rapidly evolving business environment.
e
SDOclust Evaluation Tests v2 - Dataset - B2FIND
b2find.eudat.eu
Updated Jul 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). SDOclust Evaluation Tests v2 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/edf08d73-170a-564b-b87c-0ffa13a8cfa8
Explore at:
Dataset updated
Jul 4, 2024
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SDOclust Evaluation Tests v2 conducted for the paper: Parameterization-Free Clustering with Sparse Data Observers Context and methodology SDOclust is a clustering extension of the Sparse Data Observers (SDO) algorithm. SDOclust uses data observers as graph nodes and cluster them considering connected components and local thresholding. Observers' labels are subsequently propagated to data points. In this repository, SDOclust is evaluated with 235 datasets (both synthetic and real) taken from the literature about clustering evaluation, and compared with HDBSCAN, k-means--, CLASSIX, N2D (Deep Learning Clustering), Fuzzy Clustering, and Hierarchical Clustering algorithms. This repository is framed within the research on the following domains: algorithm evaluation, clustering, unsupervised learning, machine learning, data mining, data analysis. Datasets and algorithms can be used for experiment replication and for further clustering evaluation and comparison. Technical details Experiments are conducted in Python 3. The file and folder structure is as follows:
t
SDOstreamclust: Stream Clustering Robust to Concept Drift - Evaluation Tests...
researchdata.tuwien.ac.at
zip
Updated Sep 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Felix Iglesias Vazquez; Felix Iglesias Vazquez (2024). SDOstreamclust: Stream Clustering Robust to Concept Drift - Evaluation Tests [Dataset]. http://doi.org/10.48436/xh0w2-q5x18
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.48436/xh0w2-q5x18
Dataset updated
Sep 17, 2024
Dataset provided by
TU Wien
Authors
Felix Iglesias Vazquez; Felix Iglesias Vazquez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SDOstreamclust Evaluation Tests

conducted for the paper: Stream Clustering Robust to Concept Drift

Context and methodology

SDOstreamclust is a stream clustering algorithm able to process data incrementally or per batches. It is a combination of the previous SDOstream (anomaly detection in data streams) and SDOclust (static clustering). SDOstreamclust holds the characteristics of SDO algoritmhs: lightweight, intuitive, self-adjusting, resistant to noise, capable of identifying non-convex clusters, and constructed upon robust parameters and interpretable models. Moreover, it shows excellent adaptation to concept drift

In this repository, SDOclust is evaluated with 165 datasets (both synthetic and real) and compared with CluStream, DBstream, DenStream, StreamKMeans.

This repository is framed within the research on the following domains: algorithm evaluation, stream clustering, unsupervised learning, machine learning, data mining, streaming data analysis. Datasets and algorithms can be used for experiment replication and for further evaluation and comparison.

Docker

A Docker version is also available in: https://hub.docker.com/r/fiv5/sdostreamclust

Technical details

Experiments are conducted in Python v3.8.14. The file and folder structure is as follows:- [algorithms] contains a script with functions related to algorithm configurations.

[data] contains datasets in ARFF format.

[results] contains CSV files with algorithms' performances obtained from running the "run.sh" script (as shown in the paper).

"dependencies.sh" lists and installs python dependencies.

"pysdoclust-stream-main.zip" contains the SDOstreamclust python package.

"README.md" shows details and intructions to use this repository.

"run.sh" runs the complete experiments.

"run_comp.py"for running experiments specified by arguments.

"TSindex.py" implements functions for the Temporal Silhouette index.

Note: if codes in SDOstreamclust are modified, SWIG (v4.2.1) wrappers have to be rebuilt and SDOstreamclust consequently reinstalled with pip.

License

The CC-BY license applies to all data generated with MDCgen. All distributed code is under the GPLv3+ license.
e
Gaia DR2 confirmed new nearby open clusters - Dataset - B2FIND
b2find.eudat.eu
Updated Apr 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Gaia DR2 confirmed new nearby open clusters - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/ca5ad88b-a894-5a3e-bf50-18955f115616
Explore at:
Dataset updated
Apr 30, 2023
Description
The publication of the Gaia Data Release 2 (Gaia DR2) opens a new era in astronomy. It includes precise astrometric data (positions, proper motions, and parallaxes) for more than 1.3 billion sources, mostly stars. To analyse such a vast amount of new data, the use of data-mining techniques and machine-learning algorithms is mandatory. A great example of the application of such techniques and algorithms is the search for open clusters (OCs), groups of stars that were born and move together, located in the disc. Our aim is to develop a method to automatically explore the data space, requiring minimal manual intervention. We explore the performance of a density-based clustering algorithm, DBSCAN, to find clusters in the data together with a supervised learning method such as an artificial neural network (ANN) to automatically distinguish between real OCs and statistical clusters. The development and implementation of this method in a five-dimensional space (l, b, p, {mu}{alpha}^*^, {mu}{delta}) with the Tycho-Gaia Astrometric Solution (TGAS) data, and a posterior validation using Gaia DR2 data, lead to the proposal of a set of new nearby OCs. We have developed a method to find OCs in astrometric data, designed to be applied to the full Gaia DR2 archive. Cone search capability for table J/A+A/618/A59/members (Members for the reported UBC clusters) Cone search capability for table J/A+A/618/A59/centers (Mean parameters for the reported UBC clusters (table2 of the paper))
[Model outputs] Identifying major hydrologic change drivers in a highly...
zenodo.org
data.niaid.nih.gov
+1more
txt, zip
Updated Sep 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan S. Acero Triana; Juan S. Acero Triana; Hoori Ajami; Hoori Ajami (2022). [Model outputs] Identifying major hydrologic change drivers in a highly managed transboundary endorheic basin: integrating hydro‐ecological models and time‐series data mining techniques [Dataset]. http://doi.org/10.6086/d1zm33
Explore at:
zip, txtAvailable download formats
Unique identifier
https://doi.org/10.6086/d1zm33
Dataset updated
Sep 16, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juan S. Acero Triana; Juan S. Acero Triana; Hoori Ajami; Hoori Ajami
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The fragile balance of endorheic lakes in highly managed semi-arid basins with transboundary water issues has been altered by the intertwined effects of global warming and long-term water mismanagement to support agricultural and industrial demand. The alarming rate of global endorheic lakes' depletion in recent decades necessitates formulating mitigation strategies for ecosystem restoration. However, detecting and quantifying the relative contribution of causal factors (climate variability and anthropogenic stressors) is challenging. This study developed a diagnostic multivariate framework to identify major hydrologic drivers of lake depletion in a highly managed endorheic basin with a complex water distribution system. The framework integrates the Soil and Water Assessment Tool (SWAT) simulations with time-series decomposition and clustering methods to identify the major drivers of change. This diagnostic framework was applied to the Salton Sea Transboundary Basin (SSTB), the host of the world's most impaired inland lake. The results showed signs of depletion across the SSTB since late 1998 with no significant changes in climate conditions. The time-series data mining of the SSTB water balance components indicated that decreases in lake tributary inflows (-16.4 Mm3 yr-2) in response to decline in Colorado River inflows, associated with state water transfer agreements, are causing the Salton Sea to shrink, not changes in the irrigation operation as commonly believed. The developed multivariate detection and attribution framework is useful for identifying major drivers of change in coupled natural-human systems.
A Local Distributed Peer-to-Peer Algorithm Using Multi-Party Optimization...
data.nasa.gov
data.staging.idas-ds1.appdat.jsc.nasa.gov
Updated Mar 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). A Local Distributed Peer-to-Peer Algorithm Using Multi-Party Optimization Based Privacy Preservation for Data Mining Primitive Computation - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/a-local-distributed-peer-to-peer-algorithm-using-multi-party-optimization-based-privacy-pr
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
This paper proposes a scalable, local privacy-preserving algorithm for distributed peer-to-peer (P2P) data aggregation useful for many advanced data mining/analysis tasks such as average/sum computation, decision tree induction, feature selection, and more. Unlike most multi-party privacy-preserving data mining algorithms, this approach works in an asynchronous manner through local interactions and therefore, is highly scalable. It particularly deals with the distributed computation of the sum of a set of numbers stored at different peers in a P2P network in the context of a P2P web mining application. The proposed optimization-based privacy-preserving technique for computing the sum allows different peers to specify different privacy requirements without having to adhere to a global set of parameters for the chosen privacy model. Since distributed sum computation is a frequently used primitive, the proposed approach is likely to have significant impact on many data mining tasks such as multi-party privacypreserving clustering, frequent itemset mining, and statistical aggregate computation.
c
Inductive Monitoring System (IMS)
s.cnmilf.com
gimi9.com
+3more
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Inductive Monitoring System (IMS) [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/inductive-monitoring-system-ims
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
IMS: Inductive Monitoring System The Inductive Monitoring System (IMS) is a tool that uses a data mining technique called clustering to extract models of normal system operation from archived data. IMS works with vectors of data values. IMS analyzes data collected during periods of normal system operation to build a system model. It characterizes how the parameters relate to one another during normal operation by finding areas in the vector space where nominal data tends to fall. These areas are called nominal operating regions and correspond to clusters of similar points found by the IMS clustering algorithm. These nominal operating regions are stored in a knowledge base that IMS uses for real-time telemetry monitoring or archived data analysis. During the monitoring operation, IMS reads real-time or archived data values, formats them into the predefined vector structure, and searches the knowledge base of nominal operating regions to see how well the new data fits the nominal system characterization. For each input vector, IMS returns the distance that vector falls from the nearest nominal operating region. Data that matches the normal training data well will have a deviation distance of zero. If one or more of the data parameters is slightly outside of expected values, a small non-zero result is returned. As incoming data deviates further from the normal system data, indicating a possible malfunction, IMS will return a higher deviation value to alert users of the anomaly. IMS also calculates the contribution of each individual parameter to the overall deviation, which can help isolate the cause of the anomaly.
e
New open clusters in Galactic anti-centre - Dataset - B2FIND
b2find.eudat.eu
Updated Nov 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). New open clusters in Galactic anti-centre - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/ece0cc57-200a-5246-bd0a-141e0098abe1
Explore at:
Dataset updated
Nov 3, 2023
Description
The Gaia Data Release 2 (DR2) provided an unprecedented volume of precise astrometric and excellent photometric data. In terms of data mining the Gaia catalogue, machine learning methods have shown to be a powerful tool, for instance in the search for unknown stellar structures. Particularly, supervised and unsupervised learning methods combined together significantly improves the detection rate of open clusters. We systematically scan Gaia DR2 in a region covering the Galactic anticentre and the Perseus arm (120{deg}<=l<=205{deg} and -10{deg}<=b<=10{deg}), with the goal of finding any open clusters that may exist in this region, and fine tuning a previously proposed methodology and successfully applied to TGAS data, adapting it to different density regions. Our methodology uses an unsupervised, density-based, clustering algorithm, DBSCAN, that identifies overdensities in the five-dimensional astrometric parameter space (l, b, {varpi}, pmRA*, pmDE) that may correspond to physical clusters. The overdensities are separated into physical clusters (open clusters) or random statistical clusters using an artificial neural network to recognise the isochrone pattern that open clusters show in a colour magnitude diagram. The method is able to recover more than 75% of the open clusters confirmed in the search area. Moreover, we detected 53 open clusters unknown previous to Gaia DR2, which represents an increase of more than 22% with respect to the already catalogued clusters in this region. We find that the census of nearby open clusters is not complete. Different machine learning methodologies for a blind search of open clusters are complementary to each other; no single method is able to detect 100% of the existing groups. Our methodology has shown to be a reliable tool for the automatic detection of open clusters, designed to be applied to the full Gaia DR2 catalogue.

Facebook

Twitter

Click to copy link

Link copied

Cite

Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1

Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets.

Explore at:

Unique identifier

https://doi.org/10.17632/6cm9wyd5g5.1

Dataset updated

Nov 14, 2018

Authors

Scott Herford

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered

North Carolina

Description

The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.

Clear search

Close search

Google apps

Main menu

Educational Attainment in North Carolina Public Schools: Use of statistical...

Benchmark test functions.

Application Research of Clustering on kmeans

Dataset

Contents

Artificial dataset for clustering algorithms

Data from: Multi-objective optimization based privacy preserving distributed...

300 Places in the US for K-means Clustering

Data from: Hidden Room game in University of Cadiz data clustering by...

Local L2 Thresholding Based Data Mining in Peer-to-Peer Systems - Dataset -...

Data from: A Generic Local Algorithm for Mining Data Streams in Large...

Unsupervised Learning Report

Data from: Local L2 Thresholding Based Data Mining in Peer-to-Peer Systems

Parameter Settings for Comparative Experiments.

Data Mining and Modeling Report

SDOclust Evaluation Tests v2 - Dataset - B2FIND

SDOstreamclust: Stream Clustering Robust to Concept Drift - Evaluation Tests...

SDOstreamclust Evaluation Tests

Context and methodology

Technical details

License

Gaia DR2 confirmed new nearby open clusters - Dataset - B2FIND

[Model outputs] Identifying major hydrologic change drivers in a highly...

A Local Distributed Peer-to-Peer Algorithm Using Multi-Party Optimization...

Inductive Monitoring System (IMS)

New open clusters in Galactic anti-centre - Dataset - B2FIND

Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets.