100+ datasets found

Data from: Comparison of predictive performance of data mining algorithms in...
scielo.figshare.com
jpeg
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Senol Celik; Ecevit Eyduran; Koksal Karadas; Mohammad Masood Tariq (2023). Comparison of predictive performance of data mining algorithms in predicting body weight in Mengali rams of Pakistan [Dataset]. http://doi.org/10.6084/m9.figshare.5719009.v1
Explore at:
jpegAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5719009.v1
Dataset updated
Jun 4, 2023
Dataset provided by
SciELOhttp://www.scielo.org/
Authors
Senol Celik; Ecevit Eyduran; Koksal Karadas; Mohammad Masood Tariq
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Pakistan
Description
ABSTRACT The present study aimed at comparing predictive performance of some data mining algorithms (CART, CHAID, Exhaustive CHAID, MARS, MLP, and RBF) in biometrical data of Mengali rams. To compare the predictive capability of the algorithms, the biometrical data regarding body (body length, withers height, and heart girth) and testicular (testicular length, scrotal length, and scrotal circumference) measurements of Mengali rams in predicting live body weight were evaluated by most goodness of fit criteria. In addition, age was considered as a continuous independent variable. In this context, MARS data mining algorithm was used for the first time to predict body weight in two forms, without (MARS_1) and with interaction (MARS_2) terms. The superiority order in the predictive accuracy of the algorithms was found as CART > CHAID ≈ Exhaustive CHAID > MARS_2 > MARS_1 > RBF > MLP. Moreover, all tested algorithms provided a strong predictive accuracy for estimating body weight. However, MARS is the only algorithm that generated a prediction equation for body weight. Therefore, it is hoped that the available results might present a valuable contribution in terms of predicting body weight and describing the relationship between the body weight and body and testicular measurements in revealing breed standards and the conservation of indigenous gene sources for Mengali sheep breeding. Therefore, it will be possible to perform more profitable and productive sheep production. Use of data mining algorithms is useful for revealing the relationship between body weight and testicular traits in describing breed standards of Mengali sheep.
d
Data from: A Generic Local Algorithm for Mining Data Streams in Large...
catalog.data.gov
datasets.ai
+2more
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). A Generic Local Algorithm for Mining Data Streams in Large Distributed Systems [Dataset]. https://catalog.data.gov/dataset/a-generic-local-algorithm-for-mining-data-streams-in-large-distributed-systems
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
In a large network of computers or wireless sensors, each of the components (henceforth, peers) has some data about the global state of the system. Much of the system's functionality such as message routing, information retrieval and load sharing relies on modeling the global state. We refer to the outcome of the function (e.g., the load experienced by each peer) as the emph{model} of the system. Since the state of the system is constantly changing, it is necessary to keep the models up-to-date. Computing global data mining models e.g. decision trees, k-means clustering in large distributed systems may be very costly due to the scale of the system and due to communication cost, which may be high. The cost further increases in a dynamic scenario when the data changes rapidly. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient emph{local} algorithm which can be used to monitor a wide class of data mining models. Then, we use this algorithm as a feedback loop for the monitoring of complex functions of the data such as its k-means clustering. The theoretical claims are corroborated with a thorough experimental analysis.
f
The performance of data mining algorithm given worst case/best case...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated May 4, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhou, Shang-Ming; O’Neill, Terence W.; Cooksey, Roxanne; Choy, Ernest; Denaxas, Spiros; Dixon, William G.; Sudlow, Cathie; Siebert, Stefan; Kennedy, Jonathan; Brophy, Sinead; Fernandez-Gutierrez, Fabiola; Atkinson, Mark (2016). The performance of data mining algorithm given worst case/best case assumptions. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001549392
Explore at:
Dataset updated
May 4, 2016
Authors
Zhou, Shang-Ming; O’Neill, Terence W.; Cooksey, Roxanne; Choy, Ernest; Denaxas, Spiros; Dixon, William G.; Sudlow, Cathie; Siebert, Stefan; Kennedy, Jonathan; Brophy, Sinead; Fernandez-Gutierrez, Fabiola; Atkinson, Mark
Description
The performance of data mining algorithm given worst case/best case assumptions.
Designing a more efficient, effective and safe Medical Emergency Team (MET)...
plos.figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christoph Bergmeir; Irma Bilgrami; Christopher Bain; Geoffrey I. Webb; Judit Orosz; David Pilcher (2023). Designing a more efficient, effective and safe Medical Emergency Team (MET) service using data analysis [Dataset]. http://doi.org/10.1371/journal.pone.0188688
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0188688
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Christoph Bergmeir; Irma Bilgrami; Christopher Bain; Geoffrey I. Webb; Judit Orosz; David Pilcher
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionHospitals have seen a rise in Medical Emergency Team (MET) reviews. We hypothesised that the commonest MET calls result in similar treatments. Our aim was to design a pre-emptive management algorithm that allowed direct institution of treatment to patients without having to wait for attendance of the MET team and to model its potential impact on MET call incidence and patient outcomes.MethodsData was extracted for all MET calls from the hospital database. Association rule data mining techniques were used to identify the most common combinations of MET call causes, outcomes and therapies.ResultsThere were 13,656 MET calls during the 34-month study period in 7936 patients. The most common MET call was for hypotension [31%, (2459/7936)]. These MET calls were strongly associated with the immediate administration of intra-venous fluid (70% [1714/2459] v 13% [739/5477] p
A Local Distributed Peer-to-Peer Algorithm Using Multi-Party Optimization...
data.nasa.gov
Updated Mar 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). A Local Distributed Peer-to-Peer Algorithm Using Multi-Party Optimization Based Privacy Preservation for Data Mining Primitive Computation - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/a-local-distributed-peer-to-peer-algorithm-using-multi-party-optimization-based-privacy-pr
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
This paper proposes a scalable, local privacy-preserving algorithm for distributed peer-to-peer (P2P) data aggregation useful for many advanced data mining/analysis tasks such as average/sum computation, decision tree induction, feature selection, and more. Unlike most multi-party privacy-preserving data mining algorithms, this approach works in an asynchronous manner through local interactions and therefore, is highly scalable. It particularly deals with the distributed computation of the sum of a set of numbers stored at different peers in a P2P network in the context of a P2P web mining application. The proposed optimization-based privacy-preserving technique for computing the sum allows different peers to specify different privacy requirements without having to adhere to a global set of parameters for the chosen privacy model. Since distributed sum computation is a frequently used primitive, the proposed approach is likely to have significant impact on many data mining tasks such as multi-party privacypreserving clustering, frequent itemset mining, and statistical aggregate computation.
d
Discovering Anomalous Aviation Safety Events Using Scalable Data Mining...
catalog.data.gov
s.cnmilf.com
+3more
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Discovering Anomalous Aviation Safety Events Using Scalable Data Mining Algorithms [Dataset]. https://catalog.data.gov/dataset/discovering-anomalous-aviation-safety-events-using-scalable-data-mining-algorithms
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
The worldwide civilian aviation system is one of the most complex dynamical systems created. Most modern commercial aircraft have onboard flight data recorders that record several hundred discrete and continuous parameters at approximately 1Hz for the entire duration of the flight. These data contain information about the flight control systems, actuators, engines, landing gear, avionics, and pilot commands. In this paper, recent advances in the development of a novel knowledge discovery process consisting of a suite of data mining techniques for identifying precursors to aviation safety incidents are discussed. The data mining techniques include scalable multiple-kernel learning for large-scale distributed anomaly detection. A novel multivariate time-series search algorithm is used to search for signatures of discovered anomalies on massive datasets. The process can identify operationally significant events due to environmental, mechanical, and human factors issues in the high-dimensional flight operations quality assurance data. All discovered anomalies are validated by a team of independent domain experts. This novel automated knowledge discovery process is aimed at complementing the state-of-the-art human-generated exceedance-based analysis that fails to discover previously unknown aviation safety incidents. In this paper, the discovery pipeline, the methods used, and some of the significant anomalies detected on real-world commercial aviation data are discussed.
Table 2 - Modeling and comparing data mining algorithms for prediction of...
figshare.com
xls
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alireza Mosayebi; Barat Mojaradi; Ali Bonyadi Naeini; Seyed Hamid Khodadad Hosseini (2023). Table 2 - Modeling and comparing data mining algorithms for prediction of recurrence of breast cancer [Dataset]. http://doi.org/10.1371/journal.pone.0237658.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0237658.t002
Dataset updated
Jun 5, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Alireza Mosayebi; Barat Mojaradi; Ali Bonyadi Naeini; Seyed Hamid Khodadad Hosseini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Table 2 - Modeling and comparing data mining algorithms for prediction of recurrence of breast cancer
d
Distributed Data Mining in Peer-to-Peer Networks
catalog.data.gov
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Distributed Data Mining in Peer-to-Peer Networks [Dataset]. https://catalog.data.gov/dataset/distributed-data-mining-in-peer-to-peer-networks
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
Peer-to-peer (P2P) networks are gaining popularity in many applications such as file sharing, e-commerce, and social networking, many of which deal with rich, distributed data sources that can benefit from data mining. P2P networks are, in fact,well-suited to distributed data mining (DDM), which deals with the problem of data analysis in environments with distributed data,computing nodes,and users. This article offers an overview of DDM applications and algorithms for P2P environments,focusing particularly on local algorithms that perform data analysis by using computing primitives with limited communication overhead. The authors describe both exact and approximate local P2P data mining algorithms that work in a decentralized and communication-efficient manner.
t
SDOstreamclust: Stream Clustering Robust to Concept Drift - Evaluation Tests...
researchdata.tuwien.ac.at
zip
Updated Nov 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Felix Iglesias Vazquez; Felix Iglesias Vazquez (2025). SDOstreamclust: Stream Clustering Robust to Concept Drift - Evaluation Tests [Dataset]. http://doi.org/10.48436/xh0w2-q5x18
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.48436/xh0w2-q5x18
Dataset updated
Nov 25, 2025
Dataset provided by
TU Wien
Authors
Felix Iglesias Vazquez; Felix Iglesias Vazquez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SDOstreamclust Evaluation Tests

conducted for the paper: Stream Clustering Robust to Concept Drift. Please refer to:

Iglesias Vazquez, F., Konzett, S., Zseby, T., & Bifet, A. (2025). Stream Clustering Robust to Concept Drift. In 2025 International Joint Conference on Neural Networks (IJCNN) (pp. 1–10). IEEE. https://doi.org/10.1109/IJCNN64981.2025.11227664

Context and methodology

SDOstreamclust is a stream clustering algorithm able to process data incrementally or per batches. It is a combination of the previous SDOstream (anomaly detection in data streams) and SDOclust (static clustering). SDOstreamclust holds the characteristics of SDO algoritmhs: lightweight, intuitive, self-adjusting, resistant to noise, capable of identifying non-convex clusters, and constructed upon robust parameters and interpretable models. Moreover, it shows excellent adaptation to concept drift

In this repository, SDOclust is evaluated with 165 datasets (both synthetic and real) and compared with CluStream, DBstream, DenStream, StreamKMeans.

This repository is framed within the research on the following domains: algorithm evaluation, stream clustering, unsupervised learning, machine learning, data mining, streaming data analysis. Datasets and algorithms can be used for experiment replication and for further evaluation and comparison.

Docker

A Docker version is also available in: https://hub.docker.com/r/fiv5/sdostreamclust

Technical details

Experiments are conducted in Python v3.8.14. The file and folder structure is as follows:- [algorithms] contains a script with functions related to algorithm configurations.

[data] contains datasets in ARFF format.

[results] contains CSV files with algorithms' performances obtained from running the "run.sh" script (as shown in the paper).

"dependencies.sh" lists and installs python dependencies.

"pysdoclust-stream-main.zip" contains the SDOstreamclust python package.

"README.md" shows details and intructions to use this repository.

"run.sh" runs the complete experiments.

"run_comp.py"for running experiments specified by arguments.

"TSindex.py" implements functions for the Temporal Silhouette index.

Note: if codes in SDOstreamclust are modified, SWIG (v4.2.1) wrappers have to be rebuilt and SDOstreamclust consequently reinstalled with pip.

License

The CC-BY license applies to all data generated with MDCgen. All distributed code is under the GPLv3+ license.
Real Market Data for Association Rules
kaggle.com
zip
Updated Sep 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruken Missonnier (2023). Real Market Data for Association Rules [Dataset]. https://www.kaggle.com/datasets/rukenmissonnier/real-market-data
Explore at:
zip(3068 bytes)Available download formats
Dataset updated
Sep 15, 2023
Authors
Ruken Missonnier
Description
1. Introduction

Within the confines of this document, we embark on a comprehensive journey delving into the intricacies of a dataset meticulously curated for the purpose of association rules mining. This sophisticated data mining technique is a linchpin in the realms of market basket analysis. The dataset in question boasts an array of items commonly found in retail transactions, each meticulously encoded as a binary variable, with "1" denoting presence and "0" indicating absence in individual transactions.

2. Dataset Overview

Our dataset unfolds as an opulent tapestry of distinct columns, each dedicated to the representation of a specific item:

Bread

Honey

Bacon

Toothpaste

Banana

Apple

Hazelnut

Cheese

Meat

Carrot

Cucumber

Onion

Milk

Butter

ShavingFoam

Salt

Flour

HeavyCream

Egg

Olive

Shampoo

Sugar

3. Purpose of the Dataset

The raison d'être of this dataset is to serve as a catalyst for the discovery of intricate associations and patterns concealed within the labyrinthine network of customer transactions. Each row in this dataset mirrors a solitary transaction, while the values within each column serve as sentinels, indicating whether a particular item was welcomed into a transaction's embrace or relegated to the periphery.

4. Data Format

The data within this repository is rendered in a binary symphony, where the enigmatic "1" enunciates the acquisition of an item, and the stoic "0" signifies its conspicuous absence. This binary manifestation serves to distill the essence of the dataset, centering the focus on item presence, rather than the quantum thereof.

5. Potential Applications

This dataset unfurls its wings to encompass an assortment of prospective applications, including but not limited to:

Market Basket Analysis: Discerning items that waltz together in shopping carts, thus bestowing enlightenment upon the orchestration of product placement and marketing strategies.

Recommender Systems: Crafting bespoke product recommendations, meticulously tailored to each customer's historical transactional symphony.

Inventory Management: Masterfully fine-tuning stock levels for items that find kinship in frequent co-acquisition, thereby orchestrating a harmonious reduction in carrying costs and stockouts.

Customer Behavior Analysis: Peering into the depths of customer proclivities and purchase patterns, paving the way for the sculpting of exquisite marketing campaigns.

6. Analysis Techniques

The treasure trove of this dataset beckons the deployment of quintessential techniques, among them the venerable Apriori and FP-Growth algorithms. These stalwart algorithms are proficient at ferreting out the elusive frequent itemsets and invaluable association rules, shedding light on the arcane symphony of customer behavior and item co-occurrence patterns.

7. Conclusion

In closing, the association rules dataset unfurled before you offers an alluring odyssey, replete with the promise of discovering priceless patterns and affiliations concealed within the tapestry of transactional data. Through the artistry of data mining algorithms, businesses and analysts stand poised to unearth hitherto latent insights capable of steering the helm of strategic decisions, elevating the pantheon of customer experiences, and orchestrating the symphony of operational optimization.
Grocery Store dataset for data mining
kaggle.com
zip
Updated Mar 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Honey Patel (2021). Grocery Store dataset for data mining [Dataset]. https://www.kaggle.com/honeypatel2158/grocery-store-dataset-for-data-mining
Explore at:
zip(7990 bytes)Available download formats
Dataset updated
Mar 9, 2021
Authors
Honey Patel
Description
Dataset

This dataset was created by Honey Patel

Contents
Artificial dataset for clustering algorithms(Complete)
figshare.com
zip
Updated Sep 27, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mayra Zegarra Rodriguez; Dalcimar Casanova; Cesar Henrique Comin; Odemir M. Bruno; Diego Raphael Amancio; Luciano da Fontoura Costa; Francisco Aparecido Rodrigues (2018). Artificial dataset for clustering algorithms(Complete) [Dataset]. http://doi.org/10.6084/m9.figshare.7139510.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7139510.v2
Dataset updated
Sep 27, 2018
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Mayra Zegarra Rodriguez; Dalcimar Casanova; Cesar Henrique Comin; Odemir M. Bruno; Diego Raphael Amancio; Luciano da Fontoura Costa; Francisco Aparecido Rodrigues
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This file contains a number of randomly generated datasets. The properties of each dataset are indicated in the name of each respective file: 'C' indicates the number of classes, 'F' indicates the number of features, 'Ne' indicates the number of objects contained in each class, 'A' is related to the average separation between classes and 'R' is an index used to differentiate distinct random trials. So, for instance, the file C2F10N2Ne5A1.2R0 is a dataset containing 2 classes, 10 features, 5 objects for each class and having a typical separation between classes of 1.2. The methodology used for generating the datasets is described in the accompanying reference.
G
Data Mining Tools Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Aug 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Data Mining Tools Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/data-mining-tools-market
Explore at:
pdf, csv, pptxAvailable download formats
Dataset updated
Aug 4, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Data Mining Tools Market Outlook

According to our latest research, the global Data Mining Tools market size reached USD 1.93 billion in 2024, reflecting robust industry momentum. The market is expected to grow at a CAGR of 12.7% from 2025 to 2033, reaching a projected value of USD 5.69 billion by 2033. This growth is primarily driven by the increasing adoption of advanced analytics across diverse industries, rapid digital transformation, and the necessity for actionable insights from massive data volumes.

One of the pivotal growth factors propelling the Data Mining Tools market is the exponential rise in data generation, particularly through digital channels, IoT devices, and enterprise applications. Organizations across sectors are leveraging data mining tools to extract meaningful patterns, trends, and correlations from structured and unstructured data. The need for improved decision-making, operational efficiency, and competitive advantage has made data mining an essential component of modern business strategies. Furthermore, advancements in artificial intelligence and machine learning are enhancing the capabilities of these tools, enabling predictive analytics, anomaly detection, and automation of complex analytical tasks, which further fuels market expansion.

Another significant driver is the growing demand for customer-centric solutions in industries such as retail, BFSI, and healthcare. Data mining tools are increasingly being used for customer relationship management, targeted marketing, fraud detection, and risk management. By analyzing customer behavior and preferences, organizations can personalize their offerings, optimize marketing campaigns, and mitigate risks. The integration of data mining tools with cloud platforms and big data technologies has also simplified deployment and scalability, making these solutions accessible to small and medium-sized enterprises (SMEs) as well as large organizations. This democratization of advanced analytics is creating new growth avenues for vendors and service providers.

The regulatory landscape and the increasing emphasis on data privacy and security are also shaping the development and adoption of Data Mining Tools. Compliance with frameworks such as GDPR, HIPAA, and CCPA necessitates robust data governance and transparent analytics processes. Vendors are responding by incorporating features like data masking, encryption, and audit trails into their solutions, thereby enhancing trust and adoption among regulated industries. Additionally, the emergence of industry-specific data mining applications, such as fraud detection in BFSI and predictive diagnostics in healthcare, is expanding the addressable market and fostering innovation.

From a regional perspective, North America currently dominates the Data Mining Tools market owing to the early adoption of advanced analytics, strong presence of leading technology vendors, and high investments in digital transformation. However, the Asia Pacific region is emerging as a lucrative market, driven by rapid industrialization, expansion of IT infrastructure, and growing awareness of data-driven decision-making in countries like China, India, and Japan. Europe, with its focus on data privacy and digital innovation, also represents a significant market share, while Latin America and the Middle East & Africa are witnessing steady growth as organizations in these regions modernize their operations and adopt cloud-based analytics solutions.

Component Analysis

The Component segment of the Data Mining Tools market is bifurcated into Software and Services. Software remains the dominant segment, accounting for the majority of the market share in 2024. This dominance is attributed to the continuous evolution of data mining algorithms, the proliferation of user-friendly graphical interfaces, and the integration of advanced analytics capabilities such as machine learning, artificial intelligence, and natural language pro
m
Educational Attainment in North Carolina Public Schools: Use of statistical...
data.mendeley.com
Updated Nov 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
Explore at:
Unique identifier
https://doi.org/10.17632/6cm9wyd5g5.1
Dataset updated
Nov 14, 2018
Authors
Scott Herford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
d
Data from: Mining Distance-Based Outliers in Near Linear Time
catalog.data.gov
datasets.ai
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Mining Distance-Based Outliers in Near Linear Time [Dataset]. https://catalog.data.gov/dataset/mining-distance-based-outliers-in-near-linear-time
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
Full title: Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Abstract: Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.
Application Research of Clustering on kmeans
kaggle.com
zip
Updated Feb 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ddpr raju (2021). Application Research of Clustering on kmeans [Dataset]. https://www.kaggle.com/ddprraju/tirupati-compus-school
Explore at:
zip(34507 bytes)Available download formats
Dataset updated
Feb 27, 2021
Authors
ddpr raju
Description
Dataset

This dataset was created by ddpr raju

Contents
m
Amharic text dataset extracted from memes for hate speech detection or...
data.mendeley.com
Updated Jun 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mequanent Degu (2023). Amharic text dataset extracted from memes for hate speech detection or classification [Dataset]. http://doi.org/10.17632/gw3fdtw5v7.2
Explore at:
Unique identifier
https://doi.org/10.17632/gw3fdtw5v7.2
Dataset updated
Jun 8, 2023
Authors
Mequanent Degu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
the dataset is collected from social media such as facebook and telegram. the dataset is further processed. the collection are orginal_cleaned: this dataset is neither stemed nor stopword are remove: stopword_removed: in this dataset stopwords are removed but not stemmed and in stemed datset is stemmed and stopwords are removed. stemming is done using hornmorpho developed by Michael Gesser( available at https://github.com/hltdi/HornMorpho) all datasets are normalized and free from noise such as punctuation marks and emojs.
S
Trajectory Hotspot Mining Algorithm Based on Co-location Pattern
scidb.cn
Updated Sep 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
yan rui bin (2022). Trajectory Hotspot Mining Algorithm Based on Co-location Pattern [Dataset]. http://doi.org/10.57760/sciencedb.j00133.00127
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.j00133.00127
Dataset updated
Sep 26, 2022
Dataset provided by
Science Data Bank
Authors
yan rui bin
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Trajectory hotspot mining algorithms NDTTJ and NDTTT based on Co-location Pattern, trajectory hotspot mining algorithm TTHS based on graph databaseAlgorithm experiment processMain references
Comparison of the performance of the algorithms on the initial dataset and...
plos.figshare.com
figshare.com
xls
Updated Jun 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nacer Farajzadeh; Nima Sadeghzadeh (2023). Comparison of the performance of the algorithms on the initial dataset and the one with features selection (no. class = 2). [Dataset]. http://doi.org/10.1371/journal.pone.0284588.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0284588.t004
Dataset updated
Jun 15, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Nacer Farajzadeh; Nima Sadeghzadeh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparison of the performance of the algorithms on the initial dataset and the one with features selection (no. class = 2).
r
Data from: Scaling data mining in massively parallel dataflow systems
resodate.org
Updated Feb 5, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sebastian Schelter (2016). Scaling data mining in massively parallel dataflow systems [Dataset]. http://doi.org/10.14279/depositonce-4982
Explore at:
Unique identifier
https://doi.org/10.14279/depositonce-4982
Dataset updated
Feb 5, 2016
Dataset provided by
Technische Universität Berlin
DepositOnce
Authors
Sebastian Schelter
Description
This thesis lays the ground work for enabling scalable data mining in massively parallel dataflow systems, using large datasets. Such datasets have become ubiquitous. We illustrate common fallacies with respect to scalable data mining: It is in no way sufficient to naively implement textbook algorithms on parallel systems; bottlenecks on all layers of the stack prevent the scalability of such naive implementations. We argue that scalability in data mining is a multi-leveled problem and must therefore be approached on the interplay of algorithms, systems, and applications. We therefore discuss a selection of scalability problems on these different levels. We investigate algorithm-specific scalability aspects of collaborative filtering algorithms for computing recommendations, a popular data mining use case with many industry deployments. We show how to efficiently execute the two most common approaches, namely neighborhood methods and latent factor models on MapReduce, and describe a specialized architecture for scaling collaborative filtering to extremely large datasets which we implemented at Twitter. We turn to system-specific scalability aspects, where we improve system performance during the distributed execution of a special class of iterative algorithms by drastically reducing the overhead required for guaranteeing fault tolerance. Therefore we propose a novel optimistic approach to fault-tolerance which exploits the robust convergence properties of a large class of fixpoint algorithms and does not incur measurable overhead in failure-free cases. Finally, we present work on an application-specific scalability aspect of scalable data mining. A common problem when deploying machine learning applications in real-world scenarios is that the prediction quality of ML models heavily depends on hyperparameters that have to be chosen in advance. We propose an algorithmic framework for an important subproblem occuring during hyperparameter search at scale: efficiently generating samples from block-partitioned matrices in a shared-nothing environment. For every selected problem, we show how to execute the resulting computation automatically in a parallel and scalable manner, and evaluate our proposed solution on large datasets with billions of datapoints.

Facebook

Twitter

Click to copy link

Link copied

Cite

Senol Celik; Ecevit Eyduran; Koksal Karadas; Mohammad Masood Tariq (2023). Comparison of predictive performance of data mining algorithms in predicting body weight in Mengali rams of Pakistan [Dataset]. http://doi.org/10.6084/m9.figshare.5719009.v1

Data from: Comparison of predictive performance of data mining algorithms in predicting body weight in Mengali rams of Pakistan

Explore at:

jpegAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.5719009.v1

Dataset updated

Jun 4, 2023

Dataset provided by

SciELOhttp://www.scielo.org/

Authors

Senol Celik; Ecevit Eyduran; Koksal Karadas; Mohammad Masood Tariq

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered

Pakistan

Description

ABSTRACT The present study aimed at comparing predictive performance of some data mining algorithms (CART, CHAID, Exhaustive CHAID, MARS, MLP, and RBF) in biometrical data of Mengali rams. To compare the predictive capability of the algorithms, the biometrical data regarding body (body length, withers height, and heart girth) and testicular (testicular length, scrotal length, and scrotal circumference) measurements of Mengali rams in predicting live body weight were evaluated by most goodness of fit criteria. In addition, age was considered as a continuous independent variable. In this context, MARS data mining algorithm was used for the first time to predict body weight in two forms, without (MARS_1) and with interaction (MARS_2) terms. The superiority order in the predictive accuracy of the algorithms was found as CART > CHAID ≈ Exhaustive CHAID > MARS_2 > MARS_1 > RBF > MLP. Moreover, all tested algorithms provided a strong predictive accuracy for estimating body weight. However, MARS is the only algorithm that generated a prediction equation for body weight. Therefore, it is hoped that the available results might present a valuable contribution in terms of predicting body weight and describing the relationship between the body weight and body and testicular measurements in revealing breed standards and the conservation of indigenous gene sources for Mengali sheep breeding. Therefore, it will be possible to perform more profitable and productive sheep production. Use of data mining algorithms is useful for revealing the relationship between body weight and testicular traits in describing breed standards of Mengali sheep.

Clear search

Close search

Google apps

Main menu

Data from: Comparison of predictive performance of data mining algorithms in...

Data from: A Generic Local Algorithm for Mining Data Streams in Large...

The performance of data mining algorithm given worst case/best case...

Designing a more efficient, effective and safe Medical Emergency Team (MET)...

A Local Distributed Peer-to-Peer Algorithm Using Multi-Party Optimization...

Discovering Anomalous Aviation Safety Events Using Scalable Data Mining...

Table 2 - Modeling and comparing data mining algorithms for prediction of...

Distributed Data Mining in Peer-to-Peer Networks

SDOstreamclust: Stream Clustering Robust to Concept Drift - Evaluation Tests...

SDOstreamclust Evaluation Tests

Context and methodology

Technical details

License

Real Market Data for Association Rules

1. Introduction

2. Dataset Overview

3. Purpose of the Dataset

4. Data Format

5. Potential Applications

6. Analysis Techniques

7. Conclusion

Grocery Store dataset for data mining

Dataset

Contents

Artificial dataset for clustering algorithms(Complete)

Data Mining Tools Market Research Report 2033

Data Mining Tools Market Outlook

Component Analysis

Educational Attainment in North Carolina Public Schools: Use of statistical...

Data from: Mining Distance-Based Outliers in Near Linear Time

Application Research of Clustering on kmeans

Dataset

Contents

Amharic text dataset extracted from memes for hate speech detection or...

Trajectory Hotspot Mining Algorithm Based on Co-location Pattern

Comparison of the performance of the algorithms on the initial dataset and...

Data from: Scaling data mining in massively parallel dataflow systems

Data from: Comparison of predictive performance of data mining algorithms in predicting body weight in Mengali rams of Pakistan