100+ datasets found

Ensemble Data Mining Methods
data.nasa.gov
catalog.data.gov
+2more
application/rdfxml +5
Updated Jun 26, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). Ensemble Data Mining Methods [Dataset]. https://data.nasa.gov/dataset/Ensemble-Data-Mining-Methods/m82a-r3bm
Explore at:
json, xml, application/rssxml, tsv, csv, application/rdfxmlAvailable download formats
Dataset updated
Jun 26, 2018
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods that leverage the power of multiple models to achieve better prediction accuracy than any of the individual models could on their own. The basic goal when designing an ensemble is the same as when establishing a committee of people: each member of the committee should be as competent as possible, but the members should be complementary to one another. If the members are not complementary, i.e., if they always agree, then the committee is unnecessary---any one member is sufficient. If the members are complementary, then when one or a few members make an error, the probability is high that the remaining members can correct this error. Research in ensemble methods has largely revolved around designing ensembles consisting of competent yet complementary models.
m
Educational Attainment in North Carolina Public Schools: Use of statistical...
data.mendeley.com
Updated Nov 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
Explore at:
Unique identifier
https://doi.org/10.17632/6cm9wyd5g5.1
Dataset updated
Nov 14, 2018
Authors
Scott Herford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
North Carolina
Description
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
e
Model assessment measures
paper.erudition.co.in
html
Updated Dec 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Einetic (2023). Model assessment measures [Dataset]. https://paper.erudition.co.in/makaut/btech-in-computer-science-and-engineering-artificial-intelligence-and-machine-learning/6/data-mining
Explore at:
htmlAvailable download formats
Dataset updated
Dec 2, 2023
Dataset authored and provided by
Einetic
License
https://paper.erudition.co.in/termshttps://paper.erudition.co.in/terms
Description
Question Paper Solutions of chapter Model assessment measures of Data Mining, 6th Semester , B.Tech in Computer Science & Engineering (Artificial Intelligence and Machine Learning)
f
Experimental data for "Software Data Analytics: Architectural Model...
figshare.com
data.4tu.nl
zip
Updated Jun 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cong Liu (2023). Experimental data for "Software Data Analytics: Architectural Model Discovery and Design Pattern Detection" [Dataset]. http://doi.org/10.4121/uuid:ca1b0690-d9c5-4626-a067-525ec9d5881b
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/uuid:ca1b0690-d9c5-4626-a067-525ec9d5881b
Dataset updated
Jun 6, 2023
Dataset provided by
4TU.ResearchData
Authors
Cong Liu
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset includes all experimental data used for the PhD thesis of Cong Liu, entitled "Software Data Analytics: Architectural Model Discovery and Design Pattern Detection". These data are generated by instrumenting both synthetic and real-life software systems, and are formated according to the IEEE XES format. See http://www.xes-standard.org/ and https://www.win.tue.nl/ieeetfpm/lib/exe/fetch.php?media=shared:downloads:2017-06-22-xes-software-event-v5-2.pdf for more explanations.
E
Webis Simulation Data Mining Bridge Models Corpus 2012 (Webis-SDMbridge-12)
live.european-language-grid.eu
Updated Mar 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Webis Simulation Data Mining Bridge Models Corpus 2012 (Webis-SDMbridge-12) [Dataset]. https://live.european-language-grid.eu/catalogue/ld/7937
Explore at:
Dataset updated
Mar 19, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This corpus provides the simulation data mining community with a collection of 14641 bridge models and simulated behavior.
1. Folder "1-designs":
The text files in this directory should contain all information for the
independent variables any machine learning experiment. For reference, all 14641 IFC models are supplied in subfolders 001 to 147.
2. Folder "2-simulation":
This folder contains samples of the simulation output that may be viewed in Paraview (http://www.paraview.org). The original model contains the "Org" filename fragment, and the maximum and minimum behaviors are indicated with "Max" and "Min" filename fragments. Displacement, strain, and stress behaviors are all given. Only three of the 14641 models are given as the file sizes are around 1.4 to 2.2 megabytes each. The complete data (approximately 81 gigabytes) can be regenerated and provided if necessary on request (email webis@medien.uni-weimar.de).
3. Folder "3-aggregation":
Maximum displacement, strain, and stress measurements are given in the text files individually, and together in the files with the "vtk" filename fragment. This data should be sufficient for the dependent variables of any machine learning experiment.
Data from: A Proposed Churn Prediction Model
figshare.com
pdf
Updated Feb 24, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mona Nasr; Essam Shaaban; Yehia Helmy; Dr. Ayman Khedr (2019). A Proposed Churn Prediction Model [Dataset]. http://doi.org/10.6084/m9.figshare.7763183.v2
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7763183.v2
Dataset updated
Feb 24, 2019
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Mona Nasr; Essam Shaaban; Yehia Helmy; Dr. Ayman Khedr
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Churn prediction aims to detect customers intended to leave a service provider. Retaining one customer costs an organization from 5 to 10 times than gaining a new one. Predictive models can provide correct identification of possible churners in the near future in order to provide a retention solution. This paper presents a new prediction model based on Data Mining (DM) techniques. The proposed model is composed of six steps which are; identify problem domain, data selection, investigate data set, classification, clustering and knowledge usage. A data set with 23 attributes and 5000 instances is used. 4000 instances used for training the model and 1000 instances used as a testing set. The predicted churners are clustered into 3 categories in case of using in a retention strategy. The data mining techniques used in this paper are Decision Tree, Support Vector Machine and Neural Network throughout an open source software name WEKA.
d
Privacy Preserving Distributed Data Mining
catalog.data.gov
datadiscoverystudio.org
+2more
Updated Dec 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2023). Privacy Preserving Distributed Data Mining [Dataset]. https://catalog.data.gov/dataset/privacy-preserving-distributed-data-mining
Explore at:
Dataset updated
Dec 6, 2023
Dataset provided by
Dashlink
Description
Distributed data mining from privacy-sensitive multi-party data is likely to play an important role in the next generation of integrated vehicle health monitoring systems. For example, consider an airline manufacturer [tex]$\mathcal{C}$[/tex] manufacturing an aircraft model [tex]$A$[/tex] and selling it to five different airline operating companies [tex]$\mathcal{V}_1 \dots \mathcal{V}_5$[/tex]. These aircrafts, during their operation, generate huge amount of data. Mining this data can reveal useful information regarding the health and operability of the aircraft which can be useful for disaster management and prediction of efficient operating regimes. Now if the manufacturer [tex]$\mathcal{C}$[/tex] wants to analyze the performance data collected from different aircrafts of model-type [tex]$A$[/tex] belonging to different airlines then central collection of data for subsequent analysis may not be an option. It should be noted that the result of this analysis may be statistically more significant if the data for aircraft model [tex]$A$[/tex] across all companies were available to [tex]$\mathcal{C}$[/tex]. The potential problems arising out of such a data mining scenario are:
Video-to-Model Data Set
figshare.com
commons.datacite.org
xml
Updated Mar 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sönke Knoch; Shreeraman Ponpathirkoottam; Tim Schwartz (2020). Video-to-Model Data Set [Dataset]. http://doi.org/10.6084/m9.figshare.12026850.v1
Explore at:
xmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12026850.v1
Dataset updated
Mar 24, 2020
Dataset provided by
figshare
Authors
Sönke Knoch; Shreeraman Ponpathirkoottam; Tim Schwartz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data set belongs to the paper "Video-to-Model: Unsupervised Trace Extraction from Videos for Process Discovery and Conformance Checking in Manual Assembly", submitted on March 24, 2020, to the 18th International Conference on Business Process Management (BPM).Abstract: Manual activities are often hidden deep down in discrete manufacturing processes. For the elicitation and optimization of process behavior, complete information about the execution of Manual activities are required. Thus, an approach is presented on how execution level information can be extracted from videos in manual assembly. The goal is the generation of a log that can be used in state-of-the-art process mining tools. The test bed for the system was lightweight and scalable consisting of an assembly workstation equipped with a single RGB camera recording only the hand movements of the worker from top. A neural network based real-time object classifier was trained to detect the worker’s hands. The hand detector delivers the input for an algorithm, which generates trajectories reflecting the movement paths of the hands. Those trajectories are automatically assigned to work steps using the position of material boxes on the assembly shelf as reference points and hierarchical clustering of similar behaviors with dynamic time warping. The system has been evaluated in a task-based study with ten participants in a laboratory, but under realistic conditions. The generated logs have been loaded into the process mining toolkit ProM to discover the underlying process model and to detect deviations from both, instructions and ground truth, using conformance checking. The results show that process mining delivers insights about the assembly process and the system’s precision.The data set contains the generated and the annotated logs based on the video material gathered during the user study. In addition, the petri nets from the process discovery and conformance checking conducted with ProM (http://www.promtools.org) and the reference nets modeled with Yasper (http://www.yasper.org/) are provided.
z
CONCEPT- DM2 DATA MODEL TO ANALYSE HEALTHCARE PATHWAYS OF TYPE 2 DIABETES
zenodo.org
bin, png, zip
Updated Jul 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Berta Ibáñez-Beroiz; Berta Ibáñez-Beroiz; Asier Ballesteros-Domínguez; Asier Ballesteros-Domínguez; Ignacio Oscoz-Villanueva; Ignacio Oscoz-Villanueva; Ibai Tamayo; Ibai Tamayo; Julián Librero; Julián Librero; Mónica Enguita-Germán; Mónica Enguita-Germán; Francisco Estupiñán-Romero; Francisco Estupiñán-Romero; Enrique Bernal-Delgado; Enrique Bernal-Delgado (2024). CONCEPT- DM2 DATA MODEL TO ANALYSE HEALTHCARE PATHWAYS OF TYPE 2 DIABETES [Dataset]. http://doi.org/10.5281/zenodo.7778291
Explore at:
bin, png, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7778291
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodo
Authors
Berta Ibáñez-Beroiz; Berta Ibáñez-Beroiz; Asier Ballesteros-Domínguez; Asier Ballesteros-Domínguez; Ignacio Oscoz-Villanueva; Ignacio Oscoz-Villanueva; Ibai Tamayo; Ibai Tamayo; Julián Librero; Julián Librero; Mónica Enguita-Germán; Mónica Enguita-Germán; Francisco Estupiñán-Romero; Francisco Estupiñán-Romero; Enrique Bernal-Delgado; Enrique Bernal-Delgado
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Technical notes and documentation on the common data model of the project CONCEPT-DM2.

This publication corresponds to the Common Data Model (CDM) specification of the CONCEPT-DM2 project for the implementation of a federated network analysis of the healthcare pathway of type 2 diabetes.

Aims of the CONCEPT-DM2 project:

General aim: To analyse chronic care effectiveness and efficiency of care pathways in diabetes, assuming the relevance of care pathways as independent factors of health outcomes using data from real life world (RWD) from five Spanish Regional Health Systems.

Main specific aims:

To characterize the care pathways in patients with diabetes through the whole care system in terms of process indicators and pharmacologic recommendations

To compare these observed care pathways with the theoretical clinical pathways derived from the clinical practice guidelines

To assess if the adherence to clinical guidelines influence on important health outcomes, such as cardiovascular hospitalizations.

To compare the traditional analytical methods with process mining methods in terms of modeling quality, prediction performance and information provided.

Study Design: It is a population-based retrospective observational study centered on all T2D patients diagnosed in five Regional Health Services within the Spanish National Health Service. We will include all the contacts of these patients with the health services using the electronic medical record systems including Primary Care data, Specialized Care data, Hospitalizations, Urgent Care data, Pharmacy Claims, and also other registers such as the mortality and the population register.

Cohort definition: All patients with code of Type 2 Diabetes in the clinical health records

Inclusion criteria: patients that, at 01/01/2017 or during the follow-up from 01/01/2017 to 31/12/2022 had active health card (active TIS - tarjeta sanitaria activa) and code of type 2 diabetes (T2D, DM2 in spanish) in the clinical records of primary care (CIAP2 T90 in case of using CIAP code system)

Exclusion criteria:

patients with no contact with the health system from 01/01/2017 to 31/12/2022

patients that had a T1D (DM1) code opened after the T2D code during the follow-up.

Study period. From 01/01/2017 to 31/12/2022

Files included in this publication:

Datamodel_CONCEPT_DM2_diagram.png

Common data model specification (Datamodel_CONCEPT_DM2_v.0.1.0.xlsx)

Synthetic datasets (Datamodel_CONCEPT_DM2_sample_data)

sample_data1_dm_patient.csv

sample_data2_dm_param.csv

sample_data3_dm_patient.csv

sample_data4_dm_param.csv

sample_data5_dm_patient.csv

sample_data6_dm_param.csv

sample_data7_dm_param.csv

sample_data8_dm_param.csv

Datamodel_CONCEPT_DM2_explanation.pptx
Data from: A Generic Local Algorithm for Mining Data Streams in Large...
data.staging.idas-ds1.appdat.jsc.nasa.gov
datasets.ai
+2more
Updated Feb 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). A Generic Local Algorithm for Mining Data Streams in Large Distributed Systems [Dataset]. https://data.staging.idas-ds1.appdat.jsc.nasa.gov/dataset/a-generic-local-algorithm-for-mining-data-streams-in-large-distributed-systems
Explore at:
Dataset updated
Feb 18, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
In a large network of computers or wireless sensors, each of the components (henceforth, peers) has some data about the global state of the system. Much of the system's functionality such as message routing, information retrieval and load sharing relies on modeling the global state. We refer to the outcome of the function (e.g., the load experienced by each peer) as the emph{model} of the system. Since the state of the system is constantly changing, it is necessary to keep the models up-to-date. Computing global data mining models e.g. decision trees, k-means clustering in large distributed systems may be very costly due to the scale of the system and due to communication cost, which may be high. The cost further increases in a dynamic scenario when the data changes rapidly. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient emph{local} algorithm which can be used to monitor a wide class of data mining models. Then, we use this algorithm as a feedback loop for the monitoring of complex functions of the data such as its k-means clustering. The theoretical claims are corroborated with a thorough experimental analysis.
o
Supplementary Material: Predictive model using Cross Industry Standard...
explore.openaire.eu
data.niaid.nih.gov
+1more
Updated Apr 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Supplementary Material: Predictive model using Cross Industry Standard Process for Data Mining [Dataset]. http://doi.org/10.5281/zenodo.6478177
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6478177
Dataset updated
Apr 22, 2022
Description
The Supplementary Material of the paper "Supplementary Material: Predictive model using Cross Industry Standard Process for Data Mining" includes: 1) APPENDIX 1: SQL Statements for data extraction. Appendix 2: Interview for operating Staff. 2) The DataSet of the normalized data to define the predictive model.
f
Data from: Estimation of sediment discharge using a tree-based model
tandf.figshare.com
xlsx
Updated Aug 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eun-Kyung Jang; Un Ji; Woonkwang Yeo (2023). Estimation of sediment discharge using a tree-based model [Dataset]. http://doi.org/10.6084/m9.figshare.23501628.v2
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.23501628.v2
Dataset updated
Aug 21, 2023
Dataset provided by
Taylor & Francis
Authors
Eun-Kyung Jang; Un Ji; Woonkwang Yeo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The model tree (MT) approach, a data mining technique used to analyse relationships between input and output variables in a disordered and large database, was adopted in this study to predict sediment discharge with field measurement data. The derived models were analysed for accuracy according to the goodness of fit based on training, testing, and modelling processes. When the flow velocity, depth, water surface slope, channel width, and median bed material were selected as the river’s system variables, the model results of sediment discharge resembled the measured values. The results demonstrate that developing and using the sediment discharge estimation with the MT constitutes the most effective method if long-term sediment data are of sufficient validity.
Measures of model performance in the test phase.
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lúcia Adriana dos Santos Gruginskie; Guilherme Luís Roehe Vaccaro (2023). Measures of model performance in the test phase. [Dataset]. http://doi.org/10.1371/journal.pone.0198122.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0198122.t003
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Lúcia Adriana dos Santos Gruginskie; Guilherme Luís Roehe Vaccaro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Measures of model performance in the test phase.
d
Data from: Local L2 Thresholding Based Data Mining in Peer-to-Peer Systems
catalog.data.gov
datasets.ai
+1more
Updated Dec 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2023). Local L2 Thresholding Based Data Mining in Peer-to-Peer Systems [Dataset]. https://catalog.data.gov/dataset/local-l2-thresholding-based-data-mining-in-peer-to-peer-systems
Explore at:
Dataset updated
Dec 6, 2023
Dataset provided by
Dashlink
Description
In a large network of computers, wireless sensors, or mobile devices, each of the components (hence, peers) has some data about the global status of the system. Many of the functions of the system, such as routing decisions, search strategies, data cleansing, and the assignment of mutual trust, depend on the global status. Therefore, it is essential that the system be able to detect, and react to, changes in its global status. Computing global predicates in such systems is usually very costly. Mainly because of their scale, and in some cases (e.g., sensor networks) also because of the high cost of communication. The cost further increases when the data changes rapidly (due to state changes, node failure, etc.) and computation has to follow these changes. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient local algorithm which detect when the L2 norm of the average data surpasses a threshold. Then, we use this algorithm as a feedback loop for the monitoring of complex predicates on the data – such as the data’s k-means clustering. The efficiency of the L2 algorithm guarantees that so long as the clustering results represent the data (i.e., the data is stationary) few resources are required. When the data undergoes an epoch change – a change in the underlying distribution – and the model no longer represents it, the feedback loop indicates this and the model is rebuilt. Furthermore, the existence of a feedback loop allows using approximate and “best-effort ” methods for constructing the model; if an ill-fit model is built the feedback loop would indicate so, and the model would be rebuilt.
Data from: Discovering System Health Anomalies using Data Mining Techniques
data.staging.idas-ds1.appdat.jsc.nasa.gov
gimi9.com
+4more
Updated Feb 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.staging.idas-ds1.appdat.jsc.nasa.gov (2025). Discovering System Health Anomalies using Data Mining Techniques [Dataset]. https://data.staging.idas-ds1.appdat.jsc.nasa.gov/dataset/discovering-system-health-anomalies-using-data-mining-techniques
Explore at:
Dataset updated
Feb 18, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
We discuss a statistical framework that underlies envelope detection schemes as well as dynamical models based on Hidden Markov Models (HMM) that can encompass both discrete and continuous sensor measurements for use in Integrated System Health Management (ISHM) applications. The HMM allows for the rapid assimilation, analysis, and discovery of system anomalies. We motivate our work with a discussion of an aviation problem where the identification of anomalous sequences is essential for safety reasons. The data in this application are discrete and continuous sensor measurements and can be dealt with seamlessly using the methods described here to discover anomalous flights. We specifically treat the problem of discovering anomalous features in the time series that may be hidden from the sensor suite and compare those methods to standard envelope detection methods on test data designed to accentuate the differences between the two methods. Identification of these hidden anomalies is crucial to building stable, reusable, and cost-efficient systems. We also discuss a data mining framework for the analysis and discovery of anomalies in high-dimensional time series of sensor measurements that would be found in an ISHM system. We conclude with recommendations that describe the tradeoffs in building an integrated scalable platform for robust anomaly detection in ISHM applications.
d
Fuzzy Spatiotemporal Data Mining to Activity Recognition in Smart Homes
catalog.data.gov
data.nasa.gov
Updated Dec 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2023). Fuzzy Spatiotemporal Data Mining to Activity Recognition in Smart Homes [Dataset]. https://catalog.data.gov/dataset/fuzzy-spatiotemporal-data-mining-to-activity-recognition-in-smart-homes
Explore at:
Dataset updated
Dec 6, 2023
Dataset provided by
Dashlink
Description
A primary goal to design smart homes is to provide automatic assistance for the residents to make them able to live independently at home. Activity recognition is done to achieve the mentioned goal and then to provide assistance, we would need three sort of information. First, we would need to know the goal of the resident, then the pattern that the resident should obey to achieve its goal and third sort of needed information is the deviations from the previously known patterns. In the presented paper, spatiotemporal aspects of daily activities are surveyed to mine the patterns of activities realized by the smart homes residents. Necessary data to model the spatiotemporal aspects of daily activities is provided by embedded sensors in the smart home. We believe that to accomplish daily activities, specific objects are applied and by analyzing the movement of objects and resident(s), we would obtain valuable information to model the daily activities of the Smart Home’s residents.
m
Data for: Identification of hindered internal rotational mode for complex...
data.mendeley.com
Updated Nov 21, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lam Huynh (2017). Data for: Identification of hindered internal rotational mode for complex chemical species: A data mining approach with multivariate logistic regression model [Dataset]. http://doi.org/10.17632/snstf5rd5n.1
Explore at:
Unique identifier
https://doi.org/10.17632/snstf5rd5n.1
Dataset updated
Nov 21, 2017
Authors
Lam Huynh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The "Dataset_HIR" folder contains the data to reproduce the results of the data mining approach proposed in the manuscript titled "Identification of hindered internal rotational mode for complex chemical species: A data mining approach with multivariate logistic regression model".

More specifically, the folder contains the raw electronic structure calculation input data provided by the domain experts as well as the training and testing dataset with the extracted features.

The "Dataset_HIR" folder contains the following subfolders namely:

Electronic structure calculation input data: contains the electronic structure calculation input generated by the Gaussian program

1.1. Testing data: contains the raw data of all training species (each is stored in a separate folder) used for extracting dataset for training and validation phases.

1.2. Testing data: contains the raw data of all testing species (each is stored in a separate folder) used for extracting data for the testing phase.

Dataset 2.1. Training dataset: used to produce the results in Tables 3 and 4 in the manuscript

+ datasetTrain_raw.csv: contains the features for all vibrational modes associated with corresponding labeled species to let the chemists select the Hindered Internal Rotor from the list easily for the training and validation steps. + datasetTrain.csv: refines the datasetTrain_raw.csv where the names of the species are all removed to transform the dataset into an appropriate form for the modeling and validation steps.

2.2. Testing dataset: used to produce the results of the data mining approach in Table 5 in the manuscript.

+ datasetTest_raw.csv: contains the features for all vibrational modes of each labeled species to let the chemists select the Hindered Internal Rotor from the list for the testing step. + datasetTest.csv: refines the datasetTest_raw.csv where the names of the species are all removed to transform the dataset into an appropriate form for the testing step.

Note for the Result feature in the dataset: 1 is for the mode needed to be treated as Hindered Internal Rotor, and 0 otherwise.
m
Process Discovery Contest @ BPM [1st Edition]
data.mendeley.com
Updated Mar 13, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KINGSLEY OKOYE (2017). Process Discovery Contest @ BPM [1st Edition] [Dataset]. http://doi.org/10.17632/dybhxv665z.2
Explore at:
Unique identifier
https://doi.org/10.17632/dybhxv665z.2
Dataset updated
Mar 13, 2017
Authors
KINGSLEY OKOYE
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Process Discovery approach described in the submitted document is directed towards discovery of process models from a Training Event log representing 10 different real time business process executions, and cross-validating the derived model with a set of two Test Event logs provided for evaluation of the process discovery technique. Each of the Test event logs ((test_log_april_1 to test_log_april_10) and (test_log_may_1 to test_log_may_10)) represents part of the model from the Training Log with complete total of 20 traces for each of the logs, and are characterized by having 10 traces that can be replayed (allowed) and 10 traces that cannot be replayed (disallowed) by the model. The total number of traces for the Test event logs (i.e. April log and May log) is therefore ((10 logs x 20 traces) x 2) = 400 Traces. Our aim is to carry out a classification task to determine the 400 individual traces that makes up the two test event log and then provide a Petri Net representation of the Training model as well as Business Process Model Notation (BPMN) mapping that allows for testing and evaluation of the behaviours/traces recorded in the Test logs. The objective of the proposed approach is to discover and provide process models that matches the original process models in term of balancing between “overfitting” and “underfitting”. A process model is seen as overfitting (the event log) if it is too restrictive, disallowing behaviour which is part of the underlying process. On the other hand, it is underfitting (the reality) if it is not restrictive enough, allowing behaviour which is not part of the underlying process. Following this challenge, we aim to provide a model which is as good in balancing “overfitting” and “underfitting” as it is able to correctly classify the traces that can be replayed in the “test” event log: Thus, • Given a trace (t) representing real process behaviour, the process model (m) classifies it as allowed, or • Given a trace (t) representing a behaviour not related to the process, the process model (m) classifies it as disallowed. The submitted document contains the classification attempts for the events logs provided and discusses the replaying semantics of the process modelling notation that has been employed. In other words, we discuss how, given any process trace t (for the Test event Log) and process model m (for the training log) in the discovered Petri Net and BPMN replaying notation, it can be unambiguously determined whether or not trace t can be replayed on model (m). We also provide a description of the tools used to discover the process models as well as checking the result of the classification task.
Additional file 1 of Opening the black box: interpretable machine learning...
figshare.com
txt
Updated Feb 14, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yan Zhang; Xiaoxu Zhang; Jaina Razbek; Deyang Li; Wenjun Xia; Liangliang Bao; Hongkai Mao; Mayisha Daken; Mingqin Cao (2024). Additional file 1 of Opening the black box: interpretable machine learning for predictor finding of metabolic syndrome [Dataset]. http://doi.org/10.6084/m9.figshare.20677966.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20677966.v1
Dataset updated
Feb 14, 2024
Dataset provided by
figshare
Authors
Yan Zhang; Xiaoxu Zhang; Jaina Razbek; Deyang Li; Wenjun Xia; Liangliang Bao; Hongkai Mao; Mayisha Daken; Mingqin Cao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 1:Supplementary file S1.The original dataset for this study.
d
Data from: Multi-objective optimization based privacy preserving distributed...
catalog.data.gov
data.nasa.gov
+1more
Updated Dec 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2023). Multi-objective optimization based privacy preserving distributed data mining in Peer-to-Peer networks [Dataset]. https://catalog.data.gov/dataset/multi-objective-optimization-based-privacy-preserving-distributed-data-mining-in-peer-to-p
Explore at:
Dataset updated
Dec 7, 2023
Dataset provided by
Dashlink
Description
This paper proposes a scalable, local privacy preserving algorithm for distributed Peer-to-Peer (P2P) data aggregation useful for many advanced data mining/analysis tasks such as average/sum computation, decision tree induction, feature selection, and more. Unlike most multi-party privacy-preserving data mining algorithms, this approach works in an asynchronous manner through local interactions and it is highly scalable. It particularly deals with the distributed computation of the sum of a set of numbers stored at different peers in a P2P network in the context of a P2P web mining application. The proposed optimization based privacy-preserving technique for computing the sum allows different peers to specify different privacy requirements without having to adhere to a global set of parameters for the chosen privacy model. Since distributed sum computation is a frequently used primitive, the proposed approach is likely to have significant impact on many data mining tasks such as multi-party privacy-preserving clustering, frequent itemset mining, and statistical aggregate computation.

Facebook

Twitter

Click to copy link

Link copied

Cite

(2018). Ensemble Data Mining Methods [Dataset]. https://data.nasa.gov/dataset/Ensemble-Data-Mining-Methods/m82a-r3bm

Ensemble Data Mining Methods

Explore at:

json, xml, application/rssxml, tsv, csv, application/rdfxmlAvailable download formats

Dataset updated

Jun 26, 2018

License

U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically

Description

Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods that leverage the power of multiple models to achieve better prediction accuracy than any of the individual models could on their own. The basic goal when designing an ensemble is the same as when establishing a committee of people: each member of the committee should be as competent as possible, but the members should be complementary to one another. If the members are not complementary, i.e., if they always agree, then the committee is unnecessary---any one member is sufficient. If the members are complementary, then when one or a few members make an error, the probability is high that the remaining members can correct this error. Research in ensemble methods has largely revolved around designing ensembles consisting of competent yet complementary models.

Clear search

Close search

Google apps

Main menu

Ensemble Data Mining Methods

Educational Attainment in North Carolina Public Schools: Use of statistical...

Model assessment measures

Experimental data for "Software Data Analytics: Architectural Model...

Webis Simulation Data Mining Bridge Models Corpus 2012 (Webis-SDMbridge-12)

Data from: A Proposed Churn Prediction Model

Privacy Preserving Distributed Data Mining

Video-to-Model Data Set

CONCEPT- DM2 DATA MODEL TO ANALYSE HEALTHCARE PATHWAYS OF TYPE 2 DIABETES

Data from: A Generic Local Algorithm for Mining Data Streams in Large...

Supplementary Material: Predictive model using Cross Industry Standard...

Data from: Estimation of sediment discharge using a tree-based model

Measures of model performance in the test phase.

Data from: Local L2 Thresholding Based Data Mining in Peer-to-Peer Systems

Data from: Discovering System Health Anomalies using Data Mining Techniques

Fuzzy Spatiotemporal Data Mining to Activity Recognition in Smart Homes

Data for: Identification of hindered internal rotational mode for complex...

Process Discovery Contest @ BPM [1st Edition]

Additional file 1 of Opening the black box: interpretable machine learning...

Data from: Multi-objective optimization based privacy preserving distributed...

Ensemble Data Mining Methods