100+ datasets found

Data from: Discovering System Health Anomalies using Data Mining Techniques
data.staging.idas-ds1.appdat.jsc.nasa.gov
gimi9.com
+4more
Updated Feb 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.staging.idas-ds1.appdat.jsc.nasa.gov (2025). Discovering System Health Anomalies using Data Mining Techniques [Dataset]. https://data.staging.idas-ds1.appdat.jsc.nasa.gov/dataset/discovering-system-health-anomalies-using-data-mining-techniques
Explore at:
Dataset updated
Feb 18, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
We discuss a statistical framework that underlies envelope detection schemes as well as dynamical models based on Hidden Markov Models (HMM) that can encompass both discrete and continuous sensor measurements for use in Integrated System Health Management (ISHM) applications. The HMM allows for the rapid assimilation, analysis, and discovery of system anomalies. We motivate our work with a discussion of an aviation problem where the identification of anomalous sequences is essential for safety reasons. The data in this application are discrete and continuous sensor measurements and can be dealt with seamlessly using the methods described here to discover anomalous flights. We specifically treat the problem of discovering anomalous features in the time series that may be hidden from the sensor suite and compare those methods to standard envelope detection methods on test data designed to accentuate the differences between the two methods. Identification of these hidden anomalies is crucial to building stable, reusable, and cost-efficient systems. We also discuss a data mining framework for the analysis and discovery of anomalies in high-dimensional time series of sensor measurements that would be found in an ISHM system. We conclude with recommendations that describe the tradeoffs in building an integrated scalable platform for robust anomaly detection in ISHM applications.
d
Ensemble Data Mining Methods
catalog.data.gov
data.nasa.gov
+2more
Updated Dec 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2023). Ensemble Data Mining Methods [Dataset]. https://catalog.data.gov/dataset/ensemble-data-mining-methods
Explore at:
Dataset updated
Dec 6, 2023
Dataset provided by
Dashlink
Description
Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods that leverage the power of multiple models to achieve better prediction accuracy than any of the individual models could on their own. The basic goal when designing an ensemble is the same as when establishing a committee of people: each member of the committee should be as competent as possible, but the members should be complementary to one another. If the members are not complementary, i.e., if they always agree, then the committee is unnecessary---any one member is sufficient. If the members are complementary, then when one or a few members make an error, the probability is high that the remaining members can correct this error. Research in ensemble methods has largely revolved around designing ensembles consisting of competent yet complementary models.
Data from: Peer-to-Peer Data Mining, Privacy Issues, and Games
data.staging.idas-ds1.appdat.jsc.nasa.gov
data.nasa.gov
+2more
Updated Feb 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Peer-to-Peer Data Mining, Privacy Issues, and Games [Dataset]. https://data.staging.idas-ds1.appdat.jsc.nasa.gov/dataset/peer-to-peer-data-mining-privacy-issues-and-games
Explore at:
Dataset updated
Feb 18, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
Peer-to-Peer (P2P) networks are gaining increasing popularity in many distributed applications such as file-sharing, network storage, web caching, sear- ching and indexing of relevant documents and P2P network-threat analysis. Many of these applications require scalable analysis of data over a P2P network. This paper starts by offering a brief overview of distributed data mining applications and algorithms for P2P environments. Next it discusses some of the privacy concerns with P2P data mining and points out the problems of existing privacy-preserving multi-party data mining techniques. It further points out that most of the nice assumptions of these existing privacy preserving techniques fall apart in real-life applications of privacy-preserving distributed data mining (PPDM). The paper offers a more realistic formulation of the PPDM problem as a multi-party game and points out some recent results.
f
Table 2 - Modeling and comparing data mining algorithms for prediction of...
figshare.com
xls
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alireza Mosayebi; Barat Mojaradi; Ali Bonyadi Naeini; Seyed Hamid Khodadad Hosseini (2023). Table 2 - Modeling and comparing data mining algorithms for prediction of recurrence of breast cancer [Dataset]. http://doi.org/10.1371/journal.pone.0237658.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0237658.t002
Dataset updated
Jun 5, 2023
Dataset provided by
PLOS ONE
Authors
Alireza Mosayebi; Barat Mojaradi; Ali Bonyadi Naeini; Seyed Hamid Khodadad Hosseini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Table 2 - Modeling and comparing data mining algorithms for prediction of recurrence of breast cancer
Measure List.
plos.figshare.com
xls
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fernando Benites; Svenja Simon; Elena Sapozhnikova (2023). Measure List. [Dataset]. http://doi.org/10.1371/journal.pone.0084475.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0084475.t001
Dataset updated
Jun 10, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Fernando Benites; Svenja Simon; Elena Sapozhnikova
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Measure List.
m
Educational Attainment in North Carolina Public Schools: Use of statistical...
data.mendeley.com
Updated Nov 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
Explore at:
Unique identifier
https://doi.org/10.17632/6cm9wyd5g5.1
Dataset updated
Nov 14, 2018
Authors
Scott Herford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
North Carolina
Description
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
d
Data mining approaches to quantifying the formation of secondary organic...
datasets.ai
catalog.data.gov
53
Updated Sep 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Environmental Protection Agency (2024). Data mining approaches to quantifying the formation of secondary organic aerosol [Dataset]. https://datasets.ai/datasets/data-mining-approaches-to-quantifying-the-formation-of-secondary-organic-aerosol
Explore at:
53Available download formats
Dataset updated
Sep 18, 2024
Dataset authored and provided by
U.S. Environmental Protection Agency
Description
This research used data mining approaches to better understand factors affecting the formation of secondary organic aerosol (SOA). Although numerous laboratory and computational studies have been completed on SOA formation, it is still challenging to determine factors that most influence SOA formation. Experimental data were based on previous work described by Offenberg et al. (2017), where volume concentrations of SOA were measured in 139 laboratory experiments involving the oxidation of single hydrocarbons under different operating conditions. Three different data mining methods were used, including nearest neighbor, decision tree, and pattern mining. Both decision tree and pattern mining approaches identified similar chemical and experimental conditions that were important to SOA formation. Among these important factors included the number of methyl groups, the number of rings and the presence of dinitrogen pentoxide (N2O5).

This dataset is associated with the following publication: Olson, D., J. Offenberg, M. Lewandowski, T. Kleindienst, K. Docherty, M. Jaoui, J.D. Krug, and T. Riedel. Data mining approaches to understanding the formation of secondary organic aerosol. ATMOSPHERIC ENVIRONMENT. Elsevier Science Ltd, New York, NY, USA, 252: 118345, (2021).
f
Variables associating with the length of hospital stay.
figshare.com
xls
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hyunyoung Baek; Minsu Cho; Seok Kim; Hee Hwang; Minseok Song; Sooyoung Yoo (2023). Variables associating with the length of hospital stay. [Dataset]. http://doi.org/10.1371/journal.pone.0195901.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0195901.t007
Dataset updated
May 30, 2023
Dataset provided by
PLOS ONE
Authors
Hyunyoung Baek; Minsu Cho; Seok Kim; Hee Hwang; Minseok Song; Sooyoung Yoo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Variables associating with the length of hospital stay.
s
Data from: Social Media Data Mining Becomes Ordinary
orda.shef.ac.uk
docx
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Helen Kennedy (2023). Social Media Data Mining Becomes Ordinary [Dataset]. http://doi.org/10.15131/shef.data.5195032.v1
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.15131/shef.data.5195032.v1
Dataset updated
May 30, 2023
Dataset provided by
The University of Sheffield
Authors
Helen Kennedy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This research explored what happens when social media data mining becomes ordinary and is carried out by organisations that might be seen as the pillars of everyday life. The interviews on which the transcripts are based are discussed in Chapter 6 of the book. The referenced book contains a description of the methods. No other publications resulted from working with these transcripts.
Discovering Anomalous Aviation Safety Events Using Scalable Data Mining...
data.nasa.gov
datadiscoverystudio.org
+6more
application/rdfxml +5
Updated Jun 26, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). Discovering Anomalous Aviation Safety Events Using Scalable Data Mining Algorithms [Dataset]. https://data.nasa.gov/dataset/Discovering-Anomalous-Aviation-Safety-Events-Using/2bmi-rg8d
Explore at:
tsv, csv, json, application/rdfxml, application/rssxml, xmlAvailable download formats
Dataset updated
Jun 26, 2018
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
The worldwide civilian aviation system is one of the most complex dynamical systems created. Most modern commercial aircraft have onboard flight data recorders that record several hundred discrete and continuous parameters at approximately 1Hz for the entire duration of the flight. These data contain information about the flight control systems, actuators, engines, landing gear, avionics, and pilot commands. In this paper, recent advances in the development of a novel knowledge discovery process consisting of a suite of data mining techniques for identifying precursors to aviation safety incidents are discussed. The data mining techniques include scalable multiple-kernel learning for large-scale distributed anomaly detection. A novel multivariate time-series search algorithm is used to search for signatures of discovered anomalies on massive datasets. The process can identify operationally significant events due to environmental, mechanical, and human factors issues in the high-dimensional flight operations quality assurance data. All discovered anomalies are validated by a team of independent domain experts. This novel automated knowledge discovery process is aimed at complementing the state-of-the-art human-generated exceedance-based analysis that fails to discover previously unknown aviation safety incidents. In this paper, the discovery pipeline, the methods used, and some of the significant anomalies detected on real-world commercial aviation data are discussed.
Multi-objective optimization based privacy preserving distributed data...
data.staging.idas-ds1.appdat.jsc.nasa.gov
Updated Feb 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.staging.idas-ds1.appdat.jsc.nasa.gov (2025). Multi-objective optimization based privacy preserving distributed data mining in Peer-to-Peer networks - Dataset - NASA Open Data Portal [Dataset]. https://data.staging.idas-ds1.appdat.jsc.nasa.gov/dataset/multi-objective-optimization-based-privacy-preserving-distributed-data-mining-in-peer-to-p
Explore at:
Dataset updated
Feb 19, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
This paper proposes a scalable, local privacy preserving algorithm for distributed Peer-to-Peer (P2P) data aggregation useful for many advanced data mining/analysis tasks such as average/sum computation, decision tree induction, feature selection, and more. Unlike most multi-party privacy-preserving data mining algorithms, this approach works in an asynchronous manner through local interactions and it is highly scalable. It particularly deals with the distributed computation of the sum of a set of numbers stored at different peers in a P2P network in the context of a P2P web mining application. The proposed optimization based privacy-preserving technique for computing the sum allows different peers to specify different privacy requirements without having to adhere to a global set of parameters for the chosen privacy model. Since distributed sum computation is a frequently used primitive, the proposed approach is likely to have significant impact on many data mining tasks such as multi-party privacy-preserving clustering, frequent itemset mining, and statistical aggregate computation.
u
Hidden Room Educational Data Mining Analysis
produccioncientifica.uca.es
figshare.com
Updated 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Palomo-Duarte, Manuel; Berns, Anke; Palomo-Duarte, Manuel; Berns, Anke (2016). Hidden Room Educational Data Mining Analysis [Dataset]. https://produccioncientifica.uca.es/documentos/668fc475b9e7c03b01bde1d4
Explore at:
Dataset updated
2016
Authors
Palomo-Duarte, Manuel; Berns, Anke; Palomo-Duarte, Manuel; Berns, Anke
Description
Histograms and results of k-means and Ward's clustering for Hidden Room game

The fileset contains information from three sources:

1. Histograms files:
* Lexical_histogram.png (histogram of lexical error ratios)
* Grammatical_histogram.png (histogram of grammatical error ratios)

2. K-means clustering files:
* elbow-lex kmeans.png (clustering by lexical aspects: error curves obtained for applying elbow method to determinate the optimal number of clusters)
* cube-lex kmeans.png (clustering by lexical aspects: a three-dimensional representation of clusters obtained after applying k-means method)
* Lexical_clusters (table) kmeans.xls (clustering by lexical aspects: centroids, standard deviations and number of instances assigned to each cluster)
* elbow-gram kmeans.png (clustering by grammatical aspects: error curves obtained for applying elbow method to determinate the optimal number of clusters)
* cube-gramm kmeans.png (clustering by grammatical aspects: a three-dimensional representation of clusters obtained after applying k-means method)
* Grammatical_clusters (table) kmeans.xls (clustering by grammatical aspects: centroids, standard deviations and number of instances assigned to each cluster)
* elbow-lexgram kmeans.png (clustering by lexical and grammatical aspects: error curves obtained for applying elbow method to determinate the optimal number of clusters)
* Lexical_Grammatical_clusters (table) kmeans.xls (clustering by lexical and grammatical aspects: centroids, standard deviations and number of instances assigned to each cluster)
* Grammatical_clusters_number_of_words (table) kmeans.xls : number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to grammatical error ratios.
* Lexical_clusters_number_of_words (table) kmeans.xls : number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to lexical error ratios.
* Lexical_Grammatical_clusters_number_of_words (table) kmeans.xls : number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to lexical and grammatical error ratios.

3. Ward’s Agglomerative Hierarchical Clustering files:
* Lexical_Cluster_Dendrogram_ward.png (clustering by lexical aspects: dendrogram obtained after applying Ward's clustering method).
* Grammatical_Cluster_Dendrogram_ward.png (clustering by grammatical aspects: dendrogram obtained after applying Ward's clustering method)
* Lexical_Grammatical_Cluster_Dendrogram_ward.png (clustering by lexical and grammatical aspects: dendrogram obtained after applying Ward's clustering method)
* Lexical_Grammatical_clusters (table) ward.xls: Centroids (from column 2 to 7) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to lexical and grammatical error ratios.
* Grammatical_clusters (table) ward.xls: Centroids (from column 2 to 4) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to grammatical error ratios.
* Lexical_clusters (table) ward.xls: Centroids (from column 2 to 4) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to lexical error ratios.
* Lexical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to lexical error ratios.
* Grammatical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to grammatical error ratios.
* Lexical_Grammatical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to lexical and grammatical error ratios.
Data Mining for IVHM using Sparse Binary Ensembles, Phase I
data.nasa.gov
application/rdfxml +5
Updated Jun 26, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). Data Mining for IVHM using Sparse Binary Ensembles, Phase I [Dataset]. https://data.nasa.gov/dataset/Data-Mining-for-IVHM-using-Sparse-Binary-Ensembles/qfus-evzq
Explore at:
xml, tsv, csv, application/rssxml, application/rdfxml, jsonAvailable download formats
Dataset updated
Jun 26, 2018
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
In response to NASA SBIR topic A1.05, "Data Mining for Integrated Vehicle Health Management", Michigan Aerospace Corporation (MAC) asserts that our unique SPADE (Sparse Processing Applied to Data Exploitation) technology meets a significant fraction of the stated criteria and has functionality that enables it to handle many applications within the aircraft lifecycle. SPADE distills input data into highly quantized features and uses MAC's novel techniques for constructing Ensembles of Decision Trees to develop extremely accurate diagnostic/prognostic models for classification, regression, clustering, anomaly detection and semi-supervised learning tasks. These techniques are currently being employed to do Threat Assessment for satellites in conjunction with researchers at the Air Force Research Lab. Significant advantages to this approach include: 1) completely data driven; 2) training and evaluation are faster than conventional methods; 3) operates effectively on huge datasets (> billion samples X > million features), 4) proven to be as accurate as state-of-the-art techniques in many significant real-world applications. The specific goals for Phase 1 will be to work with domain experts at NASA and with our partners Boeing, SpaceX and GMV Space Systems to delineate a subset of problems that are particularly well-suited to this approach and to determine requirements for deploying algorithms on platforms of opportunity.
d
Data from: Multi-objective optimization based privacy preserving distributed...
catalog.data.gov
data.nasa.gov
+1more
Updated Dec 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2023). Multi-objective optimization based privacy preserving distributed data mining in Peer-to-Peer networks [Dataset]. https://catalog.data.gov/dataset/multi-objective-optimization-based-privacy-preserving-distributed-data-mining-in-peer-to-p
Explore at:
Dataset updated
Dec 7, 2023
Dataset provided by
Dashlink
Description
This paper proposes a scalable, local privacy preserving algorithm for distributed Peer-to-Peer (P2P) data aggregation useful for many advanced data mining/analysis tasks such as average/sum computation, decision tree induction, feature selection, and more. Unlike most multi-party privacy-preserving data mining algorithms, this approach works in an asynchronous manner through local interactions and it is highly scalable. It particularly deals with the distributed computation of the sum of a set of numbers stored at different peers in a P2P network in the context of a P2P web mining application. The proposed optimization based privacy-preserving technique for computing the sum allows different peers to specify different privacy requirements without having to adhere to a global set of parameters for the chosen privacy model. Since distributed sum computation is a frequently used primitive, the proposed approach is likely to have significant impact on many data mining tasks such as multi-party privacy-preserving clustering, frequent itemset mining, and statistical aggregate computation.
n
Data for: Identification of hindered internal rotational mode for complex...
narcis.nl
data.mendeley.com
Updated Nov 8, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Le, T (via Mendeley Data) (2017). Data for: Identification of hindered internal rotational mode for complex chemical species: A data mining approach with multivariate logistic regression model [Dataset]. http://doi.org/10.17632/d37mzs3b3m.2
Explore at:
Unique identifier
https://doi.org/10.17632/d37mzs3b3m.2
Dataset updated
Nov 8, 2017
Dataset provided by
Data Archiving and Networked Services (DANS)
Authors
Le, T (via Mendeley Data)
Description
The "Dataset_HIR" folder contains the data to reproduce the results of the data mining approach proposed in the manuscript titled "Identification of hindered internal rotational mode for complex chemical species: A data mining approach with multivariate logistic regression model".

More specifically, the folder contains the raw electronic structure calculation input data provided by the domain experts as well as the training and testing dataset with the extracted features.

The "Dataset_HIR" folder contains the following subfolders namely:

Electronic structure calculation input data: contains the electronic structure calculation input generated by the Gaussian program

1.1. Testing data: contains the raw data of all training species (each is stored in a separate folder) used for extracting dataset for training and validation phase.

1.2. Testing data: contains the raw data of all testing species (each is stored in a separate folder) used for extracting data for the testing phase.

Dataset 2.1. Training dataset: used to produce the results in Tables 3 and 4 in the manuscript

+ datasetTrain_raw.csv: contains the features for all vibrational modes associated with corresponding labeled species to let the chemists select the Hindered Internal Rotor from the list easily for the training and validation steps. + datasetTrain.csv: refines the datasetTrain_raw.csv where the names of the species are all removed to transform the dataset into an appropriate form for the modeling and validation steps.

2.2. Testing dataset: used to produce the results of the data mining approach in Table 5 in the manuscript.

+ datasetTest_raw.csv: contains the features for all vibrational modes of each labeled species to let the chemists select the Hindered Internal Rotor from the list for the testing step. + datasetTest.csv: refines the datasetTest_raw.csv where the names of the species are all removed to transform the dataset into an appropriate form for the testing step.

Note for the Result feature in the dataset: 1 is for the mode needed to be treated as Hindered Internal Rotor, and 0 otherwise.
d
Data from: Local L2 Thresholding Based Data Mining in Peer-to-Peer Systems
catalog.data.gov
datasets.ai
+1more
Updated Dec 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2023). Local L2 Thresholding Based Data Mining in Peer-to-Peer Systems [Dataset]. https://catalog.data.gov/dataset/local-l2-thresholding-based-data-mining-in-peer-to-peer-systems
Explore at:
Dataset updated
Dec 6, 2023
Dataset provided by
Dashlink
Description
In a large network of computers, wireless sensors, or mobile devices, each of the components (hence, peers) has some data about the global status of the system. Many of the functions of the system, such as routing decisions, search strategies, data cleansing, and the assignment of mutual trust, depend on the global status. Therefore, it is essential that the system be able to detect, and react to, changes in its global status. Computing global predicates in such systems is usually very costly. Mainly because of their scale, and in some cases (e.g., sensor networks) also because of the high cost of communication. The cost further increases when the data changes rapidly (due to state changes, node failure, etc.) and computation has to follow these changes. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient local algorithm which detect when the L2 norm of the average data surpasses a threshold. Then, we use this algorithm as a feedback loop for the monitoring of complex predicates on the data – such as the data’s k-means clustering. The efficiency of the L2 algorithm guarantees that so long as the clustering results represent the data (i.e., the data is stationary) few resources are required. When the data undergoes an epoch change – a change in the underlying distribution – and the model no longer represents it, the feedback loop indicates this and the model is rebuilt. Furthermore, the existence of a feedback loop allows using approximate and “best-effort ” methods for constructing the model; if an ill-fit model is built the feedback loop would indicate so, and the model would be rebuilt.
Leveraging data mining, active learning, and domain adaptation for efficient...
data.niaid.nih.gov
zip
Updated Mar 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rui Ding; Jianguo Liu; Kang Hua; Xuebin Wang; Xiaoben Zhang; Minhua Shao; Yuxin Chen; Junhong Chen (2025). Leveraging data mining, active learning, and domain adaptation for efficient discovery of advanced oxygen evolution electrocatalysts [Dataset]. http://doi.org/10.5061/dryad.nk98sf83g
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.nk98sf83g
Dataset updated
Mar 5, 2025
Dataset provided by
Hong Kong University of Science and Technology
University of Chicago
North China Electric Power University
Nanjing University
Authors
Rui Ding; Jianguo Liu; Kang Hua; Xuebin Wang; Xiaoben Zhang; Minhua Shao; Yuxin Chen; Junhong Chen
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Developing advanced catalysts for acidic oxygen evolution reaction (OER) is crucial for sustainable hydrogen production. This study presents a multi-stage machine learning (ML) approach to streamline the discovery and optimization of complex multi-metallic catalysts. Our method integrates data mining, active learning, and domain adaptation throughout the materials discovery process. Unlike traditional trial-and-error methods, this approach systematically narrows the exploration space using domain knowledge with minimized reliance on subjective intuition. Then the active learning module efficiently refines element composition and synthesis conditions through iterative experimental feedback. The process culminated in the discovery of a promising Ru-Mn-Ca-Pr oxide catalyst. Our workflow also enhances theoretical simulations with domain adaptation strategy, providing deeper mechanistic insights aligned with experimental findings. By leveraging diverse data sources and multiple ML strategies, we demonstrate an efficient pathway for electrocatalyst discovery and optimization. This comprehensive, data-driven approach represents a paradigm shift and potentially a benchmark in electrocatalysts research.
l
LScDC Word-Category RIG Matrix
figshare.le.ac.uk
pdf
Updated Apr 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LScDC Word-Category RIG Matrix [Dataset]. https://figshare.le.ac.uk/articles/dataset/LScDC_Word-Category_RIG_Matrix/12133431
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.12133431.v2
Dataset updated
Apr 28, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LScDC Word-Category RIG MatrixApril 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk / suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny MirkesGetting StartedThis file describes the Word-Category RIG Matrix for theLeicester Scientific Corpus (LSC) [1], the procedure to build the matrix and introduces the Leicester Scientific Thesaurus (LScT) with the construction process. The Word-Category RIG Matrix is a 103,998 by 252 matrix, where rows correspond to words of Leicester Scientific Dictionary-Core (LScDC) [2] and columns correspond to 252 Web of Science (WoS) categories [3, 4, 5]. Each entry in the matrix corresponds to a pair (category,word). Its value for the pair shows the Relative Information Gain (RIG) on the belonging of a text from the LSC to the category from observing the word in this text. The CSV file of Word-Category RIG Matrix in the published archive is presented with two additional columns of the sum of RIGs in categories and the maximum of RIGs over categories (last two columns of the matrix). So, the file ‘Word-Category RIG Matrix.csv’ contains a total of 254 columns.This matrix is created to be used in future research on quantifying of meaning in scientific texts under the assumption that words have scientifically specific meanings in subject categories and the meaning can be estimated by information gains from word to categories. LScT (Leicester Scientific Thesaurus) is a scientific thesaurus of English. The thesaurus includes a list of 5,000 words from the LScDC. We consider ordering the words of LScDC by the sum of their RIGs in categories. That is, words are arranged in their informativeness in the scientific corpus LSC. Therefore, meaningfulness of words evaluated by words’ average informativeness in the categories. We have decided to include the most informative 5,000 words in the scientific thesaurus. Words as a Vector of Frequencies in WoS CategoriesEach word of the LScDC is represented as a vector of frequencies in WoS categories. Given the collection of the LSC texts, each entry of the vector consists of the number of texts containing the word in the corresponding category.It is noteworthy that texts in a corpus do not necessarily belong to a single category, as they are likely to correspond to multidisciplinary studies, specifically in a corpus of scientific texts. In other words, categories may not be exclusive. There are 252 WoS categories and a text can be assigned to at least 1 and at most 6 categories in the LSC. Using the binary calculation of frequencies, we introduce the presence of a word in a category. We create a vector of frequencies for each word, where dimensions are categories in the corpus.The collection of vectors, with all words and categories in the entire corpus, can be shown in a table, where each entry corresponds to a pair (word,category). This table is build for the LScDC with 252 WoS categories and presented in published archive with this file. The value of each entry in the table shows how many times a word of LScDC appears in a WoS category. The occurrence of a word in a category is determined by counting the number of the LSC texts containing the word in a category. Words as a Vector of Relative Information Gains Extracted for CategoriesIn this section, we introduce our approach to representation of a word as a vector of relative information gains for categories under the assumption that meaning of a word can be quantified by their information gained for categories.For each category, a function is defined on texts that takes the value 1, if the text belongs to the category, and 0 otherwise. For each word, a function is defined on texts that takes the value 1 if the word belongs to the text, and 0 otherwise. Consider LSC as a probabilistic sample space (the space of equally probable elementary outcomes). For the Boolean random variables, the joint probability distribution, the entropy and information gains are defined.The information gain about the category from the word is the amount of information on the belonging of a text from the LSC to the category from observing the word in the text [6]. We used the Relative Information Gain (RIG) providing a normalised measure of the Information Gain. This provides the ability of comparing information gains for different categories. The calculations of entropy, Information Gains and Relative Information Gains can be found in the README file in the archive published. Given a word, we created a vector where each component of the vector corresponds to a category. Therefore, each word is represented as a vector of relative information gains. It is obvious that the dimension of vector for each word is the number of categories. The set of vectors is used to form the Word-Category RIG Matrix, in which each column corresponds to a category, each row corresponds to a word and each component is the relative information gain from the word to the category. In Word-Category RIG Matrix, a row vector represents the corresponding word as a vector of RIGs in categories. We note that in the matrix, a column vector represents RIGs of all words in an individual category. If we choose an arbitrary category, words can be ordered by their RIGs from the most informative to the least informative for the category. As well as ordering words in each category, words can be ordered by two criteria: sum and maximum of RIGs in categories. The top n words in this list can be considered as the most informative words in the scientific texts. For a given word, the sum and maximum of RIGs are calculated from the Word-Category RIG Matrix.RIGs for each word of LScDC in 252 categories are calculated and vectors of words are formed. We then form the Word-Category RIG Matrix for the LSC. For each word, the sum (S) and maximum (M) of RIGs in categories are calculated and added at the end of the matrix (last two columns of the matrix). The Word-Category RIG Matrix for the LScDC with 252 categories, the sum of RIGs in categories and the maximum of RIGs over categories can be found in the database.Leicester Scientific Thesaurus (LScT)Leicester Scientific Thesaurus (LScT) is a list of 5,000 words form the LScDC [2]. Words of LScDC are sorted in descending order by the sum (S) of RIGs in categories and the top 5,000 words are selected to be included in the LScT. We consider these 5,000 words as the most meaningful words in the scientific corpus. In other words, meaningfulness of words evaluated by words’ average informativeness in the categories and the list of these words are considered as a ‘thesaurus’ for science. The LScT with value of sum can be found as CSV file with the published archive. Published archive contains following files:1) Word_Category_RIG_Matrix.csv: A 103,998 by 254 matrix where columns are 252 WoS categories, the sum (S) and the maximum (M) of RIGs in categories (last two columns of the matrix), and rows are words of LScDC. Each entry in the first 252 columns is RIG from the word to the category. Words are ordered as in the LScDC.2) Word_Category_Frequency_Matrix.csv: A 103,998 by 252 matrix where columns are 252 WoS categories and rows are words of LScDC. Each entry of the matrix is the number of texts containing the word in the corresponding category. Words are ordered as in the LScDC.3) LScT.csv: List of words of LScT with sum (S) values. 4) Text_No_in_Cat.csv: The number of texts in categories. 5) Categories_in_Documents.csv: List of WoS categories for each document of the LSC.6) README.txt: Description of Word-Category RIG Matrix, Word-Category Frequency Matrix and LScT and forming procedures.7) README.pdf (same as 6 in PDF format)References[1] Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2[2] Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v3[3] Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4] WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [5] Suzen, N., Mirkes, E. M., & Gorban, A. N. (2019). LScDC-new large scientific dictionary. arXiv preprint arXiv:1912.06858. [6] Shannon, C. E. (1948). A mathematical theory of communication. Bell system technical journal, 27(3), 379-423.
f
Hierarchical Measures Example.
plos.figshare.com
xls
Updated May 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fernando Benites; Svenja Simon; Elena Sapozhnikova (2023). Hierarchical Measures Example. [Dataset]. http://doi.org/10.1371/journal.pone.0084475.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0084475.t002
Dataset updated
May 30, 2023
Dataset provided by
PLOS ONE
Authors
Fernando Benites; Svenja Simon; Elena Sapozhnikova
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Hierarchical Measures Example.
d
Replication Data for: \"Unraveling spatial, structural, and social...
search.dataone.org
dataverse.harvard.edu
Updated Nov 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PÁJARO, Agustin; DURAN, Ignacio J.; RODRIGO, Pablo (2023). Replication Data for: \"Unraveling spatial, structural, and social country-level conditions for the emergence of the foreign fighter phenomenon: an exploratory data mining approach to the case of ISIS\" [Dataset]. https://search.dataone.org/view/sha256%3A14ddbd4f4acade3b624c9d518cddfeb2bb845be0043a741211c5afe242ad0bd4
Explore at:
Dataset updated
Nov 9, 2023
Dataset provided by
Harvard Dataverse
Authors
PÁJARO, Agustin; DURAN, Ignacio J.; RODRIGO, Pablo
Description
Data from the article "Unraveling spatial, structural, and social country-level conditions for the emergence of the foreign fighter phenomenon: an exploratory data mining approach to the case of ISIS", by Agustin Pájaro, Ignacio J. Duran and Pablo Rodrigo, published in Revista DADOS, v. 65, n. 3, 2022.

Facebook

Twitter

Click to copy link

Link copied

Cite

data.staging.idas-ds1.appdat.jsc.nasa.gov (2025). Discovering System Health Anomalies using Data Mining Techniques [Dataset]. https://data.staging.idas-ds1.appdat.jsc.nasa.gov/dataset/discovering-system-health-anomalies-using-data-mining-techniques

Data from: Discovering System Health Anomalies using Data Mining Techniques

Explore at:

Dataset updated

Feb 18, 2025

Dataset provided by

NASAhttp://nasa.gov/

Description

We discuss a statistical framework that underlies envelope detection schemes as well as dynamical models based on Hidden Markov Models (HMM) that can encompass both discrete and continuous sensor measurements for use in Integrated System Health Management (ISHM) applications. The HMM allows for the rapid assimilation, analysis, and discovery of system anomalies. We motivate our work with a discussion of an aviation problem where the identification of anomalous sequences is essential for safety reasons. The data in this application are discrete and continuous sensor measurements and can be dealt with seamlessly using the methods described here to discover anomalous flights. We specifically treat the problem of discovering anomalous features in the time series that may be hidden from the sensor suite and compare those methods to standard envelope detection methods on test data designed to accentuate the differences between the two methods. Identification of these hidden anomalies is crucial to building stable, reusable, and cost-efficient systems. We also discuss a data mining framework for the analysis and discovery of anomalies in high-dimensional time series of sensor measurements that would be found in an ISHM system. We conclude with recommendations that describe the tradeoffs in building an integrated scalable platform for robust anomaly detection in ISHM applications.

Clear search

Close search

Google apps

Main menu

Data from: Discovering System Health Anomalies using Data Mining Techniques

Ensemble Data Mining Methods

Data from: Peer-to-Peer Data Mining, Privacy Issues, and Games

Table 2 - Modeling and comparing data mining algorithms for prediction of...

Measure List.

Educational Attainment in North Carolina Public Schools: Use of statistical...

Data mining approaches to quantifying the formation of secondary organic...

Variables associating with the length of hospital stay.

Data from: Social Media Data Mining Becomes Ordinary

Discovering Anomalous Aviation Safety Events Using Scalable Data Mining...

Multi-objective optimization based privacy preserving distributed data...

Hidden Room Educational Data Mining Analysis

Data Mining for IVHM using Sparse Binary Ensembles, Phase I

Data from: Multi-objective optimization based privacy preserving distributed...

Data for: Identification of hindered internal rotational mode for complex...

Data from: Local L2 Thresholding Based Data Mining in Peer-to-Peer Systems

Leveraging data mining, active learning, and domain adaptation for efficient...

LScDC Word-Category RIG Matrix

Hierarchical Measures Example.

Replication Data for: \"Unraveling spatial, structural, and social...

Data from: Discovering System Health Anomalies using Data Mining Techniques