100+ datasets found
  1. d

    Ensemble Data Mining Methods

    • catalog.data.gov
    • data.nasa.gov
    • +2more
    Updated Dec 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2023). Ensemble Data Mining Methods [Dataset]. https://catalog.data.gov/dataset/ensemble-data-mining-methods
    Explore at:
    Dataset updated
    Dec 6, 2023
    Dataset provided by
    Dashlink
    Description

    Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods that leverage the power of multiple models to achieve better prediction accuracy than any of the individual models could on their own. The basic goal when designing an ensemble is the same as when establishing a committee of people: each member of the committee should be as competent as possible, but the members should be complementary to one another. If the members are not complementary, i.e., if they always agree, then the committee is unnecessary---any one member is sufficient. If the members are complementary, then when one or a few members make an error, the probability is high that the remaining members can correct this error. Research in ensemble methods has largely revolved around designing ensembles consisting of competent yet complementary models.

  2. Data from: Discovering System Health Anomalies using Data Mining Techniques

    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    • gimi9.com
    • +4more
    Updated Feb 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.staging.idas-ds1.appdat.jsc.nasa.gov (2025). Discovering System Health Anomalies using Data Mining Techniques [Dataset]. https://data.staging.idas-ds1.appdat.jsc.nasa.gov/dataset/discovering-system-health-anomalies-using-data-mining-techniques
    Explore at:
    Dataset updated
    Feb 18, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    We discuss a statistical framework that underlies envelope detection schemes as well as dynamical models based on Hidden Markov Models (HMM) that can encompass both discrete and continuous sensor measurements for use in Integrated System Health Management (ISHM) applications. The HMM allows for the rapid assimilation, analysis, and discovery of system anomalies. We motivate our work with a discussion of an aviation problem where the identification of anomalous sequences is essential for safety reasons. The data in this application are discrete and continuous sensor measurements and can be dealt with seamlessly using the methods described here to discover anomalous flights. We specifically treat the problem of discovering anomalous features in the time series that may be hidden from the sensor suite and compare those methods to standard envelope detection methods on test data designed to accentuate the differences between the two methods. Identification of these hidden anomalies is crucial to building stable, reusable, and cost-efficient systems. We also discuss a data mining framework for the analysis and discovery of anomalies in high-dimensional time series of sensor measurements that would be found in an ISHM system. We conclude with recommendations that describe the tradeoffs in building an integrated scalable platform for robust anomaly detection in ISHM applications.

  3. d

    Data from: Peer-to-Peer Data Mining, Privacy Issues, and Games

    • catalog.data.gov
    • data.nasa.gov
    • +2more
    Updated Dec 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2023). Peer-to-Peer Data Mining, Privacy Issues, and Games [Dataset]. https://catalog.data.gov/dataset/peer-to-peer-data-mining-privacy-issues-and-games
    Explore at:
    Dataset updated
    Dec 6, 2023
    Dataset provided by
    Dashlink
    Description

    Peer-to-Peer (P2P) networks are gaining increasing popularity in many distributed applications such as file-sharing, network storage, web caching, sear- ching and indexing of relevant documents and P2P network-threat analysis. Many of these applications require scalable analysis of data over a P2P network. This paper starts by offering a brief overview of distributed data mining applications and algorithms for P2P environments. Next it discusses some of the privacy concerns with P2P data mining and points out the problems of existing privacy-preserving multi-party data mining techniques. It further points out that most of the nice assumptions of these existing privacy preserving techniques fall apart in real-life applications of privacy-preserving distributed data mining (PPDM). The paper offers a more realistic formulation of the PPDM problem as a multi-party game and points out some recent results.

  4. f

    Public data sets.

    • plos.figshare.com
    xls
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Talayeh Razzaghi; Oleg Roderick; Ilya Safro; Nicholas Marko (2023). Public data sets. [Dataset]. http://doi.org/10.1371/journal.pone.0155119.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Talayeh Razzaghi; Oleg Roderick; Ilya Safro; Nicholas Marko
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Public data sets.

  5. m

    Educational Attainment in North Carolina Public Schools: Use of statistical...

    • data.mendeley.com
    Updated Nov 14, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
    Explore at:
    Dataset updated
    Nov 14, 2018
    Authors
    Scott Herford
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    North Carolina
    Description

    The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.

  6. Application of data analytics and mining across procurement process globally...

    • statista.com
    Updated Jul 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2023). Application of data analytics and mining across procurement process globally 2017 [Dataset]. https://www.statista.com/statistics/728137/worldwide-application-of-data-analytics-and-mining-across-procurement-process/
    Explore at:
    Dataset updated
    Jul 7, 2023
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2017
    Area covered
    Worldwide
    Description

    This statistic displays the various applications of data analytics and mining across procurement processes, according to chief procurement officers (CPOs) worldwide, as of 2017. Fifty-seven percent of the CPOs asked agreed that data analytics and mining had been applied to intelligent and advanced analytics for negotiations, and 40 percent of them indicated data analytics and mining had been applied to supplier portfolio optimization processes.

  7. s

    Data from: Social Media Data Mining Becomes Ordinary

    • orda.shef.ac.uk
    docx
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Helen Kennedy (2023). Social Media Data Mining Becomes Ordinary [Dataset]. http://doi.org/10.15131/shef.data.5195032.v1
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    The University of Sheffield
    Authors
    Helen Kennedy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This research explored what happens when social media data mining becomes ordinary and is carried out by organisations that might be seen as the pillars of everyday life. The interviews on which the transcripts are based are discussed in Chapter 6 of the book. The referenced book contains a description of the methods. No other publications resulted from working with these transcripts.

  8. d

    Data mining approaches to quantifying the formation of secondary organic...

    • datasets.ai
    • catalog.data.gov
    53
    Updated Sep 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Environmental Protection Agency (2024). Data mining approaches to quantifying the formation of secondary organic aerosol [Dataset]. https://datasets.ai/datasets/data-mining-approaches-to-quantifying-the-formation-of-secondary-organic-aerosol
    Explore at:
    53Available download formats
    Dataset updated
    Sep 18, 2024
    Dataset authored and provided by
    U.S. Environmental Protection Agency
    Description

    This research used data mining approaches to better understand factors affecting the formation of secondary organic aerosol (SOA). Although numerous laboratory and computational studies have been completed on SOA formation, it is still challenging to determine factors that most influence SOA formation. Experimental data were based on previous work described by Offenberg et al. (2017), where volume concentrations of SOA were measured in 139 laboratory experiments involving the oxidation of single hydrocarbons under different operating conditions. Three different data mining methods were used, including nearest neighbor, decision tree, and pattern mining. Both decision tree and pattern mining approaches identified similar chemical and experimental conditions that were important to SOA formation. Among these important factors included the number of methyl groups, the number of rings and the presence of dinitrogen pentoxide (N2O5).

    This dataset is associated with the following publication: Olson, D., J. Offenberg, M. Lewandowski, T. Kleindienst, K. Docherty, M. Jaoui, J.D. Krug, and T. Riedel. Data mining approaches to understanding the formation of secondary organic aerosol. ATMOSPHERIC ENVIRONMENT. Elsevier Science Ltd, New York, NY, USA, 252: 118345, (2021).

  9. Numbers of detected logic relationships by Logicome Profiler.

    • plos.figshare.com
    • figshare.com
    xls
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tsukasa Fukunaga; Wataru Iwasaki (2023). Numbers of detected logic relationships by Logicome Profiler. [Dataset]. http://doi.org/10.1371/journal.pone.0232106.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Tsukasa Fukunaga; Wataru Iwasaki
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Numbers of detected logic relationships by Logicome Profiler.

  10. Review results of the manuscript "A Systematic Review on Privacy-Preserving...

    • figshare.com
    txt
    Updated Aug 26, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chang Sun (2021). Review results of the manuscript "A Systematic Review on Privacy-Preserving Distributed Data Mining" [Dataset]. http://doi.org/10.6084/m9.figshare.14239937.v4
    Explore at:
    txtAvailable download formats
    Dataset updated
    Aug 26, 2021
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Chang Sun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the review results of the manuscript of "A Systematic Review on Privacy-Preserving Distributed Data Mining" authored by Chang Sun, Lianne Ippel, Andre Dekker, Michel Dumontier, Johan van Soest. In the datasets, there are 231 published articles about privacy-perserving distributed data mining. Variables include article DOI, title, authors, keywords, user scenarios, distributed data scenarios, privacy/security definition/proof/analysis, privacy statement, privacy-preserving methods category, privacy-preserving methods (specific), data mining problem, data mining/machine learning methods, experiment data information, accuracy of the methods, efficiency (computation and communication cost), and scalability. The search method and evaluation criteria are described in the paper "A Systematic Review on Privacy-Preserving Distributed Data Mining". The DOI and link to the paper will be provided when the paper gets published.

  11. z

    CONCEPT- DM2 DATA MODEL TO ANALYSE HEALTHCARE PATHWAYS OF TYPE 2 DIABETES

    • zenodo.org
    bin, png, zip
    Updated Jul 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Berta Ibáñez-Beroiz; Berta Ibáñez-Beroiz; Asier Ballesteros-Domínguez; Asier Ballesteros-Domínguez; Ignacio Oscoz-Villanueva; Ignacio Oscoz-Villanueva; Ibai Tamayo; Ibai Tamayo; Julián Librero; Julián Librero; Mónica Enguita-Germán; Mónica Enguita-Germán; Francisco Estupiñán-Romero; Francisco Estupiñán-Romero; Enrique Bernal-Delgado; Enrique Bernal-Delgado (2024). CONCEPT- DM2 DATA MODEL TO ANALYSE HEALTHCARE PATHWAYS OF TYPE 2 DIABETES [Dataset]. http://doi.org/10.5281/zenodo.7778291
    Explore at:
    bin, png, zipAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Zenodo
    Authors
    Berta Ibáñez-Beroiz; Berta Ibáñez-Beroiz; Asier Ballesteros-Domínguez; Asier Ballesteros-Domínguez; Ignacio Oscoz-Villanueva; Ignacio Oscoz-Villanueva; Ibai Tamayo; Ibai Tamayo; Julián Librero; Julián Librero; Mónica Enguita-Germán; Mónica Enguita-Germán; Francisco Estupiñán-Romero; Francisco Estupiñán-Romero; Enrique Bernal-Delgado; Enrique Bernal-Delgado
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Technical notes and documentation on the common data model of the project CONCEPT-DM2.

    This publication corresponds to the Common Data Model (CDM) specification of the CONCEPT-DM2 project for the implementation of a federated network analysis of the healthcare pathway of type 2 diabetes.

    Aims of the CONCEPT-DM2 project:

    General aim: To analyse chronic care effectiveness and efficiency of care pathways in diabetes, assuming the relevance of care pathways as independent factors of health outcomes using data from real life world (RWD) from five Spanish Regional Health Systems.

    Main specific aims:

    • To characterize the care pathways in patients with diabetes through the whole care system in terms of process indicators and pharmacologic recommendations
    • To compare these observed care pathways with the theoretical clinical pathways derived from the clinical practice guidelines
    • To assess if the adherence to clinical guidelines influence on important health outcomes, such as cardiovascular hospitalizations.
    • To compare the traditional analytical methods with process mining methods in terms of modeling quality, prediction performance and information provided.

    Study Design: It is a population-based retrospective observational study centered on all T2D patients diagnosed in five Regional Health Services within the Spanish National Health Service. We will include all the contacts of these patients with the health services using the electronic medical record systems including Primary Care data, Specialized Care data, Hospitalizations, Urgent Care data, Pharmacy Claims, and also other registers such as the mortality and the population register.

    Cohort definition: All patients with code of Type 2 Diabetes in the clinical health records

    • Inclusion criteria: patients that, at 01/01/2017 or during the follow-up from 01/01/2017 to 31/12/2022 had active health card (active TIS - tarjeta sanitaria activa) and code of type 2 diabetes (T2D, DM2 in spanish) in the clinical records of primary care (CIAP2 T90 in case of using CIAP code system)
    • Exclusion criteria:
      • patients with no contact with the health system from 01/01/2017 to 31/12/2022
      • patients that had a T1D (DM1) code opened after the T2D code during the follow-up.
    • Study period. From 01/01/2017 to 31/12/2022

    Files included in this publication:

    • Datamodel_CONCEPT_DM2_diagram.png
    • Common data model specification (Datamodel_CONCEPT_DM2_v.0.1.0.xlsx)
    • Synthetic datasets (Datamodel_CONCEPT_DM2_sample_data)
      • sample_data1_dm_patient.csv
      • sample_data2_dm_param.csv
      • sample_data3_dm_patient.csv
      • sample_data4_dm_param.csv
      • sample_data5_dm_patient.csv
      • sample_data6_dm_param.csv
      • sample_data7_dm_param.csv
      • sample_data8_dm_param.csv
    • Datamodel_CONCEPT_DM2_explanation.pptx
  12. Process mining application areas in companies in Russia 2021

    • statista.com
    Updated Feb 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2022). Process mining application areas in companies in Russia 2021 [Dataset]. https://www.statista.com/statistics/1289110/process-mining-application-areas-russia/
    Explore at:
    Dataset updated
    Feb 15, 2022
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Sep 2021 - Oct 2021
    Area covered
    Russia
    Description

    Nearly two thirds of surveyed top managers of large companies operating in Russia viewed process mining as useful for purchasing, in 2021. Furthermore, over 60 percent of respondents saw the technology's potential in improving the customer journey map and IT processes.

  13. u

    Hidden Room Educational Data Mining Analysis

    • produccioncientifica.uca.es
    • figshare.com
    Updated 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Palomo-Duarte, Manuel; Berns, Anke; Palomo-Duarte, Manuel; Berns, Anke (2016). Hidden Room Educational Data Mining Analysis [Dataset]. https://produccioncientifica.uca.es/documentos/668fc475b9e7c03b01bde1d4
    Explore at:
    Dataset updated
    2016
    Authors
    Palomo-Duarte, Manuel; Berns, Anke; Palomo-Duarte, Manuel; Berns, Anke
    Description

    Histograms and results of k-means and Ward's clustering for Hidden Room game

    The fileset contains information from three sources:

    1. Histograms files:
    * Lexical_histogram.png (histogram of lexical error ratios)
    * Grammatical_histogram.png (histogram of grammatical error ratios)

    2. K-means clustering files:
    * elbow-lex kmeans.png (clustering by lexical aspects: error curves obtained for applying elbow method to determinate the optimal number of clusters)
    * cube-lex kmeans.png (clustering by lexical aspects: a three-dimensional representation of clusters obtained after applying k-means method)
    * Lexical_clusters (table) kmeans.xls (clustering by lexical aspects: centroids, standard deviations and number of instances assigned to each cluster)
    * elbow-gram kmeans.png (clustering by grammatical aspects: error curves obtained for applying elbow method to determinate the optimal number of clusters)
    * cube-gramm kmeans.png (clustering by grammatical aspects: a three-dimensional representation of clusters obtained after applying k-means method)
    * Grammatical_clusters (table) kmeans.xls (clustering by grammatical aspects: centroids, standard deviations and number of instances assigned to each cluster)
    * elbow-lexgram kmeans.png (clustering by lexical and grammatical aspects: error curves obtained for applying elbow method to determinate the optimal number of clusters)
    * Lexical_Grammatical_clusters (table) kmeans.xls (clustering by lexical and grammatical aspects: centroids, standard deviations and number of instances assigned to each cluster)
    * Grammatical_clusters_number_of_words (table) kmeans.xls : number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to grammatical error ratios.
    * Lexical_clusters_number_of_words (table) kmeans.xls : number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to lexical error ratios.
    * Lexical_Grammatical_clusters_number_of_words (table) kmeans.xls : number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to lexical and grammatical error ratios.

    3. Ward’s Agglomerative Hierarchical Clustering files:
    * Lexical_Cluster_Dendrogram_ward.png (clustering by lexical aspects: dendrogram obtained after applying Ward's clustering method).
    * Grammatical_Cluster_Dendrogram_ward.png (clustering by grammatical aspects: dendrogram obtained after applying Ward's clustering method)
    * Lexical_Grammatical_Cluster_Dendrogram_ward.png (clustering by lexical and grammatical aspects: dendrogram obtained after applying Ward's clustering method)
    * Lexical_Grammatical_clusters (table) ward.xls: Centroids (from column 2 to 7) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to lexical and grammatical error ratios.
    * Grammatical_clusters (table) ward.xls: Centroids (from column 2 to 4) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to grammatical error ratios.
    * Lexical_clusters (table) ward.xls: Centroids (from column 2 to 4) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to lexical error ratios.
    * Lexical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to lexical error ratios.
    * Grammatical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to grammatical error ratios.
    * Lexical_Grammatical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to lexical and grammatical error ratios.

  14. d

    Discovering Anomalous Aviation Safety Events Using Scalable Data Mining...

    • catalog.data.gov
    • datadiscoverystudio.org
    • +6more
    Updated Dec 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Discovering Anomalous Aviation Safety Events Using Scalable Data Mining Algorithms [Dataset]. https://catalog.data.gov/dataset/discovering-anomalous-aviation-safety-events-using-scalable-data-mining-algorithms
    Explore at:
    Dataset updated
    Dec 6, 2023
    Dataset provided by
    Dashlink
    Description

    The worldwide civilian aviation system is one of the most complex dynamical systems created. Most modern commercial aircraft have onboard flight data recorders that record several hundred discrete and continuous parameters at approximately 1Hz for the entire duration of the flight. These data contain information about the flight control systems, actuators, engines, landing gear, avionics, and pilot commands. In this paper, recent advances in the development of a novel knowledge discovery process consisting of a suite of data mining techniques for identifying precursors to aviation safety incidents are discussed. The data mining techniques include scalable multiple-kernel learning for large-scale distributed anomaly detection. A novel multivariate time-series search algorithm is used to search for signatures of discovered anomalies on massive datasets. The process can identify operationally significant events due to environmental, mechanical, and human factors issues in the high-dimensional flight operations quality assurance data. All discovered anomalies are validated by a team of independent domain experts. This novel automated knowledge discovery process is aimed at complementing the state-of-the-art human-generated exceedance-based analysis that fails to discover previously unknown aviation safety incidents. In this paper, the discovery pipeline, the methods used, and some of the significant anomalies detected on real-world commercial aviation data are discussed.

  15. Leveraging data mining, active learning, and domain adaptation for efficient...

    • data.niaid.nih.gov
    zip
    Updated Mar 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rui Ding; Jianguo Liu; Kang Hua; Xuebin Wang; Xiaoben Zhang; Minhua Shao; Yuxin Chen; Junhong Chen (2025). Leveraging data mining, active learning, and domain adaptation for efficient discovery of advanced oxygen evolution electrocatalysts [Dataset]. http://doi.org/10.5061/dryad.nk98sf83g
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 5, 2025
    Dataset provided by
    Hong Kong University of Science and Technology
    University of Chicago
    Nanjing University
    North China Electric Power University
    Authors
    Rui Ding; Jianguo Liu; Kang Hua; Xuebin Wang; Xiaoben Zhang; Minhua Shao; Yuxin Chen; Junhong Chen
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Developing advanced catalysts for acidic oxygen evolution reaction (OER) is crucial for sustainable hydrogen production. This study presents a multi-stage machine learning (ML) approach to streamline the discovery and optimization of complex multi-metallic catalysts. Our method integrates data mining, active learning, and domain adaptation throughout the materials discovery process. Unlike traditional trial-and-error methods, this approach systematically narrows the exploration space using domain knowledge with minimized reliance on subjective intuition. Then the active learning module efficiently refines element composition and synthesis conditions through iterative experimental feedback. The process culminated in the discovery of a promising Ru-Mn-Ca-Pr oxide catalyst. Our workflow also enhances theoretical simulations with domain adaptation strategy, providing deeper mechanistic insights aligned with experimental findings. By leveraging diverse data sources and multiple ML strategies, we demonstrate an efficient pathway for electrocatalyst discovery and optimization. This comprehensive, data-driven approach represents a paradigm shift and potentially a benchmark in electrocatalysts research.

  16. Process mining integration plans in companies in Russia 2021

    • statista.com
    Updated Feb 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2022). Process mining integration plans in companies in Russia 2021 [Dataset]. https://www.statista.com/statistics/1289097/process-mining-integration-plans-russia/
    Explore at:
    Dataset updated
    Feb 15, 2022
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Sep 2021 - Oct 2021
    Area covered
    Russia
    Description

    One fifth of surveyed top managers of large companies operating in Russia stated that their businesses either already had integrated process mining or were in the process of implementing it in 2021. Furthermore, nearly 30 percent revealed plans to integrate it in the following three years.

  17. f

    Statistical analysis of normotension vs. hypertension in women and men.

    • plos.figshare.com
    xls
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bum Ju Lee; Jong Yeol Kim (2023). Statistical analysis of normotension vs. hypertension in women and men. [Dataset]. http://doi.org/10.1371/journal.pone.0084897.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Bum Ju Lee; Jong Yeol Kim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The AUC values were calculated using 10-fold cross validation. OR: odds ratios; AUC: area under the receiver operating characteristic curve; LR: logistic regression; NB: naïve Bayes.

  18. H

    Data from: Mining texts to efficiently generate global data on political...

    • dataverse.harvard.edu
    Updated Jul 8, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shahryar Minhas; Jay Ulfelder; Michael D. Ward (2015). Mining texts to efficiently generate global data on political regime types [Dataset]. http://doi.org/10.7910/DVN/8MC1LO
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 8, 2015
    Dataset provided by
    Harvard Dataverse
    Authors
    Shahryar Minhas; Jay Ulfelder; Michael D. Ward
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    We describe the design and results of an experiment in using text-mining and machine-learning techniques to generate annual measures of national political regime types. Valid and reliable measures of countries’ forms of national government are essential to cross-national and dynamic analysis of many phenomena of great interest to political scientists, including civil war, interstate war, democratization, and coups d’état. Unfortunately, traditional measures of regime type are very expensive to produce, and observations for ambiguous cases are often sharply contested. In this project, we train a series of support vector machine (SVM) classifiers to infer regime type from textual data sources. To train the classifiers, we used vectorized textual reports from Freedom House and the State Department as features for a training set of prelabeled regime type data. To validate our SVM classifiers, we compare their predictions in an out-of-sample context, and the performance results across a variety of metrics (accuracy, precision, recall) are very high. The results of this project highlight the ability of these techniques to contribute to producing real-time data sources for use in political science that can also be routinely updated at much lower cost than human-coded data. To this end, we set up a text-processing pipeline that pulls updated textual data from selected sources, conducts feature extraction, and applies supervised machine learning methods to produce measures of regime type. This pipeline, written in Python, can be pulled from the Github repository associated with this project and easily extended as more data becomes available.

  19. Data Mining for IVHM using Sparse Binary Ensembles, Phase I

    • data.nasa.gov
    application/rdfxml +5
    Updated Jun 26, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). Data Mining for IVHM using Sparse Binary Ensembles, Phase I [Dataset]. https://data.nasa.gov/dataset/Data-Mining-for-IVHM-using-Sparse-Binary-Ensembles/qfus-evzq
    Explore at:
    xml, tsv, csv, application/rssxml, application/rdfxml, jsonAvailable download formats
    Dataset updated
    Jun 26, 2018
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    In response to NASA SBIR topic A1.05, "Data Mining for Integrated Vehicle Health Management", Michigan Aerospace Corporation (MAC) asserts that our unique SPADE (Sparse Processing Applied to Data Exploitation) technology meets a significant fraction of the stated criteria and has functionality that enables it to handle many applications within the aircraft lifecycle. SPADE distills input data into highly quantized features and uses MAC's novel techniques for constructing Ensembles of Decision Trees to develop extremely accurate diagnostic/prognostic models for classification, regression, clustering, anomaly detection and semi-supervised learning tasks. These techniques are currently being employed to do Threat Assessment for satellites in conjunction with researchers at the Air Force Research Lab. Significant advantages to this approach include: 1) completely data driven; 2) training and evaluation are faster than conventional methods; 3) operates effectively on huge datasets (> billion samples X > million features), 4) proven to be as accurate as state-of-the-art techniques in many significant real-world applications. The specific goals for Phase 1 will be to work with domain experts at NASA and with our partners Boeing, SpaceX and GMV Space Systems to delineate a subset of problems that are particularly well-suited to this approach and to determine requirements for deploying algorithms on platforms of opportunity.

  20. Data from: Local L2 Thresholding Based Data Mining in Peer-to-Peer Systems

    • data.nasa.gov
    • datasets.ai
    • +1more
    application/rdfxml +5
    Updated Jun 26, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). Local L2 Thresholding Based Data Mining in Peer-to-Peer Systems [Dataset]. https://data.nasa.gov/w/exfs-d3uz/default?cur=Q-fHQ7jiCL7
    Explore at:
    application/rssxml, application/rdfxml, tsv, csv, json, xmlAvailable download formats
    Dataset updated
    Jun 26, 2018
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    In a large network of computers, wireless sensors, or mobile devices, each of the components (hence, peers) has some data about the global status of the system. Many of the functions of the system, such as routing decisions, search strategies, data cleansing, and the assignment of mutual trust, depend on the global status. Therefore, it is essential that the system be able to detect, and react to, changes in its global status. Computing global predicates in such systems is usually very costly. Mainly because of their scale, and in some cases (e.g., sensor networks) also because of the high cost of communication. The cost further increases when the data changes rapidly (due to state changes, node failure, etc.) and computation has to follow these changes. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient local algorithm which detect when the L2 norm of the average data surpasses a threshold. Then, we use this algorithm as a feedback loop for the monitoring of complex predicates on the data – such as the data’s k-means clustering. The efficiency of the L2 algorithm guarantees that so long as the clustering results represent the data (i.e., the data is stationary) few resources are required. When the data undergoes an epoch change – a change in the underlying distribution – and the model no longer represents it, the feedback loop indicates this and the model is rebuilt. Furthermore, the existence of a feedback loop allows using approximate and “best-effort ” methods for constructing the model; if an ill-fit model is built the feedback loop would indicate so, and the model would be rebuilt.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dashlink (2023). Ensemble Data Mining Methods [Dataset]. https://catalog.data.gov/dataset/ensemble-data-mining-methods

Ensemble Data Mining Methods

Explore at:
Dataset updated
Dec 6, 2023
Dataset provided by
Dashlink
Description

Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods that leverage the power of multiple models to achieve better prediction accuracy than any of the individual models could on their own. The basic goal when designing an ensemble is the same as when establishing a committee of people: each member of the committee should be as competent as possible, but the members should be complementary to one another. If the members are not complementary, i.e., if they always agree, then the committee is unnecessary---any one member is sufficient. If the members are complementary, then when one or a few members make an error, the probability is high that the remaining members can correct this error. Research in ensemble methods has largely revolved around designing ensembles consisting of competent yet complementary models.

Search
Clear search
Close search
Google apps
Main menu