100+ datasets found
  1. Data from: Peer-to-Peer Data Mining, Privacy Issues, and Games

    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    • data.nasa.gov
    • +2more
    Updated Feb 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Peer-to-Peer Data Mining, Privacy Issues, and Games [Dataset]. https://data.staging.idas-ds1.appdat.jsc.nasa.gov/dataset/peer-to-peer-data-mining-privacy-issues-and-games
    Explore at:
    Dataset updated
    Feb 18, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    Peer-to-Peer (P2P) networks are gaining increasing popularity in many distributed applications such as file-sharing, network storage, web caching, sear- ching and indexing of relevant documents and P2P network-threat analysis. Many of these applications require scalable analysis of data over a P2P network. This paper starts by offering a brief overview of distributed data mining applications and algorithms for P2P environments. Next it discusses some of the privacy concerns with P2P data mining and points out the problems of existing privacy-preserving multi-party data mining techniques. It further points out that most of the nice assumptions of these existing privacy preserving techniques fall apart in real-life applications of privacy-preserving distributed data mining (PPDM). The paper offers a more realistic formulation of the PPDM problem as a multi-party game and points out some recent results.

  2. Review results of the manuscript "A Systematic Review on Privacy-Preserving...

    • figshare.com
    txt
    Updated Aug 26, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chang Sun (2021). Review results of the manuscript "A Systematic Review on Privacy-Preserving Distributed Data Mining" [Dataset]. http://doi.org/10.6084/m9.figshare.14239937.v4
    Explore at:
    txtAvailable download formats
    Dataset updated
    Aug 26, 2021
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Chang Sun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the review results of the manuscript of "A Systematic Review on Privacy-Preserving Distributed Data Mining" authored by Chang Sun, Lianne Ippel, Andre Dekker, Michel Dumontier, Johan van Soest. In the datasets, there are 231 published articles about privacy-perserving distributed data mining. Variables include article DOI, title, authors, keywords, user scenarios, distributed data scenarios, privacy/security definition/proof/analysis, privacy statement, privacy-preserving methods category, privacy-preserving methods (specific), data mining problem, data mining/machine learning methods, experiment data information, accuracy of the methods, efficiency (computation and communication cost), and scalability. The search method and evaluation criteria are described in the paper "A Systematic Review on Privacy-Preserving Distributed Data Mining". The DOI and link to the paper will be provided when the paper gets published.

  3. d

    Privacy Preserving Distributed Data Mining

    • catalog.data.gov
    • datadiscoverystudio.org
    • +2more
    Updated Dec 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2023). Privacy Preserving Distributed Data Mining [Dataset]. https://catalog.data.gov/dataset/privacy-preserving-distributed-data-mining
    Explore at:
    Dataset updated
    Dec 6, 2023
    Dataset provided by
    Dashlink
    Description

    Distributed data mining from privacy-sensitive multi-party data is likely to play an important role in the next generation of integrated vehicle health monitoring systems. For example, consider an airline manufacturer [tex]$\mathcal{C}$[/tex] manufacturing an aircraft model [tex]$A$[/tex] and selling it to five different airline operating companies [tex]$\mathcal{V}_1 \dots \mathcal{V}_5$[/tex]. These aircrafts, during their operation, generate huge amount of data. Mining this data can reveal useful information regarding the health and operability of the aircraft which can be useful for disaster management and prediction of efficient operating regimes. Now if the manufacturer [tex]$\mathcal{C}$[/tex] wants to analyze the performance data collected from different aircrafts of model-type [tex]$A$[/tex] belonging to different airlines then central collection of data for subsequent analysis may not be an option. It should be noted that the result of this analysis may be statistically more significant if the data for aircraft model [tex]$A$[/tex] across all companies were available to [tex]$\mathcal{C}$[/tex]. The potential problems arising out of such a data mining scenario are:

  4. f

    Confusion matrix.

    • figshare.com
    xls
    Updated Jul 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaoxia Mou; Heming Zhang (2023). Confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0288140.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jul 7, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Shaoxia Mou; Heming Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Due to the inherent characteristics of accumulation sequence of unbalanced data, the mining results of this kind of data are often affected by a large number of categories, resulting in the decline of mining performance. To solve the above problems, the performance of data cumulative sequence mining is optimized. The algorithm for mining cumulative sequence of unbalanced data based on probability matrix decomposition is studied. The natural nearest neighbor of a few samples in the unbalanced data cumulative sequence is determined, and the few samples in the unbalanced data cumulative sequence are clustered according to the natural nearest neighbor relationship. In the same cluster, new samples are generated from the core points of dense regions and non core points of sparse regions, and then new samples are added to the original data accumulation sequence to balance the data accumulation sequence. The probability matrix decomposition method is used to generate two random number matrices with Gaussian distribution in the cumulative sequence of balanced data, and the linear combination of low dimensional eigenvectors is used to explain the preference of specific users for the data sequence; At the same time, from a global perspective, the AdaBoost idea is used to adaptively adjust the sample weight and optimize the probability matrix decomposition algorithm. Experimental results show that the algorithm can effectively generate new samples, improve the imbalance of data accumulation sequence, and obtain more accurate mining results. Optimizing global errors as well as more efficient single-sample errors. When the decomposition dimension is 5, the minimum RMSE is obtained. The proposed algorithm has good classification performance for the cumulative sequence of balanced data, and the average ranking of index F value, G mean and AUC is the best.

  5. m

    Educational Attainment in North Carolina Public Schools: Use of statistical...

    • data.mendeley.com
    Updated Nov 14, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
    Explore at:
    Dataset updated
    Nov 14, 2018
    Authors
    Scott Herford
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    North Carolina
    Description

    The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.

  6. r

    Results

    • researchdata.edu.au
    • bridges.monash.edu
    Updated May 5, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chang Wei Tan (2022). Results [Dataset]. http://doi.org/10.26180/5c30a56c0bda8
    Explore at:
    Dataset updated
    May 5, 2022
    Dataset provided by
    Monash University
    Authors
    Chang Wei Tan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the results for the FastEE paper.

  7. Data from: Local L2 Thresholding Based Data Mining in Peer-to-Peer Systems

    • data.nasa.gov
    • datasets.ai
    • +1more
    application/rdfxml +5
    Updated Jun 26, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). Local L2 Thresholding Based Data Mining in Peer-to-Peer Systems [Dataset]. https://data.nasa.gov/w/exfs-d3uz/default?cur=Q-fHQ7jiCL7
    Explore at:
    application/rssxml, application/rdfxml, tsv, csv, json, xmlAvailable download formats
    Dataset updated
    Jun 26, 2018
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    In a large network of computers, wireless sensors, or mobile devices, each of the components (hence, peers) has some data about the global status of the system. Many of the functions of the system, such as routing decisions, search strategies, data cleansing, and the assignment of mutual trust, depend on the global status. Therefore, it is essential that the system be able to detect, and react to, changes in its global status. Computing global predicates in such systems is usually very costly. Mainly because of their scale, and in some cases (e.g., sensor networks) also because of the high cost of communication. The cost further increases when the data changes rapidly (due to state changes, node failure, etc.) and computation has to follow these changes. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient local algorithm which detect when the L2 norm of the average data surpasses a threshold. Then, we use this algorithm as a feedback loop for the monitoring of complex predicates on the data – such as the data’s k-means clustering. The efficiency of the L2 algorithm guarantees that so long as the clustering results represent the data (i.e., the data is stationary) few resources are required. When the data undergoes an epoch change – a change in the underlying distribution – and the model no longer represents it, the feedback loop indicates this and the model is rebuilt. Furthermore, the existence of a feedback loop allows using approximate and “best-effort ” methods for constructing the model; if an ill-fit model is built the feedback loop would indicate so, and the model would be rebuilt.

  8. m

    Data for: Identification of hindered internal rotational mode for complex...

    • data.mendeley.com
    • narcis.nl
    Updated Nov 8, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Triet Le (2017). Data for: Identification of hindered internal rotational mode for complex chemical species: A data mining approach with multivariate logistic regression model [Dataset]. http://doi.org/10.17632/d37mzs3b3m.2
    Explore at:
    Dataset updated
    Nov 8, 2017
    Authors
    Triet Le
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The "Dataset_HIR" folder contains the data to reproduce the results of the data mining approach proposed in the manuscript titled "Identification of hindered internal rotational mode for complex chemical species: A data mining approach with multivariate logistic regression model".

    More specifically, the folder contains the raw electronic structure calculation input data provided by the domain experts as well as the training and testing dataset with the extracted features.

    The "Dataset_HIR" folder contains the following subfolders namely:

    1. Electronic structure calculation input data: contains the electronic structure calculation input generated by the Gaussian program

      1.1. Testing data: contains the raw data of all training species (each is stored in a separate folder) used for extracting dataset for training and validation phase.

      1.2. Testing data: contains the raw data of all testing species (each is stored in a separate folder) used for extracting data for the testing phase.

    2. Dataset 2.1. Training dataset: used to produce the results in Tables 3 and 4 in the manuscript

      + datasetTrain_raw.csv: contains the features for all vibrational modes associated with corresponding labeled species to let the chemists select the Hindered Internal Rotor from the list easily for the training and validation steps.  
      
      + datasetTrain.csv: refines the datasetTrain_raw.csv where the names of the species are all removed to transform the dataset into an appropriate form for the modeling and validation steps.
      

      2.2. Testing dataset: used to produce the results of the data mining approach in Table 5 in the manuscript.

      + datasetTest_raw.csv: contains the features for all vibrational modes of each labeled species to let the chemists select the Hindered Internal Rotor from the list for the testing step.
      
      + datasetTest.csv: refines the datasetTest_raw.csv where the names of the species are all removed to transform the dataset into an appropriate form for the testing step.
      

    Note for the Result feature in the dataset: 1 is for the mode needed to be treated as Hindered Internal Rotor, and 0 otherwise.

  9. u

    Hidden Room Educational Data Mining Analysis

    • produccioncientifica.uca.es
    • figshare.com
    Updated 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Palomo-Duarte, Manuel; Berns, Anke; Palomo-Duarte, Manuel; Berns, Anke (2016). Hidden Room Educational Data Mining Analysis [Dataset]. https://produccioncientifica.uca.es/documentos/668fc475b9e7c03b01bde1d4
    Explore at:
    Dataset updated
    2016
    Authors
    Palomo-Duarte, Manuel; Berns, Anke; Palomo-Duarte, Manuel; Berns, Anke
    Description

    Histograms and results of k-means and Ward's clustering for Hidden Room game

    The fileset contains information from three sources:

    1. Histograms files:
    * Lexical_histogram.png (histogram of lexical error ratios)
    * Grammatical_histogram.png (histogram of grammatical error ratios)

    2. K-means clustering files:
    * elbow-lex kmeans.png (clustering by lexical aspects: error curves obtained for applying elbow method to determinate the optimal number of clusters)
    * cube-lex kmeans.png (clustering by lexical aspects: a three-dimensional representation of clusters obtained after applying k-means method)
    * Lexical_clusters (table) kmeans.xls (clustering by lexical aspects: centroids, standard deviations and number of instances assigned to each cluster)
    * elbow-gram kmeans.png (clustering by grammatical aspects: error curves obtained for applying elbow method to determinate the optimal number of clusters)
    * cube-gramm kmeans.png (clustering by grammatical aspects: a three-dimensional representation of clusters obtained after applying k-means method)
    * Grammatical_clusters (table) kmeans.xls (clustering by grammatical aspects: centroids, standard deviations and number of instances assigned to each cluster)
    * elbow-lexgram kmeans.png (clustering by lexical and grammatical aspects: error curves obtained for applying elbow method to determinate the optimal number of clusters)
    * Lexical_Grammatical_clusters (table) kmeans.xls (clustering by lexical and grammatical aspects: centroids, standard deviations and number of instances assigned to each cluster)
    * Grammatical_clusters_number_of_words (table) kmeans.xls : number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to grammatical error ratios.
    * Lexical_clusters_number_of_words (table) kmeans.xls : number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to lexical error ratios.
    * Lexical_Grammatical_clusters_number_of_words (table) kmeans.xls : number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to lexical and grammatical error ratios.

    3. Ward’s Agglomerative Hierarchical Clustering files:
    * Lexical_Cluster_Dendrogram_ward.png (clustering by lexical aspects: dendrogram obtained after applying Ward's clustering method).
    * Grammatical_Cluster_Dendrogram_ward.png (clustering by grammatical aspects: dendrogram obtained after applying Ward's clustering method)
    * Lexical_Grammatical_Cluster_Dendrogram_ward.png (clustering by lexical and grammatical aspects: dendrogram obtained after applying Ward's clustering method)
    * Lexical_Grammatical_clusters (table) ward.xls: Centroids (from column 2 to 7) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to lexical and grammatical error ratios.
    * Grammatical_clusters (table) ward.xls: Centroids (from column 2 to 4) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to grammatical error ratios.
    * Lexical_clusters (table) ward.xls: Centroids (from column 2 to 4) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to lexical error ratios.
    * Lexical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to lexical error ratios.
    * Grammatical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to grammatical error ratios.
    * Lexical_Grammatical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to lexical and grammatical error ratios.

  10. d

    2011–2016 Single Well Aquifer Tests: Pumping Schedules, Water-Level Data in...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Jul 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). 2011–2016 Single Well Aquifer Tests: Pumping Schedules, Water-Level Data in Aquifer Test Wells, and Analysis Results from Tests Conducted near Long Canyon, Goshute Valley, Northeastern Nevada [Dataset]. https://catalog.data.gov/dataset/20112016-single-well-aquifer-tests-pumping-schedules-water-level-data-in-aquifer-test-well
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    U.S. Geological Survey
    Area covered
    Goshute Valley
    Description

    This dataset presents tabular data and Excel workbooks used to analyze single-well aquifer tests in pumping wells and slug tests in monitoring wells near Long Canyon. The data also include pdf outputs from the analysis program, Aqtesolv (Duffield, 2007). The data are presented in two zipped files, (1) single-well aquifer tests in pumping wells and (2) slug tests in monitoring wells. The slug-test data were supplied by Newmont Mining Corporation and collected by Golder and Associates in 2011. Reference Cited: Duffield, G.M., 2007, AQTESOLV for windows: Version 4.5 User’s Guide, HydroSOLV, Inc. Reston, VA, p. 530, at, http://www.aqtesolv.com/download/aqtw20070719.pdf.

  11. f

    Datasets: Mining chemical activity status from high-throughput screening...

    • figshare.com
    zip
    Updated Jan 20, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Othman Soufan (2016). Datasets: Mining chemical activity status from high-throughput screening assays [Dataset]. http://doi.org/10.6084/m9.figshare.1598200.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 20, 2016
    Dataset provided by
    figshare
    Authors
    Othman Soufan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    High-throughput screening (HTS) experiments provide a valuable resource that reports biological activity of numerous chemical compounds relative to their molecular targets. Building computational models that accurately predict such activity status (active vs. inactive) in specific assays is a challenging task given the large volume of data and frequently small proportion of active compounds relative to the inactive ones. We developed a method, DRAMOTE, to predict activity status of chemical compounds in HTP activity assays. For a class of HTP assays, our method achieves considerably better results than the current state-of-the-art-solutions. We achieved this by modification of a minority oversampling technique. To demonstrate that DRAMOTE is performing better than the other methods, we performed a comprehensive comparison analysis with several other methods and evaluated them on data from 11 PubChem assays through 1,350 experiments that involved approximately 500,000 interactions between chemicals and their target proteins. As an example of potential use, we applied DRAMOTE to develop robust models for predicting FDA approved drugs that have high probability to interact with the thyroid stimulating hormone receptor (TSHR) in humans. Our findings are further partially and indirectly supported by 3D docking results and literature information. The results based on approximately 500,000 interactions suggest that DRAMOTE has performed the best and that it can be used for developing robust virtual screening models. The datasets and implementation of all solutions are available as a MATLAB toolbox online at www.cbrc.kaust.edu.sa/dramote

  12. r

    Results

    • researchdata.edu.au
    • bridges.monash.edu
    Updated May 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chang Wei Tan (2022). Results [Dataset]. http://doi.org/10.26180/5be4d0a9b1937
    Explore at:
    Dataset updated
    May 5, 2022
    Dataset provided by
    Monash University
    Authors
    Chang Wei Tan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the result for the paper "Elastic band across the path: A new framework to lower bound DTW"

  13. H

    Data from: Mining texts to efficiently generate global data on political...

    • dataverse.harvard.edu
    Updated Jul 8, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shahryar Minhas; Jay Ulfelder; Michael D. Ward (2015). Mining texts to efficiently generate global data on political regime types [Dataset]. http://doi.org/10.7910/DVN/8MC1LO
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 8, 2015
    Dataset provided by
    Harvard Dataverse
    Authors
    Shahryar Minhas; Jay Ulfelder; Michael D. Ward
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    We describe the design and results of an experiment in using text-mining and machine-learning techniques to generate annual measures of national political regime types. Valid and reliable measures of countries’ forms of national government are essential to cross-national and dynamic analysis of many phenomena of great interest to political scientists, including civil war, interstate war, democratization, and coups d’état. Unfortunately, traditional measures of regime type are very expensive to produce, and observations for ambiguous cases are often sharply contested. In this project, we train a series of support vector machine (SVM) classifiers to infer regime type from textual data sources. To train the classifiers, we used vectorized textual reports from Freedom House and the State Department as features for a training set of prelabeled regime type data. To validate our SVM classifiers, we compare their predictions in an out-of-sample context, and the performance results across a variety of metrics (accuracy, precision, recall) are very high. The results of this project highlight the ability of these techniques to contribute to producing real-time data sources for use in political science that can also be routinely updated at much lower cost than human-coded data. To this end, we set up a text-processing pipeline that pulls updated textual data from selected sources, conducts feature extraction, and applies supervised machine learning methods to produce measures of regime type. This pipeline, written in Python, can be pulled from the Github repository associated with this project and easily extended as more data becomes available.

  14. Data from: PADMINI: A PEER-TO-PEER DISTRIBUTED ASTRONOMY DATA MINING SYSTEM...

    • data.nasa.gov
    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    • +2more
    application/rdfxml +5
    Updated Jun 26, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). PADMINI: A PEER-TO-PEER DISTRIBUTED ASTRONOMY DATA MINING SYSTEM AND A CASE STUDY [Dataset]. https://data.nasa.gov/dataset/PADMINI-A-PEER-TO-PEER-DISTRIBUTED-ASTRONOMY-DATA-/r38j-jwis
    Explore at:
    csv, xml, application/rdfxml, application/rssxml, tsv, jsonAvailable download formats
    Dataset updated
    Jun 26, 2018
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    PADMINI: A PEER-TO-PEER DISTRIBUTED ASTRONOMY DATA MINING SYSTEM AND A CASE STUDY

    TUSHAR MAHULE*, KIRK BORNE**, SANDIPAN DEY*, SUGANDHA ARORA*, AND HILLOL KARGUPTA***

    Abstract. Peer-to-Peer (P2P) networks are appealing for astronomy data mining from virtual observatories because of the large volume of the data, compute-intensive tasks, potentially large number of users, and distributed nature of the data analysis process. This paper offers a brief overview of PADMINI—a Peer-to-Peer Astronomy Data MINIng system. It also presents a case study on PADMINI for distributed outlier detection using astronomy data. PADMINI is a webbased system powered by Google Sky and distributed data mining algorithms that run on a collection of computing nodes. This paper offers a case study of the PADMINI evaluating the architecture and the performance of the overall system. Detailed experimental results are presented in order to document the utility and scalability of the system.

  15. Data Mining in Systems Health Management

    • data.nasa.gov
    • catalog.data.gov
    • +1more
    application/rdfxml +5
    Updated Jun 26, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). Data Mining in Systems Health Management [Dataset]. https://data.nasa.gov/dataset/Data-Mining-in-Systems-Health-Management/f97y-q7ff
    Explore at:
    csv, tsv, application/rdfxml, json, application/rssxml, xmlAvailable download formats
    Dataset updated
    Jun 26, 2018
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    This chapter presents theoretical and practical aspects associated to the implementation of a combined model-based/data-driven approach for failure prognostics based on particle filtering algorithms, in which the current esti- mate of the state PDF is used to determine the operating condition of the system and predict the progression of a fault indicator, given a dynamic state model and a set of process measurements. In this approach, the task of es- timating the current value of the fault indicator, as well as other important changing parameters in the environment, involves two basic steps: the predic- tion step, based on the process model, and an update step, which incorporates the new measurement into the a priori state estimate. This framework allows to estimate of the probability of failure at future time instants (RUL PDF) in real-time, providing information about time-to- failure (TTF) expectations, statistical confidence intervals, long-term predic- tions; using for this purpose empirical knowledge about critical conditions for the system (also referred to as the hazard zones). This information is of paramount significance for the improvement of the system reliability and cost-effective operation of critical assets, as it has been shown in a case study where feedback correction strategies (based on uncertainty measures) have been implemented to lengthen the RUL of a rotorcraft transmission system with propagating fatigue cracks on a critical component. Although the feed- back loop is implemented using simple linear relationships, it is helpful to provide a quick insight into the manner that the system reacts to changes on its input signals, in terms of its predicted RUL. The method is able to manage non-Gaussian pdf’s since it includes concepts such as nonlinear state estimation and confidence intervals in its formulation. Real data from a fault seeded test showed that the proposed framework was able to anticipate modifications on the system input to lengthen its RUL. Results of this test indicate that the method was able to successfully suggest the correction that the system required. In this sense, future work will be focused on the development and testing of similar strategies using different input-output uncertainty metrics.

  16. Video-to-Model Data Set

    • figshare.com
    • commons.datacite.org
    xml
    Updated Mar 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sönke Knoch; Shreeraman Ponpathirkoottam; Tim Schwartz (2020). Video-to-Model Data Set [Dataset]. http://doi.org/10.6084/m9.figshare.12026850.v1
    Explore at:
    xmlAvailable download formats
    Dataset updated
    Mar 24, 2020
    Dataset provided by
    figshare
    Authors
    Sönke Knoch; Shreeraman Ponpathirkoottam; Tim Schwartz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data set belongs to the paper "Video-to-Model: Unsupervised Trace Extraction from Videos for Process Discovery and Conformance Checking in Manual Assembly", submitted on March 24, 2020, to the 18th International Conference on Business Process Management (BPM).Abstract: Manual activities are often hidden deep down in discrete manufacturing processes. For the elicitation and optimization of process behavior, complete information about the execution of Manual activities are required. Thus, an approach is presented on how execution level information can be extracted from videos in manual assembly. The goal is the generation of a log that can be used in state-of-the-art process mining tools. The test bed for the system was lightweight and scalable consisting of an assembly workstation equipped with a single RGB camera recording only the hand movements of the worker from top. A neural network based real-time object classifier was trained to detect the worker’s hands. The hand detector delivers the input for an algorithm, which generates trajectories reflecting the movement paths of the hands. Those trajectories are automatically assigned to work steps using the position of material boxes on the assembly shelf as reference points and hierarchical clustering of similar behaviors with dynamic time warping. The system has been evaluated in a task-based study with ten participants in a laboratory, but under realistic conditions. The generated logs have been loaded into the process mining toolkit ProM to discover the underlying process model and to detect deviations from both, instructions and ground truth, using conformance checking. The results show that process mining delivers insights about the assembly process and the system’s precision.The data set contains the generated and the annotated logs based on the video material gathered during the user study. In addition, the petri nets from the process discovery and conformance checking conducted with ProM (http://www.promtools.org) and the reference nets modeled with Yasper (http://www.yasper.org/) are provided.

  17. Comparative Reviews Dataset's

    • kaggle.com
    zip
    Updated Jan 22, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Umair Younis (2019). Comparative Reviews Dataset's [Dataset]. https://www.kaggle.com/umairyounis/comparative-reviews-datasets
    Explore at:
    zip(205233 bytes)Available download formats
    Dataset updated
    Jan 22, 2019
    Authors
    Umair Younis
    Description

    Context

    To get improved results on Machine Learning Algorithms, and other techniques used in Data Mining.

    Content

    Comprises of two columns, the First row consists of comparative reviews, the second row contains polarities.

    Acknowledgements

    I pay thanks to my supervisor, Dr Muhammad Zubair Asghar, Assitant Professor, ICIT, Gomal University (KPK). Di.Khan. Without his guidance, I can't accomplish this task.

    Inspiration

    Comparative opinion mining is becoming the most popular research area in the field of Data Mining. These three comparative reviews datasets will help the researchers who are working in the area of opinion mining and sentiment analysis.

  18. u

    Data from: IJEE Educational Data Mining

    • produccioncientifica.uca.es
    Updated 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Palomo-Duarte, Manuel; Palomo-Duarte, Manuel (2016). IJEE Educational Data Mining [Dataset]. https://produccioncientifica.uca.es/documentos/668fc475b9e7c03b01bde195
    Explore at:
    Dataset updated
    2016
    Authors
    Palomo-Duarte, Manuel; Palomo-Duarte, Manuel
    Description

    Histograms and results of k-means and Ward's clustering for IJEE special issue

    The fileset contains information from three sources:

    1. Histograms (two files):
    * Lexical_histogram.png (histogram of lexical error ratios)
    * Grammatical_histogram.png (histogram of grammatical error ratios)

    2. K-means clustering (eight files):
    * elbow-lex.png (clustering by lexical aspects: error curves obtained for applying elbow method to determinate the optimal number of clusters)
    * cube-lex.png (clustering by lexical aspects: a three-dimensional representation of clusters obtained after applying k-means method)
    * Lexical_clusters (table).xls (clustering by lexical aspects: centroids, standard deviations and number of instances assigned to each cluster)
    * elbow-gram.png (clustering by grammatical aspects: error curves obtained for applying elbow method to determinate the optimal number of clusters)
    * cube-gramm.png (clustering by grammatical aspects: a three-dimensional representation of clusters obtained after applying k-means method)
    * Grammatical_clusters (table).xls (clustering by grammatical aspects: centroids, standard deviations and number of instances assigned to each cluster)
    * elbow-lexgram.png (clustering by lexical and grammatical aspects: error curves obtained for applying elbow method to determinate the optimal number of clusters)
    * Lexical_Grammatical_clusters (table).xls (clustering by lexical and grammatical aspects: centroids, standard deviations and number of instances assigned to each cluster)
    * Grammatical_clusters_number_of_words (table) kmeans.xls : number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to grammatical error ratios.
    * Lexical_clusters_number_of_words (table) kmeans.xls : number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to lexical error ratios.
    * Lexical_Grammatical_clusters_number_of_words (table) kmeans.xls : number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to lexical and grammatical error ratios.

    3. Ward’s Agglomerative Hierarchical Clustering (three files):
    * Lexical_Cluster_Dendrogram_ward.png (clustering by lexical aspects: dendrogram obtained after applying Ward's clustering method).
    * Grammatical_Cluster_Dendrogram_ward.png (clustering by grammatical aspects: dendrogram obtained after applying Ward's clustering method)
    * Lexical_Grammatical_Cluster_Dendrogram_ward.png (clustering by lexical and grammatical aspects: dendrogram obtained after applying Ward's clustering method)
    * Lexical_Grammatical_clusters (table).xls: Centroids (from column 2 to 7) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to lexical and grammatical error ratios.
    * Grammatical_clusters (table).xls: Centroids (from column 2 to 4) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to grammatical error ratios.
    * Lexical_clusters (table).xls: Centroids (from column 2 to 4) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to lexical error ratios.
    * Lexical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to lexical error ratios.
    * Grammatical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to grammatical error ratios.
    * Lexical_Grammatical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to lexical and grammatical error ratios.

  19. Leveraging data mining, active learning, and domain adaptation for efficient...

    • data.niaid.nih.gov
    zip
    Updated Mar 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rui Ding; Jianguo Liu; Kang Hua; Xuebin Wang; Xiaoben Zhang; Minhua Shao; Yuxin Chen; Junhong Chen (2025). Leveraging data mining, active learning, and domain adaptation for efficient discovery of advanced oxygen evolution electrocatalysts [Dataset]. http://doi.org/10.5061/dryad.nk98sf83g
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 5, 2025
    Dataset provided by
    Hong Kong University of Science and Technology
    University of Chicago
    Nanjing University
    North China Electric Power University
    Authors
    Rui Ding; Jianguo Liu; Kang Hua; Xuebin Wang; Xiaoben Zhang; Minhua Shao; Yuxin Chen; Junhong Chen
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Developing advanced catalysts for acidic oxygen evolution reaction (OER) is crucial for sustainable hydrogen production. This study presents a multi-stage machine learning (ML) approach to streamline the discovery and optimization of complex multi-metallic catalysts. Our method integrates data mining, active learning, and domain adaptation throughout the materials discovery process. Unlike traditional trial-and-error methods, this approach systematically narrows the exploration space using domain knowledge with minimized reliance on subjective intuition. Then the active learning module efficiently refines element composition and synthesis conditions through iterative experimental feedback. The process culminated in the discovery of a promising Ru-Mn-Ca-Pr oxide catalyst. Our workflow also enhances theoretical simulations with domain adaptation strategy, providing deeper mechanistic insights aligned with experimental findings. By leveraging diverse data sources and multiple ML strategies, we demonstrate an efficient pathway for electrocatalyst discovery and optimization. This comprehensive, data-driven approach represents a paradigm shift and potentially a benchmark in electrocatalysts research.

  20. A pre-trained sound event detection neural network

    • search.datacite.org
    • figshare.com
    Updated Jul 26, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ian McLoughlin (2017). A pre-trained sound event detection neural network [Dataset]. http://doi.org/10.6084/m9.figshare.5245789
    Explore at:
    Dataset updated
    Jul 26, 2017
    Dataset provided by
    DataCitehttps://www.datacite.org/
    Figsharehttp://figshare.com/
    Authors
    Ian McLoughlin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a network trained earlier on clean data. It does not give very good results but is enough to show the system working. It can achieve above 90% on clean sounds and probably about 80% accuracy in 0dB SNR.
    The network was saved manually (using the MATLAB 'save' command) after running the training code, and before running the testing code.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
nasa.gov (2025). Peer-to-Peer Data Mining, Privacy Issues, and Games [Dataset]. https://data.staging.idas-ds1.appdat.jsc.nasa.gov/dataset/peer-to-peer-data-mining-privacy-issues-and-games
Organization logo

Data from: Peer-to-Peer Data Mining, Privacy Issues, and Games

Related Article
Explore at:
Dataset updated
Feb 18, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description

Peer-to-Peer (P2P) networks are gaining increasing popularity in many distributed applications such as file-sharing, network storage, web caching, sear- ching and indexing of relevant documents and P2P network-threat analysis. Many of these applications require scalable analysis of data over a P2P network. This paper starts by offering a brief overview of distributed data mining applications and algorithms for P2P environments. Next it discusses some of the privacy concerns with P2P data mining and points out the problems of existing privacy-preserving multi-party data mining techniques. It further points out that most of the nice assumptions of these existing privacy preserving techniques fall apart in real-life applications of privacy-preserving distributed data mining (PPDM). The paper offers a more realistic formulation of the PPDM problem as a multi-party game and points out some recent results.

Search
Clear search
Close search
Google apps
Main menu