https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Dataset consists of the data produced by nine cyclists. Data were directly exported from their Strava or Garmin Connect accounts. Data format of sport s activities could be written in GPX or TCX form, which are basically the XML formats adapted to specific purposes. From each dataset, many following information can be obtained: GPS location, elevation, duration, distance, average and maximal heart rate, while some workouts include also data obtained from power meters.
This paper proposes a scalable, local privacy preserving algorithm for distributed Peer-to-Peer (P2P) data aggregation useful for many advanced data mining/analysis tasks such as average/sum computation, decision tree induction, feature selection, and more. Unlike most multi-party privacy-preserving data mining algorithms, this approach works in an asynchronous manner through local interactions and it is highly scalable. It particularly deals with the distributed computation of the sum of a set of numbers stored at different peers in a P2P network in the context of a P2P web mining application. The proposed optimization based privacy-preserving technique for computing the sum allows different peers to specify different privacy requirements without having to adhere to a global set of parameters for the chosen privacy model. Since distributed sum computation is a frequently used primitive, the proposed approach is likely to have significant impact on many data mining tasks such as multi-party privacy-preserving clustering, frequent itemset mining, and statistical aggregate computation.
Distributed data mining from privacy-sensitive multi-party data is likely to play an important role in the next generation of integrated vehicle health monitoring systems. For example, consider an airline manufacturer [tex]$\mathcal{C}$[/tex] manufacturing an aircraft model [tex]$A$[/tex] and selling it to five different airline operating companies [tex]$\mathcal{V}_1 \dots \mathcal{V}_5$[/tex]. These aircrafts, during their operation, generate huge amount of data. Mining this data can reveal useful information regarding the health and operability of the aircraft which can be useful for disaster management and prediction of efficient operating regimes. Now if the manufacturer [tex]$\mathcal{C}$[/tex] wants to analyze the performance data collected from different aircrafts of model-type [tex]$A$[/tex] belonging to different airlines then central collection of data for subsequent analysis may not be an option. It should be noted that the result of this analysis may be statistically more significant if the data for aircraft model [tex]$A$[/tex] across all companies were available to [tex]$\mathcal{C}$[/tex]. The potential problems arising out of such a data mining scenario are:
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
A collection of sport activity datasets for data analysis and data mining 2017a
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset comes from a questionnaire structured into 24 questions, which can be accessed at https://forms.gle/bUgYMfoNHh7r6ebs6. This questionnaire was completed by 956 respondents and aims to analyze the online activities carried out during March - April 2020, being distributed to teachers.
Each question is designed to reveal different aspects of the experiences, skills, and perspectives of teaching staff regarding online teaching and learning.
To protect the identity of the respondents and to obtain accurate responses, all data collected from teachers was anonymous. We did not collect any personal information whatsoever. This aspect was made clear to the respondents in the description of the questionnaire.
https://data.gov.tw/licensehttps://data.gov.tw/license
The Exploration and Mining Division provides a fee schedule for the provision of exploration data to external parties.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset with 72000 pins from 117 users in Pinterest. Each pin contains a short raw text and an image. The images are processed using a pretrained Convolutional Neural Network and transformed into a vector of 4096 features.
This dataset was used in the paper "User Identification in Pinterest Through the Refinement of a Cascade Fusion of Text and Images" to idenfity specific users given their comments. The paper is publishe in the Research in Computing Science Journal, as part of the LKE 2017 conference. The dataset includes the splits used in the paper.
There are nine files. text_test, text_train and text_val, contain the raw text of each pin in the corresponding split of the data. imag_test, imag_train and imag_val contain the image features of each pin in the corresponding split of the data. train_user and val_test_users contain the index of the user of each pin (between 0 and 116). There is a correspondance one-to-one among the test, train and validation files for images, text and users. There are 400 pins per user in the train set, and 100 pins per user in the validation and test sets each one.
If you have questions regarding the data, write to: jc dot gomez at ugto dot mx
Peer-to-Peer (P2P) networks are gaining increasing popularity in many distributed applications such as file-sharing, network storage, web caching, sear- ching and indexing of relevant documents and P2P network-threat analysis. Many of these applications require scalable analysis of data over a P2P network. This paper starts by offering a brief overview of distributed data mining applications and algorithms for P2P environments. Next it discusses some of the privacy concerns with P2P data mining and points out the problems of existing privacy-preserving multi-party data mining techniques. It further points out that most of the nice assumptions of these existing privacy preserving techniques fall apart in real-life applications of privacy-preserving distributed data mining (PPDM). The paper offers a more realistic formulation of the PPDM problem as a multi-party game and points out some recent results.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This research explored what happens when social media data mining becomes ordinary and is carried out by organisations that might be seen as the pillars of everyday life. The interviews on which the transcripts are based are discussed in Chapter 6 of the book. The referenced book contains a description of the methods. No other publications resulted from working with these transcripts.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data mining-Mathematics is a book subject. It includes 13 books, written by 11 different authors.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
As semiconductor devices are miniaturized, the importance of atomic layer deposition (ALD) technology is growing. When designing ALD precursors, it is important to consider the melting point, because the precursors should have melting points lower than the process temperature. However, obtaining melting point data is challenging due to experimental sensitivity and high computational costs. As a result, a comprehensive and well-organized database for the melting point of the OMCs has not been fully reported yet. Therefore, in this study, we constructed a database of melting points for 1,845 OMCs, including 58 metal and 6 metalloid elements. The database contains CAS numbers, molecular formulas, and structural information and was constructed through automatic extraction and systematic curation. The melting point information was extracted using two methods: 1) 1,434 materials from 11 chemical vendor databases and 2) 411 materials identified through natural language processing (NLP) techniques with an accuracy of 86.3%, based on 2,096 scientific papers published over the past 29 years. In our database, the OMCs contain up to around 250 atoms and have melting points that range from −170 to 1610 °C. The main source is the Chemsrc database, accounting for 607 materials (32.9%), and Fe is the most common central metal or metalloid element (15.0%), followed by Si (11.6%) and B (6.7%). To validate the utilization of the constructed database, a multimodal neural network model was developed integrating graph-based and feature-based information as descriptors to predict the melting points of the OMCs but moderate performance. We believe the current approach reduces the time and cost associated with hand-operated data collection and processing, contributing to effective screening of potentially promising ALD precursors and providing crucial information for the advancement of the semiconductor industry.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MusicOSet is an open and enhanced dataset of musical elements (artists, songs and albums) based on musical popularity classification. Provides a directly accessible collection of data suitable for numerous tasks in music data mining (e.g., data visualization, classification, clustering, similarity search, MIR, HSS and so forth). To create MusicOSet, the potential information sources were divided into three main categories: music popularity sources, metadata sources, and acoustic and lyrical features sources. Data from all three categories were initially collected between January and May 2019. Nevertheless, the update and enhancement of the data happened in June 2019.
The attractive features of MusicOSet include:
| Data | # Records |
|:-----------------:|:---------:|
| Songs | 20,405 |
| Artists | 11,518 |
| Albums | 26,522 |
| Lyrics | 19,664 |
| Acoustic Features | 20,405 |
| Genres | 1,561 |
NASA has some of the largest and most complex data sources in the world, with data sources ranging from the earth sciences, space sciences, and massive distributed engineering data sets from commercial aircraft and spacecraft. This talk will discuss some of the issues and algorithms developed to analyze and discover patterns in these data sets. We will also provide an overview of a large research program in Integrated Vehicle Health Management. The goal of this program is to develop advanced technologies to automatically detect, diagnose, predict, and mitigate adverse events during the flight of an aircraft. A case study will be presented on a recent data mining analysis performed to support the Flight Readiness Review of the Space Shuttle Mission STS-119.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set belongs to the paper "Video-to-Model: Unsupervised Trace Extraction from Videos for Process Discovery and Conformance Checking in Manual Assembly", submitted on March 24, 2020, to the 18th International Conference on Business Process Management (BPM).Abstract: Manual activities are often hidden deep down in discrete manufacturing processes. For the elicitation and optimization of process behavior, complete information about the execution of Manual activities are required. Thus, an approach is presented on how execution level information can be extracted from videos in manual assembly. The goal is the generation of a log that can be used in state-of-the-art process mining tools. The test bed for the system was lightweight and scalable consisting of an assembly workstation equipped with a single RGB camera recording only the hand movements of the worker from top. A neural network based real-time object classifier was trained to detect the worker’s hands. The hand detector delivers the input for an algorithm, which generates trajectories reflecting the movement paths of the hands. Those trajectories are automatically assigned to work steps using the position of material boxes on the assembly shelf as reference points and hierarchical clustering of similar behaviors with dynamic time warping. The system has been evaluated in a task-based study with ten participants in a laboratory, but under realistic conditions. The generated logs have been loaded into the process mining toolkit ProM to discover the underlying process model and to detect deviations from both, instructions and ground truth, using conformance checking. The results show that process mining delivers insights about the assembly process and the system’s precision.The data set contains the generated and the annotated logs based on the video material gathered during the user study. In addition, the petri nets from the process discovery and conformance checking conducted with ProM (http://www.promtools.org) and the reference nets modeled with Yasper (http://www.yasper.org/) are provided.
Retrofitting is an essential element of any comprehensive strategy for improving residential energy efficiency. The residential retrofit market is still developing, and program managers must develop innovative strategies to increase uptake and promote economies of scale. Residential retrofitting remains a challenging proposition to sell to homeowners, because awareness levels are low and financial incentives are lacking. The U.S. Department of Energy's Building America research team, Alliance for Residential Building Innovation (ARBI), implemented a project to increase residential retrofits in Davis, California. The project used a neighborhood-focused strategy for implementation and a low-cost retrofit program that focused on upgraded attic insulation and duct sealing. ARBI worked with a community partner, the not-for-profit Cool Davis Initiative, as well as selected area contractors to implement a strategy that sought to capitalize on the strong local expertise of partners and the unique aspects of the Davis, California, community. Working with community partners also allowed ARBI to collect and analyze data about effective messaging tactics for community-based retrofit programs. ARBI expected this project, called Retrofit Your Attic, to achieve higher uptake than other retrofit projects, because it emphasized a low-cost, one-measure retrofit program. However, this was not the case. The program used a strategy that focused on attics-including air sealing, duct sealing, and attic insulation-as a low-cost entry for homeowners to complete home retrofits. The price was kept below $4,000 after incentives; both contractors in the program offered the same price. The program completed only five retrofits. Interestingly, none of those homeowners used the one-measure strategy. All five homeowners were concerned about cost, comfort, and energy savings and included additional measures in their retrofits. The low-cost, one-measure strategy did not increase the uptake among homeowners, even in a well-educated, affluent community such as Davis. This project has two primary components. One is to complete attic retrofits on a community scale in the hot-dry climate on Davis, CA. Sufficient data will be collected on these projects to include them in the BAFDR. Additionally, ARBI is working with contractors to obtain building and utility data from a large set of retrofit projects in CA (hot-dry). These projects are to be uploaded into the BAFDR.
Rapid Miner Process files and XML test set including the predicted labels for the Linked Data Mining Challenge 2015.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books and is filtered where the book subjects is Data mining-Social aspects, featuring 9 columns including author, BNB id, book, book publisher, and book subjects. The preview is ordered by publication date (descending).
PADMINI: A PEER-TO-PEER DISTRIBUTED ASTRONOMY DATA MINING SYSTEM AND A CASE STUDY TUSHAR MAHULE, KIRK BORNE, SANDIPAN DEY, SUGANDHA ARORA, AND HILLOL KARGUPTA** Abstract. Peer-to-Peer (P2P) networks are appealing for astronomy data mining from virtual observatories because of the large volume of the data, compute-intensive tasks, potentially large number of users, and distributed nature of the data analysis process. This paper offers a brief overview of PADMINI—a Peer-to-Peer Astronomy Data MINIng system. It also presents a case study on PADMINI for distributed outlier detection using astronomy data. PADMINI is a webbased system powered by Google Sky and distributed data mining algorithms that run on a collection of computing nodes. This paper offers a case study of the PADMINI evaluating the architecture and the performance of the overall system. Detailed experimental results are presented in order to document the utility and scalability of the system.
This statistic displays the various applications of data analytics and mining across procurement processes, according to chief procurement officers (CPOs) worldwide, as of 2017. Fifty-seven percent of the CPOs asked agreed that data analytics and mining had been applied to intelligent and advanced analytics for negotiations, and 40 percent of them indicated data analytics and mining had been applied to supplier portfolio optimization processes.
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Dataset consists of the data produced by nine cyclists. Data were directly exported from their Strava or Garmin Connect accounts. Data format of sport s activities could be written in GPX or TCX form, which are basically the XML formats adapted to specific purposes. From each dataset, many following information can be obtained: GPS location, elevation, duration, distance, average and maximal heart rate, while some workouts include also data obtained from power meters.