100+ datasets found

Privacy Preserving Distributed Data Mining
data.staging.idas-ds1.appdat.jsc.nasa.gov
datadiscoverystudio.org
+2more
Updated Feb 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.staging.idas-ds1.appdat.jsc.nasa.gov (2025). Privacy Preserving Distributed Data Mining [Dataset]. https://data.staging.idas-ds1.appdat.jsc.nasa.gov/dataset/privacy-preserving-distributed-data-mining
Explore at:
Dataset updated
Feb 18, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
Distributed data mining from privacy-sensitive multi-party data is likely to play an important role in the next generation of integrated vehicle health monitoring systems. For example, consider an airline manufacturer [tex]$\mathcal{C}$[/tex] manufacturing an aircraft model [tex]$A$[/tex] and selling it to five different airline operating companies [tex]$\mathcal{V}_1 \dots \mathcal{V}_5$[/tex]. These aircrafts, during their operation, generate huge amount of data. Mining this data can reveal useful information regarding the health and operability of the aircraft which can be useful for disaster management and prediction of efficient operating regimes. Now if the manufacturer [tex]$\mathcal{C}$[/tex] wants to analyze the performance data collected from different aircrafts of model-type [tex]$A$[/tex] belonging to different airlines then central collection of data for subsequent analysis may not be an option. It should be noted that the result of this analysis may be statistically more significant if the data for aircraft model [tex]$A$[/tex] across all companies were available to [tex]$\mathcal{C}$[/tex]. The potential problems arising out of such a data mining scenario are:
f
DataSheet_2_The TargetMine Data Warehouse: Enhancement and Updates.pdf
figshare.com
pdf
Updated May 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yi-An Chen; Lokesh P. Tripathi; Takeshi Fujiwara; Tatsuya Kameyama; Mari N. Itoh; Kenji Mizuguchi (2023). DataSheet_2_The TargetMine Data Warehouse: Enhancement and Updates.pdf [Dataset]. http://doi.org/10.3389/fgene.2019.00934.s002
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2019.00934.s002
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Yi-An Chen; Lokesh P. Tripathi; Takeshi Fujiwara; Tatsuya Kameyama; Mari N. Itoh; Kenji Mizuguchi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Biological data analysis is the key to new discoveries in disease biology and drug discovery. The rapid proliferation of high-throughput ‘omics’ data has necessitated a need for tools and platforms that allow the researchers to combine and analyse different types of biological data and obtain biologically relevant knowledge. We had previously developed TargetMine, an integrative data analysis platform for target prioritisation and broad-based biological knowledge discovery. Here, we describe the newly modelled biological data types and the enhanced visual and analytical features of TargetMine. These enhancements have included: an enhanced coverage of gene–gene relations, small molecule metabolite to pathway mappings, an improved literature survey feature, and in silico prediction of gene functional associations such as protein–protein interactions and global gene co-expression. We have also described two usage examples on trans-omics data analysis and extraction of gene-disease associations using MeSH term descriptors. These examples have demonstrated how the newer enhancements in TargetMine have contributed to a more expansive coverage of the biological data space and can help interpret genotype–phenotype relations. TargetMine with its auxiliary toolkit is available at https://targetmine.mizuguchilab.org. The TargetMine source code is available at https://github.com/chenyian-nibio/targetmine-gradle.
Data Mining Tools Global Market Report 2025
thebusinessresearchcompany.com
pdf,excel,csv,ppt
Updated Jan 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Business Research Company (2025). Data Mining Tools Global Market Report 2025 [Dataset]. https://www.thebusinessresearchcompany.com/report/data-mining-tools-global-market-report
Explore at:
pdf,excel,csv,pptAvailable download formats
Dataset updated
Jan 14, 2025
Dataset authored and provided by
The Business Research Company
License
https://www.thebusinessresearchcompany.com/privacy-policyhttps://www.thebusinessresearchcompany.com/privacy-policy
Description
The Data Mining Tools Market is projected to grow at 12.9% CAGR, reaching $2.13 Billion by 2029. Where is the industry heading next? Get the sample report now!
m
Educational Attainment in North Carolina Public Schools: Use of statistical...
data.mendeley.com
Updated Nov 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
Explore at:
Unique identifier
https://doi.org/10.17632/6cm9wyd5g5.1
Dataset updated
Nov 14, 2018
Authors
Scott Herford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
North Carolina
Description
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
f
Datasets: Mining chemical activity status from high-throughput screening...
figshare.com
zip
Updated Jan 20, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Othman Soufan (2016). Datasets: Mining chemical activity status from high-throughput screening assays [Dataset]. http://doi.org/10.6084/m9.figshare.1598200.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1598200.v1
Dataset updated
Jan 20, 2016
Dataset provided by
figshare
Authors
Othman Soufan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
High-throughput screening (HTS) experiments provide a valuable resource that reports biological activity of numerous chemical compounds relative to their molecular targets. Building computational models that accurately predict such activity status (active vs. inactive) in specific assays is a challenging task given the large volume of data and frequently small proportion of active compounds relative to the inactive ones. We developed a method, DRAMOTE, to predict activity status of chemical compounds in HTP activity assays. For a class of HTP assays, our method achieves considerably better results than the current state-of-the-art-solutions. We achieved this by modification of a minority oversampling technique. To demonstrate that DRAMOTE is performing better than the other methods, we performed a comprehensive comparison analysis with several other methods and evaluated them on data from 11 PubChem assays through 1,350 experiments that involved approximately 500,000 interactions between chemicals and their target proteins. As an example of potential use, we applied DRAMOTE to develop robust models for predicting FDA approved drugs that have high probability to interact with the thyroid stimulating hormone receptor (TSHR) in humans. Our findings are further partially and indirectly supported by 3D docking results and literature information. The results based on approximately 500,000 interactions suggest that DRAMOTE has performed the best and that it can be used for developing robust virtual screening models. The datasets and implementation of all solutions are available as a MATLAB toolbox online at www.cbrc.kaust.edu.sa/dramote
f
Confusion matrix.
figshare.com
xls
Updated Jul 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaoxia Mou; Heming Zhang (2023). Confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0288140.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0288140.t002
Dataset updated
Jul 7, 2023
Dataset provided by
PLOS ONE
Authors
Shaoxia Mou; Heming Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Due to the inherent characteristics of accumulation sequence of unbalanced data, the mining results of this kind of data are often affected by a large number of categories, resulting in the decline of mining performance. To solve the above problems, the performance of data cumulative sequence mining is optimized. The algorithm for mining cumulative sequence of unbalanced data based on probability matrix decomposition is studied. The natural nearest neighbor of a few samples in the unbalanced data cumulative sequence is determined, and the few samples in the unbalanced data cumulative sequence are clustered according to the natural nearest neighbor relationship. In the same cluster, new samples are generated from the core points of dense regions and non core points of sparse regions, and then new samples are added to the original data accumulation sequence to balance the data accumulation sequence. The probability matrix decomposition method is used to generate two random number matrices with Gaussian distribution in the cumulative sequence of balanced data, and the linear combination of low dimensional eigenvectors is used to explain the preference of specific users for the data sequence; At the same time, from a global perspective, the AdaBoost idea is used to adaptively adjust the sample weight and optimize the probability matrix decomposition algorithm. Experimental results show that the algorithm can effectively generate new samples, improve the imbalance of data accumulation sequence, and obtain more accurate mining results. Optimizing global errors as well as more efficient single-sample errors. When the decomposition dimension is 5, the minimum RMSE is obtained. The proposed algorithm has good classification performance for the cumulative sequence of balanced data, and the average ranking of index F value, G mean and AUC is the best.
Data from: Mining Distance-Based Outliers in Near Linear Time
data.nasa.gov
datasets.ai
+2more
application/rdfxml +5
Updated Jun 26, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). Mining Distance-Based Outliers in Near Linear Time [Dataset]. https://data.nasa.gov/dataset/Mining-Distance-Based-Outliers-in-Near-Linear-Time/kkj8-xyjb
Explore at:
csv, json, xml, tsv, application/rssxml, application/rdfxmlAvailable download formats
Dataset updated
Jun 26, 2018
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Full title: Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule

Abstract: Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.
Data Mining for IVHM using Sparse Binary Ensembles, Phase I
data.nasa.gov
application/rdfxml +5
Updated Jun 26, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). Data Mining for IVHM using Sparse Binary Ensembles, Phase I [Dataset]. https://data.nasa.gov/dataset/Data-Mining-for-IVHM-using-Sparse-Binary-Ensembles/qfus-evzq
Explore at:
xml, tsv, csv, application/rssxml, application/rdfxml, jsonAvailable download formats
Dataset updated
Jun 26, 2018
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
In response to NASA SBIR topic A1.05, "Data Mining for Integrated Vehicle Health Management", Michigan Aerospace Corporation (MAC) asserts that our unique SPADE (Sparse Processing Applied to Data Exploitation) technology meets a significant fraction of the stated criteria and has functionality that enables it to handle many applications within the aircraft lifecycle. SPADE distills input data into highly quantized features and uses MAC's novel techniques for constructing Ensembles of Decision Trees to develop extremely accurate diagnostic/prognostic models for classification, regression, clustering, anomaly detection and semi-supervised learning tasks. These techniques are currently being employed to do Threat Assessment for satellites in conjunction with researchers at the Air Force Research Lab. Significant advantages to this approach include: 1) completely data driven; 2) training and evaluation are faster than conventional methods; 3) operates effectively on huge datasets (> billion samples X > million features), 4) proven to be as accurate as state-of-the-art techniques in many significant real-world applications. The specific goals for Phase 1 will be to work with domain experts at NASA and with our partners Boeing, SpaceX and GMV Space Systems to delineate a subset of problems that are particularly well-suited to this approach and to determine requirements for deploying algorithms on platforms of opportunity.
Spatial Data Mining: Predicting Mud Content across the Southwest Australian...
ecat.ga.gov.au
Updated Jan 1, 2009
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Commonwealth of Australia (Geoscience Australia) (2009). Spatial Data Mining: Predicting Mud Content across the Southwest Australian Margin [Dataset]. https://ecat.ga.gov.au/geonetwork/static/api/records/a05f7892-eced-7506-e044-00144fdd4fa6
Explore at:
Dataset updated
Jan 1, 2009
Dataset provided by
Geoscience Australiahttp://ga.gov.au/
EGD
Area covered
Australia
Description
Spatially continuous data of environmental variables is often required for marine conservation and management. However, information for environmental variables is usually collected by point sampling, particularly for the deep ocean. Thus, methods generating such spatially continuous data by using point samples to estimate values for unknown locations become essential tools. Such methods are, however, often data- or even variable- specific and it is difficult to select an appropriate method for any given dataset. In this study, 14 methods (37 sub-methods) are compared using samples of mud content with five levels of sample density across the southwest Australian margin. Bathymetry, distance to coast, and slope were used as secondary variables. Ten-fold cross validation with relative mean absolute error (RMAE) and visual examination were used to assess the performance of these methods. A total of 1,850 prediction datasets were produced and used to assess the performance of the methods. Considering both the accuracy and the visual examination, we found that a combined method, random forest and ordinary kriging (RKrf), is the most robust method. No threshold in sample density was detected in relation to prediction accuracy. No consistent patterns were observed between the performance of the methods and data variation. The RMAE of three most accurate methods is about 30% lower than that of the best methods in previous publications, highlighting the robustness of the methods selected in this study. The limitations of this study were discussed and a number of suggestions were provided for further studies.
l
LScD (Leicester Scientific Dictionary)
figshare.le.ac.uk
docx
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.9746900.v3
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.
l
LScDC Word-Category RIG Matrix
figshare.le.ac.uk
pdf
Updated Apr 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LScDC Word-Category RIG Matrix [Dataset]. https://figshare.le.ac.uk/articles/dataset/LScDC_Word-Category_RIG_Matrix/12133431
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.12133431.v2
Dataset updated
Apr 28, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LScDC Word-Category RIG MatrixApril 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk / suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny MirkesGetting StartedThis file describes the Word-Category RIG Matrix for theLeicester Scientific Corpus (LSC) [1], the procedure to build the matrix and introduces the Leicester Scientific Thesaurus (LScT) with the construction process. The Word-Category RIG Matrix is a 103,998 by 252 matrix, where rows correspond to words of Leicester Scientific Dictionary-Core (LScDC) [2] and columns correspond to 252 Web of Science (WoS) categories [3, 4, 5]. Each entry in the matrix corresponds to a pair (category,word). Its value for the pair shows the Relative Information Gain (RIG) on the belonging of a text from the LSC to the category from observing the word in this text. The CSV file of Word-Category RIG Matrix in the published archive is presented with two additional columns of the sum of RIGs in categories and the maximum of RIGs over categories (last two columns of the matrix). So, the file ‘Word-Category RIG Matrix.csv’ contains a total of 254 columns.This matrix is created to be used in future research on quantifying of meaning in scientific texts under the assumption that words have scientifically specific meanings in subject categories and the meaning can be estimated by information gains from word to categories. LScT (Leicester Scientific Thesaurus) is a scientific thesaurus of English. The thesaurus includes a list of 5,000 words from the LScDC. We consider ordering the words of LScDC by the sum of their RIGs in categories. That is, words are arranged in their informativeness in the scientific corpus LSC. Therefore, meaningfulness of words evaluated by words’ average informativeness in the categories. We have decided to include the most informative 5,000 words in the scientific thesaurus. Words as a Vector of Frequencies in WoS CategoriesEach word of the LScDC is represented as a vector of frequencies in WoS categories. Given the collection of the LSC texts, each entry of the vector consists of the number of texts containing the word in the corresponding category.It is noteworthy that texts in a corpus do not necessarily belong to a single category, as they are likely to correspond to multidisciplinary studies, specifically in a corpus of scientific texts. In other words, categories may not be exclusive. There are 252 WoS categories and a text can be assigned to at least 1 and at most 6 categories in the LSC. Using the binary calculation of frequencies, we introduce the presence of a word in a category. We create a vector of frequencies for each word, where dimensions are categories in the corpus.The collection of vectors, with all words and categories in the entire corpus, can be shown in a table, where each entry corresponds to a pair (word,category). This table is build for the LScDC with 252 WoS categories and presented in published archive with this file. The value of each entry in the table shows how many times a word of LScDC appears in a WoS category. The occurrence of a word in a category is determined by counting the number of the LSC texts containing the word in a category. Words as a Vector of Relative Information Gains Extracted for CategoriesIn this section, we introduce our approach to representation of a word as a vector of relative information gains for categories under the assumption that meaning of a word can be quantified by their information gained for categories.For each category, a function is defined on texts that takes the value 1, if the text belongs to the category, and 0 otherwise. For each word, a function is defined on texts that takes the value 1 if the word belongs to the text, and 0 otherwise. Consider LSC as a probabilistic sample space (the space of equally probable elementary outcomes). For the Boolean random variables, the joint probability distribution, the entropy and information gains are defined.The information gain about the category from the word is the amount of information on the belonging of a text from the LSC to the category from observing the word in the text [6]. We used the Relative Information Gain (RIG) providing a normalised measure of the Information Gain. This provides the ability of comparing information gains for different categories. The calculations of entropy, Information Gains and Relative Information Gains can be found in the README file in the archive published. Given a word, we created a vector where each component of the vector corresponds to a category. Therefore, each word is represented as a vector of relative information gains. It is obvious that the dimension of vector for each word is the number of categories. The set of vectors is used to form the Word-Category RIG Matrix, in which each column corresponds to a category, each row corresponds to a word and each component is the relative information gain from the word to the category. In Word-Category RIG Matrix, a row vector represents the corresponding word as a vector of RIGs in categories. We note that in the matrix, a column vector represents RIGs of all words in an individual category. If we choose an arbitrary category, words can be ordered by their RIGs from the most informative to the least informative for the category. As well as ordering words in each category, words can be ordered by two criteria: sum and maximum of RIGs in categories. The top n words in this list can be considered as the most informative words in the scientific texts. For a given word, the sum and maximum of RIGs are calculated from the Word-Category RIG Matrix.RIGs for each word of LScDC in 252 categories are calculated and vectors of words are formed. We then form the Word-Category RIG Matrix for the LSC. For each word, the sum (S) and maximum (M) of RIGs in categories are calculated and added at the end of the matrix (last two columns of the matrix). The Word-Category RIG Matrix for the LScDC with 252 categories, the sum of RIGs in categories and the maximum of RIGs over categories can be found in the database.Leicester Scientific Thesaurus (LScT)Leicester Scientific Thesaurus (LScT) is a list of 5,000 words form the LScDC [2]. Words of LScDC are sorted in descending order by the sum (S) of RIGs in categories and the top 5,000 words are selected to be included in the LScT. We consider these 5,000 words as the most meaningful words in the scientific corpus. In other words, meaningfulness of words evaluated by words’ average informativeness in the categories and the list of these words are considered as a ‘thesaurus’ for science. The LScT with value of sum can be found as CSV file with the published archive. Published archive contains following files:1) Word_Category_RIG_Matrix.csv: A 103,998 by 254 matrix where columns are 252 WoS categories, the sum (S) and the maximum (M) of RIGs in categories (last two columns of the matrix), and rows are words of LScDC. Each entry in the first 252 columns is RIG from the word to the category. Words are ordered as in the LScDC.2) Word_Category_Frequency_Matrix.csv: A 103,998 by 252 matrix where columns are 252 WoS categories and rows are words of LScDC. Each entry of the matrix is the number of texts containing the word in the corresponding category. Words are ordered as in the LScDC.3) LScT.csv: List of words of LScT with sum (S) values. 4) Text_No_in_Cat.csv: The number of texts in categories. 5) Categories_in_Documents.csv: List of WoS categories for each document of the LSC.6) README.txt: Description of Word-Category RIG Matrix, Word-Category Frequency Matrix and LScT and forming procedures.7) README.pdf (same as 6 in PDF format)References[1] Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2[2] Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v3[3] Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4] WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [5] Suzen, N., Mirkes, E. M., & Gorban, A. N. (2019). LScDC-new large scientific dictionary. arXiv preprint arXiv:1912.06858. [6] Shannon, C. E. (1948). A mathematical theory of communication. Bell system technical journal, 27(3), 379-423.
S
Survey on mine-water geothermal in Poland
data.subak.org
metadata.bgs.ac.uk
+1more
Updated Feb 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Government Digital Service (2023). Survey on mine-water geothermal in Poland [Dataset]. https://data.subak.org/dataset/survey-on-mine-water-geothermal-in-poland
Explore at:
Dataset updated
Feb 15, 2023
Dataset provided by
Government Digital Service
Area covered
Poland
Description
The study in three coal mining regions: Lower Silesia, Upper Silesia and Lublin (each N=500) was conducted using Computer Assisted Web Interview (CAWI). The questionnaire includes the block of questions concerning mine water awareness, climate change and local/place attachment. The survey online took 15 to 20 minutes and was prepared after in-depth pilot research among participants with different education level from the mining regions. We used the uninformed approach to the survey, so there were no additional questions nor requirements for participants prior to the survey. Since the mine water energy extraction is a technical issue that is neither well known nor commonly used in the narratives of Poles, we tested survey questions with pilot cognitive interviews to remove the technical language and reduce the number of replies without understanding. The interviews were conducted with 10 participants in July 2020 and due to the pilot's recommendations and results, we implemented additional changes in the final version of the questionnaire. Specifically, some questions were simplified and the background information on mine water extraction was simplified and shortened The survey CAWI was completed by adult people aged 18-65 (N=1500) between 14-19 August 2020 by Kantar Research Agency. The sample was constructed using KANTAR’s internet panel profiled for the basic demographics, such as gender, age, and the town size. Particular attention paid to the quality of the panel is reflected in its structure. Kantar’s internet panel reflects the profile of the Polish population of Internet users in terms of its participants’ demographic characteristics. The sample from each region was 500 respondents and among the full sample (N=1500) we reached only 192 people who chose to call “mining areas” as best description of the area where they live. Although the three voivodships were chosen due to its mining industry the selected sample covers the region in general in which mining communities are statically not fully represented. We also asked about the subjective perception of the area respondents live in, which we further analysed with spatial distribution. The dataset was created within SECURe project (Subsurface Evaluation of CCS and Unconventional Risks) - https://www.securegeoenergy.eu/. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 764531
E
Webis Simulation Data Mining Bridge Models Corpus 2012 (Webis-SDMbridge-12)
live.european-language-grid.eu
Updated Mar 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Webis Simulation Data Mining Bridge Models Corpus 2012 (Webis-SDMbridge-12) [Dataset]. https://live.european-language-grid.eu/catalogue/ld/7937
Explore at:
Dataset updated
Mar 19, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This corpus provides the simulation data mining community with a collection of 14641 bridge models and simulated behavior.
1. Folder "1-designs":
The text files in this directory should contain all information for the
independent variables any machine learning experiment. For reference, all 14641 IFC models are supplied in subfolders 001 to 147.
2. Folder "2-simulation":
This folder contains samples of the simulation output that may be viewed in Paraview (http://www.paraview.org). The original model contains the "Org" filename fragment, and the maximum and minimum behaviors are indicated with "Max" and "Min" filename fragments. Displacement, strain, and stress behaviors are all given. Only three of the 14641 models are given as the file sizes are around 1.4 to 2.2 megabytes each. The complete data (approximately 81 gigabytes) can be regenerated and provided if necessary on request (email webis@medien.uni-weimar.de).
3. Folder "3-aggregation":
Maximum displacement, strain, and stress measurements are given in the text files individually, and together in the files with the "vtk" filename fragment. This data should be sufficient for the dependent variables of any machine learning experiment.
H
Data from: Mining texts to efficiently generate global data on political...
dataverse.harvard.edu
Updated Jul 8, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shahryar Minhas; Jay Ulfelder; Michael D. Ward (2015). Mining texts to efficiently generate global data on political regime types [Dataset]. http://doi.org/10.7910/DVN/8MC1LO
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/8MC1LO
Dataset updated
Jul 8, 2015
Dataset provided by
Harvard Dataverse
Authors
Shahryar Minhas; Jay Ulfelder; Michael D. Ward
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
We describe the design and results of an experiment in using text-mining and machine-learning techniques to generate annual measures of national political regime types. Valid and reliable measures of countries’ forms of national government are essential to cross-national and dynamic analysis of many phenomena of great interest to political scientists, including civil war, interstate war, democratization, and coups d’état. Unfortunately, traditional measures of regime type are very expensive to produce, and observations for ambiguous cases are often sharply contested. In this project, we train a series of support vector machine (SVM) classifiers to infer regime type from textual data sources. To train the classifiers, we used vectorized textual reports from Freedom House and the State Department as features for a training set of prelabeled regime type data. To validate our SVM classifiers, we compare their predictions in an out-of-sample context, and the performance results across a variety of metrics (accuracy, precision, recall) are very high. The results of this project highlight the ability of these techniques to contribute to producing real-time data sources for use in political science that can also be routinely updated at much lower cost than human-coded data. To this end, we set up a text-processing pipeline that pulls updated textual data from selected sources, conducts feature extraction, and applies supervised machine learning methods to produce measures of regime type. This pipeline, written in Python, can be pulled from the Github repository associated with this project and easily extended as more data becomes available.
m
Lisbon, Portugal, hotel’s customer dataset with three years of personal,...
data.mendeley.com
Updated Nov 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nuno Antonio (2020). Lisbon, Portugal, hotel’s customer dataset with three years of personal, behavioral, demographic, and geographic information [Dataset]. http://doi.org/10.17632/j83f5fsh6c.1
Explore at:
Unique identifier
https://doi.org/10.17632/j83f5fsh6c.1
Dataset updated
Nov 18, 2020
Authors
Nuno Antonio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Portugal, Lisbon
Description
Hotel customer dataset with 31 variables describing a total of 83,590 instances (customers). It comprehends three full years of customer behavioral data. In addition to personal and behavioral information, the dataset also contains demographic and geographical information. This dataset contributes to reducing the lack of real-world business data that can be used for educational and research purposes. The dataset can be used in data mining, machine learning, and other analytical field problems in the scope of data science. Due to its unit of analysis, it is a dataset especially suitable for building customer segmentation models, including clustering and RFM (Recency, Frequency, and Monetary value) models, but also be used in classification and regression problems.
f
Hierarchical Measures Example.
plos.figshare.com
xls
Updated May 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fernando Benites; Svenja Simon; Elena Sapozhnikova (2023). Hierarchical Measures Example. [Dataset]. http://doi.org/10.1371/journal.pone.0084475.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0084475.t002
Dataset updated
May 30, 2023
Dataset provided by
PLOS ONE
Authors
Fernando Benites; Svenja Simon; Elena Sapozhnikova
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Hierarchical Measures Example.
d
Application of image processing and machine learning techniques to...
search.dataone.org
data.griidc.org
Updated Feb 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daly, Kendra (2025). Application of image processing and machine learning techniques to distinguish suspected oil droplets from plankton and other particles for the SIPPER imaging system [Dataset]. http://doi.org/10.7266/N74X55RS
Explore at:
Unique identifier
https://doi.org/10.7266/N74X55RS
Dataset updated
Feb 5, 2025
Dataset provided by
GRIIDC
Authors
Daly, Kendra
Description
Image classification features and examples of statistical results for the data mining approach using a one-versus-one strategy to implement a SVM (support vector machine) multi-class classifier. Data published in: Fefilatyev, S., K. Kramer, L. Hall, D. Goldgof, R. Kasturi, A. Remsen, K. Daly. 2011. Detection of Anomalous Particles from the Deepwater Horizon Oil Spill Using the SIPPER3 Underwater Imaging Platform. Proceedings of International Conference on Data Mining Workshops, p. 741-748. Awarded Data Mining Practice Prize at the IEEE International Conference on Data Mining (ICDM), Vancouver, Canada, December 11-14, 2011. DOI 10.1109/ICDMW.2011.65.
l
LSC (Leicester Scientific Corpus)
figshare.le.ac.uk
Updated Apr 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LSC (Leicester Scientific Corpus) [Dataset]. https://figshare.le.ac.uk/articles/dataset/LSC_Leicester_Scientific_Corpus_/9449639
Explore at:
Unique identifier
https://doi.org/10.25392/leicester.data.9449639.v2
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
The LSC (Leicester Scientific Corpus)

April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data are extracted from the Web of Science [1]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.[Version 2] A further cleaning is applied in Data Processing for LSC Abstracts in Version 1*. Details of cleaning procedure are explained in Step 6.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v1.Getting StartedThis text provides the information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the meaning of research texts and make it available for use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. The corpus contains only documents in English. Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper 3. Abstract: The abstract of the paper 4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’. 5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’. 6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4] 7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018. We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,350.Data ProcessingStep 1: Downloading of the Data Online

The dataset is collected manually by exporting documents as Tab-delimitated files online. All documents are available online.Step 2: Importing the Dataset to R

The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryAs our research is based on the analysis of abstracts and categories, all documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsEspecially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc. Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. The detection and identification of such words is done by sampling of medicine-related publications with human intervention. Detected concatenate words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.The section headings in such abstracts are listed below:

Background Method(s) Design Theoretical Measurement(s) Location Aim(s) Methodology Process Abstract Population Approach Objective(s) Purpose(s) Subject(s) Introduction Implication(s) Patient(s) Procedure(s) Hypothesis Measure(s) Setting(s) Limitation(s) Discussion Conclusion(s) Result(s) Finding(s) Material (s) Rationale(s) Implications for health and nursing policyStep 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction, the lengths of abstracts are calculated. ‘Length’ indicates the total number of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. In LSC, we decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis.

Step 6: [Version 2] Cleaning Copyright Notices, Permission polices, Journal Names and Conference Names from LSC Abstracts in Version 1Publications can include a footer of copyright notice, permission policy, journal name, licence, author’s right or conference name below the text of abstract by conferences and journals. Used tool for extracting and processing abstracts in WoS database leads to attached such footers to the text. For example, our casual observation yields that copyright notices such as ‘Published by Elsevier ltd.’ is placed in many texts. To avoid abnormal appearances of words in further analysis of words such as bias in frequency calculation, we performed a cleaning procedure on such sentences and phrases in abstracts of LSC version 1. We removed copyright notices, names of conferences, names of journals, authors’ rights, licenses and permission policies identiﬁed by sampling of abstracts.Step 7: [Version 2] Re-extracting (Sub-setting) the Data Based on Lengths of AbstractsThe cleaning procedure described in previous step leaded to some abstracts having less than our minimum length criteria (30 words). 474 texts were removed.Step 8: Saving the Dataset into CSV FormatDocuments are saved into 34 CSV files. In CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/ [2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html [4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US [5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3 [6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.
Sample data (five types of features of one participant)
figshare.com
txt
Updated Mar 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hua Liao (2022). Sample data (five types of features of one participant) [Dataset]. http://doi.org/10.6084/m9.figshare.19443503.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19443503.v1
Dataset updated
Mar 29, 2022
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Hua Liao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Sample data (five types of features of one participant)
f
Averaged performance of rule classifiers for supervised discretisation with...
plos.figshare.com
xls
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Urszula Stańczyk; Beata Zielosko; Grzegorz Baron (2023). Averaged performance of rule classifiers for supervised discretisation with transformations of the test sets based on discrete data models obtained for the training samples [%]. [Dataset]. http://doi.org/10.1371/journal.pone.0231788.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0231788.t005
Dataset updated
Jun 4, 2023
Dataset provided by
PLOS ONE
Authors
Urszula Stańczyk; Beata Zielosko; Grzegorz Baron
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Averaged performance of rule classifiers for supervised discretisation with transformations of the test sets based on discrete data models obtained for the training samples [%].

Facebook

Twitter

Click to copy link

Link copied

Cite

data.staging.idas-ds1.appdat.jsc.nasa.gov (2025). Privacy Preserving Distributed Data Mining [Dataset]. https://data.staging.idas-ds1.appdat.jsc.nasa.gov/dataset/privacy-preserving-distributed-data-mining

Privacy Preserving Distributed Data Mining

Explore at:

Dataset updated

Feb 18, 2025

Dataset provided by

NASAhttp://nasa.gov/

Description

Distributed data mining from privacy-sensitive multi-party data is likely to play an important role in the next generation of integrated vehicle health monitoring systems. For example, consider an airline manufacturer [tex]$\mathcal{C}$[/tex] manufacturing an aircraft model [tex]$A$[/tex] and selling it to five different airline operating companies [tex]$\mathcal{V}_1 \dots \mathcal{V}_5$[/tex]. These aircrafts, during their operation, generate huge amount of data. Mining this data can reveal useful information regarding the health and operability of the aircraft which can be useful for disaster management and prediction of efficient operating regimes. Now if the manufacturer [tex]$\mathcal{C}$[/tex] wants to analyze the performance data collected from different aircrafts of model-type [tex]$A$[/tex] belonging to different airlines then central collection of data for subsequent analysis may not be an option. It should be noted that the result of this analysis may be statistically more significant if the data for aircraft model [tex]$A$[/tex] across all companies were available to [tex]$\mathcal{C}$[/tex]. The potential problems arising out of such a data mining scenario are:

Clear search

Close search

Google apps

Main menu

Privacy Preserving Distributed Data Mining

DataSheet_2_The TargetMine Data Warehouse: Enhancement and Updates.pdf

Data Mining Tools Global Market Report 2025

Educational Attainment in North Carolina Public Schools: Use of statistical...

Datasets: Mining chemical activity status from high-throughput screening...

Confusion matrix.

Data from: Mining Distance-Based Outliers in Near Linear Time

Data Mining for IVHM using Sparse Binary Ensembles, Phase I

Spatial Data Mining: Predicting Mud Content across the Southwest Australian...

LScD (Leicester Scientific Dictionary)

LScDC Word-Category RIG Matrix

Survey on mine-water geothermal in Poland

Webis Simulation Data Mining Bridge Models Corpus 2012 (Webis-SDMbridge-12)

Data from: Mining texts to efficiently generate global data on political...

Lisbon, Portugal, hotel’s customer dataset with three years of personal,...

Hierarchical Measures Example.

Application of image processing and machine learning techniques to...

LSC (Leicester Scientific Corpus)

Sample data (five types of features of one participant)

Averaged performance of rule classifiers for supervised discretisation with...

Privacy Preserving Distributed Data Mining