44 datasets found

d
Data from: A Generic Local Algorithm for Mining Data Streams in Large...
catalog.data.gov
datasets.ai
+3more
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). A Generic Local Algorithm for Mining Data Streams in Large Distributed Systems [Dataset]. https://catalog.data.gov/dataset/a-generic-local-algorithm-for-mining-data-streams-in-large-distributed-systems
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
In a large network of computers or wireless sensors, each of the components (henceforth, peers) has some data about the global state of the system. Much of the system's functionality such as message routing, information retrieval and load sharing relies on modeling the global state. We refer to the outcome of the function (e.g., the load experienced by each peer) as the emph{model} of the system. Since the state of the system is constantly changing, it is necessary to keep the models up-to-date. Computing global data mining models e.g. decision trees, k-means clustering in large distributed systems may be very costly due to the scale of the system and due to communication cost, which may be high. The cost further increases in a dynamic scenario when the data changes rapidly. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient emph{local} algorithm which can be used to monitor a wide class of data mining models. Then, we use this algorithm as a feedback loop for the monitoring of complex functions of the data such as its k-means clustering. The theoretical claims are corroborated with a thorough experimental analysis.
m
Educational Attainment in North Carolina Public Schools: Use of statistical...
data.mendeley.com
Updated Nov 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
Explore at:
Unique identifier
https://doi.org/10.17632/6cm9wyd5g5.1
Dataset updated
Nov 14, 2018
Authors
Scott Herford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
r
A predictive model for opal exploration in Australia from a data mining...
researchdata.edu.au
Updated May 1, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas Landgrebe; Thomas Landgrebe; Adriana Dutkiewicz; Dietmar Muller (2015). A predictive model for opal exploration in Australia from a data mining approach [Dataset]. http://doi.org/10.4227/11/5587A86C0FDF1
Explore at:
Unique identifier
https://doi.org/10.4227/11/5587A86C0FDF1
Dataset updated
May 1, 2015
Dataset provided by
The University of Sydney
Authors
Thomas Landgrebe; Thomas Landgrebe; Adriana Dutkiewicz; Dietmar Muller
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Area covered

Dataset funded by
Australian Research Council
Description
This data collection is associated with the publications: Merdith, A. S., Landgrebe, T. C. W., Dutkiewicz, A., & Müller, R. D. (2013). Towards a predictive model for opal exploration using a spatio-temporal data mining approach. Australian Journal of Earth Sciences, 60(2), 217-229. doi: 10.1080/08120099.2012.754793
and
Landgrebe, T. C. W., Merdith, A., Dutkiewicz, A., & Müller, R. D. (2013). Relationships between palaeogeography and opal occurrence in Australia: A data-mining approach. Computers & Geosciences, 56(0), 76-82. doi: 10.1016/j.cageo.2013.02.002
Publication Abstract - Merdith et al. (2013)
Opal is Australia's national gemstone, however most significant opal discoveries were made in the early 1900's - more than 100 years ago - until recently. Currently there is no formal exploration model for opal, meaning there are no widely accepted concepts or methodologies available to suggest where new opal fields may be found. As a consequence opal mining in Australia is a cottage industry with the majority of opal exploration focused around old opal fields. The EarthByte Group has developed a new opal exploration methodology for the Great Artesian Basin. The work is based on the concept of applying “big data mining” approaches to data sets relevant for identifying regions that are prospective for opal. The group combined a multitude of geological and geophysical data sets that were jointly analysed to establish associations between particular features in the data with known opal mining sites. A “training set” of known opal localities (1036 opal mines) was assembled, using those localities, which were featured in published reports and on maps. The data used include rock types, soil type, regolith type, topography, radiometric data and a stack of digital palaeogeographic maps. The different data layers were analysed via spatio-temporal data mining combining the GPlates PaleoGIS software (www.gplates.org) with the Orange data mining software (orange.biolab.si) to produce the first opal prospectivity map for the Great Artesian Basin. One of the main results of the study is that the geological conditions favourable for opal were found to be related to a particular sequence of surface environments over geological time. These conditions involved alternating shallow seas and river systems followed by uplift and erosion. The approach reduces the entire area of the Great Artesian Basin to a mere 6% that is deemed to be prospective for opal exploration. The work is described in two companion papers in the Australian Journal of Earth Sciences and Computers and Geosciences.
Publication Abstract - Landgrebe et al. (2013)
Age-coded multi-layered geological datasets are becoming increasingly prevalent with the surge in open-access geodata, yet there are few methodologies for extracting geological information and knowledge from these data. We present a novel methodology, based on the open-source GPlates software in which age-coded digital palaeogeographic maps are used to “data-mine” spatio-temporal patterns related to the occurrence of Australian opal. Our aim is to test the concept that only a particular sequence of depositional/erosional environments may lead to conditions suitable for the formation of gem quality sedimentary opal. Time-varying geographic environment properties are extracted from a digital palaeogeographic dataset of the eastern Australian Great Artesian Basin (GAB) at 1036 opal localities. We obtain a total of 52 independent ordinal sequences sampling 19 time slices from the Early Cretaceous to the present-day. We find that 95% of the known opal deposits are tied to only 27 sequences all comprising fluvial and shallow marine depositional sequences followed by a prolonged phase of erosion. We then map the total area of the GAB that matches these 27 opal-specific sequences, resulting in an opal-prospective region of only about 10% of the total area of the basin. The key patterns underlying this association involve only a small number of key environmental transitions. We demonstrate that these key associations are generally absent at arbitrary locations in the basin. This new methodology allows for the simplification of a complex time-varying geological dataset into a single map view, enabling straightforward application for opal exploration and for future co-assessment with other datasets/geological criteria. This approach may help unravel the poorly understood opal formation process using an empirical spatio-temporal data-mining methodology and readily available datasets to aid hypothesis testing.
Authors and Institutions
Andrew Merdith - EarthByte Research Group, School of Geosciences, The University of Sydney, Australia. ORCID: 0000-0002-7564-8149
Thomas Landgrebe - EarthByte Research Group, School of Geosciences, The University of Sydney, Australia
Adriana Dutkiewicz - EarthByte Research Group, School of Geosciences, The University of Sydney, Australia
R. Dietmar Müller - EarthByte Research Group, School of Geosciences, The University of Sydney, Australia. ORCID: 0000-0002-3334-5764
Overview of Resources Contained
This collection contains geological data from Australia used for data mining in the publications Merdith et al. (2013) and Landgrebe et al. (2013). The resulting maps of opal prospectivity are also included.
List of Resources
Note: For details on the files included in this data collection, see “Description_of_Resources.txt”.
Note: For information on file formats and what programs to use to interact with various file formats, see “File_Formats_and_Recommended_Programs.txt”.
Map of Barfield region, Australia (.jpg, 270 KB)
Map overviewing the Great Artesian basins and main opal mining camps (.png, 82 KB)
Maps showing opal prospectivity data mining results for different geological datasets (.tif, 23.1 MB)
Map of opal prospectivity from palaeogeography data mining (.pdf, 2.6 MB)
Raster of palaeogeography target regions for viewing in Google Earth (.jpg, 418 KB)
Opal mine locations (.gpml, .txt, .kmz, .shp, total 15.6 MB)
Map of opal prospectivity from all data mining results as a Google Earth overlay (.kmz, 12 KB)
Map of probability of opal occurrence in prospective regions from all data mining results (.tif, 5.9 MB)
Paleogeography of Australia (.gpml, .txt, .shp, total 114.2 MB)
Radiometric data showing potassium concentration contrasts (.tif, .kmz, total 311.3 MB)
Regolith data (.gpml, .txt, .kml, .shp, total 7.1 MB)
Soil type data (.gpml, .txt, .kml, .shp, total 7.1 MB)
For more information on this data collection, and links to other datasets from the EarthByte Research Group please visit EarthByte
For more information about using GPlates, including tutorials and a user manual please visit GPlates or EarthByte
Datasets(Original, Mean, Median, Most Frequent).zip
figshare.com
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Omar Elzeki (2023). Datasets(Original, Mean, Median, Most Frequent).zip [Dataset]. http://doi.org/10.6084/m9.figshare.8118710.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8118710.v1
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Omar Elzeki
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is transformed into Matlab format. They are designed to be in cell formats. Each cell is a matrix which consists of a column representing the gene and row for the subject.Each dataset is organized in a separate directory. The directory contains four versions: a) Original dataset, b) Imputed dataset by MEAN,c) Imputed dataset by MEDIAN,d) Imputed dataset by Most Frequent,
l
LScD (Leicester Scientific Dictionary)
figshare.le.ac.uk
docx
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.9746900.v3
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.
Data from: Hidden Room game data clustering in University of Cadiz (Spain)...
figshare.com
png
Updated Apr 30, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manuel Palomo-duarte; Anke Berns (2018). Hidden Room game data clustering in University of Cadiz (Spain) by DeutschUCA [Dataset]. http://doi.org/10.6084/m9.figshare.6194573.v2
Explore at:
pngAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6194573.v2
Dataset updated
Apr 30, 2018
Dataset provided by
Figsharehttp://figshare.com/
Authors
Manuel Palomo-duarte; Anke Berns
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description

Histograms and results of k-means and Ward's clustering for Hidden Room game in University of Cadiz (Spain) by DeutschUCAThe fileset contains information from three sources:1. Histograms files:* Lexical_histogram.png (histogram of lexical error ratios)* Grammatical_histogram.png (histogram of grammatical error ratios)2. K-means clustering files:*

elbow-lex kmeans.png (clustering by lexical aspects: error curves

obtained for applying elbow method to determinate the optimal number of

clusters)* cube-lex kmeans.png (clustering by lexical aspects: a

three-dimensional representation of clusters obtained after applying

k-means method)* Lexical_clusters (table) kmeans.xls (clustering by

lexical aspects: centroids, standard deviations and number of instances

assigned to each cluster)* elbow-gram kmeans.png (clustering by

grammatical aspects: error curves obtained for applying elbow method to

determinate the optimal number of clusters)* cube-gramm kmeans.png

(clustering by grammatical aspects: a three-dimensional representation

of clusters obtained after applying k-means method)*

Grammatical_clusters (table) kmeans.xls (clustering by grammatical

aspects: centroids, standard deviations and number of instances assigned

to each cluster)* elbow-lexgram kmeans.png (clustering by lexical

and grammatical aspects: error curves obtained for applying elbow method

to determinate the optimal number of clusters)*

Lexical_Grammatical_clusters (table) kmeans.xls (clustering by lexical

and grammatical aspects: centroids, standard deviations and number of

instances assigned to each cluster)*

Grammatical_clusters_number_of_words (table) kmeans.xls

number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to grammatical error ratios.* Lexical_clusters_number_of_words (table) kmeans.xls

number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to lexical error ratios.* Lexical_Grammatical_clusters_number_of_words (table) kmeans.xls

number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to lexical and grammatical error ratios.3. Ward’s Agglomerative Hierarchical Clustering files:* Lexical_Cluster_Dendrogram_ward.png (clustering by lexical aspects: dendrogram obtained after applying Ward's clustering method).* Grammatical_Cluster_Dendrogram_ward.png (clustering by grammatical aspects: dendrogram obtained after applying Ward's clustering method)* Lexical_Grammatical_Cluster_Dendrogram_ward.png (clustering by lexical and grammatical aspects: dendrogram obtained after applying Ward's clustering method)* Lexical_Grammatical_clusters (table) ward.xls:
Centroids (from column 2 to 7) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to lexical and grammatical error ratios.* Grammatical_clusters (table) ward.xls: Centroids (from column 2 to 4) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to grammatical error ratios.* Lexical_clusters (table) ward.xls: Centroids (from column 2 to 4) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to lexical error ratios.* Lexical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to lexical error ratios.* Grammatical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to grammatical error ratios.* Lexical_Grammatical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to lexical and grammatical error ratios.
Data Mining Applied to Life Cycle Inventory Modeling for Cumene and Sodium...
catalog.data.gov
gimi9.com
Updated Mar 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2021). Data Mining Applied to Life Cycle Inventory Modeling for Cumene and Sodium Hydroxide Manufacturing, Version 1, 09/2018 [Dataset]. https://catalog.data.gov/dataset/data-mining-applied-to-life-cycle-inventory-modeling-for-cumene-and-sodium-hydroxide-ma-09
Explore at:
Dataset updated
Mar 4, 2021
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
This file contains the life cycle inventories (LCIs) developed for an associated journal article. Potential users of the data are referred to the journal article for a full description of the modeling methodology. LCIs were developed for cumene and sodium hydroxide manufacturing using data mining with metadata-based data preprocessing. The inventory data were collected from US EPA's 2012 Chemical Data Reporting database, 2011 National Emissions Inventory, 2011 Toxics Release Inventory, 2011 Electronic Greenhouse Gas Reporting Tool, 2011 Discharge Monitoring Report, and the 2011 Biennial Report generated from the RCRAinfo hazardous waste tracking system. The U.S. average cumene gate-to-gate inventories are provided without (baseline) and with process allocation applied using metadata-based filtering. In 2011, there were 8 facilities reporting public production volumes of cumene in the U.S., totaling to 2,609,309,687 kilograms of cumene produced that year. The U.S. average sodium hydroxide gate-to-gate inventories are also provided without (baseline) and with process allocation applied using metadata-based filtering. In 2011, there were 24 facilities reporting public production volumes of sodium hydroxide in the U.S., totaling to 3,878,021,614 kilograms of sodium hydroxide produced that year. Process allocation was only conducted for the top 12 facilities producing sodium hydroxide, which represents 97% of the public production of sodium hydroxide. The data have not been compiled in the formal Federal Commons LCI Template to avoid users interpreting the template to mean the data have been fully reviewed according to LCA standards and can be directly applied to all types of assessments and decision needs without additional review by industry and potential stakeholders. This dataset is associated with the following publication: Meyer, D.E., S. Cashman, and A. Gaglione. Improving the reliability of chemical manufacturing life cycle inventory constructed using secondary data. JOURNAL OF INDUSTRIAL ECOLOGY. Berkeley Electronic Press, Berkeley, CA, USA, 25(1): 20-35, (2021).
F
Mean Commuting Time for Workers (5-year estimate) in Miner County, SD
fred.stlouisfed.org
json
Updated Dec 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Mean Commuting Time for Workers (5-year estimate) in Miner County, SD [Dataset]. https://fred.stlouisfed.org/series/B080ACS046097
Explore at:
jsonAvailable download formats
Dataset updated
Dec 12, 2024
License
https://fred.stlouisfed.org/legal/#copyright-public-domainhttps://fred.stlouisfed.org/legal/#copyright-public-domain
Area covered
South Dakota, Miner County
Description
Graph and download economic data for Mean Commuting Time for Workers (5-year estimate) in Miner County, SD (B080ACS046097) from 2009 to 2023 about Miner County, SD; commuting time; SD; workers; average; 5-year; and USA.
Application Research of Clustering on kmeans
kaggle.com
zip
Updated Feb 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ddpr raju (2021). Application Research of Clustering on kmeans [Dataset]. https://www.kaggle.com/ddprraju/tirupati-compus-school
Explore at:
zip(34507 bytes)Available download formats
Dataset updated
Feb 27, 2021
Authors
ddpr raju
Description
Dataset

This dataset was created by ddpr raju

Contents

It contains the following files:
T
Mean Commuting Time for Workers (5-year estimate) in Miner County, SD
tradingeconomics.com
csv, excel, json, xml
Updated Feb 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TRADING ECONOMICS (2020). Mean Commuting Time for Workers (5-year estimate) in Miner County, SD [Dataset]. https://tradingeconomics.com/united-states/mean-commuting-time-for-workers-in-miner-county-sd-fed-data.html
Explore at:
csv, json, excel, xmlAvailable download formats
Dataset updated
Feb 13, 2020
Dataset authored and provided by
TRADING ECONOMICS
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1976 - Dec 31, 2025
Area covered
South Dakota, Miner County
Description
Mean Commuting Time for Workers (5-year estimate) in Miner County, SD was 19.31188 Minutes in January of 2023, according to the United States Federal Reserve. Historically, Mean Commuting Time for Workers (5-year estimate) in Miner County, SD reached a record high of 21.50298 in January of 2021 and a record low of 15.93110 in January of 2012. Trading Economics provides the current actual value, an historical data chart and related indicators for Mean Commuting Time for Workers (5-year estimate) in Miner County, SD - last updated from the United States Federal Reserve on August of 2025.
N
Income Distribution by Quintile: Mean Household Income in Miner County, SD...
neilsberg.com
csv, json
Updated Mar 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2025). Income Distribution by Quintile: Mean Household Income in Miner County, SD // 2025 Edition [Dataset]. https://www.neilsberg.com/insights/miner-county-sd-median-household-income/
Explore at:
json, csvAvailable download formats
Dataset updated
Mar 3, 2025
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
South Dakota, Miner County
Variables measured
Income Level, Mean Household Income
Measurement technique
The data presented in this dataset is derived from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. It delineates income distributions across income quintiles (mentioned above) following an initial analysis and categorization. Subsequently, we adjusted these figures for inflation using the Consumer Price Index retroactive series via current methods (R-CPI-U-RS). For additional information about these estimations, please contact us via email at research@neilsberg.com
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset presents the mean household income for each of the five quintiles in Miner County, SD, as reported by the U.S. Census Bureau. The dataset highlights the variation in mean household income across quintiles, offering valuable insights into income distribution and inequality.

Key observations

Income disparities: The mean income of the lowest quintile (20% of households with the lowest income) is 13,754, while the mean income for the highest quintile (20% of households with the highest income) is 184,982. This indicates that the top earners earn 13 times compared to the lowest earners.

*Top 5%: * The mean household income for the wealthiest population (top 5%) is 323,065, which is 174.65% higher compared to the highest quintile, and 2348.88% higher compared to the lowest quintile.

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

Income Levels:

Lowest Quintile

Second Quintile

Third Quintile

Fourth Quintile

Highest Quintile

Top 5 Percent

Variables / Data Columns

Income Level: This column showcases the income levels (As mentioned above).

Mean Household Income: Mean household income, in 2023 inflation-adjusted dollars for the specific income level.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Miner County median household income. You can refer the same here
A
Albania Enterprises: Mining and Quarrying: Investment: Means of Transport
ceicdata.com
Updated Feb 7, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CEICdata.com (2018). Albania Enterprises: Mining and Quarrying: Investment: Means of Transport [Dataset]. https://www.ceicdata.com/en/albania/enterprises-income-and-investment-by-industry-nace-2/enterprises-mining-and-quarrying-investment-means-of-transport
Explore at:
Dataset updated
Feb 7, 2018
Dataset provided by
CEICdata.com
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Dec 1, 2012 - Dec 1, 2022
Area covered
Albania
Variables measured
Enterprises Statistics
Description
Albania Enterprises: Mining and Quarrying: Investment: Means of Transport data was reported at 422.000 ALL mn in 2022. This records an increase from the previous number of 236.211 ALL mn for 2021. Albania Enterprises: Mining and Quarrying: Investment: Means of Transport data is updated yearly, averaging 363.000 ALL mn from Dec 2012 (Median) to 2022, with 11 observations. The data reached an all-time high of 1,157.821 ALL mn in 2019 and a record low of 230.000 ALL mn in 2016. Albania Enterprises: Mining and Quarrying: Investment: Means of Transport data remains active status in CEIC and is reported by Institute of Statistics. The data is categorized under Global Database’s Albania – Table AL.O011: Enterprises Income and Investment: by Industry: NACE 2.
m
Data from: Making the Case for Process Analytics: A Use Case in Court...
data.mendeley.com
Updated Aug 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Milda Aleknonyte-Resch (2025). Making the Case for Process Analytics: A Use Case in Court Proceedings [Dataset]. http://doi.org/10.17632/3mcvbrhr7c.2
Explore at:
Unique identifier
https://doi.org/10.17632/3mcvbrhr7c.2
Dataset updated
Aug 15, 2025
Authors
Milda Aleknonyte-Resch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data was extracted in PDF format with personal information redacted to ensure privacy. The raw dataset consisted of 260 cases from three chambers within a single German social law court. The data originates from a single judge, who typically oversees five to six chambers, meaning that this dataset represents only a subset of the judge’s total caseload. Optical Character Recognition (OCR) was used to extract the document text, which was organized into an event log according to the tabular structure of the documents. In the dataset, a single timestamp is recorded for each activity, commonly indicating only the date of occurrence rather than a precise timestamp. This limits the granularity of time-based analyses and the accuracy of calculated activity durations. As the analysis focuses on the overall durations of cases, which typically range from multiple months to years, the impact of the timestamp imprecisions was negligible in our use case. After extraction, the event log was further processed in consultation with domain experts to ensure anonymity, remove noise, and raise it to an abstraction level appropriate for analysis. All remaining personal identifiers, such as expert witness names, were removed from the log to ensure anonymity. Additionally, timestamps were systematically perturbed to further enhance data privacy. Originally, the event log contained 22,664 recorded events and 290 unique activities. Activities that were extremely rare (i.e., occurring fewer than 30 times) were excluded to focus on frequently observed procedural steps. Furthermore, the domain experts reviewed the list of unique activity labels, based on which similar activities were merged, and terminology was standardized across cases. The refinement of the activity labels reduced the number of unique activities to 59. Finally, duplicate events were removed. These steps collectively reduced the dataset to 19,947 events. The final anonymized and processed dataset includes 260 cases, 19,947 events from three chambers and 59 unique activities.
Z
Onset of mining operations
data.niaid.nih.gov
zenodo.org
Updated Mar 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Remelgado, Ruben (2024). Onset of mining operations [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8214548
Explore at:
Dataset updated
Mar 17, 2024
Dataset provided by
Remelgado, Ruben
Meyer, Carsten
Description
Motivation

Maus et al created the first database of the spatial extent of mining areas by mobilizing nearly 20 years of Landsat data. This dataset is imperative for GlobES, as mining areas are specified in the IUCN habitat class scheme. Yet, this dataset is temporally static. To tackle this flaw, we mined the Landsat archive to infer the first observable year of mining.

Approach

For each mining area polygon, we collected 50 random samples within it and 50 random samples along its borders. This was meant to reflect increasing spectral differences between areas within and outside a mining exploration after its onset. Then, for each sample, we used Google Earth Engine to extract spectral profiles for every available acquisition between 1990 and 2020.

After completing the extraction, we estimate mean spectral profiles for each acquisition date, once for the samples “inside” the mining area, and another for those “outside” of it. In this process, we masked pixels afflicted by clouds and cloud shadows using Landsat's quality information.

Using the time-series of mean profiles, at each mining site and for each unique date, we normalized the “inside” and “outside” multi-spectral averages and estimated the Root Mean Square Error (RMSE) between them. The normalization step aimed at emphasizing differences in the shape of the spectral profiles rather than on specific values, which can be related to radiometric innacuracies, or simply to differences in acquisition dates. This resulted in an RMSE time-series for each mining site.

We then used these data to infer the first mining year. To achieve this, we first derived a cumulative sum of the RMSE time-series with the intent of removing noise while preserving abrupt directional changes. For example, if a mine was introduced in a forest, it would drive an increase in the RMSE due to the removal of trees, whereas the outskirts of the mine would remain forested. In this example, the accumulated values would tilt upwards. However, if a mining exploration was accompanied by the removal of vegetation along its outskirts where bare land was common, a downwards shift is RMSE values is more likely as the landscape becomes more homogenization.

To detect the date marking a shift in RMSE values, we used a knee/elbow detection algorithm implemented in the python package kneebow, which uses curve rotation to infer the inflection/deflection point of a time series. Here, downward trends correspond to the elbow and upward trends to the knee. To determine which of these metrics was the most adequate, we use the Area Under the Curve (AUC). An elbow is characterized by a convex shape of a time-series which makes the AUC greater than 50%. However, if the shape of the curve is concave, the knee is the most adequate metric. We limited the detection of shifts to time-series with at least 100 time steps. When below this threshold, we assumed the mine (or the the conditions to sustain it) were present since 1990.

Content

This repository contains the infrastructure used to infer the start of a mining operation, which is organized as following:

00_data - Contains the base data required for the operation, including a SHP file with the mining area outlines, and validation samples.

01_analysis - Contains several outputs of our analysis:

xy.tar.gz - Sample locations for each mining site.

sr.tar.gz - Spectral profiles for each sample location.

mine_start.csv - First year when we detected the start of mining.

02_code - Includes all code used in our analysis.

requirements.txt - Python module requirements that can be fed to pip to replicate our study.

config.yml - Configuration file, including information on the Landsat products used.
f
Comparison of the running time(in ms) of the three algorithms.
plos.figshare.com
xls
Updated Jun 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yaling Zhang; Jin Han (2023). Comparison of the running time(in ms) of the three algorithms. [Dataset]. http://doi.org/10.1371/journal.pone.0248737.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0248737.t004
Dataset updated
Jun 11, 2023
Dataset provided by
PLOS ONE
Authors
Yaling Zhang; Jin Han
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparison of the running time(in ms) of the three algorithms.
Net profit margin of the top mining companies 2002-2024
statista.com
Updated Dec 13, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M. Garside (2024). Net profit margin of the top mining companies 2002-2024 [Dataset]. https://www.statista.com/topics/1143/mining/
Explore at:
Dataset updated
Dec 13, 2024
Dataset provided by
Statistahttp://statista.com/
Authors
M. Garside
Description
In 2011, the net profit margin of the mining industry's 40 leading companies was approximately 24 percent. Twelve years later, in 2023, the net profit margin stood at 11 percent. Profits of the top mining companies The net profit margin (also known as profit margin, net margin, net profit ratio) is a measurement to describe the profitability of a company. It is calculated by dividing the net income by the total revenue (or net profit by sales). For 2023, it means that the top 40 mining companies kept 11 cents of profit out of every U.S. dollar they earned. The average net profit margin of the world’s top 40 mining companies stood at some seven percent in 2014, but decreased to negative seven percent in 2015, and then rebounded to 11 percent in 2023. These figures are a distinct decrease when compared to the years before. In 2023, the top 40 mining companies in the world generated a net profit of approximately 90 billion U.S. dollars.The global top 40 mining companies, which represent the vast majority of the industry, generated more than 840 billion U.S. dollars of revenue in 2023. In terms of quantity, these companies produce most of all coal (including thermal and metallurgical coal), iron ore, and bauxite.
f
Descriptions of the datasets.
plos.figshare.com
xls
Updated Jun 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yaling Zhang; Jin Han (2023). Descriptions of the datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0248737.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0248737.t002
Dataset updated
Jun 10, 2023
Dataset provided by
PLOS ONE
Authors
Yaling Zhang; Jin Han
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Descriptions of the datasets.
o
Raw data of the ships of priority II in 2017
explore.openaire.eu
figshare.com
Updated Mar 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiaqi Mu (2020). Raw data of the ships of priority II in 2017 [Dataset]. http://doi.org/10.4121/uuid:92a6ce98-5001-4768-bec1-49881a367d36
Explore at:
Unique identifier
https://doi.org/10.4121/uuid:92a6ce98-5001-4768-bec1-49881a367d36
Dataset updated
Mar 30, 2020
Authors
Jiaqi Mu
Description
See the article "Targeting model based on principal component analysis and extreme learning machine" for the meaning of the data.
d
Replication Data for: Policy Diffusion: The Issue-Definition Stage
search.dataone.org
dataverse.harvard.edu
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gilardi, Fabrizio; Shipan, Charles R.; Wüest, Bruno (2023). Replication Data for: Policy Diffusion: The Issue-Definition Stage [Dataset]. http://doi.org/10.7910/DVN/QEMNP1
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/QEMNP1
Dataset updated
Nov 22, 2023
Dataset provided by
Harvard Dataverse
Authors
Gilardi, Fabrizio; Shipan, Charles R.; Wüest, Bruno
Description
We put forward a new approach to studying issue definition within the context of policy diffusion. Most studies of policy diffusion---which is the process by which policymaking in one government affects policymaking in other governments---have focused on policy adoptions. We shift the focus to an important but neglected aspect of this process: the issue-definition stage. We use topic models to estimate how policies are framed during this stage and how these frames are predicted by prior policy adoptions. Focusing on smoking restriction in U.S. states, our analysis draws upon an original dataset of over 52,000 paragraphs from newspapers covering 49 states between 1996 and 2013. We find that frames regarding the policy's concrete implications are predicted by prior adoptions in other states, while frames regarding its normative justifications are not. Our approach and findings open the way for a new perspective to studying policy diffusion in many different areas.
g
Data from: Quantitative Wirtschaftsgeschichte des Ruhrkohlenbergbaus im 19....
search.gesis.org
pollux-fid.de
+1more
Updated Apr 13, 2010
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Holtfrerich, Carl-Ludwig (2010). Quantitative Wirtschaftsgeschichte des Ruhrkohlenbergbaus im 19. Jahrhundert [Dataset]. http://doi.org/10.4232/1.8207
Explore at:
(93874)Available download formats
Unique identifier
https://doi.org/10.4232/1.8207
Dataset updated
Apr 13, 2010
Dataset provided by
GESIS Data Archive
GESIS search
Authors
Holtfrerich, Carl-Ludwig
License
https://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms
Time period covered
1816 - 1913
Description
Firstly Holtfrerich presents the Rostov Concept of the leading sector, before he sketches the development of mining in the Ruhr area by means of theoretical approaches concerning theories on production, price, and investment. In doing so, the author attempts to quantify the connections between the development of coal mining in the Ruhr district and other important sectors by means of an input-output scheme. Thereafter he examines how far the development of mining in the Ruhr area in the 19th century in its major phase of growth complies with the Rostov criteria for the leading sector. Finally Holtfrerich verifies the assumption that mining in the Ruhr district has been a leading sector of the German industrialisation.

Chart register Chart 01: Coal mining in the OBAB Dortmund, the Saar area, and the Kingdom of Prussia (1816-1913) Chart 02: Annual average price of coal in the OBAB Dortmund, nominal and real development (1816-1813) Chart 03: Number of operating coal mines in the OBAB Dortmund, and average production of each mine (1816-1892) Chart 04: Proportion of the five and ten greatest mines as to the total coal production of the mines in the OBAB Dortmund; in percent (1852-1890) Chart 05: Contributions of coal mines in the OBAB Dortmund in 1,000 marks (1850-1895) Chart 06: Tax burden for coal mining in the Lower Rhine region and in Westphalia (1880-1903) Chart 07: Burden of the coal mines in the OBAB Dortmund; coal mine contributions (“Bergwerksabgaben”) and taxes in percent of coal sales value (1816-1913) Chart 08: Annually licenced basic capital of the “Montan-Aktiengesellschaften” (coal, iron and steel corporations) founded in the Ruhr (1840-1870) Chart 10: Average number of workers per year (including mine officials) in the field of coal mining in the OBAB Dortmund (1816-1913) Chart 11: Average annual net payroll and annual net basic wages of the miners in the OBAB Dortmund (1850-1913) Chart 12: Wages in coal mining within the OBAB Dortmund (1850-1903) Chart 13: Working hours in coal mining within the OBAB Dortmund (1852-1892) Chart 14: Labour productivity in coal mining in the OBAB Dortmund (1816-1913) Chart 15: Development of capital investment: disposable steam machines (combined engine power in HP) of coal mines within the OBAB Dortmund (1851-1892) Chart 16: Development of investment: annual increase of steam machine power (in HP) (1852-1892) Chart 18: Development of capital productivity and capital intensity (1851-1892) Chart 19: Data on net value added and capital income in the field of coal mining in the OBAB Dortmund (1850-1903) Chart 20: Capital income/dividends and profits per produced ton of coal for coal mining in the Ruhr area (1850-1892) Chart 21: Proportion of the total coal produced in the Lower Rhine/Westphalian bassin, which was coked by the colliery itself, or – from 1882 on – formed into briquettes(1861-1892) Chart 22: Percentage of propulsion power in HP applied in coal mining within the OBAB Dortmund (1875-1895) Chart 23: Own consumption of coal of mines within the OBAB Dortmund (1852-1892) Chart 24: Development of the profit indicator for coal mining in the Ruhr district (1851-1892) Chart 25: Expansion of the German railway system (1835-1892) Chart 26: Figures on the development of Prussian railways (1844-1882) Chart 27: Development of average revenues for the transport of coal on various railways (1861-1877) Chart 28: Development of the proportion of means of transport with regard to the transport of coal from the Ruhr area (1851-1889) Chart 29: Division of domestic sales of the “Rheinisch-Westfälisches Kohlensyndikat” (Coal Syndicate of the Rhineland and Westphalia) per consumption groups in percent (1902-1906) Chart 30: Wroughtiron production and steel production from coal in the OBAB Dortmund and in the OBAB Bonn (part on the right bank of the Rhine) (1852-1882) Chart 31: Crude iron production in the Ruhr area, OBAB Dortmund (1837-1900) Chart 32: Price development for crude iron, bar iron and cast steel in the Ruhr district (1850-1892) Chart 33: Bar iron production in the OBAB Dortmund and in the OBAB Bonn by means of the charcoal hearth process and the “Puddelverfahren”, a method to produce steel from crude iron (1835-1870) Chart 34: The importance of the economic sectors according to their respective employment figures (1852-1875).

Facebook

Twitter

Click to copy link

Link copied

Cite

Dashlink (2025). A Generic Local Algorithm for Mining Data Streams in Large Distributed Systems [Dataset]. https://catalog.data.gov/dataset/a-generic-local-algorithm-for-mining-data-streams-in-large-distributed-systems

Data from: A Generic Local Algorithm for Mining Data Streams in Large Distributed Systems

Explore at:

Dataset updated

Apr 10, 2025

Dataset provided by

Dashlink

Description

In a large network of computers or wireless sensors, each of the components (henceforth, peers) has some data about the global state of the system. Much of the system's functionality such as message routing, information retrieval and load sharing relies on modeling the global state. We refer to the outcome of the function (e.g., the load experienced by each peer) as the emph{model} of the system. Since the state of the system is constantly changing, it is necessary to keep the models up-to-date. Computing global data mining models e.g. decision trees, k-means clustering in large distributed systems may be very costly due to the scale of the system and due to communication cost, which may be high. The cost further increases in a dynamic scenario when the data changes rapidly. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient emph{local} algorithm which can be used to monitor a wide class of data mining models. Then, we use this algorithm as a feedback loop for the monitoring of complex functions of the data such as its k-means clustering. The theoretical claims are corroborated with a thorough experimental analysis.

Clear search

Close search

Google apps

Main menu

Data from: A Generic Local Algorithm for Mining Data Streams in Large...

Educational Attainment in North Carolina Public Schools: Use of statistical...

A predictive model for opal exploration in Australia from a data mining...

and

Landgrebe, T. C. W., Merdith, A., Dutkiewicz, A., & Müller, R. D. (2013). Relationships between palaeogeography and opal occurrence in Australia: A data-mining approach. Computers & Geosciences, 56(0), 76-82. doi: 10.1016/j.cageo.2013.02.002

Publication Abstract - Merdith et al. (2013)

Publication Abstract - Landgrebe et al. (2013)

Authors and Institutions

Overview of Resources Contained

List of Resources

Datasets(Original, Mean, Median, Most Frequent).zip

LScD (Leicester Scientific Dictionary)

Data from: Hidden Room game data clustering in University of Cadiz (Spain)...

Data Mining Applied to Life Cycle Inventory Modeling for Cumene and Sodium...

Mean Commuting Time for Workers (5-year estimate) in Miner County, SD

Application Research of Clustering on kmeans

Dataset

Contents

Mean Commuting Time for Workers (5-year estimate) in Miner County, SD

Income Distribution by Quintile: Mean Household Income in Miner County, SD...

About this dataset

Content

Inspiration

Recommended for further research

Albania Enterprises: Mining and Quarrying: Investment: Means of Transport

Data from: Making the Case for Process Analytics: A Use Case in Court...

Onset of mining operations

Comparison of the running time(in ms) of the three algorithms.

Net profit margin of the top mining companies 2002-2024

Descriptions of the datasets.

Raw data of the ships of priority II in 2017

Replication Data for: Policy Diffusion: The Issue-Definition Stage

Data from: Quantitative Wirtschaftsgeschichte des Ruhrkohlenbergbaus im 19....

Data from: A Generic Local Algorithm for Mining Data Streams in Large Distributed SystemsSee More Versions

Data from: A Generic Local Algorithm for Mining Data Streams in Large Distributed Systems