44 datasets found
  1. d

    Data from: A Generic Local Algorithm for Mining Data Streams in Large...

    • catalog.data.gov
    • datasets.ai
    • +3more
    Updated Apr 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). A Generic Local Algorithm for Mining Data Streams in Large Distributed Systems [Dataset]. https://catalog.data.gov/dataset/a-generic-local-algorithm-for-mining-data-streams-in-large-distributed-systems
    Explore at:
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Dashlink
    Description

    In a large network of computers or wireless sensors, each of the components (henceforth, peers) has some data about the global state of the system. Much of the system's functionality such as message routing, information retrieval and load sharing relies on modeling the global state. We refer to the outcome of the function (e.g., the load experienced by each peer) as the emph{model} of the system. Since the state of the system is constantly changing, it is necessary to keep the models up-to-date. Computing global data mining models e.g. decision trees, k-means clustering in large distributed systems may be very costly due to the scale of the system and due to communication cost, which may be high. The cost further increases in a dynamic scenario when the data changes rapidly. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient emph{local} algorithm which can be used to monitor a wide class of data mining models. Then, we use this algorithm as a feedback loop for the monitoring of complex functions of the data such as its k-means clustering. The theoretical claims are corroborated with a thorough experimental analysis.

  2. m

    Educational Attainment in North Carolina Public Schools: Use of statistical...

    • data.mendeley.com
    Updated Nov 14, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
    Explore at:
    Dataset updated
    Nov 14, 2018
    Authors
    Scott Herford
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.

  3. r

    A predictive model for opal exploration in Australia from a data mining...

    • researchdata.edu.au
    Updated May 1, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas Landgrebe; Thomas Landgrebe; Adriana Dutkiewicz; Dietmar Muller (2015). A predictive model for opal exploration in Australia from a data mining approach [Dataset]. http://doi.org/10.4227/11/5587A86C0FDF1
    Explore at:
    Dataset updated
    May 1, 2015
    Dataset provided by
    The University of Sydney
    Authors
    Thomas Landgrebe; Thomas Landgrebe; Adriana Dutkiewicz; Dietmar Muller
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Area covered
    Dataset funded by
    Australian Research Council
    Description

    This data collection is associated with the publications: Merdith, A. S., Landgrebe, T. C. W., Dutkiewicz, A., & Müller, R. D. (2013). Towards a predictive model for opal exploration using a spatio-temporal data mining approach. Australian Journal of Earth Sciences, 60(2), 217-229. doi: 10.1080/08120099.2012.754793

    and

    Landgrebe, T. C. W., Merdith, A., Dutkiewicz, A., & Müller, R. D. (2013). Relationships between palaeogeography and opal occurrence in Australia: A data-mining approach. Computers & Geosciences, 56(0), 76-82. doi: 10.1016/j.cageo.2013.02.002

    Publication Abstract - Merdith et al. (2013)

    Opal is Australia's national gemstone, however most significant opal discoveries were made in the early 1900's - more than 100 years ago - until recently. Currently there is no formal exploration model for opal, meaning there are no widely accepted concepts or methodologies available to suggest where new opal fields may be found. As a consequence opal mining in Australia is a cottage industry with the majority of opal exploration focused around old opal fields. The EarthByte Group has developed a new opal exploration methodology for the Great Artesian Basin. The work is based on the concept of applying “big data mining” approaches to data sets relevant for identifying regions that are prospective for opal. The group combined a multitude of geological and geophysical data sets that were jointly analysed to establish associations between particular features in the data with known opal mining sites. A “training set” of known opal localities (1036 opal mines) was assembled, using those localities, which were featured in published reports and on maps. The data used include rock types, soil type, regolith type, topography, radiometric data and a stack of digital palaeogeographic maps. The different data layers were analysed via spatio-temporal data mining combining the GPlates PaleoGIS software (www.gplates.org) with the Orange data mining software (orange.biolab.si) to produce the first opal prospectivity map for the Great Artesian Basin. One of the main results of the study is that the geological conditions favourable for opal were found to be related to a particular sequence of surface environments over geological time. These conditions involved alternating shallow seas and river systems followed by uplift and erosion. The approach reduces the entire area of the Great Artesian Basin to a mere 6% that is deemed to be prospective for opal exploration. The work is described in two companion papers in the Australian Journal of Earth Sciences and Computers and Geosciences.

    Publication Abstract - Landgrebe et al. (2013)

    Age-coded multi-layered geological datasets are becoming increasingly prevalent with the surge in open-access geodata, yet there are few methodologies for extracting geological information and knowledge from these data. We present a novel methodology, based on the open-source GPlates software in which age-coded digital palaeogeographic maps are used to “data-mine” spatio-temporal patterns related to the occurrence of Australian opal. Our aim is to test the concept that only a particular sequence of depositional/erosional environments may lead to conditions suitable for the formation of gem quality sedimentary opal. Time-varying geographic environment properties are extracted from a digital palaeogeographic dataset of the eastern Australian Great Artesian Basin (GAB) at 1036 opal localities. We obtain a total of 52 independent ordinal sequences sampling 19 time slices from the Early Cretaceous to the present-day. We find that 95% of the known opal deposits are tied to only 27 sequences all comprising fluvial and shallow marine depositional sequences followed by a prolonged phase of erosion. We then map the total area of the GAB that matches these 27 opal-specific sequences, resulting in an opal-prospective region of only about 10% of the total area of the basin. The key patterns underlying this association involve only a small number of key environmental transitions. We demonstrate that these key associations are generally absent at arbitrary locations in the basin. This new methodology allows for the simplification of a complex time-varying geological dataset into a single map view, enabling straightforward application for opal exploration and for future co-assessment with other datasets/geological criteria. This approach may help unravel the poorly understood opal formation process using an empirical spatio-temporal data-mining methodology and readily available datasets to aid hypothesis testing.

    Authors and Institutions

    Andrew Merdith - EarthByte Research Group, School of Geosciences, The University of Sydney, Australia. ORCID: 0000-0002-7564-8149

    Thomas Landgrebe - EarthByte Research Group, School of Geosciences, The University of Sydney, Australia

    Adriana Dutkiewicz - EarthByte Research Group, School of Geosciences, The University of Sydney, Australia

    R. Dietmar Müller - EarthByte Research Group, School of Geosciences, The University of Sydney, Australia. ORCID: 0000-0002-3334-5764

    Overview of Resources Contained

    This collection contains geological data from Australia used for data mining in the publications Merdith et al. (2013) and Landgrebe et al. (2013). The resulting maps of opal prospectivity are also included.

    List of Resources

    Note: For details on the files included in this data collection, see “Description_of_Resources.txt”.

    Note: For information on file formats and what programs to use to interact with various file formats, see “File_Formats_and_Recommended_Programs.txt”.

    • Map of Barfield region, Australia (.jpg, 270 KB)
    • Map overviewing the Great Artesian basins and main opal mining camps (.png, 82 KB)
    • Maps showing opal prospectivity data mining results for different geological datasets (.tif, 23.1 MB)
    • Map of opal prospectivity from palaeogeography data mining (.pdf, 2.6 MB)
    • Raster of palaeogeography target regions for viewing in Google Earth (.jpg, 418 KB)
    • Opal mine locations (.gpml, .txt, .kmz, .shp, total 15.6 MB)
    • Map of opal prospectivity from all data mining results as a Google Earth overlay (.kmz, 12 KB)
    • Map of probability of opal occurrence in prospective regions from all data mining results (.tif, 5.9 MB)
    • Paleogeography of Australia (.gpml, .txt, .shp, total 114.2 MB)
    • Radiometric data showing potassium concentration contrasts (.tif, .kmz, total 311.3 MB)
    • Regolith data (.gpml, .txt, .kml, .shp, total 7.1 MB)
    • Soil type data (.gpml, .txt, .kml, .shp, total 7.1 MB)

    For more information on this data collection, and links to other datasets from the EarthByte Research Group please visit EarthByte

    For more information about using GPlates, including tutorials and a user manual please visit GPlates or EarthByte

  4. Datasets(Original, Mean, Median, Most Frequent).zip

    • figshare.com
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Omar Elzeki (2023). Datasets(Original, Mean, Median, Most Frequent).zip [Dataset]. http://doi.org/10.6084/m9.figshare.8118710.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Omar Elzeki
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset is transformed into Matlab format. They are designed to be in cell formats. Each cell is a matrix which consists of a column representing the gene and row for the subject.Each dataset is organized in a separate directory. The directory contains four versions: a) Original dataset, b) Imputed dataset by MEAN,c) Imputed dataset by MEDIAN,d) Imputed dataset by Most Frequent,

  5. l

    LScD (Leicester Scientific Dictionary)

    • figshare.le.ac.uk
    docx
    Updated Apr 15, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
    Explore at:
    docxAvailable download formats
    Dataset updated
    Apr 15, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Leicester
    Description

    LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.

  6. Data from: Hidden Room game data clustering in University of Cadiz (Spain)...

    • figshare.com
    png
    Updated Apr 30, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manuel Palomo-duarte; Anke Berns (2018). Hidden Room game data clustering in University of Cadiz (Spain) by DeutschUCA [Dataset]. http://doi.org/10.6084/m9.figshare.6194573.v2
    Explore at:
    pngAvailable download formats
    Dataset updated
    Apr 30, 2018
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Manuel Palomo-duarte; Anke Berns
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    Histograms and results of k-means and Ward's clustering for Hidden Room game in University of Cadiz (Spain) by DeutschUCAThe fileset contains information from three sources:1. Histograms files:* Lexical_histogram.png (histogram of lexical error ratios)* Grammatical_histogram.png (histogram of grammatical error ratios)2. K-means clustering files:*
    elbow-lex kmeans.png (clustering by lexical aspects: error curves
    obtained for applying elbow method to determinate the optimal number of
    clusters)* cube-lex kmeans.png (clustering by lexical aspects: a
    three-dimensional representation of clusters obtained after applying
    k-means method)* Lexical_clusters (table) kmeans.xls (clustering by
    lexical aspects: centroids, standard deviations and number of instances
    assigned to each cluster)* elbow-gram kmeans.png (clustering by
    grammatical aspects: error curves obtained for applying elbow method to
    determinate the optimal number of clusters)* cube-gramm kmeans.png
    (clustering by grammatical aspects: a three-dimensional representation
    of clusters obtained after applying k-means method)*
    Grammatical_clusters (table) kmeans.xls (clustering by grammatical
    aspects: centroids, standard deviations and number of instances assigned
    to each cluster)* elbow-lexgram kmeans.png (clustering by lexical
    and grammatical aspects: error curves obtained for applying elbow method
    to determinate the optimal number of clusters)*
    Lexical_Grammatical_clusters (table) kmeans.xls (clustering by lexical
    and grammatical aspects: centroids, standard deviations and number of
    instances assigned to each cluster)*
    Grammatical_clusters_number_of_words (table) kmeans.xls
    number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to grammatical error ratios.* Lexical_clusters_number_of_words (table) kmeans.xls
    number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to lexical error ratios.* Lexical_Grammatical_clusters_number_of_words (table) kmeans.xls
    number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to lexical and grammatical error ratios.3. Ward’s Agglomerative Hierarchical Clustering files:* Lexical_Cluster_Dendrogram_ward.png (clustering by lexical aspects: dendrogram obtained after applying Ward's clustering method).* Grammatical_Cluster_Dendrogram_ward.png (clustering by grammatical aspects: dendrogram obtained after applying Ward's clustering method)* Lexical_Grammatical_Cluster_Dendrogram_ward.png (clustering by lexical and grammatical aspects: dendrogram obtained after applying Ward's clustering method)* Lexical_Grammatical_clusters (table) ward.xls:
    Centroids (from column 2 to 7) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to lexical and grammatical error ratios.* Grammatical_clusters (table) ward.xls: Centroids (from column 2 to 4) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to grammatical error ratios.* Lexical_clusters (table) ward.xls: Centroids (from column 2 to 4) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to lexical error ratios.* Lexical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to lexical error ratios.* Grammatical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to grammatical error ratios.* Lexical_Grammatical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to lexical and grammatical error ratios.
  7. Data Mining Applied to Life Cycle Inventory Modeling for Cumene and Sodium...

    • catalog.data.gov
    • gimi9.com
    Updated Mar 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2021). Data Mining Applied to Life Cycle Inventory Modeling for Cumene and Sodium Hydroxide Manufacturing, Version 1, 09/2018 [Dataset]. https://catalog.data.gov/dataset/data-mining-applied-to-life-cycle-inventory-modeling-for-cumene-and-sodium-hydroxide-ma-09
    Explore at:
    Dataset updated
    Mar 4, 2021
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    This file contains the life cycle inventories (LCIs) developed for an associated journal article. Potential users of the data are referred to the journal article for a full description of the modeling methodology. LCIs were developed for cumene and sodium hydroxide manufacturing using data mining with metadata-based data preprocessing. The inventory data were collected from US EPA's 2012 Chemical Data Reporting database, 2011 National Emissions Inventory, 2011 Toxics Release Inventory, 2011 Electronic Greenhouse Gas Reporting Tool, 2011 Discharge Monitoring Report, and the 2011 Biennial Report generated from the RCRAinfo hazardous waste tracking system. The U.S. average cumene gate-to-gate inventories are provided without (baseline) and with process allocation applied using metadata-based filtering. In 2011, there were 8 facilities reporting public production volumes of cumene in the U.S., totaling to 2,609,309,687 kilograms of cumene produced that year. The U.S. average sodium hydroxide gate-to-gate inventories are also provided without (baseline) and with process allocation applied using metadata-based filtering. In 2011, there were 24 facilities reporting public production volumes of sodium hydroxide in the U.S., totaling to 3,878,021,614 kilograms of sodium hydroxide produced that year. Process allocation was only conducted for the top 12 facilities producing sodium hydroxide, which represents 97% of the public production of sodium hydroxide. The data have not been compiled in the formal Federal Commons LCI Template to avoid users interpreting the template to mean the data have been fully reviewed according to LCA standards and can be directly applied to all types of assessments and decision needs without additional review by industry and potential stakeholders. This dataset is associated with the following publication: Meyer, D.E., S. Cashman, and A. Gaglione. Improving the reliability of chemical manufacturing life cycle inventory constructed using secondary data. JOURNAL OF INDUSTRIAL ECOLOGY. Berkeley Electronic Press, Berkeley, CA, USA, 25(1): 20-35, (2021).

  8. F

    Mean Commuting Time for Workers (5-year estimate) in Miner County, SD

    • fred.stlouisfed.org
    json
    Updated Dec 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Mean Commuting Time for Workers (5-year estimate) in Miner County, SD [Dataset]. https://fred.stlouisfed.org/series/B080ACS046097
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Dec 12, 2024
    License

    https://fred.stlouisfed.org/legal/#copyright-public-domainhttps://fred.stlouisfed.org/legal/#copyright-public-domain

    Area covered
    South Dakota, Miner County
    Description

    Graph and download economic data for Mean Commuting Time for Workers (5-year estimate) in Miner County, SD (B080ACS046097) from 2009 to 2023 about Miner County, SD; commuting time; SD; workers; average; 5-year; and USA.

  9. Application Research of Clustering on kmeans

    • kaggle.com
    zip
    Updated Feb 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ddpr raju (2021). Application Research of Clustering on kmeans [Dataset]. https://www.kaggle.com/ddprraju/tirupati-compus-school
    Explore at:
    zip(34507 bytes)Available download formats
    Dataset updated
    Feb 27, 2021
    Authors
    ddpr raju
    Description

    Dataset

    This dataset was created by ddpr raju

    Contents

    It contains the following files:

  10. T

    Mean Commuting Time for Workers (5-year estimate) in Miner County, SD

    • tradingeconomics.com
    csv, excel, json, xml
    Updated Feb 13, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TRADING ECONOMICS (2020). Mean Commuting Time for Workers (5-year estimate) in Miner County, SD [Dataset]. https://tradingeconomics.com/united-states/mean-commuting-time-for-workers-in-miner-county-sd-fed-data.html
    Explore at:
    csv, json, excel, xmlAvailable download formats
    Dataset updated
    Feb 13, 2020
    Dataset authored and provided by
    TRADING ECONOMICS
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1976 - Dec 31, 2025
    Area covered
    South Dakota, Miner County
    Description

    Mean Commuting Time for Workers (5-year estimate) in Miner County, SD was 19.31188 Minutes in January of 2023, according to the United States Federal Reserve. Historically, Mean Commuting Time for Workers (5-year estimate) in Miner County, SD reached a record high of 21.50298 in January of 2021 and a record low of 15.93110 in January of 2012. Trading Economics provides the current actual value, an historical data chart and related indicators for Mean Commuting Time for Workers (5-year estimate) in Miner County, SD - last updated from the United States Federal Reserve on August of 2025.

  11. N

    Income Distribution by Quintile: Mean Household Income in Miner County, SD...

    • neilsberg.com
    csv, json
    Updated Mar 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2025). Income Distribution by Quintile: Mean Household Income in Miner County, SD // 2025 Edition [Dataset]. https://www.neilsberg.com/insights/miner-county-sd-median-household-income/
    Explore at:
    json, csvAvailable download formats
    Dataset updated
    Mar 3, 2025
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    South Dakota, Miner County
    Variables measured
    Income Level, Mean Household Income
    Measurement technique
    The data presented in this dataset is derived from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. It delineates income distributions across income quintiles (mentioned above) following an initial analysis and categorization. Subsequently, we adjusted these figures for inflation using the Consumer Price Index retroactive series via current methods (R-CPI-U-RS). For additional information about these estimations, please contact us via email at research@neilsberg.com
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset presents the mean household income for each of the five quintiles in Miner County, SD, as reported by the U.S. Census Bureau. The dataset highlights the variation in mean household income across quintiles, offering valuable insights into income distribution and inequality.

    Key observations

    • Income disparities: The mean income of the lowest quintile (20% of households with the lowest income) is 13,754, while the mean income for the highest quintile (20% of households with the highest income) is 184,982. This indicates that the top earners earn 13 times compared to the lowest earners.
    • *Top 5%: * The mean household income for the wealthiest population (top 5%) is 323,065, which is 174.65% higher compared to the highest quintile, and 2348.88% higher compared to the lowest quintile.
    Content

    When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

    Income Levels:

    • Lowest Quintile
    • Second Quintile
    • Third Quintile
    • Fourth Quintile
    • Highest Quintile
    • Top 5 Percent

    Variables / Data Columns

    • Income Level: This column showcases the income levels (As mentioned above).
    • Mean Household Income: Mean household income, in 2023 inflation-adjusted dollars for the specific income level.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for Miner County median household income. You can refer the same here

  12. A

    Albania Enterprises: Mining and Quarrying: Investment: Means of Transport

    • ceicdata.com
    Updated Feb 7, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CEICdata.com (2018). Albania Enterprises: Mining and Quarrying: Investment: Means of Transport [Dataset]. https://www.ceicdata.com/en/albania/enterprises-income-and-investment-by-industry-nace-2/enterprises-mining-and-quarrying-investment-means-of-transport
    Explore at:
    Dataset updated
    Feb 7, 2018
    Dataset provided by
    CEICdata.com
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 1, 2012 - Dec 1, 2022
    Area covered
    Albania
    Variables measured
    Enterprises Statistics
    Description

    Albania Enterprises: Mining and Quarrying: Investment: Means of Transport data was reported at 422.000 ALL mn in 2022. This records an increase from the previous number of 236.211 ALL mn for 2021. Albania Enterprises: Mining and Quarrying: Investment: Means of Transport data is updated yearly, averaging 363.000 ALL mn from Dec 2012 (Median) to 2022, with 11 observations. The data reached an all-time high of 1,157.821 ALL mn in 2019 and a record low of 230.000 ALL mn in 2016. Albania Enterprises: Mining and Quarrying: Investment: Means of Transport data remains active status in CEIC and is reported by Institute of Statistics. The data is categorized under Global Database’s Albania – Table AL.O011: Enterprises Income and Investment: by Industry: NACE 2.

  13. m

    Data from: Making the Case for Process Analytics: A Use Case in Court...

    • data.mendeley.com
    Updated Aug 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Milda Aleknonyte-Resch (2025). Making the Case for Process Analytics: A Use Case in Court Proceedings [Dataset]. http://doi.org/10.17632/3mcvbrhr7c.2
    Explore at:
    Dataset updated
    Aug 15, 2025
    Authors
    Milda Aleknonyte-Resch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data was extracted in PDF format with personal information redacted to ensure privacy. The raw dataset consisted of 260 cases from three chambers within a single German social law court. The data originates from a single judge, who typically oversees five to six chambers, meaning that this dataset represents only a subset of the judge’s total caseload. Optical Character Recognition (OCR) was used to extract the document text, which was organized into an event log according to the tabular structure of the documents. In the dataset, a single timestamp is recorded for each activity, commonly indicating only the date of occurrence rather than a precise timestamp. This limits the granularity of time-based analyses and the accuracy of calculated activity durations. As the analysis focuses on the overall durations of cases, which typically range from multiple months to years, the impact of the timestamp imprecisions was negligible in our use case. After extraction, the event log was further processed in consultation with domain experts to ensure anonymity, remove noise, and raise it to an abstraction level appropriate for analysis. All remaining personal identifiers, such as expert witness names, were removed from the log to ensure anonymity. Additionally, timestamps were systematically perturbed to further enhance data privacy. Originally, the event log contained 22,664 recorded events and 290 unique activities. Activities that were extremely rare (i.e., occurring fewer than 30 times) were excluded to focus on frequently observed procedural steps. Furthermore, the domain experts reviewed the list of unique activity labels, based on which similar activities were merged, and terminology was standardized across cases. The refinement of the activity labels reduced the number of unique activities to 59. Finally, duplicate events were removed. These steps collectively reduced the dataset to 19,947 events. The final anonymized and processed dataset includes 260 cases, 19,947 events from three chambers and 59 unique activities.

  14. Z

    Onset of mining operations

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Remelgado, Ruben (2024). Onset of mining operations [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8214548
    Explore at:
    Dataset updated
    Mar 17, 2024
    Dataset provided by
    Remelgado, Ruben
    Meyer, Carsten
    Description

    Motivation

    Maus et al created the first database of the spatial extent of mining areas by mobilizing nearly 20 years of Landsat data. This dataset is imperative for GlobES, as mining areas are specified in the IUCN habitat class scheme. Yet, this dataset is temporally static. To tackle this flaw, we mined the Landsat archive to infer the first observable year of mining.

    Approach

    For each mining area polygon, we collected 50 random samples within it and 50 random samples along its borders. This was meant to reflect increasing spectral differences between areas within and outside a mining exploration after its onset. Then, for each sample, we used Google Earth Engine to extract spectral profiles for every available acquisition between 1990 and 2020.

    After completing the extraction, we estimate mean spectral profiles for each acquisition date, once for the samples “inside” the mining area, and another for those “outside” of it. In this process, we masked pixels afflicted by clouds and cloud shadows using Landsat's quality information.

    Using the time-series of mean profiles, at each mining site and for each unique date, we normalized the “inside” and “outside” multi-spectral averages and estimated the Root Mean Square Error (RMSE) between them. The normalization step aimed at emphasizing differences in the shape of the spectral profiles rather than on specific values, which can be related to radiometric innacuracies, or simply to differences in acquisition dates. This resulted in an RMSE time-series for each mining site.

    We then used these data to infer the first mining year. To achieve this, we first derived a cumulative sum of the RMSE time-series with the intent of removing noise while preserving abrupt directional changes. For example, if a mine was introduced in a forest, it would drive an increase in the RMSE due to the removal of trees, whereas the outskirts of the mine would remain forested. In this example, the accumulated values would tilt upwards. However, if a mining exploration was accompanied by the removal of vegetation along its outskirts where bare land was common, a downwards shift is RMSE values is more likely as the landscape becomes more homogenization.

    To detect the date marking a shift in RMSE values, we used a knee/elbow detection algorithm implemented in the python package kneebow, which uses curve rotation to infer the inflection/deflection point of a time series. Here, downward trends correspond to the elbow and upward trends to the knee. To determine which of these metrics was the most adequate, we use the Area Under the Curve (AUC). An elbow is characterized by a convex shape of a time-series which makes the AUC greater than 50%. However, if the shape of the curve is concave, the knee is the most adequate metric. We limited the detection of shifts to time-series with at least 100 time steps. When below this threshold, we assumed the mine (or the the conditions to sustain it) were present since 1990.

    Content

    This repository contains the infrastructure used to infer the start of a mining operation, which is organized as following:

    00_data - Contains the base data required for the operation, including a SHP file with the mining area outlines, and validation samples.

    01_analysis - Contains several outputs of our analysis:

    xy.tar.gz - Sample locations for each mining site.

    sr.tar.gz - Spectral profiles for each sample location.

    mine_start.csv - First year when we detected the start of mining.

    02_code - Includes all code used in our analysis.

    requirements.txt - Python module requirements that can be fed to pip to replicate our study.

    config.yml - Configuration file, including information on the Landsat products used.

  15. f

    Comparison of the running time(in ms) of the three algorithms.

    • plos.figshare.com
    xls
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yaling Zhang; Jin Han (2023). Comparison of the running time(in ms) of the three algorithms. [Dataset]. http://doi.org/10.1371/journal.pone.0248737.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Yaling Zhang; Jin Han
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comparison of the running time(in ms) of the three algorithms.

  16. Net profit margin of the top mining companies 2002-2024

    • statista.com
    Updated Dec 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M. Garside (2024). Net profit margin of the top mining companies 2002-2024 [Dataset]. https://www.statista.com/topics/1143/mining/
    Explore at:
    Dataset updated
    Dec 13, 2024
    Dataset provided by
    Statistahttp://statista.com/
    Authors
    M. Garside
    Description

    In 2011, the net profit margin of the mining industry's 40 leading companies was approximately 24 percent. Twelve years later, in 2023, the net profit margin stood at 11 percent. Profits of the top mining companies The net profit margin (also known as profit margin, net margin, net profit ratio) is a measurement to describe the profitability of a company. It is calculated by dividing the net income by the total revenue (or net profit by sales). For 2023, it means that the top 40 mining companies kept 11 cents of profit out of every U.S. dollar they earned. The average net profit margin of the world’s top 40 mining companies stood at some seven percent in 2014, but decreased to negative seven percent in 2015, and then rebounded to 11 percent in 2023. These figures are a distinct decrease when compared to the years before. In 2023, the top 40 mining companies in the world generated a net profit of approximately 90 billion U.S. dollars.The global top 40 mining companies, which represent the vast majority of the industry, generated more than 840 billion U.S. dollars of revenue in 2023. In terms of quantity, these companies produce most of all coal (including thermal and metallurgical coal), iron ore, and bauxite.

  17. f

    Descriptions of the datasets.

    • plos.figshare.com
    xls
    Updated Jun 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yaling Zhang; Jin Han (2023). Descriptions of the datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0248737.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Yaling Zhang; Jin Han
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Descriptions of the datasets.

  18. o

    Raw data of the ships of priority II in 2017

    • explore.openaire.eu
    • figshare.com
    Updated Mar 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiaqi Mu (2020). Raw data of the ships of priority II in 2017 [Dataset]. http://doi.org/10.4121/uuid:92a6ce98-5001-4768-bec1-49881a367d36
    Explore at:
    Dataset updated
    Mar 30, 2020
    Authors
    Jiaqi Mu
    Description

    See the article "Targeting model based on principal component analysis and extreme learning machine" for the meaning of the data.

  19. d

    Replication Data for: Policy Diffusion: The Issue-Definition Stage

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gilardi, Fabrizio; Shipan, Charles R.; Wüest, Bruno (2023). Replication Data for: Policy Diffusion: The Issue-Definition Stage [Dataset]. http://doi.org/10.7910/DVN/QEMNP1
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Gilardi, Fabrizio; Shipan, Charles R.; Wüest, Bruno
    Description

    We put forward a new approach to studying issue definition within the context of policy diffusion. Most studies of policy diffusion---which is the process by which policymaking in one government affects policymaking in other governments---have focused on policy adoptions. We shift the focus to an important but neglected aspect of this process: the issue-definition stage. We use topic models to estimate how policies are framed during this stage and how these frames are predicted by prior policy adoptions. Focusing on smoking restriction in U.S. states, our analysis draws upon an original dataset of over 52,000 paragraphs from newspapers covering 49 states between 1996 and 2013. We find that frames regarding the policy's concrete implications are predicted by prior adoptions in other states, while frames regarding its normative justifications are not. Our approach and findings open the way for a new perspective to studying policy diffusion in many different areas.

  20. g

    Data from: Quantitative Wirtschaftsgeschichte des Ruhrkohlenbergbaus im 19....

    • search.gesis.org
    • pollux-fid.de
    • +1more
    Updated Apr 13, 2010
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Holtfrerich, Carl-Ludwig (2010). Quantitative Wirtschaftsgeschichte des Ruhrkohlenbergbaus im 19. Jahrhundert [Dataset]. http://doi.org/10.4232/1.8207
    Explore at:
    (93874)Available download formats
    Dataset updated
    Apr 13, 2010
    Dataset provided by
    GESIS Data Archive
    GESIS search
    Authors
    Holtfrerich, Carl-Ludwig
    License

    https://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms

    Time period covered
    1816 - 1913
    Description

    Firstly Holtfrerich presents the Rostov Concept of the leading sector, before he sketches the development of mining in the Ruhr area by means of theoretical approaches concerning theories on production, price, and investment. In doing so, the author attempts to quantify the connections between the development of coal mining in the Ruhr district and other important sectors by means of an input-output scheme. Thereafter he examines how far the development of mining in the Ruhr area in the 19th century in its major phase of growth complies with the Rostov criteria for the leading sector. Finally Holtfrerich verifies the assumption that mining in the Ruhr district has been a leading sector of the German industrialisation.

    Chart register Chart 01: Coal mining in the OBAB Dortmund, the Saar area, and the Kingdom of Prussia (1816-1913) Chart 02: Annual average price of coal in the OBAB Dortmund, nominal and real development (1816-1813) Chart 03: Number of operating coal mines in the OBAB Dortmund, and average production of each mine (1816-1892) Chart 04: Proportion of the five and ten greatest mines as to the total coal production of the mines in the OBAB Dortmund; in percent (1852-1890) Chart 05: Contributions of coal mines in the OBAB Dortmund in 1,000 marks (1850-1895) Chart 06: Tax burden for coal mining in the Lower Rhine region and in Westphalia (1880-1903) Chart 07: Burden of the coal mines in the OBAB Dortmund; coal mine contributions (“Bergwerksabgaben”) and taxes in percent of coal sales value (1816-1913) Chart 08: Annually licenced basic capital of the “Montan-Aktiengesellschaften” (coal, iron and steel corporations) founded in the Ruhr (1840-1870) Chart 10: Average number of workers per year (including mine officials) in the field of coal mining in the OBAB Dortmund (1816-1913) Chart 11: Average annual net payroll and annual net basic wages of the miners in the OBAB Dortmund (1850-1913) Chart 12: Wages in coal mining within the OBAB Dortmund (1850-1903) Chart 13: Working hours in coal mining within the OBAB Dortmund (1852-1892) Chart 14: Labour productivity in coal mining in the OBAB Dortmund (1816-1913) Chart 15: Development of capital investment: disposable steam machines (combined engine power in HP) of coal mines within the OBAB Dortmund (1851-1892) Chart 16: Development of investment: annual increase of steam machine power (in HP) (1852-1892) Chart 18: Development of capital productivity and capital intensity (1851-1892) Chart 19: Data on net value added and capital income in the field of coal mining in the OBAB Dortmund (1850-1903) Chart 20: Capital income/dividends and profits per produced ton of coal for coal mining in the Ruhr area (1850-1892) Chart 21: Proportion of the total coal produced in the Lower Rhine/Westphalian bassin, which was coked by the colliery itself, or – from 1882 on – formed into briquettes(1861-1892) Chart 22: Percentage of propulsion power in HP applied in coal mining within the OBAB Dortmund (1875-1895) Chart 23: Own consumption of coal of mines within the OBAB Dortmund (1852-1892) Chart 24: Development of the profit indicator for coal mining in the Ruhr district (1851-1892) Chart 25: Expansion of the German railway system (1835-1892) Chart 26: Figures on the development of Prussian railways (1844-1882) Chart 27: Development of average revenues for the transport of coal on various railways (1861-1877) Chart 28: Development of the proportion of means of transport with regard to the transport of coal from the Ruhr area (1851-1889) Chart 29: Division of domestic sales of the “Rheinisch-Westfälisches Kohlensyndikat” (Coal Syndicate of the Rhineland and Westphalia) per consumption groups in percent (1902-1906) Chart 30: Wroughtiron production and steel production from coal in the OBAB Dortmund and in the OBAB Bonn (part on the right bank of the Rhine) (1852-1882) Chart 31: Crude iron production in the Ruhr area, OBAB Dortmund (1837-1900) Chart 32: Price development for crude iron, bar iron and cast steel in the Ruhr district (1850-1892) Chart 33: Bar iron production in the OBAB Dortmund and in the OBAB Bonn by means of the charcoal hearth process and the “Puddelverfahren”, a method to produce steel from crude iron (1835-1870) Chart 34: The importance of the economic sectors according to their respective employment figures (1852-1875).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dashlink (2025). A Generic Local Algorithm for Mining Data Streams in Large Distributed Systems [Dataset]. https://catalog.data.gov/dataset/a-generic-local-algorithm-for-mining-data-streams-in-large-distributed-systems

Data from: A Generic Local Algorithm for Mining Data Streams in Large Distributed Systems

Related Article
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description

In a large network of computers or wireless sensors, each of the components (henceforth, peers) has some data about the global state of the system. Much of the system's functionality such as message routing, information retrieval and load sharing relies on modeling the global state. We refer to the outcome of the function (e.g., the load experienced by each peer) as the emph{model} of the system. Since the state of the system is constantly changing, it is necessary to keep the models up-to-date. Computing global data mining models e.g. decision trees, k-means clustering in large distributed systems may be very costly due to the scale of the system and due to communication cost, which may be high. The cost further increases in a dynamic scenario when the data changes rapidly. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient emph{local} algorithm which can be used to monitor a wide class of data mining models. Then, we use this algorithm as a feedback loop for the monitoring of complex functions of the data such as its k-means clustering. The theoretical claims are corroborated with a thorough experimental analysis.

Search
Clear search
Close search
Google apps
Main menu