100+ datasets found
  1. f

    Table_1_Data Mining Techniques in Analyzing Process Data: A Didactic.pdf

    • frontiersin.figshare.com
    pdf
    Updated Jun 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xin Qiao; Hong Jiao (2023). Table_1_Data Mining Techniques in Analyzing Process Data: A Didactic.pdf [Dataset]. http://doi.org/10.3389/fpsyg.2018.02231.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 7, 2023
    Dataset provided by
    Frontiers
    Authors
    Xin Qiao; Hong Jiao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Due to increasing use of technology-enhanced educational assessment, data mining methods have been explored to analyse process data in log files from such assessment. However, most studies were limited to one data mining technique under one specific scenario. The current study demonstrates the usage of four frequently used supervised techniques, including Classification and Regression Trees (CART), gradient boosting, random forest, support vector machine (SVM), and two unsupervised methods, Self-organizing Map (SOM) and k-means, fitted to one assessment data. The USA sample (N = 426) from the 2012 Program for International Student Assessment (PISA) responding to problem-solving items is extracted to demonstrate the methods. After concrete feature generation and feature selection, classifier development procedures are implemented using the illustrated techniques. Results show satisfactory classification accuracy for all the techniques. Suggestions for the selection of classifiers are presented based on the research questions, the interpretability and the simplicity of the classifiers. Interpretations for the results from both supervised and unsupervised learning methods are provided.

  2. r

    A predictive model for opal exploration in Australia from a data mining...

    • researchdata.edu.au
    Updated May 1, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas Landgrebe; Thomas Landgrebe; Adriana Dutkiewicz; Dietmar Muller (2015). A predictive model for opal exploration in Australia from a data mining approach [Dataset]. http://doi.org/10.4227/11/5587A86C0FDF1
    Explore at:
    Dataset updated
    May 1, 2015
    Dataset provided by
    The University of Sydney
    Authors
    Thomas Landgrebe; Thomas Landgrebe; Adriana Dutkiewicz; Dietmar Muller
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Area covered
    Dataset funded by
    Australian Research Council
    Description

    This data collection is associated with the publications: Merdith, A. S., Landgrebe, T. C. W., Dutkiewicz, A., & Müller, R. D. (2013). Towards a predictive model for opal exploration using a spatio-temporal data mining approach. Australian Journal of Earth Sciences, 60(2), 217-229. doi: 10.1080/08120099.2012.754793

    and

    Landgrebe, T. C. W., Merdith, A., Dutkiewicz, A., & Müller, R. D. (2013). Relationships between palaeogeography and opal occurrence in Australia: A data-mining approach. Computers & Geosciences, 56(0), 76-82. doi: 10.1016/j.cageo.2013.02.002

    Publication Abstract - Merdith et al. (2013)

    Opal is Australia's national gemstone, however most significant opal discoveries were made in the early 1900's - more than 100 years ago - until recently. Currently there is no formal exploration model for opal, meaning there are no widely accepted concepts or methodologies available to suggest where new opal fields may be found. As a consequence opal mining in Australia is a cottage industry with the majority of opal exploration focused around old opal fields. The EarthByte Group has developed a new opal exploration methodology for the Great Artesian Basin. The work is based on the concept of applying “big data mining” approaches to data sets relevant for identifying regions that are prospective for opal. The group combined a multitude of geological and geophysical data sets that were jointly analysed to establish associations between particular features in the data with known opal mining sites. A “training set” of known opal localities (1036 opal mines) was assembled, using those localities, which were featured in published reports and on maps. The data used include rock types, soil type, regolith type, topography, radiometric data and a stack of digital palaeogeographic maps. The different data layers were analysed via spatio-temporal data mining combining the GPlates PaleoGIS software (www.gplates.org) with the Orange data mining software (orange.biolab.si) to produce the first opal prospectivity map for the Great Artesian Basin. One of the main results of the study is that the geological conditions favourable for opal were found to be related to a particular sequence of surface environments over geological time. These conditions involved alternating shallow seas and river systems followed by uplift and erosion. The approach reduces the entire area of the Great Artesian Basin to a mere 6% that is deemed to be prospective for opal exploration. The work is described in two companion papers in the Australian Journal of Earth Sciences and Computers and Geosciences.

    Publication Abstract - Landgrebe et al. (2013)

    Age-coded multi-layered geological datasets are becoming increasingly prevalent with the surge in open-access geodata, yet there are few methodologies for extracting geological information and knowledge from these data. We present a novel methodology, based on the open-source GPlates software in which age-coded digital palaeogeographic maps are used to “data-mine” spatio-temporal patterns related to the occurrence of Australian opal. Our aim is to test the concept that only a particular sequence of depositional/erosional environments may lead to conditions suitable for the formation of gem quality sedimentary opal. Time-varying geographic environment properties are extracted from a digital palaeogeographic dataset of the eastern Australian Great Artesian Basin (GAB) at 1036 opal localities. We obtain a total of 52 independent ordinal sequences sampling 19 time slices from the Early Cretaceous to the present-day. We find that 95% of the known opal deposits are tied to only 27 sequences all comprising fluvial and shallow marine depositional sequences followed by a prolonged phase of erosion. We then map the total area of the GAB that matches these 27 opal-specific sequences, resulting in an opal-prospective region of only about 10% of the total area of the basin. The key patterns underlying this association involve only a small number of key environmental transitions. We demonstrate that these key associations are generally absent at arbitrary locations in the basin. This new methodology allows for the simplification of a complex time-varying geological dataset into a single map view, enabling straightforward application for opal exploration and for future co-assessment with other datasets/geological criteria. This approach may help unravel the poorly understood opal formation process using an empirical spatio-temporal data-mining methodology and readily available datasets to aid hypothesis testing.

    Authors and Institutions

    Andrew Merdith - EarthByte Research Group, School of Geosciences, The University of Sydney, Australia. ORCID: 0000-0002-7564-8149

    Thomas Landgrebe - EarthByte Research Group, School of Geosciences, The University of Sydney, Australia

    Adriana Dutkiewicz - EarthByte Research Group, School of Geosciences, The University of Sydney, Australia

    R. Dietmar Müller - EarthByte Research Group, School of Geosciences, The University of Sydney, Australia. ORCID: 0000-0002-3334-5764

    Overview of Resources Contained

    This collection contains geological data from Australia used for data mining in the publications Merdith et al. (2013) and Landgrebe et al. (2013). The resulting maps of opal prospectivity are also included.

    List of Resources

    Note: For details on the files included in this data collection, see “Description_of_Resources.txt”.

    Note: For information on file formats and what programs to use to interact with various file formats, see “File_Formats_and_Recommended_Programs.txt”.

    • Map of Barfield region, Australia (.jpg, 270 KB)
    • Map overviewing the Great Artesian basins and main opal mining camps (.png, 82 KB)
    • Maps showing opal prospectivity data mining results for different geological datasets (.tif, 23.1 MB)
    • Map of opal prospectivity from palaeogeography data mining (.pdf, 2.6 MB)
    • Raster of palaeogeography target regions for viewing in Google Earth (.jpg, 418 KB)
    • Opal mine locations (.gpml, .txt, .kmz, .shp, total 15.6 MB)
    • Map of opal prospectivity from all data mining results as a Google Earth overlay (.kmz, 12 KB)
    • Map of probability of opal occurrence in prospective regions from all data mining results (.tif, 5.9 MB)
    • Paleogeography of Australia (.gpml, .txt, .shp, total 114.2 MB)
    • Radiometric data showing potassium concentration contrasts (.tif, .kmz, total 311.3 MB)
    • Regolith data (.gpml, .txt, .kml, .shp, total 7.1 MB)
    • Soil type data (.gpml, .txt, .kml, .shp, total 7.1 MB)

    For more information on this data collection, and links to other datasets from the EarthByte Research Group please visit EarthByte

    For more information about using GPlates, including tutorials and a user manual please visit GPlates or EarthByte

  3. Data supporting the Master thesis "Monitoring von Open Data Praktiken -...

    • zenodo.org
    zip
    Updated Nov 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Katharina Zinke; Katharina Zinke (2024). Data supporting the Master thesis "Monitoring von Open Data Praktiken - Herausforderungen beim Auffinden von Datenpublikationen am Beispiel der Publikationen von Forschenden der TU Dresden" [Dataset]. http://doi.org/10.5281/zenodo.14196539
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 21, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Katharina Zinke; Katharina Zinke
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data supporting the Master thesis "Monitoring von Open Data Praktiken - Herausforderungen beim Auffinden von Datenpublikationen am Beispiel der Publikationen von Forschenden der TU Dresden" (Monitoring open data practices - challenges in finding data publications using the example of publications by researchers at TU Dresden) - Katharina Zinke, Institut für Bibliotheks- und Informationswissenschaften, Humboldt-Universität Berlin, 2023

    This ZIP-File contains the data the thesis is based on, interim exports of the results and the R script with all pre-processing, data merging and analyses carried out. The documentation of the additional, explorative analysis is also available. The actual PDFs and text files of the scientific papers used are not included as they are published open access.

    The folder structure is shown below with the file names and a brief description of the contents of each file. For details concerning the analyses approach, please refer to the master's thesis (publication following soon).

    ## Data sources

    Folder 01_SourceData/

    - PLOS-Dataset_v2_Mar23.csv (PLOS-OSI dataset)

    - ScopusSearch_ExportResults.csv (export of Scopus search results from Scopus)

    - ScopusSearch_ExportResults.ris (export of Scopus search results from Scopus)

    - Zotero_Export_ScopusSearch.csv (export of the file names and DOIs of the Scopus search results from Zotero)

    ## Automatic classification

    Folder 02_AutomaticClassification/

    - (NOT INCLUDED) PDFs folder (Folder for PDFs of all publications identified by the Scopus search, named AuthorLastName_Year_PublicationTitle_Title)

    - (NOT INCLUDED) PDFs_to_text folder (Folder for all texts extracted from the PDFs by ODDPub, named AuthorLastName_Year_PublicationTitle_Title)

    - PLOS_ScopusSearch_matched.csv (merge of the Scopus search results with the PLOS_OSI dataset for the files contained in both)

    - oddpub_results_wDOIs.csv (results file of the ODDPub classification)

    - PLOS_ODDPub.csv (merge of the results file of the ODDPub classification with the PLOS-OSI dataset for the publications contained in both)

    ## Manual coding

    Folder 03_ManualCheck/

    - CodeSheet_ManualCheck.txt (Code sheet with descriptions of the variables for manual coding)

    - ManualCheck_2023-06-08.csv (Manual coding results file)

    - PLOS_ODDPub_Manual.csv (Merge of the results file of the ODDPub and PLOS-OSI classification with the results file of the manual coding)

    ## Explorative analysis for the discoverability of open data

    Folder04_FurtherAnalyses

    Proof_of_of_Concept_Open_Data_Monitoring.pdf (Description of the explorative analysis of the discoverability of open data publications using the example of a researcher) - in German

    ## R-Script

    Analyses_MA_OpenDataMonitoring.R (R-Script for preparing, merging and analyzing the data and for performing the ODDPub algorithm)

  4. d

    Data from: A Generic Local Algorithm for Mining Data Streams in Large...

    • catalog.data.gov
    • datasets.ai
    • +3more
    Updated Apr 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). A Generic Local Algorithm for Mining Data Streams in Large Distributed Systems [Dataset]. https://catalog.data.gov/dataset/a-generic-local-algorithm-for-mining-data-streams-in-large-distributed-systems
    Explore at:
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Dashlink
    Description

    In a large network of computers or wireless sensors, each of the components (henceforth, peers) has some data about the global state of the system. Much of the system's functionality such as message routing, information retrieval and load sharing relies on modeling the global state. We refer to the outcome of the function (e.g., the load experienced by each peer) as the emph{model} of the system. Since the state of the system is constantly changing, it is necessary to keep the models up-to-date. Computing global data mining models e.g. decision trees, k-means clustering in large distributed systems may be very costly due to the scale of the system and due to communication cost, which may be high. The cost further increases in a dynamic scenario when the data changes rapidly. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient emph{local} algorithm which can be used to monitor a wide class of data mining models. Then, we use this algorithm as a feedback loop for the monitoring of complex functions of the data such as its k-means clustering. The theoretical claims are corroborated with a thorough experimental analysis.

  5. Application Research of Clustering on kmeans

    • kaggle.com
    zip
    Updated Feb 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ddpr raju (2021). Application Research of Clustering on kmeans [Dataset]. https://www.kaggle.com/ddprraju/tirupati-compus-school
    Explore at:
    zip(34507 bytes)Available download formats
    Dataset updated
    Feb 27, 2021
    Authors
    ddpr raju
    Description

    Dataset

    This dataset was created by ddpr raju

    Contents

    It contains the following files:

  6. Datasets(Original, Mean, Median, Most Frequent).zip

    • figshare.com
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Omar Elzeki (2023). Datasets(Original, Mean, Median, Most Frequent).zip [Dataset]. http://doi.org/10.6084/m9.figshare.8118710.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Authors
    Omar Elzeki
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset is transformed into Matlab format. They are designed to be in cell formats. Each cell is a matrix which consists of a column representing the gene and row for the subject.Each dataset is organized in a separate directory. The directory contains four versions: a) Original dataset, b) Imputed dataset by MEAN,c) Imputed dataset by MEDIAN,d) Imputed dataset by Most Frequent,

  7. Local L2 Thresholding Based Data Mining in Peer-to-Peer Systems - Dataset -...

    • data.nasa.gov
    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.nasa.gov (2025). Local L2 Thresholding Based Data Mining in Peer-to-Peer Systems - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/local-l2-thresholding-based-data-mining-in-peer-to-peer-systems
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    In a large network of computers, wireless sensors, or mobile devices, each of the components (hence, peers) has some data about the global status of the system. Many of the functions of the system, such as routing decisions, search strategies, data cleansing, and the assignment of mutual trust, depend on the global status. Therefore, it is essential that the system be able to detect, and react to, changes in its global status. Computing global predicates in such systems is usually very costly. Mainly because of their scale, and in some cases (e.g., sensor networks) also because of the high cost of communication. The cost further increases when the data changes rapidly (due to state changes, node failure, etc.) and computation has to follow these changes. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient local algorithm which detect when the L2 norm of the average data surpasses a threshold. Then, we use this algorithm as a feedback loop for the monitoring of complex predicates on the data – such as the data’s k-means clustering. The efficiency of the L2 algorithm guarantees that so long as the clustering results represent the data (i.e., the data is stationary) few resources are required. When the data undergoes an epoch change – a change in the underlying distribution – and the model no longer represents it, the feedback loop indicates this and the model is rebuilt. Furthermore, the existence of a feedback loop allows using approximate and “best-effort ” methods for constructing the model; if an ill-fit model is built the feedback loop would indicate so, and the model would be rebuilt.

  8. d

    Data mining approaches to quantifying the formation of secondary organic...

    • datasets.ai
    • catalog.data.gov
    53
    Updated Sep 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Environmental Protection Agency (2024). Data mining approaches to quantifying the formation of secondary organic aerosol [Dataset]. https://datasets.ai/datasets/data-mining-approaches-to-quantifying-the-formation-of-secondary-organic-aerosol
    Explore at:
    53Available download formats
    Dataset updated
    Sep 18, 2024
    Dataset authored and provided by
    U.S. Environmental Protection Agency
    Description

    This research used data mining approaches to better understand factors affecting the formation of secondary organic aerosol (SOA). Although numerous laboratory and computational studies have been completed on SOA formation, it is still challenging to determine factors that most influence SOA formation. Experimental data were based on previous work described by Offenberg et al. (2017), where volume concentrations of SOA were measured in 139 laboratory experiments involving the oxidation of single hydrocarbons under different operating conditions. Three different data mining methods were used, including nearest neighbor, decision tree, and pattern mining. Both decision tree and pattern mining approaches identified similar chemical and experimental conditions that were important to SOA formation. Among these important factors included the number of methyl groups, the number of rings and the presence of dinitrogen pentoxide (N2O5).

    This dataset is associated with the following publication: Olson, D., J. Offenberg, M. Lewandowski, T. Kleindienst, K. Docherty, M. Jaoui, J.D. Krug, and T. Riedel. Data mining approaches to understanding the formation of secondary organic aerosol. ATMOSPHERIC ENVIRONMENT. Elsevier Science Ltd, New York, NY, USA, 252: 118345, (2021).

  9. Data Mining For Business

    • kaggle.com
    Updated May 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Balal H (2022). Data Mining For Business [Dataset]. https://www.kaggle.com/datasets/balalh/data-mining-for-business
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 7, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Balal H
    Description

    Dataset

    This dataset was created by Balal H

    Contents

  10. Data Mining Applied to Life Cycle Inventory Modeling for Cumene and Sodium...

    • catalog.data.gov
    • gimi9.com
    Updated Mar 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2021). Data Mining Applied to Life Cycle Inventory Modeling for Cumene and Sodium Hydroxide Manufacturing, Version 1, 09/2018 [Dataset]. https://catalog.data.gov/dataset/data-mining-applied-to-life-cycle-inventory-modeling-for-cumene-and-sodium-hydroxide-ma-09
    Explore at:
    Dataset updated
    Mar 4, 2021
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    This file contains the life cycle inventories (LCIs) developed for an associated journal article. Potential users of the data are referred to the journal article for a full description of the modeling methodology. LCIs were developed for cumene and sodium hydroxide manufacturing using data mining with metadata-based data preprocessing. The inventory data were collected from US EPA's 2012 Chemical Data Reporting database, 2011 National Emissions Inventory, 2011 Toxics Release Inventory, 2011 Electronic Greenhouse Gas Reporting Tool, 2011 Discharge Monitoring Report, and the 2011 Biennial Report generated from the RCRAinfo hazardous waste tracking system. The U.S. average cumene gate-to-gate inventories are provided without (baseline) and with process allocation applied using metadata-based filtering. In 2011, there were 8 facilities reporting public production volumes of cumene in the U.S., totaling to 2,609,309,687 kilograms of cumene produced that year. The U.S. average sodium hydroxide gate-to-gate inventories are also provided without (baseline) and with process allocation applied using metadata-based filtering. In 2011, there were 24 facilities reporting public production volumes of sodium hydroxide in the U.S., totaling to 3,878,021,614 kilograms of sodium hydroxide produced that year. Process allocation was only conducted for the top 12 facilities producing sodium hydroxide, which represents 97% of the public production of sodium hydroxide. The data have not been compiled in the formal Federal Commons LCI Template to avoid users interpreting the template to mean the data have been fully reviewed according to LCA standards and can be directly applied to all types of assessments and decision needs without additional review by industry and potential stakeholders. This dataset is associated with the following publication: Meyer, D.E., S. Cashman, and A. Gaglione. Improving the reliability of chemical manufacturing life cycle inventory constructed using secondary data. JOURNAL OF INDUSTRIAL ECOLOGY. Berkeley Electronic Press, Berkeley, CA, USA, 25(1): 20-35, (2021).

  11. Data from: IJEE Educational Data Mining

    • figshare.com
    • produccioncientifica.uca.es
    pdf
    Updated Mar 4, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manuel Palomo-duarte (2016). IJEE Educational Data Mining [Dataset]. http://doi.org/10.6084/m9.figshare.3084319.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Mar 4, 2016
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Manuel Palomo-duarte
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Results of k-means and Ward's clustering for IJEE Educational Data Mining

  12. f

    Data from: Generation of Pairwise Potentials Using Multidimensional Data...

    • acs.figshare.com
    xlsx
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zheng Zheng; Jun Pei; Nupur Bansal; Hao Liu; Lin Frank Song; Kenneth M. Merz (2023). Generation of Pairwise Potentials Using Multidimensional Data Mining [Dataset]. http://doi.org/10.1021/acs.jctc.8b00516.s009
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    ACS Publications
    Authors
    Zheng Zheng; Jun Pei; Nupur Bansal; Hao Liu; Lin Frank Song; Kenneth M. Merz
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The rapid development of molecular structural databases provides the chemistry community access to an enormous array of experimental data that can be used to build and validate computational models. Using radial distribution functions collected from experimentally available X-ray and NMR structures, a number of so-called statistical potentials have been developed over the years using the structural data mining strategy. These potentials have been developed within the context of the two-particle Kirkwood equation by extending its original use for isotropic monatomic systems to anisotropic biomolecular systems. However, the accuracy and the unclear physical meaning of statistical potentials have long formed the central arguments against such methods. In this work, we present a new approach to generate molecular energy functions using structural data mining. Instead of employing the Kirkwood equation and introducing the “reference state” approximation, we model the multidimensional probability distributions of the molecular system using graphical models and generate the target pairwise Boltzmann probabilities using the Bayesian field theory. Different from the current statistical potentials that mimic the “knowledge-based” PMF based on the 2-particle Kirkwood equation, the graphical-model-based structure-derived potential developed in this study focuses on the generation of lower-dimensional Boltzmann distributions of atoms through reduction of dimensionality. We have named this new scoring function GARF, and in this work we focus on the mathematical derivation of our novel approach followed by validation studies on its ability to predict protein–ligand interactions.

  13. d

    SA Mineral and/or Opal Exploration Licence Applications

    • data.gov.au
    • researchdata.edu.au
    • +1more
    zip
    Updated Apr 13, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bioregional Assessment Program (2022). SA Mineral and/or Opal Exploration Licence Applications [Dataset]. https://data.gov.au/data/dataset/064b4ce1-cbf9-4cd3-ad2e-4d1a677c70b8
    Explore at:
    zip(435509)Available download formats
    Dataset updated
    Apr 13, 2022
    Dataset authored and provided by
    Bioregional Assessment Program
    License
    Description

    Abstract

    This dataset and its metadata statement were supplied to the Bioregional Assessment Programme by a third party and are presented here as originally supplied.

    Location of all current mineral exploration licences issued under the Mining Act 1971. Exploration licences provide exclusive tenure rights to explore for mineral resources for up to a maximum of 5 years. Comment is sought on applications for exploration licences from numerous sources before granting. Exploration programs are subject to strict environmental and heritage conditions. Exploitation of identified resources must be made under separate mineral production leases.

    Purpose

    Purpose:

    The dataset was developed to record information necessary for the administration of the Mining Act.

    Use:

    Used to supply government, industry and the general public with an up-to-date status and extent of mineral and/or opal exploration licence application activities throughout the state.

    Use limitation:

    The data should not be used at a scale larger than 1:50 000.

    Dataset History

    Lineage:

    Source data history: Exploration Licence application boundaries were sourced from the official Mining Register licence application documents. Licence application boundaries are legally defined to follow lines of latitude and longitude. The register has existed since 1930.

    Processing steps: Coordinates entered by keyboard from licence application documents. Linework cleaned to remove duplicate arcs. Data adjusted for accurate state border and coastline. Where appropriate cadastral parcels removed from licence application polygons. Associated attribute data also captured from licence application documents.

    Dataset Citation

    SA Department of Primary Industries and Resources (2014) SA Mineral and/or Opal Exploration Licence Applications. Bioregional Assessment Source Dataset. Viewed 12 December 2018, http://data.bioregionalassessments.gov.au/dataset/064b4ce1-cbf9-4cd3-ad2e-4d1a677c70b8.

  14. l

    LScD (Leicester Scientific Dictionary)

    • figshare.le.ac.uk
    docx
    Updated Apr 15, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
    Explore at:
    docxAvailable download formats
    Dataset updated
    Apr 15, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Leicester
    Description

    LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.

  15. d

    Data from: Local L2 Thresholding Based Data Mining in Peer-to-Peer Systems

    • datasets.ai
    • catalog.data.gov
    33
    Updated Oct 21, 2005
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Aeronautics and Space Administration (2005). Local L2 Thresholding Based Data Mining in Peer-to-Peer Systems [Dataset]. https://datasets.ai/datasets/local-l2-thresholding-based-data-mining-in-peer-to-peer-systems
    Explore at:
    33Available download formats
    Dataset updated
    Oct 21, 2005
    Dataset authored and provided by
    National Aeronautics and Space Administration
    Description

    In a large network of computers, wireless sensors, or mobile devices, each of the components (hence, peers) has some data about the global status of the system. Many of the functions of the system, such as routing decisions, search strategies, data cleansing, and the assignment of mutual trust, depend on the global status. Therefore, it is essential that the system be able to detect, and react to, changes in its global status. Computing global predicates in such systems is usually very costly. Mainly because of their scale, and in some cases (e.g., sensor networks) also because of the high cost of communication. The cost further increases when the data changes rapidly (due to state changes, node failure, etc.) and computation has to follow these changes. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient local algorithm which detect when the L2 norm of the average data surpasses a threshold. Then, we use this algorithm as a feedback loop for the monitoring of complex predicates on the data – such as the data’s k-means clustering. The efficiency of the L2 algorithm guarantees that so long as the clustering results represent the data (i.e., the data is stationary) few resources are required. When the data undergoes an epoch change – a change in the underlying distribution – and the model no longer represents it, the feedback loop indicates this and the model is rebuilt. Furthermore, the existence of a feedback loop allows using approximate and “best-effort ” methods for constructing the model; if an ill-fit model is built the feedback loop would indicate so, and the model would be rebuilt.

  16. Hurricane Season Geospatial Data Mining Workshop - Datasets - AmericaView -...

    • ckan.americaview.org
    Updated Nov 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ckan.americaview.org (2021). Hurricane Season Geospatial Data Mining Workshop - Datasets - AmericaView - CKAN [Dataset]. https://ckan.americaview.org/dataset/2020-hurricane-season-geospatial-data-mining-workshop
    Explore at:
    Dataset updated
    Nov 11, 2021
    Dataset provided by
    CKANhttps://ckan.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Through the cooperation of the LouisianaView consortium members and co-sponsored with the local USGS liaison, this annual workshop is offered free to everyone interested in up-to-date information on data availability for the geospatial emergency responder. This is a 4-day virtual workshop hosts speakers from multiple Federal, State and Private Response Teams, each presenting their data, websites, links, and contacts while also fielding questions live from those in attendance, proving again and again what a cohesive and informed network of geospatial responders can mean to the inhabitants and economic base within Louisiana, the Gulf of Mexico region and the Caribbean.

  17. A

    OceanXtremes: Oceanographic Data-Intensive Anomaly Detection and Analysis...

    • data.amerigeoss.org
    • data.wu.ac.at
    html
    Updated Jul 25, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States[old] (2019). OceanXtremes: Oceanographic Data-Intensive Anomaly Detection and Analysis Portal [Dataset]. https://data.amerigeoss.org/pl/dataset/0f24d562-556c-4895-955a-74fec4cc9993
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jul 25, 2019
    Dataset provided by
    United States[old]
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    Anomaly detection is a process of identifying items, events or observations, which do not conform to an expected pattern in a dataset or time series. Current and future missions and our research communities challenge us to rapidly identify features and anomalies in complex and voluminous observations to further science and improve decision support. Given this data intensive reality, we propose to develop an anomaly detection system, called OceanXtremes, powered by an intelligent, elastic Cloud-based analytic service backend that enables execution of domain-specific, multi-scale anomaly and feature detection algorithms across the entire archive of ocean science datasets. A parallel analytics engine will be developed as the key computational and data-mining core of OceanXtreams' backend processing. This analytic engine will demonstrate three new technology ideas to provide rapid turn around on climatology computation and anomaly detection: 1. An adaption of the Hadoop/MapReduce framework for parallel data mining of science datasets, typically large 3 or 4 dimensional arrays packaged in NetCDF and HDF. 2. An algorithm profiling service to efficiently and cost-effectively scale up hybrid Cloud computing resources based on the needs of scheduled jobs (CPU, memory, network, and bursting from a private Cloud computing cluster to public cloud provider like Amazon Cloud services). 3. An extension to industry-standard search solutions (OpenSearch and Faceted search) to provide support for shared discovery and exploration of ocean phenomena and anomalies, along with unexpected correlations between key measured variables. We will use a hybrid Cloud compute cluster (private Eucalyptus on-premise at JPL with bursting to Amazon Web Services) as the operational backend. The key idea is that the parallel data-mining operations will be run 'near' the ocean data archives (a local 'network' hop) so that we can efficiently access the thousands of (say, daily) files making up a three decade time-series, and then cache key variables and pre-computed climatologies in a high-performance parallel database. OceanXtremes will be equipped with both web portal and web service interfaces for users and applications/systems to register and retrieve oceanographic anomalies data. By leveraging technology such as Datacasting (Bingham, et.al, 2007), users can also subscribe to anomaly or 'event' types of their interest and have newly computed anomaly metrics and other information delivered to them by metadata feeds packaged in standard Rich Site Summary (RSS) format. Upon receiving new feed entries, users can examine the metrics and download relevant variables, by simply clicking on a link, to begin further analyzing the event. The OceanXtremes web portal will allow users to define their own anomaly or feature types where continuous backend processing will be scheduled to populate the new user-defined anomaly type by executing the chosen data mining algorithm (i.e. differences from climatology or gradients above a specified threshold). Metadata on the identified anomalies will be cataloged including temporal and geospatial profiles, key physical metrics, related observational artifacts and other relevant metadata to facilitate discovery, extraction, and visualization. Products created by the anomaly detection algorithm will be made explorable and subsettable using Webification (Huang, et.al, 2014) and OPeNDAP (http://opendap.org) technologies. Using this platform scientists can efficiently search for anomalies or ocean phenomena, compute data metrics for events or over time-series of ocean variables, and efficiently find and access all of the data relevant to their study (and then download only that data).

  18. d

    Data from: DATA MINING THE GALAXY ZOO MERGERS

    • catalog.data.gov
    • gimi9.com
    • +3more
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). DATA MINING THE GALAXY ZOO MERGERS [Dataset]. https://catalog.data.gov/dataset/data-mining-the-galaxy-zoo-mergers
    Explore at:
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Dashlink
    Description

    DATA MINING THE GALAXY ZOO MERGERS STEVEN BAEHR, ARUN VEDACHALAM, KIRK BORNE, AND DANIEL SPONSELLER Abstract. Collisions between pairs of galaxies usually end in the coalescence (merger) of the two galaxies. Collisions and mergers are rare phenomena, yet they may signal the ultimate fate of most galaxies, including our own Milky Way. With the onset of massive collection of astronomical data, a computerized and automated method will be necessary for identifying those colliding galaxies worthy of more detailed study. This project researches methods to accomplish that goal. Astronomical data from the Sloan Digital Sky Survey (SDSS) and human-provided classifications on merger status from the Galaxy Zoo project are combined and processed with machine learning algorithms. The goal is to determine indicators of merger status based solely on discovering those automated pipeline-generated attributes in the astronomical database that correlate most strongly with the patterns identified through visual inspection by the Galaxy Zoo volunteers. In the end, we aim to provide a new and improved automated procedure for classification of collisions and mergers in future petascale astronomical sky surveys. Both information gain analysis (via the C4.5 decision tree algorithm) and cluster analysis (via the Davies-Bouldin Index) are explored as techniques for finding the strongest correlations between human-identified patterns and existing database attributes. Galaxy attributes measured in the SDSS green waveband images are found to represent the most influential of the attributes for correct classification of collisions and mergers. Only a nominal information gain is noted in this research, however, there is a clear indication of which attributes contribute so that a direction for further study is apparent.

  19. s

    development high risk (national coal mining database) planning - open data

    • data.stirling.gov.uk
    Updated Jul 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stirling Council - insights by location (2024). development high risk (national coal mining database) planning - open data [Dataset]. https://data.stirling.gov.uk/datasets/33b602fe652c41d99716f9d276a77833
    Explore at:
    Dataset updated
    Jul 25, 2024
    Dataset authored and provided by
    Stirling Council - insights by location
    Area covered
    Description

    This dataset is published as Open DataAbstract The Development High Risk Area is the part of the coal mining reporting area which contains one or more recorded coal mining related features which have the potential for instability or a degree of risk to the surface from the legacy of coal mining operations. The combination of features included in this composite area includes mine entries; shallow coal workings (recorded and probable); recorded coal mining related hazards; recorded mine gas sites; fissures and breaklines and previous surface mining sites. New development in this defined area needs to demonstrate that the development will be safe and stable taking full account of former coal mining activities. This area was formally known as the Development Referral Area. Purpose The development high risk areas have been defined to enable developers and planners to understand and consider the potential for instability or degree of risk from the legacy of coal mining operations. This information is also provided to asset managers for the management of the land assets of public bodies and major landowners. Supplementary Information The National Coal Mining Database, which is based on the records held at The Coal Authority offices in Mansfield, Nottinghamshire, is updated on a regular basis. This dataset has been extracted from this dynamic database on the date stated below and therefore represents a snapshot in time. Status of the data Extract of data from the National Coal Mining Database Data update frequency: As needed

  20. d

    Replication Data for: Policy Diffusion: The Issue-Definition Stage

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gilardi, Fabrizio; Shipan, Charles R.; Wüest, Bruno (2023). Replication Data for: Policy Diffusion: The Issue-Definition Stage [Dataset]. http://doi.org/10.7910/DVN/QEMNP1
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Gilardi, Fabrizio; Shipan, Charles R.; Wüest, Bruno
    Description

    We put forward a new approach to studying issue definition within the context of policy diffusion. Most studies of policy diffusion---which is the process by which policymaking in one government affects policymaking in other governments---have focused on policy adoptions. We shift the focus to an important but neglected aspect of this process: the issue-definition stage. We use topic models to estimate how policies are framed during this stage and how these frames are predicted by prior policy adoptions. Focusing on smoking restriction in U.S. states, our analysis draws upon an original dataset of over 52,000 paragraphs from newspapers covering 49 states between 1996 and 2013. We find that frames regarding the policy's concrete implications are predicted by prior adoptions in other states, while frames regarding its normative justifications are not. Our approach and findings open the way for a new perspective to studying policy diffusion in many different areas.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Xin Qiao; Hong Jiao (2023). Table_1_Data Mining Techniques in Analyzing Process Data: A Didactic.pdf [Dataset]. http://doi.org/10.3389/fpsyg.2018.02231.s001

Table_1_Data Mining Techniques in Analyzing Process Data: A Didactic.pdf

Related Article
Explore at:
pdfAvailable download formats
Dataset updated
Jun 7, 2023
Dataset provided by
Frontiers
Authors
Xin Qiao; Hong Jiao
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Due to increasing use of technology-enhanced educational assessment, data mining methods have been explored to analyse process data in log files from such assessment. However, most studies were limited to one data mining technique under one specific scenario. The current study demonstrates the usage of four frequently used supervised techniques, including Classification and Regression Trees (CART), gradient boosting, random forest, support vector machine (SVM), and two unsupervised methods, Self-organizing Map (SOM) and k-means, fitted to one assessment data. The USA sample (N = 426) from the 2012 Program for International Student Assessment (PISA) responding to problem-solving items is extracted to demonstrate the methods. After concrete feature generation and feature selection, classifier development procedures are implemented using the illustrated techniques. Results show satisfactory classification accuracy for all the techniques. Suggestions for the selection of classifiers are presented based on the research questions, the interpretability and the simplicity of the classifiers. Interpretations for the results from both supervised and unsupervised learning methods are provided.

Search
Clear search
Close search
Google apps
Main menu