39 datasets found
  1. DatasetofDatasets (DoD)

    • kaggle.com
    zip
    Updated Aug 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Konstantinos Malliaridis (2024). DatasetofDatasets (DoD) [Dataset]. https://www.kaggle.com/terminalgr/datasetofdatasets-124-1242024
    Explore at:
    zip(7583 bytes)Available download formats
    Dataset updated
    Aug 12, 2024
    Authors
    Konstantinos Malliaridis
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset is essentially the metadata from 164 datasets. Each of its lines concerns a dataset from which 22 features have been extracted, which are used to classify each dataset into one of the categories 0-Unmanaged, 2-INV, 3-SI, 4-NOA (DatasetType).

    This Dataset consists of 164 Rows. Each row is the metadata of an other dataset. The target column is datasetType which has 4 values indicating the dataset type. These are:

    2 - Invoice detail (INV): This dataset type is a special report (usually called Detailed Sales Statement) produced by a Company Accounting or an Enterprise Resource Planning software (ERP). Using a INV-type dataset directly for ARM is extremely convenient for users as it relieves them from the tedious work of transforming data into another more suitable form. INV-type data input typically includes a header but, only two of its attributes are essential for data mining. The first attribute serves as the grouping identifier creating a unique transaction (e.g., Invoice ID, Order Number), while the second attribute contains the items utilized for data mining (e.g., Product Code, Product Name, Product ID).

    3 - Sparse Item (SI): This type is widespread in Association Rules Mining (ARM). It involves a header and a fixed number of columns. Each item corresponds to a column. Each row represents a transaction. The typical cell stores a value, usually one character in length, that depicts the presence or absence of the item in the corresponding transaction. The absence character must be identified or declared before the Association Rules Mining process takes place.

    4 - Nominal Attributes (NOA): This type is commonly used in Machine Learning and Data Mining tasks. It involves a fixed number of columns. Each column registers nominal/categorical values. The presence of a header row is optional. However, in cases where no header is provided, there is a risk of extracting incorrect rules if similar values exist in different attributes of the dataset. The potential values for each attribute can vary.

    0 - Unmanaged for ARM: On the other hand, not all datasets are suitable for extracting useful association rules or frequent item sets. For instance, datasets characterized predominantly by numerical features with arbitrary values, or datasets that involve fragmented or mixed types of data types. For such types of datasets, ARM processing becomes possible only by introducing a data discretization stage which in turn introduces information loss. Such types of datasets are not considered in the present treatise and they are termed (0) Unmanaged in the sequel.

    The dataset type is crucial to determine for ARM, and the current dataset is used to classify the dataset's type using a Supervised Machine Learning Model.

    There is and another dataset type named 1 - Market Basket List (MBL) where each dataset row is a transaction. A transaction involves a variable number of items. However, due to this characteristic, these datasets can be easily categorized using procedural programming and DoD does not include instances of them. For more details about Dataset Types please refer to article "WebApriori: a web application for association rules mining". https://link.springer.com/chapter/10.1007/978-3-030-49663-0_44

  2. Data mining as a hatchery process evaluation tool

    • scielo.figshare.com
    jpeg
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniela Regina Klein; Marcos Martinez do Vale; Mariana Fernandes Ribas da Silva; Micheli Faccin Kuhn; Tatiane Branco; Mauricio Portella dos Santos (2023). Data mining as a hatchery process evaluation tool [Dataset]. http://doi.org/10.6084/m9.figshare.10258280.v1
    Explore at:
    jpegAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    SciELOhttp://www.scielo.org/
    Authors
    Daniela Regina Klein; Marcos Martinez do Vale; Mariana Fernandes Ribas da Silva; Micheli Faccin Kuhn; Tatiane Branco; Mauricio Portella dos Santos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ABSTRACT The hatchery is one of the most important segments of the poultry chain, and generates an abundance of data, which, when analyzed, allow for identifying critical points of the process . The aim of this study was to evaluate the applicability of the data mining technique to databases of egg incubation of broiler breeders and laying hen breeders. The study uses a database recording egg incubation from broiler breeders housed in pens with shavings used for litters in natural mating, as well as laying hen breeders housed in cages using an artificial insemination mating system. The data mining technique (DM) was applied to analyses in a classification task, using the type of breeder and house system for delineating classes. The database was analyzed in three different ways: original database, attribute selection, and expert analysis. Models were selected on the basis of model precision and class accuracy. The data mining technique allowed for the classification of hatchery fertile eggs from different genetic groups, as well as hatching rates and the percentage of fertile eggs (the attributes with the greatest classification power). Broiler breeders showed higher fertility (> 95 %), but higher embryonic mortality between the third and seventh day post-hatching (> 0.5 %) when compared to laying hen breeders’ eggs. In conclusion, applying data mining to the hatchery process, selection of attributes and strategies based on the experience of experts can improve model performance.

  3. H1B Disclosure Dataset

    • kaggle.com
    zip
    Updated Dec 31, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Charmi (2017). H1B Disclosure Dataset [Dataset]. https://www.kaggle.com/trivedicharmi/h1b-disclosure-dataset
    Explore at:
    zip(44804316 bytes)Available download formats
    Dataset updated
    Dec 31, 2017
    Authors
    Charmi
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Project Description:

    1) Data Background

    In the Data Mining class, we had the opportunity to analyze data by performing data mining algorithms to a dataset. Our dataset is from Office of Foreign Labor Certification (OFLC). OFLC is a division of the U.S. Department of Labor. The main duty of OFLC is to assist the Secretary of Labor to enforce part of the Immigration and Nationality Act (INA), which requires certain labor conditions exist before employers can hire foreign workers. H-1B is a visa category in the United States of America under the INA, section 101(a)(15)(H) which allows U.S. employers to employ foreign workers. The first step employer must take to hire a foreign worker is to file the Labor Condition Application. In this project, we will analyze the data from the Labor Condition Application.

    1.1) Introduction to H1B Dataset

    The H-1B Dataset selected for this project contains data from employer’s Labor Condition Application and the case certification determinations processed by the Office of Foreign Labor Certification (OFLC) where the date of the determination was issues on or after October 1, 2016 and on or before June 30, 2017.

    The Labor Condition Application (LCA) is a document that a perspective H-1B employer files with U.S. Department of Labor Employment and Training Administration (DOLETA) when it seeks to employ non-immigrant workers at a specific job occupation in an area of intended employment for not more than three years.

    1.2) Goal of the Project

    Our goal for this project is to predict the case status of an application submitted by the employer to hire non-immigrant workers under the H-1B visa program. Employer can hire non-immigrant workers only after their LCA petition is approved. The approved LCA petition is then submitted as part of the Petition for a Non-immigrant Worker application for work authorizations for H-1B visa status.

    We want to uncover insights that can help employers understand the process of getting their LCA approved. We will use WEKA software to run data mining algorithms to understand the relationship between attributes and the target variable.

    2)Dataset Information:

    a) Source: Office of Foreign Labor Certification, U.S. Department of Labor Employment and Training Administration
    b) List Link: https://www.foreignlaborcert.doleta.gov/performancedata.cfm
    c) Dataset Type: Record – Transaction Data
    d) Number of Attributes: 40
    e) Number of Instances: 528,147
    f) Date Created: July 2017

    3) Attribute List:

    The detailed description of each attribute below is given in the Record Layout file available in the zip folder H1B Disclosure Dataset Files.

    The H-1B dataset from OFLC contained 40 attributes and 528,147 instances. The attributes are in the table below. The attributes highlighted bold were removed during the data cleaning process.

    1) CASE_NUMBER
    2)CASE_SUBMITTED
    3)DECISION_DATE
    4)VISA_CLASS
    5)EMPLOYMENT_START_DATE
    6)EMPLOYMENT_END_DATE
    7)EMPLOYER_NAME
    8)EMPLOYER_ADDRESS
    9)EMPLOYER_CITY
    10)EMPLOYER_STATE
    11)EMPLOYER_POSTAL_CODE
    12)EMPLOYER_COUNTRY
    13)EMPLOYER_PROVINCE
    14)EMPLOYER_PHONE
    15)EMPLOYER_PHONE_EXT
    16)AGENT_ATTORNEY_NAME
    17)AGENT_ATTORNEY_CITY
    18)AGENT_ATTORNEY_STATE
    19)JOB_TITLE
    20)SOC_CODE
    21)SOC_NAME
    22)NAICS_CODE
    23)TOTAL_WORKERS
    24)FULL_TIME_POSITION
    25)PREVAILING_WAGE
    26)PW_UNIT_OF_PAY
    27)PW_SOURCE
    28)PW_SOURCE_YEAR
    29)PW_SOURCE_OTHER
    30)WAGE_RATE_OF_PAY_FROM
    31)WAGE_RATE_OF_PAY_TO
    32)WAGE_UNIT_OF_PAY
    33)H-1B_DEPENDENT
    34) WILLFUL_VIOLATOR
    35) WORKSITE_CITY
    36)WORKSITE_COUNTY
    37)WORKSITE_STATE
    38)WORKSITE_POSTAL_CODE
    39)ORIGINAL_CERT_DATE
    40)CASE_STATUS* - _Class Attribute - To be predicted

    3.1) Class Attribute

    For the H-1B Dataset our class attribute is ‘CASE_STATUS’. There are 4 categories of Case Status. The values of Case_Status attributes are:

    1) Certified
    2) Certified_Withdrawn
    3) Withdrawn
    4) Denied

    Certified means the LCA of an employer was approved. Certified Withdrawn means the case was withdrawn after it was certified by OFLC. Withdrawn means the case was withdrawn by the employer. Denied means the case was denied OFLC.

  4. The Insurance Company (TIC) Benchmark

    • kaggle.com
    zip
    Updated May 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kush Shah (2020). The Insurance Company (TIC) Benchmark [Dataset]. https://www.kaggle.com/datasets/kushshah95/the-insurance-company-tic-benchmark/code
    Explore at:
    zip(268454 bytes)Available download formats
    Dataset updated
    May 27, 2020
    Authors
    Kush Shah
    Description

    This data set used in the CoIL 2000 Challenge contains information on customers of an insurance company. The data consists of 86 variables and includes product usage data and socio-demographic data

    DETAILED DATA DESCRIPTION

    THE INSURANCE COMPANY (TIC) 2000

    (c) Sentient Machine Research 2000

    DISCLAIMER

    This dataset is owned and supplied by the Dutch data mining company Sentient Machine Research, and is based on real-world business data. You are allowed to use this dataset and accompanying information for non-commercial research and education purposes only. It is explicitly not allowed to use this dataset for commercial education or demonstration purposes. For any other use, please contact Peter van der Putten, info@smr.nl.

    This dataset has been used in the CoIL Challenge 2000 data mining competition. For papers describing results on this dataset, see the TIC 2000 homepage: http://www.wi.leidenuniv.nl/~putten/library/cc2000/

    REFERENCE P. van der Putten and M. van Someren (eds). CoIL Challenge 2000: The Insurance Company Case. Published by Sentient Machine Research, Amsterdam. Also a Leiden Institute of Advanced Computer Science Technical Report 2000-09. June 22, 2000. See http://www.liacs.nl/~putten/library/cc2000/

    RELEVANT FILES

    tic_2000_train_data.csv: Dataset to train and validate prediction models and build a description (5822 customer records). Each record consists of 86 attributes, containing sociodemographic data (attribute 1-43) and product ownership (attributes 44-86). The sociodemographic data is derived from zip codes. All customers living in areas with the same zip code have the same sociodemographic attributes. Attribute 86, "CARAVAN: Number of mobile home policies", is the target variable.

    tic_2000_eval_data.csv: Dataset for predictions (4000 customer records). It has the same format as TICDATA2000.txt, only the target is missing. Participants are supposed to return the list of predicted targets only. All datasets are in CSV format. The meaning of the attributes and attribute values is given dictionary.csv

    tic_2000_target_data.csv Targets for the evaluation set.

    dictionary.txt: Data description with numerical labeled categories descriptions. It has columnar description data and the labels of the dummy/Labeled encoding.

    Original Task description Link: http://liacs.leidenuniv.nl/~puttenpwhvander/library/cc2000/problem.html UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/datasets/Insurance+Company+Benchmark+%28COIL+2000%29

  5. winequality-white

    • kaggle.com
    zip
    Updated Oct 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vitalii Puzhenko (2024). winequality-white [Dataset]. https://www.kaggle.com/datasets/vitaliipuzhenko/winequality-white/suggestions?status=pending
    Explore at:
    zip(73187 bytes)Available download formats
    Dataset updated
    Oct 12, 2024
    Authors
    Vitalii Puzhenko
    Description

    P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

    Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

    1. Title: Wine Quality

    2. Sources Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009

    3. Past Usage:

      P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

      In the above reference, two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model these datasets under a regression approach. The support vector machine model achieved the best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T), etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity analysis procedure).

    4. Relevant Information:

      The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

      These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.

    5. Number of Instances: red wine - 1599; white wine - 4898.

    6. Number of Attributes: 11 + output attribute

      Note: several of the attributes may be correlated, thus it makes sense to apply some sort of feature selection.

    7. Attribute information:

      For more information, read [Cortez et al., 2009].

      Input variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol Output variable (based on sensory data): 12 - quality (score between 0 and 10)

    8. Missing Attribute Values: None

  6. Types and attributes of event log data.

    • plos.figshare.com
    xls
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hyunyoung Baek; Minsu Cho; Seok Kim; Hee Hwang; Minseok Song; Sooyoung Yoo (2023). Types and attributes of event log data. [Dataset]. http://doi.org/10.1371/journal.pone.0195901.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Hyunyoung Baek; Minsu Cho; Seok Kim; Hee Hwang; Minseok Song; Sooyoung Yoo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Types and attributes of event log data.

  7. f

    Data_Sheet_1_Cross-Species Meta-Analysis of Transcriptomic Data in...

    • frontiersin.figshare.com
    txt
    Updated Jun 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammad Farhadian; Seyed A. Rafat; Karim Hasanpur; Mansour Ebrahimi; Esmaeil Ebrahimie (2023). Data_Sheet_1_Cross-Species Meta-Analysis of Transcriptomic Data in Combination With Supervised Machine Learning Models Identifies the Common Gene Signature of Lactation Process.CSV [Dataset]. http://doi.org/10.3389/fgene.2018.00235.s001
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Frontiers
    Authors
    Mohammad Farhadian; Seyed A. Rafat; Karim Hasanpur; Mansour Ebrahimi; Esmaeil Ebrahimie
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Lactation, a physiologically complex process, takes place in mammary gland after parturition. The expression profile of the effective genes in lactation has not comprehensively been elucidated. Herein, meta-analysis, using publicly available microarray data, was conducted identify the differentially expressed genes (DEGs) between pre- and post-peak milk production. Three microarray datasets of Rat, Bos Taurus, and Tammar wallaby were used. Samples related to pre-peak (n = 85) and post-peak (n = 24) milk production were selected. Meta-analysis revealed 31 DEGs across the studied species. Interestingly, 10 genes, including MRPS18B, SF1, UQCRC1, NUCB1, RNF126, ADSL, TNNC1, FIS1, HES5 and THTPA, were not detected in original studies that highlights meta-analysis power in biosignature discovery. Common target and regulator analysis highlighted the high connectivity of CTNNB1, CDD4 and LPL as gene network hubs. As data originally came from three different species, to check the effects of heterogeneous data sources on DEGs, 10 attribute weighting (machine learning) algorithms were applied. Attribute weighting results showed that the type of organism had no or little effect on the selected gene list. Systems biology analysis suggested that these DEGs affect the milk production by improving the immune system performance and mammary cell growth. This is the first study employing both meta-analysis and machine learning approaches for comparative analysis of gene expression pattern of mammary glands in two important time points of lactation process. The finding may pave the way to use of publically available to elucidate the underlying molecular mechanisms of physiologically complex traits such as lactation in mammals.

  8. Market Basket Analysis

    • kaggle.com
    zip
    Updated Dec 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
    Explore at:
    zip(23875170 bytes)Available download formats
    Dataset updated
    Dec 9, 2021
    Authors
    Aslan Ahmedov
    Description

    Market Basket Analysis

    Market basket analysis with Apriori algorithm

    The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

    Introduction

    Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

    An Example of Association Rules

    Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

    Strategy

    • Data Import
    • Data Understanding and Exploration
    • Transformation of the data – so that is ready to be consumed by the association rules algorithm
    • Running association rules
    • Exploring the rules generated
    • Filtering the generated rules
    • Visualization of Rule

    Dataset Description

    • File name: Assignment-1_Data
    • List name: retaildata
    • File format: . xlsx
    • Number of Row: 522065
    • Number of Attributes: 7

      • BillNo: 6-digit number assigned to each transaction. Nominal.
      • Itemname: Product name. Nominal.
      • Quantity: The quantities of each product per transaction. Numeric.
      • Date: The day and time when each transaction was generated. Numeric.
      • Price: Product price. Numeric.
      • CustomerID: 5-digit number assigned to each customer. Nominal.
      • Country: Name of the country where each customer resides. Nominal.

    imagehttps://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

    Libraries in R

    First, we need to load required libraries. Shortly I describe all libraries.

    • arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
    • arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
    • tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.
    • readxl - Read Excel Files in R.
    • plyr - Tools for Splitting, Applying and Combining Data.
    • ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
    • knitr - Dynamic Report generation in R.
    • magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.
    • dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
    • tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

    imagehttps://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

    Data Pre-processing

    Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

    imagehttps://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> imagehttps://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

    After we will clear our data frame, will remove missing values.

    imagehttps://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

    To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...

  9. d

    Estimated Use of Water in the United States County-Level Data for 2015

    • catalog.data.gov
    • data.usgs.gov
    Updated Nov 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Estimated Use of Water in the United States County-Level Data for 2015 [Dataset]. https://catalog.data.gov/dataset/estimated-use-of-water-in-the-united-states-county-level-data-for-2015
    Explore at:
    Dataset updated
    Nov 12, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    United States
    Description

    This dataset contains water-use estimates for 2015 that are aggregated to the county level in the United States. The U.S. Geological Survey's (USGS's) National Water Use Science Project is responsible for compiling and disseminating the Nation's water-use data. Working in cooperation with local, State, and Federal agencies, the USGS has published an estimate of water use in the United States every 5 years, beginning in 1950. Water-use estimates aggregated to the State level are presented in USGS Circular 1441, "Estimated Use of Water in the United States in 2015" (Dieter and others, 2018). This dataset contains the county-level water-use data that support the state-level estimates in Dieter and others 2018. This dataset contains data for public supply, domestic, irrigation, thermoelectric power, industrial, mining, livestock, and aquaculture water-use categories. First posted September 28, 2017, ver. 1.0 Revised June 19, 2018, ver. 2.0 Version 2.0: This version of the dataset contains total population data and water-use estimates for 2015 for the following categories: Public supply, domestic, irrigation, thermoelectric power, industrial, mining, livestock, and aquaculture. Data are aggregated to the county level. A value of "--" denotes that values were not estimated for an optional attribute. Some values in the public supply and domestic categories have been updated from those found in version 1.0 of this dataset. Version 1.0: This version of the dataset contains total population data and water-use estimates for the public supply and domestic categories for 2015 that are aggregated to the county level in the United States. A "--" in the attributes "PS-GWPop" or "PS-SWPop" denotes that values were not estimated for an optional attribute. All other occurrences of "--" denote data for an attribute in a water-use category that has not yet been released. Version 1.0 data are available upon request.

  10. Characteristics that Favor Freq-Itemset Algorithms

    • kaggle.com
    Updated Oct 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeff Heaton (2020). Characteristics that Favor Freq-Itemset Algorithms [Dataset]. https://www.kaggle.com/jeffheaton/characteristics-that-favor-freqitemset-algorithms
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 24, 2020
    Dataset provided by
    Kaggle
    Authors
    Jeff Heaton
    License

    http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

    Description

    Source Paper

    This dataset is from my paper:

    Heaton, J. (2016, March). Comparing dataset characteristics that favor the Apriori, Eclat or FP-Growth frequent itemset mining algorithms. In SoutheastCon 2016 (pp. 1-7). IEEE.

    Frequent itemset mining is a popular data mining technique. Apriori, Eclat, and FP-Growth are among the most common algorithms for frequent itemset mining. Considerable research has been performed to compare the relative performance between these three algorithms, by evaluating the scalability of each algorithm as the dataset size increases. While scalability as data size increases is important, previous papers have not examined the performance impact of similarly sized datasets that contain different itemset characteristics. This paper explores the effects that two dataset characteristics can have on the performance of these three frequent itemset algorithms. To perform this empirical analysis, a dataset generator is created to measure the effects of frequent item density and the maximum transaction size on performance. The generated datasets contain the same number of rows. This provides some insight into dataset characteristics that are conducive to each algorithm. The results of this paper's research demonstrate Eclat and FP-Growth both handle increases in maximum transaction size and frequent itemset density considerably better than the Apriori algorithm.

    Files Generated

    We generated two datasets that allow us to adjust two independent variables to create a total of 20 different transaction sets. We also provide the Python script that generated this data in a notebook. This Python script accepts the following parameters to specify the transaction set to produce:

    • Transaction/Basket count: 5 million default
    • Number of items: 50,000 default
    • Number of frequent sets: 100 default
    • Max transaction/basket size: independent variable, 5-100 range
    • Frequent set density: independent variable, 0.1 to 0.8 range

    Files contained in this dataset reside in two folders: * freq-items-pct - We vary the frequent set density in these transaction sets. * freq-items-tsz - We change the maximum number of items per basket in these transaction sets.

    While you can vary basket count, the number of frequent sets, and the number of items in the script, they will remain fixed at this paper's above values. We determined that the basket count only had a small positive correlation.

    File Content

    The following listing shows the type of data generated for this research. Here we present an example file created with ten baskets out of 100 items, two frequent itemsets, a maximum basket size of 10, and a density of 0.5.

    I36 I94 
    I71 I13 I91 I89 I34
    F6 F5 F3 F4 
    I86 
    I39 I16 I49 I62 I31 I54 I91 
    I22 I31 
    I70 I85 I78 I63 
    F4 F3 F1 F6 F0 I69 I44 
    I82 I50 I9 I31 I57 I20 
    F4 F3 F1 F6 F0 I87
    

    As you can see from the above file, the items are either prefixed with “I” or “F.” The “F” prefix indicates that this line contains one of the frequent itemsets. Items with the “I” prefix are not part of an intentional frequent itemset. Of course, “I” prefixed items might form frequent itemsets, as they are uniformly sampled from the number of things to fill out nonfrequent itemsets. Each basket will have a random size chosen, up to the maximum basket size. The frequent itsemset density specifies the probability of each line containing one of the intentional frequent itemsets. Because we used a density of 0.5, approximately half of the lines above include one of the two intentional frequent itemsets. A frequent itemset line may have additional random “I” prefixed items added to cause the line to reach the randomly chosen length for that line. If the frequent itemset selected does cause the generated sequence to exceed its randomly chosen length, no truncation will occur. The intentional frequent itemsets are all determined to be less than or equal to the maximum basket size.

  11. u

    Association analysis of high-low outlier road intersection crashes within...

    • zivahub.uct.ac.za
    xlsx
    Updated Jun 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simone Vieira; Simon Hull; Roger Behrens (2024). Association analysis of high-low outlier road intersection crashes within the CoCT in 2017, 2018, 2019 and 2021 [Dataset]. http://doi.org/10.25375/uct.25975741.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 7, 2024
    Dataset provided by
    University of Cape Town
    Authors
    Simone Vieira; Simon Hull; Roger Behrens
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    City of Cape Town
    Description

    This dataset provides comprehensive information on road intersection crashes recognised as "high-low" outliers within the City of Cape Town. It includes detailed records of all intersection crashes and their corresponding crash attribute combinations, which were prevalent in at least 5% of the total "high-low" outlier road intersection crashes for the years 2017, 2018, 2019, and 2021. The dataset is meticulously organised according to support metric values, ranging from 0,05 to 0,0278, with entries presented in descending order.Data SpecificsData Type: Geospatial-temporal categorical dataFile Format: Excel document (.xlsx)Size: 0,99 MBNumber of Files: The dataset contains a total of 10212 association rulesDate Created: 23rd May 2024MethodologyData Collection Method: The descriptive road traffic crash data per crash victim involved in the crashes was obtained from the City of Cape Town Network InformationSoftware: ArcGIS Pro, PythonProcessing Steps: Following the spatio-temporal analyses and the derivation of "high-low" outlier fishnet grid cells from a cluster and outlier analysis, all the road intersection crashes that occurred within the "high-low" outlier fishnet grid cells were extracted to be processed by association analysis. The association analysis of the "high-low" outlier road intersection crashes was processed using Python software and involved the use of a 0,05 support metric value. Consequently, commonly occurring crash attributes among at least 5% of the "high-low" outlier road intersection crashes were extracted for inclusion in this dataset.Geospatial InformationSpatial Coverage:West Bounding Coordinate: 18°20'EEast Bounding Coordinate: 19°05'ENorth Bounding Coordinate: 33°25'SSouth Bounding Coordinate: 34°25'SCoordinate System: South African Reference System (Lo19) using the Universal Transverse Mercator projectionTemporal InformationTemporal Coverage:Start Date: 01/01/2017End Date: 31/12/2021 (2020 data omitted)

  12. u

    Surface disturbance linear features - Catalogue - Canadian Urban Data...

    • data.urbandatacentre.ca
    Updated Oct 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Surface disturbance linear features - Catalogue - Canadian Urban Data Catalogue (CUDC) [Dataset]. https://data.urbandatacentre.ca/dataset/gov-canada-adabb0f8-af79-1645-ac51-3b2a77deeb3b
    Explore at:
    Dataset updated
    Oct 1, 2024
    Description

    This data shows anthropogenic polyline disturbance features. Features were digitized using high resolution satellite imagery and orthophotos. Features from the National Road Network (NRN) and the National Railway Network (NRWN) were adapted and included. The following data was not included in the dataset: proposed features. Table 1. A list of attributes, associated domains, and descriptions. Attribute Data Type Domains Description REF_ID Text (20) Unique feature reference ID DATABASE Text (20) Historic, Most Recent, Retired Sub-database to which the feature belongs TYPE_INDUSTRY Text (50) Table 2.3.2 Major classification of disturbance feature by industry TYPE_DISTURBANCE Text (50) Table 2.3.2 Sub classification of disturbance feature WIDTH_M Double Width of feature in meters WIDTH_CLASS Text (5) HIGH, MED, LOW Width of feature by classification SCALE_CAPTURED Long Scale at which the feature was digitized DATA_SOURCE Text (10) Imagery, GPS, Other Data source: digitized from imagery, captured by GPS, or obtained by other means IMAGE_NAME Text (100) Filename of source imagery IMAGE_DATE Date Date that imagery was captured (YYYYMMDD) IMAGE_RESOLUTION Double Resolution of source imagery in meters IMAGE_SENSOR Text (35) Name of sensor that captured source imagery WIDTH_M: Linear features must be attributed with a width measurement. The width of the feature can be estimated in meters, rounded to the nearest whole number. **WIDTH_CLASS: This field employs a classification scheme used by previous contractors. This classification scheme was discussed and agreed upon by Mammoth Mapping and the Project Manager in 2011-2013. The width values are the following. Table 2. Width classification breakdown. WIDTH_CLASS Anticipated Value Range (meters) LOW 8 Table 3. A list of disturbance feature types and their descriptions. TYPE_INDUSTRY TYPE_DISTURBANCE DESCRIPTION Mining Survey / Cutline A linear cleared area through undeveloped land, used for line-of-sight surveying; impossible to distinguish whether associated with quartz or placer mining (overlapping or unclear claims information) Survey / Cutline - Placer A linear cleared area through undeveloped land, used for line-of-sight surveying; associated with placer mining (identified using claims information and/or other indicators) Survey / Cutline - Quartz A linear cleared area through undeveloped land, used for line-of-sight surveying; associated with quartz mining (identified using claims information and/or other indicators) Trench A long, narrow excavation dug to expose vein or ore structure Unknown Unknown linear mining disturbance Oil and Gas Pipeline Visible pipeline or pipeline Right-of-Way (above- or below-ground) Seismic Line Seismic lines Rural Driveway A driveway in a rural area Fence A fence in a rural area Transportation Access Assumed A linear feature that is assumed to be an access road, but could also be a trail Access Road A road or narrow passage whose primary function is to provide access for resource extraction (i.e. mining, forestry) and may also have served in providing public access to the backcountry. Arterial Road A major thoroughfare with medium to large traffic capacity Local Road A low-speed thoroughfare, provides access to front of properties, including those with potential public restrictions such as trailer parks, First Nations land, private estate, seasonal residences, gravel pits (NRN definition for Local Street/Local Strata/Local Unknown). Shows signs of regular use. Right of Way For Road Rights as attributed in the land parcels ancillary data Trail Path or track (typically 1.5 m wide) that does not necessarily access remote resources Unknown Right of Way A right of way with unknown industry type Survey / Cutline A linear cleared area through undeveloped land, used for line-of-sight surveying. A cutline may not always be associated with mineral exploration, therefore, Type: Unknown was used to differentiate all cutlines that were outside of mineral exploration. Unknown Unclassified, or unable to identify type based on imagery, but suspected to be anthropogenic Utility Electric Utility Corridor Corridor usually running parallel to highway, where transmission lines or other utilities are visible Unknown Unknown linear feature assumed to be a utility corridor; ancillary data is unclear. Distributed from GeoYukon by the Government of Yukon . Discover more digital map data and interactive maps from Yukon's digital map data collection. For more information: geomatics.help@yukon.ca

  13. d

    The StreamCat Dataset: Accumulated Attributes for NHDPlusV2 (Version 2.1)...

    • catalog.data.gov
    • datasets.ai
    • +1more
    Updated Feb 4, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Environmental Protection Agency, Office of Research and Development (ORD), Center for Public Health and Environmental Assessment (CPHEA), Pacific Ecological Systems Division (PESD), (2025). The StreamCat Dataset: Accumulated Attributes for NHDPlusV2 (Version 2.1) Catchments for the Conterminous United States: Mine Density Active Mines and Mineral Plants in the US [Dataset]. https://catalog.data.gov/dataset/the-streamcat-dataset-accumulated-attributes-for-nhdplusv2-version-2-1-catchments-for-the--6c68a
    Explore at:
    Dataset updated
    Feb 4, 2025
    Dataset provided by
    U.S. Environmental Protection Agency, Office of Research and Development (ORD), Center for Public Health and Environmental Assessment (CPHEA), Pacific Ecological Systems Division (PESD),
    Area covered
    United States
    Description

    This dataset represents the mine density within individual, local NHDPlusV2 catchments and upstream, contributing watersheds based on mine plants and operations monitored by the USGS National Minerals Information Center. The National Minerals Information Center canvasses the nonfuel mining and mineral-processing industry in the United States for data on mineral production, consumption, recycling, stocks, and shipments. Mine plants and operations for commodities are expressed as points in a shapefile that was downloaded from the USGS directly. The (mines / catchment) were summarized and accumulated into watersheds to produce local catchment-level and watershed-level metrics as a point data type.

  14. G

    Surface disturbance linear features

    • ouvert.canada.ca
    • catalogue.arctic-sdi.org
    • +1more
    esri rest, html
    Updated Nov 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Government of Yukon (2025). Surface disturbance linear features [Dataset]. https://ouvert.canada.ca/data/dataset/09eb6891-dc10-4d67-aca2-ad8d4aef19a2
    Explore at:
    html, esri restAvailable download formats
    Dataset updated
    Nov 19, 2025
    Dataset provided by
    Government of Yukon
    License

    Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
    License information was derived automatically

    Description

    This data shows anthropogenic polyline disturbance features. Features were digitized using high resolution satellite imagery and orthophotos. Features from the National Road Network (NRN) and the National Railway Network (NRWN) were adapted and included. The following data was not included in the dataset: proposed features. Table 1. A list of attributes, associated domains, and descriptions. Attribute Data Type Domains Description REF_ID Text (20) Unique feature reference ID DATABASE Text (20) Historic, Most Recent, Retired Sub-database to which the feature belongs TYPE_INDUSTRY Text (50) Table 2.3.2 Major classification of disturbance feature by industry TYPE_DISTURBANCE Text (50) Table 2.3.2 Sub classification of disturbance feature WIDTH_M* Double Width of feature in meters WIDTH_CLASS** Text (5) HIGH, MED, LOW Width of feature by classification SCALE_CAPTURED Long Scale at which the feature was digitized DATA_SOURCE Text (10) Imagery, GPS, Other Data source: digitized from imagery, captured by GPS, or obtained by other means IMAGE_NAME Text (100) Filename of source imagery IMAGE_DATE Date Date that imagery was captured (YYYYMMDD) IMAGE_RESOLUTION Double Resolution of source imagery in meters IMAGE_SENSOR Text (35) Name of sensor that captured source imagery *WIDTH_M: Linear features must be attributed with a width measurement. The width of the feature can be estimated in meters, rounded to the nearest whole number. **WIDTH_CLASS: This field employs a classification scheme used by previous contractors. This classification scheme was discussed and agreed upon by Mammoth Mapping and the Project Manager in 2011-2013. The width values are the following. Table 2. Width classification breakdown. WIDTH_CLASS Anticipated Value Range (meters) LOW <4 MED 4-8 HIGH >8 Table 3. A list of disturbance feature types and their descriptions. TYPE_INDUSTRY TYPE_DISTURBANCE DESCRIPTION Mining Survey / Cutline A linear cleared area through undeveloped land, used for line-of-sight surveying; impossible to distinguish whether associated with quartz or placer mining (overlapping or unclear claims information) Survey / Cutline - Placer A linear cleared area through undeveloped land, used for line-of-sight surveying; associated with placer mining (identified using claims information and/or other indicators) Survey / Cutline - Quartz A linear cleared area through undeveloped land, used for line-of-sight surveying; associated with quartz mining (identified using claims information and/or other indicators) Trench A long, narrow excavation dug to expose vein or ore structure Unknown Unknown linear mining disturbance Oil and Gas Pipeline Visible pipeline or pipeline Right-of-Way (above- or below-ground) Seismic Line Seismic lines Rural Driveway A driveway in a rural area Fence A fence in a rural area Transportation Access Assumed A linear feature that is assumed to be an access road, but could also be a trail Access Road A road or narrow passage whose primary function is to provide access for resource extraction (i.e. mining, forestry) and may also have served in providing public access to the backcountry. Arterial Road A major thoroughfare with medium to large traffic capacity Local Road A low-speed thoroughfare, provides access to front of properties, including those with potential public restrictions such as trailer parks, First Nations land, private estate, seasonal residences, gravel pits (NRN definition for Local Street/Local Strata/Local Unknown). Shows signs of regular use. Right of Way For Road Rights as attributed in the land parcels ancillary data Trail Path or track (typically <1.5 m wide) used for walking, cycling, ORV, or other backcountry activities. (Note: trails used for mining activities are Access Roads.) Unpaved Road Dirt or gravel road (typically >1.5 m wide) that does not necessarily access remote resources Unknown Right of Way A right of way with unknown industry type Survey / Cutline A linear cleared area through undeveloped land, used for line-of-sight surveying. A cutline may not always be associated with mineral exploration, therefore, Type: Unknown was used to differentiate all cutlines that were outside of mineral exploration. Unknown Unclassified, or unable to identify type based on imagery, but suspected to be anthropogenic Utility Electric Utility Corridor Corridor usually running parallel to highway, where transmission lines or other utilities are visible Unknown Unknown linear feature assumed to be a utility corridor; ancillary data is unclear. Distributed from GeoYukon by the Government of Yukon . Discover more digital map data and interactive maps from Yukon's digital map data collection. For more information: geomatics.help@yukon.ca

  15. Best Books Ever Dataset

    • zenodo.org
    csv
    Updated Nov 10, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells (2020). Best Books Ever Dataset [Dataset]. http://doi.org/10.5281/zenodo.4265096
    Explore at:
    csvAvailable download formats
    Dataset updated
    Nov 10, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The dataset has been collected in the frame of the Prac1 of the subject Tipology and Data Life Cycle of the Master's Degree in Data Science of the Universitat Oberta de Catalunya (UOC).

    The dataset contains 25 variables and 52478 records corresponding to books on the GoodReads Best Books Ever list (the larges list on the site).

    Original code used to retrieve the dataset can be found on github repository: github.com/scostap/goodreads_bbe_dataset

    The data was retrieved in two sets, the first 30000 books and then the remainig 22478. Dates were not parsed and reformated on the second chunk so publishDate and firstPublishDate are representet in a mm/dd/yyyy format for the first 30000 records and Month Day Year for the rest.

    Book cover images can be optionally downloaded from the url in the 'coverImg' field. Python code for doing so and an example can be found on the github repo.

    The 25 fields of the dataset are:

    | Attributes | Definition | Completeness |
    | ------------- | ------------- | ------------- | 
    | bookId | Book Identifier as in goodreads.com | 100 |
    | title | Book title | 100 |
    | series | Series Name | 45 |
    | author | Book's Author | 100 |
    | rating | Global goodreads rating | 100 |
    | description | Book's description | 97 |
    | language | Book's language | 93 |
    | isbn | Book's ISBN | 92 |
    | genres | Book's genres | 91 |
    | characters | Main characters | 26 |
    | bookFormat | Type of binding | 97 |
    | edition | Type of edition (ex. Anniversary Edition) | 9 |
    | pages | Number of pages | 96 |
    | publisher | Editorial | 93 |
    | publishDate | publication date | 98 |
    | firstPublishDate | Publication date of first edition | 59 |
    | awards | List of awards | 20 |
    | numRatings | Number of total ratings | 100 |
    | ratingsByStars | Number of ratings by stars | 97 |
    | likedPercent | Derived field, percent of ratings over 2 starts (as in GoodReads) | 99 |
    | setting | Story setting | 22 |
    | coverImg | URL to cover image | 99 |
    | bbeScore | Score in Best Books Ever list | 100 |
    | bbeVotes | Number of votes in Best Books Ever list | 100 |
    | price | Book's price (extracted from Iberlibro) | 73 |

  16. d

    The LakeCat Dataset: Accumulated Attributes for NHDPlusV2 (Version 2.1)...

    • catalog.data.gov
    Updated Feb 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Environmental Protection Agency, Office of Research and Development (ORD), Center for Public Health and Environmental Assessment (CPHEA), Pacific Ecological Systems Division (PESD), (2025). The LakeCat Dataset: Accumulated Attributes for NHDPlusV2 (Version 2.1) Catchments for the Conterminous United States: Mine Density: Active Mines and Mineral Plants in the US [Dataset]. https://catalog.data.gov/dataset/the-lakecat-dataset-accumulated-attributes-for-nhdplusv2-version-2-1-catchments-for-the-co-e3814
    Explore at:
    Dataset updated
    Feb 5, 2025
    Dataset provided by
    U.S. Environmental Protection Agency, Office of Research and Development (ORD), Center for Public Health and Environmental Assessment (CPHEA), Pacific Ecological Systems Division (PESD),
    Area covered
    Contiguous United States, United States
    Description

    This dataset represents mine density within individual local and accumulated upstream catchments for NHDPlusV2 Waterbodies based on mine plants and operations monitored by the USGS National Minerals Information Center. Catchment boundaries in LakeCat are defined in one of two ways, on-network or off-network. The on-network catchment boundaries follow the catchments provided in the NHDPlusV2 and the metrics for these lakes mirror metrics from StreamCat, but will substitute the COMID of the NHDWaterbody for that of the NHDFlowline. The off-network catchment framework uses the NHDPlusV2 flow direction rasters to define non-overlapping lake-catchment boundaries and then links them through an off-network flow table. The National Minerals Information Center canvasses the nonfuel mining and mineral-processing industry in the United States for data on mineral production, consumption, recycling, stocks, and shipments. Mine plants and operations for commodities are expressed as points in a shapefile that was downloaded from the USGS directly. The (mines / catchment) were summarized and accumulated into watersheds to produce local catchment-level and watershed-level metrics as a point data type.

  17. d

    NSW Petroleum Boreholes 20140815

    • data.gov.au
    • researchdata.edu.au
    • +1more
    Updated Nov 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bioregional Assessment Program (2019). NSW Petroleum Boreholes 20140815 [Dataset]. https://data.gov.au/data/dataset/activity/7629e8db-b01e-4d6e-b926-a4df07e7eebc
    Explore at:
    Dataset updated
    Nov 20, 2019
    Dataset provided by
    Bioregional Assessment Program
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    New South Wales
    Description

    Abstract

    This dataset and its metadata statement were supplied to the Bioregional Assessment Programme by a third party and are presented here as originally supplied.

    The Petroleum Wells database stores information about petroleum wells within NSW. The NSW Petroleum Wells Database has been developed to record summary data on petroleum exploration and development drilling data for holes drilled in New South Wales. The database contains summary information about each drillhole such as location, total depth, azimuth/dip, geochemistry, wireline logs and DIGS references.

    Full metadata available at: http://minview.minerals.nsw.gov.au/mv2web/mv2?cmd=MDDetail&lid=petbore

    This dataset has been provided to the BA Programme for use within the programme only. Third parties should contact the NSW Department of Industry. http://www.industry.nsw.gov.au/

    Dataset History

    Data Quality

    Lineage:

    The primary data source for the petroleum borehole database was company exploration reports supplied to the Department of Mineral Resources as part of the statutory reporting requirements for exploration licences. Holes drilled by Industry and Investment NSW (Minerals) are also included in the database. Initially data was held in the COGENT Oracle database. The data was exported into spreadsheets and maintained in this environment for some time. The database is now on a SQL Server platform. Data can be entered or edited using the GBIS program

    Positional Accuracy:

    Drillhole coordinates have been obtained from a number of sources including exploration companies and consultants. The accuracy of the drillholes varies and is often unknown. In most instances the drillhole coordinates have been measured off company exploration maps or parish/county maps and these holes have an accuracy of between 25m and 500m depending on the quality of the map displaying the holes. Where a number of holes have been drilled on one prospect they will be located accurate relative to each other.

    Attribute Accuracy:

    The dataset contains a number of attributes (see attribute definitions below), these values have been derived from a large number of sources (see data lineage). Overall accuracy of the attributes will vary according to many factors such as the company that reported the drillhole and the type of lease the hole was drilled on. Some companies provide the Department with exploration reports of very high standard while others submit the bare minimum of information. Likewise, exploration licences (EL's) have fairly stringent reporting requirements while older types of licences and mining leases often have reports of poor quality

    Logical Consistency:

    There should be a high degree of logical consistency in this data set as data from one drillhole does not depend upon other information

    Completeness:

    As mentioned in the attribute accuracy the completeness of the data is dependent upon the quality of the data supplied to the Department in company reports. It is not always possible to obtain all the information required for each drillhole.

    Attribute Definitions

    Name Description

    OBJECTID An identification code given by the database

    PROJECT A group of drillholes - eg. PET=Petroleum, MIN=Minerals, COAL=Coal

    SITE_ID A text value that uniquely defines a drillhole within the project that it belongs to

    CONFIDENTIAL_YN Indicates if the hole is confidential - ie. not available to the public

    PROGRAM Used to identify an area/prospect where a hole/series of holes was drilled. eg Cow Flat, Angus Place

    HOLE_NAME Hole number in a program or hole name in a program (DD97GR01)

    OLD_NAME A name of a drillhole that has been superceded by a new name

    BUS_PURPOSE Business reason for drilling the hole - eg. COAL, CSM, PETROLEUM

    DRILL_TYPE Type of drill used

    HOLE_STATUS Status of the drillhole - eg. CASE=cased, PLUG=plugged - codes listed in GSL_HOLE_STATUS

    TITLE_TYPE Code of type of licence where hole was drilled - eg. EL, ML, ATOE

    TITLE_NO The number of the title where the hole was drilled

    LICENCEE Company that holds the Exploartion or Mining Title

    OPERATOR Company doing the work eg joint venture partner

    LICENCEE_ID Code for the company that holds the Exploartion or Mining Title

    OPERATOR_ID Code for company doing the work eg joint venture partner

    TARGET The target the hole was drilled to intersect - eg. GEOCHEM, GOSSAN

    CORELIB Core library where the core is stored

    STARTPOINT Starting point of the drill - eg. GRND=natural ground surface, UNDG=underground

    HOLE_TESTS
    
    REPORTS
    

    COMMENTS Comments about the drillhole/well

    TOP_STRAT Letter Symbol for the top stratigraphic unit intersected in the hole

    TOPSTRAT The top stratigraphic unit intersected in the hole

    BASE_STRAT Letter Symbol fpr the bottom stratigraphic unit intersected in the hole

    BASESTRAT The bottom stratigraphic unit intersected in the hole

    COMMENCED_DT Date the drilling of the hole commenced

    COMPLETED_DT Date the drilling of the hole was completed

    YEAR_DRILLED The year the hole was drilled if cemmenced date not known

    KELLY_LEVEL Elevation in metres above sea level of the Kelly Bushing or Rotary Table

    START_DEPTH Depth in metres that the hole started at - usually 0 but can be different depth for a wedge

    END_DEPTH Depth in metres that the hole was stopped at

    GEOPHYS_YN Is a geophysical log available for the drillhole?

    TEXTLOG_YN Is a written log available for the drillhole?

    GRAPHLOG_YN Is a graphic log available for drillhole?

    COREPHOTO_YN Are photos of the core available?

    RECOVD_COST Costs recovered by the department for holes drilled by the the department

    LAT94
    
    LNG94
    

    TITLEREF Concatenation of TITLE_TYPE and TITLE_NO

    COMPANY Concatenation of LICENCEE and OPERATOR

    Miscellaneous

    Source Pathname:

    SDE GS_SPATIAL.GBV_Drillhole via FME from SQLServer GBIS Boreholes

    Dataset Citation

    NSW Trade and Investment (2014) NSW Petroleum Boreholes 20140815. Bioregional Assessment Source Dataset. Viewed 22 June 2018, http://data.bioregionalassessments.gov.au/dataset/7629e8db-b01e-4d6e-b926-a4df07e7eebc.

  18. u

    Association analysis of high-high cluster road intersection crashes within...

    • zivahub.uct.ac.za
    xlsx
    Updated Jun 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simone Vieira; Simon Hull; Roger Behrens (2024). Association analysis of high-high cluster road intersection crashes within the CoCT in 2017, 2018, 2019 and 2021 [Dataset]. http://doi.org/10.25375/uct.25975285.v2
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 7, 2024
    Dataset provided by
    University of Cape Town
    Authors
    Simone Vieira; Simon Hull; Roger Behrens
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    City of Cape Town
    Description

    This dataset provides comprehensive information on road intersection crashes recognised as "high-high" clusters within the City of Cape Town. It includes detailed records of all intersection crashes and their corresponding crash attribute combinations, which were prevalent in at least 5% of the total "high-high" cluster road intersection crashes for the years 2017, 2018, 2019, and 2021. The dataset is meticulously organised according to support metric values, ranging from 0,05 to 0,0235, with entries presented in descending order.Data SpecificsData Type: Geospatial-temporal categorical dataFile Format: Excel document (.xlsx)Size: 499 KBNumber of Files: The dataset contains a total of 7186 association rulesDate Created: 23rd May 2024MethodologyData Collection Method: The descriptive road traffic crash data per crash victim involved in the crashes was obtained from the City of Cape Town Network InformationSoftware: ArcGIS Pro, PythonProcessing Steps: Following the spatio-temporal analyses and the derivation of "high-high" cluster fishnet grid cells from a cluster and outlier analysis, all the road intersection crashes that occurred within the "high-high" cluster fishnet grid cells were extracted to be processed by association analysis. The association analysis of these crashes was processed using Python software and involved the use of a 0,05 support metric value. Consequently, commonly occurring crash attributes among at least 5% of the "high-high" cluster road intersection crashes were extracted for inclusion in this dataset.Geospatial InformationSpatial Coverage:West Bounding Coordinate: 18°20'EEast Bounding Coordinate: 19°05'ENorth Bounding Coordinate: 33°25'SSouth Bounding Coordinate: 34°25'SCoordinate System: South African Reference System (Lo19) using the Universal Transverse Mercator projectionTemporal InformationTemporal Coverage:Start Date: 01/01/2017End Date: 31/12/2021 (2020 data omitted)

  19. u

    Association analysis of high-high cluster road intersection pedestrian...

    • zivahub.uct.ac.za
    xlsx
    Updated Jun 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simone Vieira; Simon Hull; Roger Behrens (2024). Association analysis of high-high cluster road intersection pedestrian crashes within the CoCT in 2017, 2018, 2019 and 2021 [Dataset]. http://doi.org/10.25375/uct.25976263.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 7, 2024
    Dataset provided by
    University of Cape Town
    Authors
    Simone Vieira; Simon Hull; Roger Behrens
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    City of Cape Town
    Description

    This dataset provides comprehensive information on road intersection pedestrian crashes recognised as "high-high" clusters within the City of Cape Town. It includes detailed records of all intersection crashes and their corresponding crash attribute combinations, which were prevalent in at least 10% of the total "high-high" cluster pedestrian road intersection crashes for the years 2017, 2018, 2019, and 2021.The dataset is meticulously organised according to support metric values, ranging from 0,10 to 0,13, with entries presented in descending order.Data SpecificsData Type: Geospatial-temporal categorical dataFile Format: Excel document (.xlsx)Size: 15,0 KBNumber of Files: The dataset contains a total of 162 association rulesDate Created: 24th May 2024MethodologyData Collection Method: The descriptive road traffic crash data per crash victim involved in the crashes was obtained from the City of Cape Town Network InformationSoftware: ArcGIS Pro, PythonProcessing Steps: Following the spatio-temporal analyses and the derivation of "high-high" cluster fishnet grid cells from a cluster and outlier analysis, all the road intersection pedestrian crashes that occurred within the "high-high" cluster fishnet grid cells were extracted to be processed by association analysis. The association analysis of these crashes was processed using Python software and involved the use of a 0,10 support metric value. Consequently, commonly occurring crash attributes among at least 10% of the "high-high" cluster road intersection pedestrian crashes were extracted for inclusion in this dataset.Geospatial InformationSpatial Coverage:West Bounding Coordinate: 18°20'EEast Bounding Coordinate: 19°05'ENorth Bounding Coordinate: 33°25'SSouth Bounding Coordinate: 34°25'SCoordinate System: South African Reference System (Lo19) using the Universal Transverse Mercator projectionTemporal InformationTemporal Coverage:Start Date: 01/01/2017End Date: 31/12/2021 (2020 data omitted)

  20. r

    Data from: Biosolids in the establishment of herbaceous species and in...

    • resodate.org
    • figshare.com
    Updated Jan 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Natália Caron Kitamura; Cledimar Rogério Lourenzi; Alcenir Claudio Bueno; Antonio Lourenço Pinto; Cláudio Roberto Fonseca Sousa Soares; Admir José Giachini (2021). Biosolids in the establishment of herbaceous species and in chemical and microbiological attributes in soil impacted by coal mining [Dataset]. http://doi.org/10.6084/M9.FIGSHARE.14283698
    Explore at:
    Dataset updated
    Jan 1, 2021
    Dataset provided by
    SciELO journals
    Authors
    Natália Caron Kitamura; Cledimar Rogério Lourenzi; Alcenir Claudio Bueno; Antonio Lourenço Pinto; Cláudio Roberto Fonseca Sousa Soares; Admir José Giachini
    Description

    ABSTRACT This study aims to evaluate the effects of different concentrations of sewage sludge biosolid concentrations, submitted to thermal treatment, in the establishment of herbaceous species (black oats, vetches, and ryegrass) and in the chemical and microbiological attributes of a soil degraded by coal mining. The experiment was installed in an area degraded by coal mining, in Treviso/SC, with treatments composed of concentrations of 0; 6.25; 100; 250; and 500 Mg ha-1 of biosolids, in 2×2 m plots. Species of black oat, vetch, and ryegrass were grown in a consortium manner, evaluating the plant parameters and chemical attributes of the soil at depths 0-5, 5-10, and 10-20 cm. The biosolids provided improvements in soil fertility, such as pH elevation, increased levels available of P, K, and total organic carbon, in addition to not influencing mycorrhizal colonization, basal soil respiration, and root nodulation. The use of biosolid waste as a substrate in degraded areas is an alternative to its final disposal due to the economy when using it as a fertilizer, in addition to the environmental benefits associated with its use.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Konstantinos Malliaridis (2024). DatasetofDatasets (DoD) [Dataset]. https://www.kaggle.com/terminalgr/datasetofdatasets-124-1242024
Organization logo

DatasetofDatasets (DoD)

This dataset is essentially the metadata from other datasets.

Explore at:
zip(7583 bytes)Available download formats
Dataset updated
Aug 12, 2024
Authors
Konstantinos Malliaridis
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

This dataset is essentially the metadata from 164 datasets. Each of its lines concerns a dataset from which 22 features have been extracted, which are used to classify each dataset into one of the categories 0-Unmanaged, 2-INV, 3-SI, 4-NOA (DatasetType).

This Dataset consists of 164 Rows. Each row is the metadata of an other dataset. The target column is datasetType which has 4 values indicating the dataset type. These are:

2 - Invoice detail (INV): This dataset type is a special report (usually called Detailed Sales Statement) produced by a Company Accounting or an Enterprise Resource Planning software (ERP). Using a INV-type dataset directly for ARM is extremely convenient for users as it relieves them from the tedious work of transforming data into another more suitable form. INV-type data input typically includes a header but, only two of its attributes are essential for data mining. The first attribute serves as the grouping identifier creating a unique transaction (e.g., Invoice ID, Order Number), while the second attribute contains the items utilized for data mining (e.g., Product Code, Product Name, Product ID).

3 - Sparse Item (SI): This type is widespread in Association Rules Mining (ARM). It involves a header and a fixed number of columns. Each item corresponds to a column. Each row represents a transaction. The typical cell stores a value, usually one character in length, that depicts the presence or absence of the item in the corresponding transaction. The absence character must be identified or declared before the Association Rules Mining process takes place.

4 - Nominal Attributes (NOA): This type is commonly used in Machine Learning and Data Mining tasks. It involves a fixed number of columns. Each column registers nominal/categorical values. The presence of a header row is optional. However, in cases where no header is provided, there is a risk of extracting incorrect rules if similar values exist in different attributes of the dataset. The potential values for each attribute can vary.

0 - Unmanaged for ARM: On the other hand, not all datasets are suitable for extracting useful association rules or frequent item sets. For instance, datasets characterized predominantly by numerical features with arbitrary values, or datasets that involve fragmented or mixed types of data types. For such types of datasets, ARM processing becomes possible only by introducing a data discretization stage which in turn introduces information loss. Such types of datasets are not considered in the present treatise and they are termed (0) Unmanaged in the sequel.

The dataset type is crucial to determine for ARM, and the current dataset is used to classify the dataset's type using a Supervised Machine Learning Model.

There is and another dataset type named 1 - Market Basket List (MBL) where each dataset row is a transaction. A transaction involves a variable number of items. However, due to this characteristic, these datasets can be easily categorized using procedural programming and DoD does not include instances of them. For more details about Dataset Types please refer to article "WebApriori: a web application for association rules mining". https://link.springer.com/chapter/10.1007/978-3-030-49663-0_44

Search
Clear search
Close search
Google apps
Main menu