92 datasets found
  1. f

    Collection of example datasets used for the book - R Programming -...

    • figshare.com
    txt
    Updated Dec 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kingsley Okoye; Samira Hosseini (2023). Collection of example datasets used for the book - R Programming - Statistical Data Analysis in Research [Dataset]. http://doi.org/10.6084/m9.figshare.24728073.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 4, 2023
    Dataset provided by
    figshare
    Authors
    Kingsley Okoye; Samira Hosseini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This book is written for statisticians, data analysts, programmers, researchers, teachers, students, professionals, and general consumers on how to perform different types of statistical data analysis for research purposes using the R programming language. R is an open-source software and object-oriented programming language with a development environment (IDE) called RStudio for computing statistics and graphical displays through data manipulation, modelling, and calculation. R packages and supported libraries provides a wide range of functions for programming and analyzing of data. Unlike many of the existing statistical softwares, R has the added benefit of allowing the users to write more efficient codes by using command-line scripting and vectors. It has several built-in functions and libraries that are extensible and allows the users to define their own (customized) functions on how they expect the program to behave while handling the data, which can also be stored in the simple object system.For all intents and purposes, this book serves as both textbook and manual for R statistics particularly in academic research, data analytics, and computer programming targeted to help inform and guide the work of the R users or statisticians. It provides information about different types of statistical data analysis and methods, and the best scenarios for use of each case in R. It gives a hands-on step-by-step practical guide on how to identify and conduct the different parametric and non-parametric procedures. This includes a description of the different conditions or assumptions that are necessary for performing the various statistical methods or tests, and how to understand the results of the methods. The book also covers the different data formats and sources, and how to test for reliability and validity of the available datasets. Different research experiments, case scenarios and examples are explained in this book. It is the first book to provide a comprehensive description and step-by-step practical hands-on guide to carrying out the different types of statistical analysis in R particularly for research purposes with examples. Ranging from how to import and store datasets in R as Objects, how to code and call the methods or functions for manipulating the datasets or objects, factorization, and vectorization, to better reasoning, interpretation, and storage of the results for future use, and graphical visualizations and representations. Thus, congruence of Statistics and Computer programming for Research.

  2. q

    Large Datasets in R - Plant Phenology & Temperature Data from NEON

    • qubeshub.org
    Updated May 10, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Megan Jones Patterson; Lee Stanish; Natalie Robinson; Katherine Jones; Cody Flagg (2018). Large Datasets in R - Plant Phenology & Temperature Data from NEON [Dataset]. http://doi.org/10.25334/Q4DQ3F
    Explore at:
    Dataset updated
    May 10, 2018
    Dataset provided by
    QUBES
    Authors
    Megan Jones Patterson; Lee Stanish; Natalie Robinson; Katherine Jones; Cody Flagg
    Description

    This module series covers how to import, manipulate, format and plot time series data stored in .csv format in R. Originally designed to teach researchers to use NEON plant phenology and air temperature data; has been used in undergraduate classrooms.

  3. Basic R for Data Analysis

    • kaggle.com
    Updated Dec 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kebba Ndure (2024). Basic R for Data Analysis [Dataset]. https://www.kaggle.com/datasets/kebbandure/basic-r-for-data-analysis/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 8, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kebba Ndure
    Description

    ABOUT DATASET

    This is the R markdown notebook. It contains step by step guide for working on Data Analysis with R. It helps you with installing the relevant packages and how to load them. it also provides a detailed summary of the "dplyr" commands that you can use to manipulate your data in the R environment.

    Anyone new to R and wish to carry out some data analysis on R can check it out!

  4. d

    Randomized Hourly Load Data for use with Taxonomy Distribution Feeders.

    • datadiscoverystudio.org
    • data.wu.ac.at
    Updated Aug 29, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). Randomized Hourly Load Data for use with Taxonomy Distribution Feeders. [Dataset]. http://datadiscoverystudio.org/geoportal/rest/metadata/item/bc873dbf6a1f44c190153d3345fbbafd/html
    Explore at:
    Dataset updated
    Aug 29, 2017
    Description

    description: This dataset was developed by NREL's distributed energy systems integration group as part of a study on high penetrations of distributed solar PV [1]. It consists of hourly load data in CSV format for use with the PNNL taxonomy of distribution feeders [2]. These feeders were developed in the open source GridLAB-D modelling language [3]. In this dataset each of the load points in the taxonomy feeders is populated with hourly averaged load data from a utility in the feeder s geographical region, scaled and randomized to emulate real load profiles. For more information on the scaling and randomization process, see [1]. The taxonomy feeders are statistically representative of the various types of distribution feeders found in five geographical regions of the U.S. Efforts are underway (possibly complete) to translate these feeders into the OpenDSS modelling language. This data set consists of one large CSV file for each feeder. Within each CSV, each column represents one load bus on the feeder. The header row lists the name of the load bus. The subsequent 8760 rows represent the loads for each hour of the year. The loads were scaled and randomized using a Python script, so each load series represents only one of many possible randomizations. In the header row, "rl" = residential load and "cl" = commercial load. Commercial loads are followed by a phase letter (A, B, or C). For regions 1-3, the data is from 2009. For regions 4-5, the data is from 2000. For use in GridLAB-D, each column will need to be separated into its own CSV file without a header. The load value goes in the second column, and corresponding datetime values go in the first column, as shown in the sample file, sample_individual_load_file.csv. Only the first value in the time column needs to written as an absolute time; subsequent times may be written in relative format (i.e. "+1h", as in the sample). The load should be written in P+Qj format, as seen in the sample CSV, in units of Watts (W) and Volt-amps reactive (VAr). This dataset was derived from metered load data and hence includes only real power; reactive power can be generated by assuming an appropriate power factor. These loads were used with GridLAB-D version 2.2. Browse files in this dataset, accessible as individual files and as a single ZIP file. This dataset is approximately 242MB compressed or 475MB uncompressed. For questions about this dataset, contact andy.hoke@nrel.gov. If you find this dataset useful, please mention NREL and cite [1] in your work. References: [1] A. Hoke, R. Butler, J. Hambrick, and B. Kroposki, Steady-State Analysis of Maximum Photovoltaic Penetration Levels on Typical Distribution Feeders, IEEE Transactions on Sustainable Energy, April 2013, available at http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6357275 . [2] K. Schneider, D. P. Chassin, R. Pratt, D. Engel, and S. Thompson, Modern Grid Initiative Distribution Taxonomy Final Report, PNNL, Nov. 2008. Accessed April 27, 2012: http://www.gridlabd.org/models/feeders/taxonomy of prototypical feeders.pdf [3] K. Schneider, D. Chassin, Y. Pratt, and J. C. Fuller, Distribution power flow for smart grid technologies, IEEE/PES Power Systems Conference and Exposition, Seattle, WA, Mar. 2009, pp. 1-7, 15-18.; abstract: This dataset was developed by NREL's distributed energy systems integration group as part of a study on high penetrations of distributed solar PV [1]. It consists of hourly load data in CSV format for use with the PNNL taxonomy of distribution feeders [2]. These feeders were developed in the open source GridLAB-D modelling language [3]. In this dataset each of the load points in the taxonomy feeders is populated with hourly averaged load data from a utility in the feeder s geographical region, scaled and randomized to emulate real load profiles. For more information on the scaling and randomization process, see [1]. The taxonomy feeders are statistically representative of the various types of distribution feeders found in five geographical regions of the U.S. Efforts are underway (possibly complete) to translate these feeders into the OpenDSS modelling language. This data set consists of one large CSV file for each feeder. Within each CSV, each column represents one load bus on the feeder. The header row lists the name of the load bus. The subsequent 8760 rows represent the loads for each hour of the year. The loads were scaled and randomized using a Python script, so each load series represents only one of many possible randomizations. In the header row, "rl" = residential load and "cl" = commercial load. Commercial loads are followed by a phase letter (A, B, or C). For regions 1-3, the data is from 2009. For regions 4-5, the data is from 2000. For use in GridLAB-D, each column will need to be separated into its own CSV file without a header. The load value goes in the second column, and corresponding datetime values go in the first column, as shown in the sample file, sample_individual_load_file.csv. Only the first value in the time column needs to written as an absolute time; subsequent times may be written in relative format (i.e. "+1h", as in the sample). The load should be written in P+Qj format, as seen in the sample CSV, in units of Watts (W) and Volt-amps reactive (VAr). This dataset was derived from metered load data and hence includes only real power; reactive power can be generated by assuming an appropriate power factor. These loads were used with GridLAB-D version 2.2. Browse files in this dataset, accessible as individual files and as a single ZIP file. This dataset is approximately 242MB compressed or 475MB uncompressed. For questions about this dataset, contact andy.hoke@nrel.gov. If you find this dataset useful, please mention NREL and cite [1] in your work. References: [1] A. Hoke, R. Butler, J. Hambrick, and B. Kroposki, Steady-State Analysis of Maximum Photovoltaic Penetration Levels on Typical Distribution Feeders, IEEE Transactions on Sustainable Energy, April 2013, available at http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6357275 . [2] K. Schneider, D. P. Chassin, R. Pratt, D. Engel, and S. Thompson, Modern Grid Initiative Distribution Taxonomy Final Report, PNNL, Nov. 2008. Accessed April 27, 2012: http://www.gridlabd.org/models/feeders/taxonomy of prototypical feeders.pdf [3] K. Schneider, D. Chassin, Y. Pratt, and J. C. Fuller, Distribution power flow for smart grid technologies, IEEE/PES Power Systems Conference and Exposition, Seattle, WA, Mar. 2009, pp. 1-7, 15-18.

  5. q

    Working with Datasets in R swirl

    • qubeshub.org
    Updated Jul 4, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Caitlin Pries (2019). Working with Datasets in R swirl [Dataset]. http://doi.org/10.25334/Q4KF2V
    Explore at:
    Dataset updated
    Jul 4, 2019
    Dataset provided by
    QUBES
    Authors
    Caitlin Pries
    Description

    The goal of this lesson is to learn how to import datasets into R, understand variable types, make adjustments to variables, perform basic calculations, and begin data visualization. The exercise uses an over 100 year time series of climate data.

  6. Storage and Transit Time Data and Code

    • zenodo.org
    zip
    Updated Oct 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrew Felton; Andrew Felton (2024). Storage and Transit Time Data and Code [Dataset]. http://doi.org/10.5281/zenodo.14009758
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 29, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrew Felton; Andrew Felton
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Author: Andrew J. Felton
    Date: 10/29/2024

    This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:

    "Global estimates of the storage and transit time of water through vegetation"

    Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated.

    Data information:

    The data folder contains key data sets used for analysis. In particular:

    "data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.

    #Code information

    Python scripts can be found in the "supporting_code" folder.

    Each R script in this project has a role:

    "01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).

    "02_functions.R": This script contains custom functions. Load this using the
    `source()` function in the 01_start.R script.

    "03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
    `source()` function in the 01_start.R script.

    "04_figures_tables.R": This is the main workhouse for figure/table production and
    supporting analyses. This script generates the key figures and summary statistics
    used in the study that then get saved in the manuscript_figures folder. Note that all
    maps were produced using Python code found in the "supporting_code"" folder.

    "supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.

    "supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.

  7. P

    titanic5 Dataset Dataset

    • paperswithcode.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    titanic5 Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/titanic5-dataset
    Explore at:
    Description

    titanic5 Dataset Created by David Beltran del Rio March 2016.

    Notes This is the final (for now) version of my update to the Titanic data. I think it’s finally ready for publishing if you’d like. What I did was to strip all the passenger and crew data from the Encyclopedia Titanica (ET) web pages (excluding channel crossing passengers), create a unique ID for each passenger and crew member (Name_ID), then (painstakingly and hopefully 100% correctly) match to your earlier titanic3 dataset, in order to compare the two and to get your sibsp and parch variables. Since the ET is updated occasionally the work put into the ID and matching can be reused and refined later. I did eventually hear back from the ET people, they are willing to make the underlying database available in the future, I have not yet taken them up on it.

    The two datasets line up nicely, most of the differences in the newer titanic5 dataset are in the age variable, as I had mentioned before - the new set has less missing ages - 51 missing (vs 263) out of 1309.

    I am in the process of refining my analysis of the data as well, based on your comments below and your Regression Modeling Strategies example.

    titanic3_wID data can be matched to titanic5 using the Name_ID variable. Tab titanic5 Metadata has the variable descriptions and allowable values for Class and Class/Dept.

    A note about the ages - instead of using the add 0.5 trick to indicate estimated birth day / date I have a flag that indicates how the “final” age (Age_F) was arrived at. It’s the Age_F_Code variable - the allowable values are in the Titanic5_metadata tab in the attached excel. The reason for this is that I already had some fractional ages for infants where I had age in months instead of years and I wanted to avoid confusion for 6 month old infants, although I don’t think there are any in the data! Also, I was thinking to make fractional ages or age in days for all passengers for whom I have DoB, but I have not yet done so.

    Here’s what the tabs are:

    Titanic5_all - all (mostly cleaned) Titanic passenger and crew records Titanic5_work - working dataset, crew removed, unnecessary variables removed - this is the one I import into SAS / R to work on Titanic5_metadata - Variable descriptions and allowable values titanic3_wID - Original Titanic3 dataset with Name_ID added for merging to Titanic5 I have a csv, R dataset, and SAS dataset, but the variable names are an older version, so I won’t send those along for now to avoid confusion.

    If it helps send my contact info along to your student in case any questions arise. Gmail address probably best, on weekends for sure: davebdr@gmail.com

    The tabs in titanic5.xls are

    Titanic5_all Titanic5_passenger (the one to be used for analysis) Titanic5_metadata (used during analysis file creation) Titanic3_wID

  8. Dataset

    • datasets.ai
    • catalog.data.gov
    • +1more
    0, 33
    Updated Sep 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Environmental Protection Agency (2024). Dataset [Dataset]. https://datasets.ai/datasets/dataset
    Explore at:
    0, 33Available download formats
    Dataset updated
    Sep 8, 2024
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Authors
    U.S. Environmental Protection Agency
    Description

    Publicly available weblinks used to develop research project.

    This dataset is associated with the following publication: Hall, E., R. Hall, J. Aron, S. Swanson, M. Philbin, R. Schaefer, T. Jones-Lepp, D. Heggem, J. Lin, E. Wilson, and H. Kahan. An Ecological Function Approach to Managing Harmful Cyanobacteria in Three Oregon Lakes: Beyond Water Quality Advisories and Total Maximum Daily Loads (TMDLs). WATER. MDPI AG, Basel, SWITZERLAND, 11(6): 1125, (2019).

  9. h

    all-recipes-xs

    • huggingface.co
    Updated Apr 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AWeirdDev (2024). all-recipes-xs [Dataset]. https://huggingface.co/datasets/AWeirdDev/all-recipes-xs
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 9, 2024
    Authors
    AWeirdDev
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    all-recipes-xs (500)

    All Recipes dataset (extra small). from datasets import load_dataset

    Load the dataset

    dataset = load_dataset("AWeirdDev/all-recipes-xs")

    Alternatively, load with pickle from _frozen.pkl: import pickle import requests

    r = requests.get("https://huggingface.co/datasets/AWeirdDev/all-recipes-xs/resolve/main/_frozen.pkl") dataset = pickle.loads(r.content)

      Features
    

    Note: Empty values are presented as "unknown" instead of None (normally, unless… See the full description on the dataset page: https://huggingface.co/datasets/AWeirdDev/all-recipes-xs.

  10. f

    How does cognitive load affect social interactions? Dataset and Analysis

    • figshare.com
    txt
    Updated Jan 18, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kathryn Mills (2016). How does cognitive load affect social interactions? Dataset and Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.757787.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 18, 2016
    Dataset provided by
    figshare
    Authors
    Kathryn Mills
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Project abstract: Many situations involve processing social and non-social information simultaneously. However, is not known how performance is affected in such situations. Here, we examined how our ability to process social information is affected by the need to keep track of non-social information. Participants were instructed to carry out two tasks within each trial. The social task involved referential communication – requiring participants to use social cues to guide their decisions. At the same time, cognitive load was manipulated by requiring participants to remember non-social information in the form of either one or three two-digit numbers visually presented before each social task stimulus. Results indicate that the cognitive demands of simultaneously processing social and non-social information impair social information processing. Specifically, keeping in mind three numbers slowed participants' ability to use another person's perspective to guide decisions. These results suggest that social information processing requires domain-general resources that are depleted under cognitive load. Data: These files include our dataset, as well as the scripts used to analyze the data and create graphs of the results. You will need to download R (http://www.r-project.org/) to use these files. Data are from 29 adult participants. Participants completed an adapted version of the “Director Task” (Dumontheil, Hillebrandt, Apperly, & Blakemore, 2012) with an embedded working memory (WM) Task component. Afterwards, participants completed a verbal reverse digit-span task as a measure of WM capacity and the Interpersonal Reactivity Index questionnaire to assess individual differences in trait perspective taking (Davis, 1980). Data Analysis: We used the lme4 package in R (Bates, Maechler, & Bolker, 2013) to perform a linear mixed effects analysis on the relationship between our factors of interest and accuracy and RT for both tasks. RT data from correct trials only were analyzed. To create approximately normally distributed residuals, we used a log or reciprocal function to transform RT data. We performed a two-step procedure: first, we created a global model including main and interactive effects of cognitive load (low vs. high), condition (Director Present vs. Director Absent), trial type (1-object vs. 3-object), and perspective (same vs. different) as fixed effects, and each model included a random intercept for each participant. We then compared all possible combinations[1] of the variables within our global model using an automated model selection procedure (MuMIn1.9.0; Barton, 2013). Models were ranked using Second-order Akaike Information Criterion (AICc; Burnham & Anderson, 2002). Second, after determining the best fitting model for each outcome of interest, we tested whether WM capacity or trait perspective taking explained any additional variance through likelihood ratio tests. All p-values were obtained by likelihood ratio tests comparing the best fitting model against a baseline model.[1] Interactions were always accompanied by their respective main effects and all lower order terms

    Update (August 8, 2013): There was a minor error in the original SocialDualTaskData.R file, which has now been corrected.

  11. d

    R programming code for analyzing output from the Stochastic Empirical...

    • catalog.data.gov
    • data.usgs.gov
    Updated Jul 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). R programming code for analyzing output from the Stochastic Empirical Loading Dilution Model created for U.S. Geological Survey Scientific Investigations Report 2019-5053, 116 p., https://doi.org/10.3133/sir20195053 [Dataset]. https://catalog.data.gov/dataset/r-programming-code-for-analyzing-output-from-the-stochastic-empirical-loading-dilution-mod
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    This R script can be used to analyze SELDM results. The script is specifically tailored for the SELDM simulations used in the publication: Stonewall, A.J., and Granato, G.E., 2018, Assessing potential effects of highway and urban runoff on receiving streams in total maximum daily load watersheds in Oregon using the Stochastic Empirical Loading and Dilution Model: U.S. Geological Survey Scientific Investigations Report 2019-5053, 116 p., https://doi.org/10.3133/sir20195053

  12. Market Basket Analysis

    • kaggle.com
    Updated Dec 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 9, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aslan Ahmedov
    Description

    Market Basket Analysis

    Market basket analysis with Apriori algorithm

    The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

    Introduction

    Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

    An Example of Association Rules

    Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

    Strategy

    • Data Import
    • Data Understanding and Exploration
    • Transformation of the data – so that is ready to be consumed by the association rules algorithm
    • Running association rules
    • Exploring the rules generated
    • Filtering the generated rules
    • Visualization of Rule

    Dataset Description

    • File name: Assignment-1_Data
    • List name: retaildata
    • File format: . xlsx
    • Number of Row: 522065
    • Number of Attributes: 7

      • BillNo: 6-digit number assigned to each transaction. Nominal.
      • Itemname: Product name. Nominal.
      • Quantity: The quantities of each product per transaction. Numeric.
      • Date: The day and time when each transaction was generated. Numeric.
      • Price: Product price. Numeric.
      • CustomerID: 5-digit number assigned to each customer. Nominal.
      • Country: Name of the country where each customer resides. Nominal.

    imagehttps://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

    Libraries in R

    First, we need to load required libraries. Shortly I describe all libraries.

    • arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
    • arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
    • tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.
    • readxl - Read Excel Files in R.
    • plyr - Tools for Splitting, Applying and Combining Data.
    • ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
    • knitr - Dynamic Report generation in R.
    • magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.
    • dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
    • tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

    imagehttps://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

    Data Pre-processing

    Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

    imagehttps://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> imagehttps://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

    After we will clear our data frame, will remove missing values.

    imagehttps://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

    To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...

  13. d

    Turbidity, Discharge, Suspended Sediment Concentration, Suspended Sediment...

    • catalog.data.gov
    • data.usgs.gov
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Turbidity, Discharge, Suspended Sediment Concentration, Suspended Sediment Load in Duwamish River, Tukwila, Washington [Dataset]. https://catalog.data.gov/dataset/turbidity-discharge-suspended-sediment-concentration-suspended-sediment-load-in-duwamish-r
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Duwamish River, Tukwila, Washington
    Description

    The Green/Duwamish River transports watershed-derived sediment to the Lower Duwamish Waterway Superfund site near Seattle, Washington. Understanding the amount of sediment transported by the river is essential information to the bed sediment cleanup process. Turbidity, discharge, suspended-sediment concentration (SSC) and particle-size data were collected by the U.S. Geological Survey (USGS) between February 2013 and January 2017 at the Duwamish River, Washington, within the tidal influence at river kilometer 16.7 (USGS site 12113390; Duwamish River at Golf Course at Tukwila, WA). This report quantifies the timing and magnitude of suspended-sediment transported in the Duwamish River. Regression models were developed between SSC and turbidity as well as SSC and discharge to estimate 15-minute SSC. Suspended-sediment loads were calculated from the computed SSC and time-series discharge data for every 15-minute interval during the study period. This dataset is the primary appendix to Open-File Report, Suspended Sediment Transport by the Green-Duwamish River to the Lower Duwamish Waterway, Seattle, Washington, 2013-2017 (include link).

  14. f

    Petre_Slide_CategoricalScatterplotFigShare.pptx

    • figshare.com
    pptx
    Updated Sep 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1
    Explore at:
    pptxAvailable download formats
    Dataset updated
    Sep 19, 2016
    Dataset provided by
    figshare
    Authors
    Benj Petre; Aurore Coince; Sophien Kamoun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Categorical scatterplots with R for biologists: a step-by-step guide

    Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

    1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

    Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

    Protocol

    • Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

    • Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

    • Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

    Notes

    • Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

    • Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

    7 Display the graph in a separate window. Dot colors indicate

    replicates

    graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

    References

    Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

    Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

    Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

    https://cran.r-project.org/

    http://ggplot2.org/

  15. Load profile data of 50 industrial plants in Germany for one year

    • zenodo.org
    • explore.openaire.eu
    • +1more
    csv
    Updated Jun 18, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fritz Braeuer; Fritz Braeuer (2020). Load profile data of 50 industrial plants in Germany for one year [Dataset]. http://doi.org/10.5281/zenodo.3899018
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 18, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Fritz Braeuer; Fritz Braeuer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Germany
    Description

    This dataset holds the electric load profiles of 50 small and mid-size enterprises in Germany. The load profiles are in 15-minute time resolution for one year. The load is shown in kW as an average over 15 minutes.

    The dataset is divided into two:

    • LoadProfile_20IPs_2016 shows load profiles of 20 industrial plants (IP) for the year 2016.
    • LoadProfile_30IPs_2017 shows load profiles of 30 industrial plants (IP) for the year 2017.

    The IPs from the dataset for 2016 do not reappear in the dataset for 2017.

    The dataset LoadProfile_20IPs_2016 is evaluated in the following publication:

    • Covic, N., Braeuer, F., McKenna, R., Pandzic, H., Optimizing Industrial Facilities’ Active Participation
      in Electricity Markets under Uncertainty, 2020.

    Both datasets together are evaluated in multiple publications:

    • Braeuer, F., Finck, R., McKenna, R., Comparing empirical and model-based approaches for calculating dynamic grid emission factors: An application to CO2-minimizing storage dispatch in Germany, Journal of Cleaner Production, Volume 266, 2020, 121588, ISSN 0959-6526, https://doi.org/10.1016/j.jclepro.2020.121588.
    • Braeuer, F., Rominger, J., McKenna, R.,Fichtner, W., Battery storage systems: An economic model-based analysis of parallel revenue streams and general implications for industry, Applied Energy, Volume 239, 2019, Pages 1424-1440, ISSN 0306-2619, https://doi.org/10.1016/j.apenergy.2019.01.050.

    Enjoy.

  16. Artificial single-cell datasets with cell clusters used for...

    • figshare.com
    application/x-gzip
    Updated Jun 11, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexis Vandenbon (2023). Artificial single-cell datasets with cell clusters used for singleCellHaystack manuscript [Dataset]. http://doi.org/10.6084/m9.figshare.12319787.v1
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Alexis Vandenbon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    An archive containing 100 artificial single-cell datasets. The data of each dataset is an R data file (.rda). The file names have the following format: splatter_thousands_of_cellskCells_groups_set_set_id.rda'For example: "splatter_1kCells_groups_set_9.rda" represents the 9th set containing 1000 cells made using the "groups" option of splatSimulate. You can import the data into R using the load() function. Each dataset includes the following R objects:1) counts : the number of reads or UMIs for each gene in each cell2) gene.data : a summary of the data generated by Splatter3) params : the parameters used to generate the dataset

  17. U

    Datasets for assessment of nutrient load estimation approaches for small...

    • data.usgs.gov
    • datasets.ai
    • +1more
    Updated Aug 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephen Harden; Jessica Diaz; William Hamilton (2024). Datasets for assessment of nutrient load estimation approaches for small urban streams in Durham, North Carolina, 2009-2020 [Dataset]. http://doi.org/10.5066/P9F0Q501
    Explore at:
    Dataset updated
    Aug 7, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Authors
    Stephen Harden; Jessica Diaz; William Hamilton
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Time period covered
    Jan 1, 2009 - Dec 31, 2020
    Area covered
    Durham, North Carolina
    Description

    In cooperation with the City of Durham Public Works Department Stormwater Division, the U.S. Geological Survey (USGS) conducted a study to evaluate whether alternate monitoring strategies that incorporated samples collected across an increased range of streamflows would improve nutrient load estimates for Ellerbe and Sandy Creeks, two small, highly urbanized streams in the City of Durham, North Carolina. This data release provides the associated datasets described in the Scientific Investigations Report, "Assessment of Nutrient Load Estimation Approaches for Small Urban Streams in Durham, North Carolina". Water-quality and streamflow data collected between January 2009 and December 2020 were used to develop instream nutrient-load models using the U.S. Geological Survey R-LOADEST program (Runkel and others, 2004; Runkel, 2013; Lorenz and others, 2017). The datasets contain water-quality data, streamflow data, input files for model calibration and prediction, and output files for mo ...

  18. d

    Replication Data for: Reining in the Rascals: Challenger Parties' Path to...

    • search.dataone.org
    Updated Mar 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hjorth, Frederik; Jacob Nyrup; Martin Vinæs Larsen (2024). Replication Data for: Reining in the Rascals: Challenger Parties' Path to Power [Dataset]. http://doi.org/10.7910/DVN/FLGPW8
    Explore at:
    Dataset updated
    Mar 6, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Hjorth, Frederik; Jacob Nyrup; Martin Vinæs Larsen
    Description
    ### Information for replicating the analysis for "Reining in the Rascals: Challenger Parties' Path to Power" ### The Journal of Politics ### ### Frederik Hjorth, Jacob Nyrup & Martin Vinæs Larsen ###### All code to replicate the analysis is written in R. 14 files in total are used to replicate the analysis in the article: 5 r-scripts and 9 datafiles. The scripts use the R package "pacman" to install and load relevant packages, which is handled by the function pacman::p_load(). To make sure the function runs, the replicator should have "pacman" installed. The scripts use the R package "here" to automatically set the working directory to the replication folder. If "here" fails to locate the appropriate folder, simply set the working directory to the folder containing scripts and data using setwd(). When running the analysis it is important that 00-helperfunctions.R is loaded into R. This file contains a list of extra functions used throughout the analysis. ### List of r-scripts 00-helperfunctions.R 01-comparativeanalysis.R 02-mainanalysis.R 03-mechanismanalysis.R 04-appendix.R ### List of datasets df_comparative.xlsx df_main.rds df_mainretroactive.rds dkvaa13txtdf.rds dkvaa17txtdf.rds dkvaa2013.xlsx dkvaa2017.xlsx irtposbyparty.rds municodelist.txt
  19. h

    imagenet-r

    • huggingface.co
    Updated Jun 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weixiong Lin (2024). imagenet-r [Dataset]. https://huggingface.co/datasets/axiong/imagenet-r
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 18, 2024
    Authors
    Weixiong Lin
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    ImageNet-R

    This repo is made to facilitate the evaluation of various pretraining models. It's constructed from the source file provided by official implementation.

      Usage
    

    from datasets import load_dataset

    dataset = load_dataset('axiong/imagenet-r')

      Dataset Summary
    

    ImageNet-R(endition) contains art, cartoons, deviantart, graffiti, embroidery, graphics, origami, paintings, patterns, plastic objects, plush objects, sculptures, sketches, tattoos, toys, and video… See the full description on the dataset page: https://huggingface.co/datasets/axiong/imagenet-r.

  20. T

    imagenet_r

    • tensorflow.org
    Updated Jun 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). imagenet_r [Dataset]. https://www.tensorflow.org/datasets/catalog/imagenet_r
    Explore at:
    Dataset updated
    Jun 1, 2024
    Description

    ImageNet-R is a set of images labelled with ImageNet labels that were obtained by collecting art, cartoons, deviantart, graffiti, embroidery, graphics, origami, paintings, patterns, plastic objects, plush objects, sculptures, sketches, tattoos, toys, and video game renditions of ImageNet classes. ImageNet-R has renditions of 200 ImageNet classes resulting in 30,000 images. by collecting new data and keeping only those images that ResNet-50 models fail to correctly classify. For more details please refer to the paper.

    The label space is the same as that of ImageNet2012. Each example is represented as a dictionary with the following keys:

    • 'image': The image, a (H, W, 3)-tensor.
    • 'label': An integer in the range [0, 1000).
    • 'file_name': A unique sting identifying the example within the dataset.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('imagenet_r', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/imagenet_r-0.2.0.png" alt="Visualization" width="500px">

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Kingsley Okoye; Samira Hosseini (2023). Collection of example datasets used for the book - R Programming - Statistical Data Analysis in Research [Dataset]. http://doi.org/10.6084/m9.figshare.24728073.v1

Collection of example datasets used for the book - R Programming - Statistical Data Analysis in Research

Explore at:
txtAvailable download formats
Dataset updated
Dec 4, 2023
Dataset provided by
figshare
Authors
Kingsley Okoye; Samira Hosseini
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This book is written for statisticians, data analysts, programmers, researchers, teachers, students, professionals, and general consumers on how to perform different types of statistical data analysis for research purposes using the R programming language. R is an open-source software and object-oriented programming language with a development environment (IDE) called RStudio for computing statistics and graphical displays through data manipulation, modelling, and calculation. R packages and supported libraries provides a wide range of functions for programming and analyzing of data. Unlike many of the existing statistical softwares, R has the added benefit of allowing the users to write more efficient codes by using command-line scripting and vectors. It has several built-in functions and libraries that are extensible and allows the users to define their own (customized) functions on how they expect the program to behave while handling the data, which can also be stored in the simple object system.For all intents and purposes, this book serves as both textbook and manual for R statistics particularly in academic research, data analytics, and computer programming targeted to help inform and guide the work of the R users or statisticians. It provides information about different types of statistical data analysis and methods, and the best scenarios for use of each case in R. It gives a hands-on step-by-step practical guide on how to identify and conduct the different parametric and non-parametric procedures. This includes a description of the different conditions or assumptions that are necessary for performing the various statistical methods or tests, and how to understand the results of the methods. The book also covers the different data formats and sources, and how to test for reliability and validity of the available datasets. Different research experiments, case scenarios and examples are explained in this book. It is the first book to provide a comprehensive description and step-by-step practical hands-on guide to carrying out the different types of statistical analysis in R particularly for research purposes with examples. Ranging from how to import and store datasets in R as Objects, how to code and call the methods or functions for manipulating the datasets or objects, factorization, and vectorization, to better reasoning, interpretation, and storage of the results for future use, and graphical visualizations and representations. Thus, congruence of Statistics and Computer programming for Research.

Search
Clear search
Close search
Google apps
Main menu