61 datasets found
  1. q

    Tutorial: Data Manipulation with dplyr

    • qubeshub.org
    Updated Nov 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Drew LaMar (2025). Tutorial: Data Manipulation with dplyr [Dataset]. http://doi.org/10.25334/H0V0-M514
    Explore at:
    Dataset updated
    Nov 11, 2025
    Dataset provided by
    QUBES
    Authors
    Drew LaMar
    Description

    In this tutorial, we will explore the tidyverse data manipulation package dplyr.

  2. q

    REMNet Tutorial, R Part 5: Normalizing Microbiome Data in R 5.2.19

    • qubeshub.org
    Updated Aug 28, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jessica Joyner (2019). REMNet Tutorial, R Part 5: Normalizing Microbiome Data in R 5.2.19 [Dataset]. http://doi.org/10.25334/M13H-XT81
    Explore at:
    Dataset updated
    Aug 28, 2019
    Dataset provided by
    QUBES
    Authors
    Jessica Joyner
    Description

    Video on normalizing microbiome data from the Research Experiences in Microbiomes Network

  3. r

    R codes and dataset for Visualisation of Diachronic Constructional Change...

    • researchdata.edu.au
    • bridges.monash.edu
    Updated Apr 1, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gede Primahadi Wijaya Rajeg; Gede Primahadi Wijaya Rajeg (2019). R codes and dataset for Visualisation of Diachronic Constructional Change using Motion Chart [Dataset]. http://doi.org/10.26180/5c844c7a81768
    Explore at:
    Dataset updated
    Apr 1, 2019
    Dataset provided by
    Monash University
    Authors
    Gede Primahadi Wijaya Rajeg; Gede Primahadi Wijaya Rajeg
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Publication


    Primahadi Wijaya R., Gede. 2014. Visualisation of diachronic constructional change using Motion Chart. In Zane Goebel, J. Herudjati Purwoko, Suharno, M. Suryadi & Yusuf Al Aried (eds.). Proceedings: International Seminar on Language Maintenance and Shift IV (LAMAS IV), 267-270. Semarang: Universitas Diponegoro. doi: https://doi.org/10.4225/03/58f5c23dd8387

    Description of R codes and data files in the repository

    This repository is imported from its GitHub repo. Versioning of this figshare repository is associated with the GitHub repo's Release. So, check the Releases page for updates (the next version is to include the unified version of the codes in the first release with the tidyverse).

    The raw input data consists of two files (i.e. will_INF.txt and go_INF.txt). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).

    These two input files are used in the R code file 1-script-create-input-data-raw.r. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade, (ii) coll (for "collocate"), (iii) BE going to (for frequency of the collocates with be going to) and (iv) will (for frequency of the collocates with will); it is available in the input_data_raw.txt.

    Then, the script 2-script-create-motion-chart-input-data.R processes the input_data_raw.txt for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt). The output from the second script is input_data_futurate.txt.

    Next, input_data_futurate.txt contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R).

    The repository adopts the project-oriented workflow in RStudio; double-click on the Future Constructions.Rproj file to open an RStudio session whose working directory is associated with the contents of this repository.

  4. d

    Trend Detection and Forecasting

    • search.dataone.org
    • hydroshare.org
    • +1more
    Updated Dec 5, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriela Garcia; Kateri Salk (2021). Trend Detection and Forecasting [Dataset]. https://search.dataone.org/view/sha256%3Acc6ce10bf4642cd85c69fc697a24b519ad086342c5da54012eb613d2f4f81e70
    Explore at:
    Dataset updated
    Dec 5, 2021
    Dataset provided by
    Hydroshare
    Authors
    Gabriela Garcia; Kateri Salk
    Description

    Trend Detection and Forecasting

    This lesson was adapted from educational material written by Dr. Kateri Salk for her Fall 2019 Hydrologic Data Analysis course at Duke University. This is the second part of a two-part exercise focusing on time series analysis.

    Introduction

    Time series are a special class of dataset, where a response variable is tracked over time. Time series analysis is a powerful technique that can be used to understand the various temporal patterns in our data by decomposing data into different cyclic trends. Time series analysis can also be used to predict how levels of a variable will change in the future, taking into account what has happened in the past.

    Learning Objectives

    1. Choose appropriate time series analyses for trend detection and forecasting
    2. Discuss the influence of seasonality on time series analysis
    3. Interpret and communicate results of time series analyses
  5. Additional file 2 of tidyMicro: a pipeline for microbiome data analysis and...

    • springernature.figshare.com
    txt
    Updated Jun 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Charlie M. Carpenter; Daniel N. Frank; Kayla Williamson; Jaron Arbet; Brandie D. Wagner; Katerina Kechris; Miranda E. Kroehl (2023). Additional file 2 of tidyMicro: a pipeline for microbiome data analysis and visualization using the tidyverse in R [Dataset]. http://doi.org/10.6084/m9.figshare.13685090.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Charlie M. Carpenter; Daniel N. Frank; Kayla Williamson; Jaron Arbet; Brandie D. Wagner; Katerina Kechris; Miranda E. Kroehl
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 2. Model estimates table. Column 1: Taxa names. Column 2: Model coefficients. Column 3: Estimated rate ratios from exponentiated β estimates. For models with interaction terms, the appropriate β estimates are summed before being exponentiated. Column 4: Exponentiated 95% Wald confidence intervals. For models with interaction terms, the appropriate β estimates and covariance terms are summed for the Wald intervals. Column 5: Z-statistics from β estimates. Column 6: False discovery rate adjusted p-value

  6. Storage and Transit Time Data and Code

    • zenodo.org
    zip
    Updated Nov 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrew Felton; Andrew Felton (2024). Storage and Transit Time Data and Code [Dataset]. http://doi.org/10.5281/zenodo.14171251
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 15, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrew Felton; Andrew Felton
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Author: Andrew J. Felton
    Date: 11/15/2024

    This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:

    "Global estimates of the storage and transit time of water through vegetation"

    Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated throughout the peer review process.

    #Data information:

    The data folder contains key data sets used for analysis. In particular:

    "data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.

    #Code information

    Python scripts can be found in the "supporting_code" folder.

    Each R script in this project has a role:

    "01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).

    "02_functions.R": This script contains custom functions. Load this using the `source()` function in the 01_start.R script.

    "03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
    `source()` function in the 01_start.R script.

    "04_figures_tables.R": This is the main workhouse for figure/table production and supporting analyses. This script generates the key figures and summary statistics used in the study that then get saved in the "manuscript_figures" folder. Note that all maps were produced using Python code found in the "supporting_code"" folder. Also note that within the "manuscript_figures" folder there is an "extended_data" folder, which contains tables of the summary statistics (e.g., quartiles and sample sizes) behind figures containing box plots or depicting regression coefficients.

    "supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.

    "supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.

  7. Market Basket Analysis

    • kaggle.com
    zip
    Updated Dec 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
    Explore at:
    zip(23875170 bytes)Available download formats
    Dataset updated
    Dec 9, 2021
    Authors
    Aslan Ahmedov
    Description

    Market Basket Analysis

    Market basket analysis with Apriori algorithm

    The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

    Introduction

    Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

    An Example of Association Rules

    Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

    Strategy

    • Data Import
    • Data Understanding and Exploration
    • Transformation of the data – so that is ready to be consumed by the association rules algorithm
    • Running association rules
    • Exploring the rules generated
    • Filtering the generated rules
    • Visualization of Rule

    Dataset Description

    • File name: Assignment-1_Data
    • List name: retaildata
    • File format: . xlsx
    • Number of Row: 522065
    • Number of Attributes: 7

      • BillNo: 6-digit number assigned to each transaction. Nominal.
      • Itemname: Product name. Nominal.
      • Quantity: The quantities of each product per transaction. Numeric.
      • Date: The day and time when each transaction was generated. Numeric.
      • Price: Product price. Numeric.
      • CustomerID: 5-digit number assigned to each customer. Nominal.
      • Country: Name of the country where each customer resides. Nominal.

    imagehttps://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

    Libraries in R

    First, we need to load required libraries. Shortly I describe all libraries.

    • arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
    • arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
    • tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.
    • readxl - Read Excel Files in R.
    • plyr - Tools for Splitting, Applying and Combining Data.
    • ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
    • knitr - Dynamic Report generation in R.
    • magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.
    • dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
    • tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

    imagehttps://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

    Data Pre-processing

    Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

    imagehttps://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> imagehttps://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

    After we will clear our data frame, will remove missing values.

    imagehttps://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

    To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...

  8. H

    Physical Properties of Lakes: Exploratory Data Visualization

    • hydroshare.org
    • search.dataone.org
    zip
    Updated Jan 29, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriela Garcia; Kateri Salk (2021). Physical Properties of Lakes: Exploratory Data Visualization [Dataset]. https://www.hydroshare.org/resource/e22442bc4e4940609003b43747b366e0
    Explore at:
    zip(2.9 MB)Available download formats
    Dataset updated
    Jan 29, 2021
    Dataset provided by
    HydroShare
    Authors
    Gabriela Garcia; Kateri Salk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    May 27, 1984 - Aug 17, 2016
    Area covered
    Description

    Exploratory Data Visualization for the Physical Properties of Lakes

    This lesson was adapted from educational material written by Dr. Kateri Salk for her Fall 2019 Hydrologic Data Analysis course at Duke University. This is the second part of a two-part exercise focusing on the physical properties of lakes.

    Introduction

    The field of limnology, the study of inland waters, uses a unique graph format to display relationships of variables by depth in a lake (the field of oceanography uses the same convention). Depth is placed on the y-axis in reverse order and the other variable(s) are placed on the x-axis. In this manner, the graph appears as if a cross section were taken from that point in the lake, with the surface at the top of the graph. This lesson introduces physical properties of lakes, namely stratification, and its visualization using the package ggplot2.

    Learning Objectives

    After successfully completing this notebook, you will be able to:

    1. Investigate the concepts of lake stratification and mixing by analyzing monitoring data
    2. Apply data analytics skills to applied questions about physical properties of lakes
    3. Communicate findings with peers through oral, visual, and written modes
  9. Divvy Bikeshare

    • kaggle.com
    zip
    Updated Dec 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Justina Rosario (2022). Divvy Bikeshare [Dataset]. https://www.kaggle.com/datasets/justinarosario/divvy-bikeshare
    Explore at:
    zip(53940438 bytes)Available download formats
    Dataset updated
    Dec 14, 2022
    Authors
    Justina Rosario
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This is my very first data analytics project. Data for it is found in https://artscience.blog/home/divvy-dataviz-case-study. I followed the R script written by Kevin Hartman. His analysis is based on the Divvy case study "'Sophisticatedd, Clear, and Polished': Divvy and Data Visualization"

  10. q

    Writing Clean Code in R Workshop

    • qubeshub.org
    Updated Oct 15, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Max Joseph; Leah Wasser (2019). Writing Clean Code in R Workshop [Dataset]. https://qubeshub.org/publications/1442
    Explore at:
    Dataset updated
    Oct 15, 2019
    Dataset provided by
    QUBES
    Authors
    Max Joseph; Leah Wasser
    Description

    When working with data, you often spend the most amount of time cleaning your data. Learn how to write more efficient code using the tidyverse in R.

  11. n

    Data from: Generalizable EHR-R-REDCap pipeline for a national...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Jan 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 9, 2022
    Dataset provided by
    Massachusetts General Hospital
    Harvard Medical School
    Authors
    Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

    Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

    Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

    Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.

    Methods eLAB Development and Source Code (R statistical software):

    eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).

    eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.

    Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.

    The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).

    Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.

    Data Dictionary (DD)

    EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.

    Study Cohort

    This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.

    Statistical Analysis

    OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.

  12. H

    Physical Properties of Rivers: Querying Metadata and Discharge Data

    • hydroshare.org
    • search.dataone.org
    zip
    Updated Jan 29, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriela Garcia; Kateri Salk (2021). Physical Properties of Rivers: Querying Metadata and Discharge Data [Dataset]. https://www.hydroshare.org/resource/20dc4af8451e44b3950b182a8f506296
    Explore at:
    zip(1.7 MB)Available download formats
    Dataset updated
    Jan 29, 2021
    Dataset provided by
    HydroShare
    Authors
    Gabriela Garcia; Kateri Salk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Physical Properties of Rivers: Querying Metadata and Discharge Data

    This lesson was adapted from educational material written by Dr. Kateri Salk for her Fall 2019 Hydrologic Data Analysis course at Duke University. This is the second part of a two-part exercise focusing on the physical properties of rivers.

    Introduction

    Rivers are bodies of freshwater flowing from higher elevations to lower elevations due to the force of gravity. One of the most important physical characteristics of a stream or river is discharge, the volume of water moving through the river or stream over a given amount of time. Discharge can be measured directly by measuring the velocity of flow in several spots in a stream and multiplying the flow velocity over the cross-sectional area of the stream. However, this method is effort-intensive. This exercise will demonstrate how to approximate discharge by developing a rating curve for a stream at a given sampling point. You will also learn to query metadata from and compare discharge patterns in climatically different regions of the United States.

    Learning Objectives

    After successfully completing this exercise, you will be able to:

    1. Execute queries to pull a variety of National Water Information System (NWIS) and Water Quality Portal (WQP) data into R.
    2. Analyze seasonal and interannual characteristics of stream discharge and compare discharge patterns in different regions of the United States
  13. RUNNING"calorie:heartrate

    • kaggle.com
    zip
    Updated Jan 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    romechris34 (2022). RUNNING"calorie:heartrate [Dataset]. https://www.kaggle.com/datasets/romechris34/wellness
    Explore at:
    zip(25272804 bytes)Available download formats
    Dataset updated
    Jan 6, 2022
    Authors
    romechris34
    Description

    title: 'BellaBeat Fitbit' author: 'C Romero' date: 'r Sys.Date()' output: html_document: number_sections: true

    toc: true

    ##Installation of the base package for data analysis tool
    install.packages("base")
    
    ##Installation of the ggplot2 package for data analysis tool
    install.packages("ggplot2")
    
    ##install Lubridate is an R package that makes it easier to work with dates and times.
    install.packages("lubridate")
    ```{r}
    
    ##Installation of the tidyverse package for data analysis tool
    install.packages("tidyverse")
    
    ##Installation of the tidyr package for data analysis tool
    install.packages("dplyr")
    
    ##Installation of the readr package for data analysis tool
    install.packages("readr")
    
    ##Installation of the tidyr package for data analysis tool
    install.packages("tidyr")
    

    Importing packages

    metapackage of all tidyverse packages

    library(base) library(lubridate)# make dealing with dates a little easier library(ggplot2)# create elegant data visialtions using the grammar of graphics library(dplyr)# a grammar of data manpulation library(readr)# read rectangular data text library(tidyr)

    
    ## Running code
    
    In a notebook, you can run a single code cell by clicking in the cell and then hitting 
    the blue arrow to the left, or by clicking in the cell and pressing Shift+Enter. In a script, 
    you can run code by highlighting the code you want to run and then clicking the blue arrow
    at the bottom of this window.
    
    ## Reading in files
    
    
    ```{r}
    list.files(path = "../input")
    
    # load the activity and sleep data set
    ```{r}
    dailyActivity <- read_csv("../input/wellness/dailyActivity_merge.csv")
    sleepDay <- read_csv("../input/wellness/sleepDay_merged.csv")
    
    

    check for duplicates and na

    sum(duplicated(dailyActivity)) sum(duplicated(sleepDay)) sum(is.na(dailyActivity)) sum(is.na(sleepDay))

    now we will remove duplicate from sleep & create new dataframe

    sleepy <- sleepDay %>% distinct() head(sleepy) head(dailyActivity)

    count number of id's total sleepy & dailyActivity frames

    n_distinct(dailyActivity$Id) n_distinct(sleepy$Id)

    get total sum steps for each member id

    dailyActivity %>% group_by(Id) %>% summarise(freq = sum(TotalSteps)) %>% arrange(-freq) Tot_dist <- dailyActivity %>% mutate(Id = as.character(dailyActivity$Id)) %>% group_by(Id) %>% summarise(dizzy = sum(TotalDistance)) %>% arrange(-dizzy)

    now get total min sleep & lie in bed

    sleepy %>% group_by(Id) %>% summarise(Msleep = sum(TotalMinutesAsleep)) %>% arrange(Msleep) sleepy %>% group_by(Id) %>% summarise(inBed = sum(TotalTimeInBed)) %>% arrange(inBed)

    plot graph for "inbed and sleep data" & "total steps and distance"

    ggplot(Tot_dist) + 
     geom_count(mapping = aes(y= dizzy, x= Id, color = Id, fill = Id, size = 2)) +
     labs(x = "member id's", title = "distance miles" ) +
     theme(axis.text.x = element_text(angle = 90)) 
     ```
    
  14. n

    Multilevel modeling of time-series cross-sectional data reveals the dynamic...

    • data.niaid.nih.gov
    • dataone.org
    • +1more
    zip
    Updated Mar 6, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kodai Kusano (2020). Multilevel modeling of time-series cross-sectional data reveals the dynamic interaction between ecological threats and democratic development [Dataset]. http://doi.org/10.5061/dryad.547d7wm3x
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 6, 2020
    Dataset provided by
    University of Nevada, Reno
    Authors
    Kodai Kusano
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    What is the relationship between environment and democracy? The framework of cultural evolution suggests that societal development is an adaptation to ecological threats. Pertinent theories assume that democracy emerges as societies adapt to ecological factors such as higher economic wealth, lower pathogen threats, less demanding climates, and fewer natural disasters. However, previous research confused within-country processes with between-country processes and erroneously interpreted between-country findings as if they generalize to within-country mechanisms. In this article, we analyze a time-series cross-sectional dataset to study the dynamic relationship between environment and democracy (1949-2016), accounting for previous misconceptions in levels of analysis. By separating within-country processes from between-country processes, we find that the relationship between environment and democracy not only differs by countries but also depends on the level of analysis. Economic wealth predicts increasing levels of democracy in between-country comparisons, but within-country comparisons show that democracy declines as countries become wealthier over time. This relationship is only prevalent among historically wealthy countries but not among historically poor countries, whose wealth also increased over time. By contrast, pathogen prevalence predicts lower levels of democracy in both between-country and within-country comparisons. Our longitudinal analyses identifying temporal precedence reveal that not only reductions in pathogen prevalence drive future democracy, but also democracy reduces future pathogen prevalence and increases future wealth. These nuanced results contrast with previous analyses using narrow, cross-sectional data. As a whole, our findings illuminate the dynamic process by which environment and democracy shape each other.

    Methods Our Time-Series Cross-Sectional data combine various online databases. Country names were first identified and matched using R-package “countrycode” (Arel-Bundock, Enevoldsen, & Yetman, 2018) before all datasets were merged. Occasionally, we modified unidentified country names to be consistent across datasets. We then transformed “wide” data into “long” data and merged them using R’s Tidyverse framework (Wickham, 2014). Our analysis begins with the year 1949, which was occasioned by the fact that one of the key time-variant level-1 variables, pathogen prevalence was only available from 1949 on. See our Supplemental Material for all data, Stata syntax, R-markdown for visualization, supplemental analyses and detailed results (available at https://osf.io/drt8j/).

  15. Data from: Data and code from: Extending irrigation reservoir histories for...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    • +1more
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data and code from: Extending irrigation reservoir histories for improved groundwater modeling and conjunctive water management in two Arkansas critical groundwater areas [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-extending-irrigation-reservoir-histories-for-improved-groundwater-model-7e431
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Servicehttps://www.ars.usda.gov/
    Description

    This dataset contains all data and code required to reproduce the time-to-event analysis in the associated manuscript. More detail is found in the associated README.md file.Contents of repositoryYaeger_ReservoirDataset_Oct2022.csv: comma-separated data file with reservoir characteristics, construction times, and water table depths and percent saturation at 5-year intervals from 1975-2015.CGA_reservoir_analysis.Rmd: RMarkdown notebook with all code required to reproduce the time-to-event analysis in the manuscript and generate the associated plots.CGA_reservoir_analysis.html: HTML file rendered from the .Rmd notebook.README.md: additional details, including column descriptions from the CSV file.Software versions usedR version 4.1.2 (https://cran.r-project.org/bin/windows/base/old/4.1.2)R packages:data.table v1.14.8 (https://rdatatable.gitlab.io/data.table/)ggplot2 v3.4.4 (https://ggplot2.tidyverse.org/)sf v1.0-14 (https://r-spatial.github.io/sf/)survival v3.5-5 (https://cran.r-project.org/package=survival)icenReg v2.0.15 (https://cran.r-project.org/package=icenReg)

  16. Google Data Analytics: Case Study 1(Cyclistics)

    • kaggle.com
    zip
    Updated Sep 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nuyang Rai (2022). Google Data Analytics: Case Study 1(Cyclistics) [Dataset]. https://www.kaggle.com/datasets/nuyangrai/google-data-analytics-case-study-1cyclistics
    Explore at:
    zip(419500618 bytes)Available download formats
    Dataset updated
    Sep 9, 2022
    Authors
    Nuyang Rai
    Description

    I downloaded the divvy_trip_data 2021 from Jan to Dec. Since google sheets won't let me edit and upload these files as these files are way too big and exceeds 1 GB and google bigquery won't let me use DML with the free account, I will be using R-studio for all the analysis esp: - Tidyverse for data manipulation, exploration, and visualization - Palmerpenguins (if necessary and same as tidyverse) - Lubridate for dates and times - ggplot for visualization

  17. q

    Module M.1 R basics for data exploration and management

    • qubeshub.org
    Updated Jun 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raisa Hernández-Pacheco; Alexandra Bland (2023). Module M.1 R basics for data exploration and management [Dataset]. http://doi.org/10.25334/M9B9-8073
    Explore at:
    Dataset updated
    Jun 26, 2023
    Dataset provided by
    QUBES
    Authors
    Raisa Hernández-Pacheco; Alexandra Bland
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Introduction to Primate Data Exploration and Linear Modeling with R was created with the goal of providing training to undergraduate biology students on data management and statistical analysis using authentic data of Cayo Santiago rhesus macaques. Module M.1 introduces basic functions from R, as well as from its package tidyverse, for data exploration and management.

  18. Z

    Introduction to Ancient Metagenomics Textbook (Edition 2024): Introduction...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Clemens Schmid (2024). Introduction to Ancient Metagenomics Textbook (Edition 2024): Introduction to R and the Tidyverse [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8413026
    Explore at:
    Dataset updated
    Sep 13, 2024
    Dataset provided by
    Max Planck Institute for Evolutionary Anthropology
    Authors
    Clemens Schmid
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data and conda software environment file for the chapter 'Introduction to R and the Tidyverse' of the SPAAM Community's textbook: Introduction to Ancient Metagenomics (https://www.spaam-community.org/intro-to-ancient-metagenomics-book).

  19. Z

    SPAAM Summer School 2022: Introduction to Ancient Metagenomics - 3b1...

    • data.niaid.nih.gov
    Updated Aug 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Schmid, Clemens (2022). SPAAM Summer School 2022: Introduction to Ancient Metagenomics - 3b1 Introduction to R and the Tidyverse [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6983148
    Explore at:
    Dataset updated
    Aug 12, 2022
    Dataset provided by
    Max Planck Institute for Evolutionary Anthropology / Max Planck Institute for the Science of Human History
    Authors
    Schmid, Clemens
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Teaching data for practical session: "3b1 Introduction to R and the Tidyverse" of the 2022 SPAAM Summer School: Introduction to Ancient Metagenomics (Aug. 1-5 2022).

    See: https://spaam-community.github.io/wss-summer-school/#/2022/ or https://doi.org/10.5281/zenodo.6976711 for slides.

    Once downloaded, run:

    tar xvfz .tar.gz

    to decompress the data directory for the session.

  20. Z

    Brisbane Library Checkout Data

    • data.niaid.nih.gov
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicholas Tierney (2020). Brisbane Library Checkout Data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2437859
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Monash University
    Authors
    Nicholas Tierney
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Brisbane
    Description

    This has been copied from the README.md file

    bris-lib-checkout

    This provides tidied up data from the Brisbane library checkouts

    Retrieving and cleaning the data

    The script for retrieving and cleaning the data is made available in scrape-library.R.

    The data

    The data/ folder contains the tidy data

    The data-raw/ folder contains the raw data

    data/

    This contains four tidied up dataframes:

    tidy-brisbane-library-checkout.csv

    metadata_branch.csv

    metadata_heading.csv

    metadata_item_type.csv

    tidy-brisbane-library-checkout.csv contains the following columns, with the metadata file metadata_heading containing the description of these columns.

    knitr::kable(readr::read_csv("data/metadata_heading.csv"))

    > Parsed with column specification:

    > cols(

    > heading = col_character(),

    > heading_explanation = col_character()

    > )

    heading

    heading_explanation

    Title

    Title of Item

    Author

    Author of Item

    Call Number

    Call Number of Item

    Item id

    Unique Item Identifier

    Item Type

    Type of Item (see next column)

    Status

    Current Status of Item

    Language

    Published language of item (if not English)

    Age

    Suggested audience

    Checkout Library

    Checkout branch

    Date

    Checkout date

    We also added year, month, and day columns.

    The remaining data are all metadata files that contain meta information on the columns in the checkout data:

    library(tidyverse)

    > ── Attaching packages ────────────── tidyverse 1.2.1 ──

    > ✔ ggplot2 3.1.0 ✔ purrr 0.2.5

    > ✔ tibble 1.4.99.9006 ✔ dplyr 0.7.8

    > ✔ tidyr 0.8.2 ✔ stringr 1.3.1

    > ✔ readr 1.3.0 ✔ forcats 0.3.0

    > ── Conflicts ───────────────── tidyverse_conflicts() ──

    > ✖ dplyr::filter() masks stats::filter()

    > ✖ dplyr::lag() masks stats::lag()

    knitr::kable(readr::read_csv("data/metadata_branch.csv"))

    > Parsed with column specification:

    > cols(

    > branch_code = col_character(),

    > branch_heading = col_character()

    > )

    branch_code

    branch_heading

    ANN

    Annerley

    ASH

    Ashgrove

    BNO

    Banyo

    BRR

    BrackenRidge

    BSQ

    Brisbane Square Library

    BUL

    Bulimba

    CDA

    Corinda

    CDE

    Chermside

    CNL

    Carindale

    CPL

    Coopers Plains

    CRA

    Carina

    EPK

    Everton Park

    FAI

    Fairfield

    GCY

    Garden City

    GNG

    Grange

    HAM

    Hamilton

    HPK

    Holland Park

    INA

    Inala

    IPY

    Indooroopilly

    MBG

    Mt. Coot-tha

    MIT

    Mitchelton

    MTG

    Mt. Gravatt

    MTO

    Mt. Ommaney

    NDH

    Nundah

    NFM

    New Farm

    SBK

    Sunnybank Hills

    SCR

    Stones Corner

    SGT

    Sandgate

    VAN

    Mobile Library

    TWG

    Toowong

    WND

    West End

    WYN

    Wynnum

    ZIL

    Zillmere

    knitr::kable(readr::read_csv("data/metadata_item_type.csv"))

    > Parsed with column specification:

    > cols(

    > item_type_code = col_character(),

    > item_type_explanation = col_character()

    > )

    item_type_code

    item_type_explanation

    AD-FICTION

    Adult Fiction

    AD-MAGS

    Adult Magazines

    AD-PBK

    Adult Paperback

    BIOGRAPHY

    Biography

    BSQCDMUSIC

    Brisbane Square CD Music

    BSQCD-ROM

    Brisbane Square CD Rom

    BSQ-DVD

    Brisbane Square DVD

    CD-BOOK

    Compact Disc Book

    CD-MUSIC

    Compact Disc Music

    CD-ROM

    CD Rom

    DVD

    DVD

    DVD_R18+

    DVD Restricted - 18+

    FASTBACK

    Fastback

    GAYLESBIAN

    Gay and Lesbian Collection

    GRAPHICNOV

    Graphic Novel

    ILL

    InterLibrary Loan

    JU-FICTION

    Junior Fiction

    JU-MAGS

    Junior Magazines

    JU-PBK

    Junior Paperback

    KITS

    Kits

    LARGEPRINT

    Large Print

    LGPRINTMAG

    Large Print Magazine

    LITERACY

    Literacy

    LITERACYAV

    Literacy Audio Visual

    LOCSTUDIES

    Local Studies

    LOTE-BIO

    Languages Other than English Biography

    LOTE-BOOK

    Languages Other than English Book

    LOTE-CDMUS

    Languages Other than English CD Music

    LOTE-DVD

    Languages Other than English DVD

    LOTE-MAG

    Languages Other than English Magazine

    LOTE-TB

    Languages Other than English Taped Book

    MBG-DVD

    Mt Coot-tha Botanical Gardens DVD

    MBG-MAG

    Mt Coot-tha Botanical Gardens Magazine

    MBG-NF

    Mt Coot-tha Botanical Gardens Non Fiction

    MP3-BOOK

    MP3 Audio Book

    NONFIC-SET

    Non Fiction Set

    NONFICTION

    Non Fiction

    PICTURE-BK

    Picture Book

    PICTURE-NF

    Picture Book Non Fiction

    PLD-BOOK

    Public Libraries Division Book

    YA-FICTION

    Young Adult Fiction

    YA-MAGS

    Young Adult Magazine

    YA-PBK

    Young Adult Paperback

    Example usage

    Let’s explore the data

    bris_libs <- readr::read_csv("data/bris-lib-checkout.csv")

    > Parsed with column specification:

    > cols(

    > title = col_character(),

    > author = col_character(),

    > call_number = col_character(),

    > item_id = col_double(),

    > item_type = col_character(),

    > status = col_character(),

    > language = col_character(),

    > age = col_character(),

    > library = col_character(),

    > date = col_double(),

    > datetime = col_datetime(format = ""),

    > year = col_double(),

    > month = col_double(),

    > day = col_character()

    > )

    > Warning: 20 parsing failures.

    > row col expected actual file

    > 587795 item_id a double REFRESH 'data/bris-lib-checkout.csv'

    > 590579 item_id a double REFRESH 'data/bris-lib-checkout.csv'

    > 590597 item_id a double REFRESH 'data/bris-lib-checkout.csv'

    > 595774 item_id a double REFRESH 'data/bris-lib-checkout.csv'

    > 597567 item_id a double REFRESH 'data/bris-lib-checkout.csv'

    > ...... ....... ........ ....... ............................

    > See problems(...) for more details.

    We can count the number of titles, item types, suggested age, and the library given:

    library(dplyr) count(bris_libs, title, sort = TRUE)

    > # A tibble: 121,046 x 2

    > title n

    >

    > 1 Australian house and garden 1469

    > 2 New scientist (Australasian ed.) 1380

    > 3 Australian home beautiful 1331

    > 4 Country style 1229

    > 5 The New idea 1186

    > 6 Hello 1133

    > 7 Woman's day 1096

    > 8 Country life 1056

    > 9 Better homes and gardens. (AU) 1041

    > 10 Yi Zhou Kan 884

    > # … with 121,036 more rows

    count(bris_libs, item_type, sort = TRUE)

    > # A tibble: 69 x 2

    > item_type n

    >

    > 1 PICTURE-BK 121126

    > 2 DVD 98283

    > 3 AD-PBK 91671

    > 4 JU-PBK 88402

    > 5 NONFICTION 76168

    > 6 AD-MAGS 60516

    > 7 AD-FICTION 53090

    > 8 LARGEPRINT 19113

    > 9 JU-FICTION 17261

    > 10 LOTE-BOOK 12303

    > # … with 59 more rows

    count(bris_libs, age, sort = TRUE)

    > # A tibble: 5 x 2

    > age n

    >

    > 1 ADULT 420287

    > 2 JUVENILE 283902

    > 3 YA 13715

    > 4 147

    > 5 UNKNOWN 36

    count(bris_libs, library, sort = TRUE)

    > # A tibble: 38 x 2

    > library n

    >

    > 1 SBK 49154

    > 2 BSQ 45968

    > 3 CNL 45642

    > 4 IPY 44569

    > 5 GCY 43090

    > 6 CDE 42775

    > 7 ASH 42086

    > 8 WYN 35124

    > 9 KEN 33947

    > 10 MTO 31201

    > # … with 28 more rows

    License

    This data is provided under a CC BY 4.0 license

    It has been downloaded from Brisbane library checkouts, and tidied up using the code in data-raw.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Drew LaMar (2025). Tutorial: Data Manipulation with dplyr [Dataset]. http://doi.org/10.25334/H0V0-M514

Tutorial: Data Manipulation with dplyr

Explore at:
Dataset updated
Nov 11, 2025
Dataset provided by
QUBES
Authors
Drew LaMar
Description

In this tutorial, we will explore the tidyverse data manipulation package dplyr.

Search
Clear search
Close search
Google apps
Main menu