61 datasets found

q
Tutorial: Data Manipulation with dplyr
qubeshub.org
Updated Nov 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Drew LaMar (2025). Tutorial: Data Manipulation with dplyr [Dataset]. http://doi.org/10.25334/H0V0-M514
Explore at:
Unique identifier
https://doi.org/10.25334/H0V0-M514
Dataset updated
Nov 11, 2025
Dataset provided by
QUBES
Authors
Drew LaMar
Description
In this tutorial, we will explore the tidyverse data manipulation package dplyr.
q
REMNet Tutorial, R Part 5: Normalizing Microbiome Data in R 5.2.19
qubeshub.org
Updated Aug 28, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jessica Joyner (2019). REMNet Tutorial, R Part 5: Normalizing Microbiome Data in R 5.2.19 [Dataset]. http://doi.org/10.25334/M13H-XT81
Explore at:
Unique identifier
https://doi.org/10.25334/M13H-XT81
Dataset updated
Aug 28, 2019
Dataset provided by
QUBES
Authors
Jessica Joyner
Description
Video on normalizing microbiome data from the Research Experiences in Microbiomes Network
r
R codes and dataset for Visualisation of Diachronic Constructional Change...
researchdata.edu.au
bridges.monash.edu
Updated Apr 1, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gede Primahadi Wijaya Rajeg; Gede Primahadi Wijaya Rajeg (2019). R codes and dataset for Visualisation of Diachronic Constructional Change using Motion Chart [Dataset]. http://doi.org/10.26180/5c844c7a81768
Explore at:
Unique identifier
https://doi.org/10.26180/5c844c7a81768
Dataset updated
Apr 1, 2019
Dataset provided by
Monash University
Authors
Gede Primahadi Wijaya Rajeg; Gede Primahadi Wijaya Rajeg
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Publication

Primahadi Wijaya R., Gede. 2014. Visualisation of diachronic constructional change using Motion Chart. In Zane Goebel, J. Herudjati Purwoko, Suharno, M. Suryadi & Yusuf Al Aried (eds.). Proceedings: International Seminar on Language Maintenance and Shift IV (LAMAS IV), 267-270. Semarang: Universitas Diponegoro. doi: https://doi.org/10.4225/03/58f5c23dd8387

Description of R codes and data files in the repository

This repository is imported from its GitHub repo. Versioning of this figshare repository is associated with the GitHub repo's Release. So, check the Releases page for updates (the next version is to include the unified version of the codes in the first release with the tidyverse).

The raw input data consists of two files (i.e. will_INF.txt and go_INF.txt). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).

These two input files are used in the R code file 1-script-create-input-data-raw.r. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade, (ii) coll (for "collocate"), (iii) BE going to (for frequency of the collocates with be going to) and (iv) will (for frequency of the collocates with will); it is available in the input_data_raw.txt.

Then, the script 2-script-create-motion-chart-input-data.R processes the input_data_raw.txt for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt). The output from the second script is input_data_futurate.txt.

Next, input_data_futurate.txt contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R).

The repository adopts the project-oriented workflow in RStudio; double-click on the Future Constructions.Rproj file to open an RStudio session whose working directory is associated with the contents of this repository.
d
Trend Detection and Forecasting
search.dataone.org
hydroshare.org
+1more
Updated Dec 5, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriela Garcia; Kateri Salk (2021). Trend Detection and Forecasting [Dataset]. https://search.dataone.org/view/sha256%3Acc6ce10bf4642cd85c69fc697a24b519ad086342c5da54012eb613d2f4f81e70
Explore at:
Dataset updated
Dec 5, 2021
Dataset provided by
Hydroshare
Authors
Gabriela Garcia; Kateri Salk
Description
Trend Detection and Forecasting

This lesson was adapted from educational material written by Dr. Kateri Salk for her Fall 2019 Hydrologic Data Analysis course at Duke University. This is the second part of a two-part exercise focusing on time series analysis.

Introduction

Time series are a special class of dataset, where a response variable is tracked over time. Time series analysis is a powerful technique that can be used to understand the various temporal patterns in our data by decomposing data into different cyclic trends. Time series analysis can also be used to predict how levels of a variable will change in the future, taking into account what has happened in the past.

Learning Objectives

Choose appropriate time series analyses for trend detection and forecasting

Discuss the influence of seasonality on time series analysis

Interpret and communicate results of time series analyses
Additional file 2 of tidyMicro: a pipeline for microbiome data analysis and...
springernature.figshare.com
txt
Updated Jun 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charlie M. Carpenter; Daniel N. Frank; Kayla Williamson; Jaron Arbet; Brandie D. Wagner; Katerina Kechris; Miranda E. Kroehl (2023). Additional file 2 of tidyMicro: a pipeline for microbiome data analysis and visualization using the tidyverse in R [Dataset]. http://doi.org/10.6084/m9.figshare.13685090.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13685090.v1
Dataset updated
Jun 4, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Charlie M. Carpenter; Daniel N. Frank; Kayla Williamson; Jaron Arbet; Brandie D. Wagner; Katerina Kechris; Miranda E. Kroehl
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 2. Model estimates table. Column 1: Taxa names. Column 2: Model coefficients. Column 3: Estimated rate ratios from exponentiated β estimates. For models with interaction terms, the appropriate β estimates are summed before being exponentiated. Column 4: Exponentiated 95% Wald confidence intervals. For models with interaction terms, the appropriate β estimates and covariance terms are summed for the Wald intervals. Column 5: Z-statistics from β estimates. Column 6: False discovery rate adjusted p-value
Storage and Transit Time Data and Code
zenodo.org
zip
Updated Nov 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew Felton; Andrew Felton (2024). Storage and Transit Time Data and Code [Dataset]. http://doi.org/10.5281/zenodo.14171251
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14171251
Dataset updated
Nov 15, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrew Felton; Andrew Felton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Author: Andrew J. Felton
Date: 11/15/2024

This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:

"Global estimates of the storage and transit time of water through vegetation"

Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated throughout the peer review process.

#Data information:

The data folder contains key data sets used for analysis. In particular:

"data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.

#Code information

Python scripts can be found in the "supporting_code" folder.

Each R script in this project has a role:

"01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).

"02_functions.R": This script contains custom functions. Load this using the `source()` function in the 01_start.R script.

"03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
`source()` function in the 01_start.R script.

"04_figures_tables.R": This is the main workhouse for figure/table production and supporting analyses. This script generates the key figures and summary statistics used in the study that then get saved in the "manuscript_figures" folder. Note that all maps were produced using Python code found in the "supporting_code"" folder. Also note that within the "manuscript_figures" folder there is an "extended_data" folder, which contains tables of the summary statistics (e.g., quartiles and sample sizes) behind figures containing box plots or depicting regression coefficients.

"supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.

"supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.
Market Basket Analysis
kaggle.com
zip
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Explore at:
zip(23875170 bytes)Available download formats
Dataset updated
Dec 9, 2021
Authors
Aslan Ahmedov
Description
Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import

Data Understanding and Exploration

Transformation of the data – so that is ready to be consumed by the association rules algorithm

Running association rules

Exploring the rules generated

Filtering the generated rules

Visualization of Rule

Dataset Description

File name: Assignment-1_Data

List name: retaildata

File format: . xlsx

Number of Row: 522065

Number of Attributes: 7

BillNo: 6-digit number assigned to each transaction. Nominal.

Itemname: Product name. Nominal.

Quantity: The quantities of each product per transaction. Numeric.

Date: The day and time when each transaction was generated. Numeric.

Price: Product price. Numeric.

CustomerID: 5-digit number assigned to each customer. Nominal.

Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.

readxl - Read Excel Files in R.

plyr - Tools for Splitting, Applying and Combining Data.

ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

knitr - Dynamic Report generation in R.

magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.

dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
H
Physical Properties of Lakes: Exploratory Data Visualization
hydroshare.org
search.dataone.org
zip
Updated Jan 29, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriela Garcia; Kateri Salk (2021). Physical Properties of Lakes: Exploratory Data Visualization [Dataset]. https://www.hydroshare.org/resource/e22442bc4e4940609003b43747b366e0
Explore at:
zip(2.9 MB)Available download formats
Dataset updated
Jan 29, 2021
Dataset provided by
HydroShare
Authors
Gabriela Garcia; Kateri Salk
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
May 27, 1984 - Aug 17, 2016
Area covered
Description
Exploratory Data Visualization for the Physical Properties of Lakes

This lesson was adapted from educational material written by Dr. Kateri Salk for her Fall 2019 Hydrologic Data Analysis course at Duke University. This is the second part of a two-part exercise focusing on the physical properties of lakes.

Introduction

The field of limnology, the study of inland waters, uses a unique graph format to display relationships of variables by depth in a lake (the field of oceanography uses the same convention). Depth is placed on the y-axis in reverse order and the other variable(s) are placed on the x-axis. In this manner, the graph appears as if a cross section were taken from that point in the lake, with the surface at the top of the graph. This lesson introduces physical properties of lakes, namely stratification, and its visualization using the package ggplot2.

Learning Objectives

After successfully completing this notebook, you will be able to:

Investigate the concepts of lake stratification and mixing by analyzing monitoring data

Apply data analytics skills to applied questions about physical properties of lakes

Communicate findings with peers through oral, visual, and written modes
Divvy Bikeshare
kaggle.com
zip
Updated Dec 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Justina Rosario (2022). Divvy Bikeshare [Dataset]. https://www.kaggle.com/datasets/justinarosario/divvy-bikeshare
Explore at:
zip(53940438 bytes)Available download formats
Dataset updated
Dec 14, 2022
Authors
Justina Rosario
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This is my very first data analytics project. Data for it is found in https://artscience.blog/home/divvy-dataviz-case-study. I followed the R script written by Kevin Hartman. His analysis is based on the Divvy case study "'Sophisticatedd, Clear, and Polished': Divvy and Data Visualization"
q
Writing Clean Code in R Workshop
qubeshub.org
Updated Oct 15, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Max Joseph; Leah Wasser (2019). Writing Clean Code in R Workshop [Dataset]. https://qubeshub.org/publications/1442
Explore at:
Dataset updated
Oct 15, 2019
Dataset provided by
QUBES
Authors
Max Joseph; Leah Wasser
Description
When working with data, you often spend the most amount of time cleaning your data. Learn how to write more efficient code using the tidyverse in R.
n
Data from: Generalizable EHR-R-REDCap pipeline for a national...
data.niaid.nih.gov
datadryad.org
zip
Updated Jan 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.rjdfn2zcm
Dataset updated
Jan 9, 2022
Dataset provided by
Massachusetts General Hospital
Harvard Medical School
Authors
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.

Methods eLAB Development and Source Code (R statistical software):

eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).

eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.

Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.

The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).

Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.

Data Dictionary (DD)

EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.

Study Cohort

This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.

Statistical Analysis

OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.
H
Physical Properties of Rivers: Querying Metadata and Discharge Data
hydroshare.org
search.dataone.org
zip
Updated Jan 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriela Garcia; Kateri Salk (2021). Physical Properties of Rivers: Querying Metadata and Discharge Data [Dataset]. https://www.hydroshare.org/resource/20dc4af8451e44b3950b182a8f506296
Explore at:
zip(1.7 MB)Available download formats
Dataset updated
Jan 29, 2021
Dataset provided by
HydroShare
Authors
Gabriela Garcia; Kateri Salk
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Physical Properties of Rivers: Querying Metadata and Discharge Data

This lesson was adapted from educational material written by Dr. Kateri Salk for her Fall 2019 Hydrologic Data Analysis course at Duke University. This is the second part of a two-part exercise focusing on the physical properties of rivers.

Introduction

Rivers are bodies of freshwater flowing from higher elevations to lower elevations due to the force of gravity. One of the most important physical characteristics of a stream or river is discharge, the volume of water moving through the river or stream over a given amount of time. Discharge can be measured directly by measuring the velocity of flow in several spots in a stream and multiplying the flow velocity over the cross-sectional area of the stream. However, this method is effort-intensive. This exercise will demonstrate how to approximate discharge by developing a rating curve for a stream at a given sampling point. You will also learn to query metadata from and compare discharge patterns in climatically different regions of the United States.

Learning Objectives

After successfully completing this exercise, you will be able to:

Execute queries to pull a variety of National Water Information System (NWIS) and Water Quality Portal (WQP) data into R.

Analyze seasonal and interannual characteristics of stream discharge and compare discharge patterns in different regions of the United States
RUNNING"calorie:heartrate
kaggle.com
zip
Updated Jan 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
romechris34 (2022). RUNNING"calorie:heartrate [Dataset]. https://www.kaggle.com/datasets/romechris34/wellness
Explore at:
zip(25272804 bytes)Available download formats
Dataset updated
Jan 6, 2022
Authors
romechris34
Description
title: 'BellaBeat Fitbit' author: 'C Romero' date: 'r Sys.Date()' output: html_document: number_sections: true

toc: true

##Installation of the base package for data analysis tool install.packages("base")

##Installation of the ggplot2 package for data analysis tool install.packages("ggplot2")

##install Lubridate is an R package that makes it easier to work with dates and times. install.packages("lubridate") ```{r} ##Installation of the tidyverse package for data analysis tool install.packages("tidyverse")

##Installation of the tidyr package for data analysis tool install.packages("dplyr")

##Installation of the readr package for data analysis tool install.packages("readr")

##Installation of the tidyr package for data analysis tool install.packages("tidyr")

Importing packages

metapackage of all tidyverse packages

library(base) library(lubridate)# make dealing with dates a little easier library(ggplot2)# create elegant data visialtions using the grammar of graphics library(dplyr)# a grammar of data manpulation library(readr)# read rectangular data text library(tidyr)

## Running code In a notebook, you can run a single code cell by clicking in the cell and then hitting the blue arrow to the left, or by clicking in the cell and pressing Shift+Enter. In a script, you can run code by highlighting the code you want to run and then clicking the blue arrow at the bottom of this window. ## Reading in files ```{r} list.files(path = "../input") # load the activity and sleep data set ```{r} dailyActivity <- read_csv("../input/wellness/dailyActivity_merge.csv") sleepDay <- read_csv("../input/wellness/sleepDay_merged.csv")

check for duplicates and na

sum(duplicated(dailyActivity)) sum(duplicated(sleepDay)) sum(is.na(dailyActivity)) sum(is.na(sleepDay))

now we will remove duplicate from sleep & create new dataframe

sleepy <- sleepDay %>% distinct() head(sleepy) head(dailyActivity)

count number of id's total sleepy & dailyActivity frames

n_distinct(dailyActivity$Id) n_distinct(sleepy$Id)

get total sum steps for each member id

dailyActivity %>% group_by(Id) %>% summarise(freq = sum(TotalSteps)) %>% arrange(-freq) Tot_dist <- dailyActivity %>% mutate(Id = as.character(dailyActivity$Id)) %>% group_by(Id) %>% summarise(dizzy = sum(TotalDistance)) %>% arrange(-dizzy)

now get total min sleep & lie in bed

sleepy %>% group_by(Id) %>% summarise(Msleep = sum(TotalMinutesAsleep)) %>% arrange(Msleep) sleepy %>% group_by(Id) %>% summarise(inBed = sum(TotalTimeInBed)) %>% arrange(inBed)

plot graph for "inbed and sleep data" & "total steps and distance"

ggplot(Tot_dist) + geom_count(mapping = aes(y= dizzy, x= Id, color = Id, fill = Id, size = 2)) + labs(x = "member id's", title = "distance miles" ) + theme(axis.text.x = element_text(angle = 90)) ```
n
Multilevel modeling of time-series cross-sectional data reveals the dynamic...
data.niaid.nih.gov
dataone.org
+1more
zip
Updated Mar 6, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kodai Kusano (2020). Multilevel modeling of time-series cross-sectional data reveals the dynamic interaction between ecological threats and democratic development [Dataset]. http://doi.org/10.5061/dryad.547d7wm3x
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.547d7wm3x
Dataset updated
Mar 6, 2020
Dataset provided by
University of Nevada, Reno
Authors
Kodai Kusano
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
What is the relationship between environment and democracy? The framework of cultural evolution suggests that societal development is an adaptation to ecological threats. Pertinent theories assume that democracy emerges as societies adapt to ecological factors such as higher economic wealth, lower pathogen threats, less demanding climates, and fewer natural disasters. However, previous research confused within-country processes with between-country processes and erroneously interpreted between-country findings as if they generalize to within-country mechanisms. In this article, we analyze a time-series cross-sectional dataset to study the dynamic relationship between environment and democracy (1949-2016), accounting for previous misconceptions in levels of analysis. By separating within-country processes from between-country processes, we find that the relationship between environment and democracy not only differs by countries but also depends on the level of analysis. Economic wealth predicts increasing levels of democracy in between-country comparisons, but within-country comparisons show that democracy declines as countries become wealthier over time. This relationship is only prevalent among historically wealthy countries but not among historically poor countries, whose wealth also increased over time. By contrast, pathogen prevalence predicts lower levels of democracy in both between-country and within-country comparisons. Our longitudinal analyses identifying temporal precedence reveal that not only reductions in pathogen prevalence drive future democracy, but also democracy reduces future pathogen prevalence and increases future wealth. These nuanced results contrast with previous analyses using narrow, cross-sectional data. As a whole, our findings illuminate the dynamic process by which environment and democracy shape each other.

Methods Our Time-Series Cross-Sectional data combine various online databases. Country names were first identified and matched using R-package “countrycode” (Arel-Bundock, Enevoldsen, & Yetman, 2018) before all datasets were merged. Occasionally, we modified unidentified country names to be consistent across datasets. We then transformed “wide” data into “long” data and merged them using R’s Tidyverse framework (Wickham, 2014). Our analysis begins with the year 1949, which was occasioned by the fact that one of the key time-variant level-1 variables, pathogen prevalence was only available from 1949 on. See our Supplemental Material for all data, Stata syntax, R-markdown for visualization, supplemental analyses and detailed results (available at https://osf.io/drt8j/).
Data from: Data and code from: Extending irrigation reservoir histories for...
catalog.data.gov
agdatacommons.nal.usda.gov
+1more
Updated Apr 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Data and code from: Extending irrigation reservoir histories for improved groundwater modeling and conjunctive water management in two Arkansas critical groundwater areas [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-extending-irrigation-reservoir-histories-for-improved-groundwater-model-7e431
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
This dataset contains all data and code required to reproduce the time-to-event analysis in the associated manuscript. More detail is found in the associated README.md file.Contents of repositoryYaeger_ReservoirDataset_Oct2022.csv: comma-separated data file with reservoir characteristics, construction times, and water table depths and percent saturation at 5-year intervals from 1975-2015.CGA_reservoir_analysis.Rmd: RMarkdown notebook with all code required to reproduce the time-to-event analysis in the manuscript and generate the associated plots.CGA_reservoir_analysis.html: HTML file rendered from the .Rmd notebook.README.md: additional details, including column descriptions from the CSV file.Software versions usedR version 4.1.2 (https://cran.r-project.org/bin/windows/base/old/4.1.2)R packages:data.table v1.14.8 (https://rdatatable.gitlab.io/data.table/)ggplot2 v3.4.4 (https://ggplot2.tidyverse.org/)sf v1.0-14 (https://r-spatial.github.io/sf/)survival v3.5-5 (https://cran.r-project.org/package=survival)icenReg v2.0.15 (https://cran.r-project.org/package=icenReg)
Google Data Analytics: Case Study 1(Cyclistics)
kaggle.com
zip
Updated Sep 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nuyang Rai (2022). Google Data Analytics: Case Study 1(Cyclistics) [Dataset]. https://www.kaggle.com/datasets/nuyangrai/google-data-analytics-case-study-1cyclistics
Explore at:
zip(419500618 bytes)Available download formats
Dataset updated
Sep 9, 2022
Authors
Nuyang Rai
Description
I downloaded the divvy_trip_data 2021 from Jan to Dec. Since google sheets won't let me edit and upload these files as these files are way too big and exceeds 1 GB and google bigquery won't let me use DML with the free account, I will be using R-studio for all the analysis esp: - Tidyverse for data manipulation, exploration, and visualization - Palmerpenguins (if necessary and same as tidyverse) - Lubridate for dates and times - ggplot for visualization
q
Module M.1 R basics for data exploration and management
qubeshub.org
Updated Jun 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raisa Hernández-Pacheco; Alexandra Bland (2023). Module M.1 R basics for data exploration and management [Dataset]. http://doi.org/10.25334/M9B9-8073
Explore at:
Unique identifier
https://doi.org/10.25334/M9B9-8073
Dataset updated
Jun 26, 2023
Dataset provided by
QUBES
Authors
Raisa Hernández-Pacheco; Alexandra Bland
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Introduction to Primate Data Exploration and Linear Modeling with R was created with the goal of providing training to undergraduate biology students on data management and statistical analysis using authentic data of Cayo Santiago rhesus macaques. Module M.1 introduces basic functions from R, as well as from its package tidyverse, for data exploration and management.
Z
Introduction to Ancient Metagenomics Textbook (Edition 2024): Introduction...
data.niaid.nih.gov
zenodo.org
Updated Sep 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Clemens Schmid (2024). Introduction to Ancient Metagenomics Textbook (Edition 2024): Introduction to R and the Tidyverse [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8413026
Explore at:
Dataset updated
Sep 13, 2024
Dataset provided by
Max Planck Institute for Evolutionary Anthropology
Authors
Clemens Schmid
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data and conda software environment file for the chapter 'Introduction to R and the Tidyverse' of the SPAAM Community's textbook: Introduction to Ancient Metagenomics (https://www.spaam-community.org/intro-to-ancient-metagenomics-book).
Z
SPAAM Summer School 2022: Introduction to Ancient Metagenomics - 3b1...
data.niaid.nih.gov
Updated Aug 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Schmid, Clemens (2022). SPAAM Summer School 2022: Introduction to Ancient Metagenomics - 3b1 Introduction to R and the Tidyverse [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6983148
Explore at:
Dataset updated
Aug 12, 2022
Dataset provided by
Max Planck Institute for Evolutionary Anthropology / Max Planck Institute for the Science of Human History
Authors
Schmid, Clemens
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Teaching data for practical session: "3b1 Introduction to R and the Tidyverse" of the 2022 SPAAM Summer School: Introduction to Ancient Metagenomics (Aug. 1-5 2022).

See: https://spaam-community.github.io/wss-summer-school/#/2022/ or https://doi.org/10.5281/zenodo.6976711 for slides.

Once downloaded, run:

tar xvfz .tar.gz

to decompress the data directory for the session.
Z
Brisbane Library Checkout Data
data.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholas Tierney (2020). Brisbane Library Checkout Data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2437859
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Monash University
Authors
Nicholas Tierney
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Brisbane
Description
This has been copied from the README.md file

bris-lib-checkout

This provides tidied up data from the Brisbane library checkouts

Retrieving and cleaning the data

The script for retrieving and cleaning the data is made available in scrape-library.R.

The data

The data/ folder contains the tidy data

The data-raw/ folder contains the raw data

data/

This contains four tidied up dataframes:

tidy-brisbane-library-checkout.csv

metadata_branch.csv

metadata_heading.csv

metadata_item_type.csv

tidy-brisbane-library-checkout.csv contains the following columns, with the metadata file metadata_heading containing the description of these columns.

knitr::kable(readr::read_csv("data/metadata_heading.csv"))

> Parsed with column specification:

> cols(

> heading = col_character(),

> heading_explanation = col_character()

> )

heading

heading_explanation

Title

Title of Item

Author

Author of Item

Call Number

Call Number of Item

Item id

Unique Item Identifier

Item Type

Type of Item (see next column)

Status

Current Status of Item

Language

Published language of item (if not English)

Age

Suggested audience

Checkout Library

Checkout branch

Date

Checkout date

We also added year, month, and day columns.

The remaining data are all metadata files that contain meta information on the columns in the checkout data:

library(tidyverse)

> ── Attaching packages ────────────── tidyverse 1.2.1 ──

> ✔ ggplot2 3.1.0 ✔ purrr 0.2.5

> ✔ tibble 1.4.99.9006 ✔ dplyr 0.7.8

> ✔ tidyr 0.8.2 ✔ stringr 1.3.1

> ✔ readr 1.3.0 ✔ forcats 0.3.0

> ── Conflicts ───────────────── tidyverse_conflicts() ──

> ✖ dplyr::filter() masks stats::filter()

> ✖ dplyr::lag() masks stats::lag()

knitr::kable(readr::read_csv("data/metadata_branch.csv"))

> Parsed with column specification:

> cols(

> branch_code = col_character(),

> branch_heading = col_character()

> )

branch_code

branch_heading

ANN

Annerley

ASH

Ashgrove

BNO

Banyo

BRR

BrackenRidge

BSQ

Brisbane Square Library

BUL

Bulimba

CDA

Corinda

CDE

Chermside

CNL

Carindale

CPL

Coopers Plains

CRA

Carina

EPK

Everton Park

FAI

Fairfield

GCY

Garden City

GNG

Grange

HAM

Hamilton

HPK

Holland Park

INA

Inala

IPY

Indooroopilly

MBG

Mt. Coot-tha

MIT

Mitchelton

MTG

Mt. Gravatt

MTO

Mt. Ommaney

NDH

Nundah

NFM

New Farm

SBK

Sunnybank Hills

SCR

Stones Corner

SGT

Sandgate

VAN

Mobile Library

TWG

Toowong

WND

West End

WYN

Wynnum

ZIL

Zillmere

knitr::kable(readr::read_csv("data/metadata_item_type.csv"))

> Parsed with column specification:

> cols(

> item_type_code = col_character(),

> item_type_explanation = col_character()

> )

item_type_code

item_type_explanation

AD-FICTION

Adult Fiction

AD-MAGS

Adult Magazines

AD-PBK

Adult Paperback

BIOGRAPHY

Biography

BSQCDMUSIC

Brisbane Square CD Music

BSQCD-ROM

Brisbane Square CD Rom

BSQ-DVD

Brisbane Square DVD

CD-BOOK

Compact Disc Book

CD-MUSIC

Compact Disc Music

CD-ROM

CD Rom

DVD

DVD

DVD_R18+

DVD Restricted - 18+

FASTBACK

Fastback

GAYLESBIAN

Gay and Lesbian Collection

GRAPHICNOV

Graphic Novel

ILL

InterLibrary Loan

JU-FICTION

Junior Fiction

JU-MAGS

Junior Magazines

JU-PBK

Junior Paperback

KITS

Kits

LARGEPRINT

Large Print

LGPRINTMAG

Large Print Magazine

LITERACY

Literacy

LITERACYAV

Literacy Audio Visual

LOCSTUDIES

Local Studies

LOTE-BIO

Languages Other than English Biography

LOTE-BOOK

Languages Other than English Book

LOTE-CDMUS

Languages Other than English CD Music

LOTE-DVD

Languages Other than English DVD

LOTE-MAG

Languages Other than English Magazine

LOTE-TB

Languages Other than English Taped Book

MBG-DVD

Mt Coot-tha Botanical Gardens DVD

MBG-MAG

Mt Coot-tha Botanical Gardens Magazine

MBG-NF

Mt Coot-tha Botanical Gardens Non Fiction

MP3-BOOK

MP3 Audio Book

NONFIC-SET

Non Fiction Set

NONFICTION

Non Fiction

PICTURE-BK

Picture Book

PICTURE-NF

Picture Book Non Fiction

PLD-BOOK

Public Libraries Division Book

YA-FICTION

Young Adult Fiction

YA-MAGS

Young Adult Magazine

YA-PBK

Young Adult Paperback

Example usage

Let’s explore the data

bris_libs <- readr::read_csv("data/bris-lib-checkout.csv")

> Parsed with column specification:

> cols(

> title = col_character(),

> author = col_character(),

> call_number = col_character(),

> item_id = col_double(),

> item_type = col_character(),

> status = col_character(),

> language = col_character(),

> age = col_character(),

> library = col_character(),

> date = col_double(),

> datetime = col_datetime(format = ""),

> year = col_double(),

> month = col_double(),

> day = col_character()

> )

> Warning: 20 parsing failures.

> row col expected actual file

> 587795 item_id a double REFRESH 'data/bris-lib-checkout.csv'

> 590579 item_id a double REFRESH 'data/bris-lib-checkout.csv'

> 590597 item_id a double REFRESH 'data/bris-lib-checkout.csv'

> 595774 item_id a double REFRESH 'data/bris-lib-checkout.csv'

> 597567 item_id a double REFRESH 'data/bris-lib-checkout.csv'

> ...... ....... ........ ....... ............................

> See problems(...) for more details.

We can count the number of titles, item types, suggested age, and the library given:

library(dplyr) count(bris_libs, title, sort = TRUE)

> # A tibble: 121,046 x 2

> title n

>

> 1 Australian house and garden 1469

> 2 New scientist (Australasian ed.) 1380

> 3 Australian home beautiful 1331

> 4 Country style 1229

> 5 The New idea 1186

> 6 Hello 1133

> 7 Woman's day 1096

> 8 Country life 1056

> 9 Better homes and gardens. (AU) 1041

> 10 Yi Zhou Kan 884

> # … with 121,036 more rows

count(bris_libs, item_type, sort = TRUE)

> # A tibble: 69 x 2

> item_type n

>

> 1 PICTURE-BK 121126

> 2 DVD 98283

> 3 AD-PBK 91671

> 4 JU-PBK 88402

> 5 NONFICTION 76168

> 6 AD-MAGS 60516

> 7 AD-FICTION 53090

> 8 LARGEPRINT 19113

> 9 JU-FICTION 17261

> 10 LOTE-BOOK 12303

> # … with 59 more rows

count(bris_libs, age, sort = TRUE)

> # A tibble: 5 x 2

> age n

>

> 1 ADULT 420287

> 2 JUVENILE 283902

> 3 YA 13715

> 4 147

> 5 UNKNOWN 36

count(bris_libs, library, sort = TRUE)

> # A tibble: 38 x 2

> library n

>

> 1 SBK 49154

> 2 BSQ 45968

> 3 CNL 45642

> 4 IPY 44569

> 5 GCY 43090

> 6 CDE 42775

> 7 ASH 42086

> 8 WYN 35124

> 9 KEN 33947

> 10 MTO 31201

> # … with 28 more rows

License

This data is provided under a CC BY 4.0 license

It has been downloaded from Brisbane library checkouts, and tidied up using the code in data-raw.

Facebook

Twitter

Click to copy link

Link copied

Cite

Drew LaMar (2025). Tutorial: Data Manipulation with dplyr [Dataset]. http://doi.org/10.25334/H0V0-M514

Tutorial: Data Manipulation with dplyr

Explore at:

Unique identifier

https://doi.org/10.25334/H0V0-M514

Dataset updated

Nov 11, 2025

Dataset provided by

QUBES

Authors

Drew LaMar

Description

In this tutorial, we will explore the tidyverse data manipulation package dplyr.

Clear search

Close search

Google apps

Main menu

Tutorial: Data Manipulation with dplyr

REMNet Tutorial, R Part 5: Normalizing Microbiome Data in R 5.2.19

R codes and dataset for Visualisation of Diachronic Constructional Change...

Trend Detection and Forecasting

Additional file 2 of tidyMicro: a pipeline for microbiome data analysis and...

Storage and Transit Time Data and Code

Market Basket Analysis

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing

Physical Properties of Lakes: Exploratory Data Visualization

Divvy Bikeshare

Writing Clean Code in R Workshop

Data from: Generalizable EHR-R-REDCap pipeline for a national...

Physical Properties of Rivers: Querying Metadata and Discharge Data

RUNNING"calorie:heartrate

toc: true

Importing packages

metapackage of all tidyverse packages

check for duplicates and na

now we will remove duplicate from sleep & create new dataframe

count number of id's total sleepy & dailyActivity frames

get total sum steps for each member id

now get total min sleep & lie in bed

plot graph for "inbed and sleep data" & "total steps and distance"

Multilevel modeling of time-series cross-sectional data reveals the dynamic...

Data from: Data and code from: Extending irrigation reservoir histories for...

Google Data Analytics: Case Study 1(Cyclistics)

Module M.1 R basics for data exploration and management

Introduction to Ancient Metagenomics Textbook (Edition 2024): Introduction...

SPAAM Summer School 2022: Introduction to Ancient Metagenomics - 3b1...

Brisbane Library Checkout Data

> Parsed with column specification:

> cols(

> heading = col_character(),

> heading_explanation = col_character()

> )

> ── Attaching packages ────────────── tidyverse 1.2.1 ──

> ✔ ggplot2 3.1.0 ✔ purrr 0.2.5

> ✔ tibble 1.4.99.9006 ✔ dplyr 0.7.8

> ✔ tidyr 0.8.2 ✔ stringr 1.3.1

> ✔ readr 1.3.0 ✔ forcats 0.3.0

> ── Conflicts ───────────────── tidyverse_conflicts() ──

> ✖ dplyr::filter() masks stats::filter()

> ✖ dplyr::lag() masks stats::lag()

> Parsed with column specification:

> cols(

> branch_code = col_character(),

> branch_heading = col_character()

> )

> Parsed with column specification:

> cols(

> item_type_code = col_character(),

> item_type_explanation = col_character()

> )

> Parsed with column specification:

> cols(

> title = col_character(),

> author = col_character(),

> call_number = col_character(),

> item_id = col_double(),

> item_type = col_character(),

> status = col_character(),

> language = col_character(),

> age = col_character(),

> library = col_character(),

> date = col_double(),

> datetime = col_datetime(format = ""),

> year = col_double(),

> month = col_double(),

> day = col_character()

> )

> Warning: 20 parsing failures.

> row col expected actual file

> 587795 item_id a double REFRESH 'data/bris-lib-checkout.csv'

> 590579 item_id a double REFRESH 'data/bris-lib-checkout.csv'