35 datasets found

Collection of example datasets used for the book - R Programming -...
figshare.com
txt
Updated Dec 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kingsley Okoye; Samira Hosseini (2023). Collection of example datasets used for the book - R Programming - Statistical Data Analysis in Research [Dataset]. http://doi.org/10.6084/m9.figshare.24728073.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24728073.v1
Dataset updated
Dec 4, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Kingsley Okoye; Samira Hosseini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This book is written for statisticians, data analysts, programmers, researchers, teachers, students, professionals, and general consumers on how to perform different types of statistical data analysis for research purposes using the R programming language. R is an open-source software and object-oriented programming language with a development environment (IDE) called RStudio for computing statistics and graphical displays through data manipulation, modelling, and calculation. R packages and supported libraries provides a wide range of functions for programming and analyzing of data. Unlike many of the existing statistical softwares, R has the added benefit of allowing the users to write more efficient codes by using command-line scripting and vectors. It has several built-in functions and libraries that are extensible and allows the users to define their own (customized) functions on how they expect the program to behave while handling the data, which can also be stored in the simple object system.For all intents and purposes, this book serves as both textbook and manual for R statistics particularly in academic research, data analytics, and computer programming targeted to help inform and guide the work of the R users or statisticians. It provides information about different types of statistical data analysis and methods, and the best scenarios for use of each case in R. It gives a hands-on step-by-step practical guide on how to identify and conduct the different parametric and non-parametric procedures. This includes a description of the different conditions or assumptions that are necessary for performing the various statistical methods or tests, and how to understand the results of the methods. The book also covers the different data formats and sources, and how to test for reliability and validity of the available datasets. Different research experiments, case scenarios and examples are explained in this book. It is the first book to provide a comprehensive description and step-by-step practical hands-on guide to carrying out the different types of statistical analysis in R particularly for research purposes with examples. Ranging from how to import and store datasets in R as Objects, how to code and call the methods or functions for manipulating the datasets or objects, factorization, and vectorization, to better reasoning, interpretation, and storage of the results for future use, and graphical visualizations and representations. Thus, congruence of Statistics and Computer programming for Research.
Bellabeat Case Study II Google Capstone Project
kaggle.com
zip
Updated Nov 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NUR SİMAİŞ (2022). Bellabeat Case Study II Google Capstone Project [Dataset]. https://www.kaggle.com/datasets/nursma/bellabeat-case-study-ii-google-capstone-project
Explore at:
zip(25278847 bytes)Available download formats
Dataset updated
Nov 18, 2022
Authors
NUR SİMAİŞ
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset is retrieved from the user Mobius page, where it's generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. I woıuld like to thank Möbius and everyone responsible for the work.

Bellabeat Case Study 1 2022-11-14 1. Introduction Hello everyone, my name is Nur Simais and this project is part of Google Data Analytics Professional Certificate. There have been multiple skills and skillsets learned throughout this course that can mainly be categorized under soft and hard skills. Also, this case study I have chosen is about the company calles “Bellabeat”, a fitness tracker device. The company is founded in 2013 by Urška Sršen and Sando Mur. It gradually gained recognition and expanded in many countires.(https://bellabeat.com/) Adding this brief info about the company, I’d like to say that doing the business analysis will help the company to see how it can achieve it’s goals and what can be done as to improve more.

During the analysis process, I will be using the Google’s “Ask-Prepare-Process-Analyze-Share-Act” Framework that I have learned throughout this certification and apply the tools and skillsets into it.

1.ASK

1.1 Business Task The goal of this project is to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices and how to apply these insights into Bellabeat’s marketing strategy using these three questions:

What are some trends in smart device usage? How could these trends apply to Bellabeat customers? How could these trends help influence Bellabeat marketing strategy?

2.PREPARE Prepare the Data and Libraries in RStudio Collect the data required for analysis but since the data is available on Kaggle publicly, FitBit Fitness Tracker Data (CC0: Public Domain) and download the dataset.

There are 18 packages but after examining the excel docs, I decided to use these 8 datasets: dailyActivity_merged.csv, heartrate_seconds_merged.csv, hourlyCalories_merged.csv, hourlyIntensities_merged.csv, hourlySteps_merged.csv, minuteMETsNarrow_merged.csv, sleepDay_merged.csv, weightLogInfo_merged.csv 2.1 Install and load the packages Install the RStudio libraries for analysis and visualizations

install.packages("tidyverse") # core package for cleaning and analysis

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'

(as 'lib' is unspecified)

install.packages("lubridate") # date library mdy()

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'

(as 'lib' is unspecified)

install.packages("janitor") # clean_names() to consists only _, character, numbers, and letters.

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'

(as 'lib' is unspecified)

install.packages("dplyr") #helps to check the garmmar of data manioulation

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'

(as 'lib' is unspecified)

Load the libraries

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──

✔ ggplot2 3.4.0 ✔ purrr 0.3.5

✔ tibble 3.1.8 ✔ dplyr 1.0.10

✔ tidyr 1.2.1 ✔ stringr 1.4.1

✔ readr 2.1.3 ✔ forcats 0.5.2

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──

✖ dplyr::filter() masks stats::filter()

✖ dplyr::lag() masks stats::lag()

library(janitor) ##

Attaching package: 'janitor'

##

The following objects are masked from 'package:stats':

##

chisq.test, fisher.test

library(lubridate)

Loading required package: timechange

##

Attaching package: 'lubridate'

##

The following objects are masked from 'package:base':

##

date, intersect, setdiff, union

library(dplyr) Having loaded tidyverse package, the rest of the essential packages (ggplot2, dplyr, and tidyr) are loaded as well.

2.2 Importing and Preparing the Dataset Upload the archived dataset to RStudio by clicking the Upload button in the bottom right pane.

File will be saved in a new folder named “Fitabase Data 4.12.16-5.12.16”. Importing the datasets and renaming them.

Loading your CSV files

daily_activity <- read.csv("dailyActivity_merged.csv") heartrate_seconds <- read_csv("heartrate_seconds_merged.csv")

Rows: 2483658 Columns: 3

── Column specification ────────────────────────────────────────────────────────

Delimiter: ","

chr (1): Time

dbl (2): Id, Value

##

ℹ Use spec() to retrieve the full column specification for this data.

ℹ Specify the column types or set show_col_types = FALSE to quiet this message.

hourly_calories <- read_csv("hourlyCalories_merged.csv")

Rows: 22099 Columns: 3

── Column specification ─────────────────────────────────────...
f
Database and Rstudio code
figshare.com
xlsx
Updated May 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iker Madinabeitia Cabrera (2024). Database and Rstudio code [Dataset]. http://doi.org/10.6084/m9.figshare.25772268.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25772268.v1
Dataset updated
May 8, 2024
Dataset provided by
figshare
Authors
Iker Madinabeitia Cabrera
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Background: Executive functions, notably inhibition, significantly influence decision-making and behavioral regulation in team sports. However, more research must be conducted on individual player characteristics such as experience and motor skills. This study assessed how accumulated practical experience moderates inhibition in response to varying task difficulty levels. Methods: Forty-four university students (age: 20.36 ± 3.13 years) participated in this study with two sessions: one followed standard 1 × 1 basketball rules (“Regular Practice”), while the other imposed motor, temporal, and spatial restrictions (“Restriction Practice”). Functional difficulty was controlled by grouping pairs with similar skill levels. Flanker and Go-Nogo tests were used. Results: Increasing complexity worsened cognitive performance (inhibition). “Restriction Practice” showed a significantly slower and less accurate performance in both tests than “Regular Practice” (p < 0.001). Experience positively impacted test speed and accuracy (p < 0.001). Conclusion: In sports, acute cognitive impacts are intrinsically linked to the task’s complexity and the athlete’s cognitive resources. In this sense, it is essential to adjust individually the cognitive demands of the tasks, considering each athlete’s specific cognitive abilities and capacities.
q
Large Datasets in R - Plant Phenology & Temperature Data from NEON
qubeshub.org
Updated May 10, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Megan Jones Patterson; Lee Stanish; Natalie Robinson; Katherine Jones; Cody Flagg (2018). Large Datasets in R - Plant Phenology & Temperature Data from NEON [Dataset]. http://doi.org/10.25334/Q4DQ3F
Explore at:
Unique identifier
https://doi.org/10.25334/Q4DQ3F
Dataset updated
May 10, 2018
Dataset provided by
QUBES
Authors
Megan Jones Patterson; Lee Stanish; Natalie Robinson; Katherine Jones; Cody Flagg
Description
This module series covers how to import, manipulate, format and plot time series data stored in .csv format in R. Originally designed to teach researchers to use NEON plant phenology and air temperature data; has been used in undergraduate classrooms.
Using Descriptive Statistics to Analyse Data in R
kaggle.com
zip
Updated May 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Enrico68 (2024). Using Descriptive Statistics to Analyse Data in R [Dataset]. https://www.kaggle.com/datasets/enrico68/using-descriptive-statistics-to-analyse-data-in-r
Explore at:
zip(105561 bytes)Available download formats
Dataset updated
May 9, 2024
Authors
Enrico68
Description
Load and view a real-world dataset in RStudio

• Calculate “Measure of Frequency” metrics

• Calculate “Measure of Central Tendency” metrics

• Calculate “Measure of Dispersion” metrics

• Use R’s in-built functions for additional data quality metrics

• Create a custom R function to calculate descriptive statistics on any given dataset

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

zenodo.org

application/gzip, bin +2

Updated Aug 2, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788

Explore at:

bin, application/gzip, zip, text/x-pythonAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.1419788

Dataset updated

Aug 2, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb

License

https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

Description

Replication pack, FSE2018 submission #164:
------------------------------------------

**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
A Case Study of the PyPI Ecosystem

**Note:** link to data artifacts is already included in the paper. 
Link to the code will be included in the Camera Ready version as well.


Content description
===================

- **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
 described below
- **settings.py** - settings template for the code archive.
- **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
 This dataset only includes stats aggregated by the ecosystem (PyPI)
- **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
 statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
 themselves, which take around 2TB.
- **build_model.r, helpers.r** - R files to process the survival data 
  (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
  `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
  **dataset_full_Jan_2018.tgz**)
- **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
- LICENSE - text of GPL v3, under which this dataset is published
- INSTALL.md - replication guide (~2 pages)

Replication guide
=================

Step 0 - prerequisites
----------------------

- Unix-compatible OS (Linux or OS X)
- Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
- R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)

Depending on detalization level (see Step 2 for more details):
- up to 2Tb of disk space (see Step 2 detalization levels)
- at least 16Gb of RAM (64 preferable)
- few hours to few month of processing time

Step 1 - software
----------------

- unpack **ghd-0.1.0.zip**, or clone from gitlab:

   git clone https://gitlab.com/user2589/ghd.git
   git checkout 0.1.0
 
 `cd` into the extracted folder. 
 All commands below assume it as a current directory.
  
- copy `settings.py` into the extracted folder. Edit the file:
  * set `DATASET_PATH` to some newly created folder path
  * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
- install docker. For Ubuntu Linux, the command is 
  `sudo apt-get install docker-compose`
- install libarchive and headers: `sudo apt-get install libarchive-dev`
- (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
 Without this dependency, you might get an error on the next step, 
 but it's safe to ignore.
- install Python libraries: `pip install --user -r requirements.txt` . 
- disable all APIs except GitHub (Bitbucket and Gitlab support were
 not yet implemented when this study was in progress): edit
 `scraper/init.py`, comment out everything except GitHub support
 in `PROVIDERS`.

Step 2 - obtaining the dataset
-----------------------------

The ultimate goal of this step is to get output of the Python function 
`common.utils.survival_data()` and save it into a CSV file:

  # copy and paste into a Python console
  from common import utils
  survival_data = utils.survival_data('pypi', '2008', smoothing=6)
  survival_data.to_csv('survival_data.csv')

Since full replication will take several months, here are some ways to speedup
the process:

####Option 2.a, difficulty level: easiest

Just use the precomputed data. Step 1 is not necessary under this scenario.

- extract **dataset_minimal_Jan_2018.zip**
- get `survival_data.csv`, go to the next step

####Option 2.b, difficulty level: easy

Use precomputed longitudinal feature values to build the final table.
The whole process will take 15..30 minutes.

- create a folder `

f
Open data: Visual load effects on the auditory steady-state responses to...
su.figshare.com
demo.researchdata.se
+2more
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stefan Wiens; Malina Szychowska (2023). Open data: Visual load effects on the auditory steady-state responses to 20-, 40-, and 80-Hz amplitude-modulated tones [Dataset]. http://doi.org/10.17045/sthlmuni.12582002.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.17045/sthlmuni.12582002.v1
Dataset updated
May 30, 2023
Dataset provided by
Stockholm University
Authors
Stefan Wiens; Malina Szychowska
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The main results file are saved separately:- ASSR2.html: R output of the main analyses (N = 33)- ASSR2_subset.html: R output of the main analyses for the smaller sample (N = 25)FIGSHARE METADATACategories- Biological psychology- Neuroscience and physiological psychology- Sensory processes, perception, and performanceKeywords- crossmodal attention- electroencephalography (EEG)- early-filter theory- task difficulty- envelope following responseReferences- https://doi.org/10.17605/OSF.IO/6FHR8- https://github.com/stamnosslin/mn- https://doi.org/10.17045/sthlmuni.4981154.v3- https://biosemi.com/- https://www.python.org/- https://mne.tools/stable/index.html#- https://www.r-project.org/- https://rstudio.com/products/rstudio/GENERAL INFORMATION1. Title of Dataset:Open data: Visual load effects on the auditory steady-state responses to 20-, 40-, and 80-Hz amplitude-modulated tones2. Author Information A. Principal Investigator Contact Information Name: Stefan Wiens Institution: Department of Psychology, Stockholm University, Sweden Internet: https://www.su.se/profiles/swiens-1.184142 Email: sws@psychology.su.se B. Associate or Co-investigator Contact Information Name: Malina Szychowska Institution: Department of Psychology, Stockholm University, Sweden Internet: https://www.researchgate.net/profile/Malina_Szychowska Email: malina.szychowska@psychology.su.se3. Date of data collection: Subjects (N = 33) were tested between 2019-11-15 and 2020-03-12.4. Geographic location of data collection: Department of Psychology, Stockholm, Sweden5. Information about funding sources that supported the collection of the data:Swedish Research Council (Vetenskapsrådet) 2015-01181SHARING/ACCESS INFORMATION1. Licenses/restrictions placed on the data: CC BY 4.02. Links to publications that cite or use the data: Szychowska M., & Wiens S. (2020). Visual load effects on the auditory steady-state responses to 20-, 40-, and 80-Hz amplitude-modulated tones. Submitted manuscript.The study was preregistered:https://doi.org/10.17605/OSF.IO/6FHR83. Links to other publicly accessible locations of the data: N/A4. Links/relationships to ancillary data sets: N/A5. Was data derived from another source? No 6. Recommended citation for this dataset: Wiens, S., & Szychowska M. (2020). Open data: Visual load effects on the auditory steady-state responses to 20-, 40-, and 80-Hz amplitude-modulated tones. Stockholm: Stockholm University. https://doi.org/10.17045/sthlmuni.12582002DATA & FILE OVERVIEWFile List:The files contain the raw data, scripts, and results of main and supplementary analyses of an electroencephalography (EEG) study. Links to the hardware and software are provided under methodological information.ASSR2_experiment_scripts.zip: contains the Python files to run the experiment. ASSR2_rawdata.zip: contains raw datafiles for each subject- data_EEG: EEG data in bdf format (generated by Biosemi)- data_log: logfiles of the EEG session (generated by Python)ASSR2_EEG_scripts.zip: Python-MNE scripts to process the EEG dataASSR2_EEG_preprocessed_data.zip: EEG data in fif format after preprocessing with Python-MNE scriptsASSR2_R_scripts.zip: R scripts to analyze the data together with the main datafiles. The main files in the folder are: - ASSR2.html: R output of the main analyses- ASSR2_subset.html: R output of the main analyses but after excluding eight subjects who were recorded as pilots before preregistering the studyASSR2_results.zip: contains all figures and tables that are created by Python-MNE and R.METHODOLOGICAL INFORMATION1. Description of methods used for collection/generation of data:The auditory stimuli were amplitude-modulated tones with a carrier frequency (fc) of 500 Hz and modulation frequencies (fm) of 20.48 Hz, 40.96 Hz, or 81.92 Hz. The experiment was programmed in python: https://www.python.org/ and used extra functions from here: https://github.com/stamnosslin/mnThe EEG data were recorded with an Active Two BioSemi system (BioSemi, Amsterdam, Netherlands; www.biosemi.com) and saved in .bdf format.For more information, see linked publication.2. Methods for processing the data:We conducted frequency analyses and computed event-related potentials. See linked publication3. Instrument- or software-specific information needed to interpret the data:MNE-Python (Gramfort A., et al., 2013): https://mne.tools/stable/index.html#Rstudio used with R (R Core Team, 2020): https://rstudio.com/products/rstudio/Wiens, S. (2017). Aladins Bayes Factor in R (Version 3). https://www.doi.org/10.17045/sthlmuni.4981154.v34. Standards and calibration information, if appropriate:For information, see linked publication.5. Environmental/experimental conditions:For information, see linked publication.6. Describe any quality-assurance procedures performed on the data:For information, see linked publication.7. People involved with sample collection, processing, analysis and/or submission:- Data collection: Malina Szychowska with assistance from Jenny Arctaedius.- Data processing, analysis, and submission: Malina Szychowska and Stefan WiensDATA-SPECIFIC INFORMATION:All relevant information can be found in the MNE-Python and R scripts (in EEG_scripts and analysis_scripts folders) that process the raw data. For example, we added notes to explain what different variables mean.
Walmart Data Set
kaggle.com
zip
Updated Jan 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthew Garrett Carter (2023). Walmart Data Set [Dataset]. https://www.kaggle.com/datasets/matthewgarrettcarter/walmart-data-set
Explore at:
zip(272320 bytes)Available download formats
Dataset updated
Jan 4, 2023
Authors
Matthew Garrett Carter
Description
Introduction

The purpose of this project was added practice in learning new and demonstrate R Data analytical skills. The data set was located in Kaggle and shows sales information from the years 2010 to 2012. The weekly sales have two categories: holiday and non holiday representing 1 and 0 in that column respectfully.

The main question for this exercise was were there any factors that affected weekly sales for the stores? Those factors included temperature, fuel prices, and unemployment rates.

The following packages required for this project:

install.packages("tidyverse") install.packages("dplyr") install.packages("tsibble")

The following libraries required:

library("tidyverse") library(readr) library(dplyr) library(ggplot2) library(readr) library(lubridate) library(tsibble)

Downloading data set into RStudio:

Walmart <- read.csv("C:/Users/matth/OneDrive/Desktop/Case Study/Walmart.csv")

Data Inspection

Compared column names of each file to verify consistency.

colnames(Walmart) colnames(Walmart) dim(Walmart) str(Walmart) head(Walmart) which(is.na(Walmart$Date)) sum(is.na(Walmart))

There is NA data in the set.

Turning Store and Holiday_flag into factors:

Walmart$Store<-as.factor(Walmart$Store) Walmart$Holiday_Flag<-as.factor(Walmart$Holiday_Flag)

Splicing the date into Year and weekyear:

Walmart$week<-yearweek(as.Date(Walmart$Date,tryFormats=c("%d-%m-%Y"))) # make sure to install "tsibble" Walmart$year<-format(as.Date(Walmart$Date,tryFormats=c("%d-%m-%Y")),"%Y")

Filered Holiday_Flag Column to include only holidays weeks:

Walmart_Holiday<- filter(Walmart, Holiday_Flag==1)

Filered Holiday_Flag Column to include only non holidays Weeks:

Walmart_Non_Holiday<- filter(Walmart, Holiday_Flag==0)

Lets review all 45 stores' weekly sales and compare them. Using dataset Walmart

ggplot(Walmart, aes(x=Weekly_Sales, y=Store))+geom_boxplot()+ labs(title = 'Weekly Sales Accross 45 Stores', x='Weekly sales', y='Store')+theme_bw()

Results

From observation of the boxplot, it shows that Store 14 had max sales while Store 33 had the min sales.

Lets verify the results via slice_max and slice_min:

Walmart %>% slice_max(Weekly_Sales) Walmart %>% slice_min(Weekly_Sales)

It looks the information was correct. Lets check the mean for the weekly_sales column:

mean(Walmart$Weekly_Sales)

The mean for Weekly_Sales column for the Walmart dataset was 1046965.

Lets check for the MIN and MAX of Weekly Sales but only if they are holiday sales weeks:

ggplot(Walmart_Holiday, aes(x=Weekly_Sales, y=Store))+geom_boxplot()+ labs(title = 'Holiday Sales Accross 45 Stores', x='Weekly sales', y='Store')+theme_bw()

Result

Store 4 had the highest weekly sales during a holiday week based on the boxplot. Boxplot shows stores 33 and 5 as some of the lowest holiday sales.Lets reverify with slice_max and slice_min:

Walmart_Holiday %>% slice_max(Weekly_Sales) Walmart_Holiday %>% slice_min(Weekly_Sales)

The results match what is given on the boxplot. Lets find the mean:

mean(Walmart_Holiday$Weekly_Sales)

The result was that the mean was 1122888.

Lets check for the MIN and MAX of Weekly Sales but only if they are non holiday sales weeks:

ggplot(Walmart_Non_Holiday, aes(x=Weekly_Sales, y=Store))+geom_boxplot()+ labs(title = 'Non Holiday Sales Accross 45 Stores', x='Weekly sales', y='Store')+theme_bw()

Lets matched the results of the Walmart dataset that had both non holiday weeks and holiday weeks. Store 14 had the max sales and store 33 had the minimum sales. Lets verify the results and find the mean:

Walmart_Non_Holiday %>% slice_max(Weekly_Sales) Walmart_Non_Holiday %>% slice_min(Weekly_Sales) mean(Walmart_Non_Holiday$Weekly_Sales)

Results matched. And the mean for weekly sales was 1041256.

Which Year had the most sales?

ggplot(data = Walmart) + geom_point(mapping = aes(x=year, y=Weekly_Sales))

According the plot, 2010 had the most sales. Lets use a boxplot to see more.

ggplot(Walmart, aes(x=year, y=Weekly_Sales))+geom_boxplot()+ labs(title = 'Weekly Sales for Years 2010 - 2012', x='Year', y='Weekly Sales')

2010 Saw higher sales numbers and higher medium

Is there any differance between Sales during no Holiday weeks and Holiday weeks?

Lets start with holiday weekly sales:

ggplot(Walmart_Holiday, aes(x=year, y=Weekly_Sales))+geom_boxplot()+ labs(title = 'Holiday Weekly Sales for Years ...
t
Manipulating data using R
test.researchdata.tuwien.at
bin, pdf, txt
Updated Nov 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vseslav Levchenko; Vseslav Levchenko; Vseslav Levchenko; Vseslav Levchenko (2024). Manipulating data using R [Dataset]. http://doi.org/10.70124/5rrjk-ey181
Explore at:
bin, pdf, txtAvailable download formats
Unique identifier
https://doi.org/10.70124/5rrjk-ey181
Dataset updated
Nov 27, 2024
Dataset provided by
TU Wien
Authors
Vseslav Levchenko; Vseslav Levchenko; Vseslav Levchenko; Vseslav Levchenko
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Oct 30, 2023
Description
Data created during Computer Statistics assignment

Context and methodology

This is used for the project in the context of the "Introduction to Research Data Management" course, 2024 winter semester. Originally it was made for a homework assignment in the "Computer Statistics" course, 2023 winter semester.

The dataset consists of the following: code (and comment) written in the R markdown language that is to be compiled and executed in order to generate the 2 datasets created in the project; .pdf file generated from compiling and executing the aforementioned R code using RStudio; .txt file generated as part of one of the exercises in the assignment, also by compiling and executing the R code.

The code was written by Vseslav Levchenko in R, using RStudio.

Technical details

The code was written in RStudio and it is recommended to use it when working with R, however it is not strictly necessary. However, it is required to install the R language itself. For the other files, standard software like Microsoft Excel and any PDF reader are all that is needed.

The code also contains necessary comments, and a .pdf file with the assignment's tasks is provided separately.
Data set: St. Louis River Watershed, MN Conductivity Assessment March 2022
catalog.data.gov
datasets.ai
Updated Jul 18, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2025). Data set: St. Louis River Watershed, MN Conductivity Assessment March 2022 [Dataset]. https://catalog.data.gov/dataset/data-set-st-louis-river-watershed-mn-conductivity-assessment-march-2022
Explore at:
Dataset updated
Jul 18, 2025
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Area covered
Minnesota, Saint Louis River
Description
Data used to evaluate potential downstream impacts of the NorthMet Mine, by USEPA Office of Research and Development is providing, for USEPA Region 5’s use, including a characterization of stream specific conductivity (SC) levels, least disturbed background SC, and SC levels that may exceed the Fond du Lac Band’s WQ standards and adversely affect aquatic life, including brook trout (Salvelinus fontinalis), lake sturgeon (Acipenser fulvescens), and benthic macroinvertebrates. Keywords: Conductivity, St. Louis River, benthic invertebrates; mining The attached Excel Pedigree includes: _Datasets: Data file uploaded to EPA Science Hub and/or Environmental Data Set Gateway _R : Clean R scripts used to generate document figures and tables _Tables_Figures: Files generated from R script and used in the Region 5 memo 20220325 R Code and Data: All additional files used for this project, including original files, intermediate files, extra output files, and extra functions. The "_R" folder contains four subfolders. Each subfolder has several R scripts, input and output files, and an R project file. Users can run R scripts directly from each subfolder by installing R, RStudio, and associated R packages. Data Dictionary: See tab DataDictionary in Excel file Datasets: Simplified language is used in the text to identify parent data sets. Source and File names are retained in this pedigree in original form to enable R-scripts to retain functionality. • Thingvold et al. (1975-1977) • Griffith (1998-2009) • Predicted background (2000-2015) • Water Quality Portal (1996-2021) • Water Quality Portal Less Disturbed (1996-2021) • Minnesota Pollution Control Agency (MPCA) (1996-2013) • Mid-Atlantic Highlands (1990 to 2014). This dataset is associated with the following publication: Cormier, S., and Y. Wang. Appendix C: ORD Specific Conductance Memo, from Susan Cormier to Tera Fong. March 15, 2022. Assessment of effects of increased ion concentrations in the St. Louis River Watershed with special attention to potential mining influence and the jurisdiction of the Fond du Lac Band of Lake Superior Chippewa. U.S. Environmental Protection Agency, Washington, DC, USA, 2022.
Data and tools for studying isograms
figshare.com
Updated Jul 31, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
Explore at:
application/x-sqlite3Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5245810.v1
Dataset updated
Jul 31, 2017
Dataset provided by
Figsharehttp://figshare.com/
Authors
Florian Breit
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

Label Data type Description

isogramy int The order of isogramy, e.g. "2" is a second order isogram

length int The length of the word in letters

word text The actual word/isogram in ASCII

source_pos text The Part of Speech tag from the original corpus

count int Token count (total number of occurences)

vol_count int Volume count (number of different sources which contain the word)

count_per_million int Token count per million words

vol_count_as_percent int Volume count as percentage of the total number of volumes

is_palindrome bool Whether the word is a palindrome (1) or not (0)

is_tautonym bool Whether the word is a tautonym (1) or not (0)

The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

Label

Data type

Description

!total_1grams

int

The total number of words in the corpus

!total_volumes

int

The total number of volumes (individual sources) in the corpus

!total_isograms

int

The total number of isograms found in the corpus (before compacting)

!total_palindromes

int

How many of the isograms found are palindromes

!total_tautonyms

int

How many of the isograms found are tautonyms

The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
[Dataset] Does Volunteer Engagement Pay Off? An Analysis of User...
data.europa.eu
recerca.uoc.edu
+3more
unknown
Updated Nov 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2022). [Dataset] Does Volunteer Engagement Pay Off? An Analysis of User Participation in Online Citizen Science Projects [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-7357747?locale=el
Explore at:
unknown(10386572)Available download formats
Dataset updated
Nov 27, 2022
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Explanation/Overview: Corresponding dataset for the analyses and results achieved in the CS Track project in the research line on participation analyses, which is also reported in the publication "Does Volunteer Engagement Pay Off? An Analysis of User Participation in Online Citizen Science Projects", a conference paper for the conference CollabTech 2022: Collaboration Technologies and Social Computing and published as part of the Lecture Notes in Computer Science book series (LNCS,volume 13632) here. The usernames have been anonymised. Purpose: The purpose of this dataset is to provide the basis to reproduce the results reported in the associated deliverable, and in the above-mentioned publication. As such, it does not represent raw data, but rather files that already include certain analysis steps (like calculated degrees or other SNA-related measures), ready for analysis, visualisation and interpretation with R. Relatedness: The data of the different projects was derived from the forums of 7 Zooniverse projects based on similar discussion board features. The projects are: 'Galaxy Zoo', 'Gravity Spy', 'Seabirdwatch', 'Snapshot Wisconsin', 'Wildwatch Kenya', 'Galaxy Nurseries', 'Penguin Watch'. Content: In this Zenodo entry, several files can be found. The structure is as follows (files and folders and descriptions). corresponding_calculations.html Quarto-notebook to view in browser corresponding_calculations.qmd Quarto-notebook to view in RStudio assets data annotations annotations.csv List of annotations made per day for each of the analysed projects comments comments.csv Total list of comments with several data fields (i.e., comment id, text, reply_user_id) rolechanges 478_rolechanges.csv List of roles per user to determine number of role changes 1104_rolechanges.csv ... ... totalnetworkdata Edges 478_edges.csv Network data (edge set) for the given projects (without time slices) 1104_edges.csv ... ... Nodes 478_nodes.csv Network data (node set) for the given projects (without time slices) 1104_nodes.csv ... ... trajectories Network data (edge and node sets) for the given projects and all time slices (Q1 2016 - Q4 2021) 478 Edges edges_4782016_q1.csv edges_4782016_q2.csv edges_4782016_q3.csv edges_4782016_q4.csv ... Nodes nodes_4782016_q1.csv nodes_4782016_q4.csv nodes_4782016_q3.csv nodes_4782016_q2.csv ... 1104 Edges ... Nodes ... ... scripts datavizfuncs.R script for the data visualisation functions, automatically executed from within corresponding_calculations.qmd import.R script for the import of data, automatically executed from within corresponding_calculations.qmd corresponding_calculations_files files for the html/qmd view in the browser/RStudio Grouping: The data is grouped according to given criteria (e.g., project_title or time). Accordingly, the respective files can be found in the data structure
r
Data from: Working with a linguistic corpus using R: An introductory note...
researchdata.edu.au
bridges.monash.edu
Updated May 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gede Primahadi Wijaya Rajeg; I Made Rajeg; Karlina Denistia (2022). Working with a linguistic corpus using R: An introductory note with Indonesian Negating Construction [Dataset]. http://doi.org/10.4225/03/5a7ee2ac84303
Explore at:
Unique identifier
https://doi.org/10.4225/03/5a7ee2ac84303
Dataset updated
May 5, 2022
Dataset provided by
Monash University
Authors
Gede Primahadi Wijaya Rajeg; I Made Rajeg; Karlina Denistia
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This is a repository for codes and datasets for the open-access paper in Linguistik Indonesia, the flagship journal for the Linguistic Society of Indonesia (Masyarakat Linguistik Indonesia [MLI]) (cf. the link in the references below).

To cite the paper (in APA 6th style):
Rajeg, G. P. W., Denistia, K., & Rajeg, I. M. (2018). Working with a linguistic corpus using R: An introductory note with Indonesian negating construction. Linguistik Indonesia, 36(1), 1–36. doi: 10.26499/li.v36i1.71

To cite this repository:
Click on the Cite (dark-pink button on the top-left) and select the citation style through the dropdown button (default style is Datacite option (right-hand side)

This repository consists of the following files:
1. Source R Markdown Notebook (.Rmd file) used to write the paper and containing the R codes to generate the analyses in the paper.
2. Tutorial to download the Leipzig Corpus file used in the paper. It is freely available on the Leipzig Corpora Collection Download page.
3. Accompanying datasets as images and .rds format so that all code-chunks in the R Markdown file can be run.
4. BibLaTeX and .csl files for the referencing and bibliography (with APA 6th style).
5. A snippet of the R session info after running all codes in the R Markdown file.
6. RStudio project file (.Rproj). Double click on this file to open an RStudio session associated with the content of this repository. See here and here for details on Project-based workflow in RStudio.
7. A .docx template file following the basic stylesheet for Linguistik Indonesia

Put all these files in the same folder (including the downloaded Leipzig corpus file)!

To render the R Markdown into MS Word document, we use the bookdown R package (Xie, 2018). Make sure this package is installed in R.

Yihui Xie (2018). bookdown: Authoring Books and Technical Documents with R Markdown. R package version 0.6.
Bitter Creek Analysis Pedigree Data
catalog.data.gov
s.cnmilf.com
Updated Sep 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2022). Bitter Creek Analysis Pedigree Data [Dataset]. https://catalog.data.gov/dataset/bitter-creek-analysis-pedigree-data
Explore at:
Dataset updated
Sep 25, 2022
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
These data sets contain raw and processed data used in for analyses, figures, and tables in the Region 8 Memo: Characterization of chloride and conductivity levels in the Bitter Creek Watershed, WY. However, these data may be used for other analyses alone or in combination with other or new data. These data were used to assess whether chloride levels are naturally high in streams in the Bitter Creek, WY watershed and how chloride concentrations expected to protect 95 percent of aquatic genera in these streams compare to Wyoming’s chloride criteria applicable to the Bitter Creek watershed. Owing to the arid conditions, background conductivity and chloride levels were characterized for surface flow and ground water flow conditions. Natural chloride levels were found to be less than current water quality criteria for Wyoming. Although the report was prepared for USEPA Region 8 and OST, Office of Water, the report will be of interest to the WDEQ, Sweetwater County Conservation District, and the regulated community. No formal metadata standard was used. Pedigree.xlsx contains: 1. NOTES: Description of work and other worksheets. 2. Pedigree_Summary: Source files used to create figures and tables. 3. DataFiles: Data files used in the R code for creating the figures and tables 4. R_Script: Summary of the R scripts. 5. DataDictionary: Data file titles in all data files Folders: _Datasets Data file uploaded to Environmental Dataset Gateway "A list of subfolders: _R: Clean R scripts used to generate document figures and tables _Tables_Figures: Files generated from R script and used in the Region 6 memo R Code and Data: All additional files used for this project, including original files, intermediate files, extra output files, and extra functions the ""_R"" folder stores R scripts for input and output files and an R project file.. Users can open the R project and run R scripts directly from the ""_R"" folder or the XC95 folder by installing R, RStudio, and associated R packages."
d
Data from: Plant-pollinator specialization: Origin and measurement of...
datadryad.org
dataone.org
+1more
zip
Updated Oct 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mannfred Boehm; Jill Jankowski; Quentin Cronk (2021). Plant-pollinator specialization: Origin and measurement of curvature [Dataset]. http://doi.org/10.5061/dryad.g1jwstqrr
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.g1jwstqrr
Dataset updated
Oct 26, 2021
Dataset provided by
Dryad
Authors
Mannfred Boehm; Jill Jankowski; Quentin Cronk
Time period covered
Sep 24, 2021
Description
This dataset is primarily an RStudio project. To run the .R files, we recommend first opening the .RProj file in RStudio and installing the package here. This will allow you to run all of the .R scripts without changing any of the working directories.
E
Data from: The Low Dimensionality of Development
edmond.mpg.de
application/gzip, tar
Updated Mar 11, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guido Kraemer; Markus Reichstein; Gustau Camps-Valls; Jeroen Smits; Miguel Mahecha; Guido Kraemer; Markus Reichstein; Gustau Camps-Valls; Jeroen Smits; Miguel Mahecha (2020). The Low Dimensionality of Development [Dataset]. http://doi.org/10.17617/3.3I
Explore at:
tar(2420220416), application/gzip(65695884)Available download formats
Unique identifier
https://doi.org/10.17617/3.3I
Dataset updated
Mar 11, 2020
Dataset provided by
Edmond
Authors
Guido Kraemer; Markus Reichstein; Gustau Camps-Valls; Jeroen Smits; Miguel Mahecha; Guido Kraemer; Markus Reichstein; Gustau Camps-Valls; Jeroen Smits; Miguel Mahecha
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This contains a docker container and the prepared data to do the analysis in the paper title: "The Low Dimensionality of Development" published in "Social Indicators Research" 2020 by Kraemer et al.. If you use code or data from in this repository cite the paper. This repository contains a copy of the World Development Indicators database from May 2018 [1]. The analysis was run on a 48 core cluster node with 256GB of ram and takes some hours to complete. Afterthis generating all results, they can be loaded into slightly over 16GB of RAM. Unpack 'docker_data.tar.gz' into a '/path/to/data' on 'host' and run as: docker load -i dockerimage.tar docker run -v "/path/to/data":/home/rstudio/data -v "/path/to/figures":/home/rstudio/fig -p 8989:8787 -e PASSWORD=secret_password development-indicators then open 'host:8989' in your browser and an RStudio session should appear. You can login with the user "rstudio" and the password you specified before. NOTE: We did not set a random seed, so slight variations between runs will occur. [1] License: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/) By: The World Bank The original data can be found here: https://datacatalog.worldbank.org/dataset/world-development-indicators
FakeCovid Fact-Checked News Dataset
kaggle.com
zip
Updated Feb 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). FakeCovid Fact-Checked News Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/fakecovid-fact-checked-news-dataset
Explore at:
zip(19911252 bytes)Available download formats
Dataset updated
Feb 1, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
FakeCovid Fact-Checked News Dataset

International Coverage of COVID-19 in 40 Languages from 105 Countries

By [source]

About this dataset

The FakeCovid dataset is an unparalleled compilation of 7623 fact-checked news articles related to COVID-19. Obtained from 92 fact-checking websites located in 105 countries, this comprehensive collection covers a wide range of sources and languages, including locations across Africa, Europe, Asia, The Americas and Oceania. With data gathered from references on Poynter and Snopes, this unique dataset is an invaluable resource for researching the accuracy of global news related to the pandemic. It offers an invaluable insight into the international nature of COVID information with its column headers covering country's involved; categories such as coronavirus health updates or political interference during coronavirus; URLs for referenced articles; verifiers employed by websites; article classes that can range from true to false or even mixed evaluations; publication dates ; article sources injected with credibility verification as well as article text and language standardization. This one-of-a kind dataset serves as an essential tool in understanding both global information flow around the world concerning COVID 19 while simultaneously offering transparency into whose interests guide it

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

The FakeCovid dataset is a multilingual cross-domain collection of 7623 fact-checked news articles related to COVID-19. It is collected from 92 fact-checking websites and covers a wide range of sources and countries, including locations in Africa, Asia, Europe, The Americas, and Oceania. This dataset can be used for research related to understanding the truth and accuracy of news sources related to COVID-19 in different countries and languages.

To use this dataset effectively, you will need basic knowledge of data science principles such as data manipulation with pandas or Python libraries such as NumPy or ScikitLearn. The data is in CSV (comma separated values) format that can be read by most spreadsheet applications or text editor like Notepad++. Here are some steps on how to get started: - Access the FakeCovid Fact Checked News Dataset from Kaggle: https://www.kaggle.com/c/fakecovidfactcheckednewsdataset/data - Download the provided CSV file containing all fact checked news articles and place it into your desired folder location - Load the CSV file into your preferred software application like Jupyter Notebook or RStudio 4)Explore your dataset using built-in functions within data science libraries such as Pandas & matplotlib – find meaningful information through statistical analysis &//or create visualizations 5)Modify parameters within the csv file if required & save 6)Share your creative projects through Gitter chatroom #fakecovidauthors 7 )Publish any interesting discoveries you find within open source repositories like GitHub 8 )Engage with our Hangouts group #FakeCoviDFactCheckersClub 9 )Show off fun graphics via Twitter hashtag #FakeCovidiauthors 10 )Reach out if you have further questions via email contactfakecovidadatateam 11 )Stay connected by joining our mailing list#FakeCoviDAuthorsGroup

We hope this guide helps you better understand how to use our FakeCoviD Fact Checked News Dataset for generating meaningful insights relating to COVID-19 news articles worldwide!

Research Ideas

Developing an automated algorithm to detect fake news related to COVID-19 by leveraging the fact-checking flags and other results included in this dataset for machine learning and natural language processing tasks.

Training a sentiment analysis model on the data to categorize articles according to their sentiments which can be used for further investigations into why certain news topics or countries have certain outcomes, motivations, or behaviors due to their content relatedness or author biasness(if any).

Using unsupervised clustering techniques, this dataset could be used as a tool for identifying any discrepancies between news circulated in different populations in different countries (langauge and regions) so that publicists can focus more on providing factual information rather than spreading false rumors or misinformation about the pandemic

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

**License: [CC0 1.0 Universal (CC0 1.0) - Public Do...
Replication Package for 'Anchoring Code Understandability Evaluations...
zenodo.org
zip
Updated Mar 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous; Anonymous (2022). Replication Package for 'Anchoring Code Understandability Evaluations Through Task Descriptions' [Dataset]. http://doi.org/10.5281/zenodo.5877314
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5877314
Dataset updated
Mar 25, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous; Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Replication package for:

"Anchoring Code Understandability Evaluations Through Task Descriptions", submission under review.

The data folder contains dataset and Rmarkdown analysis script. We recommend loading the script with RStudio. For your convenience, we include a self-contained HTML outputted file.

The code snippets folder contains the experimental code snippets and a representation with syntax highlighting in HTML. We have implemented the code in this way in LimeSurvey and think that for replication purposes it may be useful to include the markup in this replication package (see `LimeSurvey` subdirectory).
m
Short-range Early Phase COVID-19 Forecasting R-Project and Data
data.mendeley.com
Updated Dec 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christopher Lynch (2020). Short-range Early Phase COVID-19 Forecasting R-Project and Data [Dataset]. http://doi.org/10.17632/cytrb8p42g.2
Explore at:
Unique identifier
https://doi.org/10.17632/cytrb8p42g.2
Dataset updated
Dec 15, 2020
Authors
Christopher Lynch
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This R-Project and its data files are provided in support of ongoing research efforts for forecasting COVID-19 cumulative case growth at varied geographic levels. All code and data files are provided to facilitate reproducibility of current research findings. Seven forecasting methods are evaluated with respect to their effectiveness at forecasting one-, three-, and seven-day cumulative COVID-19 cases, including: (1) a Naïve approach; (2) Holt-Winters exponential smoothing; (3) growth rate; (4) moving average (MA); (5) autoregressive (AR); (6) autoregressive moving average (ARMA); and (7) autoregressive integrated moving average (ARIMA). This package is developed to be directly opened and run in RStudio through the provided RProject file. Code developed using R version 3.6.3.

This software generates the findings of the article entitled "Short-range forecasting of coronavirus disease 2019 (COVID-19) during early onset at county, health district, and state geographic levels: Comparative forecasting approach using seven forecasting methods" using cumulative case counts reported by The New York Times up to April 22, 2020. This package provides two avenues for reproducing results: 1) Regenerate the forecasts from scratch using the provided code and data files and then run the analyses; or 2) Load the saved forecast data and run the analyses on the existing data

License info can be viewed from the "License Info.txt" file.

The "RProject" folder contains the RProject file which opens the project in RStudio with the desired working directory set.

README files are contained in each sub-folder which provide additoinal detail on the contents of the folder.

Copyright (c) 2020 Christopher J. Lynch and Ross Gore

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Except as contained in this notice, the name(s) of the above copyright holders shall not be used in advertising or otherwise to promote the sale, use, or other dealings in this Software without prior written authorization.
Automated_Parametric_Analysis_Pipeline In R Studio
kaggle.com
zip
Updated Nov 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr. Nagendra (2025). Automated_Parametric_Analysis_Pipeline In R Studio [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/automated-parametric-analysis-pipeline-in-r-studio
Explore at:
zip(22695 bytes)Available download formats
Dataset updated
Nov 29, 2025
Authors
Dr. Nagendra
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
A comprehensive and automated data analysis pipeline developed in R Studio. Designed for rapid, reproducible execution of common parametric statistical tests. The pipeline handles sequential steps from data loading to final report generation. It includes functions for critical assumptions checking (e.g., normality, homogeneity). Focuses on essential parametric methods like T-tests, ANOVA, and linear models. Offers a streamlined, efficient workflow for students, researchers, and data analysts. Promotes high standards of statistical reporting and reproducibility in data science.

Facebook

Twitter

Click to copy link

Link copied

Cite

Kingsley Okoye; Samira Hosseini (2023). Collection of example datasets used for the book - R Programming - Statistical Data Analysis in Research [Dataset]. http://doi.org/10.6084/m9.figshare.24728073.v1

Collection of example datasets used for the book - R Programming - Statistical Data Analysis in Research

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.24728073.v1

Dataset updated

Dec 4, 2023

Dataset provided by

Figsharehttp://figshare.com/

Authors

Kingsley Okoye; Samira Hosseini

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This book is written for statisticians, data analysts, programmers, researchers, teachers, students, professionals, and general consumers on how to perform different types of statistical data analysis for research purposes using the R programming language. R is an open-source software and object-oriented programming language with a development environment (IDE) called RStudio for computing statistics and graphical displays through data manipulation, modelling, and calculation. R packages and supported libraries provides a wide range of functions for programming and analyzing of data. Unlike many of the existing statistical softwares, R has the added benefit of allowing the users to write more efficient codes by using command-line scripting and vectors. It has several built-in functions and libraries that are extensible and allows the users to define their own (customized) functions on how they expect the program to behave while handling the data, which can also be stored in the simple object system.For all intents and purposes, this book serves as both textbook and manual for R statistics particularly in academic research, data analytics, and computer programming targeted to help inform and guide the work of the R users or statisticians. It provides information about different types of statistical data analysis and methods, and the best scenarios for use of each case in R. It gives a hands-on step-by-step practical guide on how to identify and conduct the different parametric and non-parametric procedures. This includes a description of the different conditions or assumptions that are necessary for performing the various statistical methods or tests, and how to understand the results of the methods. The book also covers the different data formats and sources, and how to test for reliability and validity of the available datasets. Different research experiments, case scenarios and examples are explained in this book. It is the first book to provide a comprehensive description and step-by-step practical hands-on guide to carrying out the different types of statistical analysis in R particularly for research purposes with examples. Ranging from how to import and store datasets in R as Objects, how to code and call the methods or functions for manipulating the datasets or objects, factorization, and vectorization, to better reasoning, interpretation, and storage of the results for future use, and graphical visualizations and representations. Thus, congruence of Statistics and Computer programming for Research.

Clear search

Close search

Google apps

Main menu

Collection of example datasets used for the book - R Programming -...

Bellabeat Case Study II Google Capstone Project

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'

(as 'lib' is unspecified)

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'

(as 'lib' is unspecified)

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'

(as 'lib' is unspecified)

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'

(as 'lib' is unspecified)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──

✔ ggplot2 3.4.0 ✔ purrr 0.3.5

✔ tibble 3.1.8 ✔ dplyr 1.0.10

✔ tidyr 1.2.1 ✔ stringr 1.4.1

✔ readr 2.1.3 ✔ forcats 0.5.2

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──

✖ dplyr::filter() masks stats::filter()

✖ dplyr::lag() masks stats::lag()

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

chisq.test, fisher.test

Loading required package: timechange

Attaching package: 'lubridate'

The following objects are masked from 'package:base':

date, intersect, setdiff, union

Loading your CSV files

Rows: 2483658 Columns: 3

── Column specification ────────────────────────────────────────────────────────

Delimiter: ","

chr (1): Time

dbl (2): Id, Value

ℹ Use spec() to retrieve the full column specification for this data.

ℹ Specify the column types or set show_col_types = FALSE to quiet this message.

Rows: 22099 Columns: 3

── Column specification ─────────────────────────────────────...

Database and Rstudio code

Large Datasets in R - Plant Phenology & Temperature Data from NEON

Using Descriptive Statistics to Analyse Data in R

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

Open data: Visual load effects on the auditory steady-state responses to...

Walmart Data Set

Introduction

The following packages required for this project:

The following libraries required:

Downloading data set into RStudio:

Data Inspection

Turning Store and Holiday_flag into factors:

Splicing the date into Year and weekyear:

Filered Holiday_Flag Column to include only holidays weeks:

Filered Holiday_Flag Column to include only non holidays Weeks:

Lets review all 45 stores' weekly sales and compare them. Using dataset Walmart

Results

Lets check for the MIN and MAX of Weekly Sales but only if they are holiday sales weeks:

Result

Lets check for the MIN and MAX of Weekly Sales but only if they are non holiday sales weeks:

Which Year had the most sales?

Is there any differance between Sales during no Holiday weeks and Holiday weeks?

Manipulating data using R

Data created during Computer Statistics assignment

Context and methodology

Technical details

Data set: St. Louis River Watershed, MN Conductivity Assessment March 2022

Data and tools for studying isograms

[Dataset] Does Volunteer Engagement Pay Off? An Analysis of User...

Data from: Working with a linguistic corpus using R: An introductory note...

Bitter Creek Analysis Pedigree Data

Data from: Plant-pollinator specialization: Origin and measurement of...

Data from: The Low Dimensionality of Development

FakeCovid Fact-Checked News Dataset

FakeCovid Fact-Checked News Dataset

International Coverage of COVID-19 in 40 Languages from 105 Countries

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Replication Package for 'Anchoring Code Understandability Evaluations...

Short-range Early Phase COVID-19 Forecasting R-Project and Data

ℹ Use `spec()` to retrieve the full column specification for this data.

ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.