49 datasets found

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

zenodo.org

application/gzip, bin +2

Updated Aug 2, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788

Explore at:

bin, application/gzip, zip, text/x-pythonAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.1419788

Dataset updated

Aug 2, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb

License

https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

Description

Replication pack, FSE2018 submission #164:
------------------------------------------

**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
A Case Study of the PyPI Ecosystem

**Note:** link to data artifacts is already included in the paper. 
Link to the code will be included in the Camera Ready version as well.


Content description
===================

- **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
 described below
- **settings.py** - settings template for the code archive.
- **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
 This dataset only includes stats aggregated by the ecosystem (PyPI)
- **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
 statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
 themselves, which take around 2TB.
- **build_model.r, helpers.r** - R files to process the survival data 
  (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
  `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
  **dataset_full_Jan_2018.tgz**)
- **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
- LICENSE - text of GPL v3, under which this dataset is published
- INSTALL.md - replication guide (~2 pages)

Replication guide
=================

Step 0 - prerequisites
----------------------

- Unix-compatible OS (Linux or OS X)
- Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
- R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)

Depending on detalization level (see Step 2 for more details):
- up to 2Tb of disk space (see Step 2 detalization levels)
- at least 16Gb of RAM (64 preferable)
- few hours to few month of processing time

Step 1 - software
----------------

- unpack **ghd-0.1.0.zip**, or clone from gitlab:

   git clone https://gitlab.com/user2589/ghd.git
   git checkout 0.1.0
 
 `cd` into the extracted folder. 
 All commands below assume it as a current directory.
  
- copy `settings.py` into the extracted folder. Edit the file:
  * set `DATASET_PATH` to some newly created folder path
  * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
- install docker. For Ubuntu Linux, the command is 
  `sudo apt-get install docker-compose`
- install libarchive and headers: `sudo apt-get install libarchive-dev`
- (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
 Without this dependency, you might get an error on the next step, 
 but it's safe to ignore.
- install Python libraries: `pip install --user -r requirements.txt` . 
- disable all APIs except GitHub (Bitbucket and Gitlab support were
 not yet implemented when this study was in progress): edit
 `scraper/init.py`, comment out everything except GitHub support
 in `PROVIDERS`.

Step 2 - obtaining the dataset
-----------------------------

The ultimate goal of this step is to get output of the Python function 
`common.utils.survival_data()` and save it into a CSV file:

  # copy and paste into a Python console
  from common import utils
  survival_data = utils.survival_data('pypi', '2008', smoothing=6)
  survival_data.to_csv('survival_data.csv')

Since full replication will take several months, here are some ways to speedup
the process:

####Option 2.a, difficulty level: easiest

Just use the precomputed data. Step 1 is not necessary under this scenario.

- extract **dataset_minimal_Jan_2018.zip**
- get `survival_data.csv`, go to the next step

####Option 2.b, difficulty level: easy

Use precomputed longitudinal feature values to build the final table.
The whole process will take 15..30 minutes.

- create a folder `

Collection of example datasets used for the book - R Programming -...
figshare.com
txt
Updated Dec 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kingsley Okoye; Samira Hosseini (2023). Collection of example datasets used for the book - R Programming - Statistical Data Analysis in Research [Dataset]. http://doi.org/10.6084/m9.figshare.24728073.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24728073.v1
Dataset updated
Dec 4, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Kingsley Okoye; Samira Hosseini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This book is written for statisticians, data analysts, programmers, researchers, teachers, students, professionals, and general consumers on how to perform different types of statistical data analysis for research purposes using the R programming language. R is an open-source software and object-oriented programming language with a development environment (IDE) called RStudio for computing statistics and graphical displays through data manipulation, modelling, and calculation. R packages and supported libraries provides a wide range of functions for programming and analyzing of data. Unlike many of the existing statistical softwares, R has the added benefit of allowing the users to write more efficient codes by using command-line scripting and vectors. It has several built-in functions and libraries that are extensible and allows the users to define their own (customized) functions on how they expect the program to behave while handling the data, which can also be stored in the simple object system.For all intents and purposes, this book serves as both textbook and manual for R statistics particularly in academic research, data analytics, and computer programming targeted to help inform and guide the work of the R users or statisticians. It provides information about different types of statistical data analysis and methods, and the best scenarios for use of each case in R. It gives a hands-on step-by-step practical guide on how to identify and conduct the different parametric and non-parametric procedures. This includes a description of the different conditions or assumptions that are necessary for performing the various statistical methods or tests, and how to understand the results of the methods. The book also covers the different data formats and sources, and how to test for reliability and validity of the available datasets. Different research experiments, case scenarios and examples are explained in this book. It is the first book to provide a comprehensive description and step-by-step practical hands-on guide to carrying out the different types of statistical analysis in R particularly for research purposes with examples. Ranging from how to import and store datasets in R as Objects, how to code and call the methods or functions for manipulating the datasets or objects, factorization, and vectorization, to better reasoning, interpretation, and storage of the results for future use, and graphical visualizations and representations. Thus, congruence of Statistics and Computer programming for Research.
Bellabeat Case Study II Google Capstone Project
kaggle.com
zip
Updated Nov 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NUR SİMAİŞ (2022). Bellabeat Case Study II Google Capstone Project [Dataset]. https://www.kaggle.com/datasets/nursma/bellabeat-case-study-ii-google-capstone-project
Explore at:
zip(25278847 bytes)Available download formats
Dataset updated
Nov 18, 2022
Authors
NUR SİMAİŞ
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset is retrieved from the user Mobius page, where it's generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. I woıuld like to thank Möbius and everyone responsible for the work.

Bellabeat Case Study 1 2022-11-14 1. Introduction Hello everyone, my name is Nur Simais and this project is part of Google Data Analytics Professional Certificate. There have been multiple skills and skillsets learned throughout this course that can mainly be categorized under soft and hard skills. Also, this case study I have chosen is about the company calles “Bellabeat”, a fitness tracker device. The company is founded in 2013 by Urška Sršen and Sando Mur. It gradually gained recognition and expanded in many countires.(https://bellabeat.com/) Adding this brief info about the company, I’d like to say that doing the business analysis will help the company to see how it can achieve it’s goals and what can be done as to improve more.

During the analysis process, I will be using the Google’s “Ask-Prepare-Process-Analyze-Share-Act” Framework that I have learned throughout this certification and apply the tools and skillsets into it.

1.ASK

1.1 Business Task The goal of this project is to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices and how to apply these insights into Bellabeat’s marketing strategy using these three questions:

What are some trends in smart device usage? How could these trends apply to Bellabeat customers? How could these trends help influence Bellabeat marketing strategy?

2.PREPARE Prepare the Data and Libraries in RStudio Collect the data required for analysis but since the data is available on Kaggle publicly, FitBit Fitness Tracker Data (CC0: Public Domain) and download the dataset.

There are 18 packages but after examining the excel docs, I decided to use these 8 datasets: dailyActivity_merged.csv, heartrate_seconds_merged.csv, hourlyCalories_merged.csv, hourlyIntensities_merged.csv, hourlySteps_merged.csv, minuteMETsNarrow_merged.csv, sleepDay_merged.csv, weightLogInfo_merged.csv 2.1 Install and load the packages Install the RStudio libraries for analysis and visualizations

install.packages("tidyverse") # core package for cleaning and analysis

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'

(as 'lib' is unspecified)

install.packages("lubridate") # date library mdy()

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'

(as 'lib' is unspecified)

install.packages("janitor") # clean_names() to consists only _, character, numbers, and letters.

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'

(as 'lib' is unspecified)

install.packages("dplyr") #helps to check the garmmar of data manioulation

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'

(as 'lib' is unspecified)

Load the libraries

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──

✔ ggplot2 3.4.0 ✔ purrr 0.3.5

✔ tibble 3.1.8 ✔ dplyr 1.0.10

✔ tidyr 1.2.1 ✔ stringr 1.4.1

✔ readr 2.1.3 ✔ forcats 0.5.2

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──

✖ dplyr::filter() masks stats::filter()

✖ dplyr::lag() masks stats::lag()

library(janitor) ##

Attaching package: 'janitor'

##

The following objects are masked from 'package:stats':

##

chisq.test, fisher.test

library(lubridate)

Loading required package: timechange

##

Attaching package: 'lubridate'

##

The following objects are masked from 'package:base':

##

date, intersect, setdiff, union

library(dplyr) Having loaded tidyverse package, the rest of the essential packages (ggplot2, dplyr, and tidyr) are loaded as well.

2.2 Importing and Preparing the Dataset Upload the archived dataset to RStudio by clicking the Upload button in the bottom right pane.

File will be saved in a new folder named “Fitabase Data 4.12.16-5.12.16”. Importing the datasets and renaming them.

Loading your CSV files

daily_activity <- read.csv("dailyActivity_merged.csv") heartrate_seconds <- read_csv("heartrate_seconds_merged.csv")

Rows: 2483658 Columns: 3

── Column specification ────────────────────────────────────────────────────────

Delimiter: ","

chr (1): Time

dbl (2): Id, Value

##

ℹ Use spec() to retrieve the full column specification for this data.

ℹ Specify the column types or set show_col_types = FALSE to quiet this message.

hourly_calories <- read_csv("hourlyCalories_merged.csv")

Rows: 22099 Columns: 3

── Column specification ─────────────────────────────────────...
q
Large Datasets in R - Plant Phenology & Temperature Data from NEON
qubeshub.org
Updated May 10, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Megan Jones Patterson; Lee Stanish; Natalie Robinson; Katherine Jones; Cody Flagg (2018). Large Datasets in R - Plant Phenology & Temperature Data from NEON [Dataset]. http://doi.org/10.25334/Q4DQ3F
Explore at:
Unique identifier
https://doi.org/10.25334/Q4DQ3F
Dataset updated
May 10, 2018
Dataset provided by
QUBES
Authors
Megan Jones Patterson; Lee Stanish; Natalie Robinson; Katherine Jones; Cody Flagg
Description
This module series covers how to import, manipulate, format and plot time series data stored in .csv format in R. Originally designed to teach researchers to use NEON plant phenology and air temperature data; has been used in undergraduate classrooms.
t
Manipulating data using R
test.researchdata.tuwien.at
bin, pdf, txt
Updated Nov 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vseslav Levchenko; Vseslav Levchenko; Vseslav Levchenko; Vseslav Levchenko (2024). Manipulating data using R [Dataset]. http://doi.org/10.70124/5rrjk-ey181
Explore at:
bin, pdf, txtAvailable download formats
Unique identifier
https://doi.org/10.70124/5rrjk-ey181
Dataset updated
Nov 27, 2024
Dataset provided by
TU Wien
Authors
Vseslav Levchenko; Vseslav Levchenko; Vseslav Levchenko; Vseslav Levchenko
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Oct 30, 2023
Description
Data created during Computer Statistics assignment

Context and methodology

This is used for the project in the context of the "Introduction to Research Data Management" course, 2024 winter semester. Originally it was made for a homework assignment in the "Computer Statistics" course, 2023 winter semester.

The dataset consists of the following: code (and comment) written in the R markdown language that is to be compiled and executed in order to generate the 2 datasets created in the project; .pdf file generated from compiling and executing the aforementioned R code using RStudio; .txt file generated as part of one of the exercises in the assignment, also by compiling and executing the R code.

The code was written by Vseslav Levchenko in R, using RStudio.

Technical details

The code was written in RStudio and it is recommended to use it when working with R, however it is not strictly necessary. However, it is required to install the R language itself. For the other files, standard software like Microsoft Excel and any PDF reader are all that is needed.

The code also contains necessary comments, and a .pdf file with the assignment's tasks is provided separately.
Cyclistic_Divvy_data
kaggle.com
zip
Updated Jun 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rami Ghaith (2023). Cyclistic_Divvy_data [Dataset]. https://www.kaggle.com/datasets/ramighaith/cyclistic-divvy-data
Explore at:
zip(21440758 bytes)Available download formats
Dataset updated
Jun 11, 2023
Authors
Rami Ghaith
Description
The following data shows riding information for members vs casual riders at the company Cyclistic(made up name). This is a dataset used as a case study for the google data analytics certificate.

The Changes Done to the Data in Excel: - Removed all duplicated (none were found) - Added a ride_length column by subtracting ended_at by started_at using the following formula "=C2-B2" and then turned that type into a Time, 37:30:55 - Added a day_of_week column using the following formula "=WEEKDAY(B2,1)" to display the day the ride took place on, 1= sunday through 7=saturday. - There was data that can be seen as ########, that data was left the same with no changes done to it, this data simply represents negative data and should just be looked at as 0.

Processing the Data in RStudio: - Installed required packages such as tidyverse for data import and wrangling, lubridate for date functions and ggplot for visualization. - Step 1: I read the csv files into R to collect the data - Step 2: Made sure the data all contained the same column names because I want to merge them into one - Step 3: Renamed all column names to make sure they align, then merged them into one combined data - Step 4: More data cleaning and analyzing - Step 5: Once my data was cleaned and clearly telling a story, I began to visualize it. The visualizations done can be seen below.
Using Descriptive Statistics to Analyse Data in R
kaggle.com
zip
Updated May 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Enrico68 (2024). Using Descriptive Statistics to Analyse Data in R [Dataset]. https://www.kaggle.com/datasets/enrico68/using-descriptive-statistics-to-analyse-data-in-r
Explore at:
zip(105561 bytes)Available download formats
Dataset updated
May 9, 2024
Authors
Enrico68
Description
Load and view a real-world dataset in RStudio

• Calculate “Measure of Frequency” metrics

• Calculate “Measure of Central Tendency” metrics

• Calculate “Measure of Dispersion” metrics

• Use R’s in-built functions for additional data quality metrics

• Create a custom R function to calculate descriptive statistics on any given dataset
[Dataset] Does Volunteer Engagement Pay Off? An Analysis of User...
data.europa.eu
recerca.uoc.edu
+3more
unknown
Updated Nov 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2022). [Dataset] Does Volunteer Engagement Pay Off? An Analysis of User Participation in Online Citizen Science Projects [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-7357747?locale=el
Explore at:
unknown(10386572)Available download formats
Dataset updated
Nov 27, 2022
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Explanation/Overview: Corresponding dataset for the analyses and results achieved in the CS Track project in the research line on participation analyses, which is also reported in the publication "Does Volunteer Engagement Pay Off? An Analysis of User Participation in Online Citizen Science Projects", a conference paper for the conference CollabTech 2022: Collaboration Technologies and Social Computing and published as part of the Lecture Notes in Computer Science book series (LNCS,volume 13632) here. The usernames have been anonymised. Purpose: The purpose of this dataset is to provide the basis to reproduce the results reported in the associated deliverable, and in the above-mentioned publication. As such, it does not represent raw data, but rather files that already include certain analysis steps (like calculated degrees or other SNA-related measures), ready for analysis, visualisation and interpretation with R. Relatedness: The data of the different projects was derived from the forums of 7 Zooniverse projects based on similar discussion board features. The projects are: 'Galaxy Zoo', 'Gravity Spy', 'Seabirdwatch', 'Snapshot Wisconsin', 'Wildwatch Kenya', 'Galaxy Nurseries', 'Penguin Watch'. Content: In this Zenodo entry, several files can be found. The structure is as follows (files and folders and descriptions). corresponding_calculations.html Quarto-notebook to view in browser corresponding_calculations.qmd Quarto-notebook to view in RStudio assets data annotations annotations.csv List of annotations made per day for each of the analysed projects comments comments.csv Total list of comments with several data fields (i.e., comment id, text, reply_user_id) rolechanges 478_rolechanges.csv List of roles per user to determine number of role changes 1104_rolechanges.csv ... ... totalnetworkdata Edges 478_edges.csv Network data (edge set) for the given projects (without time slices) 1104_edges.csv ... ... Nodes 478_nodes.csv Network data (node set) for the given projects (without time slices) 1104_nodes.csv ... ... trajectories Network data (edge and node sets) for the given projects and all time slices (Q1 2016 - Q4 2021) 478 Edges edges_4782016_q1.csv edges_4782016_q2.csv edges_4782016_q3.csv edges_4782016_q4.csv ... Nodes nodes_4782016_q1.csv nodes_4782016_q4.csv nodes_4782016_q3.csv nodes_4782016_q2.csv ... 1104 Edges ... Nodes ... ... scripts datavizfuncs.R script for the data visualisation functions, automatically executed from within corresponding_calculations.qmd import.R script for the import of data, automatically executed from within corresponding_calculations.qmd corresponding_calculations_files files for the html/qmd view in the browser/RStudio Grouping: The data is grouped according to given criteria (e.g., project_title or time). Accordingly, the respective files can be found in the data structure
r
Open data: Visual load effects on the auditory steady-state responses to...
demo.researchdata.se
researchdata.se
+2more
Updated Nov 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stefan Wiens; Malina Szychowska (2020). Open data: Visual load effects on the auditory steady-state responses to 20-, 40-, and 80-Hz amplitude-modulated tones [Dataset]. http://doi.org/10.17045/STHLMUNI.12582002
Explore at:
Unique identifier
https://doi.org/10.17045/STHLMUNI.12582002
Dataset updated
Nov 8, 2020
Dataset provided by
Stockholm University
Authors
Stefan Wiens; Malina Szychowska
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The main results file are saved separately:

ASSR2.html: R output of the main analyses (N = 33)

ASSR2_subset.html: R output of the main analyses for the smaller sample (N = 25)

FIGSHARE METADATA

Categories

Biological psychology

Neuroscience and physiological psychology

Sensory processes, perception, and performance

Keywords

crossmodal attention

electroencephalography (EEG)

early-filter theory

task difficulty

envelope following response

References

https://doi.org/10.17605/OSF.IO/6FHR8

https://github.com/stamnosslin/mn

https://doi.org/10.17045/sthlmuni.4981154.v3

https://biosemi.com/

https://www.python.org/

https://mne.tools/stable/index.html#- https://www.r-project.org/- https://rstudio.com/products/rstudio/

GENERAL INFORMATION

Title of Dataset: Open data: Visual load effects on the auditory steady-state responses to 20-, 40-, and 80-Hz amplitude-modulated tones

Author Information A. Principal Investigator Contact Information Name: Stefan Wiens Institution: Department of Psychology, Stockholm University, Sweden Internet: https://www.su.se/profiles/swiens-1.184142 Email: sws@psychology.su.se

B. Associate or Co-investigator Contact Information Name: Malina Szychowska Institution: Department of Psychology, Stockholm University, Sweden Internet: https://www.researchgate.net/profile/Malina_Szychowska Email: malina.szychowska@psychology.su.se

Date of data collection: Subjects (N = 33) were tested between 2019-11-15 and 2020-03-12.

Geographic location of data collection: Department of Psychology, Stockholm, Sweden

Information about funding sources that supported the collection of the data: Swedish Research Council (Vetenskapsrådet) 2015-01181

SHARING/ACCESS INFORMATION

Licenses/restrictions placed on the data: CC BY 4.0

Links to publications that cite or use the data: Szychowska M., & Wiens S. (2020). Visual load effects on the auditory steady-state responses to 20-, 40-, and 80-Hz amplitude-modulated tones. Submitted manuscript.

The study was preregistered: https://doi.org/10.17605/OSF.IO/6FHR8

Links to other publicly accessible locations of the data: N/A

Links/relationships to ancillary data sets: N/A

Was data derived from another source? No

Recommended citation for this dataset: Wiens, S., & Szychowska M. (2020). Open data: Visual load effects on the auditory steady-state responses to 20-, 40-, and 80-Hz amplitude-modulated tones. Stockholm: Stockholm University. https://doi.org/10.17045/sthlmuni.12582002

DATA & FILE OVERVIEW

File List: The files contain the raw data, scripts, and results of main and supplementary analyses of an electroencephalography (EEG) study. Links to the hardware and software are provided under methodological information.

ASSR2_experiment_scripts.zip: contains the Python files to run the experiment.

ASSR2_rawdata.zip: contains raw datafiles for each subject

data_EEG: EEG data in bdf format (generated by Biosemi)

data_log: logfiles of the EEG session (generated by Python)

ASSR2_EEG_scripts.zip: Python-MNE scripts to process the EEG data

ASSR2_EEG_preprocessed_data.zip: EEG data in fif format after preprocessing with Python-MNE scripts

ASSR2_R_scripts.zip: R scripts to analyze the data together with the main datafiles. The main files in the folder are:

ASSR2.html: R output of the main analyses

ASSR2_subset.html: R output of the main analyses but after excluding eight subjects who were recorded as pilots before preregistering the study

ASSR2_results.zip: contains all figures and tables that are created by Python-MNE and R.

METHODOLOGICAL INFORMATION

Description of methods used for collection/generation of data: The auditory stimuli were amplitude-modulated tones with a carrier frequency (fc) of 500 Hz and modulation frequencies (fm) of 20.48 Hz, 40.96 Hz, or 81.92 Hz. The experiment was programmed in python: https://www.python.org/ and used extra functions from here: https://github.com/stamnosslin/mn

The EEG data were recorded with an Active Two BioSemi system (BioSemi, Amsterdam, Netherlands; www.biosemi.com) and saved in .bdf format. For more information, see linked publication.

Methods for processing the data: We conducted frequency analyses and computed event-related potentials. See linked publication

Instrument- or software-specific information needed to interpret the data: MNE-Python (Gramfort A., et al., 2013): https://mne.tools/stable/index.html# Rstudio used with R (R Core Team, 2020): https://rstudio.com/products/rstudio/ Wiens, S. (2017). Aladins Bayes Factor in R (Version 3). https://www.doi.org/10.17045/sthlmuni.4981154.v3

Standards and calibration information, if appropriate: For information, see linked publication.

Environmental/experimental conditions: For information, see linked publication.

Describe any quality-assurance procedures performed on the data: For information, see linked publication.

People involved with sample collection, processing, analysis and/or submission:

Data collection: Malina Szychowska with assistance from Jenny Arctaedius.

Data processing, analysis, and submission: Malina Szychowska and Stefan Wiens

DATA-SPECIFIC INFORMATION: All relevant information can be found in the MNE-Python and R scripts (in EEG_scripts and analysis_scripts folders) that process the raw data. For example, we added notes to explain what different variables mean.
E
Data from: The Low Dimensionality of Development
edmond.mpg.de
application/gzip, tar
Updated Mar 11, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guido Kraemer; Markus Reichstein; Gustau Camps-Valls; Jeroen Smits; Miguel Mahecha; Guido Kraemer; Markus Reichstein; Gustau Camps-Valls; Jeroen Smits; Miguel Mahecha (2020). The Low Dimensionality of Development [Dataset]. http://doi.org/10.17617/3.3I
Explore at:
tar(2420220416), application/gzip(65695884)Available download formats
Unique identifier
https://doi.org/10.17617/3.3I
Dataset updated
Mar 11, 2020
Dataset provided by
Edmond
Authors
Guido Kraemer; Markus Reichstein; Gustau Camps-Valls; Jeroen Smits; Miguel Mahecha; Guido Kraemer; Markus Reichstein; Gustau Camps-Valls; Jeroen Smits; Miguel Mahecha
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This contains a docker container and the prepared data to do the analysis in the paper title: "The Low Dimensionality of Development" published in "Social Indicators Research" 2020 by Kraemer et al.. If you use code or data from in this repository cite the paper. This repository contains a copy of the World Development Indicators database from May 2018 [1]. The analysis was run on a 48 core cluster node with 256GB of ram and takes some hours to complete. Afterthis generating all results, they can be loaded into slightly over 16GB of RAM. Unpack 'docker_data.tar.gz' into a '/path/to/data' on 'host' and run as: docker load -i dockerimage.tar docker run -v "/path/to/data":/home/rstudio/data -v "/path/to/figures":/home/rstudio/fig -p 8989:8787 -e PASSWORD=secret_password development-indicators then open 'host:8989' in your browser and an RStudio session should appear. You can login with the user "rstudio" and the password you specified before. NOTE: We did not set a random seed, so slight variations between runs will occur. [1] License: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/) By: The World Bank The original data can be found here: https://datacatalog.worldbank.org/dataset/world-development-indicators
r
Open data: Visual load does not decrease the auditory steady state response...
researchdata.se
data.europa.eu
Updated Aug 25, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stefan Wiens; Malina Szychowska (2020). Open data: Visual load does not decrease the auditory steady state response to 40-Hz amplitude-modulated tones [Dataset]. http://doi.org/10.17045/STHLMUNI.7324898
Explore at:
Unique identifier
https://doi.org/10.17045/STHLMUNI.7324898
Dataset updated
Aug 25, 2020
Dataset provided by
Stockholm University
Authors
Stefan Wiens; Malina Szychowska
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Open data: Visual load does not decrease the auditory steady state response to 40-Hz amplitude-modulated tones The main results files are saved separately: - ASSR_study1.html: R output of the main analyses- ASSR_study1_subset_subjects.html: R output of the main analyses- ASSR_study2.html: R output of the main analyses The studies were preregistered:Study 1: https://doi.org/10.17605/OSF.IO/UYJVAStudy 2: https://doi.org/10.17605/OSF.IO/JVMFD DATA & FILE OVERVIEW File List:The files contain the raw data, scripts, and results of main and supplementary analyses of two electroencephalography (EEG) studies (Study1, Study2). Links to the hardware and software are provided under methodological information. ASSR_study1_experiment_scripts.zip: contains the Python files to run the experiment. ASSR_study1_rawdata.zip: contains raw datafiles for each subject - data_EEG: EEG data in bdf format (generated by Biosemi)- data_log: logfiles of the EEG session (generated by Python)- data_WMC: logfiles of the working memory capacity task (generated by Python) ASSR_study1_EEG_scripts.zip: Python-MNE scripts to process the EEG data ASSR_study1_EEG_preprocessed.zip: Preprocessed EEG data from Python-MNE ASSR_study1_analysis_scripts.zip: R scripts to analyze the data together with the main datafiles. The main files in the folder are: - ASSR_study1.html: R output of the main analyses- ASSR_study1_subset_subjects.html: R output of the main analyses but after excluding five subjects who were excluded because of stricter, preregistered artifact rejection criteria ASSR_study1_figures.zip: contains all figures and tables that are created by Python-MNE and R. ASSR_study2_experiment_scripts.zip: contains the Python files to run the experiment ASSR_study2_rawdata.zip: contains raw datafiles for each subject - data_EEG: EEG data in bdf format (generated by Biosemi)- data_log: logfiles of the EEG session (generated by Python)- data_WMC: logfiles of the working memory capacity task (generated by Python) ASSR_study2_EEG_scripts.zip: Python-MNE scripts to process the EEG data ASSR_study2_EEG_preprocessed.zip: Preprocessed EEG data from Python-MNE ASSR_study2_analysis_scripts.zip: R scripts to analyze the data together with the main datafiles. The main files in the folder are: - ASSR_study2.html: R output of the main analyses- ASSR_compare_performance_between_studies.html: R output of analyses that compare behavioral performance between study 1 and study 2. ASSR_study2_figures.zip: contains all figures and tables that are created by Python-MNE and R. Instrument- or software-specific information needed to interpret the data:MNE-Python (Gramfort A., et al., 2013): https://mne.tools/stable/index.html#Rstudio used with R (R Core Team, 2016): https://rstudio.com/products/rstudio/Wiens, S. (2017). Aladins Bayes Factor in R (Version 3). https://www.doi.org/10.17045/sthlmuni.4981154.v3
Walmart Data Set
kaggle.com
zip
Updated Jan 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthew Garrett Carter (2023). Walmart Data Set [Dataset]. https://www.kaggle.com/datasets/matthewgarrettcarter/walmart-data-set
Explore at:
zip(272320 bytes)Available download formats
Dataset updated
Jan 4, 2023
Authors
Matthew Garrett Carter
Description
Introduction

The purpose of this project was added practice in learning new and demonstrate R Data analytical skills. The data set was located in Kaggle and shows sales information from the years 2010 to 2012. The weekly sales have two categories: holiday and non holiday representing 1 and 0 in that column respectfully.

The main question for this exercise was were there any factors that affected weekly sales for the stores? Those factors included temperature, fuel prices, and unemployment rates.

The following packages required for this project:

install.packages("tidyverse") install.packages("dplyr") install.packages("tsibble")

The following libraries required:

library("tidyverse") library(readr) library(dplyr) library(ggplot2) library(readr) library(lubridate) library(tsibble)

Downloading data set into RStudio:

Walmart <- read.csv("C:/Users/matth/OneDrive/Desktop/Case Study/Walmart.csv")

Data Inspection

Compared column names of each file to verify consistency.

colnames(Walmart) colnames(Walmart) dim(Walmart) str(Walmart) head(Walmart) which(is.na(Walmart$Date)) sum(is.na(Walmart))

There is NA data in the set.

Turning Store and Holiday_flag into factors:

Walmart$Store<-as.factor(Walmart$Store) Walmart$Holiday_Flag<-as.factor(Walmart$Holiday_Flag)

Splicing the date into Year and weekyear:

Walmart$week<-yearweek(as.Date(Walmart$Date,tryFormats=c("%d-%m-%Y"))) # make sure to install "tsibble" Walmart$year<-format(as.Date(Walmart$Date,tryFormats=c("%d-%m-%Y")),"%Y")

Filered Holiday_Flag Column to include only holidays weeks:

Walmart_Holiday<- filter(Walmart, Holiday_Flag==1)

Filered Holiday_Flag Column to include only non holidays Weeks:

Walmart_Non_Holiday<- filter(Walmart, Holiday_Flag==0)

Lets review all 45 stores' weekly sales and compare them. Using dataset Walmart

ggplot(Walmart, aes(x=Weekly_Sales, y=Store))+geom_boxplot()+ labs(title = 'Weekly Sales Accross 45 Stores', x='Weekly sales', y='Store')+theme_bw()

Results

From observation of the boxplot, it shows that Store 14 had max sales while Store 33 had the min sales.

Lets verify the results via slice_max and slice_min:

Walmart %>% slice_max(Weekly_Sales) Walmart %>% slice_min(Weekly_Sales)

It looks the information was correct. Lets check the mean for the weekly_sales column:

mean(Walmart$Weekly_Sales)

The mean for Weekly_Sales column for the Walmart dataset was 1046965.

Lets check for the MIN and MAX of Weekly Sales but only if they are holiday sales weeks:

ggplot(Walmart_Holiday, aes(x=Weekly_Sales, y=Store))+geom_boxplot()+ labs(title = 'Holiday Sales Accross 45 Stores', x='Weekly sales', y='Store')+theme_bw()

Result

Store 4 had the highest weekly sales during a holiday week based on the boxplot. Boxplot shows stores 33 and 5 as some of the lowest holiday sales.Lets reverify with slice_max and slice_min:

Walmart_Holiday %>% slice_max(Weekly_Sales) Walmart_Holiday %>% slice_min(Weekly_Sales)

The results match what is given on the boxplot. Lets find the mean:

mean(Walmart_Holiday$Weekly_Sales)

The result was that the mean was 1122888.

Lets check for the MIN and MAX of Weekly Sales but only if they are non holiday sales weeks:

ggplot(Walmart_Non_Holiday, aes(x=Weekly_Sales, y=Store))+geom_boxplot()+ labs(title = 'Non Holiday Sales Accross 45 Stores', x='Weekly sales', y='Store')+theme_bw()

Lets matched the results of the Walmart dataset that had both non holiday weeks and holiday weeks. Store 14 had the max sales and store 33 had the minimum sales. Lets verify the results and find the mean:

Walmart_Non_Holiday %>% slice_max(Weekly_Sales) Walmart_Non_Holiday %>% slice_min(Weekly_Sales) mean(Walmart_Non_Holiday$Weekly_Sales)

Results matched. And the mean for weekly sales was 1041256.

Which Year had the most sales?

ggplot(data = Walmart) + geom_point(mapping = aes(x=year, y=Weekly_Sales))

According the plot, 2010 had the most sales. Lets use a boxplot to see more.

ggplot(Walmart, aes(x=year, y=Weekly_Sales))+geom_boxplot()+ labs(title = 'Weekly Sales for Years 2010 - 2012', x='Year', y='Weekly Sales')

2010 Saw higher sales numbers and higher medium

Is there any differance between Sales during no Holiday weeks and Holiday weeks?

Lets start with holiday weekly sales:

ggplot(Walmart_Holiday, aes(x=year, y=Weekly_Sales))+geom_boxplot()+ labs(title = 'Holiday Weekly Sales for Years ...
Data and tools for studying isograms
figshare.com
Updated Jul 31, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
Explore at:
application/x-sqlite3Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5245810.v1
Dataset updated
Jul 31, 2017
Dataset provided by
Figsharehttp://figshare.com/
Authors
Florian Breit
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

Label Data type Description

isogramy int The order of isogramy, e.g. "2" is a second order isogram

length int The length of the word in letters

word text The actual word/isogram in ASCII

source_pos text The Part of Speech tag from the original corpus

count int Token count (total number of occurences)

vol_count int Volume count (number of different sources which contain the word)

count_per_million int Token count per million words

vol_count_as_percent int Volume count as percentage of the total number of volumes

is_palindrome bool Whether the word is a palindrome (1) or not (0)

is_tautonym bool Whether the word is a tautonym (1) or not (0)

The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

Label

Data type

Description

!total_1grams

int

The total number of words in the corpus

!total_volumes

int

The total number of volumes (individual sources) in the corpus

!total_isograms

int

The total number of isograms found in the corpus (before compacting)

!total_palindromes

int

How many of the isograms found are palindromes

!total_tautonyms

int

How many of the isograms found are tautonyms

The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
d
R-LOADEST files to produce results in the Heart River Basin, North Dakota,...
catalog.data.gov
data.usgs.gov
Updated Nov 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). R-LOADEST files to produce results in the Heart River Basin, North Dakota, 1970-2020 [Dataset]. https://catalog.data.gov/dataset/r-loadest-files-to-produce-results-in-the-heart-river-basin-north-dakota-1970-2020
Explore at:
Dataset updated
Nov 20, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
North Dakota, Heart River
Description
This child page contains a zipped folder which contains all of the items necessary to run load estimation using R-LOADEST to produce results that are published in U.S. Geological Survey Investigations Report 2021-XXXX [Tatge, W.S., Nustad, R.A., and Galloway, J.M., 2021, Evaluation of Salinity and Nutrient Conditions in the Heart River Basin, North Dakota, 1970-2020: U.S. Geological Survey Scientific Investigations Report 2021-XXXX, XX p]. The folder contains an allsiteinfo.table.csv file, a "datain" folder, and a "scripts" folder. The allsiteinfo.table.csv file can be used to cross reference the sites with the main report (Tatge and others, 2021). The "datain" folder contains all the input data necessary to reproduce the load estimation results. The naming convention in the "datain" folder is site_MI_rloadest or site_NUT_rloadest for either the major ion loads or the nutrient loads. The .Rdata files are used in the scripts to run the estimations and the .csv files can be used to look at the data. The "scripts" folder contains the written R scripts to produce the results of the load estimation from the main report. R-LOADEST is a software package for analyzing loads in streams and an accompanying report (Runkel and others, 2004) serves as the formal documentation for R-LOADEST. The package is a collection of functions written in R (R Development Core Team, 2019), an open source language and a general environment for statistical computing and graphics. The following system requirements are necessary for producing results: Windows 10 operating system R (version 3.4 or later; 64-bit recommended) RStudio (version 1.1.456 or later) R-LOADEST program (available at https://github.com/USGS-R/rloadest). Runkel, R.L., Crawford, C.G., and Cohn, T.A., 2004, Load Estimator (LOADEST): A FORTRAN Program for Estimating Constituent Loads in Streams and Rivers: U.S. Geological Survey Techniques and Methods Book 4, Chapter A5, 69 p., [Also available at https://pubs.usgs.gov/tm/2005/tm4A5/pdf/508final.pdf.] R Development Core Team, 2019, R—A language and environment for statistical computing: Vienna, Austria, R Foundation for Statistical Computing, accessed December 7, 2020, at https://www.r-project.org.
Data from: Data and code from: Cover crop and crop rotation effects on...
catalog.data.gov
agdatacommons.nal.usda.gov
Updated Apr 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Data and code from: Cover crop and crop rotation effects on tissue and soil population dynamics of Macrophomina phaseolina and yield in no-till system - V2 [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-cover-crop-and-crop-rotation-effects-on-tissue-and-soil-population-dyna-831b9
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
[Note 2023-08-14 - Supersedes version 1, https://doi.org/10.15482/USDA.ADC/1528086 ] This dataset contains all code and data necessary to reproduce the analyses in the manuscript: Mengistu, A., Read, Q. D., Sykes, V. R., Kelly, H. M., Kharel, T., & Bellaloui, N. (2023). Cover crop and crop rotation effects on tissue and soil population dynamics of Macrophomina phaseolina and yield under no-till system. Plant Disease. https://doi.org/10.1094/pdis-03-23-0443-re The .zip archive cropping-systems-1.0.zip contains data and code files. Data stem_soil_CFU_by_plant.csv: Soil disease load (SoilCFUg) and stem tissue disease load (StemCFUg) for individual plants in CFU per gram, with columns indicating year, plot ID, replicate, row, plant ID, previous crop treatment, cover crop treatment, and comments. Missing data are indicated with . yield_CFU_by_plot.csv: Yield data (YldKgHa) at the plot level in units of kg/ha, with columns indicating year, plot ID, replicate, and treatments, as well as means of soil and stem disease load at the plot level. Code cropping_system_analysis_v3.0.Rmd: RMarkdown notebook with all data processing, analysis, and visualization code equations.Rmd: RMarkdown notebook with formatted equations formatted_figs_revision.R: R script to produce figures formatted exactly as they appear in the manuscript The Rproject file cropping-systems.Rproj is used to organize the RStudio project. Scripts and notebooks used in older versions of the analysis are found in the testing/ subdirectory. Excel spreadsheets containing raw data from which the cleaned CSV files were created are found in the raw_data subdirectory.
Z
Replication Package for "Political Expression of Academics on Twitter"
data.niaid.nih.gov
Updated Apr 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prashant, Garg (2025). Replication Package for "Political Expression of Academics on Twitter" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11522063
Explore at:
Dataset updated
Apr 12, 2025
Dataset provided by
Imperial College London
Authors
Prashant, Garg
License
http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
Description
Replication Package for 'Political Expression of Academics on Social Media' by Prashant Garg and Thiemo Fetzer.

Overview

This replication package contains all necessary scripts and data to replicate the main figures and tables presented in the paper.

Folder Structure

1. 1_scripts

This folder contains all scripts required to replicate the main figures and tables of the paper. The scripts are numbers with a prefix (e.g. "1_") in the order they should be run. Output will also be produced in this folder.

0_init.Rmd: An R Markdown file that installs and loads all packages necessary for the subsequent scripts. - 1_fig_1.Rmd: Primarily produces Figure 1 (Zipf's plots) and conducts statistical tests to support underlying statistical claims made through the figure.

2_fig_2_to_4.Rmd: Primarily produces Figures 2 to 4 (average levels of expression) and conducts statistical tests to support underlying statistical claims made through the figures. This includes conducting t-tests to establish subgroup differences.

The script also includes The file table_controlling_how.csv contains the full set of regression results for the analysis of subgroup differences in political stances, controlling for emotionality, egocentrism, and toxicity. This file includes effect sizes, standard errors, confidence intervals, and p-values for each stance, group variable, and confounder.

3_fig_5_to_6.Rmd: Primarily produces Figures 5 to 6 (trends in expression) and conducts statistical tests to support underlying statistical claims made through the figures. This includes conducting t-tests to establish subgroup differences.

4_tab_1_to_2.Rmd: Produces Tables 1 to 2, and shows code for Table A5 (descriptive tables).

Expected run time for each script is under 3 minutes and requires around 4GB RAM. Script 3_fig_5_to_6.Rmd can take up to 3-4 minutes and requires up to 6GB RAM. Installation of each package for the first time user may take around 2 minutes each, except 'tidyverse', which may take around 4 minutes.

We have not provided a demo since the actual dataset used for analysis is small enough and computations are efficient enough to be run in most systems.

Each script starts with a layperson explanation to overview the functionality of the code and a pseudocode for a detailed procedure, followed by the actual code.

2. 2_data

This folder contains all data used to replicate the main results. The data is called by the respective scripts automatically using relative paths.

data_dictionary.txt: Provides a description of all variables as they are coded in the various datasets, especially the main author by time level dataset called repl_df.csv.- Processed data at individual author by time (year by month) level aggregated measures are provided, as raw data containing raw tweets cannot be shared.

Installation Instructions

Prerequisites

This project uses R and RStudio. Make sure you have the following installed:

R (version 4.0.0 or later)- RStudio

Once installed, to ensure the correct versions of the required packages are installed, use the following R markdown script '0_init.Rmd'. This script will install the remotes package (if not already installed) and then install the specified versions of the required packages.

Running the ScriptsOpen 0_init.Rmd in RStudio and run all chunks to install and load the required packages.Run the remaining scripts (1_fig_1.Rmd, 2_fig_2_to_4.Rmd, 3_fig_5_to_6.Rmd, and 4_tab_1_to_2.Rmd) in the order they are listed to reproduce the figures and tables from the paper.

ContactFor any questions, feel free to contact Prashant Garg at prashant.garg@imperial.ac.uk.

License

This project is licensed under the Apache License 2.0 - see the license.txt file for details.
m
Short-range Early Phase COVID-19 Forecasting R-Project and Data
data.mendeley.com
Updated Dec 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christopher Lynch (2020). Short-range Early Phase COVID-19 Forecasting R-Project and Data [Dataset]. http://doi.org/10.17632/cytrb8p42g.2
Explore at:
Unique identifier
https://doi.org/10.17632/cytrb8p42g.2
Dataset updated
Dec 15, 2020
Authors
Christopher Lynch
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This R-Project and its data files are provided in support of ongoing research efforts for forecasting COVID-19 cumulative case growth at varied geographic levels. All code and data files are provided to facilitate reproducibility of current research findings. Seven forecasting methods are evaluated with respect to their effectiveness at forecasting one-, three-, and seven-day cumulative COVID-19 cases, including: (1) a Naïve approach; (2) Holt-Winters exponential smoothing; (3) growth rate; (4) moving average (MA); (5) autoregressive (AR); (6) autoregressive moving average (ARMA); and (7) autoregressive integrated moving average (ARIMA). This package is developed to be directly opened and run in RStudio through the provided RProject file. Code developed using R version 3.6.3.

This software generates the findings of the article entitled "Short-range forecasting of coronavirus disease 2019 (COVID-19) during early onset at county, health district, and state geographic levels: Comparative forecasting approach using seven forecasting methods" using cumulative case counts reported by The New York Times up to April 22, 2020. This package provides two avenues for reproducing results: 1) Regenerate the forecasts from scratch using the provided code and data files and then run the analyses; or 2) Load the saved forecast data and run the analyses on the existing data

License info can be viewed from the "License Info.txt" file.

The "RProject" folder contains the RProject file which opens the project in RStudio with the desired working directory set.

README files are contained in each sub-folder which provide additoinal detail on the contents of the folder.

Copyright (c) 2020 Christopher J. Lynch and Ross Gore

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Except as contained in this notice, the name(s) of the above copyright holders shall not be used in advertising or otherwise to promote the sale, use, or other dealings in this Software without prior written authorization.
d
Data from: Plant-pollinator specialization: Origin and measurement of...
datadryad.org
dataone.org
+1more
zip
Updated Oct 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mannfred Boehm; Jill Jankowski; Quentin Cronk (2021). Plant-pollinator specialization: Origin and measurement of curvature [Dataset]. http://doi.org/10.5061/dryad.g1jwstqrr
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.g1jwstqrr
Dataset updated
Oct 26, 2021
Dataset provided by
Dryad
Authors
Mannfred Boehm; Jill Jankowski; Quentin Cronk
Time period covered
Sep 24, 2021
Description
This dataset is primarily an RStudio project. To run the .R files, we recommend first opening the .RProj file in RStudio and installing the package here. This will allow you to run all of the .R scripts without changing any of the working directories.
Analysis scripts and supplementary files: Barriers to implementing clinical...
figshare.com
datasetcatalog.nlm.nih.gov
zip
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter Kamerman; Victoria J (Tory) Madden; Romy Parker; Dershnee Devan; Sarah Cameron; Kirsty Jackson; Cameron Reardon; Antonia Wadley (2023). Analysis scripts and supplementary files: Barriers to implementing clinical trials on non-pharmacological treatments in developing countries – lessons learnt from addressing pain in HIV [Dataset]. http://doi.org/10.6084/m9.figshare.7654637.v6
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7654637.v6
Dataset updated
Jun 3, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Peter Kamerman; Victoria J (Tory) Madden; Romy Parker; Dershnee Devan; Sarah Cameron; Kirsty Jackson; Cameron Reardon; Antonia Wadley
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
DESCRIPTIONThis repository contains analysis scripts (with outputs), figures from the manuscript, and supplementary files the HIV Pain (HIP) Intervention Study. All analysis scripts (and their outputs -- /outputs subdirectory) are found in HIP-study.zip, while PDF copies of the analysis outputs that are cited in the manuscript as supplementary material are found in the relevant supplement-*.pdf file.Note: Participant consent did not provide for the publication of their data, and hence neither the original nor cleaned data have been made available. However, we do not wish to bar access to the data unnecessarily and we will judge requests to access the data on a case-by-case basis. Examples of potential use cases include independent assessments of our analyses, and secondary data analyses. Please contact Peter Kamerman (peter.kamerman@gmail.com), Dr Tory Madden (torymadden@gmail.com, or open an issue on the GitHub repo (https://github.com/kamermanpr/HIP-study/issues).BIBLIOGRAPHIC INFORMATIONRepository citationKamerman PR, Madden VJ, Parker R, Devan D, Cameron S, Jackson K, Reardon C, Wadley A. Analysis scripts and supplementary files: Barriers to implementing clinical trials on non-pharmacological treatments in developing countries – lessons learnt from addressing pain in HIV. DOI: 10.6084/m9.figshare.7654637.Manuscript citationParker R, Madden VJ, Devan D, Cameron S, Jackson K, Kamerman P, Reardon C, Wadley A. Barriers to implementing clinical trials on non-pharmacological treatments in developing countries – lessons learnt from addressing pain in HIV. Pain Reports [submitted 2019-01-31]Manuscript abstractintroduction: Pain affects over half of people living with HIV/AIDS (LWHA) and pharmacological treatment has limited efficacy. Preliminary evidence supports non-pharmacological interventions. We previously piloted a multimodal intervention in amaXhosa women LWHA and chronic pain in South Africa with improvements seen in all outcomes, in both intervention and control groups. Methods: A multicentre, single-blind randomised controlled trial with 160 participants recruited was conducted to determine whether the multimodal peer-led intervention reduced pain in different populations of both male and female South Africans LWHA. Participants were followed up at Weeks 4, 8, 12, 24 and 48 to evaluate effects on the primary outcome of pain, and on depression, self-efficacy and health-related quality of life. Results: We were unable to assess the efficacy of the intervention due to a 58% loss to follow up (LTFU). Secondary analysis of the LTFU found that sociocultural factors were not predictive of LTFU. Depression, however, did associate with LTFU, with greater severity of depressive symptoms predicting LTFU at week 8 (p=0.01). Discussion: We were unable to evaluate the effectiveness of the intervention due to the high LTFU and the risk of retention bias. The different sociocultural context in South Africa may warrant a different approach to interventions for pain in HIV compared to resource-rich countries, including a concurrent strategy to address barriers to health care service delivery. We suggest that assessment of pain and depression need to occur simultaneously in those with pain in HIV. We suggest investigation of the effect of social inclusion on pain and depression. USING DOCKER TO RUN THE HIP-STUDY ANALYSIS SCRIPTSThese instructions are for running the analysis on your local machine.You need to have Docker installed on your computer. To do so, go to docker.com (https://www.docker.com/community-edition#/download) and follow the instructions for downloading and installing Docker for your operating system. Once Docker has been installed, follow the steps below, noting that Docker commands are entered in a terminal window (Linux and OSX/macOS) or command prompt window (Windows). Windows users also may wish to install GNU Make (http://gnuwin32.sourceforge.net/downlinks/make.php) (required for the make method of running the scripts) and Git (https://gitforwindows.org/) version control software (not essential).Download the latest imageEnter: docker pull kamermanpr/docker-hip-study:v2.0.0Run the containerEnter: docker run -d -p 8787:8787 -v :/home/rstudio --name threshold -e USER=hip -e PASSWORD=study kamermanpr/docker-hip-study:v2.0.0Where refers to the path to the HIP-study directory on your computer, which you either cloned from GitHub (https://github.com/kamermanpr/HIP-study.git), git clone https://github.com/kamermanpr/HIP-study, or downloaded and extracted from figshare (https://doi.org/10.6084/m9.figshare.7654637).Login to RStudio Server- Open a web browser window and navigate to: localhost:8787- Use the following login credentials: - Username: hip - Password: study Prepare the HIP-study directoryThe HIP-study directory comes with the outputs for all the analysis scripts in the /outputs directory (html and md formats). However, should you wish to run the scripts yourself, there are several preparatory steps that are required:1. Acquire the data. The data required to run the scripts have not been included in the repo because participants in the studies did not consent to public release of their data. However, the data are available on request from Peter Kamerman (peter.kamerman@gmail.com). Once the data have been obtained, the files should be copied into a subdirectory named /data-original.2. Clean the /outputs directory by entering make clean in the Terminal tab in RStudio.Run the HIP-study analysis scriptsTo run all the scripts (including the data cleaning scripts), enter make all in the Terminal tab in RStudio.To run individual RMarkdown scripts (*.Rmd files)1. Generate the cleaned data using one of the following methods: - Enter make data-cleaned/demographics.rds in the Terminal tab in RStudio. - Enter source('clean-data-script.R') in the Console tab in RStudio. - Open the clean-data-script.R script through the File tab in RStudio, and then click the 'Source' button on the right of the Script console in RStudio for each script. 2. Run the individual script by: - Entering make outputs/.html in the Terminal tab in RStudio, OR - Opening the relevant *.Rmd file through the File tab in RStudio, and then clicking the 'knit' button on the left of the Script console in RStudio. Shutting downOnce done, log out of RStudio Server and enter the following into a terminal to stop the Docker container: docker stop hip. If you then want to remove the container, enter: docker rm threshold. If you also want to remove the Docker image you downloaded, enter: docker rmi kamermanpr/docker-hip-study:v2.0.0
H
Replication Data for:
dataverse.harvard.edu
Updated Mar 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julia Krasselt; Philipp Dreesen (2024). Replication Data for: [Dataset]. http://doi.org/10.7910/DVN/FRYPSL
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/FRYPSL
Dataset updated
Mar 4, 2024
Dataset provided by
Harvard Dataverse
Authors
Julia Krasselt; Philipp Dreesen
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The folder contains everything that is needed to reproduce findings, figures and tables presented in the following publication: Krasselt, J., & Dreesen, Ph. (2024). Topic models indicate textual aboutness and pragmatics: Valuation practices in Islamophobic discourse. Journal of Cultual Analytics. In detail: R Script to reproduce figures and tables: supplementary_material_cultural_analytics.rmd - the script is also provided as a commented html markdown version: supplementary_material_cultural_analytics.html - to run the script, open the file supplementary_material_cultural_analytics.rmd in Rstudio, install the necessary packages and run each chunk LDA topic model - document-topic-distribution: doc_topics_df.csv - topic list (top 20 words): top_words_df.csv - word-topic-assignment: topic_word_assignment.csv (columns with actual tokens and lemmata were deleted due to copyright) Sura citations - a file containing 3grams counted for sura citations only: citations_suras_3grams.txt

Facebook

Twitter

Click to copy link

Link copied

Cite

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

bin, application/gzip, zip, text/x-pythonAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.1419788

Dataset updated

Aug 2, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb

License

https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

Description

Replication pack, FSE2018 submission #164:
------------------------------------------

**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
A Case Study of the PyPI Ecosystem

**Note:** link to data artifacts is already included in the paper. 
Link to the code will be included in the Camera Ready version as well.


Content description
===================

- **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
 described below
- **settings.py** - settings template for the code archive.
- **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
 This dataset only includes stats aggregated by the ecosystem (PyPI)
- **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
 statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
 themselves, which take around 2TB.
- **build_model.r, helpers.r** - R files to process the survival data 
  (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
  `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
  **dataset_full_Jan_2018.tgz**)
- **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
- LICENSE - text of GPL v3, under which this dataset is published
- INSTALL.md - replication guide (~2 pages)

Replication guide
=================

Step 0 - prerequisites
----------------------

- Unix-compatible OS (Linux or OS X)
- Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
- R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)

Depending on detalization level (see Step 2 for more details):
- up to 2Tb of disk space (see Step 2 detalization levels)
- at least 16Gb of RAM (64 preferable)
- few hours to few month of processing time

Step 1 - software
----------------

- unpack **ghd-0.1.0.zip**, or clone from gitlab:

   git clone https://gitlab.com/user2589/ghd.git
   git checkout 0.1.0
 
 `cd` into the extracted folder. 
 All commands below assume it as a current directory.
  
- copy `settings.py` into the extracted folder. Edit the file:
  * set `DATASET_PATH` to some newly created folder path
  * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
- install docker. For Ubuntu Linux, the command is 
  `sudo apt-get install docker-compose`
- install libarchive and headers: `sudo apt-get install libarchive-dev`
- (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
 Without this dependency, you might get an error on the next step, 
 but it's safe to ignore.
- install Python libraries: `pip install --user -r requirements.txt` . 
- disable all APIs except GitHub (Bitbucket and Gitlab support were
 not yet implemented when this study was in progress): edit
 `scraper/init.py`, comment out everything except GitHub support
 in `PROVIDERS`.

Step 2 - obtaining the dataset
-----------------------------

The ultimate goal of this step is to get output of the Python function 
`common.utils.survival_data()` and save it into a CSV file:

  # copy and paste into a Python console
  from common import utils
  survival_data = utils.survival_data('pypi', '2008', smoothing=6)
  survival_data.to_csv('survival_data.csv')

Since full replication will take several months, here are some ways to speedup
the process:

####Option 2.a, difficulty level: easiest

Just use the precomputed data. Step 1 is not necessary under this scenario.

- extract **dataset_minimal_Jan_2018.zip**
- get `survival_data.csv`, go to the next step

####Option 2.b, difficulty level: easy

Use precomputed longitudinal feature values to build the final table.
The whole process will take 15..30 minutes.

- create a folder `

Clear search

Close search

Google apps

Main menu

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

Collection of example datasets used for the book - R Programming -...

Bellabeat Case Study II Google Capstone Project

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'

(as 'lib' is unspecified)

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'

(as 'lib' is unspecified)

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'

(as 'lib' is unspecified)

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'

(as 'lib' is unspecified)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──

✔ ggplot2 3.4.0 ✔ purrr 0.3.5

✔ tibble 3.1.8 ✔ dplyr 1.0.10

✔ tidyr 1.2.1 ✔ stringr 1.4.1

✔ readr 2.1.3 ✔ forcats 0.5.2

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──

✖ dplyr::filter() masks stats::filter()

✖ dplyr::lag() masks stats::lag()

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

chisq.test, fisher.test

Loading required package: timechange

Attaching package: 'lubridate'

The following objects are masked from 'package:base':

date, intersect, setdiff, union

Loading your CSV files

Rows: 2483658 Columns: 3

── Column specification ────────────────────────────────────────────────────────

Delimiter: ","

chr (1): Time

dbl (2): Id, Value

ℹ Use spec() to retrieve the full column specification for this data.

ℹ Specify the column types or set show_col_types = FALSE to quiet this message.

Rows: 22099 Columns: 3

── Column specification ─────────────────────────────────────...

Large Datasets in R - Plant Phenology & Temperature Data from NEON

Manipulating data using R

Data created during Computer Statistics assignment

Context and methodology

Technical details

Cyclistic_Divvy_data

Using Descriptive Statistics to Analyse Data in R

[Dataset] Does Volunteer Engagement Pay Off? An Analysis of User...

Open data: Visual load effects on the auditory steady-state responses to...

Data from: The Low Dimensionality of Development

Open data: Visual load does not decrease the auditory steady state response...

Walmart Data Set

Introduction

The following packages required for this project:

The following libraries required:

Downloading data set into RStudio:

Data Inspection

Turning Store and Holiday_flag into factors:

Splicing the date into Year and weekyear:

Filered Holiday_Flag Column to include only holidays weeks:

Filered Holiday_Flag Column to include only non holidays Weeks:

Lets review all 45 stores' weekly sales and compare them. Using dataset Walmart

Results

Lets check for the MIN and MAX of Weekly Sales but only if they are holiday sales weeks:

Result

Lets check for the MIN and MAX of Weekly Sales but only if they are non holiday sales weeks:

Which Year had the most sales?

Is there any differance between Sales during no Holiday weeks and Holiday weeks?

Data and tools for studying isograms

R-LOADEST files to produce results in the Heart River Basin, North Dakota,...

Data from: Data and code from: Cover crop and crop rotation effects on...

Replication Package for "Political Expression of Academics on Twitter"

Replication Package for 'Political Expression of Academics on Social Media' by Prashant Garg and Thiemo Fetzer.

Overview

Folder Structure

1. 1_scripts

2. 2_data

Installation Instructions

Prerequisites

Running the ScriptsOpen 0_init.Rmd in RStudio and run all chunks to install and load the required packages.Run the remaining scripts (1_fig_1.Rmd, 2_fig_2_to_4.Rmd, 3_fig_5_to_6.Rmd, and 4_tab_1_to_2.Rmd) in the order they are listed to reproduce the figures and tables from the paper.

ContactFor any questions, feel free to contact Prashant Garg at prashant.garg@imperial.ac.uk.

License

Short-range Early Phase COVID-19 Forecasting R-Project and Data

Data from: Plant-pollinator specialization: Origin and measurement of...

ℹ Use `spec()` to retrieve the full column specification for this data.

ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

1. `1_scripts`

2. `2_data`

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem