Facebook
Twitterhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Replication pack, FSE2018 submission #164: ------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem **Note:** link to data artifacts is already included in the paper. Link to the code will be included in the Camera Ready version as well. Content description =================== - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files described below - **settings.py** - settings template for the code archive. - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset. This dataset only includes stats aggregated by the ecosystem (PyPI) - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages themselves, which take around 2TB. - **build_model.r, helpers.r** - R files to process the survival data (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, `common.cache/survival_data.pypi_2008_2017-12_6.csv` in **dataset_full_Jan_2018.tgz**) - **Interview protocol.pdf** - approximate protocol used for semistructured interviews. - LICENSE - text of GPL v3, under which this dataset is published - INSTALL.md - replication guide (~2 pages)
Replication guide ================= Step 0 - prerequisites ---------------------- - Unix-compatible OS (Linux or OS X) - Python interpreter (2.7 was used; Python 3 compatibility is highly likely) - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible) Depending on detalization level (see Step 2 for more details): - up to 2Tb of disk space (see Step 2 detalization levels) - at least 16Gb of RAM (64 preferable) - few hours to few month of processing time Step 1 - software ---------------- - unpack **ghd-0.1.0.zip**, or clone from gitlab: git clone https://gitlab.com/user2589/ghd.git git checkout 0.1.0 `cd` into the extracted folder. All commands below assume it as a current directory. - copy `settings.py` into the extracted folder. Edit the file: * set `DATASET_PATH` to some newly created folder path * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` - install docker. For Ubuntu Linux, the command is `sudo apt-get install docker-compose` - install libarchive and headers: `sudo apt-get install libarchive-dev` - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools` Without this dependency, you might get an error on the next step, but it's safe to ignore. - install Python libraries: `pip install --user -r requirements.txt` . - disable all APIs except GitHub (Bitbucket and Gitlab support were not yet implemented when this study was in progress): edit `scraper/init.py`, comment out everything except GitHub support in `PROVIDERS`. Step 2 - obtaining the dataset ----------------------------- The ultimate goal of this step is to get output of the Python function `common.utils.survival_data()` and save it into a CSV file: # copy and paste into a Python console from common import utils survival_data = utils.survival_data('pypi', '2008', smoothing=6) survival_data.to_csv('survival_data.csv') Since full replication will take several months, here are some ways to speedup the process: ####Option 2.a, difficulty level: easiest Just use the precomputed data. Step 1 is not necessary under this scenario. - extract **dataset_minimal_Jan_2018.zip** - get `survival_data.csv`, go to the next step ####Option 2.b, difficulty level: easy Use precomputed longitudinal feature values to build the final table. The whole process will take 15..30 minutes. - create a folder `
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This book is written for statisticians, data analysts, programmers, researchers, teachers, students, professionals, and general consumers on how to perform different types of statistical data analysis for research purposes using the R programming language. R is an open-source software and object-oriented programming language with a development environment (IDE) called RStudio for computing statistics and graphical displays through data manipulation, modelling, and calculation. R packages and supported libraries provides a wide range of functions for programming and analyzing of data. Unlike many of the existing statistical softwares, R has the added benefit of allowing the users to write more efficient codes by using command-line scripting and vectors. It has several built-in functions and libraries that are extensible and allows the users to define their own (customized) functions on how they expect the program to behave while handling the data, which can also be stored in the simple object system.For all intents and purposes, this book serves as both textbook and manual for R statistics particularly in academic research, data analytics, and computer programming targeted to help inform and guide the work of the R users or statisticians. It provides information about different types of statistical data analysis and methods, and the best scenarios for use of each case in R. It gives a hands-on step-by-step practical guide on how to identify and conduct the different parametric and non-parametric procedures. This includes a description of the different conditions or assumptions that are necessary for performing the various statistical methods or tests, and how to understand the results of the methods. The book also covers the different data formats and sources, and how to test for reliability and validity of the available datasets. Different research experiments, case scenarios and examples are explained in this book. It is the first book to provide a comprehensive description and step-by-step practical hands-on guide to carrying out the different types of statistical analysis in R particularly for research purposes with examples. Ranging from how to import and store datasets in R as Objects, how to code and call the methods or functions for manipulating the datasets or objects, factorization, and vectorization, to better reasoning, interpretation, and storage of the results for future use, and graphical visualizations and representations. Thus, congruence of Statistics and Computer programming for Research.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is retrieved from the user Mobius page, where it's generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. I woıuld like to thank Möbius and everyone responsible for the work.
Bellabeat Case Study 1 2022-11-14 1. Introduction Hello everyone, my name is Nur Simais and this project is part of Google Data Analytics Professional Certificate. There have been multiple skills and skillsets learned throughout this course that can mainly be categorized under soft and hard skills. Also, this case study I have chosen is about the company calles “Bellabeat”, a fitness tracker device. The company is founded in 2013 by Urška Sršen and Sando Mur. It gradually gained recognition and expanded in many countires.(https://bellabeat.com/) Adding this brief info about the company, I’d like to say that doing the business analysis will help the company to see how it can achieve it’s goals and what can be done as to improve more.
During the analysis process, I will be using the Google’s “Ask-Prepare-Process-Analyze-Share-Act” Framework that I have learned throughout this certification and apply the tools and skillsets into it.
1.ASK
1.1 Business Task The goal of this project is to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices and how to apply these insights into Bellabeat’s marketing strategy using these three questions:
What are some trends in smart device usage? How could these trends apply to Bellabeat customers? How could these trends help influence Bellabeat marketing strategy?
2.PREPARE Prepare the Data and Libraries in RStudio Collect the data required for analysis but since the data is available on Kaggle publicly, FitBit Fitness Tracker Data (CC0: Public Domain) and download the dataset.
There are 18 packages but after examining the excel docs, I decided to use these 8 datasets: dailyActivity_merged.csv, heartrate_seconds_merged.csv, hourlyCalories_merged.csv, hourlyIntensities_merged.csv, hourlySteps_merged.csv, minuteMETsNarrow_merged.csv, sleepDay_merged.csv, weightLogInfo_merged.csv 2.1 Install and load the packages Install the RStudio libraries for analysis and visualizations
install.packages("tidyverse") # core package for cleaning and analysis
install.packages("lubridate") # date library mdy()
install.packages("janitor") # clean_names() to consists only _, character, numbers, and letters.
install.packages("dplyr") #helps to check the garmmar of data manioulation
Load the libraries
library(tidyverse)
library(janitor) ##
##
##
library(lubridate)
##
##
##
library(dplyr) Having loaded tidyverse package, the rest of the essential packages (ggplot2, dplyr, and tidyr) are loaded as well.
2.2 Importing and Preparing the Dataset Upload the archived dataset to RStudio by clicking the Upload button in the bottom right pane.
File will be saved in a new folder named “Fitabase Data 4.12.16-5.12.16”. Importing the datasets and renaming them.
daily_activity <- read.csv("dailyActivity_merged.csv") heartrate_seconds <- read_csv("heartrate_seconds_merged.csv")
##
spec() to retrieve the full column specification for this data.show_col_types = FALSE to quiet this message.hourly_calories <- read_csv("hourlyCalories_merged.csv")
Facebook
TwitterThis module series covers how to import, manipulate, format and plot time series data stored in .csv format in R. Originally designed to teach researchers to use NEON plant phenology and air temperature data; has been used in undergraduate classrooms.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Facebook
TwitterThe following data shows riding information for members vs casual riders at the company Cyclistic(made up name). This is a dataset used as a case study for the google data analytics certificate.
The Changes Done to the Data in Excel: - Removed all duplicated (none were found) - Added a ride_length column by subtracting ended_at by started_at using the following formula "=C2-B2" and then turned that type into a Time, 37:30:55 - Added a day_of_week column using the following formula "=WEEKDAY(B2,1)" to display the day the ride took place on, 1= sunday through 7=saturday. - There was data that can be seen as ########, that data was left the same with no changes done to it, this data simply represents negative data and should just be looked at as 0.
Processing the Data in RStudio: - Installed required packages such as tidyverse for data import and wrangling, lubridate for date functions and ggplot for visualization. - Step 1: I read the csv files into R to collect the data - Step 2: Made sure the data all contained the same column names because I want to merge them into one - Step 3: Renamed all column names to make sure they align, then merged them into one combined data - Step 4: More data cleaning and analyzing - Step 5: Once my data was cleaned and clearly telling a story, I began to visualize it. The visualizations done can be seen below.
Facebook
TwitterLoad and view a real-world dataset in RStudio
• Calculate “Measure of Frequency” metrics
• Calculate “Measure of Central Tendency” metrics
• Calculate “Measure of Dispersion” metrics
• Use R’s in-built functions for additional data quality metrics
• Create a custom R function to calculate descriptive statistics on any given dataset
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Explanation/Overview: Corresponding dataset for the analyses and results achieved in the CS Track project in the research line on participation analyses, which is also reported in the publication "Does Volunteer Engagement Pay Off? An Analysis of User Participation in Online Citizen Science Projects", a conference paper for the conference CollabTech 2022: Collaboration Technologies and Social Computing and published as part of the Lecture Notes in Computer Science book series (LNCS,volume 13632) here. The usernames have been anonymised. Purpose: The purpose of this dataset is to provide the basis to reproduce the results reported in the associated deliverable, and in the above-mentioned publication. As such, it does not represent raw data, but rather files that already include certain analysis steps (like calculated degrees or other SNA-related measures), ready for analysis, visualisation and interpretation with R. Relatedness: The data of the different projects was derived from the forums of 7 Zooniverse projects based on similar discussion board features. The projects are: 'Galaxy Zoo', 'Gravity Spy', 'Seabirdwatch', 'Snapshot Wisconsin', 'Wildwatch Kenya', 'Galaxy Nurseries', 'Penguin Watch'. Content: In this Zenodo entry, several files can be found. The structure is as follows (files and folders and descriptions). corresponding_calculations.html Quarto-notebook to view in browser corresponding_calculations.qmd Quarto-notebook to view in RStudio assets data annotations annotations.csv List of annotations made per day for each of the analysed projects comments comments.csv Total list of comments with several data fields (i.e., comment id, text, reply_user_id) rolechanges 478_rolechanges.csv List of roles per user to determine number of role changes 1104_rolechanges.csv ... ... totalnetworkdata Edges 478_edges.csv Network data (edge set) for the given projects (without time slices) 1104_edges.csv ... ... Nodes 478_nodes.csv Network data (node set) for the given projects (without time slices) 1104_nodes.csv ... ... trajectories Network data (edge and node sets) for the given projects and all time slices (Q1 2016 - Q4 2021) 478 Edges edges_4782016_q1.csv edges_4782016_q2.csv edges_4782016_q3.csv edges_4782016_q4.csv ... Nodes nodes_4782016_q1.csv nodes_4782016_q4.csv nodes_4782016_q3.csv nodes_4782016_q2.csv ... 1104 Edges ... Nodes ... ... scripts datavizfuncs.R script for the data visualisation functions, automatically executed from within corresponding_calculations.qmd import.R script for the import of data, automatically executed from within corresponding_calculations.qmd corresponding_calculations_files files for the html/qmd view in the browser/RStudio Grouping: The data is grouped according to given criteria (e.g., project_title or time). Accordingly, the respective files can be found in the data structure
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The main results file are saved separately:
FIGSHARE METADATA
Categories
Keywords
References
GENERAL INFORMATION
Title of Dataset: Open data: Visual load effects on the auditory steady-state responses to 20-, 40-, and 80-Hz amplitude-modulated tones
Author Information A. Principal Investigator Contact Information Name: Stefan Wiens Institution: Department of Psychology, Stockholm University, Sweden Internet: https://www.su.se/profiles/swiens-1.184142 Email: sws@psychology.su.se
B. Associate or Co-investigator Contact Information Name: Malina Szychowska Institution: Department of Psychology, Stockholm University, Sweden Internet: https://www.researchgate.net/profile/Malina_Szychowska Email: malina.szychowska@psychology.su.se
Date of data collection: Subjects (N = 33) were tested between 2019-11-15 and 2020-03-12.
Geographic location of data collection: Department of Psychology, Stockholm, Sweden
Information about funding sources that supported the collection of the data: Swedish Research Council (Vetenskapsrådet) 2015-01181
SHARING/ACCESS INFORMATION
Licenses/restrictions placed on the data: CC BY 4.0
Links to publications that cite or use the data: Szychowska M., & Wiens S. (2020). Visual load effects on the auditory steady-state responses to 20-, 40-, and 80-Hz amplitude-modulated tones. Submitted manuscript.
The study was preregistered: https://doi.org/10.17605/OSF.IO/6FHR8
Links to other publicly accessible locations of the data: N/A
Links/relationships to ancillary data sets: N/A
Was data derived from another source? No
Recommended citation for this dataset: Wiens, S., & Szychowska M. (2020). Open data: Visual load effects on the auditory steady-state responses to 20-, 40-, and 80-Hz amplitude-modulated tones. Stockholm: Stockholm University. https://doi.org/10.17045/sthlmuni.12582002
DATA & FILE OVERVIEW
File List: The files contain the raw data, scripts, and results of main and supplementary analyses of an electroencephalography (EEG) study. Links to the hardware and software are provided under methodological information.
ASSR2_experiment_scripts.zip: contains the Python files to run the experiment.
ASSR2_rawdata.zip: contains raw datafiles for each subject
ASSR2_EEG_scripts.zip: Python-MNE scripts to process the EEG data
ASSR2_EEG_preprocessed_data.zip: EEG data in fif format after preprocessing with Python-MNE scripts
ASSR2_R_scripts.zip: R scripts to analyze the data together with the main datafiles. The main files in the folder are:
ASSR2_results.zip: contains all figures and tables that are created by Python-MNE and R.
METHODOLOGICAL INFORMATION
The EEG data were recorded with an Active Two BioSemi system (BioSemi, Amsterdam, Netherlands; www.biosemi.com) and saved in .bdf format. For more information, see linked publication.
Methods for processing the data: We conducted frequency analyses and computed event-related potentials. See linked publication
Instrument- or software-specific information needed to interpret the data: MNE-Python (Gramfort A., et al., 2013): https://mne.tools/stable/index.html# Rstudio used with R (R Core Team, 2020): https://rstudio.com/products/rstudio/ Wiens, S. (2017). Aladins Bayes Factor in R (Version 3). https://www.doi.org/10.17045/sthlmuni.4981154.v3
Standards and calibration information, if appropriate: For information, see linked publication.
Environmental/experimental conditions: For information, see linked publication.
Describe any quality-assurance procedures performed on the data: For information, see linked publication.
People involved with sample collection, processing, analysis and/or submission:
DATA-SPECIFIC INFORMATION: All relevant information can be found in the MNE-Python and R scripts (in EEG_scripts and analysis_scripts folders) that process the raw data. For example, we added notes to explain what different variables mean.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This contains a docker container and the prepared data to do the analysis in the paper title: "The Low Dimensionality of Development" published in "Social Indicators Research" 2020 by Kraemer et al.. If you use code or data from in this repository cite the paper. This repository contains a copy of the World Development Indicators database from May 2018 [1]. The analysis was run on a 48 core cluster node with 256GB of ram and takes some hours to complete. Afterthis generating all results, they can be loaded into slightly over 16GB of RAM. Unpack 'docker_data.tar.gz' into a '/path/to/data' on 'host' and run as: docker load -i dockerimage.tar docker run -v "/path/to/data":/home/rstudio/data -v "/path/to/figures":/home/rstudio/fig -p 8989:8787 -e PASSWORD=secret_password development-indicators then open 'host:8989' in your browser and an RStudio session should appear. You can login with the user "rstudio" and the password you specified before. NOTE: We did not set a random seed, so slight variations between runs will occur. [1] License: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/) By: The World Bank The original data can be found here: https://datacatalog.worldbank.org/dataset/world-development-indicators
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Open data: Visual load does not decrease the auditory steady state response to 40-Hz amplitude-modulated tones The main results files are saved separately: - ASSR_study1.html: R output of the main analyses- ASSR_study1_subset_subjects.html: R output of the main analyses- ASSR_study2.html: R output of the main analyses The studies were preregistered:Study 1: https://doi.org/10.17605/OSF.IO/UYJVAStudy 2: https://doi.org/10.17605/OSF.IO/JVMFD DATA & FILE OVERVIEW File List:The files contain the raw data, scripts, and results of main and supplementary analyses of two electroencephalography (EEG) studies (Study1, Study2). Links to the hardware and software are provided under methodological information. ASSR_study1_experiment_scripts.zip: contains the Python files to run the experiment. ASSR_study1_rawdata.zip: contains raw datafiles for each subject - data_EEG: EEG data in bdf format (generated by Biosemi)- data_log: logfiles of the EEG session (generated by Python)- data_WMC: logfiles of the working memory capacity task (generated by Python) ASSR_study1_EEG_scripts.zip: Python-MNE scripts to process the EEG data ASSR_study1_EEG_preprocessed.zip: Preprocessed EEG data from Python-MNE ASSR_study1_analysis_scripts.zip: R scripts to analyze the data together with the main datafiles. The main files in the folder are: - ASSR_study1.html: R output of the main analyses- ASSR_study1_subset_subjects.html: R output of the main analyses but after excluding five subjects who were excluded because of stricter, preregistered artifact rejection criteria ASSR_study1_figures.zip: contains all figures and tables that are created by Python-MNE and R. ASSR_study2_experiment_scripts.zip: contains the Python files to run the experiment ASSR_study2_rawdata.zip: contains raw datafiles for each subject - data_EEG: EEG data in bdf format (generated by Biosemi)- data_log: logfiles of the EEG session (generated by Python)- data_WMC: logfiles of the working memory capacity task (generated by Python) ASSR_study2_EEG_scripts.zip: Python-MNE scripts to process the EEG data ASSR_study2_EEG_preprocessed.zip: Preprocessed EEG data from Python-MNE ASSR_study2_analysis_scripts.zip: R scripts to analyze the data together with the main datafiles. The main files in the folder are: - ASSR_study2.html: R output of the main analyses- ASSR_compare_performance_between_studies.html: R output of analyses that compare behavioral performance between study 1 and study 2. ASSR_study2_figures.zip: contains all figures and tables that are created by Python-MNE and R. Instrument- or software-specific information needed to interpret the data:MNE-Python (Gramfort A., et al., 2013): https://mne.tools/stable/index.html#Rstudio used with R (R Core Team, 2016): https://rstudio.com/products/rstudio/Wiens, S. (2017). Aladins Bayes Factor in R (Version 3). https://www.doi.org/10.17045/sthlmuni.4981154.v3
Facebook
TwitterThe purpose of this project was added practice in learning new and demonstrate R Data analytical skills. The data set was located in Kaggle and shows sales information from the years 2010 to 2012. The weekly sales have two categories: holiday and non holiday representing 1 and 0 in that column respectfully.
The main question for this exercise was were there any factors that affected weekly sales for the stores? Those factors included temperature, fuel prices, and unemployment rates.
install.packages("tidyverse")
install.packages("dplyr")
install.packages("tsibble")
library("tidyverse")
library(readr)
library(dplyr)
library(ggplot2)
library(readr)
library(lubridate)
library(tsibble)
Walmart <- read.csv("C:/Users/matth/OneDrive/Desktop/Case Study/Walmart.csv")
Compared column names of each file to verify consistency.
colnames(Walmart)
colnames(Walmart)
dim(Walmart)
str(Walmart)
head(Walmart)
which(is.na(Walmart$Date))
sum(is.na(Walmart))
There is NA data in the set.
Walmart$Store<-as.factor(Walmart$Store)
Walmart$Holiday_Flag<-as.factor(Walmart$Holiday_Flag)
Walmart$week<-yearweek(as.Date(Walmart$Date,tryFormats=c("%d-%m-%Y"))) # make sure to install "tsibble"
Walmart$year<-format(as.Date(Walmart$Date,tryFormats=c("%d-%m-%Y")),"%Y")
Walmart_Holiday<-
filter(Walmart, Holiday_Flag==1)
Walmart_Non_Holiday<-
filter(Walmart, Holiday_Flag==0)
ggplot(Walmart, aes(x=Weekly_Sales, y=Store))+geom_boxplot()+ labs(title = 'Weekly Sales Accross 45 Stores',
x='Weekly sales', y='Store')+theme_bw()
From observation of the boxplot, it shows that Store 14 had max sales while Store 33 had the min sales.
Lets verify the results via slice_max and slice_min:
Walmart %>% slice_max(Weekly_Sales)
Walmart %>% slice_min(Weekly_Sales)
It looks the information was correct. Lets check the mean for the weekly_sales column:
mean(Walmart$Weekly_Sales)
The mean for Weekly_Sales column for the Walmart dataset was 1046965.
ggplot(Walmart_Holiday, aes(x=Weekly_Sales, y=Store))+geom_boxplot()+ labs(title = 'Holiday Sales Accross 45 Stores',
x='Weekly sales', y='Store')+theme_bw()
Store 4 had the highest weekly sales during a holiday week based on the boxplot. Boxplot shows stores 33 and 5 as some of the lowest holiday sales.Lets reverify with slice_max and slice_min:
Walmart_Holiday %>% slice_max(Weekly_Sales)
Walmart_Holiday %>% slice_min(Weekly_Sales)
The results match what is given on the boxplot. Lets find the mean:
mean(Walmart_Holiday$Weekly_Sales)
The result was that the mean was 1122888.
ggplot(Walmart_Non_Holiday, aes(x=Weekly_Sales, y=Store))+geom_boxplot()+ labs(title = 'Non Holiday Sales Accross 45 Stores', x='Weekly sales', y='Store')+theme_bw()
Lets matched the results of the Walmart dataset that had both non holiday weeks and holiday weeks. Store 14 had the max sales and store 33 had the minimum sales. Lets verify the results and find the mean:
Walmart_Non_Holiday %>% slice_max(Weekly_Sales)
Walmart_Non_Holiday %>% slice_min(Weekly_Sales)
mean(Walmart_Non_Holiday$Weekly_Sales)
Results matched. And the mean for weekly sales was 1041256.
ggplot(data = Walmart) + geom_point(mapping = aes(x=year, y=Weekly_Sales))
According the plot, 2010 had the most sales. Lets use a boxplot to see more.
ggplot(Walmart, aes(x=year, y=Weekly_Sales))+geom_boxplot()+ labs(title = 'Weekly Sales for Years 2010 - 2012',
x='Year', y='Weekly Sales')
2010 Saw higher sales numbers and higher medium
Lets start with holiday weekly sales:
ggplot(Walmart_Holiday, aes(x=year, y=Weekly_Sales))+geom_boxplot()+ labs(title = 'Holiday Weekly Sales for Years ...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):
Label Data type Description
isogramy int The order of isogramy, e.g. "2" is a second order isogram
length int The length of the word in letters
word text The actual word/isogram in ASCII
source_pos text The Part of Speech tag from the original corpus
count int Token count (total number of occurences)
vol_count int Volume count (number of different sources which contain the word)
count_per_million int Token count per million words
vol_count_as_percent int Volume count as percentage of the total number of volumes
is_palindrome bool Whether the word is a palindrome (1) or not (0)
is_tautonym bool Whether the word is a tautonym (1) or not (0)
The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:
Label
Data type
Description
!total_1grams
int
The total number of words in the corpus
!total_volumes
int
The total number of volumes (individual sources) in the corpus
!total_isograms
int
The total number of isograms found in the corpus (before compacting)
!total_palindromes
int
How many of the isograms found are palindromes
!total_tautonyms
int
How many of the isograms found are tautonyms
The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
Facebook
TwitterThis child page contains a zipped folder which contains all of the items necessary to run load estimation using R-LOADEST to produce results that are published in U.S. Geological Survey Investigations Report 2021-XXXX [Tatge, W.S., Nustad, R.A., and Galloway, J.M., 2021, Evaluation of Salinity and Nutrient Conditions in the Heart River Basin, North Dakota, 1970-2020: U.S. Geological Survey Scientific Investigations Report 2021-XXXX, XX p]. The folder contains an allsiteinfo.table.csv file, a "datain" folder, and a "scripts" folder. The allsiteinfo.table.csv file can be used to cross reference the sites with the main report (Tatge and others, 2021). The "datain" folder contains all the input data necessary to reproduce the load estimation results. The naming convention in the "datain" folder is site_MI_rloadest or site_NUT_rloadest for either the major ion loads or the nutrient loads. The .Rdata files are used in the scripts to run the estimations and the .csv files can be used to look at the data. The "scripts" folder contains the written R scripts to produce the results of the load estimation from the main report. R-LOADEST is a software package for analyzing loads in streams and an accompanying report (Runkel and others, 2004) serves as the formal documentation for R-LOADEST. The package is a collection of functions written in R (R Development Core Team, 2019), an open source language and a general environment for statistical computing and graphics. The following system requirements are necessary for producing results: Windows 10 operating system R (version 3.4 or later; 64-bit recommended) RStudio (version 1.1.456 or later) R-LOADEST program (available at https://github.com/USGS-R/rloadest). Runkel, R.L., Crawford, C.G., and Cohn, T.A., 2004, Load Estimator (LOADEST): A FORTRAN Program for Estimating Constituent Loads in Streams and Rivers: U.S. Geological Survey Techniques and Methods Book 4, Chapter A5, 69 p., [Also available at https://pubs.usgs.gov/tm/2005/tm4A5/pdf/508final.pdf.] R Development Core Team, 2019, R—A language and environment for statistical computing: Vienna, Austria, R Foundation for Statistical Computing, accessed December 7, 2020, at https://www.r-project.org.
Facebook
Twitter[Note 2023-08-14 - Supersedes version 1, https://doi.org/10.15482/USDA.ADC/1528086 ] This dataset contains all code and data necessary to reproduce the analyses in the manuscript: Mengistu, A., Read, Q. D., Sykes, V. R., Kelly, H. M., Kharel, T., & Bellaloui, N. (2023). Cover crop and crop rotation effects on tissue and soil population dynamics of Macrophomina phaseolina and yield under no-till system. Plant Disease. https://doi.org/10.1094/pdis-03-23-0443-re The .zip archive cropping-systems-1.0.zip contains data and code files. Data stem_soil_CFU_by_plant.csv: Soil disease load (SoilCFUg) and stem tissue disease load (StemCFUg) for individual plants in CFU per gram, with columns indicating year, plot ID, replicate, row, plant ID, previous crop treatment, cover crop treatment, and comments. Missing data are indicated with . yield_CFU_by_plot.csv: Yield data (YldKgHa) at the plot level in units of kg/ha, with columns indicating year, plot ID, replicate, and treatments, as well as means of soil and stem disease load at the plot level. Code cropping_system_analysis_v3.0.Rmd: RMarkdown notebook with all data processing, analysis, and visualization code equations.Rmd: RMarkdown notebook with formatted equations formatted_figs_revision.R: R script to produce figures formatted exactly as they appear in the manuscript The Rproject file cropping-systems.Rproj is used to organize the RStudio project. Scripts and notebooks used in older versions of the analysis are found in the testing/ subdirectory. Excel spreadsheets containing raw data from which the cleaned CSV files were created are found in the raw_data subdirectory.
Facebook
Twitterhttp://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
This replication package contains all necessary scripts and data to replicate the main figures and tables presented in the paper.
1_scriptsThis folder contains all scripts required to replicate the main figures and tables of the paper. The scripts are numbers with a prefix (e.g. "1_") in the order they should be run. Output will also be produced in this folder.
0_init.Rmd: An R Markdown file that installs and loads all packages necessary for the subsequent scripts. - 1_fig_1.Rmd: Primarily produces Figure 1 (Zipf's plots) and conducts statistical tests to support underlying statistical claims made through the figure.
2_fig_2_to_4.Rmd: Primarily produces Figures 2 to 4 (average levels of expression) and conducts statistical tests to support underlying statistical claims made through the figures. This includes conducting t-tests to establish subgroup differences.
The script also includes The file table_controlling_how.csv contains the full set of regression results for the analysis of subgroup differences in political stances, controlling for emotionality, egocentrism, and toxicity. This file includes effect sizes, standard errors, confidence intervals, and p-values for each stance, group variable, and confounder.
3_fig_5_to_6.Rmd: Primarily produces Figures 5 to 6 (trends in expression) and conducts statistical tests to support underlying statistical claims made through the figures. This includes conducting t-tests to establish subgroup differences.
4_tab_1_to_2.Rmd: Produces Tables 1 to 2, and shows code for Table A5 (descriptive tables).
Expected run time for each script is under 3 minutes and requires around 4GB RAM. Script 3_fig_5_to_6.Rmd can take up to 3-4 minutes and requires up to 6GB RAM. Installation of each package for the first time user may take around 2 minutes each, except 'tidyverse', which may take around 4 minutes.
We have not provided a demo since the actual dataset used for analysis is small enough and computations are efficient enough to be run in most systems.
Each script starts with a layperson explanation to overview the functionality of the code and a pseudocode for a detailed procedure, followed by the actual code.
2_dataThis folder contains all data used to replicate the main results. The data is called by the respective scripts automatically using relative paths.
data_dictionary.txt: Provides a description of all variables as they are coded in the various datasets, especially the main author by time level dataset called repl_df.csv.- Processed data at individual author by time (year by month) level aggregated measures are provided, as raw data containing raw tweets cannot be shared.This project uses R and RStudio. Make sure you have the following installed:
Once installed, to ensure the correct versions of the required packages are installed, use the following R markdown script '0_init.Rmd'. This script will install the remotes package (if not already installed) and then install the specified versions of the required packages.
This project is licensed under the Apache License 2.0 - see the license.txt file for details.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This R-Project and its data files are provided in support of ongoing research efforts for forecasting COVID-19 cumulative case growth at varied geographic levels. All code and data files are provided to facilitate reproducibility of current research findings. Seven forecasting methods are evaluated with respect to their effectiveness at forecasting one-, three-, and seven-day cumulative COVID-19 cases, including: (1) a Naïve approach; (2) Holt-Winters exponential smoothing; (3) growth rate; (4) moving average (MA); (5) autoregressive (AR); (6) autoregressive moving average (ARMA); and (7) autoregressive integrated moving average (ARIMA). This package is developed to be directly opened and run in RStudio through the provided RProject file. Code developed using R version 3.6.3.
This software generates the findings of the article entitled "Short-range forecasting of coronavirus disease 2019 (COVID-19) during early onset at county, health district, and state geographic levels: Comparative forecasting approach using seven forecasting methods" using cumulative case counts reported by The New York Times up to April 22, 2020. This package provides two avenues for reproducing results: 1) Regenerate the forecasts from scratch using the provided code and data files and then run the analyses; or 2) Load the saved forecast data and run the analyses on the existing data
License info can be viewed from the "License Info.txt" file.
The "RProject" folder contains the RProject file which opens the project in RStudio with the desired working directory set.
README files are contained in each sub-folder which provide additoinal detail on the contents of the folder.
Copyright (c) 2020 Christopher J. Lynch and Ross Gore
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Except as contained in this notice, the name(s) of the above copyright holders shall not be used in advertising or otherwise to promote the sale, use, or other dealings in this Software without prior written authorization.
Facebook
TwitterThis dataset is primarily an RStudio project. To run the .R files, we recommend first opening the .RProj file in RStudio and installing the package here. This will allow you to run all of the .R scripts without changing any of the working directories.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DESCRIPTIONThis repository contains analysis scripts (with outputs), figures from the manuscript, and supplementary files the HIV Pain (HIP) Intervention Study. All analysis scripts (and their outputs -- /outputs subdirectory) are found in HIP-study.zip, while PDF copies of the analysis outputs that are cited in the manuscript as supplementary material are found in the relevant supplement-*.pdf file.Note: Participant consent did not provide for the publication of their data, and hence neither the original nor cleaned data have been made available. However, we do not wish to bar access to the data unnecessarily and we will judge requests to access the data on a case-by-case basis. Examples of potential use cases include independent assessments of our analyses, and secondary data analyses. Please contact Peter Kamerman (peter.kamerman@gmail.com), Dr Tory Madden (torymadden@gmail.com, or open an issue on the GitHub repo (https://github.com/kamermanpr/HIP-study/issues).BIBLIOGRAPHIC INFORMATIONRepository citationKamerman PR, Madden VJ, Parker R, Devan D, Cameron S, Jackson K, Reardon C, Wadley A. Analysis scripts and supplementary files: Barriers to implementing clinical trials on non-pharmacological treatments in developing countries – lessons learnt from addressing pain in HIV. DOI: 10.6084/m9.figshare.7654637.Manuscript citationParker R, Madden VJ, Devan D, Cameron S, Jackson K, Kamerman P, Reardon C, Wadley A. Barriers to implementing clinical trials on non-pharmacological treatments in developing countries – lessons learnt from addressing pain in HIV. Pain Reports [submitted 2019-01-31]Manuscript abstractintroduction: Pain affects over half of people living with HIV/AIDS (LWHA) and pharmacological treatment has limited efficacy. Preliminary evidence supports non-pharmacological interventions. We previously piloted a multimodal intervention in amaXhosa women LWHA and chronic pain in South Africa with improvements seen in all outcomes, in both intervention and control groups. Methods: A multicentre, single-blind randomised controlled trial with 160 participants recruited was conducted to determine whether the multimodal peer-led intervention reduced pain in different populations of both male and female South Africans LWHA. Participants were followed up at Weeks 4, 8, 12, 24 and 48 to evaluate effects on the primary outcome of pain, and on depression, self-efficacy and health-related quality of life. Results: We were unable to assess the efficacy of the intervention due to a 58% loss to follow up (LTFU). Secondary analysis of the LTFU found that sociocultural factors were not predictive of LTFU. Depression, however, did associate with LTFU, with greater severity of depressive symptoms predicting LTFU at week 8 (p=0.01). Discussion: We were unable to evaluate the effectiveness of the intervention due to the high LTFU and the risk of retention bias. The different sociocultural context in South Africa may warrant a different approach to interventions for pain in HIV compared to resource-rich countries, including a concurrent strategy to address barriers to health care service delivery. We suggest that assessment of pain and depression need to occur simultaneously in those with pain in HIV. We suggest investigation of the effect of social inclusion on pain and depression. USING DOCKER TO RUN THE HIP-STUDY ANALYSIS SCRIPTSThese instructions are for running the analysis on your local machine.You need to have Docker installed on your computer. To do so, go to docker.com (https://www.docker.com/community-edition#/download) and follow the instructions for downloading and installing Docker for your operating system. Once Docker has been installed, follow the steps below, noting that Docker commands are entered in a terminal window (Linux and OSX/macOS) or command prompt window (Windows). Windows users also may wish to install GNU Make (http://gnuwin32.sourceforge.net/downlinks/make.php) (required for the make method of running the scripts) and Git (https://gitforwindows.org/) version control software (not essential).Download the latest imageEnter: docker pull kamermanpr/docker-hip-study:v2.0.0Run the containerEnter: docker run -d -p 8787:8787 -v :/home/rstudio --name threshold -e USER=hip -e PASSWORD=study kamermanpr/docker-hip-study:v2.0.0Where refers to the path to the HIP-study directory on your computer, which you either cloned from GitHub (https://github.com/kamermanpr/HIP-study.git), git clone https://github.com/kamermanpr/HIP-study, or downloaded and extracted from figshare (https://doi.org/10.6084/m9.figshare.7654637).Login to RStudio Server- Open a web browser window and navigate to: localhost:8787- Use the following login credentials: - Username: hip - Password: study Prepare the HIP-study directoryThe HIP-study directory comes with the outputs for all the analysis scripts in the /outputs directory (html and md formats). However, should you wish to run the scripts yourself, there are several preparatory steps that are required:1. Acquire the data. The data required to run the scripts have not been included in the repo because participants in the studies did not consent to public release of their data. However, the data are available on request from Peter Kamerman (peter.kamerman@gmail.com). Once the data have been obtained, the files should be copied into a subdirectory named /data-original.2. Clean the /outputs directory by entering make clean in the Terminal tab in RStudio.Run the HIP-study analysis scriptsTo run all the scripts (including the data cleaning scripts), enter make all in the Terminal tab in RStudio.To run individual RMarkdown scripts (*.Rmd files)1. Generate the cleaned data using one of the following methods: - Enter make data-cleaned/demographics.rds in the Terminal tab in RStudio. - Enter source('clean-data-script.R') in the Console tab in RStudio. - Open the clean-data-script.R script through the File tab in RStudio, and then click the 'Source' button on the right of the Script console in RStudio for each script. 2. Run the individual script by: - Entering make outputs/.html in the Terminal tab in RStudio, OR - Opening the relevant *.Rmd file through the File tab in RStudio, and then clicking the 'knit' button on the left of the Script console in RStudio. Shutting downOnce done, log out of RStudio Server and enter the following into a terminal to stop the Docker container: docker stop hip. If you then want to remove the container, enter: docker rm threshold. If you also want to remove the Docker image you downloaded, enter: docker rmi kamermanpr/docker-hip-study:v2.0.0
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The folder contains everything that is needed to reproduce findings, figures and tables presented in the following publication: Krasselt, J., & Dreesen, Ph. (2024). Topic models indicate textual aboutness and pragmatics: Valuation practices in Islamophobic discourse. Journal of Cultual Analytics. In detail: R Script to reproduce figures and tables: supplementary_material_cultural_analytics.rmd - the script is also provided as a commented html markdown version: supplementary_material_cultural_analytics.html - to run the script, open the file supplementary_material_cultural_analytics.rmd in Rstudio, install the necessary packages and run each chunk LDA topic model - document-topic-distribution: doc_topics_df.csv - topic list (top 20 words): top_words_df.csv - word-topic-assignment: topic_word_assignment.csv (columns with actual tokens and lemmata were deleted due to copyright) Sura citations - a file containing 3grams counted for sura citations only: citations_suras_3grams.txt
Facebook
Twitterhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Replication pack, FSE2018 submission #164: ------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem **Note:** link to data artifacts is already included in the paper. Link to the code will be included in the Camera Ready version as well. Content description =================== - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files described below - **settings.py** - settings template for the code archive. - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset. This dataset only includes stats aggregated by the ecosystem (PyPI) - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages themselves, which take around 2TB. - **build_model.r, helpers.r** - R files to process the survival data (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, `common.cache/survival_data.pypi_2008_2017-12_6.csv` in **dataset_full_Jan_2018.tgz**) - **Interview protocol.pdf** - approximate protocol used for semistructured interviews. - LICENSE - text of GPL v3, under which this dataset is published - INSTALL.md - replication guide (~2 pages)
Replication guide ================= Step 0 - prerequisites ---------------------- - Unix-compatible OS (Linux or OS X) - Python interpreter (2.7 was used; Python 3 compatibility is highly likely) - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible) Depending on detalization level (see Step 2 for more details): - up to 2Tb of disk space (see Step 2 detalization levels) - at least 16Gb of RAM (64 preferable) - few hours to few month of processing time Step 1 - software ---------------- - unpack **ghd-0.1.0.zip**, or clone from gitlab: git clone https://gitlab.com/user2589/ghd.git git checkout 0.1.0 `cd` into the extracted folder. All commands below assume it as a current directory. - copy `settings.py` into the extracted folder. Edit the file: * set `DATASET_PATH` to some newly created folder path * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` - install docker. For Ubuntu Linux, the command is `sudo apt-get install docker-compose` - install libarchive and headers: `sudo apt-get install libarchive-dev` - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools` Without this dependency, you might get an error on the next step, but it's safe to ignore. - install Python libraries: `pip install --user -r requirements.txt` . - disable all APIs except GitHub (Bitbucket and Gitlab support were not yet implemented when this study was in progress): edit `scraper/init.py`, comment out everything except GitHub support in `PROVIDERS`. Step 2 - obtaining the dataset ----------------------------- The ultimate goal of this step is to get output of the Python function `common.utils.survival_data()` and save it into a CSV file: # copy and paste into a Python console from common import utils survival_data = utils.survival_data('pypi', '2008', smoothing=6) survival_data.to_csv('survival_data.csv') Since full replication will take several months, here are some ways to speedup the process: ####Option 2.a, difficulty level: easiest Just use the precomputed data. Step 1 is not necessary under this scenario. - extract **dataset_minimal_Jan_2018.zip** - get `survival_data.csv`, go to the next step ####Option 2.b, difficulty level: easy Use precomputed longitudinal feature values to build the final table. The whole process will take 15..30 minutes. - create a folder `