https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Replication pack, FSE2018 submission #164: ------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem **Note:** link to data artifacts is already included in the paper. Link to the code will be included in the Camera Ready version as well. Content description =================== - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files described below - **settings.py** - settings template for the code archive. - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset. This dataset only includes stats aggregated by the ecosystem (PyPI) - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages themselves, which take around 2TB. - **build_model.r, helpers.r** - R files to process the survival data (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, `common.cache/survival_data.pypi_2008_2017-12_6.csv` in **dataset_full_Jan_2018.tgz**) - **Interview protocol.pdf** - approximate protocol used for semistructured interviews. - LICENSE - text of GPL v3, under which this dataset is published - INSTALL.md - replication guide (~2 pages)
Replication guide ================= Step 0 - prerequisites ---------------------- - Unix-compatible OS (Linux or OS X) - Python interpreter (2.7 was used; Python 3 compatibility is highly likely) - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible) Depending on detalization level (see Step 2 for more details): - up to 2Tb of disk space (see Step 2 detalization levels) - at least 16Gb of RAM (64 preferable) - few hours to few month of processing time Step 1 - software ---------------- - unpack **ghd-0.1.0.zip**, or clone from gitlab: git clone https://gitlab.com/user2589/ghd.git git checkout 0.1.0 `cd` into the extracted folder. All commands below assume it as a current directory. - copy `settings.py` into the extracted folder. Edit the file: * set `DATASET_PATH` to some newly created folder path * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` - install docker. For Ubuntu Linux, the command is `sudo apt-get install docker-compose` - install libarchive and headers: `sudo apt-get install libarchive-dev` - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools` Without this dependency, you might get an error on the next step, but it's safe to ignore. - install Python libraries: `pip install --user -r requirements.txt` . - disable all APIs except GitHub (Bitbucket and Gitlab support were not yet implemented when this study was in progress): edit `scraper/init.py`, comment out everything except GitHub support in `PROVIDERS`. Step 2 - obtaining the dataset ----------------------------- The ultimate goal of this step is to get output of the Python function `common.utils.survival_data()` and save it into a CSV file: # copy and paste into a Python console from common import utils survival_data = utils.survival_data('pypi', '2008', smoothing=6) survival_data.to_csv('survival_data.csv') Since full replication will take several months, here are some ways to speedup the process: ####Option 2.a, difficulty level: easiest Just use the precomputed data. Step 1 is not necessary under this scenario. - extract **dataset_minimal_Jan_2018.zip** - get `survival_data.csv`, go to the next step ####Option 2.b, difficulty level: easy Use precomputed longitudinal feature values to build the final table. The whole process will take 15..30 minutes. - create a folder `
This child page contains a zipped folder which contains all of the items necessary to run load estimation using R-LOADEST to produce results that are published in U.S. Geological Survey Investigations Report 2021-XXXX [Tatge, W.S., Nustad, R.A., and Galloway, J.M., 2021, Evaluation of Salinity and Nutrient Conditions in the Heart River Basin, North Dakota, 1970-2020: U.S. Geological Survey Scientific Investigations Report 2021-XXXX, XX p]. The folder contains an allsiteinfo.table.csv file, a "datain" folder, and a "scripts" folder. The allsiteinfo.table.csv file can be used to cross reference the sites with the main report (Tatge and others, 2021). The "datain" folder contains all the input data necessary to reproduce the load estimation results. The naming convention in the "datain" folder is site_MI_rloadest or site_NUT_rloadest for either the major ion loads or the nutrient loads. The .Rdata files are used in the scripts to run the estimations and the .csv files can be used to look at the data. The "scripts" folder contains the written R scripts to produce the results of the load estimation from the main report. R-LOADEST is a software package for analyzing loads in streams and an accompanying report (Runkel and others, 2004) serves as the formal documentation for R-LOADEST. The package is a collection of functions written in R (R Development Core Team, 2019), an open source language and a general environment for statistical computing and graphics. The following system requirements are necessary for producing results: Windows 10 operating system R (version 3.4 or later; 64-bit recommended) RStudio (version 1.1.456 or later) R-LOADEST program (available at https://github.com/USGS-R/rloadest). Runkel, R.L., Crawford, C.G., and Cohn, T.A., 2004, Load Estimator (LOADEST): A FORTRAN Program for Estimating Constituent Loads in Streams and Rivers: U.S. Geological Survey Techniques and Methods Book 4, Chapter A5, 69 p., [Also available at https://pubs.usgs.gov/tm/2005/tm4A5/pdf/508final.pdf.] R Development Core Team, 2019, R—A language and environment for statistical computing: Vienna, Austria, R Foundation for Statistical Computing, accessed December 7, 2020, at https://www.r-project.org.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):
Label Data type Description
isogramy int The order of isogramy, e.g. "2" is a second order isogram
length int The length of the word in letters
word text The actual word/isogram in ASCII
source_pos text The Part of Speech tag from the original corpus
count int Token count (total number of occurences)
vol_count int Volume count (number of different sources which contain the word)
count_per_million int Token count per million words
vol_count_as_percent int Volume count as percentage of the total number of volumes
is_palindrome bool Whether the word is a palindrome (1) or not (0)
is_tautonym bool Whether the word is a tautonym (1) or not (0)
The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:
Label
Data type
Description
!total_1grams
int
The total number of words in the corpus
!total_volumes
int
The total number of volumes (individual sources) in the corpus
!total_isograms
int
The total number of isograms found in the corpus (before compacting)
!total_palindromes
int
How many of the isograms found are palindromes
!total_tautonyms
int
How many of the isograms found are tautonyms
The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
These data support poscrptR (wright et al. 2021). poscrptR is a shiny app that predicts the probability of post-fire conifer regeneration for fire data supplied by the user. The predictive model was fit using presence/absence data collected in 4.4m radius plots (60 square meters). Please refer to Stewart et al. (2020) for more details concerning field data collection, the model fitting process, and limitations. Learn more about shiny apps at https://shiny.rstudio.com. The app is designed to simplify the process of predicting post-fire conifer regeneration under different precipitation and seed production scenarios. The app requires the user to upload two input data sets: 1. a raster of Relativized differenced Normalized Burn Ratio (RdNBR), and 2. a .zip folder containing a fire perimeter shapefile. The app was designed to use Rapid Assessment of Vegetative Condition (RAVG) data inputs. The RAVG website (https://fsapps.nwcg.gov/ravg) has both RdNBR and fire perimeter data sets available for all fires with at least 1,000 acres of National Forest land from 2007 to the present. The fire perimeter must be a zipped shapefile (.zip file, include all shapefile components: .cpg, .dbf, .prj, .sbn, .sbx, .shp, and .shx). RdNBR must be 30m resolution, and both the RdNBR and fire perimeter must use the USA Contiguous Albers Equal Area Conic coordinate reference system (USGS version). RDNBR must be alligned (same origin) as RAVG raster data. References: Stewart, J., van Mantgem, P., Young, D., Shive, K., Preisler, H., Das, A., Stephenson, N., Keeley, J., Safford, H., Welch, K., Thorne, J., 2020. Effects of postfire climate and seed availability on postfire conifer regeneration. Ecological Applications. Wright, M.C., Stewart, J.E., van Mantgem, P.J., Young, D.J., Shive, K.L., Preisler, H.K., Das, A.J., Stephenson, N.L., Keeley, J.E., Safford, H.D., Welch, K.R., and Thorne, J.H. 2021. poscrptR. R package version 0.1.3.
Contains data and code for the manuscript 'Mean landscape-scale incidence of species in discrete habitats is patch size dependent'. Raw data consist of 202 published datasets collated from primary and secondary (e.g., government technical reports) sources. These sources summarise metacommunity structure for different taxonomic groups (birds, invertebrates, non-avian vertebrates or plants) in different types of discrete metacommunities including 'true' islands (i.e., inland, continental or oceanic archipelagos), habitat islands (e.g., ponds, wetlands, sky islands) and fragments (e.g., forest/woodland or grass/shrubland habitat remnants). The aim of the study was to test whether the size of a habitat patch influences the mean incidences of species within it, relative to the incidence of all species across the landscape. In other words, whether high-incidence (widespread) or low-incidence (narrow-range) species are found more often than expected in smaller or larger patches. To achieve th..., Details regarding keyword and other search strategies used to collate the raw database from published sources were presented in Deane, D. C. & He, F. (2018) Loss of only the smallest patches will reduce species diversity in most discrete habitat networks. Glob Chang Biol, 24, 5802-5814 and in Deane, D.C. (2022) Species accumulation in small-large vs large-small order: more species but not all species? Oecologia, 200, 273-284. Minimum data requirements were presence absence records for all species in all patches and area of each habitat patch. The database consists of 202 published datasets. The first column in each dataset is the area of the patch in question (in hectares), other columns record presence and absence of each species in each patch. In the study, a metric was calculated for every patch that quantifies how the incidence of species in each patch compares with an expectation derived from the occupancy of all species in all patches (called mean species landscape-scale incid..., All provided files are intended for use within the R-programming environment. The raw database records required to run the analysis from scratch, along with processed data used to run regression models are saved as R data objects (i.e., extension '.RData'). The fitted model obtained in analysis and used to generate results is also an R object, but of class 'brmsfit' (requiring R package brms is loaded into the R-workspace). Both object types can be opened in R (R Studio, etc). , # Data from 'Species representation in discrete habitats is patch size dependent'
Contains the raw data and code used to reproduce the analysis and results in the manuscript.
The simplest way to do this is to save all files provided to a single folder. Code needed to run the analyses in the paper are in scr_R_code_Dryad_R01.txt. Change the file extension from .txt to .R and then the script can be opened directly in R/R Studio and this includes code to load all objects described and run all analyses.
Description of data files:
Data: Datha.RData - an R object of class 'list', each element of the list representing one of 202 p/a datasets obtained from published sources. Datasets are saved as sites x species dataframes, with patch area (in hectares) in the first column and species data in column numbers 2 to N+1, where N is the total number of species in that dataset (the +1 reflects the area data in the first column). Note, the nu...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Explanation/Overview:
Corresponding dataset for the analyses and results achieved in the CS Track project in the research line on participation analyses, which is also reported in the publication "Does Volunteer Engagement Pay Off? An Analysis of User Participation in Online Citizen Science Projects", a conference paper for the conference CollabTech 2022: Collaboration Technologies and Social Computing and published as part of the Lecture Notes in Computer Science book series (LNCS,volume 13632) here. The usernames have been anonymised.
Purpose:
The purpose of this dataset is to provide the basis to reproduce the results reported in the associated deliverable, and in the above-mentioned publication. As such, it does not represent raw data, but rather files that already include certain analysis steps (like calculated degrees or other SNA-related measures), ready for analysis, visualisation and interpretation with R.
Relatedness:
The data of the different projects was derived from the forums of 7 Zooniverse projects based on similar discussion board features. The projects are: 'Galaxy Zoo', 'Gravity Spy', 'Seabirdwatch', 'Snapshot Wisconsin', 'Wildwatch Kenya', 'Galaxy Nurseries', 'Penguin Watch'.
Content:
In this Zenodo entry, several files can be found. The structure is as follows (files and folders and descriptions).
corresponding_calculations.html
Quarto-notebook to view in browser
corresponding_calculations.qmd
Quarto-notebook to view in RStudio
assets
data
annotations
annotations.csv
List of annotations made per day for each of the analysed projects
comments
comments.csv
Total list of comments with several data fields (i.e., comment id, text, reply_user_id)
rolechanges
478_rolechanges.csv
List of roles per user to determine number of role changes
1104_rolechanges.csv
...
...
totalnetworkdata
Edges
478_edges.csv
Network data (edge set) for the given projects (without time slices)
1104_edges.csv
...
...
Nodes
478_nodes.csv
Network data (node set) for the given projects (without time slices)
1104_nodes.csv
...
...
trajectories
Network data (edge and node sets) for the given projects and all time slices (Q1 2016 - Q4 2021)
478
Edges
edges_4782016_q1.csv
edges_4782016_q2.csv
edges_4782016_q3.csv
edges_4782016_q4.csv
...
Nodes
nodes_4782016_q1.csv
nodes_4782016_q4.csv
nodes_4782016_q3.csv
nodes_4782016_q2.csv
...
1104
Edges
...
Nodes
...
...
scripts
datavizfuncs.R
script for the data visualisation functions, automatically executed from within corresponding_calculations.qmd
import.R
script for the import of data, automatically executed from within corresponding_calculations.qmd
corresponding_calculations_files
files for the html/qmd view in the browser/RStudio
Grouping:
The data is grouped according to given criteria (e.g., project_title or time). Accordingly, the respective files can be found in the data structure
[Note 2023-08-14 - Supersedes version 1, https://doi.org/10.15482/USDA.ADC/1528086 ] This dataset contains all code and data necessary to reproduce the analyses in the manuscript: Mengistu, A., Read, Q. D., Sykes, V. R., Kelly, H. M., Kharel, T., & Bellaloui, N. (2023). Cover crop and crop rotation effects on tissue and soil population dynamics of Macrophomina phaseolina and yield under no-till system. Plant Disease. https://doi.org/10.1094/pdis-03-23-0443-re The .zip archive cropping-systems-1.0.zip contains data and code files. Data stem_soil_CFU_by_plant.csv: Soil disease load (SoilCFUg) and stem tissue disease load (StemCFUg) for individual plants in CFU per gram, with columns indicating year, plot ID, replicate, row, plant ID, previous crop treatment, cover crop treatment, and comments. Missing data are indicated with . yield_CFU_by_plot.csv: Yield data (YldKgHa) at the plot level in units of kg/ha, with columns indicating year, plot ID, replicate, and treatments, as well as means of soil and stem disease load at the plot level. Code cropping_system_analysis_v3.0.Rmd: RMarkdown notebook with all data processing, analysis, and visualization code equations.Rmd: RMarkdown notebook with formatted equations formatted_figs_revision.R: R script to produce figures formatted exactly as they appear in the manuscript The Rproject file cropping-systems.Rproj is used to organize the RStudio project. Scripts and notebooks used in older versions of the analysis are found in the testing/ subdirectory. Excel spreadsheets containing raw data from which the cleaned CSV files were created are found in the raw_data subdirectory.
http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
# Replication Package for 'Political Expression of Academics on Social Media' by Prashant Garg and Thiemo Fetzer.
## Overview
This replication package contains all necessary scripts and data to replicate the main figures and tables presented in the paper.
## Folder Structure
### 1. `1_scripts`
This folder contains all scripts required to replicate the main figures and tables of the paper. The scripts are numbers with a prefix (e.g. "1_") in the order they should be run. Output will also be produced in this folder.
- `0_init.Rmd`: An R Markdown file that installs and loads all packages necessary for the subsequent scripts.
- `1_fig_1.Rmd`: Primarily produces Figure 1 (Zipf's plots) and conducts statistical tests to support underlying statistical claims made through the figure.
- `2_fig_2_to_4.Rmd`: Primarily produces Figures 2 to 4 (average levels of expression) and conducts statistical tests to support underlying statistical claims made through the figures. This includes conducting t-tests to establish subgroup differences.
The script also includes The file table_controlling_how.csv contains the full set of regression results for the analysis of subgroup differences in political stances, controlling for emotionality, egocentrism, and toxicity. This file includes effect sizes, standard errors, confidence intervals, and p-values for each stance, group variable, and confounder.
- `3_fig_5_to_6.Rmd`: Primarily produces Figures 5 to 6 (trends in expression) and conducts statistical tests to support underlying statistical claims made through the figures. This includes conducting t-tests to establish subgroup differences.
- `4_tab_1_to_2.Rmd`: Produces Tables 1 to 2, and shows code for Table A5 (descriptive tables).
Expected run time for each script is under 3 minutes and requires around 4GB RAM. Script `3_fig_5_to_6.Rmd` can take up to 3-4 minutes and requires up to 6GB RAM. Installation of each package for the first time user may take around 2 minutes each, except 'tidyverse', which may take around 4 minutes.
We have not provided a demo since the actual dataset used for analysis is small enough and computations are efficient enough to be run in most systems.
Each script starts with a layperson explanation to overview the functionality of the code and a pseudocode for a detailed procedure, followed by the actual code.
### 2. `2_data`
This folder contains all data used to replicate the main results. The data is called by the respective scripts automatically using relative paths.
- `data_dictionary.txt`: Provides a description of all variables as they are coded in the various datasets, especially the main author by time level dataset called `repl_df.csv`.
- Processed data at individual author by time (year by month) level aggregated measures are provided, as raw data containing raw tweets cannot be shared.
## Installation Instructions
### Prerequisites
This project uses R and RStudio. Make sure you have the following installed:
- [R](https://cran.r-project.org/) (version 4.0.0 or later)
- [RStudio](https://www.rstudio.com/products/rstudio/download/)
Once installed, to ensure the correct versions of the required packages are installed, use the following R markdown script '0_init.Rmd'. This script will install the `remotes` package (if not already installed) and then install the specified versions of the required packages.
## Running the Scripts
Open 0_init.Rmd in RStudio and run all chunks to install and load the required packages.
Run the remaining scripts (1_fig_1.Rmd, 2_fig_2_to_4.Rmd, 3_fig_5_to_6.Rmd, and 4_tab_1_to_2.Rmd) in the order they are listed to reproduce the figures and tables from the paper.
# Contact
For any questions, feel free to contact Prashant Garg at prashant.garg@imperial.ac.uk.
# License
This project is licensed under the Apache License 2.0 - see the license.txt file for details.
This repository provides the code used for the article "Monitoring cropland daily carbon dioxide exchange at field scales with Sentinel-2 satellite imagery" by Pia Gottschalk, Aram Kalhori, Zhan Li, Christian Wille, Torsten Sachs. The data are used to exemplify how ground measured CO2 fluxes of an agricultural field can be linked with remotely sensed vegetation indices to provided an upscaling approach for spatial CO2-flux projection. The repository contains the codes produced for the article "Monitoring cropland daily carbon dioxide exchange at field scales with Sentinel-2 satellite imagery" by Pia Gottschalk, Aram Kalhori, Zhan Li, Christian Wille, Torsten Sachs. In this article, the authors present how local carbon dioxide (CO2) ground measurements and satellite data can be linked to project CO2 emissions spatially for agriculutral fields. The codes are provided for - footprint analysis and raw flux data quality control (MATLAB codes); - retrieving Sentinel-2 vegetation indices via Google Earth Engine (GEE code); - subsequent quality control, gap-filling and flux partitioning following the MDS approach by Reichstein et al. 2005 implemented by the R-package "REddyProc" (R codes); - statistical analyses of combined EC and Sentinel-2 data (R codes); - code for all figures as displayed in the manuscript (R codes). This software is written in MATLAB, R and JavaScript (GEE). Running the codes (R and .m files (Code)) and loading the data files (CSV files and .mat files (Data)) requires the pre-installation of R and RStudio and (MATLAB). The GEE script runs in a browser and can also be opened/downloaded here: https://code.earthengine.google.com/858361ae4aac7c3fe5227076c9733040 The RStudio 2021.09.0 Build 351 version has been used for developping the R scripts. The land cover classification work was performed in QGIS, v.3.16.11-Hannover. Data were analyzed in both MATLAB and R; and plots created with R (R Core Development Team 2020) in RStudio®.The R codes in this repository contain a suite of external R-packages ("zoo"; "REddyProc"; "Hmisc"; "PerformanceAnalytics") which are required for data analysis in this manuscript. The data to run the codes are published with the DOI https://doi.org/10.5880/GFZ.1.4.2023.008 (Gottschalk et al., 2023).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Software RequirementsYou will need to install R from http://cran.r-project.org/ onto your computer. We've tested the workflow on R version 3.4.1.You will also need RStudio from http://www.rstudio.com/products/rstudio/download/.After installing, open RStudio, which opens an R session, and paste in the following commands into the Console window. Hitting enter will execute the lines.source("https://bioconductor.org/biocLite.R")biocLite("biomaRt")biocLite("DESeq2")biocLite("org.Dr.eg.db")biocLite("topGO")biocLite("tximport")install.packages("rbokeh")install.packages("readr")install.packages("rjson")Once installed, scripts will be able to utilize the libraries via loading which occurs each time you execute a given workflow or script. In our case, that environment loads in the following section:library("readr")library("rjson")library("tximport")library("DESeq2")library("biomaRt")library("rbokeh")library("topGO")library("org.Dr.eg.db")These lines should execute without error. If you encounter an error like:> Error in library("topGO") : there is no package called ‘topGO’Then that package failed to install and you can retry its from above:biocLite("topGO")And read the console for any prompts.Other DownloadsDownload the osu-workshop.zip file and extract. This contains the data, some intermediate data, our working examples for this workshop. The hands-on portion will be going through the DE.Rmd document in RStudio.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the full dataset used in the manuscript Wintle et al. (2018) Global synthesis of conservation studies reveals the importance of small habitat patches for biodiversity. To reproduce the analysis, simply load all of the files into a folder called 'data_for_web', open the Rproject in RStudio, and implement the scripts in the RScript 'small patches model - FINAL.R'. You may need to load in the RLibraries 'raster', 'effects' and 'spdep'. Contact brendanw@unimelb.edu.au if you have any problems.
Kim_Shin et al. Data-Code-FiguresThis ZIP archive includes all the raw data and R code used to generate the figures for our paper. We recommend loading and running each Rmarkdown (.Rmd) file in the RStudio development environment (https://www.rstudio.com).
http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
# Replication Package for 'Political Expression of Academics on Social Media'
A repository with replication material for the paper "Political Expression of Academics on Social Media" (2024) by Prashant Garg and Thiemo Fetzer
## Overview
This replication package contains all necessary scripts and data to replicate the main figures and tables presented in the paper.
## Folder Structure
### 1. `1_scripts`
This folder contains all scripts required to replicate the main figures and tables of the paper. The scripts are arranged in the order they should be run.
- `0_init.Rmd`: An R Markdown file that installs and loads all packages necessary for the subsequent scripts.
- `1_fig_1.Rmd`: Produces Figure 1 (Zipf's plots).
- `2_fig_2_to_4.Rmd`: Produces Figures 2 to 4 (average levels of expression).
- `3_fig_5_to_6.Rmd`: Produces Figures 5 to 6 (trends in expression).
- `4_tab_1_to_3.Rmd`: Produces Tables 1 to 3 (descriptive tables).
Expected run time for each script is under 2 minutes and requires around 4GB RAM. Script `3_fig_5_to_6.Rmd` can take up to 3-4 minutes and requires up to 6GB RAM. Installation of each package for the first time user may take around 2 minutes each, except 'tidyverse', which may take around 4 minutes.
We have not provided a demo since the actual dataset used for analysis is small enough and computations are efficient enough to be run in most systems.
Each script starts with a layperson explanation to overview the functionality of the code and a pseudocode for a detailed procedure, followed by the actual code.
### 2. `2_data`
This folder contains data used to replicate the main results. The data is called by the respective scripts automatically using relative paths.
- `data_dictionary.txt`: Provides a description of all variables as they are coded in the various datasets, especially the main author by time level dataset called `repl_df.csv`.
- Processed data at individual author by time (year by month) level aggregated measures are provided, as raw data containing raw tweets cannot be shared.
## Installation Instructions
### Prerequisites
This project uses R and RStudio. Make sure you have the following installed:
- [R](https://cran.r-project.org/) (version 4.0.0 or later)
- [RStudio](https://www.rstudio.com/products/rstudio/download/)
Once installed, to ensure the correct versions of the required packages are installed, use the following R markdown script '0_init.Rmd'. This script will install the `remotes` package (if not already installed) and then install the specified versions of the required packages.
## Running the Scripts
Open 0_init.Rmd in RStudio and run all chunks to install and load the required packages.
Run the remaining scripts (1_fig_1.Rmd, 2_fig_2_to_4.Rmd, 3_fig_5_to_6.Rmd, and 4_tab_1_to_3.Rmd) in the order they are listed to reproduce the figures and tables from the paper.
# Contact
For any questions, feel free to contact Prashant Garg at prashant.garg@imperial.ac.uk.
# License
This project is licensed under the Apache License 2.0 - see the license.txt file for details.
To run the provided scripts, the open-access R programming environment (v4.1.2) and the RStudio desktop application (build 353) is recommended. To begin, "Virag_Karadi_2023_CONODONTS_script.R" should be opened with RStudio, and if run step by step, it will load the necessary packages (Morpho), custom functions ("Virag_Karadi_2023_CONODONTS_functions.R") and data files (all provided .csv files), automatically during the session.
Environmental changes, such as climate warming and higher herbivory pressure, are altering the carbon balance of Arctic ecosystems; yet how these drivers modify the carbon balance among different habitats remains uncertain. This dataset is used to investigate how spring goose grubbing and summer warming – two key environmental-change drivers in the Arctic – alter CO2-fluxes in three tundra habitats varying in soil moisture and plant-community composition. Where: CO2-flux data were gathered from a full-factorial randomized-block experiment simulating spring goose grubbing and summer warming in high-Arctic Svalbard. When: CO2-flux data were gathered at each of three sampling occasions (early, peak, and late summer) in summer 2016 and summer 2017. Data collection and processing: CO2-fluxes were assessed using a closed-system made of a clear acrylic chamber (25 cm × 25 cm area × 35 cm height), including a fan for air mixing, connected through an air pump (L052C-11, Parker Corp, Cleveland, Ohio, USA; ~1 l min-1 flow rate) to a CO2 infrared gas analyzer (LI-840A, LICOR, Lincoln, Nebraska, USA). We calculated CO2-fluxes for each measurement by fitting linear regression models based on the ideal gas law. How to use this dataset: All analyses of this dataset were run in the R statistical and computing environment v. 4.3.0 (https://www.r-project.org). To use this dataset, download the most recent version of R and R studio and import the dataset in your workspace. Additional information on how to analyze this data is given in the related publication.
The dataset was derived by the Bioregional Assessment Programme. This dataset was derived from multiple datasets. You can find a link to the parent datasets in the Lineage Field in this metadata statement. The History Field in this metadata statement describes how this dataset was derived.
The difference between NSW Office of Water GW licences - CLM v2 and v3 is that an additional column has been added, 'Asset Class' that aggregates the purpose of the licence into the set classes for the Asset Database. Also the 'Completed_Depth' has been added, which is the total depth of the groundwater bore. These columns were added for the purpose of the Asset Register.
The aim of this dataset was to be able to map each groundwater works with the volumetric entitlement without double counting the volume and to aggregate/ disaggregate the data depending on the final use.
This has not been clipped to the CLM PAE, therefore the number of economic assets/ relevant licences will drastically reduce once this occurs.
The Clarence Moreton groundwater licences includes an extract of all licences that fell within the data management acquisition area as provided by BA to NSW Office of Water.
Aim: To get a one to one ratio of licences numbers to bore IDs.
Important notes about data:
Data has not been clipped to the PAE.
No decision have been made in regards to what purpose of groundwater should be protected. Therefore the purpose currently includes groundwater bores that have been drilled for non-extractive purposes including experimental research, test, monitoring bore, teaching, mineral explore and groundwater explore
No volume has been included for domestic & stock as it is a basic right. Therefore an arbitrary volume could be applied to account for D&S use.
Licence Number - Each sheet in the Original Data has a licence number, this is assumed to be the actual licence number. Some are old because they have not been updated to the new WA. Some are new (From_Spreadsheet_WALs). This is the reason for the different codes.
WA/CA - This number is the 'works' number. It is assumed that the number indicates the bore permit or works approval. This is why there can be multiple works to licence and licences to works number. (For complete glossary see here http://registers.water.nsw.gov.au/wma/Glossary.jsp). Originally, the aim was to make sure that the when there was more than more than one licence to works number or mulitple works to licenes that the mulitple instances were compelte.
Clarence Moreton worksheet links the individual licence to a works and a volumetric entitlement. For most sites, this can be linked to a bore which can be found in the NGIS through the HydroID. (\\wron\Project\BA\BA_all\Hydrogeology\_National_Groundwater_Information_System_v1.1_Sept2013). This will allow analysis of depths, lithology and hydrostratigraphy where the data exists.
We can aggregate the data based on water source and water management zone as can be seen in the other worksheets.
Data available:
Original Data: Any data that was bought in from NSW Offcie of Water, includes
Spatial locations provided by NoW- This is a exported data from the submitted shape files. Includes the licence (LICENCE) numbers and the bore ID (WORK_NUO). (Refer to lineage NSW Office of Water Groundwater Entitlements Spatial Locations).
Spreadsheet_WAL - The spread sheet from the submitted data, WLS-EXTRACT_WALs_volume. (Refer to Lineage NSW Office of Water Groundwater Licence Extract CLM- Oct 2013)
WLS_extracts - The combined spread sheets from the submitted data, WLS-EXTRACT . (Refer to Lineage NSW Office of Water Groundwater Licence Extract CLM- Oct 2013)
Aggregated share component to water sharing plan, water source and water management zone
The difference between NSW Office of Water GW licences - CLM v2 and v3 is that an additional column has been added, 'Asset Class' that aggregates the purpose of the licence into the set classes for the Asset Database.
Where purpose = domestic; or domestic & stock; or stock then it was classed as 'basic water right'. Where it is listed as both a domestic/stock and a licensed use such as irrigation, it was classed as a 'water access right.' All other take and use were classed as a 'water access right'. Where purpose = drainage, waste disposal, groundwater remediation, experimental research, null, conveyancing, test bore - these were not given an asset class. Monitoring bores were classed as 'Water supply and monitoring infrastructure'
Depth has also been included which is the completed depth of the bore.
Instructions
Procedure: refer to Bioregional assessment data conversion script.docx
1) Original spread sheets have mulitple licence instances if there are more than one WA/CA number. This means that there are more than one works or permit to the licence. The aim is to only have one licence instance.
2) The individual licence numbers were combined into one column
3) Using the new column of licence numbers, several vlookups were created to bring in other data. Where the columns are identical in the original spreadsheets, they are combined. The only ones that don't are the Share/Entitlement/allocation, these mean different things.
4) A hydro ID column was created, this is a code that links this NSW to the NGIS, which is basically a ".1.1" at the end of the bore code.
5) All 'cancelled' licences were removed
6) A count of the number of works per licence and number of bores were included in the spreadsheet.
7) Where the ShareComponent = NA, the Entitlement = 0, Allocation = 0 and there was more than one instance of the same bore, this means that the original licence assigned to the bore has been replaced by a new licence with a share component. Where these criteria were met, the instances were removed
8) a volume per works ensures that the volume of the licence is not repeated for each works, but is divided by the number of works
Bioregional assessment data conversion script
Aim: The following document is the R-Studio script for the conversion and merging of the bioregional assessment data.
Requirements: The user will need R-Studio. It would be recommended that there is some basic knowledge of R. If there isn't, the only thing that would really need to be changed is the file location and name. The way that R reads files is different to windows and also the locations that R-Studio read is dependent on where R-Studio is originally installed to point. This would need to be completed properly before the script can be run.
Procedure: The information below the dashed line is the script. This can be copied and pasted directly into R-Studio. Any text with '\#' will not be read as a script, so that can be added in and read as an instruction.
\#\#\#\#\#\#\#\#\#\#\#
\# 18/2/2014
\# Code by Brendan Dimech
\#
\# Script to merge extract files from submitted NSW bioregional
\# assessment and convert data into required format. Also use a 'vlookup'
\# process to get Bore and Location information from NGIS.
\#
\# There are 3 scripts, one for each of the individual regions.
\#
\#\#\#\#\#\#\#\#\#\#\#\#
\# CLARENCE MORTON
\# Opening of files. Location can be changed if needed.
\# arc.file is the exported \*.csv from the NGIS data which has bore data and Lat/long information.
\# Lat/long weren't in the file natively so were added to the table using Arc Toolbox tools.
arc.folder = '/data/cdc_cwd_wra/awra/wra_share_01/GW_licencing_and_use_data/Rstudio/Data/Vlookup/Data'
arc.file = "Moreton.csv"
\# Files from NSW came through in two types. WALS files, this included 'newer' licences that had a share component.
\# The 'OTH' files were older licences that had just an allocation. Some data was similar and this was combined,
\# and other information that wasn't similar from the datasets was removed.
\# This section is locating and importing the WALS and OTH files.
WALS.folder = '/data/cdc_cwd_wra/awra/wra_share_01/GW_licencing_and_use_data/Rstudio/Data/Vlookup/Data'
WALS.file = "GW_Clarence_Moreton_WLS-EXTRACT_4_WALs_volume.xls"
OTH.file.1 = "GW_Clarence_Moreton_WLS-EXTRACT_1.xls"
OTH.file.2 = "GW_Clarence_Moreton_WLS-EXTRACT_2.xls"
OTH.file.3 = "GW_Clarence_Moreton_WLS-EXTRACT_3.xls"
OTH.file.4 = "GW_Clarence_Moreton_WLS-EXTRACT_4.xls"
newWALS.folder = '/data/cdc_cwd_wra/awra/wra_share_01/GW_licencing_and_use_data/Rstudio/Data/Vlookup/Products'
newWALS.file = "Clarence_Moreton.csv"
arc <- read.csv(paste(arc.folder, arc.file, sep="/" ), header =TRUE, sep = ",")
WALS <- read.table(paste(WALS.folder, WALS.file, sep="/" ), header =TRUE, sep = "\t")
\# Merge any individual WALS and OTH files into a single WALS or OTH file if there were more than one.
OTH1 <- read.table(paste(WALS.folder, OTH.file.1, sep="/" ), header =TRUE, sep = "\t")
OTH2 <- read.table(paste(WALS.folder, OTH.file.2, sep="/" ), header =TRUE, sep = "\t")
OTH3 <- read.table(paste(WALS.folder, OTH.file.3, sep="/" ), header =TRUE, sep = "\t")
OTH4 <- read.table(paste(WALS.folder, OTH.file.4, sep="/" ), header =TRUE, sep = "\t")
OTH <- merge(OTH1,OTH2, all.y = TRUE, all.x = TRUE)
OTH <- merge(OTH,OTH3, all.y = TRUE, all.x = TRUE)
OTH <- merge(OTH,OTH4, all.y = TRUE, all.x = TRUE)
\# Add new columns to OTH for the BORE, LAT and LONG. Then use 'merge' as a vlookup to add the corresponding
\# bore and location from the arc file. The WALS and OTH files are slightly different because the arc file has
\# a different licence number added in.
OTH <- data.frame(OTH, BORE = "", LAT = "", LONG = "")
OTH$BORE <- arc$WORK_NO\[match(OTH$LICENSE.APPROVAL, arc$LICENSE)\]
OTH$LAT <-
This repository contains R code and instructions to automatically collate and calculate quality control indices across multiple output files generated from BioTek microplate readers with Gen5 software (version 3.04). This is applicable for researchers conducting laboratory assays with this specific equipment and software. The worked example was developed for outputs from a cortisol enzyme immunoassay, but can be adapted for a variety of assays and analytes. The code will efficiently merge and process results and perform quality checks across multiple plates in a batch. The code was developed by Delaney Glass at the University of Washington Center for Studies in Demography & Ecology Biodemography Lab, under advisement of Lab Co-PIs Melanie Martin and Tiffany Pan. Below you will find example code, a readme file with instructions and an outlined use-case, as well as de-identifiable example data files.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This data set contains the ShinyFMBN app and related material. The ShinyFMBN app allows you to access FoodMicrobionet 3.1, a repository of data on food microbiome studies. To run the app you need to install R and R Studio.
This compressed folder contains: a. folder data: contains a .RDS file of data extracted from FoodMicrobionet, to be used with the FMBNanalyzer script (see below) b. folder FMBNanalyzer: contains the FMBNanalyzer_v_2_1.R which can be used for graphical and statistical analysis of data extracted from FoodMicrobionet c. folder Gephi_network: contains a .gml file extracted from FoodMicrobionet using the ShinyFMBN app, a .gephi file created by importing it, and an example figure of the network d. folder merge_phyloseq_objs: contains a proof of concept script which can be used to merge phyloseq objects extracted from FoodMicrobionet using the ShinFMBN app, together with example data e. folder ShinyFMBN contains the app folder, the runShinyFMBN_2_1_4.R script (a R script to install all needed packages and run the app) and the app manual in .htm format This version includes an improved version of the Shiny app and incorporates changes to the taxa table, which is now aligned to SILVA taxonomy (https://www.arb-silva.de/documentation/silva-taxonomy/). This change has become necessary to improve compatibility with new accessions to FoodMicrobionet, which now assigns taxonomy based on SILVA v138.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
An R script to extract and parse the Postal Code Conversion File (PCCF) from Statistics Canada/Canada Post.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Replication pack, FSE2018 submission #164: ------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem **Note:** link to data artifacts is already included in the paper. Link to the code will be included in the Camera Ready version as well. Content description =================== - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files described below - **settings.py** - settings template for the code archive. - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset. This dataset only includes stats aggregated by the ecosystem (PyPI) - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages themselves, which take around 2TB. - **build_model.r, helpers.r** - R files to process the survival data (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, `common.cache/survival_data.pypi_2008_2017-12_6.csv` in **dataset_full_Jan_2018.tgz**) - **Interview protocol.pdf** - approximate protocol used for semistructured interviews. - LICENSE - text of GPL v3, under which this dataset is published - INSTALL.md - replication guide (~2 pages)
Replication guide ================= Step 0 - prerequisites ---------------------- - Unix-compatible OS (Linux or OS X) - Python interpreter (2.7 was used; Python 3 compatibility is highly likely) - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible) Depending on detalization level (see Step 2 for more details): - up to 2Tb of disk space (see Step 2 detalization levels) - at least 16Gb of RAM (64 preferable) - few hours to few month of processing time Step 1 - software ---------------- - unpack **ghd-0.1.0.zip**, or clone from gitlab: git clone https://gitlab.com/user2589/ghd.git git checkout 0.1.0 `cd` into the extracted folder. All commands below assume it as a current directory. - copy `settings.py` into the extracted folder. Edit the file: * set `DATASET_PATH` to some newly created folder path * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` - install docker. For Ubuntu Linux, the command is `sudo apt-get install docker-compose` - install libarchive and headers: `sudo apt-get install libarchive-dev` - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools` Without this dependency, you might get an error on the next step, but it's safe to ignore. - install Python libraries: `pip install --user -r requirements.txt` . - disable all APIs except GitHub (Bitbucket and Gitlab support were not yet implemented when this study was in progress): edit `scraper/init.py`, comment out everything except GitHub support in `PROVIDERS`. Step 2 - obtaining the dataset ----------------------------- The ultimate goal of this step is to get output of the Python function `common.utils.survival_data()` and save it into a CSV file: # copy and paste into a Python console from common import utils survival_data = utils.survival_data('pypi', '2008', smoothing=6) survival_data.to_csv('survival_data.csv') Since full replication will take several months, here are some ways to speedup the process: ####Option 2.a, difficulty level: easiest Just use the precomputed data. Step 1 is not necessary under this scenario. - extract **dataset_minimal_Jan_2018.zip** - get `survival_data.csv`, go to the next step ####Option 2.b, difficulty level: easy Use precomputed longitudinal feature values to build the final table. The whole process will take 15..30 minutes. - create a folder `