https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Replication pack, FSE2018 submission #164: ------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem **Note:** link to data artifacts is already included in the paper. Link to the code will be included in the Camera Ready version as well. Content description =================== - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files described below - **settings.py** - settings template for the code archive. - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset. This dataset only includes stats aggregated by the ecosystem (PyPI) - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages themselves, which take around 2TB. - **build_model.r, helpers.r** - R files to process the survival data (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, `common.cache/survival_data.pypi_2008_2017-12_6.csv` in **dataset_full_Jan_2018.tgz**) - **Interview protocol.pdf** - approximate protocol used for semistructured interviews. - LICENSE - text of GPL v3, under which this dataset is published - INSTALL.md - replication guide (~2 pages)
Replication guide ================= Step 0 - prerequisites ---------------------- - Unix-compatible OS (Linux or OS X) - Python interpreter (2.7 was used; Python 3 compatibility is highly likely) - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible) Depending on detalization level (see Step 2 for more details): - up to 2Tb of disk space (see Step 2 detalization levels) - at least 16Gb of RAM (64 preferable) - few hours to few month of processing time Step 1 - software ---------------- - unpack **ghd-0.1.0.zip**, or clone from gitlab: git clone https://gitlab.com/user2589/ghd.git git checkout 0.1.0 `cd` into the extracted folder. All commands below assume it as a current directory. - copy `settings.py` into the extracted folder. Edit the file: * set `DATASET_PATH` to some newly created folder path * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` - install docker. For Ubuntu Linux, the command is `sudo apt-get install docker-compose` - install libarchive and headers: `sudo apt-get install libarchive-dev` - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools` Without this dependency, you might get an error on the next step, but it's safe to ignore. - install Python libraries: `pip install --user -r requirements.txt` . - disable all APIs except GitHub (Bitbucket and Gitlab support were not yet implemented when this study was in progress): edit `scraper/init.py`, comment out everything except GitHub support in `PROVIDERS`. Step 2 - obtaining the dataset ----------------------------- The ultimate goal of this step is to get output of the Python function `common.utils.survival_data()` and save it into a CSV file: # copy and paste into a Python console from common import utils survival_data = utils.survival_data('pypi', '2008', smoothing=6) survival_data.to_csv('survival_data.csv') Since full replication will take several months, here are some ways to speedup the process: ####Option 2.a, difficulty level: easiest Just use the precomputed data. Step 1 is not necessary under this scenario. - extract **dataset_minimal_Jan_2018.zip** - get `survival_data.csv`, go to the next step ####Option 2.b, difficulty level: easy Use precomputed longitudinal feature values to build the final table. The whole process will take 15..30 minutes. - create a folder `
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):
Label Data type Description
isogramy int The order of isogramy, e.g. "2" is a second order isogram
length int The length of the word in letters
word text The actual word/isogram in ASCII
source_pos text The Part of Speech tag from the original corpus
count int Token count (total number of occurences)
vol_count int Volume count (number of different sources which contain the word)
count_per_million int Token count per million words
vol_count_as_percent int Volume count as percentage of the total number of volumes
is_palindrome bool Whether the word is a palindrome (1) or not (0)
is_tautonym bool Whether the word is a tautonym (1) or not (0)
The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:
Label
Data type
Description
!total_1grams
int
The total number of words in the corpus
!total_volumes
int
The total number of volumes (individual sources) in the corpus
!total_isograms
int
The total number of isograms found in the corpus (before compacting)
!total_palindromes
int
How many of the isograms found are palindromes
!total_tautonyms
int
How many of the isograms found are tautonyms
The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
[Note 2023-08-14 - Supersedes version 1, https://doi.org/10.15482/USDA.ADC/1528086 ] This dataset contains all code and data necessary to reproduce the analyses in the manuscript: Mengistu, A., Read, Q. D., Sykes, V. R., Kelly, H. M., Kharel, T., & Bellaloui, N. (2023). Cover crop and crop rotation effects on tissue and soil population dynamics of Macrophomina phaseolina and yield under no-till system. Plant Disease. https://doi.org/10.1094/pdis-03-23-0443-re The .zip archive cropping-systems-1.0.zip contains data and code files. Data stem_soil_CFU_by_plant.csv: Soil disease load (SoilCFUg) and stem tissue disease load (StemCFUg) for individual plants in CFU per gram, with columns indicating year, plot ID, replicate, row, plant ID, previous crop treatment, cover crop treatment, and comments. Missing data are indicated with . yield_CFU_by_plot.csv: Yield data (YldKgHa) at the plot level in units of kg/ha, with columns indicating year, plot ID, replicate, and treatments, as well as means of soil and stem disease load at the plot level. Code cropping_system_analysis_v3.0.Rmd: RMarkdown notebook with all data processing, analysis, and visualization code equations.Rmd: RMarkdown notebook with formatted equations formatted_figs_revision.R: R script to produce figures formatted exactly as they appear in the manuscript The Rproject file cropping-systems.Rproj is used to organize the RStudio project. Scripts and notebooks used in older versions of the analysis are found in the testing/ subdirectory. Excel spreadsheets containing raw data from which the cleaned CSV files were created are found in the raw_data subdirectory.
This data release supports an analysis of changes in dissolved organic carbon (DOC) and nitrate concentrations in Buck Creek watershed near Inlet, New York 2001 to 2021. The Buck Creek watershed is a 310-hectare forested watershed that is recovering from acidic deposition within the Adirondack region. The data release includes pre-processed model inputs and model outputs for the Weighted Regressions on Time, Discharge and Season (WRTDS) model (Hirsch and others, 2010) to estimate daily flow normalized concentrations of DOC and nitrate during a 20-year period of analysis. WRTDS uses daily discharge and concentration observations implemented through the Exploration and Graphics for River Trends R package (EGRET) to predict solute concentration using decimal time and discharge as explanatory variables (Hirsch and De Cicco, 2015; Hirsch and others, 2010). Discharge and concentration data are available from the U.S. Geological Survey National Water Information System (NWIS) database (U.S. Geological Survey, 2016). The time series data were analyzed for the entire period, water years 2001 (WY2001) to WY2021 where WY2001 is the period from October 1, 2000 to September 30, 2001. This data release contains 5 comma-separated values (CSV) files, one R script, and one XML metadata file. There are four input files (“Daily.csv”, “INFO.csv”, “Sample_doc.csv”, and “Sample_nitrate.csv”) that contain site information, daily mean discharge, and mean daily DOC or nitrate concentrations. The R script (“Buck Creek WRTDS R script.R”) uses the four input datasets and functions from the EGRET R package to generate estimations of flow normalized concentrations. The output file (“WRTDS_results.csv”) contains model output at daily time steps for each sub-watershed and for each solute. Files are automatically associated with the R script when opened in RStudio using the provided R project file ("Files.Rproj"). All input, output, and R files are in the "Files.zip" folder.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Explanation/Overview:
Corresponding dataset for the analyses and results achieved in the CS Track project in the research line on participation analyses, which is also reported in the publication "Does Volunteer Engagement Pay Off? An Analysis of User Participation in Online Citizen Science Projects", a conference paper for the conference CollabTech 2022: Collaboration Technologies and Social Computing and published as part of the Lecture Notes in Computer Science book series (LNCS,volume 13632) here. The usernames have been anonymised.
Purpose:
The purpose of this dataset is to provide the basis to reproduce the results reported in the associated deliverable, and in the above-mentioned publication. As such, it does not represent raw data, but rather files that already include certain analysis steps (like calculated degrees or other SNA-related measures), ready for analysis, visualisation and interpretation with R.
Relatedness:
The data of the different projects was derived from the forums of 7 Zooniverse projects based on similar discussion board features. The projects are: 'Galaxy Zoo', 'Gravity Spy', 'Seabirdwatch', 'Snapshot Wisconsin', 'Wildwatch Kenya', 'Galaxy Nurseries', 'Penguin Watch'.
Content:
In this Zenodo entry, several files can be found. The structure is as follows (files and folders and descriptions).
corresponding_calculations.html
Quarto-notebook to view in browser
corresponding_calculations.qmd
Quarto-notebook to view in RStudio
assets
data
annotations
annotations.csv
List of annotations made per day for each of the analysed projects
comments
comments.csv
Total list of comments with several data fields (i.e., comment id, text, reply_user_id)
rolechanges
478_rolechanges.csv
List of roles per user to determine number of role changes
1104_rolechanges.csv
...
...
totalnetworkdata
Edges
478_edges.csv
Network data (edge set) for the given projects (without time slices)
1104_edges.csv
...
...
Nodes
478_nodes.csv
Network data (node set) for the given projects (without time slices)
1104_nodes.csv
...
...
trajectories
Network data (edge and node sets) for the given projects and all time slices (Q1 2016 - Q4 2021)
478
Edges
edges_4782016_q1.csv
edges_4782016_q2.csv
edges_4782016_q3.csv
edges_4782016_q4.csv
...
Nodes
nodes_4782016_q1.csv
nodes_4782016_q4.csv
nodes_4782016_q3.csv
nodes_4782016_q2.csv
...
1104
Edges
...
Nodes
...
...
scripts
datavizfuncs.R
script for the data visualisation functions, automatically executed from within corresponding_calculations.qmd
import.R
script for the import of data, automatically executed from within corresponding_calculations.qmd
corresponding_calculations_files
files for the html/qmd view in the browser/RStudio
Grouping:
The data is grouped according to given criteria (e.g., project_title or time). Accordingly, the respective files can be found in the data structure
The Data Package includes a set of csv files of input meteorological parameters for locations of 17 meteorological stations within the East River watershed. These meteorological datasets were downloaded from the (1) PRISM database--monthly precipitation, air temperature (minimum, mean, and maximum), vapor pressure deficit (minimum and maximum), and dewpoint temperature, and (2) NCEP/NCAR Reanalysis database – wind database. The datasets were used for calculations of the Potential Evapotranspiration (ETo), Actual Evapotranspiration (ET), Standard Precipitation Index (SPI) , Standard Evapotranspiration-Precipitation Index (SPEI) for the period from 1966 to 2021. The main research questions addressed are: the evaluation of the long-term temporal trends of climatic parameters, hierarchical clustering, and areal mapping/zonation of the East River watershed. Calculations were conducted in the Rstudio environment. The dataset additionally includes a file-level metadata (flmd.csv) file that lists each file contained in the dataset with associated metadata; and a data dictionary (dd.csv) file that contains column/row headers used throughout the files along with a definition, units, and data type. The input datasets were downloaded from (a) PRISM database (the Northwest Alliance for Computational Science and Engineering at the Oregon State University), and (b) NCEP/NCAR Reanalysis database.
The dataset was derived by the Bioregional Assessment Programme. This dataset was derived from multiple datasets. You can find a link to the parent datasets in the Lineage Field in this metadata statement. The History Field in this metadata statement describes how this dataset was derived.
The difference between NSW Office of Water GW licences - CLM v2 and v3 is that an additional column has been added, 'Asset Class' that aggregates the purpose of the licence into the set classes for the Asset Database. Also the 'Completed_Depth' has been added, which is the total depth of the groundwater bore. These columns were added for the purpose of the Asset Register.
The aim of this dataset was to be able to map each groundwater works with the volumetric entitlement without double counting the volume and to aggregate/ disaggregate the data depending on the final use.
This has not been clipped to the CLM PAE, therefore the number of economic assets/ relevant licences will drastically reduce once this occurs.
The Clarence Moreton groundwater licences includes an extract of all licences that fell within the data management acquisition area as provided by BA to NSW Office of Water.
Aim: To get a one to one ratio of licences numbers to bore IDs.
Important notes about data:
Data has not been clipped to the PAE.
No decision have been made in regards to what purpose of groundwater should be protected. Therefore the purpose currently includes groundwater bores that have been drilled for non-extractive purposes including experimental research, test, monitoring bore, teaching, mineral explore and groundwater explore
No volume has been included for domestic & stock as it is a basic right. Therefore an arbitrary volume could be applied to account for D&S use.
Licence Number - Each sheet in the Original Data has a licence number, this is assumed to be the actual licence number. Some are old because they have not been updated to the new WA. Some are new (From_Spreadsheet_WALs). This is the reason for the different codes.
WA/CA - This number is the 'works' number. It is assumed that the number indicates the bore permit or works approval. This is why there can be multiple works to licence and licences to works number. (For complete glossary see here http://registers.water.nsw.gov.au/wma/Glossary.jsp). Originally, the aim was to make sure that the when there was more than more than one licence to works number or mulitple works to licenes that the mulitple instances were compelte.
Clarence Moreton worksheet links the individual licence to a works and a volumetric entitlement. For most sites, this can be linked to a bore which can be found in the NGIS through the HydroID. (\wron\Project\BA\BA_all\Hydrogeology_National_Groundwater_Information_System_v1.1_Sept2013). This will allow analysis of depths, lithology and hydrostratigraphy where the data exists.
We can aggregate the data based on water source and water management zone as can be seen in the other worksheets.
Data available:
Original Data: Any data that was bought in from NSW Offcie of Water, includes
Spatial locations provided by NoW- This is a exported data from the submitted shape files. Includes the licence (LICENCE) numbers and the bore ID (WORK_NUO). (Refer to lineage NSW Office of Water Groundwater Entitlements Spatial Locations).
Spreadsheet_WAL - The spread sheet from the submitted data, WLS-EXTRACT_WALs_volume. (Refer to Lineage NSW Office of Water Groundwater Licence Extract CLM- Oct 2013)
WLS_extracts - The combined spread sheets from the submitted data, WLS-EXTRACT . (Refer to Lineage NSW Office of Water Groundwater Licence Extract CLM- Oct 2013)
Aggregated share component to water sharing plan, water source and water management zone
The difference between NSW Office of Water GW licences - CLM v2 and v3 is that an additional column has been added, 'Asset Class' that aggregates the purpose of the licence into the set classes for the Asset Database.
Where purpose = domestic; or domestic & stock; or stock then it was classed as 'basic water right'. Where it is listed as both a domestic/stock and a licensed use such as irrigation, it was classed as a 'water access right.' All other take and use were classed as a 'water access right'. Where purpose = drainage, waste disposal, groundwater remediation, experimental research, null, conveyancing, test bore - these were not given an asset class. Monitoring bores were classed as 'Water supply and monitoring infrastructure'
Depth has also been included which is the completed depth of the bore.
Instructions
Procedure: refer to Bioregional assessment data conversion script.docx
1) Original spread sheets have mulitple licence instances if there are more than one WA/CA number. This means that there are more than one works or permit to the licence. The aim is to only have one licence instance.
2) The individual licence numbers were combined into one column
3) Using the new column of licence numbers, several vlookups were created to bring in other data. Where the columns are identical in the original spreadsheets, they are combined. The only ones that don't are the Share/Entitlement/allocation, these mean different things.
4) A hydro ID column was created, this is a code that links this NSW to the NGIS, which is basically a ".1.1" at the end of the bore code.
5) All 'cancelled' licences were removed
6) A count of the number of works per licence and number of bores were included in the spreadsheet.
7) Where the ShareComponent = NA, the Entitlement = 0, Allocation = 0 and there was more than one instance of the same bore, this means that the original licence assigned to the bore has been replaced by a new licence with a share component. Where these criteria were met, the instances were removed
8) a volume per works ensures that the volume of the licence is not repeated for each works, but is divided by the number of works
Bioregional assessment data conversion script
Aim: The following document is the R-Studio script for the conversion and merging of the bioregional assessment data.
Requirements: The user will need R-Studio. It would be recommended that there is some basic knowledge of R. If there isn't, the only thing that would really need to be changed is the file location and name. The way that R reads files is different to windows and also the locations that R-Studio read is dependent on where R-Studio is originally installed to point. This would need to be completed properly before the script can be run.
Procedure: The information below the dashed line is the script. This can be copied and pasted directly into R-Studio. Any text with '#' will not be read as a script, so that can be added in and read as an instruction.
###########
# 18/2/2014
# Code by Brendan Dimech
#
# Script to merge extract files from submitted NSW bioregional
# assessment and convert data into required format. Also use a 'vlookup'
# process to get Bore and Location information from NGIS.
#
# There are 3 scripts, one for each of the individual regions.
#
############
# CLARENCE MORTON
# Opening of files. Location can be changed if needed.
# arc.file is the exported *.csv from the NGIS data which has bore data and Lat/long information.
# Lat/long weren't in the file natively so were added to the table using Arc Toolbox tools.
arc.folder = '/data/cdc_cwd_wra/awra/wra_share_01/GW_licencing_and_use_data/Rstudio/Data/Vlookup/Data'
arc.file = "Moreton.csv"
# Files from NSW came through in two types. WALS files, this included 'newer' licences that had a share component.
# The 'OTH' files were older licences that had just an allocation. Some data was similar and this was combined,
# and other information that wasn't similar from the datasets was removed.
# This section is locating and importing the WALS and OTH files.
WALS.folder = '/data/cdc_cwd_wra/awra/wra_share_01/GW_licencing_and_use_data/Rstudio/Data/Vlookup/Data'
WALS.file = "GW_Clarence_Moreton_WLS-EXTRACT_4_WALs_volume.xls"
OTH.file.1 = "GW_Clarence_Moreton_WLS-EXTRACT_1.xls"
OTH.file.2 = "GW_Clarence_Moreton_WLS-EXTRACT_2.xls"
OTH.file.3 = "GW_Clarence_Moreton_WLS-EXTRACT_3.xls"
OTH.file.4 = "GW_Clarence_Moreton_WLS-EXTRACT_4.xls"
newWALS.folder = '/data/cdc_cwd_wra/awra/wra_share_01/GW_licencing_and_use_data/Rstudio/Data/Vlookup/Products'
newWALS.file = "Clarence_Moreton.csv"
arc <- read.csv(paste(arc.folder, arc.file, sep="/" ), header =TRUE, sep = ",")
WALS <- read.table(paste(WALS.folder, WALS.file, sep="/" ), header =TRUE, sep = "\t")
# Merge any individual WALS and OTH files into a single WALS or OTH file if there were more than one.
OTH1 <- read.table(paste(WALS.folder, OTH.file.1, sep="/" ), header =TRUE, sep = "\t")
OTH2 <- read.table(paste(WALS.folder, OTH.file.2, sep="/" ), header =TRUE, sep = "\t")
OTH3 <- read.table(paste(WALS.folder, OTH.file.3, sep="/" ), header =TRUE, sep = "\t")
OTH4 <- read.table(paste(WALS.folder, OTH.file.4, sep="/" ), header =TRUE, sep = "\t")
OTH <- merge(OTH1,OTH2, all.y = TRUE, all.x = TRUE)
OTH <- merge(OTH,OTH3, all.y = TRUE, all.x = TRUE)
OTH <- merge(OTH,OTH4, all.y = TRUE, all.x = TRUE)
# Add new columns to OTH for the BORE, LAT and LONG. Then use 'merge' as a vlookup to add the corresponding
# bore and location from the arc file. The WALS and OTH files are slightly different because the arc file has
# a different licence number added in.
OTH <- data.frame(OTH, BORE = "", LAT = "", LONG = "")
OTH$BORE <- arc$WORK_NO[match(OTH$LICENSE.APPROVAL, arc$LICENSE)]
OTH$LAT <- arc$POINT_X[match(OTH$LICENSE.APPROVAL, arc$LICENSE)]
OTH$LONG <-
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Enzyme-linked immunosorbent assay (ELISA) is a technique to detect the presence of an antigen or antibody in a sample. ELISA is a simple and cost-effective method that has been used for evaluating vaccine efficacy by detecting the presence of antibodies against viral/bacterial antigens and diagnosis of disease stages. Traditional ELISA data analysis utilizes a standard curve of known analyte, and the concentration of the unknown sample is determined by comparing its observed optical density against the standard curve. However, in the case of vaccine research for complicated bacteria such as Mycobacterium tuberculosis (Mtb), there is no prior information regarding the antigen against which high-affinity antibodies are generated and therefore plotting a standard curve is not feasible. Consequently, the analysis of ELISA data in this instance is based on a comparison between vaccinated and unvaccinated groups. However, to the best of our knowledge, no robust data analysis method exists for “non-standard curve” ELISA. In this paper, we provide a straightforward R-based ELISA data analysis method with open access that incorporates end-point titer determination and curve-fitting models. Our modified method allows for direct measurement data input from the instrument, cleaning and arranging the dataset in the required format, and preparing the final report with calculations while leaving the raw data file unchanged. As an illustration of our method, we provide an example from our published data in which we successfully used our method to compare anti-Mtb antibodies in vaccinated vs non-vaccinated mice.
http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
# Replication Package for 'Political Expression of Academics on Social Media' by Prashant Garg and Thiemo Fetzer.
## Overview
This replication package contains all necessary scripts and data to replicate the main figures and tables presented in the paper.
## Folder Structure
### 1. `1_scripts`
This folder contains all scripts required to replicate the main figures and tables of the paper. The scripts are numbers with a prefix (e.g. "1_") in the order they should be run. Output will also be produced in this folder.
- `0_init.Rmd`: An R Markdown file that installs and loads all packages necessary for the subsequent scripts.
- `1_fig_1.Rmd`: Primarily produces Figure 1 (Zipf's plots) and conducts statistical tests to support underlying statistical claims made through the figure.
- `2_fig_2_to_4.Rmd`: Primarily produces Figures 2 to 4 (average levels of expression) and conducts statistical tests to support underlying statistical claims made through the figures. This includes conducting t-tests to establish subgroup differences.
The script also includes The file table_controlling_how.csv contains the full set of regression results for the analysis of subgroup differences in political stances, controlling for emotionality, egocentrism, and toxicity. This file includes effect sizes, standard errors, confidence intervals, and p-values for each stance, group variable, and confounder.
- `3_fig_5_to_6.Rmd`: Primarily produces Figures 5 to 6 (trends in expression) and conducts statistical tests to support underlying statistical claims made through the figures. This includes conducting t-tests to establish subgroup differences.
- `4_tab_1_to_2.Rmd`: Produces Tables 1 to 2, and shows code for Table A5 (descriptive tables).
Expected run time for each script is under 3 minutes and requires around 4GB RAM. Script `3_fig_5_to_6.Rmd` can take up to 3-4 minutes and requires up to 6GB RAM. Installation of each package for the first time user may take around 2 minutes each, except 'tidyverse', which may take around 4 minutes.
We have not provided a demo since the actual dataset used for analysis is small enough and computations are efficient enough to be run in most systems.
Each script starts with a layperson explanation to overview the functionality of the code and a pseudocode for a detailed procedure, followed by the actual code.
### 2. `2_data`
This folder contains all data used to replicate the main results. The data is called by the respective scripts automatically using relative paths.
- `data_dictionary.txt`: Provides a description of all variables as they are coded in the various datasets, especially the main author by time level dataset called `repl_df.csv`.
- Processed data at individual author by time (year by month) level aggregated measures are provided, as raw data containing raw tweets cannot be shared.
## Installation Instructions
### Prerequisites
This project uses R and RStudio. Make sure you have the following installed:
- [R](https://cran.r-project.org/) (version 4.0.0 or later)
- [RStudio](https://www.rstudio.com/products/rstudio/download/)
Once installed, to ensure the correct versions of the required packages are installed, use the following R markdown script '0_init.Rmd'. This script will install the `remotes` package (if not already installed) and then install the specified versions of the required packages.
## Running the Scripts
Open 0_init.Rmd in RStudio and run all chunks to install and load the required packages.
Run the remaining scripts (1_fig_1.Rmd, 2_fig_2_to_4.Rmd, 3_fig_5_to_6.Rmd, and 4_tab_1_to_2.Rmd) in the order they are listed to reproduce the figures and tables from the paper.
# Contact
For any questions, feel free to contact Prashant Garg at prashant.garg@imperial.ac.uk.
# License
This project is licensed under the Apache License 2.0 - see the license.txt file for details.
Microsoft excel (CSV files) and R in RStudio v3.5.3 statistical software.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Replication pack, FSE2018 submission #164: ------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem **Note:** link to data artifacts is already included in the paper. Link to the code will be included in the Camera Ready version as well. Content description =================== - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files described below - **settings.py** - settings template for the code archive. - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset. This dataset only includes stats aggregated by the ecosystem (PyPI) - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages themselves, which take around 2TB. - **build_model.r, helpers.r** - R files to process the survival data (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, `common.cache/survival_data.pypi_2008_2017-12_6.csv` in **dataset_full_Jan_2018.tgz**) - **Interview protocol.pdf** - approximate protocol used for semistructured interviews. - LICENSE - text of GPL v3, under which this dataset is published - INSTALL.md - replication guide (~2 pages)
Replication guide ================= Step 0 - prerequisites ---------------------- - Unix-compatible OS (Linux or OS X) - Python interpreter (2.7 was used; Python 3 compatibility is highly likely) - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible) Depending on detalization level (see Step 2 for more details): - up to 2Tb of disk space (see Step 2 detalization levels) - at least 16Gb of RAM (64 preferable) - few hours to few month of processing time Step 1 - software ---------------- - unpack **ghd-0.1.0.zip**, or clone from gitlab: git clone https://gitlab.com/user2589/ghd.git git checkout 0.1.0 `cd` into the extracted folder. All commands below assume it as a current directory. - copy `settings.py` into the extracted folder. Edit the file: * set `DATASET_PATH` to some newly created folder path * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` - install docker. For Ubuntu Linux, the command is `sudo apt-get install docker-compose` - install libarchive and headers: `sudo apt-get install libarchive-dev` - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools` Without this dependency, you might get an error on the next step, but it's safe to ignore. - install Python libraries: `pip install --user -r requirements.txt` . - disable all APIs except GitHub (Bitbucket and Gitlab support were not yet implemented when this study was in progress): edit `scraper/init.py`, comment out everything except GitHub support in `PROVIDERS`. Step 2 - obtaining the dataset ----------------------------- The ultimate goal of this step is to get output of the Python function `common.utils.survival_data()` and save it into a CSV file: # copy and paste into a Python console from common import utils survival_data = utils.survival_data('pypi', '2008', smoothing=6) survival_data.to_csv('survival_data.csv') Since full replication will take several months, here are some ways to speedup the process: ####Option 2.a, difficulty level: easiest Just use the precomputed data. Step 1 is not necessary under this scenario. - extract **dataset_minimal_Jan_2018.zip** - get `survival_data.csv`, go to the next step ####Option 2.b, difficulty level: easy Use precomputed longitudinal feature values to build the final table. The whole process will take 15..30 minutes. - create a folder `