https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Replication pack, FSE2018 submission #164: ------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem **Note:** link to data artifacts is already included in the paper. Link to the code will be included in the Camera Ready version as well. Content description =================== - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files described below - **settings.py** - settings template for the code archive. - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset. This dataset only includes stats aggregated by the ecosystem (PyPI) - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages themselves, which take around 2TB. - **build_model.r, helpers.r** - R files to process the survival data (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, `common.cache/survival_data.pypi_2008_2017-12_6.csv` in **dataset_full_Jan_2018.tgz**) - **Interview protocol.pdf** - approximate protocol used for semistructured interviews. - LICENSE - text of GPL v3, under which this dataset is published - INSTALL.md - replication guide (~2 pages)
Replication guide ================= Step 0 - prerequisites ---------------------- - Unix-compatible OS (Linux or OS X) - Python interpreter (2.7 was used; Python 3 compatibility is highly likely) - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible) Depending on detalization level (see Step 2 for more details): - up to 2Tb of disk space (see Step 2 detalization levels) - at least 16Gb of RAM (64 preferable) - few hours to few month of processing time Step 1 - software ---------------- - unpack **ghd-0.1.0.zip**, or clone from gitlab: git clone https://gitlab.com/user2589/ghd.git git checkout 0.1.0 `cd` into the extracted folder. All commands below assume it as a current directory. - copy `settings.py` into the extracted folder. Edit the file: * set `DATASET_PATH` to some newly created folder path * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` - install docker. For Ubuntu Linux, the command is `sudo apt-get install docker-compose` - install libarchive and headers: `sudo apt-get install libarchive-dev` - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools` Without this dependency, you might get an error on the next step, but it's safe to ignore. - install Python libraries: `pip install --user -r requirements.txt` . - disable all APIs except GitHub (Bitbucket and Gitlab support were not yet implemented when this study was in progress): edit `scraper/init.py`, comment out everything except GitHub support in `PROVIDERS`. Step 2 - obtaining the dataset ----------------------------- The ultimate goal of this step is to get output of the Python function `common.utils.survival_data()` and save it into a CSV file: # copy and paste into a Python console from common import utils survival_data = utils.survival_data('pypi', '2008', smoothing=6) survival_data.to_csv('survival_data.csv') Since full replication will take several months, here are some ways to speedup the process: ####Option 2.a, difficulty level: easiest Just use the precomputed data. Step 1 is not necessary under this scenario. - extract **dataset_minimal_Jan_2018.zip** - get `survival_data.csv`, go to the next step ####Option 2.b, difficulty level: easy Use precomputed longitudinal feature values to build the final table. The whole process will take 15..30 minutes. - create a folder `
To make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.
You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:
df<-read.csv('uber.csv')
df_black<-subset(uber_df, uber_df$name == 'Black')
write.csv(df_black, "nameofthefileyouwanttosaveas.csv")
getwd()
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset was developed by NREL's distributed energy systems integration group as part of a study on high penetrations of distributed solar PV [1]. It consists of hourly load data in CSV format for use with the PNNL taxonomy of distribution feeders [2]. These feeders were developed in the open source GridLAB-D modelling language [3]. In this dataset each of the load points in the taxonomy feeders is populated with hourly averaged load data from a utility in the feeder’s geographical region, scaled and randomized to emulate real load profiles. For more information on the scaling and randomization process, see [1].
The taxonomy feeders are statistically representative of the various types of distribution feeders found in five geographical regions of the U.S. Efforts are underway (possibly complete) to translate these feeders into the OpenDSS modelling language.
This data set consists of one large CSV file for each feeder. Within each CSV, each column represents one load bus on the feeder. The header row lists the name of the load bus. The subsequent 8760 rows represent the loads for each hour of the year. The loads were scaled and randomized using a Python script, so each load series represents only one of many possible randomizations. In the header row, "rl" = residential load and "cl" = commercial load. Commercial loads are followed by a phase letter (A, B, or C). For regions 1-3, the data is from 2009. For regions 4-5, the data is from 2000.
For use in GridLAB-D, each column will need to be separated into its own CSV file without a header. The load value goes in the second column, and corresponding datetime values go in the first column, as shown in the sample file, sample_individual_load_file.csv. Only the first value in the time column needs to written as an absolute time; subsequent times may be written in relative format (i.e. "+1h", as in the sample). The load should be written in P+Qj format, as seen in the sample CSV, in units of Watts (W) and Volt-amps reactive (VAr). This dataset was derived from metered load data and hence includes only real power; reactive power can be generated by assuming an appropriate power factor. These loads were used with GridLAB-D version 2.2.
Browse files in this dataset, accessible as individual files and as a single ZIP file. This dataset is approximately 242MB compressed or 475MB uncompressed.
For questions about this dataset, contact andy.hoke@nrel.gov.
If you find this dataset useful, please mention NREL and cite [1] in your work.
References:
[1] A. Hoke, R. Butler, J. Hambrick, and B. Kroposki, “Steady-State Analysis of Maximum Photovoltaic Penetration Levels on Typical Distribution Feeders,” IEEE Transactions on Sustainable Energy, April 2013, available at http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6357275 .
[2] K. Schneider, D. P. Chassin, R. Pratt, D. Engel, and S. Thompson, “Modern Grid Initiative Distribution Taxonomy Final Report”, PNNL, Nov. 2008. Accessed April 27, 2012: http://www.gridlabd.org/models/feeders/taxonomy of prototypical feeders.pdf
[3] K. Schneider, D. Chassin, Y. Pratt, and J. C. Fuller, “Distribution power flow for smart grid technologies”, IEEE/PES Power Systems Conference and Exposition, Seattle, WA, Mar. 2009, pp. 1-7, 15-18.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
analyze the consumer expenditure survey (ce) with r the consumer expenditure survey (ce) is the primo data source to understand how americans spend money. participating households keep a running diary about every little purchase over the year. those diaries are then summed up into precise expenditure categories. how else are you gonna know that the average american household spent $34 (±2) on bacon, $826 (±17) on cellular phones, and $13 (±2) on digital e-readers in 2011? an integral component of the market basket calculation in the consumer price index, this survey recently became available as public-use microdata and they're slowly releasing historical files back to 1996. hooray! for a t aste of what's possible with ce data, look at the quick tables listed on their main page - these tables contain approximately a bazillion different expenditure categories broken down by demographic groups. guess what? i just learned that americans living in households with $5,000 to $9,999 of annual income spent an average of $283 (±90) on pets, toys, hobbies, and playground equipment (pdf page 3). you can often get close to your statistic of interest from these web tables. but say you wanted to look at domestic pet expenditure among only households with children between 12 and 17 years old. another one of the thirteen web tables - the consumer unit composition table - shows a few different breakouts of households with kids, but none matching that exact population of interest. the bureau of labor statistics (bls) (the survey's designers) and the census bureau (the survey's administrators) have provided plenty of the major statistics and breakouts for you, but they're not psychic. if you want to comb through this data for specific expenditure categories broken out by a you-defined segment of the united states' population, then let a little r into your life. fun starts now. fair warning: only analyze t he consumer expenditure survey if you are nerd to the core. the microdata ship with two different survey types (interview and diary), each containing five or six quarterly table formats that need to be stacked, merged, and manipulated prior to a methodologically-correct analysis. the scripts in this repository contain examples to prepare 'em all, just be advised that magnificent data like this will never be no-assembly-required. the folks at bls have posted an excellent summary of what's av ailable - read it before anything else. after that, read the getting started guide. don't skim. a few of the descriptions below refer to sas programs provided by the bureau of labor statistics. you'll find these in the C:\My Directory\CES\2011\docs directory after you run the download program. this new github repository contains three scripts: 2010-2011 - download all microdata.R lo op through every year and download every file hosted on the bls's ce ftp site import each of the comma-separated value files into r with read.csv depending on user-settings, save each table as an r data file (.rda) or stat a-readable file (.dta) 2011 fmly intrvw - analysis examples.R load the r data files (.rda) necessary to create the 'fmly' table shown in the ce macros program documentation.doc file construct that 'fmly' table, using five quarters of interviews (q1 2011 thru q1 2012) initiate a replicate-weighted survey design object perform some lovely li'l analysis examples replicate the %mean_variance() macro found in "ce macros.sas" and provide some examples of calculating descriptive statistics using unimputed variables replicate the %compare_groups() macro found in "ce macros.sas" and provide some examples of performing t -tests using unimputed variables create an rsqlite database (to minimize ram usage) containing the five imputed variable files, after identifying which variables were imputed based on pdf page 3 of the user's guide to income imputation initiate a replicate-weighted, database-backed, multiply-imputed survey design object perform a few additional analyses that highlight the modified syntax required for multiply-imputed survey designs replicate the %mean_variance() macro found in "ce macros.sas" and provide some examples of calculating descriptive statistics using imputed variables repl icate the %compare_groups() macro found in "ce macros.sas" and provide some examples of performing t-tests using imputed variables replicate the %proc_reg() and %proc_logistic() macros found in "ce macros.sas" and provide some examples of regressions and logistic regressions using both unimputed and imputed variables replicate integrated mean and se.R match each step in the bls-provided sas program "integr ated mean and se.sas" but with r instead of sas create an rsqlite database when the expenditure table gets too large for older computers to handle in ram export a table "2011 integrated mean and se.csv" that exactly matches the contents of the sas-produced "2011 integrated mean and se.lst" text file click here to view these three scripts for...
This dataset includes all the data and R code needed to reproduce the analyses in a forthcoming manuscript:Copes, W. E., Q. D. Read, and B. J. Smith. Environmental influences on drying rate of spray applied disinfestants from horticultural production services. PhytoFrontiers, DOI pending.Study description: Instructions for disinfestants typically specify a dose and a contact time to kill plant pathogens on production surfaces. A problem occurs when disinfestants are applied to large production areas where the evaporation rate is affected by weather conditions. The common contact time recommendation of 10 min may not be achieved under hot, sunny conditions that promote fast drying. This study is an investigation into how the evaporation rates of six commercial disinfestants vary when applied to six types of substrate materials under cool to hot and cloudy to sunny weather conditions. Initially, disinfestants with low surface tension spread out to provide 100% coverage and disinfestants with high surface tension beaded up to provide about 60% coverage when applied to hard smooth surfaces. Disinfestants applied to porous materials were quickly absorbed into the body of the material, such as wood and concrete. Even though disinfestants evaporated faster under hot sunny conditions than under cool cloudy conditions, coverage was reduced considerably in the first 2.5 min under most weather conditions and reduced to less than or equal to 50% coverage by 5 min. Dataset contents: This dataset includes R code to import the data and fit Bayesian statistical models using the model fitting software CmdStan, interfaced with R using the packages brms and cmdstanr. The models (one for 2022 and one for 2023) compare how quickly different spray-applied disinfestants dry, depending on what chemical was sprayed, what surface material it was sprayed onto, and what the weather conditions were at the time. Next, the statistical models are used to generate predictions and compare mean drying rates between the disinfestants, surface materials, and weather conditions. Finally, tables and figures are created. These files are included:Drying2022.csv: drying rate data for the 2022 experimental runWeather2022.csv: weather data for the 2022 experimental runDrying2023.csv: drying rate data for the 2023 experimental runWeather2023.csv: weather data for the 2023 experimental rundisinfestant_drying_analysis.Rmd: RMarkdown notebook with all data processing, analysis, and table creation codedisinfestant_drying_analysis.html: rendered output of notebookMS_figures.R: additional R code to create figures formatted for journal requirementsfit2022_discretetime_weather_solar.rds: fitted brms model object for 2022. This will allow users to reproduce the model prediction results without having to refit the model, which was originally fit on a high-performance computing clusterfit2023_discretetime_weather_solar.rds: fitted brms model object for 2023data_dictionary.xlsx: descriptions of each column in the CSV data files
This archive contains code and data for reproducing the analysis for “Replication Data for Revisiting ‘The Rise and Decline’ in a Population of Peer Production Projects”. Depending on what you hope to do with the data you probabbly do not want to download all of the files. Depending on your computation resources you may not be able to run all stages of the analysis. The code for all stages of the analysis, including typesetting the manuscript and running the analysis, is in code.tar. If you only want to run the final analysis or to play with datasets used in the analysis of the paper, you want intermediate_data.7z or the uncompressed tab and csv files. The data files are created in a four-stage process. The first stage uses the program “wikiq” to parse mediawiki xml dumps and create tsv files that have edit data for each wiki. The second stage generates all.edits.RDS file which combines these tsvs into a dataset of edits from all the wikis. This file is expensive to generate and at 1.5GB is pretty big. The third stage builds smaller intermediate files that contain the analytical variables from these tsv files. The fourth stage uses the intermediate files to generate smaller RDS files that contain the results. Finally, knitr and latex typeset the manuscript. A stage will only run if the outputs from the previous stages do not exist. So if the intermediate files exist they will not be regenerated. Only the final analysis will run. The exception is that stage 4, fitting models and generating plots, always runs. If you only want to replicate from the second stage onward, you want wikiq_tsvs.7z. If you want to replicate everything, you want wikia_mediawiki_xml_dumps.7z.001 wikia_mediawiki_xml_dumps.7z.002, and wikia_mediawiki_xml_dumps.7z.003. These instructions work backwards from building the manuscript using knitr, loading the datasets, running the analysis, to building the intermediate datasets. Building the manuscript using knitr This requires working latex, latexmk, and knitr installations. Depending on your operating system you might install these packages in different ways. On Debian Linux you can run apt install r-cran-knitr latexmk texlive-latex-extra. Alternatively, you can upload the necessary files to a project on Overleaf.com. Download code.tar. This has everything you need to typeset the manuscript. Unpack the tar archive. On a unix system this can be done by running tar xf code.tar. Navigate to code/paper_source. Install R dependencies. In R. run install.packages(c("data.table","scales","ggplot2","lubridate","texreg")) On a unix system you should be able to run make to build the manuscript generalizable_wiki.pdf. Otherwise you should try uploading all of the files (including the tables, figure, and knitr folders) to a new project on Overleaf.com. Loading intermediate datasets The intermediate datasets are found in the intermediate_data.7z archive. They can be extracted on a unix system using the command 7z x intermediate_data.7z. The files are 95MB uncompressed. These are RDS (R data set) files and can be loaded in R using the readRDS. For example newcomer.ds <- readRDS("newcomers.RDS"). If you wish to work with these datasets using a tool other than R, you might prefer to work with the .tab files. Running the analysis Fitting the models may not work on machines with less than 32GB of RAM. If you have trouble, you may find the functions in lib-01-sample-datasets.R useful to create stratified samples of data for fitting models. See line 89 of 02_model_newcomer_survival.R for an example. Download code.tar and intermediate_data.7z to your working folder and extract both archives. On a unix system this can be done with the command tar xf code.tar && 7z x intermediate_data.7z. Install R dependencies. install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). On a unix system you can simply run regen.all.sh to fit the models, build the plots and create the RDS files. Generating datasets Building the intermediate files The intermediate files are generated from all.edits.RDS. This process requires about 20GB of memory. Download all.edits.RDS, userroles_data.7z,selected.wikis.csv, and code.tar. Unpack code.tar and userroles_data.7z. On a unix system this can be done using tar xf code.tar && 7z x userroles_data.7z. Install R dependencies. In R run install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). Run 01_build_datasets.R. Building all.edits.RDS The intermediate RDS files used in the analysis are created from all.edits.RDS. To replicate building all.edits.RDS, you only need to run 01_build_datasets.R when the int... Visit https://dataone.org/datasets/sha256%3Acfa4980c107154267d8eb6dc0753ed0fde655a73a062c0c2f5af33f237da3437 for complete metadata about this dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Categorical scatterplots with R for biologists: a step-by-step guide
Benjamin Petre1, Aurore Coince2, Sophien Kamoun1
1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK
Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.
Protocol
• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.
• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.
• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.
Notes
• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.
• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.
replicates
graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()
References
Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.
Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035
Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a new version draft of the data files for "Food washing monkeys recognize the law of diminishing returns" by Rosien et al.
The original reviewed pre-print was published on the elife website on 22 July 2024: https://elifesciences.org/reviewed-preprints/98520. The data stored here are for the updated version of record.
The published text contains methods justifications and supporting citations.
This dataset was revised based on the recommendations of three reviewers. It now contains:
A general note: when running the scripts, the file path you utilize will differ from the ones utilized in the current text, as it depends on where on one's computer the actual .csv files are stored. The "read.csv" command in the R code will need to be customized to a particular file path.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
analyze the basic stand alone medicare claims public use files (bsapufs) with r and monetdb the centers for medicare and medicaid services (cms) took the plunge. the famous medicare 5% sample has been released to the public, free of charge. jfyi - medicare is the u.s. government program that provides health insurance to 50 million elderly and disabled americans. the basic stand alone medicare claims public use files (bsapufs) contain either person- or event-level data on inpatient stays, durable medical equipment purchases, prescription drug fills, hospice users, doctor visits, home health provision , outpatient hospital procedures, skilled nursing facility short-term residents, as well as aggregated statistics for medicare beneficiaries with chronic conditions and medicare beneficiaries living in nursing homes. oh sorry, there's one catch: they only provide sas scripts to analyze everything. cue the villian music. that bored old game of monopoly ends today. the initial release of the 2008 bsapufs was accompanied by some major fanfare in the world of health policy , a big win for government transparency. unfortunately, the final files that cleared the confidentiality hurdles are heavily de-identified and obfuscated. prime examples: none of the files can be linked to any other file. not across years, not across expenditure categories costs are rounded to the nearest fifth or tenth dollar at lower values, nearest thousandth at higher values ages are categorized into five year bands so these files are baldly inferior to the unsquelched, linkable data only available through an expensive formal application process. any researcher with a budget flush enough to afford a sas license (the only statistical software mentioned in the cms official documentation) can probably also cough up the money to buy the identifiable data through resdac (resdac, btw, rocks). soapbox: cms released free public data sets that could only be analyzed with a software package costing thousands of dollars. so even though the actual data sets were free, researchers still needed deep pock ets to buy sas. meanwhile, the unsquelched and therefore superior data sets are also available for many thousands of dollars. researchers with funding would (reasonably) just buy the better data. researchers without any financial resources - the target audience of free, public data - were left out in the cold. no wonder these bsapufs haven't been used much. that ends now. using r, monetdb, and the personal computer you already own (mine cost $700 in 2009), researchers can, for the first time, seriously analyze these medicare public use files without spending another dime. woah. plus hey guess what all you researcher fat-cats with your federal grant streams and your proprietary software licenses: r + monetdb runs one heckuva lot faster than sas. woah^2. dump your sas license water wings and learn how to swim. the scripts below require monetdb . click here for step-by-step instructions of how to install it on windows and click here for speed tests. vroom. since the bsapufs comprise 5% of the medicare population, ya generally need to multiply any counts or sums by twenty. although the individuals represented in these claims are randomly sampled, this data should not be treated like a complex survey sample, meaning that the creation of a survey object is unnecessary. most bsapufs generalize to either the total or fee-for-service medicare population, but each file is different so give the documentation a hard stare before that eureka moment. this new github repository contains three scripts: 2008 - download all csv files.R loop through and download every zip file hosted by cms unzip the contents of each zipped file to the working directory 2008 - import all csv files into monetdb.R create the batch (.bat) file needed to initiate the monet database in the f uture loop through each csv file in the current working directory and import them into the monet database create a well-documented block of code to re-initiate the monetdb server in the future 2008 - replicate cms publications.R initiate the same monetdb server instance, unsing the same well-documented block of code as above replicate nine sets of statistics found in data tables provided by cms < a href="https://github.com/ajdamico/usgsd/tree/master/Basic%20Stand%20Alone%20Medicare%20Claims%20Public%20Use%20Files">click here to view these three scripts for more detail about the basic stand alone medicare claims public use files (bsapufs), visit: the centers for medicare and medicaid's bsapuf homepage a joint academyhealth webinar given by the organizations that partnered to create these files - cms, impaq, norc notes: the replication script has oodles of easily-modified syntax and should be viewed for analysis examples. if you know the name of the data table you want to examine, you can quickly modify these general monetdb analysis examples too. just run sql queries - sas users, that's "proc...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials
Background
This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.
The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).
Usage
Included Files
File Format: Downsampled Data
These are the "LP_
These data files can be easily loaded using the pandas library in Python through:
import pandas
data = pandas.read_csv(data_file, index_col=0)
The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.
File Format: Unreduced Data
These are the "LP_
The data can be loaded and used similarly to the downsampled data.
File Format: Overall_Summary
The overall summary file provides data on all the test specimens in the database. The columns include:
File Format: Summarized_Mechanical_Props_Campaign
Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,
tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv',
index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1],
keep_default_na=False, na_values='')
Caveats
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
To get the consumption model from Section 3.1, one needs load execute the file consumption_data.R. Load the data for the 3 Phases ./data/CONSUMPTION/PL1.csv, PL2.csv, PL3.csv, transform the data and build the model (starting line 225). The final consumption data can be found in one file for each year in ./data/CONSUMPTION/MEGA_CONS_list.Rdata
To get the results for the optimization problem, one needs to execute the file analyze_data.R. It provides the functions to compare production and consumption data, and to optimize for the different values (PV, MBC,).
To reproduce the figures one needs to execute the file visualize_results.R. It provides the functions to reproduce the figures.
To calculate the solar radiation that is needed in the Section Production Data, follow file calculate_total_radiation.R.
To reproduce the radiation data from from ERA5, that can be found in data.zip, do the following steps: 1. ERA5 - download the reanalysis datasets as GRIB file. For FDIR select "Total sky direct solar radiation at surface", for GHI select "Surface solar radiation downwards", and for ALBEDO select "Forecast albedo". 2. convert GRIB to csv with the file era5toGRID.sh 3. convert the csv file to the data that is used in this paper with the file convert_year_to_grid.R
This child page contains a zipped folder which contains all of the items necessary to run load estimation using R-LOADEST to produce results that are published in U.S. Geological Survey Investigations Report 2021-XXXX [Tatge, W.S., Nustad, R.A., and Galloway, J.M., 2021, Evaluation of Salinity and Nutrient Conditions in the Heart River Basin, North Dakota, 1970-2020: U.S. Geological Survey Scientific Investigations Report 2021-XXXX, XX p]. The folder contains an allsiteinfo.table.csv file, a "datain" folder, and a "scripts" folder. The allsiteinfo.table.csv file can be used to cross reference the sites with the main report (Tatge and others, 2021). The "datain" folder contains all the input data necessary to reproduce the load estimation results. The naming convention in the "datain" folder is site_MI_rloadest or site_NUT_rloadest for either the major ion loads or the nutrient loads. The .Rdata files are used in the scripts to run the estimations and the .csv files can be used to look at the data. The "scripts" folder contains the written R scripts to produce the results of the load estimation from the main report. R-LOADEST is a software package for analyzing loads in streams and an accompanying report (Runkel and others, 2004) serves as the formal documentation for R-LOADEST. The package is a collection of functions written in R (R Development Core Team, 2019), an open source language and a general environment for statistical computing and graphics. The following system requirements are necessary for producing results: Windows 10 operating system R (version 3.4 or later; 64-bit recommended) RStudio (version 1.1.456 or later) R-LOADEST program (available at https://github.com/USGS-R/rloadest). Runkel, R.L., Crawford, C.G., and Cohn, T.A., 2004, Load Estimator (LOADEST): A FORTRAN Program for Estimating Constituent Loads in Streams and Rivers: U.S. Geological Survey Techniques and Methods Book 4, Chapter A5, 69 p., [Also available at https://pubs.usgs.gov/tm/2005/tm4A5/pdf/508final.pdf.] R Development Core Team, 2019, R—A language and environment for statistical computing: Vienna, Austria, R Foundation for Statistical Computing, accessed December 7, 2020, at https://www.r-project.org.
We present a flora and fauna dataset for the Mira-Mataje binational basins. This is an area shared between southwestern Colombia and northwestern Ecuador, where both the Chocó and Tropical Andes biodiversity hotspots converge. Information from 120 sources was systematized in the Darwin Core Archive (DwC-A) standard and geospatial vector data format for geographic information systems (GIS) (shapefiles). Sources included natural history museums, published literature, and citizen science repositories across 18 countries. The resulting database has 33,460 records from 5,281 species, of which 1,083 are endemic and 680 threatened. The diversity represented in the dataset is equivalent to 10\% of the total plant species and 26\% of the total terrestrial vertebrate species in the hotspots. It corresponds to 0.07\% of their total area. The dataset can be used to estimate and compare biodiversity patterns with environmental parameters and provide value to ecosystems, ecoregions, and protected areas. The dataset is a baseline for future assessments of biodiversity in the face of environmental degradation, climate change, and accelerated extinction processes. The data has been formally presented in the manuscript entitled "The Tropical Andes Biodiversity Hotspot: A Comprehensive Dataset for the Mira-Mataje Binational Basins" in the journal "Scientific Data". To maintain DOI integrity, this version will not change after publication of the manuscript and therefore we cannot provide further references on volume, issue, and DOI of manuscript publication. - Data format 1: The .rds file extension saves a single object to be read in R and provides better compression, serialization, and integration within the R environment, than simple .csv files. The description of file names is in the original manuscript. -- m_m_flora_2021_voucher_ecuador.rds -- m_m_flora_2021_observation_ecuador.rds -- m_m_flora_2021_total_ecuador.rds -- m_m_fauna_2021_ecuador.rds - Data format 2: The .csv file has been encoded in UTF-8, and is an ASCII file with text separated by commas. The description of file names is in the original manuscript. -- m_m_flora_fauna_2021_all.zip. This file includes all biodiversity datasets. -- m_m_flora_2021_voucher_ecuador.csv -- m_m_flora_2021_observation_ecuador.csv -- m_m_flora_2021_total_ecuador.csv -- m_m_fauna_2021_ecuador.csv - Data format 3: We consolidated a shapefile for the basin containing layers for vegetation ecosystems and the total number of occurrences, species, and endemic and threatened species for each ecosystem. -- biodiversity_measures_mira_mataje.zip. This file includes the .shp file and accessory geomatic files. - A set of 3D shaded-relief map representations of the data in the shapefile can be found at https://doi.org/10.6084/m9.figshare.23499180.v4 Three taxonomic data tables were used in our technical validation of the presented dataset. These three files are: 1) the_catalog_of_life.tsv (Source: Bánki, O. et al. Catalogue of life checklist (version 2024-03-26). https://doi.org/10.48580/dfz8d (2024)) 2) world_checklist_of_vascular_plants_names.csv (we are also including ancillary tables "world_checklist_of_vascular_plants_distribution.csv", and "README_world_checklist_of_vascular_plants_.xlsx") (Source: Govaerts, R., Lughadha, E. N., Black, N., Turner, R. & Paton, A. The World Checklist of Vascular Plants is a continuously updated resource for exploring global plant diversity. Sci. Data 8, 215, 10.1038/s41597-021-00997-6 (2021).) 3) world_flora_online.csv (Source: The World Flora Online Consortium et al. World flora online plant list December 2023, 10.5281/zenodo.10425161 (2023).)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
prompt
folder, the two bash scripts used to run it on the university HPC.results
folder, which contains the log files, and one csv file per modelplatine_check
indicating which gold results were re-checkedfigure_tables.R
.Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data files and Python and R scripts are provided for Case Study 1 of the openENTRANCE project. The data covers 10 residential devices on the NUTS2 level for the EU27 + UK +TR + NO + CH from 2020-2050. The devices included are full battery electric vehicles (EV), storage heater (SH), water heater with storage capabilitites (WH), air conditiong (AC), heat circulation pump (CP), air-to-air heat pump (HP), refrigeration (includes refrigerators (RF) and freezers (FR)), dish washer (DW), washing machine (WM), and tumble drier (TD). The data for the study uses represenative hours to describe load expectations and constraints for each residential device - hourly granularity from 2020 to 2050 for a representative day for each month (i.e. 24 hours for an average day in each month).
The aggregated final results are in Full_potential.V9.csv and acheivable_NUTS2_summary.csv. The file metaData.Full_Potential.csv is provided to guide users on the nomenclature in Full_potential.V9.csv and the disaggregated data sets.The disaggregated loads can be found in d_ACV8.csv, d_CPV6.csv, d_DWV6.csv, d_EVV7.csv, d_FRV5.csv, d_HPV4.csv, d_RFV5.csv, d_SHV7.csv, d_TDV6.csv, d_WHV7.csv, d_WMV6.csv while the disaggregated maximum capacities p_ACV8.csv, p_CPV6.csv, p_DWV6.csv, p_EVV7.csv, p_FRV5.csv, p_HPV4.csv, p_RFV5.csv, p_SHV7.csv, p_TDV6.csv, p_WHV7.csv, p_WMV6.csv.
Full_potential.V9.csv shows the NUTS2 level unadjusted loads for the residential devices using representative hours from 2020-2050. The loads provided here have not been adjusted with the direct load participation rates (see paper for more details). More details on the dataset can be found in the metaData.Full_Potential.csv file.
The acheivable_NUTS2_summary.csv shows the NUTS2 level acheivable direct load control potentials for the average hour in the respective year (years - 2020, 2022,2030,2040, 2050). These summaries have allready adjusted the disaggregated loads with direct load participation rates from participation_rates_country.csv.
A detailed overview of the data files are provided below. Where possible, a brief description, input data, and script use to generate the data is provided. If questions arise, first refer to the publication. If something still needs clarification, send an email to ryano18@vt.edu.
Description of data provided
Achievable_NUTS2_summary.csv
Description
Average hourly achievable direct load potentials for each NUTS2 region and device for 2020, 2022, 2030,2040, 2050
Data input
Full_potential.V9.csv
participation_rates_country.csv
P_inc_SH.csv
P_inc_WH.csv
P_inc_HP.csv
P_inc_DW.csv
P_inc_WM.csv
P_inc_TD.csv
Script
NUTS2_acheivable.R
COP_.1deg_11-21_V1.csv
Description
NUTS2 average coefficient of performance estimates from 2011-2021 daily temperature
Data
tg_ens_mean_0.1deg_reg_2011-2021_v24.0e.nc
NUTS_RG_01M_2021_3857.shp
nhhV2.csv
Script
COP_from_E-OBS.R
Country dd projections.csv
Description
Assumptions for annual change in CDD and HDD
Spinoni, J., Vogt, J. V., Barbosa, P., Dosio, A., McCormick, N., Bigano, A., & Füssel, H. M. (2018). Changes of heating and cooling degree‐days in Europe from 1981 to 2100. International Journal of Climatology, 38, e191-e208.
Expectations for future HDD and CDD used the long-run averages and country level expected changes in the rcp45 scenario
EV NUTS projectionsV5.csv
Description
NUTS2 level EV projections 2018-2050
Data input
EV projectionsV5_ave.csv
Country level EV projections
NUTS 2 regional share of national vehicle fleet
Eurostat - Vehicle Nuts.xlsx
Script
EVprojections_NUTS_V5.py
EV_NVF_EV_path.xlsx
Description
Country level – EV share of new passenger vehicle fleet
From: Mathieu, L., & Poliscanova, J. (2020). Mission (almost) accomplished. Carmakers’ Race to Meet the, 21.
EV_parameters.xlsx
Description
Parameters used to calculate future loads from EVs
Wunit_EV – represents annual kWh per EV
evLIFE_150kkm
number of years
represents usable life if EV only lasted 150 thousand km. Hence, 150,000/average km traveled per year with respect to country (this variable is dropped and not used for estimation).
Average age/#years assuming 150k life – represents
Number of years
Average between evLIFE_150kkm and average age of vehicle with respect to the country
full_potentialV9.csv
Description
Final data that shows hourly demand (Maximum Reduction) and (Maximum Dispatch for each device, region, and year.
This data has not been adjusted with participation_rates_country.csv
Maximum dispatch is equal to max capacity – hourly demand with respect to the device, region, year, and hour.
Script
Full_potentialV9.py
gils projection assumptions.xlsx
Description
Data from: Gils, H. C. (2015). Balancing of intermittent renewable power generation by demand response and thermal energy storage.
A linear extrapolation was used to determine values for every year and country 2020-2050. AC – Air Conditioning, SH – Storage Heater, WH – Water heater with storage capability, CP – heat circulation pump, TD – Tumble Drier, WM – Washing Machine, DW -Dish Washer, FR – Freezer, RF – Refrigerator. The results are in the files shown below.
nflh – full load hours
nflh_ac.csv
nflh_cp.csv
wunit – annual energy consumption
Wunit_rf_fr.csv
Pcycle – power demand per cycle
Pcycle_wm.csv
Pcycle_dw.csv
Pcycle_td.csv
Punit – power damand for device
Punit_ac.csv
Punit_cp.csv
r – country level household ownership rates of residential device
rfr.csv
rrf.csv
rwm.csv
rtd.csv
rdw.csv
rac.csv
rwh.csv
rcp.csv
rsh.csv
Script
openENTRANCE projections.py
heat_pump_hourly_share.csv
Description
Hours share of daily energy demand
From ENTROS TYNDP – Charts and Figures
https://2020.entsos-tyndp-scenarios.eu/download-data/#download
hourlyEVshares.csv
Description
Hours share of daily energy demand
From My Electric Avenue Study
https://eatechnology.com/consultancy-insights/my-electric-avenue/
HP_transitionV2.csv
Description
Used to create Qhp_thermal_MWh_projectedV2.csv
Final_energy_15-19
Average final energy demand for the residential heating sector between 2015-2019
Final_energy_15-19_nonEE
Average final energy demand for the residential heating sector for energy sources that are not energy efficient between 2015-2019 (see paper for sources)
Final_energy_15-19_nonEE_share
share of inefficient heating sources
HP_thermal_2018
Thermal energy provided by residential heat pumps in 2018
HP_thermal_2019
Thermal energy provided by residential heat pumps in 2019
See publication for data sources
Nflh_ac.csv, nflh_cp.csv
See gils projection assumptions.xlsx
nhhV2
Description
Expected number of households for NUTS2 regions for 2020-2050
See publication for data sources
Script
EUROSTAT_POP2NUTSV2.R
NUTS0_thermal_heat_annum.csv
Description
Country level residential annual thermal heat requirements in kWh
Used to determine maximum dispatch in openENTRANCE final V14.py
Mantzos, L., Wiesenthal, T., Matei, N. A., Tchung-Ming, S., Rozsai, M., Russ, P., & Ramirez, A. S. (2017). JRC-IDEES: Integrated Database of the European Energy Sector: Methodological Note (No. JRC108244). Joint Research Centre (Seville site).
p_ACV8.csv, p_CPV6.csv, p_DWV6.csv, p_EVV7.csv, p_FRV5.csv, p_HPV4.csv, p_RFV5.csv, p_SHV7.csv, p_TDV6.csv, p_WHV7.csv, p_WMV6.csv
Description
Maximum capacity – load for a device can never exceed maximum capacity
Data
gils projection assumptions.xlsx
Script
openENTRANCE final V14.py
P_inc_DW.csv, P_inc_HP.csv, P_inc_SH.csv, P_inc_TD.csv, P_inc_WH.csv, P_inc_WM.csv, SAMPLE_PINC.csv
Description
Unadjusted average hourly potential for increase by NUTS2 region for 2018-2050
Data
d_ACV8.csv, d_CPV6.csv, d_DWV6.csv, d_EVV7.csv, d_FRV5.csv, d_HPV4.csv, d_RFV5.csv, d_SHV7.csv, d_TDV6.csv, d_WHV7.csv, d_WMV6.csv
Theoretical maximum reduction / load of the respective device
p_ACV8.csv, p_CPV6.csv, p_DWV6.csv, p_EVV7.csv, p_FRV5.csv, p_HPV4.csv, p_RFV5.csv, p_SHV7.csv, p_TDV6.csv, p_WHV7.csv, p_WMV6.csv
Maximum capacity
Script
P_increaseV2.py
Pcycle_dw.csv, Pcycle_td.csv, Pcycle_wm.csv
Description
power demand per cycle kWh
See gils projection assumptions.xlsx
Punit_ac.csv, Punit_cp.csv
Description
Unit capacities kWh
See gils projection assumptions.xlsx
Qhp_thermal_MWh_projectedV2.csv
Description
NUTS2 expectations for thermal energy demand met by heat pumps for 2022-2050
Assumes a linear decomposition of non-renewable and non-energy efficient heating sources until 2050
Data
HP_transitionV2.csv
nhhV2.csv
Script
HP_projection_nuts.py
rac.csv, rcp.csv, rdw.csv, rfr.csv, rrf.csv, rsh.csv, rtd.csv, rwh.csv, rwm.csv
Description
Household ownership rates
See gils projection
This data package is associated with the publication “Investigating the impacts of solid phase extraction on dissolved organic matter optical signatures and the pairing with high-resolution mass spectrometry data in a freshwater system” submitted to “Limnology and Oceanography: Methods.” This data is an extension of the River Corridor and Watershed Biogeochemistry SFA’s Spatial Study 2021 (https://doi.org/10.15485/1898914). Other associated data and field metadata can be found at the link provided. The goal of this manuscript is to assess the impact of solid phase extraction (SPE) on the ability to pair ultra-high resolution mass spectrometry data collected from SPE extracts with optical properties collected on ambient stream samples. Forty-seven samples collected from within the Yakima River Basin, Washington were analyzed dissolved organic carbon (DOC, measured as non-purgeable organic carbon, NPOC), absorbance, and fluorescence. Samples were subsequently concentrated with SPE and reanalyzed for each measurement. The extraction efficiency for the DOC and common optical indices were calculated. In addition, SPE samples were subject to ultra-high resolution mass spectrometry and compared with the ambient and SPE generated optical data. Finally, in addition to this cross-platform inter-comparison, we further performed and intra-comparison among the high-resolution mass spectrometry data to determine the impact of sample preparation on the interpretability of results. Here, the SPE samples were prepared at 40 milligrams per liter (mg/L) based on the known DOC extraction efficiency of the samples (ranging from ~30 to ~75%) compared to the common practice of assuming the DOC extraction efficiency of freshwater samples at 60%. This data package folder consists of one main data folder with one subfolder (Data_Input). The main data folder contains (1) readme; (2) data dictionary (dd); (3) file-level metadata (flmd); (4) final data summary output from processing script; and (5) the processing script. The R-markdown processing script (SPE_Manuscript_Rmarkdown_Data_Package.rmd) contains all code needed to reproduce manuscript statistics and figures (with the exception of that stated below). The Data_Input folder has two subfolders: (1) FTICR and (2) Optics. Additionally, the Data_Input folder contains dissolved organic carbon (DOC, measured as non-purgeable organic carbon, NPOC) data (SPS_NPOC_Summary.csv) and relevant supporting Solid Phase Extraction Volume information (SPS_SPE_Volumes.csv). Methods information for the optical and FTICR data is embedded in the header rows of SPS_EEMs_Methods.csv and SPS_FTICR_Methods.csv, respectively. In addition, the data dictionary (SPS_SPE_dd.csv), file level metadata (SPS_SPE_flmd.csv), and methods codes (SPS_SPE_Methods_codes.csv) are provided. The FTICR subfolder contains all raw FTICR data as well as instructions for processing. In addition, post processed FTICR molecular information (Processed_FTICRMS_Mol.csv) and sample data (Processed_FTICRMS_Data.csv) is provided that can be directly read into R with the associated R-markdown file. The Optics subfolder contains all Absorbance and Fluorescence Spectra. Fluorescence spectra have been blank corrected, inner filter corrected, and undergone scatter removal. In addition, this folder contains Matlab code used to make a portion of Figure 1 within the manuscript, derive various spectral parameters used within the manuscript, and used for parallel factor analysis (PARAFAC) modeling. Spectral indices (SPS_SpectralIndices.csv) and PARAFAC outputs (SPS_PARAFAC_Model_Loadings.csv and SPS_PARAFAC_Sample_Scores.csv) are directly read into the associated R-markdown file.
This dataset accompanies the paper “Self-Attribution of Distorted Reaching Movements in Immersive Virtual Reality” published in the Computer and Graphics journal from Elsevier. It contains 3 datasets related to the experiments described in the paper. All datasets are in “.csv” format and can be easily loaded by statistical analysis tools (e.g. a dataset can be loaded in r using the command read.csv(“filename.csv”)). It also contains the C# Unity implementation of the distortion function presented in the paper. Paper reference: Galvan Debarba H, Boulic R, Salomon R, Blanke O, Herbelin B. Self-Attribution of Distorted Reaching Movements in Immersive Virtual Reality. Computers & Graphics. 2018; ISSN 0097-8493. Elsevier. DOI: doi.org/10.1016/j.cag.2018.09.001
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The purpose of this dataset is to provide a detailed picture of the characteristics of Syrian towns in the years preceding the 2011 Syrian uprising and ensuing civil war. It incorporates the 2004 national census, the last before the uprising, and a newly collected set of data on ethnic identity. The level of analysis is the town (the Syrian Census Bureau’s fourth administrative level). TECHNICAL NOTE: The .csv files in this data package contain both Arabic and English, so are encoded in UTF-8. The Arabic script should render if opened directly in Open Office, Numbers, Google Drive, or R statistical software. To read the Arabic in Excel, you can open the .csv file in any of these applications and save it as an .xlsx file, or open it through Excel using the following steps: (1) open a blank excel document (2) import the data using “Data -> Get External Data -> Import text file” (3) select “File Origin: Unicode (UTF-8)” (4) select “Delimiters: comma” (5) select the top left cell to place the data See the following post for further details: https://stackoverflow.com/questions/6002256/is-it-possible-to-force-excel-recognize-utf-8-csv-files-automatically
This module series covers how to import, manipulate, format and plot time series data stored in .csv format in R. Originally designed to teach researchers to use NEON plant phenology and air temperature data; has been used in undergraduate classrooms.
[Note 2023-08-14 - Supersedes version 1, https://doi.org/10.15482/USDA.ADC/1528086 ] This dataset contains all code and data necessary to reproduce the analyses in the manuscript: Mengistu, A., Read, Q. D., Sykes, V. R., Kelly, H. M., Kharel, T., & Bellaloui, N. (2023). Cover crop and crop rotation effects on tissue and soil population dynamics of Macrophomina phaseolina and yield under no-till system. Plant Disease. https://doi.org/10.1094/pdis-03-23-0443-re The .zip archive cropping-systems-1.0.zip contains data and code files. Data stem_soil_CFU_by_plant.csv: Soil disease load (SoilCFUg) and stem tissue disease load (StemCFUg) for individual plants in CFU per gram, with columns indicating year, plot ID, replicate, row, plant ID, previous crop treatment, cover crop treatment, and comments. Missing data are indicated with . yield_CFU_by_plot.csv: Yield data (YldKgHa) at the plot level in units of kg/ha, with columns indicating year, plot ID, replicate, and treatments, as well as means of soil and stem disease load at the plot level. Code cropping_system_analysis_v3.0.Rmd: RMarkdown notebook with all data processing, analysis, and visualization code equations.Rmd: RMarkdown notebook with formatted equations formatted_figs_revision.R: R script to produce figures formatted exactly as they appear in the manuscript The Rproject file cropping-systems.Rproj is used to organize the RStudio project. Scripts and notebooks used in older versions of the analysis are found in the testing/ subdirectory. Excel spreadsheets containing raw data from which the cleaned CSV files were created are found in the raw_data subdirectory.
https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Replication pack, FSE2018 submission #164: ------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem **Note:** link to data artifacts is already included in the paper. Link to the code will be included in the Camera Ready version as well. Content description =================== - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files described below - **settings.py** - settings template for the code archive. - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset. This dataset only includes stats aggregated by the ecosystem (PyPI) - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages themselves, which take around 2TB. - **build_model.r, helpers.r** - R files to process the survival data (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, `common.cache/survival_data.pypi_2008_2017-12_6.csv` in **dataset_full_Jan_2018.tgz**) - **Interview protocol.pdf** - approximate protocol used for semistructured interviews. - LICENSE - text of GPL v3, under which this dataset is published - INSTALL.md - replication guide (~2 pages)
Replication guide ================= Step 0 - prerequisites ---------------------- - Unix-compatible OS (Linux or OS X) - Python interpreter (2.7 was used; Python 3 compatibility is highly likely) - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible) Depending on detalization level (see Step 2 for more details): - up to 2Tb of disk space (see Step 2 detalization levels) - at least 16Gb of RAM (64 preferable) - few hours to few month of processing time Step 1 - software ---------------- - unpack **ghd-0.1.0.zip**, or clone from gitlab: git clone https://gitlab.com/user2589/ghd.git git checkout 0.1.0 `cd` into the extracted folder. All commands below assume it as a current directory. - copy `settings.py` into the extracted folder. Edit the file: * set `DATASET_PATH` to some newly created folder path * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` - install docker. For Ubuntu Linux, the command is `sudo apt-get install docker-compose` - install libarchive and headers: `sudo apt-get install libarchive-dev` - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools` Without this dependency, you might get an error on the next step, but it's safe to ignore. - install Python libraries: `pip install --user -r requirements.txt` . - disable all APIs except GitHub (Bitbucket and Gitlab support were not yet implemented when this study was in progress): edit `scraper/init.py`, comment out everything except GitHub support in `PROVIDERS`. Step 2 - obtaining the dataset ----------------------------- The ultimate goal of this step is to get output of the Python function `common.utils.survival_data()` and save it into a CSV file: # copy and paste into a Python console from common import utils survival_data = utils.survival_data('pypi', '2008', smoothing=6) survival_data.to_csv('survival_data.csv') Since full replication will take several months, here are some ways to speedup the process: ####Option 2.a, difficulty level: easiest Just use the precomputed data. Step 1 is not necessary under this scenario. - extract **dataset_minimal_Jan_2018.zip** - get `survival_data.csv`, go to the next step ####Option 2.b, difficulty level: easy Use precomputed longitudinal feature values to build the final table. The whole process will take 15..30 minutes. - create a folder `