67 datasets found
  1. Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

    • zenodo.org
    application/gzip, bin +2
    Updated Aug 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788
    Explore at:
    bin, application/gzip, zip, text/x-pythonAvailable download formats
    Dataset updated
    Aug 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb
    License

    https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

    Description
    Replication pack, FSE2018 submission #164:
    ------------------------------------------
    
    **Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
    A Case Study of the PyPI Ecosystem
    
    **Note:** link to data artifacts is already included in the paper. 
    Link to the code will be included in the Camera Ready version as well.
    
    
    Content description
    ===================
    
    - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
     described below
    - **settings.py** - settings template for the code archive.
    - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
     This dataset only includes stats aggregated by the ecosystem (PyPI)
    - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
     statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
     themselves, which take around 2TB.
    - **build_model.r, helpers.r** - R files to process the survival data 
      (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
      `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
      **dataset_full_Jan_2018.tgz**)
    - **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
    - LICENSE - text of GPL v3, under which this dataset is published
    - INSTALL.md - replication guide (~2 pages)
    Replication guide
    =================
    
    Step 0 - prerequisites
    ----------------------
    
    - Unix-compatible OS (Linux or OS X)
    - Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
    - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)
    
    Depending on detalization level (see Step 2 for more details):
    - up to 2Tb of disk space (see Step 2 detalization levels)
    - at least 16Gb of RAM (64 preferable)
    - few hours to few month of processing time
    
    Step 1 - software
    ----------------
    
    - unpack **ghd-0.1.0.zip**, or clone from gitlab:
    
       git clone https://gitlab.com/user2589/ghd.git
       git checkout 0.1.0
     
     `cd` into the extracted folder. 
     All commands below assume it as a current directory.
      
    - copy `settings.py` into the extracted folder. Edit the file:
      * set `DATASET_PATH` to some newly created folder path
      * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
    - install docker. For Ubuntu Linux, the command is 
      `sudo apt-get install docker-compose`
    - install libarchive and headers: `sudo apt-get install libarchive-dev`
    - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
     Without this dependency, you might get an error on the next step, 
     but it's safe to ignore.
    - install Python libraries: `pip install --user -r requirements.txt` . 
    - disable all APIs except GitHub (Bitbucket and Gitlab support were
     not yet implemented when this study was in progress): edit
     `scraper/init.py`, comment out everything except GitHub support
     in `PROVIDERS`.
    
    Step 2 - obtaining the dataset
    -----------------------------
    
    The ultimate goal of this step is to get output of the Python function 
    `common.utils.survival_data()` and save it into a CSV file:
    
      # copy and paste into a Python console
      from common import utils
      survival_data = utils.survival_data('pypi', '2008', smoothing=6)
      survival_data.to_csv('survival_data.csv')
    
    Since full replication will take several months, here are some ways to speedup
    the process:
    
    ####Option 2.a, difficulty level: easiest
    
    Just use the precomputed data. Step 1 is not necessary under this scenario.
    
    - extract **dataset_minimal_Jan_2018.zip**
    - get `survival_data.csv`, go to the next step
    
    ####Option 2.b, difficulty level: easy
    
    Use precomputed longitudinal feature values to build the final table.
    The whole process will take 15..30 minutes.
    
    - create a folder `
  2. Data Mining Project - Boston

    • kaggle.com
    Updated Nov 25, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SophieLiu (2019). Data Mining Project - Boston [Dataset]. https://www.kaggle.com/sliu65/data-mining-project-boston/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 25, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    SophieLiu
    Area covered
    Boston
    Description

    Context

    To make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.

    Use of Data Files

    You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:

    This loads the file into R

    df<-read.csv('uber.csv')

    The next codes is to subset the data into specific car types. The example below only has Uber 'Black' car types.

    df_black<-subset(uber_df, uber_df$name == 'Black')

    This next portion of code will be to load it into R. First, we must write this dataframe into a csv file on our computer in order to load it into R.

    write.csv(df_black, "nameofthefileyouwanttosaveas.csv")

    The file will appear in you working directory. If you are not familiar with your working directory. Run this code:

    getwd()

    The output will be the file path to your working directory. You will find the file you just created in that folder.

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  3. w

    Randomized Hourly Load Data for use with Taxonomy Distribution Feeders

    • data.wu.ac.at
    application/unknown
    Updated Aug 29, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Energy (2017). Randomized Hourly Load Data for use with Taxonomy Distribution Feeders [Dataset]. https://data.wu.ac.at/schema/data_gov/NWYwYmFmYTItOWRkMC00OWM0LTk3OGYtZDcyYzZiOWY5N2Ez
    Explore at:
    application/unknownAvailable download formats
    Dataset updated
    Aug 29, 2017
    Dataset provided by
    Department of Energy
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset was developed by NREL's distributed energy systems integration group as part of a study on high penetrations of distributed solar PV [1]. It consists of hourly load data in CSV format for use with the PNNL taxonomy of distribution feeders [2]. These feeders were developed in the open source GridLAB-D modelling language [3]. In this dataset each of the load points in the taxonomy feeders is populated with hourly averaged load data from a utility in the feeder’s geographical region, scaled and randomized to emulate real load profiles. For more information on the scaling and randomization process, see [1].

    The taxonomy feeders are statistically representative of the various types of distribution feeders found in five geographical regions of the U.S. Efforts are underway (possibly complete) to translate these feeders into the OpenDSS modelling language.

    This data set consists of one large CSV file for each feeder. Within each CSV, each column represents one load bus on the feeder. The header row lists the name of the load bus. The subsequent 8760 rows represent the loads for each hour of the year. The loads were scaled and randomized using a Python script, so each load series represents only one of many possible randomizations. In the header row, "rl" = residential load and "cl" = commercial load. Commercial loads are followed by a phase letter (A, B, or C). For regions 1-3, the data is from 2009. For regions 4-5, the data is from 2000.

    For use in GridLAB-D, each column will need to be separated into its own CSV file without a header. The load value goes in the second column, and corresponding datetime values go in the first column, as shown in the sample file, sample_individual_load_file.csv. Only the first value in the time column needs to written as an absolute time; subsequent times may be written in relative format (i.e. "+1h", as in the sample). The load should be written in P+Qj format, as seen in the sample CSV, in units of Watts (W) and Volt-amps reactive (VAr). This dataset was derived from metered load data and hence includes only real power; reactive power can be generated by assuming an appropriate power factor. These loads were used with GridLAB-D version 2.2.

    Browse files in this dataset, accessible as individual files and as a single ZIP file. This dataset is approximately 242MB compressed or 475MB uncompressed.

    For questions about this dataset, contact andy.hoke@nrel.gov.

    If you find this dataset useful, please mention NREL and cite [1] in your work.

    References:

    [1] A. Hoke, R. Butler, J. Hambrick, and B. Kroposki, “Steady-State Analysis of Maximum Photovoltaic Penetration Levels on Typical Distribution Feeders,” IEEE Transactions on Sustainable Energy, April 2013, available at http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6357275 .

    [2] K. Schneider, D. P. Chassin, R. Pratt, D. Engel, and S. Thompson, “Modern Grid Initiative Distribution Taxonomy Final Report”, PNNL, Nov. 2008. Accessed April 27, 2012: http://www.gridlabd.org/models/feeders/taxonomy of prototypical feeders.pdf

    [3] K. Schneider, D. Chassin, Y. Pratt, and J. C. Fuller, “Distribution power flow for smart grid technologies”, IEEE/PES Power Systems Conference and Exposition, Seattle, WA, Mar. 2009, pp. 1-7, 15-18.

  4. H

    Consumer Expenditure Survey (CE)

    • dataverse.harvard.edu
    Updated May 30, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthony Damico (2013). Consumer Expenditure Survey (CE) [Dataset]. http://doi.org/10.7910/DVN/UTNJAH
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 30, 2013
    Dataset provided by
    Harvard Dataverse
    Authors
    Anthony Damico
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    analyze the consumer expenditure survey (ce) with r the consumer expenditure survey (ce) is the primo data source to understand how americans spend money. participating households keep a running diary about every little purchase over the year. those diaries are then summed up into precise expenditure categories. how else are you gonna know that the average american household spent $34 (±2) on bacon, $826 (±17) on cellular phones, and $13 (±2) on digital e-readers in 2011? an integral component of the market basket calculation in the consumer price index, this survey recently became available as public-use microdata and they're slowly releasing historical files back to 1996. hooray! for a t aste of what's possible with ce data, look at the quick tables listed on their main page - these tables contain approximately a bazillion different expenditure categories broken down by demographic groups. guess what? i just learned that americans living in households with $5,000 to $9,999 of annual income spent an average of $283 (±90) on pets, toys, hobbies, and playground equipment (pdf page 3). you can often get close to your statistic of interest from these web tables. but say you wanted to look at domestic pet expenditure among only households with children between 12 and 17 years old. another one of the thirteen web tables - the consumer unit composition table - shows a few different breakouts of households with kids, but none matching that exact population of interest. the bureau of labor statistics (bls) (the survey's designers) and the census bureau (the survey's administrators) have provided plenty of the major statistics and breakouts for you, but they're not psychic. if you want to comb through this data for specific expenditure categories broken out by a you-defined segment of the united states' population, then let a little r into your life. fun starts now. fair warning: only analyze t he consumer expenditure survey if you are nerd to the core. the microdata ship with two different survey types (interview and diary), each containing five or six quarterly table formats that need to be stacked, merged, and manipulated prior to a methodologically-correct analysis. the scripts in this repository contain examples to prepare 'em all, just be advised that magnificent data like this will never be no-assembly-required. the folks at bls have posted an excellent summary of what's av ailable - read it before anything else. after that, read the getting started guide. don't skim. a few of the descriptions below refer to sas programs provided by the bureau of labor statistics. you'll find these in the C:\My Directory\CES\2011\docs directory after you run the download program. this new github repository contains three scripts: 2010-2011 - download all microdata.R lo op through every year and download every file hosted on the bls's ce ftp site import each of the comma-separated value files into r with read.csv depending on user-settings, save each table as an r data file (.rda) or stat a-readable file (.dta) 2011 fmly intrvw - analysis examples.R load the r data files (.rda) necessary to create the 'fmly' table shown in the ce macros program documentation.doc file construct that 'fmly' table, using five quarters of interviews (q1 2011 thru q1 2012) initiate a replicate-weighted survey design object perform some lovely li'l analysis examples replicate the %mean_variance() macro found in "ce macros.sas" and provide some examples of calculating descriptive statistics using unimputed variables replicate the %compare_groups() macro found in "ce macros.sas" and provide some examples of performing t -tests using unimputed variables create an rsqlite database (to minimize ram usage) containing the five imputed variable files, after identifying which variables were imputed based on pdf page 3 of the user's guide to income imputation initiate a replicate-weighted, database-backed, multiply-imputed survey design object perform a few additional analyses that highlight the modified syntax required for multiply-imputed survey designs replicate the %mean_variance() macro found in "ce macros.sas" and provide some examples of calculating descriptive statistics using imputed variables repl icate the %compare_groups() macro found in "ce macros.sas" and provide some examples of performing t-tests using imputed variables replicate the %proc_reg() and %proc_logistic() macros found in "ce macros.sas" and provide some examples of regressions and logistic regressions using both unimputed and imputed variables replicate integrated mean and se.R match each step in the bls-provided sas program "integr ated mean and se.sas" but with r instead of sas create an rsqlite database when the expenditure table gets too large for older computers to handle in ram export a table "2011 integrated mean and se.csv" that exactly matches the contents of the sas-produced "2011 integrated mean and se.lst" text file click here to view these three scripts for...

  5. d

    Data from: Data and code from: Environmental influences on drying rate of...

    • catalog.data.gov
    • datasetcatalog.nlm.nih.gov
    • +1more
    Updated Apr 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data and code from: Environmental influences on drying rate of spray applied disinfestants from horticultural production services [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-environmental-influences-on-drying-rate-of-spray-applied-disinfestants-
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    This dataset includes all the data and R code needed to reproduce the analyses in a forthcoming manuscript:Copes, W. E., Q. D. Read, and B. J. Smith. Environmental influences on drying rate of spray applied disinfestants from horticultural production services. PhytoFrontiers, DOI pending.Study description: Instructions for disinfestants typically specify a dose and a contact time to kill plant pathogens on production surfaces. A problem occurs when disinfestants are applied to large production areas where the evaporation rate is affected by weather conditions. The common contact time recommendation of 10 min may not be achieved under hot, sunny conditions that promote fast drying. This study is an investigation into how the evaporation rates of six commercial disinfestants vary when applied to six types of substrate materials under cool to hot and cloudy to sunny weather conditions. Initially, disinfestants with low surface tension spread out to provide 100% coverage and disinfestants with high surface tension beaded up to provide about 60% coverage when applied to hard smooth surfaces. Disinfestants applied to porous materials were quickly absorbed into the body of the material, such as wood and concrete. Even though disinfestants evaporated faster under hot sunny conditions than under cool cloudy conditions, coverage was reduced considerably in the first 2.5 min under most weather conditions and reduced to less than or equal to 50% coverage by 5 min. Dataset contents: This dataset includes R code to import the data and fit Bayesian statistical models using the model fitting software CmdStan, interfaced with R using the packages brms and cmdstanr. The models (one for 2022 and one for 2023) compare how quickly different spray-applied disinfestants dry, depending on what chemical was sprayed, what surface material it was sprayed onto, and what the weather conditions were at the time. Next, the statistical models are used to generate predictions and compare mean drying rates between the disinfestants, surface materials, and weather conditions. Finally, tables and figures are created. These files are included:Drying2022.csv: drying rate data for the 2022 experimental runWeather2022.csv: weather data for the 2022 experimental runDrying2023.csv: drying rate data for the 2023 experimental runWeather2023.csv: weather data for the 2023 experimental rundisinfestant_drying_analysis.Rmd: RMarkdown notebook with all data processing, analysis, and table creation codedisinfestant_drying_analysis.html: rendered output of notebookMS_figures.R: additional R code to create figures formatted for journal requirementsfit2022_discretetime_weather_solar.rds: fitted brms model object for 2022. This will allow users to reproduce the model prediction results without having to refit the model, which was originally fit on a high-performance computing clusterfit2023_discretetime_weather_solar.rds: fitted brms model object for 2023data_dictionary.xlsx: descriptions of each column in the CSV data files

  6. d

    Replication Data for: Revisiting 'The Rise and Decline' in a Population of...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TeBlunthuis, Nathan; Aaron Shaw; Benjamin Mako Hill (2023). Replication Data for: Revisiting 'The Rise and Decline' in a Population of Peer Production Projects [Dataset]. http://doi.org/10.7910/DVN/SG3LP1
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    TeBlunthuis, Nathan; Aaron Shaw; Benjamin Mako Hill
    Description

    This archive contains code and data for reproducing the analysis for “Replication Data for Revisiting ‘The Rise and Decline’ in a Population of Peer Production Projects”. Depending on what you hope to do with the data you probabbly do not want to download all of the files. Depending on your computation resources you may not be able to run all stages of the analysis. The code for all stages of the analysis, including typesetting the manuscript and running the analysis, is in code.tar. If you only want to run the final analysis or to play with datasets used in the analysis of the paper, you want intermediate_data.7z or the uncompressed tab and csv files. The data files are created in a four-stage process. The first stage uses the program “wikiq” to parse mediawiki xml dumps and create tsv files that have edit data for each wiki. The second stage generates all.edits.RDS file which combines these tsvs into a dataset of edits from all the wikis. This file is expensive to generate and at 1.5GB is pretty big. The third stage builds smaller intermediate files that contain the analytical variables from these tsv files. The fourth stage uses the intermediate files to generate smaller RDS files that contain the results. Finally, knitr and latex typeset the manuscript. A stage will only run if the outputs from the previous stages do not exist. So if the intermediate files exist they will not be regenerated. Only the final analysis will run. The exception is that stage 4, fitting models and generating plots, always runs. If you only want to replicate from the second stage onward, you want wikiq_tsvs.7z. If you want to replicate everything, you want wikia_mediawiki_xml_dumps.7z.001 wikia_mediawiki_xml_dumps.7z.002, and wikia_mediawiki_xml_dumps.7z.003. These instructions work backwards from building the manuscript using knitr, loading the datasets, running the analysis, to building the intermediate datasets. Building the manuscript using knitr This requires working latex, latexmk, and knitr installations. Depending on your operating system you might install these packages in different ways. On Debian Linux you can run apt install r-cran-knitr latexmk texlive-latex-extra. Alternatively, you can upload the necessary files to a project on Overleaf.com. Download code.tar. This has everything you need to typeset the manuscript. Unpack the tar archive. On a unix system this can be done by running tar xf code.tar. Navigate to code/paper_source. Install R dependencies. In R. run install.packages(c("data.table","scales","ggplot2","lubridate","texreg")) On a unix system you should be able to run make to build the manuscript generalizable_wiki.pdf. Otherwise you should try uploading all of the files (including the tables, figure, and knitr folders) to a new project on Overleaf.com. Loading intermediate datasets The intermediate datasets are found in the intermediate_data.7z archive. They can be extracted on a unix system using the command 7z x intermediate_data.7z. The files are 95MB uncompressed. These are RDS (R data set) files and can be loaded in R using the readRDS. For example newcomer.ds <- readRDS("newcomers.RDS"). If you wish to work with these datasets using a tool other than R, you might prefer to work with the .tab files. Running the analysis Fitting the models may not work on machines with less than 32GB of RAM. If you have trouble, you may find the functions in lib-01-sample-datasets.R useful to create stratified samples of data for fitting models. See line 89 of 02_model_newcomer_survival.R for an example. Download code.tar and intermediate_data.7z to your working folder and extract both archives. On a unix system this can be done with the command tar xf code.tar && 7z x intermediate_data.7z. Install R dependencies. install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). On a unix system you can simply run regen.all.sh to fit the models, build the plots and create the RDS files. Generating datasets Building the intermediate files The intermediate files are generated from all.edits.RDS. This process requires about 20GB of memory. Download all.edits.RDS, userroles_data.7z,selected.wikis.csv, and code.tar. Unpack code.tar and userroles_data.7z. On a unix system this can be done using tar xf code.tar && 7z x userroles_data.7z. Install R dependencies. In R run install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). Run 01_build_datasets.R. Building all.edits.RDS The intermediate RDS files used in the analysis are created from all.edits.RDS. To replicate building all.edits.RDS, you only need to run 01_build_datasets.R when the int... Visit https://dataone.org/datasets/sha256%3Acfa4980c107154267d8eb6dc0753ed0fde655a73a062c0c2f5af33f237da3437 for complete metadata about this dataset.

  7. f

    Petre_Slide_CategoricalScatterplotFigShare.pptx

    • figshare.com
    pptx
    Updated Sep 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1
    Explore at:
    pptxAvailable download formats
    Dataset updated
    Sep 19, 2016
    Dataset provided by
    figshare
    Authors
    Benj Petre; Aurore Coince; Sophien Kamoun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Categorical scatterplots with R for biologists: a step-by-step guide

    Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

    1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

    Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

    Protocol

    • Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

    • Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

    • Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

    Notes

    • Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

    • Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

    7 Display the graph in a separate window. Dot colors indicate

    replicates

    graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

    References

    Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

    Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

    Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

    https://cran.r-project.org/

    http://ggplot2.org/

  8. Updated Data for: Food-washing monkeys recognize the law of diminishing...

    • zenodo.org
    bin, csv
    Updated Sep 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luke Fannin; Luke Fannin (2024). Updated Data for: Food-washing monkeys recognize the law of diminishing returns [Dataset]. http://doi.org/10.5281/zenodo.13748538
    Explore at:
    csv, binAvailable download formats
    Dataset updated
    Sep 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Luke Fannin; Luke Fannin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a new version draft of the data files for "Food washing monkeys recognize the law of diminishing returns" by Rosien et al.

    The original reviewed pre-print was published on the elife website on 22 July 2024: https://elifesciences.org/reviewed-preprints/98520. The data stored here are for the updated version of record.

    The published text contains methods justifications and supporting citations.

    This dataset was revised based on the recommendations of three reviewers. It now contains:

    • two text files, to be run in the R programming environment (version of record is 4.4.1), containing code to replicate the GLMM analyses and produce the based figure files displayed in the paper.
    • Two .csv files for running the GLMM statistics included in the revised text.
    • One .csv file for figure 1, which contains sand geometric and compositional data.
    • One .csv file containing intake rate data
    • Six .csv files used to create figures 2 and S2 in the text.
    • One Mathematica notebook file for producing the optimal cleaning model of figure 3.

    A general note: when running the scripts, the file path you utilize will differ from the ones utilized in the current text, as it depends on where on one's computer the actual .csv files are stored. The "read.csv" command in the R code will need to be customized to a particular file path.

  9. H

    Basic Stand Alone Medicare Claims Public Use Files (BSAPUFs)

    • dataverse.harvard.edu
    Updated May 30, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthony Damico (2013). Basic Stand Alone Medicare Claims Public Use Files (BSAPUFs) [Dataset]. http://doi.org/10.7910/DVN/BGP8EB
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 30, 2013
    Dataset provided by
    Harvard Dataverse
    Authors
    Anthony Damico
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    analyze the basic stand alone medicare claims public use files (bsapufs) with r and monetdb the centers for medicare and medicaid services (cms) took the plunge. the famous medicare 5% sample has been released to the public, free of charge. jfyi - medicare is the u.s. government program that provides health insurance to 50 million elderly and disabled americans. the basic stand alone medicare claims public use files (bsapufs) contain either person- or event-level data on inpatient stays, durable medical equipment purchases, prescription drug fills, hospice users, doctor visits, home health provision , outpatient hospital procedures, skilled nursing facility short-term residents, as well as aggregated statistics for medicare beneficiaries with chronic conditions and medicare beneficiaries living in nursing homes. oh sorry, there's one catch: they only provide sas scripts to analyze everything. cue the villian music. that bored old game of monopoly ends today. the initial release of the 2008 bsapufs was accompanied by some major fanfare in the world of health policy , a big win for government transparency. unfortunately, the final files that cleared the confidentiality hurdles are heavily de-identified and obfuscated. prime examples: none of the files can be linked to any other file. not across years, not across expenditure categories costs are rounded to the nearest fifth or tenth dollar at lower values, nearest thousandth at higher values ages are categorized into five year bands so these files are baldly inferior to the unsquelched, linkable data only available through an expensive formal application process. any researcher with a budget flush enough to afford a sas license (the only statistical software mentioned in the cms official documentation) can probably also cough up the money to buy the identifiable data through resdac (resdac, btw, rocks). soapbox: cms released free public data sets that could only be analyzed with a software package costing thousands of dollars. so even though the actual data sets were free, researchers still needed deep pock ets to buy sas. meanwhile, the unsquelched and therefore superior data sets are also available for many thousands of dollars. researchers with funding would (reasonably) just buy the better data. researchers without any financial resources - the target audience of free, public data - were left out in the cold. no wonder these bsapufs haven't been used much. that ends now. using r, monetdb, and the personal computer you already own (mine cost $700 in 2009), researchers can, for the first time, seriously analyze these medicare public use files without spending another dime. woah. plus hey guess what all you researcher fat-cats with your federal grant streams and your proprietary software licenses: r + monetdb runs one heckuva lot faster than sas. woah^2. dump your sas license water wings and learn how to swim. the scripts below require monetdb . click here for step-by-step instructions of how to install it on windows and click here for speed tests. vroom. since the bsapufs comprise 5% of the medicare population, ya generally need to multiply any counts or sums by twenty. although the individuals represented in these claims are randomly sampled, this data should not be treated like a complex survey sample, meaning that the creation of a survey object is unnecessary. most bsapufs generalize to either the total or fee-for-service medicare population, but each file is different so give the documentation a hard stare before that eureka moment. this new github repository contains three scripts: 2008 - download all csv files.R loop through and download every zip file hosted by cms unzip the contents of each zipped file to the working directory 2008 - import all csv files into monetdb.R create the batch (.bat) file needed to initiate the monet database in the f uture loop through each csv file in the current working directory and import them into the monet database create a well-documented block of code to re-initiate the monetdb server in the future 2008 - replicate cms publications.R initiate the same monetdb server instance, unsing the same well-documented block of code as above replicate nine sets of statistics found in data tables provided by cms < a href="https://github.com/ajdamico/usgsd/tree/master/Basic%20Stand%20Alone%20Medicare%20Claims%20Public%20Use%20Files">click here to view these three scripts for more detail about the basic stand alone medicare claims public use files (bsapufs), visit: the centers for medicare and medicaid's bsapuf homepage a joint academyhealth webinar given by the organizations that partnered to create these files - cms, impaq, norc notes: the replication script has oodles of easily-modified syntax and should be viewed for analysis examples. if you know the name of the data table you want to examine, you can quickly modify these general monetdb analysis examples too. just run sql queries - sas users, that's "proc...

  10. Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic...

    • zenodo.org
    bin, csv, zip
    Updated Dec 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander R. Hartloper; Alexander R. Hartloper; Selimcan Ozden; Albano de Castro e Sousa; Dimitrios G. Lignos; Dimitrios G. Lignos; Selimcan Ozden; Albano de Castro e Sousa (2022). Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials [Dataset]. http://doi.org/10.5281/zenodo.6965147
    Explore at:
    bin, zip, csvAvailable download formats
    Dataset updated
    Dec 24, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alexander R. Hartloper; Alexander R. Hartloper; Selimcan Ozden; Albano de Castro e Sousa; Dimitrios G. Lignos; Dimitrios G. Lignos; Selimcan Ozden; Albano de Castro e Sousa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials

    Background

    This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.

    The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).

    Usage

    • The data is licensed through the Creative Commons Attribution 4.0 International.
    • If you have used our data and are publishing your work, we ask that you please reference both:
      1. this database through its DOI, and
      2. any publication that is associated with the experiments. See the Overall_Summary and Database_References files for the associated publication references.

    Included Files

    • Overall_Summary_2022-08-25_v1-0-0.csv: summarises the specimen information for all experiments in the database.
    • Summarized_Mechanical_Props_Campaign_2022-08-25_v1-0-0.csv: summarises the average initial yield stress and average initial elastic modulus per campaign.
    • Unreduced_Data-#_v1-0-0.zip: contain the original (not downsampled) data
      • Where # is one of: 1, 2, 3, 4, 5, 6. The unreduced data is broken into separate archives because of upload limitations to Zenodo. Together they provide all the experimental data.
      • We recommend you un-zip all the folders and place them in one "Unreduced_Data" directory similar to the "Clean_Data"
      • The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.
      • There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the unreduced data.
      • The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.
    • Clean_Data_v1-0-0.zip: contains all the downsampled data
      • The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.
      • There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the clean data.
      • The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.
    • Database_References_v1-0-0.bib
      • Contains a bibtex reference for many of the experiments in the database. Corresponds to the "citekey" entry in the summary files.

    File Format: Downsampled Data

    These are the "LP_

    • The header of the first column is empty: the first column corresponds to the index of the sample point in the original (unreduced) data
    • Time[s]: time in seconds since the start of the test
    • e_true: true strain
    • Sigma_true: true stress in MPa
    • (optional) Temperature[C]: the surface temperature in degC

    These data files can be easily loaded using the pandas library in Python through:

    import pandas
    data = pandas.read_csv(data_file, index_col=0)

    The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.

    File Format: Unreduced Data

    These are the "LP_

    • The first column is the index of each data point
    • S/No: sample number recorded by the DAQ
    • System Date: Date and time of sample
    • Time[s]: time in seconds since the start of the test
    • C_1_Force[kN]: load cell force
    • C_1_Déform1[mm]: extensometer displacement
    • C_1_Déplacement[mm]: cross-head displacement
    • Eng_Stress[MPa]: engineering stress
    • Eng_Strain[]: engineering strain
    • e_true: true strain
    • Sigma_true: true stress in MPa
    • (optional) Temperature[C]: specimen surface temperature in degC

    The data can be loaded and used similarly to the downsampled data.

    File Format: Overall_Summary

    The overall summary file provides data on all the test specimens in the database. The columns include:

    • hidden_index: internal reference ID
    • grade: material grade
    • spec: specifications for the material
    • source: base material for the test specimen
    • id: internal name for the specimen
    • lp: load protocol
    • size: type of specimen (M8, M12, M20)
    • gage_length_mm_: unreduced section length in mm
    • avg_reduced_dia_mm_: average measured diameter for the reduced section in mm
    • avg_fractured_dia_top_mm_: average measured diameter of the top fracture surface in mm
    • avg_fractured_dia_bot_mm_: average measured diameter of the bottom fracture surface in mm
    • fy_n_mpa_: nominal yield stress
    • fu_n_mpa_: nominal ultimate stress
    • t_a_deg_c_: ambient temperature in degC
    • date: date of test
    • investigator: person(s) who conducted the test
    • location: laboratory where test was conducted
    • machine: setup used to conduct test
    • pid_force_k_p, pid_force_t_i, pid_force_t_d: PID parameters for force control
    • pid_disp_k_p, pid_disp_t_i, pid_disp_t_d: PID parameters for displacement control
    • pid_extenso_k_p, pid_extenso_t_i, pid_extenso_t_d: PID parameters for extensometer control
    • citekey: reference corresponding to the Database_References.bib file
    • yield_stress_mpa_: computed yield stress in MPa
    • elastic_modulus_mpa_: computed elastic modulus in MPa
    • fracture_strain: computed average true strain across the fracture surface
    • c,si,mn,p,s,n,cu,mo,ni,cr,v,nb,ti,al,b,zr,sn,ca,h,fe: chemical compositions in units of %mass
    • file: file name of corresponding clean (downsampled) stress-strain data

    File Format: Summarized_Mechanical_Props_Campaign

    Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,

    tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv',
              index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1],
              keep_default_na=False, na_values='')
    • citekey: reference in "Campaign_References.bib".
    • Grade: material grade.
    • Spec.: specifications (e.g., J2+N).
    • Yield Stress [MPa]: initial yield stress in MPa
      • size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign
    • Elastic Modulus [MPa]: initial elastic modulus in MPa
      • size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

    Caveats

    • The files in the following directories were tested before the protocol was established. Therefore, only the true stress-strain is available for each:
      • A500
      • A992_Gr50
      • BCP325
      • BCR295
      • HYP400
      • S460NL
      • S690QL/25mm
      • S355J2_Plates/S355J2_N_25mm and S355J2_N_50mm
  11. F

    Data from: Solar self-sufficient households as a driving factor for...

    • data.uni-hannover.de
    .zip, r, rdata +2
    Updated Dec 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Institut für Kartographie und Geoinformatik (2024). Solar self-sufficient households as a driving factor for sustainability transformation [Dataset]. https://data.uni-hannover.de/sk/dataset/19503682-5752-4352-97f6-511ae31d97df
    Explore at:
    r(3406), rdata(408277), r(24773), r(6280), r(21968), rdata(1024592), r(63854), .zip, txt(1431), text/x-sh(183), rdata(426)Available download formats
    Dataset updated
    Dec 12, 2024
    Dataset authored and provided by
    Institut für Kartographie und Geoinformatik
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    To get the consumption model from Section 3.1, one needs load execute the file consumption_data.R. Load the data for the 3 Phases ./data/CONSUMPTION/PL1.csv, PL2.csv, PL3.csv, transform the data and build the model (starting line 225). The final consumption data can be found in one file for each year in ./data/CONSUMPTION/MEGA_CONS_list.Rdata

    To get the results for the optimization problem, one needs to execute the file analyze_data.R. It provides the functions to compare production and consumption data, and to optimize for the different values (PV, MBC,).

    To reproduce the figures one needs to execute the file visualize_results.R. It provides the functions to reproduce the figures.

    To calculate the solar radiation that is needed in the Section Production Data, follow file calculate_total_radiation.R.

    To reproduce the radiation data from from ERA5, that can be found in data.zip, do the following steps: 1. ERA5 - download the reanalysis datasets as GRIB file. For FDIR select "Total sky direct solar radiation at surface", for GHI select "Surface solar radiation downwards", and for ALBEDO select "Forecast albedo". 2. convert GRIB to csv with the file era5toGRID.sh 3. convert the csv file to the data that is used in this paper with the file convert_year_to_grid.R

  12. d

    R-LOADEST files to produce results in the Heart River Basin, North Dakota,...

    • catalog.data.gov
    • data.usgs.gov
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). R-LOADEST files to produce results in the Heart River Basin, North Dakota, 1970-2020 [Dataset]. https://catalog.data.gov/dataset/r-loadest-files-to-produce-results-in-the-heart-river-basin-north-dakota-1970-2020
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    North Dakota, Heart River
    Description

    This child page contains a zipped folder which contains all of the items necessary to run load estimation using R-LOADEST to produce results that are published in U.S. Geological Survey Investigations Report 2021-XXXX [Tatge, W.S., Nustad, R.A., and Galloway, J.M., 2021, Evaluation of Salinity and Nutrient Conditions in the Heart River Basin, North Dakota, 1970-2020: U.S. Geological Survey Scientific Investigations Report 2021-XXXX, XX p]. The folder contains an allsiteinfo.table.csv file, a "datain" folder, and a "scripts" folder. The allsiteinfo.table.csv file can be used to cross reference the sites with the main report (Tatge and others, 2021). The "datain" folder contains all the input data necessary to reproduce the load estimation results. The naming convention in the "datain" folder is site_MI_rloadest or site_NUT_rloadest for either the major ion loads or the nutrient loads. The .Rdata files are used in the scripts to run the estimations and the .csv files can be used to look at the data. The "scripts" folder contains the written R scripts to produce the results of the load estimation from the main report. R-LOADEST is a software package for analyzing loads in streams and an accompanying report (Runkel and others, 2004) serves as the formal documentation for R-LOADEST. The package is a collection of functions written in R (R Development Core Team, 2019), an open source language and a general environment for statistical computing and graphics. The following system requirements are necessary for producing results: Windows 10 operating system R (version 3.4 or later; 64-bit recommended) RStudio (version 1.1.456 or later) R-LOADEST program (available at https://github.com/USGS-R/rloadest). Runkel, R.L., Crawford, C.G., and Cohn, T.A., 2004, Load Estimator (LOADEST): A FORTRAN Program for Estimating Constituent Loads in Streams and Rivers: U.S. Geological Survey Techniques and Methods Book 4, Chapter A5, 69 p., [Also available at https://pubs.usgs.gov/tm/2005/tm4A5/pdf/508final.pdf.] R Development Core Team, 2019, R—A language and environment for statistical computing: Vienna, Austria, R Foundation for Statistical Computing, accessed December 7, 2020, at https://www.r-project.org.

  13. e

    Data from: The Tropical Andes Biodiversity Hotspot: A Comprehensive Dataset...

    • knb.ecoinformatics.org
    • dataone.org
    • +4more
    Updated May 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pablo Jarrín-V.; Mario H Yánez-Muñoz (2024). The Tropical Andes Biodiversity Hotspot: A Comprehensive Dataset for the Mira-Mataje Binational Basins [Dataset]. http://doi.org/10.5063/F14F1P6H
    Explore at:
    Dataset updated
    May 30, 2024
    Dataset provided by
    Knowledge Network for Biocomplexity
    Authors
    Pablo Jarrín-V.; Mario H Yánez-Muñoz
    Time period covered
    Jun 11, 2022 - Jun 11, 2023
    Area covered
    Description

    We present a flora and fauna dataset for the Mira-Mataje binational basins. This is an area shared between southwestern Colombia and northwestern Ecuador, where both the Chocó and Tropical Andes biodiversity hotspots converge. Information from 120 sources was systematized in the Darwin Core Archive (DwC-A) standard and geospatial vector data format for geographic information systems (GIS) (shapefiles). Sources included natural history museums, published literature, and citizen science repositories across 18 countries. The resulting database has 33,460 records from 5,281 species, of which 1,083 are endemic and 680 threatened. The diversity represented in the dataset is equivalent to 10\% of the total plant species and 26\% of the total terrestrial vertebrate species in the hotspots. It corresponds to 0.07\% of their total area. The dataset can be used to estimate and compare biodiversity patterns with environmental parameters and provide value to ecosystems, ecoregions, and protected areas. The dataset is a baseline for future assessments of biodiversity in the face of environmental degradation, climate change, and accelerated extinction processes. The data has been formally presented in the manuscript entitled "The Tropical Andes Biodiversity Hotspot: A Comprehensive Dataset for the Mira-Mataje Binational Basins" in the journal "Scientific Data". To maintain DOI integrity, this version will not change after publication of the manuscript and therefore we cannot provide further references on volume, issue, and DOI of manuscript publication. - Data format 1: The .rds file extension saves a single object to be read in R and provides better compression, serialization, and integration within the R environment, than simple .csv files. The description of file names is in the original manuscript. -- m_m_flora_2021_voucher_ecuador.rds -- m_m_flora_2021_observation_ecuador.rds -- m_m_flora_2021_total_ecuador.rds -- m_m_fauna_2021_ecuador.rds - Data format 2: The .csv file has been encoded in UTF-8, and is an ASCII file with text separated by commas. The description of file names is in the original manuscript. -- m_m_flora_fauna_2021_all.zip. This file includes all biodiversity datasets. -- m_m_flora_2021_voucher_ecuador.csv -- m_m_flora_2021_observation_ecuador.csv -- m_m_flora_2021_total_ecuador.csv -- m_m_fauna_2021_ecuador.csv - Data format 3: We consolidated a shapefile for the basin containing layers for vegetation ecosystems and the total number of occurrences, species, and endemic and threatened species for each ecosystem. -- biodiversity_measures_mira_mataje.zip. This file includes the .shp file and accessory geomatic files. - A set of 3D shaded-relief map representations of the data in the shapefile can be found at https://doi.org/10.6084/m9.figshare.23499180.v4 Three taxonomic data tables were used in our technical validation of the presented dataset. These three files are: 1) the_catalog_of_life.tsv (Source: Bánki, O. et al. Catalogue of life checklist (version 2024-03-26). https://doi.org/10.48580/dfz8d (2024)) 2) world_checklist_of_vascular_plants_names.csv (we are also including ancillary tables "world_checklist_of_vascular_plants_distribution.csv", and "README_world_checklist_of_vascular_plants_.xlsx") (Source: Govaerts, R., Lughadha, E. N., Black, N., Turner, R. & Paton, A. The World Checklist of Vascular Plants is a continuously updated resource for exploring global plant diversity. Sci. Data 8, 215, 10.1038/s41597-021-00997-6 (2021).) 3) world_flora_online.csv (Source: The World Flora Online Consortium et al. World flora online plant list December 2023, 10.5281/zenodo.10425161 (2023).)

  14. data and code for "Beyond Human Gold Standards: A Multi-Model Framework for...

    • zenodo.org
    zip
    Updated Jul 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Denis Mongin; Denis Mongin (2025). data and code for "Beyond Human Gold Standards: A Multi-Model Framework for Automated Abstract Classification and Information Extraction" article [Dataset]. http://doi.org/10.5281/zenodo.15829040
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 7, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Denis Mongin; Denis Mongin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    This is the public repository for the article "Beyond Human Gold Standards: A Multi-Model Framework for Automated Abstract Classification and Information Extraction" by Delphine S. Courvoisier, Diana Buitrago Garcia, Nils Burgisser, Clément P. Buclin, Michele Iudici, and Denis Mongin.
    The uptodate repository can be found here: https://gitlab.unige.ch/trial_integrity/llm_majority_public

    The structure of the repository is as follows:



    - The folder LLM_inference contains the LLM inferences for the two tasks performed on the abstracts list of the abstract csv file by the list of LLMs described in the model_list.csv file. The two tasks are the task for the classification of the intervention (folder abstract_classification) and the task for the extraction of the number of participants (participant_numbers folder). The initial list of abstract conatined 1080 abstract, some of which were not considered in our final analysis because they were protocols, and not randomized.
    - both folders contain the python script used for the inference using the prompt in the prompt folder, the two bash scripts used to run it on the university HPC.
    - All inference results are une the results folder, which contains the log files, and one csv file per model
    - The file gold.csv contains, for the final list of 1020 abstracts, the tasks performed by each reviewers, the human gold standard, and the platine stndard, with a 0/1 variable platine_check indicating which gold results were re-checked
    - The folder R_analysis contains the R files allowing to perform the analysis, produce the tables and the figures:
    - the file analysis.R contains the code to read the LLM inferences results, and calculate the accuracy for the different model combinations. It output a file in the results folder
    - the file figure_tables.R contains the R code using the result of the analysis.R code to produce the tables and figures of the article. The figures and tables are created in the figures_tables folder. The file trial_publication_info.csv contains the information about the RCT used for this analysis, coming from the data of the study doi.org/10.1016/j.jclinepi.2024.111586 .
    - the file help_func.R contains the functions used to format the table results, and is loaded in figure_tables.R.

  15. Z

    openENTRANCE - Case Study 1 - Residential Demand Response - Data and Scripts...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    O'Reilly, Ryan (2023). openENTRANCE - Case Study 1 - Residential Demand Response - Data and Scripts [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7182593
    Explore at:
    Dataset updated
    Apr 27, 2023
    Dataset provided by
    Reichl, Johannes
    Cohen, Jed
    O'Reilly, Ryan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data files and Python and R scripts are provided for Case Study 1 of the openENTRANCE project. The data covers 10 residential devices on the NUTS2 level for the EU27 + UK +TR + NO + CH from 2020-2050. The devices included are full battery electric vehicles (EV), storage heater (SH), water heater with storage capabilitites (WH), air conditiong (AC), heat circulation pump (CP), air-to-air heat pump (HP), refrigeration (includes refrigerators (RF) and freezers (FR)), dish washer (DW), washing machine (WM), and tumble drier (TD). The data for the study uses represenative hours to describe load expectations and constraints for each residential device - hourly granularity from 2020 to 2050 for a representative day for each month (i.e. 24 hours for an average day in each month).

    The aggregated final results are in Full_potential.V9.csv and acheivable_NUTS2_summary.csv. The file metaData.Full_Potential.csv is provided to guide users on the nomenclature in Full_potential.V9.csv and the disaggregated data sets.The disaggregated loads can be found in d_ACV8.csv, d_CPV6.csv, d_DWV6.csv, d_EVV7.csv, d_FRV5.csv, d_HPV4.csv, d_RFV5.csv, d_SHV7.csv, d_TDV6.csv, d_WHV7.csv, d_WMV6.csv while the disaggregated maximum capacities p_ACV8.csv, p_CPV6.csv, p_DWV6.csv, p_EVV7.csv, p_FRV5.csv, p_HPV4.csv, p_RFV5.csv, p_SHV7.csv, p_TDV6.csv, p_WHV7.csv, p_WMV6.csv.

    Full_potential.V9.csv shows the NUTS2 level unadjusted loads for the residential devices using representative hours from 2020-2050. The loads provided here have not been adjusted with the direct load participation rates (see paper for more details). More details on the dataset can be found in the metaData.Full_Potential.csv file.

    The acheivable_NUTS2_summary.csv shows the NUTS2 level acheivable direct load control potentials for the average hour in the respective year (years - 2020, 2022,2030,2040, 2050). These summaries have allready adjusted the disaggregated loads with direct load participation rates from participation_rates_country.csv.

    A detailed overview of the data files are provided below. Where possible, a brief description, input data, and script use to generate the data is provided. If questions arise, first refer to the publication. If something still needs clarification, send an email to ryano18@vt.edu.

    Description of data provided

    Achievable_NUTS2_summary.csv

    Description

    Average hourly achievable direct load potentials for each NUTS2 region and device for 2020, 2022, 2030,2040, 2050

    Data input

    Full_potential.V9.csv

    participation_rates_country.csv

    P_inc_SH.csv

    P_inc_WH.csv

    P_inc_HP.csv

    P_inc_DW.csv

    P_inc_WM.csv

    P_inc_TD.csv

    Script

    NUTS2_acheivable.R

    COP_.1deg_11-21_V1.csv

    Description

    NUTS2 average coefficient of performance estimates from 2011-2021 daily temperature

    Data

    tg_ens_mean_0.1deg_reg_2011-2021_v24.0e.nc

    NUTS_RG_01M_2021_3857.shp

    nhhV2.csv

    Script

    COP_from_E-OBS.R

    Country dd projections.csv

    Description

    Assumptions for annual change in CDD and HDD

    Spinoni, J., Vogt, J. V., Barbosa, P., Dosio, A., McCormick, N., Bigano, A., & Füssel, H. M. (2018). Changes of heating and cooling degree‐days in Europe from 1981 to 2100. International Journal of Climatology, 38, e191-e208.

    Expectations for future HDD and CDD used the long-run averages and country level expected changes in the rcp45 scenario

    EV NUTS projectionsV5.csv

    Description

    NUTS2 level EV projections 2018-2050

    Data input

    EV projectionsV5_ave.csv

    Country level EV projections

    NUTS 2 regional share of national vehicle fleet

    Eurostat - Vehicle Nuts.xlsx

    Script

    EVprojections_NUTS_V5.py

    EV_NVF_EV_path.xlsx

    Description

    Country level – EV share of new passenger vehicle fleet

    From: Mathieu, L., & Poliscanova, J. (2020). Mission (almost) accomplished. Carmakers’ Race to Meet the, 21.

    EV_parameters.xlsx

    Description

    Parameters used to calculate future loads from EVs

    Wunit_EV – represents annual kWh per EV

    evLIFE_150kkm

    number of years

    represents usable life if EV only lasted 150 thousand km. Hence, 150,000/average km traveled per year with respect to country (this variable is dropped and not used for estimation).

    Average age/#years assuming 150k life – represents

    Number of years

    Average between evLIFE_150kkm and average age of vehicle with respect to the country

    full_potentialV9.csv

    Description

    Final data that shows hourly demand (Maximum Reduction) and (Maximum Dispatch for each device, region, and year.

    This data has not been adjusted with participation_rates_country.csv

    Maximum dispatch is equal to max capacity – hourly demand with respect to the device, region, year, and hour.

    Script

    Full_potentialV9.py

    gils projection assumptions.xlsx

    Description

    Data from: Gils, H. C. (2015). Balancing of intermittent renewable power generation by demand response and thermal energy storage.

    A linear extrapolation was used to determine values for every year and country 2020-2050. AC – Air Conditioning, SH – Storage Heater, WH – Water heater with storage capability, CP – heat circulation pump, TD – Tumble Drier, WM – Washing Machine, DW -Dish Washer, FR – Freezer, RF – Refrigerator. The results are in the files shown below.

    nflh – full load hours

    nflh_ac.csv

    nflh_cp.csv

    wunit – annual energy consumption

    Wunit_rf_fr.csv

    Pcycle – power demand per cycle

    Pcycle_wm.csv

    Pcycle_dw.csv

    Pcycle_td.csv

    Punit – power damand for device

    Punit_ac.csv

    Punit_cp.csv

    r – country level household ownership rates of residential device

    rfr.csv

    rrf.csv

    rwm.csv

    rtd.csv

    rdw.csv

    rac.csv

    rwh.csv

    rcp.csv

    rsh.csv

    Script

    openENTRANCE projections.py

    heat_pump_hourly_share.csv

    Description

    Hours share of daily energy demand

    From ENTROS TYNDP – Charts and Figures

    https://2020.entsos-tyndp-scenarios.eu/download-data/#download

    hourlyEVshares.csv

    Description

    Hours share of daily energy demand

    From My Electric Avenue Study

    https://eatechnology.com/consultancy-insights/my-electric-avenue/

    HP_transitionV2.csv

    Description

    Used to create Qhp_thermal_MWh_projectedV2.csv

    Final_energy_15-19

    Average final energy demand for the residential heating sector between 2015-2019

    Final_energy_15-19_nonEE

    Average final energy demand for the residential heating sector for energy sources that are not energy efficient between 2015-2019 (see paper for sources)

    Final_energy_15-19_nonEE_share

    share of inefficient heating sources

    HP_thermal_2018

    Thermal energy provided by residential heat pumps in 2018

    HP_thermal_2019

    Thermal energy provided by residential heat pumps in 2019

    See publication for data sources

    Nflh_ac.csv, nflh_cp.csv

    See gils projection assumptions.xlsx

    nhhV2

    Description

    Expected number of households for NUTS2 regions for 2020-2050

    See publication for data sources

    Script

    EUROSTAT_POP2NUTSV2.R

    NUTS0_thermal_heat_annum.csv

    Description

    Country level residential annual thermal heat requirements in kWh

    Used to determine maximum dispatch in openENTRANCE final V14.py

    Mantzos, L., Wiesenthal, T., Matei, N. A., Tchung-Ming, S., Rozsai, M., Russ, P., & Ramirez, A. S. (2017). JRC-IDEES: Integrated Database of the European Energy Sector: Methodological Note (No. JRC108244). Joint Research Centre (Seville site).

    p_ACV8.csv, p_CPV6.csv, p_DWV6.csv, p_EVV7.csv, p_FRV5.csv, p_HPV4.csv, p_RFV5.csv, p_SHV7.csv, p_TDV6.csv, p_WHV7.csv, p_WMV6.csv

    Description

    Maximum capacity – load for a device can never exceed maximum capacity

    Data

    gils projection assumptions.xlsx

    Script

    openENTRANCE final V14.py

    P_inc_DW.csv, P_inc_HP.csv, P_inc_SH.csv, P_inc_TD.csv, P_inc_WH.csv, P_inc_WM.csv, SAMPLE_PINC.csv

    Description

    Unadjusted average hourly potential for increase by NUTS2 region for 2018-2050

    Data

    d_ACV8.csv, d_CPV6.csv, d_DWV6.csv, d_EVV7.csv, d_FRV5.csv, d_HPV4.csv, d_RFV5.csv, d_SHV7.csv, d_TDV6.csv, d_WHV7.csv, d_WMV6.csv

    Theoretical maximum reduction / load of the respective device

    p_ACV8.csv, p_CPV6.csv, p_DWV6.csv, p_EVV7.csv, p_FRV5.csv, p_HPV4.csv, p_RFV5.csv, p_SHV7.csv, p_TDV6.csv, p_WHV7.csv, p_WMV6.csv

    Maximum capacity

    Script

    P_increaseV2.py

    Pcycle_dw.csv, Pcycle_td.csv, Pcycle_wm.csv

    Description

    power demand per cycle kWh

    See gils projection assumptions.xlsx

    Punit_ac.csv, Punit_cp.csv

    Description

    Unit capacities kWh

    See gils projection assumptions.xlsx

    Qhp_thermal_MWh_projectedV2.csv

    Description

    NUTS2 expectations for thermal energy demand met by heat pumps for 2022-2050

    Assumes a linear decomposition of non-renewable and non-energy efficient heating sources until 2050

    Data

    HP_transitionV2.csv

    nhhV2.csv

    Script

    HP_projection_nuts.py

    rac.csv, rcp.csv, rdw.csv, rfr.csv, rrf.csv, rsh.csv, rtd.csv, rwh.csv, rwm.csv

    Description

    Household ownership rates

    See gils projection

  16. d

    Data and scripts associated with a manuscript investigating impacts of solid...

    • search.dataone.org
    • osti.gov
    Updated Aug 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alan Roebuck; Brieanne Forbes; Vanessa A. Garayburu-Caruso; Samantha Grieger; Khadijah Homolka; James C. Stegen; Allison Myers-Pigg (2023). Data and scripts associated with a manuscript investigating impacts of solid phase extraction on freshwater organic matter optical signatures and mass spectrometry pairing [Dataset]. http://doi.org/10.15485/1995543
    Explore at:
    Dataset updated
    Aug 21, 2023
    Dataset provided by
    ESS-DIVE
    Authors
    Alan Roebuck; Brieanne Forbes; Vanessa A. Garayburu-Caruso; Samantha Grieger; Khadijah Homolka; James C. Stegen; Allison Myers-Pigg
    Time period covered
    Aug 30, 2021 - Sep 15, 2021
    Area covered
    Description

    This data package is associated with the publication “Investigating the impacts of solid phase extraction on dissolved organic matter optical signatures and the pairing with high-resolution mass spectrometry data in a freshwater system” submitted to “Limnology and Oceanography: Methods.” This data is an extension of the River Corridor and Watershed Biogeochemistry SFA’s Spatial Study 2021 (https://doi.org/10.15485/1898914). Other associated data and field metadata can be found at the link provided. The goal of this manuscript is to assess the impact of solid phase extraction (SPE) on the ability to pair ultra-high resolution mass spectrometry data collected from SPE extracts with optical properties collected on ambient stream samples. Forty-seven samples collected from within the Yakima River Basin, Washington were analyzed dissolved organic carbon (DOC, measured as non-purgeable organic carbon, NPOC), absorbance, and fluorescence. Samples were subsequently concentrated with SPE and reanalyzed for each measurement. The extraction efficiency for the DOC and common optical indices were calculated. In addition, SPE samples were subject to ultra-high resolution mass spectrometry and compared with the ambient and SPE generated optical data. Finally, in addition to this cross-platform inter-comparison, we further performed and intra-comparison among the high-resolution mass spectrometry data to determine the impact of sample preparation on the interpretability of results. Here, the SPE samples were prepared at 40 milligrams per liter (mg/L) based on the known DOC extraction efficiency of the samples (ranging from ~30 to ~75%) compared to the common practice of assuming the DOC extraction efficiency of freshwater samples at 60%. This data package folder consists of one main data folder with one subfolder (Data_Input). The main data folder contains (1) readme; (2) data dictionary (dd); (3) file-level metadata (flmd); (4) final data summary output from processing script; and (5) the processing script. The R-markdown processing script (SPE_Manuscript_Rmarkdown_Data_Package.rmd) contains all code needed to reproduce manuscript statistics and figures (with the exception of that stated below). The Data_Input folder has two subfolders: (1) FTICR and (2) Optics. Additionally, the Data_Input folder contains dissolved organic carbon (DOC, measured as non-purgeable organic carbon, NPOC) data (SPS_NPOC_Summary.csv) and relevant supporting Solid Phase Extraction Volume information (SPS_SPE_Volumes.csv). Methods information for the optical and FTICR data is embedded in the header rows of SPS_EEMs_Methods.csv and SPS_FTICR_Methods.csv, respectively. In addition, the data dictionary (SPS_SPE_dd.csv), file level metadata (SPS_SPE_flmd.csv), and methods codes (SPS_SPE_Methods_codes.csv) are provided. The FTICR subfolder contains all raw FTICR data as well as instructions for processing. In addition, post processed FTICR molecular information (Processed_FTICRMS_Mol.csv) and sample data (Processed_FTICRMS_Data.csv) is provided that can be directly read into R with the associated R-markdown file. The Optics subfolder contains all Absorbance and Fluorescence Spectra. Fluorescence spectra have been blank corrected, inner filter corrected, and undergone scatter removal. In addition, this folder contains Matlab code used to make a portion of Figure 1 within the manuscript, derive various spectral parameters used within the manuscript, and used for parallel factor analysis (PARAFAC) modeling. Spectral indices (SPS_SpectralIndices.csv) and PARAFAC outputs (SPS_PARAFAC_Model_Loadings.csv and SPS_PARAFAC_Sample_Scores.csv) are directly read into the associated R-markdown file.

  17. o

    Data from: Self-Attribution Of Distorted Reaching Movements In Immersive...

    • explore.openaire.eu
    Updated Sep 5, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Henrique Galvan Debarba; Ronan Boulic; Roy Salomon; Olaf Blanke; Bruno Herbelin (2018). Self-Attribution Of Distorted Reaching Movements In Immersive Virtual Reality Dataset [Dataset]. http://doi.org/10.5281/zenodo.1409823
    Explore at:
    Dataset updated
    Sep 5, 2018
    Authors
    Henrique Galvan Debarba; Ronan Boulic; Roy Salomon; Olaf Blanke; Bruno Herbelin
    Description

    This dataset accompanies the paper “Self-Attribution of Distorted Reaching Movements in Immersive Virtual Reality” published in the Computer and Graphics journal from Elsevier. It contains 3 datasets related to the experiments described in the paper. All datasets are in “.csv” format and can be easily loaded by statistical analysis tools (e.g. a dataset can be loaded in r using the command read.csv(“filename.csv”)). It also contains the C# Unity implementation of the distortion function presented in the paper. Paper reference: Galvan Debarba H, Boulic R, Salomon R, Blanke O, Herbelin B. Self-Attribution of Distorted Reaching Movements in Immersive Virtual Reality. Computers & Graphics. 2018; ISSN 0097-8493. Elsevier. DOI: doi.org/10.1016/j.cag.2018.09.001

  18. H

    Syria town database

    • dataverse.harvard.edu
    Updated Nov 22, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kheder Khaddour; Kevin Mazur (2018). Syria town database [Dataset]. http://doi.org/10.7910/DVN/YQQ07L
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 22, 2018
    Dataset provided by
    Harvard Dataverse
    Authors
    Kheder Khaddour; Kevin Mazur
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Syria
    Description

    The purpose of this dataset is to provide a detailed picture of the characteristics of Syrian towns in the years preceding the 2011 Syrian uprising and ensuing civil war. It incorporates the 2004 national census, the last before the uprising, and a newly collected set of data on ethnic identity. The level of analysis is the town (the Syrian Census Bureau’s fourth administrative level). TECHNICAL NOTE: The .csv files in this data package contain both Arabic and English, so are encoded in UTF-8. The Arabic script should render if opened directly in Open Office, Numbers, Google Drive, or R statistical software. To read the Arabic in Excel, you can open the .csv file in any of these applications and save it as an .xlsx file, or open it through Excel using the following steps: (1) open a blank excel document (2) import the data using “Data -> Get External Data -> Import text file” (3) select “File Origin: Unicode (UTF-8)” (4) select “Delimiters: comma” (5) select the top left cell to place the data See the following post for further details: https://stackoverflow.com/questions/6002256/is-it-possible-to-force-excel-recognize-utf-8-csv-files-automatically

  19. q

    Large Datasets in R - Plant Phenology & Temperature Data from NEON

    • qubeshub.org
    Updated May 10, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Megan Jones Patterson; Lee Stanish; Natalie Robinson; Katherine Jones; Cody Flagg (2018). Large Datasets in R - Plant Phenology & Temperature Data from NEON [Dataset]. http://doi.org/10.25334/Q4DQ3F
    Explore at:
    Dataset updated
    May 10, 2018
    Dataset provided by
    QUBES
    Authors
    Megan Jones Patterson; Lee Stanish; Natalie Robinson; Katherine Jones; Cody Flagg
    Description

    This module series covers how to import, manipulate, format and plot time series data stored in .csv format in R. Originally designed to teach researchers to use NEON plant phenology and air temperature data; has been used in undergraduate classrooms.

  20. Data from: Data and code from: Cover crop and crop rotation effects on...

    • catalog.data.gov
    Updated Apr 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data and code from: Cover crop and crop rotation effects on tissue and soil population dynamics of Macrophomina phaseolina and yield in no-till system - V2 [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-cover-crop-and-crop-rotation-effects-on-tissue-and-soil-population-dyna-831b9
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Servicehttps://www.ars.usda.gov/
    Description

    [Note 2023-08-14 - Supersedes version 1, https://doi.org/10.15482/USDA.ADC/1528086 ] This dataset contains all code and data necessary to reproduce the analyses in the manuscript: Mengistu, A., Read, Q. D., Sykes, V. R., Kelly, H. M., Kharel, T., & Bellaloui, N. (2023). Cover crop and crop rotation effects on tissue and soil population dynamics of Macrophomina phaseolina and yield under no-till system. Plant Disease. https://doi.org/10.1094/pdis-03-23-0443-re The .zip archive cropping-systems-1.0.zip contains data and code files. Data stem_soil_CFU_by_plant.csv: Soil disease load (SoilCFUg) and stem tissue disease load (StemCFUg) for individual plants in CFU per gram, with columns indicating year, plot ID, replicate, row, plant ID, previous crop treatment, cover crop treatment, and comments. Missing data are indicated with . yield_CFU_by_plot.csv: Yield data (YldKgHa) at the plot level in units of kg/ha, with columns indicating year, plot ID, replicate, and treatments, as well as means of soil and stem disease load at the plot level. Code cropping_system_analysis_v3.0.Rmd: RMarkdown notebook with all data processing, analysis, and visualization code equations.Rmd: RMarkdown notebook with formatted equations formatted_figs_revision.R: R script to produce figures formatted exactly as they appear in the manuscript The Rproject file cropping-systems.Rproj is used to organize the RStudio project. Scripts and notebooks used in older versions of the analysis are found in the testing/ subdirectory. Excel spreadsheets containing raw data from which the cleaned CSV files were created are found in the raw_data subdirectory.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788
Organization logo

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem

Related Article
Explore at:
bin, application/gzip, zip, text/x-pythonAvailable download formats
Dataset updated
Aug 2, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb
License

https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

Description
Replication pack, FSE2018 submission #164:
------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
A Case Study of the PyPI Ecosystem

**Note:** link to data artifacts is already included in the paper. 
Link to the code will be included in the Camera Ready version as well.


Content description
===================

- **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
 described below
- **settings.py** - settings template for the code archive.
- **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
 This dataset only includes stats aggregated by the ecosystem (PyPI)
- **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
 statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
 themselves, which take around 2TB.
- **build_model.r, helpers.r** - R files to process the survival data 
  (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
  `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
  **dataset_full_Jan_2018.tgz**)
- **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
- LICENSE - text of GPL v3, under which this dataset is published
- INSTALL.md - replication guide (~2 pages)
Replication guide
=================

Step 0 - prerequisites
----------------------

- Unix-compatible OS (Linux or OS X)
- Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
- R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)

Depending on detalization level (see Step 2 for more details):
- up to 2Tb of disk space (see Step 2 detalization levels)
- at least 16Gb of RAM (64 preferable)
- few hours to few month of processing time

Step 1 - software
----------------

- unpack **ghd-0.1.0.zip**, or clone from gitlab:

   git clone https://gitlab.com/user2589/ghd.git
   git checkout 0.1.0
 
 `cd` into the extracted folder. 
 All commands below assume it as a current directory.
  
- copy `settings.py` into the extracted folder. Edit the file:
  * set `DATASET_PATH` to some newly created folder path
  * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
- install docker. For Ubuntu Linux, the command is 
  `sudo apt-get install docker-compose`
- install libarchive and headers: `sudo apt-get install libarchive-dev`
- (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
 Without this dependency, you might get an error on the next step, 
 but it's safe to ignore.
- install Python libraries: `pip install --user -r requirements.txt` . 
- disable all APIs except GitHub (Bitbucket and Gitlab support were
 not yet implemented when this study was in progress): edit
 `scraper/init.py`, comment out everything except GitHub support
 in `PROVIDERS`.

Step 2 - obtaining the dataset
-----------------------------

The ultimate goal of this step is to get output of the Python function 
`common.utils.survival_data()` and save it into a CSV file:

  # copy and paste into a Python console
  from common import utils
  survival_data = utils.survival_data('pypi', '2008', smoothing=6)
  survival_data.to_csv('survival_data.csv')

Since full replication will take several months, here are some ways to speedup
the process:

####Option 2.a, difficulty level: easiest

Just use the precomputed data. Step 1 is not necessary under this scenario.

- extract **dataset_minimal_Jan_2018.zip**
- get `survival_data.csv`, go to the next step

####Option 2.b, difficulty level: easy

Use precomputed longitudinal feature values to build the final table.
The whole process will take 15..30 minutes.

- create a folder `
Search
Clear search
Close search
Google apps
Main menu