36 datasets found
  1. f

    Petre_Slide_CategoricalScatterplotFigShare.pptx

    • figshare.com
    pptx
    Updated Sep 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1
    Explore at:
    pptxAvailable download formats
    Dataset updated
    Sep 19, 2016
    Dataset provided by
    figshare
    Authors
    Benj Petre; Aurore Coince; Sophien Kamoun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Categorical scatterplots with R for biologists: a step-by-step guide

    Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

    1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

    Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

    Protocol

    • Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

    • Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

    • Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

    Notes

    • Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

    • Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

    7 Display the graph in a separate window. Dot colors indicate

    replicates

    graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

    References

    Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

    Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

    Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

    https://cran.r-project.org/

    http://ggplot2.org/

  2. Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Dec 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dylan Westfall; Mullins James (2023). Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies [Dataset]. http://doi.org/10.5061/dryad.w3r2280w0
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 7, 2023
    Dataset provided by
    HIV Prevention Trials Network
    HIV Vaccine Trials Networkhttp://www.hvtn.org/
    National Institute of Allergy and Infectious Diseaseshttp://www.niaid.nih.gov/
    PEPFAR
    Authors
    Dylan Westfall; Mullins James
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies. Methods This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies" Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005 For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub. The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub. The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results. Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program. To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper. Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd. Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.

  3. d

    Integrated Hourly Meteorological Database of 20 Meteorological Stations...

    • search.dataone.org
    • osti.gov
    Updated Jan 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Boris Faybishenko; Dylan O'Ryan (2025). Integrated Hourly Meteorological Database of 20 Meteorological Stations (1981-2022) for Watershed Function SFA Hydrological Modeling [Dataset]. http://doi.org/10.15485/2502101
    Explore at:
    Dataset updated
    Jan 17, 2025
    Dataset provided by
    ESS-DIVE
    Authors
    Boris Faybishenko; Dylan O'Ryan
    Time period covered
    Jan 1, 1981 - Dec 31, 2022
    Area covered
    Description

    This dataset contains (a) a script “R_met_integrated_for_modeling.R”, and (b) associated input CSV files: 3 CSV files per location to create a 5-variable integrated meteorological dataset file (air temperature, precipitation, wind speed, relative humidity, and solar radiation) for 19 meteorological stations and 1 location within Trail Creek from the modeling team within the East River Community Observatory as part of the Watershed Function Scientific Focus Area (SFA). As meteorological forcings varied across the watershed, a high-frequency database is needed to ensure consistency in the data analysis and modeling. We evaluated several data sources, including gridded meteorological products and field data from meteorological stations. We determined that our modeling efforts required multiple data sources to meet all their needs. As output, this dataset contains (c) a single CSV data file (*_1981-2022.csv) for each location (20 CSV output files total) containing hourly time series data for 1981 to 2022 and (d) five PNG files of time series and density plots for each variable per location (100 PNG files). Detailed location metadata is contained within the Integrated_Met_Database_Locations.csv file for each point location included within this dataset, obtained from Varadharajan et al., 2023 doi:10.15485/1660962. This dataset also includes (e) a file-level metadata (flmd.csv) file that lists each file contained in the dataset with associated metadata and (f) a data dictionary (dd.csv) file that contains column/row headers used throughout the files along with a definition, units, and data type. Review the (g) ReadMe_Integrated_Met_Database.pdf file for additional details on the script, methods, and structure of the dataset. The script integrates Northwest Alliance for Computational Science and Engineering’s PRISM gridded data product, National Oceanic and Atmospheric Administration’s NCEP-NCAR Reanalysis 1 gridded data product (through the RCNEP R package, Kemp et al., doi:10.32614/CRAN.package.RNCEP), and analytical-based calculations. Further, this script downscales the input data into hourly frequency, which is necessary for the modeling efforts.

  4. Dataset 5: R script and input data

    • figshare.com
    txt
    Updated Jan 21, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ulrike Bayr (2021). Dataset 5: R script and input data [Dataset]. http://doi.org/10.6084/m9.figshare.12301307.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 21, 2021
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Ulrike Bayr
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    1. CSV-files as input for R code containing results from the GIS-analysis for points and polygons2. CSV-file with 3D-errors based on DTM 1m and DTM 10m (output from WSL Monoplotting Tool)2. R script
  5. d

    Child 1: Nutrient and streamflow model-input data

    • catalog.data.gov
    • data.usgs.gov
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Child 1: Nutrient and streamflow model-input data [Dataset]. https://catalog.data.gov/dataset/child-1-nutrient-and-streamflow-model-input-data
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    U.S. Geological Survey
    Description

    Trends in nutrient fluxes and streamflow for selected tributaries in the Lake Erie watershed were calculated using monitoring data at 10 locations. Trends in flow-normalized nutrient fluxes were determined by applying a weighted regression approach called WRTDS (Weighted Regression on Time, Discharge, and Season). Site information and streamflow and water-quality records are contained in 3 zipped files named as follows: INFO (site information), Daily (daily streamflow records), and Sample (water-quality records). The INFO, Daily (flow), and Sample files contain the input data, by water-quality parameter and by site as .csv files, used to run trend analyses. These files were generated by the R (version 3.1.2) software package called EGRET - Exploration and Graphics for River Trends (version 2.5.1) (Hirsch and DeCicco, 2015), and can be used directly as input to run graphical procedures and WRTDS trend analyses using EGRET R software. The .csv files are identified according to water-quality parameter (TP, SRP, TN, NO23, and TKN) and site reference number (e.g. TPfiles.1.INFO.csv, SRPfiles.1.INFO.csv, TPfiles.2.INFO.csv, etc.). Water-quality parameter abbreviations and site reference numbers are defined in the file "Site-summary_table.csv" on the landing page, where there is also a site-location map ("Site_map.pdf"). Parameter information details, including abbreviation definitions, appear in the abstract on the Landing Page. SRP data records were available at only 6 of the 10 trend sites, which are identified in the file "site-summary_table.csv" (see landing page) as monitored by the organization NCWQR (National Center for Water Quality Research). The SRP sites are: RAIS, MAUW, SAND, HONE, ROCK, and CUYA. The model-input dataset is presented in 3 parts: 1. INFO.zip (site information) 2. Daily.zip (daily streamflow records) 3. Sample.zip (water-quality records) Reference: Hirsch, R.M., and De Cicco, L.A., 2015 (revised). User Guide to Exploration and Graphics for RivEr Trends (EGRET) and dataRetrieval: R Packages for Hydrologic Data, Version 2.0, U.S. Geological Survey Techniques Methods, 4-A10. U.S. Geological Survey, Reston, VA., 93 p. (at: http://dx.doi.org/10.3133/tm4A10).

  6. case study 1 bike share

    • kaggle.com
    Updated Oct 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mohamed osama (2022). case study 1 bike share [Dataset]. https://www.kaggle.com/ososmm/case-study-1-bike-share/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 8, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    mohamed osama
    Description

    Cyclistic: Google Data Analytics Capstone Project

    Cyclistic - Google Data Analytics Certification Capstone Project Moirangthem Arup Singh How Does a Bike-Share Navigate Speedy Success? Background: This project is for the Google Data Analytics Certification capstone project. I am wearing the hat of a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. Cyclistic is a bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore,my team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, my team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve the recommendations, so they must be backed up with compelling data insights and professional data visualizations. This project will be completed by using the 6 Data Analytics stages: Ask: Identify the business task and determine the key stakeholders. Prepare: Collect the data, identify how it’s organized, determine the credibility of the data. Process: Select the tool for data cleaning, check for errors and document the cleaning process. Analyze: Organize and format the data, aggregate the data so that it’s useful, perform calculations and identify trends and relationships. Share: Use design thinking principles and data-driven storytelling approach, present the findings with effective visualization. Ensure the analysis has answered the business task. Act: Share the final conclusion and the recommendations. Ask: Business Task: Recommend marketing strategies aimed at converting casual riders into annual members by better understanding how annual members and casual riders use Cyclistic bikes differently. Stakeholders: Lily Moreno: The director of marketing and my manager. Cyclistic executive team: A detail-oriented executive team who will decide whether to approve the recommended marketing program. Cyclistic marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Cyclistic’s marketing strategy. Prepare: For this project, I will use the public data of Cyclistic’s historical trip data to analyze and identify trends. The data has been made available by Motivate International Inc. under the license. I downloaded the ZIP files containing the csv files from the above link but while uploading the files in kaggle (as I am using kaggle notebook), it gave me a warning that the dataset is already available in kaggle. So I will be using the dataset cyclictic-bike-share dataset from kaggle. The dataset has 13 csv files from April 2020 to April 2021. For the purpose of my analysis I will use the csv files from April 2020 to March 2021. The source csv files are in Kaggle so I can rely on it's integrity. I am using Microsoft Excel to get a glimpse of the data. There is one csv file for each month and has information about the bike ride which contain details of the ride id, rideable type, start and end time, start and end station, latitude and longitude of the start and end stations. Process: I will use R as language in kaggle to import the dataset to check how it’s organized, whether all the columns have appropriate data type, find outliers and if any of these data have sampling bias. I will be using below R libraries

    Load the tidyverse, lubridate, ggplot2, sqldf and psych libraries

    library(tidyverse) library(lubridate) library(ggplot2) library(plotrix) ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

    ✔ ggplot2 3.3.5 ✔ purrr 0.3.4 ✔ tibble 3.1.4 ✔ dplyr 1.0.7 ✔ tidyr 1.1.3 ✔ stringr 1.4.0 ✔ readr 2.0.1 ✔ forcats 0.5.1

    ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag()

    Attaching package: ‘lubridate’

    The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union
    

    Set the working directory

    setwd("/kaggle/input/cyclistic-bike-share")

    Import the csv files

    r_202004 <- read.csv("202004-divvy-tripdata.csv") r_202005 <- read.csv("20...

  7. d

    Streamflow, Dissolved Organic Carbon, and Nitrate Input Datasets and Model...

    • catalog.data.gov
    • data.usgs.gov
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Streamflow, Dissolved Organic Carbon, and Nitrate Input Datasets and Model Results Using the Weighted Regressions on Time, Discharge, and Season (WRTDS) Model for Buck Creek Watersheds, Adirondack Park, New York, 2001 to 2021 [Dataset]. https://catalog.data.gov/dataset/streamflow-dissolved-organic-carbon-and-nitrate-input-datasets-and-model-results-using-the
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    This data release supports an analysis of changes in dissolved organic carbon (DOC) and nitrate concentrations in Buck Creek watershed near Inlet, New York 2001 to 2021. The Buck Creek watershed is a 310-hectare forested watershed that is recovering from acidic deposition within the Adirondack region. The data release includes pre-processed model inputs and model outputs for the Weighted Regressions on Time, Discharge and Season (WRTDS) model (Hirsch and others, 2010) to estimate daily flow normalized concentrations of DOC and nitrate during a 20-year period of analysis. WRTDS uses daily discharge and concentration observations implemented through the Exploration and Graphics for River Trends R package (EGRET) to predict solute concentration using decimal time and discharge as explanatory variables (Hirsch and De Cicco, 2015; Hirsch and others, 2010). Discharge and concentration data are available from the U.S. Geological Survey National Water Information System (NWIS) database (U.S. Geological Survey, 2016). The time series data were analyzed for the entire period, water years 2001 (WY2001) to WY2021 where WY2001 is the period from October 1, 2000 to September 30, 2001. This data release contains 5 comma-separated values (CSV) files, one R script, and one XML metadata file. There are four input files (“Daily.csv”, “INFO.csv”, “Sample_doc.csv”, and “Sample_nitrate.csv”) that contain site information, daily mean discharge, and mean daily DOC or nitrate concentrations. The R script (“Buck Creek WRTDS R script.R”) uses the four input datasets and functions from the EGRET R package to generate estimations of flow normalized concentrations. The output file (“WRTDS_results.csv”) contains model output at daily time steps for each sub-watershed and for each solute. Files are automatically associated with the R script when opened in RStudio using the provided R project file ("Files.Rproj"). All input, output, and R files are in the "Files.zip" folder.

  8. d

    2010 County and City-Level Water-Use Data and Associated Explanatory...

    • catalog.data.gov
    • data.usgs.gov
    • +3more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). 2010 County and City-Level Water-Use Data and Associated Explanatory Variables [Dataset]. https://catalog.data.gov/dataset/2010-county-and-city-level-water-use-data-and-associated-explanatory-variables
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    U.S. Geological Survey
    Description

    This data release contains the input-data files and R scripts associated with the analysis presented in [citation of manuscript]. The spatial extent of the data is the contiguous U.S. The input-data files include one comma separated value (csv) file of county-level data, and one csv file of city-level data. The county-level csv (“county_data.csv”) contains data for 3,109 counties. This data includes two measures of water use, descriptive information about each county, three grouping variables (climate region, urban class, and economic dependency), and contains 18 explanatory variables: proportion of population growth from 2000-2010, fraction of withdrawals from surface water, average daily water yield, mean annual maximum temperature from 1970-2010, 2005-2010 maximum temperature departure from the 40-year maximum, mean annual precipitation from 1970-2010, 2005-2010 mean precipitation departure from the 40-year mean, Gini income disparity index, percent of county population with at least some college education, Cook Partisan Voting Index, housing density, median household income, average number of people per household, median age of structures, percent of renters, percent of single family homes, percent apartments, and a numeric version of urban class. The city-level csv (city_data.csv) contains data for 83 cities. This data includes descriptive information for each city, water-use measures, one grouping variable (climate region), and 6 explanatory variables: type of water bill (increasing block rate, decreasing block rate, or uniform), average price of water bill, number of requirement-oriented water conservation policies, number of rebate-oriented water conservation policies, aridity index, and regional price parity. The R scripts construct fixed-effects and Bayesian Hierarchical regression models. The primary difference between these models relates to how they handle possible clustering in the observations that define unique water-use settings. Fixed-effects models address possible clustering in one of two ways. In a "fully pooled" fixed-effects model, any clustering by group is ignored, and a single, fixed estimate of the coefficient for each covariate is developed using all of the observations. Conversely, in an unpooled fixed-effects model, separate coefficient estimates are developed only using the observations in each group. A hierarchical model provides a compromise between these two extremes. Hierarchical models extend single-level regression to data with a nested structure, whereby the model parameters vary at different levels in the model, including a lower level that describes the actual data and an upper level that influences the values taken by parameters in the lower level. The county-level models were compared using the Watanabe-Akaike information criterion (WAIC) which is derived from the log pointwise predictive density of the models and can be shown to approximate out-of-sample predictive performance. All script files are intended to be used with R statistical software (R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org) and Stan probabilistic modeling software (Stan Development Team. 2017. RStan: the R interface to Stan. R package version 2.16.2. http://mc-stan.org).

  9. Z

    Data and Code for "A Ray-Based Input Distance Function to Model Zero-Valued...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Henningsen, Arne (2023). Data and Code for "A Ray-Based Input Distance Function to Model Zero-Valued Output Quantities: Derivation and an Empirical Application" [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_7882078
    Explore at:
    Dataset updated
    Jun 17, 2023
    Dataset provided by
    Henningsen, Arne
    Price, Juan José
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data and code archive provides all the data and code for replicating the empirical analysis that is presented in the journal article "A Ray-Based Input Distance Function to Model Zero-Valued Output Quantities: Derivation and an Empirical Application" authored by Juan José Price and Arne Henningsen and published in the Journal of Productivity Analysis (DOI: 10.1007/s11123-023-00684-1).

    We conducted the empirical analysis with the "R" statistical software (version 4.3.0) using the add-on packages "combinat" (version 0.0.8), "miscTools" (version 0.6.28), "quadprog" (version 1.5.8), sfaR (version 1.0.0), stargazer (version 5.2.3), and "xtable" (version 1.8.4) that are available at CRAN. We created the R package "micEconDistRay" that provides the functions for empirical analyses with ray-based input distance functions that we developed for the above-mentioned paper. Also this R package is available at CRAN (https://cran.r-project.org/package=micEconDistRay).

    This replication package contains the following files and folders:

    README This file

    MuseumsDk.csv The original data obtained from the Danish Ministry of Culture and from Statistics Denmark. It includes the following variables:

    museum: Name of the museum.

    type: Type of museum (Kulturhistorisk museum = cultural history museum; Kunstmuseer = arts museum; Naturhistorisk museum = natural history museum; Blandet museum = mixed museum).

    munic: Municipality, in which the museum is located.

    yr: Year of the observation.

    units: Number of visit sites.

    resp: Whether or not the museum has special responsibilities (0 = no special responsibilities; 1 = at least one special responsibility).

    vis: Number of (physical) visitors.

    aarc: Number of articles published (archeology).

    ach: Number of articles published (cultural history).

    aah: Number of articles published (art history).

    anh: Number of articles published (natural history).

    exh: Number of temporary exhibitions.

    edu: Number of primary school classes on educational visits to the museum.

    ev: Number of events other than exhibitions.

    ftesc: Scientific labor (full-time equivalents).

    ftensc: Non-scientific labor (full-time equivalents).

    expProperty: Running and maintenance costs [1,000 DKK].

    expCons: Conservation expenditure [1,000 DKK].

    ipc: Consumer Price Index in Denmark (the value for year 2014 is set to 1).

    prepare_data.R This R script imports the data set MuseumsDk.csv, prepares it for the empirical analysis (e.g., removing unsuitable observations, preparing variables), and saves the resulting data set as DataPrepared.csv.

    DataPrepared.csv This data set is prepared and saved by the R script prepare_data.R. It is used for the empirical analysis.

    make_table_descriptive.R This R script imports the data set DataPrepared.csv and creates the LaTeX table /tables/table_descriptive.tex, which provides summary statistics of the variables that are used in the empirical analysis.

    IO_Ray.R This R script imports the data set DataPrepared.csv, estimates a ray-based Translog input distance functions with the 'optimal' ordering of outputs, imposes monotonicity on this distance function, creates the LaTeX table /tables/idfRes.tex that presents the estimated parameters of this function, and creates several figures in the folder /figures/ that illustrate the results.

    IO_Ray_ordering_outputs.R This R script imports the data set DataPrepared.csv, estimates a ray-based Translog input distance functions, imposes monotonicity for each of the 720 possible orderings of the outputs, and saves all the estimation results as (a huge) R object allOrderings.rds.

    allOrderings.rds (not included in the ZIP file, uploaded separately) This is a saved R object created by the R script IO_Ray_ordering_outputs.R that contains the estimated ray-based Translog input distance functions (with and without monotonicity imposed) for each of the 720 possible orderings.

    IO_Ray_model_averaging.R This R script loads the R object allOrderings.rds that contains the estimated ray-based Translog input distance functions for each of the 720 possible orderings, does model averaging, and creates several figures in the folder /figures/ that illustrate the results.

    /tables/ This folder contains the two LaTeX tables table_descriptive.tex and idfRes.tex (created by R scripts make_table_descriptive.R and IO_Ray.R, respectively) that provide summary statistics of the data set and the estimated parameters (without and with monotonicity imposed) for the 'optimal' ordering of outputs.

    /figures/ This folder contains 48 figures (created by the R scripts IO_Ray.R and IO_Ray_model_averaging.R) that illustrate the results obtained with the 'optimal' ordering of outputs and the model-averaged results and that compare these two sets of results.

  10. d

    Input data, model output, and R scripts for a machine learning streamflow...

    • datasets.ai
    • data.usgs.gov
    • +1more
    55
    Updated Sep 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of the Interior (2024). Input data, model output, and R scripts for a machine learning streamflow model on the Wyoming Range, Wyoming, 2012–17 [Dataset]. https://datasets.ai/datasets/input-data-model-output-and-r-scripts-for-a-machine-learning-streamflow-model-on-the-wyomi
    Explore at:
    55Available download formats
    Dataset updated
    Sep 11, 2024
    Dataset authored and provided by
    Department of the Interior
    Area covered
    Wyoming Range, Wyoming
    Description

    A machine learning streamflow (MLFLOW) model was developed in R (model is in the Rscripts folder) for modeling monthly streamflow from 2012 to 2017 in three watersheds on the Wyoming Range in the upper Green River basin. Geospatial information for 125 site features (vector data are in the Sites.shp file) and discrete streamflow observation data and environmental predictor data were used in fitting the MLFLOW model and predicting with the fitted model. Tabular calibration and validation data are in the Model_Fitting_Site_Data.csv file, totaling 971 discrete observations and predictions of monthly streamflow. Geospatial information for 17,518 stream grid cells (raster data are in the Streams.tif file) and environmental predictor data were used for continuous streamflow predictions with the MLFLOW model. Tabular prediction data for all the study area (17,518 stream grid cells) and study period (72 months; 2012–17) are in the Model_Prediction_Stream_Data.csv file, totaling 1,261,296 predictions of spatially and temporally continuous monthly streamflow. Additional information about the datasets is in the metadata included in the four zipped dataset files and about the MLFLOW model is in the readme included in the zipped model archive folder.

  11. H

    Replication Data for: The Political Violence Cycle

    • dataverse.harvard.edu
    • dataone.org
    Updated May 18, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrew Little (2018). Replication Data for: The Political Violence Cycle [Dataset]. http://doi.org/10.7910/DVN/GA0X38
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 18, 2018
    Dataset provided by
    Harvard Dataverse
    Authors
    Andrew Little
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Replication files and code for "The Political Violence Cycle", by S.P. Harish and Andrew Little, APSR. * Input files id & q-wide.dta scad.dta scad_lar.dta scad_nelda_basic.dta scad_nelda_basic_year.dta scad_nelda_basic_yearmonth.dta * Output files used to make graphs 439year.csv 452year.csv 475year.csv 501year.csv dist2elec30antigovt2.csv dist2elec30contest.csv dist2elec30NOcontest.csv dist2elec45progovt.csv dist2elec180contest.csv dist2elec180NOcontest.csv dist2elec365contest.csv dist2elec365NOcontest.csv * Code to clean data, create files to input to R for final graphs scad_prep.do nelda_prep.do scad_nelda_merge.do gengraphdata.do timing_replication.do (this runs the other do files in the correct order) * Code to produce final graphs code_for_figures.R (all but figure 5) simulated_comparative_statics_and_fig5.R (figure 5 and graphs in supplemental information)

  12. Data Supplement: Demeter v1.2.0 Example Data

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated May 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chris. R. Vernon; Chris. R. Vernon (2021). Data Supplement: Demeter v1.2.0 Example Data [Dataset]. http://doi.org/10.5281/zenodo.4773639
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 20, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Chris. R. Vernon; Chris. R. Vernon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data set is to accompany the version 1.2.0 release of Demeter. Demeter can be accessed on GitHub here: https://github.com/JGCRI/demeter

    ## Structure
    demeter_v1.2.0_data_supplement
    ├── config_gcam_reference.ini
    ├── example.py
    ├── inputs
    │  ├── allocation
    │  │  ├── gcam_regbasin_modis_v6_type5_5arcmin_constraint_alloc.csv
    │  │  ├── gcam_regbasin_modis_v6_type5_5arcmin_kernel_weighting.csv
    │  │  ├── gcam_regbasin_modis_v6_type5_5arcmin_observed_alloc.csv
    │  │  ├── gcam_regbasin_modis_v6_type5_5arcmin_order_alloc.csv
    │  │  ├── gcam_regbasin_modis_v6_type5_5arcmin_projected_alloc.csv
    │  │  └── gcam_regbasin_modis_v6_type5_5arcmin_transition_alloc.csv
    │  ├── constraints
    │  │  ├── 000_nutrientavail_hswd_5arcmin.csv
    │  │  └── 001_soilquality_hswd_5arcmin.csv
    │  ├── observed
    │  │  └── gcam_reg32_basin235_modis_v6_2010_5arcmin_sqdeg_wgs84_11Jul2019.zip
    │  └── projected
    │    └── gcam_ref_scenario_reg32_basin235_v5p1p3.csv
    └── outputs
      └── gcam5p1p3_ref_2021-05-19_10h57m15s
        ├── log_files
        │  └── logfile_gcam5p1p3_ref_2021-05-19_10h57m15s.log
        └── spatial_landcover_tabular
          ├── landcover_2010_timestep.csv
          ├── landcover_2015_timestep.csv
          ├── landcover_2020_timestep.csv
          └── landcover_2025_timestep.csv
    
    ## Context
    - **demeter_v1.2.0_data_supplement**: root directory
    - **config_gcam_reference.ini**: example configuration file for Demeter. The `run_dir` needs to be set to the directory where this data has been downloaded to.
    - **example.py**: sample Python script to run Demeter 
    - **inputs**: directory containing the model input files 
    - **allocation**: directory containing the model allocation input files 
    - **gcam_regbasin_modis_v6_type5_5arcmin_constraint_alloc.csv**: allocation file to weight the influence of constraint. See https://github.com/JGCRI/demeter#constraint-weighting
    - **gcam_regbasin_modis_v6_type5_5arcmin_kernel_weighting.csv**: allocation file to weight the influence of kernel density on extensification. See https://github.com/JGCRI/demeter#kernel-density-weighting
    - **gcam_regbasin_modis_v6_type5_5arcmin_observed_alloc.csv**: allocation file to bin land classes in the observed spaital data input to Demeter's final output land types. See https://github.com/JGCRI/demeter#observational-spatial-data-class-allocation
    - **gcam_regbasin_modis_v6_type5_5arcmin_order_alloc.csv**: allocation file having the order in which land classes will be processed. See https://github.com/JGCRI/demeter#treatment-order
    - **gcam_regbasin_modis_v6_type5_5arcmin_projected_alloc.csv**: allocation file to bin the land classes in the projected data input to Demeter's final output land classes. See https://github.com/JGCRI/demeter#projected-land-class-allocation
    - **gcam_regbasin_modis_v6_type5_5arcmin_transition_alloc.csv**: allocation file that defines the order in which one land class should transition to another. See https://github.com/JGCRI/demeter#transition-priority
    - **constraints**: directory containing the model constraints 
    - **000_nutrientavail_hswd_5arcmin.csv**: constraint file for nutrient availability from data provided in the Harmonized World Soil Database (HWSD) and described in Le Page et al. 2016 containing a feature id matching that in the observed spatial data and a weight per grid cell that describes how suitable a grid cell is for the target constraint. 
    - **001_soilquality_hswd_5arcmin.csv**: constraint file for soil workability from data provided in the Harmonized World Soil Database (HWSD) and described in Le Page et al. 2016 containing a feature id matching that in the observed spatial data and a weight per grid cell that describes how suitable a grid cell is for the target constraint.
    - **observed**: directory containing the observed spatial base layer data for Demeter 
    - **gcam_reg32_basin235_modis_v6_2010_5arcmin_sqdeg_wgs84_11Jul2019.zip**: compressed CSV file of a 5-arcmin (0.083333-degree) resolution dataset of land cells that have a region/basin designation in GCAM. Values are in units square degrees. The observed data was derived from MODIS v6 Classification Scheme 5 data (Friedl et al. 2010) and binned into Demeter's final land cover types for each 5-arcmin grid cell
    - **projected**: directory containing the projected data file
    - **gcam_ref_scenario_reg32_basin235_v5p1p3.csv**: projected land allocation data for years 2010 through 2100 derived from a Global Change Analysis Model (GCAM; Calvin et al. 2019) version 5.1.3 output database for a Reference scenario. Demeter can build this file directly from a GCAM output XML database by adding the `gcam_database` option with a path to a target database into the `[PROJECTED]` section of the configuration file if so desired. GCAM output databases can be rather large; therefore, we provide the output of that query as a CSV in this data record.
    - **outputs**: directory containing the expected outputs from a Demeter run under the example configuration for years 2010 through 2025 in 5 year timesteps
    - **gcam5p1p3_ref_2021-05-19_10h57m15s**: directory from the example Demeter run to compare against. Other runs will generate a new directory with the appropriate time stamp.
    - **log_files**: directory containing any log files 
    - **logfile_gcam5p1p3_ref_2021-05-19_10h57m15s.log**: log file from the example run 
    - **spatial_landcover_tabular**: directory containing the outputs from Demeter 
    - **landcover_2010_timestep.csv**: output file from Demeter containing the fraction of land for each 5-arcmin grid cell for each final land type from Demeter for year 2010. Units are in fraction of a grid cell. Spatial reference is EPSG:4326 (WGS84) for the latitude and longitude of each grid cell centroid in degrees.
    - **landcover_2015_timestep.csv**: output file from Demeter containing the fraction of land for each 5-arcmin grid cell for each final land type from Demeter for year 2015. Units are in fraction of a grid cell. Spatial reference is EPSG:4326 (WGS84) for the latitude and longitude of each grid cell centroid in degrees.
    - **landcover_2020_timestep.csv**: output file from Demeter containing the fraction of land for each 5-arcmin grid cell for each final land type from Demeter for year 2020. Units are in fraction of a grid cell. Spatial reference is EPSG:4326 (WGS84) for the latitude and longitude of each grid cell centroid in degrees.
    - **landcover_2025_timestep.csv**: output file from Demeter containing the fraction of land for each 5-arcmin grid cell for each final land type from Demeter for year 2025. Units are in fraction of a grid cell. Spatial reference is EPSG:4326 (WGS84) for the latitude and longitude of each grid cell centroid in degrees.
    
    ## References
    Calvin, K., Patel, P., Clarke, L., Asrar, G., Bond-Lamberty, B., Cui, R. Y., Di Vittorio, A., Dorheim, K., Edmonds, J., Hartin, C., Hejazi, M., Horowitz, R., Iyer, G., Kyle, P., Kim, S., Link, R., McJeon, H., Smith, S. J., Snyder, A., Waldhoff, S., and Wise, M.: GCAM v5.1: representing the linkages between energy, water, land, climate, and economic systems, Geosci. Model Dev., 12, 677–698, https://doi.org/10.5194/gmd-12-677-2019, 2019.
    
    Friedl, M. A., Sulla-Menashe, D., Tan, B., Schneider, A., Ramankutty, N., Sibley, A., andHuang, X. (2010). MODIS Collection 5 global land cover: Algorithm refinements and characterization of new datasets. Remote Sensing of Environment, 114, 168–182
    
    Le Page, Y, West, T, Link, R and Patel, P 2016 Downscaling land use and land cover from the Global Change Assessment Model for coupling with Earth system models. Geosci. Model Dev., 9: 3055–3069. DOI: https://doi.org/10.5194/gmd-9-3055-2016
    

  13. o

    German Question-Answer Context Dataset

    • opendatabay.com
    .undefined
    Updated Jun 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). German Question-Answer Context Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/eebd00ea-b68f-4b29-9aa1-7fe980e60502
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 24, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Data Science and Analytics
    Description

    The dataset provided is a comprehensive collection of German question-answer pairs with their corresponding context. It has been specifically compiled for the purpose of enhancing and facilitating natural language processing (NLP) tasks in the German language. The dataset includes two main files: train.csv and test.csv.

    The train.csv file contains a substantial amount of data, consisting of numerous entries that comprise various contexts along with their corresponding questions and answers in German. The contextual information may range from paragraphs to concise sentences, providing a well-rounded representation of different scenarios.

    Similarly, the test.csv file also contains a significant number of question-answer pairs in German along with their respective contexts. This file can be utilized for model evaluation and testing purposes, ensuring the robustness and accuracy of NLP models developed using this dataset.

    Both train.csv and test.csv provide valuable resources for training machine learning models in order to improve question-answering systems or any other NLP application specific to the German language. The inclusion of multiple context fields enhances diversity within the dataset and enables more thorough analysis by accounting for varying linguistic structures.

    Ultimate objectives behind creating this rich dataset involve fostering advancements in machine learning techniques applied to natural language understanding in German. Researchers, developers, and enthusiasts working on NLP tasks can leverage this extensive collection to explore state-of-the-art methodologies or develop novel approaches focused on understanding complex questions within given contextual frameworks accurately.

    How to use the dataset Understanding the Dataset Structure: The dataset consists of two files - train.csv and test.csv. Both files contain question-answer pairs along with their corresponding context.

    Columns: Each file has multiple columns that provide important information about the data:

    context: This column contains the context in which the question is being asked. It can be a paragraph, a sentence, or any other relevant information.

    answers: This column contains the answer(s) to the given question in the corresponding context. The answers could be single or multiple.

    Exploring and Analyzing Data: Before diving into any analysis or modeling tasks, it's recommended to explore and analyze the dataset thoroughly:

    Load both train.csv and test.csv files into your preferred programming environment (Python/R).

    Check for missing values (NaN) or any inconsistencies in data.

    Analyze statistical properties of different columns such as count, mean, standard deviation etc., to understand variations within your dataset.

    Preprocessing Text Data: Since this dataset contains text data (questions, answers), preprocessing steps might be required before further analysis.

    Process text by removing punctuation marks, special characters and converting all words to lowercase for better consistency.

    Tokenize text data by splitting sentences into individual words/tokens using libraries like NLTK or SpaCy.

    Remove stop words (commonly occurring irrelevant words like 'the', 'is', etc.) from your text using available stop word lists.

    Building Models: Once you have preprocessed your data appropriately, you can proceed with building models using a variety of techniques based on your goals and requirements. Some common approaches include:

    Building question-answering systems using machine learning algorithms like Natural Language Processing (NLP) or transformers.

    Utilizing pre-trained language models such as BERT, GPT, etc., for more accurate predictions.

    Implementing deep learning architectures like LSTM or CNN for better contextual understanding.

    Model Evaluation: After training your models, evaluate their performance by utilizing appropriate evaluation metrics and techniques.

    Iterative Process: Most often, the process of building effective question-answering

    Research Ideas Language understanding and translation: This dataset can be used to train models for German language understanding and translation tasks. By providing context, question, and answer pairs, the models can learn to understand the meaning of sentences in German and generate accurate translations. Question-answering systems: The dataset can be used to build question-answering systems in German. By training a model on this dataset, it can learn to read the context, understand the question being asked, and generate accurate answers based on the given context. Information retrieval: With this dataset, information retrieval systems can be built that retrieve relevant information based on user queries in German. The models trained on this dataset can process user questions and return relevant answers from the provided contexts. By utilizing this dataset in these ways, it enables advancements in natural language processing tasks specific to German language unde

  14. l

    LScD (Leicester Scientific Dictionary)

    • figshare.le.ac.uk
    docx
    Updated Apr 15, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
    Explore at:
    docxAvailable download formats
    Dataset updated
    Apr 15, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Leicester
    Description

    LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.

  15. d

    QA/QC-ed Groundwater Level Time Series in PLM-1 and PLM-6 Monitoring Wells,...

    • search.dataone.org
    • dataone.org
    • +2more
    Updated Oct 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Boris Faybishenko; Roelof Versteeg; Kenneth Williams; Rosemary Carroll; Wenming Dong; Tetsu Tokunaga; Dylan O'Ryan (2022). QA/QC-ed Groundwater Level Time Series in PLM-1 and PLM-6 Monitoring Wells, East River, Colorado (2016-2020) [Dataset]. http://doi.org/10.15485/1866836
    Explore at:
    Dataset updated
    Oct 13, 2022
    Dataset provided by
    ESS-DIVE
    Authors
    Boris Faybishenko; Roelof Versteeg; Kenneth Williams; Rosemary Carroll; Wenming Dong; Tetsu Tokunaga; Dylan O'Ryan
    Time period covered
    Nov 30, 2016 - Sep 22, 2020
    Area covered
    Description

    This data set contains QA/QC-ed (Quality Assurance and Quality Control) water level data for the PLM1 and PLM6 wells. PLM1 and PLM6 are location identifiers used by the Watershed Function SFA project for two groundwater monitoring wells along an elevation gradient located along the lower montane life zone of a hillslope near the Pumphouse location at the East River Watershed, Colorado, USA. These wells are used to monitor subsurface water and carbon inventories and fluxes, and to determine the seasonally dependent flow of groundwater under the PLM hillslope. The downslope flow of groundwater in combination with data on groundwater chemistry (see related references) can be used to estimate rates of solute export from the hillslope to the floodplain and river.QA/QC analysis of measured groundwater levels in monitoring wells PLM-1 and PLM-6 included identification and flagging of duplicated values of timestamps, gap filling of missing timestamps and water levels, removal of abnormal/bad and outliers of measured water levels. The QA/QC analysis also tested the application of different QA/QC methods and the development of regular (1-hour and 1-day) time series datasets, which can serve as a benchmark for testing other QA/QC techniques, and will be applicable for ecohydrological modeling. The package includes a Readme file, two R Markdown html files, a series of 8 csv files: two input files of irregular measured groundwater levels, two output files with QA/QC flagging of original data, and four QA/QC-ed regular time series datasets. The raw data from these wells was published as a separate dataset on ESS-DIVE, Williams et al. (2020): 10.15485/1818367. This dataset replaces the previously published raw data timeseries, and is the final groundwater data product for the PLM wells in the East River. Complete metadata information on the PLM1 and PLM6 wells are available in a related dataset on ESS-DIVE: Varadharajan C, et al (2022). https://doi.org/10.15485/1660962. These data products are part of the Watershed Function Scientific Focus Area collection effort to further scientific understanding of biogeochemical dynamics from genome to watershed scales. 2020/09/09 Update: Converted data files using ESS-DIVE’s Hydrological Monitoring Reporting Format. With the adoption of this reporting format, the addition of three new files (v1_20220909_flmd.csv, V1_20220909_dd.csv, and InstallationMethods.csv) were added. The file-level metadata file (v1_20220909_flmd.csv) contains information specific to the files contained within the dataset. The data dictionary file (v1_20220909_dd.csv) contains definitions of column headers and other terms across the dataset. The installation methods file (InstallationMethods.csv) contains a description of methods associated with installation and deployment at PLM1 and PLM6 wells. Additionally, eight data files were re-formatted to follow the reporting format guidance (er_plm1_waterlevel_2016-2020.csv, er_plm1_waterlevel_1-hour_2016-2020.csv, er_plm1_waterlevel_daily_2016-2020.csv, QA_PLM1_Flagging.csv, er_plm6_waterlevel_2016-2020.csv, er_plm6_waterlevel_1-hour_2016-2020.csv, er_plm6_waterlevel_daily_2016-2020.csv, QA_PLM6_Flagging.csv). The major changes to the data files include the addition of header_rows above the data containing metadata about the particular well, units, and sensor description.

  16. Supplement to "Managing discharge data quality in sparsely gauged river...

    • zenodo.org
    zip
    Updated Jun 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard Cornelio; Richard Cornelio (2025). Supplement to "Managing discharge data quality in sparsely gauged river basins using residual-based diagnostics and model-agnostic uncertainty analysis" [Dataset]. http://doi.org/10.5281/zenodo.15611919
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 7, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Richard Cornelio; Richard Cornelio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This deposit contains the code and data used in "Managing discharge data quality in sparsely gauged river basins using residual-based diagnostics and model-agnostic uncertainty analysis," which has been submitted for peer review. The following directories and files are herein available:

    • MRB_Q_RC.R is the R script used for formal analysis. Inline comments are provided to guide the user through the workflow. Please change the directory path names as needed. Please also download and install the required libraries prior to execution.

    • MRB_Q_RC_gg.R is the R script used to reproduce the figures used in the manuscript. Please note that MRB_Q_RC.R must be run first, as MRB_Q_RC_gg.R depends on the objects generated in the initial script.

    • MRB_CSVs is a directory containing all the required input data to MRB_Q_RC.R

      • MRB_DMs_in is a subdirectory containing the discharge measurements (“DMs”) or gaugings used to build the rating curves at each of the Mindanao River Basin (“MRB”) gauging stations used in the study: Buluan River (Bln_DMs.csv), Alip River (Alp_DMs.csv), Kapingkong River Basin (Kpn_DMs.csv), and Lanon River Basin (Lnn_DMs.csv). All CSV files have the same attributes:

        • “Date” — YYYY-MM-DD

        • “H” — [num] stage (m)

        • “Q_lps” — [num] discharge (L/s)

        • “Q_cms” — [num] discharge (m3/s)

        • “RC_ofcl” — [chr] official rating curve; either “RC_A” or “RC_B”

        • “Outside_val_per” — [log] if TRUE, the gauging (or more specifically the date of gauging) is outside the validity period of the rating curve.

      • MRB_Q_dpwh_in is a subdirectory containing the rated or “processed” discharge data from the Department of Public Works and Highways. Data on all four stations are available: Lnn_DPWH.csv, Kpn_DPWH.csv, Alp_DPWH.csv, and Bln_DPWH.csv. All CSV files have the same attributes: “YEAR” [YYYY], “DAY” [D], and month columns (“JAN” through “DEC”). The values under each month column are discharge estimates in L/s.

      • MRB_RTs_in is a subdirectory containing the rating tables (“RTs”) for each official rating curve at each gauging station considered: Kapingkong (Kpn_RC_A.csv and Kpn_RC_B.csv); Alip (Alp_RC_A.csv and Alp_RC_B.csv); Lanon (Lnn_RC_A.csv and Lnn_RC_B.csv); and Buluan (Kpn_RC_A.csv and Bln_RC_B.csv). All CSV files have the same attributes:

        • “H” — [num] stage (m)

        • “Q_lps” — [num] discharge (L/s)

        • “Q_cms” — [num] discharge (m3/s)

      • MRB_RCs_summ.csv is a CSV file containing a summary table of the gaugings and rating curves used at each station. The file has following attributes:

        • “River” — [chr] river name

        • “Station” — [chr] abbreviated station name based on the river name.

        • “Start_year_Q_ts” — [date] the start date of the discharge record (YYYY-MM-DD)

        • “End_year_Q_ts” — [date] the end date of the discharge record (YYYY-MM-DD)

        • “Year_count” — [num] the approximate number of years on record

        • “RCs_count” — [num] the number of official rating curves developed at each station

        • “RC_A_start” — [date] the start date of the validity period of the first official rating curve considered (YYYY-MM-DD)

        • “RC_A_end” — [date] the start end of the validity period of the first official rating curve considered (YYYY-MM-DD)

        • “RC_B_start” — [date] the start date of the validity period of the second official rating curve considered (YYYY-MM-DD)

        • “RC_B_end” — [date] the end date of the validity period of the second official rating curve considered (YYYY-MM-DD)

        • “RC_A_start_2” — [date] the start date of the second validity period of the first official rating curve considered (YYYY-MM-DD); applies only to the Buluan station.

        • “RC_A_end_2” — [date] the end date of the second validity period of the first official rating curve considered (YYYY-MM-DD); applies only to the Buluan station.

      • MRB_RCs_rval.csv is a CSV file similar to MRB_RCs_summ.csv, though the data are pivoted wide to long and slightly modified. It has the following attributes: “Station” [chr], “Subperiod” [chr], “RC_ofcl” [chr], “RC_start” in YYYY-MM-DD and “RC_end” in YYYY-MM-DD.

  17. d

    R scripts, input and output data for: Season of death, pathogen persistence...

    • search.dataone.org
    • data.niaid.nih.gov
    • +2more
    Updated Jan 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amelie Dolfi (2024). R scripts, input and output data for: Season of death, pathogen persistence and wildlife behaviour alter number of anthrax secondary infections from environmental reservoirs [Dataset]. http://doi.org/10.5061/dryad.3n5tb2rqt
    Explore at:
    Dataset updated
    Jan 30, 2024
    Dataset provided by
    Dryad Digital Repository
    Authors
    Amelie Dolfi
    Time period covered
    Jan 1, 2024
    Description

    An important part of infectious disease management is predicting factors that influence disease outbreaks, such as R, the number of secondary infections arising from an infected individual. Estimating R is particularly challenging for environmentally transmitted pathogens given time lags between cases and subsequent infections. Here, we calculated R for Bacillus anthracis infections arising from anthrax carcass sites in Etosha National Park, Namibia. Combining host behavioural data, pathogen concentrations, and simulation models, we show that R is spatially and temporally variable, driven by spore concentrations at death, host visitation rates and early preference for foraging at infectious sites. While spores were detected up to a decade after death, most secondary infections occurred within two years. Transmission simulations under scenarios combining site infectiousness and host exposure risk under different environmental conditions led to dramatically different outbreak dynamics, fr..., Camera trap data were collected from 13 sites between 2010–2013 in Etosha National Park, Namibia. Soil samples were collected to analyse the concentration of Bacillus anthracis spores between 2010–2022 in Etosha National Park. Statistical analysis was used to analyse camera trap data, and a simulation model was built to estimate the reproduction number of anthrax in Etosha National Park., , # Reproduction Number analysis and model

    Files to run the model from the manuscript "Season of death, pathogen persistence and wildlife behavior alter number of anthrax secondary infections from environmental reservoirs'.

    Six CSV data files and one R code are used to run the analysis and modeling.

    Eleven CSV files represent the output of the model (Rscript 'R_estimation_Code.R, using the input data CSV files). Those outputs are used in the R code 'Plots_R_estimation_Code.R' to plot the figures in the MS and the SI.

    Description of the data and file structure

    Input files :

    • Camera_all_individuals.csv : Records each individual zebra and wildebeest observed on the camera. The columns are :
      • Date = Date of recording;
      • ID = Camera ID;
      • Treatment = Carcass or Control site;
      • Tini = Time at which the individual entered the zone;
      • Tend = Time at which the individual exited the zone;
      • Individual = When herd visiting the site at the same time, t...
  18. Z

    Food and Agriculture Biomass Input–Output (FABIO) database

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kuschnig, Nikolas (2022). Food and Agriculture Biomass Input–Output (FABIO) database [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2577066
    Explore at:
    Dataset updated
    Jun 8, 2022
    Dataset provided by
    Bruckner, Martin
    Kuschnig, Nikolas
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This data repository provides the Food and Agriculture Biomass Input Output (FABIO) database, a global set of multi-regional physical supply-use and input-output tables covering global agriculture and forestry.

    The work is based on mostly freely available data from FAOSTAT, IEA, EIA, and UN Comtrade/BACI. FABIO currently covers 191 countries + RoW, 118 processes and 125 commodities (raw and processed agricultural and food products) for 1986-2013. All R codes and auxilliary data are available on GitHub. For more information please refer to https://fabio.fineprint.global.

    The database consists of the following main components, in compressed .rds format:

    Z: the inter-commodity input-output matrix, displaying the relationships of intermediate use of each commodity in the production of each commodity, in physical units (tons). The matrix has 24000 rows and columns (125 commodities x 192 regions), and is available in two versions, based on the method to allocate inputs to outputs in production processes: Z_mass (mass allocation) and Z_value (value allocation). Note that the row sums of the Z matrix (= total intermediate use by commodity) are identical in both versions.

    Y: the final demand matrix, denoting the consumption of all 24000 commodities by destination country and final use category. There are six final use categories (yielding 192 x 6 = 1152 columns): 1) food use, 2) other use (non-food), 3) losses, 4) stock addition, 5) balancing, and 6) unspecified.

    X: the total output vector of all 24000 commodities. Total output is equal to the sum of intermediate and final use by commodity.

    L: the Leontief inverse, computed as (I – A)-1, where A is the matrix of input coefficients derived from Z and x. Again, there are two versions, depending on the underlying version of Z (L_mass and L_value).

    E: environmental extensions for each of the 24000 commodities, including four resource categories: 1) primary biomass extraction (in tons), 2) land use (in hectares), 3) blue water use (in m3)., and 4) green water use (in m3).

    mr_sup_mass/mr_sup_value: For each allocation method (mass/value), the supply table gives the physical supply quantity of each commodity by producing process, with processes in the rows (118 processes x 192 regions = 22656 rows) and commodities in columns (24000 columns).

    mr_use: the use table capture the quantities of each commodity (rows) used as an input in each process (columns).

    A description of the included countries and commodities (i.e. the rows and columns of the Z matrix) can be found in the auxiliary file io_codes.csv. Separate lists of the country sample (including ISO3 codes and continental grouping) and commodities (including moisture content) are given in the files regions.csv and items.csv, respectively. For information on the individual processes, see auxiliary file su_codes.csv. RDS files can be opened in R. Information on how to read these files can be obtained here: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/readRDS

    Except of X.rds, which contains a matrix, all variables are organized as lists, where each element contains a sparse matrix. Please note that values are always given in physical units, i.e. tonnes or head, as specified in items.csv. The suffixes value and mass only indicate the form of allocation chosen for the construction of the symmetric IO tables (for more details see Bruckner et al. 2019). Product, process and country classifications can be found in the file fabio_classifications.xlsx.

    Footprint results are not contained in the database but can be calculated, e.g. by using this script: https://github.com/martinbruckner/fabio_comparison/blob/master/R/fabio_footprints.R

    How to cite:

    To cite FABIO work please refer to this paper:

    Bruckner, M., Wood, R., Moran, D., Kuschnig, N., Wieland, H., Maus, V., Börner, J. 2019. FABIO – The Construction of the Food and Agriculture Input–Output Model. Environmental Science & Technology 53(19), 11302–11312. DOI: 10.1021/acs.est.9b03554

    License:

    This data repository is distributed under the CC BY-NC-SA 4.0 License. You are free to share and adapt the material for non-commercial purposes using proper citation. If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. In case you are interested in a collaboration, I am happy to receive enquiries at martin.bruckner@wu.ac.at.

    Known issues:

    The underlying FAO data have been manipulated to the minimum extent necessary. Data filling and supply-use balancing, yet, required some adaptations. These are documented in the code and are also reflected in the balancing item in the final demand matrices. For a proper use of the database, I recommend to distribute the balancing item over all other uses proportionally and to do analyses with and without balancing to illustrate uncertainties.

  19. Data from: Replication package for the paper: "A Study on the Pythonic...

    • zenodo.org
    zip
    Updated Nov 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous; Anonymous (2023). Replication package for the paper: "A Study on the Pythonic Functional Constructs' Understandability" [Dataset]. http://doi.org/10.5281/zenodo.10101383
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 10, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous; Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Replication Package for A Study on the Pythonic Functional Constructs' Understandability

    This package contains several folders and files with code and data used in the study.


    examples/
    Contains the code snippets used as objects of the study, named as reported in Table 1, summarizing the experiment design.

    RQ1-RQ2-files-for-statistical-analysis/
    Contains three .csv files used as input for conducting the statistical analysis and drawing the graphs for addressing the first two research questions of the study. Specifically:

    - ConstructUsage.csv contains the declared frequency usage of the three functional constructs object of the study. This file is used to draw Figure 4.
    - RQ1.csv contains the collected data used for the mixed-effect logistic regression relating the use of functional constructs with the correctness of the change task, and the logistic regression relating the use of map/reduce/filter functions with the correctness of the change task.
    - RQ1Paired-RQ2.csv contains the collected data used for the ordinal logistic regression of the relationship between the perceived ease of understanding of the functional constructs and (i) participants' usage frequency, and (ii) constructs' complexity (except for map/reduce/filter).

    inter-rater-RQ3-files/
    Contains four .csv files used as input for computing the inter-rater agreement for the manual labeling used for addressing RQ3. Specifically, you will find one file for each functional construct, i.e., comprehension.csv, lambda.csv, and mrf.csv, and a different file used for highlighting the reasons why participants prefer to use the procedural paradigm, i.e., procedural.csv.

    Questionnaire-Example.pdf
    This file contains the questionnaire submitted to one of the ten experimental groups within our controlled experiment. Other questionnaires are similar, except for the code snippets used for the first section, i.e., change tasks, and the second section, i.e., comparison tasks.

    RQ2ManualValidation.csv
    This file contains the results of the manual validation being done to sanitize the answers provided by our participants used for addressing RQ2. Specifically, we coded the behavior description using four different levels: (i) correct, (ii) somewhat correct, (iii) wrong, and (iv) automatically generated.

    RQ3ManualValidation.xlsx
    This file contains the results of the open coding applied to address our third research question. Specifically, you will find four sheets, one for each functional construct and one for the procedural paradigm. For each sheet, you will find the provided answers together with the categories assigned to them.

    Appendix.pdf
    This file contains the results of the logistic regression relating the use of map, filter, and reduce functions with the correctness of the change task, not shown in the paper.

    FuncConstructs-Statistics.r
    This file contains an R script that you can reuse to re-run all the analyses conducted and discussed in the paper.

    FuncConstructs-Statistics.ipynb
    This file contains the code to re-execute all the analysis conducted in the paper as a notebook.

  20. d

    Statistical Methods in Water Resources - Supporting Materials

    • catalog.data.gov
    • data.usgs.gov
    Updated Jul 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Statistical Methods in Water Resources - Supporting Materials [Dataset]. https://catalog.data.gov/dataset/statistical-methods-in-water-resources-supporting-materials
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    This dataset contains all of the supporting materials to accompany Helsel, D.R., Hirsch, R.M., Ryberg, K.R., Archfield, S.A., and Gilroy, E.J., 2020, Statistical methods in water resources: U.S. Geological Survey Techniques and Methods, book 4, chapter A3, 454 p., https://doi.org/10.3133/tm4a3. [Supersedes USGS Techniques of Water-Resources Investigations, book 4, chapter A3, version 1.1.]. Supplemental material (SM) for each chapter are available to re-create all examples and figures, and to solve the exercises at the end of each chapter, with relevant datasets provided in an electronic format readable by R. The SM provide (1) datasets as .Rdata files for immediate input into R, (2) datasets as .csv files for input into R or for use with other software programs, (3) R functions that are used in the textbook but not part of a published R package, (4) R scripts to produce virtually all of the figures in the book, and (5) solutions to the exercises as .html and .Rmd files. The suffix .Rmd refers to the file format for code written in the R Markdown language; the .Rmd file that is provided in the SM was used to generate the .html file containing the solutions to the exercises. All data used in the in-text examples, figures, and exercises are not new and already available through publicly-available data portals.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1

Petre_Slide_CategoricalScatterplotFigShare.pptx

Explore at:
pptxAvailable download formats
Dataset updated
Sep 19, 2016
Dataset provided by
figshare
Authors
Benj Petre; Aurore Coince; Sophien Kamoun
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Categorical scatterplots with R for biologists: a step-by-step guide

Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

Protocol

• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

Notes

• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

7 Display the graph in a separate window. Dot colors indicate

replicates

graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

References

Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

https://cran.r-project.org/

http://ggplot2.org/

Search
Clear search
Close search
Google apps
Main menu