Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Categorical scatterplots with R for biologists: a step-by-step guide
Benjamin Petre1, Aurore Coince2, Sophien Kamoun1
1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK
Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.
Protocol
• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.
• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.
• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.
Notes
• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.
• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.
replicates
graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()
References
Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.
Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035
Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128
Facebook
TwitterTo make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.
You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:
df<-read.csv('uber.csv')
df_black<-subset(uber_df, uber_df$name == 'Black')
write.csv(df_black, "nameofthefileyouwanttosaveas.csv")
getwd()
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Facebook
TwitterThis data release supports an analysis of changes in dissolved organic carbon (DOC) and nitrate concentrations in Buck Creek watershed near Inlet, New York 2001 to 2021. The Buck Creek watershed is a 310-hectare forested watershed that is recovering from acidic deposition within the Adirondack region. The data release includes pre-processed model inputs and model outputs for the Weighted Regressions on Time, Discharge and Season (WRTDS) model (Hirsch and others, 2010) to estimate daily flow normalized concentrations of DOC and nitrate during a 20-year period of analysis. WRTDS uses daily discharge and concentration observations implemented through the Exploration and Graphics for River Trends R package (EGRET) to predict solute concentration using decimal time and discharge as explanatory variables (Hirsch and De Cicco, 2015; Hirsch and others, 2010). Discharge and concentration data are available from the U.S. Geological Survey National Water Information System (NWIS) database (U.S. Geological Survey, 2016). The time series data were analyzed for the entire period, water years 2001 (WY2001) to WY2021 where WY2001 is the period from October 1, 2000 to September 30, 2001. This data release contains 5 comma-separated values (CSV) files, one R script, and one XML metadata file. There are four input files (“Daily.csv”, “INFO.csv”, “Sample_doc.csv”, and “Sample_nitrate.csv”) that contain site information, daily mean discharge, and mean daily DOC or nitrate concentrations. The R script (“Buck Creek WRTDS R script.R”) uses the four input datasets and functions from the EGRET R package to generate estimations of flow normalized concentrations. The output file (“WRTDS_results.csv”) contains model output at daily time steps for each sub-watershed and for each solute. Files are automatically associated with the R script when opened in RStudio using the provided R project file ("Files.Rproj"). All input, output, and R files are in the "Files.zip" folder.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"
## Root directory
- `statistics.r`: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreements
- `modelsInfo.zip`: zip file containing all the downloaded model cards (in JSON format)
- `script`: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.
## Dataset
- `Dataset/Dataset_HF-models-list.csv`: list of HF models analyzed
- `Dataset/Dataset_github-prj-list.txt`: list of GitHub projects using the *transformers* library
- `Dataset/Dataset_github-Prj_model-Used.csv`: contains usage pairs: project, model
- `Dataset/Dataset_prj-num-models-reused.csv`: number of models used by each GitHub project
- `Dataset/Dataset_model-download_num-prj_correlation.csv` contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloads
## RQ1
- `RQ1/RQ1_dataset-list.txt`: list of HF datasets
- `RQ1/RQ1_datasetSample.csv`: sample set of models used for the manual analysis of datasets
- `RQ1/RQ1_analyzeDatasetTags.py`: Python script to analyze model tags for the presence of datasets. it requires to unzip the `modelsInfo.zip` in a directory with the same name (`modelsInfo`) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the `RQ2/countDataset.py` script
- `RQ1/RQ1_countDataset.py`: given the output of `RQ2/analyzeDatasetTags.py` (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysis
- `RQ1/RQ1_datasetTags.csv`: output of `RQ2/analyzeDatasetTags.py`
- `RQ1/RQ1_dataset_usage_count.csv`: output of `RQ2/countDataset.py`
## RQ2
- `RQ2/tableBias.pdf`: table detailing the number of occurrences of different types of bias by model Task
- `RQ2/RQ2_bias_classification_sheet.csv`: results of the manual labeling
- `RQ2/RQ2_isBiased.csv`: file to compute the inter-rater agreement of whether or not a model documents Bias
- `RQ2/RQ2_biasAgrLabels.csv`: file to compute the inter-rater agreement related to bias categories
- `RQ2/RQ2_final_bias_categories_with_levels.csv`: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate category
## RQ3
- `RQ3/RQ3_LicenseValidation.csv`: manual validation of a sample of licenses
- `RQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt`: lists of licenses with different permissiveness
- `RQ3/RQ3_prjs_license.csv`: for each project linked to models, among other fields it indicates the license tag and name
- `RQ3/RQ3_models_license.csv`: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of license
- `RQ3/RQ3_model-prj-license_contingency_table.csv`: usage contingency table between projects' licenses (columns) and models' licenses (rows)
- `RQ3/RQ3_models_prjs_licenses_with_type.csv`: pairs project-model, with their respective licenses and permissiveness level
## scripts
Contains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README
Facebook
TwitterThe U.S. Geological Survey, in cooperation with the U.S. Environmental Protection Agency's Long Island Sound Study (https://longislandsoundstudy.net), characterized nitrogen export from forested watersheds and whether nitrogen loading has been increasing or decreasing to help inform Long Island Sound management strategies. The Weighted Regressions on Time, Discharge, and Season (WRTDS; Hirsch and others, 2010) method was used to estimate annual concentrations and fluxes of nitrogen species using long-term records (14 to 37 years in length) of stream total nitrogen, dissolved organic nitrogen, nitrate, and ammonium concentrations and daily discharge data from 17 watersheds located in the Long Island Sound basin or in nearby areas of Massachusetts, New Hampshire, or New York. This data release contains the input water-quality and discharge data, annual outputs (including concentrations, fluxes, yields, and confidence intervals about these estimates), statistical tests for trends between the periods of water years 1999-2000 and 2016-2018, and model diagnostic statistics. These datasets are organized into one zip file (WRTDSeLists.zip) and six comma-separated values (csv) data files (StationInformation.csv, AnnualResults.csv, TrendResults.csv, ModelStatistics.csv, InputWaterQuality.csv, and InputStreamflow.csv). The csv file (StationInformation.csv) contains information about the stations and input datasets. Finally, a short R script (SampleScript.R) is included to facilitate viewing the input and output data and to re-run the model. Reference: Hirsch, R.M., Moyer, D.L., and Archfield, S.A., 2010, Weighted Regressions on Time, Discharge, and Season (WRTDS), with an application to Chesapeake Bay River inputs: Journal of the American Water Resources Association, v. 46, no. 5, p. 857–880.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Corresponding peer-reviewed publication
This dataset corresponds to all the RAPID input and output files that were used in the study reported in:
David, Cédric H., Florence Habets, David R. Maidment and Zong-Liang Yang (2011), RAPID applied to the SIM-France model, Hydrological Processes, 25(22), 3412-3425. DOI: 10.1002/hyp.8070.
When making use of any of the files in this dataset, please cite both the aforementioned article and the dataset herein.
Time format
The times reported in this description all follow the ISO 8601 format. For example 2000-01-01T16:00-06:00 represents 4:00 PM (16:00) on Jan 1st 2000 (2000-01-01), Central Standard Time (-06:00). Additionally, when time ranges with inner time steps are reported, the first time corresponds to the beginning of the first time step, and the second time corresponds to the end of the last time step. For example, the 3-hourly time range from 2000-01-01T03:00+00:00 to 2000-01-01T09:00+00:00 contains two 3-hourly time steps. The first one starts at 3:00 AM and finishes at 6:00AM on Jan 1st 2000, Universal Time; the second one starts at 6:00 AM and finishes at 9:00AM on Jan 1st 2000, Universal Time.
Data sources
The following sources were used to produce files in this dataset:
The hydrographic network of SIM-France, as published in Habets, F., A. Boone, J. L. Champeaux, P. Etchevers, L. Franchistéguy, E. Leblois, E. Ledoux, P. Le Moigne, E. Martin, S. Morel, J. Noilhan, P. Quintana Seguí, F. Rousset-Regimbeau, and P. Viennot (2008), The SAFRAN-ISBA-MODCOU hydrometeorological model applied over France, Journal of Geophysical Research: Atmospheres, 113(D6), DOI: 10.1029/2007JD008548.
The observed flows are from Banque HYDRO, Service Central d'Hydrométéorologie et d'Appui à la Prévision des Inondations. Available at http://www.hydro.eaufrance.fr/index.php.
Outputs from a simulation using SIM-France (Habets et al. 2008). The simulation was run by Florence Habets, and produced 3-hourly time steps from 1995-08-01T00:00+02:00 to 2005-07-31T21:02+00:00. Further details on the inputs and options used for this simulation are provided in David et al. (2011).
Software
The following software were used to produce files in this dataset:
The Routing Application for Parallel computation of Discharge (RAPID, David et al. 2011, http://rapid-hub.org), Version 1.1.0. Further details on the inputs and options used for this series of simulations are provided below and in David et al. (2011).
ESRI ArcGIS (http://www.arcgis.com).
Microsoft Excel (https://products.office.com/en-us/excel).
The GNU Compiler Collection (https://gcc.gnu.org) and the Intel compilers (https://software.intel.com/en-us/intel-compilers).
Study domain
The files in this dataset correspond to one study domain:
The river network of SIM-France is made of 24264 river reaches. The temporal range corresponding to this domain is from 1995-08-01T00:00+02:00 to 2005-07-31 T21:00+02:00.
Description of files
All files below were prepared by Cédric H. David, using the data sources and software mentioned above.
rapid_connect_France.csv. This CSV file contains the river network connectivity information and is based on the unique IDs of the SIM-France river reaches (the IDs). For each river reach, this file specifies: the ID of the reach, the ID of the unique downstream reach, the number of upstream reaches with a maximum of four reaches, and the IDs of all upstream reaches. A value of zero is used in place of NoData. The river reaches are sorted in increasing value of ID. The values were computed based on the SIM-France FICVID file. This file was prepared using a Fortran program.
m3_riv_France_1995_2005_ksat_201101_c_zvol_ext.nc. This netCDF file contains the 3-hourly accumulated inflows of water (in cubic meters) from surface and subsurface runoff into the upstream point of each river reach. The river reaches have the same IDs and are sorted similarly to rapid_connect_France.csv. The time range for this file is from 1995-08-01T00:00+02:00 to 2005/07/31T21:00+02:00. The values were computed using the outputs of SIM-France. This file was prepared using a Fortran program.
kfac_modcou_1km_hour.csv. This CSV file contains a first guess of Muskingum k values (in seconds) for all river reaches. The river reaches have the same IDs and are sorted similarly to rapid_connect_France.csv. The values were computed based on the following information: ID, size of the side of the grid cell, Equation (5) in David et al. (2011), and using a wave celerity of 1 km/h. This file was prepared using a Fortran program.
kfac_modcou_ttra_length.csv. This CSV file contains a second guess of Muskingum k values (in seconds) for all river reaches. The river reaches have the same IDs and are sorted similarly to rapid_connect_France.csv. The values were computed based on the following information: ID, size of the side of the grid cell, travel time, and Equation (9) in David et al. (2011).
k_modcou_0.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following information: kfac_modcou_1km_hour.csv and using Table (2) in David et al. (2011). This file was prepared using a Fortran program.
k_modcou_1.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following information: kfac_modcou_1km_hour.csv and using Table (2) in David et al. (2011). This file was prepared using a Fortran program.
k_modcou_2.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following information: kfac_modcou_1km_hour.csv and using Table (2) in David et al. (2011). This file was prepared using a Fortran program.
k_modcou_3.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following information: kfac_modcou_1km_hour.csv and using Table (2) in David et al. (2011). This file was prepared using a Fortran program.
k_modcou_4.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following information: kfac_modcou_1km_hour.csv and using Table (2) in David et al. (2011). This file was prepared using a Fortran program.
k_modcou_a.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following information: kfac_modcou_1km_hour.csv and using Table (2) in David et al. (2011). This file was prepared using a Fortran program.
k_modcou_b.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following information: kfac_modcou_1km_hour.csv and using Table (2) in David et al. (2011). This file was prepared using a Fortran program.
k_modcou_c.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following information: kfac_modcou_1km_hour.csv and using Table (2) in David et al. (2011). This file was prepared using a Fortran program.
x_modcou_0.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Table (2) in David et al. (2011). This file was prepared using a Fortran program.
x_modcou_1.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Table (2) in David et al. (2011). This file was prepared using a Fortran program.
x_modcou_2.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Table (2) in David et al. (2011). This file was prepared using a Fortran program.
x_modcou_3.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Table (2) in David et al. (2011). This file was prepared using a Fortran program.
x_modcou_4.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Table (2) in David et al. (2011). This file was prepared using a Fortran program.
x_modcou_a.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Table (2) in David et al. (2011). This file was prepared using a Fortran program.
x_modcou_b.csv. This CSV file contains Muskingum x values
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the raw data for the study:
Characterizing Intraspecific Resource Utilization in an Aquatic Consumer Using High-Throughput Phenotyping
Data are provided separately for the first experiment (numerical response experiment with 16 rotifer clones across six food concentrations) and the second experiment (growth rate measurements with 98 rotifer clones across two food concentrations).
Contents of first_experiment.zip
input/ This folder contains raw count data (output of the Wellcounter software):
popgrowth_
output/ output files produced by the R-script 'first_experiment_analysis.Rmd'
wellcounter/ contains the Wellcounter software (programs and configuration files) that were used for running the raw analysis of this dataset on a High Performance Computing cluster
first_experiment_analysis.Rmd R-Markdown file with data processing and statistical analysis of the first experiment
numerical_response_2par.R A function required by 'first_experiment_analysis.Rmd'
Contents of second_experiment.zip
input/ This folder contains raw count and behavioral data (output of the Wellcounter software):
popgrowth_
output/ output files produced by the R-script 'second_experiment_analysis.Rmd'
wellcounter/ contains the Wellcounter software (programs and configuration files) that were used for running the raw analysis (image and motion analysis) of this dataset on a High Performance Computing cluster
second_experiment_prep_run1.Rmd R-Markdown file for preprocessing the data from run1
second_experiment_prep_run2.Rmd R-Markdown file for preprocessing the data from run2
second_experiment_analysis.Rmd R-Markdown file with data processing and statistical analysis of the second experiment
extract_fixed_effects_table.R A function required by 'second_experiment_analysis.Rmd'
Facebook
TwitterA routine was developed in R ('bathy_plots.R') to plot bathymetry data over time during individual CEAMARC events. This is so we can analyse benthic data in relation to habitat, ie. did we trawl over a slope or was the sea floor relatively flat. Note that the depth range in the plots is autoscaled to the data, so a small range in depths appears as a scatetring of points. As long as you look at the depth scale though interpretation will be ok. The R files need a file of bathymetry data in '200708V3_one_minute.csv' which is a file containing a data export from the underway PostgreSQL ship database and 'events.csv' which is a stripped down version of the events export from the ship board events database export. If you wish to run the code again you may need to change the pathnames in the R script to relevant locations. If you have opened the csv files in excel at any stage and the R script gets an error you may need to format the date/time columns as yyyy-mm-dd hh;mm:ss, save and close the file as csv without opening it again and then run the R script. However, all output files are here for every CEAMARC event. Filenames contain a reference to CEAMARC event id. Files are in eps format and can be viewed using Ghostview which is available as a free download on the internet.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We investigate apparent-time tone variation in the Black Lahu language (Loloish/Ngwi, Tibeto-Burman) of Yunnan, China. These are the supplementary materials for the paper "Generational differences in the low tones of Black Lahu," accepted for publication in Linguistics Vanguard.
Appendices:
Script files contained in the analysis:
Data files contained in this analysis:
Facebook
TwitterThis data set contains QA/QC-ed (Quality Assurance and Quality Control) water level data for the PLM1 and PLM6 wells. PLM1 and PLM6 are location identifiers used by the Watershed Function SFA project for two groundwater monitoring wells along an elevation gradient located along the lower montane life zone of a hillslope near the Pumphouse location at the East River Watershed, Colorado, USA. These wells are used to monitor subsurface water and carbon inventories and fluxes, and to determine the seasonally dependent flow of groundwater under the PLM hillslope. The downslope flow of groundwater in combination with data on groundwater chemistry (see related references) can be used to estimate rates of solute export from the hillslope to the floodplain and river. QA/QC analysis of measured groundwater levels in monitoring wells PLM-1 and PLM-6 included identification and flagging of duplicated values of timestamps, gap filling of missing timestamps and water levels, removal of abnormal/bad and outliers of measured water levels. The QA/QC analysis also tested the application of different QA/QC methods and the development of regular (5-minute, 1-hour, and 1-day) time series datasets, which can serve as a benchmark for testing other QA/QC techniques, and will be applicable for ecohydrological modeling. The package includes a Readme file, one R code file used to perform QA/QC, a series of 8 data csv files (six QA/QC-ed regular time series datasets of varying intervals (5-min, 1-hr, 1-day) and two files with QA/QC flagging of original data), and three files for the reporting format adoption of this dataset (InstallationMethods, file level metadata (flmd), and data dictionary (dd) files).QA/QC-ed data herein were derived from the original/raw data publication available at Williams et al., 2020 (DOI: 10.15485/1818367). For more information about running R code file (10.15485_1866836_QAQC_PLM1_PLM6.R) to reproduce QA/QC output files, see README (QAQC_PLM_readme.docx). This dataset replaces the previously published raw data time series, and is the final groundwater data product for the PLM wells in the East River. Complete metadata information on the PLM1 and PLM6 wells are available in a related dataset on ESS-DIVE: Varadharajan C, et al (2022). https://doi.org/10.15485/1660962. These data products are part of the Watershed Function Scientific Focus Area collection effort to further scientific understanding of biogeochemical dynamics from genome to watershed scales. 2022/09/09 Update: Converted data files using ESS-DIVE’s Hydrological Monitoring Reporting Format. With the adoption of this reporting format, the addition of three new files (v1_20220909_flmd.csv, V1_20220909_dd.csv, and InstallationMethods.csv) were added. The file-level metadata file (v1_20220909_flmd.csv) contains information specific to the files contained within the dataset. The data dictionary file (v1_20220909_dd.csv) contains definitions of column headers and other terms across the dataset. The installation methods file (InstallationMethods.csv) contains a description of methods associated with installation and deployment at PLM1 and PLM6 wells. Additionally, eight data files were re-formatted to follow the reporting format guidance (er_plm1_waterlevel_2016-2020.csv, er_plm1_waterlevel_1-hour_2016-2020.csv, er_plm1_waterlevel_daily_2016-2020.csv, QA_PLM1_Flagging.csv, er_plm6_waterlevel_2016-2020.csv, er_plm6_waterlevel_1-hour_2016-2020.csv, er_plm6_waterlevel_daily_2016-2020.csv, QA_PLM6_Flagging.csv). The major changes to the data files include the addition of header_rows above the data containing metadata about the particular well, units, and sensor description. 2023/01/18 Update: Dataset updated to include additional QA/QC-ed water level data up until 2022-10-12 for ER-PLM1 and 2022-10-13 for ER-PLM6. Reporting format specific files (v2_20230118_flmd.csv, v2_20230118_dd.csv, v2_20230118_InstallationMethods.csv) were updated to reflect the additional data. R code file (QAQC_PLM1_PLM6.R) was added to replace the previously uploaded HTML files to enable execution of the associated code. R code file (QAQC_PLM1_PLM6.R) and ReadMe file (QAQC_PLM_readme.docx) were revised to clarify where original data was retrieved from and to remove local file paths.
Facebook
TwitterState Harvest Data (csv)Commercial snapping turtle harvest data (in individuals) for eleven states from 1998 - 2013. States reporting are Arkansas, Delaware, Iowa, Maryland, Massachusetts, Michigan, Minnesota, New Jersey, North Carolina, Pennsylvania, and Virginia.StateHarvestData.csvInput and execution code for Colteaux_Johnson_2016Attached R file includes the code described in the listed publication. The companion JAGS (just another Gibbs sampler) code is also stored in this repository under separate cover.ColteauxJohnsonNatureConservation.RJAGS model code for Colteaux_Johnson_2016Attached R file includes the JAGS (just another Gibbs sampler) code described in the listed publication. The companion input and execution code is also stored in this repository under separate cover.ColteauxJohnsonNatureConservationJAGS.R
Facebook
Twitter
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Corresponding peer-reviewed publication
This dataset corresponds to all the RAPID input and output files that were used in the study reported in:
David, Cédric H., David R. Maidment, Guo-Yue Niu, Zong-Liang Yang, Florence Habets and Victor Eijkhout (2011), River Network Routing on the NHDPlus Dataset, Journal of Hydrometeorology, 12(5), 913-934. DOI: 10.1175/2011JHM1345.1.
When making use of any of the files in this dataset, please cite both the aforementioned article and the dataset herein.
Time format
The times reported in this description all follow the ISO 8601 format. For example 2000-01-01T16:00-06:00 represents 4:00 PM (16:00) on Jan 1st 2000 (2000-01-01), Central Standard Time (-06:00). Additionally, when time ranges with inner time steps are reported, the first time corresponds to the beginning of the first time step, and the second time corresponds to the end of the last time step. For example, the 3-hourly time range from 2000-01-01T03:00+00:00 to 2000-01-01T09:00+00:00 contains two 3-hourly time steps. The first one starts at 3:00 AM and finishes at 6:00AM on Jan 1st 2000, Universal Time; the second one starts at 6:00 AM and finishes at 9:00AM on Jan 1st 2000, Universal Time.
Data sources
The following sources were used to produce files in this dataset:
The National Hydrography Dataset Plus (NHDPlus) Version 1, obtained from http://www.horizon-systems.com/nhdplus.
The National Water Information System (NWIS), obtained from http://waterdata.usgs.gov/nwis.
Outputs from a simulation using the community Noah land surface model with multiparameterization options (Noah-MP, Niu et al. 2011, http://www.jsg.utexas.edu/noah-mp). The simulation was run by Guo-Yue Niu, and produced 3-hourly time steps from 2004-01-01T00:00+00:00 to 2008-01-01T00:00+00:00. Further details on the inputs and options used for this simulation are provided in David et al. (2011).
Software
The following software were used to produce files in this dataset:
The Routing Application for Parallel computation of Discharge (RAPID, David et al. 2011, http://rapid-hub.org), Version 1.0.0. Further details on the inputs and options used for this series of simulations are provided below and in David et al. (2011).
ESRI ArcGIS (http://www.arcgis.com).
Microsoft Excel (https://products.office.com/en-us/excel).
CUAHSI HydroGET (http://his.cuahsi.org/hydroget.html).
The GNU Compiler Collection (https://gcc.gnu.org) and the Intel compilers (https://software.intel.com/en-us/intel-compilers).
Study domain
The files in this dataset correspond to two study domains:
The combination of the San Antonio and Guadalupe River Basins, TX. RAPID can only use the river reaches of NHDPlus that have a known flow direction and focus is made on these reaches here (a total of 5,175). The temporal range corresponding to this domain is from 2004-01-01T00:00-06:00 to 2007-12-31 T00:00-06:00.
The Upper Mississippi River Basin. RAPID can only use the river reaches of NHDPlus that have a known flow direction and focus is made on these reaches here (a total of 182,240). The temporal range corresponding to this domain spans 100 fictitious days.
Description of files for the San Antonio and Guadalupe River Basins
All files below were prepared by Cédric H. David, using the data sources and software mentioned above.
rapid_connect_San_Guad.csv. This CSV file contains the river network connectivity information and is based on the unique IDs of NHDPlus reaches (the COMIDs). For each river reach, this file specifies: the COMID of the reach, the COMID of the unique downstream reach, the number of upstream reaches with a maximum of four reaches, and the COMIDs of all upstream reaches. A value of zero is used in place of NoData. The river reaches are sorted in increasing value of COMID. The values were computed using a combination of the following NHDPlus fields: COMID, DIVERGENCE, FROMNODE and TONODE. This file was prepared using ArcGIS and Excel.
m3_riv_San_Guad_2004_2007_cst.nc. This netCDF file contains the 3-hourly accumulated inflows of water (in cubic meters) from surface and subsurface runoff into the upstream point of each river reach. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The time range for this file is from 2004-01-01T00:00-06:00 to 2007/12/31T18:00-06:00. The values were computed by superimposing a 900-m gridded map of NHDPlus catchments to the outputs of Noah-MP. This file was prepared using ArcGIS and a Fortran program.
kfac_San_Guad_1km_hour.csv. This CSV file contains a first guess of Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following NHDPlus fields: COMID, LENGTHKM, Equation (13) in David et al. (2011), and using a wave celerity of 1 km/h. This file was prepared using a Fortran program.
kfac_San_Guad_celerity.csv. This CSV file contains a first guess of Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following NHDPlus fields: COMID, LENGTHKM, Equation (13) in David et al. (2011), and using the wave celerity numbers of Table 2 in David et al. (2011). This file was prepared using a Fortran program.
k_San_Guad_2004_1.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following NHDPlus fields: COMID, LENGTHKM, and using Equation (17) in David et al. (2011). This file was prepared using a Fortran program.
k_San_Guad_2004_2.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following NHDPlus fields: COMID, LENGTHKM, and using Equation (18) in David et al. (2011). This file was prepared using a Fortran program.
k_San_Guad_2004_3.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following NHDPlus fields: COMID, LENGTHKM, and using Equation (19) in David et al. (2011). This file was prepared using a Fortran program.
k_San_Guad_2004_4.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following NHDPlus fields: COMID, LENGTHKM, and using Equation (21) in David et al. (2011). This file was prepared using a Fortran program.
x_San_Guad_2004_1.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Equation (17) in David et al. (2011). This file was prepared using a Fortran program.
x_San_Guad_2004_2.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Equation (18) in David et al. (2011). This file was prepared using a Fortran program.
x_San_Guad_2004_3.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Equation (19) in David et al. (2011). This file was prepared using a Fortran program.
x_San_Guad_2004_4.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Equation (21) in David et al. (2011). This file was prepared using a Fortran program.
basin_id_San_Guad_hydroseq.csv. This CSV file contains the list of unique IDs of NHDPlus river reaches (COMID) in the San Antonio and Guadalupe River Basins. The river reaches are sorted from upstream to downstream. The values were computed using the following NHDPlus fields: COMID and HYDROSEQ. This file was prepared using Excel.
Qout_San_Guad_1460days_p1_dtR=900s.nc. This netCDF file contains the 3-hourly averaged outputs (in cubic meters per second) from RAPID corresponding to the downstream point of each reach. The river reaches have the same COMIDs and are sorted similarly to basin_id_San_Guad_hydroseq.csv. The time range for this file is from 2004-01-01T00:00-06:00 to 2007-12-31-00:00-06:00. The values were computed using the Muskingum method with parameters of Equation (17) in David et al. (2011). This file was prepared using RAPID v1.0.0 running with the preonly ILU solver on one core.
Qout_San_Guad_1460days_p2_dtR=900s.nc. This netCDF file contains the 3-hourly averaged outputs (in cubic meters per second) from RAPID corresponding to the downstream point of each reach. The river reaches have the same COMIDs and are sorted similarly to basin_id_San_Guad_hydroseq.csv. The time range for this file is from 2004-01-01T00:00-06:00 to 2007-12-31-00:00-06:00. The values were computed using the Muskingum method with parameters of Equation (18) in David et al. (2011). This file was prepared using RAPID v1.0.0 running with the preonly ILU solver on one core.
Qout_San_Guad_1460days_p3_dtR=900s.nc. This netCDF file contains the 3-hourly averaged outputs (in cubic meters per second) from RAPID corresponding to the downstream
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
S1 File. SI_C01_SPD_KDE_models. R-script for analysing radiocarbon dates dates. The code performs the computation of over-regional and regional SPD and KDE models, as well as their export to CSV files (Rmd). S2 File. SI_C02_aoristic_dating. R-script for exporting aoristic time series derived from typochronological dated archaeological material as CSV files (Rmd). S3 File. SI_C03_vegetation_openness_score_example. R-script performing the computation of a vegetation openness score from pollen records and the export of the generated time series as CVS file (Rmd). S4 File. SI_C04_data_preparation. Jupyter Notebook performing the import and transformation of relevant data visualize plots exhibited in the paper (ipynb). S5 File. SI_C05_figures_extra. Jupyter Notebook visualizing the plots exhibited in the paper (ipynb). S1 Data. SI_D01_reg_data_no_dups. Spread sheet holding radiocarbon dates, with the information of laboratory identification, site name, geographical coordinates, site type, material, source and regional affiliation (csv). S2 Data. SI_D02_reg_axe_dagger_graves. Spread sheet holding entries of axes and daggers, with the information of context, site, parish, artefact identification, type, subtype, absolute dating, typochonological dating, references, geographical coordinates and regional affiliations (csv). S3 Data. SI_D03_pollen_example. Spread sheet holding sample entries of the pollen records from Krageholm (neotoma Site ID 3204) and Bjäresjöholmsjön (neotoma Site ID 3017) for example run of S3 File. Record can be access via the neotoma explorer (https://apps.neotomadb.org/explorer/) with their given IDs. Each entry holds the information of the records type, regional affiliation, absolute BP and BCE dating, as well as the counts of given plant taxa (csv). S4 Data. SI_D04_PAP_303600_TOC_LOI. Table holding sample entries of TOC content, LOI and SST reconstruction of sediment core PAP_303600 for correlations of population development with Baltic sea surface temperature. Available via 10.1594/PANGAEA.883292 (tab). S5 Data. SI_D05_vos_[…]. Spread sheets holding the vegetation openness score time series of lake Belau, Vinge, Northern Jutland and Zealand (csv). (ZIP)
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
MCF10A non-tumorigenic breast cells were dosed with environmental toxicants and stained with multiple cellular stains to study morphological perturbations. Following up on feature results, MCF10A cells were stained with an anti-beta catenin antibody to study beta catenin nuclear translocation. Cell profiler software was used to measure and export per cell data .CSV formats to be further analyze din BMDExpress2 and R studio
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This replication package contains all datasets, scripts, and documentation used in our empirical study.The package is organized into five main folders, corresponding to the major stages of the study: dataset construction and the analyses for RQ1–RQ4.1. Dataset_constructionThis folder contains scripts for constructing the dataset.find_ai_prs_with_100_stars.pyFilters PRs from repositories with at least 100 GitHub stars.Output: ai_pull_requests_over_100stars.csvapplying_security_keywords.pyExpands the filtered dataset by applying a comprehensive list of security-related keywords.Output: ai_prs_security_candidates_expanded.csvapplying_gemini_to_get_final_dataset.pyUses Gemini-based model validation to confirm whether each candidate PR is truly security-related.Output: final_dataset.csvannotator_security_prs_sample_245.xlsxContains a manually annotated subset (n=245) of PRs used for validation of model predictions.dataset_agreement_analysis.pyComputes inter-annotator agreement metrics (Cohen’s κ) for the manual sample.2. RQ1_analysisThis folder contains scripts used to answer RQ1.Input: final_dataset.csvrun_semgrep.pyRuns Semgrep across all PR code changes.Output: all_prs_with_semgrep.csvRQ1_analysis.pyAggregates and analyzes vulnerability types to address RQ1.3. RQ2_analysisThis folder analyzes RQ2.Subfolder: feature_extraction/Input: final_dataset.csvfind_factors.pyExtracts PR- and repository-level features.Output: ai_factors.csvSubfolder: regression_analysis/Input: ai_factors.csvPR_latency.R, PR_acceptance.R, and common.RPerform regression analyses.4. RQ3_analysisThis folder contains scripts for RQ3.find_commits.pyExtracts all commits associated with PRs listed in final_dataset.csv.Output: ai_security_prs_with_commits.csvC-Good.pyReplicates the pretrained commit message quality model (C-Good) proposed by Tian et al.preprocessor_step1.py, preprocessor_step2.py, preprocessor_step3.pySequentially preprocess the original messages.csv dataset (from Tian et al.) for model training.Output: trained model bert_commit_model.pthpreprocessing_my_dataset.pyApplies the same preprocessing pipeline to ai_security_prs_with_commits.csv.Output: ai_security_prs_with_commits_preprocessed.csvtesting_my_dataset.pyLoads bert_commit_model.pth, evaluates commits, and produces commit-level quality labels.Output: ai_security_prs_with_commits_predictions.csvcommit_message_sample_339.csvContains a random sample of 339 commit messages manually reviewed to verify the accuracy of the model predictions.manual_verification.pyAnalyzes commit_message_sample_339.csv by comparing manual ratings with model-predicted labels.rq3_analysis.pyPerforms analysis of commit message quality.5. RQ4_analysisThis folder contains scripts and resources for RQ4.find_rejected_prs_comments.pyIdentifies PRs closed without merging from final_dataset.csv.Collects maintainer review comments.Output: rejected_pr_comments.csvRubrics.docxDefines categories and guidelines for manual annotation.Manual AnnotationTwo annotators manually review rejected_pr_comments.csv following the rubric.Output: manual_labeling.xlsxrq4_analysis.pyAnalyzes the annotated dataset (manual_labeling.xlsx) .
Facebook
Twitterhttps://opensource.org/licenses/BSD-2-Clausehttps://opensource.org/licenses/BSD-2-Clause
This data repository feeds into the meta-repository setup for post-processing of GCAM-SAM outputs. GitHub link of meta-repository is: https://github.com/JGCRI/Kyle-etal_2022_EF
Folders: model/ is the static version of the model used to simulate 8 scenarios. See the GitHub GCAM-SAM repository to follow active development of this model. inputs/ folder contains input datasets and scripts used to prepare files while postprocessing. This is to be used with GitHub post-processing meta-repository. outdata/ contains GCAM-SAM output and post-processed output files used to plot figures.
Key files: SAM-matrix.dat is the consolidated GCAM-SAM output. Use proj_load.R in the metarepo to read the file. region_vals.csv has all 8 indicators in all 8 scenarios for years 2020 till 2100 on a 10 year time step.
Short introduction to the study:
In this paper sustainable agriculture matrix (SAM) is estimated to 2100 using Global Change Analysis Model (GCAM). We model combinatorial variations of yield intensification, dietary shift, and greenhouse gas mitigation scenarios. Findings include scenarios having significant tradeoffs across multiple environmental, economic, and social dimensions. Assessment of these multi-dimensional tradeoffs in a consistent framework improves the quality of information for decision-making.
Should you have any questions, feel free to reach out Page Kyle at pkyle@pnnl.gov.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Replication Package
This repository contains data and source files needed to replicate our work described in the paper "Unboxing Default Argument Breaking Changes in Scikit Learn".
Requirements
We recommend the following requirements to replicate our study:
Package Structure
We relied on Docker containers to provide a working environment that is easier to replicate. Specifically, we configure the following containers:
data-analysis, an R-based Container we used to run our data analysis.data-collection, a Python Container we used to collect Scikit's default arguments and detect them in client applications.database, a Postgres Container we used to store clients' data, obtainer from Grotov et al.storage, a directory used to store the data processed in data-analysis and data-collection. This directory is shared in both containers.docker-compose.yml, the Docker file that configures all containers used in the package.In the remainder of this document, we describe how to set up each container properly.
Using VSCode to Setup the Package
We selected VSCode as the IDE of choice because its extensions allow us to implement our scripts directly inside the containers. In this package, we provide configuration parameters for both data-analysis and data-collection containers. This way you can directly access and run each container inside it without any specific configuration.
You first need to set up the containers
$ cd /replication/package/folder
$ docker-compose build
$ docker-compose up
# Wait docker creating and running all containers
Then, you can open them in Visual Studio Code:
If you want/need a more customized organization, the remainder of this file describes it in detail.
Longest Road: Manual Package Setup
Database Setup
The database container will automatically restore the dump in dump_matroskin.tar in its first launch. To set up and run the container, you should:
Build an image:
$ cd ./database
$ docker build --tag 'dabc-database' .
$ docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
dabc-database latest b6f8af99c90d 50 minutes ago 18.5GB
Create and enter inside the container:
$ docker run -it --name dabc-database-1 dabc-database
$ docker exec -it dabc-database-1 /bin/bash
root# psql -U postgres -h localhost -d jupyter-notebooks
jupyter-notebooks=# \dt
List of relations
Schema | Name | Type | Owner
--------+-------------------+-------+-------
public | Cell | table | root
public | Code_cell | table | root
public | Md_cell | table | root
public | Notebook | table | root
public | Notebook_features | table | root
public | Notebook_metadata | table | root
public | repository | table | root
If you got the tables list as above, your database is properly setup.
It is important to mention that this database is extended from the one provided by Grotov et al.. Basically, we added three columns in the table Notebook_features (API_functions_calls, defined_functions_calls, andother_functions_calls) containing the function calls performed by each client in the database.
Data Collection Setup
This container is responsible for collecting the data to answer our research questions. It has the following structure:
dabcs.py, extract DABCs from Scikit Learn source code, and export them to a CSV file.dabcs-clients.py, extract function calls from clients and export them to a CSV file. We rely on a modified version of Matroskin to leverage the function calls. You can find the tool's source code in the `matroskin`` directory.Makefile, commands to set up and run both dabcs.py and dabcs-clients.pymatroskin, the directory containing the modified version of matroskin tool. We extended the library to collect the function calls performed on the client notebooks of Grotov's dataset.storage, a docker volume where the data-collection should save the exported data. This data will be used later in Data Analysis.requirements.txt, Python dependencies adopted in this module.Note that the container will automatically configure this module for you, e.g., install dependencies, configure matroskin, download scikit learn source code, etc. For this, you must run the following commands:
$ cd ./data-collection
$ docker build --tag "data-collection" .
$ docker run -it -d --name data-collection-1 -v $(pwd)/:/data-collection -v $(pwd)/../storage/:/data-collection/storage/ data-collection
$ docker exec -it data-collection-1 /bin/bash
$ ls
Dockerfile Makefile config.yml dabcs-clients.py dabcs.py matroskin storage requirements.txt utils.py
If you see project files, it means the container is configured accordingly.
Data Analysis Setup
We use this container to conduct the analysis over the data produced by the Data Collection container. It has the following structure:
dependencies.R, an R script containing the dependencies used in our data analysis.data-analysis.Rmd, the R notebook we used to perform our data analysisdatasets, a docker volume pointing to the storage directory.Execute the following commands to run this container:
$ cd ./data-analysis
$ docker build --tag "data-analysis" .
$ docker run -it -d --name data-analysis-1 -v $(pwd)/:/data-analysis -v $(pwd)/../storage/:/data-collection/datasets/ data-analysis
$ docker exec -it data-analysis-1 /bin/bash
$ ls
data-analysis.Rmd datasets dependencies.R Dockerfile figures Makefile
If you see project files, it means the container is configured accordingly.
A note on storage shared folder
As mentioned, the storage folder is mounted as a volume and shared between data-collection and data-analysis containers. We compressed the content of this folder due to space constraints. Therefore, before starting working on Data Collection or Data Analysis, make sure you extracted the compressed files. You can do this by running the Makefile inside storage folder.
$ make unzip # extract files
$ ls
clients-dabcs.csv clients-validation.csv dabcs.csv Makefile scikit-learn-versions.csv versions.csv
$ make zip # compress files
$ ls
csv-files.tar.gz Makefile
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data supporting the Master thesis "Monitoring von Open Data Praktiken - Herausforderungen beim Auffinden von Datenpublikationen am Beispiel der Publikationen von Forschenden der TU Dresden" (Monitoring open data practices - challenges in finding data publications using the example of publications by researchers at TU Dresden) - Katharina Zinke, Institut für Bibliotheks- und Informationswissenschaften, Humboldt-Universität Berlin, 2023
This ZIP-File contains the data the thesis is based on, interim exports of the results and the R script with all pre-processing, data merging and analyses carried out. The documentation of the additional, explorative analysis is also available. The actual PDFs and text files of the scientific papers used are not included as they are published open access.
The folder structure is shown below with the file names and a brief description of the contents of each file. For details concerning the analyses approach, please refer to the master's thesis (publication following soon).
Folder 01_SourceData/
PLOS-Dataset_v2_Mar23.csv (PLOS-OSI dataset)
ScopusSearch_ExportResults.csv (export of Scopus search results from Scopus)
ScopusSearch_ExportResults.ris (export of Scopus search results from Scopus)
Zotero_Export_ScopusSearch.csv (export of the file names and DOIs of the Scopus search results from Zotero)
Folder 02_AutomaticClassification/
(NOT INCLUDED) PDFs folder (Folder for PDFs of all publications identified by the Scopus search, named AuthorLastName_Year_PublicationTitle_Title)
(NOT INCLUDED) PDFs_to_text folder (Folder for all texts extracted from the PDFs by ODDPub, named AuthorLastName_Year_PublicationTitle_Title)
PLOS_ScopusSearch_matched.csv (merge of the Scopus search results with the PLOS_OSI dataset for the files contained in both)
oddpub_results_wDOIs.csv (results file of the ODDPub classification)
PLOS_ODDPub.csv (merge of the results file of the ODDPub classification with the PLOS-OSI dataset for the publications contained in both)
Folder 03_ManualCheck/
CodeSheet_ManualCheck.txt (Code sheet with descriptions of the variables for manual coding)
ManualCheck_2023-06-08.csv (Manual coding results file)
PLOS_ODDPub_Manual.csv (Merge of the results file of the ODDPub and PLOS-OSI classification with the results file of the manual coding)
Proof_of_of_Concept_Open_Data_Monitoring.pdf (Description of the explorative analysis of the discoverability of open data publications using the example of a researcher) - in German
Analyses_MA_OpenDataMonitoring.R (R-Script for preparing, merging and analyzing the data and for performing the ODDPub algorithm)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
To replicate the analysis, the results, and the figures of the paper:
Download input data from this Zenodo repository and code from Github https://github.com/giacfalk/urban_green_space_mapping_and_tracking
Optional data extraction steps (processed output data are already available in the Zenodo repository):
Adjust your working directory
Run [lines 4-11] of workflow/sourcer.R
Run the Javascript scripts written by the string_generator_training.R and string_generator_prediction.R files in Google Earth Engine (https://code.earthengine.google.com) and complete the export to Drive tasks to generate the output .csv files
Run workflow/sourcer.R [lines 15-46] to train the ML model and make predictions (including figures and tables replication)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Categorical scatterplots with R for biologists: a step-by-step guide
Benjamin Petre1, Aurore Coince2, Sophien Kamoun1
1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK
Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.
Protocol
• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.
• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.
• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.
Notes
• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.
• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.
replicates
graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()
References
Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.
Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035
Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128