Facebook
Twitterhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Replication pack, FSE2018 submission #164: ------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem **Note:** link to data artifacts is already included in the paper. Link to the code will be included in the Camera Ready version as well. Content description =================== - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files described below - **settings.py** - settings template for the code archive. - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset. This dataset only includes stats aggregated by the ecosystem (PyPI) - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages themselves, which take around 2TB. - **build_model.r, helpers.r** - R files to process the survival data (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, `common.cache/survival_data.pypi_2008_2017-12_6.csv` in **dataset_full_Jan_2018.tgz**) - **Interview protocol.pdf** - approximate protocol used for semistructured interviews. - LICENSE - text of GPL v3, under which this dataset is published - INSTALL.md - replication guide (~2 pages)
Replication guide ================= Step 0 - prerequisites ---------------------- - Unix-compatible OS (Linux or OS X) - Python interpreter (2.7 was used; Python 3 compatibility is highly likely) - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible) Depending on detalization level (see Step 2 for more details): - up to 2Tb of disk space (see Step 2 detalization levels) - at least 16Gb of RAM (64 preferable) - few hours to few month of processing time Step 1 - software ---------------- - unpack **ghd-0.1.0.zip**, or clone from gitlab: git clone https://gitlab.com/user2589/ghd.git git checkout 0.1.0 `cd` into the extracted folder. All commands below assume it as a current directory. - copy `settings.py` into the extracted folder. Edit the file: * set `DATASET_PATH` to some newly created folder path * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` - install docker. For Ubuntu Linux, the command is `sudo apt-get install docker-compose` - install libarchive and headers: `sudo apt-get install libarchive-dev` - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools` Without this dependency, you might get an error on the next step, but it's safe to ignore. - install Python libraries: `pip install --user -r requirements.txt` . - disable all APIs except GitHub (Bitbucket and Gitlab support were not yet implemented when this study was in progress): edit `scraper/init.py`, comment out everything except GitHub support in `PROVIDERS`. Step 2 - obtaining the dataset ----------------------------- The ultimate goal of this step is to get output of the Python function `common.utils.survival_data()` and save it into a CSV file: # copy and paste into a Python console from common import utils survival_data = utils.survival_data('pypi', '2008', smoothing=6) survival_data.to_csv('survival_data.csv') Since full replication will take several months, here are some ways to speedup the process: ####Option 2.a, difficulty level: easiest Just use the precomputed data. Step 1 is not necessary under this scenario. - extract **dataset_minimal_Jan_2018.zip** - get `survival_data.csv`, go to the next step ####Option 2.b, difficulty level: easy Use precomputed longitudinal feature values to build the final table. The whole process will take 15..30 minutes. - create a folder `
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Categorical scatterplots with R for biologists: a step-by-step guide
Benjamin Petre1, Aurore Coince2, Sophien Kamoun1
1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK
Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.
Protocol
• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.
• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.
• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.
Notes
• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.
• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.
replicates
graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()
References
Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.
Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035
Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128
Facebook
TwitterTo make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.
You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:
df<-read.csv('uber.csv')
df_black<-subset(uber_df, uber_df$name == 'Black')
write.csv(df_black, "nameofthefileyouwanttosaveas.csv")
getwd()
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Facebook
TwitterThis data release supports an analysis of changes in dissolved organic carbon (DOC) and nitrate concentrations in Buck Creek watershed near Inlet, New York 2001 to 2021. The Buck Creek watershed is a 310-hectare forested watershed that is recovering from acidic deposition within the Adirondack region. The data release includes pre-processed model inputs and model outputs for the Weighted Regressions on Time, Discharge and Season (WRTDS) model (Hirsch and others, 2010) to estimate daily flow normalized concentrations of DOC and nitrate during a 20-year period of analysis. WRTDS uses daily discharge and concentration observations implemented through the Exploration and Graphics for River Trends R package (EGRET) to predict solute concentration using decimal time and discharge as explanatory variables (Hirsch and De Cicco, 2015; Hirsch and others, 2010). Discharge and concentration data are available from the U.S. Geological Survey National Water Information System (NWIS) database (U.S. Geological Survey, 2016). The time series data were analyzed for the entire period, water years 2001 (WY2001) to WY2021 where WY2001 is the period from October 1, 2000 to September 30, 2001. This data release contains 5 comma-separated values (CSV) files, one R script, and one XML metadata file. There are four input files (“Daily.csv”, “INFO.csv”, “Sample_doc.csv”, and “Sample_nitrate.csv”) that contain site information, daily mean discharge, and mean daily DOC or nitrate concentrations. The R script (“Buck Creek WRTDS R script.R”) uses the four input datasets and functions from the EGRET R package to generate estimations of flow normalized concentrations. The output file (“WRTDS_results.csv”) contains model output at daily time steps for each sub-watershed and for each solute. Files are automatically associated with the R script when opened in RStudio using the provided R project file ("Files.Rproj"). All input, output, and R files are in the "Files.zip" folder.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Corresponding peer-reviewed publication
This dataset corresponds to all the RAPID input and output files that were used in the study reported in:
David, Cédric H., Florence Habets, David R. Maidment and Zong-Liang Yang (2011), RAPID applied to the SIM-France model, Hydrological Processes, 25(22), 3412-3425. DOI: 10.1002/hyp.8070.
When making use of any of the files in this dataset, please cite both the aforementioned article and the dataset herein.
Time format
The times reported in this description all follow the ISO 8601 format. For example 2000-01-01T16:00-06:00 represents 4:00 PM (16:00) on Jan 1st 2000 (2000-01-01), Central Standard Time (-06:00). Additionally, when time ranges with inner time steps are reported, the first time corresponds to the beginning of the first time step, and the second time corresponds to the end of the last time step. For example, the 3-hourly time range from 2000-01-01T03:00+00:00 to 2000-01-01T09:00+00:00 contains two 3-hourly time steps. The first one starts at 3:00 AM and finishes at 6:00AM on Jan 1st 2000, Universal Time; the second one starts at 6:00 AM and finishes at 9:00AM on Jan 1st 2000, Universal Time.
Data sources
The following sources were used to produce files in this dataset:
The hydrographic network of SIM-France, as published in Habets, F., A. Boone, J. L. Champeaux, P. Etchevers, L. Franchistéguy, E. Leblois, E. Ledoux, P. Le Moigne, E. Martin, S. Morel, J. Noilhan, P. Quintana Seguí, F. Rousset-Regimbeau, and P. Viennot (2008), The SAFRAN-ISBA-MODCOU hydrometeorological model applied over France, Journal of Geophysical Research: Atmospheres, 113(D6), DOI: 10.1029/2007JD008548.
The observed flows are from Banque HYDRO, Service Central d'Hydrométéorologie et d'Appui à la Prévision des Inondations. Available at http://www.hydro.eaufrance.fr/index.php.
Outputs from a simulation using SIM-France (Habets et al. 2008). The simulation was run by Florence Habets, and produced 3-hourly time steps from 1995-08-01T00:00+02:00 to 2005-07-31T21:02+00:00. Further details on the inputs and options used for this simulation are provided in David et al. (2011).
Software
The following software were used to produce files in this dataset:
The Routing Application for Parallel computation of Discharge (RAPID, David et al. 2011, http://rapid-hub.org), Version 1.1.0. Further details on the inputs and options used for this series of simulations are provided below and in David et al. (2011).
ESRI ArcGIS (http://www.arcgis.com).
Microsoft Excel (https://products.office.com/en-us/excel).
The GNU Compiler Collection (https://gcc.gnu.org) and the Intel compilers (https://software.intel.com/en-us/intel-compilers).
Study domain
The files in this dataset correspond to one study domain:
The river network of SIM-France is made of 24264 river reaches. The temporal range corresponding to this domain is from 1995-08-01T00:00+02:00 to 2005-07-31 T21:00+02:00.
Description of files
All files below were prepared by Cédric H. David, using the data sources and software mentioned above.
rapid_connect_France.csv. This CSV file contains the river network connectivity information and is based on the unique IDs of the SIM-France river reaches (the IDs). For each river reach, this file specifies: the ID of the reach, the ID of the unique downstream reach, the number of upstream reaches with a maximum of four reaches, and the IDs of all upstream reaches. A value of zero is used in place of NoData. The river reaches are sorted in increasing value of ID. The values were computed based on the SIM-France FICVID file. This file was prepared using a Fortran program.
m3_riv_France_1995_2005_ksat_201101_c_zvol_ext.nc. This netCDF file contains the 3-hourly accumulated inflows of water (in cubic meters) from surface and subsurface runoff into the upstream point of each river reach. The river reaches have the same IDs and are sorted similarly to rapid_connect_France.csv. The time range for this file is from 1995-08-01T00:00+02:00 to 2005/07/31T21:00+02:00. The values were computed using the outputs of SIM-France. This file was prepared using a Fortran program.
kfac_modcou_1km_hour.csv. This CSV file contains a first guess of Muskingum k values (in seconds) for all river reaches. The river reaches have the same IDs and are sorted similarly to rapid_connect_France.csv. The values were computed based on the following information: ID, size of the side of the grid cell, Equation (5) in David et al. (2011), and using a wave celerity of 1 km/h. This file was prepared using a Fortran program.
kfac_modcou_ttra_length.csv. This CSV file contains a second guess of Muskingum k values (in seconds) for all river reaches. The river reaches have the same IDs and are sorted similarly to rapid_connect_France.csv. The values were computed based on the following information: ID, size of the side of the grid cell, travel time, and Equation (9) in David et al. (2011).
k_modcou_0.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following information: kfac_modcou_1km_hour.csv and using Table (2) in David et al. (2011). This file was prepared using a Fortran program.
k_modcou_1.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following information: kfac_modcou_1km_hour.csv and using Table (2) in David et al. (2011). This file was prepared using a Fortran program.
k_modcou_2.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following information: kfac_modcou_1km_hour.csv and using Table (2) in David et al. (2011). This file was prepared using a Fortran program.
k_modcou_3.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following information: kfac_modcou_1km_hour.csv and using Table (2) in David et al. (2011). This file was prepared using a Fortran program.
k_modcou_4.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following information: kfac_modcou_1km_hour.csv and using Table (2) in David et al. (2011). This file was prepared using a Fortran program.
k_modcou_a.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following information: kfac_modcou_1km_hour.csv and using Table (2) in David et al. (2011). This file was prepared using a Fortran program.
k_modcou_b.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following information: kfac_modcou_1km_hour.csv and using Table (2) in David et al. (2011). This file was prepared using a Fortran program.
k_modcou_c.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following information: kfac_modcou_1km_hour.csv and using Table (2) in David et al. (2011). This file was prepared using a Fortran program.
x_modcou_0.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Table (2) in David et al. (2011). This file was prepared using a Fortran program.
x_modcou_1.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Table (2) in David et al. (2011). This file was prepared using a Fortran program.
x_modcou_2.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Table (2) in David et al. (2011). This file was prepared using a Fortran program.
x_modcou_3.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Table (2) in David et al. (2011). This file was prepared using a Fortran program.
x_modcou_4.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Table (2) in David et al. (2011). This file was prepared using a Fortran program.
x_modcou_a.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Table (2) in David et al. (2011). This file was prepared using a Fortran program.
x_modcou_b.csv. This CSV file contains Muskingum x values
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the raw data for the study:
Characterizing Intraspecific Resource Utilization in an Aquatic Consumer Using High-Throughput Phenotyping
Data are provided separately for the first experiment (numerical response experiment with 16 rotifer clones across six food concentrations) and the second experiment (growth rate measurements with 98 rotifer clones across two food concentrations).
Contents of first_experiment.zip
input/ This folder contains raw count data (output of the Wellcounter software):
popgrowth_
output/ output files produced by the R-script 'first_experiment_analysis.Rmd'
wellcounter/ contains the Wellcounter software (programs and configuration files) that were used for running the raw analysis of this dataset on a High Performance Computing cluster
first_experiment_analysis.Rmd R-Markdown file with data processing and statistical analysis of the first experiment
numerical_response_2par.R A function required by 'first_experiment_analysis.Rmd'
Contents of second_experiment.zip
input/ This folder contains raw count and behavioral data (output of the Wellcounter software):
popgrowth_
output/ output files produced by the R-script 'second_experiment_analysis.Rmd'
wellcounter/ contains the Wellcounter software (programs and configuration files) that were used for running the raw analysis (image and motion analysis) of this dataset on a High Performance Computing cluster
second_experiment_prep_run1.Rmd R-Markdown file for preprocessing the data from run1
second_experiment_prep_run2.Rmd R-Markdown file for preprocessing the data from run2
second_experiment_analysis.Rmd R-Markdown file with data processing and statistical analysis of the second experiment
extract_fixed_effects_table.R A function required by 'second_experiment_analysis.Rmd'
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Corresponding peer-reviewed publication
This dataset corresponds to all the RAPID input and output files that were used in the study reported in:
David, Cédric H., David R. Maidment, Guo-Yue Niu, Zong-Liang Yang, Florence Habets and Victor Eijkhout (2011), River Network Routing on the NHDPlus Dataset, Journal of Hydrometeorology, 12(5), 913-934. DOI: 10.1175/2011JHM1345.1.
When making use of any of the files in this dataset, please cite both the aforementioned article and the dataset herein.
Time format
The times reported in this description all follow the ISO 8601 format. For example 2000-01-01T16:00-06:00 represents 4:00 PM (16:00) on Jan 1st 2000 (2000-01-01), Central Standard Time (-06:00). Additionally, when time ranges with inner time steps are reported, the first time corresponds to the beginning of the first time step, and the second time corresponds to the end of the last time step. For example, the 3-hourly time range from 2000-01-01T03:00+00:00 to 2000-01-01T09:00+00:00 contains two 3-hourly time steps. The first one starts at 3:00 AM and finishes at 6:00AM on Jan 1st 2000, Universal Time; the second one starts at 6:00 AM and finishes at 9:00AM on Jan 1st 2000, Universal Time.
Data sources
The following sources were used to produce files in this dataset:
The National Hydrography Dataset Plus (NHDPlus) Version 1, obtained from http://www.horizon-systems.com/nhdplus.
The National Water Information System (NWIS), obtained from http://waterdata.usgs.gov/nwis.
Outputs from a simulation using the community Noah land surface model with multiparameterization options (Noah-MP, Niu et al. 2011, http://www.jsg.utexas.edu/noah-mp). The simulation was run by Guo-Yue Niu, and produced 3-hourly time steps from 2004-01-01T00:00+00:00 to 2008-01-01T00:00+00:00. Further details on the inputs and options used for this simulation are provided in David et al. (2011).
Software
The following software were used to produce files in this dataset:
The Routing Application for Parallel computation of Discharge (RAPID, David et al. 2011, http://rapid-hub.org), Version 1.0.0. Further details on the inputs and options used for this series of simulations are provided below and in David et al. (2011).
ESRI ArcGIS (http://www.arcgis.com).
Microsoft Excel (https://products.office.com/en-us/excel).
CUAHSI HydroGET (http://his.cuahsi.org/hydroget.html).
The GNU Compiler Collection (https://gcc.gnu.org) and the Intel compilers (https://software.intel.com/en-us/intel-compilers).
Study domain
The files in this dataset correspond to two study domains:
The combination of the San Antonio and Guadalupe River Basins, TX. RAPID can only use the river reaches of NHDPlus that have a known flow direction and focus is made on these reaches here (a total of 5,175). The temporal range corresponding to this domain is from 2004-01-01T00:00-06:00 to 2007-12-31 T00:00-06:00.
The Upper Mississippi River Basin. RAPID can only use the river reaches of NHDPlus that have a known flow direction and focus is made on these reaches here (a total of 182,240). The temporal range corresponding to this domain spans 100 fictitious days.
Description of files for the San Antonio and Guadalupe River Basins
All files below were prepared by Cédric H. David, using the data sources and software mentioned above.
rapid_connect_San_Guad.csv. This CSV file contains the river network connectivity information and is based on the unique IDs of NHDPlus reaches (the COMIDs). For each river reach, this file specifies: the COMID of the reach, the COMID of the unique downstream reach, the number of upstream reaches with a maximum of four reaches, and the COMIDs of all upstream reaches. A value of zero is used in place of NoData. The river reaches are sorted in increasing value of COMID. The values were computed using a combination of the following NHDPlus fields: COMID, DIVERGENCE, FROMNODE and TONODE. This file was prepared using ArcGIS and Excel.
m3_riv_San_Guad_2004_2007_cst.nc. This netCDF file contains the 3-hourly accumulated inflows of water (in cubic meters) from surface and subsurface runoff into the upstream point of each river reach. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The time range for this file is from 2004-01-01T00:00-06:00 to 2007/12/31T18:00-06:00. The values were computed by superimposing a 900-m gridded map of NHDPlus catchments to the outputs of Noah-MP. This file was prepared using ArcGIS and a Fortran program.
kfac_San_Guad_1km_hour.csv. This CSV file contains a first guess of Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following NHDPlus fields: COMID, LENGTHKM, Equation (13) in David et al. (2011), and using a wave celerity of 1 km/h. This file was prepared using a Fortran program.
kfac_San_Guad_celerity.csv. This CSV file contains a first guess of Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following NHDPlus fields: COMID, LENGTHKM, Equation (13) in David et al. (2011), and using the wave celerity numbers of Table 2 in David et al. (2011). This file was prepared using a Fortran program.
k_San_Guad_2004_1.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following NHDPlus fields: COMID, LENGTHKM, and using Equation (17) in David et al. (2011). This file was prepared using a Fortran program.
k_San_Guad_2004_2.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following NHDPlus fields: COMID, LENGTHKM, and using Equation (18) in David et al. (2011). This file was prepared using a Fortran program.
k_San_Guad_2004_3.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following NHDPlus fields: COMID, LENGTHKM, and using Equation (19) in David et al. (2011). This file was prepared using a Fortran program.
k_San_Guad_2004_4.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following NHDPlus fields: COMID, LENGTHKM, and using Equation (21) in David et al. (2011). This file was prepared using a Fortran program.
x_San_Guad_2004_1.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Equation (17) in David et al. (2011). This file was prepared using a Fortran program.
x_San_Guad_2004_2.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Equation (18) in David et al. (2011). This file was prepared using a Fortran program.
x_San_Guad_2004_3.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Equation (19) in David et al. (2011). This file was prepared using a Fortran program.
x_San_Guad_2004_4.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Equation (21) in David et al. (2011). This file was prepared using a Fortran program.
basin_id_San_Guad_hydroseq.csv. This CSV file contains the list of unique IDs of NHDPlus river reaches (COMID) in the San Antonio and Guadalupe River Basins. The river reaches are sorted from upstream to downstream. The values were computed using the following NHDPlus fields: COMID and HYDROSEQ. This file was prepared using Excel.
Qout_San_Guad_1460days_p1_dtR=900s.nc. This netCDF file contains the 3-hourly averaged outputs (in cubic meters per second) from RAPID corresponding to the downstream point of each reach. The river reaches have the same COMIDs and are sorted similarly to basin_id_San_Guad_hydroseq.csv. The time range for this file is from 2004-01-01T00:00-06:00 to 2007-12-31-00:00-06:00. The values were computed using the Muskingum method with parameters of Equation (17) in David et al. (2011). This file was prepared using RAPID v1.0.0 running with the preonly ILU solver on one core.
Qout_San_Guad_1460days_p2_dtR=900s.nc. This netCDF file contains the 3-hourly averaged outputs (in cubic meters per second) from RAPID corresponding to the downstream point of each reach. The river reaches have the same COMIDs and are sorted similarly to basin_id_San_Guad_hydroseq.csv. The time range for this file is from 2004-01-01T00:00-06:00 to 2007-12-31-00:00-06:00. The values were computed using the Muskingum method with parameters of Equation (18) in David et al. (2011). This file was prepared using RAPID v1.0.0 running with the preonly ILU solver on one core.
Qout_San_Guad_1460days_p3_dtR=900s.nc. This netCDF file contains the 3-hourly averaged outputs (in cubic meters per second) from RAPID corresponding to the downstream
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
This child page contains a zipped folder which contains all of the items necessary to run load estimation using R-LOADEST to produce results that are published in U.S. Geological Survey Investigations Report 2021-XXXX [Tatge, W.S., Nustad, R.A., and Galloway, J.M., 2021, Evaluation of Salinity and Nutrient Conditions in the Heart River Basin, North Dakota, 1970-2020: U.S. Geological Survey Scientific Investigations Report 2021-XXXX, XX p]. The folder contains an allsiteinfo.table.csv file, a "datain" folder, and a "scripts" folder. The allsiteinfo.table.csv file can be used to cross reference the sites with the main report (Tatge and others, 2021). The "datain" folder contains all the input data necessary to reproduce the load estimation results. The naming convention in the "datain" folder is site_MI_rloadest or site_NUT_rloadest for either the major ion loads or the nutrient loads. The .Rdata files are used in the scripts to run the estimations and the .csv files can be used to ...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
{# General information# The script runs with R (Version 3.1.1; 2014-07-10) and packages plyr (Version 1.8.1), XLConnect (Version 0.2-9), utilsMPIO (Version 0.0.25), sp (Version 1.0-15), rgdal (Version 0.8-16), tools (Version 3.1.1) and lattice (Version 0.20-29)# --------------------------------------------------------------------------------------------------------# Questions can be directed to: Martin Bulla (bulla.mar@gmail.com)# -------------------------------------------------------------------------------------------------------- # Data collection and how the individual variables were derived is described in: #Steiger, S.S., et al., When the sun never sets: diverse activity rhythms under continuous daylight in free-living arctic-breeding birds. Proceedings of the Royal Society B: Biological Sciences, 2013. 280(1764): p. 20131016-20131016. # Dale, J., et al., The effects of life history and sexual selection on male and female plumage colouration. Nature, 2015. # Data are available as Rdata file # Missing values are NA. # --------------------------------------------------------------------------------------------------------# For better readability the subsections of the script can be collapsed # --------------------------------------------------------------------------------------------------------}{# Description of the method # 1 - data are visualized in an interactive actogram with time of day on x-axis and one panel for each day of data # 2 - red rectangle indicates the active field, clicking with the mouse in that field on the depicted light signal generates a data point that is automatically (via custom made function) saved in the csv file. For this data extraction I recommend, to click always on the bottom line of the red rectangle, as there is always data available due to a dummy variable ("lin") that creates continuous data at the bottom of the active panel. The data are captured only if greenish vertical bar appears and if new line of data appears in R console). # 3 - to extract incubation bouts, first click in the new plot has to be start of incubation, then next click depict end of incubation and the click on the same stop start of the incubation for the other sex. If the end and start of incubation are at different times, the data will be still extracted, but the sex, logger and bird_ID will be wrong. These need to be changed manually in the csv file. Similarly, the first bout for a given plot will be always assigned to male (if no data are present in the csv file) or based on previous data. Hence, whenever a data from a new plot are extracted, at a first mouse click it is worth checking whether the sex, logger and bird_ID information is correct and if not adjust it manually. # 4 - if all information from one day (panel) is extracted, right-click on the plot and choose "stop". This will activate the following day (panel) for extraction. # 5 - If you wish to end extraction before going through all the rectangles, just press "escape". }{# Annotations of data-files from turnstone_2009_Barrow_nest-t401_transmitter.RData dfr-- contains raw data on signal strength from radio tag attached to the rump of female and male, and information about when the birds where captured and incubation stage of the nest1. who: identifies whether the recording refers to female, male, capture or start of hatching2. datetime_: date and time of each recording3. logger: unique identity of the radio tag 4. signal_: signal strength of the radio tag5. sex: sex of the bird (f = female, m = male)6. nest: unique identity of the nest7. day: datetime_ variable truncated to year-month-day format8. time: time of day in hours9. datetime_utc: date and time of each recording, but in UTC time10. cols: colors assigned to "who"--------------------------------------------------------------------------------------------------------m-- contains metadata for a given nest1. sp: identifies species (RUTU = Ruddy turnstone)2. nest: unique identity of the nest3. year_: year of observation4. IDfemale: unique identity of the female5. IDmale: unique identity of the male6. lat: latitude coordinate of the nest7. lon: longitude coordinate of the nest8. hatch_start: date and time when the hatching of the eggs started 9. scinam: scientific name of the species10. breeding_site: unique identity of the breeding site (barr = Barrow, Alaska)11. logger: type of device used to record incubation (IT - radio tag)12. sampling: mean incubation sampling interval in seconds--------------------------------------------------------------------------------------------------------s-- contains metadata for the incubating parents1. year_: year of capture2. species: identifies species (RUTU = Ruddy turnstone)3. author: identifies the author who measured the bird4. nest: unique identity of the nest5. caught_date_time: date and time when the bird was captured6. recapture: was the bird capture before? (0 - no, 1 - yes)7. sex: sex of the bird (f = female, m = male)8. bird_ID: unique identity of the bird9. logger: unique identity of the radio tag --------------------------------------------------------------------------------------------------------}
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MaxEnt response curves (output) xlsx files and matlab script describing data post-processing (from maxent2xlsx).R script to produce response curves graphs in R format. CSV files describe the response curve (average and std) - input for R script
Facebook
TwitterThis child page contains a zipped folder which contains all items necessary to run trend models and produce results published in U.S. Geological Scientific Investigations Report 2021–XXXX [Tatge, W.S., Nustad, R.A., and Galloway, J.M., 2021, Evaluation of Salinity and Nutrient Conditions in the Heart River Basin, North Dakota, 1970-2020: U.S. Geological Survey Scientific Investigations Report 2021-XXXX, XX p.]. To run the R-QWTREND program in R 6 files are required and each is included in this child page: prepQWdataV4.txt, runQWmodelV4XXUEP.txt, plotQWtrendV4XXUEP.txt, qwtrend2018v4.exe, salflibc.dll, and StartQWTrendV4.R (Vecchia and Nustad, 2020). The folder contains: six items required to run the R–QWTREND trend analysis tool; a readme.txt file; a flowtrendData.RData file; an allsiteinfo.table.csv file, a folder called "scripts", and a folder called "waterqualitydata". The "scripts" folder contains the scripts that can be used to reproduce the results found in the USGS Scientific Investigations Report referenced above. The "waterqualitydata" folder contains .csv files with the naming convention of site_ions or site_nuts for major ions and nutrients constituents and contains machine readable files with the water-quality data used for the trend analysis at each site. R–QWTREND is a software package for analyzing trends in stream-water quality. The package is a collection of functions written in R (R Development Core Team, 2019), an open source language and a general environment for statistical computing and graphics. The following system requirements are necessary for using R–QWTREND: • Windows 10 operating system • R (version 3.4 or later; 64 bit recommended) • RStudio (version 1.1.456 or later). An accompanying report (Vecchia and Nustad, 2020) serves as the formal documentation for R–QWTREND. Vecchia, A.V., and Nustad, R.A., 2020, Time-series model, statistical methods, and software documentation for R–QWTREND—An R package for analyzing trends in stream-water quality: U.S. Geological Survey Open-File Report 2020–1014, 51 p., https://doi.org/10.3133/ofr20201014 R Development Core Team, 2019, R—A language and environment for statistical computing: Vienna, Austria, R Foundation for Statistical Computing, accessed December 7, 2020, at https://www.r-project.org.
Facebook
TwitterA routine was developed in R ('bathy_plots.R') to plot bathymetry data over time during individual CEAMARC events. This is so we can analyse benthic data in relation to habitat, ie. did we trawl over a slope or was the sea floor relatively flat. Note that the depth range in the plots is autoscaled to the data, so a small range in depths appears as a scatetring of points. As long as you look at the depth scale though interpretation will be ok. The R files need a file of bathymetry data in '200708V3_one_minute.csv' which is a file containing a data export from the underway PostgreSQL ship database and 'events.csv' which is a stripped down version of the events export from the ship board events database export. If you wish to run the code again you may need to change the pathnames in the R script to relevant locations. If you have opened the csv files in excel at any stage and the R script gets an error you may need to format the date/time columns as yyyy-mm-dd hh;mm:ss, save and close the file as csv without opening it again and then run the R script. However, all output files are here for every CEAMARC event. Filenames contain a reference to CEAMARC event id. Files are in eps format and can be viewed using Ghostview which is available as a free download on the internet.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
One persistent challenge in untargeted metabolomics is the identification of compounds from their mass spectrometry (MS) signal, which is necessary for biological data interpretation. This process can be facilitated by building in-house libraries of metabolite standards containing retention time (RT) information, which is orthogonal and complementary to large, published MS/MS spectra repositories. Creating such libraries can require substantial effort and is time intensive. To streamline this process, we developed metScribeR, an R package with a Shiny application to accelerate the creation of RT and m/z libraries. metScribeR provides an easy, user-friendly interface for peak finding, filtering, and comprehensive quality review of the MS data. Uniquely, metScribeR does not require MS/MS spectral information and reports an identification probability estimate for each adduct. In our benchmarking, metScribeR required approximately 10 s of computational and manual effort per standard, showed a correlation of 0.99 between manual and metScribeR-derived RTs, and appropriately filtered out poor quality peaks. The metScribeR output is a.csv file including the identity, m/z, RT, and peak quality information for standards along with MS/MS spectra retrieved from MassBank of North America (MoNA). metScribeR is open source and available for download on GitHub at https://github.com/ncats/metScribeR
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We investigate apparent-time tone variation in the Black Lahu language (Loloish/Ngwi, Tibeto-Burman) of Yunnan, China. These are the supplementary materials for the paper "Generational differences in the low tones of Black Lahu," accepted for publication in Linguistics Vanguard.
Appendices:
Script files contained in the analysis:
Data files contained in this analysis:
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This data repository provides the Food and Agriculture Biomass Input Output (FABIO) database, a global set of multi-regional physical supply-use and input-output tables covering global agriculture and forestry. The work is based on mostly freely available data from FAOSTAT, IEA, EIA, and UN Comtrade/BACI. FABIO currently covers 191 countries + RoW, 118 processes and 125 commodities (raw and processed agricultural and food products) for 1986-2013. All R codes and auxilliary data are available on GitHub. For more information please refer to https://fabio.fineprint.global. The database consists of the following main components, in compressed .rds format: Z: the inter-commodity input-output matrix, displaying the relationships of intermediate use of each commodity in the production of each commodity, in physical units (tons). The matrix has 24000 rows and columns (125 commodities x 192 regions), and is available in two versions, based on the method to allocate inputs to outputs in production processes: Z_mass (mass allocation) and Z_value (value allocation). Note that the row sums of the Z matrix (= total intermediate use by commodity) are identical in both versions. Y: the final demand matrix, denoting the consumption of all 24000 commodities by destination country and final use category. There are six final use categories (yielding 192 x 6 = 1152 columns): 1) food use, 2) other use (non-food), 3) losses, 4) stock addition, 5) balancing, and 6) unspecified. X: the total output vector of all 24000 commodities. Total output is equal to the sum of intermediate and final use by commodity. L: the Leontief inverse, computed as (I – A)-1, where A is the matrix of input coefficients derived from Z and x. Again, there are two versions, depending on the underlying version of Z (L_mass and L_value). E: environmental extensions for each of the 24000 commodities, including four resource categories: 1) primary biomass extraction (in tons), 2) land use (in hectares), 3) blue water use (in m3)., and 4) green water use (in m3). mr_sup_mass/mr_sup_value: For each allocation method (mass/value), the supply table gives the physical supply quantity of each commodity by producing process, with processes in the rows (118 processes x 192 regions = 22656 rows) and commodities in columns (24000 columns). mr_use: the use table capture the quantities of each commodity (rows) used as an input in each process (columns). A description of the included countries and commodities (i.e. the rows and columns of the Z matrix) can be found in the auxiliary file io_codes.csv. Separate lists of the country sample (including ISO3 codes and continental grouping) and commodities (including moisture content) are given in the files regions.csv and items.csv, respectively. For information on the individual processes, see auxiliary file su_codes.csv. RDS files can be opened in R. Information on how to read these files can be obtained here: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/readRDS Except of X.rds, which contains a matrix, all variables are organized as lists, where each element contains a sparse matrix. Please note that values are always given in physical units, i.e. tonnes or head, as specified in items.csv. The suffixes value and mass only indicate the form of allocation chosen for the construction of the symmetric IO tables (for more details see Bruckner et al. 2019). Product, process and country classifications can be found in the file fabio_classifications.xlsx. Footprint results are not contained in the database but can be calculated, e.g. by using this script: https://github.com/martinbruckner/fabio_comparison/blob/master/R/fabio_footprints.R How to cite: To cite FABIO work please refer to this paper: Bruckner, M., Wood, R., Moran, D., Kuschnig, N., Wieland, H., Maus, V., Börner, J. 2019. FABIO – The Construction of the Food and Agriculture Input–Output Model. Environmental Science & Technology 53(19), 11302–11312. DOI: 10.1021/acs.est.9b03554 License: This data repository is distributed under the CC BY-NC-SA 4.0 License. You are free to share and adapt the material for non-commercial purposes using proper citation. If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. In case you are interested in a collaboration, I am happy to receive enquiries at martin.bruckner@wu.ac.at. Known issues: The underlying FAO data have been manipulated to the minimum extent necessary. Data filling and supply-use balancing, yet, required some adaptations. These are documented in the code and are also reflected in the balancing item in the final demand matrices. For a proper use of the database, I recommend to distribute the balancing item over all other uses proportionally and to do analyses with and without balancing to illustrate uncertainties.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The following seven zip files are compressed folders containing the input datasets/trees, main output files and the scripts of the related analyses performed in this study. I. ancestral_microhabitat_reconstruction.zip: contains four files, including two input files (microhabitats.csv, timetree.tre) and a script (simmap_microhabitat.R) for ancestral states reconstruction of microhabitat by make.simmap implemented in the R package phytools v1.5, as well as the main output file (ancestral_microhabitats.csv). 1. ancestral_microhabitats.csv: reconstructed ancestral microhabitats for each node. 2. microhabitats.csv: microhabitats of the studies species. 3. simmap_microhabitat.R: the R script of make.simmap for ancestral microhabitat reconstruction 4. timetree.tre: dated tree used for ancestral state reconstruction for microhabitat and morphological characters II. ancestral_morphology_reconstruction.zip: contains six files, including an input file (morphology.csv) and a script (simmap_morphology.R) for ancestral states reconstruction of morphology by make.simmap implemented in the R package phytools v1.5, as well as four main output files(forewing_ancestral_state.csv, frontal_sutures_ancestral_state.csv, hind_wing_ancestral_state.csv, ocellus_ancestral_state.csv). 1. forewing_ancestral_state.csv: reconstructed ancestral states of the development of the forewing for each node. 2. frontal_sutures_ancestral_state.csv: reconstructed ancestral states of the development of frontal sutures for each node. 3. hind_wing_ancestral_state.csv: reconstructed ancestral states of the development of the hind wing for each node. 4. morphology.csv: the states of the development of ocellus, forewing, hing wing and frontal sutures for each studies species. 5. ocellus_ancestral_state.csv: reconstructed ancestral states of the development of the ocellus for each node. 6. simmap_morphology.R: the R script of make.simmap for ancestral state reconstruction of morphology III. biogeographic_reconstruction.zip: contains four files, including three input files (dispersal_probablity.txt, distributions.csv, timetree_noOutgroup.tre) used for a stratified biogeographic analysis by BioGeoBEARS in RASP v4.2 and the main output file (DIVELIKE_result.txt). 1. dispersal_probablity.txt: relative dispersal probabilities among biogeographical regions at different geological epochs. 2. distributions.csv: current distributions of the studied species. 3. DIVELIKE_result.txt: BioGeoBEARS result of ancestral areas based on the DIVELIKE model. 4. timetree_noOutgroup.tre: the dated tree with the outgroup lineage (Eurymelinae) excluded. IV. coalescent_analysis.zip: contains a folder and two files, including a folder (individual_gene_alignment) of input files used to construct gene trees, an input file (MLtree_BS70.tre) used for the multi-species coalescent analysis by ASTRAL v 4.10.5 and the main output file (coalescent_species_tree.tre). 1. coalescent_species_tree.tre: the species tree generated by the multi-species coalescent analysis with the quartet support, effective number of genes and the local posterior probability indicated. 2. individual_gene_alignment: a folder containing 427 FASTA files, each one represents the nucleotide alignment for a gene. Hyphens are used to represent gaps. These files were used to construct gene trees using IQ-TREE v1.6.12. 3. MLtree_BS70.tre: 165 gene trees with the average SH-aLRT and ultrafast bootstrap values of ≥ 70%. This file was used to estimate the species tree by ASTRAL v 4.10.5. V. divergence_time_estimation.zip: contains five files, including two input files (treefile_rooted_noBranchLength.tre, treefile_rooted.tre) and two control files (baseml.ctl, mcmctree.ctl) used for divergence time estimation by BASEML and MCMCTREE in PAML v4.9, as well as the main output file (timetree_with95%HPD.tre). 1. baseml.ctl: the control file used for the estimation of substitution rates by BASEML in PAML v4.9. 2. mcmctree.ctl: the control file used for the estimation of divergence times by MCMCTREE in PAML v4.9. 3. timetree_with95%HPD.tre: dated tree with the 95% highest posterior density confidence intervals indicated. 4. treefile_rooted_noBranchLength.tre: the maximum likelihood tree based on the concatenated nucleotide dataset with calibrations for the crown and internal nodes. Branch length and support values were not indicated. 5. treefile_rooted.tre: the maximum likelihood tree based on the concatenated nucleotide dataset with a secondary calibration on the root age. Branch support values were not indicated. VI. maximum_likelihood_analysis_aa.zip: contains three files, including two input files (concatenated_aa_partition.nex, concatenated_aa.phy) used for the maximum likelihood analysis by IQ-TREE v1.6.12 and the main output file (MLtree_aa.tre). 1. concatenated_aa_partition.nex: the partitioning schemes for the maximum likelihood analysis using concatenated_aa.phy. This file partitions the 52,024 amino acid positions into 427 character sets. 2. concatenated_aa.phy: a concatenated amino acid dataset with 52,024 amino acid positions. Hyphens are used to represent gaps. This dataset was used for the maximum likelihood analysis. 3. MLtree_aa.tre: the maximum likelihood tree based on the concatenated amino acid dataset, with SH-aLRT values and ultrafast bootstrap values indicated. VII. maximum_likelihood_analysis_nt.zip: contains three files, including two input files (concatenated_nt_partition.nex, concatenated_nt.phy) used for the maximum likelihood analysis by IQ-TREE v1.6.12 and the main output file (MLtree_nt.tre). 1. concatenated_nt_partition.nex: the partitioning schemes for the maximum likelihood analysis using concatenated_nt.phy. This file partitions the 156,072 nucleotide positions into 427 character sets. 2. concatenated_nt.phy: a concatenated nucleotide dataset with 156,072 nucleotide positions. Hyphens are used to represent gaps. This dataset was used for the maximum likelihood analysis as well as divergence time estimation. 3. MLtree_nt.tre: the maximum likelihood tree based on the concatenated nucleotide dataset, with SH-aLRT values and ultrafast bootstrap values indicated. VIII. Taxon_sampling.csv: contains the sample IDs (1st column) which were used in the alignments and the taxonomic information (2nd to 6th columns).
Facebook
TwitterThis dataset contains (a) a script “R_met_integrated_for_modeling.R”, and (b) associated input CSV files: 3 CSV files per location to create a 5-variable integrated meteorological dataset file (air temperature, precipitation, wind speed, relative humidity, and solar radiation) for 19 meteorological stations and 1 location within Trail Creek from the modeling team within the East River Community Observatory as part of the Watershed Function Scientific Focus Area (SFA). As meteorological forcings varied across the watershed, a high-frequency database is needed to ensure consistency in the data analysis and modeling. We evaluated several data sources, including gridded meteorological products and field data from meteorological stations. We determined that our modeling efforts required multiple data sources to meet all their needs. As output, this dataset contains (c) a single CSV data file (*_1981-2022.csv) for each location (20 CSV output files total) containing hourly time series data for 1981 to 2022 and (d) five PNG files of time series and density plots for each variable per location (100 PNG files). Detailed location metadata is contained within the Integrated_Met_Database_Locations.csv file for each point location included within this dataset, obtained from Varadharajan et al., 2023 doi:10.15485/1660962. This dataset also includes (e) a file-level metadata (flmd.csv) file that lists each file contained in the dataset with associated metadata and (f) a data dictionary (dd.csv) file that contains column/row headers used throughout the files along with a definition, units, and data type. Review the (g) ReadMe_Integrated_Met_Database.pdf file for additional details on the script, methods, and structure of the dataset. The script integrates Northwest Alliance for Computational Science and Engineering’s PRISM gridded data product, National Oceanic and Atmospheric Administration’s NCEP-NCAR Reanalysis 1 gridded data product (through the RCNEP R package, Kemp et al., doi:10.32614/CRAN.package.RNCEP), and analytical-based calculations. Further, this script downscales the input data into hourly frequency, which is necessary for the modeling efforts.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes all data as well as the scripts used to produce figures from both the main text and the supporting information from Zhang et al. (2023) " Fungal genome size and composition reflect ecological strategies along soil fertility gradients".
Detailed description: genome_size_lifestyle.csv: the file containing genome size and lifestyles for all fungi. ncbi_fungi_GC_content.csv: files containing GC content data from NCBI. Fungi_all_gs_gc_match.csv: files containing fungal species that both have the GC and genome size values. genome_size_fungal_traits.csv: files containing other fungal traits data for species that have genome size data.
All data above are the raw data files. By contrast, the following data files, including Australian_Microbiome_GC_CWM.csv, Australian_Microbiome_Genomesize_CWM.csv, Appendix_S1.csv, Appendix_S3.csv are derived data and the code Fig3.R can be used to generate these data files. Fig4_slope_data.csv was generated from FigS4_stats.R, FigS5_stats.R, FigS6_stats.R.
For each of these figures there is their respective R script and the reorganized dataset used to produce them. R scripts are responsible for the production of Figures 1-4 in the main text and most of the supplement respectively.
Facebook
TwitterSediment diatoms are widely used to track environmental histories of lakes and their watersheds, but merging datasets generated by different researchers for further large-scale studies is challenging because of the taxonomic discrepancies caused by rapidly evolving diatom nomenclature and taxonomic concepts. Here we collated five datasets of lake sediment diatoms from the northeastern USA using a harmonization process which included updating synonyms, tracking the identity of inconsistently identified taxa and grouping those that could not be resolved taxonomically. The Dataset consists of a Portable Document Format (.pdf) file of the Voucher Flora, six Microsoft Excel (.xlsx) data files, an R script, and five output Comma Separated Values (.csv) files.
The Voucher Flora documents the morphological species concepts in the dataset using diatom images compiled into plates (NE_Lakes_Voucher_Flora_102421.pdf) and the translation scheme of the OTU codes to diatom scientific or provisional names with identification sources, references, and notes (VoucherFloraTranslation_102421.xlsx).
The file Slide_accession_numbers_102421.xlsx has slide accession numbers in the ANS Diatom Herbarium.
The “DiatomHarmonization_032222_files for R.zip” archive contains four Excel input data files, the R code, and a subfolder “OUTPUT” with five .csv files. The file Counts_original_long_102421.xlsx contains original diatom count data in long format. The file Harmonization_102421.xlsx is the taxonomic harmonization scheme with notes and references. The file SiteInfo_031922.xlsx contains sampling site- and sample-level information. WaterQualityData_021822.xlsx is a supplementary file with water quality data. R code (DiatomHarmonization_032222.R) was used to apply the harmonization scheme to the original diatom counts to produce the output files. The resulting output files are five wide format files containing diatom count data at different harmonization steps (Counts_1327_wide.csv, Step1_1327_wide.csv, Step2_1327_wide.csv, Step3_1327_wide.csv) and the summary of the Indicator Species Analysis (INDVAL_RESULT.csv). The harmonization scheme (Harmonization_102421.xlsx) can be further modified based on additional taxonomic investigations, while the associated R code (DiatomHarmonization_032222.R) provides a straightforward mechanism to diatom data versioning.
This dataset is associated with the following publication: Potapova, M., S. Lee, S. Spaulding, and N. Schulte. A harmonized dataset of sediment diatoms from hundreds of lakes in the northeastern United States. Scientific Data. Springer Nature, New York, NY, 9(540): 1-8, (2022).
Facebook
TwitterThe dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.
This dataset contains all the scripts used to carry out the uncertainty analysis for the maximum drawdown and time to maximum drawdown at the groundwater receptors in the Hunter bioregion and all the resulting posterior predictions. This is described in product 2.6.2 Groundwater numerical modelling (Herron et al. 2016). See History for a detailed explanation of the dataset contents.
References:
Herron N, Crosbie R, Peeters L, Marvanek S, Ramage A and Wilkins A (2016) Groundwater numerical modelling for the Hunter subregion. Product 2.6.2 for the Hunter subregion from the Northern Sydney Basin Bioregional Assessment. Department of the Environment, Bureau of Meteorology, CSIRO and Geoscience Australia, Australia.
This dataset uses the results of the design of experiment runs of the groundwater model of the Hunter subregion to train emulators to (a) constrain the prior parameter ensembles into the posterior parameter ensembles and to (b) generate the predictive posterior ensembles of maximum drawdown and time to maximum drawdown. This is described in product 2.6.2 Groundwater numerical modelling (Herron et al. 2016).
A flow chart of the way the various files and scripts interact is provided in HUN_GW_UA_Flowchart.png (editable version in HUN_GW_UA_Flowchart.gliffy).
R-script HUN_DoE_Parameters.R creates the set of parameters for the design of experiment in HUN_DoE_Parameters.csv. Each of these parameter combinations is evaluated with the groundwater model (dataset HUN GW Model v01). Associated with this spreadsheet is file HUN_GW_Parameters.csv. This file contains, for each parameter, if it is included in the sensitivity analysis, tied to another parameters, the initial value and range, the transformation, the type of prior distribution with its mean and covariance structure.
The results of the design of experiment model runs are summarised in files HUN_GW_dmax_DoE_Predictions.csv, HUN_GW_tmax_DoE_Predictions.csv, HUN_GW_DoE_Observations.csv, HUN_GW_DoE_mean_BL_BF_hist.csv which have the maximum additional drawdown, the time to maximum additional drawdown for each receptor and the simulated equivalents to observed groundwater levels and SW-GW fluxes respectively. These are generated with post-processing scripts in dataset HUN GW Model v01 from the output (as exemplified in dataset HUN GW Model simulate ua999 pawsey v01).
Spreadsheets HUN_GW_dmax_Predictions.csv and HUN_GW_tmax_Predictions.csv capture additional information on each prediction; the name of the prediction, transformation, min, max and median of design of experiment, a boolean to indicate the prediction is to be included in the uncertainty analysis, the layer it is assigned to and which objective function to use to constrain the prediction.
Spreadsheet HUN_GW_Observations.csv has additional information on each observation; the name of the observation, a boolean to indicate to use the observation, the min and max of the design of experiment, a metadata statement describing the observation, the spatial coordinates, the observed value and the number of observations at this location (from dataset HUN bores v01). Further it has the distance of each bore to the nearest blue line network and the distance to each prediction (both in km). Spreadsheet HUN_GW_mean_BL_BF_hist.csv has similar information, but on the SW-GW flux. The observed values are from dataset HUN Groundwater Flowrate Time Series v01
These files are used in script HUN_GW_SI.py to generate sensitivity indices (based on the Plischke et al. (2013) method) for each group of observations and predictions. These indices are saved in spreadsheets HUN_GW_dmax_SI.csv, HUN_GW_tmax_SI.csv, HUN_GW_hobs_SI.py, HUN_GW_mean_BF_hist_SI.csv
Script HUN_GW_dmax_ObjFun.py calculates the objective function values for the design of experiment runs. Each prediction has a tailored objective function which is a weighted sum of the residuals between observations and predictions with weights based on the distance between observation and prediction. In addition to that there is an objective function for the baseflow rates. The results are stored in HUN_GW_DoE_ObjFun.csv and HUN_GW_ObjFun.csv.
The latter files are used in scripts HUN_GW_dmax_CreatePosteriorParameters.R to carry out the Monte Carlo sampling of the prior parameter distributions with the Approximate Bayesian Computation methodology as described in Herron et al (2016) by generating and applying emulators for each objective function. The scripts use the scripts in dataset R-scripts for uncertainty analysis v01. These files are run on the high performance computation cluster machines with batch file HUN_GW_dmax_CreatePosterior.slurm. These scripts result in posterior parameter combinations for each objective function, stored in directory PosteriorParameters, with filename convention HUN_GW_dmax_Posterior_Parameters_OO_$OFName$.csv where $OFName$ is the name of the objective function. Python script HUN_GW_PosteriorParameters_Percentiles.py summarizes these posterior parameter combinations and stores the results in HUN_GW_PosteriorParameters_Percentiles.csv.
The same set of spreadsheets is used to test convergence of the emulator performance with script HUN_GW_emulator_convergence.R and batch file HUN_GW_emulator_convergence.slurm to produce spreadsheet HUN_GW_convergence_objfun_BF.csv.
The posterior parameter distributions are sampled with scripts HUN_GW_dmax_tmax_MCsampler.R and associated .slurm batch file. The script create and apply an emulator for each prediction. The emulator and results are stored in directory Emulators. This directory is not part of the this dataset but can be regenerated by running the scripts on the high performance computation clusters. A single emulator and associated output is included for illustrative purposes.
Script HUN_GW_collate_predictions.csv collates all posterior predictive distributions in spreadsheets HUN_GW_dmax_PosteriorPredictions.csv and HUN_GW_tmax_PosteriorPredictions.csv. These files are further summarised in spreadsheet HUN_GW_dmax_tmax_excprob.csv with script HUN_GW_exc_prob. This spreadsheet contains for all predictions the coordinates, layer, number of samples in the posterior parameter distribution and the 5th, 50th and 95th percentile of dmax and tmax, the probability of exceeding 1 cm and 20 cm drawdown, the maximum dmax value from the design of experiment and the threshold of the objective function and the acceptance rate.
The script HUN_GW_dmax_tmax_MCsampler.R is also used to evaluate parameter distributions HUN_GW_dmax_Posterior_Parameters_HUN_OF_probe439.csv and HUN_GW_dmax_Posterior_Parameters_Mackie_OF_probe439.csv. These are, for one predictions, different parameter distributions, in which the latter represents local information. The corresponding dmax values are stored in HUN_GW_dmax_probe439_HUN.csv and HUN_GW_dmax_probe439_Mackie.csv
Bioregional Assessment Programme (XXXX) HUN GW Uncertainty Analysis v01. Bioregional Assessment Derived Dataset. Viewed 09 October 2018, http://data.bioregionalassessments.gov.au/dataset/c25db039-5082-4dd6-bb9d-de7c37f6949a.
Derived From HUN GW Model code v01
Derived From NSW Office of Water Surface Water Entitlements Locations v1_Oct2013
Derived From NSW Office of Water - National Groundwater Information System 20140701
Derived From Travelling Stock Route Conservation Values
Derived From HUN GW Model v01
Derived From NSW Wetlands
Derived From Climate Change Corridors Coastal North East NSW
Derived From Communities of National Environmental Significance Database - RESTRICTED - Metadata only
Derived From Climate Change Corridors for Nandewar and New England Tablelands
Derived From National Groundwater Dependent Ecosystems (GDE) Atlas
Derived From R-scripts for uncertainty analysis v01
Derived From Asset database for the Hunter subregion on 27 August 2015
Derived From Birds Australia - Important Bird Areas (IBA) 2009
Derived From Estuarine Macrophytes of Hunter Subregion NSW DPI Hunter 2004
Derived From Hunter CMA GDEs (DRAFT DPI pre-release)
Derived From Camerons Gorge Grassy White Box Endangered Ecological Community (EEC) 2008
Derived From Atlas of Living Australia NSW ALA Portal 20140613
Derived From [Spatial Threatened Species and Communities (TESC) NSW
Facebook
Twitterhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Replication pack, FSE2018 submission #164: ------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem **Note:** link to data artifacts is already included in the paper. Link to the code will be included in the Camera Ready version as well. Content description =================== - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files described below - **settings.py** - settings template for the code archive. - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset. This dataset only includes stats aggregated by the ecosystem (PyPI) - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages themselves, which take around 2TB. - **build_model.r, helpers.r** - R files to process the survival data (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, `common.cache/survival_data.pypi_2008_2017-12_6.csv` in **dataset_full_Jan_2018.tgz**) - **Interview protocol.pdf** - approximate protocol used for semistructured interviews. - LICENSE - text of GPL v3, under which this dataset is published - INSTALL.md - replication guide (~2 pages)
Replication guide ================= Step 0 - prerequisites ---------------------- - Unix-compatible OS (Linux or OS X) - Python interpreter (2.7 was used; Python 3 compatibility is highly likely) - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible) Depending on detalization level (see Step 2 for more details): - up to 2Tb of disk space (see Step 2 detalization levels) - at least 16Gb of RAM (64 preferable) - few hours to few month of processing time Step 1 - software ---------------- - unpack **ghd-0.1.0.zip**, or clone from gitlab: git clone https://gitlab.com/user2589/ghd.git git checkout 0.1.0 `cd` into the extracted folder. All commands below assume it as a current directory. - copy `settings.py` into the extracted folder. Edit the file: * set `DATASET_PATH` to some newly created folder path * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` - install docker. For Ubuntu Linux, the command is `sudo apt-get install docker-compose` - install libarchive and headers: `sudo apt-get install libarchive-dev` - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools` Without this dependency, you might get an error on the next step, but it's safe to ignore. - install Python libraries: `pip install --user -r requirements.txt` . - disable all APIs except GitHub (Bitbucket and Gitlab support were not yet implemented when this study was in progress): edit `scraper/init.py`, comment out everything except GitHub support in `PROVIDERS`. Step 2 - obtaining the dataset ----------------------------- The ultimate goal of this step is to get output of the Python function `common.utils.survival_data()` and save it into a CSV file: # copy and paste into a Python console from common import utils survival_data = utils.survival_data('pypi', '2008', smoothing=6) survival_data.to_csv('survival_data.csv') Since full replication will take several months, here are some ways to speedup the process: ####Option 2.a, difficulty level: easiest Just use the precomputed data. Step 1 is not necessary under this scenario. - extract **dataset_minimal_Jan_2018.zip** - get `survival_data.csv`, go to the next step ####Option 2.b, difficulty level: easy Use precomputed longitudinal feature values to build the final table. The whole process will take 15..30 minutes. - create a folder `