100+ datasets found

d
R script that creates a wrapper function to automate the generation of...
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). R script that creates a wrapper function to automate the generation of boxplots of change factors for all ArcHydro Enhanced Database (AHED) basins (basin_boxplot.R) [Dataset]. https://catalog.data.gov/dataset/r-script-that-creates-a-wrapper-function-to-automate-the-generation-of-boxplots-of-change-
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
The South Florida Water Management District (SFWMD) and the U.S. Geological Survey have developed projected future change factors for precipitation depth-duration-frequency (DDF) curves at 174 National Oceanic and Atmospheric Administration (NOAA) Atlas 14 stations in central and south Florida. The change factors were computed as the ratio of projected future to historical extreme precipitation depths fitted to extreme precipitation data from various downscaled climate datasets using a constrained maximum likelihood (CML) approach. The change factors correspond to the period 2050-2089 (centered in the year 2070) as compared to the 1966-2005 historical period. An R script (basin_boxplot.R) is provided provided as an example on how to create a wrapper function that will automate the generation of boxplots of change factors for all AHED basins. The wrapper script sources the file create_boxplot.R and calls the function create_boxplot() one AHED basin at a time to create a figure with boxplots of change fators for all durations (1, 3, and 7 days) and return periods (5, 10, 25, 50, 100, and 200 years) evaluated as part of this project. An example is also provided in the code that shows how to generate a figure showing boxplots of change factors for a single duration and return period. A Microsoft Word file documenting code usage is also provided within this data release (Documentation_R_script_create_boxplot.docx). As described in the documentation, the R script relies on some of the Microsoft Excel spreadsheets published as part of this data release.
Storage and Transit Time Data and Code
zenodo.org
zip
Updated Oct 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew Felton; Andrew Felton (2024). Storage and Transit Time Data and Code [Dataset]. http://doi.org/10.5281/zenodo.14009758
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14009758
Dataset updated
Oct 29, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrew Felton; Andrew Felton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Author: Andrew J. Felton
Date: 10/29/2024

This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:

"Global estimates of the storage and transit time of water through vegetation"

Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated.

Data information:

The data folder contains key data sets used for analysis. In particular:

"data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.

#Code information

Python scripts can be found in the "supporting_code" folder.

Each R script in this project has a role:

"01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).

"02_functions.R": This script contains custom functions. Load this using the
`source()` function in the 01_start.R script.

"03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
`source()` function in the 01_start.R script.

"04_figures_tables.R": This is the main workhouse for figure/table production and
supporting analyses. This script generates the key figures and summary statistics
used in the study that then get saved in the manuscript_figures folder. Note that all
maps were produced using Python code found in the "supporting_code"" folder.

"supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.

"supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.
d
Data from: Source code for R tutorials and dataset for empirical case study...
datadryad.org
data.niaid.nih.gov
+2more
zip
Updated Jul 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martijn van de Pol; Lyanne Brouwer (2021). Source code for R tutorials and dataset for empirical case study on Malurus elegans (red-winged fairy wren) [Dataset]. http://doi.org/10.5061/dryad.7h44j0ztw
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.7h44j0ztw
Dataset updated
Jul 27, 2021
Dataset provided by
Dryad
Authors
Martijn van de Pol; Lyanne Brouwer
Time period covered
Jul 23, 2021
Description
Biological processes exhibit complex temporal dependencies due to the sequential nature of allocation decisions in organisms’ life-cycles, feedback loops, and two-way causality. Consequently, longitudinal data often contain cross-lags: the predictor variable depends on the response variable of the previous time-step. Although statisticians have warned that regression models that ignore such covariate endogeneity in time series are likely to be inappropriate, this has received relatively little attention in biology. Furthermore, the resulting degree of estimation bias remains largely unexplored.

We use a graphical model and numerical simulations to understand why and how regression models that ignore cross-lags can be biased, and how this bias depends on the length and number of time series. Ecological and evolutionary examples are provided to illustrate that cross-lags may be more common than is typically appreciated and that they occur in functionally different ways.

We show that rou...
Chemical product and function dataset
catalog.data.gov
res1catalogd-o-tdatad-o-tgov.vcapture.xyz
+1more
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Chemical product and function dataset [Dataset]. https://catalog.data.gov/dataset/chemical-product-and-function-dataset
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Merged product weight fraction and chemical function data. This dataset is associated with the following publication: Isaacs , K., M. Goldsmith, P. Egeghy , K. Phillips, R. Brooks, T. Hong, and J. Wambaugh. Characterization and prediction of chemical functions and weight fractions in consumer products. Toxicology Reports. Elsevier B.V., Amsterdam, NETHERLANDS, 3: 723-732, (2016).
f
Long Covid Risk
figshare.com
txt
Updated Apr 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Shaheen (2024). Long Covid Risk [Dataset]. http://doi.org/10.6084/m9.figshare.25599591.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25599591.v1
Dataset updated
Apr 13, 2024
Dataset provided by
figshare
Authors
Ahmed Shaheen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Feature preparation Preprocessing was applied to the data, such as creating dummy variables and performing transformations (centering, scaling, YeoJohnson) using the preProcess() function from the “caret” package in R. The correlation among the variables was examined and no serious multicollinearity problems were found. A stepwise variable selection was performed using a logistic regression model. The final set of variables included: Demographic: age, body mass index, sex, ethnicity, smoking History of disease: heart disease, migraine, insomnia, gastrointestinal disease, COVID-19 history: covid vaccination, rashes, conjunctivitis, shortness of breath, chest pain, cough, runny nose, dysgeusia, muscle and joint pain, fatigue, fever ,COVID-19 reinfection, and ICU admission. These variables were used to train and test various machine-learning models Model selection and training The data was randomly split into 80% training and 20% testing subsets. The “h2o” package in R version 4.3.1 was employed to implement different algorithms. AutoML was first used, which automatically explored a range of models with different configurations. Gradient Boosting Machines (GBM), Random Forest (RF), and Regularized Generalized Linear Model (GLM) were identified as the best-performing models on our data and their parameters were fine-tuned. An ensemble method that stacked different models together was also used, as it could sometimes improve the accuracy. The models were evaluated using the area under the curve (AUC) and C-statistics as diagnostic measures. The model with the highest AUC was selected for further analysis using the confusion matrix, accuracy, sensitivity, specificity, and F1 and F2 scores. The optimal prediction threshold was determined by plotting the sensitivity, specificity, and accuracy and choosing the point of intersection as it balanced the trade-off between the three metrics. The model’s predictions were also plotted, and the quantile ranges were used to classify the model’s prediction as follows: > 1st quantile, > 2nd quantile, > 3rd quartile and < 3rd quartile (very low, low, moderate, high) respectively. Metric Formula C-statistics (TPR + TNR - 1) / 2 Sensitivity/Recall TP / (TP + FN) Specificity TN / (TN + FP) Accuracy (TP + TN) / (TP + TN + FP + FN) F1 score 2 * (precision * recall) / (precision + recall) Model interpretation We used the variable importance plot, which is a measure of how much each variable contributes to the prediction power of a machine learning model. In H2O package, variable importance for GBM and RF is calculated by measuring the decrease in the model's error when a variable is split on. The more a variable's split decreases the error, the more important that variable is considered to be. The error is calculated using the following formula: 𝑆𝐸=𝑀𝑆𝐸∗𝑁=𝑉𝐴𝑅∗𝑁 and then it is scaled between 0 and 1 and plotted. Also, we used The SHAP summary plot which is a graphical tool to visualize the impact of input features on the prediction of a machine learning model. SHAP stands for SHapley Additive exPlanations, a method to calculate the contribution of each feature to the prediction by averaging over all possible subsets of features [28]. SHAP summary plot shows the distribution of the SHAP values for each feature across the data instances. We use the h2o.shap_summary_plot() function in R to generate the SHAP summary plot for our GBM model. We pass the model object and the test data as arguments, and optionally specify the columns (features) we want to include in the plot. The plot shows the SHAP values for each feature on the x-axis, and the features on the y-axis. The color indicates whether the feature value is low (blue) or high (red). The plot also shows the distribution of the feature values as a density plot on the right.
d
R script that creates a wrapper function to automate the generation of...
catalog.data.gov
Updated Jul 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). R script that creates a wrapper function to automate the generation of boxplots of change factors for all Florida HUC-8 basins (basin_boxplot.R) [Dataset]. https://catalog.data.gov/dataset/r-script-that-creates-a-wrapper-function-to-automate-the-generation-of-boxplots-of-change--f7fc2
Explore at:
Dataset updated
Jul 20, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
The Florida Flood Hub for Applied Research and Innovation and the U.S. Geological Survey have developed projected future change factors for precipitation depth-duration-frequency (DDF) curves at 242 National Oceanic and Atmospheric Administration (NOAA) Atlas 14 stations in Florida. The change factors were computed as the ratio of projected future to historical extreme-precipitation depths fitted to extreme-precipitation data from downscaled climate datasets using a constrained maximum likelihood (CML) approach as described in https://doi.org/10.3133/sir20225093. The change factors correspond to the periods 2020-59 (centered in the year 2040) and 2050-89 (centered in the year 2070) as compared to the 1966-2005 historical period. An R script (basin_boxplot.R) is provided as an example on how to create a wrapper function that will automate the generation of boxplots of change factors for all Florida HUC-8 basins. The wrapper script sources the file create_boxplot.R and calls the function create_boxplot() one Florida basin at a time to create a figure with boxplots of change factors for all durations (1, 3, and 7 days) and return periods (5, 10, 25, 50, 100, 200, and 500 years) evaluated as part of this project. An example is also provided in the code that shows how to generate a figure showing boxplots of change factors for a single duration and return period. A Microsoft Word file documenting code usage is also provided within this data release (Documentation_R_script_create_boxplot.docx). As described in the documentation, the R script relies on some of the Microsoft Excel spreadsheets published as part of this data release. The script uses HUC-8 basins defined in the "Florida Hydrologic Unit Code (HUC) Basins (areas)" from the Florida Department of Environmental Protection (FDEP; https://geodata.dep.state.fl.us/datasets/FDEP::florida-hydrologic-unit-code-huc-basins-areas/explore) and their names are listed in the file basins_list.txt provided with the script. County names are listed in the file counties_list.txt provided with the script. NOAA Atlas 14 stations located in each Florida basin or county are defined in the Microsoft Excel spreadsheet Datasets_station_information.xlsx which is part of this data release. Instructions are provided in code documentation (see highlighted text on page 7 of Documentation_R_script_create_boxplot.docx) so that users can modify the script to generate boxplots for basins different from the FDEP "Florida Hydrologic Unit Code (HUC) Basins (areas)."
u
Example data simulated using the R package survtd
figshare.unimelb.edu.au
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Margarita Moreno-Betancur (2023). Example data simulated using the R package survtd [Dataset]. http://doi.org/10.4225/49/58e58a8dc39a6
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.4225/49/58e58a8dc39a6
Dataset updated
May 31, 2023
Dataset provided by
The University of Melbourne
Authors
Margarita Moreno-Betancur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This example dataset is used to illustrate the usage of the R package survtd in the Supplementary Materials of the paper:Moreno-Betancur M, Carlin JB, Brilleman SL, Tanamas S, Peeters A, Wolfe R (2017). Survival analysis with time-dependent covariates subject to measurement error and missing data: Two-stage joint model using multiple imputation (submitted).The data was generated using the simjm function of the package, using the following code:dat
Z
Simulation Data & R scripts for: "Introducing recurrent events analyses to...
data.niaid.nih.gov
Updated Apr 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ferry, Nicolas (2024). Simulation Data & R scripts for: "Introducing recurrent events analyses to assess species interactions based on camera trap data: a comparison with time-to-first-event approaches" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11085005
Explore at:
Dataset updated
Apr 29, 2024
Dataset authored and provided by
Ferry, Nicolas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Files descriptions:

All csv files refer to results from the different models (PAMM, AARs, Linear models, MRPPs) on each iteration of the simulation. One row being one iteration. "results_perfect_detection.csv" refers to the results from the first simulation part with all the observations."results_imperfect_detection.csv" refers to the results from the first simulation part with randomly thinned observations to mimick imperfect detection.

ID_run: identified of the iteration (N: number of sites, D_AB: duration of the effect of A on B, D_BA: duration of the effect of B on A, AB: effect of A on B, BA: effect of B on A, Se: seed number of the iteration).PAMM30: p-value of the PAMM running on the 30-days survey.PAMM7: p-value of the PAMM running on the 7-days survey.AAR1: ratio value for the Avoidance-Attraction-Ratio calculating AB/BA.AAR2: ratio value for the Avoidance-Attraction-Ratio calculating BAB/BB.Harmsen_P: p-value from the linear model with interaction Species1*Species2 from Harmsen et al. (2009).Niedballa_P: p-value from the linear model comparing AB to BA (Niedballa et al. 2021).Karanth_permA: rank of the observed interval duration median (AB and BA undifferenciated) compared to the randomized median distribution, when permuting on species A (Karanth et al. 2017).MurphyAB_permA: rank of the observed AB interval duration median compared to the randomized median distribution, when permuting on species A (Murphy et al. 2021). MurphyBA_permA: rank of the observed BA interval duration median compared to the randomized median distribution, when permuting on species A (Murphy et al. 2021). Karanth_permB: rank of the observed interval duration median (AB and BA undifferenciated) compared to the randomized median distribution, when permuting on species B (Karanth et al. 2017).MurphyAB_permB: rank of the observed AB interval duration median compared to the randomized median distribution, when permuting on species B (Murphy et al. 2021). MurphyBA_permB: rank of the observed BA interval duration median compared to the randomized median distribution, when permuting on species B (Murphy et al. 2021).

"results_int_dir_perf_det.csv" refers to the results from the second simulation part, with all the observations."results_int_dir_imperf_det.csv" refers to the results from the second simulation part, with randomly thinned observations to mimick imperfect detection.ID_run: identified of the iteration (N: number of sites, D_AB: duration of the effect of A on B, D_BA: duration of the effect of B on A, AB: effect of A on B, BA: effect of B on A, Se: seed number of the iteration).p_pamm7_AB: p-value of the PAMM running on the 7-days survey testing for the effect of A on B.p_pamm7_AB: p-value of the PAMM running on the 7-days survey testing for the effect of B on A.AAR1: ratio value for the Avoidance-Attraction-Ratio calculating AB/BA.AAR2_BAB: ratio value for the Avoidance-Attraction-Ratio calculating BAB/BB.AAR2_ABA: ratio value for the Avoidance-Attraction-Ratio calculating ABA/AA.Harmsen_P: p-value from the linear model with interaction Species1*Species2 from Harmsen et al. (2009).Niedballa_P: p-value from the linear model comparing AB to BA (Niedballa et al. 2021).Karanth_permA: rank of the observed interval duration median (AB and BA undifferenciated) compared to the randomized median distribution, when permuting on species A (Karanth et al. 2017).MurphyAB_permA: rank of the observed AB interval duration median compared to the randomized median distribution, when permuting on species A (Murphy et al. 2021). MurphyBA_permA: rank of the observed BA interval duration median compared to the randomized median distribution, when permuting on species A (Murphy et al. 2021). Karanth_permB: rank of the observed interval duration median (AB and BA undifferenciated) compared to the randomized median distribution, when permuting on species B (Karanth et al. 2017).MurphyAB_permB: rank of the observed AB interval duration median compared to the randomized median distribution, when permuting on species B (Murphy et al. 2021). MurphyBA_permB: rank of the observed BA interval duration median compared to the randomized median distribution, when permuting on species B (Murphy et al. 2021).

Scripts files description:1_Functions: R script containing the functions: - MRPP from Karanth et al. (2017) adapted here for time efficiency. - MRPP from Murphy et al. (2021) adapted here for time efficiency. - Version of the ct_to_recurrent() function from the recurrent package adapted to process parallized on the simulation datasets. - The simulation() function used to simulate two species observations with reciprocal effect on each other.2_Simulations: R script containing the parameters definitions for all iterations (for the two parts of the simulations), the simulation paralellization and the random thinning mimicking imperfect detection.3_Approaches comparison: R script containing the fit of the different models tested on the simulated data.3_1_Real data comparison: R script containing the fit of the different models tested on the real data example from Murphy et al. 2021.4_Graphs: R script containing the code for plotting results from the simulation part and appendices.5_1_Appendix - Check for similarity between codes for Karanth et al 2017 method: R script containing Karanth et al. (2017) and Murphy et al. (2021) codes lines and the adapted version for time-efficiency matter and a comparison to verify similarity of results.5_2_Appendix - Multi-response procedure permutation difference: R script containing R code to test for difference of the MRPPs approaches according to the species on which permutation are done.
f
Data from: Functional Additive Mixed Models
tandf.figshare.com
txt
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabian Scheipl; Ana-Maria Staicu; Sonja Greven (2023). Functional Additive Mixed Models [Dataset]. http://doi.org/10.6084/m9.figshare.987098.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.987098.v2
Dataset updated
Jun 2, 2023
Dataset provided by
Taylor & Francis
Authors
Fabian Scheipl; Ana-Maria Staicu; Sonja Greven
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We propose an extensive framework for additive regression models for correlated functional responses, allowing for multiple partially nested or crossed functional random effects with flexible correlation structures for, for example, spatial, temporal, or longitudinal functional data. Additionally, our framework includes linear and nonlinear effects of functional and scalar covariates that may vary smoothly over the index of the functional response. It accommodates densely or sparsely observed functional responses and predictors which may be observed with additional error and includes both spline-based and functional principal component-based terms. Estimation and inference in this framework is based on standard additive mixed models, allowing us to take advantage of established methods and robust, flexible algorithms. We provide easy-to-use open source software in the pffr() function for the R package refund. Simulations show that the proposed method recovers relevant effects reliably, handles small sample sizes well, and also scales to larger datasets. Applications with spatially and longitudinally observed functional data demonstrate the flexibility in modeling and interpretability of results of our approach.
d
POPMAPS: An R package to estimate ancestry probability surfaces
catalog.data.gov
data.usgs.gov
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). POPMAPS: An R package to estimate ancestry probability surfaces [Dataset]. https://catalog.data.gov/dataset/popmaps-an-r-package-to-estimate-ancestry-probability-surfaces
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This software code was developed to estimate the probability that individuals found at a geographic location will belong to the same genetic cluster as individuals at the nearest empirical sampling location for which ancestry is known. POPMAPS includes 5 main functions to calculate and visualize these results (see Table 1 for functions and arguments). Population assignment coefficients and a raster surface must be estimated prior to using POPMAPS functions (see Fig. 1a and b). With these data in hand, users can run a jackknife function to choose an optimal parameter combination that reconstructs empirical data best (Figs. 2 and S2). Pertinent parameters include 1) how many empirical sampling localities should be used to estimate ancestry coefficients and 2) what is the influence of empirical sites on ancestry coefficient estimation as distance increases (Fig. 2). After choosing these parameters, a user can estimate the entire ancestry probability surface (Fig. 1c and d, Fig. 3). This package can be used to estimate ancestry coefficients from empirical genetic data across a user-defined geospatial layer. Estimated ancestry coefficients are used to calculate ancestry probabilities, which together with 'hard population boundaries,' compose an ancestry probability surface. Within a hard boundary, the ancestry probability informs a user of the confidence that they can have of genetic identity matching the principal population if they were to find individuals of the focal organism at a location. Confidence can be modified across the ancestry probability surface by changing parameters influencing the contribution of empirical data to the estimation of ancestry coefficients. This information may be valuable to inform decision-making for organisms having management needs. See 'Related External Resources, Type: Source Code' below for direct access to the POPMAPS R software package.

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

zenodo.org

application/gzip, bin +2

Updated Aug 2, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788

Explore at:

bin, application/gzip, zip, text/x-pythonAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.1419788

Dataset updated

Aug 2, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb

License

https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

Description

Replication pack, FSE2018 submission #164:
------------------------------------------

**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
A Case Study of the PyPI Ecosystem

**Note:** link to data artifacts is already included in the paper. 
Link to the code will be included in the Camera Ready version as well.


Content description
===================

- **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
 described below
- **settings.py** - settings template for the code archive.
- **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
 This dataset only includes stats aggregated by the ecosystem (PyPI)
- **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
 statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
 themselves, which take around 2TB.
- **build_model.r, helpers.r** - R files to process the survival data 
  (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
  `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
  **dataset_full_Jan_2018.tgz**)
- **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
- LICENSE - text of GPL v3, under which this dataset is published
- INSTALL.md - replication guide (~2 pages)

Replication guide
=================

Step 0 - prerequisites
----------------------

- Unix-compatible OS (Linux or OS X)
- Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
- R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)

Depending on detalization level (see Step 2 for more details):
- up to 2Tb of disk space (see Step 2 detalization levels)
- at least 16Gb of RAM (64 preferable)
- few hours to few month of processing time

Step 1 - software
----------------

- unpack **ghd-0.1.0.zip**, or clone from gitlab:

   git clone https://gitlab.com/user2589/ghd.git
   git checkout 0.1.0
 
 `cd` into the extracted folder. 
 All commands below assume it as a current directory.
  
- copy `settings.py` into the extracted folder. Edit the file:
  * set `DATASET_PATH` to some newly created folder path
  * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
- install docker. For Ubuntu Linux, the command is 
  `sudo apt-get install docker-compose`
- install libarchive and headers: `sudo apt-get install libarchive-dev`
- (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
 Without this dependency, you might get an error on the next step, 
 but it's safe to ignore.
- install Python libraries: `pip install --user -r requirements.txt` . 
- disable all APIs except GitHub (Bitbucket and Gitlab support were
 not yet implemented when this study was in progress): edit
 `scraper/init.py`, comment out everything except GitHub support
 in `PROVIDERS`.

Step 2 - obtaining the dataset
-----------------------------

The ultimate goal of this step is to get output of the Python function 
`common.utils.survival_data()` and save it into a CSV file:

  # copy and paste into a Python console
  from common import utils
  survival_data = utils.survival_data('pypi', '2008', smoothing=6)
  survival_data.to_csv('survival_data.csv')

Since full replication will take several months, here are some ways to speedup
the process:

####Option 2.a, difficulty level: easiest

Just use the precomputed data. Step 1 is not necessary under this scenario.

- extract **dataset_minimal_Jan_2018.zip**
- get `survival_data.csv`, go to the next step

####Option 2.b, difficulty level: easy

Use precomputed longitudinal feature values to build the final table.
The whole process will take 15..30 minutes.

- create a folder `

Leamington AURN Air Quality Data
kaggle.com
Updated Oct 20, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anthony Walker (2019). Leamington AURN Air Quality Data [Dataset]. https://www.kaggle.com/datasets/airqualityanthony/leamington-aurn-air-quality-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 20, 2019
Dataset provided by
Kaggle
Authors
Anthony Walker
Description
Air Quality data pulled from AURN monitoring network: https://uk-air.defra.gov.uk/

Site LEAR: Leamington Spa Rugby Road AURN Station.

2009 - 2019 AURN data pull for all pollutants and metadata

Pulled from AURN network using R - openair package, importAURN function: http://www.openair-project.org/

Data licenced under Open Goverment Licence : https://uk-air.defra.gov.uk/about-these-pages#licence http://www.nationalarchives.gov.uk/doc/open-government-licence/version/2/

Codes in R for spatial statistics analysis, ecological response models and...

zenodo.org
data.niaid.nih.gov

bin

Updated Apr 24, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

D. W. Rössel-Ramírez; D. W. Rössel-Ramírez; J. Palacio-Núñez; J. Palacio-Núñez; S. Espinosa; S. Espinosa; J. F. Martínez-Montoya; J. F. Martínez-Montoya (2025). Codes in R for spatial statistics analysis, ecological response models and spatial distribution models [Dataset]. http://doi.org/10.5281/zenodo.7603557

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.7603557

Dataset updated

Apr 24, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

D. W. Rössel-Ramírez; D. W. Rössel-Ramírez; J. Palacio-Núñez; J. Palacio-Núñez; S. Espinosa; S. Espinosa; J. F. Martínez-Montoya; J. F. Martínez-Montoya

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

In the last decade, a plethora of algorithms have been developed for spatial ecology studies. In our case, we use some of these codes for underwater research work in applied ecology analysis of threatened endemic fishes and their natural habitat. For this, we developed codes in Rstudio® script environment to run spatial and statistical analyses for ecological response and spatial distribution models (e.g., Hijmans & Elith, 2017; Den Burg et al., 2020). The employed R packages are as follows: caret (Kuhn et al., 2020), corrplot (Wei & Simko, 2017), devtools (Wickham, 2015), dismo (Hijmans & Elith, 2017), gbm (Freund & Schapire, 1997; Friedman, 2002), ggplot2 (Wickham et al., 2019), lattice (Sarkar, 2008), lattice (Musa & Mansor, 2021), maptools (Hijmans & Elith, 2017), modelmetrics (Hvitfeldt & Silge, 2021), pander (Wickham, 2015), plyr (Wickham & Wickham, 2015), pROC (Robin et al., 2011), raster (Hijmans & Elith, 2017), RColorBrewer (Neuwirth, 2014), Rcpp (Eddelbeuttel & Balamura, 2018), rgdal (Verzani, 2011), sdm (Naimi & Araujo, 2016), sf (e.g., Zainuddin, 2023), sp (Pebesma, 2020) and usethis (Gladstone, 2022).

It is important to follow all the codes in order to obtain results from the ecological response and spatial distribution models. In particular, for the ecological scenario, we selected the Generalized Linear Model (GLM) and for the geographic scenario we selected DOMAIN, also known as Gower's metric (Carpenter et al., 1993). We selected this regression method and this distance similarity metric because of its adequacy and robustness for studies with endemic or threatened species (e.g., Naoki et al., 2006). Next, we explain the statistical parameterization for the codes immersed in the GLM and DOMAIN running:

In the first instance, we generated the background points and extracted the values of the variables (Code2_Extract_values_DWp_SC.R). Barbet-Massin et al. (2012) recommend the use of 10,000 background points when using regression methods (e.g., Generalized Linear Model) or distance-based models (e.g., DOMAIN). However, we considered important some factors such as the extent of the area and the type of study species for the correct selection of the number of points (Pers. Obs.). Then, we extracted the values of predictor variables (e.g., bioclimatic, topographic, demographic, habitat) in function of presence and background points (e.g., Hijmans and Elith, 2017).

Subsequently, we subdivide both the presence and background point groups into 75% training data and 25% test data, each group, following the method of Soberón & Nakamura (2009) and Hijmans & Elith (2017). For a training control, the 10-fold (cross-validation) method is selected, where the response variable presence is assigned as a factor. In case that some other variable would be important for the study species, it should also be assigned as a factor (Kim, 2009).

After that, we ran the code for the GBM method (Gradient Boost Machine; Code3_GBM_Relative_contribution.R and Code4_Relative_contribution.R), where we obtained the relative contribution of the variables used in the model. We parameterized the code with a Gaussian distribution and cross iteration of 5,000 repetitions (e.g., Friedman, 2002; kim, 2009; Hijmans and Elith, 2017). In addition, we considered selecting a validation interval of 4 random training points (Personal test). The obtained plots were the partial dependence blocks, in function of each predictor variable.

Subsequently, the correlation of the variables is run by Pearson's method (Code5_Pearson_Correlation.R) to evaluate multicollinearity between variables (Guisan & Hofer, 2003). It is recommended to consider a bivariate correlation ± 0.70 to discard highly correlated variables (e.g., Awan et al., 2021).

Once the above codes were run, we uploaded the same subgroups (i.e., presence and background groups with 75% training and 25% testing) (Code6_Presence&backgrounds.R) for the GLM method code (Code7_GLM_model.R). Here, we first ran the GLM models per variable to obtain the p-significance value of each variable (alpha ≤ 0.05); we selected the value one (i.e., presence) as the likelihood factor. The generated models are of polynomial degree to obtain linear and quadratic response (e.g., Fielding and Bell, 1997; Allouche et al., 2006). From these results, we ran ecological response curve models, where the resulting plots included the probability of occurrence and values for continuous variables or categories for discrete variables. The points of the presence and background training group are also included.

On the other hand, a global GLM was also run, from which the generalized model is evaluated by means of a 2 x 2 contingency matrix, including both observed and predicted records. A representation of this is shown in Table 1 (adapted from Allouche et al., 2006). In this process we select an arbitrary boundary of 0.5 to obtain better modeling performance and avoid high percentage of bias in type I (omission) or II (commission) errors (e.g., Carpenter et al., 1993; Fielding and Bell, 1997; Allouche et al., 2006; Kim, 2009; Hijmans and Elith, 2017).

Table 1. Example of 2 x 2 contingency matrix for calculating performance metrics for GLM models. A represents true presence records (true positives), B represents false presence records (false positives - error of commission), C represents true background points (true negatives) and D represents false backgrounds (false negatives - errors of omission).

	Validation set
Model	True	False
Presence	A	B
Background	C	D

We then calculated the Overall and True Skill Statistics (TSS) metrics. The first is used to assess the proportion of correctly predicted cases, while the second metric assesses the prevalence of correctly predicted cases (Olden and Jackson, 2002). This metric also gives equal importance to the prevalence of presence prediction as to the random performance correction (Fielding and Bell, 1997; Allouche et al., 2006).

The last code (i.e., Code8_DOMAIN_SuitHab_model.R) is for species distribution modelling using the DOMAIN algorithm (Carpenter et al., 1993). Here, we loaded the variable stack and the presence and background group subdivided into 75% training and 25% test, each. We only included the presence training subset and the predictor variables stack in the calculation of the DOMAIN metric, as well as in the evaluation and validation of the model.

Regarding the model evaluation and estimation, we selected the following estimators:

1) partial ROC, which evaluates the approach between the curves of positive (i.e., correctly predicted presence) and negative (i.e., correctly predicted absence) cases. As farther apart these curves are, the model has a better prediction performance for the correct spatial distribution of the species (Manzanilla-Quiñones, 2020).

2) ROC/AUC curve for model validation, where an optimal performance threshold is estimated to have an expected confidence of 75% to 99% probability (De Long et al., 1988).

n
Data from: WiBB: An integrated method for quantifying the relative...
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated Aug 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qin Li; Xiaojun Kou (2021). WiBB: An integrated method for quantifying the relative importance of predictive variables [Dataset]. http://doi.org/10.5061/dryad.xsj3tx9g1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.xsj3tx9g1
Dataset updated
Aug 20, 2021
Dataset provided by
Beijing Normal University
Field Museum of Natural History
Authors
Qin Li; Xiaojun Kou
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
This dataset contains simulated datasets, empirical data, and R scripts described in the paper: “Li, Q. and Kou, X. (2021) WiBB: An integrated method for quantifying the relative importance of predictive variables. Ecography (DOI: 10.1111/ecog.05651)”.

A fundamental goal of scientific research is to identify the underlying variables that govern crucial processes of a system. Here we proposed a new index, WiBB, which integrates the merits of several existing methods: a model-weighting method from information theory (Wi), a standardized regression coefficient method measured by ß* (B), and bootstrap resampling technique (B). We applied the WiBB in simulated datasets with known correlation structures, for both linear models (LM) and generalized linear models (GLM), to evaluate its performance. We also applied two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate their performance in comparison with the WiBB method on ranking predictor importances under various scenarios. We also applied it to an empirical dataset in a plant genus Mimulus to select bioclimatic predictors of species’ presence across the landscape. Results in the simulated datasets showed that the WiBB method outperformed the ß* and SWi methods in scenarios with small and large sample sizes, respectively, and that the bootstrap resampling technique significantly improved the discriminant ability. When testing WiBB in the empirical dataset with GLM, it sensibly identified four important predictors with high credibility out of six candidates in modeling geographical distributions of 71 Mimulus species. This integrated index has great advantages in evaluating predictor importance and hence reducing the dimensionality of data, without losing interpretive power. The simplicity of calculation of the new metric over more sophisticated statistical procedures, makes it a handy method in the statistical toolbox.

Methods To simulate independent datasets (size = 1000), we adopted Galipaud et al.’s approach (2014) with custom modifications of the data.simulation function, which used the multiple normal distribution function rmvnorm in R package mvtnorm(v1.0-5, Genz et al. 2016). Each dataset was simulated with a preset correlation structure between a response variable (y) and four predictors(x1, x2, x3, x4). The first three (genuine) predictors were set to be strongly, moderately, and weakly correlated with the response variable, respectively (denoted by large, medium, small Pearson correlation coefficients, r), while the correlation between the response and the last (spurious) predictor was set to be zero. We simulated datasets with three levels of differences of correlation coefficients of consecutive predictors, where ∆r = 0.1, 0.2, 0.3, respectively. These three levels of ∆r resulted in three correlation structures between the response and four predictors: (0.3, 0.2, 0.1, 0.0), (0.6, 0.4, 0.2, 0.0), and (0.8, 0.6, 0.3, 0.0), respectively. We repeated the simulation procedure 200 times for each of three preset correlation structures (600 datasets in total), for LM fitting later. For GLM fitting, we modified the simulation procedures with additional steps, in which we converted the continuous response into binary data O (e.g., occurrence data having 0 for absence and 1 for presence). We tested the WiBB method, along with two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate the ability to correctly rank predictor importances under various scenarios. The empirical dataset of 71 Mimulus species was collected by their occurrence coordinates and correponding values extracted from climatic layers from WorldClim dataset (www.worldclim.org), and we applied the WiBB method to infer important predictors for their geographical distributions.
H
Time-Series Matrix (TSMx): A visualization tool for plotting multiscale...
dataverse.harvard.edu
Updated Jul 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Georgios Boumis; Brad Peter (2024). Time-Series Matrix (TSMx): A visualization tool for plotting multiscale temporal trends [Dataset]. http://doi.org/10.7910/DVN/ZZDYM9
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/ZZDYM9
Dataset updated
Jul 8, 2024
Dataset provided by
Harvard Dataverse
Authors
Georgios Boumis; Brad Peter
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Time-Series Matrix (TSMx): A visualization tool for plotting multiscale temporal trends TSMx is an R script that was developed to facilitate multi-temporal-scale visualizations of time-series data. The script requires only a two-column CSV of years and values to plot the slope of the linear regression line for all possible year combinations from the supplied temporal range. The outputs include a time-series matrix showing slope direction based on the linear regression, slope values plotted with colors indicating magnitude, and results of a Mann-Kendall test. The start year is indicated on the y-axis and the end year is indicated on the x-axis. In the example below, the cell in the top-right corner is the direction of the slope for the temporal range 2001–2019. The red line corresponds with the temporal range 2010–2019 and an arrow is drawn from the cell that represents that range. One cell is highlighted with a black border to demonstrate how to read the chart—that cell represents the slope for the temporal range 2004–2014. This publication entry also includes an excel template that produces the same visualizations without a need to interact with any code, though minor modifications will need to be made to accommodate year ranges other than what is provided. TSMx for R was developed by Georgios Boumis; TSMx was originally conceptualized and created by Brad G. Peter in Microsoft Excel. Please refer to the associated publication: Peter, B.G., Messina, J.P., Breeze, V., Fung, C.Y., Kapoor, A. and Fan, P., 2024. Perspectives on modifiable spatiotemporal unit problems in remote sensing of agriculture: evaluating rice production in Vietnam and tools for analysis. Frontiers in Remote Sensing, 5, p.1042624. https://www.frontiersin.org/journals/remote-sensing/articles/10.3389/frsen.2024.1042624 TSMx sample chart from the supplied Excel template. Data represent the productivity of rice agriculture in Vietnam as measured via EVI (enhanced vegetation index) from the NASA MODIS data product (MOD13Q1.V006). TSMx R script: # import packages library(dplyr) library(readr) library(ggplot2) library(tibble) library(tidyr) library(forcats) library(Kendall) options(warn = -1) # disable warnings # read data (.csv file with "Year" and "Value" columns) data <- read_csv("EVI.csv") # prepare row/column names for output matrices years <- data %>% pull("Year") r.names <- years[-length(years)] c.names <- years[-1] years <- years[-length(years)] # initialize output matrices sign.matrix <- matrix(data = NA, nrow = length(years), ncol = length(years)) pval.matrix <- matrix(data = NA, nrow = length(years), ncol = length(years)) slope.matrix <- matrix(data = NA, nrow = length(years), ncol = length(years)) # function to return remaining years given a start year getRemain <- function(start.year) { years <- data %>% pull("Year") start.ind <- which(data[["Year"]] == start.year) + 1 remain <- years[start.ind:length(years)] return (remain) } # function to subset data for a start/end year combination splitData <- function(end.year, start.year) { keep <- which(data[['Year']] >= start.year & data[['Year']] <= end.year) batch <- data[keep,] return(batch) } # function to fit linear regression and return slope direction fitReg <- function(batch) { trend <- lm(Value ~ Year, data = batch) slope <- coefficients(trend)[[2]] return(sign(slope)) } # function to fit linear regression and return slope magnitude fitRegv2 <- function(batch) { trend <- lm(Value ~ Year, data = batch) slope <- coefficients(trend)[[2]] return(slope) } # function to implement Mann-Kendall (MK) trend test and return significance # the test is implemented only for n>=8 getMann <- function(batch) { if (nrow(batch) >= 8) { mk <- MannKendall(batch[['Value']]) pval <- mk[['sl']] } else { pval <- NA } return(pval) } # function to return slope direction for all combinations given a start year getSign <- function(start.year) { remaining <- getRemain(start.year) combs <- lapply(remaining, splitData, start.year = start.year) signs <- lapply(combs, fitReg) return(signs) } # function to return MK significance for all combinations given a start year getPval <- function(start.year) { remaining <- getRemain(start.year) combs <- lapply(remaining, splitData, start.year = start.year) pvals <- lapply(combs, getMann) return(pvals) } # function to return slope magnitude for all combinations given a start year getMagn <- function(start.year) { remaining <- getRemain(start.year) combs <- lapply(remaining, splitData, start.year = start.year) magns <- lapply(combs, fitRegv2) return(magns) } # retrieve slope direction, MK significance, and slope magnitude signs <- lapply(years, getSign) pvals <- lapply(years, getPval) magns <- lapply(years, getMagn) # fill-in output matrices dimension <- nrow(sign.matrix) for (i in 1:dimension) { sign.matrix[i, i:dimension] <- unlist(signs[i]) pval.matrix[i, i:dimension] <- unlist(pvals[i]) slope.matrix[i, i:dimension] <- unlist(magns[i]) } sign.matrix <-...
Z
Data and Code for "A Ray-Based Input Distance Function to Model Zero-Valued...
data.niaid.nih.gov
zenodo.org
Updated Jun 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Henningsen, Arne (2023). Data and Code for "A Ray-Based Input Distance Function to Model Zero-Valued Output Quantities: Derivation and an Empirical Application" [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_7882078
Explore at:
Dataset updated
Jun 17, 2023
Dataset provided by
Price, Juan José
Henningsen, Arne
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data and code archive provides all the data and code for replicating the empirical analysis that is presented in the journal article "A Ray-Based Input Distance Function to Model Zero-Valued Output Quantities: Derivation and an Empirical Application" authored by Juan José Price and Arne Henningsen and published in the Journal of Productivity Analysis (DOI: 10.1007/s11123-023-00684-1).

We conducted the empirical analysis with the "R" statistical software (version 4.3.0) using the add-on packages "combinat" (version 0.0.8), "miscTools" (version 0.6.28), "quadprog" (version 1.5.8), sfaR (version 1.0.0), stargazer (version 5.2.3), and "xtable" (version 1.8.4) that are available at CRAN. We created the R package "micEconDistRay" that provides the functions for empirical analyses with ray-based input distance functions that we developed for the above-mentioned paper. Also this R package is available at CRAN (https://cran.r-project.org/package=micEconDistRay).

This replication package contains the following files and folders:

README This file

MuseumsDk.csv The original data obtained from the Danish Ministry of Culture and from Statistics Denmark. It includes the following variables:

museum: Name of the museum.

type: Type of museum (Kulturhistorisk museum = cultural history museum; Kunstmuseer = arts museum; Naturhistorisk museum = natural history museum; Blandet museum = mixed museum).

munic: Municipality, in which the museum is located.

yr: Year of the observation.

units: Number of visit sites.

resp: Whether or not the museum has special responsibilities (0 = no special responsibilities; 1 = at least one special responsibility).

vis: Number of (physical) visitors.

aarc: Number of articles published (archeology).

ach: Number of articles published (cultural history).

aah: Number of articles published (art history).

anh: Number of articles published (natural history).

exh: Number of temporary exhibitions.

edu: Number of primary school classes on educational visits to the museum.

ev: Number of events other than exhibitions.

ftesc: Scientific labor (full-time equivalents).

ftensc: Non-scientific labor (full-time equivalents).

expProperty: Running and maintenance costs [1,000 DKK].

expCons: Conservation expenditure [1,000 DKK].

ipc: Consumer Price Index in Denmark (the value for year 2014 is set to 1).

prepare_data.R This R script imports the data set MuseumsDk.csv, prepares it for the empirical analysis (e.g., removing unsuitable observations, preparing variables), and saves the resulting data set as DataPrepared.csv.

DataPrepared.csv This data set is prepared and saved by the R script prepare_data.R. It is used for the empirical analysis.

make_table_descriptive.R This R script imports the data set DataPrepared.csv and creates the LaTeX table /tables/table_descriptive.tex, which provides summary statistics of the variables that are used in the empirical analysis.

IO_Ray.R This R script imports the data set DataPrepared.csv, estimates a ray-based Translog input distance functions with the 'optimal' ordering of outputs, imposes monotonicity on this distance function, creates the LaTeX table /tables/idfRes.tex that presents the estimated parameters of this function, and creates several figures in the folder /figures/ that illustrate the results.

IO_Ray_ordering_outputs.R This R script imports the data set DataPrepared.csv, estimates a ray-based Translog input distance functions, imposes monotonicity for each of the 720 possible orderings of the outputs, and saves all the estimation results as (a huge) R object allOrderings.rds.

allOrderings.rds (not included in the ZIP file, uploaded separately) This is a saved R object created by the R script IO_Ray_ordering_outputs.R that contains the estimated ray-based Translog input distance functions (with and without monotonicity imposed) for each of the 720 possible orderings.

IO_Ray_model_averaging.R This R script loads the R object allOrderings.rds that contains the estimated ray-based Translog input distance functions for each of the 720 possible orderings, does model averaging, and creates several figures in the folder /figures/ that illustrate the results.

/tables/ This folder contains the two LaTeX tables table_descriptive.tex and idfRes.tex (created by R scripts make_table_descriptive.R and IO_Ray.R, respectively) that provide summary statistics of the data set and the estimated parameters (without and with monotonicity imposed) for the 'optimal' ordering of outputs.

/figures/ This folder contains 48 figures (created by the R scripts IO_Ray.R and IO_Ray_model_averaging.R) that illustrate the results obtained with the 'optimal' ordering of outputs and the model-averaged results and that compare these two sets of results.
f
Supplement 1. R and WinBUGS code for fitting the model of species occurrence...
figshare.com
wiley.figshare.com
html
Updated Aug 5, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert M. Dorazio; J. Andrew Royle; Bo Söderström; Anders Glimskär (2016). Supplement 1. R and WinBUGS code for fitting the model of species occurrence and detection and example data sets. [Dataset]. http://doi.org/10.6084/m9.figshare.3526013.v1
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3526013.v1
Dataset updated
Aug 5, 2016
Dataset provided by
Wiley
Authors
Robert M. Dorazio; J. Andrew Royle; Bo Söderström; Anders Glimskär
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
File List breedingBirdData.txt butterflyData.txt ExampleSession.txt MultiSpeciesSiteOcc.R MultiSpeciesSiteOccModel.txt CumNumSpeciesPresent.R

Description “breedingBirdData.txt” is an example data set in ASCII comma-delimited format. Each row corresponds to data for a single species observed in the avian survey. The 50 columns correspond to 50 sample locations. “butterflyData.txt” is an example data set in ASCII comma-delimited format. Each row corresponds to data for a single species observed in the butterfly survey. The 20 columns correspond to 20 sample locations. “ExampleSession.txt” illustrates an example session in R where the butterfly data are read into memory and then analyzed using the R and WinBUGS code. “MultiSpeciesSiteOcc.R” defines an R function for fitting the model of species occurrence and detection to data. This function specifies a Gibbs sampler wherein 55000 random draws are computed for each of 4 different Markov chains. These computations may require nontrivial execution times. For example, analysis of the avian data required about 4 hours using a computer equipped with a 3.20 GHz Pentium 4 processor. Analysis of the butterfly data required about 1.5 hours. “MultiSpeciesSiteOccModel.txt” contains WinBUGS code for specifying the model of species occurrence and detection. “CumNumSpeciesPresent.R” defines an R function for computing a sample of the posterior-predictive distribution of a species-accumulation curve whose abscissa ranges from 1 to nsites sites.
RECAP dataset: Subject, exposure, and health endpoint (blood, lipids,...
catalog.data.gov
Updated Oct 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2021). RECAP dataset: Subject, exposure, and health endpoint (blood, lipids, cardiac, and lung) data [Dataset]. https://catalog.data.gov/dataset/recap-dataset-subject-exposure-and-health-endpoint-blood-lipids-cardiac-and-lung-data
Explore at:
Dataset updated
Oct 4, 2021
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
This dataset contains deidentified subject level data from the study titled: Responses to Exposure to Low Levels of Concentrated Ambient Particles in Healthy Young Adults (RECAP). Subject, exposure, and health endpoint data are included in the dataset. Health endpoint data includes inflammatory, heart rate variability and cardiac repolarization, lung function, blood chemistry, and lipids measures. This dataset is associated with the following publication: Wyatt, L., R. Devlin, A. Rappold, and M. Case. Low levels of fine particulate matter increase vascular damage and reduce pulmonary function in young healthy adults. Particle and Fibre Toxicology. BioMed Central Ltd, London, UK, 17(1): 58, (2020).
Importance of predictor variables for models of chemical function
catalog.data.gov
data.wu.ac.at
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Importance of predictor variables for models of chemical function [Dataset]. https://catalog.data.gov/dataset/importance-of-predictor-variables-for-models-of-chemical-function
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Importance of random forest predictors for all classification models of chemical function. This dataset is associated with the following publication: Isaacs , K., M. Goldsmith, P. Egeghy , K. Phillips, R. Brooks, T. Hong, and J. Wambaugh. Characterization and prediction of chemical functions and weight fractions in consumer products. Toxicology Reports. Elsevier B.V., Amsterdam, NETHERLANDS, 3: 723-732, (2016).
f
ISAM PPIG Global Survey on COVID-19 and Substance Use — R Project
figshare.com
zip
Updated May 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohsen Ebrahimi (2021). ISAM PPIG Global Survey on COVID-19 and Substance Use — R Project [Dataset]. http://doi.org/10.6084/m9.figshare.14604504.v5
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14604504.v5
Dataset updated
May 28, 2021
Dataset provided by
figshare
Authors
Mohsen Ebrahimi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Include data and R code

Facebook

Twitter

Click to copy link

Link copied

Cite

U.S. Geological Survey (2024). R script that creates a wrapper function to automate the generation of boxplots of change factors for all ArcHydro Enhanced Database (AHED) basins (basin_boxplot.R) [Dataset]. https://catalog.data.gov/dataset/r-script-that-creates-a-wrapper-function-to-automate-the-generation-of-boxplots-of-change-

R script that creates a wrapper function to automate the generation of boxplots of change factors for all ArcHydro Enhanced Database (AHED) basins (basin_boxplot.R)

Explore at:

Dataset updated

Jul 6, 2024

Dataset provided by

United States Geological Surveyhttp://www.usgs.gov/

Description

The South Florida Water Management District (SFWMD) and the U.S. Geological Survey have developed projected future change factors for precipitation depth-duration-frequency (DDF) curves at 174 National Oceanic and Atmospheric Administration (NOAA) Atlas 14 stations in central and south Florida. The change factors were computed as the ratio of projected future to historical extreme precipitation depths fitted to extreme precipitation data from various downscaled climate datasets using a constrained maximum likelihood (CML) approach. The change factors correspond to the period 2050-2089 (centered in the year 2070) as compared to the 1966-2005 historical period. An R script (basin_boxplot.R) is provided provided as an example on how to create a wrapper function that will automate the generation of boxplots of change factors for all AHED basins. The wrapper script sources the file create_boxplot.R and calls the function create_boxplot() one AHED basin at a time to create a figure with boxplots of change fators for all durations (1, 3, and 7 days) and return periods (5, 10, 25, 50, 100, and 200 years) evaluated as part of this project. An example is also provided in the code that shows how to generate a figure showing boxplots of change factors for a single duration and return period. A Microsoft Word file documenting code usage is also provided within this data release (Documentation_R_script_create_boxplot.docx). As described in the documentation, the R script relies on some of the Microsoft Excel spreadsheets published as part of this data release.

Clear search

Close search

Google apps

Main menu

R script that creates a wrapper function to automate the generation of...

Storage and Transit Time Data and Code

Data from: Source code for R tutorials and dataset for empirical case study...

Chemical product and function dataset

Long Covid Risk

R script that creates a wrapper function to automate the generation of...

Example data simulated using the R package survtd

Simulation Data & R scripts for: "Introducing recurrent events analyses to...

Data from: Functional Additive Mixed Models

POPMAPS: An R package to estimate ancestry probability surfaces

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

Leamington AURN Air Quality Data

Codes in R for spatial statistics analysis, ecological response models and...

Data from: WiBB: An integrated method for quantifying the relative...

Time-Series Matrix (TSMx): A visualization tool for plotting multiscale...

Data and Code for "A Ray-Based Input Distance Function to Model Zero-Valued...

Supplement 1. R and WinBUGS code for fitting the model of species occurrence...

RECAP dataset: Subject, exposure, and health endpoint (blood, lipids,...

Importance of predictor variables for models of chemical function

ISAM PPIG Global Survey on COVID-19 and Substance Use — R Project

R script that creates a wrapper function to automate the generation of boxplots of change factors for all ArcHydro Enhanced Database (AHED) basins (basin_boxplot.R)See More Versions

R script that creates a wrapper function to automate the generation of boxplots of change factors for all ArcHydro Enhanced Database (AHED) basins (basin_boxplot.R)