20 datasets found
  1. w

    Synthetic Data for an Imaginary Country, Sample, 2023 - World

    • microdata.worldbank.org
    • nada-demo.ihsn.org
    Updated Jul 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Development Data Group, Data Analytics Unit (2023). Synthetic Data for an Imaginary Country, Sample, 2023 - World [Dataset]. https://microdata.worldbank.org/index.php/catalog/5906
    Explore at:
    Dataset updated
    Jul 7, 2023
    Dataset authored and provided by
    Development Data Group, Data Analytics Unit
    Time period covered
    2023
    Area covered
    World
    Description

    Abstract

    The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.

    The full-population dataset (with about 10 million individuals) is also distributed as open data.

    Geographic coverage

    The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.

    Analysis unit

    Household, Individual

    Universe

    The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.

    Kind of data

    ssd

    Sampling procedure

    The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.

    Mode of data collection

    other

    Research instrument

    The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.

    Cleaning operations

    The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.

    Response rate

    This is a synthetic dataset; the "response rate" is 100%.

  2. Los Angeles Family and Neighborhood Survey (L.A.FANS), Wave 2, Restricted...

    • icpsr.umich.edu
    Updated Apr 8, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pebley, Anne R.; Sastry, Narayan (2019). Los Angeles Family and Neighborhood Survey (L.A.FANS), Wave 2, Restricted Data Version 1, 2006-2008 [Dataset]. http://doi.org/10.3886/ICPSR37259.v1
    Explore at:
    Dataset updated
    Apr 8, 2019
    Dataset provided by
    Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
    Authors
    Pebley, Anne R.; Sastry, Narayan
    License

    https://www.icpsr.umich.edu/web/ICPSR/studies/37259/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/37259/terms

    Time period covered
    2006 - 2008
    Area covered
    Los Angeles, California, United States
    Description

    This study includes a restricted data file for Wave 2 of the L.A.FANS data. To compare L.A.FANS restricted data, version 1 with other restricted data versions, see the table on the series page for the L.A.FANS data here. Data in this study are designed for use with the public use data files for L.A.FANS, Wave 2 (study 2). This file adds only a few variables to the L.A.FANS, Wave 2 public use files. Specifically, it adds a "pseudo-tract ID" which is a number from 1 to 65, randomly assigned to each census tract (neighborhood) in the study. It is not possible to link pseudo-tract IDs in any way to real tract IDs or other neighborhood characteristics. However, pseudo-tract IDs permit users to conduct analyses which take into account the clustered sample design in which neighborhoods (tracts) were selected first and then individuals were sampled within neighborhoods. Pseudo-tract IDs do so because they identify which respondents live in the same neighborhood. It also includes certain variables, thought to be sensitive, which are not available in the public use data. These variables are identified in the L.A.FANS Wave 2 Users Guide and Codebook. Finally, some distance variables and individual characteristics which are treated in the public use data to make it harder to identify individuals are provided in an untreated form in the Version 1 restricted data file. Please note that L.A. FANS restricted data may only be accessed within the ICPSR Virtual Data Enclave (VDE) and must be merged with the L.A. FANS public data prior to beginning any analysis. A Users' Guide which explains the design and how to use the samples are available for Wave 2 at the RAND website. Additional information on the project, survey design, sample, and variables are available from: Sastry, Narayan, Bonnie Ghosh-Dastidar, John Adams, and Anne R. Pebley (2006). The Design of a Multilevel Survey of Children, Families, and Communities: The Los Angeles Family and Neighborhood Survey, Social Science Research, Volume 35, Number 4, Pages 1000-1024 The Users' Guides (Wave 1 and Wave 2) RAND Documentation Reports page

  3. Clustering of samples and variables with mixed-type data

    • plos.figshare.com
    tiff
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manuela Hummel; Dominic Edelmann; Annette Kopp-Schneider (2023). Clustering of samples and variables with mixed-type data [Dataset]. http://doi.org/10.1371/journal.pone.0188274
    Explore at:
    tiffAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Manuela Hummel; Dominic Edelmann; Annette Kopp-Schneider
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of data measured on different scales is a relevant challenge. Biomedical studies often focus on high-throughput datasets of, e.g., quantitative measurements. However, the need for integration of other features possibly measured on different scales, e.g. clinical or cytogenetic factors, becomes increasingly important. The analysis results (e.g. a selection of relevant genes) are then visualized, while adding further information, like clinical factors, on top. However, a more integrative approach is desirable, where all available data are analyzed jointly, and where also in the visualization different data sources are combined in a more natural way. Here we specifically target integrative visualization and present a heatmap-style graphic display. To this end, we develop and explore methods for clustering mixed-type data, with special focus on clustering variables. Clustering of variables does not receive as much attention in the literature as does clustering of samples. We extend the variables clustering methodology by two new approaches, one based on the combination of different association measures and the other on distance correlation. With simulation studies we evaluate and compare different clustering strategies. Applying specific methods for mixed-type data proves to be comparable and in many cases beneficial as compared to standard approaches applied to corresponding quantitative or binarized data. Our two novel approaches for mixed-type variables show similar or better performance than the existing methods ClustOfVar and bias-corrected mutual information. Further, in contrast to ClustOfVar, our methods provide dissimilarity matrices, which is an advantage, especially for the purpose of visualization. Real data examples aim to give an impression of various kinds of potential applications for the integrative heatmap and other graphical displays based on dissimilarity matrices. We demonstrate that the presented integrative heatmap provides more information than common data displays about the relationship among variables and samples. The described clustering and visualization methods are implemented in our R package CluMix available from https://cran.r-project.org/web/packages/CluMix.

  4. Example of how to manually extract incubation bouts from interactive plots...

    • figshare.com
    txt
    Updated Jan 22, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Bulla (2016). Example of how to manually extract incubation bouts from interactive plots of raw data - R-CODE and DATA [Dataset]. http://doi.org/10.6084/m9.figshare.2066784.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 22, 2016
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Martin Bulla
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    {# General information# The script runs with R (Version 3.1.1; 2014-07-10) and packages plyr (Version 1.8.1), XLConnect (Version 0.2-9), utilsMPIO (Version 0.0.25), sp (Version 1.0-15), rgdal (Version 0.8-16), tools (Version 3.1.1) and lattice (Version 0.20-29)# --------------------------------------------------------------------------------------------------------# Questions can be directed to: Martin Bulla (bulla.mar@gmail.com)# -------------------------------------------------------------------------------------------------------- # Data collection and how the individual variables were derived is described in: #Steiger, S.S., et al., When the sun never sets: diverse activity rhythms under continuous daylight in free-living arctic-breeding birds. Proceedings of the Royal Society B: Biological Sciences, 2013. 280(1764): p. 20131016-20131016. # Dale, J., et al., The effects of life history and sexual selection on male and female plumage colouration. Nature, 2015. # Data are available as Rdata file # Missing values are NA. # --------------------------------------------------------------------------------------------------------# For better readability the subsections of the script can be collapsed # --------------------------------------------------------------------------------------------------------}{# Description of the method # 1 - data are visualized in an interactive actogram with time of day on x-axis and one panel for each day of data # 2 - red rectangle indicates the active field, clicking with the mouse in that field on the depicted light signal generates a data point that is automatically (via custom made function) saved in the csv file. For this data extraction I recommend, to click always on the bottom line of the red rectangle, as there is always data available due to a dummy variable ("lin") that creates continuous data at the bottom of the active panel. The data are captured only if greenish vertical bar appears and if new line of data appears in R console). # 3 - to extract incubation bouts, first click in the new plot has to be start of incubation, then next click depict end of incubation and the click on the same stop start of the incubation for the other sex. If the end and start of incubation are at different times, the data will be still extracted, but the sex, logger and bird_ID will be wrong. These need to be changed manually in the csv file. Similarly, the first bout for a given plot will be always assigned to male (if no data are present in the csv file) or based on previous data. Hence, whenever a data from a new plot are extracted, at a first mouse click it is worth checking whether the sex, logger and bird_ID information is correct and if not adjust it manually. # 4 - if all information from one day (panel) is extracted, right-click on the plot and choose "stop". This will activate the following day (panel) for extraction. # 5 - If you wish to end extraction before going through all the rectangles, just press "escape". }{# Annotations of data-files from turnstone_2009_Barrow_nest-t401_transmitter.RData dfr-- contains raw data on signal strength from radio tag attached to the rump of female and male, and information about when the birds where captured and incubation stage of the nest1. who: identifies whether the recording refers to female, male, capture or start of hatching2. datetime_: date and time of each recording3. logger: unique identity of the radio tag 4. signal_: signal strength of the radio tag5. sex: sex of the bird (f = female, m = male)6. nest: unique identity of the nest7. day: datetime_ variable truncated to year-month-day format8. time: time of day in hours9. datetime_utc: date and time of each recording, but in UTC time10. cols: colors assigned to "who"--------------------------------------------------------------------------------------------------------m-- contains metadata for a given nest1. sp: identifies species (RUTU = Ruddy turnstone)2. nest: unique identity of the nest3. year_: year of observation4. IDfemale: unique identity of the female5. IDmale: unique identity of the male6. lat: latitude coordinate of the nest7. lon: longitude coordinate of the nest8. hatch_start: date and time when the hatching of the eggs started 9. scinam: scientific name of the species10. breeding_site: unique identity of the breeding site (barr = Barrow, Alaska)11. logger: type of device used to record incubation (IT - radio tag)12. sampling: mean incubation sampling interval in seconds--------------------------------------------------------------------------------------------------------s-- contains metadata for the incubating parents1. year_: year of capture2. species: identifies species (RUTU = Ruddy turnstone)3. author: identifies the author who measured the bird4. nest: unique identity of the nest5. caught_date_time: date and time when the bird was captured6. recapture: was the bird capture before? (0 - no, 1 - yes)7. sex: sex of the bird (f = female, m = male)8. bird_ID: unique identity of the bird9. logger: unique identity of the radio tag --------------------------------------------------------------------------------------------------------}

  5. Time Series Forecasting Using Prophet in R

    • kaggle.com
    zip
    Updated Jul 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vikram amin (2023). Time Series Forecasting Using Prophet in R [Dataset]. https://www.kaggle.com/datasets/vikramamin/time-series-forecasting-using-prophet-in-r
    Explore at:
    zip(9000 bytes)Available download formats
    Dataset updated
    Jul 25, 2023
    Authors
    vikram amin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description
    • Main objective : To forecast the page visits of a website
    • Tool : Time Series Forecasting using Prophet in R.
    • Steps:
    • Read the data
    • Data Cleaning: Checking data types, date formats and missing data https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F56d7b1edf4f51157804e81b02c032e4d%2FPicture1.png?generation=1690271521103777&alt=media" alt="">
    • Run libraries (dplyr, ggplot2, tidyverse, lubridate, prophet, forecast)
    • Change the Date column from character vector to date and change data format using lubridate package
    • Rename the column "Date" to "ds" and "Visits" to "y".
    • Treat "Christmas" and "Black.Friday" as holiday events. As the data ranges from 2016 to 2020, there will be 5 Christmas and 5 Black Friday days.
    • We will look at the impact of Christmas 3 days prior and 3 days later from Christmas date on "Visits" and 3 days prior and 1 day later for Black Friday
    • We create two data frames called Christmas and Black.Friday and merge the two into a data frame called "holidays". https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fd07b366be2050fefe6a62563b6abac0c%2FPicture2.png?generation=1690272066356516&alt=media" alt="">
    • We create train and test data. In train data & test data, we select only 3 variables namely ds, y , Easter. In train data, ds contains data before 2020-12-01 and test data contains data equal to and after 2020-12-01 (31 days) data
    • Train Data
    • https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8f3f58fe40b29b276bb7103cb1dfdde1%2FPicture3.png?generation=1690272272038405&alt=media" alt="">
    • Test Data
    • https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fb4362117f46aeb210dad23f07d3ecb39%2FPicture4.png?generation=1690272400355824&alt=media" alt="">
    • Use prophet model which will include multiple parameter. We are going with the default parameters. Thereafter, we add the external regressor "Easter".
    • https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F7325be63d887372cc5764ddf29a94310%2FPicture5.png?generation=1690272892963939&alt=media" alt="">
    • We create the future data frame for forecasting and name the data frame "future". It will include "m" and 31 days of the test data. We then predict this future data frame and create a new data frame called "forecast".
    • Forecast data frame consists of 1827 rows and 34 variables. This shows the external Regressor (Easter) value is 0 through the entire time period. This shows that "Easter" has no impact or effect on "Visits".
    • yhat stands for the predicted value (predicted visits).
    • https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fae5c9414d1b1bbb2670b372a326970a5%2FPicture6.png?generation=1690273558489681&alt=media" alt="">
    • We try to understand the impact of Holiday events "Christmas" and "Black.Friday"
    • https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F5a36cc5308f9e46f0b63fa8e37c4b932%2FPicture7.png?generation=1690273814760538&alt=media" alt="">
    • https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8cc3dd0581db1e8b9d542d9a524abd39%2FPicture8.png?generation=1690273879506571&alt=media" alt="">
    • We plot the forecast.
    • plot(m,forecast) https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fa7968ff05abdd5b4e789f3723b41c4ed%2FPicture9.png?generation=1690274020880594&alt=media" alt="">
    • blue is predicted value(yhat) and black is actual value(y) and blue shaded regions are the yhat_upper and yhat_lower values
    • prophet_plot_components(m,forecast) https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F52408afb8c71118ef6729420085875e8%2FPicture10.png?generation=1690274184325240&alt=media" alt="">
    • Trend indicates that the page visits remained constant from Jan'16 to Mid'17 and thereafter there was an upswing from Mid'19 to End of 2020
    • From Holidays, we can make out that Christmas had a negative effect on page visits whereas Black Friday had a positive effect on page visits
    • Weekly seasonality indicates that page visits tend to remain the highest from Monday to Thursday and starts going down thereafter
    • Yearly seasonality indicates that page visits are the highest in Apr and then starts going down thereafter with
    • Oct having reaching the bottom point
    • External regressor "Easter" has no impact on page visits
    • plot(m,forecast) + add_changepoints_to_plot(m)
    • https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F1253a0e381ae04d3156a4b098dafb2ca%2FPicture11.png?generation=1690274373570449&alt=media" alt="">
    • Trend which is indicated by the red line starts moving upwards from Mid 2019 to 2020 onwards
    • We check for acc...
  6. r

    Data from: Cold pool driven convective initiation: using causal graph...

    • resodate.org
    • data.ub.uni-muenchen.de
    • +1more
    Updated Jan 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mirjam Hirt; George C. Craig; Sophia Schäfer; Julien Savre; Rieke Heinze (2020). Cold pool driven convective initiation: using causal graph analysis to determine what convection permitting models are missing [Dataset]. http://doi.org/10.5282/UBM/DATA.178
    Explore at:
    Dataset updated
    Jan 1, 2020
    Dataset provided by
    Universitätsbibliothek der Ludwig-Maximilians-Universität München
    Authors
    Mirjam Hirt; George C. Craig; Sophia Schäfer; Julien Savre; Rieke Heinze
    Description

    The data in this folder comprises all data necessary to produce the Figures presented in our paper (Hirt et al, 2020, in review, Quarterly Journal of the Royal Meteorological Society). Corresponding Jupyter notebooks, which were used to analyse and plot the data, are available at https://github.com/HirtM/cold_pool_driven_convection_initiation. The datasets are netcdf files and should contain all relevant metadata. cp_aggregates2*: These datasets contain different variables of cold pool objects. For each variable, several different statistics are available, e.g. the average/median/some percentile over the area of each cold pool object. Note that the data does not contain tracked cold pools. Any sequence of cold pool indices is hence meaningless. Each cold pool index does not only have information about its cold pool, but also its edges (see mask dimension). P_ci_* These datasets contain information on convection initiation within cold pool areas, cold pool edge areas or no cold pool areas. No single cold pool objects are identified here. prec_* As P_ci_*, but for precipitation. synoptic_conditions_variables.nc This dataset contains domain averaged (total domain, not cold pool objects) timeseries of selected variables. The selected variables were chosen in order to describe the synoptic and diurnal conditions of the days of interest. This dataset is used for the causal regression analysis. All the data here is derived from the ICON-LEM simulation conducted within HDCP2: http://hdcp2.eu/index.php?id=5013 Heinze, R., Dipankar, A., Carbajal Henken, C., Moseley, C., Sourdeval, O., Trömel, S., Xie, X., Adamidis, P., Ament, F., Baars, H., Barthlott, C., Behrendt, A., Blahak, U., Bley, S., Brdar, S., Brueck, M., Crewell, S., Deneke, H., Di Girolamo, P., Evaristo, R., Fischer, J., Frank, C., Friederichs, P., Göcke, T., Gorges, K., Hande, L., Hanke, M., Hansen, A., Hege, H.-C., Hoose, C., Jahns, T., Kalthoff, N., Klocke, D., Kneifel, S., Knippertz, P., Kuhn, A., van Laar, T., Macke, A., Maurer, V., Mayer, B., Meyer, C. I., Muppa, S. K., Neggers, R. A. J., Orlandi, E., Pantillon, F., Pospichal, B., Röber, N., Scheck, L., Seifert, A., Seifert, P., Senf, F., Siligam, P., Simmer, C., Steinke, S., Stevens, B., Wapler, K., Weniger, M., Wulfmeyer, V., Zängl, G., Zhang, D. and Quaas, J. (2016): Large-eddy simulations over Germany using ICON: A comprehensive evaluation. Q.J.R. Meteorol. Soc., doi:10.1002/qj.2947 M.Hirt, 9 Jan 2020

  7. d

    Data from: Datasets for Comparison of Surrogate Models to Estimate Pesticide...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Datasets for Comparison of Surrogate Models to Estimate Pesticide Concentrations at Six U.S. Geological Survey National Water Quality Network Sites During Water Years 2013–2018 [Dataset]. https://catalog.data.gov/dataset/datasets-for-comparison-of-surrogate-models-to-estimate-pesticide-concentrations-at-six-u-
    Explore at:
    Dataset updated
    Nov 27, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    This data release is comprised of data tables of input variables for seawaveQ and surrogate models used to predict concentrations of select pesticides at six U.S. Geological Survey National Water Quality Network (NWQN) river sites (Fanno Creek at Durham, Oregon; White River at Hazleton, Indiana; Kansas River at DeSoto, Kansas; Little Arkansas River near Sedgwick, Kansas; Missouri River at Hermann, Missouri; Red River of the North at Grand Forks, North Dakota). Each data table includes discrete concentrations of one select pesticide (Atrazine, Azoxystrobin, Bentazon, Bromacil, Imidacloprid, Simazine, or Triclopyr) at one of the NWQN sites; daily mean streamflow; 30-day and 1-day flow anomalies; daily median values of pH and turbidity; daily mean values of dissolved oxygen, specific conductance, and water temperature; and 30-day and 1-day anomalies for pH, turbidity, dissolved oxygen, specific conductance, and water temperature. Two pesticides were modeled at each site with three types of regression models. Also included is a zip file with outputs from seawaveQ model summary. The processes for retrieving and preparing data for regression models followed those outlined in the SEAWAVE-Q R package documentation (Ryberg and Vecchia, 2013; Ryberg and York, 2020). The R package waterData (Ryberg and Vecchia, 2012) was used to import daily mean values for discharge and either daily mean or daily median values for continuous water-quality constituents directly into R depending on what data were available at each site. Pesticide concentration, streamflow, and surrogate data (continuously measured field parameters) were imported from and are available online from the USGS National Water Information System database (USGS, 2020). The waterData package was used to screen for missing daily mean discharge values (no missing values were found for the sites) and to calculate short-term (1 day) and mid-term (30 day) anomalies for flow and short-term anomalies (1 day) for each water-quality variable. A mid-term streamflow anomaly, for instance, is the deviation of concurrent daily streamflow from average conditions for the previous 30 days (Vecchia and others, 2008). Anomalies were calculated as additional potential model variables. Pesticide concentrations for select constituents from each site were pulled into R using the dataRetrieval package (De Cicco and others, 2018). Three of the six sites (Kansas River at DeSoto, Kansas; Missouri River at Hermann, Missouri; and White River at Hazleton, Indiana) pulled pesticide data for WY 2013–17 whereas the other three sites (Fanno Creek at Durham, Oregon; Little Arkansas River near Sedgwick, Kansas; and Red River of the North at Grand Forks, North Dakota) pulled pesticide data for WY 2013–18. Discrete pesticide data were matched with daily mean discharge and daily mean or median water-quality constituents and the associated calculated short-term (1-day) and mid-term (30-day) anomalies from the date of sampling. Pesticide concentrations were estimated using the SEAWAVE-Q (with surrogates) model using 19 combinations of surrogate variables (table 2 in the associated SIR, "Comparison of Surrogate Models to Estimate Pesticide Concentrations at Six U.S. Geological Survey National Water Quality Network Sites During Water Years 2013–18.") at each of 12 site-pesticide combinations (table 3 in the associated SIR). Three measures of model performance—the generalized coefficient of determination (R2), Akaike’s Information Criteria (AIC), and scale—were included in the output and used to select best-fit models (Table 4 of the associated SIR). The three to four best-fit SEAWAVE-Q (with surrogates) models with sample sizes at least five times the number of variables were selected for each site-pesticide combination based on generalized R2 values—the higher, the better. If generalized R2 values were the same, the model with the lower AIC value was used. The standard surrogate regression and base SEAWAVE-Q models were then applied using the same samples that were used for each of the best-fit SEAWAVE-Q (with surrogates) models so that direct comparisons could be made for each site-pesticide-surrogate instance. The input data used to estimate daily pesticide concentrations for each of the best fit models have been included in this data release. An example of one output file for each model type is included in a .zip file named "output_examples.zip". Each of the output files shows the three measures of model performance. (1) The output file for the standard regression model named "HAZ8_Atrazine_Standard_Regression_Output.txt" includes: Pseudo R-square (Allison) of 0.631, Model AIC of 174.0232, and a Scale of 0.961. (2) The output file for the base SEAWAVE-Q model named "HAZ8_Atrazine_Base_Seawave-Q_Output.txt" includes: Generalized r-squared of 0.82, AIC (Akaike's An Information Criterion) of 36.38, and a Scale of 0.288. (3) The output file for the SEAWAVE-Q w/Surrogates model named "HAZ8_Atrazine_Seawave-Q_w_Surrogates_Output.txt" includes: Generalized r-squared of 0.85, AIC (Akaike's An Information Criterion) of 33.76, and a Scale of 0.268. These values match those for Site ID = HAZ, Pesticide = Atrazine, and Surrogate variable group 8 for each model type in Table 4 of the associated SIR.

  8. f

    Percentiles of exogenous demographic variables.

    • figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rajesh P. Narayanan; James Nordlund; R. Kelley Pace; Dimuthu Ratnadiwakara (2023). Percentiles of exogenous demographic variables. [Dataset]. http://doi.org/10.1371/journal.pone.0239572.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Rajesh P. Narayanan; James Nordlund; R. Kelley Pace; Dimuthu Ratnadiwakara
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Percentiles of exogenous demographic variables.

  9. Data from: A dataset to model Levantine landcover and land-use change...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Dec 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Kempf; Michael Kempf (2023). A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19 [Dataset]. http://doi.org/10.5281/zenodo.10396148
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 16, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Michael Kempf; Michael Kempf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 16, 2023
    Area covered
    Levant
    Description

    Overview

    This dataset is the repository for the following paper submitted to Data in Brief:

    Kempf, M. A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19. Data in Brief (submitted: December 2023).

    The Data in Brief article contains the supplement information and is the related data paper to:

    Kempf, M. Climate change, the Arab Spring, and COVID-19 - Impacts on landcover transformations in the Levant. Journal of Arid Environments (revision submitted: December 2023).

    Description/abstract

    The Levant region is highly vulnerable to climate change, experiencing prolonged heat waves that have led to societal crises and population displacement. Since 2010, the area has been marked by socio-political turmoil, including the Syrian civil war and currently the escalation of the so-called Israeli-Palestinian Conflict, which strained neighbouring countries like Jordan due to the influx of Syrian refugees and increases population vulnerability to governmental decision-making. Jordan, in particular, has seen rapid population growth and significant changes in land-use and infrastructure, leading to over-exploitation of the landscape through irrigation and construction. This dataset uses climate data, satellite imagery, and land cover information to illustrate the substantial increase in construction activity and highlights the intricate relationship between climate change predictions and current socio-political developments in the Levant.

    Folder structure

    The main folder after download contains all data, in which the following subfolders are stored are stored as zipped files:

    “code” stores the above described 9 code chunks to read, extract, process, analyse, and visualize the data.

    “MODIS_merged” contains the 16-days, 250 m resolution NDVI imagery merged from three tiles (h20v05, h21v05, h21v06) and cropped to the study area, n=510, covering January 2001 to December 2022 and including January and February 2023.

    “mask” contains a single shapefile, which is the merged product of administrative boundaries, including Jordan, Lebanon, Israel, Syria, and Palestine (“MERGED_LEVANT.shp”).

    “yield_productivity” contains .csv files of yield information for all countries listed above.

    “population” contains two files with the same name but different format. The .csv file is for processing and plotting in R. The .ods file is for enhanced visualization of population dynamics in the Levant (Socio_cultural_political_development_database_FAO2023.ods).

    “GLDAS” stores the raw data of the NASA Global Land Data Assimilation System datasets that can be read, extracted (variable name), and processed using code “8_GLDAS_read_extract_trend” from the respective folder. One folder contains data from 1975-2022 and a second the additional January and February 2023 data.

    “built_up” contains the landcover and built-up change data from 1975 to 2022. This folder is subdivided into two subfolder which contain the raw data and the already processed data. “raw_data” contains the unprocessed datasets and “derived_data” stores the cropped built_up datasets at 5 year intervals, e.g., “Levant_built_up_1975.tif”.

    Code structure

    1_MODIS_NDVI_hdf_file_extraction.R


    This is the first code chunk that refers to the extraction of MODIS data from .hdf file format. The following packages must be installed and the raw data must be downloaded using a simple mass downloader, e.g., from google chrome. Packages: terra. Download MODIS data from after registration from: https://lpdaac.usgs.gov/products/mod13q1v061/ or https://search.earthdata.nasa.gov/search (MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V061, last accessed, 09th of October 2023). The code reads a list of files, extracts the NDVI, and saves each file to a single .tif-file with the indication “NDVI”. Because the study area is quite large, we have to load three different (spatially) time series and merge them later. Note that the time series are temporally consistent.


    2_MERGE_MODIS_tiles.R


    In this code, we load and merge the three different stacks to produce large and consistent time series of NDVI imagery across the study area. We further use the package gtools to load the files in (1, 2, 3, 4, 5, 6, etc.). Here, we have three stacks from which we merge the first two (stack 1, stack 2) and store them. We then merge this stack with stack 3. We produce single files named NDVI_final_*consecutivenumber*.tif. Before saving the final output of single merged files, create a folder called “merged” and set the working directory to this folder, e.g., setwd("your directory_MODIS/merged").


    3_CROP_MODIS_merged_tiles.R


    Now we want to crop the derived MODIS tiles to our study area. We are using a mask, which is provided as .shp file in the repository, named "MERGED_LEVANT.shp". We load the merged .tif files and crop the stack with the vector. Saving to individual files, we name them “NDVI_merged_clip_*consecutivenumber*.tif. We now produced single cropped NDVI time series data from MODIS.
    The repository provides the already clipped and merged NDVI datasets.


    4_TREND_analysis_NDVI.R


    Now, we want to perform trend analysis from the derived data. The data we load is tricky as it contains 16-days return period across a year for the period of 22 years. Growing season sums contain MAM (March-May), JJA (June-August), and SON (September-November). December is represented as a single file, which means that the period DJF (December-February) is represented by 5 images instead of 6. For the last DJF period (December 2022), the data from January and February 2023 can be added. The code selects the respective images from the stack, depending on which period is under consideration. From these stacks, individual annually resolved growing season sums are generated and the slope is calculated. We can then extract the p-values of the trend and characterize all values with high confidence level (0.05). Using the ggplot2 package and the melt function from reshape2 package, we can create a plot of the reclassified NDVI trends together with a local smoother (LOESS) of value 0.3.
    To increase comparability and understand the amplitude of the trends, z-scores were calculated and plotted, which show the deviation of the values from the mean. This has been done for the NDVI values as well as the GLDAS climate variables as a normalization technique.


    5_BUILT_UP_change_raster.R


    Let us look at the landcover changes now. We are working with the terra package and get raster data from here: https://ghsl.jrc.ec.europa.eu/download.php?ds=bu (last accessed 03. March 2023, 100 m resolution, global coverage). Here, one can download the temporal coverage that is aimed for and reclassify it using the code after cropping to the individual study area. Here, I summed up different raster to characterize the built-up change in continuous values between 1975 and 2022.


    6_POPULATION_numbers_plot.R


    For this plot, one needs to load the .csv-file “Socio_cultural_political_development_database_FAO2023.csv” from the repository. The ggplot script provided produces the desired plot with all countries under consideration.


    7_YIELD_plot.R


    In this section, we are using the country productivity from the supplement in the repository “yield_productivity” (e.g., "Jordan_yield.csv". Each of the single country yield datasets is plotted in a ggplot and combined using the patchwork package in R.


    8_GLDAS_read_extract_trend


    The last code provides the basis for the trend analysis of the climate variables used in the paper. The raw data can be accessed https://disc.gsfc.nasa.gov/datasets?keywords=GLDAS%20Noah%20Land%20Surface%20Model%20L4%20monthly&page=1 (last accessed 9th of October 2023). The raw data comes in .nc file format and various variables can be extracted using the [“^a variable name”] command from the spatraster collection. Each time you run the code, this variable name must be adjusted to meet the requirements for the variables (see this link for abbreviations: https://disc.gsfc.nasa.gov/datasets/GLDAS_CLSM025_D_2.0/summary, last accessed 09th of October 2023; or the respective code chunk when reading a .nc file with the ncdf4 package in R) or run print(nc) from the code or use names(the spatraster collection).
    Choosing one variable, the code uses the MERGED_LEVANT.shp mask from the repository to crop and mask the data to the outline of the study area.
    From the processed data, trend analysis are conducted and z-scores were calculated following the code described above. However, annual trends require the frequency of the time series analysis to be set to value = 12. Regarding, e.g., rainfall, which is measured as annual sums and not means, the chunk r.sum=r.sum/12 has to be removed or set to r.sum=r.sum/1 to avoid calculating annual mean values (see other variables). Seasonal subset can be calculated as described in the code. Here, 3-month subsets were chosen for growing seasons, e.g. March-May (MAM), June-July (JJA), September-November (SON), and DJF (December-February, including Jan/Feb of the consecutive year).
    From the data, mean values of 48 consecutive years are calculated and trend analysis are performed as describe above. In the same way, p-values are extracted and 95 % confidence level values are marked with dots on the raster plot. This analysis can be performed with a much longer time series, other variables, ad different spatial extent across the globe due to the availability of the GLDAS variables.

  10. H

    Supplementary materials for: "Comparing Internet experiences and...

    • dataverse.harvard.edu
    • dataone.org
    Updated Mar 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eszter Hargittai; Aaron Shaw (2020). Supplementary materials for: "Comparing Internet experiences and prosociality in Amazon Mechanical Turk and population-based survey samples" [Dataset]. http://doi.org/10.7910/DVN/UFL6MI
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 10, 2020
    Dataset provided by
    Harvard Dataverse
    Authors
    Eszter Hargittai; Aaron Shaw
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/UFL6MIhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/UFL6MI

    Description

    Overview Supplementary materials for the paper "Comparing Internet experiences and prosociality in Amazon Mechanical Turk and population-based survey samples" by Eszter Hargittai and Aaron Shaw published in Socius in 2020 (https://doi.org/10.1177/2378023119889834). License The materials provided here are issued under the same (Creative Commons Attribution Non-Commercial 4.0) license as the paper. Details and a copy of the license are available at: http://creativecommons.org/licenses/by-nc/4.0/. Manifest The files included are: Hargittai-Shaw-AMT-NORC-2019.rds and Hargittai-Shaw-AMT-NORC-2019.tsv: Two (identical) versions the dataset used for the analysis. The tsv file is provided to facilitate import into software other than R. R analysis code files: 01-import.R - Imports dataset. Creates a mapping of dependent variables and variable names used elsewhere in the figure and analysis. 02-gen_figure.R - Generates Figure 1 in PDF and PNG formats and saves them in the "figures" directory. 03-gendescriptivestats.R - Generates results reported in Table 1. 04-gen_models.R - Fits models reported in Tables 2-4. 05-alternative_specifications.R - Fits models using log-transformed version of the income variable. Makefile: Executes all of the R files in sequence, produces corresponding .log files in the "log" directory that contain the full R session from each file as well as separate error log files (also in the "log" directory) that capture any error messages and warnings generated by R along the way. HargittaiShaw2019Socius-Instrument.pdf: The questions distributed to both the NORC and AMT survey participants used in the analysis reported in this paper. How to reproduce the analysis presented in the paper Depending on your computing environment, reproducing the analysis presented in the paper may be as easy as invoking "make all" or "make" in the directory containing this file on a system that has the appropriate software installed. Once compilation is complete, you can review the log files in a text editor. See below for more on software and dependencies. If calling the makefile fails, the individual R scripts can also be run interactively or in batch mode. Software and dependencies The R and compilation materials provided here were created and tested on a 64-bit laptop pc running Ubuntu 18.04.3 LTS, R version 3.6.1, ggplot2 version 3.2.1, reshape2 version 1.4.3, forcats version 0.4.0, pscl version 1.5.2, and stargazer version 5.2.2 (these last five are R packages called in specific .R files). As with all software, your mileage may vary and the authors provide no warranties. Codebook The dataset consists of 36 variables (columns) and 2,716 participants (rows). The variable names and brief descriptions follow below. Additional details of measurement are provided in the paper and survey instrument. All dichotomous indicators are coded 0/1 where 1 is the affirmative response implied by the variable name: id: Index to identify individual units (participants). svy_raked_wgt: Raked survey weights provided by NORC. amtsample: Data source coded 0 (NORC) or 1 (AMT). age: Participant age in years. female: Participant selected "female" gender. incomecont: Income in USD (continuous) coded from center-points of categories reported in the instruments. incomediv: Income in $1,000s USD (=incomecont/1000). incomesqrt: Square-root of incomecont. lincome: Natural logarithm of incomecont. rural: Participant resides in a rural area. employed: Participant is fully or partially employed. eduhsorless: Highest education level is high school or less. edusc: Highest education level is completed some college. edubaormore: Highest education level is BA or more. white: Race = white. black: Race = black. nativeam: Race = native american. hispanic: Ethnicity = hispanic. asian: Race = asian. raceother: Race = other. skillsmean: Internet use skills index (described in paper). accesssum: Internet use autonomy (described in paper). webweekhrs: Internet use frequency (described in paper). do_sum: Participatory online activities (described in paper). snssumcompare: Social network site activities (described in paper). altru_scale: Generous behaviors (described in paper). trust_scale: Trust scale score (described in paper). pts_give: Points donated in unilateral dictator game (described in paper). std_accesssum: Standardized (z-score) version of accesssum. std_webweekhrs: Standardized (z-score) version of webweekhrs. std_skillsmean: Standardized (z-score) version of skillsmean. std_do_sum: Standardized (z-score) version of do_sum. std_snssumcompare: Standardized (z-score) version of snssumcompare. std_trust_scale: Standardized (z-score) version of trust_scale. std_altru_scale: Standardized (z-score) version of altru_scale. std_pts_give: Standardized (z-score) version of pts_give.

  11. d

    Data and code for: A delayed response in the area-concentrated search can...

    • search.dataone.org
    • datadryad.org
    Updated Mar 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thotsapol Chaianunporn (2025). Data and code for: A delayed response in the area-concentrated search can improve foraging success [Dataset]. http://doi.org/10.5061/dryad.vx0k6dk2r
    Explore at:
    Dataset updated
    Mar 27, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Thotsapol Chaianunporn
    Description

    Area-concentrated search (ACS) is a simple movement rule implying that an animal searches for resources using a 'state-dependent correlated random walk'. Accordingly, a forager increases its searching intensity by reducing the directionality of movement ('intensive search mode' or ISM) when it detects a resource item, but if it searches unsuccessfully for a while, it returns to a more straight-line movement to search for new resource locations elsewhere ('extensive search mode' or ESM). We propose a modified ACS, called delayed-response ACS (dACS), which would be more efficient in resource collection than standard ACS. Instead of immediately switching from ESM to ISM when encountering a resource, as is done in standard ACS, an individual foraging in the dACS mode delays this switch by 'x' steps so it continues moving in a straight line for a while before switching to ISM. Our results show that an individual with a suitable delay parameter 'x' for the dACS achieves substantially higher f..., For the simulations we created infinite landscapes with a single resource cluster. The cluster size and resource density varied among different scenarios. We utilized the Mat\'{e}rn Cluster Point Process using R version 3.5.3 and the ‘spatstat’ library version 1.58-2 to create resource clusters as a continuous spatial point pattern (Baddeley and Turner 2005). To understand the effects of the two landscape parameters, patch radius $r$ and resource density $u$ on foraging success (see below), we created landscapes with either the cluster radius fixed at $r=40$ and resource density $u$ set to 0.04, 0.16, or 0.64 resource items per unit area respectively, or with resource density fixed at $u=0.16$ and cluster radius varied from 20 to 40 and 80 (measured in step length). The expected number of active resource items per cluster ($\bar R$) is consequently calculated as $\bar R=g_i^2 \pi \times u$. As movement rule we implemented an area-concentrated search where..., , ## Data and code from: A delayed response in the area-concentrated search can improve foraging success

    https://doi.org/10.5061/dryad.vx0k6dk2r

    Files and variables

    File: Experiment_1_Fixed_Patch_Size_-_Varied_Density.txt

    Description:Â Scenarios with fixed cluster radius at 40 units and varied resource density (0.04, 0.16, 0.64 resources per unit area)

    Variables
    • Resource density (0.04, 0.16, 0.64 resources per unit area)
    • Half-saturation constant (10, 20, 40, 80)
    • Delay parameter (0, 2, 4, 6, . . . , 100 steps)
    • Description for the header of data file
      • radius --> Size of cluster radius (spatial units)
      • halfsat --> Half-saturation constant
      • N_active --> The number of active resource point in each scenario
      • res_density --> Resource density (resources per unit area)
      • angle --> Initial movement angle
      • N_immi --> Number of immigrants into a patch (the number of patch border crossings)
      • N_encount...,
  12. d

    Current Population Survey (CPS)

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Damico, Anthony (2023). Current Population Survey (CPS) [Dataset]. http://doi.org/10.7910/DVN/AK4FDD
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Damico, Anthony
    Description

    analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D

  13. n

    Occurrences and R code for: Dynamic distribution modeling of the Swamp...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Oct 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aaron Goodman (2022). Occurrences and R code for: Dynamic distribution modeling of the Swamp Tigertail dragonfly Synthemis eustalacta (Odonata: Anisoptera: Synthemistidae) over a 20-year bushfire regime [Dataset]. http://doi.org/10.5061/dryad.tx95x6b23
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 24, 2022
    Dataset provided by
    American Museum of Natural History
    Authors
    Aaron Goodman
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Intensity and severity of bushfires in Australia have increased over the past few decades due to climate change, threatening habitat loss for numerous species. Although the impact of bushfires on vertebrates is well-documented, the corresponding effects on insect taxa are rarely examined, although they are responsible for key ecosystem functions and services. Understanding the effects of bushfire seasons on insect distributions could elucidate long-term impacts and patterns of ecosystem recovery. Here, we investigated the effects of recent bushfires, land-cover change, and climatic variables on the distribution of a common and endemic dragonfly, the swamp tigertail (Synthemis eustalacta (Burmeister, 1839)), which inhabits forests that have recently undergone severe burning. We used a temporally dynamic species distribution modeling approach that incorporated 20 years of community-science data on dragonfly occurrence and predictors based on fire, land cover, and climate to make yearly predictions of suitability. We also compared this to an approach that combines multiple temporally static models that use annual data. We found that for both approaches, fire-specific variables had negligible importance for the models, while percent of tree and non-vegetative cover were the most important. We also found that the dynamic model outperformed the static ones when evaluated with cross-validation. Model predictions indicated temporal variation in area and spatial arrangement of suitable habitat but no patterns of habitat expansion, contraction, or shifting. These results highlight not only the efficacy of dynamic modeling to capture spatiotemporal variables, such as vegetation cover for an endemic insect species, but also provide a novel approach to mapping species distributions with sparse locality records. Methods Occurrence Records We acquired occurrence records of adult S. eustalacta from the Global Biodiversity Information Facility (GBIF, DOI: https://doi.org/10.15468/dl.c256kv). We first subsetted the raw data by selecting occurrence records identified as museum samples, curated research-grade community science observations, and published sightings from scientific surveys. We then filtered out records without coordinate information and those recorded outside the years 2001–2020. Finally, we restricted our analysis dataset to records from scientific institutions (Australian Museum, Queensland Museum, Naturalis Biodiversity Center, Murray Darling Basin Authority), community science websites (iNaturalist, Atlas of Living Australia), and governmental organizations (New South Wales Department of Planning, Industry, and Environment, South Australia Department for Environment and Water). In total, we acquired 483 occurrences (Table 1). Further occurrence filtering consisted of removing sightings with erroneous localities (specimens located at known institutions). Environmental Data We generated yearly sets of environmental predictor variables for modeling that included bioclimatic variables, as well as vegetation cover and seasonal burned area for the years 2001-2020. All analyses were conducted using the statistical programming language R v. 4.1.2 (Team, 2021), and all layers have a geographic coordinate system (i.e., degrees) with a WGS84 datum. We acquired monthly minimum and maximum temperature and precipitation rasters for Australia at 2.5 arcminutes resolution (approx. 5 km at the equator), produced by the Australia Bureau of Meteorology (Jones et al., 2009). From these rasters, we created a set of 19 bioclimatic variables representing means, variabilities, and extremes for temperature and precipitation using the dismo package (Hijmans et al., 2017). Due to known spatial artifacts, we omitted four of these variables that include temperature-precipitation interactions (bio08, bio09, bio18, bio19) from the analysis (Moo-Llanes et al., 2021). We also acquired remotely sensed variables from NASA’s Moderate Resolution Imaging Spectroradiometer (MODIS) using the MODISstp package (Busetto & Ranghetti, 2016). Landscape variables describing fire and land cover have been shown to be important predictors of dragonfly range (Jolly et al., 2022). Increases in regional fire frequency lead to higher rates of river contamination via burned carbon and metal leaching (Kelly et al., 2020; Nasirian & Irvine, 2017; Nunes et al., 2018). Tree cover heavily affects other odonate species in terms of landscape patchiness (Dolný et al., 2014; Rith-Najarian, 1998; Suhonen et al., 2010; Suhonen et al., 2013). Loss of plant cover due to fires increases ambient temperature, which can drastically affect odonate survival (Castillo-Pérez et al., 2021). Based on this information and on knowledge of the species, we selected variables that we hypothesized were drivers of S. eustalacta distribution (Collins & Mcintyre, 2015; Theischinger & Hawking, 2006): vegetation continuous fields (VCF; defined as percent of pixel covered by each field) for percent tree cover, non-tree cover, and non-vegetated cover (percent tree cover subtracted from percent non-tree cover) (MOD44B, 250 m yearly resolution), annual evapotranspiration (MOD16A3, 500 m yearly resolution), Normalized Difference Vegetation Index (NDVI; MOD13A2, 1 km monthly resolution), and Burned Area data product (MDC64A1, 500 m monthly resolution). We calculated yearly averages of each MODIS variable to capture annual variability and resampled all variables to the coarsest resolution (2.5 arcminutes). Pixel values for Burned Area Product range from 0 (unburned) to 365 (366 for leap year), corresponding to the days of the year. From these data, we generated annual Burned Area layers by converting these values to binary (pixel values >1 = burned, 0 = unburned). Finally, we acquired categorical rasters representing Australia’s major vegetation groups from The Biodiversity and Climate Change Virtual Laboratory (BCCVL, https://bccvl.org.au/), and global terrestrial ecoregions from the World Wildlife Fund (https://www.worldwildlife.org). All raster analyses were conducted with the R package raster v3.5-29 (Hijmans et al., 2021). Species Distribution Modeling Before modeling, we processed our occurrences to account for sampling bias, delineated a study extent to sample background points, and omitted highly correlated environmental variables. We spatially thinned occurrences by 10 km to reduce the effects of sampling bias and artificial clustering (Veloz, 2009) using the spThin package (Aiello-Lammens et al., 2015), which resulted in more even sample sizes across years (n = 133 total; Table 1). Synthemis eustalacta is endemic to southern Australia and disperses roughly 500 m from streams in early adulthood, but upon reaching maturity, returns to its site of emergence (Theischinger & Hawking, 2006). We thus chose a study extent to include potentially unsampled areas yet exclude large areas outside the species’ dispersal limitations (Peterson & Soberón, 2012), defined as a minimum convex polygon around all localities (2001-2020) buffered by 1 degree (approx. 111 km at equator). Within this extent, we randomly sampled 50,000 background points for modeling and extracted their yearly environmental values. We used these values to calculate correlations between variables using the ‘vifcor’ and ‘vifstep’ functions in the usdm package v 1.1-18 (Naimi, 2017) and filtered out variables by year with correlation coefficients higher than 0.9 and a VIF threshold of 10. Finally, we retained variables for analysis which were kept among all yearly environmental backgrounds. To model the distribution of S. eustalacta, we used the presence-background algorithm Maxent v3.4.4 (Phillips et al., 2017), which remains one of the top-performing models for fitting SDMs with background data (Valavi et al., 2021). To automate model building and evaluation with different complexity settings and reporting of results, we used the R package ENMeval 2.0.0 (Kass et al., 2021). We constructed both dynamic models that incorporated data across years and static models that used year-specific data (Fig. 1). For the dynamic models, we extracted the environmental predictor values for each year from the occurrence and background points for that year and assembled them into a single training dataset. To construct a single background point dataset for the dynamic models, we extracted yearly environmental values for the same set of background points, then averaged these values across years per background point (we used the mode for categorical variables; Fig. 1). We evaluated models using random k-fold cross-validation, in which occurrences are randomly partitioned into a specified number of groups (i.e., “folds”), then models are sequentially trained on all groups but one (training data) and evaluated on the withheld group (validation data) (Hastie et al., 2009)—we used four folds (k = 4) for our evaluations. As random partitioning can result in spatial autocorrelation due to clustering within folds, spatial block cross-validation techniques are often prescribed to address this (Roberts et al., 2017). However, as our occurrences varied not only in space but also in time, and as we lacked enough records per time bin to additionally separate by temporal block, we chose to use simpler random partitioning for evaluation. In contrast to the single dynamic model, we also constructed one static model per year that used only occurrence and background environmental values for that year. For this approach, we did not make models for years with fewer than five associated occurrences (Phillips et al., 2017; Phillips & Dudík, 2008). For static models, we partitioned our data using the ‘leave-one-out’ strategy (referred to as “jackknife” in ENMeval), whereby one occurrence record is withheld from each model during cross-validation (k = n, or the number of occurrences). This cross-validation technique is most appropriate for small

  14. f

    All R and Stata codes to replicate study results.

    • plos.figshare.com
    application/x-rar
    Updated Jun 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Parisa Janjani; Nahid Salehi; Mohammad Rouzbahani; Soraya Siabani; Meysam Olfatifar (2023). All R and Stata codes to replicate study results. [Dataset]. http://doi.org/10.1371/journal.pone.0284668.s003
    Explore at:
    application/x-rarAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Parisa Janjani; Nahid Salehi; Mohammad Rouzbahani; Soraya Siabani; Meysam Olfatifar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundThe precise impact of sex difference on in-hospital mortality in ST-elevation myocardial infarction (STEMI) patients are unclear, and the studies are no longer consistent. Therefore, we sought to evaluate the impact of sex differences in a cohort of STEMI patients.MethodsWe analyzed the data of 2647 STEMI patients enrolled in the Kermanshah STEMI Cohort from July 2017 to May 2020. To accurately clarify the relationship between sex and hospital mortality, propensity score matching (PSM) and causal mediation analysis was applied to the selected confounder and identified intermediate variables, respectively.ResultsBefore matching, the two groups differed on almost every baseline variable and in-hospital death. After matching with 30 selected variables, 574 male and female matched pairs were significantly different only for five baseline variables and women were no longer at greater risk of in-hospital mortality (10.63% vs. 9.76%, p = 0.626). Among the suspected mediating variables, creatinine clearance (CLCR) alone accounts for 74% (0.665/0.895) of the total effect equal to 0.895(95% CI: 0.464–1.332). In this milieu, the relationship between sex and in-hospital death was no longer significant and reversed -0.233(95% CI: -0.623–0.068), which shows the full mediating role of CLCR.ConclusionOur research could help address sex disparities in STEMI mortality and provide a consequence. Moreover, CLCR alone can fully explain this relationship, which can highlight the importance of CLCR in predicting the short-term outcomes of STEMI patients and provide a useful indicator for clinicians.

  15. n

    Data from: Hindcast-validated species distribution models reveal future...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Aug 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard Hodel; Douglas Soltis; Pamela Soltis (2022). Hindcast-validated species distribution models reveal future vulnerabilities of mangroves and salt marsh species [Dataset]. http://doi.org/10.5061/dryad.08kprr55b
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 30, 2022
    Dataset provided by
    Muséum national d'Histoire naturelle
    University of Florida
    Authors
    Richard Hodel; Douglas Soltis; Pamela Soltis
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Rapid climate change threatens biodiversity via habitat loss, range shifts, increases in invasive species, novel species interactions, and other unforeseen changes. Coastal and estuarine species are especially vulnerable to the impacts of climate change due to sea level rise and may be severely impacted in the next several decades. Species distribution modeling can project the potential future distributions of species under scenarios of climate change using bioclimatic data and georeferenced occurrence data. However, models projecting suitable habitat into the future are impossible to ground truth. One solution is to develop species distribution models for the present and project them to periods in the recent past where distributions are known to test model performance before making projections into the future. Here, we develop models using abiotic environmental variables to quantify the current suitable habitat available to eight Neotropical coastal species: four mangrove species and four salt marsh species. Using a novel model validation approach that leverages newly available monthly climatic data from 1960-2018, we project these niche models into two time periods in the recent past (i.e., within the past half-century) when either mangrove or salt marsh dominance was documented via other data sources. Models were hindcast-validated and then used to project the suitable habitat of all species at four time periods in the future under a model of climate change. For all future time periods, the projected suitable habitat of mangrove species decreased, and suitable habitat declined more severely in salt marsh species. Methods Data acquisition We obtained specimen-based occurrence data for each species from iDigBio (Integrated Digitized Biocollections; idigbio.org) and GBIF (Global Biodiversity Information Facility; gbif.org) and supplemented these data with locality data from personal collections for three mangrove species (Avicennia germinans, Laguncularia racemosa, Rhizophora mangle). Four of the species included in the analysis are mangroves (Avicennia germinans, black mangrove; Laguncularia racemosa, white mangrove; and Rhizophora mangle, red mangrove) or mangrove-associated species (Conocarpus erectus, buttonwood). For simplicity, these four species will hereafter be collectively referred to as ‘mangroves,’ even though Conocarpus erectus is not considered a true mangrove (Tomlinson, 2016). We also selected four salt marsh species (Batis maritima, turtleweed; Sesuvium portulacastrum, sea purslane; Spartina alterniflora, smooth cordgrass; and Sporobolus virginicus, seashore dropseed) for analyses. These four species were selected because they occur in close proximity to one another—indicating the presence of salt marsh habitat—and because of their broad and exclusively coastal distributions in the Neotropics. We used SDM to investigate changes in suitable habitat for all eight species. The raw data were cleaned using standard approaches and R scripts (e.g., Marchant et al., 2017); duplicates and incorrect data (e.g., latitude and longitude of 0) were removed from the data set (all scripts used in this paper were deposited in GitHub (github.com/richiehodel/coastal_ENM), and all cleaned occurrence data, layers, and models were deposited in Dryad. We included species that had exclusively coastal or estuarine distributions, and only species with at least 50 occurrence points (after cleaning) were used in the analyses. Given the complexities of the modeling approach, we focused on the Neotropics as opposed to a global analysis; only mangrove and salt marsh species with native ranges in the Americas were used (i.e., cosmopolitan species were excluded). Certain species that inhabit salt marshes, but that have extensive inland distributions, including freshwater wetlands, were excluded (e.g., Distichlis spicata). We acquired bioclimatic environmental layers from Worldclim 2.1 (worldclim.org; Fick & Hijmans, 2017) for multiple time periods. The bioclimatic layers, which contain temperature and precipitation data for every continent except Antarctica, have been used extensively and successfully in SDM studies (Booth, 2018). In Worldclim 2.1, annual precipitation, maximum temperature, and minimum temperature data are available for every month from 1960-2018 at 2.5 arc minute resolution; these three variables can be used to calculate values of all 19 bioclimatic variables (Harris et al., 2014; Fick & Hijmans, 2017; Hijmans et al., 2017). We considered the present to be 2013-2018, the 1980s salt marsh dominance period to be 1984-1989, and the early 2000s mangrove dominance to be 2001-2006. These time periods were selected to capture the optimal amount of either mangrove or salt marsh dominance during each documented oscillation (Cavanaugh et al., 2019), and we selected these windows of time so that the present and past time periods were all six years. Although many of the study species may be longer-lived than each of the time periods (i.e., six years), we prioritized using time periods that captured either mangrove or salt marsh dominance. Due to the oscillations of mangrove versus salt marsh dominance, many individual plants were likely exterminated on short time scales. We used all occurrence data to construct an SDM for each species for our defined present time (2013-2018) regardless of when the specimens were collected. It would be ideal to use separate occurrence specimens from each time period to assess SDM performance, but this was not possible with the temporal distribution of georeferenced data points. For each six-year time period, we averaged the annual precipitation, maximum temperature, and minimum temperature for each month (e.g., average values of these three variables were calculated across the six January months, six February months, etc. in each time period) and used the resulting 12 monthly averages to calculate the standard 19 bioclimatic variable values using the ‘biovars’ function in the ‘dismo’ R package for each six-year time period (Hijmans et al., 2017). The standard 19 bioclimatic variables are not available on a monthly basis because some of them incorporate seasonality and require data for at least one year. By using monthly data for annual precipitation, maximum temperature, and minimum temperature variables, all of the 19 bioclimatic variables can be calculated (Hijmans et al., 2017). All layers were then trimmed so that the extent of the study area was between -120 and -32 degrees longitude, and -36 and 36 degrees latitude using custom scripts and the R package ‘raster’ (Hijmans et al., 2015) and exported in ASCII format (Fig. 1). This study area was selected because it included subtropical and tropical regions of both the Northern and Southern Hemispheres, captured the ecotone between mangrove and salt marsh species in both Hemispheres, and allowed for an expansion zone as some species may expand their ranges in the future as the climate changes. Regions such as Hawaii, where some Neotropical mangrove species have been introduced, were not included in the study. We used an R script and the R package ‘raster’ (Hijmans et al., 2015) to measure the pairwise correlation of the 19 bioclimatic variables. When variables were correlated with one another (r > 0.7), only one of the layers was retained for subsequent analyses (Dormann et al., 2013). After removing correlated layers, we had a data set of six bioclimatic variables (BIO2, mean diurnal temperature range; BIO5, maximum temperature of the warmest month; BIO6, minimum temperature of the coldest month; BIO12, annual precipitation; BIO15, precipitation seasonality; BIO18, precipitation of warmest quarter). BIO6 and BIO1 were highly correlated (r = 0.956), and BIO1 (mean annual temperature) was excluded even though it is frequently included in SDM analyses because BIO6 has been identified as an important variable shaping range limits of coastal species (Tomlinson, 2016). All layers were clipped using the ‘mask’ function in the ‘raster’ R package (Hijmans et al., 2015) such that all cells with an elevation greater than 10m were considered ‘no data’ cells. This was done to ensure that the SDM analyses were not trained on inland regions representing areas where these coastal species do not occur.

    Species Distribution Modeling The occurrence data obtained from digitized herbaria records and the six environmental layers were used as input for the SDM analyses. SDM uses the occurrence data for each species in the present to identify pixels that have suitable habitat for the species of interest based on environmental data. We used the maximum entropy algorithm implemented in MAXENT v3.4.1 (Phillips et al., 2006; Phillips et al., 2017) to conduct SDM analyses. The maximum entropy algorithm uses presence data and random background sampling to develop the model, and it has been shown to perform well with presence-only data (Elith et al., 2006; Wisz et al., 2008). Optimal settings for MAXENT model fit were determined using the ‘ENMevaluate’ function in the ENMeval R package (Muscarella et al., 2014). We investigated regularization multipliers from 0.5 to 4 at intervals of 0.5 and the following features/combinations of features: linear, linear/quadratic, linear/quadratic/hinge, linear/quadratic/hinge/product, linear/quadratic/product/threshold, and linear/quadratic/hinge/product/threshold. The ‘ENMevaluate’ function was run for each species, using the same 10,000 background points, occurrence data for the species of interest, and the ‘maxnet’ algorithm with the ‘checkerboard2’ method. The DAICc scores for all models tested for each species were compared to determine the optimal model to be inputted into MAXENT. Other non-default settings used include five-fold cross-validation, a minimum training presence threshold, and fading by clamping. Cloglog output was used because it

  16. Z

    Codes in R for spatial statistics analysis, ecological response models and...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Feb 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rössel-Ramírez, D. W.; Palacio-Núñez, J.; Espinosa, S.; Martínez-Montoya, J. F. (2023). Codes in R for spatial statistics analysis, ecological response models and spatial distribution models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7603556
    Explore at:
    Dataset updated
    Feb 6, 2023
    Dataset provided by
    Facultad de Ciencias, Universidad Autónoma de San Luis Potosí. San Luis Potosí, S.L.P. México.
    Campus San Luis, Colegio de Postgraduados. Salinas de Hidalgo, S.L.P. México.
    Authors
    Rössel-Ramírez, D. W.; Palacio-Núñez, J.; Espinosa, S.; Martínez-Montoya, J. F.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In the last decade, a plethora of algorithms have been developed for spatial ecology studies. In our case, we use some of these codes for underwater research work in applied ecology analysis of threatened endemic fishes and their natural habitat. For this, we developed codes in Rstudio® script environment to run spatial and statistical analyses for ecological response and spatial distribution models (e.g., Hijmans & Elith, 2017; Den Burg et al., 2020). The employed R packages are as follows: caret (Kuhn et al., 2020), corrplot (Wei & Simko, 2017), devtools (Wickham, 2015), dismo (Hijmans & Elith, 2017), gbm (Freund & Schapire, 1997; Friedman, 2002), ggplot2 (Wickham et al., 2019), lattice (Sarkar, 2008), lattice (Musa & Mansor, 2021), maptools (Hijmans & Elith, 2017), modelmetrics (Hvitfeldt & Silge, 2021), pander (Wickham, 2015), plyr (Wickham & Wickham, 2015), pROC (Robin et al., 2011), raster (Hijmans & Elith, 2017), RColorBrewer (Neuwirth, 2014), Rcpp (Eddelbeuttel & Balamura, 2018), rgdal (Verzani, 2011), sdm (Naimi & Araujo, 2016), sf (e.g., Zainuddin, 2023), sp (Pebesma, 2020) and usethis (Gladstone, 2022).

    It is important to follow all the codes in order to obtain results from the ecological response and spatial distribution models. In particular, for the ecological scenario, we selected the Generalized Linear Model (GLM) and for the geographic scenario we selected DOMAIN, also known as Gower's metric (Carpenter et al., 1993). We selected this regression method and this distance similarity metric because of its adequacy and robustness for studies with endemic or threatened species (e.g., Naoki et al., 2006). Next, we explain the statistical parameterization for the codes immersed in the GLM and DOMAIN running:

    In the first instance, we generated the background points and extracted the values of the variables (Code2_Extract_values_DWp_SC.R). Barbet-Massin et al. (2012) recommend the use of 10,000 background points when using regression methods (e.g., Generalized Linear Model) or distance-based models (e.g., DOMAIN). However, we considered important some factors such as the extent of the area and the type of study species for the correct selection of the number of points (Pers. Obs.). Then, we extracted the values of predictor variables (e.g., bioclimatic, topographic, demographic, habitat) in function of presence and background points (e.g., Hijmans and Elith, 2017).

    Subsequently, we subdivide both the presence and background point groups into 75% training data and 25% test data, each group, following the method of Soberón & Nakamura (2009) and Hijmans & Elith (2017). For a training control, the 10-fold (cross-validation) method is selected, where the response variable presence is assigned as a factor. In case that some other variable would be important for the study species, it should also be assigned as a factor (Kim, 2009).

    After that, we ran the code for the GBM method (Gradient Boost Machine; Code3_GBM_Relative_contribution.R and Code4_Relative_contribution.R), where we obtained the relative contribution of the variables used in the model. We parameterized the code with a Gaussian distribution and cross iteration of 5,000 repetitions (e.g., Friedman, 2002; kim, 2009; Hijmans and Elith, 2017). In addition, we considered selecting a validation interval of 4 random training points (Personal test). The obtained plots were the partial dependence blocks, in function of each predictor variable.

    Subsequently, the correlation of the variables is run by Pearson's method (Code5_Pearson_Correlation.R) to evaluate multicollinearity between variables (Guisan & Hofer, 2003). It is recommended to consider a bivariate correlation ± 0.70 to discard highly correlated variables (e.g., Awan et al., 2021).

    Once the above codes were run, we uploaded the same subgroups (i.e., presence and background groups with 75% training and 25% testing) (Code6_Presence&backgrounds.R) for the GLM method code (Code7_GLM_model.R). Here, we first ran the GLM models per variable to obtain the p-significance value of each variable (alpha ≤ 0.05); we selected the value one (i.e., presence) as the likelihood factor. The generated models are of polynomial degree to obtain linear and quadratic response (e.g., Fielding and Bell, 1997; Allouche et al., 2006). From these results, we ran ecological response curve models, where the resulting plots included the probability of occurrence and values for continuous variables or categories for discrete variables. The points of the presence and background training group are also included.

    On the other hand, a global GLM was also run, from which the generalized model is evaluated by means of a 2 x 2 contingency matrix, including both observed and predicted records. A representation of this is shown in Table 1 (adapted from Allouche et al., 2006). In this process we select an arbitrary boundary of 0.5 to obtain better modeling performance and avoid high percentage of bias in type I (omission) or II (commission) errors (e.g., Carpenter et al., 1993; Fielding and Bell, 1997; Allouche et al., 2006; Kim, 2009; Hijmans and Elith, 2017).

    Table 1. Example of 2 x 2 contingency matrix for calculating performance metrics for GLM models. A represents true presence records (true positives), B represents false presence records (false positives - error of commission), C represents true background points (true negatives) and D represents false backgrounds (false negatives - errors of omission).

    Validation set

    Model

    True

    False

    Presence

    A

    B

    Background

    C

    D

    We then calculated the Overall and True Skill Statistics (TSS) metrics. The first is used to assess the proportion of correctly predicted cases, while the second metric assesses the prevalence of correctly predicted cases (Olden and Jackson, 2002). This metric also gives equal importance to the prevalence of presence prediction as to the random performance correction (Fielding and Bell, 1997; Allouche et al., 2006).

    The last code (i.e., Code8_DOMAIN_SuitHab_model.R) is for species distribution modelling using the DOMAIN algorithm (Carpenter et al., 1993). Here, we loaded the variable stack and the presence and background group subdivided into 75% training and 25% test, each. We only included the presence training subset and the predictor variables stack in the calculation of the DOMAIN metric, as well as in the evaluation and validation of the model.

    Regarding the model evaluation and estimation, we selected the following estimators:

    1) partial ROC, which evaluates the approach between the curves of positive (i.e., correctly predicted presence) and negative (i.e., correctly predicted absence) cases. As farther apart these curves are, the model has a better prediction performance for the correct spatial distribution of the species (Manzanilla-Quiñones, 2020).

    2) ROC/AUC curve for model validation, where an optimal performance threshold is estimated to have an expected confidence of 75% to 99% probability (De Long et al., 1988).

  17. Z

    soilmap_simple: a simplified and standardized derivative of the digital soil...

    • data.niaid.nih.gov
    Updated Mar 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vanderhaeghe, Floris; De Vos, Bruno; Cools, Nathalie (2025). soilmap_simple: a simplified and standardized derivative of the digital soil map of the Flemish Region [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3732903
    Explore at:
    Dataset updated
    Mar 24, 2025
    Dataset provided by
    Research Institute for Nature and Forest (INBO)
    Authors
    Vanderhaeghe, Floris; De Vos, Bruno; Cools, Nathalie
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Flanders, Flemish Region
    Description

    The data source soilmap_simple is a simplified and standardized derived form of the 'digital soil map of the Flemish Region' (the shapefile of which we named soilmap, for analytical workflows in R) published by 'Databank Ondergrond Vlaanderen’ (DOV). It is a GeoPackage that contains a spatial polygon layer ‘soilmap_simple’ in the Belgian Lambert 72 coordinate reference system (EPSG-code 31370), plus a non-spatial table ‘explanations’ with the meaning of category codes that occur in the spatial layer. Further documentation about the digital soil map of the Flemish Region is available in Van Ranst & Sys (2000) and Dudal et al. (2005).

    This version of soilmap_simple was derived from version 'soilmap_2017-06-20' (Zenodo DOI) as follows:

    all attribute variables received English names (purpose of standardization), starting with prefix bsm_ (referring to the 'Belgian soil map');

    attribute variables were reordered;

    the values of the morphogenetic substrate, texture and drainage variables (bsm_mo_substr, bsm_mo_tex and bsm_mo_drain + their _explan counterparts) were filled for most features in the 'coastal plain' area.

    To derive morphogenetic texture and drainage levels from the geomorphological soil types, a conversion table by Bruno De Vos & Carole Ampe was applied (for earlier work on this, see Ampe 2013).

    Substrate classes were copied over from bsm_ge_substr into bsm_mo_substr (bsm_ge_substr already followed the categories of bsm_mo_substr).

    These steps coincide with the approach that had been taken to construct the Unitype variable in the soilmap data source;
    

    only a minimal number of variables were selected: those that are most useful for analytical work.

    See R-code in the GitHub repository 'n2khab-preprocessing' at commit b3c6696 for the creation from the soilmap data source.

    A reading function to return soilmap_simple (this data source) or soilmap in a standardized way into the R environment is provided by the R-package n2khab.

    The attributes of the spatial polygon layer soilmap_simple can have mo_ in their name to refer to the Belgian Morphogenetic System:

    bsm_poly_id: unique polygon ID (numeric)

    bsm_region: name of the region

    bsm_converted: boolean. Were morphogenetic texture and drainage variables (bsm_mo_tex and bsm_mo_drain) derived from a conversion table (see above)? Value TRUE is largely confined to the 'coastal plain' areas.

    bsm_mo_soilunitype: code of the soil type (applying morphogenetic codes within the coastal plain areas when possible, just as for the following three variables)

    bsm_mo_substr: code of the soil substrate

    bsm_mo_tex: code of the soil texture category

    bsm_mo_drain: code of the soil drainage category

    bsm_mo_prof: code of the soil profile category

    bsm_mo_parentmat: code of a variant regarding the parent material

    bsm_mo_profvar: code of a variant regarding the soil profile

    The non-spatial table explanations has following variables:

    subject: attribute name of the spatial layer: either bsm_mo_substr, bsm_mo_tex, bsm_mo_drain, bsm_mo_prof, bsm_mo_parentmat or bsm_mo_profvar

    code: category code that occurs as value for the corresponding attribute in the spatial layer

    name: explanation of the value of code

  18. Methods for clustering or defining distances between samples with mixed...

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manuela Hummel; Dominic Edelmann; Annette Kopp-Schneider (2023). Methods for clustering or defining distances between samples with mixed data. [Dataset]. http://doi.org/10.1371/journal.pone.0188274.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Manuela Hummel; Dominic Edelmann; Annette Kopp-Schneider
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The columns indicated whether hierarchical clustering is suitable (in contrast to partitioning), whether distance matrices can be retrieved, whether the funcionality is available in R (to the authors’ knowledge) and whether ordinal variables are treated in a special way. Only clustering based on Gower’s similarity coefficient is applied throughout the manuscript.

  19. Information criteria for the selected classification variables.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guangming Li; Mengying Li; Shuzhen Peng; Ying Wang; Li Ran; Xuyu Chen; Ling Zhang; Sirong Zhu; Qi Chen; Wenjing Wang; Yang Xu; Yubin Zhang; Xiaodong Tan (2023). Information criteria for the selected classification variables. [Dataset]. http://doi.org/10.1371/journal.pone.0265406.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Guangming Li; Mengying Li; Shuzhen Peng; Ying Wang; Li Ran; Xuyu Chen; Ling Zhang; Sirong Zhu; Qi Chen; Wenjing Wang; Yang Xu; Yubin Zhang; Xiaodong Tan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Information criteria for the selected classification variables.

  20. Formulas for Learning Direct Learning, 1st-Order Occasion Setting, and...

    • plos.figshare.com
    bin
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tomislav D. Zbozinek; Omar D. Perez; Toby Wise; Michael Fanselow; Dean Mobbs (2023). Formulas for Learning Direct Learning, 1st-Order Occasion Setting, and 2nd-Order Occasion Setting. [Dataset]. http://doi.org/10.1371/journal.pcbi.1010410.t003
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Tomislav D. Zbozinek; Omar D. Perez; Toby Wise; Michael Fanselow; Dean Mobbs
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In the table, subscripts are stimulus names (e.g., A, B, C; "sum" for all stimuli present on a trial); superscripts are trial numbers. For "R" formula, superscript "n" for all variables, and subscript "sum" for all variables except "R.” Formulas are arranged in column format for readability. Responding (R) formula ultimately predicts behavioral responding and learning. R operates by adding excitation and subtracting inhibition (formula in dark gray). Light gray columns highlight the similar variables used in the "recipe" across our learning formulas. Hierarchical control of a) 2nd-order occasion setting (2nd OS) on 1st-order occasion setting (1st OS) and b) 1st OS on direct associative learning (i.e., CSs) is accomplished with modulation, in which the higher-order stimulus affects the lower-order stimuli’s signal of US (non)occurrence. The gateway to higher-order learning (from direct learning to 1st OS, and from 1st OS to 2nd OS) is lower-order stimulus ambiguity. Mechanism through which 1st OS is learned is γ1, which is the degree to which the CS is ambiguous (i.e., that the CS has both direct excitation and direct inhibition). If the present CS is unambiguous (e.g., only excitatory or only inhibitory), γ1 remains at 0, and no 1st OS is learned. Once a given CS is ambiguous (i.e., the CS is both excitatory and inhibitory), γ1 becomes positive, allowing P and N to increase from zero and for 1st OS to be learned. CS ambiguity is necessary but not sufficient for 1st-order occasion setting to be learned. In order for 1st OS to be learned, the individual must also learn that the CS can be modulated by a 1st-order positive occasion setter () or 1st-order negative occasion setter (). This will occur if a stimulus/context (i.e., the 1st-order occasion setter) provides information about the CS’s (non)reinforcement and if the stimulus/context is less salient than the CS e.g., 43. and are values contained by the CS (rather than the 1st-order occasion setter), indicating the CS can be modulated by a 1st-order positive or negative occasion setter, respectively. Thus, a CS must be both ambiguous and trained with a 1st-order occasion setter in order to be modulated by other 1st-order occasion setters (e.g., a simple partially reinforced CS will not be affected by a 1st-order occasion setter because its and will equal 0, causing the 1st OS terms in the R formula to equal 0).

  21. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Development Data Group, Data Analytics Unit (2023). Synthetic Data for an Imaginary Country, Sample, 2023 - World [Dataset]. https://microdata.worldbank.org/index.php/catalog/5906

Synthetic Data for an Imaginary Country, Sample, 2023 - World

Explore at:
Dataset updated
Jul 7, 2023
Dataset authored and provided by
Development Data Group, Data Analytics Unit
Time period covered
2023
Area covered
World
Description

Abstract

The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.

The full-population dataset (with about 10 million individuals) is also distributed as open data.

Geographic coverage

The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.

Analysis unit

Household, Individual

Universe

The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.

Kind of data

ssd

Sampling procedure

The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.

Mode of data collection

other

Research instrument

The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.

Cleaning operations

The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.

Response rate

This is a synthetic dataset; the "response rate" is 100%.

Search
Clear search
Close search
Google apps
Main menu