We provide instructions, codes and datasets for replicating the article by Kim, Lee and McCulloch (2024), "A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews." This repository provides a user-friendly R package for any researchers or practitioners to apply A Topic-based Segmentation Model with Unstructured Texts (latent class regression with group variable selection) to their datasets. First, we provide a R code to replicate the illustrative simulation study: see file 1. Second, we provide the user-friendly R package with a very simple example code to help apply the model to real-world datasets: see file 2, Package_MixtureRegression_GroupVariableSelection.R and Dendrogram.R. Third, we provide a set of codes and instructions to replicate the empirical studies of customer-level segmentation and restaurant-level segmentation with Yelp reviews data: see files 3-a, 3-b, 4-a, 4-b. Note, due to the dataset terms of use by Yelp and the restriction of data size, we provide the link to download the same Yelp datasets (https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset/versions/6). Fourth, we provided a set of codes and datasets to replicate the empirical study with professor ratings reviews data: see file 5. Please see more details in the description text and comments of each file. [A guide on how to use the code to reproduce each study in the paper] 1. Full codes for replicating Illustrative simulation study.txt -- [see Table 2 and Figure 2 in main text]: This is R source code to replicate the illustrative simulation study. Please run from the beginning to the end in R. In addition to estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships, you will get dendrograms of selected groups of variables in Figure 2. Computing time is approximately 20 to 30 minutes 3-a. Preprocessing raw Yelp Reviews for Customer-level Segmentation.txt: Code for preprocessing the downloaded unstructured Yelp review data and preparing DV and IVs matrix for customer-level segmentation study. 3-b. Instruction for replicating Customer-level Segmentation analysis.txt -- [see Table 10 in main text; Tables F-1, F-2, and F-3 and Figure F-1 in Web Appendix]: Code for replicating customer-level segmentation study with Yelp data. You will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 3 to 4 hours. 4-a. Preprocessing raw Yelp reviews_Restaruant Segmentation (1).txt: R code for preprocessing the downloaded unstructured Yelp data and preparing DV and IVs matrix for restaurant-level segmentation study. 4-b. Instructions for replicating restaurant-level segmentation analysis.txt -- [see Tables 5, 6 and 7 in main text; Tables E-4 and E-5 and Figure H-1 in Web Appendix]: Code for replicating restaurant-level segmentation study with Yelp. you will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 10 to 12 hours. [Guidelines for running Benchmark models in Table 6] Unsupervised Topic model: 'topicmodels' package in R -- after determining the number of topics(e.g., with 'ldatuning' R package), run 'LDA' function in the 'topicmodels'package. Then, compute topic probabilities per restaurant (with 'posterior' function in the package) which can be used as predictors. Then, conduct prediction with regression Hierarchical topic model (HDP): 'gensimr' R package -- 'model_hdp' function for identifying topics in the package (see https://radimrehurek.com/gensim/models/hdpmodel.html or https://gensimr.news-r.org/). Supervised topic model: 'lda' R package -- 'slda.em' function for training and 'slda.predict' for prediction. Aggregate regression: 'lm' default function in R. Latent class regression without variable selection: 'flexmix' function in 'flexmix' R package. Run flexmix with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, conduct prediction of dependent variable per each segment. Latent class regression with variable selection: 'Unconstraind_Bayes_Mixture' function in Kim, Fong and DeSarbo(2012)'s package. Run the Kim et al's model (2012) with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, we can do prediction of dependent variables per each segment. The same R package ('KimFongDeSarbo2012.zip') can be downloaded at: https://sites.google.com/scarletmail.rutgers.edu/r-code-packages/home 5. Instructions for replicating Professor ratings review study.txt -- [see Tables G-1, G-2, G-4 and G-5, and Figures G-1 and H-2 in Web Appendix]: Code to replicate the Professor ratings reviews study. Computing time is approximately 10 hours. [A list of the versions of R, packages, and computer...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes six files. Two of them (EcoProvinceCentroids_WGS84. and WorldOceans_Canals_fixed.) are datasets produced by external organizations and available online in public portals (https://data.unep-wcmc.org/datasets/38 and https://maps.princeton.edu/catalog/stanford-ds959rc3867, respectively). These two shapefiles are the input files used in the R script VoyRisk_trade_connections_script.R, which uses functions located in VoyRisk_localFunctions.R, to create simulated maritime paths between each pair of the 62 Marine Ecoregions Of the World (MEOW) ecoprovinces (Spalding et al. 2007). The other four files (ecoprovDistanceTable.csv, tr1_.RData, tr1C.RData, WorldOceans_fixed_raster_005res.tif) are intermediate files produced by the R script. References: Schattschneider, Jessica; Floerl, Lisa; Casanovas, Paula (2022): VoyRisk distance analysis code. The University of Auckland. Software. DOI: https://doi.org/10.17608/k6.auckland.21368874 Spalding, M.D., Fox, H.E., Allen, G.R., Davidson, N., Ferdaña, Z.A., Finlayson, M., Halpern, B.S., Jorge, M.A., Lombana, A., Lourie, S.A., Martin, K.D., McManus, E., Molnar, J., Recchia, C.A., and Robertson, J. (2007). Marine Ecoregions of the World: a bioregionalization of coast and shelf areas. BioScience 57: 573-583. DOI: http://dx.doi.org/10.1641/B570707
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
This dataset is the repository for the following paper submitted to Data in Brief:
Kempf, M. A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19. Data in Brief (submitted: December 2023).
The Data in Brief article contains the supplement information and is the related data paper to:
Kempf, M. Climate change, the Arab Spring, and COVID-19 - Impacts on landcover transformations in the Levant. Journal of Arid Environments (revision submitted: December 2023).
Description/abstract
The Levant region is highly vulnerable to climate change, experiencing prolonged heat waves that have led to societal crises and population displacement. Since 2010, the area has been marked by socio-political turmoil, including the Syrian civil war and currently the escalation of the so-called Israeli-Palestinian Conflict, which strained neighbouring countries like Jordan due to the influx of Syrian refugees and increases population vulnerability to governmental decision-making. Jordan, in particular, has seen rapid population growth and significant changes in land-use and infrastructure, leading to over-exploitation of the landscape through irrigation and construction. This dataset uses climate data, satellite imagery, and land cover information to illustrate the substantial increase in construction activity and highlights the intricate relationship between climate change predictions and current socio-political developments in the Levant.
Folder structure
The main folder after download contains all data, in which the following subfolders are stored are stored as zipped files:
“code” stores the above described 9 code chunks to read, extract, process, analyse, and visualize the data.
“MODIS_merged” contains the 16-days, 250 m resolution NDVI imagery merged from three tiles (h20v05, h21v05, h21v06) and cropped to the study area, n=510, covering January 2001 to December 2022 and including January and February 2023.
“mask” contains a single shapefile, which is the merged product of administrative boundaries, including Jordan, Lebanon, Israel, Syria, and Palestine (“MERGED_LEVANT.shp”).
“yield_productivity” contains .csv files of yield information for all countries listed above.
“population” contains two files with the same name but different format. The .csv file is for processing and plotting in R. The .ods file is for enhanced visualization of population dynamics in the Levant (Socio_cultural_political_development_database_FAO2023.ods).
“GLDAS” stores the raw data of the NASA Global Land Data Assimilation System datasets that can be read, extracted (variable name), and processed using code “8_GLDAS_read_extract_trend” from the respective folder. One folder contains data from 1975-2022 and a second the additional January and February 2023 data.
“built_up” contains the landcover and built-up change data from 1975 to 2022. This folder is subdivided into two subfolder which contain the raw data and the already processed data. “raw_data” contains the unprocessed datasets and “derived_data” stores the cropped built_up datasets at 5 year intervals, e.g., “Levant_built_up_1975.tif”.
Code structure
1_MODIS_NDVI_hdf_file_extraction.R
This is the first code chunk that refers to the extraction of MODIS data from .hdf file format. The following packages must be installed and the raw data must be downloaded using a simple mass downloader, e.g., from google chrome. Packages: terra. Download MODIS data from after registration from: https://lpdaac.usgs.gov/products/mod13q1v061/ or https://search.earthdata.nasa.gov/search (MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V061, last accessed, 09th of October 2023). The code reads a list of files, extracts the NDVI, and saves each file to a single .tif-file with the indication “NDVI”. Because the study area is quite large, we have to load three different (spatially) time series and merge them later. Note that the time series are temporally consistent.
2_MERGE_MODIS_tiles.R
In this code, we load and merge the three different stacks to produce large and consistent time series of NDVI imagery across the study area. We further use the package gtools to load the files in (1, 2, 3, 4, 5, 6, etc.). Here, we have three stacks from which we merge the first two (stack 1, stack 2) and store them. We then merge this stack with stack 3. We produce single files named NDVI_final_*consecutivenumber*.tif. Before saving the final output of single merged files, create a folder called “merged” and set the working directory to this folder, e.g., setwd("your directory_MODIS/merged").
3_CROP_MODIS_merged_tiles.R
Now we want to crop the derived MODIS tiles to our study area. We are using a mask, which is provided as .shp file in the repository, named "MERGED_LEVANT.shp". We load the merged .tif files and crop the stack with the vector. Saving to individual files, we name them “NDVI_merged_clip_*consecutivenumber*.tif. We now produced single cropped NDVI time series data from MODIS.
The repository provides the already clipped and merged NDVI datasets.
4_TREND_analysis_NDVI.R
Now, we want to perform trend analysis from the derived data. The data we load is tricky as it contains 16-days return period across a year for the period of 22 years. Growing season sums contain MAM (March-May), JJA (June-August), and SON (September-November). December is represented as a single file, which means that the period DJF (December-February) is represented by 5 images instead of 6. For the last DJF period (December 2022), the data from January and February 2023 can be added. The code selects the respective images from the stack, depending on which period is under consideration. From these stacks, individual annually resolved growing season sums are generated and the slope is calculated. We can then extract the p-values of the trend and characterize all values with high confidence level (0.05). Using the ggplot2 package and the melt function from reshape2 package, we can create a plot of the reclassified NDVI trends together with a local smoother (LOESS) of value 0.3.
To increase comparability and understand the amplitude of the trends, z-scores were calculated and plotted, which show the deviation of the values from the mean. This has been done for the NDVI values as well as the GLDAS climate variables as a normalization technique.
5_BUILT_UP_change_raster.R
Let us look at the landcover changes now. We are working with the terra package and get raster data from here: https://ghsl.jrc.ec.europa.eu/download.php?ds=bu (last accessed 03. March 2023, 100 m resolution, global coverage). Here, one can download the temporal coverage that is aimed for and reclassify it using the code after cropping to the individual study area. Here, I summed up different raster to characterize the built-up change in continuous values between 1975 and 2022.
6_POPULATION_numbers_plot.R
For this plot, one needs to load the .csv-file “Socio_cultural_political_development_database_FAO2023.csv” from the repository. The ggplot script provided produces the desired plot with all countries under consideration.
7_YIELD_plot.R
In this section, we are using the country productivity from the supplement in the repository “yield_productivity” (e.g., "Jordan_yield.csv". Each of the single country yield datasets is plotted in a ggplot and combined using the patchwork package in R.
8_GLDAS_read_extract_trend
The last code provides the basis for the trend analysis of the climate variables used in the paper. The raw data can be accessed https://disc.gsfc.nasa.gov/datasets?keywords=GLDAS%20Noah%20Land%20Surface%20Model%20L4%20monthly&page=1 (last accessed 9th of October 2023). The raw data comes in .nc file format and various variables can be extracted using the [“^a variable name”] command from the spatraster collection. Each time you run the code, this variable name must be adjusted to meet the requirements for the variables (see this link for abbreviations: https://disc.gsfc.nasa.gov/datasets/GLDAS_CLSM025_D_2.0/summary, last accessed 09th of October 2023; or the respective code chunk when reading a .nc file with the ncdf4 package in R) or run print(nc) from the code or use names(the spatraster collection).
Choosing one variable, the code uses the MERGED_LEVANT.shp mask from the repository to crop and mask the data to the outline of the study area.
From the processed data, trend analysis are conducted and z-scores were calculated following the code described above. However, annual trends require the frequency of the time series analysis to be set to value = 12. Regarding, e.g., rainfall, which is measured as annual sums and not means, the chunk r.sum=r.sum/12 has to be removed or set to r.sum=r.sum/1 to avoid calculating annual mean values (see other variables). Seasonal subset can be calculated as described in the code. Here, 3-month subsets were chosen for growing seasons, e.g. March-May (MAM), June-July (JJA), September-November (SON), and DJF (December-February, including Jan/Feb of the consecutive year).
From the data, mean values of 48 consecutive years are calculated and trend analysis are performed as describe above. In the same way, p-values are extracted and 95 % confidence level values are marked with dots on the raster plot. This analysis can be performed with a much longer time series, other variables, ad different spatial extent across the globe due to the availability of the GLDAS variables.
This data release contains the input-data files and R scripts associated with the analysis presented in [citation of manuscript]. The spatial extent of the data is the contiguous U.S. The input-data files include one comma separated value (csv) file of county-level data, and one csv file of city-level data. The county-level csv (“county_data.csv”) contains data for 3,109 counties. This data includes two measures of water use, descriptive information about each county, three grouping variables (climate region, urban class, and economic dependency), and contains 18 explanatory variables: proportion of population growth from 2000-2010, fraction of withdrawals from surface water, average daily water yield, mean annual maximum temperature from 1970-2010, 2005-2010 maximum temperature departure from the 40-year maximum, mean annual precipitation from 1970-2010, 2005-2010 mean precipitation departure from the 40-year mean, Gini income disparity index, percent of county population with at least some college education, Cook Partisan Voting Index, housing density, median household income, average number of people per household, median age of structures, percent of renters, percent of single family homes, percent apartments, and a numeric version of urban class. The city-level csv (city_data.csv) contains data for 83 cities. This data includes descriptive information for each city, water-use measures, one grouping variable (climate region), and 6 explanatory variables: type of water bill (increasing block rate, decreasing block rate, or uniform), average price of water bill, number of requirement-oriented water conservation policies, number of rebate-oriented water conservation policies, aridity index, and regional price parity. The R scripts construct fixed-effects and Bayesian Hierarchical regression models. The primary difference between these models relates to how they handle possible clustering in the observations that define unique water-use settings. Fixed-effects models address possible clustering in one of two ways. In a "fully pooled" fixed-effects model, any clustering by group is ignored, and a single, fixed estimate of the coefficient for each covariate is developed using all of the observations. Conversely, in an unpooled fixed-effects model, separate coefficient estimates are developed only using the observations in each group. A hierarchical model provides a compromise between these two extremes. Hierarchical models extend single-level regression to data with a nested structure, whereby the model parameters vary at different levels in the model, including a lower level that describes the actual data and an upper level that influences the values taken by parameters in the lower level. The county-level models were compared using the Watanabe-Akaike information criterion (WAIC) which is derived from the log pointwise predictive density of the models and can be shown to approximate out-of-sample predictive performance. All script files are intended to be used with R statistical software (R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org) and Stan probabilistic modeling software (Stan Development Team. 2017. RStan: the R interface to Stan. R package version 2.16.2. http://mc-stan.org).
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Phantom of Bern: repeated scans of two volunteers with eight different combinations of MR sequence parameters
The Phantom of Bern consists of eight same-session re-scans of T1-weighted MRI with different combinations of sequence parameters, acquired on two healthy subjects. The subjects have agreed in writing to the publication of these data, including the original anonymized DICOM files and waving the requirement of defacing. Usage is permitted under the terms of the data usage agreement stated below.
The BIDS directory is organized as follows:
└── PhantomOfBern/
├─ code/
│
├─ derivatives/
│ ├─ dldirect_v1-0-0/
│ │ ├─ results/ # Folder with flattened subject/session inputs and outputs of DL+DiReCT
│ │ └─ stats2table/ # Folder with tables summarizing all DL+DiReCT outputs
│ ├─ freesurfer_v6-0-0/
│ │ ├─ results/ # Folder with flattened subject/session inputs and outputs of freesurfer
│ │ └─ stats2table/ # Folder with tables summarizing all freesurfer outputs
│ └─ siena_v2-6/
│ ├─ SIENA_results.csv # Siena's main output
│ └─ ... # Flattened subject/session inputs and outputs of SIENA
│
├─ sourcedata/
│ ├─ POBHC0001/
│ │ └─ 17473A/
│ │ └─ ... # Anonymized DICOM folders
│ └─ POBHC0002/
│ └─ 14610A/
│ └─ ... # Anonymized DICOM folders
│
├─ sub-<label>/
│ └─ ses-<label>/
│ └─ anat/ # Folder with scan's json and nifti files
├─ ...
The dataset can be cited as:
M. Rebsamen, D. Romascano, M. Capiglioni, R. Wiest, P. Radojewski, C. Rummel. The Phantom of Bern:
repeated scans of two volunteers with eight different combinations of MR sequence parameters.
OpenNeuro, 2023.
If you use these data, please also cite the original paper:
M. Rebsamen, M. Capiglioni, R. Hoepner, A. Salmen, R. Wiest, P. Radojewski, C. Rummel. Growing importance
of brain morphometry analysis in the clinical routine: The hidden impact of MR sequence parameters.
Journal of Neuroradiology, 2023.
The Phantom of Bern is distributed under the following terms, to which you agree by downloading and/or using the dataset:
To use these datasets solely for research and development or statistical purposes and not for investigation of specific subjects
To make no use of the identity of any subject discovered inadvertently, and to advise the providers of any such discovery (crummel@web.de)
When publicly presenting any results or algorithms that benefited from the use of the Phantom of Bern, you should acknowledge it, see above. Papers, book chapters, books, posters, oral presentations, and all other printed and digital presentations of results derived from the Phantom of Bern data should cite the publications listed above.
Redistribution of data (complete or in parts) in any manner without explicit inclusion of this data use agreement is prohibited.
Usage of the data for testing commercial tools is explicitly allowed. Usage for military purposes is prohibited.
The original collector and provider of the data (see acknowledgement) and the relevant funding agency bear no responsibility for use of the data or for interpretations or inferences based upon such uses.
This work was supported by the Swiss National Science Foundation under grant numbers 204593 (ScanOMetrics) and CRSII5_180365 (The Swiss-First Study).
https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Replication pack, FSE2018 submission #164: ------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem **Note:** link to data artifacts is already included in the paper. Link to the code will be included in the Camera Ready version as well. Content description =================== - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files described below - **settings.py** - settings template for the code archive. - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset. This dataset only includes stats aggregated by the ecosystem (PyPI) - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages themselves, which take around 2TB. - **build_model.r, helpers.r** - R files to process the survival data (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, `common.cache/survival_data.pypi_2008_2017-12_6.csv` in **dataset_full_Jan_2018.tgz**) - **Interview protocol.pdf** - approximate protocol used for semistructured interviews. - LICENSE - text of GPL v3, under which this dataset is published - INSTALL.md - replication guide (~2 pages)
Replication guide ================= Step 0 - prerequisites ---------------------- - Unix-compatible OS (Linux or OS X) - Python interpreter (2.7 was used; Python 3 compatibility is highly likely) - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible) Depending on detalization level (see Step 2 for more details): - up to 2Tb of disk space (see Step 2 detalization levels) - at least 16Gb of RAM (64 preferable) - few hours to few month of processing time Step 1 - software ---------------- - unpack **ghd-0.1.0.zip**, or clone from gitlab: git clone https://gitlab.com/user2589/ghd.git git checkout 0.1.0 `cd` into the extracted folder. All commands below assume it as a current directory. - copy `settings.py` into the extracted folder. Edit the file: * set `DATASET_PATH` to some newly created folder path * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` - install docker. For Ubuntu Linux, the command is `sudo apt-get install docker-compose` - install libarchive and headers: `sudo apt-get install libarchive-dev` - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools` Without this dependency, you might get an error on the next step, but it's safe to ignore. - install Python libraries: `pip install --user -r requirements.txt` . - disable all APIs except GitHub (Bitbucket and Gitlab support were not yet implemented when this study was in progress): edit `scraper/init.py`, comment out everything except GitHub support in `PROVIDERS`. Step 2 - obtaining the dataset ----------------------------- The ultimate goal of this step is to get output of the Python function `common.utils.survival_data()` and save it into a CSV file: # copy and paste into a Python console from common import utils survival_data = utils.survival_data('pypi', '2008', smoothing=6) survival_data.to_csv('survival_data.csv') Since full replication will take several months, here are some ways to speedup the process: ####Option 2.a, difficulty level: easiest Just use the precomputed data. Step 1 is not necessary under this scenario. - extract **dataset_minimal_Jan_2018.zip** - get `survival_data.csv`, go to the next step ####Option 2.b, difficulty level: easy Use precomputed longitudinal feature values to build the final table. The whole process will take 15..30 minutes. - create a folder `
analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes a series of R scripts required to carry out some of the practical exercises in the book “Land Use Cover Datasets and Validation Tools”, available in open access.
The scripts have been designed within the context of the R Processing Provider, a plugin that integrates the R processing environment into QGIS. For all the information about how to use these scripts in QGIS, please refer to Chapter 1 of the book referred to above.
The dataset includes 15 different scripts, which can implement the calculation of different metrics in QGIS:
Descriptions of all these methods can be found in different chapters of the aforementioned book.
The dataset also includes a readme file listing all the scripts provided, detailing their authors and the references on which their methods are based.
These data support poscrptR (wright et al. 2021). poscrptR is a shiny app that predicts the probability of post-fire conifer regeneration for fire data supplied by the user. The predictive model was fit using presence/absence data collected in 4.4m radius plots (60 square meters). Please refer to Stewart et al. (2020) for more details concerning field data collection, the model fitting process, and limitations. Learn more about shiny apps at https://shiny.rstudio.com. The app is designed to simplify the process of predicting post-fire conifer regeneration under different precipitation and seed production scenarios. The app requires the user to upload two input data sets: 1. a raster of Relativized differenced Normalized Burn Ratio (RdNBR), and 2. a .zip folder containing a fire perimeter shapefile. The app was designed to use Rapid Assessment of Vegetative Condition (RAVG) data inputs. The RAVG website (https://fsapps.nwcg.gov/ravg) has both RdNBR and fire perimeter data sets available for all fires with at least 1,000 acres of National Forest land from 2007 to the present. The fire perimeter must be a zipped shapefile (.zip file, include all shapefile components: .cpg, .dbf, .prj, .sbn, .sbx, .shp, and .shx). RdNBR must be 30m resolution, and both the RdNBR and fire perimeter must use the USA Contiguous Albers Equal Area Conic coordinate reference system (USGS version). RDNBR must be alligned (same origin) as RAVG raster data. References: Stewart, J., van Mantgem, P., Young, D., Shive, K., Preisler, H., Das, A., Stephenson, N., Keeley, J., Safford, H., Welch, K., Thorne, J., 2020. Effects of postfire climate and seed availability on postfire conifer regeneration. Ecological Applications. Wright, M.C., Stewart, J.E., van Mantgem, P.J., Young, D.J., Shive, K.L., Preisler, H.K., Das, A.J., Stephenson, N.L., Keeley, J.E., Safford, H.D., Welch, K.R., and Thorne, J.H. 2021. poscrptR. R package version 0.1.3.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Identifying causal relations from time series is the first step to understanding the behavior of complex systems. Although many methods have been proposed, few papers have applied multiple methods together to detect causal relations based on time series generated from coupled nonlinear systems with some unobserved parts. Here we propose the combined use of three methods and a majority vote to infer causality under such circumstances. Two of these methods are proposed here for the first time, and all of the three methods can be applied even if the underlying dynamics is nonlinear and there are hidden common causes. We test our methods with coupled logistic maps, coupled Rössler models, and coupled Lorenz models. In addition, we show from ice core data how the causal relations among the temperature, the CH4 level, and the CO2 level in the atmosphere changed in the last 800,000 years, a conclusion also supported by irregularly sampled data analysis. Moreover, these methods show how three regions of the brain interact with each other during the visually cued, two-choice arm reaching task. Especially, we demonstrate that this is due to bottom up influences at the beginning of the task, while there exist mutual influences between the posterior medial prefrontal cortex and the presupplementary motor area. Based on our results, we conclude that identifying causality with an appropriate ensemble of multiple methods ensures the validity of the obtained results more firmly.
This data provides results from field analyses, from the California Environmental Data Exchange Network (CEDEN). The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.
Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data. Example R code using the API to access data across all years can be found here.
Users who want to manually download more specific subsets of the data can also use the CEDEN query tool, at: https://ceden.waterboards.ca.gov/AdvancedQueryTool
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
iEEG and EEG data from 5 centers is organized in our study with a total of 100 subjects. We publish 4 centers' dataset here due to data sharing issues.
Acquisitions include ECoG and SEEG. Each run specifies a different snapshot of EEG data from that specific subject's session. For seizure sessions, this means that each run is a EEG snapshot around a different seizure event.
For additional clinical metadata about each subject, refer to the clinical Excel table in the publication.
NIH, JHH, UMMC, and UMF agreed to share. Cleveland Clinic did not, so requires an additional DUA.
All data, except for Cleveland Clinic was approved by their centers to be de-identified and shared. All data in this dataset have no PHI, or other identifiers associated with patient. In order to access Cleveland Clinic data, please forward all requests to Amber Sours, SOURSA@ccf.org:
Amber Sours, MPH Research Supervisor | Epilepsy Center Cleveland Clinic | 9500 Euclid Ave. S3-399 | Cleveland, OH 44195 (216) 444-8638
You will need to sign a data use agreement (DUA).
For each subject, there was a raw EDF file, which was converted into the BrainVision format with mne_bids
.
Each subject with SEEG implantation, also has an Excel table, called electrode_layout.xlsx
, which outlines where the clinicians marked each electrode anatomically. Note that there is no rigorous atlas applied, so the main points of interest are: WM
, GM
, VENTRICLE
, CSF
, and OUT
, which represent white-matter, gray-matter, ventricle, cerebrospinal fluid and outside the brain. WM, Ventricle, CSF and OUT were removed channels from further analysis. These were labeled in the corresponding BIDS channels.tsv
sidecar file as status=bad
.
The dataset uploaded to openneuro.org
does not contain the sourcedata
since there was an extra
anonymization step that occurred when fully converting to BIDS.
Derivatives include: * fragility analysis * frequency analysis * graph metrics analysis * figures
These can be computed by following the following paper: Neural Fragility as an EEG Marker for the Seizure Onset Zone
Within each EDF file, there contain event markers that are annotated by clinicians, which may inform you of specific clinical events that are occuring in time, or of when they saw seizures onset and offset (clinical and electrographic).
During a seizure event, specifically event markers may follow this time course:
* eeg onset, or clinical onset - the onset of a seizure that is either marked electrographically, or by clinical behavior. Note that the clinical onset may not always be present, since some seizures manifest without clinical behavioral changes.
* Marker/Mark On - these are usually annotations within some cases, where a health practitioner injects a chemical marker for use in ICTAL SPECT imaging after a seizure occurs. This is commonly done to see which portions of the brain are active metabolically.
* Marker/Mark Off - This is when the ICTAL SPECT stops imaging.
* eeg offset, or clinical offset - this is the offset of the seizure, as determined either electrographically, or by clinical symptoms.
Other events included may be beneficial for you to understand the time-course of each seizure. Note that ICTAL SPECT occurs in all Cleveland Clinic data. Note that seizure markers are not consistent in their description naming, so one might encode some specific regular-expression rules to consistently capture seizure onset/offset markers across all dataset. In the case of UMMC data, all onset and offset markers were provided by the clinicians on an Excel sheet instead of via the EDF file. So we went in and added the annotations manually to each EDF file.
For various datasets, there are seizures present within the dataset. Generally there is only one seizure per EDF file. When seizures are present, they are marked electrographically (and clinically if present) via standard approaches in the epilepsy clinical workflow.
Clinical onset are just manifestation of the seizures with clinical syndromes. Sometimes the maker may not be present.
What is actually important in the evaluation of datasets is the clinical annotations of their localization hypotheses of the seizure onset zone.
These generally include:
* early onset: the earliest onset electrodes participating in the seizure that clinicians saw
* early/late spread (optional): the electrodes that showed epileptic spread activity after seizure onset. Not all seizures has spread contacts annotated.
For patients with the post-surgical MRI available, then the segmentation process outlined above tells us which electrodes were within the surgical removed brain region.
Otherwise, clinicians give us their best estimate, of which electrodes were resected/ablated based on their surgical notes.
For surgical patients whose postoperative medical records did not explicitly indicate specific resected or ablated contacts, manual visual inspection was performed to determine the approximate contacts that were located in later resected/ablated tissue. Postoperative T1 MRI scans were compared against post-SEEG implantation CT scans or CURRY coregistrations of preoperative MRI/post SEEG CT scans. Contacts of interest in and around the area of the reported resection were selected individually and the corresponding slice was navigated to on the CT scan or CURRY coregistration. After identifying landmarks of that slice (e.g. skull shape, skull features, shape of prominent brain structures like the ventricles, central sulcus, superior temporal gyrus, etc.), the location of a given contact in relation to these landmarks, and the location of the slice along the axial plane, the corresponding slice in the postoperative MRI scan was navigated to. The resected tissue within the slice was then visually inspected and compared against the distinct landmarks identified in the CT scans, if brain tissue was not present in the corresponding location of the contact, then the contact was marked as resected/ablated. This process was repeated for each contact of interest.
Adam Li, Chester Huynh, Zachary Fitzgerald, Iahn Cajigas, Damian Brusko, Jonathan Jagid, Angel Claudio, Andres Kanner, Jennifer Hopp, Stephanie Chen, Jennifer Haagensen, Emily Johnson, William Anderson, Nathan Crone, Sara Inati, Kareem Zaghloul, Juan Bulacio, Jorge Gonzalez-Martinez, Sridevi V. Sarma. Neural Fragility as an EEG Marker of the Seizure Onset Zone. bioRxiv 862797; doi: https://doi.org/10.1101/862797
Appelhoff, S., Sanderson, M., Brooks, T., Vliet, M., Quentin, R., Holdgraf, C., Chaumon, M., Mikulan, E., Tavabi, K., Höchenberger, R., Welke, D., Brunner, C., Rockhill, A., Larson, E., Gramfort, A. and Jas, M. (2019). MNE-BIDS: Organizing electrophysiological data into the BIDS format and facilitating their analysis. Journal of Open Source Software 4: (1896). https://doi.org/10.21105/joss.01896
Holdgraf, C., Appelhoff, S., Bickel, S., Bouchard, K., D'Ambrosio, S., David, O., … Hermes, D. (2019). iEEG-BIDS, extending the Brain Imaging Data Structure specification to human intracranial electrophysiology. Scientific Data, 6, 102. https://doi.org/10.1038/s41597-019-0105-7
Pernet, C. R., Appelhoff, S., Gorgolewski, K. J., Flandin, G., Phillips, C., Delorme, A., Oostenveld, R. (2019). EEG-BIDS, an extension to the brain imaging data structure for electroencephalography. Scientific Data, 6, 103. https://doi.org/10.1038/s41597-019-0104-8
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
The WIC Infant and Toddler Feeding Practices Study–2 (WIC ITFPS-2) (also known as the “Feeding My Baby Study”) is a national, longitudinal study that captures data on caregivers and their children who participated in the Special Supplemental Nutrition Program for Women, Infants, and Children (WIC) around the time of the child’s birth. The study addresses a series of research questions regarding feeding practices, the effect of WIC services on those practices, and the health and nutrition outcomes of children on WIC. Additionally, the study assesses changes in behaviors and trends that may have occurred over the past 20 years by comparing findings to the WIC Infant Feeding Practices Study–1 (WIC IFPS-1), the last major study of the diets of infants on WIC. This longitudinal cohort study has generated a series of reports. These datasets include data from caregivers and their children during the prenatal period and during the children’s first five years of life (child ages 1 to 60 months). A full description of the study design and data collection methods can be found in Chapter 1 of the Second Year Report (https://www.fns.usda.gov/wic/wic-infant-and-toddler-feeding-practices-st...). A full description of the sampling and weighting procedures can be found in Appendix B-1 of the Fourth Year Report (https://fns-prod.azureedge.net/sites/default/files/resource-files/WIC-IT...). Processing methods and equipment used Data in this dataset were primarily collected via telephone interview with caregivers. Children’s length/height and weight data were objectively collected while at the WIC clinic or during visits with healthcare providers. The study team cleaned the raw data to ensure the data were as correct, complete, and consistent as possible. Study date(s) and duration Data collection occurred between 2013 and 2019. Study spatial scale (size of replicates and spatial scale of study area) Respondents were primarily the caregivers of children who received WIC services around the time of the child’s birth. Data were collected from 80 WIC sites across 27 State agencies. Level of true replication Unknown Sampling precision (within-replicate sampling or pseudoreplication) This dataset includes sampling weights that can be applied to produce national estimates. A full description of the sampling and weighting procedures can be found in Appendix B-1 of the Fourth Year Report (https://fns-prod.azureedge.net/sites/default/files/resource-files/WIC-IT...). Level of subsampling (number and repeat or within-replicate sampling) A full description of the sampling and weighting procedures can be found in Appendix B-1 of the Fourth Year Report (https://fns-prod.azureedge.net/sites/default/files/resource-files/WIC-IT...). Study design (before–after, control–impacts, time series, before–after-control–impacts) Longitudinal cohort study. Description of any data manipulation, modeling, or statistical analysis undertaken Each entry in the dataset contains caregiver-level responses to telephone interviews. Also available in the dataset are children’s length/height and weight data, which were objectively collected while at the WIC clinic or during visits with healthcare providers. In addition, the file contains derived variables used for analytic purposes. The file also includes weights created to produce national estimates. The dataset does not include any personally-identifiable information for the study children and/or for individuals who completed the telephone interviews. Description of any gaps in the data or other limiting factors Please refer to the series of annual WIC ITFPS-2 reports (https://www.fns.usda.gov/wic/infant-and-toddler-feeding-practices-study-2-fourth-year-report) for detailed explanations of the study’s limitations. Outcome measurement methods and equipment used The majority of outcomes were measured via telephone interviews with children’s caregivers. Dietary intake was assessed using the USDA Automated Multiple Pass Method (https://www.ars.usda.gov/northeast-area/beltsville-md-bhnrc/beltsville-h...). Children’s length/height and weight data were objectively collected while at the WIC clinic or during visits with healthcare providers. Resources in this dataset:Resource Title: ITFP2 Year 5 Enroll to 60 Months Public Use Data CSV. File Name: itfps2_enrollto60m_publicuse.csvResource Description: ITFP2 Year 5 Enroll to 60 Months Public Use Data CSVResource Title: ITFP2 Year 5 Enroll to 60 Months Public Use Data Codebook. File Name: ITFPS2_EnrollTo60m_PUF_Codebook.pdfResource Description: ITFP2 Year 5 Enroll to 60 Months Public Use Data CodebookResource Title: ITFP2 Year 5 Enroll to 60 Months Public Use Data SAS SPSS STATA R Data. File Name: ITFP@_Year5_Enroll60_SAS_SPSS_STATA_R.zipResource Description: ITFP2 Year 5 Enroll to 60 Months Public Use Data SAS SPSS STATA R DataResource Title: ITFP2 Year 5 Ana to 60 Months Public Use Data CSV. File Name: ampm_1to60_ana_publicuse.csvResource Description: ITFP2 Year 5 Ana to 60 Months Public Use Data CSVResource Title: ITFP2 Year 5 Tot to 60 Months Public Use Data Codebook. File Name: AMPM_1to60_Tot Codebook.pdfResource Description: ITFP2 Year 5 Tot to 60 Months Public Use Data CodebookResource Title: ITFP2 Year 5 Ana to 60 Months Public Use Data Codebook. File Name: AMPM_1to60_Ana Codebook.pdfResource Description: ITFP2 Year 5 Ana to 60 Months Public Use Data CodebookResource Title: ITFP2 Year 5 Ana to 60 Months Public Use Data SAS SPSS STATA R Data. File Name: ITFP@_Year5_Ana_60_SAS_SPSS_STATA_R.zipResource Description: ITFP2 Year 5 Ana to 60 Months Public Use Data SAS SPSS STATA R DataResource Title: ITFP2 Year 5 Tot to 60 Months Public Use Data CSV. File Name: ampm_1to60_tot_publicuse.csvResource Description: ITFP2 Year 5 Tot to 60 Months Public Use Data CSVResource Title: ITFP2 Year 5 Tot to 60 Months Public Use SAS SPSS STATA R Data. File Name: ITFP@_Year5_Tot_60_SAS_SPSS_STATA_R.zipResource Description: ITFP2 Year 5 Tot to 60 Months Public Use SAS SPSS STATA R DataResource Title: ITFP2 Year 5 Food Group to 60 Months Public Use Data CSV. File Name: ampm_foodgroup_1to60m_publicuse.csvResource Description: ITFP2 Year 5 Food Group to 60 Months Public Use Data CSVResource Title: ITFP2 Year 5 Food Group to 60 Months Public Use Data Codebook. File Name: AMPM_FoodGroup_1to60m_Codebook.pdfResource Description: ITFP2 Year 5 Food Group to 60 Months Public Use Data CodebookResource Title: ITFP2 Year 5 Food Group to 60 Months Public Use SAS SPSS STATA R Data. File Name: ITFP@_Year5_Foodgroup_60_SAS_SPSS_STATA_R.zipResource Title: WIC Infant and Toddler Feeding Practices Study-2 Data File Training Manual. File Name: WIC_ITFPS-2_DataFileTrainingManual.pdf
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Generated from raw data by MNE-BIDS (Appelhoff et al., 2019) and custom code to join to behavioural data, stimulus information, and metadata.
For full details on this dataset, see our preprint: (url here once out)
An issue during recording meant that sub-05 completed the first block without data being saved. The experiment was restarted from the beginning for this participant. This participant was not included in our analyses, but the data are included in this dataset. They are also identified with the recording_restarted
field in participants.tsv
.
A separate issue during recording meant that EEG data for some trials were lost for sub-01
, though enough trials were recorded in total to meet our criteria for inclusion in the analysis. The raw data comprised two separate recordings. In this dataset, the two recordings are concatenated end-to-end into one file. The point at which the files are joined is marked with a boundary event. This participant is identified with the recording_interrupted
field in participants.tsv
.
During the course of the experiment, we identified an issue with the wiring in one splitter box, which meant that voltages from channels FT7 and FC3 were swapped in the raw recorded data. We elected to keep the wiring as it was for the duration of the experiment, and then swapped the data from the two channels in the code that generated this BIDS dataset. This means that this issue has been corrected in this BIDS version of the data.
"BAD" periods (MNE term) for key presses and break periods are included in the events files.
Recording dates/times have been anonymised by shifting all recordings backwards in time by a constant number of days (same constant for all participants). This obscures information that may be used to identify participants, but preserves time-of-day information, and the relative times elapsed between different recordings.
Appelhoff, S., Sanderson, M., Brooks, T., Vliet, M., Quentin, R., Holdgraf, C., Chaumon, M., Mikulan, E., Tavabi, K., Höchenberger, R., Welke, D., Brunner, C., Rockhill, A., Larson, E., Gramfort, A. and Jas, M. (2019). MNE-BIDS: Organizing electrophysiological data into the BIDS format and facilitating their analysis. Journal of Open Source Software 4: (1896). https://doi.org/10.21105/joss.01896
Pernet, C. R., Appelhoff, S., Gorgolewski, K. J., Flandin, G., Phillips, C., Delorme, A., Oostenveld, R. (2019). EEG-BIDS, an extension to the brain imaging data structure for electroencephalography. Scientific Data, 6, 103. https://doi.org/10.1038/s41597-019-0104-8
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Our datasets, both released and evaluation set, are derived from the YFCC100M Dataset. Each dataset comprises vectors encoded from images using the CLIP model, which are then reduced to 100 dimensions using Principal Component Analysis (PCA). Additionally, categorical and timestamp attributes are selected from the metadata of the images. The categorical attribute is discretized into integers starting from 0, and the timestamp attribute is normalized into floats between 0 and 1.
For each query, a query type is randomly selected from four possible types, denoted by the numbers 0 to 3. Then, we randomly choose two data points from dataset D, utilizing their categorical attribute (C) timestamp attribute (T), and vectors, to determine the values of the query. Specifically:
We assure that at least 100 data points in D meet the query limit.
Dataset D is in a binary format, beginning with a 4-byte integer num_vectors (uint32_t) indicating the number of vectors. This is followed by data for each vector, stored consecutively, with each vector occupying 102 (2 + vector_num_dimension) x sizeof(float32) bytes, summing up to num_vectors x 102 (2 + vector_num_dimension) x sizeof(float32) bytes in total. Specifically, for the 102 dimensions of each vector: the first dimension denotes the discretized categorical attribute C and the second dimension denotes the normalized timestamp attribute T. The rest 100 dimensions are the vector.
Query set Q is in a binary format, beginning with a 4-byte integer num_queries (uint32_t) indicating the number of queries. This is followed by data for each query, stored consecutively, with each query occupying 104 (4 + vector_num_dimension) x sizeof(float32) bytes, summing up to num_queries x 104 (4 + vector_num_dimension) x sizeof(float32) bytes in total.
The 104-dimensional representation for a query is organized as follows:
There are four types of queries, i.e., the query_type takes values from 0, 1, 2 and 3. The 4 types of queries correspond to:
The predicate for the categorical attribute is an equality predicate, i.e., C=v. And the predicate for the timestamp attribute is a range predicate, i.e., l≤T≤r.
Originally provided on https://dbgroup.cs.tsinghua.edu.cn/sigmod2024/task.shtml?content=datasets .
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 2. Input files needed to recreate the plots in this paper: Tracer output files for three species.
The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.
This dataset contains all the scripts used to conduct the uncertainty analysis for the maximum drawdown and time to maximum drawdown at the groundwater receptors in the Clarence-Moreton bioregion and all the resulting posterior predictions. This is described in product 2.6.2 Groundwater numerical modelling (Cui et al. 2016). See History for a detailed explanation of the dataset contents.
This dataset uses the results of the design of experiment runs of the MODFLOW groundwater model of the Clarence-Moreton subregion to train emulators to (a) constrain the prior parameter ensembles into the posterior parameter ensembles and to (b) generate the predictive posterior ensembles of maximum drawdown and time to maximum drawdown. This is described in product 2.6.2 Groundwater numerical modelling (Cui et al. 2016).
A flow chart of the way the various files and scripts interact is provided in CLM_MF_dmax_v02_Flowchart.png (editable version in CLM_MF_dmax_v02_Flowchart.gliffy).
R-script CLM_DoE_Parameters.R creates the set of parameters for the design of experiment in CLM_DoE_Parameters.csv. Each of these parameter combinations is evaluated with the groundwater model (dataset CLM groundwater model V1). Associated with this spreadsheet is file CLM_MF_Parameters.csv. This file contains, for each parameter, if it is included in the sensitivity analysis, tied to another parameters, the initial value and range, the transformation, the type of prior distribution with its mean and covariance structure.
The results of the design of experiment model runs are summarised in files CLM_MF_dmax_DoE_Predictions.csv, CLM_MF_tmax_DoE_Predictions.csv, CLM_MF_DoE_Observations.csv, which have the maximum additional drawdown, the time to maximum additional drawdown for each receptor and the simulated equivalents to observations respectively. The first two are generated with post-processing scripts in dataset groundwater model V1, while for the last file, additional script CLM_MF_postprocess_riverflux.py is used to summarise the simulated equivalents to the surface water groundwater exchange flux.
Spreadsheets CLM_MF_dmax_Predictions.csv and CLM_MF_tmax_Predictions.csv capture additional information on each prediction; the name of the prediction, transformation, min, max and median of design of experiment, a boolean to indicate the prediction is to be included in the uncertainty analysis, the layer it is assigned to and which objective function to use to constrain the prediction.
Spreadsheet CLM_MF_dmax_Observations.csv has additional information on each observation; the name of the observation, a boolean to indicate to use the observation, the min and max of the design of experiment, a metadata statement describing if the observation is steady state (SS) or transient (TR) and the source of the spatial coordinates (from dataset CLM - Bore water level NSW). Further it has the distance of each bore to the nearest blue line network and the distance to each prediction (both in km).
These files are used in script CLM_MF_SI.py to generate sensitivity indices (based on the Plischke et al. (2013) method) for each group of observations and predictions. These indices are saved in spreadsheets CLM_MF_SI_dmaxL1.csv, CLM_MF_SI_dmaxL2.csv, CLM_MF_SI_dmaxL3.csv, CLM_MF_SI_dmaxL4.csv, CLM_MF_SI_dmaxL6.csv, CLM_MF_SI_hobs.csv, CLM_MF_SI_Qcsg.csv, CLM_MF_SI_objfun.csv.
Script CLM_MF_dmax_ObjFun.py calculates the objective function values for the design of experiment runs. Each prediction in layer 1 has a tailored objective function which is a weighted sum of the residuals between observations and predictions with weights based on the distance between observation and prediction. In addition to that there is an objective function for the baseflow and CSG water production rates. The results are stored in CLM_MF_DoE_ObjFun.csv and CLM_MF_ObjFun.csv.
The latter files are used in scripts CLM_MF_dmax_CreatePosteriorParameters_oo.R and CLM_MF_dmax_CreatePosteriorParameters_gen.R to carry out the Markov Chain Monte Carlo sampling of the prior parameter distributions with the Approximate Bayesian Computation methodology as described in Cui et al (2016) by generating and applying emulators for each objective function. The scripts use the scripts in dataset R-scripts for uncertainty analysis v01. These files are run on the high performance computation cluster machines with batch file CLM_MF_dmax_CreatePosterior.slurm. These scripts result in posterior parameter combinations for each objective function, stored in directory PosteriorParameters, with filename convention CLM_MF_dmax_Posterior_Parameters_OO_%i_batch.csv % 1-982. The general posterior parameter distribution (i.e. without the distance weighted groundwater level observations) is stored in CLM_MF_dmax_Posterior_Parameters_gen_batch1.csv.
The same set of spreadsheets is used to test convergence of the emulator performance with script CLM_MF_emulator_convergence.R and batch file CLM_MF_emulator_convergence.slurm to produce spreadsheet CLM_MF_convergence_objfun_qriv.csv.
The posterior parameter distributions are sampled with scripts CLM_MF_dmax_MCsampler_OO_i.R, CLM_MF_dmax_MCsampler_gen_i.R, CLM_MF_tmax_MCsampler_OO_i.R, CLM_MF_tmax_MCsampler_gen_i.R and associated .slurm batch files. Files ending in OO_i.R sample for predictions that have a groundwater level observation constrained objective function, files ending in gen_i.R sample the predictions that have the general objective function. The scripts create and apply an emulator for each prediction. The emulator and results are stored in directory Emulators. This directory is not part of the this dataset but can be regenerated by running the scripts on the high performance computation clusters.
Script CLM_MF_collate_predictions.csv collates all posterior predictive distributions in spreadsheets CLM_MF_dmax_PosteriorPredictions.csv and CLM_MF_tmax_PosteriorPredictions.csv. These files are further summarised in spreadsheet CLM_MF_dmax_tmax_excprob.csv with script CLM_MF_exc_prob. This spreadsheet contains for all predictions the coordinates, layer, number of samples in the posterior parameter distribution and the 5th, 50th and 95th percentile of dmax and tmax, the probability of exceeding 1 cm and 20 cm drawdown, the maximum dmax value from the design of experiment and for the predictions in layer 1 the threshold of the objective function and the acceptance rate.
Bioregional Assessment Programme (2016) CLM MODFLOW Uncertainty Analysis. Bioregional Assessment Derived Dataset. Viewed 10 July 2017, http://data.bioregionalassessments.gov.au/dataset/25e01e3c-7b87-4200-9ef2-5c5405627130.
Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements 20131204
Derived From Qld 100K mapsheets - Mount Lindsay
Derived From Qld 100K mapsheets - Helidon
Derived From Qld 100K mapsheets - Ipswich
Derived From CLM - Woogaroo Subgroup extent
Derived From CLM - Interpolated surfaces of Alluvium depth
Derived From CLM - Extent of Logan and Albert river alluvial systems
Derived From CLM - Bore allocations NSW v02
Derived From CLM - Bore allocations NSW
Derived From CLM - Bore assignments NSW and QLD summary tables
Derived From CLM - Geology NSW & Qld combined v02
Derived From CLM - Orara-Bungawalbin bedrock
Derived From CLM16gwl NSW Office of Water_GW licence extract linked to spatial locations_CLM_v3_13032014
Derived From CLM groundwater model hydraulic property data
Derived From CLM - Koukandowie FM bedrock
Derived From GEODATA TOPO 250K Series 3, File Geodatabase format (.gdb)
Derived From NSW Office of Water - National Groundwater Information System 20140701
Derived From CLM - Gatton Sandstone extent
Derived From CLM16gwl NSW Office of Water, GW licence extract linked to spatial locations in CLM v2 28022014
Derived From Bioregional Assessment areas v03
Derived From NSW Geological Survey - geological units DRAFT line work.
Derived From [Mean Annual Climate Data of Australia 1981 to
Abstract The dataset was derived by the Bioregional Assessment Programme. This dataset was derived from multiple datasets. You can find a link to the parent datasets in the Lineage Field in this …Show full descriptionAbstract The dataset was derived by the Bioregional Assessment Programme. This dataset was derived from multiple datasets. You can find a link to the parent datasets in the Lineage Field in this metadata statement. The History Field in this metadata statement describes how this dataset was derived. This dataset includes the following parameters clipped to BA_SYD extent. Mean annual BAWAP (Bureau of Meteorology Australian Water Availability Project) rainfall of year 1981 - 2013 Mean annual penman PET (potential evapotranspiration) of year 1981 - 2013 Mean annual runoff using the 'Budyko-framework' implementation of Choudhury Dataset History Lineage is as per the BA All mean climate data for Australia except the national data has been clipped to BA SYD extent. The mean annual rainfall data is created from monthly BAWAP grids which is created from daily BILO rainfall. Jones, D. A., W. Wang and R. Fawcett (2009). "High-quality spatial climate data-sets for Australia." Australian Meteorological and Oceanographic Journal 58(4): 233-248. The Mean annual penman PET is created as per the Donohue et al (2010) paper using the fully physically based Penman formulation of potential evapotranspiration, exept that daily wind speed grids used here were generated with a spline (i.e., ANUSPLIN) as per McVicar et al (2008), not the TIN as per Donohue et al (2010). For comprehensive details regarding the generation of some of these datasets (i.e., net radiation, Rn) see the details provided in Donohue et al (2009). Donohue, R.J., McVicar, T.R. and Roderick, M.L. (2010) Assessing the ability of potential evaporation formulations to capture the dynamics in evaporative demand within a changing climate. Journal of Hydrology. 386(1-4), 186-197. doi:10.1016/j.jhydrol.2010.03.020 Donohue, R.J., McVicar, T.R. and Roderick, M.L., (2009) Generating Australian potential evaporation data suitable for assessing the dynamics in evaporative demand within a changing climate. CSIRO: Water for a Healthy Country Flagship, pp 43. http://www.clw.csiro.au/publications/waterforahealthycountry/2009/wfhc-evaporative-demand-dynamics.pdf McVicar, T.R., Van Niel, T.G., Li, L.T., Roderick, M.L., Rayner, D.P., Ricciardulli, L. and Donohue, R.J. (2008) Wind speed climatology and trends for Australia, 1975-2006: Capturing the stilling phenomenon and comparison with near-surface reanalysis output. Geophysical Research Letters. 35, L20403, doi:10.1029/2008GL035627 The Mean annual runoff was created as per the Donohue et al (2010) paper. The data represent the runoff expected from the steady-state 'Budyko curve' longterm mean annual water-energy limit approach using BAWAP precipitation and the Penman potential ET described above. Choudhury BJ (1999) Evaluation of an empirical equation for annual evaporation using field observations and results from a biophysical model. Journal of Hydrology 216, 99-110. Donohue, R.J., McVicar, T.R. and Roderick, M.L. (2010) Assessing the ability of potential evaporation formulations to capture the dynamics in evaporative demand within a changing climate. Journal of Hydrology. 386(1-4), 186-197. doi:10.1016/j.jhydrol.2010.03.020 Donohue, R.J., McVicar, T.R. and Roderick, M.L., (2009) Generating Australian potential evaporation data suitable for assessing the dynamics in evaporative demand within a changing climate. CSIRO: Water for a Healthy Country Flagship, pp 43. http://www.clw.csiro.au/publications/waterforahealthycountry/2009/wfhc-evaporative-demand-dynamics.pdf McVicar, T.R., Van Niel, T.G., Li, L.T., Roderick, M.L., Rayner, D.P., Ricciardulli, L. and Donohue, R.J. (2008) Wind speed climatology and trends for Australia, 1975-2006: Capturing the stilling phenomenon and comparison with near-surface reanalysis output. Geophysical Research Letters. 35, L20403, doi:10.1029/2008GL035627 Dataset Citation Bioregional Assessment Programme (2014) Mean annual climate data clipped to BA_SYD extent. Bioregional Assessment Derived Dataset. Viewed 18 June 2018, http://data.bioregionalassessments.gov.au/dataset/a8393a45-5e86-431b-b504-c0b2953296f4. Dataset Ancestors Derived From BILO Gridded Climate Data: Daily Climate Data for each year from 1900 to 2012 Derived From Mean Annual Climate Data of Australia 1981 to 2012
Overview: ERA5-Land is a reanalysis dataset providing a consistent view of the evolution of land variables over several decades at an enhanced resolution compared to ERA5. ERA5-Land has been produced by replaying the land component of the ECMWF ERA5 climate reanalysis. Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. Reanalysis produces data that goes several decades back in time, providing an accurate description of the climate of the past. Air temperature (2 m): Temperature of air at 2m above the surface of land, sea or in-land waters. 2m temperature is calculated by interpolating between the lowest model level and the Earth's surface, taking account of the atmospheric conditions. The original ERA5-Land dataset (period: 2000 - 2020) has been reprocessed to: - aggregate ERA5-Land hourly data to daily data (minimum, mean, maximum) - while increasing the spatial resolution from the native ERA5-Land resolution of 0.1 degree (~ 9 km) to 30 arc seconds (~ 1 km) by image fusion with CHELSA data (V1.2) (https://chelsa-climate.org/). For each day we used the corresponding monthly long-term average of CHELSA. The aim was to use the fine spatial detail of CHELSA and at the same time preserve the general regional pattern and fine temporal detail of ERA5-Land. The steps included aggregation and enhancement, specifically: 1. spatially aggregate CHELSA to the resolution of ERA5-Land 2. calculate difference of ERA5-Land - aggregated CHELSA 3. interpolate differences with a Gaussian filter to 30 arc seconds 4. add the interpolated differences to CHELSA Data available is the daily average, minimum and maximum of air temperature (2 m). Spatial resolution: 30 arc seconds (approx. 1000 m) Temporal resolution: Daily Pixel values: °C * 10 (scaled to Integer; example: value 238 = 23.8 %) Software used: GDAL 3.2.2 and GRASS GIS 8.0.0 (r.resamp.stats -w; r.relief) Original ERA5-Land dataset license: https://cds.climate.copernicus.eu/api/v2/terms/static/licence-to-use-copernicus-products.pdf CHELSA climatologies (V1.2): Data used: Karger D.N., Conrad, O., Böhner, J., Kawohl, T., Kreft, H., Soria-Auza, R.W., Zimmermann, N.E, Linder, H.P., Kessler, M. (2018): Data from: Climatologies at high resolution for the earth's land surface areas. Dryad digital repository. http://dx.doi.org/doi:10.5061/dryad.kd1d4 Original peer-reviewed publication: Karger, D.N., Conrad, O., Böhner, J., Kawohl, T., Kreft, H., Soria-Auza, R.W., Zimmermann, N.E., Linder, P., Kessler, M. (2017): Climatologies at high resolution for the Earth land surface areas. Scientific Data. 4 170122. https://doi.org/10.1038/sdata.2017.122
This file includes data from the 2002 through 2011 National Survey on Drug Use and Health (NSDUH) survey. The only variables included in the data file are ones that were collected in a comparable manner across one or more of the pair years, i.e., 2002-2003, 2004-2005, 2006-2007, 2008-2009, 2010-2011, or 2012-2013. The National Survey on Drug Use and Health (NSDUH) series (formerly titled National Household Survey on Drug Abuse) primarily measures the prevalence and correlates of drug use in the United States. The surveys are designed to provide quarterly, as well as annual, estimates. Information is provided on the use of illicit drugs, alcohol, and tobacco among members of United States households aged 12 and older. Questions included age at first use as well as lifetime, annual, and past-month usage for the following drug classes: marijuana, cocaine (and crack), hallucinogens, heroin, inhalants, alcohol, tobacco, and nonmedical use of prescription drugs, including pain relievers, tranquilizers, stimulants, and sedatives. The survey covered substance abuse treatment history and perceived need for treatment. The survey included questions concerning treatment for both substance abuse and mental health-related disorders. Respondents were also asked about personal and family income sources and amounts, health care access and coverage, illegal activities and arrest record, problems resulting from the use of drugs, and needle-sharing. Certain questions are asked only of respondents aged 12 to 17. These "youth experiences" items covered a variety of topics, such as neighborhood environment, illegal activities, drug use by friends, social support, extracurricular activities, exposure to substance abuse prevention and education programs, and perceived adult attitudes toward drug use and activities such as school work. Also included are questions on mental health and access to care, perceived risk of using drugs, perceived availability of drugs, driving and personal behavior, and cigar smoking. Demographic information includes gender, race, age, ethnicity, marital status, educational level, job status, veteran status, and current household composition. In the income section, which was interviewer-administered, a split-sample study had been embedded within the 2006 and 2007 surveys to compare a shorter version of the income questions with a longer set of questions that had been used in previous surveys. This shorter version was adopted for the 2008 NSDUH and will be used for future NSDUHs.This study has 1 Data Set.
We provide instructions, codes and datasets for replicating the article by Kim, Lee and McCulloch (2024), "A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews." This repository provides a user-friendly R package for any researchers or practitioners to apply A Topic-based Segmentation Model with Unstructured Texts (latent class regression with group variable selection) to their datasets. First, we provide a R code to replicate the illustrative simulation study: see file 1. Second, we provide the user-friendly R package with a very simple example code to help apply the model to real-world datasets: see file 2, Package_MixtureRegression_GroupVariableSelection.R and Dendrogram.R. Third, we provide a set of codes and instructions to replicate the empirical studies of customer-level segmentation and restaurant-level segmentation with Yelp reviews data: see files 3-a, 3-b, 4-a, 4-b. Note, due to the dataset terms of use by Yelp and the restriction of data size, we provide the link to download the same Yelp datasets (https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset/versions/6). Fourth, we provided a set of codes and datasets to replicate the empirical study with professor ratings reviews data: see file 5. Please see more details in the description text and comments of each file. [A guide on how to use the code to reproduce each study in the paper] 1. Full codes for replicating Illustrative simulation study.txt -- [see Table 2 and Figure 2 in main text]: This is R source code to replicate the illustrative simulation study. Please run from the beginning to the end in R. In addition to estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships, you will get dendrograms of selected groups of variables in Figure 2. Computing time is approximately 20 to 30 minutes 3-a. Preprocessing raw Yelp Reviews for Customer-level Segmentation.txt: Code for preprocessing the downloaded unstructured Yelp review data and preparing DV and IVs matrix for customer-level segmentation study. 3-b. Instruction for replicating Customer-level Segmentation analysis.txt -- [see Table 10 in main text; Tables F-1, F-2, and F-3 and Figure F-1 in Web Appendix]: Code for replicating customer-level segmentation study with Yelp data. You will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 3 to 4 hours. 4-a. Preprocessing raw Yelp reviews_Restaruant Segmentation (1).txt: R code for preprocessing the downloaded unstructured Yelp data and preparing DV and IVs matrix for restaurant-level segmentation study. 4-b. Instructions for replicating restaurant-level segmentation analysis.txt -- [see Tables 5, 6 and 7 in main text; Tables E-4 and E-5 and Figure H-1 in Web Appendix]: Code for replicating restaurant-level segmentation study with Yelp. you will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 10 to 12 hours. [Guidelines for running Benchmark models in Table 6] Unsupervised Topic model: 'topicmodels' package in R -- after determining the number of topics(e.g., with 'ldatuning' R package), run 'LDA' function in the 'topicmodels'package. Then, compute topic probabilities per restaurant (with 'posterior' function in the package) which can be used as predictors. Then, conduct prediction with regression Hierarchical topic model (HDP): 'gensimr' R package -- 'model_hdp' function for identifying topics in the package (see https://radimrehurek.com/gensim/models/hdpmodel.html or https://gensimr.news-r.org/). Supervised topic model: 'lda' R package -- 'slda.em' function for training and 'slda.predict' for prediction. Aggregate regression: 'lm' default function in R. Latent class regression without variable selection: 'flexmix' function in 'flexmix' R package. Run flexmix with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, conduct prediction of dependent variable per each segment. Latent class regression with variable selection: 'Unconstraind_Bayes_Mixture' function in Kim, Fong and DeSarbo(2012)'s package. Run the Kim et al's model (2012) with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, we can do prediction of dependent variables per each segment. The same R package ('KimFongDeSarbo2012.zip') can be downloaded at: https://sites.google.com/scarletmail.rutgers.edu/r-code-packages/home 5. Instructions for replicating Professor ratings review study.txt -- [see Tables G-1, G-2, G-4 and G-5, and Figures G-1 and H-2 in Web Appendix]: Code to replicate the Professor ratings reviews study. Computing time is approximately 10 hours. [A list of the versions of R, packages, and computer...