This dataset was created by Chloe Shearer
These are the variable codes for the datasets released as part of the 2020 decennial census redistricting data.
No description was included in this Dataset collected from the OSF
The data in this archive in in a zipped R data binary format, https://cran.r-project.org/doc/manuals/r-release/R-data.html. These data can be read by using the open source and free to use statistical software package R, https://www.r-project.org/. The data are organized following the figure numbering in the manuscript, e.g. Figure 1a is fig1a, and contains the same labeling as the figures including units and variable names. For a full explanation of the figure, please see the captions in the manuscript. To open this data file, use the following commands in R. load(‘JKelly_NH4NO3_JGR_2018.rdata’) To list the contents of the file, use the following command in R ls() The data for each figure is contained in the data object with the figures name. To list the data, simply type the name of the figure returned from the ls() command. The original model output and emissions used for this study are located on the ASM archived storage at /asm/ROMO/finescale/sjv2013. These data are in NetCDF format with self contained metadata with descriptive headers containing variable names, units, and simulation times. This dataset is associated with the following publication: Kelly, J., C. Parworth, Q. Zhang, D. Miller, K. Sun, M. Zondlo , K. Baker, A. Wisthaler, J. Nowak , S. Pusede , R. Cohen , A. Weinheimer , A. Beyersdorf , G. Tonnesen, J. Bash, L. Valin, J. Crawford, A. Fried , and J. Walega. Modeling NH4NO3 Over the San Joaquin Valley During the 2013 DISCOVER‐AQ Campaign. JOURNAL OF GEOPHYSICAL RESEARCH-ATMOSPHERES. American Geophysical Union, Washington, DC, USA, 123(9): 4727-4745, (2018).
This dataset includes variable names, variable labels, variable values, and corresponding variable value labels for the IPUMS 1900 datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”
A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org
Please cite this when using the dataset.
Detailed description of the dataset:
1 Film Dataset: Festival Programs
The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.
The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.
The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.
The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.
2 Survey Dataset
The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.
The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.
The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.
The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.
3 IMDb & Scripts
The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.
The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.
The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.
The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.
The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.
The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.
The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.
The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.
The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.
The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.
The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.
The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.
The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.
The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.
The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.
The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.
The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.
The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.
The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.
4 Festival Library Dataset
The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.
The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories, units of measurement, data sources and coding and missing data.
The csv file “4_festival-library_dataset_imdb-and-survey” contains data on all unique festivals collected from both IMDb and survey sources. This dataset appears in wide format, all information for each festival is listed in one row. This
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was created in order to document self-reported life evaluations among small-scale societies that exist on the fringes of mainstream industrialized socieities. The data were produced as part of the LICCI project, through fieldwork carried out by LICCI partners. The data include individual responses to a life satisfaction question, and household asset values. Data from Gallup World Poll and the World Values Survey are also included, as used for comparison. TABULAR DATA-SPECIFIC INFORMATION --------------------------------- 1. File name: LICCI_individual.csv Number of rows and columns: 2814,7 Variable list: Variable names: User, Site, village Description: identification of investigator and location Variable name: Well.being.general Description: numerical score for life satisfaction question Variable names: HH_Assets_US, HH_Assets_USD_capita Description: estimated value of representative assets in the household of respondent, total and per capita (accounting for number of household inhabitants) 2. File name: LICCI_bySite.csv Number of rows and columns: 19,8 Variable list: Variable names: Site, N Description: site name and number of respondents at the site Variable names: SWB_mean, SWB_SD Description: mean and standard deviation of life satisfaction score Variable names: HHAssets_USD_mean, HHAssets_USD_sd Description: Site mean and standard deviation of household asset value Variable names: PerCapAssets_USD_mean, PerCapAssets_USD_sd Description: Site mean and standard deviation of per capita asset value 3. File name: gallup_WVS_GDP_pk.csv Number of rows and columns: 146,8 Variable list: Variable name: Happiness Score, Whisker-high, Whisker-low Description: from Gallup World Poll as documented in World Happiness Report 2022. Variable name: GDP-PPP2017 Description: Gross Domestic Product per capita for year 2020 at PPP (constant 2017 international $). Accessed May 2022. Variable name: pk Description: Produced capital per capita for year 2018 (in 2018 US$) for available countries, as estimated by the World Bank (accessed February 2022). Variable names: WVS7_mean, WVS7_std Description: Results of Question 49 in the World Values Survey, Wave 7.
Archived RTC Seabird data - unedited variable names
Date Submitted: 2015-12-24 2019-12-20: a new data file on X-Ray data 'Rontgen_opT10_20191118' was added to the dataset.2017-10-04: A data file on X-Ray ratings has been added and the variable guide is replaced by a new version (6) with information on this data file. Please note the variable names start with 'RontgT10_' in the data file.2017-07-12: Due to an error a data file has been replaced. CHECK_T0_DANS_ENG_20151211.sav is now replaced by CHECK_T0_DANS_ENG_20161128.sav.---The informed consent statements of the participants are stored at the participating hospitals.The .dta (STATA) and .por (SPSS) files are conversions of the original .sav (SPSS) files.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The AgrImOnIA dataset is a comprehensive dataset relating air quality and livestock (expressed as the density of bovines and swine bred) along with weather and other variables. The AgrImOnIA Dataset represents the first step of the AgrImOnIA project. The purpose of this dataset is to give the opportunity to assess the impact of agriculture on air quality in Lombardy through statistical techniques capable of highlighting the relationship between the livestock sector and air pollutants concentrations.
The building process of the dataset is detailed in the companion paper:
A. Fassò, J. Rodeschini, A. Fusta Moro, Q. Shaboviq, P. Maranzano, M. Cameletti, F. Finazzi, N. Golini, R. Ignaccolo, and P. Otto (2023). Agrimonia: a dataset on livestock, meteorology and air quality in the Lombardy region, Italy. SCIENTIFIC DATA, 1-19.
available here.
This dataset is a collection of estimated daily values for a range of measurements of different dimensions as: air quality, meteorology, emissions, livestock animals and land use. Data are related to Lombardy and the surrounding area for 2016-2021, inclusive. The surrounding area is obtained by applying a 0.3° buffer on Lombardy borders.
The data uses several aggregation and interpolation methods to estimate the measurement for all days.
The files in the record, renamed according to their version (es. .._v_3_0_0), are:
Agrimonia_Dataset.csv(.mat and .Rdata) which is built by joining the daily time series related to the AQ, WE, EM, LI and LA variables. In order to simplify access to variables in the Agrimonia dataset, the variable name starts with the dimension of the variable, i.e., the name of the variables related to the AQ dimension start with 'AQ_'. This file is archived also in the format for MATLAB and R software.
Metadata_Agrimonia.csv which provides further information about the Agrimonia variables: e.g. sources used, original names of the variables imported, transformations applied.
Metadata_AQ_imputation_uncertainty.csv which contains the daily uncertainty estimate of the imputed observation for the AQ to mitigate missing data in the hourly time series.
Metadata_LA_CORINE_labels.csv which contains the label and the description associated with the CLC class.
Metadata_monitoring_network_registry.csv which contains all details about the AQ monitoring station used to build the dataset. Information about air quality monitoring stations include: station type, municipality code, environment type, altitude, pollutants sampled and other. Each row represents a single sensor.
Metadata_LA_SIARL_labels.csv which contains the label and the description associated with the SIARL class.
AGC_Dataset.csv(.mat and .Rdata) that includes daily data of almost all variables available in the Agrimonia Dataset (excluding AQ variables) on an equidistant grid covering the Lombardy region and its surrounding area.
The Agrimonia dataset can be reproduced using the code available at the GitHub page: https://github.com/AgrImOnIA-project/AgrImOnIA_Data
UPDATE 31/05/2023 - NEW RELEASE - V 3.0.0
A new version of the dataset is released: Agrimonia_Dataset_v_3_0_0.csv (.Rdata and .mat), where variable WE_rh_min, WE_rh_mean and WE_rh_max have been recomputed due to some bugs.
In addition, two new columns are added, they are LI_pigs_v2 and LI_bovine_v2 and represents the density of the pigs and bovine (expressed as animals per kilometer squared) of a square of size ~ 10 x 10 km centered at the station localisation.
A new dataset is released: the Agrimonia Grid Covariates (AGC) that includes daily information for the period from 2016 to 2020 of almost all variables within the Agrimonia Dataset on a equidistant grid containing the Lombardy region and its surrounding area. The AGC does not include AQ variables as they come from the monitoring stations that are irregularly spread over the area considered.
UPDATE 11/03/2023 - NEW RELEASE - V 2.0.2
A new version of the dataset is released: Agrimonia_Dataset_v_2_0_2.csv (.Rdata), where variable WE_tot_precipitation have been recomputed due to some bugs.
A new version of the metadata is available: Metadata_Agrimonia_v_2_0_2.csv where the spatial resolution of the variable WE_precipitation_t is corrected.
UPDATE 24/01/2023 - NEW RELEASE - V 2.0.1
minor bug fixed
UPDATE 16/01/2023 - NEW RELEASE - V 2.0.0
A new version of the dataset is released, Agrimonia_Dataset_v_2_0_0.csv (.Rdata) and Metadata_monitoring_network_registry_v_2_0_0.csv. Some minor points have been addressed:
Added values for LA_land_use variable for Switzerland stations (in Agrimonia Dataset_v_2_0_0.csv)
Deleted incorrect values for LA_soil_use variable for stations outside Lombardy region during 2018 (in Agrimonia Dataset_v_2_0_0.csv)
Fixed duplicate sensors corresponding to the same pollutant within the same station (in Metadata_monitoring_network_registry_v_2_0_0.csv)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
This dataset is the repository for the following paper submitted to Data in Brief:
Kempf, M. A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19. Data in Brief (submitted: December 2023).
The Data in Brief article contains the supplement information and is the related data paper to:
Kempf, M. Climate change, the Arab Spring, and COVID-19 - Impacts on landcover transformations in the Levant. Journal of Arid Environments (revision submitted: December 2023).
Description/abstract
The Levant region is highly vulnerable to climate change, experiencing prolonged heat waves that have led to societal crises and population displacement. Since 2010, the area has been marked by socio-political turmoil, including the Syrian civil war and currently the escalation of the so-called Israeli-Palestinian Conflict, which strained neighbouring countries like Jordan due to the influx of Syrian refugees and increases population vulnerability to governmental decision-making. Jordan, in particular, has seen rapid population growth and significant changes in land-use and infrastructure, leading to over-exploitation of the landscape through irrigation and construction. This dataset uses climate data, satellite imagery, and land cover information to illustrate the substantial increase in construction activity and highlights the intricate relationship between climate change predictions and current socio-political developments in the Levant.
Folder structure
The main folder after download contains all data, in which the following subfolders are stored are stored as zipped files:
“code” stores the above described 9 code chunks to read, extract, process, analyse, and visualize the data.
“MODIS_merged” contains the 16-days, 250 m resolution NDVI imagery merged from three tiles (h20v05, h21v05, h21v06) and cropped to the study area, n=510, covering January 2001 to December 2022 and including January and February 2023.
“mask” contains a single shapefile, which is the merged product of administrative boundaries, including Jordan, Lebanon, Israel, Syria, and Palestine (“MERGED_LEVANT.shp”).
“yield_productivity” contains .csv files of yield information for all countries listed above.
“population” contains two files with the same name but different format. The .csv file is for processing and plotting in R. The .ods file is for enhanced visualization of population dynamics in the Levant (Socio_cultural_political_development_database_FAO2023.ods).
“GLDAS” stores the raw data of the NASA Global Land Data Assimilation System datasets that can be read, extracted (variable name), and processed using code “8_GLDAS_read_extract_trend” from the respective folder. One folder contains data from 1975-2022 and a second the additional January and February 2023 data.
“built_up” contains the landcover and built-up change data from 1975 to 2022. This folder is subdivided into two subfolder which contain the raw data and the already processed data. “raw_data” contains the unprocessed datasets and “derived_data” stores the cropped built_up datasets at 5 year intervals, e.g., “Levant_built_up_1975.tif”.
Code structure
1_MODIS_NDVI_hdf_file_extraction.R
This is the first code chunk that refers to the extraction of MODIS data from .hdf file format. The following packages must be installed and the raw data must be downloaded using a simple mass downloader, e.g., from google chrome. Packages: terra. Download MODIS data from after registration from: https://lpdaac.usgs.gov/products/mod13q1v061/ or https://search.earthdata.nasa.gov/search (MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V061, last accessed, 09th of October 2023). The code reads a list of files, extracts the NDVI, and saves each file to a single .tif-file with the indication “NDVI”. Because the study area is quite large, we have to load three different (spatially) time series and merge them later. Note that the time series are temporally consistent.
2_MERGE_MODIS_tiles.R
In this code, we load and merge the three different stacks to produce large and consistent time series of NDVI imagery across the study area. We further use the package gtools to load the files in (1, 2, 3, 4, 5, 6, etc.). Here, we have three stacks from which we merge the first two (stack 1, stack 2) and store them. We then merge this stack with stack 3. We produce single files named NDVI_final_*consecutivenumber*.tif. Before saving the final output of single merged files, create a folder called “merged” and set the working directory to this folder, e.g., setwd("your directory_MODIS/merged").
3_CROP_MODIS_merged_tiles.R
Now we want to crop the derived MODIS tiles to our study area. We are using a mask, which is provided as .shp file in the repository, named "MERGED_LEVANT.shp". We load the merged .tif files and crop the stack with the vector. Saving to individual files, we name them “NDVI_merged_clip_*consecutivenumber*.tif. We now produced single cropped NDVI time series data from MODIS.
The repository provides the already clipped and merged NDVI datasets.
4_TREND_analysis_NDVI.R
Now, we want to perform trend analysis from the derived data. The data we load is tricky as it contains 16-days return period across a year for the period of 22 years. Growing season sums contain MAM (March-May), JJA (June-August), and SON (September-November). December is represented as a single file, which means that the period DJF (December-February) is represented by 5 images instead of 6. For the last DJF period (December 2022), the data from January and February 2023 can be added. The code selects the respective images from the stack, depending on which period is under consideration. From these stacks, individual annually resolved growing season sums are generated and the slope is calculated. We can then extract the p-values of the trend and characterize all values with high confidence level (0.05). Using the ggplot2 package and the melt function from reshape2 package, we can create a plot of the reclassified NDVI trends together with a local smoother (LOESS) of value 0.3.
To increase comparability and understand the amplitude of the trends, z-scores were calculated and plotted, which show the deviation of the values from the mean. This has been done for the NDVI values as well as the GLDAS climate variables as a normalization technique.
5_BUILT_UP_change_raster.R
Let us look at the landcover changes now. We are working with the terra package and get raster data from here: https://ghsl.jrc.ec.europa.eu/download.php?ds=bu (last accessed 03. March 2023, 100 m resolution, global coverage). Here, one can download the temporal coverage that is aimed for and reclassify it using the code after cropping to the individual study area. Here, I summed up different raster to characterize the built-up change in continuous values between 1975 and 2022.
6_POPULATION_numbers_plot.R
For this plot, one needs to load the .csv-file “Socio_cultural_political_development_database_FAO2023.csv” from the repository. The ggplot script provided produces the desired plot with all countries under consideration.
7_YIELD_plot.R
In this section, we are using the country productivity from the supplement in the repository “yield_productivity” (e.g., "Jordan_yield.csv". Each of the single country yield datasets is plotted in a ggplot and combined using the patchwork package in R.
8_GLDAS_read_extract_trend
The last code provides the basis for the trend analysis of the climate variables used in the paper. The raw data can be accessed https://disc.gsfc.nasa.gov/datasets?keywords=GLDAS%20Noah%20Land%20Surface%20Model%20L4%20monthly&page=1 (last accessed 9th of October 2023). The raw data comes in .nc file format and various variables can be extracted using the [“^a variable name”] command from the spatraster collection. Each time you run the code, this variable name must be adjusted to meet the requirements for the variables (see this link for abbreviations: https://disc.gsfc.nasa.gov/datasets/GLDAS_CLSM025_D_2.0/summary, last accessed 09th of October 2023; or the respective code chunk when reading a .nc file with the ncdf4 package in R) or run print(nc) from the code or use names(the spatraster collection).
Choosing one variable, the code uses the MERGED_LEVANT.shp mask from the repository to crop and mask the data to the outline of the study area.
From the processed data, trend analysis are conducted and z-scores were calculated following the code described above. However, annual trends require the frequency of the time series analysis to be set to value = 12. Regarding, e.g., rainfall, which is measured as annual sums and not means, the chunk r.sum=r.sum/12 has to be removed or set to r.sum=r.sum/1 to avoid calculating annual mean values (see other variables). Seasonal subset can be calculated as described in the code. Here, 3-month subsets were chosen for growing seasons, e.g. March-May (MAM), June-July (JJA), September-November (SON), and DJF (December-February, including Jan/Feb of the consecutive year).
From the data, mean values of 48 consecutive years are calculated and trend analysis are performed as describe above. In the same way, p-values are extracted and 95 % confidence level values are marked with dots on the raster plot. This analysis can be performed with a much longer time series, other variables, ad different spatial extent across the globe due to the availability of the GLDAS variables.
The Cohort Hip & Cohort Knee (CHECK) is a population-based observational multicenter cohort study of 1002 individuals with early symptomatic osteoarthritis (OA) of knee and/or hip in the Netherlands. The participants were followed for 10 years. The study evaluated clinical, radiographic and biochemical variables in order to establish the course, prognosis and underlying mechanisms of early symptomatic osteoarthritis. The Dutch Artritis Foundation initiated and funded this inception cohort.This dataset covers the data collection of baseline (T0) without the variable 'Subject identification number'. Included is a Kellgren-Lawrence radiographic classification covering T0,T2,T5, T8 and T10. Also X-rays of hips and knees of baseline are available. More information on the variables can be found in the documentation. In the description file you can find an overview of the data belonging to this dataset and more information about the format and kind of view of the X rays.The complete data are available via three separate datasets, each containing again the baseline T0 data of this current dataset. All SPSS data files of these three datasets include the variable 'Subject identification number'.The X-ray data are not included in the dataset, they are stored outside of EASY. If you wish to use this data, please contact DANS via info@dans.knaw.nl. Or consult the X-ray_data_request.pdf document for more information.If you wish to make use of the complete CHECK data, please see the see relations for the other CHECK datasets and for the overview 'Thematic collection: CHECK (Cohort Hip & Cohort Knee)'. Date Submitted: 2015-12-09 2019-12-20: a new data file on X-Ray data 'Rontgen_opT10_20191118' was added to the dataset.2017-09-19: A data file on X-Ray ratings has been added and the variable guide is replaced by a new version (6) with information on this data file. Please note the variable names start with 'RontgT10_' in the data file.2017-07-12: Due to an error a data file has been replaced.CHECK_T0_DANS_nsinENG_20151211.sav is now replaced by CHECK_T0_DANS_nsin_ENG_20161128.sav---The informed consent statements of the participants are stored at the participating hospitals.The .dta (STATA) and .por (SPSS) files are conversions of the original .sav (SPSS) files.
a MOCK dataset used to show how to import Qualtrics metadata into the codebook R package
This table contains variable names, labels, and number of missing values. See the complete codebook for more.
name | label | n_missing |
---|---|---|
ResponseSet | NA | 0 |
Q7 | NA | 0 |
Q10 | NA | 0 |
This dataset was automatically described using the codebook R package (version 0.9.5).
This is the raw H2S data- concentration of H2S in parts per million in the biogas. Each sheet (tab) is formatted to be exported as a .csv for use with the R-code (AQ-June20.R). In order for this code to work properly, it is important that this file remain intact. Do not change the column names or codes for data, for example. And to be safe, don’t even sort. One simple change in the excel file could make the code full of bugs.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The current version of the CAMELSH dataset, containing data for 9,008 basins,. Due to the total data volume in the repository being approximately 57 GB, which exceeds Zenodo's size limit, we split it into two different links. This link contains data on attributes, shapefiles, and time series data for basins with observed streamflow. The second link (xxx) contains time series data for the 'non-obs' basins. All data is compressed in 7zip format. After extraction, the dataset is organized into the following subfolders:
• The attributes folder contains 28 CSV (comma-separated values) files that store basin attributes with all files beginning with "attributes_" and one excel file. Of these, the 'attributes_nldas2_climate.csv' file contains nine climate attributes (Table 2) derived from NLDAS-2 data. The 'attributes_hydroATLAS.csv' file includes 195 basin attributes derived from the HydroATLAS dataset. 26 files with names starting with 'attributes_gageii_' contain a total of 439 basin attributes extracted from the GAGES-II dataset. The name of each file represents a distinct group of attributes, as described in Table S.1. The remaining file, named 'Var_description_gageii.xlsx', provides explanatory details regarding the variable names included in the 26 CSV files, with information similar to that presented in Table S.1. The first column in all CSV files, labeled 'STAID', contains the identification (ID) names of the stream gauges. These IDs are assigned by the USGS and are sourced from the original GAGES-II dataset.
• The shapefiles folder contains two sets of shapefiles for the catchment boundary. The first set, CAMELSH_shapefile.shp, is derived from the original GAGES-II dataset and is used to obtain the corresponding climate forcing data for each catchment. The second set, CAMELSH_shapefile_hydroATLAS.shp, includes catchment boundaries derived from the HydroATLAS dataset. Each polygon in both shapefiles contains a field named GAGE_ID, which represents the ID of the stream gauges.
• The timeseries (7zip) file contains a compressed archive (7zip) that includes time series data for 3,166 basins with observed streamflow data. Within this 7zip file, there are a total of 3,166 NetCDF files, each corresponding to a specific basin. The name of each NetCDF file matches the stream gauge ID. Each file contains an hourly time series from 1980-01-01 00:00:00 to 2024-12-31 23:00:00 for streamflow (denoted as "Streamflow" in the NetCDF file) and 11 climate variables (see Table 1). The streamflow data series includes missing values, which are represented as "NaN". All meteorological forcing data and streamflow records have been standardized to the +0 UTC time zone.
• The timeseries_nonobs (7zip) file contains time series data for 5,842 basins without observed streamflow data. The structure of each NetCDF file is similar to the one described above.
• The info.csv file, located in the main directory of the dataset, contains basic information for 9,008 stream stations. This includes the stream gauge ID, the total number of observed hourly data points over 45 years (from 1980 to 2024), and the number of observed hourly data points for each year from 1980 to 2024. Stations with and without observed data are distinguished by the value in the second column, where stations without observed streamflow data have a corresponding value of 0.
a small mock Big Five Inventory dataset
This table contains variable names, labels, and number of missing values. See the complete codebook for more.
name | label | n_missing |
---|---|---|
session | NA | 0 |
created | user first opened survey | 0 |
modified | user last edited survey | 0 |
ended | user finished survey | 0 |
expired | NA | 28 |
BFIK_agree_4R | Ich kann mich schroff und abweisend anderen gegenüber verhalten. | 0 |
BFIK_agree_1R | Ich neige dazu, andere zu kritisieren. | 0 |
BFIK_neuro_2R | Ich bin entspannt, lasse mich durch Stress nicht aus der Ruhe bringen. | 0 |
BFIK_agree_3R | Ich kann mich kalt und distanziert verhalten. | 0 |
BFIK_neuro_3 | Ich mache mir viele Sorgen. | 0 |
BFIK_neuro_4 | Ich werde leicht nervös und unsicher. | 0 |
BFIK_agree_2 | Ich schenke anderen leicht Vertrauen, glaube an das Gute im Menschen. | 0 |
BFIK_agree | 4 BFIK_agree items aggregated by aggregation_function | 0 |
BFIK_neuro | 3 BFIK_neuro items aggregated by aggregation_function | 0 |
age | Alter | 0 |
This dataset was automatically described using the codebook R package (version 0.9.2).
UNI-CEN Standardized Census Data Tables contain Census data that have been reformatted into a common table format with standardized variable names and codes. The data are provided in two tabular formats for different use cases. "Long" tables are suitable for use in statistical environments, while "wide" tables are commonly used in GIS environments. The long tables are provided in Stata Binary (dta) format, which is readable by all statistics software. The wide tables are provided in comma-separated values (csv) and dBase 3 (dbf) formats with codebooks. The wide tables are easily joined to the UNI-CEN Digital Boundary Files. For the csv files, a .csvt file is provided to ensure that column data formats are correctly formatted when importing into QGIS. A schema.ini file does the same when importing into ArcGIS environments. As the DBF file format supports a maximum of 250 columns, tables with a larger number of variables are divided into multiple DBF files. For more information about file sources, the methods used to create them, and how to use them, consult the documentation at https://borealisdata.ca/dataverse/unicen_docs. For more information about the project, visit https://observatory.uwo.ca/unicen.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Pokemon’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/mlomuscio/pokemon on 14 February 2022.
--- Dataset description provided by original source is as follows ---
I acquired the data from Alberto Barradas at https://www.kaggle.com/abcsds/pokemon. I needed to edit some of the variable names and remove the Total variable in order for my students to use this data for class. Otherwise, I would have just had them use his version of the data.
This dataset is for my Introduction to Data Science and Machine Learning Course. Using a modified Pokémon dataset acquired from Kaggle.com, I created example code for students demonstrating how to explore data with R.
Barradas provides the following description of each variable. I have modified the variable names to make them easier to deal with.
--- Original source retains full ownership of the source dataset ---
On the continental scale, climate is an important determinant of the distributions of plant taxa and ecoregions. To quantify and depict the relations between specific climate variables and these distributions, we placed modern climate and plant taxa distribution data on an approximately 25-kilometer (km) equal-area grid with 27,984 points that cover Canada and the continental United States (Thompson and others, 2015). The gridded climatic data include annual and monthly temperature and precipitation, as well as bioclimatic variables (growing degree days, mean temperatures of the coldest and warmest months, and a moisture index) based on 1961-1990 30-year mean values from the University of East Anglia (UK) Climatic Research Unit (CRU) CL 2.0 dataset (New and others, 2002), and absolute minimum and maximum temperatures for 1951-1980 interpolated from climate-station data (WeatherDisc Associates, 1989). As described below, these data were used to produce portions of the "Atlas of relations between climatic parameters and distributions of important trees and shrubs in North America" (hereafter referred to as "the Atlas"; Thompson and others, 1999a, 1999b, 2000, 2006, 2007, 2012a, 2015). Evolution of the Atlas Over the 16 Years Between Volumes A & B and G: The Atlas evolved through time as technology improved and our knowledge expanded. The climate data employed in the first five Atlas volumes were replaced by more standard and better documented data in the last two volumes (Volumes F and G; Thompson and others, 2012a, 2015). Similarly, the plant distribution data used in Volumes A through D (Thompson and others, 1999a, 1999b, 2000, 2006) were improved for the latter volumes. However, the digitized ecoregion boundaries used in Volume E (Thompson and others, 2007) remain unchanged. Also, as we and others used the data in Atlas Volumes A through E, we came to realize that the plant distribution and climate data for areas south of the US-Mexico border were not of sufficient quality or resolution for our needs and these data are not included in this data release. The data in this data release are provided in comma-separated values (.csv) files. We also provide netCDF (.nc) files containing the climate and bioclimatic data, grouped taxa and species presence-absence data, and ecoregion assignment data for each grid point (but not the country, state, province, and county assignment data for each grid point, which are available in the .csv files). The netCDF files contain updated Albers conical equal-area projection details and more precise grid-point locations. When the original approximately 25-km equal-area grid was created (ca. 1990), it was designed to be registered with existing data sets, and only 3 decimal places were recorded for the grid-point latitude and longitude values (these original 3-decimal place latitude and longitude values are in the .csv files). In addition, the Albers conical equal-area projection used for the grid was modified to match projection irregularities of the U.S. Forest Service atlases (e.g., Little, 1971, 1976, 1977) from which plant taxa distribution data were digitized. For the netCDF files, we have updated the Albers conical equal-area projection parameters and recalculated the grid-point latitudes and longitudes to 6 decimal places. The additional precision in the location data produces maximum differences between the 6-decimal place and the original 3-decimal place values of up to 0.00266 degrees longitude (approximately 143.8 m along the projection x-axis of the grid) and up to 0.00123 degrees latitude (approximately 84.2 m along the projection y-axis of the grid). The maximum straight-line distance between a three-decimal-point and six-decimal-point grid-point location is 144.2 m. Note that we have not regridded the elevation, climate, grouped taxa and species presence-absence data, or ecoregion data to the locations defined by the new 6-decimal place latitude and longitude data. For example, the climate data described in the Atlas publications were interpolated to the grid-point locations defined by the original 3-decimal place latitude and longitude values. Interpolating the data to the 6-decimal place latitude and longitude values would in many cases not result in changes to the reported values and for other grid points the changes would be small and insignificant. Similarly, if the digitized Little (1971, 1976, 1977) taxa distribution maps were regridded using the 6-decimal place latitude and longitude values, the changes to the gridded distributions would be minor, with a small number of grid points along the edge of a taxa's digitized distribution potentially changing value from taxa "present" to taxa "absent" (or vice versa). These changes should be considered within the spatial margin of error for the taxa distributions, which are based on hand-drawn maps with the distributions evidently generalized, or represented by a small, filled circle, and these distributions were subsequently hand digitized. Users wanting to use data that exactly match the data in the Atlas volumes should use the 3-decimal place latitude and longitude data provided in the .csv files in this data release to represent the center point of each grid cell. Users for whom an offset of up to 144.2 m from the original grid-point location is acceptable (e.g., users investigating continental-scale questions) or who want to easily visualize the data may want to use the data associated with the 6-decimal place latitude and longitude values in the netCDF files. The variable names in the netCDF files generally match those in the data release .csv files, except where the .csv file variable name contains a forward slash, colon, period, or comma (i.e., "/", ":", ".", or ","). In the netCDF file variable short names, the forward slashes are replaced with an underscore symbol (i.e., "_") and the colons, periods, and commas are deleted. In the netCDF file variable long names, the punctuation in the name matches that in the .csv file variable names. The "country", "state, province, or territory", and "county" data in the .csv files are not included in the netCDF files. Data included in this release: - Geographic scope. The gridded data cover an area that we labelled as "CANUSA", which includes Canada and the USA (excluding Hawaii, Puerto Rico, and other oceanic islands). Note that the maps displayed in the Atlas volumes are cropped at their northern edge and do not display the full northern extent of the data included in this data release. - Elevation. The elevation data were regridded from the ETOPO5 data set (National Geophysical Data Center, 1993). There were 35 coastal grid points in our CANUSA study area grid for which the regridded elevations were below sea level and these grid points were assigned missing elevation values (i.e., elevation = 9999). The grid points with missing elevation values occur in five coastal areas: (1) near San Diego (California, USA; 1 grid point), (2) Vancouver Island (British Columbia, Canada) and the Olympic Peninsula (Washington, USA; 2 grid points), (3) the Haida Gwaii (formerly Queen Charlotte Islands, British Columbia, Canada) and southeast Alaska (USA, 9 grid points), (4) the Canadian Arctic Archipelago (22 grid points), and (5) Newfoundland (Canada; 1 grid point). - Climate. The gridded climatic data provided here are based on the 1961-1990 30-year mean values from the University of East Anglia (UK) Climatic Research Unit (CRU) CL 2.0 dataset (New and others, 2002), and include annual and monthly temperature and precipitation. The CRU CL 2.0 data were interpolated onto the approximately 25-km grid using geographically-weighted regression, incorporating local lapse-rate estimation and correction. Additional bioclimatic variables (growing degree days on a 5 degrees Celsius base, mean temperatures of the coldest and warmest months, and a moisture index calculated as actual evapotranspiration divided by potential evapotranspiration) were calculated using the interpolated CRU CL 2.0 data. Also included are absolute minimum and maximum temperatures for 1951-1980 interpolated in a similar fashion from climate-station data (WeatherDisc Associates, 1989). These climate and bioclimate data were used in Atlas volumes F and G (see Thompson and others, 2015, for a description of the methods used to create the gridded climate data). Note that for grid points with missing elevation values (i.e., elevation values equal to 9999), climate data were created using an elevation value of -120 meters. Users may want to exclude these climate data from their analyses (see the Usage Notes section in the data release readme file). - Plant distributions. The gridded plant distribution data align with Atlas volume G (Thompson and others, 2015). Plant distribution data on the grid include 690 species, as well as 67 groups of related species and genera, and are based on U.S. Forest Service atlases (e.g., Little, 1971, 1976, 1977), regional atlases (e.g., Benson and Darrow, 1981), and new maps based on information available from herbaria and other online and published sources (for a list of sources, see Tables 3 and 4 in Thompson and others, 2015). See the "Notes" column in Table 1 (https://pubs.usgs.gov/pp/p1650-g/table1.html) and Table 2 (https://pubs.usgs.gov/pp/p1650-g/table2.html) in Thompson and others (2015) for important details regarding the species and grouped taxa distributions. - Ecoregions. The ecoregion gridded data are the same as in Atlas volumes D and E (Thompson and others, 2006, 2007), and include three different systems, Bailey's ecoregions (Bailey, 1997, 1998), WWF's ecoregions (Ricketts and others, 1999), and Kuchler's potential natural vegetation regions (Kuchler, 1985), that are each based on distinctive approaches to categorizing ecoregions. For the Bailey and WWF ecoregions for North America and the Kuchler potential natural vegetation regions for the contiguous United States (i.e.,
The integration of traits into 'syndromes' has been suggested as a useful framework to advance insights in trait responses to environmental stressors. Yet, how stressors shape the consistency (‘repeatability’) of traits and their covariation at the individual level remains debated. We studied how seasonal time constraints shape trait repeatability and integration of life-history, behavioural, and physiological traits along a fast-slow continuum, using the ‘pace-of-life syndrome’ as a framework. We manipulated the photoperiod during the larval development of the damselfly Ischnura elegans, generating a time-relaxed early, a control, and a time-constrained late group. The photoperiod treatment did not seem to affect the voltinism of the larvae. As predicted, late-period larvae accelerated development and growth, yet this acceleration was no longer detectable for growth and metabolic rate during the final instar, possibly due to costs of the initial life-history acceleration. This warrants...
This dataset was created by Chloe Shearer