100+ datasets found

Data from: A dataset to model Levantine landcover and land-use change...
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated Dec 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Kempf; Michael Kempf (2023). A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19 [Dataset]. http://doi.org/10.5281/zenodo.10396148
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10396148
Dataset updated
Dec 16, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Michael Kempf; Michael Kempf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Dec 16, 2023
Area covered
Levant
Description
Overview

This dataset is the repository for the following paper submitted to Data in Brief:

Kempf, M. A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19. Data in Brief (submitted: December 2023).

The Data in Brief article contains the supplement information and is the related data paper to:

Kempf, M. Climate change, the Arab Spring, and COVID-19 - Impacts on landcover transformations in the Levant. Journal of Arid Environments (revision submitted: December 2023).

Description/abstract

The Levant region is highly vulnerable to climate change, experiencing prolonged heat waves that have led to societal crises and population displacement. Since 2010, the area has been marked by socio-political turmoil, including the Syrian civil war and currently the escalation of the so-called Israeli-Palestinian Conflict, which strained neighbouring countries like Jordan due to the influx of Syrian refugees and increases population vulnerability to governmental decision-making. Jordan, in particular, has seen rapid population growth and significant changes in land-use and infrastructure, leading to over-exploitation of the landscape through irrigation and construction. This dataset uses climate data, satellite imagery, and land cover information to illustrate the substantial increase in construction activity and highlights the intricate relationship between climate change predictions and current socio-political developments in the Levant.

Folder structure

The main folder after download contains all data, in which the following subfolders are stored are stored as zipped files:

“code” stores the above described 9 code chunks to read, extract, process, analyse, and visualize the data.

“MODIS_merged” contains the 16-days, 250 m resolution NDVI imagery merged from three tiles (h20v05, h21v05, h21v06) and cropped to the study area, n=510, covering January 2001 to December 2022 and including January and February 2023.

“mask” contains a single shapefile, which is the merged product of administrative boundaries, including Jordan, Lebanon, Israel, Syria, and Palestine (“MERGED_LEVANT.shp”).

“yield_productivity” contains .csv files of yield information for all countries listed above.

“population” contains two files with the same name but different format. The .csv file is for processing and plotting in R. The .ods file is for enhanced visualization of population dynamics in the Levant (Socio_cultural_political_development_database_FAO2023.ods).

“GLDAS” stores the raw data of the NASA Global Land Data Assimilation System datasets that can be read, extracted (variable name), and processed using code “8_GLDAS_read_extract_trend” from the respective folder. One folder contains data from 1975-2022 and a second the additional January and February 2023 data.

“built_up” contains the landcover and built-up change data from 1975 to 2022. This folder is subdivided into two subfolder which contain the raw data and the already processed data. “raw_data” contains the unprocessed datasets and “derived_data” stores the cropped built_up datasets at 5 year intervals, e.g., “Levant_built_up_1975.tif”.

Code structure

1_MODIS_NDVI_hdf_file_extraction.R

This is the first code chunk that refers to the extraction of MODIS data from .hdf file format. The following packages must be installed and the raw data must be downloaded using a simple mass downloader, e.g., from google chrome. Packages: terra. Download MODIS data from after registration from: https://lpdaac.usgs.gov/products/mod13q1v061/ or https://search.earthdata.nasa.gov/search (MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V061, last accessed, 09th of October 2023). The code reads a list of files, extracts the NDVI, and saves each file to a single .tif-file with the indication “NDVI”. Because the study area is quite large, we have to load three different (spatially) time series and merge them later. Note that the time series are temporally consistent.

2_MERGE_MODIS_tiles.R

In this code, we load and merge the three different stacks to produce large and consistent time series of NDVI imagery across the study area. We further use the package gtools to load the files in (1, 2, 3, 4, 5, 6, etc.). Here, we have three stacks from which we merge the first two (stack 1, stack 2) and store them. We then merge this stack with stack 3. We produce single files named NDVI_final_*consecutivenumber*.tif. Before saving the final output of single merged files, create a folder called “merged” and set the working directory to this folder, e.g., setwd("your directory_MODIS/merged").

3_CROP_MODIS_merged_tiles.R

Now we want to crop the derived MODIS tiles to our study area. We are using a mask, which is provided as .shp file in the repository, named "MERGED_LEVANT.shp". We load the merged .tif files and crop the stack with the vector. Saving to individual files, we name them “NDVI_merged_clip_*consecutivenumber*.tif. We now produced single cropped NDVI time series data from MODIS.
The repository provides the already clipped and merged NDVI datasets.

4_TREND_analysis_NDVI.R

Now, we want to perform trend analysis from the derived data. The data we load is tricky as it contains 16-days return period across a year for the period of 22 years. Growing season sums contain MAM (March-May), JJA (June-August), and SON (September-November). December is represented as a single file, which means that the period DJF (December-February) is represented by 5 images instead of 6. For the last DJF period (December 2022), the data from January and February 2023 can be added. The code selects the respective images from the stack, depending on which period is under consideration. From these stacks, individual annually resolved growing season sums are generated and the slope is calculated. We can then extract the p-values of the trend and characterize all values with high confidence level (0.05). Using the ggplot2 package and the melt function from reshape2 package, we can create a plot of the reclassified NDVI trends together with a local smoother (LOESS) of value 0.3.
To increase comparability and understand the amplitude of the trends, z-scores were calculated and plotted, which show the deviation of the values from the mean. This has been done for the NDVI values as well as the GLDAS climate variables as a normalization technique.

5_BUILT_UP_change_raster.R

Let us look at the landcover changes now. We are working with the terra package and get raster data from here: https://ghsl.jrc.ec.europa.eu/download.php?ds=bu (last accessed 03. March 2023, 100 m resolution, global coverage). Here, one can download the temporal coverage that is aimed for and reclassify it using the code after cropping to the individual study area. Here, I summed up different raster to characterize the built-up change in continuous values between 1975 and 2022.

6_POPULATION_numbers_plot.R

For this plot, one needs to load the .csv-file “Socio_cultural_political_development_database_FAO2023.csv” from the repository. The ggplot script provided produces the desired plot with all countries under consideration.

7_YIELD_plot.R

In this section, we are using the country productivity from the supplement in the repository “yield_productivity” (e.g., "Jordan_yield.csv". Each of the single country yield datasets is plotted in a ggplot and combined using the patchwork package in R.

8_GLDAS_read_extract_trend

The last code provides the basis for the trend analysis of the climate variables used in the paper. The raw data can be accessed https://disc.gsfc.nasa.gov/datasets?keywords=GLDAS%20Noah%20Land%20Surface%20Model%20L4%20monthly&page=1 (last accessed 9th of October 2023). The raw data comes in .nc file format and various variables can be extracted using the [“^a variable name”] command from the spatraster collection. Each time you run the code, this variable name must be adjusted to meet the requirements for the variables (see this link for abbreviations: https://disc.gsfc.nasa.gov/datasets/GLDAS_CLSM025_D_2.0/summary, last accessed 09th of October 2023; or the respective code chunk when reading a .nc file with the ncdf4 package in R) or run print(nc) from the code or use names(the spatraster collection).
Choosing one variable, the code uses the MERGED_LEVANT.shp mask from the repository to crop and mask the data to the outline of the study area.
From the processed data, trend analysis are conducted and z-scores were calculated following the code described above. However, annual trends require the frequency of the time series analysis to be set to value = 12. Regarding, e.g., rainfall, which is measured as annual sums and not means, the chunk r.sum=r.sum/12 has to be removed or set to r.sum=r.sum/1 to avoid calculating annual mean values (see other variables). Seasonal subset can be calculated as described in the code. Here, 3-month subsets were chosen for growing seasons, e.g. March-May (MAM), June-July (JJA), September-November (SON), and DJF (December-February, including Jan/Feb of the consecutive year).
From the data, mean values of 48 consecutive years are calculated and trend analysis are performed as describe above. In the same way, p-values are extracted and 95 % confidence level values are marked with dots on the raster plot. This analysis can be performed with a much longer time series, other variables, ad different spatial extent across the globe due to the availability of the GLDAS variables.
n
Data from: WiBB: An integrated method for quantifying the relative...
data.niaid.nih.gov
data-staging.niaid.nih.gov
+1more
zip
Updated Aug 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qin Li; Xiaojun Kou (2021). WiBB: An integrated method for quantifying the relative importance of predictive variables [Dataset]. http://doi.org/10.5061/dryad.xsj3tx9g1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.xsj3tx9g1
Dataset updated
Aug 20, 2021
Dataset provided by
Beijing Normal University
Field Museum of Natural History
Authors
Qin Li; Xiaojun Kou
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
This dataset contains simulated datasets, empirical data, and R scripts described in the paper: “Li, Q. and Kou, X. (2021) WiBB: An integrated method for quantifying the relative importance of predictive variables. Ecography (DOI: 10.1111/ecog.05651)”.

A fundamental goal of scientific research is to identify the underlying variables that govern crucial processes of a system. Here we proposed a new index, WiBB, which integrates the merits of several existing methods: a model-weighting method from information theory (Wi), a standardized regression coefficient method measured by ß* (B), and bootstrap resampling technique (B). We applied the WiBB in simulated datasets with known correlation structures, for both linear models (LM) and generalized linear models (GLM), to evaluate its performance. We also applied two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate their performance in comparison with the WiBB method on ranking predictor importances under various scenarios. We also applied it to an empirical dataset in a plant genus Mimulus to select bioclimatic predictors of species’ presence across the landscape. Results in the simulated datasets showed that the WiBB method outperformed the ß* and SWi methods in scenarios with small and large sample sizes, respectively, and that the bootstrap resampling technique significantly improved the discriminant ability. When testing WiBB in the empirical dataset with GLM, it sensibly identified four important predictors with high credibility out of six candidates in modeling geographical distributions of 71 Mimulus species. This integrated index has great advantages in evaluating predictor importance and hence reducing the dimensionality of data, without losing interpretive power. The simplicity of calculation of the new metric over more sophisticated statistical procedures, makes it a handy method in the statistical toolbox.

Methods To simulate independent datasets (size = 1000), we adopted Galipaud et al.’s approach (2014) with custom modifications of the data.simulation function, which used the multiple normal distribution function rmvnorm in R package mvtnorm(v1.0-5, Genz et al. 2016). Each dataset was simulated with a preset correlation structure between a response variable (y) and four predictors(x1, x2, x3, x4). The first three (genuine) predictors were set to be strongly, moderately, and weakly correlated with the response variable, respectively (denoted by large, medium, small Pearson correlation coefficients, r), while the correlation between the response and the last (spurious) predictor was set to be zero. We simulated datasets with three levels of differences of correlation coefficients of consecutive predictors, where ∆r = 0.1, 0.2, 0.3, respectively. These three levels of ∆r resulted in three correlation structures between the response and four predictors: (0.3, 0.2, 0.1, 0.0), (0.6, 0.4, 0.2, 0.0), and (0.8, 0.6, 0.3, 0.0), respectively. We repeated the simulation procedure 200 times for each of three preset correlation structures (600 datasets in total), for LM fitting later. For GLM fitting, we modified the simulation procedures with additional steps, in which we converted the continuous response into binary data O (e.g., occurrence data having 0 for absence and 1 for presence). We tested the WiBB method, along with two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate the ability to correctly rank predictor importances under various scenarios. The empirical dataset of 71 Mimulus species was collected by their occurrence coordinates and correponding values extracted from climatic layers from WorldClim dataset (www.worldclim.org), and we applied the WiBB method to infer important predictors for their geographical distributions.
Z
Film Circulation dataset
data.niaid.nih.gov
zenodo.org
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Loist, Skadi; Samoilova, Evgenia (Zhenya) (2024). Film Circulation dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7887671
Explore at:
Dataset updated
Jul 12, 2024
Dataset provided by
Film University Babelsberg KONRAD WOLF
Authors
Loist, Skadi; Samoilova, Evgenia (Zhenya)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.

Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories, units of measurement, data sources and coding and missing data.

The csv file “4_festival-library_dataset_imdb-and-survey” contains data on all unique festivals collected from both IMDb and survey sources. This dataset appears in wide format, all information for each festival is listed in one row. This
o
University SET data, with faculty and courses characteristics
openicpsr.org
Updated Sep 12, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Under blind review in refereed journal (2021). University SET data, with faculty and courses characteristics [Dataset]. http://doi.org/10.3886/E149801V1
Explore at:
Unique identifier
https://doi.org/10.3886/E149801V1
Dataset updated
Sep 12, 2021
Authors
Under blind review in refereed journal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This paper explores a unique dataset of all the SET ratings provided by students of one university in Poland at the end of the winter semester of the 2020/2021 academic year. The SET questionnaire used by this university is presented in Appendix 1. The dataset is unique for several reasons. It covers all SET surveys filled by students in all fields and levels of study offered by the university. In the period analysed, the university was entirely in the online regime amid the Covid-19 pandemic. While the expected learning outcomes formally have not been changed, the online mode of study could have affected the grading policy and could have implications for some of the studied SET biases. This Covid-19 effect is captured by econometric models and discussed in the paper. The average SET scores were matched with the characteristics of the teacher for degree, seniority, gender, and SET scores in the past six semesters; the course characteristics for time of day, day of the week, course type, course breadth, class duration, and class size; the attributes of the SET survey responses as the percentage of students providing SET feedback; and the grades of the course for the mean, standard deviation, and percentage failed. Data on course grades are also available for the previous six semesters. This rich dataset allows many of the biases reported in the literature to be tested for and new hypotheses to be formulated, as presented in the introduction section. The unit of observation or the single row in the data set is identified by three parameters: teacher unique id (j), course unique id (k) and the question number in the SET questionnaire (n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9} ). It means that for each pair (j,k), we have nine rows, one for each SET survey question, or sometimes less when students did not answer one of the SET questions at all. For example, the dependent variable SET_score_avg(j,k,n) for the triplet (j=Calculus, k=John Smith, n=2) is calculated as the average of all Likert-scale answers to question nr 2 in the SET survey distributed to all students that took the Calculus course taught by John Smith. The data set has 8,015 such observations or rows. The full list of variables or columns in the data set included in the analysis is presented in the attached filesection. Their description refers to the triplet (teacher id = j, course id = k, question number = n). When the last value of the triplet (n) is dropped, it means that the variable takes the same values for all n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9}.Two attachments:- word file with variables description- Rdata file with the data set (for R language).Appendix 1. Appendix 1. The SET questionnaire was used for this paper. Evaluation survey of the teaching staff of [university name] Please, complete the following evaluation form, which aims to assess the lecturer’s performance. Only one answer should be indicated for each question. The answers are coded in the following way: 5- I strongly agree; 4- I agree; 3- Neutral; 2- I don’t agree; 1- I strongly don’t agree. Questions 1 2 3 4 5 I learnt a lot during the course. ○ ○ ○ ○ ○ I think that the knowledge acquired during the course is very useful. ○ ○ ○ ○ ○ The professor used activities to make the class more engaging. ○ ○ ○ ○ ○ If it was possible, I would enroll for the course conducted by this lecturer again. ○ ○ ○ ○ ○ The classes started on time. ○ ○ ○ ○ ○ The lecturer always used time efficiently. ○ ○ ○ ○ ○ The lecturer delivered the class content in an understandable and efficient way. ○ ○ ○ ○ ○ The lecturer was available when we had doubts. ○ ○ ○ ○ ○ The lecturer treated all students equally regardless of their race, background and ethnicity. ○ ○
d
Data from: Remotely sensed variables analyzed and reported in the paper...
catalog.data.gov
data.usgs.gov
+1more
Updated Oct 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Remotely sensed variables analyzed and reported in the paper titled "Multi-year data from satellite- and ground-based sensors show details and scale matter in assessing climate’s effects on wetland surface water, amphibians, and landscape conditions" [Dataset]. https://catalog.data.gov/dataset/remotely-sensed-variables-analyzed-and-reported-in-the-paper-titled-multi-year-data-from-s
Explore at:
Dataset updated
Oct 22, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
The comma-delimited fields in this dataset provide values for the remotely sensed variables analyzed for landscape blocks described in the paper, "Multi-year data from satellite- and ground-based sensors show details and scale matter in assessing climate’s effects on wetland surface water, amphibians, and landscape conditions," by Sadinski et al. (submitted). The field labeled “BlockSite” links the records in this file with a set of boundaries in a shapefile called “Study_Block_Boundaries.shp” The records represent weekly measurements of normalized difference vegetation index (BlockNDVI) values and total evapotranspiration (BlockETmm), as well as the annual snow-off date (BlockDOYsnowfree) for the study blocks from January through August from 2008 to 2012.
Z
Data from: AgrImOnIA: Open Access dataset correlating livestock and air...
data.niaid.nih.gov
zenodo.org
Updated Feb 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fassò, Alessandro; Rodeschini, Jacopo; Fusta Moro, Alessandro; Shaboviq, Qendrim; Vinciguerra, Marco; Maranzano, Paolo; Cameletti, Michela; Finazzi, Francesco; Golini, Natalia; Ignaccolo, Rosaria; Otto, Philipp (2024). AgrImOnIA: Open Access dataset correlating livestock and air quality in the Lombardy region, Italy [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6620529
Explore at:
Dataset updated
Feb 6, 2024
Dataset provided by
University of Milano-Bicocca
University of Turin
University of Bergamo
Leibniz University Hannover
Authors
Fassò, Alessandro; Rodeschini, Jacopo; Fusta Moro, Alessandro; Shaboviq, Qendrim; Vinciguerra, Marco; Maranzano, Paolo; Cameletti, Michela; Finazzi, Francesco; Golini, Natalia; Ignaccolo, Rosaria; Otto, Philipp
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Italy, Lombardy
Description
The AgrImOnIA dataset is a comprehensive dataset relating air quality and livestock (expressed as the density of bovines and swine bred) along with weather and other variables. The AgrImOnIA Dataset represents the first step of the AgrImOnIA project. The purpose of this dataset is to give the opportunity to assess the impact of agriculture on air quality in Lombardy through statistical techniques capable of highlighting the relationship between the livestock sector and air pollutants concentrations.

The building process of the dataset is detailed in the companion paper:

A. Fassò, J. Rodeschini, A. Fusta Moro, Q. Shaboviq, P. Maranzano, M. Cameletti, F. Finazzi, N. Golini, R. Ignaccolo, and P. Otto (2023). Agrimonia: a dataset on livestock, meteorology and air quality in the Lombardy region, Italy. SCIENTIFIC DATA, 1-19.

available here.

This dataset is a collection of estimated daily values for a range of measurements of different dimensions as: air quality, meteorology, emissions, livestock animals and land use. Data are related to Lombardy and the surrounding area for 2016-2021, inclusive. The surrounding area is obtained by applying a 0.3° buffer on Lombardy borders.

The data uses several aggregation and interpolation methods to estimate the measurement for all days.

The files in the record, renamed according to their version (es. .._v_3_0_0), are:

Agrimonia_Dataset.csv(.mat and .Rdata) which is built by joining the daily time series related to the AQ, WE, EM, LI and LA variables. In order to simplify access to variables in the Agrimonia dataset, the variable name starts with the dimension of the variable, i.e., the name of the variables related to the AQ dimension start with 'AQ_'. This file is archived also in the format for MATLAB and R software.

Metadata_Agrimonia.csv which provides further information about the Agrimonia variables: e.g. sources used, original names of the variables imported, transformations applied.

Metadata_AQ_imputation_uncertainty.csv which contains the daily uncertainty estimate of the imputed observation for the AQ to mitigate missing data in the hourly time series.

Metadata_LA_CORINE_labels.csv which contains the label and the description associated with the CLC class.

Metadata_monitoring_network_registry.csv which contains all details about the AQ monitoring station used to build the dataset. Information about air quality monitoring stations include: station type, municipality code, environment type, altitude, pollutants sampled and other. Each row represents a single sensor.

Metadata_LA_SIARL_labels.csv which contains the label and the description associated with the SIARL class.

AGC_Dataset.csv(.mat and .Rdata) that includes daily data of almost all variables available in the Agrimonia Dataset (excluding AQ variables) on an equidistant grid covering the Lombardy region and its surrounding area.

The Agrimonia dataset can be reproduced using the code available at the GitHub page: https://github.com/AgrImOnIA-project/AgrImOnIA_Data

UPDATE 31/05/2023 - NEW RELEASE - V 3.0.0

A new version of the dataset is released: Agrimonia_Dataset_v_3_0_0.csv (.Rdata and .mat), where variable WE_rh_min, WE_rh_mean and WE_rh_max have been recomputed due to some bugs.

In addition, two new columns are added, they are LI_pigs_v2 and LI_bovine_v2 and represents the density of the pigs and bovine (expressed as animals per kilometer squared) of a square of size ~ 10 x 10 km centered at the station localisation.

A new dataset is released: the Agrimonia Grid Covariates (AGC) that includes daily information for the period from 2016 to 2020 of almost all variables within the Agrimonia Dataset on a equidistant grid containing the Lombardy region and its surrounding area. The AGC does not include AQ variables as they come from the monitoring stations that are irregularly spread over the area considered.

UPDATE 11/03/2023 - NEW RELEASE - V 2.0.2

A new version of the dataset is released: Agrimonia_Dataset_v_2_0_2.csv (.Rdata), where variable WE_tot_precipitation have been recomputed due to some bugs.

A new version of the metadata is available: Metadata_Agrimonia_v_2_0_2.csv where the spatial resolution of the variable WE_precipitation_t is corrected.

UPDATE 24/01/2023 - NEW RELEASE - V 2.0.1

minor bug fixed

UPDATE 16/01/2023 - NEW RELEASE - V 2.0.0

A new version of the dataset is released, Agrimonia_Dataset_v_2_0_0.csv (.Rdata) and Metadata_monitoring_network_registry_v_2_0_0.csv. Some minor points have been addressed:

Added values for LA_land_use variable for Switzerland stations (in Agrimonia Dataset_v_2_0_0.csv)

Deleted incorrect values for LA_soil_use variable for stations outside Lombardy region during 2018 (in Agrimonia Dataset_v_2_0_0.csv)

Fixed duplicate sensors corresponding to the same pollutant within the same station (in Metadata_monitoring_network_registry_v_2_0_0.csv)
Stonybrook_AMS578_Multiple_Regression_Dataset
kaggle.com
zip
Updated Dec 20, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joseph Chan (2020). Stonybrook_AMS578_Multiple_Regression_Dataset [Dataset]. https://www.kaggle.com/josephchan524/stonybrook-ams578-multiple-regression-dataset
Explore at:
zip(61993 bytes)Available download formats
Dataset updated
Dec 20, 2020
Authors
Joseph Chan
Description
Context

This is a dataset is a Multiple Regression Project from an Applied Math Science Graduate Level Course at Stony Brook (AMS578 Spring 2020).

The class blackboard has a pdf file of a paper by Caspi et al. that reports a finding of a gene-environment interaction. This paper used multiple regression techniques as the methodology for its findings. You should read it for background, as it is the genesis of the models that you will be given. The data that you are analyzing is synthetic. That is, the TA used a model to generate the data. Your task is to find the model that the TA used for your data. For example, one possible model is

The class blackboard also contains a paper by Risch et al. that uses a larger collection of data to assess the findings in Caspi et al. These researchers confirmed that Caspi et al. calculated their results correctly but that no other dataset had the relation reported in Caspi et al. That is, Caspi et al. seem to have reported a false positive (Type I error). The class blackboard contains a recent paper about the genetics of mental illness and a technical appendix giving the specifics. Together these papers are an example of the response of the research community to studying the genetics of mental illness, which is a notoriously difficult research area.

Content

One file contains the patient identifier and the dependent variable value. The second file contains the patient identifier and values of six environment variables called E1 to E6. The third file contains the patient identifier and the twenty independent indicator variables called G1 to G20. The records may not be in correct order in each file, and cases may be missing in one or more of the files. You can process the data with VMLOOKUP or other data merging software.
Water-quality data imputation with a high percentage of missing values: a...
zenodo.org
data.niaid.nih.gov
csv
Updated Jun 8, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati (2021). Water-quality data imputation with a high percentage of missing values: a machine learning approach [Dataset]. http://doi.org/10.5281/zenodo.4731169
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4731169
Dataset updated
Jun 8, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.

This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.

To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).

IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.

In this dataset, we include the original and imputed values for the following variables:

Water temperature (Tw)

Dissolved oxygen (DO)

Electrical conductivity (EC)

pH

Turbidity (Turb)

Nitrite (NO2-)

Nitrate (NO3-)

Total Nitrogen (TN)

Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].

More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.

If you use this dataset in your work, please cite our paper:
Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318
Behavioral responses of common dolphins to naval sonar
data.niaid.nih.gov
datadryad.org
zip
Updated Oct 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brandon Southall; John Durban (2024). Behavioral responses of common dolphins to naval sonar [Dataset]. http://doi.org/10.5061/dryad.ncjsxkt40
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.ncjsxkt40
Dataset updated
Oct 4, 2024
Dataset provided by
Southall Environmental Associates (United States)
University of California, Santa Cruz
Authors
Brandon Southall; John Durban
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Despite strong interest in how noise affects marine mammals, little is known about the most abundant and commonly exposed taxa. Social delphinids occur in groups of hundreds of individuals that travel quickly, change behavior ephemerally, and are not amenable to conventional tagging methods, posing challenges in quantifying noise impacts. We integrated drone-based photogrammetry, strategically-placed acoustic recorders, and broad-scale visual observations to provide complimentary measurements of different aspects of behavior for short- and long-beaked common dolphins. We measured behavioral responses during controlled exposure experiments (CEEs) of military mid-frequency (3-4 kHz) active sonar (MFAS) using simulated and actual Navy sonar sources. We used latent-state Bayesian models to evaluate response probability and persistence in exposure and post-exposure phases. Changes in sub-group movement and aggregation parameters were commonly detected during different phases of MFAS CEEs but not control CEEs. Responses were more evident in short-beaked common dolphins (n=14 CEEs), and a direct relationship between response probability and received level was observed. Long-beaked common dolphins (n=20) showed less consistent responses, although contextual differences may have limited which movement responses could be detected. These are the first experimental behavioral response data for these abundant dolphins to directly inform impact assessments for military sonars. Methods We used complementary visual and acoustic sampling methods at variable spatial scales to measure different aspects of common dolphin behavior in known and controlled MFAS exposure and non-exposure contexts. Three fundamentally different data collection systems were used to sample group behavior. A broad-scale visual sampling of subgroup movement was conducted using theodolite tracking from shore-based stations. Assessments of whole-group and sub-group sizes, movement, and behavior were conducted at 2-minute intervals from shore-based and vessel platforms using high-powered binoculars and standardized sampling regimes. Aerial UAS-based photogrammetry quantified the movement of a single focal subgroup. The UAS consisted of a large (1.07 m diameter) custom-built octocopter drone launched and retrieved by hand from vessel platforms. The drone carried a vertically gimballed camera (at least 16MP) and sensors that allowed precise spatial positioning, allowing spatially explicit photogrammetry to infer movement speed and directionality. Remote-deployed (drifting) passive acoustic monitoring (PAM) sensors were strategically deployed around focal groups to examine both basic aspects of subspecies-specific common dolphin acoustic (whistling) behavior and potential group responses in whistling to MFAS on variable temporal scales (Casey et al., in press). This integration allowed us to evaluate potential changes in movement, social cohesion, and acoustic behavior and their covariance associated with the absence or occurrence of exposure to MFAS. The collective raw data set consists of several GB of continuous broadband acoustic data and hundreds of thousands of photogrammetry images. Three sets of quantitative response variables were analyzed from the different data streams: directional persistence and variation in speed of the focal subgroup from UAS photogrammetry; group vocal activity (whistle counts) from passive acoustic records; and number of sub-groups within a larger group being tracked by the shore station overlook. We fit separate Bayesian hidden Markov models (HMMs) to each set of response data, with the HMM assumed to have two states: a baseline state and an enhanced state that was estimated in sequential 5-s blocks throughout each CEE. The number of subgroups was recorded during periodic observations every 2 minutes and assumed constant across time blocks between observations. The number of subgroups was treated as missing data 30 seconds before each change was noted to introduce prior uncertainty about the precise timing of the change. For movement, two parameters relating to directional persistence and variation in speed were estimated by fitting a continuous time-correlated random walk model to spatially explicit photogrammetry data in the form of location tracks for focal individuals that were sequentially tracked throughout each CEE as a proxy for subgroup movement. Movement parameters were assumed to be normally distributed. Whistle counts were treated as normally distributed but truncated as positive because negative count data is not possible. Subgroup counts were assumed to be Poisson distributed as they were distinct, small values. In all cases, the response variable mean was modeled as a function of the HMM with a log link: log(Responset) = l0 + l1Z t where at each 5-s time block t, the hidden state took values of Zt = 0 to identify one state with a baseline response level l0, or Zt = 1 to identify an “enhanced” state, with l1 representing the enhancement of the quantitative value of the response variable. A flat uniform (-30,30) prior distribution was used for l0 in each response model, and a uniform (0,30) prior distribution was adopted for each l1 to constrain enhancements to be positive. For whistle and subgroup counts, the enhanced state indicated increased vocal activity and more subgroups. A common indicator variable was estimated for the latent state for both the movement parameters, such that switching to the enhanced state described less directional persistence and more variation in velocity. Speed was derived as a function of these two parameters and was used here as a proxy for their joint responses, representing directional displacement over time.
To assess differences in the behavior states between experimental phases, the block-specific latent states were modeled as a function of phase-specific probabilities, Z t ~ Bernoulli (pphaset), to learn about the probability pphase of being in an enhanced state during each phase. For each pre-exposure, exposure, and post-exposure phase, this probability was assigned a flat uniform (0,1) prior probability. The model was programmed in R (R version 3.6.1; The R Foundation for Statistical Computing) with the nimble package (de Valpine et al. 2020) to estimate posterior distributions of model parameters using Markov Chain Monte Carlo (MCMC) sampling. Inference was based on 100,000 MCMC samples following a burn-in of 100,000, with chain convergence determined by visual inspection of three MCMC chains and corroborated by convergence diagnostics (Brooks and Gelman, 1998). To compare behavior across phases, we compared the posterior distribution of the pphase parameters for each response variable, specifically by monitoring the MCMC output to assess the “probability of response” as the proportion of iterations for which pexposure was greater or less than ppre-exposure and the “probability of persistence” as the proportion of iterations for which ppost-exposre was greater or less than ppre-exposure. These probabilities of response and persistence thus estimated the extent of separation (non-overlap) between the distributions of pairs of pphase parameters: if the two distributions of interest were identical, then p=0.5, and if the two were non-overlapping, then p=1. Similarly, we estimated the average values of the response variables in each phase by predicting phase-specific functions of the parameters: Mean.responsephase = exp(l0 + l1pphase) and simply derived average speed as the mean of the speed estimates for 5-second blocks in each phase.
f
UC_vs_US Statistic Analysis.xlsx
figshare.com
xlsx
Updated Jul 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
F. (Fabiano) Dalpiaz (2020). UC_vs_US Statistic Analysis.xlsx [Dataset]. http://doi.org/10.23644/uu.12631628.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.23644/uu.12631628.v1
Dataset updated
Jul 9, 2020
Dataset provided by
Utrecht University
Authors
F. (Fabiano) Dalpiaz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Sheet 1 (Raw-Data): The raw data of the study is provided, presenting the tagging results for the used measures described in the paper. For each subject, it includes multiple columns: A. a sequential student ID B an ID that defines a random group label and the notation C. the used notation: user Story or use Cases D. the case they were assigned to: IFA, Sim, or Hos E. the subject's exam grade (total points out of 100). Empty cells mean that the subject did not take the first exam F. a categorical representation of the grade L/M/H, where H is greater or equal to 80, M is between 65 included and 80 excluded, L otherwise G. the total number of classes in the student's conceptual model H. the total number of relationships in the student's conceptual model I. the total number of classes in the expert's conceptual model J. the total number of relationships in the expert's conceptual model K-O. the total number of encountered situations of alignment, wrong representation, system-oriented, omitted, missing (see tagging scheme below) P. the researchers' judgement on how well the derivation process explanation was explained by the student: well explained (a systematic mapping that can be easily reproduced), partially explained (vague indication of the mapping ), or not present.

Tagging scheme: Aligned (AL) - A concept is represented as a class in both models, either

with the same name or using synonyms or clearly linkable names; Wrongly represented (WR) - A class in the domain expert model is incorrectly represented in the student model, either (i) via an attribute, method, or relationship rather than class, or (ii) using a generic term (e.g., user'' instead ofurban planner''); System-oriented (SO) - A class in CM-Stud that denotes a technical implementation aspect, e.g., access control. Classes that represent legacy system or the system under design (portal, simulator) are legitimate; Omitted (OM) - A class in CM-Expert that does not appear in any way in CM-Stud; Missing (MI) - A class in CM-Stud that does not appear in any way in CM-Expert.

All the calculations and information provided in the following sheets

originate from that raw data.

Sheet 2 (Descriptive-Stats): Shows a summary of statistics from the data collection,

including the number of subjects per case, per notation, per process derivation rigor category, and per exam grade category.

Sheet 3 (Size-Ratio):

The number of classes within the student model divided by the number of classes within the expert model is calculated (describing the size ratio). We provide box plots to allow a visual comparison of the shape of the distribution, its central value, and its variability for each group (by case, notation, process, and exam grade) . The primary focus in this study is on the number of classes. However, we also provided the size ratio for the number of relationships between student and expert model.

Sheet 4 (Overall):

Provides an overview of all subjects regarding the encountered situations, completeness, and correctness, respectively. Correctness is defined as the ratio of classes in a student model that is fully aligned with the classes in the corresponding expert model. It is calculated by dividing the number of aligned concepts (AL) by the sum of the number of aligned concepts (AL), omitted concepts (OM), system-oriented concepts (SO), and wrong representations (WR). Completeness on the other hand, is defined as the ratio of classes in a student model that are correctly or incorrectly represented over the number of classes in the expert model. Completeness is calculated by dividing the sum of aligned concepts (AL) and wrong representations (WR) by the sum of the number of aligned concepts (AL), wrong representations (WR) and omitted concepts (OM). The overview is complemented with general diverging stacked bar charts that illustrate correctness and completeness.

For sheet 4 as well as for the following four sheets, diverging stacked bar

charts are provided to visualize the effect of each of the independent and mediated variables. The charts are based on the relative numbers of encountered situations for each student. In addition, a "Buffer" is calculated witch solely serves the purpose of constructing the diverging stacked bar charts in Excel. Finally, at the bottom of each sheet, the significance (T-test) and effect size (Hedges' g) for both completeness and correctness are provided. Hedges' g was calculated with an online tool: https://www.psychometrica.de/effect_size.html. The independent and moderating variables can be found as follows:

Sheet 5 (By-Notation):

Model correctness and model completeness is compared by notation - UC, US.

Sheet 6 (By-Case):

Model correctness and model completeness is compared by case - SIM, HOS, IFA.

Sheet 7 (By-Process):

Model correctness and model completeness is compared by how well the derivation process is explained - well explained, partially explained, not present.

Sheet 8 (By-Grade):

Model correctness and model completeness is compared by the exam grades, converted to categorical values High, Low , and Medium.
S
A dataset of for cross-course learning path planning with 7 types of learner...
scidb.cn
Updated May 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yong-Wei Zhang (2024). A dataset of for cross-course learning path planning with 7 types of learner and 7 types of course materials [Dataset]. http://doi.org/10.57760/sciencedb.18420
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.18420
Dataset updated
May 14, 2024
Dataset provided by
Science Data Bank
Authors
Yong-Wei Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset accompanies the research paper titled "Enhancing Personalized Learning in Online Education through Integrated Cross-Course Learning Path Planning." The dataset consists of MATLAB data files (.mat format).The dataset includes data on seven types of learner attributes, named from LearnerA.mat to LearnerG.mat. Each learner dataset contains two variables: L and LP. L is a 10x16 matrix that stores learner attributes, where each row represents a learner. The first column indicates the learner's ability level, the second column indicates the expected learning time, columns 3 to 6 represent normalized learning styles, and columns 7 to 16 represent learning objectives. LP is a structure that stores statistical information about this matrix.The dataset also includes data on seven types of learning resource attributes, named DatasetA.mat, DatasetB.mat, DatasetC.mat, DatasetAB.mat, DatasetAC.mat, DatasetBC.mat, and DatasetABC.mat. Each resource dataset contains two variables: M and MP. M is a matrix that stores the attributes of learning materials, where each row represents a material. The first column indicates the material's difficulty level, the second column represents the learning time required for the material, columns 3 to 6 describe the type of material, columns 7 to 16 cover the knowledge points addressed by the material, and columns 17 to 26 list the prerequisite knowledge points required for the material. MP is a structure that stores statistical information about this matrix.The dataset encompasses results from learning path planning involving seven types of learners across seven datasets, totaling 49 datasets, named in the format PathCost4_LSHADE_cnEpSin_D_X_L_Y.mat. Here, X represents the type of learning resource dataset (A, B, C, AB, AC, BC, ABC) and Y represents the type of learner (A to G). Each data file contains three variables: Gbest, Gtime, and S. Gbest is a 30x10 matrix, where each column stores the best cost function obtained from 30 runs of path planning for a learner on the corresponding dataset. Gtime is a 30x10 matrix, where each column stores the time spent on each run for a learner on the corresponding dataset. S is a 30x10 cell array storing the status information from each run.Finally, the dataset includes a compilation of the best cost functions for all runs for all learners across all learning material datasets, named learnerBest.mat. The file contains a variable, learnerBest, which is a 7x7x10x30 four-dimensional array. The first dimension represents the type of learner, the second dimension represents the type of learning material, the third dimension represents the learner index, and the fourth dimension represents the run index.
Historic US census - 1930
redivis.com
application/jsonl +7
Updated Jan 10, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford Center for Population Health Sciences (2020). Historic US census - 1930 [Dataset]. http://doi.org/10.57761/6e5q-rh85
Explore at:
application/jsonl, parquet, spss, csv, arrow, stata, avro, sasAvailable download formats
Unique identifier
https://doi.org/10.57761/6e5q-rh85
Dataset updated
Jan 10, 2020
Dataset provided by
Redivis Inc.
Authors
Stanford Center for Population Health Sciences
Time period covered
Jan 1, 1930 - Dec 31, 1930
Area covered
United States
Description
Abstract

The Integrated Public Use Microdata Series (IPUMS) Complete Count Data include more than 650 million individual-level and 7.5 million household-level records. The microdata are the result of collaboration between IPUMS and the nation’s two largest genealogical organizations—Ancestry.com and FamilySearch—and provides the largest and richest source of individual level and household data.

Before Manuscript Submission

All manuscripts (and other items you'd like to publish) must be submitted to

phsdatacore@stanford.edu for approval prior to journal submission.

We will check your cell sizes and citations.

For more information about how to cite PHS and PHS datasets, please visit:

https:/phsdocs.developerhub.io/need-help/citing-phs-data-core

Documentation

This dataset was created on 2020-01-10 22:52:11.461 by merging multiple datasets together. The source datasets for this version were:

IPUMS 1930 households: This dataset includes all households from the 1930 US census.

IPUMS 1930 persons: This dataset includes all individuals from the 1930 US census.

IPUMS 1930 Lookup: This dataset includes variable names, variable labels, variable values, and corresponding variable value labels for the IPUMS 1930 datasets.

Section 2

Historic data are scarce and often only exists in aggregate tables. The key advantage of historic US census data is the availability of individual and household level characteristics that researchers can tabulate in ways that benefits their specific research questions. The data contain demographic variables, economic variables, migration variables and family variables. Within households, it is possible to create relational data as all relations between household members are known. For example, having data on the mother and her children in a household enables researchers to calculate the mother’s age at birth. Another advantage of the Complete Count data is the possibility to follow individuals over time using a historical identifier.

In sum: the historic US census data are a unique source for research on social and economic change and can provide population health researchers with information about social and economic determinants.Historic data are scarce and often only exists in aggregate tables. The key advantage of historic US census data is the availability of individual and household level characteristics that researchers can tabulate in ways that benefits their specific research questions. The data contain demographic variables, economic variables, migration variables and family variables. Within households, it is possible to create relational data as all relations between household members are known. For example, having data on the mother and her children in a household enables researchers to calculate the mother’s age at birth. Another advantage of the Complete Count data is the possibility to follow individuals over time using a historical identifier. In sum: the historic US census data are a unique source for research on social and economic change and can provide population health researchers with information about social and economic determinants.

The historic US 1930 census data was collected in April 1930. Enumerators collected data traveling to households and counting the residents who regularly slept at the household. Individuals lacking permanent housing were counted as residents of the place where they were when the data was collected. Household members absent on the day of data collected were either listed to the household with the help of other household members or were scheduled for the last census subdivision.

Notes

We provide IPUMS household and person data separately so that it is convenient to explore the descriptive statistics on each level. In order to obtain a full dataset, merge the household and person on the variables SERIAL and SERIALP. In order to create a longitudinal dataset, merge datasets on the variable HISTID.

Households with more than 60 people in the original data were broken up for processing purposes. Every person in the large households are considered to be in their own household. The original large households can be identified using the variable SPLIT, reconstructed using the variable SPLITHID, and the original count is found in the variable SPLITNUM.

Coded variables derived from string variables are still in progress. These variables include: occupation and industry.

Missing observations have been allocated and some inconsistencies have been edited for the following variables: SPEAKENG, YRIMMIG, CITIZEN, AGEMARR, AGE, BPL, MBPL, FBPL, LIT, SCHOOL, OWNERSHP, FARM, EMPSTAT, OCC1950, IND1950, MTONGUE, MARST, RACE, SEX, RELATE, CLASSWKR. The flag variables indicating an allocated observation for the associated variables can be included in your extract by clicking the ‘Select data quality flags’ box on the extract summary page.

Most inconsistent information was not edite
Treatment Episode Data Set -- Admissions (TEDS-A), 2012
icpsr.umich.edu
ascii, delimited, r +3
Updated May 7, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United States Department of Health and Human Services. Substance Abuse and Mental Health Services Administration. Center for Behavioral Health Statistics and Quality (2014). Treatment Episode Data Set -- Admissions (TEDS-A), 2012 [Dataset]. http://doi.org/10.3886/ICPSR35037.v1
Explore at:
ascii, sas, delimited, spss, stata, rAvailable download formats
Unique identifier
https://doi.org/10.3886/ICPSR35037.v1
Dataset updated
May 7, 2014
Dataset provided by
Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
Authors
United States Department of Health and Human Services. Substance Abuse and Mental Health Services Administration. Center for Behavioral Health Statistics and Quality
License
https://www.icpsr.umich.edu/web/ICPSR/studies/35037/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/35037/terms
Time period covered
2012
Area covered
United States
Description
The Treatment Episode Data Set -- Admissions (TEDS-A) is a national census data system of annual admissions to substance abuse treatment facilities. TEDS-A provides annual data on the number and characteristics of persons admitted to public and private substance abuse treatment programs that receive public funding. The unit of analysis is a treatment admission. TEDS consists of data reported to state substance abuse agencies by the treatment programs, which in turn report it to SAMHSA. A sister data system, called the Treatment Episode Data Set -- Discharges (TEDS-D), collects data on discharges from substance abuse treatment facilities. The first year of TEDS-A data is 1992, while the first year of TEDS-D is 2006. TEDS variables that are required to be reported are called the "Minimum Data Set (MDS)", while those that are optional are called the "Supplemental Data Set (SuDS)". Variables in the MDS include: information on service setting, number of prior treatments, primary source of referral, gender, race, ethnicity, education, employment status, substance(s) abused, route of administration, frequency of use, age at first use, and whether methadone was prescribed in treatment. Supplemental variables include: diagnosis codes, presence of psychiatric problems, living arrangements, source of income, health insurance, expected source of payment, pregnancy and veteran status, marital status, detailed not in labor force codes, detailed criminal justice referral codes, days waiting to enter treatment, and the number of arrests in the 30 days prior to admissions (starting in 2008). Substances abused include alcohol, cocaine and crack, marijuana and hashish, heroin, nonprescription methadone, other opiates and synthetics, PCP, other hallucinogens, methamphetamine, other amphetamines, other stimulants, benzodiazepines, other non-benzodiazepine tranquilizers, barbiturates, other non-barbiturate sedatives or hypnotics, inhalants, over-the-counter medications, and other substances. Created variables include total number of substances reported, intravenous drug use (IDU), and flags for any mention of specific substances.
d
Current Population Survey (CPS)
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Damico, Anthony (2023). Current Population Survey (CPS) [Dataset]. http://doi.org/10.7910/DVN/AK4FDD
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/AK4FDD
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Damico, Anthony
Description
analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D
u
Data from: Dataset of the paper “Variable selection for linear regression in...
investigacion.ubu.es
Updated 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pacheco Bonrostro, Joaquín; Casado Yusta, Silvia; Pacheco Bonrostro, Joaquín; Casado Yusta, Silvia (2020). Dataset of the paper “Variable selection for linear regression in large databases: exact methods” Applied Intelligence, 51(6), 3736-3756 [Dataset]. https://investigacion.ubu.es/documentos/682afba74c44bf76b28811e1
Explore at:
Dataset updated
2020
Authors
Pacheco Bonrostro, Joaquín; Casado Yusta, Silvia; Pacheco Bonrostro, Joaquín; Casado Yusta, Silvia
Description
The variable selection problem in the context of Linear Regression for large databases is analysed. The problem consists in selecting a small subset of independent variables that can perform the prediction task optimally. This problem has a wide range of applications. One important type of application is the design of composite indicators in various areas (sociology and economics, for example). Other important applications of variable selection in linear regression can be found in fields such as chemometrics, genetics, and climate prediction, among many others. For this problem, we propose a Branch & Bound method. This is an exact method and therefore guarantees optimal solutions. We also provide strategies that enable this method to be applied in very large databases (with hundreds of thousands of cases) in a moderate computation time. A series of computational experiments shows that our method performs well compared with well-known methods in the literature and with commercial software.
w
Dataset of books called Variable capital
workwithdata.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books called Variable capital [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Variable+capital
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 1 row and is filtered where the book is Variable capital. It features 7 columns including author, publication date, language, and book publisher.
f
Data from: Automatic Spectroscopic Data Categorization by Clustering...
acs.figshare.com
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xin Zou; Elaine Holmes; Jeremy K Nicholson; Ruey Leng Loo (2023). Automatic Spectroscopic Data Categorization by Clustering Analysis (ASCLAN): A Data-Driven Approach for Distinguishing Discriminatory Metabolites for Phenotypic Subclasses [Dataset]. http://doi.org/10.1021/acs.analchem.5b04020.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.analchem.5b04020.s001
Dataset updated
May 30, 2023
Dataset provided by
ACS Publications
Authors
Xin Zou; Elaine Holmes; Jeremy K Nicholson; Ruey Leng Loo
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
We propose a novel data-driven approach aiming to reliably distinguish discriminatory metabolites from nondiscriminatory metabolites for a given spectroscopic data set containing two biological phenotypic subclasses. The automatic spectroscopic data categorization by clustering analysis (ASCLAN) algorithm aims to categorize spectral variables within a data set into three clusters corresponding to noise, nondiscriminatory and discriminatory metabolites regions. This is achieved by clustering each spectral variable based on the r2 value representing the loading weight of each spectral variable as extracted from a orthogonal partial least-squares discriminant (OPLS-DA) model of the data set. The variables are ranked according to r2 values and a series of principal component analysis (PCA) models are then built for subsets of these spectral data corresponding to ranges of r2 values. The Q2X value for each PCA model is extracted. K-means clustering is then applied to the Q2X values to generate two clusters based on minimum Euclidean distance criterion. The cluster consisting of lower Q2X values is deemed devoid of metabolic information (noise), while the cluster consists of higher Q2X values is then further subclustered into two groups based on the r2 values. We considered the cluster with high Q2X but low r2 values as nondiscriminatory, while the cluster with high Q2X and r2 values as discriminatory variables. The boundaries between these three clusters of spectral variables, on the basis of the r2 values were considered as the cut off values for defining the noise, nondiscriminatory and discriminatory variables. We evaluated the ASCLAN algorithm using six simulated 1H NMR spectroscopic data sets representing small, medium and large data sets (N = 50, 500, and 1000 samples per group, respectively), each with a reduced and full resolution set of variables (0.005 and 0.0005 ppm, respectively). ASCLAN correctly identified all discriminatory metabolites and showed zero false positive (100% specificity and positive predictive value) irrespective of the spectral resolution or the sample size in all six simulated data sets. This error rate was found to be superior to existing methods for ascertaining feature significance: univariate t test by Bonferroni correction (up to 10% false positive rate), Benjamini–Hochberg correction (up to 35% false positive rate) and metabolome wide significance level (MWSL, up to 0.4% false positive rate), as well as by various OPLS-DA parameters: variable importance to projection, (up to 15% false positive rate), loading coefficients (up to 35% false positive rate), and regression coefficients (up to 39% false positive rate). The application of ASCLAN was further exemplified using a widely investigated renal toxin, mercury II chloride (HgCl2) in rat model. ASCLAN successfully identified many of the known metabolites related to renal toxicity such as increased excretion of urinary creatinine, and different amino acids. The ASCLAN algorithm provides a framework for reliably differentiating discriminatory metabolites from nondiscriminatory metabolites in a biological data set without the need to set an arbitrary cut off value as applied to some of the conventional methods. This offers significant advantages over existing methods and the possibility for automation of high-throughput screening in “omics” data.
Data from: Generalized Factor Model for Ultra-High Dimensional Correlated...
tandf.figshare.com
zip
Updated Feb 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wei Liu; Huazhen Lin; Shurong Zheng; Jin Liu (2024). Generalized Factor Model for Ultra-High Dimensional Correlated Variables with Mixed Types [Dataset]. http://doi.org/10.6084/m9.figshare.16899998.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.16899998.v2
Dataset updated
Feb 9, 2024
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Wei Liu; Huazhen Lin; Shurong Zheng; Jin Liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
As high-dimensional data measured with mixed-type variables gradually become prevalent, it is particularly appealing to represent those mixed-type high-dimensional data using a much smaller set of so-called factors. Due to the limitation of the existing methods for factor analysis that deal with only continuous variables, in this article, we develop a generalized factor model, a corresponding algorithm and theory for ultra-high dimensional mixed types of variables where both the sample size n and variable dimension p could diverge to infinity. Specifically, to solve the computational problem arising from the non-linearity and mixed types, we develop a two-step algorithm so that each update can be carried out in parallel across variables and samples by using an existing package. Theoretically, we establish the rate of convergence for the estimators of factors and loadings in the presence of nonlinear structure accompanied with mixed-type variables when both n and p diverge to infinity. Moreover, since the correct specification of the number of factors is crucial to both the theoretical and the empirical validity of factor models, we also develop a criterion based on a penalized loss to consistently estimate the number of factors under the framework of a generalized factor model. To demonstrate the advantages of the proposed method over the existing ones, we conducted extensive simulation studies and also applied it to the analysis of the NFBC1966 dataset and a cardiac arrhythmia dataset, resulting in more predictive and interpretable estimators for loadings and factors than the existing factor model.
w
Dataset of books called Reactions with variable-charge soils
workwithdata.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books called Reactions with variable-charge soils [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Reactions+with+variable-charge+soils
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 1 row and is filtered where the book is Reactions with variable-charge soils. It features 7 columns including author, publication date, language, and book publisher.
South German Credit (UPDATE) Data Set
kaggle.com
zip
Updated Aug 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tushar Mishra (2020). South German Credit (UPDATE) Data Set [Dataset]. https://www.kaggle.com/datasets/tmchls/south-german-credit-update-data-set/code
Explore at:
zip(16463 bytes)Available download formats
Dataset updated
Aug 29, 2020
Authors
Tushar Mishra
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
700 good and 300 bad credits with 20 predictor variables. Data from 1973 to 1975. Stratified sample from actual credits with bad credits heavily oversampled. A cost matrix can be used.

The widely used Statlog German credit data ([Web Link]), as of November 2019, suffers from severe errors in the coding information and does not come with any background information. The 'South German Credit' data provide a correction and some background information, based on the Open Data LMU (2010) representation of the same data and several other German language resources.

Attribute Information:

Column name: laufkont Variable name: status Content: status of the debtor's checking account with the bank (categorical)

Column name: laufzeit Variable name: duration Content: credit duration in months (quantitative)

Column name: moral Variable name: credit_history Content: history of compliance with previous or concurrent credit contracts (categorical)

Column name: verw Variable name: purpose Content: purpose for which the credit is needed (categorical)

Column name: hoehe Variable name: amount Content: credit amount in DM (quantitative; result of monotonic transformation; actual data and type of transformation unknown)

Column name: sparkont Variable name: savings Content: debtor's savings (categorical)

Column name: beszeit Variable name: employment_duration Content: duration of debtor's employment with current employer (ordinal; discretized quantitative)

Column name: rate Variable name: installment_rate Content: credit installments as a percentage of debtor's disposable income (ordinal; discretized quantitative)

Column name: famges Variable name: personal_status_sex Content: combined information on sex and marital status; categorical; sex cannot be recovered from the variable, because male singles and female non-singles are coded with the same code (2); female widows cannot be easily classified, because the code table does not list them in any of the female categories

Column name: buerge Variable name: other_debtors Content: Is there another debtor or a guarantor for the credit? (categorical)

Column name: wohnzeit Variable name: present_residence Content: length of time (in years) the debtor lives in the present residence (ordinal; discretized quantitative)

Column name: verm Variable name: property Content: the debtor's most valuable property, i.e. the highest possible code is used. Code 2 is used, if codes 3 or 4 are not applicable and there is a car or any other relevant property that does not fall under variable sparkont. (ordinal)

Column name: alter Variable name: age Content: age in years (quantitative)

Column name: weitkred Variable name: other_installment_plans Content: installment plans from providers other than the credit-giving bank (categorical)

Column name: wohn Variable name: housing Content: type of housing the debtor lives in (categorical)

Column name: bishkred Variable name: number_credits Content: number of credits including the current one the debtor has (or had) at this bank (ordinal, discretized quantitative); contrary to Fahrmeir and HamerleÃ¢â‚¬â„¢s (1984) statement, the original data values are not available.

Column name: beruf Variable name: job Content: quality of debtor's job (ordinal)

Column name: pers Variable name: people_liable Content: number of persons who financially depend on the debtor (i.e., are entitled to maintenance) (binary, discretized quantitative)

Column name: telef Variable name: telephone Content: Is there a telephone landline registered on the debtor's name? (binary; remember that the data are from the 1970s)

Column name: gastarb Variable name: foreign_worker Content: Is the debtor a foreign worker? (binary)

Column name: kredit Variable name: credit_risk Content: Has the credit contract been complied with (good) or not (bad) ? (binary)

Acknowledgements:

Grömping, U. (2019). South German Credit Data: Correcting a Widely Used Data Set. Report 4/2019, Reports in Mathematics, Physics and Chemistry, Department II, Beuth University of Applied Sciences Berlin.

Facebook

Twitter

Click to copy link

Link copied

Cite

Michael Kempf; Michael Kempf (2023). A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19 [Dataset]. http://doi.org/10.5281/zenodo.10396148

Data from: A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.10396148

Dataset updated

Dec 16, 2023

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Michael Kempf; Michael Kempf

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Dec 16, 2023

Area covered

Levant

Description

Overview

This dataset is the repository for the following paper submitted to Data in Brief:

Kempf, M. A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19. Data in Brief (submitted: December 2023).

The Data in Brief article contains the supplement information and is the related data paper to:

Kempf, M. Climate change, the Arab Spring, and COVID-19 - Impacts on landcover transformations in the Levant. Journal of Arid Environments (revision submitted: December 2023).

Description/abstract

The Levant region is highly vulnerable to climate change, experiencing prolonged heat waves that have led to societal crises and population displacement. Since 2010, the area has been marked by socio-political turmoil, including the Syrian civil war and currently the escalation of the so-called Israeli-Palestinian Conflict, which strained neighbouring countries like Jordan due to the influx of Syrian refugees and increases population vulnerability to governmental decision-making. Jordan, in particular, has seen rapid population growth and significant changes in land-use and infrastructure, leading to over-exploitation of the landscape through irrigation and construction. This dataset uses climate data, satellite imagery, and land cover information to illustrate the substantial increase in construction activity and highlights the intricate relationship between climate change predictions and current socio-political developments in the Levant.

Folder structure

The main folder after download contains all data, in which the following subfolders are stored are stored as zipped files:

“code” stores the above described 9 code chunks to read, extract, process, analyse, and visualize the data.

“MODIS_merged” contains the 16-days, 250 m resolution NDVI imagery merged from three tiles (h20v05, h21v05, h21v06) and cropped to the study area, n=510, covering January 2001 to December 2022 and including January and February 2023.

“mask” contains a single shapefile, which is the merged product of administrative boundaries, including Jordan, Lebanon, Israel, Syria, and Palestine (“MERGED_LEVANT.shp”).

“yield_productivity” contains .csv files of yield information for all countries listed above.

“population” contains two files with the same name but different format. The .csv file is for processing and plotting in R. The .ods file is for enhanced visualization of population dynamics in the Levant (Socio_cultural_political_development_database_FAO2023.ods).

“GLDAS” stores the raw data of the NASA Global Land Data Assimilation System datasets that can be read, extracted (variable name), and processed using code “8_GLDAS_read_extract_trend” from the respective folder. One folder contains data from 1975-2022 and a second the additional January and February 2023 data.

“built_up” contains the landcover and built-up change data from 1975 to 2022. This folder is subdivided into two subfolder which contain the raw data and the already processed data. “raw_data” contains the unprocessed datasets and “derived_data” stores the cropped built_up datasets at 5 year intervals, e.g., “Levant_built_up_1975.tif”.

Code structure

1_MODIS_NDVI_hdf_file_extraction.R

This is the first code chunk that refers to the extraction of MODIS data from .hdf file format. The following packages must be installed and the raw data must be downloaded using a simple mass downloader, e.g., from google chrome. Packages: terra. Download MODIS data from after registration from: https://lpdaac.usgs.gov/products/mod13q1v061/ or https://search.earthdata.nasa.gov/search (MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V061, last accessed, 09th of October 2023). The code reads a list of files, extracts the NDVI, and saves each file to a single .tif-file with the indication “NDVI”. Because the study area is quite large, we have to load three different (spatially) time series and merge them later. Note that the time series are temporally consistent.

2_MERGE_MODIS_tiles.R

In this code, we load and merge the three different stacks to produce large and consistent time series of NDVI imagery across the study area. We further use the package gtools to load the files in (1, 2, 3, 4, 5, 6, etc.). Here, we have three stacks from which we merge the first two (stack 1, stack 2) and store them. We then merge this stack with stack 3. We produce single files named NDVI_final_*consecutivenumber*.tif. Before saving the final output of single merged files, create a folder called “merged” and set the working directory to this folder, e.g., setwd("your directory_MODIS/merged").

3_CROP_MODIS_merged_tiles.R

Now we want to crop the derived MODIS tiles to our study area. We are using a mask, which is provided as .shp file in the repository, named "MERGED_LEVANT.shp". We load the merged .tif files and crop the stack with the vector. Saving to individual files, we name them “NDVI_merged_clip_*consecutivenumber*.tif. We now produced single cropped NDVI time series data from MODIS.
The repository provides the already clipped and merged NDVI datasets.

4_TREND_analysis_NDVI.R

Now, we want to perform trend analysis from the derived data. The data we load is tricky as it contains 16-days return period across a year for the period of 22 years. Growing season sums contain MAM (March-May), JJA (June-August), and SON (September-November). December is represented as a single file, which means that the period DJF (December-February) is represented by 5 images instead of 6. For the last DJF period (December 2022), the data from January and February 2023 can be added. The code selects the respective images from the stack, depending on which period is under consideration. From these stacks, individual annually resolved growing season sums are generated and the slope is calculated. We can then extract the p-values of the trend and characterize all values with high confidence level (0.05). Using the ggplot2 package and the melt function from reshape2 package, we can create a plot of the reclassified NDVI trends together with a local smoother (LOESS) of value 0.3.
To increase comparability and understand the amplitude of the trends, z-scores were calculated and plotted, which show the deviation of the values from the mean. This has been done for the NDVI values as well as the GLDAS climate variables as a normalization technique.

5_BUILT_UP_change_raster.R

Let us look at the landcover changes now. We are working with the terra package and get raster data from here: https://ghsl.jrc.ec.europa.eu/download.php?ds=bu (last accessed 03. March 2023, 100 m resolution, global coverage). Here, one can download the temporal coverage that is aimed for and reclassify it using the code after cropping to the individual study area. Here, I summed up different raster to characterize the built-up change in continuous values between 1975 and 2022.

6_POPULATION_numbers_plot.R

For this plot, one needs to load the .csv-file “Socio_cultural_political_development_database_FAO2023.csv” from the repository. The ggplot script provided produces the desired plot with all countries under consideration.

7_YIELD_plot.R

In this section, we are using the country productivity from the supplement in the repository “yield_productivity” (e.g., "Jordan_yield.csv". Each of the single country yield datasets is plotted in a ggplot and combined using the patchwork package in R.

8_GLDAS_read_extract_trend

The last code provides the basis for the trend analysis of the climate variables used in the paper. The raw data can be accessed https://disc.gsfc.nasa.gov/datasets?keywords=GLDAS%20Noah%20Land%20Surface%20Model%20L4%20monthly&page=1 (last accessed 9th of October 2023). The raw data comes in .nc file format and various variables can be extracted using the [“^a variable name”] command from the spatraster collection. Each time you run the code, this variable name must be adjusted to meet the requirements for the variables (see this link for abbreviations: https://disc.gsfc.nasa.gov/datasets/GLDAS_CLSM025_D_2.0/summary, last accessed 09th of October 2023; or the respective code chunk when reading a .nc file with the ncdf4 package in R) or run print(nc) from the code or use names(the spatraster collection).
Choosing one variable, the code uses the MERGED_LEVANT.shp mask from the repository to crop and mask the data to the outline of the study area.
From the processed data, trend analysis are conducted and z-scores were calculated following the code described above. However, annual trends require the frequency of the time series analysis to be set to value = 12. Regarding, e.g., rainfall, which is measured as annual sums and not means, the chunk r.sum=r.sum/12 has to be removed or set to r.sum=r.sum/1 to avoid calculating annual mean values (see other variables). Seasonal subset can be calculated as described in the code. Here, 3-month subsets were chosen for growing seasons, e.g. March-May (MAM), June-July (JJA), September-November (SON), and DJF (December-February, including Jan/Feb of the consecutive year).
From the data, mean values of 48 consecutive years are calculated and trend analysis are performed as describe above. In the same way, p-values are extracted and 95 % confidence level values are marked with dots on the raster plot. This analysis can be performed with a much longer time series, other variables, ad different spatial extent across the globe due to the availability of the GLDAS variables.

Clear search

Close search

Google apps

Main menu

Data from: A dataset to model Levantine landcover and land-use change...

Data from: WiBB: An integrated method for quantifying the relative...

Film Circulation dataset

University SET data, with faculty and courses characteristics

Data from: Remotely sensed variables analyzed and reported in the paper...

Data from: AgrImOnIA: Open Access dataset correlating livestock and air...

Stonybrook_AMS578_Multiple_Regression_Dataset

Context

Content

Water-quality data imputation with a high percentage of missing values: a...

Behavioral responses of common dolphins to naval sonar

UC_vs_US Statistic Analysis.xlsx

A dataset of for cross-course learning path planning with 7 types of learner...

Historic US census - 1930

Abstract

Before Manuscript Submission

Documentation

Section 2

Treatment Episode Data Set -- Admissions (TEDS-A), 2012

Current Population Survey (CPS)

Data from: Dataset of the paper “Variable selection for linear regression in...

Dataset of books called Variable capital

Data from: Automatic Spectroscopic Data Categorization by Clustering...

Data from: Generalized Factor Model for Ultra-High Dimensional Correlated...

Dataset of books called Reactions with variable-charge soils

South German Credit (UPDATE) Data Set

Attribute Information:

Acknowledgements:

Data from: A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19