Overview: 142: Areas used for sports, leisure and recreation purposes. Traceability (lineage): This dataset was produced with a machine learning framework with several input datasets, specified in detail in Witjes et al., 2022 (in review, preprint available at https://doi.org/10.21203/rs.3.rs-561383/v3 ) Scientific methodology: The single-class probability layers were generated with a spatiotemporal ensemble machine learning framework detailed in Witjes et al., 2022 (in review, preprint available at https://doi.org/10.21203/rs.3.rs-561383/v3 ). The single-class uncertainty layers were calculated by taking the standard deviation of the three single-class probabilities predicted by the three components of the ensemble. The HCL (hard class) layers represents the class with the highest probability as predicted by the ensemble. Usability: The HCL layers have a decreasing average accuracy (weighted F1-score) at each subsequent level in the CLC hierarchy. These metrics are 0.83 at level 1 (5 classes):, 0.63 at level 2 (14 classes), and 0.49 at level 3 (43 classes). This means that the hard-class maps are more reliable when aggregating classes to a higher level in the hierarchy (e.g. 'Discontinuous Urban Fabric' and 'Continuous Urban Fabric' to 'Urban Fabric'). Some single-class probabilities may more closely represent actual patterns for some classes that were overshadowed by unequal sample point distributions. Users are encouraged to set their own thresholds when postprocessing these datasets to optimize the accuracy for their specific use case. Uncertainty quantification: Uncertainty is quantified by taking the standard deviation of the probabilities predicted by the three components of the spatiotemporal ensemble model. Data validation approaches: The LULC classification was validated through spatial 5-fold cross-validation as detailed in the accompanying publication. Completeness: The dataset has chunks of empty predictions in regions with complex coast lines (e.g. the Zeeland province in the Netherlands and the Mar da Palha bay area in Portugal). These are artifacts that will be avoided in subsequent versions of the LULC product. Consistency: The accuracy of the predictions was compared per year and per 30km*30km tile across europe to derive temporal and spatial consistency by calculating the standard deviation. The standard deviation of annual weighted F1-score was 0.135, while the standard deviation of weighted F1-score per tile was 0.150. This means the dataset is more consistent through time than through space: Predictions are notably less accurate along the Mediterrranean coast. The accompanying publication contains additional information and visualisations. Positional accuracy: The raster layers have a resolution of 30m, identical to that of the Landsat data cube used as input features for the machine learning framework that predicted it. Temporal accuracy: The dataset contains predictions and uncertainty layers for each year between 2000 and 2019. Thematic accuracy: The maps reproduce the Corine Land Cover classification system, a hierarchical legend that consists of 5 classes at the highest level, 14 classes at the second level, and 44 classes at the third level. Class 523: Oceans was omitted due to computational constraints.
We provide instructions, codes and datasets for replicating the article by Kim, Lee and McCulloch (2024), "A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews." This repository provides a user-friendly R package for any researchers or practitioners to apply A Topic-based Segmentation Model with Unstructured Texts (latent class regression with group variable selection) to their datasets. First, we provide a R code to replicate the illustrative simulation study: see file 1. Second, we provide the user-friendly R package with a very simple example code to help apply the model to real-world datasets: see file 2, Package_MixtureRegression_GroupVariableSelection.R and Dendrogram.R. Third, we provide a set of codes and instructions to replicate the empirical studies of customer-level segmentation and restaurant-level segmentation with Yelp reviews data: see files 3-a, 3-b, 4-a, 4-b. Note, due to the dataset terms of use by Yelp and the restriction of data size, we provide the link to download the same Yelp datasets (https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset/versions/6). Fourth, we provided a set of codes and datasets to replicate the empirical study with professor ratings reviews data: see file 5. Please see more details in the description text and comments of each file. [A guide on how to use the code to reproduce each study in the paper] 1. Full codes for replicating Illustrative simulation study.txt -- [see Table 2 and Figure 2 in main text]: This is R source code to replicate the illustrative simulation study. Please run from the beginning to the end in R. In addition to estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships, you will get dendrograms of selected groups of variables in Figure 2. Computing time is approximately 20 to 30 minutes 3-a. Preprocessing raw Yelp Reviews for Customer-level Segmentation.txt: Code for preprocessing the downloaded unstructured Yelp review data and preparing DV and IVs matrix for customer-level segmentation study. 3-b. Instruction for replicating Customer-level Segmentation analysis.txt -- [see Table 10 in main text; Tables F-1, F-2, and F-3 and Figure F-1 in Web Appendix]: Code for replicating customer-level segmentation study with Yelp data. You will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 3 to 4 hours. 4-a. Preprocessing raw Yelp reviews_Restaruant Segmentation (1).txt: R code for preprocessing the downloaded unstructured Yelp data and preparing DV and IVs matrix for restaurant-level segmentation study. 4-b. Instructions for replicating restaurant-level segmentation analysis.txt -- [see Tables 5, 6 and 7 in main text; Tables E-4 and E-5 and Figure H-1 in Web Appendix]: Code for replicating restaurant-level segmentation study with Yelp. you will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 10 to 12 hours. [Guidelines for running Benchmark models in Table 6] Unsupervised Topic model: 'topicmodels' package in R -- after determining the number of topics(e.g., with 'ldatuning' R package), run 'LDA' function in the 'topicmodels'package. Then, compute topic probabilities per restaurant (with 'posterior' function in the package) which can be used as predictors. Then, conduct prediction with regression Hierarchical topic model (HDP): 'gensimr' R package -- 'model_hdp' function for identifying topics in the package (see https://radimrehurek.com/gensim/models/hdpmodel.html or https://gensimr.news-r.org/). Supervised topic model: 'lda' R package -- 'slda.em' function for training and 'slda.predict' for prediction. Aggregate regression: 'lm' default function in R. Latent class regression without variable selection: 'flexmix' function in 'flexmix' R package. Run flexmix with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, conduct prediction of dependent variable per each segment. Latent class regression with variable selection: 'Unconstraind_Bayes_Mixture' function in Kim, Fong and DeSarbo(2012)'s package. Run the Kim et al's model (2012) with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, we can do prediction of dependent variables per each segment. The same R package ('KimFongDeSarbo2012.zip') can be downloaded at: https://sites.google.com/scarletmail.rutgers.edu/r-code-packages/home 5. Instructions for replicating Professor ratings review study.txt -- [see Tables G-1, G-2, G-4 and G-5, and Figures G-1 and H-2 in Web Appendix]: Code to replicate the Professor ratings reviews study. Computing time is approximately 10 hours. [A list of the versions of R, packages, and computer...
Three datasets are available, each consisting of 15 csv files. Each file containing the voxelised shower information obtained from single particles produced at the front of the calorimeter in the |η| range (0.2-0.25) simulated in the ATLAS detector. Two datasets contain photons events with different statistics; the larger sample has about 10 times the number of events as the other. The other dataset contains pions. The pion dataset and the photon dataset with the lower statistics were used to train the corresponding two GANs presented in the AtlFast3 paper SIMU-2018-04.
The information in each file is a table; the rows correspond to the events and the columns to the voxels. The voxelisation procedure is described in the AtlFast3 paper linked above and in the dedicated PUB note ATL-SOFT-PUB-2020-006. In summary, the detailed energy deposits produced by ATLAS were converted from x,y,z coordinates to local cylindrical coordinates defined around the particle 3-momentum at the entrance of the calorimeter. The energy deposits in each layer were then grouped in voxels and for each voxel the energy was stored in the csv file. For each particle, there are 15 files corresponding to the 15 energy points used to train the GAN. The name of the csv file defines both the particle and the energy of the sample used to create the file.
The size of the voxels is described in the binning.xml file. Software tools to read the XML file and manipulate the spatial information of voxels are provided in the FastCaloGAN repository.
Updated on February 10th 2022. A new dataset photons_samples_highStat.tgz was added to this record and the binning.xml file was updated accordingly.
Updated on April 18th 2023. A new dataset pions_samples_highStat.tgz was added to this record.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Human Activity Recognition (HAR) refers to the capacity of machines to perceive human actions. This dataset contains information on 18 different activities collected from 90 participants (75 male and 15 female) using smartphone sensors (Accelerometer and Gyroscope). It has 1945 raw activity samples collected directly from the participants, and 20750 subsamples extracted from them.
1.Raw_ time_ domian_ data.zip ➞ Originally collected 1945 time-domain samples in separate .csv files. The arrangement of information in each .csv file is: Column 1, 5 ➞ exact time (elapsed since the start) when the Accelerometer (col. 1) & Gyroscope (col. 5) output were recorded (in ms) Col. 2, 3, 4 ➞ Acceleration along X, Y, Z axes (in m/s^2) Col. 6, 7, 8 ➞ Rate of rotation around X, Y, Z axes (in rad/s)
2.Trimmed_ interpolated_ raw_ data.zip ➞ Unnecessary parts of the samples were trimmed (only from the beginning and the end). The samples were interpolated to keep a constant sampling rate of 100 Hz. The arrangement of information is the same as above.
3.Time_ domain_ subsamples.zip ➞ 20750 subsamples extracted from the 1945 collected samples provided in a single .csv file. Each of them contains 3 seconds of non-overlapping data of the corresponding activity. Arrangement of information: Col. 1–300, 301–600, 601–900 ➞ Accelerometer X, Y, Z axes readings Col. 901–1200, 1201–1500, 1501–1800 ➞ Gyro X, Y, Z axes readings Col. 1801 ➞ Class ID (0 to 17, in the order mentioned above) Col. 1802 ➞ length of each channel data in the subsample Col. 1803 ➞ serial no. of the subsample
Gravity acceleration was omitted from the Accelerometer data, and no filter was applied to remove noise. The dataset is free to download, modify, and use provided that the source and the associated article are properly referenced.
Use the .csv file of the Time_ domain_ subsamples.zip for instant HAR classification tasks. See this notebook for details. Use the other files if you want to work with raw activity data.
More information is provided in the following data paper. Please cite it if you use this dataset in your research/work: [1] N. Sikder and A.-A. Nahid, “**KU-HAR: An open dataset for heterogeneous human activity recognition**,” Pattern Recognition Letters, vol. 146, pp. 46–54, Jun. 2021, doi: 10.1016/j.patrec.2021.02.024
[2] N. Sikder, M. A. R. Ahad, and A.-A. Nahid, “Human Action Recognition Based on a Sequential Deep Learning Model,” 2021 Joint 10th International Conference on Informatics, Electronics & Vision (ICIEV) and 2021 5th International Conference on Imaging, Vision & Pattern Recognition (icIVPR). IEEE, Aug. 16, 2021. doi: 10.1109/icievicivpr52578.2021.9564234.
Cite the dataset as: A.-A. Nahid, N. Sikder, and I. Rafi, “KU-HAR: An Open Dataset for Human Activity Recognition.” Mendeley, Feb. 16, 2021, doi: 10.17632/45F952Y38R.5
Supplementary files: https://drive.google.com/drive/folders/1yrG8pwq3XMlyEGYMnM-8xnrd6js0oXA7
The dataset is originally hosted on Mendeley Data
The image used in the banner is collected from here and attributed as: Fit, athletic man getting ready for a run by Jacob Lund from Noun Projects
Overview: Actual Natural Vegetation (ANV): probability of occurrence for the Pedunculate oak in its realized environment for the period 2000 - 2033 Traceability (lineage): This is an original dataset produced with a machine learning framework which used a combination of point datasets and raster datasets as inputs. Point dataset is a harmonized collection of tree occurrence data, comprising observations from National Forest Inventories (EU-Forest), GBIF and LUCAS. The complete dataset is available on Zenodo. Raster datasets used as input are: harmonized and gapfilled time series of seasonal aggregates of the Landsat GLAD ARD dataset (bands and spectral indices); monthly time series air and surface temperature and precipitation from a reprocessed version of the Copernicus ERA5 dataset; long term averages of bioclimatic variables from CHELSA, tree species distribution maps from the European Atlas of Forest Tree Species; elevation, slope and other elevation-derived metrics; long term monthly averages snow probability and long term monthly averages of cloud fraction from MODIS. For a more comprehensive list refer to Bonannella et al. (2022) (in review, preprint available at: https://doi.org/10.21203/rs.3.rs-1252972/v1). Scientific methodology: Probability and uncertainty maps were the output of a spatiotemporal ensemble machine learning framework based on stacked regularization. Three base models (random forest, gradient boosted trees and generalized linear models) were first trained on the input dataset and their predictions were used to train an additional model (logistic regression) which provided the final predictions. More details on the whole workflow are available in the listed publication. Usability: Probability maps can be used to detect potential forest degradation and compositional change across the time period analyzed. Some possible applications for these topics are explained in the listed publication. Uncertainty quantification: Uncertainty is quantified by taking the standard deviation of the probabilities predicted by the three components of the spatiotemporal ensemble model. Data validation approaches: Distribution maps were validated using a spatial 5-fold cross validation following the workflow detailed in the listed publication. Completeness: The raster files perfectly cover the entire Geo-harmonizer region as defined by the landmask raster dataset available here. Consistency: Areas which are outside of the calibration area of the point dataset (Iceland, Norway) usually have high uncertainty values. This is not only a problem of extrapolation but also of poor representation in the feature space available to the model of the conditions that are present in this countries. Positional accuracy: The rasters have a spatial resolution of 30m. Temporal accuracy: The maps cover the period 2000 - 2020, each map covers a certain number of years according to the following scheme: (1) 2000--2002, (2) 2002--2006, (3) 2006--2010, (4) 2010--2014, (5) 2014--2018 and (6) 2018--2020 Thematic accuracy: Both probability and uncertainty maps contain values from 0 to 100: in the case of probability maps, they indicate the probability of occurrence of a single individual of the target species, while uncertainty maps indicate the standard deviation of the ensemble model.
This dataset was collected to train a human activity recognition (HAR) model using data from inertial sensors. The goal is to classify daily activities based on linear acceleration and angular velocity data.
This dataset was gathered using a self-developed wearable device for academic purposes (PBL project at DUT). Due to the specific setup and limited participant diversity, the data might not generalize well to broader populations.
Activities (with corresponding label)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Urban Sound & Sight (Urbansas):
Version 1.0, May 2022
Created by
Magdalena Fuentes (1, 2), Bea Steers (1, 2), Pablo Zinemanas (3), Martín Rocamora (4), Luca Bondi (5), Julia Wilkins (1, 2), Qianyi Shi (2), Yao Hou (2), Samarjit Das (5), Xavier Serra (3), Juan Pablo Bello (1, 2)
1. Music and Audio Research Lab, New York University
2. Center for Urban Science and Progress, New York University
3. Universitat Pompeu Fabra, Barcelona, Spain
4. Universidad de la República, Montevideo, Uruguay
5. Bosch Research, Pittsburgh, PA, USA
Publication
If using this data in academic work, please cite the following paper, which presented this dataset:
M. Fuentes, B. Steers, P. Zinemanas, M. Rocamora, L. Bondi, J. Wilkins, Q. Shi, Y. Hou, S. Das, X. Serra, J. Bello. “Urban Sound & Sight: Dataset and Benchmark for Audio-Visual Urban Scene Understanding”. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022.
Description
Urbansas is a dataset for the development and evaluation of machine listening systems for audiovisual spatial urban understanding. One of the main challenges to this field of study is a lack of realistic, labeled data to train and evaluate models on their ability to localize using a combination of audio and video.
We set four main goals for creating this dataset:
1. To compile a set of real-field audio-visual recordings;
2. The recordings should be stereo to allow exploring sound localization in the wild;
3. The compilation should be varied in terms of scenes and recording conditions to be meaningful for training and evaluation of machine learning models;
4. The labeled collection should be accompanied by a bigger unlabeled collection with similar characteristics to allow exploring self-supervised learning in urban contexts.
Audiovisual data
We have compiled and manually annotated Urbansas from two publicly available datasets, plus the addition of unreleased material. The public datasets are the TAU Urban Audio-Visual Scenes 2021 Development dataset (street-traffic subset) and the Montevideo Audio-Visual Dataset (MAVD):
Wang, Shanshan, et al. "A curated dataset of urban scenes for audio-visual scene analysis." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.
Zinemanas, Pablo, Pablo Cancela, and Martín Rocamora. "MAVD: A dataset for sound event detection in urban environments." Detection and Classification of Acoustic Scenes and Events, DCASE 2019, New York, NY, USA, 25–26 oct, page 263--267 (2019).
The TAU dataset consists of 10-second segments of audio and video from different scenes across European cities, traffic being one of the scenes. Only the scenes labeled as traffic were included in Urbansas. MAVD is an audio-visual traffic dataset curated in different locations of Montevideo, Uruguay, with annotations of vehicles and vehicle components sounds (e.g. engine, brakes) for sound event detection. Besides the published datasets, we include a total of 9.5 hours of unpublished material recorded in Montevideo, with the same recording devices of MAVD but including new locations and scenes.
Recordings for TAU were acquired using a GoPro Hero 5 (30fps, 1280x720) and a Soundman OKM II Klassik/studio A3 electret binaural in-ear microphone with a Zoom F8 audio recorder (48kHz, 24 bits, stereo). Recordings for MAVD were collected using a GoPro Hero 3 (24fps, 1920x1080) and a SONY PCM-D50 recorder (48kHz, 24 bits, stereo).
When compiled in Urbansas, it includes 15 hours of stereo audio and video, stored in separate 10 second MPEG4 (1280x720, 24fps) and WAV (48kHz, 24 bit, 2 channel) files. Both released video datasets are already anonymized to obscure people and license plates, the unpublished MAVD data was anonymized similarly using this anonymizer. We also distribute the 2fps video used for producing the annotations.
The audio and video files both share the same filename stem, meaning that they can be associated after removing the parent directory and extension.
MAVD:
video/
TAU:
video/
where location_id in both cases includes the city and an ID number.
city & places & clips & mins & frames & labeled mins \\
Montevideo & 8 & 4085 & 681 & 980400 & 92 \\
Stockholm & 3 & 91 & 15 & 21840 & 2 \\
Barcelona & 4 & 144 & 24 & 34560 & 24 \\
Helsinki & 4 & 144 & 24 & 34560 & 16 \\
Lisbon & 4 & 144 & 24 & 34560 & 19 \\
Lyon & 4 & 144 & 24 & 34560 & 6 \\
Paris & 4 & 144 & 24 & 34560 & 2 \\
Prague & 4 & 144 & 24 & 34560 & 2 \\
Vienna & 4 & 144 & 24 & 34560 & 6 \\
London & 5 & 144 & 24 & 34560 & 4 \\
Milan & 6 & 144 & 24 & 34560 & 6 \\
\midrule
Total & 50 & 5472 & 912 & 1.3M & 180 \\
Annotations
Of the 15 hours of audio and video, 3 hours of data (1.5 hours TAU, 1.5 hours MAVD) are manually annotated by our team both in audio and image, along with 12 hours of unlabeled data (2.5 hours TAU, 9.5 hours of unpublished material) for the benefit of unsupervised models. The distribution of clips across locations was selected to maximize variance across different scenes. The annotations were collected at 2 frames per second (FPS) as it provided a balance between temporal granularity and clip coverage.
The annotation data is contained in video_annotations.csv and audio_annotations.csv.
Video Annotations
Each row in the video annotations represents a single object in a single frame of the video. The annotation schema is as follows:
Audio Annotations
Each row represents a single object instance, along with the time range that it exists within the clip. The annotation schema is as follows:
Conditions of use
Dataset created by Magdalena Fuentes, Bea Steers, Pablo Zinemanas, Martín Rocamora, Luca Bondi, Julia Wilkins, Qianyi Shi, Yao Hou, Samarjit Das, Xavier Serra, and Juan Pablo Bello.
The Urbansas dataset is offered free of charge under the following terms:
Feedback
Please help us improve Urbansas by sending your feedback to:
In case of a problem, please include as many details as possible.
Acknowledgments
This work was partially supported by the National Science
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Site Description:
In this dataset, there are seventeen production crop fields in Bulgaria where winter rapeseed and wheat were grown and two research fields in France where winter wheat – rapeseed – barley – sunflower and winter wheat – irrigated maize crop rotation is used. The full description of those fields is in the database "In-situ crop phenology dataset from sites in Bulgaria and France" (doi.org/10.5281/zenodo.7875440).
Methodology and Data Description:
Remote sensing data is extracted from Sentinel-2 tiles 35TNJ for Bulgarian sites and 31TCJ for French sites on the day of the overpass since September 2015 for Sentinel-2 derived vegetation indices and since October 2016 for HR-VPP products. To suppress spectral mixing effects at the parcel boundaries, as highlighted by Meier et al., 2020, the values from all datasets were subgrouped per field and then aggregated to a single median value for further analysis.
Sentinel-2 data was downloaded for all test sites from CREODIAS (https://creodias.eu/) in L2A processing level using a maximum scene-wide cloudy cover threshold of 75%. Scenes before 2017 were available in L1C processing level only. Scenes in L1C processing level were corrected for atmospheric effects after downloading using Sen2Cor (v2.9) with default settings. This was the same version used for the L2A scenes obtained intermediately from CREODIAS.
Next, the data was extracted from the Sentinel-2 scenes for each field parcel where only SCL classes 4 (vegetation) and 5 (bare soil) pixels were kept. We resampled the 20m band B8A to match the spatial resolution of the green and red band (10m) using nearest neighbor interpolation. The entire image processing chain was carried out using the open-source Python Earth Observation Data Analysis Library (EOdal) (Graf et al., 2022).
Apart from the widely used Normalized Difference Vegetation Index (NDVI) and Enhanced Vegetation Index (EVI), we included two recently proposed indices that were reported to have a higher correlation with photosynthesis and drought response of vegetation: These were the Near-Infrared Reflection of Vegetation (NIRv) (Badgley et al., 2017) and Kernel NDVI (kNDVI) (Camps-Valls et al., 2021). We calculated the vegetation indices in two different ways:
First, we used B08 as near-infrared (NIR) band which comes in a native spatial resolution of 10 m. B08 (central wavelength 833 nm) has a relatively coarse spectral resolution with a bandwidth of 106 nm.
Second, we used B8A which is available at 20 m spatial resolution. B8A differs from B08 in its central wavelength (864 nm) and has a narrower bandwidth (21 nm or 22 nm in the case of Sentinel-2A and 2B, respectively) compared to B08.
The High Resolution Vegetation Phenology and Productivity (HR-VPP) dataset from Copernicus Land Monitoring Service (CLMS) has three 10-m set products of Sentinel-2: vegetation indices, vegetation phenology and productivity parameters and seasonal trajectories (Tian et al., 2021). Both vegetation indices, Normalized Vegetation Index (NDVI) and Plant Phenology (PPI) and plant parameters, Fraction of Absorbed Photosynthetic Active Radiation (FAPAR) and Leaf Area Index (LAI) were computed for the time of Sentinel-2 overpass by the data provider.
NDVI is computed directly from B04 and B08 and PPI is computed using Difference Vegetation Index (DVI = B08 - B04) and its seasonal maximum value per pixel. FAPAR and LAI are retrieved from B03 and B04 and B08 with neural network training on PROSAIL model simulations. The dataset has a quality flag product (QFLAG2) which is a 16-bit that extends the scene classification band (SCL) of the Sentinel-2 Level-2 products. A “medium” filter was used to mask out QFLAG2 values from 2 to 1022, leaving land pixels (bit 1) within or outside cloud proximity (bits 11 and 13) or cloud shadow proximity (bits 12 and 14).
The HR-VPP daily raw vegetation indices products are described in detail in the user manual (Smets et al., 2022) and the computations details of PPI are given by Jin and Eklundh (2014). Seasonal trajectories refer to the 10-daily smoothed time-series of PPI used for vegetation phenology and productivity parameters retrieval with TIMESAT (Jönsson and Eklundh 2002, 2004).
HR-VPP data was downloaded through the WEkEO Copernicus Data and Information Access Services (DIAS) system with a Python 3.8.10 harmonized data access (HDA) API 0.2.1. Zonal statistics [’min’, ’max’, ’mean’, ’median’, ’count’, ’std’, ’majority’] were computed on non-masked pixel values within field boundaries with rasterstats Python package 0.17.00.
The Start of season date (SOSD), end of season date (EOSD) and length of seasons (LENGTH) were extracted from the annual Vegetation Phenology and Productivity Parameters (VPP) dataset as an additional source for comparison. These data are a product of the Vegetation Phenology and Productivity Parameters, see (https://land.copernicus.eu/pan-european/biophysical-parameters/high-resolution-vegetation-phenology-and-productivity/vegetation-phenology-and-productivity) for detailed information.
File Description:
4 datasets:
1_senseco_data_S2_B08_Bulgaria_France; 1_senseco_data_S2_B8A_Bulgaria_France; 1_senseco_data_HR_VPP_Bulgaria_France; 1_senseco_data_phenology_VPP_Bulgaria_France
3 metadata:
2_senseco_metadata_S2_B08_B8A_Bulgaria_France; 2_senseco_metadata_HR_VPP_Bulgaria_France; 2_senseco_metadata_phenology_VPP_Bulgaria_France
The dataset files “1_senseco_data_S2_B8_Bulgaria_France” and “1_senseco_data_S2_B8A_Bulgaria_France” concerns all vegetation indices (EVI, NDVI, kNDVI, NIRv) data values and related information, and metadata file “2_senseco_metadata_S2_B08_B8A_Bulgaria_France” describes all the existing variables. Both “1_senseco_data_S2_B8_Bulgaria_France” and “1_senseco_data_S2_B8A_Bulgaria_France” have the same column variable names and for that reason, they share the same metadata file “2_senseco_metadata_S2_B08_B8A_Bulgaria_France”.
The dataset file “1_senseco_data_HR_VPP_Bulgaria_France” concerns vegetation indices (NDVI, PPI) and plant parameters (LAI, FAPAR) data values and related information, and metadata file “2_senseco_metadata_HRVPP_Bulgaria_France” describes all the existing variables.
The dataset file “1_senseco_data_phenology_VPP_Bulgaria_France” concerns the vegetation phenology and productivity parameters (LENGTH, SOSD, EOSD) values and related information, and metadata file “2_senseco_metadata_VPP_Bulgaria_France” describes all the existing variables.
Bibliography
G. Badgley, C.B. Field, J.A. Berry, Canopy near-infrared reflectance and terrestrial photosynthesis, Sci. Adv. 3 (2017) e1602244. https://doi.org/10.1126/sciadv.1602244.
G. Camps-Valls, M. Campos-Taberner, Á. Moreno-Martínez, S. Walther, G. Duveiller, A. Cescatti, M.D. Mahecha, J. Muñoz-Marí, F.J. García-Haro, L. Guanter, M. Jung, J.A. Gamon, M. Reichstein, S.W. Running, A unified vegetation index for quantifying the terrestrial biosphere, Sci. Adv. 7 (2021) eabc7447. https://doi.org/10.1126/sciadv.abc7447.
L.V. Graf, G. Perich, H. Aasen, EOdal: An open-source Python package for large-scale agroecological research using Earth Observation and gridded environmental data, Comput. Electron. Agric. 203 (2022) 107487. https://doi.org/10.1016/j.compag.2022.107487.
H. Jin, L. Eklundh, A physically based vegetation index for improved monitoring of plant phenology, Remote Sens. Environ. 152 (2014) 512–525. https://doi.org/10.1016/j.rse.2014.07.010.
P. Jonsson, L. Eklundh, Seasonality extraction by function fitting to time-series of satellite sensor data, IEEE Trans. Geosci. Remote Sens. 40 (2002) 1824–1832. https://doi.org/10.1109/TGRS.2002.802519.
P. Jönsson, L. Eklundh, TIMESAT—a program for analyzing time-series of satellite sensor data, Comput. Geosci. 30 (2004) 833–845. https://doi.org/10.1016/j.cageo.2004.05.006.
J. Meier, W. Mauser, T. Hank, H. Bach, Assessments on the impact of high-resolution-sensor pixel sizes for common agricultural policy and smart farming services in European regions, Comput. Electron. Agric. 169 (2020) 105205. https://doi.org/10.1016/j.compag.2019.105205.
B. Smets, Z. Cai, L. Eklund, F. Tian, K. Bonte, R. Van Hoost, R. Van De Kerchove, S. Adriaensen, B. De Roo, T. Jacobs, F. Camacho, J. Sánchez-Zapero, S. Else, H. Scheifinger, K. Hufkens, P. Jönsson, HR-VPP Product User Manual Vegetation Indices, 2022.
F. Tian, Z. Cai, H. Jin, K. Hufkens, H. Scheifinger, T. Tagesson, B. Smets, R. Van Hoolst, K. Bonte, E. Ivits, X. Tong, J. Ardö, L. Eklundh, Calibrating vegetation phenology from Sentinel-2 using eddy covariance, PhenoCam, and PEP725 networks across Europe, Remote Sens. Environ. 260 (2021) 112456. https://doi.org/10.1016/j.rse.2021.112456.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset provides a collection of behaviour biometrics data (commonly known as Keyboard, Mouse and Touchscreen (KMT) dynamics). The data was collected for use in a FinTech research project undertaken by academics and researchers at Computer Science Department, Edge Hill University, United Kingdom. The project called CyberSIgnature uses KMT dynamics data to distinguish between legitimate card owners and fraudsters. An application was developed that has a graphical user interface (GUI) similar to a standard online card payment form including fields for card type, name, card number, card verification code (cvc) and expiry date. Then, user KMT dynamics were captured while they entered fictitious card information on the GUI application.
The dataset consists of 1,760 KMT dynamic instances collected over 88 user sessions on the GUI application. Each user session involves 20 iterations of data entry in which the user is assigned a fictitious card information (drawn at random from a pool) to enter 10 times and subsequently presented with 10 additional card information, each to be entered once. The 10 additional card information is drawn from a pool that has been assigned or to be assigned to other users. A KMT data instance is collected during each data entry iteration. Thus, a total of 20 KMT data instances (i.e., 10 legitimate and 10 illegitimate) was collected during each user entry session on the GUI application.
The raw dataset is stored in .json format within 88 separate files. The root folder named behaviour_biometrics_dataset' consists of two sub-folders
raw_kmt_dataset' and `feature_kmt_dataset'; and a Jupyter notebook file (kmt_feature_classificatio.ipynb). Their folder and file content is described below:
-- raw_kmt_dataset': this folder contains 88 files, each named
raw_kmt_user_n.json', where n is a number from 0001 to 0088. Each file contains 20 instances of KMT dynamics data corresponding to a given fictitious card; and the data instances are equally split between legitimate (n = 10) and illegitimate (n = 10) classes. The legitimate class corresponds to KMT dynamics captured from the user that is assigned to the card detail; while the illegitimate class corresponds to KMT dynamics data collected from other users entering the same card detail.
-- feature_kmt_dataset': this folder contains two sub-folders, namely:
feature_kmt_json' and feature_kmt_xlsx'. Each folder contains 88 files (of the relevant format: .json or .xlsx) , each named
feature_kmt_user_n', where n is a number from 0001 to 0088. Each file contains 20 instances of features extracted from the corresponding `raw_kmt_user_n' file including the class labels (legitimate = 1 or illegitimate = 0).
-- `kmt_feature_classification.ipynb': this file contains python code necessary to generate features from the raw KMT files and apply simple machine learning classification task to generate results. The code is designed to run with minimal effort from the user.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The GLC_FCS10 is a novel global land-cover product at 10 m with fine classification system containing 30 fine land-cover types, it is generated by a hierarchical land-cover mapping framework from Sentinel-1 and 2 time-series imagery. The specific description can be found at the corresponding algorithm article.
The GLC_FCS10 can be visually visited at https://zhangxiao-glcproj.users.earthengine.app/view/glcfcs102023maps.
Due to the huge amount of data in this product, the GLC_FCS10 has been compressed into 36 zip files, and the land-cover id about the GLC_FCS10 can be found in the algorithm article.
We present an optical and infrared (IR) study of IC 10 X-2, a high-mass X-ray binary in the galaxy IC 10. Previous optical and X-ray studies suggest that X-2 is a Supergiant Fast X-ray Transient: a large-amplitude (factor of ~100), short-duration (hours to weeks) X-ray outburst on 2010 May 21. We analyze R- and g-band light curves of X-2 from the intermediate Palomar Transient Factory taken between 2013 July 15 and 2017 February 14 that show high-amplitude (>~1mag), short-duration (~0.5mag). Near-IR spectroscopy of X-2 from Palomar/TripleSpec show He I, Paschen-{gamma}, and Paschen-{beta} emission lines with similar shapes and amplitudes as those of luminous blue variables (LBVs) and LBV candidates (LBVc). Mid-IR colors and magnitudes from Spitzer/Infrared Array Camera photometry of X-2 resemble those of known LBV/LBVcs. We suggest that the stellar companion in X-2 is an LBV/LBVc and discuss possible origins of the optical flares. Dips in the optical light curve are indicative of eclipses from optically thick clumps formed in the winds of the stellar counterpart. Given the constraints on the flare duration (0.02-0.8 days) and the time between flares (15.1+/-7.8days), we estimate the clump volume filling factor in the stellar winds, f_V_, to be 0.01<f_V_<0.71, which overlaps with values measured from massive star winds. In X-2, we interpret the origin of the optical flares as the accretion of clumps formed in the winds of an LBV/LBVc onto the compact object.
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
The Flights Booking Dataset of various Airlines is a scraped datewise from a famous website in a structured format. The dataset contains the records of flight travel details between the cities in India. Here, multiple features are present like Source & Destination City, Arrival & Departure Time, Duration & Price of the flight etc.
This data is available as a CSV file. We are going to analyze this data set using the Pandas DataFrame.
This analyse will be helpful for those working in Airlines, Travel domain.
Using this dataset, we answered multiple questions with Python in our Project.
Q.1. What are the airlines in the dataset, accompanied by their frequencies?
Q.2. Show Bar Graphs representing the Departure Time & Arrival Time.
Q.3. Show Bar Graphs representing the Source City & Destination City.
Q.4. Does price varies with airlines ?
Q.5. Does ticket price change based on the departure time and arrival time?
Q.6. How the price changes with change in Source and Destination?
Q.7. How is the price affected when tickets are bought in just 1 or 2 days before departure?
Q.8. How does the ticket price vary between Economy and Business class?
Q.9. What will be the Average Price of Vistara airline for a flight from Delhi to Hyderabad in Business Class ?
These are the main Features/Columns available in the dataset :
1) Airline: The name of the airline company is stored in the airline column. It is a categorical feature having 6 different airlines.
2) Flight: Flight stores information regarding the plane's flight code. It is a categorical feature.
3) Source City: City from which the flight takes off. It is a categorical feature having 6 unique cities.
4) Departure Time: This is a derived categorical feature obtained created by grouping time periods into bins. It stores information about the departure time and have 6 unique time labels.
5) Stops: A categorical feature with 3 distinct values that stores the number of stops between the source and destination cities.
6) Arrival Time: This is a derived categorical feature created by grouping time intervals into bins. It has six distinct time labels and keeps information about the arrival time.
7) Destination City: City where the flight will land. It is a categorical feature having 6 unique cities.
8) Class: A categorical feature that contains information on seat class; it has two distinct values: Business and Economy.
9) Duration: A continuous feature that displays the overall amount of time it takes to travel between cities in hours.
10) Days Left: This is a derived characteristic that is calculated by subtracting the trip date by the booking date.
11) Price: Target variable stores information of the ticket price.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains an extension for 2018 of the Dutch Offshore Wind Atlas (DOWA). DOWA is validated wind climatology (and information on temperature, pressure and relative humidity) based on 10 years (2008-2017) of model data: weather model HARMONIE-AROME is nested in re-analyses ERA5. In this data set time series for individual grid locations are available for 2018. Note: the proj4 string in the NetCDF file is incorrect. It should be: +proj=lcc +lat_1=52.500000 +lat_2=52.500000 +lat_0=52.500000 +lon_0=.000000 +k_0=1.0 +x_0=-92963.487426 +y_0=230383.739533 +a=6371220.000000 +b=6371220.000000
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Available variables in KNMI-LENTIS
request-overview-CMIP-historical-including-EC-EARTH-AOGCM-preferences.txt
Where is the data deposited on the ECWMF's tape storage (section 4)
LENTIS_on_ECFS.zip
Data of all variables for 1 year for 1 ensemble member (section 5)
tree_of_files_one_member_all_data.txt
{AERmon,Amon,Emon,LImon,Lmon,Ofx,Omon,SImon,fx,Eday,Oday,day,CFday,3hr,6hrPlev,6hrPlevPt}.zip
This Zenodo dataset pertains to the full KNMI-LENTIS dataset: a large ensemble of simulations with the Global Climate Model EC-Earth3. The periods are for the present-day period (2000-2009) and a future +2K period (2075-2084 following SSP2-4.5). KNMI-LENTIS has 1600 simulated years for both the two climates. This level of sampled climate variability allows for robust and in-depth research into extreme events. The available variables are listed in the file request-overview-CMIP-historical-including-EC-EARTH-AOGCM-preferences.txt. All variables are cmorised following CMIP6 data format convention. Further details on the variables and their output dimensions is available via the following search tool. The total size of KNMI-LENTIS is 128 TB. KNMI-LENTIS is stored at the high performance storage system of the ECMWF (ECFS).
The Global Climate Model that is used for generating this Large Ensemble is EC-Earth3 - VAREX project branch https://svn.ec-earth.org/ecearth3/branches/projects/varex (access restricted to ECMWF members).
The goal of this Zenodo dataset is :
to provide an accurate description and example of how the KNMI-LENTIS dataset is organised.
to describe in which servers the data are deposited and how to gain access to the data for future users
to provide links to related git repositories and other content relating to the KNMI-LENTIS production
KNMI-LENTIS consists of 2 times 160 runs of 10 years. All simulations have a unique ensemble member label that reflects the forcing, and how the initial conditions are generated. The initial conditions have two aspects: the parent simulation from which the run is branched (macro perturbation, there are 16), and the seed relating to a particular micro-perturbation in the initial three-dimensional atmosphere temperature field (there are 10). The ensemble member label thus is a combination of:
forcing (h for present-day/historical and s for +2K/SSP2-4.5)
parent ID (number between 1 and 16)
micro perturbation ID (number between 0 and 9)
In this Zenodo dataset we publish 1 year from 1 member to give insight into the type of data and metadata that is representative of the full KNMI-LENTIS dataset. The published data is year 2000 from member h010. See Section 4
Further, all KNMI-LENTIS simulations are labeled per the CMIP6 convention of variant labelling. A variant label is made from four components: the realization index r, the initialization index i, the physics index p and the forcing index f. Further details on CMIP6 variant labelling be found in The CMIP6 Participation Guidance for Modelers. In the KNMI-LENTIS data set, the forcing is reflected in the first digit of the realization index r of the variant label. For the historical simulations, the one thousands (r1000-r1999) have been reserved. For the SSP2-4.5 the five thousands (r5000-r5999) have been reserved. The parent is reflected in the second and third digit of the realization index r of the variant label (r?01?-r?16?). The seed is reflected in the fourth digit of the realization index r: (r???0-r???9). The seed is also reflected in the initialization index i of the variant label (i0-i9), so this is double information. The physics index p5 has been reserved for the ECE3p5 version: all KNMI-LENTIS simulations have the p5 label. The forcing index f of the variant label is kept at 1 for all KNMI-LENTIS simulations. As an example, variant label r5119i9p5f1 refers to: the 2K time slice with parent 11 and randomizing seed number 9. The physics index is 5, meaning the run is done with the ECE3p5 version of EC-Earth3.
In this Zenodo folder, there are several text files and several netcdf files. The text files provide
Data from KNMI-LENTIS is deposited in the ECMWF ECFS tape storage system. Data can be freely downloaded by to those who have access to the ECMWF ECFS. Else, the data can be made available by the authors upon request.
The way the dataset is organised is detailed in LENTIS_on_ECFS.zip. This contains details on all available KNMI-LENTIS files, in particular details for how these are filed in ECFS. The files on ECFS are tar zipped per ensemble member & variable: these contain 10 years of ensemble member data (10 separate netcdf files). The location on ECFS of the tar-zipped files that are listed in the various text files in this Zenodo dataset is
ec:/nklm/LENTIS/ec-earth/cmorised_by_var/
for freq in AERmon Amon Emon LImon Lmon Ofx Omon SImon fx Eday Oday day CFday 3hr 6hrPlev 6hrPlevPt; do for scen in hxxx sxxx; do els -l ec:/nklm/LENTIS/ec-earth/cmorised_by_var/${scen}/${freq}/* >> LENTIS_on_ECFS_${scen}_${freq}.txt done done
Further, part of the data will be made publicly available from the Earth System Grid Federation (ESGF) data portal. We aim to upload most of the monthly variables for the full ensemble. As search terms use EC-Earth for model and p5 for physical index to locate the KNMI-LENTIS data.
The netcdf files of the data of 1 year from 1 member h010 are published here to give insight into the type of data and metadata that is representative of the full KNMI-LENTIS dataset. The data are in zipped folders per output frequencies: AERmon, Amon, Emon, LImon, Lmon, Ofx, Omon, SImon, fx, Eday, Oday, day, CFday, 3hr, 6hrPlev, 6hrPlevPt. The text file request-overview-CMIP-historical-including-EC-EARTH-AOGCM-preferences.txt gives an overview of variables available per output frequency. the text files tree_of_files_one_member_all_data.txt gives an overview of the files in the zipped folders.
The production of the KNMI-LENTIS ensemble was funded by the KNMI (Royal Dutch Meteorological Institute) multi-year strategic research fund KNMI MSO Climate Variability And Extremes (VAREX)
GitHub repository corresponding to this Zenodo dataset: https://github.com/lmuntjewerf/KNMI-LENTIS_dataset_description.git
Github repository for KNMI-LENTIS production code: https://github.com/lmuntjewerf/KNMI-LENTIS_production_script_train.git
The Chandra data archive is a treasure trove for various studies, and in this study the author exploits this valuable resource to study the X-ray point source populations in nearby galaxies. By 2007 December 14, 383 galaxies within 40 Mpc with isophotal major axes above 1 arcminute had been observed by 626 public ACIS observations, most of which were for the first time analyzed by this survey to study the X-ray point sources. Uniform data analysis procedures were applied to the 626 ACIS observations and led to the detection of 28,099 point sources, which belong to 17,559 independent sources. These include 8700 sources observed twice or more and 1000 sources observed 10 times or more, providing a wealth of data to study the long-term variability of these X-ray sources. Cross-correlation of these sources with galaxy isophotes led to 8,519 sources within the D25 isophotes of 351 galaxies, 3,305 sources between the D25 and 2 * D25 isophotes of 309 galaxies, and an additional 5,735 sources outside the 2 * D25 isophotes of galaxies. This survey has produced a uniform catalog, by far the largest, of 11,824 X-ray point sources within 2 * D25 isophotes of 380 galaxies. Contamination analysis using the log N-log S relation shows that 74% of the sources within the 2 * D25 isophotes above 1039 erg s-1, 71% of the sources above 1038 erg s-1, 63% of the sources above 1037 erg s-1, and 56% of all sources are truly associated with the galaxies. Meticulous efforts have identified 234 X-ray sources with galactic nuclei of nearby galaxies. This archival survey leads to 300 ultraluminous X-ray sources (ULXs) with LX in the 0.3-8 keV band >= 2 x 1039 erg s-1 within the D25 isophotes, 179 ULXs between the D25 and the 2 * D25 isophotes, and a total of 479 ULXs within 188 host galaxies, with about 324 ULXs truly associated with the host galaxies based on the contamination analysis. About 4% of the sources exhibited at least one supersoft phase, and 70 sources are classified as ultraluminous supersoft sources with LX (0.3-8 keV) >= 2 x 1038 erg s-1. With a uniform data set and good statistics, this survey enables future works on various topics, such as X-ray luminosity functions for the ordinary X-ray binary populations in different types of galaxies, and X-ray properties of galactic nuclei. This table contains the list of 17,559 'independent' X-ray point sources that was contained in table 4 of the reference paper. As the author notes in Section 5 of this paper, there are 341 sources projected within 2 galaxies with overlapping domains which are listed for both galaxies. The 5,735 sources lieing outside the 2* D25 isophotes of the galaxies are also included in this table. For these sources, the X-ray luminosities are computed as if they were in a galaxy of that group, which may or may not be the case; thus, they may not be their 'true' luminosities, but are listed for the purposes of comparison. This table was created by the HEASARC in March 2011 based on the electronic version of Table 4 of the reference paper which was obtained from the Astrophysical Journal web site. Some of the values for the name parameter in the HEASARC's implementation of this table were corrected in April 2018. This is a service provided by NASA HEASARC .
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Student Performance Data Set’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/impapan/student-performance-data-set on 13 February 2022.
--- Dataset description provided by original source is as follows ---
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades.
# Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:
1 school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
2 sex - student's sex (binary: 'F' - female or 'M' - male)
3 age - student's age (numeric: from 15 to 22)
4 address - student's home address type (binary: 'U' - urban or 'R' - rural)
5 famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
6 Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
7 Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
8 Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
9 Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
10 Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
11 reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
12 guardian - student's guardian (nominal: 'mother', 'father' or 'other')
13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
16 schoolsup - extra educational support (binary: yes or no)
17 famsup - family educational support (binary: yes or no)
18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19 activities - extra-curricular activities (binary: yes or no)
20 nursery - attended nursery school (binary: yes or no)
21 higher - wants to take higher education (binary: yes or no)
22 internet - Internet access at home (binary: yes or no)
23 romantic - with a romantic relationship (binary: yes or no)
24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29 health - current health status (numeric: from 1 - very bad to 5 - very good)
30 absences - number of school absences (numeric: from 0 to 93)
# these grades are related with the course subject, Math or Portuguese:
31 G1 - first period grade (numeric: from 0 to 20)
31 G2 - second period grade (numeric: from 0 to 20)
32 G3 - final grade (numeric: from 0 to 20, output target)
If you use this dataset in your research, please credit the authors
P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.
--- Original source retains full ownership of the source dataset ---
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
This table contains 1410 series, with data for years 1990 - 1998 (not all combinations necessarily have data for all years), and was last released on 2007-01-29. This table contains data described by the following dimensions (Not all combinations are available): Geography (30 items: Austria; Belgium (Flemish speaking);Belgium (French speaking);Belgium ...), Sex (2 items: Males; Females ...), Age group (3 items: 11 years;13 years;15 years ...), Activity (2 items: Tasted an alcoholic beverage; Been really drunk ...), Frequency (8 items: Yes;2 to 3 times;Once;4 to 10 times ...).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This layer displays a global map of land use/land cover (LULC) derived from ESA Sentinel-2 imagery at 10m resolution. Each year is generated with Impact Observatory’s deep learning AI land classification model, trained using billions of human-labeled image pixels from the National Geographic Society. The global maps are produced by applying this model to the Sentinel-2 Level-2A image collection on Microsoft’s Planetary Computer, processing over 400,000 Earth observations per year.The algorithm generates LULC predictions for nine classes, described in detail below. The year 2017 has a land cover class assigned for every pixel, but its class is based upon fewer images than the other years. The years 2018-2024 are based upon a more complete set of imagery. For this reason, the year 2017 may have less accurate land cover class assignments than the years 2018-2024. Key Properties Variable mapped: Land use/land cover in 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024Source Data Coordinate System: Universal Transverse Mercator (UTM) WGS84Service Coordinate System: Web Mercator Auxiliary Sphere WGS84 (EPSG:3857)Extent: GlobalSource imagery: Sentinel-2 L2ACell Size: 10-metersType: ThematicAttribution: Esri, Impact ObservatoryAnalysis: Optimized for analysisClass Definitions: ValueNameDescription1WaterAreas where water was predominantly present throughout the year; may not cover areas with sporadic or ephemeral water; contains little to no sparse vegetation, no rock outcrop nor built up features like docks; examples: rivers, ponds, lakes, oceans, flooded salt plains.2TreesAny significant clustering of tall (~15 feet or higher) dense vegetation, typically with a closed or dense canopy; examples: wooded vegetation, clusters of dense tall vegetation within savannas, plantations, swamp or mangroves (dense/tall vegetation with ephemeral water or canopy too thick to detect water underneath).4Flooded vegetationAreas of any type of vegetation with obvious intermixing of water throughout a majority of the year; seasonally flooded area that is a mix of grass/shrub/trees/bare ground; examples: flooded mangroves, emergent vegetation, rice paddies and other heavily irrigated and inundated agriculture.5CropsHuman planted/plotted cereals, grasses, and crops not at tree height; examples: corn, wheat, soy, fallow plots of structured land.7Built AreaHuman made structures; major road and rail networks; large homogenous impervious surfaces including parking structures, office buildings and residential housing; examples: houses, dense villages / towns / cities, paved roads, asphalt.8Bare groundAreas of rock or soil with very sparse to no vegetation for the entire year; large areas of sand and deserts with no to little vegetation; examples: exposed rock or soil, desert and sand dunes, dry salt flats/pans, dried lake beds, mines.9Snow/IceLarge homogenous areas of permanent snow or ice, typically only in mountain areas or highest latitudes; examples: glaciers, permanent snowpack, snow fields.10CloudsNo land cover information due to persistent cloud cover.11RangelandOpen areas covered in homogenous grasses with little to no taller vegetation; wild cereals and grasses with no obvious human plotting (i.e., not a plotted field); examples: natural meadows and fields with sparse to no tree cover, open savanna with few to no trees, parks/golf courses/lawns, pastures. Mix of small clusters of plants or single plants dispersed on a landscape that shows exposed soil or rock; scrub-filled clearings within dense forests that are clearly not taller than trees; examples: moderate to sparse cover of bushes, shrubs and tufts of grass, savannas with very sparse grasses, trees or other plants.NOTE: Land use focus does not provide the spatial detail of a land cover map. As such, for the built area classification, yards, parks, and groves will appear as built area rather than trees or rangeland classes.Usage Information and Best PracticesProcessing TemplatesThis layer includes a number of preconfigured processing templates (raster function templates) to provide on-the-fly data rendering and class isolation for visualization and analysis. Each processing template includes labels and descriptions to characterize the intended usage. This may include for visualization, for analysis, or for both visualization and analysis. VisualizationThe default rendering on this layer displays all classes.There are a number of on-the-fly renderings/processing templates designed specifically for data visualization.By default, the most recent year is displayed. To discover and isolate specific years for visualization in Map Viewer, try using the Image Collection Explorer. AnalysisIn order to leverage the optimization for analysis, the capability must be enabled by your ArcGIS organization administrator. More information on enabling this feature can be found in the ‘Regional data hosting’ section of this help doc.Optimized for analysis means this layer does not have size constraints for analysis and it is recommended for multisource analysis with other layers optimized for analysis. See this group for a complete list of imagery layers optimized for analysis.Prior to running analysis, users should always provide some form of data selection with either a layer filter (e.g. for a specific date range, cloud cover percent, mission, etc.) or by selecting specific images. To discover and isolate specific images for analysis in Map Viewer, try using the Image Collection Explorer.Zonal Statistics is a common tool used for understanding the composition of a specified area by reporting the total estimates for each of the classes. GeneralIf you are new to Sentinel-2 LULC, the Sentinel-2 Land Cover Explorer provides a good introductory user experience for working with this imagery layer. For more information, see this Quick Start Guide.Global land use/land cover maps provide information on conservation planning, food security, and hydrologic modeling, among other things. This dataset can be used to visualize land use/land cover anywhere on Earth. Classification ProcessThese maps include Version 003 of the global Sentinel-2 land use/land cover data product. It is produced by a deep learning model trained using over five billion hand-labeled Sentinel-2 pixels, sampled from over 20,000 sites distributed across all major biomes of the world.The underlying deep learning model uses 6-bands of Sentinel-2 L2A surface reflectance data: visible blue, green, red, near infrared, and two shortwave infrared bands. To create the final map, the model is run on multiple dates of imagery throughout the year, and the outputs are composited into a final representative map for each year.The input Sentinel-2 L2A data was accessed via Microsoft’s Planetary Computer and scaled using Microsoft Azure Batch. CitationKarra, Kontgis, et al. “Global land use/land cover with Sentinel-2 and deep learning.” IGARSS 2021-2021 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 2021.AcknowledgementsTraining data for this project makes use of the National Geographic Society Dynamic World training dataset, produced for the Dynamic World Project by National Geographic Society in partnership with Google and the World Resources Institute.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F8734253%2F832430253683be01796f74de8f532b34%2Fweather%20forecasting.png?generation=1730602999355141&alt=media" alt="">
Weather is recorded every 10 minutes throughout the entire year of 2020, comprising 20 meteorological indicators measured at a Max Planck Institute weather station. The dataset provides comprehensive atmospheric measurements including air temperature, humidity, wind patterns, radiation, and precipitation. With over 52,560 data points per variable (365 days × 24 hours × 6 measurements per hour), this high-frequency sampling offers detailed insights into weather patterns and atmospheric conditions. The measurements include both basic weather parameters and derived quantities such as vapor pressure deficit and potential temperature, making it suitable for both meteorological research and practical applications. You can find some initial analysis using this dataset here: "Weather Long-term Time Series Forecasting Analysis".
The dataset is provided in a CSV format with the following columns:
Column Name | Description |
---|---|
date | Date and time of the observation. |
p | Atmospheric pressure in millibars (mbar). |
T | Air temperature in degrees Celsius (°C). |
Tpot | Potential temperature in Kelvin (K), representing the temperature an air parcel would have if moved to a standard pressure level. |
Tdew | Dew point temperature in degrees Celsius (°C), indicating the temperature at which air becomes saturated with moisture. |
rh | Relative humidity as a percentage (%), showing the amount of moisture in the air relative to the maximum it can hold at that temperature. |
VPmax | Maximum vapor pressure in millibars (mbar), representing the maximum pressure exerted by water vapor at the given temperature. |
VPact | Actual vapor pressure in millibars (mbar), indicating the current water vapor pressure in the air. |
VPdef | Vapor pressure deficit in millibars (mbar), measuring the difference between maximum and actual vapor pressure, used to gauge drying potential. |
sh | Specific humidity in grams per kilogram (g/kg), showing the mass of water vapor per kilogram of air. |
H2OC | Concentration of water vapor in millimoles per mole (mmol/mol) of dry air. |
rho | Air density in grams per cubic meter (g/m³), reflecting the mass of air per unit volume. |
wv | Wind speed in meters per second (m/s), measuring the horizontal motion of air. |
max. wv | Maximum wind speed in meters per second (m/s), indicating the highest recorded wind speed over the period. |
wd | Wind direction in degrees (°), representing the direction from which the wind is blowing. |
rain | Total rainfall in millimeters (mm), showing the amount of precipitation over the observation period. |
raining | Duration of rainfall in seconds (s), recording the time for which rain occurred during the observation period. |
SWDR | Short-wave downward radiation in watts per square meter (W/m²), measuring incoming solar radiation. |
PAR | Photosynthetically active radiation in micromoles per square meter per second (µmol/m²/s), indicating the amount of light available for photosynthesis. |
max. PAR | Maximum photosynthetically active radiation recorded in the observation period in µmol/m²/s. |
Tlog | Temperature logged in degrees Celsius (°C), potentially from a secondary sensor or logger. |
OT | Likely refers to an "operational timestamp" or an offset in time, but may need clarification depending on the dataset's context. |
This high-resolution meteorological dataset enables applications across multiple domains. For weather forecasting, the frequent measurements support development of prediction models, while climate researchers can study microclimate variations and seasonal patterns. In agriculture, temperature and vapor pressure deficit data aids crop modeling and irrigation planning. The wind and radiation measurements benefit renewable energy planning, while the comprehensive atmospheric data supports environmental monitoring. The dataset's detailed nature makes it particularly suitable for machine learning applications and educational purposes in meteorology and data science.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Satellite images can be used to derive time series of vegetation indices, such as normalized difference vegetation index (NDVI) or enhanced vegetation index (EVI), at global scale. Unfortunately, recording artifacts, clouds, and other atmospheric contaminants impacts a significant portion of the produced images, requiring the usage of ad-hoc techniques to reconstruct the time series in the affected regions. In literature, several methods have been proposed to fill the gaps present in the images, and some works also presented performance comparisons between them (Roerink et al., 2000; Moreno-Martínez et al., 2020; Siabi et al., 2022). Because of the lack of a ground truth for the reconstructed images, the performance evaluation requires the creation of datasets where artificial gaps are introduced in a reference image, such that metrics like the root mean square error (RMSE) can be computed comparing the reconstructed images with the reference one. Different approaches have been used to create the reference images and the artificial gaps, but in most cases, the artificial gaps are introduced using arbitrary patterns and/or the reference image is produced artificially and not using real satellite images (e.g. Kandasamy et al., 2013; Liu et al., 2017; Julien & Sobrino, 2018). In addition, to the best of our knowledge, few of them are openly available and directly accessible allowing for fully reproducible research.
We provide here a benchmark dataset for time series reconstruction method based on the harmonized Landsat Sentinel-2 (HLS) collection where the artificial gaps are introduced with a realistic spatio-temporal distribution. In particular, we selected six tiles that we considered representative for most of the main climate classes (e.g. equatorial, arid, warm temperature, boreal and polar), as depicted in the preview.
Specifically, following the relative tiling system shown above, we downloaded the Red, NIR and F-mask bands from both the HLSL30 and HLSS30 collections for the tiles 19FCV, 22LEH, 32QPK, 31UFS, 45WFV and 49MWM. From the Red and NIR band we derived the NDVI as:
(NDVI = {NIR - Red \over NIR + Red})
only for clear-sky on lend pixels (F-mask bits 1, 3, 4 and 5 equal zero), setting as not a number the remaining pixels. The images are then aggregated on a 16 days base, averaging the available values for each pixel in each temporal range. The so obtained data, are considered from us as the reference data for the benchmarking, and stored following the file naming convention
HLS.T..v2.0.NDVI.tif
where TILE_NAME is one between the above specified ones, YYYY is the corresponding year (spanning from 2015 to 2022) and DDD is the day of the year from which the corresponding 16 days range starts. Finally, for each tile, we have a time series composed of 184 images (23 images for 8 years) that can be easily manipulated, for example using the Scikit-Map library in Python.
Starting from those data, for each image we considered the mask of currently present gaps, we randomly rotated it by 90, 180 or 270 degrees and we added artificial gaps in the pixels of the rotated mask. Doing so, we believe that the spatio-temporal distribution will be still realistic, providing a solid benchmark for gap-filling methods that work on time series, on spatial pattern or combination of the both.
The data including the artificial gaps are stored with the naming structure
HLS.T..v2.0.NDVI_art_gaps.tif
following the previously mentioned convention. The performance metrics, such as RMSE or normalized RMSE (NRMSE), can be computed by applying a reconstruction method on the images with artificial gaps, and then comparing the reconstructed time series with the reference one only on the artificially created gaps locations.
This dataset was used to compare the performance of some gap-filling methods and we provide a Jupyter notebook that shows how to access and use the data. The files are provided in GeoTIFF format and projected in the coordinate reference system WGS 84 / UTM zone 19N (EPSG:32619).
If you succeed to produce higher accuracy or develop a new algorithm for gap filling, please contact authors or post on our GitHub repository. May the force be with you!
References:
Julien, Y., & Sobrino, J. A. (2018). TISSBERT: A benchmark for the validation and comparison of NDVI time series reconstruction methods. Revista de Teledetección, (51), 19-31. https://doi.org/10.4995/raet.2018.9749
Kandasamy, S., Baret, F., Verger, A., Neveux, P., & Weiss, M. (2013). A comparison of methods for smoothing and gap filling time series of remote sensing observations–application to MODIS LAI products. Biogeosciences, 10(6), 4055-4071. https://doi.org/10.5194/bg-10-4055-2013
Liu, R., Shang, R., Liu, Y., & Lu, X. (2017). Global evaluation of gap-filling approaches for seasonal NDVI with considering vegetation growth trajectory, protection of key point, noise resistance and curve stability. Remote Sensing of Environment, 189, 164-179. https://doi.org/10.1016/j.rse.2016.11.023
Moreno-Martínez, Á., Izquierdo-Verdiguier, E., Maneta, M. P., Camps-Valls, G., Robinson, N., Muñoz-Marí, J., ... & Running, S. W. (2020). Multispectral high resolution sensor fusion for smoothing and gap-filling in the cloud. Remote Sensing of Environment, 247, 111901. https://doi.org/10.1016/j.rse.2020.111901
Roerink, G. J., Menenti, M., & Verhoef, W. (2000). Reconstructing cloudfree NDVI composites using Fourier analysis of time series. International Journal of Remote Sensing, 21(9), 1911-1917. https://doi.org/10.1080/014311600209814
Siabi, N., Sanaeinejad, S. H., & Ghahraman, B. (2022). Effective method for filling gaps in time series of environmental remote sensing data: An example on evapotranspiration and land surface temperature images. Computers and Electronics in Agriculture, 193, 106619. https://doi.org/10.1016/j.compag.2021.106619
Overview: 142: Areas used for sports, leisure and recreation purposes. Traceability (lineage): This dataset was produced with a machine learning framework with several input datasets, specified in detail in Witjes et al., 2022 (in review, preprint available at https://doi.org/10.21203/rs.3.rs-561383/v3 ) Scientific methodology: The single-class probability layers were generated with a spatiotemporal ensemble machine learning framework detailed in Witjes et al., 2022 (in review, preprint available at https://doi.org/10.21203/rs.3.rs-561383/v3 ). The single-class uncertainty layers were calculated by taking the standard deviation of the three single-class probabilities predicted by the three components of the ensemble. The HCL (hard class) layers represents the class with the highest probability as predicted by the ensemble. Usability: The HCL layers have a decreasing average accuracy (weighted F1-score) at each subsequent level in the CLC hierarchy. These metrics are 0.83 at level 1 (5 classes):, 0.63 at level 2 (14 classes), and 0.49 at level 3 (43 classes). This means that the hard-class maps are more reliable when aggregating classes to a higher level in the hierarchy (e.g. 'Discontinuous Urban Fabric' and 'Continuous Urban Fabric' to 'Urban Fabric'). Some single-class probabilities may more closely represent actual patterns for some classes that were overshadowed by unequal sample point distributions. Users are encouraged to set their own thresholds when postprocessing these datasets to optimize the accuracy for their specific use case. Uncertainty quantification: Uncertainty is quantified by taking the standard deviation of the probabilities predicted by the three components of the spatiotemporal ensemble model. Data validation approaches: The LULC classification was validated through spatial 5-fold cross-validation as detailed in the accompanying publication. Completeness: The dataset has chunks of empty predictions in regions with complex coast lines (e.g. the Zeeland province in the Netherlands and the Mar da Palha bay area in Portugal). These are artifacts that will be avoided in subsequent versions of the LULC product. Consistency: The accuracy of the predictions was compared per year and per 30km*30km tile across europe to derive temporal and spatial consistency by calculating the standard deviation. The standard deviation of annual weighted F1-score was 0.135, while the standard deviation of weighted F1-score per tile was 0.150. This means the dataset is more consistent through time than through space: Predictions are notably less accurate along the Mediterrranean coast. The accompanying publication contains additional information and visualisations. Positional accuracy: The raster layers have a resolution of 30m, identical to that of the Landsat data cube used as input features for the machine learning framework that predicted it. Temporal accuracy: The dataset contains predictions and uncertainty layers for each year between 2000 and 2019. Thematic accuracy: The maps reproduce the Corine Land Cover classification system, a hierarchical legend that consists of 5 classes at the highest level, 14 classes at the second level, and 44 classes at the third level. Class 523: Oceans was omitted due to computational constraints.