Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
"We believe that by accounting for the inherent uncertainty in the system during each measurement, the relationship between cause and effect can be assessed more accurately, potentially reducing the duration of research."
Short description
This dataset was created as part of a research project investigating the efficiency and learning mechanisms of a Bayesian adaptive search algorithm supported by the Imprecision Entropy Indicator (IEI) as a novel method. It includes detailed statistical results, posterior probability values, and the weighted averages of IEI across multiple simulations aimed at target localization within a defined spatial environment. Control experiments, including random search, random walk, and genetic algorithm-based approaches, were also performed to benchmark the system's performance and validate its reliability.
The task involved locating a target area centered at (100; 100) within a radius of 10 units (Research_area.png), inside a circular search space with a radius of 100 units. The search process continued until 1,000 successful target hits were achieved.
To benchmark the algorithm's performance and validate its reliability, control experiments were conducted using alternative search strategies, including random search, random walk, and genetic algorithm-based approaches. These control datasets serve as baselines, enabling comprehensive comparisons of efficiency, randomness, and convergence behavior across search methods, thereby demonstrating the effectiveness of our novel approach.
Uploaded files
The first dataset contains the average IEI values, generated by randomly simulating 300 x 1 hits for 10 bins per quadrant (4 quadrants in total) using the Python programming language, and calculating the corresponding IEI values. This resulted in a total of 4 x 10 x 300 x 1 = 12,000 data points. The summary of the IEI values by quadrant and bin is provided in the file results_1_300.csv. The calculation of IEI values for averages is based on likelihood, using an absolute difference-based approach for the likelihood probability computation. IEI_Likelihood_Based_Data.zip
The weighted IEI average values for likelihood calculation (Bayes formula) are provided in the file Weighted_IEI_Average_08_01_2025.xlsx
This dataset contains the results of a simulated target search experiment using Bayesian posterior updates and Imprecision Entropy Indicators (IEI). Each row represents a hit during the search process, including metrics such as Shannon entropy (H), Gini index (G), average distance, angular deviation, and calculated IEI values. The dataset also includes bin-specific posterior probability updates and likelihood calculations for each iteration. The simulation explores adaptive learning and posterior penalization strategies to optimize the search efficiency. Our Bayesian adaptive searching system source code (search algorithm, 1000 target searches): IEI_Self_Learning_08_01_2025.pyThis dataset contains the results of 1,000 iterations of a successful target search simulation. The simulation runs until the target is successfully located for each iteration. The dataset includes further three main outputs: a) Results files (results{iteration_number}.csv): Details of each hit during the search process, including entropy measures, Gini index, average distance and angle, Imprecision Entropy Indicators (IEI), coordinates, and the bin number of the hit. b) Posterior updates (Pbin_all_steps_{iter_number}.csv): Tracks the posterior probability updates for all bins during the search process acrosations multiple steps. c) Likelihoodanalysis(likelihood_analysis_{iteration_number}.csv): Contains the calculated likelihood values for each bin at every step, based on the difference between the measured IEI and pre-defined IE bin averages. IEI_Self_Learning_08_01_2025.py
Based on the mentioned Python source code (see point 3, Bayesian adaptive searching method with IEI values), we performed 1,000 successful target searches, and the outputs were saved in the:Self_learning_model_test_output.zip file.
Bayesian Search (IEI) from different quadrant. This dataset contains the results of Bayesian adaptive target search simulations, including various outputs that represent the performance and analysis of the search algorithm. The dataset includes: a) Heatmaps (Heatmap_I_Quadrant, Heatmap_II_Quadrant, Heatmap_III_Quadrant, Heatmap_IV_Quadrant): These heatmaps represent the search results and the paths taken from each quadrant during the simulations. They indicate how frequently the system selected each bin during the search process. b) Posterior Distributions (All_posteriors, Probability_distribution_posteriors_values, CDF_posteriors_values): Generated based on posterior values, these files track the posterior probability updates, including cumulative distribution functions (CDF) and probability distributions. c) Macro Summary (summary_csv_macro): This file aggregates metrics and key statistics from the simulation. It summarizes the results from the individual results.csv files. d) Heatmap Searching Method Documentation (Bayesian_Heatmap_Searching_Method_05_12_2024): This document visualizes the search algorithm's path, showing how frequently each bin was selected during the 1,000 successful target searches. e) One-Way ANOVA Analysis (Anova_analyze_dataset, One_way_Anova_analysis_results): This includes the database and SPSS calculations used to examine whether the starting quadrant influences the number of search steps required. The analysis was conducted at a 5% significance level, followed by a Games-Howell post hoc test [43] to identify which target-surrounding quadrants differed significantly in terms of the number of search steps. Results were saved in the Self_learning_model_test_results.zip
This dataset contains randomly generated sequences of bin selections (1-40) from a control search algorithm (random search) used to benchmark the performance of Bayesian-based methods. The process iteratively generates random numbers until a stopping condition is met (reaching target bins 1, 11, 21, or 31). This dataset serves as a baseline for analyzing the efficiency, randomness, and convergence of non-adaptive search strategies. The dataset includes the following: a) The Python source code of the random search algorithm. b) A file (summary_random_search.csv) containing the results of 1000 successful target hits. c) A heatmap visualizing the frequency of search steps for each bin, providing insight into the distribution of steps across the bins. Random_search.zip
This dataset contains the results of a random walk search algorithm, designed as a control mechanism to benchmark adaptive search strategies (Bayesian-based methods). The random walk operates within a defined space of 40 bins, where each bin has a set of neighboring bins. The search begins from a randomly chosen starting bin and proceeds iteratively, moving to a randomly selected neighboring bin, until one of the stopping conditions is met (bins 1, 11, 21, or 31). The dataset provides detailed records of 1,000 random walk iterations, with the following key components: a) Individual Iteration Results: Each iteration's search path is saved in a separate CSV file (random_walk_results_.csv), listing the sequence of steps taken and the corresponding bin at each step. b) Summary File: A combined summary of all iterations is available in random_walk_results_summary.csv, which aggregates the step-by-step data for all 1,000 random walks. c) Heatmap Visualization: A heatmap file is included to illustrate the frequency distribution of steps across bins, highlighting the relative visit frequencies of each bin during the random walks. d) Python Source Code: The Python script used to generate the random walk dataset is provided, allowing reproducibility and customization for further experiments. Random_walk.zip
This dataset contains the results of a genetic search algorithm implemented as a control method to benchmark adaptive Bayesian-based search strategies. The algorithm operates in a 40-bin search space with predefined target bins (1, 11, 21, 31) and evolves solutions through random initialization, selection, crossover, and mutation over 1000 successful runs. Dataset Components: a) Run Results: Individual run data is stored in separate files (genetic_algorithm_run_.csv), detailing: Generation: The generation number. Fitness: The fitness score of the solution. Steps: The path length in bins. Solution: The sequence of bins visited. b) Summary File: summary.csv consolidates the best solutions from all runs, including their fitness scores, path lengths, and sequences. c) All Steps File: summary_all_steps.csv records all bins visited during the runs for distribution analysis. d) A heatmap was also generated for the genetic search algorithm, illustrating the frequency of bins chosen during the search process as a representation of the search pathways.Genetic_search_algorithm.zip
Technical Information
The dataset files have been compressed into a standard ZIP archive using Total Commander (version 9.50). The ZIP format ensures compatibility across various operating systems and tools.
The XLSX files were created using Microsoft Excel Standard 2019 (Version 1808, Build 10416.20027)
The Python program was developed using Visual Studio Code (Version 1.96.2, user setup), with the following environment details: Commit fabd6a6b30b49f79a7aba0f2ad9df9b399473380f, built on 2024-12-19. The Electron version is 32.6, and the runtime environment includes Chromium 128.0.6263.186, Node.js 20.18.1, and V8 12.8.374.38-electron.0. The operating system is Windows NT x64 10.0.19045.
The statistical analysis included in this dataset was partially conducted using IBM SPSS Statistics, Version 29.0.1.0
The CSV files in this dataset were created following European standards, using a semicolon (;) as the delimiter instead of a comma, encoded in UTF-8 to ensure compatibility with a wide
Facebook
TwitterOur target was to predict gender, age and emotion from audio. We found audio labeled datasets on Mozilla and RAVDESS. So by using R programming language 20 statistical features were extracted and then after adding the labels these datasets were formed. Audio files were collected from "Mozilla Common Voice" and “Ryerson AudioVisual Database of Emotional Speech and Song (RAVDESS)”.
Datasets contains 20 feature columns and 1 column for denoting the label. The 20 statistical features were extracted through the Frequency Spectrum Analysis using R programming Language. They are: 1) meanfreq - The mean frequency (in kHz) is a pitch measure, that assesses the center of the distribution of power across frequencies. 2) sd - The standard deviation of frequency is a statistical measure that describes a dataset’s dispersion relative to its mean and is calculated as the variance’s square root. 3) median - The median frequency (in kHz) is the middle number in the sorted, ascending, or descending list of numbers. 4) Q25 - The first quartile (in kHz), referred to as Q1, is the median of the lower half of the data set. This means that about 25 percent of the data set numbers are below Q1, and about 75 percent are above Q1. 5) Q75 - The third quartile (in kHz), referred to as Q3, is the central point between the median and the highest distributions. 6) IQR - The interquartile range (in kHz) is a measure of statistical dispersion, equal to the difference between 75th and 25th percentiles or between upper and lower quartiles. 7) skew - The skewness is the degree of distortion from the normal distribution. It measures the lack of symmetry in the data distribution. 8) kurt - The kurtosis is a statistical measure that determines how much the tails of distribution vary from the tails of a normal distribution. It is actually the measure of outliers present in the data distribution. 9) sp.ent - The spectral entropy is a measure of signal irregularity that sums up the normalized signal’s spectral power. 10) sfm - The spectral flatness or tonality coefficient, also known as Wiener entropy, is a measure used for digital signal processing to characterize an audio spectrum. Spectral flatness is usually measured in decibels, which, instead of being noise-like, offers a way to calculate how tone-like a sound is. 11) mode - The mode frequency is the most frequently observed value in a data set. 12) centroid - The spectral centroid is a metric used to describe a spectrum in digital signal processing. It means where the spectrum’s center of mass is centered. 13) meanfun - The meanfun is the average of the fundamental frequency measured across the acoustic signal. 14) minfun - The minfun is the minimum fundamental frequency measured across the acoustic signal 15) maxfun - The maxfun is the maximum fundamental frequency measured across the acoustic signal. 16) meandom - The meandom is the average of dominant frequency measured across the acoustic signal. 17) mindom - The mindom is the minimum of dominant frequency measured across the acoustic signal. 18) maxdom - The maxdom is the maximum of dominant frequency measured across the acoustic signal 19) dfrange - The dfrange is the range of dominant frequency measured across the acoustic signal. 20) modindx - the modindx is the modulation index, which calculates the degree of frequency modulation expressed numerically as the ratio of the frequency deviation to the frequency of the modulating signal for a pure tone modulation.
Gender and Age Audio Data Souce: Link: https://commonvoice.mozilla.org/en Emotion Audio Data Souce: Link : https://smartlaboratory.org/ravdess/
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Despite strong interest in how noise affects marine mammals, little is known about the most abundant and commonly exposed taxa. Social delphinids occur in groups of hundreds of individuals that travel quickly, change behavior ephemerally, and are not amenable to conventional tagging methods, posing challenges in quantifying noise impacts. We integrated drone-based photogrammetry, strategically-placed acoustic recorders, and broad-scale visual observations to provide complimentary measurements of different aspects of behavior for short- and long-beaked common dolphins. We measured behavioral responses during controlled exposure experiments (CEEs) of military mid-frequency (3-4 kHz) active sonar (MFAS) using simulated and actual Navy sonar sources. We used latent-state Bayesian models to evaluate response probability and persistence in exposure and post-exposure phases. Changes in sub-group movement and aggregation parameters were commonly detected during different phases of MFAS CEEs but not control CEEs. Responses were more evident in short-beaked common dolphins (n=14 CEEs), and a direct relationship between response probability and received level was observed. Long-beaked common dolphins (n=20) showed less consistent responses, although contextual differences may have limited which movement responses could be detected. These are the first experimental behavioral response data for these abundant dolphins to directly inform impact assessments for military sonars.
Methods
We used complementary visual and acoustic sampling methods at variable spatial scales to measure different aspects of common dolphin behavior in known and controlled MFAS exposure and non-exposure contexts. Three fundamentally different data collection systems were used to sample group behavior. A broad-scale visual sampling of subgroup movement was conducted using theodolite tracking from shore-based stations. Assessments of whole-group and sub-group sizes, movement, and behavior were conducted at 2-minute intervals from shore-based and vessel platforms using high-powered binoculars and standardized sampling regimes. Aerial UAS-based photogrammetry quantified the movement of a single focal subgroup. The UAS consisted of a large (1.07 m diameter) custom-built octocopter drone launched and retrieved by hand from vessel platforms. The drone carried a vertically gimballed camera (at least 16MP) and sensors that allowed precise spatial positioning, allowing spatially explicit photogrammetry to infer movement speed and directionality. Remote-deployed (drifting) passive acoustic monitoring (PAM) sensors were strategically deployed around focal groups to examine both basic aspects of subspecies-specific common dolphin acoustic (whistling) behavior and potential group responses in whistling to MFAS on variable temporal scales (Casey et al., in press). This integration allowed us to evaluate potential changes in movement, social cohesion, and acoustic behavior and their covariance associated with the absence or occurrence of exposure to MFAS. The collective raw data set consists of several GB of continuous broadband acoustic data and hundreds of thousands of photogrammetry images.
Three sets of quantitative response variables were analyzed from the different data streams: directional persistence and variation in speed of the focal subgroup from UAS photogrammetry; group vocal activity (whistle counts) from passive acoustic records; and number of sub-groups within a larger group being tracked by the shore station overlook. We fit separate Bayesian hidden Markov models (HMMs) to each set of response data, with the HMM assumed to have two states: a baseline state and an enhanced state that was estimated in sequential 5-s blocks throughout each CEE. The number of subgroups was recorded during periodic observations every 2 minutes and assumed constant across time blocks between observations. The number of subgroups was treated as missing data 30 seconds before each change was noted to introduce prior uncertainty about the precise timing of the change. For movement, two parameters relating to directional persistence and variation in speed were estimated by fitting a continuous time-correlated random walk model to spatially explicit photogrammetry data in the form of location tracks for focal individuals that were sequentially tracked throughout each CEE as a proxy for subgroup movement.
Movement parameters were assumed to be normally distributed. Whistle counts were treated as normally distributed but truncated as positive because negative count data is not possible. Subgroup counts were assumed to be Poisson distributed as they were distinct, small values. In all cases, the response variable mean was modeled as a function of the HMM with a log link:
log(Responset) = l0 + l1Z t
where at each 5-s time block t, the hidden state took values of Zt = 0 to identify one state with a baseline response level l0, or Zt = 1 to identify an “enhanced” state, with l1 representing the enhancement of the quantitative value of the response variable. A flat uniform (-30,30) prior distribution was used for l0 in each response model, and a uniform (0,30) prior distribution was adopted for each l1 to constrain enhancements to be positive. For whistle and subgroup counts, the enhanced state indicated increased vocal activity and more subgroups. A common indicator variable was estimated for the latent state for both the movement parameters, such that switching to the enhanced state described less directional persistence and more variation in velocity. Speed was derived as a function of these two parameters and was used here as a proxy for their joint responses, representing directional displacement over time.
To assess differences in the behavior states between experimental phases, the block-specific latent states were modeled as a function of phase-specific probabilities, Z t ~ Bernoulli (pphaset), to learn about the probability pphase of being in an enhanced state during each phase. For each pre-exposure, exposure, and post-exposure phase, this probability was assigned a flat uniform (0,1) prior probability. The model was programmed in R (R version 3.6.1; The R Foundation for Statistical Computing) with the nimble package (de Valpine et al. 2020) to estimate posterior distributions of model parameters using Markov Chain Monte Carlo (MCMC) sampling. Inference was based on 100,000 MCMC samples following a burn-in of 100,000, with chain convergence determined by visual inspection of three MCMC chains and corroborated by convergence diagnostics (Brooks and Gelman, 1998). To compare behavior across phases, we compared the posterior distribution of the pphase parameters for each response variable, specifically by monitoring the MCMC output to assess the “probability of response” as the proportion of iterations for which pexposure was greater or less than ppre-exposure and the “probability of persistence” as the proportion of iterations for which ppost-exposre was greater or less than ppre-exposure. These probabilities of response and persistence thus estimated the extent of separation (non-overlap) between the distributions of pairs of pphase parameters: if the two distributions of interest were identical, then p=0.5, and if the two were non-overlapping, then p=1. Similarly, we estimated the average values of the response variables in each phase by predicting phase-specific functions of the parameters:
Mean.responsephase = exp(l0 + l1pphase)
and simply derived average speed as the mean of the speed estimates for 5-second blocks in each phase.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains vibration data collected from a geoscope sensor to analyze human activities (walking, running, and waiting). The data is segmented into 3-second time windows, with each window containing 120 rows of data per person. The dataset consists of 1800 rows of data from five individuals: Furkan, Enes, Yusuf, Alihan and Emir.
Each person’s activity is classified into one of the three categories: walking, running, or standing still. The data includes both statistical and frequency-domain features extracted from the raw vibration signals, detailed below:
Statistical Features: - Mean: The average value of the signal over the time window.- - Median: The middle value of the signal, dividing the data into two equal halves. - Standard Deviation: A measure of how much the signal deviates from its mean, indicating the signal's variability. - Minimum: The smallest value in the signal during the time window. - Maximum: The largest value in the signal during the time window. - First Quartile (Q1): The median of the lower half of the data, representing the 25th percentile. - Third Quartile (Q3): The median of the upper half of the data, representing the 75th percentile. - Skewness: A measure of the asymmetry of the signal distribution, showing whether the data is skewed to the left or right.
Frequency-Domain Features: - Dominant Frequency: The frequency with the highest power, providing insights into the primary periodicity of the signal. - Signal Energy: The total energy of the signal, representing the sum of the squared signal values over the time window.
Dataset Overview: - Total Rows: 1800 - Number of Individuals: 5 (Furkan, Enes, Yusuf, Alihan, Emir) - Activity Types: Walking, Running, Waiting (Standing Still) - Time Frame: 3-second time windows (120 rows per individual for each activity) - Features: Statistical and frequency-domain features (as described above)
This dataset is suitable for training models on activity recognition, user identification, and other related tasks. It provides rich, detailed features that can be used for various classification and analysis applications.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Mean and standard deviation of the difference in each group at each measurement site (units: μm).
Facebook
TwitterThis is the Baltic and North Sea Climatology (BNSC) for the Baltic Sea and the North Sea in the range 47 ° N to 66 ° N and 15 ° W to 30 ° E. It is the follow-up project to the knsc climatology. The climatology was first made available to the public in March 2018 by ICDC and is published here in a slightly revised version 2. It contains the monthly averages of mean air pressure at sea level, and air temperature, and dew point temperature at 2 meter height. It is available on a 1 ° x 1 ° grid for the period from 1950 to 2015. For the calculation of the mean values, all available quality-controlled data of the DWD (German Meteorological Service) of ship observations and buoy measurements were taken into account during this period. Additional dew point values were calculated from relative humidity and air temperature if available. Climatologies were calculated for the WMO standard periods 1951-1980, 1961-1990, 1971-2000 and 1981-2010 (monthly mean values). As a prerequisite for the calculation of the 30-year-climatology, at least 25 out of 30 (five-sixths) valid monthly means to be present in the respective grid box. For the long-term climatology from 1950 to 2015, at least four-fifths valid monthly means had to be available. Two methods were used (in combination) to calculate the monthly averages, to account for the small number of measurements per grid box and their uneven spatial and temporal distribution: 1. For parameters with a detectable annual cycle in the data (air temperature, dew point temperature), a 2nd order polynomial was fitted to the data to reduce the variation within a month and reduce the uncertainty of the calculated averages. In addition, for the mean value of air temperature, the daily temperature cycle was removed from the data. In the case of air pressure, which has no annual cycle, in version 2 per month and grid box no data gaps longer than 14 days were allowed for the calculation of a monthly mean and standard deviation. This method differs from knsc and BNSC version 1, where mean and standard deviation were calculated from 6-day windows means. 2. If the number of observations fell below a certain threshold, which was 20 observations per grid box and month for the air temperature as well as for the dew point temperature, and 500 per box and month for the air pressure, data from the adjacent boxes was used for the calculation. The neighbouring boxes were used in two steps (the nearest 8 boxes, and if the number was still below the threshold, the next sourrounding 16 boxes) to calculate the mean value of the center box. Thus, the spatial resolution of the parameters is reduced at certain points and, instead of 1 ° x 1 °, if neighboring values are taken into account, data from an area of 5 ° x 5 ° can also be considered, which are then averaged into a grid box value. This was especially used for air pressure, where the 24 values of the neighboring boxes were included in the averaging for most grid boxes. The mean value, the number of measurements, the standard deviation and the number of grid boxes used to calculate the mean values are available as parameters in the products. The calculated monthly and annual means were allocated to the centers of the grid boxes: Latitudes: 47.5, 48.5,... Longitudes: —14.5, -13.5,... In order to remove any existing values over land, a land-sea mask was used, which is also provided in 1 ° x 1 ° resolution. In this version 2 of the BNSC, a slightly different database was used, than for the knsc, which resulted in small changes (less than 1 K) in the means and standard deviations of the 2-meter air temperature and dew point temperature. The changes in mean sea level pressure values and the associated standard deviations are in the range of a few hPa, compared to the knsc. The parameter names and units have been adjusted to meet the CF 1.6 standard.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
For more than a decade, open access book platforms have been distributing titles in order to maximise their impact. Each platform offers some form of usage data, showcasing the success of their offering. However, the numbers alone are not sufficient to convey how well a book is actually performing.
Our data set is consists of 18,014 books and chapters. The selected titles have been added to the OAPEN Library collection before 1 January 2022, and the usage data of twelve months (January to December 2022) has been captured. During that period, this collection of books and chapters has been downloaded more than 10 million times. Each title has been linked to one broad subject and the title’s language has been coded as either English, German or other languages.
The titles are rated using the TOANI score.
The acronym stands for Transparent Open Access Normalised Index. The transparency is based on the application of clear regulations, and by making all data used visible. The data is normalised, by using a common scale for the complete collection of an open access book platform. Additionally, there are only three possible values to score the titles: average, less than average and more than average. This index is set up to provide a clear and simple answer to the question whether an open access book has made an impact. It is not meant to give a sense of false accuracy; the complexities surrounding this issue cannot be measured in several decimal places.
The TOANI score is based on the following principles:
Select only titles that have been available for at least 12 months;
Use the usage data of the same 12 months period for the whole collection;
Each title is assigned one – high level – subject;
Each title is assigned one language;
All titles are grouped based on subject and language;
The groups should consists of at least 100 titles;
The following data must be made available for each title:
Platform
Total number of titles in the group
Subject
Language
Period used for the measurement
Minimum value, maximum value, median, first and third quartile of the platform’s usage data
Based on the previous, titles are classified as:
“Less than average” – First quartile; 25 % of the titles
“Average” – Second and third quartile; 50% of the titles
“More than average” – Fourth quartile; 25 % of the titles
Facebook
TwitterData files containing detailed information about vehicles in the UK are also available, including make and model data.
Some tables have been withdrawn and replaced. The table index for this statistical series has been updated to provide a full map between the old and new numbering systems used in this page.
The Department for Transport is committed to continuously improving the quality and transparency of our outputs, in line with the Code of Practice for Statistics. In line with this, we have recently concluded a planned review of the processes and methodologies used in the production of Vehicle licensing statistics data. The review sought to seek out and introduce further improvements and efficiencies in the coding technologies we use to produce our data and as part of that, we have identified several historical errors across the published data tables affecting different historical periods. These errors are the result of mistakes in past production processes that we have now identified, corrected and taken steps to eliminate going forward.
Most of the revisions to our published figures are small, typically changing values by less than 1% to 3%. The key revisions are:
Licensed Vehicles (2014 Q3 to 2016 Q3)
We found that some unlicensed vehicles during this period were mistakenly counted as licensed. This caused a slight overstatement, about 0.54% on average, in the number of licensed vehicles during this period.
3.5 - 4.25 tonnes Zero Emission Vehicles (ZEVs) Classification
Since 2023, ZEVs weighing between 3.5 and 4.25 tonnes have been classified as light goods vehicles (LGVs) instead of heavy goods vehicles (HGVs). We have now applied this change to earlier data and corrected an error in table VEH0150. As a result, the number of newly registered HGVs has been reduced by:
3.1% in 2024
2.3% in 2023
1.4% in 2022
Table VEH0156 (2018 to 2023)
Table VEH0156, which reports average CO₂ emissions for newly registered vehicles, has been updated for the years 2018 to 2023. Most changes are minor (under 3%), but the e-NEDC measure saw a larger correction, up to 15.8%, due to a calculation error. Other measures (WLTP and Reported) were less notable, except for April 2020 when COVID-19 led to very few new registrations which led to greater volatility in the resultant percentages.
Neither these specific revisions, nor any of the others introduced, have had a material impact on the statistics overall, the direction of trends nor the key messages that they previously conveyed.
Specific details of each revision made has been included in the relevant data table notes to ensure transparency and clarity. Users are advised to review these notes as part of their regular use of the data to ensure their analysis accounts for these changes accordingly.
If you have questions regarding any of these changes, please contact the Vehicle statistics team.
Overview
VEH0101: https://assets.publishing.service.gov.uk/media/68ecf5acf159f887526bbd7c/veh0101.ods">Vehicles at the end of the quarter by licence status and body type: Great Britain and United Kingdom (ODS, 99.7 KB)
Detailed breakdowns
VEH0103: https://assets.publishing.service.gov.uk/media/68ecf5abf159f887526bbd7b/veh0103.ods">Licensed vehicles at the end of the year by tax class: Great Britain and United Kingdom (ODS, 23.8 KB)
VEH0105: https://assets.publishing.service.gov.uk/media/68ecf5ac2adc28a81b4acfc8/veh0105.ods">Licensed vehicles at
Facebook
TwitterThis is the Baltic and North Sea Climatology (BNSC) for the Baltic Sea and the North Sea in the range 47 ° N to 66 ° N and 15 ° W to 30 ° E. It is the follow-up project to the KNSC climatology. The climatology was first made available to the public in March 2018 by ICDC and is published here in a slightly revised version 2. It contains the monthly averages of mean air pressure at sea level, and air temperature, and dew point temperature at 2 meter height. It is available on a 1 ° x 1 ° grid for the period from 1950 to 2015. For the calculation of the mean values, all available quality-controlled data of the DWD (German Meteorological Service) of ship observations and buoy measurements were taken into account during this period. Additional dew point values were calculated from relative humidity and air temperature if available. Climatologies were calculated for the WMO standard periods 1951-1980, 1961-1990, 1971-2000 and 1981-2010 (monthly mean values). As a prerequisite for the calculation of the 30-year-climatology, at least 25 out of 30 (five-sixths) valid monthly means had to be present in the respective grid box. For the long-term climatology from 1950 to 2015, at least four-fifths valid monthly means had to be available. Two methods were used (in combination) to calculate the monthly averages, to account for the small number of measurements per grid box and their uneven spatial and temporal distribution: 1. For parameters with a detectable annual cycle in the data (air temperature, dew point temperature), a 2nd order polynomial was fitted to the data to reduce the variation within a month and reduce the uncertainty of the calculated averages. In addition, for the mean value of air temperature, the daily temperature cycle was removed from the data. In the case of air pressure, which has no annual cycle, in version 2 per month and grid box no data gaps longer than 14 days were allowed for the calculation of a monthly mean and standard deviation. This method differs from KNSC and BNSC version 1, where mean and standard deviation were calculated from 6-day windows means. 2. If the number of observations fell below a certain threshold, which was 20 observations per grid box and month for the air temperature as well as for the dew point temperature, and 500 per box and month for the air pressure, data from the adjacent boxes was used for the calculation. The neighbouring boxes were used in two steps (the nearest 8 boxes, and if the number was still below the threshold, the next sourrounding 16 boxes) to calculate the mean value of the center box. Thus, the spatial resolution of the parameters is reduced at certain points and, instead of 1 ° x 1 °, if neighboring values are taken into account, data from an area of 5 ° x 5 ° can also be considered, which are then averaged into a grid box value. This was especially used for air pressure, where the 24 values of the neighboring boxes were included in the averaging for most grid boxes. The mean value, the number of measurements, the standard deviation and the number of grid boxes used to calculate the mean values are available as parameters in the products. The calculated monthly and annual means were allocated to the centers of the grid boxes: Latitudes: 47.5, 48.5, ... Longitudes: -14.5, -13.5, … In order to remove any existing values over land, a land-sea mask was used, which is also provided in 1 ° x 1 ° resolution. In this version 2 of the BNSC, a slightly different database was used, than for the KNSC, which resulted in small changes (less than 1 K) in the means and standard deviations of the 2-meter air temperature and dew point temperature. The changes in mean sea level pressure values and the associated standard deviations are in the range of a few hPa, compared to the KNSC. The parameter names and units have been adjusted to meet the CF 1.6 standard.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Network traffic datasets with novel extended IP flow called NetTiSA flow
Datasets were created for the paper: NetTiSA: Extended IP Flow with Time-series Features for Universal Bandwidth-constrained High-speed Network Traffic Classification -- Josef Koumar, Karel Hynek, Jaroslav Pešek, Tomáš Čejka -- which is published in The International Journal of Computer and Telecommunications Networking https://doi.org/10.1016/j.comnet.2023.110147Please cite the usage of our datasets as:
Josef Koumar, Karel Hynek, Jaroslav Pešek, Tomáš Čejka, "NetTiSA: Extended IP flow with time-series features for universal bandwidth-constrained high-speed network traffic classification", Computer Networks, Volume 240, 2024, 110147, ISSN 1389-1286
@article{KOUMAR2024110147, title = {NetTiSA: Extended IP flow with time-series features for universal bandwidth-constrained high-speed network traffic classification}, journal = {Computer Networks}, volume = {240}, pages = {110147}, year = {2024}, issn = {1389-1286}, doi = {https://doi.org/10.1016/j.comnet.2023.110147}, url = {https://www.sciencedirect.com/science/article/pii/S1389128623005923}, author = {Josef Koumar and Karel Hynek and Jaroslav Pešek and Tomáš Čejka} }
This Zenodo repository contains 23 datasets created from 15 well-known published datasets, which are cited in the table below. Each dataset contains the NetTiSA flow feature vector.
NetTiSA flow feature vector
The novel extended IP flow called NetTiSA (Network Time Series Analysed) flow contains a universal bandwidth-constrained feature vector consisting of 20 features. We divide the NetTiSA flow classification features into three groups by computation. The first group of features is based on classical bidirectional flow information---a number of transferred bytes, and packets. The second group contains statistical and time-based features calculated using the time-series analysis of the packet sequences. The third type of features can be computed from the previous groups (i.e., on the flow collector) and improve the classification performance without any impact on the telemetry bandwidth.
Flow features
The flow features are:
Packets is the number of packets in the direction from the source to the destination IP address.
Packets in reverse order is the number of packets in the direction from the destination to the source IP address.
Bytes is the size of the payload in bytes transferred in the direction from the source to the destination IP address.
Bytes in reverse order is the size of the payload in bytes transferred in the direction from the destination to the source IP address.
Statistical and Time-based features
The features that are exported in the extended part of the flow. All of them can be computed (exactly or in approximative) by stream-wise computation, which is necessary for keeping memory requirements low. The second type of feature set contains the following features:
Mean represents mean of the payload lengths of packets
Min is the minimal value from payload lengths of all packets in a flow
Max is the maximum value from payload lengths of all packets in a flow
Standard deviation is a measure of the variation of payload lengths from the mean payload length
Root mean square is the measure of the magnitude of payload lengths of packets
Average dispersion is the average absolute difference between each payload length of the packet and the mean value
Kurtosis is the measure describing the extent to which the tails of a distribution differ from the tails of a normal distribution
Mean of relative times is the mean of the relative times which is a sequence defined as (st = {t_1 - t_1, t_2 - t_1, ..., t_n - t_1} )
Mean of time differences is the mean of the time differences which is a sequence defined as (dt = { t_j - t_i | j = i + 1, i \in {1, 2, \dots, n - 1} }.)
Min from time differences is the minimal value from all time differences, i.e., min space between packets.
Max from time differences is the maximum value from all time differences, i.e., max space between packets.
Time distribution describes the deviation of time differences between individual packets within the time series. The feature is computed by the following equation:(tdist = \frac{ \frac{1}{n-1} \sum_{i=1}^{n-1} \left| \mu_{{dt_{n-1}}} - dt_i \right| }{ \frac{1}{2} \left(max\left({dt_{n-1}}\right) - min\left({dt_{n-1}}\right) \right) })
Switching ratio represents a value change ratio (switching) between payload lengths. The switching ratio is computed by equation:(sr = \frac{s_n}{\frac{1}{2} (n - 1)})
where \(s_n\) is number of switches.
Features computed at the collectorThe third set contains features that are computed from the previous two groups prior to classification. Therefore, they do not influence the network telemetry size and their computation does not put additional load to resource-constrained flow monitoring probes. The NetTiSA flow combined with this feature set is called the Enhanced NetTiSA flow and contains the following features:
Max minus min is the difference between minimum and maximum payload lengths
Percent deviation is the dispersion of the average absolute difference to the mean value
Variance is the spread measure of the data from its mean
Burstiness is the degree of peakedness in the central part of the distribution
Coefficient of variation is a dimensionless quantity that compares the dispersion of a time series to its mean value and is often used to compare the variability of different time series that have different units of measurement
Directions describe a percentage ratio of packet direction computed as (\frac{d_1}{ d_1 + d_0}), where (d_1) is a number of packets in a direction from source to destination IP address and (d_0) the opposite direction. Both (d_1) and (d_0) are inside the classical bidirectional flow.
Duration is the duration of the flow
The NetTiSA flow is implemented into IP flow exporter ipfixprobe.
Description of dataset files
In the following table is a description of each dataset file:
File name
Detection problem
Citation of the original raw dataset
botnet_binary.csv Binary detection of botnet S. García et al. An Empirical Comparison of Botnet Detection Methods. Computers & Security, 45:100–123, 2014.
botnet_multiclass.csv Multi-class classification of botnet S. García et al. An Empirical Comparison of Botnet Detection Methods. Computers & Security, 45:100–123, 2014.
cryptomining_design.csv Binary detection of cryptomining; the design part Richard Plný et al. Datasets of Cryptomining Communication. Zenodo, October 2022
cryptomining_evaluation.csv Binary detection of cryptomining; the evaluation part Richard Plný et al. Datasets of Cryptomining Communication. Zenodo, October 2022
dns_malware.csv Binary detection of malware DNS Samaneh Mahdavifar et al. Classifying Malicious Domains using DNS Traffic Analysis. In DASC/PiCom/CBDCom/CyberSciTech 2021, pages 60–67. IEEE, 2021.
doh_cic.csv Binary detection of DoH Mohammadreza MontazeriShatoori et al. Detection of doh tunnels using time-series classification of encrypted traffic. In DASC/PiCom/CBDCom/CyberSciTech 2020, pages 63–70. IEEE, 2020
doh_real_world.csv Binary detection of DoH Kamil Jeřábek et al. Collection of datasets with DNS over HTTPS traffic. Data in Brief, 42:108310, 2022
dos.csv Binary detection of DoS Nickolaos Koroniotis et al. Towards the development of realistic botnet dataset in the Internet of Things for network forensic analytics: Bot-IoT dataset. Future Gener. Comput. Syst., 100:779–796, 2019.
edge_iiot_binary.csv Binary detection of IoT malware Mohamed Amine Ferrag et al. Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications: Centralized and federated learning, 2022.
edge_iiot_multiclass.csv Multi-class classification of IoT malware Mohamed Amine Ferrag et al. Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications: Centralized and federated learning, 2022.
https_brute_force.csv Binary detection of HTTPS Brute Force Jan Luxemburk et al. HTTPS Brute-force dataset with extended network flows, November 2020
ids_cic_binary.csv Binary detection of intrusion in IDS Iman Sharafaldin et al. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp, 1:108–116, 2018.
ids_cic_multiclass.csv Multi-class classification of intrusion in IDS Iman Sharafaldin et al. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp, 1:108–116, 2018.
unsw_binary.csv Binary detection of intrusion in IDS Nour Moustafa and Jill Slay. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In 2015 military communications and information systems conference (MilCIS), pages 1–6. IEEE, 2015.
unsw_multiclass.csv Multi-class classification of intrusion in IDS Nour Moustafa and Jill Slay. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In 2015 military communications and information systems conference (MilCIS), pages 1–6. IEEE, 2015.
iot_23.csv Binary detection of IoT malware Sebastian Garcia et al. IoT-23: A labeled dataset with malicious and benign IoT network traffic, January 2020. More details here https://www.stratosphereips.org /datasets-iot23
ton_iot_binary.csv Binary detection of IoT malware Nour Moustafa. A new distributed architecture for evaluating ai-based security systems at the edge: Network ton iot datasets. Sustainable Cities and Society, 72:102994, 2021
ton_iot_multiclass.csv Multi-class classification of IoT malware Nour Moustafa. A new distributed architecture for evaluating ai-based security systems at the edge: Network ton iot datasets.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
To assess environmental fate, transport, and exposure for PFAS (per- and polyfluoroalkyl substances), predictive models are needed to fill experimental data gaps for physicochemical properties. In this work, quantitative structure–property relationship (QSPR) models for octanol–water partition coefficient, water solubility, vapor pressure, boiling point, melting point, and Henry’s law constant are presented. Over 200,000 experimental property value records were extracted from publicly available data sources. Global models generated from data for diverse chemical classes resulted in more accurate property value predictions for PFAS than local models generated from a PFAS-only data set, with an average 11% reduction in mean absolute error (MAE). The global models across all property endpoints achieved strong performance on test data (R2 = 0.76–0.89 for all chemical classes). The test set mean absolute error for PFAS was about 33% higher than the value for all chemicals in the test set (when averaged over the six data sets). The new global models yielded superior PFAS prediction statistics relative to those for existing Toxicity Estimation Software Tool (T.E.S.T) models, with an average 13% reduction in MAE. A nearest neighbor-based measure of model applicability domain (AD) was shown to exclude poor predictions while maintaining a relatively high fraction (∼95%) of chemicals inside the AD. In addition, most test set PFAS are outside the AD when the model was generated without PFAS in the training set.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Introduction : Equations can calculate pulse wave velocity (ePWV) from blood pressure values (BP) and age. The ePWV predicts cardiovascular events beyond carotid-femoral PWV. We aimed to evaluate the correlation between four different equations to calculate ePWV.
Methods: The ePWV was estimated utilizing mean BP (MBP) from office BP (MBPOBP) or 24-hour ambulatory BP (MBP24-hBP). We separated the whole sample into two groups: individuals with risk factors and healthy individuals. The e-PWV was calculated as follows:
We calculated the concordance correlation coefficient (Pc) between e1-PWVOBP vs e2-PWVOBP, e1-PWV24-hBP vs e2-PWV24-hBP, and mean values of e1-PWVOBP, e2-PWVOBP, e1-PWV24-hBP, and e2-PWV24-hBP . The multilevel regression model determined how much the ePWVs are influenced by age and MBP values.
Results: We analyzed data from 1541 individuals; 1374 ones with risk factors and 167 healthy ones. The values are presented for the entire sample, for risk-factor patients and for healthy individuals, respectively. The correlation between e1-PWVOBP with e2-PWVOBP and e1-PWV24-hBP with e2-PWV24-hBP was almost perfect. The Pc for e1-PWVOBP vs e2-PWVOBP was 0.996 (0.995-0.996), 0.996 (0.995-0.996), and 0.994 (0.992-0.995); furthermore, it was 0.994 (0.993-0.995), 0.994 (0.994-0.995), 0.987 (0.983-0.990) to the e1-PWV24-hBP vs e2-PWV24-hBP. There were no significant differences between mean values (m/s) for e1-PWVOBP vs e2-PWVOBP 8.98±1.9 vs 8.97±1.8; p=0.88, 9.14±1.8 vs 9.13±1.8; p=0.88, and 7.57±1.3 vs 7.65±1.3; p=0.5; mean values are also similar for e1-PWV24-hBP vs e2-PWV24-hBP, 8.36±1.7 vs 8.46±1.6; p=0.09, 8.50±1.7 vs 8.58±1.7; p=0.21 and 7.26±1.3 vs 7.39±1.2; p=0.34. The multiple linear regression showed that age, MBP, and age² predicted more than 99.5% of all four e-PWV.
Conclusion: Our data presents a nearly perfect correlation between the values of two equations to calculate the estimated PWV, whether utilizing office or ambulatory blood pressure.
Methods
This study is a secondary analysis of data obtained from two cross-sectional studies conducted at a specialized center in Brazil to diagnose and treat non-communicable diseases. In both studies, the inclusion criteria were adults aged 18 years and above, referred to undergo ambulatory blood pressure monitoring (ABPM) due to suspected non-treated or uncontrolled hypertension following initial blood pressure measurements by a physician. The combined databases included 1541 people. For the first database, we recruited participants between 28 January and 13 December 2013, and for the second database, between 23 January 2016 and 28 June 2019.
Prior to being fitted with an AMBP device and assisted by a trained nurse, all participants signed a written consent form to partake in the research. Later, the nurse collected demographic and clinical data, including any previous reports of clinical cardiovascular disease (CVD), acute myocardial infarction, acute coronary syndrome, coronary or other arterial revascularization, stroke, transient ischemic attack, aortic aneurysm, peripheral artery disease and severe chronic kidney disease (CKD). All subjects had their BP, weight, height, and waist circumference measured and their body mass index (BMI) calculated.
Although the ePWV data from the Reference Values for Arterial Stiffness Collaboration originated from cohorts lacking established cardiovascular disease, cerebrovascular disease, or diabetes, we included diabetes, CVD, CKD, smokers, and obese individuals. This choice reflects a sample that more closely resembles what can be seen in everyday Brazilian physician appointments.
The study population was divided into two groups: healthy individuals and those with risk factors. Healthy individuals did not present any risk factors and a non-elevated BP (<140 and 90 mmHg). Conversely, the group with risk factors consisted of individuals with elevated BP (≥140 and-or 90 mmHg) or at least one risk factor, such as previous hypertension, dyslipidemia, diabetes, smoking, body obesity (BMI ≥ 30 kg/m2), or an increased waist circumference at risk (waist circumference > 102 cm in males and > 88 cm in females).
Blood pressure measurement and ambulatory blood pressure monitoring
During the data collection for both studies, office BP (OBP) measurements were conducted following recommended guidelines to ensure accurate pressure values. In the first database, a nurse performed seven consecutive BP measurements utilizing a Microlife device BP3BTOA (Onbo Electronic Co, Shenzhen, China). In the second database, a nurse assistant operated a Microlife device model BP3AC1-1PC (Onbo Electronic Co, Shenzhen, China). This device operated on Microlife Average Mode which takes three measurements in succession and calculates the average BP value. The assistant took two sets of three BP measurements sequentially.
All individuals registered twenty-four hours of ABPM using a Dyna-Mapa / Mobil-O-Graph-NG monitor (Cardios, São Paulo, Brazil), equipped with an appropriately-sized cuff on their non-dominant arm. The readings were taken every 20 minutes during the day and every 30 minutes during the night, here understood as the period between going to bed and waking up. We respected all recommended protocols strictly to ensure quality recordings.
Calculation of estimated pulse wave velocity
The ePWV was calculated using the equations derived from the Reference Values for Arterial Stiffness Collaboration, incorporating age and MBP as follows:
MBP was also calculated as diastolic BP+ 0.4*(systolic BP/diastolic BP). Thus, the values of e1-PWV and e2-PWV were obtained for the total sample, as well as separately for the groups comprising healthy individuals and those with risk factors. We used MBP from OBP (MBPOBP) to calculate e1-PWVOBP and e2-PWVOBP, and MBP of twenty hours BP average (MBP24-hBP) to e1-PWV24-hBP and e2-PWV24-hBP.
The Human Research Ethics Committee of Sirio Libanes Hospital and Federal University of the Triângulo Mineiro, provided ethical approval for data collection under protocol numbers 08930813.0.0000.5461 (first database) and 61985316.9.0000.5154 (second database), respectively.
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
The Tropical Rainfall Measuring Mission (TRMM) is a joint U.S.-Japan satellite mission to monitor tropical and subtropical precipitation and to estimate its associated latent heating.
The primary objective of the 2A21 is to compute the path integrated attenuation (PIA), using the surface reference technique (SRT). The SRT relies on the assumption that the difference between the measurements of the normalized surface cross section within and outside the rain provides a measure of the PIA.
Two types of non-rain surface cross section (sigma-zero) reference estimates are used: spatial and temporal. In the spatial surface reference data set, the mean and standard deviation of the surface cross sections are calculated over a running window of Ns fields of view before rain is encountered. These operations are performed separately for each of the 49+2 incidence angles of TRMM (corresponding to the cross-track scan from -17 degrees to + 17 degrees with respect to nadir). The two additional angle bins (making the total 51 rather than 49) are to account for non-zero pitch/roll angles that can shift the incidence angle with respect to nadir outside the normal range.
For the temporal surface reference data set, the running mean and standard deviation are computed over a 1 degree x 1 degree (latitude, longitude) grid. Within each 1 degree x 1 degree grid cell, the data are further categorized into incidence angle categories (26). The number of observations in each category, Nt, are also recorded. Note that, for the temporal reference data set, no distinction is made between the port and starboard incidence angles. So, instead of 49 incidence angles, there are only 25 + 1, where the additional bin corresponds to angles greater than the normal range.
When rain is encountered, the mean and standard deviations of the reference sigma-zero values are retrieved from the spatial and temporal surface reference data sets. To determine which reference measurement is to be used, the algorithm checks whether Nt >= Ntmin and Ns >= Nsmin, where Ntmin and Nsmin are the minimum number of samples that are needed to be considered a valid reference estimate for the temporal and spatial reference data sets, respectively. (Presently, Ntmin = 50 and Nsmin = 8). If neither condition is satisfied, no estimate of the PIA is made and the flags are set accordingly. If only one condition is met, then the surface reference data which corresponds to this is used. If both conditions are satisfied, the surface reference data is taken from that set which has the smaller standard deviation.
If a valid surface reference data set exists (i.e., either Nt >= Ntmin or Ns >= Nsmin or both) then the 2-way path attenuation (PIA) is estimated from the equation:
PIA =
where sigma-zero(in rain) is the value of the surface cross section over the rain volume of interest and
To obtain information as to the reliability of this PIA estimate we consider the difference between the PIA, as derived in the above equation, and the standard deviation as calculated from the no-rain sigma-zero values and stored in the reference data set. Labeling this as std dev(reference value), then the reliability factor of the PIA estimate is obtained from:
reliabFactor = PIA - std dev(reference value)
When this quantity is large, the reliability is considered high and conversely. This is the basic...
Facebook
Twitter[Updated 28/01/25 to fix an issue in the ‘Lower’ values, which were not fully representing the range of uncertainty. ‘Median’ and ‘Higher’ values remain unchanged. The size of the change varies by grid cell and fixed period/global warming levels but the average difference between the 'lower' values before and after this update is 1.2.]What does the data show? The Annual Count of Frost Days is the number of days per year where the minimum daily temperature is below 0°C. It measures how many times the threshold is exceeded (not by how much) in a year. The results should be interpreted as an approximation of the projected number of days when the threshold is exceeded as there will be many factors such as natural variability and local scale processes that the climate model is unable to represent.The Annual Count of Frost Days is calculated for two baseline (historical) periods 1981-2000 (corresponding to 0.51°C warming) and 2001-2020 (corresponding to 0.87°C warming) and for global warming levels of 1.5°C, 2.0°C, 2.5°C, 3.0°C, 4.0°C above the pre-industrial (1850-1900) period. This enables users to compare the future number of frost days to previous values. What are the possible societal impacts?The Annual Count of Frost Days indicates increased cold weather disruption due to a higher than normal chance of ice and snow. It is based on the minimum daily temperature being below 0°C. Impacts include:Damage to crops.Transport disruption.Increased energy demand.The Annual Count of Icing Days, is a similar metric measuring impacts from cold temperatures, it indicates more severe cold weather impacts.What is a global warming level?The Annual Count of Frost Days is calculated from the UKCP18 regional climate projections using the high emissions scenario (RCP 8.5) where greenhouse gas emissions continue to grow. Instead of considering future climate change during specific time periods (e.g. decades) for this scenario, the dataset is calculated at various levels of global warming relative to the pre-industrial (1850-1900) period. The world has already warmed by around 1.1°C (between 1850–1900 and 2011–2020), whilst this dataset allows for the exploration of greater levels of warming. The global warming levels available in this dataset are 1.5°C, 2°C, 2.5°C, 3°C and 4°C. The data at each warming level was calculated using a 21 year period. These 21 year periods are calculated by taking 10 years either side of the first year at which the global warming level is reached. This time will be different for different model ensemble members. To calculate the value for the Annual Count of Frost Days, an average is taken across the 21 year period. Therefore, the Annual Count of Frost Days show the number of frost days that could occur each year, for each given level of warming. We cannot provide a precise likelihood for particular emission scenarios being followed in the real world future. However, we do note that RCP8.5 corresponds to emissions considerably above those expected with current international policy agreements. The results are also expressed for several global warming levels because we do not yet know which level will be reached in the real climate as it will depend on future greenhouse emission choices and the sensitivity of the climate system, which is uncertain. Estimates based on the assumption of current international agreements on greenhouse gas emissions suggest a median warming level in the region of 2.4-2.8°C, but it could either be higher or lower than this level.What are the naming conventions and how do I explore the data?This data contains a field for each global warming level and two baselines. They are named ‘Frost Days’, the warming level or baseline, and ‘upper’ ‘median’ or ‘lower’ as per the description below. E.g. ‘Frost Days 2.5 median’ is the median value for the 2.5°C warming level. Decimal points are included in field aliases but not field names e.g. ‘Frost Days 2.5 median’ is ‘FrostDays_25_median’. To understand how to explore the data, see this page: https://storymaps.arcgis.com/stories/457e7a2bc73e40b089fac0e47c63a578Please note, if viewing in ArcGIS Map Viewer, the map will default to ‘Frost Days 2.0°C median’ values.What do the ‘median’, ‘upper’, and ‘lower’ values mean?Climate models are numerical representations of the climate system. To capture uncertainty in projections for the future, an ensemble, or group, of climate models are run. Each ensemble member has slightly different starting conditions or model set-ups. Considering all of the model outcomes gives users a range of plausible conditions which could occur in the future. For this dataset, the model projections consist of 12 separate ensemble members. To select which ensemble members to use, the Annual Count of Frost Days was calculated for each ensemble member and they were then ranked in order from lowest to highest for each location. The ‘lower’ fields are the second lowest ranked ensemble member. The ‘upper’ fields are the second highest ranked ensemble member. The ‘median’ field is the central value of the ensemble.This gives a median value, and a spread of the ensemble members indicating the range of possible outcomes in the projections. This spread of outputs can be used to infer the uncertainty in the projections. The larger the difference between the lower and upper fields, the greater the uncertainty.‘Lower’, ‘median’ and ‘upper’ are also given for the baseline periods as these values also come from the model that was used to produce the projections. This allows a fair comparison between the model projections and recent past. Useful linksThis dataset was calculated following the methodology in the ‘Future Changes to high impact weather in the UK’ report and uses the same temperature thresholds as the 'State of the UK Climate' report.Further information on the UK Climate Projections (UKCP).Further information on understanding climate data within the Met Office Climate Data Portal.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset is for the study of task decomposition effects in time estimation: the role of future boundaries and thought focus, and supplementary materials. Due to the previous research on the impact of task decomposition on time estimation, the role of time factors was often overlooked. For example, with the same decomposition, people subjectively set different time boundaries when facing difficult and easy tasks. Therefore, taking into account the time factor is bound to improve and integrate the research conclusions of decomposition effects. Based on this, we studied the impact of task decomposition and future boundaries on time estimation. Experiment 1 passed 2 (task decomposition/no decomposition) × Design an inter subject experiment with/without future boundaries, using the expected paradigm to measure the time estimation of participants; Experiment 2 further manipulates the time range of future boundaries based on Experiment 1, using 2 (task decomposition/non decomposition) × 3 (future boundaries of longer/shorter/medium range) inter subject experimental design, using expected paradigm to measure time estimation of subjects; On the basis of Experiment 2, Experiment 3 further verified the mechanism of the influence of the time range of future boundaries under decomposition conditions on time estimation. Through a single factor inter subject experimental design, a thinking focus scale was used to measure the thinking focus of participants under longer and shorter boundary conditions. Through the above experiments and measurements, we have obtained the following dataset. Experiment 1 Table Data Column Label Meaning: Task decomposition into grouped variables: 0 represents decomposition; 1 indicates no decomposition The future boundary is a grouping variable: 0 represents existence; 1 means it does not exist Zsco01: Standard score for estimating total task time A logarithm: The logarithmic value of the estimated time for all tasks Experiment 2 Table Data Column Label Meaning: The future boundary is a grouping variable: 7 represents shorter, 8 represents medium, and 9 represents longer The remaining data labels are the same as Experiment 1 Experiment 3 Table Data Column Label Meaning: Zplan represents the standard score for the focus plan score Zbar represents the standard score for attention barriers The future boundary is a grouping variable: 0 represents shorter, 1 represents longer
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
An updated and improved version of a global, vertically resolved, monthly mean zonal mean ozone database has been calculated – hereafter referred to as the BSVertOzone database, the BSVertOzone database. Like its predecessor, it combines measurements from several satellite-based instruments and ozone profile measurements from the global ozonesonde network. Monthly mean zonal mean ozone concentrations in mixing ratio and number density are provided in 5 latitude zones, spanning 70 altitude levels (1 to 70km), or 70 pressure 5 levels that are approximately 1km apart (878.4hPa to 0.046hPa). Different data sets or "Tiers" are provided: "Tier 0" is based only on the available measurements and therefore does not completely cover the whole globe or the full vertical range uniformly; the "Tier 0.5" monthly mean zonal means are calculated from a filled version of the Tier 0 database where missing monthly mean zonal mean values are estimated from correlations at level 20 against a total column ozone database and then at levels above and below on correlations with lower and upper levels respectively. The Tier 10 0.5 database includes the full range of measurement variability and is created as an intermediate step for the calculation of the "Tier 1" data where a least squares regression model is used to attribute variability to various known forcing factors for ozone. Regression model fit coefficients are expanded in Fourier series and Legendre polynomials (to account for seasonality and latitudinal structure, respectively). Four different combinations of contributions from selected regression model basis functions result in four different "Tier 1" data set that can be used for comparisons with chemistry-climate model simulations that do not 15 exhibit the same unforced variability as reality (unless they are nudged towards reanalyses). Compared to previous versions of the database, this update includes additional satellite data sources and ozonesonde measurements to extend the database period to 2016. Additional improvements over the previous version of the database include: (i) Adjustments of measurements to account for biases and drifts between different data sources (using a chemistry-transport model simulation as a transfer standard), (ii) a more objective way to determine the optimum number of Fourier and Legendre expansions for the basis 20 function fit coefficients, and (iii) the derivation of methodological and measurement uncertainties on each database value are traced through all data modification steps. Comparisons with the ozone database from SWOOSH (Stratospheric Water and OzOne Satellite Homogenized data set) show excellent agreements in many regions of the globe, and minor differences caused by different bias adjustment procedures for the two databases. However, compared to SWOOSH, BSVertOzone additionally covers the troposphere.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Q: Was the month drier or wetter than usual? A: Colors show where and by how much monthly precipitation totals differed from average precipitation for the same month from 1991-2020. Green areas were wetter than the 30-year average for the month and brown areas were drier. White and very light areas had monthly precipitation totals close to the long-term average. Q: Where do these measurements come from? A: Daily measurements of rain and snow come from weather stations in the Global Historical Climatology Network (GHCN-D). Volunteer observers or automated instruments gather the data and submit them to the National Centers for Environmental Information (NCEI). After scientists check the quality of the data to omit any systematic errors, they calculate each station’s monthly total and plot it on a 5x5 km gridded map. To fill in the grid at locations without stations, a computer program interpolates (or estimates) values, accounting for the distribution of stations and various physical relationships, such as the way temperature changes with elevation. The resulting product is the NOAA Monthly U.S. Climate Gridded Dataset (NClimGrid). To calculate the percent of average precipitation values shown on these maps—also called precipitation anomalies—NCEI scientists take the total precipitation in each 5x5 km grid box for a single month and year, and divide it by its 1991-2020 average for the same month. Multiplying that number by 100 yields a percent of average precipitation. If the result is greater than 100%, the region was wetter than average. Less than 100% means the region was drier than usual. Q: What do the colors mean? A: Shades of brown show places where total precipitation was below the long-term average for the month. Areas shown in shades of green had more liquid water from rain and/or snow than they averaged from 1991 to 2020. The darker the shade of brown or green, the larger the difference from the average precipitation. White and very light areas show where precipitation totals were the same as or very close to the long-term average. Note that snowfall totals are reported as the amount of liquid water they produce upon melting. Thus, a 10-inch snowfall that melts to produce one inch of liquid water would be counted as one inch of precipitation. Q: Why do these data matter? A: Comparing an area’s recent precipitation to its long-term average can tell how wet or how dry the area is compared to usual. Knowing if an area is much drier or much wetter than usual can encourage people to pay close attention to on-the-ground conditions that affect daily life and decisions. People check maps like this to judge crop progress; monitor reservoir levels; consider if lawns and landscaping need water; and to understand the possibilities of flooding. Q: How did you produce these snapshots? A: Data Snapshots are derivatives of existing data products; to meet the needs of a broad audience, we present the source data in a simplified visual style. This set of snapshots is based on climate data (NClimGrid) produced by and available from the National Centers for Environmental Information (NCEI). To produce our images, we invoke a set of scripts that access the source data and represent them according to our selected color ramps on our base maps. Additional information The data used in these snapshots can be downloaded from different places and in different formats. We used these specific data sources: NClimGrid Total Precipitation NClimGrid Precipitation Normals References NOAA Monthly U.S. Climate Gridded Dataset (NClimGrid) NOAA Monthly U.S. Climate Divisional Database (NClimDiv) Improved Historical Temperature and Precipitation Time Series for U.S. Climate Divisions NCEI Monthly National Analysis Climate at a Glance - Data Information NCEI Climate Monitoring - All ProductsSource: https://www.climate.gov/maps-data/
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
You will find three datasets containing heights of the high school students.
All heights are in inches.
The data is simulated. The heights are generated from a normal distribution with different sets of mean and standard deviation for boys and girls.
| Height Statistics (inches) | Boys | Girls |
|---|---|---|
| Mean | 67 | 62 |
| Standard Deviation | 2.9 | 2.2 |
There are 500 measurements for each gender.
Here are the datasets:
hs_heights.csv: contains a single column with heights for all boys and girls. There's no way to tell which of the values are for boys and which ones are for girls.
hs_heights_pair.csv: has two columns. The first column has boy's heights. The second column contains girl's heights.
hs_heights_flag.csv: has two columns. The first column has the flag is_girl. The second column contains a girl's height if the flag is 1. Otherwise, it contains a boy's height.
To see how I generated this dataset, check this out: https://github.com/ysk125103/datascience101/tree/main/datasets/high_school_heights
Image by Gillian Callison from Pixabay
Facebook
Twitter[Updated 28/01/25 to fix an issue in the ‘Lower’ values, which were not fully representing the range of uncertainty. ‘Median’ and ‘Higher’ values remain unchanged. The size of the change varies by grid cell and fixed period/global warming levels but the average difference between the 'lower' values before and after this update is 0.2.]What does the data show? The Annual Count of Hot Summer Days is the number of days per year where the maximum daily temperature is above 30°C. It measures how many times the threshold is exceeded (not by how much) in a year. Note, the term ‘hot summer days’ is used to refer to the threshold and temperatures above 30°C outside the summer months also contribute to the annual count. The results should be interpreted as an approximation of the projected number of days when the threshold is exceeded as there will be many factors such as natural variability and local scale processes that the climate model is unable to represent.The Annual Count of Hot Summer Days is calculated for two baseline (historical) periods 1981-2000 (corresponding to 0.51°C warming) and 2001-2020 (corresponding to 0.87°C warming) and for global warming levels of 1.5°C, 2.0°C, 2.5°C, 3.0°C, 4.0°C above the pre-industrial (1850-1900) period. This enables users to compare the future number of hot summer days to previous values.What are the possible societal impacts?The Annual Count of Hot Summer Days indicates increased health risks, transport disruption and damage to infrastructure from high temperatures. It is based on exceeding a maximum daily temperature of 30°C. Impacts include:Increased heat related illnesses, hospital admissions or death.Transport disruption due to overheating of railway infrastructure. Overhead power lines also become less efficient. Other metrics such as the Annual Count of Summer Days (days above 25°C), Annual Count of Extreme Summer Days (days above 35°C) and the Annual Count of Tropical Nights (where the minimum temperature does not fall below 20°C) also indicate impacts from high temperatures, however they use different temperature thresholds.What is a global warming level?The Annual Count of Hot Summer Days is calculated from the UKCP18 regional climate projections using the high emissions scenario (RCP 8.5) where greenhouse gas emissions continue to grow. Instead of considering future climate change during specific time periods (e.g. decades) for this scenario, the dataset is calculated at various levels of global warming relative to the pre-industrial (1850-1900) period. The world has already warmed by around 1.1°C (between 1850–1900 and 2011–2020), whilst this dataset allows for the exploration of greater levels of warming. The global warming levels available in this dataset are 1.5°C, 2°C, 2.5°C, 3°C and 4°C. The data at each warming level was calculated using a 21 year period. These 21 year periods are calculated by taking 10 years either side of the first year at which the global warming level is reached. This time will be different for different model ensemble members. To calculate the value for the Annual Count of Hot Summer Days, an average is taken across the 21 year period. Therefore, the Annual Count of Hot Summer Days show the number of hot summer days that could occur each year, for each given level of warming. We cannot provide a precise likelihood for particular emission scenarios being followed in the real world future. However, we do note that RCP8.5 corresponds to emissions considerably above those expected with current international policy agreements. The results are also expressed for several global warming levels because we do not yet know which level will be reached in the real climate as it will depend on future greenhouse emission choices and the sensitivity of the climate system, which is uncertain. Estimates based on the assumption of current international agreements on greenhouse gas emissions suggest a median warming level in the region of 2.4-2.8°C, but it could either be higher or lower than this level.What are the naming conventions and how do I explore the data?This data contains a field for each global warming level and two baselines. They are named ‘HSD’ (where HSD means Hot Summer Days), the warming level or baseline, and ‘upper’ ‘median’ or ‘lower’ as per the description below. E.g. ‘Hot Summer Days 2.5 median’ is the median value for the 2.5°C warming level. Decimal points are included in field aliases but not field names e.g. ‘Hot Summer Days 2.5 median’ is ‘HotSummerDays_25_median’. To understand how to explore the data, see this page: https://storymaps.arcgis.com/stories/457e7a2bc73e40b089fac0e47c63a578Please note, if viewing in ArcGIS Map Viewer, the map will default to ‘HSD 2.0°C median’ values.What do the ‘median’, ‘upper’, and ‘lower’ values mean?Climate models are numerical representations of the climate system. To capture uncertainty in projections for the future, an ensemble, or group, of climate models are run. Each ensemble member has slightly different starting conditions or model set-ups. Considering all of the model outcomes gives users a range of plausible conditions which could occur in the future. For this dataset, the model projections consist of 12 separate ensemble members. To select which ensemble members to use, the Annual Count of Hot Summer Days was calculated for each ensemble member and they were then ranked in order from lowest to highest for each location. The ‘lower’ fields are the second lowest ranked ensemble member. The ‘upper’ fields are the second highest ranked ensemble member. The ‘median’ field is the central value of the ensemble.This gives a median value, and a spread of the ensemble members indicating the range of possible outcomes in the projections. This spread of outputs can be used to infer the uncertainty in the projections. The larger the difference between the lower and upper fields, the greater the uncertainty.‘Lower’, ‘median’ and ‘upper’ are also given for the baseline periods as these values also come from the model that was used to produce the projections. This allows a fair comparison between the model projections and recent past. Useful linksThis dataset was calculated following the methodology in the ‘Future Changes to high impact weather in the UK’ report and uses the same temperature thresholds as the 'State of the UK Climate' report.Further information on the UK Climate Projections (UKCP).Further information on understanding climate data within the Met Office Climate Data Portal.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Associations between developmental stability, canalization and phenotypic plasticity have been predicted, but rarely supported by direct evidence. Architectural analysis may provide a more powerful approach to finding correlations among these mechanisms in plants. To investigate the relationships among the three mechanisms in architectural perspective, we subjected plants of Abutilon theophrasti to three densities, measured and calculated fluctuating asymmetry (FA), coefficients of variation (CV) and plasticity (PI) of three leaf traits, to analyze the correlations among these variables. As density increased, mean leaf size, petiole length and angle of most layers and mean leaf FA of some layers decreased (at both stages), CV of petiole angle increased (at day 50), and PI of petiole length and angle across all layers decreased (at day 70); leaf FA and CV of traits generally increased with higher layers at all densities. At both stages, there were more positive correlations between FA and CV at lower vs. high densities; at day 50, little correlation of plasticity with FA or CV was found; at day 70, more positive correlations between FA and PI occurred for response to high vs. low density than for response to medium vs. low density, and more positive correlations between CV and PI occurred at lower vs. high densities. Results suggested that developmental instability, decreased canalization and plasticity can be cooperative and the relationships between decreased canalization and plasticity are more likely to be positive if decreased canalization is due to vibrant growth rather than stressful effects. The relationships of plasticity with developmental instability differed from its relationship with decreased canalization in the way of variation. Decreased canalization should be more beneficial for possible plasticity in the future, while canalization may result from already-expressed plasticity. Methods The experiment was conducted in a research-special field at the Pasture Ecological Research Station of Northeast Normal University, Changling, Jilin province, China (123°44 E, 44°40 N) in 2007. The soil (aeolian sandy soil, pH = 8.3) of the experimental field was a little infertile because of frequent utilization annually, thus we covered the field with a layer of 5-10 cm virgin soil (meadow soil, pH = 8.2) transported from the nearby meadow (with no cultivation history) in the north of the research station, to improve the soil quality (Wang & Zhou 2021b). A completely randomized design was implemented with three density treatments and three replicate plots randomly distributed into nine plots of 2 × 3 m in size. Low, medium and high densities were set up by sowing seeds at three inter-planting distances of 30, 20 and 10 cm, to reach the target plant densities of 12.8, 27.5 and 108.5 plants·m-2 respectively. Seeds were sown on June 7, 2007, and most seeds emerged four days later after sowing. Then seedlings were thinned to the target densities when they reached four-leaf stage. Plots were hand weeded and watered regularly before experiencing drought. Data collection and calculations of CV and PI Plants were sampled at day 50 and 70 of growth after seedling emergence. At each sampling, five to six individuals were randomly chosen from each plot, making the maximum total of 6 individuals × 3 plots × 3 densities × 2 stages = 108 sampling. Each individual was divided into different architectural layers every 10 cm (day 50) or 20 cm (day 70) vertically from bottom to top. Samples from different layers, density treatments and plots were mixed together, and measured in a random sequence. For each layer per individual, we measured all the leaves on the main stem immediately after sampling when they were fresh. For each leaf, we measured the width of right and left halves (from the midrib to the margin) at the widest point of the leaf (perpendicular to the midrib) with a digital caliper (Wilsey et al. 1998). The width of each side was measured twice successively and immediately after each other. For each leaf, we also calculated the leaf size (LS) as the average width of right and left sides (Palmer & Strobeck 1986; Wilsey et al. 1998), and measured the length and angle (the angle between the petiole and the main stem) of each petiole. Canalization was evaluated by coefficient of variation (CV, the standard deviation divided by mean value of the trait) among individuals per layer for leaf size, petiole length and angle. For each trait, there were 18 individual values at most for each layer in a total of 5-7 layers. Plasticity was calculated by simplified Relative Distance Plasticity Index (Valladares et al. 2006) for each of leaf size, petiole length and angle, with the abbreviated form of PI and the formula as: PI = (X – Y)/(X + Y) (1-1) where X was the adjusted mean trait value at high or medium density, and Y was the adjusted mean value at low density. We calculated the plasticity both in response to high vs. low density (PIHL) and the plasticity in response to medium vs. low density (PIML). Adjusted mean trait values were produced from one-way ANCOVA on original mean values, with density as effect and plant size (total mass) as a covariate. Calculations and analyses of FA We compared various conventional indexes (FA1-FA8 and FA10) in calculating the fluctuating asymmetry (FA) in leaf width, to identify the ones with the highest explanatory powers for our study design (Table A2). Different indexes showed little response or similar trends in response to density or architectural layer (Table A3-A5), thus we adopted FA1, FA2 (with and without effects of leaf size respectively) and FA10 (the only index with measurement error variance partitioned out of the total between-sides variance) in analyses, with the formula as (Palmer 1994; Palmer & Strobeck 2003): FA1 = ∑ |R – L| / n (2-1) FA2 = ∑ (|R – L| / LS) / n (2-2) FA10 = 0.798 × √ (MSsj - MSm) / M (2-3) where R and L were the widths of right and left sides of a leaf respectively, n was the total number of leaves, and LS (leaf size) was calculated by (R+L)/2, MSsj was the mean squares of side × individual interaction, MSm was the mean squares of measurement error, M was the number of replicate measurements per side, from a side × individual ANOVA on untransformed replicate measurements of R and L. We measured skew (γ1) and kurtosis (γ2) to evaluate whether the leaf asymmetry deviated from normality. To detect the presence of antisymmetry, kurtosis (γ2) was tested with a t-test of the null hypothesis H0:γ2 = 0, where a significant negativeγ2 indicates possible antisymmetry (Cowart & Graham 1999; Palmer 1994). To test the presence of directional asymmetry, we used two methods: 1) testing (R - L) against 0 with one-sample t-test (the hypothesis H0:γ1 = 0); and 2) testing whether the difference between sides (mean squares for side effect [MSs]) is greater than nondirectional asymmetry (mean squares for side × individual interaction [MSsi]) with factorial ANOVA (Palmer 1994; Wilsey et al. 1998). For layer, density and stage combination, two sets of samples (at day 50) showed leptokurtosis, indicating antiasymmetry; but only one set of samples showed left-dominant directional asymmetry (Table A6, A7). In addition, seven sets of samples at day 50 and ten sets at day 70 also showed greater mean difference between sides (MSs) than between-sides variation (MSsi; Table A8, A9), indicating directional asymmetry. We regressed |R - L| on LS for all the leaves of individuals per layer at each density and stage to determine the size-dependence of leaf asymmetry, and found several cases of leaf asymmetry were size-dependent (Table A6, A7). We also evaluated whether the between-sides variation is significantly larger than the measurement error (MSm) in factorial ANOVA (Palmer 1994). The MSm values for all cases were lower than MSsi values (Table A8, A9). Statistical analysis Mean value (MV), CV and PI of leaf size, petiole length and angle and leaf FA were used in statistics. The original data was log-transformed, petiole angles were square root-transformed, before any analysis to minimize variance heterogeneity. All analyses were conducted using SAS statistical software (SAS Institute 9.0 Incorporation 2002). Three-way ANOVA was performed for effects of growth stage, population density, plant layer and their interactions on all variables. Then we used one-way ANOVA for differences among layers for all variables at each density and one-way ANCOVA for effects of density on all variables for each layer or across all layers, with plant total biomass as a covariate. Multiple comparisons used LSD (Least Significant Difference) method in General Linear Model (GLM) program. For each of the three leaf traits at each density and stage, correlations among MV, FA (only results with FA2 were presented due to similar results for different FA indexes), CV and PI across all layers were analyzed with PROC CORR, producing Pearson Correlation Coefficients (PCC) for all correlations and Partial Pearson Correlation Coefficients (PPCC) for correlations among FA, CV and PI, with mean trait value in control in partial correlation analyses.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
"We believe that by accounting for the inherent uncertainty in the system during each measurement, the relationship between cause and effect can be assessed more accurately, potentially reducing the duration of research."
Short description
This dataset was created as part of a research project investigating the efficiency and learning mechanisms of a Bayesian adaptive search algorithm supported by the Imprecision Entropy Indicator (IEI) as a novel method. It includes detailed statistical results, posterior probability values, and the weighted averages of IEI across multiple simulations aimed at target localization within a defined spatial environment. Control experiments, including random search, random walk, and genetic algorithm-based approaches, were also performed to benchmark the system's performance and validate its reliability.
The task involved locating a target area centered at (100; 100) within a radius of 10 units (Research_area.png), inside a circular search space with a radius of 100 units. The search process continued until 1,000 successful target hits were achieved.
To benchmark the algorithm's performance and validate its reliability, control experiments were conducted using alternative search strategies, including random search, random walk, and genetic algorithm-based approaches. These control datasets serve as baselines, enabling comprehensive comparisons of efficiency, randomness, and convergence behavior across search methods, thereby demonstrating the effectiveness of our novel approach.
Uploaded files
The first dataset contains the average IEI values, generated by randomly simulating 300 x 1 hits for 10 bins per quadrant (4 quadrants in total) using the Python programming language, and calculating the corresponding IEI values. This resulted in a total of 4 x 10 x 300 x 1 = 12,000 data points. The summary of the IEI values by quadrant and bin is provided in the file results_1_300.csv. The calculation of IEI values for averages is based on likelihood, using an absolute difference-based approach for the likelihood probability computation. IEI_Likelihood_Based_Data.zip
The weighted IEI average values for likelihood calculation (Bayes formula) are provided in the file Weighted_IEI_Average_08_01_2025.xlsx
This dataset contains the results of a simulated target search experiment using Bayesian posterior updates and Imprecision Entropy Indicators (IEI). Each row represents a hit during the search process, including metrics such as Shannon entropy (H), Gini index (G), average distance, angular deviation, and calculated IEI values. The dataset also includes bin-specific posterior probability updates and likelihood calculations for each iteration. The simulation explores adaptive learning and posterior penalization strategies to optimize the search efficiency. Our Bayesian adaptive searching system source code (search algorithm, 1000 target searches): IEI_Self_Learning_08_01_2025.pyThis dataset contains the results of 1,000 iterations of a successful target search simulation. The simulation runs until the target is successfully located for each iteration. The dataset includes further three main outputs: a) Results files (results{iteration_number}.csv): Details of each hit during the search process, including entropy measures, Gini index, average distance and angle, Imprecision Entropy Indicators (IEI), coordinates, and the bin number of the hit. b) Posterior updates (Pbin_all_steps_{iter_number}.csv): Tracks the posterior probability updates for all bins during the search process acrosations multiple steps. c) Likelihoodanalysis(likelihood_analysis_{iteration_number}.csv): Contains the calculated likelihood values for each bin at every step, based on the difference between the measured IEI and pre-defined IE bin averages. IEI_Self_Learning_08_01_2025.py
Based on the mentioned Python source code (see point 3, Bayesian adaptive searching method with IEI values), we performed 1,000 successful target searches, and the outputs were saved in the:Self_learning_model_test_output.zip file.
Bayesian Search (IEI) from different quadrant. This dataset contains the results of Bayesian adaptive target search simulations, including various outputs that represent the performance and analysis of the search algorithm. The dataset includes: a) Heatmaps (Heatmap_I_Quadrant, Heatmap_II_Quadrant, Heatmap_III_Quadrant, Heatmap_IV_Quadrant): These heatmaps represent the search results and the paths taken from each quadrant during the simulations. They indicate how frequently the system selected each bin during the search process. b) Posterior Distributions (All_posteriors, Probability_distribution_posteriors_values, CDF_posteriors_values): Generated based on posterior values, these files track the posterior probability updates, including cumulative distribution functions (CDF) and probability distributions. c) Macro Summary (summary_csv_macro): This file aggregates metrics and key statistics from the simulation. It summarizes the results from the individual results.csv files. d) Heatmap Searching Method Documentation (Bayesian_Heatmap_Searching_Method_05_12_2024): This document visualizes the search algorithm's path, showing how frequently each bin was selected during the 1,000 successful target searches. e) One-Way ANOVA Analysis (Anova_analyze_dataset, One_way_Anova_analysis_results): This includes the database and SPSS calculations used to examine whether the starting quadrant influences the number of search steps required. The analysis was conducted at a 5% significance level, followed by a Games-Howell post hoc test [43] to identify which target-surrounding quadrants differed significantly in terms of the number of search steps. Results were saved in the Self_learning_model_test_results.zip
This dataset contains randomly generated sequences of bin selections (1-40) from a control search algorithm (random search) used to benchmark the performance of Bayesian-based methods. The process iteratively generates random numbers until a stopping condition is met (reaching target bins 1, 11, 21, or 31). This dataset serves as a baseline for analyzing the efficiency, randomness, and convergence of non-adaptive search strategies. The dataset includes the following: a) The Python source code of the random search algorithm. b) A file (summary_random_search.csv) containing the results of 1000 successful target hits. c) A heatmap visualizing the frequency of search steps for each bin, providing insight into the distribution of steps across the bins. Random_search.zip
This dataset contains the results of a random walk search algorithm, designed as a control mechanism to benchmark adaptive search strategies (Bayesian-based methods). The random walk operates within a defined space of 40 bins, where each bin has a set of neighboring bins. The search begins from a randomly chosen starting bin and proceeds iteratively, moving to a randomly selected neighboring bin, until one of the stopping conditions is met (bins 1, 11, 21, or 31). The dataset provides detailed records of 1,000 random walk iterations, with the following key components: a) Individual Iteration Results: Each iteration's search path is saved in a separate CSV file (random_walk_results_.csv), listing the sequence of steps taken and the corresponding bin at each step. b) Summary File: A combined summary of all iterations is available in random_walk_results_summary.csv, which aggregates the step-by-step data for all 1,000 random walks. c) Heatmap Visualization: A heatmap file is included to illustrate the frequency distribution of steps across bins, highlighting the relative visit frequencies of each bin during the random walks. d) Python Source Code: The Python script used to generate the random walk dataset is provided, allowing reproducibility and customization for further experiments. Random_walk.zip
This dataset contains the results of a genetic search algorithm implemented as a control method to benchmark adaptive Bayesian-based search strategies. The algorithm operates in a 40-bin search space with predefined target bins (1, 11, 21, 31) and evolves solutions through random initialization, selection, crossover, and mutation over 1000 successful runs. Dataset Components: a) Run Results: Individual run data is stored in separate files (genetic_algorithm_run_.csv), detailing: Generation: The generation number. Fitness: The fitness score of the solution. Steps: The path length in bins. Solution: The sequence of bins visited. b) Summary File: summary.csv consolidates the best solutions from all runs, including their fitness scores, path lengths, and sequences. c) All Steps File: summary_all_steps.csv records all bins visited during the runs for distribution analysis. d) A heatmap was also generated for the genetic search algorithm, illustrating the frequency of bins chosen during the search process as a representation of the search pathways.Genetic_search_algorithm.zip
Technical Information
The dataset files have been compressed into a standard ZIP archive using Total Commander (version 9.50). The ZIP format ensures compatibility across various operating systems and tools.
The XLSX files were created using Microsoft Excel Standard 2019 (Version 1808, Build 10416.20027)
The Python program was developed using Visual Studio Code (Version 1.96.2, user setup), with the following environment details: Commit fabd6a6b30b49f79a7aba0f2ad9df9b399473380f, built on 2024-12-19. The Electron version is 32.6, and the runtime environment includes Chromium 128.0.6263.186, Node.js 20.18.1, and V8 12.8.374.38-electron.0. The operating system is Windows NT x64 10.0.19045.
The statistical analysis included in this dataset was partially conducted using IBM SPSS Statistics, Version 29.0.1.0
The CSV files in this dataset were created following European standards, using a semicolon (;) as the delimiter instead of a comma, encoded in UTF-8 to ensure compatibility with a wide