https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Normative learning theories dictate that we should preferentially attend to informative sources, but only up to the point that our limited learning systems can process their content. Humans, including infants, show this predicted strategic deployment of attention. Here we demonstrate that rhesus monkeys, much like humans, attend to events of moderate surprisingness over both more and less surprising events. They do this in the absence of any specific goal or contingent reward, indicating that the behavioral pattern is spontaneous. We suggest this U-shaped attentional preference represents an evolutionarily preserved strategy for guiding intelligent organisms toward material that is maximally useful for learning. Methods How the data were collected: In this project, we collected gaze data of 5 macaques when they watched sequential visual displays designed to elicit probabilistic expectations using the Eyelink Toolbox and were sampled at 1000 Hz by an infrared eye-monitoring camera system. Dataset:
"csv-combined.csv" is an aggregated dataset that includes one pop-up event per row for all original datasets for each trial. Here are descriptions of each column in the dataset:
subj: subject_ID = {"B":104, "C":102,"H":101,"J":103,"K":203} trialtime: start time of current trial in second trial: current trial number (each trial featured one of 80 possible visual-event sequences)(in order) seq current: sequence number (one of 80 sequences) seq_item: current item number in a seq (in order) active_item: pop-up item (active box) pre_active: prior pop-up item (actve box) {-1: "the first active object in the sequence/ no active object before the currently active object in the sequence"} next_active: next pop-up item (active box) {-1: "the last active object in the sequence/ no active object after the currently active object in the sequence"} firstappear: {0: "not first", 1: "first appear in the seq"} looks_blank: csv: total amount of time look at blank space for current event (ms); csv_timestamp: {1: "look blank at timestamp", 0: "not look blank at timestamp"} looks_offscreen: csv: total amount of time look offscreen for current event (ms); csv_timestamp: {1: "look offscreen at timestamp", 0: "not look offscreen at timestamp"} time till target: time spent to first start looking at the target object (ms) {-1: "never look at the target"} looks target: csv: time spent to look at the target object (ms);csv_timestamp: look at the target or not at current timestamp (1 or 0) look1,2,3: time spent look at each object (ms) location 123X, 123Y: location of each box (location of the three boxes for a given sequence were chosen randomly, but remained static throughout the sequence) item123id: pop-up item ID (remained static throughout a sequence) event time: total time spent for the whole event (pop-up and go back) (ms) eyeposX,Y: eye position at current timestamp
"csv-surprisal-prob.csv" is an output file from Monkilock_Data_Processing.ipynb. Surprisal values for each event were calculated and added to the "csv-combined.csv". Here are descriptions of each additional column:
rt: time till target {-1: "never look at the target"}. In data analysis, we included data that have rt > 0. already_there: {NA: "never look at the target object"}. In data analysis, we included events that are not the first event in a sequence, are not repeats of the previous event, and already_there is not NA. looks_away: {TRUE: "the subject was looking away from the currently active object at this time point", FALSE: "the subject was not looking away from the currently active object at this time point"} prob: the probability of the occurrence of object surprisal: unigram surprisal value bisurprisal: transitional surprisal value std_surprisal: standardized unigram surprisal value std_bisurprisal: standardized transitional surprisal value binned_surprisal_means: the means of unigram surprisal values binned to three groups of evenly spaced intervals according to surprisal values. binned_bisurprisal_means: the means of transitional surprisal values binned to three groups of evenly spaced intervals according to surprisal values.
"csv-surprisal-prob_updated.csv" is a ready-for-analysis dataset generated by Analysis_Code_final.Rmd after standardizing controlled variables, changing data types for categorical variables for analysts, etc. "AllSeq.csv" includes event information of all 80 sequences
Empty Values in Datasets:
There is no missing value in the original dataset "csv-combined.csv". Missing values (marked as NA in datasets) happen in columns "prev_active", "next_active", "already_there", "bisurprisal", "std_bisurprisal", "sq_std_bisurprisal" in "csv-surprisal-prob.csv" and "csv-surprisal-prob_updated.csv". NAs in columns "prev_active" and "next_active" mean that the first or the last active object in the sequence/no active object before or after the currently active object in the sequence. When we analyzed the variable "already_there", we eliminated data that their "prev_active" variable is NA. NAs in column "already there" mean that the subject never looks at the target object in the current event. When we analyzed the variable "already there", we eliminated data that their "already_there" variable is NA. Missing values happen in columns "bisurprisal", "std_bisurprisal", "sq_std_bisurprisal" when it is the first event in the sequence and the transitional probability of the event cannot be computed because there's no event happening before in this sequence. When we fitted models for transitional statistics, we eliminated data that their "bisurprisal", "std_bisurprisal", and "sq_std_bisurprisal" are NAs.
Codes:
In "Monkilock_Data_Processing.ipynb", we processed raw fixation data of 5 macaques and explored the relationship between their fixation patterns and the "surprisal" of events in each sequence. We computed the following variables which are necessary for further analysis, modeling, and visualizations in this notebook (see above for details): active_item, pre_active, next_active, firstappear ,looks_blank, looks_offscreen, time till target, looks target, look1,2,3, prob, surprisal, bisurprisal, std_surprisal, std_bisurprisal, binned_surprisal_means, binned_bisurprisal_means. "Analysis_Code_final.Rmd" is the main scripts that we further processed the data, built models, and created visualizations for data. We evaluated the statistical significance of variables using mixed effect linear and logistic regressions with random intercepts. The raw regression models include standardized linear and quadratic surprisal terms as predictors. The controlled regression models include covariate factors, such as whether an object is a repeat, the distance between the current and previous pop up object, trial number. A generalized additive model (GAM) was used to visualize the relationship between the surprisal estimate from the computational model and the behavioral data. "helper-lib.R" includes helper functions used in Analysis_Code_final.Rmd
Survey based Harmonized Indicators (SHIP) files are harmonized data files from household surveys that are conducted by countries in Africa. To ensure the quality and transparency of the data, it is critical to document the procedures of compiling consumption aggregation and other indicators so that the results can be duplicated with ease. This process enables consistency and continuity that make temporal and cross-country comparisons consistent and more reliable.
Four harmonized data files are prepared for each survey to generate a set of harmonized variables that have the same variable names. Invariably, in each survey, questions are asked in a slightly different way, which poses challenges on consistent definition of harmonized variables. The harmonized household survey data present the best available variables with harmonized definitions, but not identical variables. The four harmonized data files are
a) Individual level file (Labor force indicators in a separate file): This file has information on basic characteristics of individuals such as age and sex, literacy, education, health, anthropometry and child survival. b) Labor force file: This file has information on labor force including employment/unemployment, earnings, sectors of employment, etc. c) Household level file: This file has information on household expenditure, household head characteristics (age and sex, level of education, employment), housing amenities, assets, and access to infrastructure and services. d) Household Expenditure file: This file has consumption/expenditure aggregates by consumption groups according to Purpose (COICOP) of Household Consumption of the UN.
National
The survey covered all de jure household members (usual residents).
Sample survey data [ssd]
A multi-stage sampling technique was used in selecting the GLSS sample. Initially, 4565 households were selected for GLSS3, spread around the country in 407 small clusters; in general, 15 households were taken in an urban cluster and 10 households in a rural cluster. The actual achieved sample was 4552 households. Because of the sample design used, and the very high response rate achieved, the sample can be considered as being selfweighting, though in the case of expenditure data weighting of the expenditure values is required.
Face-to-face [f2f]
Overview: The Essential Climate Variables for assessment of climate variability from 1979 to present dataset contains a selection of climatologies, monthly anomalies and monthly mean fields of Essential Climate Variables (ECVs) suitable for monitoring and assessment of climate variability and change. Selection criteria are based on accuracy and temporal consistency on monthly to decadal time scales. The ECV data products in this set have been estimated from climate reanalyses ERA-Interim and ERA5, and, depending on the source, may have been adjusted to account for biases and other known deficiencies. Data sources and adjustment methods used are described in the Product User Guide, as are various particulars such as the baseline periods used to calculate monthly climatologies and the corresponding anomalies. Sum of monthly precipitation: This variable is the accumulated liquid and frozen water, including rain and snow, that falls to the Earth's surface. It is the sum of large-scale precipitation (that precipitation which is generated by large-scale weather patterns, such as troughs and cold fronts) and convective precipitation (generated by convection which occurs when air at lower levels in the atmosphere is warmer and less dense than the air above, so it rises). Precipitation variables do not include fog, dew or the precipitation that evaporates in the atmosphere before it lands at the surface of the Earth. Spatial resolution: 0:15:00 (0.25°) Temporal resolution: monthly Temporal extent: 1979 - present Data unit: mm * 10 Data type: UInt32 CRS as EPSG: EPSG:4326 Processing time delay: one month
Occupation data for 2021 and 2022 data files
The ONS has identified an issue with the collection of some occupational data in 2021 and 2022 data files in a number of their surveys. While they estimate any impacts will be small overall, this will affect the accuracy of the breakdowns of some detailed (four-digit Standard Occupational Classification (SOC)) occupations, and data derived from them. Further information can be found in the ONS article published on 11 July 2023: https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/articles/revisionofmiscodedoccupationaldataintheonslabourforcesurveyuk/january2021toseptember2022" style="background-color: rgb(255, 255, 255);">Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022.
Latest edition information
For the fourth edition (September 2023), the variables NSECM20, NSECMJ20, SC2010M, SC20SMJ, SC20SMN and SOC20M have been replaced with new versions. Further information on the SOC revisions can be found in the ONS article published on 11 July 2023: https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/articles/revisionofmiscodedoccupationaldataintheonslabourforcesurveyuk/january2021toseptember2022" style="background-color: rgb(255, 255, 255);">Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
COVID-19 prediction has been essential in the aid of prevention and control of the disease. The motivation of this case study is to develop predictive models for COVID-19 cases and deaths based on a cross-sectional data set with a total of 28,955 observations and 18 variables, which is compiled from 5 data sources from Kaggle. A two-part modeling framework, in which the first part is a logistic classifier and the second part includes machine learning or statistical smoothing methods, is introduced to model the highly skewed distribution of COVID-19 cases and deaths. We also aim to understand what factors are most relevant to COVID-19’s occurrence and fatality. Evaluation criteria such as root mean squared error (RMSE) and mean absolute error (MAE) are used. We find that the two-part XGBoost model perform best with predicting the entire distribution of COVID-19 cases and deaths. The most important factors relevant to either COVID-19 cases or deaths include population and the rate of primary care physicians.
https://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/licence-to-use-copernicus-products/licence-to-use-copernicus-products_b4b9451f54cffa16ecef5c912c9cebd6979925a956e3fa677976e0cf198c2c18.pdfhttps://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/licence-to-use-copernicus-products/licence-to-use-copernicus-products_b4b9451f54cffa16ecef5c912c9cebd6979925a956e3fa677976e0cf198c2c18.pdf
ERA5 is the fifth generation ECMWF reanalysis for the global climate and weather for the past 8 decades. Data is available from 1940 onwards. ERA5 replaces the ERA-Interim reanalysis. Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where every so many hours (12 hours at ECMWF) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved versions of the original observations, which all benefit the quality of the reanalysis product. ERA5 provides hourly estimates for a large number of atmospheric, ocean-wave and land-surface quantities. An uncertainty estimate is sampled by an underlying 10-member ensemble at three-hourly intervals. Ensemble mean and spread have been pre-computed for convenience. Such uncertainty estimates are closely related to the information content of the available observing system which has evolved considerably over time. They also indicate flow-dependent sensitive areas. To facilitate many climate applications, monthly-mean averages have been pre-calculated too, though monthly means are not available for the ensemble mean and spread. ERA5 is updated daily with a latency of about 5 days. In case that serious flaws are detected in this early release (called ERA5T), this data could be different from the final release 2 to 3 months later. In case that this occurs users are notified. The data set presented here is a regridded subset of the full ERA5 data set on native resolution. It is online on spinning disk, which should ensure fast and easy access. It should satisfy the requirements for most common applications. An overview of all ERA5 datasets can be found in this article. Information on access to ERA5 data on native resolution is provided in these guidelines. Data has been regridded to a regular lat-lon grid of 0.25 degrees for the reanalysis and 0.5 degrees for the uncertainty estimate (0.5 and 1 degree respectively for ocean waves). There are four main sub sets: hourly and monthly products, both on pressure levels (upper air fields) and single levels (atmospheric, ocean-wave and land surface quantities). The present entry is "ERA5 hourly data on single levels from 1940 to present".
The main objective of the HEIS survey is to obtain detailed data on household expenditure and income, linked to various demographic and socio-economic variables, to enable computation of poverty indices and determine the characteristics of the poor and prepare poverty maps. Therefore, to achieve these goals, the sample had to be representative on the sub-district level. The raw survey data provided by the Statistical Office was cleaned and harmonized by the Economic Research Forum, in the context of a major research project to develop and expand knowledge on equity and inequality in the Arab region. The main focus of the project is to measure the magnitude and direction of change in inequality and to understand the complex contributing social, political and economic forces influencing its levels. However, the measurement and analysis of the magnitude and direction of change in this inequality cannot be consistently carried out without harmonized and comparable micro-level data on income and expenditures. Therefore, one important component of this research project is securing and harmonizing household surveys from as many countries in the region as possible, adhering to international statistics on household living standards distribution. Once the dataset has been compiled, the Economic Research Forum makes it available, subject to confidentiality agreements, to all researchers and institutions concerned with data collection and issues of inequality.
Data collected through the survey helped in achieving the following objectives: 1. Provide data weights that reflect the relative importance of consumer expenditure items used in the preparation of the consumer price index 2. Study the consumer expenditure pattern prevailing in the society and the impact of demographic and socio-economic variables on those patterns 3. Calculate the average annual income of the household and the individual, and assess the relationship between income and different economic and social factors, such as profession and educational level of the head of the household and other indicators 4. Study the distribution of individuals and households by income and expenditure categories and analyze the factors associated with it 5. Provide the necessary data for the national accounts related to overall consumption and income of the household sector 6. Provide the necessary income data to serve in calculating poverty indices and identifying the poor characteristics as well as drawing poverty maps 7. Provide the data necessary for the formulation, follow-up and evaluation of economic and social development programs, including those addressed to eradicate poverty
National
Sample survey data [ssd]
The Household Expenditure and Income survey sample for 2010, was designed to serve the basic objectives of the survey through providing a relatively large sample in each sub-district to enable drawing a poverty map in Jordan. The General Census of Population and Housing in 2004 provided a detailed framework for housing and households for different administrative levels in the country. Jordan is administratively divided into 12 governorates, each governorate is composed of a number of districts, each district (Liwa) includes one or more sub-district (Qada). In each sub-district, there are a number of communities (cities and villages). Each community was divided into a number of blocks. Where in each block, the number of houses ranged between 60 and 100 houses. Nomads, persons living in collective dwellings such as hotels, hospitals and prison were excluded from the survey framework.
A two stage stratified cluster sampling technique was used. In the first stage, a cluster sample proportional to the size was uniformly selected, where the number of households in each cluster was considered the weight of the cluster. At the second stage, a sample of 8 households was selected from each cluster, in addition to another 4 households selected as a backup for the basic sample, using a systematic sampling technique. Those 4 households were sampled to be used during the first visit to the block in case the visit to the original household selected is not possible for any reason. For the purposes of this survey, each sub-district was considered a separate stratum to ensure the possibility of producing results on the sub-district level. In this respect, the survey framework adopted that provided by the General Census of Population and Housing Census in dividing the sample strata. To estimate the sample size, the coefficient of variation and the design effect of the expenditure variable provided in the Household Expenditure and Income Survey for the year 2008 was calculated for each sub-district. These results were used to estimate the sample size on the sub-district level so that the coefficient of variation for the expenditure variable in each sub-district is less than 10%, at a minimum, of the number of clusters in the same sub-district (6 clusters). This is to ensure adequate presentation of clusters in different administrative areas to enable drawing an indicative poverty map.
It should be noted that in addition to the standard non response rate assumed, higher rates were expected in areas where poor households are concentrated in major cities. Therefore, those were taken into consideration during the sampling design phase, and a higher number of households were selected from those areas, aiming at well covering all regions where poverty spreads.
Face-to-face [f2f]
Raw Data: - Organizing forms/questionnaires: A compatible archive system was used to classify the forms according to different rounds throughout the year. A registry was prepared to indicate different stages of the process of data checking, coding and entry till forms were back to the archive system. - Data office checking: This phase was achieved concurrently with the data collection phase in the field where questionnaires completed in the field were immediately sent to data office checking phase. - Data coding: A team was trained to work on the data coding phase, which in this survey is only limited to education specialization, profession and economic activity. In this respect, international classifications were used, while for the rest of the questions, coding was predefined during the design phase. - Data entry/validation: A team consisting of system analysts, programmers and data entry personnel were working on the data at this stage. System analysts and programmers started by identifying the survey framework and questionnaire fields to help build computerized data entry forms. A set of validation rules were added to the entry form to ensure accuracy of data entered. A team was then trained to complete the data entry process. Forms prepared for data entry were provided by the archive department to ensure forms are correctly extracted and put back in the archive system. A data validation process was run on the data to ensure the data entered is free of errors. - Results tabulation and dissemination: After the completion of all data processing operations, ORACLE was used to tabulate the survey final results. Those results were further checked using similar outputs from SPSS to ensure that tabulations produced were correct. A check was also run on each table to guarantee consistency of figures presented, together with required editing for tables' titles and report formatting.
Harmonized Data: - The Statistical Package for Social Science (SPSS) was used to clean and harmonize the datasets. - The harmonization process started with cleaning all raw data files received from the Statistical Office. - Cleaned data files were then merged to produce one data file on the individual level containing all variables subject to harmonization. - A country-specific program was generated for each dataset to generate/compute/recode/rename/format/label harmonized variables. - A post-harmonization cleaning process was run on the data. - Harmonized data was saved on the household as well as the individual level, in SPSS and converted to STATA format.
DS126.0 represents a dataset implemented and computed by NCAR's Data Support Section, and forms an essential part of efforts undertaken in late 2004, early 2005, to produce an archive of selected segments of ERA-40 on a standard transformation grid.
In this case, forty seven ERA-40 monthly mean surface and single level analysis variables were transformed from a reduced N80 Gaussian grid to a 256 by 128 regular Gaussian grid. All fields were transformed using routines from the ECMWF EMOS library, including 10 meter winds which were treated as scalars because of a lack of 10 meter spectral vorticity and divergence. A missing value occurs in the sea surface temperature and sea ice fields to mask grid points occurring over land. Fields formerly archived as whole integers, such as vegetation indices and cloud cover, occur as integers plus a fractional part in the T85 version due to interpolation.
Twenty seven ERA-40 monthly mean surface and single level 6-hour forecast variables were transformed from a reduced N80 Gaussian grid to a 256 by 128 regular Gaussian grid. Four of the variables are "instantaneous" variables, and the remaining twenty three variables are "accumulated" over the 6-hour forecast time. Divide the accumulated variables by 21600 seconds to obtain instantaneous values. (Multiplication by minus one may also be necessary to match the sign convention one is accustomed to.) All fields were transformed using routines from the ECMWF EMOS library, including three pairs of stresses which were treated as scalars because of a lack of spectral precursors.
In addition, all corresponding 00Z, 06Z, 12Z, and 18Z monthly mean surface and single level analysis variables and 6-hour forecast variables were also transformed to a T85 Gaussian grid.
All forecast variables are valid 6 hours after the forecast was initiated. Thus, 00Z 6-hour forecast evaporation is valid at 06Z. Divide the accumulated variables by 21600 seconds to obtain instantaneous values. (Multiplication by minus one may also be necessary to match the sign convention one is to.)
The choice of a T85 Gaussian grid was based on considerations of limiting the volume of new data generated to a moderate level, and to match the horizontal resolution of the Community Atmosphere Model (CAM) [https://www.cesm.ucar.edu/models/atm-cam/] component of NCAR's Community Climate System Model (CCSM).
The ERA-Interim data from ECMWF is an update to the ERA-40 project. The ERA-Interim data starts in 1989 and has a higher horizontal resolution (T255, N128 nominally 0.703125 degrees) than the ERA-40 data (T159, N80 nominally 1.125 degrees). ERA-Interim is based on a more current model than ERA-40 and uses 4DVAR (as opposed to 3DVAR in ERA-40). ECMWF will continue to run the ERA-Interim model in near real time through at least 2010, and possibly longer. This data is available in ds627.0 [https://rda.ucar.edu/datasets/ds627.0/].
https://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/licence-to-use-copernicus-products/licence-to-use-copernicus-products_b4b9451f54cffa16ecef5c912c9cebd6979925a956e3fa677976e0cf198c2c18.pdfhttps://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/licence-to-use-copernicus-products/licence-to-use-copernicus-products_b4b9451f54cffa16ecef5c912c9cebd6979925a956e3fa677976e0cf198c2c18.pdf
ERA5-Land is a reanalysis dataset providing a consistent view of the evolution of land variables over several decades at an enhanced resolution compared to ERA5. ERA5-Land has been produced by replaying the land component of the ECMWF ERA5 climate reanalysis. Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. Reanalysis produces data that goes several decades back in time, providing an accurate description of the climate of the past. ERA5-Land provides a consistent view of the water and energy cycles at surface level during several decades. It contains a detailed record from 1950 onwards, with a temporal resolution of 1 hour. The native spatial resolution of the ERA5-Land reanalysis dataset is 9km on a reduced Gaussian grid (TCo1279). The data in the CDS has been regridded to a regular lat-lon grid of 0.1x0.1 degrees. The data presented here is a post-processed subset of the full ERA5-Land dataset. Monthly-mean averages have been pre-calculated to facilitate many applications requiring easy and fast access to the data, when sub-monthly fields are not required.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The BS Filled Total Column Ozone (TCO) database version 3.5.1 provides a gap-free extension of the NIWA-BS Total Column Ozone database version 3.5.1 (doi:10.5281/zenodo.4535293) which combines TCO data from multiple satellite-based instruments to create a single near-global daily time series of ozone fields at 1.25° longitude by 1° latitude spanning the period 31 October 1978 to 31 December 2019. The construction of NIWA-BS TCO was initially maintained by the National Institute of Water and Atmospheric Research (NIWA) and now by Bodeker Scientific (BS). The latter also developed the BS Filled TCO database published here.
While the BS Filled TCO database has the same resolution and covers the same period as NIWA-BS TCO, any missing data have been filled using a machine learning-based method that regresses the NIWA-BS TCO database against NCEP CFSR reanalysis tropopause height fields and potential vorticity (PV) fields on the 550 K surface.
Uncertainties in the filled TCO fields are provided on every datum and these uncertainties reflect the data availability, i.e. the uncertainty is smaller where measurements are available compared to regions where filling of the data was required.
Please note: For the reasons detailed in this document, versions 3.5.x of the BS Filled TCO
database and of the NIWA-BS TCO database should not be used henceforth for trend analysis and, as such, we have updated the version 3.4 of NIWA-BS Combined TCO database to the end of 2019 (now referred to as version 3.4.1 of the database) as a replacement. You will find a link to version 3.4.1 under the link provided above.
The data are available in daily, monthly or annual resolution. Please note the following for the data in annual resolution: There have to be at least 12 valid monthly values within a year to calculate a valid annual mean. Therefore, no annual mean is available for the year 1978 as the record only starts on 31 October 1978.
Please email greg@bodekerscientific.com and let us know which data set you downloaded and what your intended purpose for the use of the data is. You will then receive updates if an improved version becomes available.
In the dataset ´GLES Cross-Section 2013-2021, Sensitive Regional Data´, the recoded or deleted variables of the GLES Cross-Section Scientific Use Files, which refer to the respondents’ place of residence, are made available for research purposes. The basis for the assignment of the small-scale regional units are the addresses of the respondents. After geocoding, i.e. the calculation of geocoordinates based on the addresses, the point coordinates were linked to regional units (e.g. INSPIRE grid cells, municipality and district ids, postal codes). The regional variables of this dataset can be linked to the survey data of the pre- and post-election cross-sections of the GLES.
This data set contains the following sensitive regional variables (both ids and, if applying, names): - 3-digit key for the adminsitrative governmental district (Regierungsbezirk) (since 2013) - 3-digit key for spatial planning region (since 2013) - 5-digit key for (city-) districts (since 2013) - 9-digit key for municipalities (since 2021) - 8-digit general municipality key (AGS) (since 2013) - 12-digit regional key (Regionalschlüssel) (since 2021) - Zip code (since 2013) - Constituencies (since 2013) - NUTS-3 code (since 2013) - INSPIRE ID (1km) (since 2013) - municipality size (since 2013) - BIK type of municipality (since 2013)
This sensitive data is subject to a special access restriction and can only be used within the scope of an on-site use in the Secure Data Center in Cologne. Further information and contact persons can be found on our website: https://www.gesis.org/en/secdc
In order to take into account changes in the territorial status of the regional units (e. g. district reforms, municipality incorporations), the regional variables are offered as time-harmonized variables as of December 31, 2015 in addition to the status as of January 1 of the year of survey.
If you want to use the regional variables to add additional context characteristics (regional attributes such as unemployment rate or election turnout, for example), you have to send us this data before your visit. In addition, we require a reference and documentation (description of variables) of the data. Note that this context data may be as sensitive as the regional variables if direct assignment is possible. Due to data protection it is problematic if individual characteristics can be assigned to specific regional units – and therefore ultimately to the individual respondents – even without the ALLBUS dataset by means of a table of correspondence. Accordingly, the publication of (descriptive) analysis results based on such contextual data is only possible in a coarsened form.
Please contact the GLES User Service first and send us the filled GLES regional data form (see ´Data & Documents´), specifying exactly which GLES datasets and regional variables you need. Contact: gles@gesis.org
As soon as you have clarified with the GLES user service which exact regional features are to be made available for on-site use, the data use agreement for the use of the data at a guest workstation in our Secure Data Center (Safe Room) in Cologne will be sent to you. Please specify all data sets you need, i.e. both the ´GLES Sensitive Regional Data (ZA6828)´ and the Scientific Use Files to which the regional variables are to be assigned. Furthermore, under ´Specific variables´, please name all the regional variables you need (see GLES regional data form).
Occupation data for 2021 and 2022 data files
The ONS has identified an issue with the collection of some occupational data in 2021 and 2022 data files in a number of their surveys. While they estimate any impacts will be small overall, this will affect the accuracy of the breakdowns of some detailed (four-digit Standard Occupational Classification (SOC)) occupations, and data derived from them. Further information can be found in the ONS article published on 11 July 2023: https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/articles/revisionofmiscodedoccupationaldataintheonslabourforcesurveyuk/january2021toseptember2022" style="background-color: rgb(255, 255, 255);">Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022.
Latest edition information
For the third edition (September 2023), the variables NSECM20, NSECMJ20, SC2010M, SC20SMJ, SC20SMN and SOC20M have been replaced with new versions. Further information on the SOC revisions can be found in the ONS article published on 11 July 2023: https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/articles/revisionofmiscodedoccupationaldataintheonslabourforcesurveyuk/january2021toseptember2022" style="background-color: rgb(255, 255, 255);">Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Total column ozone (TCO) data from multiple satellite-based instruments have been combined to create a single near-global daily time series of ozone fields at 1.25degree longitude by 1degree latitude spanning the period 31 October 1978 to 31 December 2019. Comparisons against TCO measurements from the ground-based Dobson and Brewer spectrophotometer networks are used to remove offsets and drifts against the ground-based measurements in a subset of the satellite-based instruments. The corrected subset is then used as a basis for homogenizing the remaining data sets. The intention is that this data set serves as a climate data record for TCO and, to this end, the requirements for constructing climate data records, as detailed by GCOS (Global Climate Observing System) have been followed as closely as possible. The construction of this database improves on earlier versions of the database maintained first by the National Institute of Water and Atmospheric Research (NIWA) and now by Bodeker Scientific (BS).
Please note: For the reasons detailed in this document, version 3.5.1 of the NIWA-BS TCO
database should not be used henceforth for trend analysis and, as such, we have updated the version 3.4 database to the end of 2019 (now referred to as version 3.4.1 of the database, available in this record) as a replacement.
This version 3.4.1. is produced using the same method as version 3.4 but the dataset has been extended in time to the end of 2019. This means that all fits were recalculated and, thus, this version is slightly different to version 3.4 in all years. A filled (gap-free) version of version 3.4.1 of this dataset is available under doi:10.5281/zenodo.7447757.
The data are available in daily, monthly or annual resolution. Please note the following for the data in monthly and annual resolution:
Monthly: There have to be at least 25 valid values in a gridbox within a month to calculate a monthly mean.
Annual: There have to be at least 12 valid monthly values within a year to calculate an annual mean.
Please email greg@bodekerscientific.com and let us know which data set you downloaded and what your intended purpose for the use of the data is. You will then receive updates if an improved version becomes available.
DS126.2 represents a dataset implemented and computed by NCAR's Data Support Section, and forms an essential part of efforts undertaken in late 2004, early 2005, to produce an archive of selected segments of ERA-40 on a standard transformation grid. In this case, ERA-40 monthly mean upper air variables on 60 model levels were transformed from either spherical harmonics (surface geopotential, temperature, vertical pressure velocity, vorticity, logarithm of surface pressure, divergence), or a reduced N80 Gaussian grid (specific humidity, ozone mass mixing ratio, cloud liquid water content, cloud ice water content, cloud cover), to a 256 by 128 regular Gaussian grid at T85 spectral truncation. In addition, horizontal wind components were derived from spectral vorticity and divergence and also archived on a T85 Gaussian grid. All scalar fields were transformed using routines from the ECMWF EMOS library, whereas the horizontal winds were obtained using NCAR's SPHEREPACK library. All corresponding 00Z, 06Z, 12Z, and 18Z monthly mean upper air variables on 60 model levels were also transformed to a T85 Gaussian grid. The choice of a T85 Gaussian grid was based on considerations of limiting the volume of new data generated to a moderate level, and to match the horizontal resolution of the Community Atmosphere Model (CAM) [https://www.cesm.ucar.edu/models/atm-cam/] component of NCAR's Community Climate System Model (CCSM).
The ERA-Interim data from ECMWF is an update to the ERA-40 project. The ERA-Interim data starts in 1989 and has a higher horizontal resolution (T255, N128 nominally 0.703125 degrees) than the ERA-40 data (T159, N80 nominally 1.125 degrees). ERA-Interim is based on a more current model than ERA-40 and uses 4DVAR (as opposed to 3DVAR in ERA-40). ECMWF will continue to run the ERA-Interim model in near real time through at least 2010, and possibly longer. This data is available in ds627.0 [https://rda.ucar.edu/datasets/ds627.0/].
https://vocab.nerc.ac.uk/collection/L08/current/LI/https://vocab.nerc.ac.uk/collection/L08/current/LI/
The data sources of the dataset are outputs from CMIP6 simulations. The effect of the Southern Ocean on global climate change is assessed using Earth system model projections following an idealised 1% annual rise in atmospheric CO2. The model simulations run over 150 years and were created using the Earth System Grid Federation at the CMIP6 archive (https://esgf-node.llnl.gov/search/cmip6, World Climate Research Programme, 2021). The reported derived data sets are based on the output of sub-set of CMIP6 models, providing all necessary variables for the diagnostics and analysis published in: Williams, R.G., P. Ceppi, V. Roussenov, A. Katavouta and A. Meijers, 2022. The role of the Southern Ocean in the global climate response to carbon emissions. Philosophical Transactions A, Royal Society, in press. The dataset contains 3 types of variables: (1) Time averaged 2D fields: model mean and standard deviation (STD) of the surface warming, ocean heat uptake and storage, radiative response, climate feedback parameter, ocean carbon uptake and storage, cumulative top of the atmosphere heat uptake with examples for 2 models - GFDL-ESM4 and UKESM1-0-LL; (2) Time series of the Sothern Ocean or globally averaged (or globally integrated) variables for each model together with the model mean and STD: surface warming, ocean heat uptake and storage, radiative forcing and radiative response, ocean carbon uptake and storage; (3) Single values for the Sothern Ocean and planetary physical climate feedback parameter and Transient Climate Response to Emissions (TCRE) together with their components. This dataset was created under project Southern Ocean carbon indices and metrics (SARDINE), NERC Grant reference NE/T010657/1 by scientists from University of Liverpool, National Oceanography Centre (Liverpool), Imperial College London and British Antarctic Survey.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
subject to appropriate attribution.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Existing missing value imputation methods focused on imputing the data regarding actual values towards a completion of datasets as an input for machine learning tasks. This work proposes an imputation of missing values towards improvement of accuracy performance for classification. The proposed method was based on bee algorithm and the use of k-nearest neighborhood with linear regression to guide on finding the appropriate solution in prevention of randomness. Among the processes, GINI importance score was utilized in selecting values for imputation. The imputed values thus reflected on improving a discriminative power in classification tasks instead of replicating the actual values from the original dataset. In this study, we evaluated the proposed method against frequently used imputation methods such as k-nearest neighborhood, principal components analysis, nonlinear principal, and component analysis to compare root mean square error results and accuracy of using imputed datasets in a classification task. The experimental results indicated that our proposed method obtained the best accuracy results from all datasets comparing to other methods. In comparison to original dataset, the classification model from imputed datasets yielded 15-25% higher accuracy in class prediction. From analysis, the results showed that feature ranking used in a classification process was affected and lead to noticeably change in informativeness as the imputed data from the proposed method played the role to boost a discriminating power.
The new version of the Hamburg Ocean Atmosphere Parameters and Fluxes from Satellite Data set - HOAPS II - contains improved global fields of precipitation and evaporation over the oceans and all basic state variables needed for the derivation of the turbulent fluxes. Except for the NOAA Pathfinder SST data set, all variables are derived from SSM/I satellite data over the ice free oceans between 1987 and 2002. The earlier HOAPS version was improved and includes now the utilisation of multi-satellite averages with proper inter-satellite calibration, improved algorithms and a new ice detection procedure, resulting in more homogeneous and reliable spatial and temporal fields as before. The spatial resolution of 0.5 degree, makes them ideally suited for studies of climate variability over the global oceans. Pentade and climatological means are also public and available via the CERA database system. Further information under : https://www.cmsaf.eu/EN/Overview/OurProducts/Hoaps/Hoaps_node.html .
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This simulation data belongs to the article:
Retrieving pulsatility in ultrasound localization microscopy
DOI: 10.1109/OJUFFC.2022.3221354
This information is also available in README.txt, included in this repository.
The scripts that should be used to process this data can be found at: https://github.com/qnano/ulm-pulsatility
The simulation data in this repository, is contained in several .zip files:
The .zip folders contain the following:
Download the desired .zip file and see the documentation at https://github.com/qnano/ulm-pulsatility for instructions on processing the data.
This dataset contains ERA5 initial release (ERA5t) surface level analysis parameter data ensemble means (see linked dataset for spreads). ERA5t is the European Centre for Medium-Range Weather Forecasts (ECWMF) ERA5 reanalysis project initial release available upto 5 days behind the present data. CEDA will maintain a 6 month rolling archive of these data with overlap to the verified ERA5 data - see linked datasets on this record. The ensemble means and spreads are calculated from the ERA5t 10 member ensemble, run at a reduced resolution compared with the single high resolution (hourly output at 31 km grid spacing) 'HRES' realisation, for which these data have been produced to provide an uncertainty estimate. This dataset contains a limited selection of all available variables and have been converted to netCDF from the original GRIB files held on the ECMWF system. They have also been translated onto a regular latitude-longitude grid during the extraction process from the ECMWF holdings. For a fuller set of variables please see the linked Copernicus Data Store (CDS) data tool, linked to from this record. See linked datasets for ensemble member and spread data. Note, ensemble standard deviation is often referred to as ensemble spread and is calculated as the standard deviation of the 10-members in the ensemble (i.e., including the control). It is not the sample standard deviation, and thus were calculated by dividing by 10 rather than 9 (N-1). The ERA5 global atmospheric reanalysis of the covers 1979 to 2 months behind the present month. This follows on from the ERA-15, ERA-40 rand ERA-interim re-analysis projects. An initial release of ERA5 data (ERA5t) is made roughly 5 days behind the present date. These will be subsequently reviewed and, if required, amended before the full ERA5 release. CEDA holds a 6 month rolling copy of the latest ERA5t data. See related datasets linked to from this record.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Normative learning theories dictate that we should preferentially attend to informative sources, but only up to the point that our limited learning systems can process their content. Humans, including infants, show this predicted strategic deployment of attention. Here we demonstrate that rhesus monkeys, much like humans, attend to events of moderate surprisingness over both more and less surprising events. They do this in the absence of any specific goal or contingent reward, indicating that the behavioral pattern is spontaneous. We suggest this U-shaped attentional preference represents an evolutionarily preserved strategy for guiding intelligent organisms toward material that is maximally useful for learning. Methods How the data were collected: In this project, we collected gaze data of 5 macaques when they watched sequential visual displays designed to elicit probabilistic expectations using the Eyelink Toolbox and were sampled at 1000 Hz by an infrared eye-monitoring camera system. Dataset:
"csv-combined.csv" is an aggregated dataset that includes one pop-up event per row for all original datasets for each trial. Here are descriptions of each column in the dataset:
subj: subject_ID = {"B":104, "C":102,"H":101,"J":103,"K":203} trialtime: start time of current trial in second trial: current trial number (each trial featured one of 80 possible visual-event sequences)(in order) seq current: sequence number (one of 80 sequences) seq_item: current item number in a seq (in order) active_item: pop-up item (active box) pre_active: prior pop-up item (actve box) {-1: "the first active object in the sequence/ no active object before the currently active object in the sequence"} next_active: next pop-up item (active box) {-1: "the last active object in the sequence/ no active object after the currently active object in the sequence"} firstappear: {0: "not first", 1: "first appear in the seq"} looks_blank: csv: total amount of time look at blank space for current event (ms); csv_timestamp: {1: "look blank at timestamp", 0: "not look blank at timestamp"} looks_offscreen: csv: total amount of time look offscreen for current event (ms); csv_timestamp: {1: "look offscreen at timestamp", 0: "not look offscreen at timestamp"} time till target: time spent to first start looking at the target object (ms) {-1: "never look at the target"} looks target: csv: time spent to look at the target object (ms);csv_timestamp: look at the target or not at current timestamp (1 or 0) look1,2,3: time spent look at each object (ms) location 123X, 123Y: location of each box (location of the three boxes for a given sequence were chosen randomly, but remained static throughout the sequence) item123id: pop-up item ID (remained static throughout a sequence) event time: total time spent for the whole event (pop-up and go back) (ms) eyeposX,Y: eye position at current timestamp
"csv-surprisal-prob.csv" is an output file from Monkilock_Data_Processing.ipynb. Surprisal values for each event were calculated and added to the "csv-combined.csv". Here are descriptions of each additional column:
rt: time till target {-1: "never look at the target"}. In data analysis, we included data that have rt > 0. already_there: {NA: "never look at the target object"}. In data analysis, we included events that are not the first event in a sequence, are not repeats of the previous event, and already_there is not NA. looks_away: {TRUE: "the subject was looking away from the currently active object at this time point", FALSE: "the subject was not looking away from the currently active object at this time point"} prob: the probability of the occurrence of object surprisal: unigram surprisal value bisurprisal: transitional surprisal value std_surprisal: standardized unigram surprisal value std_bisurprisal: standardized transitional surprisal value binned_surprisal_means: the means of unigram surprisal values binned to three groups of evenly spaced intervals according to surprisal values. binned_bisurprisal_means: the means of transitional surprisal values binned to three groups of evenly spaced intervals according to surprisal values.
"csv-surprisal-prob_updated.csv" is a ready-for-analysis dataset generated by Analysis_Code_final.Rmd after standardizing controlled variables, changing data types for categorical variables for analysts, etc. "AllSeq.csv" includes event information of all 80 sequences
Empty Values in Datasets:
There is no missing value in the original dataset "csv-combined.csv". Missing values (marked as NA in datasets) happen in columns "prev_active", "next_active", "already_there", "bisurprisal", "std_bisurprisal", "sq_std_bisurprisal" in "csv-surprisal-prob.csv" and "csv-surprisal-prob_updated.csv". NAs in columns "prev_active" and "next_active" mean that the first or the last active object in the sequence/no active object before or after the currently active object in the sequence. When we analyzed the variable "already_there", we eliminated data that their "prev_active" variable is NA. NAs in column "already there" mean that the subject never looks at the target object in the current event. When we analyzed the variable "already there", we eliminated data that their "already_there" variable is NA. Missing values happen in columns "bisurprisal", "std_bisurprisal", "sq_std_bisurprisal" when it is the first event in the sequence and the transitional probability of the event cannot be computed because there's no event happening before in this sequence. When we fitted models for transitional statistics, we eliminated data that their "bisurprisal", "std_bisurprisal", and "sq_std_bisurprisal" are NAs.
Codes:
In "Monkilock_Data_Processing.ipynb", we processed raw fixation data of 5 macaques and explored the relationship between their fixation patterns and the "surprisal" of events in each sequence. We computed the following variables which are necessary for further analysis, modeling, and visualizations in this notebook (see above for details): active_item, pre_active, next_active, firstappear ,looks_blank, looks_offscreen, time till target, looks target, look1,2,3, prob, surprisal, bisurprisal, std_surprisal, std_bisurprisal, binned_surprisal_means, binned_bisurprisal_means. "Analysis_Code_final.Rmd" is the main scripts that we further processed the data, built models, and created visualizations for data. We evaluated the statistical significance of variables using mixed effect linear and logistic regressions with random intercepts. The raw regression models include standardized linear and quadratic surprisal terms as predictors. The controlled regression models include covariate factors, such as whether an object is a repeat, the distance between the current and previous pop up object, trial number. A generalized additive model (GAM) was used to visualize the relationship between the surprisal estimate from the computational model and the behavioral data. "helper-lib.R" includes helper functions used in Analysis_Code_final.Rmd