100+ datasets found

n
Data from: Using multiple imputation to estimate missing data in...
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated Nov 25, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
E. Hance Ellington; Guillaume Bastille-Rousseau; Cayla Austin; Kristen N. Landolt; Bruce A. Pond; Erin E. Rees; Nicholas Robar; Dennis L. Murray (2015). Using multiple imputation to estimate missing data in meta-regression [Dataset]. http://doi.org/10.5061/dryad.m2v4m
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.m2v4m
Dataset updated
Nov 25, 2015
Dataset provided by
Trent University
University of Prince Edward Island
Authors
E. Hance Ellington; Guillaume Bastille-Rousseau; Cayla Austin; Kristen N. Landolt; Bruce A. Pond; Erin E. Rees; Nicholas Robar; Dennis L. Murray
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
There is a growing need for scientific synthesis in ecology and evolution. In many cases, meta-analytic techniques can be used to complement such synthesis. However, missing data is a serious problem for any synthetic efforts and can compromise the integrity of meta-analyses in these and other disciplines. Currently, the prevalence of missing data in meta-analytic datasets in ecology and the efficacy of different remedies for this problem have not been adequately quantified. 2. We generated meta-analytic datasets based on literature reviews of experimental and observational data and found that missing data were prevalent in meta-analytic ecological datasets. We then tested the performance of complete case removal (a widely used method when data are missing) and multiple imputation (an alternative method for data recovery) and assessed model bias, precision, and multi-model rankings under a variety of simulated conditions using published meta-regression datasets. 3. We found that complete case removal led to biased and imprecise coefficient estimates and yielded poorly specified models. In contrast, multiple imputation provided unbiased parameter estimates with only a small loss in precision. The performance of multiple imputation, however, was dependent on the type of data missing. It performed best when missing values were weighting variables, but performance was mixed when missing values were predictor variables. Multiple imputation performed poorly when imputing raw data which was then used to calculate effect size and the weighting variable. 4. We conclude that complete case removal should not be used in meta-regression, and that multiple imputation has the potential to be an indispensable tool for meta-regression in ecology and evolution. However, we recommend that users assess the performance of multiple imputation by simulating missing data on a subset of their data before implementing it to recover actual missing data.
Cyclist Dataset
kaggle.com
zip
Updated Sep 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samir Tak (2022). Cyclist Dataset [Dataset]. https://www.kaggle.com/samirtak/cyclist-dataset
Explore at:
zip(851993390 bytes)Available download formats
Dataset updated
Sep 8, 2022
Authors
Samir Tak
Description
An explanation of the analysis has been done in my Portfolio. Please check it out by clicking here There are three folders below - Cleaned Dataset - Uncleaned Dataset - last_year_trip

The Uncleaned dataset contains the last 12 month's datasets, It has many null values. The Cleaned dataset contains the same last 12 months dataset, but it is cleaned and all the missing values are filled using Machine Learning. last_year_trip is the merged cleaned datasets.
Z
Film Circulation dataset
data.niaid.nih.gov
zenodo.org
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Loist, Skadi; Samoilova, Evgenia (Zhenya) (2024). Film Circulation dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7887671
Explore at:
Dataset updated
Jul 12, 2024
Dataset provided by
Film University Babelsberg KONRAD WOLF
Authors
Loist, Skadi; Samoilova, Evgenia (Zhenya)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.

Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories, units of measurement, data sources and coding and missing data.

The csv file “4_festival-library_dataset_imdb-and-survey” contains data on all unique festivals collected from both IMDb and survey sources. This dataset appears in wide format, all information for each festival is listed in one row. This
t
ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture...
researchdata.tuwien.ac.at
researchdata.tuwien.at
zip
Updated Sep 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo (2025). ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture from merged multi-satellite observations [Dataset]. http://doi.org/10.48436/3fcxr-cde10
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.48436/3fcxr-cde10
Dataset updated
Sep 5, 2025
Dataset provided by
TU Wien
Authors
Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was produced with funding from the European Space Agency (ESA) Climate Change Initiative (CCI) Plus Soil Moisture Project (CCN 3 to ESRIN Contract No: 4000126684/19/I-NB "ESA CCI+ Phase 1 New R&D on CCI ECVS Soil Moisture"). Project website: https://climate.esa.int/en/projects/soil-moisture/

This dataset contains information on the Surface Soil Moisture (SM) content derived from satellite observations in the microwave domain.

Dataset Paper (Open Access)

A description of this dataset, including the methodology and validation results, is available at:

Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: an independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data, 17, 4305–4329, https://doi.org/10.5194/essd-17-4305-2025, 2025.

Abstract

ESA CCI Soil Moisture is a multi-satellite climate data record that consists of harmonized, daily observations coming from 19 satellites (as of v09.1) operating in the microwave domain. The wealth of satellite information, particularly over the last decade, facilitates the creation of a data record with the highest possible data consistency and coverage.
However, data gaps are still found in the record. This is particularly notable in earlier periods when a limited number of satellites were in operation, but can also arise from various retrieval issues, such as frozen soils, dense vegetation, and radio frequency interference (RFI). These data gaps present a challenge for many users, as they have the potential to obscure relevant events within a study area or are incompatible with (machine learning) software that often relies on gap-free inputs.
Since the requirement of a gap-free ESA CCI SM product was identified, various studies have demonstrated the suitability of different statistical methods to achieve this goal. A fundamental feature of such gap-filling method is to rely only on the original observational record, without need for ancillary variable or model-based information. Due to the intrinsic challenge, there was until present no global, long-term univariate gap-filled product available. In this version of the record, data gaps due to missing satellite overpasses and invalid measurements are filled using the Discrete Cosine Transform (DCT) Penalized Least Squares (PLS) algorithm (Garcia, 2010). A linear interpolation is applied over periods of (potentially) frozen soils with little to no variability in (frozen) soil moisture content. Uncertainty estimates are based on models calibrated in experiments to fill satellite-like gaps introduced to GLDAS Noah reanalysis soil moisture (Rodell et al., 2004), and consider the gap size and local vegetation conditions as parameters that affect the gapfilling performance.

Summary

Gap-filled global estimates of volumetric surface soil moisture from 1991-2023 at 0.25° sampling

Fields of application (partial): climate variability and change, land-atmosphere interactions, global biogeochemical cycles and ecology, hydrological and land surface modelling, drought applications, and meteorology

Method: Modified version of DCT-PLS (Garcia, 2010) interpolation/smoothing algorithm, linear interpolation over periods of frozen soils. Uncertainty estimates are provided for all data points.

More information: See Preimesberger et al. (2025) and https://doi.org/10.5281/zenodo.8320869" target="_blank" rel="noopener">ESA CCI SM Algorithm Theoretical Baseline Document [Chapter 7.2.9] (Dorigo et al., 2023)

Programmatic Download

You can use command line tools such as wget or curl to download (and extract) data for multiple years. The following command will download and extract the complete data set to the local directory ~/Download on Linux or macOS systems.

#!/bin/bash

# Set download directory
DOWNLOAD_DIR=~/Downloads

base_url="https://researchdata.tuwien.at/records/3fcxr-cde10/files"

# Loop through years 1991 to 2023 and download & extract data
for year in {1991..2023}; do
echo "Downloading $year.zip..."
wget -q -P "$DOWNLOAD_DIR" "$base_url/$year.zip"
unzip -o "$DOWNLOAD_DIR/$year.zip" -d $DOWNLOAD_DIR
rm "$DOWNLOAD_DIR/$year.zip"
done

Data details

The dataset provides global daily estimates for the 1991-2023 period at 0.25° (~25 km) horizontal grid resolution. Daily images are grouped by year (YYYY), each subdirectory containing one netCDF image file for a specific day (DD), month (MM) in a 2-dimensional (longitude, latitude) grid system (CRS: WGS84). The file name has the following convention:

ESACCI-SOILMOISTURE-L3S-SSMV-COMBINED_GAPFILLED-YYYYMMDD000000-fv09.1r1.nc

Data Variables

Each netCDF file contains 3 coordinate variables (WGS84 longitude, latitude and time stamp), as well as the following data variables:

sm: (float) The Soil Moisture variable reflects estimates of daily average volumetric soil moisture content (m3/m3) in the soil surface layer (~0-5 cm) over a whole grid cell (0.25 degree).

sm_uncertainty: (float) The Soil Moisture Uncertainty variable reflects the uncertainty (random error) of the original satellite observations and of the predictions used to fill observation data gaps.

sm_anomaly: Soil moisture anomalies (reference period 1991-2020) derived from the gap-filled values (`sm`)

sm_smoothed: Contains DCT-PLS predictions used to fill data gaps in the original soil moisture field. These values are also provided for cases where an observation was initially available (compare `gapmask`). In this case, they provided a smoothed version of the original data.

gapmask: (0 | 1) Indicates grid cells where a satellite observation is available (1), and where the interpolated (smoothed) values are used instead (0) in the 'sm' field.

frozenmask: (0 | 1) Indicates grid cells where ERA5 soil temperature is <0 °C. In this case, a linear interpolation over time is applied.

Additional information for each variable is given in the netCDF attributes.

Version Changelog

Changes in v9.1r1 (previous version was v09.1):

This version uses a novel uncertainty estimation scheme as described in Preimesberger et al. (2025).

Software to open netCDF files

These data can be read by any software that supports Climate and Forecast (CF) conform metadata standards for netCDF files, such as:

https://github.com/pydata/xarray" target="_blank" rel="noopener">Xarray (python)

https://unidata.github.io/netcdf4-python/" target="_blank" rel="noopener">netCDF4 (python)

https://github.com/TUW-GEO/esa_cci_sm">esa_cci_sm (python)

Similar tools exists for other programming languages (Matlab, R, etc.)

Software packages and GIS tools can open netCDF files, e.g. CDO, NCO, QGIS, ArCGIS

You can also use the GUI software Panoply to view the contents of each file

References

Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: an independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data, 17, 4305–4329, https://doi.org/10.5194/essd-17-4305-2025, 2025.

Dorigo, W., Preimesberger, W., Stradiotti, P., Kidd, R., van der Schalie, R., van der Vliet, M., Rodriguez-Fernandez, N., Madelon, R., & Baghdadi, N. (2023). ESA Climate Change Initiative Plus - Soil Moisture Algorithm Theoretical Baseline Document (ATBD) Supporting Product Version 08.1 (version 1.1). Zenodo. https://doi.org/10.5281/zenodo.8320869

Garcia, D., 2010. Robust smoothing of gridded data in one and higher dimensions with missing values. Computational Statistics & Data Analysis, 54(4), pp.1167-1178. Available at: https://doi.org/10.1016/j.csda.2009.09.020

Rodell, M., Houser, P. R., Jambor, U., Gottschalck, J., Mitchell, K., Meng, C.-J., Arsenault, K., Cosgrove, B., Radakovich, J., Bosilovich, M., Entin, J. K., Walker, J. P., Lohmann, D., and Toll, D.: The Global Land Data Assimilation System, Bulletin of the American Meteorological Society, 85, 381 – 394, https://doi.org/10.1175/BAMS-85-3-381, 2004.

Related Records

The following records are all part of the ESA CCI Soil Moisture science data records community

1
ESA CCI SM MODELFREE Surface Soil Moisture Record
<a href="https://doi.org/10.48436/svr1r-27j77" target="_blank"
d
Data from: Dataset from the Upper Mississippi River Restoration Program...
catalog.data.gov
datasets.ai
Updated Nov 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Dataset from the Upper Mississippi River Restoration Program (1993-2019) to reconstruct missing data by comparing interpolation techniques [Dataset]. https://catalog.data.gov/dataset/dataset-from-the-upper-mississippi-river-restoration-program-1993-2019-to-reconstruct-miss
Explore at:
Dataset updated
Nov 20, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Mississippi River, Upper Mississippi River
Description
The dataset accompanies the scientific article,"Reconstructing missing data by comparing interpolation techniques: applications for long-term water quality data." Missingness is typical in large datasets, but intercomparisons of interpolation methods can alleviate data gaps and common problems associated with missing data. We compared seven popular interpolation methods for predicting missing values in a long-term water quality data set from the upper Mississippi River, USA.
Walmart complete updated stocks dataset
kaggle.com
zip
Updated Mar 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M Atif Latif (2025). Walmart complete updated stocks dataset [Dataset]. https://www.kaggle.com/datasets/matiflatif/walmart-complete-stocks-dataweekly-updated
Explore at:
zip(1909332 bytes)Available download formats
Dataset updated
Mar 15, 2025
Authors
M Atif Latif
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Walmart (WMT) Stock Price Data (1970 - 2025)

Dataset Overview:

This dataset contains historical stock price data for Walmart Inc. (WMT) from October 1, 1970, to January 31, 2025. The data includes key stock market indicators such as opening price, closing price, adjusted closing price, highest and lowest prices of the day, and trading volume. This dataset can be valuable for financial analysis, stock market trend prediction, and machine learning applications in quantitative finance.

Data Source

The data has been collected from publicly available financial sources and covers over 13,000 trading days, providing a comprehensive view of Walmart’s stock performance over several decades.

Columns Description

Date: The trading date (1970-10-01).

Open: The opening price of Walmart stock for the day.

High: The highest price reached during the trading session.

Low: The lowest price recorded during the trading session.

Close: The closing price at the end of the trading day.

Adj Close: The adjusted closing price, which accounts for stock splits and dividends.

Volume: The total number of shares traded on that particular day.

Potential Use Cases

This dataset can be used for a variety of financial and data science applications, including:

✔ Stock Market Analysis – Study historical trends and price movements.

✔ Time Series Forecasting – Develop predictive models using machine learning.

✔ Technical Analysis – Apply moving averages, RSI, and other trading indicators.

✔ Market Volatility Analysis – Assess market fluctuations over different periods.

✔ Algorithmic Trading – Backtest trading strategies based on historical data.

Data Integrity

No missing values.

Data spans over 50 years, ensuring long-term trend analysis.

Preprocessed and structured for easy use in Python, R, and other data science tools.

How to Use the Data?

You can load the dataset using Pandas in Python: ``` import pandas as pd

Load the dataset

df = pd.read_csv("WMT_1970-10-01_2025-01-31.csv")

Display the first few rows

df.head() ```

Acknowledgments

This dataset is provided for educational and research purposes. Please ensure proper attribution if used in projects or research.

More Dataset

This data set is scrape by Muhammad Atif Latif.

For more Datasets justCLICK HERE
Z
Temperature Rain Dataset without Missing Values
data-staging.niaid.nih.gov
data.niaid.nih.gov
+1more
Updated Jul 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Godahewa, Rakshitha; Bergmeir, Christoph; Webb, Geoff; Hyndman, Rob; Montero-Manso, Pablo (2021). Temperature Rain Dataset without Missing Values [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_5129090
Explore at:
Dataset updated
Jul 24, 2021
Dataset provided by
Professor at Monash University
Lecturer at University of Sydney
PhD Student at Monash University
Lecturer at Monash University
Authors
Godahewa, Rakshitha; Bergmeir, Christoph; Webb, Geoff; Hyndman, Rob; Montero-Manso, Pablo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains 32072 daily time series showing the temperature observations and rain forecasts, gathered by the Australian Bureau of Meteorology for 422 weather stations across Australia, between 02/05/2015 and 26/04/2017.

The original dataset contains missing values and they have been simply replaced by zeros.
n
Data from: Macaques preferentially attend to intermediately surprising...
data.niaid.nih.gov
datadryad.org
zip
Updated Apr 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shengyi Wu; Tommy Blanchard; Emily Meschke; Richard Aslin; Ben Hayden; Celeste Kidd (2022). Macaques preferentially attend to intermediately surprising information [Dataset]. http://doi.org/10.6078/D15Q7Q
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6078/D15Q7Q
Dataset updated
Apr 26, 2022
Dataset provided by
Klaviyo
University of California, Berkeley
Yale University
University of Minnesota
Authors
Shengyi Wu; Tommy Blanchard; Emily Meschke; Richard Aslin; Ben Hayden; Celeste Kidd
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Normative learning theories dictate that we should preferentially attend to informative sources, but only up to the point that our limited learning systems can process their content. Humans, including infants, show this predicted strategic deployment of attention. Here we demonstrate that rhesus monkeys, much like humans, attend to events of moderate surprisingness over both more and less surprising events. They do this in the absence of any specific goal or contingent reward, indicating that the behavioral pattern is spontaneous. We suggest this U-shaped attentional preference represents an evolutionarily preserved strategy for guiding intelligent organisms toward material that is maximally useful for learning. Methods How the data were collected: In this project, we collected gaze data of 5 macaques when they watched sequential visual displays designed to elicit probabilistic expectations using the Eyelink Toolbox and were sampled at 1000 Hz by an infrared eye-monitoring camera system. Dataset:

"csv-combined.csv" is an aggregated dataset that includes one pop-up event per row for all original datasets for each trial. Here are descriptions of each column in the dataset:

subj: subject_ID = {"B":104, "C":102,"H":101,"J":103,"K":203} trialtime: start time of current trial in second trial: current trial number (each trial featured one of 80 possible visual-event sequences)(in order) seq current: sequence number (one of 80 sequences) seq_item: current item number in a seq (in order) active_item: pop-up item (active box) pre_active: prior pop-up item (actve box) {-1: "the first active object in the sequence/ no active object before the currently active object in the sequence"} next_active: next pop-up item (active box) {-1: "the last active object in the sequence/ no active object after the currently active object in the sequence"} firstappear: {0: "not first", 1: "first appear in the seq"} looks_blank: csv: total amount of time look at blank space for current event (ms); csv_timestamp: {1: "look blank at timestamp", 0: "not look blank at timestamp"} looks_offscreen: csv: total amount of time look offscreen for current event (ms); csv_timestamp: {1: "look offscreen at timestamp", 0: "not look offscreen at timestamp"} time till target: time spent to first start looking at the target object (ms) {-1: "never look at the target"} looks target: csv: time spent to look at the target object (ms);csv_timestamp: look at the target or not at current timestamp (1 or 0) look1,2,3: time spent look at each object (ms) location 123X, 123Y: location of each box (location of the three boxes for a given sequence were chosen randomly, but remained static throughout the sequence) item123id: pop-up item ID (remained static throughout a sequence) event time: total time spent for the whole event (pop-up and go back) (ms) eyeposX,Y: eye position at current timestamp

"csv-surprisal-prob.csv" is an output file from Monkilock_Data_Processing.ipynb. Surprisal values for each event were calculated and added to the "csv-combined.csv". Here are descriptions of each additional column:

rt: time till target {-1: "never look at the target"}. In data analysis, we included data that have rt > 0. already_there: {NA: "never look at the target object"}. In data analysis, we included events that are not the first event in a sequence, are not repeats of the previous event, and already_there is not NA. looks_away: {TRUE: "the subject was looking away from the currently active object at this time point", FALSE: "the subject was not looking away from the currently active object at this time point"} prob: the probability of the occurrence of object surprisal: unigram surprisal value bisurprisal: transitional surprisal value std_surprisal: standardized unigram surprisal value std_bisurprisal: standardized transitional surprisal value binned_surprisal_means: the means of unigram surprisal values binned to three groups of evenly spaced intervals according to surprisal values. binned_bisurprisal_means: the means of transitional surprisal values binned to three groups of evenly spaced intervals according to surprisal values.

"csv-surprisal-prob_updated.csv" is a ready-for-analysis dataset generated by Analysis_Code_final.Rmd after standardizing controlled variables, changing data types for categorical variables for analysts, etc. "AllSeq.csv" includes event information of all 80 sequences

Empty Values in Datasets:

There is no missing value in the original dataset "csv-combined.csv". Missing values (marked as NA in datasets) happen in columns "prev_active", "next_active", "already_there", "bisurprisal", "std_bisurprisal", "sq_std_bisurprisal" in "csv-surprisal-prob.csv" and "csv-surprisal-prob_updated.csv". NAs in columns "prev_active" and "next_active" mean that the first or the last active object in the sequence/no active object before or after the currently active object in the sequence. When we analyzed the variable "already_there", we eliminated data that their "prev_active" variable is NA. NAs in column "already there" mean that the subject never looks at the target object in the current event. When we analyzed the variable "already there", we eliminated data that their "already_there" variable is NA. Missing values happen in columns "bisurprisal", "std_bisurprisal", "sq_std_bisurprisal" when it is the first event in the sequence and the transitional probability of the event cannot be computed because there's no event happening before in this sequence. When we fitted models for transitional statistics, we eliminated data that their "bisurprisal", "std_bisurprisal", and "sq_std_bisurprisal" are NAs.

Codes:

In "Monkilock_Data_Processing.ipynb", we processed raw fixation data of 5 macaques and explored the relationship between their fixation patterns and the "surprisal" of events in each sequence. We computed the following variables which are necessary for further analysis, modeling, and visualizations in this notebook (see above for details): active_item, pre_active, next_active, firstappear ,looks_blank, looks_offscreen, time till target, looks target, look1,2,3, prob, surprisal, bisurprisal, std_surprisal, std_bisurprisal, binned_surprisal_means, binned_bisurprisal_means. "Analysis_Code_final.Rmd" is the main scripts that we further processed the data, built models, and created visualizations for data. We evaluated the statistical significance of variables using mixed effect linear and logistic regressions with random intercepts. The raw regression models include standardized linear and quadratic surprisal terms as predictors. The controlled regression models include covariate factors, such as whether an object is a repeat, the distance between the current and previous pop up object, trial number. A generalized additive model (GAM) was used to visualize the relationship between the surprisal estimate from the computational model and the behavioral data. "helper-lib.R" includes helper functions used in Analysis_Code_final.Rmd
z
Data from: Incomplete specimens in geometric morphometric analyses
zenodo.org
search.dataone.org
+2more
Updated Oct 11, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arbour, Jessica H.; Brown, Caleb M. (2014). Data from: Incomplete specimens in geometric morphometric analyses [Dataset]. http://doi.org/10.5061/dryad.mp713
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.mp713
Dataset updated
Oct 11, 2014
Dataset provided by
University of Toronto
Authors
Arbour, Jessica H.; Brown, Caleb M.
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
1.The analysis of morphological diversity frequently relies on the use of multivariate methods for characterizing biological shape. However, many of these methods are intolerant of missing data, which can limit the use of rare taxa and hinder the study of broad patterns of ecological diversity and morphological evolution. This study applied a mutli-dataset approach to compare variation in missing data estimation and its effect on geometric morphometric analysis across taxonomically-variable groups, landmark position and sample sizes. 2.Missing morphometric landmark data was simulated from five real, complete datasets, including modern fish, primates and extinct theropod dinosaurs. Missing landmarks were then estimated using several standard approaches and a geometric-morphometric-specific method. The accuracy of missing data estimation was determined for each estimation method, landmark position, and morphological dataset. Procrustes superimposition was used to compare the eigenvectors and principal component scores of a geometric morphometric analysis of the original landmark data, to datasets with A) missing values estimated, or B) simulated incomplete specimens excluded, for varying levels of specimens incompleteness and sample sizes. 3.Standard estimation techniques were more reliable estimators and had lower impacts on morphometric analysis compared to a geometric-morphometric-specific estimator. For most datasets and estimation techniques, estimating missing data produced a better fit to the structure of the original data than exclusion of incomplete specimens, and this was maintained even at considerably reduced sample sizes. The impact of missing data on geometric morphometric analysis was disproportionately affected by the most fragmentary specimens. 4.Missing data estimation was influenced by variability of specific anatomical features, and may be improved by a better understanding of shape variation present in a dataset. Our results suggest that the inclusion of incomplete specimens through the use of effective missing data estimators better reflects the patterns of shape variation within a dataset than using only complete specimens, however the effectiveness of missing data estimation can be maximized by excluding only the most incomplete specimens. It is advised that missing data estimators be evaluated for each dataset and landmark independently, as the effectiveness of estimators can vary strongly and unpredictably between different taxa and structures.
Combinations of variable inclusion and stratification approaches where X is...
plos.figshare.com
xls
Updated Nov 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lucy Grigoroff; Reika Masuda; John Lindon; Janonna Kadyrov; Jeremy K. Nicholson; Elaine Holmes; Julien Wist (2025). Combinations of variable inclusion and stratification approaches where X is the clinical chemistry dataset that is missing values. C is the outcome variable, with CStrata representing separate imputation per group defined in the chosen variable and CVariable is including the outcome as a variable. YAll is the remaining metadata not used for stratification. YAll + Toxin is the same as YAll but with Toxin metadata now included. [Dataset]. http://doi.org/10.1371/journal.pone.0335852.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0335852.t001
Dataset updated
Nov 20, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Lucy Grigoroff; Reika Masuda; John Lindon; Janonna Kadyrov; Jeremy K. Nicholson; Elaine Holmes; Julien Wist
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Combinations of variable inclusion and stratification approaches where X is the clinical chemistry dataset that is missing values. C is the outcome variable, with CStrata representing separate imputation per group defined in the chosen variable and CVariable is including the outcome as a variable. YAll is the remaining metadata not used for stratification. YAll + Toxin is the same as YAll but with Toxin metadata now included.

synthetic but realistic salary prediction dataset

kaggle.com

zip

Updated Oct 29, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Arif Miah (2025). synthetic but realistic salary prediction dataset [Dataset]. https://www.kaggle.com/datasets/miadul/synthetic-but-realistic-salary-prediction-dataset

Explore at:

zip(38665 bytes)Available download formats

Dataset updated

Oct 29, 2025

Authors

Arif Miah

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

📊 Synthetic Salary Prediction Dataset (with Missing Values & Outliers)

🧠 Overview

This dataset is a synthetic but realistic salary prediction dataset designed to simulate real-world employee compensation data. It is ideal for practicing data preprocessing, EDA, machine learning model building, and deployment (e.g., Flask or Streamlit apps).

The dataset captures a range of demographic, educational, and professional attributes that typically influence salary outcomes, along with intentional missing values and outliers to provide a challenging and practical experience for learners and researchers.

🧩 Key Features

Column	Description
`age`	Employee’s age (20–60 years)
`gender`	Gender of the employee (Male, Female, Other)
`education`	Highest educational qualification
`experience_years`	Total years of work experience
`role_seniority`	Current job level (Junior, Mid, Senior, Lead)
`company_size`	Size of the organization (Startup, SME, Enterprise)
`location_tier`	Job location category (Tier-1, Tier-2, Tier-3, Remote)
`skills_count`	Number of professional/technical skills
`certifications`	Count of relevant certifications
`worked_remote`	Whether the employee works remotely (0 = No, 1 = Yes)
`last_promotion_years_ago`	Years since last promotion
`recent_project_description_length`	Word count of recent project summary
`recent_note`	Short note describing work experience or project type
`survey_date`	Synthetic date when data was recorded
`salary_bdt`	Target variable: Monthly salary in Bangladeshi Taka (BDT)

🧮 Dataset Summary

Total Rows: 2000
Total Columns: 15
Missing Values: Yes (intentionally introduced)
Outliers: Yes (~1% high-salary records to mimic real-world noise)
Use Case: Regression (Salary Prediction), EDA, Feature Engineering, Data Cleaning Practice

💡 Possible Use Cases

Predict employee salary based on experience and education
Handle missing values and perform imputation
Detect and treat outliers
Explore correlation between experience and salary
Build ML models using scikit-learn, TensorFlow, or PyTorch
Deploy salary prediction apps with Streamlit or Flask

🧰 Tech Stack for Analysis (Recommended)

Python, Pandas, NumPy, Matplotlib, Seaborn, Plotly
Scikit-learn, TensorFlow, PyTorch
Streamlit / Flask for app deployment

🧑‍💻 Author

Name: Arif Miah Background: Final Year B.Sc. Student (Computer Science and Engineering) at Port City International University Focus Areas: Machine Learning, Deep Learning, NLP, Streamlit Apps, Data Science Projects Contact: arifmiahcse@gmail.com GitHub: github.com/your-github-username

⚠️ Disclaimer

This dataset is synthetic and generated for educational and research purposes only. It does not represent any real individuals or organizations.

Z
Kaggle Wikipedia Web Traffic Daily Dataset (without Missing Values)
data.niaid.nih.gov
zenodo.org
Updated Apr 1, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Godahewa, Rakshitha; Bergmeir, Christoph; Webb, Geoff; Hyndman, Rob; Montero-Manso, Pablo (2021). Kaggle Wikipedia Web Traffic Daily Dataset (without Missing Values) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3892918
Explore at:
Dataset updated
Apr 1, 2021
Dataset provided by
Professor at Monash University
Lecturer at University of Sydney
PhD Student at Monash University
Lecturer at Monash University
Authors
Godahewa, Rakshitha; Bergmeir, Christoph; Webb, Geoff; Hyndman, Rob; Montero-Manso, Pablo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was used in the Kaggle Wikipedia Web Traffic forecasting competition. It contains 145063 daily time series representing the number of hits or web traffic for a set of Wikipedia pages from 2015-07-01 to 2017-09-10.

The original dataset contains missing values. They have been simply replaced by zeros.
u
UKHLS
beta.ukdataservice.ac.uk
Updated Oct 21, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UK Data Service (2022). UKHLS [Dataset]. http://doi.org/10.5255/UKDA-SN-9019-1
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-9019-1
Dataset updated
Oct 21, 2022
Dataset provided by
UK Data Servicehttps://ukdataservice.ac.uk/
Area covered
United Kingdom
Description
As the UK went into the first lockdown of the COVID-19 pandemic, the team behind the biggest social survey in the UK, Understanding Society (UKHLS), developed a way to capture these experiences. From April 2020, participants from this Study were asked to take part in the Understanding Society COVID-19 survey, henceforth referred to as the COVID-19 survey or the COVID-19 study.
The COVID-19 survey regularly asked people about their situation and experiences. The resulting data gives a unique insight into the impact of the pandemic on individuals, families, and communities. The COVID-19 Teaching Dataset contains data from the main COVID-19 survey in a simplified form. It covers topics such as

Socio-demographics

Whether working at home and home-schooling

COVID symptoms

Health and well-being

Social contact and neighbourhood cohesion

Volunteering

The resource contains two data files:

Cross-sectional: contains data collected in Wave 4 in July 2020 (with some additional variables from other waves);

Longitudinal: Contains mainly data from Waves 1, 4 and 9 with key variables measured at three time points.

Key features of the dataset

Missing values: in the web survey, participants clicking "Next" but not answering a question were given further options such as "Don't know" and "Prefer not to say". Missing observations like these are recorded using negative values such as -1 for "Don't know". In many instances, users of the data will need to set these values as missing. The User Guide includes Stata and SPSS code for setting negative missing values to system missing.

The Longitudinal file is a balanced panel and is in wide format. A balanced panel means it only includes participants that took part in every wave. In wide format, each participant has one row of information, and each measurement of the same variable is a different variable.

Weights: both the cross-sectional and longitudinal files include survey weights that adjust the sample to represent the UK adult population. The cross-sectional weight (betaindin_xw) adjusts for unequal selection probabilities in the sample design and for non-response. The longitudinal weight (ci_betaindin_lw) adjusts for the sample design and also for the fact that not all those invited to participate in the survey, do participate in all waves.

Both the cross-sectional and longitudinal datasets include the survey design variables (psu and strata).

A full list of variables in both files can be found in the User Guide appendix.
Who is in the sample?
All adults (16 years old and over as of April 2020), in households who had participated in at least one of the last two waves of the main study Understanding Society, were invited to participate in this survey. From the September 2020 (Wave 5) survey onwards, only sample members who had completed at least one partial interview in any of the first four web surveys were invited to participate. From the November 2020 (Wave 6) survey onwards, those who had only completed the initial survey in April 2020 and none since, were no longer invited to participate

The User guide accompanying the data adds to the information here and includes a full variable list with details of measurement levels and links to the relevant questionnaire.
Z
KDD Cup Dataset (without Missing Values)
data.niaid.nih.gov
zenodo.org
Updated Apr 1, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Godahewa, Rakshitha; Bergmeir, Christoph; Webb, Geoff (2021). KDD Cup Dataset (without Missing Values) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3893512
Explore at:
Dataset updated
Apr 1, 2021
Dataset provided by
Professor at Monash University
PhD Student at Monash University
Lecturer at Monash University
Authors
Godahewa, Rakshitha; Bergmeir, Christoph; Webb, Geoff
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was used in the KDD Cup 2018 forecasting competition. It contains long hourly time series representing the air quality levels in 59 stations in 2 cities: Beijing (35 stations) and London (24 stations) from 01/01/2017 to 31/03/2018. The air quality level is represented in multiple measurements such as PM2.5, PM10, NO2, CO, O3 and SO2.

The dataset uploaded here contains 282 hourly time series which have been categorized using city, station name and air quality measurement. The original dataset contains missing values and they have been simply replaced by zeros.
Rideshare Dataset without Missing Values
zenodo.org
data.niaid.nih.gov
zip
Updated Jul 23, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rakshitha Godahewa; Rakshitha Godahewa; Christoph Bergmeir; Christoph Bergmeir; Geoff Webb; Geoff Webb; Rob Hyndman; Rob Hyndman; Pablo Montero-Manso; Pablo Montero-Manso (2021). Rideshare Dataset without Missing Values [Dataset]. http://doi.org/10.5281/zenodo.5122232
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5122232
Dataset updated
Jul 23, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rakshitha Godahewa; Rakshitha Godahewa; Christoph Bergmeir; Christoph Bergmeir; Geoff Webb; Geoff Webb; Rob Hyndman; Rob Hyndman; Pablo Montero-Manso; Pablo Montero-Manso
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains various hourly time series representations of attributes related to Uber and Lyft rideshare services for various locations in New York between 26/11/2018 and 18/12/2018.

For a given starting location, provider and service, the following types are represented: 'price_min', 'price_mean', 'price_max', 'distance_min', 'distance_mean', 'distance_max', 'surge_min', 'surge_mean', 'surge_max', 'api_calls', 'temp', 'rain', 'humidity', 'clouds' and 'wind'.

The original dataset contains missing values and they have been simply replaced by zeros.
2
QLFS
datacatalogue.ukdataservice.ac.uk
Updated Sep 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office for National Statistics (2024). QLFS [Dataset]. http://doi.org/10.5255/UKDA-SN-9303-1
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-9303-1
Dataset updated
Sep 5, 2024
Dataset provided by
UK Data Servicehttps://ukdataservice.ac.uk/
Authors
Office for National Statistics
Area covered
United Kingdom
Description
Background
The Labour Force Survey (LFS) is a unique source of information using international definitions of employment and unemployment and economic inactivity, together with a wide range of related topics such as occupation, training, hours of work and personal characteristics of household members aged 16 years and over. It is used to inform social, economic and employment policy. The LFS was first conducted biennially from 1973-1983. Between 1984 and 1991 the survey was carried out annually and consisted of a quarterly survey conducted throughout the year and a 'boost' survey in the spring quarter (data were then collected seasonally). From 1992 quarterly data were made available, with a quarterly sample size approximately equivalent to that of the previous annual data. The survey then became known as the Quarterly Labour Force Survey (QLFS). From December 1994, data gathering for Northern Ireland moved to a full quarterly cycle to match the rest of the country, so the QLFS then covered the whole of the UK (though some additional annual Northern Ireland LFS datasets are also held at the UK Data Archive). Further information on the background to the QLFS may be found in the documentation.

Household datasets
Up to 2015, the LFS household datasets were produced twice a year (April-June and October-December) from the corresponding quarter's individual-level data. From January 2015 onwards, they are now produced each quarter alongside the main QLFS. The household datasets include all the usual variables found in the individual-level datasets, with the exception of those relating to income, and are intended to facilitate the analysis of the economic activity patterns of whole households. It is recommended that the existing individual-level LFS datasets continue to be used for any analysis at individual level, and that the LFS household datasets be used for analysis involving household or family-level data. From January 2011, a pseudonymised household identifier variable (HSERIALP) is also included in the main quarterly LFS dataset instead.

Change to coding of missing values for household series
From 1996-2013, all missing values in the household datasets were set to one '-10' category instead of the separate '-8' and '-9' categories. For that period, the ONS introduced a new imputation process for the LFS household datasets and it was necessary to code the missing values into one new combined category ('-10'), to avoid over-complication. This was also in line with the Annual Population Survey household series of the time. The change was applied to the back series during 2010 to ensure continuity for analytical purposes. From 2013 onwards, the -8 and -9 categories have been reinstated.

LFS Documentation
The documentation available from the Archive to accompany LFS datasets largely consists of the latest version of each volume alongside the appropriate questionnaire for the year concerned. However, LFS volumes are updated periodically by ONS, so users are advised to check the ONS LFS User Guidance page before commencing analysis.

Additional data derived from the QLFS
The Archive also holds further QLFS series: End User Licence (EUL) quarterly datasets; Secure Access datasets (see below); two-quarter and five-quarter longitudinal datasets; quarterly, annual and ad hoc module datasets compiled for Eurostat; and some additional annual Northern Ireland datasets.

End User Licence and Secure Access QLFS Household datasets
Users should note that there are two discrete versions of the QLFS household datasets. One is available under the standard End User Licence (EUL) agreement, and the other is a Secure Access version. Secure Access household datasets for the QLFS are available from 2009 onwards, and include additional, detailed variables not included in the standard EUL versions. Extra variables that typically can be found in the Secure Access versions but not in the EUL versions relate to: geography; date of birth, including day; education and training; household and family characteristics; employment; unemployment and job hunting; accidents at work and work-related health problems; nationality, national identity and country of birth; occurrence of learning difficulty or disability; and benefits. For full details of variables included, see data dictionary documentation. The Secure Access version (see SN 7674) has more restrictive access conditions than those made available under the standard EUL. Prospective users will need to gain ONS Accredited Researcher status, complete an extra application form and demonstrate to the data owners exactly why they need access to the additional variables. Users are strongly advised to first obtain the standard EUL version of the data to see if they are sufficient for their research requirements.

Changes to variables in QLFS Household EUL datasets
In order to further protect respondent confidentiality, ONS have made some changes to variables available in the EUL datasets. From July-September 2015 onwards, 4-digit industry class is available for main job only, meaning that 3-digit industry group is the most detailed level available for second and last job.

Review of imputation methods for LFS Household data - changes to missing values
A review of the imputation methods used in LFS Household and Family analysis resulted in a change from the January-March 2015 quarter onwards. It was no longer considered appropriate to impute any personal characteristic variables (e.g. religion, ethnicity, country of birth, nationality, national identity, etc.) using the LFS donor imputation method. This method is primarily focused to ensure the 'economic status' of all individuals within a household is known, allowing analysis of the combined economic status of households. This means that from 2015 larger amounts of missing values ('-8'/-9') will be present in the data for these personal characteristic variables than before. Therefore if users need to carry out any time series analysis of households/families which also includes personal characteristic variables covering this time period, then it is advised to filter off 'ioutcome=3' cases from all periods to remove this inconsistent treatment of non-responders.

Occupation data for 2021 and 2022 data files
The ONS has identified an issue with the collection of some occupational data in 2021 and 2022 data files in a number of their surveys. While they estimate any impacts will be small overall, this will affect the accuracy of the breakdowns of some detailed (four-digit Standard Occupational Classification (SOC)) occupations, and data derived from them. Further information can be found in the ONS article published on 11 July 2023: https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/articles/revisionofmiscodedoccupationaldataintheonslabourforcesurveyuk/january2021toseptember2022" style="background-color: rgb(255, 255, 255);">Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022.
d
Skeletal traits for thousands of bird species v1.0
dataone.org
search.dataone.org
Updated Nov 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brian Weeks; Zhizhuo Zhou; Charlotte Probst; Jacob Berv; Bruce O'Brien; Brett Benz; Heather Skeen; Mark Ziebell; Louise Bodt; David Fouhey (2025). Skeletal traits for thousands of bird species v1.0 [Dataset]. http://doi.org/10.5061/dryad.v41ns1s4c
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.v41ns1s4c
Dataset updated
Nov 6, 2025
Dataset provided by
Dryad Digital Repository
Authors
Brian Weeks; Zhizhuo Zhou; Charlotte Probst; Jacob Berv; Bruce O'Brien; Brett Benz; Heather Skeen; Mark Ziebell; Louise Bodt; David Fouhey
Description
The dataset spans 2,057 species of birds (Aves: Passeriformes) and includes linear measurements of 12 skeletal elements from 14,419 individuals. In addition to the trait values directly measured from photographs, we leverage the multi-dimensional nature of our dataset and known phylogenetic relationships of the species to impute missing data under an evolutionary model. The traits included in the dataset are: the lengths of the tibiotarsus, humerus, tarsometatarsus, ulna, radius, keel, carpometacarpus, 2nd digit 1st phalanx, furcula, and femur; the maximum outer diameter of the sclerotic ring, and the length from the back of the skull to the tip of the bill (treating the rhamphotheca as part of the bill when it remains present on the specimen). These data are presented in three ways: 1) a dataset that only includes trait estimates for elements that were confidently identified and measured, 2) a complete specimen-level dataset that includes imputed trait values for all missing data, and ..., These data were collected from museum skeletal specimens. To measure traits, images were taken of skeletal specimens and then Skelevision, a computer vision method, was used to segment out the bones in the images, identify them, and measure them; this method is described in detail in Weeks et al. (2023). In addition to presenting the data that were generated using Skelevision, we generated a 100% complete dataset by imputing all missing values in the dataset using Rphylopars (Goolsby et al. 2017), which is a method for fitting multivariate phylogenetic models and estimating missing values in comparative data. We also present species-level means along with associated estimates of uncertainty derived from the Rphylopars model. We validated the Skelevision estimates by comparing them to handmade measurements, and we assessed the trait imputation accuracy by withholding data and imputing the withheld values. The validation procedure and results are outlined in detail in Weeks et al. (..., , # Skeletal Traits for Thousands of Bird Species v1.0

https://doi.org/10.5061/dryad.v41ns1s4c

Description of the data and file structure

The data presented here were generated using photographs of museum skeletal specimens. These data were used to generate three versions of the dataset:

1) Skelevision Only Dataset v1. This version of the dataset only includes traits that were confidently measured using the Skelevision computer vision pipeline, described in detail in Weeks et al. (2023), and implemented as described in Weeks et al. (2024).

2) Complete Trait Dataset v1. This version of the dataset includes a complete specimen-level dataset. It was generated by imputing all missing trait values using evolutionary models as described and validated in Weeks et al. (2024).

3) Skelevision species complete v1. This version of the dataset presents species mean trait values generated using evolutionary models, as outlined in Weeks et al. (2024...,
f
Application of Multiple Imputation for Missing Values in Three-Way...
plos.figshare.com
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ting Tian; Geoffrey J. McLachlan; Mark J. Dieters; Kaye E. Basford (2023). Application of Multiple Imputation for Missing Values in Three-Way Three-Mode Multi-Environment Trial Data [Dataset]. http://doi.org/10.1371/journal.pone.0144370
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0144370
Dataset updated
May 30, 2023
Dataset provided by
PLOS ONE
Authors
Ting Tian; Geoffrey J. McLachlan; Mark J. Dieters; Kaye E. Basford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
It is a common occurrence in plant breeding programs to observe missing values in three-way three-mode multi-environment trial (MET) data. We proposed modifications of models for estimating missing observations for these data arrays, and developed a novel approach in terms of hierarchical clustering. Multiple imputation (MI) was used in four ways, multiple agglomerative hierarchical clustering, normal distribution model, normal regression model, and predictive mean match. The later three models used both Bayesian analysis and non-Bayesian analysis, while the first approach used a clustering procedure with randomly selected attributes and assigned real values from the nearest neighbour to the one with missing observations. Different proportions of data entries in six complete datasets were randomly selected to be missing and the MI methods were compared based on the efficiency and accuracy of estimating those values. The results indicated that the models using Bayesian analysis had slightly higher accuracy of estimation performance than those using non-Bayesian analysis but they were more time-consuming. However, the novel approach of multiple agglomerative hierarchical clustering demonstrated the overall best performances.
t
Orange Dataset - Dataset - LDM
service.tib.eu
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Orange Dataset - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/orange-dataset
Explore at:
Dataset updated
Dec 2, 2024
Description
The Orange dataset is a standard telecom dataset used for churn prediction. It has 18 features with missing values and 5 features have just a single value.
i07 WellReportStatsBySection
data.ca.gov
data.cnra.ca.gov
+1more
Updated Apr 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
California Department of Water Resources (2022). i07 WellReportStatsBySection [Dataset]. https://data.ca.gov/dataset/i07-wellreportstatsbysection
Explore at:
html, arcgis geoservices rest api, zip, csv, geojson, kmlAvailable download formats
Dataset updated
Apr 14, 2022
Dataset provided by
California Department of Water Resourceshttp://www.water.ca.gov/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This feature class represents an index of records from the California Department of Water Resources' (DWR) Online System for Well Completion Reports (OSWCR). This feature class is for informational purposes only. All attribute values should be verified by reviewing the original Well Completion Report. Known issues include: - Missing and duplicate records - Missing values (either missing on original Well Completion Report, or not key entered into database) - Incorrect values (e.g. incorrect Latitude, Longitude, Record Type, Planned Use, Total Completed Depth) - Limited spatial resolution: The majority of well completion reports have been spatially registered to the center of the 1x1 mile Public Land Survey System section that the well is located in.

This Well Completion Report dataset represents an index of records from the California Department of Water Resources' (DWR) Online System for Well Completion Reports (OSWCR). This dataset is for informational purposes only. All attribute values should be verified by reviewing the original Well Completion Report. Known issues include: - Missing and duplicate records - Missing values (either missing on original Well Completion Report, or not key entered into database) - Incorrect values (e.g. incorrect Latitude, Longitude, Record Type, Planned Use, Total Completed Depth) - Limited spatial resolution: The majority of well completion reports have been spatially registered to the center of the 1x1 mile Public Land Survey System section that the well is located in.

Facebook

Twitter

Click to copy link

Link copied

Cite

E. Hance Ellington; Guillaume Bastille-Rousseau; Cayla Austin; Kristen N. Landolt; Bruce A. Pond; Erin E. Rees; Nicholas Robar; Dennis L. Murray (2015). Using multiple imputation to estimate missing data in meta-regression [Dataset]. http://doi.org/10.5061/dryad.m2v4m

Data from: Using multiple imputation to estimate missing data in meta-regression

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5061/dryad.m2v4m

Dataset updated

Nov 25, 2015

Dataset provided by

Trent University
University of Prince Edward Island

Authors

E. Hance Ellington; Guillaume Bastille-Rousseau; Cayla Austin; Kristen N. Landolt; Bruce A. Pond; Erin E. Rees; Nicholas Robar; Dennis L. Murray

License

https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

Description

There is a growing need for scientific synthesis in ecology and evolution. In many cases, meta-analytic techniques can be used to complement such synthesis. However, missing data is a serious problem for any synthetic efforts and can compromise the integrity of meta-analyses in these and other disciplines. Currently, the prevalence of missing data in meta-analytic datasets in ecology and the efficacy of different remedies for this problem have not been adequately quantified. 2. We generated meta-analytic datasets based on literature reviews of experimental and observational data and found that missing data were prevalent in meta-analytic ecological datasets. We then tested the performance of complete case removal (a widely used method when data are missing) and multiple imputation (an alternative method for data recovery) and assessed model bias, precision, and multi-model rankings under a variety of simulated conditions using published meta-regression datasets. 3. We found that complete case removal led to biased and imprecise coefficient estimates and yielded poorly specified models. In contrast, multiple imputation provided unbiased parameter estimates with only a small loss in precision. The performance of multiple imputation, however, was dependent on the type of data missing. It performed best when missing values were weighting variables, but performance was mixed when missing values were predictor variables. Multiple imputation performed poorly when imputing raw data which was then used to calculate effect size and the weighting variable. 4. We conclude that complete case removal should not be used in meta-regression, and that multiple imputation has the potential to be an indispensable tool for meta-regression in ecology and evolution. However, we recommend that users assess the performance of multiple imputation by simulating missing data on a subset of their data before implementing it to recover actual missing data.

Clear search

Close search

Google apps

Main menu

Data from: Using multiple imputation to estimate missing data in...

Cyclist Dataset

Film Circulation dataset

ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture...

Dataset Paper (Open Access)

Abstract

Summary

Programmatic Download

Data details

Data Variables

Version Changelog

Software to open netCDF files

References

Related Records

Data from: Dataset from the Upper Mississippi River Restoration Program...

Walmart complete updated stocks dataset

Walmart (WMT) Stock Price Data (1970 - 2025)

Dataset Overview:

Data Source

Columns Description

Potential Use Cases

Data Integrity

How to Use the Data?

Load the dataset

Display the first few rows

Acknowledgments

More Dataset

Temperature Rain Dataset without Missing Values

Data from: Macaques preferentially attend to intermediately surprising...

Data from: Incomplete specimens in geometric morphometric analyses

Combinations of variable inclusion and stratification approaches where X is...

synthetic but realistic salary prediction dataset

📊 Synthetic Salary Prediction Dataset (with Missing Values & Outliers)

🧠 Overview

🧩 Key Features

🧮 Dataset Summary

💡 Possible Use Cases

🧰 Tech Stack for Analysis (Recommended)

🧑‍💻 Author

⚠️ Disclaimer

Kaggle Wikipedia Web Traffic Daily Dataset (without Missing Values)

UKHLS

KDD Cup Dataset (without Missing Values)

Rideshare Dataset without Missing Values

QLFS

Skeletal traits for thousands of bird species v1.0

Description of the data and file structure

Application of Multiple Imputation for Missing Values in Three-Way...

Orange Dataset - Dataset - LDM

i07 WellReportStatsBySection

Data from: Using multiple imputation to estimate missing data in meta-regression