100+ datasets found
  1. Retail Product Dataset with Missing Values

    • kaggle.com
    zip
    Updated Feb 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Himel Sarder (2025). Retail Product Dataset with Missing Values [Dataset]. https://www.kaggle.com/datasets/himelsarder/retail-product-dataset-with-missing-values
    Explore at:
    zip(47826 bytes)Available download formats
    Dataset updated
    Feb 17, 2025
    Authors
    Himel Sarder
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This synthetic dataset contains 4,362 rows and five columns, including both numerical and categorical data. It is designed for data cleaning, imputation, and analysis tasks, featuring structured missing values at varying percentages (63%, 4%, 47%, 31%, and 9%).

    The dataset includes:
    - Category (Categorical): Product category (A, B, C, D)
    - Price (Numerical): Randomized product prices
    - Rating (Numerical): Ratings between 1 to 5
    - Stock (Categorical): Availability status (In Stock, Out of Stock)
    - Discount (Numerical): Discount percentage

    This dataset is ideal for practicing missing data handling, exploratory data analysis (EDA), and machine learning preprocessing.

  2. Vehicle Trips Dataset with Missing Values

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jul 23, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rakshitha Godahewa; Rakshitha Godahewa; Christoph Bergmeir; Christoph Bergmeir; Geoff Webb; Geoff Webb; Rob Hyndman; Rob Hyndman; Pablo Montero-Manso; Pablo Montero-Manso (2021). Vehicle Trips Dataset with Missing Values [Dataset]. http://doi.org/10.5281/zenodo.5122535
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 23, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rakshitha Godahewa; Rakshitha Godahewa; Christoph Bergmeir; Christoph Bergmeir; Geoff Webb; Geoff Webb; Rob Hyndman; Rob Hyndman; Pablo Montero-Manso; Pablo Montero-Manso
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains 329 daily time series representing the number of trips and vehicles belonging to a set of for-hire vehicle (FHV) companies. The dataset was originally extracted from https://github.com/fivethirtyeight/uber-tlc-foil-response.

  3. Missing Values (Pima Indians Diabetes data)

    • kaggle.com
    zip
    Updated Apr 10, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vishvaa Chhatrara (2021). Missing Values (Pima Indians Diabetes data) [Dataset]. https://www.kaggle.com/datasets/vishvaachhatrara/missing-values-pima-indians-diabetes-data
    Explore at:
    zip(8977 bytes)Available download formats
    Dataset updated
    Apr 10, 2021
    Authors
    Vishvaa Chhatrara
    Description

    Dataset

    This dataset was created by Vishvaa Chhatrara

    Contents

  4. Superstore Sales Dataset Cleaned

    • kaggle.com
    zip
    Updated Jan 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shubham Jagtap (2023). Superstore Sales Dataset Cleaned [Dataset]. https://www.kaggle.com/datasets/shubhamcjagtap4/superstore-sales-dataset-cleaned
    Explore at:
    zip(1389086 bytes)Available download formats
    Dataset updated
    Jan 13, 2023
    Authors
    Shubham Jagtap
    Description

    Dataset

    This dataset was created by Shubham Jagtap

    Contents

  5. Orange dataset table

    • figshare.com
    xlsx
    Updated Mar 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rui Simões (2022). Orange dataset table [Dataset]. http://doi.org/10.6084/m9.figshare.19146410.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Mar 4, 2022
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Rui Simões
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The complete dataset used in the analysis comprises 36 samples, each described by 11 numeric features and 1 target. The attributes considered were caspase 3/7 activity, Mitotracker red CMXRos area and intensity (3 h and 24 h incubations with both compounds), Mitosox oxidation (3 h incubation with the referred compounds) and oxidation rate, DCFDA fluorescence (3 h and 24 h incubations with either compound) and oxidation rate, and DQ BSA hydrolysis. The target of each instance corresponds to one of the 9 possible classes (4 samples per class): Control, 6.25, 12.5, 25 and 50 µM for 6-OHDA and 0.03, 0.06, 0.125 and 0.25 µM for rotenone. The dataset is balanced, it does not contain any missing values and data was standardized across features. The small number of samples prevented a full and strong statistical analysis of the results. Nevertheless, it allowed the identification of relevant hidden patterns and trends.

    Exploratory data analysis, information gain, hierarchical clustering, and supervised predictive modeling were performed using Orange Data Mining version 3.25.1 [41]. Hierarchical clustering was performed using the Euclidean distance metric and weighted linkage. Cluster maps were plotted to relate the features with higher mutual information (in rows) with instances (in columns), with the color of each cell representing the normalized level of a particular feature in a specific instance. The information is grouped both in rows and in columns by a two-way hierarchical clustering method using the Euclidean distances and average linkage. Stratified cross-validation was used to train the supervised decision tree. A set of preliminary empirical experiments were performed to choose the best parameters for each algorithm, and we verified that, within moderate variations, there were no significant changes in the outcome. The following settings were adopted for the decision tree algorithm: minimum number of samples in leaves: 2; minimum number of samples required to split an internal node: 5; stop splitting when majority reaches: 95%; criterion: gain ratio. The performance of the supervised model was assessed using accuracy, precision, recall, F-measure and area under the ROC curve (AUC) metrics.

  6. t

    ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture...

    • researchdata.tuwien.ac.at
    • researchdata.tuwien.at
    zip
    Updated Sep 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo (2025). ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture from merged multi-satellite observations [Dataset]. http://doi.org/10.48436/3fcxr-cde10
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 5, 2025
    Dataset provided by
    TU Wien
    Authors
    Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    This dataset was produced with funding from the European Space Agency (ESA) Climate Change Initiative (CCI) Plus Soil Moisture Project (CCN 3 to ESRIN Contract No: 4000126684/19/I-NB "ESA CCI+ Phase 1 New R&D on CCI ECVS Soil Moisture"). Project website: https://climate.esa.int/en/projects/soil-moisture/

    This dataset contains information on the Surface Soil Moisture (SM) content derived from satellite observations in the microwave domain.

    Dataset Paper (Open Access)

    A description of this dataset, including the methodology and validation results, is available at:

    Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: an independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data, 17, 4305–4329, https://doi.org/10.5194/essd-17-4305-2025, 2025.

    Abstract

    ESA CCI Soil Moisture is a multi-satellite climate data record that consists of harmonized, daily observations coming from 19 satellites (as of v09.1) operating in the microwave domain. The wealth of satellite information, particularly over the last decade, facilitates the creation of a data record with the highest possible data consistency and coverage.
    However, data gaps are still found in the record. This is particularly notable in earlier periods when a limited number of satellites were in operation, but can also arise from various retrieval issues, such as frozen soils, dense vegetation, and radio frequency interference (RFI). These data gaps present a challenge for many users, as they have the potential to obscure relevant events within a study area or are incompatible with (machine learning) software that often relies on gap-free inputs.
    Since the requirement of a gap-free ESA CCI SM product was identified, various studies have demonstrated the suitability of different statistical methods to achieve this goal. A fundamental feature of such gap-filling method is to rely only on the original observational record, without need for ancillary variable or model-based information. Due to the intrinsic challenge, there was until present no global, long-term univariate gap-filled product available. In this version of the record, data gaps due to missing satellite overpasses and invalid measurements are filled using the Discrete Cosine Transform (DCT) Penalized Least Squares (PLS) algorithm (Garcia, 2010). A linear interpolation is applied over periods of (potentially) frozen soils with little to no variability in (frozen) soil moisture content. Uncertainty estimates are based on models calibrated in experiments to fill satellite-like gaps introduced to GLDAS Noah reanalysis soil moisture (Rodell et al., 2004), and consider the gap size and local vegetation conditions as parameters that affect the gapfilling performance.

    Summary

    • Gap-filled global estimates of volumetric surface soil moisture from 1991-2023 at 0.25° sampling
    • Fields of application (partial): climate variability and change, land-atmosphere interactions, global biogeochemical cycles and ecology, hydrological and land surface modelling, drought applications, and meteorology
    • Method: Modified version of DCT-PLS (Garcia, 2010) interpolation/smoothing algorithm, linear interpolation over periods of frozen soils. Uncertainty estimates are provided for all data points.
    • More information: See Preimesberger et al. (2025) and https://doi.org/10.5281/zenodo.8320869" target="_blank" rel="noopener">ESA CCI SM Algorithm Theoretical Baseline Document [Chapter 7.2.9] (Dorigo et al., 2023)

    Programmatic Download

    You can use command line tools such as wget or curl to download (and extract) data for multiple years. The following command will download and extract the complete data set to the local directory ~/Download on Linux or macOS systems.

    #!/bin/bash

    # Set download directory
    DOWNLOAD_DIR=~/Downloads

    base_url="https://researchdata.tuwien.at/records/3fcxr-cde10/files"

    # Loop through years 1991 to 2023 and download & extract data
    for year in {1991..2023}; do
    echo "Downloading $year.zip..."
    wget -q -P "$DOWNLOAD_DIR" "$base_url/$year.zip"
    unzip -o "$DOWNLOAD_DIR/$year.zip" -d $DOWNLOAD_DIR
    rm "$DOWNLOAD_DIR/$year.zip"
    done

    Data details

    The dataset provides global daily estimates for the 1991-2023 period at 0.25° (~25 km) horizontal grid resolution. Daily images are grouped by year (YYYY), each subdirectory containing one netCDF image file for a specific day (DD), month (MM) in a 2-dimensional (longitude, latitude) grid system (CRS: WGS84). The file name has the following convention:

    ESACCI-SOILMOISTURE-L3S-SSMV-COMBINED_GAPFILLED-YYYYMMDD000000-fv09.1r1.nc

    Data Variables

    Each netCDF file contains 3 coordinate variables (WGS84 longitude, latitude and time stamp), as well as the following data variables:

    • sm: (float) The Soil Moisture variable reflects estimates of daily average volumetric soil moisture content (m3/m3) in the soil surface layer (~0-5 cm) over a whole grid cell (0.25 degree).
    • sm_uncertainty: (float) The Soil Moisture Uncertainty variable reflects the uncertainty (random error) of the original satellite observations and of the predictions used to fill observation data gaps.
    • sm_anomaly: Soil moisture anomalies (reference period 1991-2020) derived from the gap-filled values (`sm`)
    • sm_smoothed: Contains DCT-PLS predictions used to fill data gaps in the original soil moisture field. These values are also provided for cases where an observation was initially available (compare `gapmask`). In this case, they provided a smoothed version of the original data.
    • gapmask: (0 | 1) Indicates grid cells where a satellite observation is available (1), and where the interpolated (smoothed) values are used instead (0) in the 'sm' field.
    • frozenmask: (0 | 1) Indicates grid cells where ERA5 soil temperature is <0 °C. In this case, a linear interpolation over time is applied.

    Additional information for each variable is given in the netCDF attributes.

    Version Changelog

    Changes in v9.1r1 (previous version was v09.1):

    • This version uses a novel uncertainty estimation scheme as described in Preimesberger et al. (2025).

    Software to open netCDF files

    These data can be read by any software that supports Climate and Forecast (CF) conform metadata standards for netCDF files, such as:

    References

    • Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: an independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data, 17, 4305–4329, https://doi.org/10.5194/essd-17-4305-2025, 2025.
    • Dorigo, W., Preimesberger, W., Stradiotti, P., Kidd, R., van der Schalie, R., van der Vliet, M., Rodriguez-Fernandez, N., Madelon, R., & Baghdadi, N. (2023). ESA Climate Change Initiative Plus - Soil Moisture Algorithm Theoretical Baseline Document (ATBD) Supporting Product Version 08.1 (version 1.1). Zenodo. https://doi.org/10.5281/zenodo.8320869
    • Garcia, D., 2010. Robust smoothing of gridded data in one and higher dimensions with missing values. Computational Statistics & Data Analysis, 54(4), pp.1167-1178. Available at: https://doi.org/10.1016/j.csda.2009.09.020
    • Rodell, M., Houser, P. R., Jambor, U., Gottschalck, J., Mitchell, K., Meng, C.-J., Arsenault, K., Cosgrove, B., Radakovich, J., Bosilovich, M., Entin, J. K., Walker, J. P., Lohmann, D., and Toll, D.: The Global Land Data Assimilation System, Bulletin of the American Meteorological Society, 85, 381 – 394, https://doi.org/10.1175/BAMS-85-3-381, 2004.

    Related Records

    The following records are all part of the ESA CCI Soil Moisture science data records community

    1

    ESA CCI SM MODELFREE Surface Soil Moisture Record

    <a href="https://doi.org/10.48436/svr1r-27j77" target="_blank"

  7. Retail Store Sales: Dirty for Data Cleaning

    • kaggle.com
    zip
    Updated Jan 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Mohamed (2025). Retail Store Sales: Dirty for Data Cleaning [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/retail-store-sales-dirty-for-data-cleaning
    Explore at:
    zip(226740 bytes)Available download formats
    Dataset updated
    Jan 18, 2025
    Authors
    Ahmed Mohamed
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dirty Retail Store Sales Dataset

    Overview

    The Dirty Retail Store Sales dataset contains 12,575 rows of synthetic data representing sales transactions from a retail store. The dataset includes eight product categories with 25 items per category, each having static prices. It is designed to simulate real-world sales data, including intentional "dirtiness" such as missing or inconsistent values. This dataset is suitable for practicing data cleaning, exploratory data analysis (EDA), and feature engineering.

    File Information

    • File Name: retail_store_sales.csv
    • Number of Rows: 12,575
    • Number of Columns: 11

    Columns Description

    Column NameDescriptionExample Values
    Transaction IDA unique identifier for each transaction. Always present and unique.TXN_1234567
    Customer IDA unique identifier for each customer. 25 unique customers.CUST_01
    CategoryThe category of the purchased item.Food, Furniture
    ItemThe name of the purchased item. May contain missing values or None.Item_1_FOOD, None
    Price Per UnitThe static price of a single unit of the item. May contain missing or None values.4.00, None
    QuantityThe quantity of the item purchased. May contain missing or None values.1, None
    Total SpentThe total amount spent on the transaction. Calculated as Quantity * Price Per Unit.8.00, None
    Payment MethodThe method of payment used. May contain missing or invalid values.Cash, Credit Card
    LocationThe location where the transaction occurred. May contain missing or invalid values.In-store, Online
    Transaction DateThe date of the transaction. Always present and valid.2023-01-15
    Discount AppliedIndicates if a discount was applied to the transaction. May contain missing values.True, False, None

    Categories and Items

    The dataset includes the following categories, each containing 25 items with corresponding codes, names, and static prices:

    Electric Household Essentials

    Item CodeItem NamePrice
    Item_1_EHEBlender5.0
    Item_2_EHEMicrowave6.5
    Item_3_EHEToaster8.0
    Item_4_EHEVacuum Cleaner9.5
    Item_5_EHEAir Purifier11.0
    Item_6_EHEElectric Kettle12.5
    Item_7_EHERice Cooker14.0
    Item_8_EHEIron15.5
    Item_9_EHECeiling Fan17.0
    Item_10_EHETable Fan18.5
    Item_11_EHEHair Dryer20.0
    Item_12_EHEHeater21.5
    Item_13_EHEHumidifier23.0
    Item_14_EHEDehumidifier24.5
    Item_15_EHECoffee Maker26.0
    Item_16_EHEPortable AC27.5
    Item_17_EHEElectric Stove29.0
    Item_18_EHEPressure Cooker30.5
    Item_19_EHEInduction Cooktop32.0
    Item_20_EHEWater Dispenser33.5
    Item_21_EHEHand Blender35.0
    Item_22_EHEMixer Grinder36.5
    Item_23_EHESandwich Maker38.0
    Item_24_EHEAir Fryer39.5
    Item_25_EHEJuicer41.0

    Furniture

    Item CodeItem NamePrice
    Item_1_FUROffice Chair5.0
    Item_2_FURSofa6.5
    Item_3_FURCoffee Table8.0
    Item_4_FURDining Table9.5
    Item_5_FURBookshelf11.0
    Item_6_FURBed F...
  8. Z

    Kaggle Wikipedia Web Traffic Daily Dataset (without Missing Values)

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 1, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Godahewa, Rakshitha; Bergmeir, Christoph; Webb, Geoff; Hyndman, Rob; Montero-Manso, Pablo (2021). Kaggle Wikipedia Web Traffic Daily Dataset (without Missing Values) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3892918
    Explore at:
    Dataset updated
    Apr 1, 2021
    Dataset provided by
    Lecturer at Monash University
    Lecturer at University of Sydney
    Professor at Monash University
    PhD Student at Monash University
    Authors
    Godahewa, Rakshitha; Bergmeir, Christoph; Webb, Geoff; Hyndman, Rob; Montero-Manso, Pablo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was used in the Kaggle Wikipedia Web Traffic forecasting competition. It contains 145063 daily time series representing the number of hits or web traffic for a set of Wikipedia pages from 2015-07-01 to 2017-09-10.

    The original dataset contains missing values. They have been simply replaced by zeros.

  9. Dataset for numerical analysis

    • figshare.com
    • data.mendeley.com
    zip
    Updated Nov 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shi Chen; Dong Chen; Jyh-Horng Lin (2023). Dataset for numerical analysis [Dataset]. http://doi.org/10.6084/m9.figshare.24648945.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 28, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Shi Chen; Dong Chen; Jyh-Horng Lin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains one Excel sheet and five Word documents. In this dataset, Simulation.xlsx describes the parameter values used for the numerical analysis based on empirical data. In this Excel sheet, we calculated the values of each capped call-option model parameter. Computation of Table 2.docx and other documents show the results of the comparative statistics.

  10. d

    Data from: Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 -...

    • catalog.data.gov
    • data.usgs.gov
    • +2more
    Updated Nov 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 - 7—Data [Dataset]. https://catalog.data.gov/dataset/variable-terrestrial-gps-telemetry-detection-rates-parts-1-7data
    Explore at:
    Dataset updated
    Nov 27, 2025
    Dataset provided by
    U.S. Geological Survey
    Description

    Studies utilizing Global Positioning System (GPS) telemetry rarely result in 100% fix success rates (FSR). Many assessments of wildlife resource use do not account for missing data, either assuming data loss is random or because a lack of practical treatment for systematic data loss. Several studies have explored how the environment, technological features, and animal behavior influence rates of missing data in GPS telemetry, but previous spatially explicit models developed to correct for sampling bias have been specified to small study areas, on a small range of data loss, or to be species-specific, limiting their general utility. Here we explore environmental effects on GPS fix acquisition rates across a wide range of environmental conditions and detection rates for bias correction of terrestrial GPS-derived, large mammal habitat use. We also evaluate patterns in missing data that relate to potential animal activities that change the orientation of the antennae and characterize home-range probability of GPS detection for 4 focal species; cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Part 1, Positive Openness Raster (raster dataset): Openness is an angular measure of the relationship between surface relief and horizontal distance. For angles less than 90 degrees it is equivalent to the internal angle of a cone with its apex at a DEM location, and is constrained by neighboring elevations within a specified radial distance. 480 meter search radius was used for this calculation of positive openness. Openness incorporates the terrain line-of-sight or viewshed concept and is calculated from multiple zenith and nadir angles-here along eight azimuths. Positive openness measures openness above the surface, with high values for convex forms and low values for concave forms (Yokoyama et al. 2002). We calculated positive openness using a custom python script, following the methods of Yokoyama et. al (2002) using a USGS National Elevation Dataset as input. Part 2, Northern Arizona GPS Test Collar (csv): Bias correction in GPS telemetry data-sets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix acquisition. We found terrain exposure and tall over-story vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The model's predictive ability was evaluated using two independent data-sets from stationary test collars of different make/model, fix interval programming, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. The model training data are provided here for fix attempts by hour. This table can be linked with the site location shapefile using the site field. Part 3, Probability Raster (raster dataset): Bias correction in GPS telemetry datasets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix aquistion. We found terrain exposure and tall overstory vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The models predictive ability was evaluated using two independent datasets from stationary test collars of different make/model, fix interval programing, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. We evaluated GPS telemetry datasets by comparing the mean probability of a successful GPS fix across study animals home-ranges, to the actual observed FSR of GPS downloaded deployed collars on cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Comparing the mean probability of acquisition within study animals home-ranges and observed FSRs of GPS downloaded collars resulted in a approximatly 1:1 linear relationship with an r-sq= 0.68. Part 4, GPS Test Collar Sites (shapefile): Bias correction in GPS telemetry data-sets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix acquisition. We found terrain exposure and tall over-story vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The model's predictive ability was evaluated using two independent data-sets from stationary test collars of different make/model, fix interval programming, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. Part 5, Cougar Home Ranges (shapefile): Cougar home-ranges were calculated to compare the mean probability of a GPS fix acquisition across the home-range to the actual fix success rate (FSR) of the collar as a means for evaluating if characteristics of an animal’s home-range have an effect on observed FSR. We estimated home-ranges using the Local Convex Hull (LoCoH) method using the 90th isopleth. Data obtained from GPS download of retrieved units were only used. Satellite delivered data was omitted from the analysis for animals where the collar was lost or damaged because satellite delivery tends to lose as additional 10% of data. Comparisons with home-range mean probability of fix were also used as a reference for assessing if the frequency animals use areas of low GPS acquisition rates may play a role in observed FSRs. Part 6, Cougar Fix Success Rate by Hour (csv): Cougar GPS collar fix success varied by hour-of-day suggesting circadian rhythms with bouts of rest during daylight hours may change the orientation of the GPS receiver affecting the ability to acquire fixes. Raw data of overall fix success rates (FSR) and FSR by hour were used to predict relative reductions in FSR. Data only includes direct GPS download datasets. Satellite delivered data was omitted from the analysis for animals where the collar was lost or damaged because satellite delivery tends to lose approximately an additional 10% of data. Part 7, Openness Python Script version 2.0: This python script was used to calculate positive openness using a 30 meter digital elevation model for a large geographic area in Arizona, California, Nevada and Utah. A scientific research project used the script to explore environmental effects on GPS fix acquisition rates across a wide range of environmental conditions and detection rates for bias correction of terrestrial GPS-derived, large mammal habitat use.

  11. s

    Facebook Deactivation Participants

    • socialmediaarchive.org
    pdf, xlsx
    Updated May 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Facebook Deactivation Participants [Dataset]. https://socialmediaarchive.org/record/61?v=pdf
    Explore at:
    xlsx(16172), xlsx(33969), pdf(813810)Available download formats
    Dataset updated
    May 21, 2024
    Description

    This table includes platform data for Facebook participants in the Deactivation experiment. Each row of the dataset corresponds to data from a participant’s Facebook user account. Each column contains a value, or set of values, that aggregates log data for this specific participant over a certain period of time.

  12. R

    Data from: Missing Tooth Dataset

    • universe.roboflow.com
    zip
    Updated Jan 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tooth Problem (2025). Missing Tooth Dataset [Dataset]. https://universe.roboflow.com/tooth-problem/missing-tooth-hod1j/model/3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 13, 2025
    Dataset authored and provided by
    Tooth Problem
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Missing Tooth Bounding Boxes
    Description

    Missing Tooth

    ## Overview
    
    Missing Tooth is a dataset for object detection tasks - it contains Missing Tooth annotations for 1,153 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  13. Cafe Sales - Dirty Data for Cleaning Training

    • kaggle.com
    zip
    Updated Jan 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Mohamed (2025). Cafe Sales - Dirty Data for Cleaning Training [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/cafe-sales-dirty-data-for-cleaning-training
    Explore at:
    zip(113510 bytes)Available download formats
    Dataset updated
    Jan 17, 2025
    Authors
    Ahmed Mohamed
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dirty Cafe Sales Dataset

    Overview

    The Dirty Cafe Sales dataset contains 10,000 rows of synthetic data representing sales transactions in a cafe. This dataset is intentionally "dirty," with missing values, inconsistent data, and errors introduced to provide a realistic scenario for data cleaning and exploratory data analysis (EDA). It can be used to practice cleaning techniques, data wrangling, and feature engineering.

    File Information

    • File Name: dirty_cafe_sales.csv
    • Number of Rows: 10,000
    • Number of Columns: 8

    Columns Description

    Column NameDescriptionExample Values
    Transaction IDA unique identifier for each transaction. Always present and unique.TXN_1234567
    ItemThe name of the item purchased. May contain missing or invalid values (e.g., "ERROR").Coffee, Sandwich
    QuantityThe quantity of the item purchased. May contain missing or invalid values.1, 3, UNKNOWN
    Price Per UnitThe price of a single unit of the item. May contain missing or invalid values.2.00, 4.00
    Total SpentThe total amount spent on the transaction. Calculated as Quantity * Price Per Unit.8.00, 12.00
    Payment MethodThe method of payment used. May contain missing or invalid values (e.g., None, "UNKNOWN").Cash, Credit Card
    LocationThe location where the transaction occurred. May contain missing or invalid values.In-store, Takeaway
    Transaction DateThe date of the transaction. May contain missing or incorrect values.2023-01-01

    Data Characteristics

    1. Missing Values:

      • Some columns (e.g., Item, Payment Method, Location) may contain missing values represented as None or empty cells.
    2. Invalid Values:

      • Some rows contain invalid entries like "ERROR" or "UNKNOWN" to simulate real-world data issues.
    3. Price Consistency:

      • Prices for menu items are consistent but may have missing or incorrect values introduced.

    Menu Items

    The dataset includes the following menu items with their respective price ranges:

    ItemPrice($)
    Coffee2
    Tea1.5
    Sandwich4
    Salad5
    Cake3
    Cookie1
    Smoothie4
    Juice3

    Use Cases

    This dataset is suitable for: - Practicing data cleaning techniques such as handling missing values, removing duplicates, and correcting invalid entries. - Exploring EDA techniques like visualizations and summary statistics. - Performing feature engineering for machine learning workflows.

    Cleaning Steps Suggestions

    To clean this dataset, consider the following steps: 1. Handle Missing Values: - Fill missing numeric values with the median or mean. - Replace missing categorical values with the mode or "Unknown."

    1. Handle Invalid Values:

      • Replace invalid entries like "ERROR" and "UNKNOWN" with NaN or appropriate values.
    2. Date Consistency:

      • Ensure all dates are in a consistent format.
      • Fill missing dates with plausible values based on nearby records.
    3. Feature Engineering:

      • Create new columns, such as Day of the Week or Transaction Month, for further analysis.

    License

    This dataset is released under the CC BY-SA 4.0 License. You are free to use, share, and adapt it, provided you give appropriate credit.

    Feedback

    If you have any questions or feedback, feel free to reach out through the dataset's discussion board on Kaggle.

  14. Z

    CoAID dataset with multiple extracted features (both sparse and dense)

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    Updated Jun 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guillaume Bernard (2022). CoAID dataset with multiple extracted features (both sparse and dense) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6630404
    Explore at:
    Dataset updated
    Jun 10, 2022
    Dataset provided by
    Laboratoire L3i, Université de La Rochelle
    Authors
    Guillaume Bernard
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a publication of the CoAID dataset originaly dedicated to fake news detection. We changed here the purpose of this dataset in order to use it in the context of event tracking in press documents.

    Cui, Limeng, et Dongwon Lee. 2020. « CoAID: COVID-19 Healthcare Misinformation Dataset ». ArXiv:2006.00885 [Cs], novembre. http://arxiv.org/abs/2006.00885.

    In this dataset, we provide multiple features extracted from the text itself. Please note the text is missing from the dataset published in the CSV format for copyright reasons. You can download the original datasets and manually add the missing texts from the original publications.

    Features are extracted using:

    • A corpus of reference articles in multiple languages languages for TF-IDF weighting. (features_news) [1]

    • A corpus of tweets reporting news for TF-IDF weighting. (features_tweets) [1]

    • A S-BERT model [2] that uses distiluse-base-multilingual-cased-v1 (called features_use) 3

    • A S-BERT model [2] that uses paraphrase-multilingual-mpnet-base-v2 (called features_mpnet) 4

    References:

    [1]: Guillaume Bernard. (2022). Resources to compute TF-IDF weightings on press articles and tweets (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6610406

    [2]: Reimers, Nils, et Iryna Gurevych. 2019. « Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks ». In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3982‑92. Hong Kong, China: Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1410.

  15. z

    Data from: NeuroSense: A Novel EEG Dataset Utilizing Low-Cost, Sparse...

    • zenodo.org
    • data-staging.niaid.nih.gov
    Updated Oct 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tommaso Colafiglio; Tommaso Colafiglio; Angela Lombardi; Angela Lombardi; Paolo Sorino; Paolo Sorino; Elvira Brattico; Elvira Brattico; Domenico Lofù; Domenico Lofù; Danilo Danese; Danilo Danese; Eugenio Di Sciascio; Eugenio Di Sciascio; Tommaso Di Noia; Tommaso Di Noia; Fedelucio Narducci; Fedelucio Narducci (2024). NeuroSense: A Novel EEG Dataset Utilizing Low-Cost, Sparse Electrode Devices for Emotion Exploration [Dataset]. http://doi.org/10.5281/zenodo.14003181
    Explore at:
    Dataset updated
    Oct 30, 2024
    Dataset provided by
    Zenodo
    Authors
    Tommaso Colafiglio; Tommaso Colafiglio; Angela Lombardi; Angela Lombardi; Paolo Sorino; Paolo Sorino; Elvira Brattico; Elvira Brattico; Domenico Lofù; Domenico Lofù; Danilo Danese; Danilo Danese; Eugenio Di Sciascio; Eugenio Di Sciascio; Tommaso Di Noia; Tommaso Di Noia; Fedelucio Narducci; Fedelucio Narducci
    Time period covered
    Oct 28, 2024
    Description

    README

    Link to the Publication

    🔗 Read the Paper

    Details related to access to the data

    Data user agreement

    The terms and conditions for using this dataset are specified in the [LICENCE](LICENCE) file included in this repository. Please review these terms carefully before accessing or using the data.

    Contact person

    For additional information about the dataset, please contact:
    - Name: Angela Lombardi
    - Affiliation: Department of Electrical and Information Engineering, Politecnico di Bari
    - Email: angela.lombardi@poliba.it

    Practical information to access the data

    The dataset can be accessed through our dedicated web platform. To request access:

    1. Visit the main dataset page at: https://sisinflab.poliba.it/neurosense-dataset-request/
    2. Follow the instructions on the website to submit your access request
    3. Upon approval, you will receive further instructions for downloading the data

    Please ensure you have read and agreed to the terms in the data user agreement before requesting access.

    Overview

    EEG Emotion Recognition - Muse Headset
    2023-2024

    The experiment consists in 40 sessions per user. During each session, users are asked to watch a
    music video with the aim to understand their emotions.
    Recordings are performed with a Muse EEG headset at a 256 Hz sampling rate.
    Channels are recorded as follows:
    - Channel 0: AF7
    - Channel 1: TP9
    - Channel 2: TP10
    - Channel 3: AF8

    The chosen songs have various Last.fm tags in order to create different feelings. The title of every track
    can be found in the "TaskName" field of sub-ID***_ses-S***_task-Default_run-001_eeg.json, while the author,
    the Last.fm tag and additional information in "TaskDescription".

    Methods

    Subjects

    The subject pool is made of 30 college students, aged between 18 and 35. 16 of them are males, 14 females.

    Apparatus

    The experiment was performed using the same procedures as those to create
    [Deap Dataset](https://www.eecs.qmul.ac.uk/mmv/datasets/deap/), which is a dataset to recognize emotions via a Brain
    Computer Interface (BCI).


    Task organization

    Firstly, music videos were selected. Once 40 songs were picked, the protocol was chosen and the self-assessment
    questionnaire was created.

    Task details

    In order to evaluate the stimulus, Russell's VAD (Valence-Arousal-Dominance) scale was used.
    In this scale, valenza-arousal space can be divided in four quadrants:
    - Low Arousal/Low Valence (LALV);
    - Low Arousal/High Valence (LAHV);
    - High Arousal/Low Valence (HALV);
    - High Arousal/High Valence (HAHV).

    Experimental location

    The experiment was performed in a laboratory located at DEI Department of
    [Politecnico di Bari](https://www.poliba.it/).

    Missing data


    Data recorded during session S019 - Session 2, ID021 - Session 23, user was corrupted, therefore is missing.
    Sessions S033 and S038 of ID015 user show a calculated effective sampling rate lower than 256 Hz:
    - ID015_ses-S033 has 226.1320 Hz
    - ID015_ses-S038 has 216.9549 Hz

  16. R

    Missing Teeth Dataset

    • universe.roboflow.com
    zip
    Updated Nov 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Annotation (2025). Missing Teeth Dataset [Dataset]. https://universe.roboflow.com/data-annotation-sjsn9/missing-teeth-g8n2u
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 5, 2025
    Dataset authored and provided by
    Data Annotation
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Missing_teeth Bounding Boxes
    Description

    Missing Teeth

    ## Overview
    
    Missing Teeth is a dataset for object detection tasks - it contains Missing_teeth annotations for 800 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  17. GLDAS Noah Land Surface Model L4 3 hourly 0.25 x 0.25 degree V2.0...

    • data.nasa.gov
    Updated Apr 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). GLDAS Noah Land Surface Model L4 3 hourly 0.25 x 0.25 degree V2.0 (GLDAS_NOAH025_3H) at GES DISC - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/gldas-noah-land-surface-model-l4-3-hourly-0-25-x-0-25-degree-v2-0-gldas-noah025-3h-at-ges--ddbc8
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    NASA Global Land Data Assimilation System Version 2 (GLDAS-2) has three components: GLDAS-2.0, GLDAS-2.1, and GLDAS-2.2. GLDAS-2.0 is forced entirely with the Princeton meteorological forcing input data and provides a temporally consistent series from 1948 through 2014. GLDAS-2.1 is forced with a combination of model and observation data from 2000 to present. GLDAS-2.2 product suites use data assimilation (DA), whereas the GLDAS-2.0 and GLDAS-2.1 products are "open-loop" (i.e., no data assimilation). The choice of forcing data, as well as DA observation source, variable, and scheme, vary for different GLDAS-2.2 products.This data product, GLDAS-2.0 0.25 degree 3-hourly, was reprocessed and replaced its previous data product on November 27, 2019. The data product contains a series of land surface parameters simulated from the Noah Model 3.6, and currently covers from January 1948 to December 2014, but will be extended as the data becomes available. The GLDAS-2.0 data are archived and distributed in netCDF format.The GLDAS-2.0 model simulations were initialized on simulation date January 1, 1948, using soil moisture and other state fields from the LSM climatology for that day of the year. The simulations were forced by the global meteorological forcing data set from Princeton University (Sheffield et al., 2006). Each simulation uses the common GLDAS data sets for land water mask (MOD44W: Carroll et al., 2009) and elevation (GTOPO30) along with the model default land cover and soils datasets. Noah model uses the Modified IGBP MODIS 20-category vegetation classification and the soil texture based on the Hybrid STATSGO/FAO) datasets. The MODIS based land surface parameters are used in the current GLDAS-2.0 and GLDAS-2.1 products while the AVHRR base parameters were used in GLDAS-1 and previous GLDAS-2 products (prior to October 2012). In October 2020, all 3-hourly and monthly GLDAS-2 data were post-processed with the MOD44W MODIS land mask. Previously, some grid boxes over inland water were considered as over land and, thus, had non-missing values. The post-processing corrected this issue and masked out all model output data over inland water; the post-processing did not affect the meteorological forcing variables. More information can be found in the GLDAS-2 README. The MOD44W MODIS land mask is available on the GLDAS Project site.If you had downloaded the GLDAS data prior to November 2020, please download the data again to receive the post-processed data.

  18. Z

    Complete Rxivist dataset of scraped biology preprint data

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    • +1more
    Updated Mar 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdill, Richard J.; Blekhman, Ran (2023). Complete Rxivist dataset of scraped biology preprint data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2529922
    Explore at:
    Dataset updated
    Mar 2, 2023
    Dataset provided by
    University of Minnesota
    Authors
    Abdill, Richard J.; Blekhman, Ran
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    rxivist.org allowed readers to sort and filter the tens of thousands of preprints posted to bioRxiv and medRxiv. Rxivist used a custom web crawler to index all papers posted to those two websites; this is a snapshot of Rxivist the production database. The version number indicates the date on which the snapshot was taken. See the included "README.md" file for instructions on how to use the "rxivist.backup" file to import data into a PostgreSQL database server.

    Please note this is a different repository than the one used for the Rxivist manuscript—that is in a separate Zenodo repository. You're welcome (and encouraged!) to use this data in your research, but please cite our paper, now published in eLife.

    Previous versions are also available pre-loaded into Docker images, available at blekhmanlab/rxivist_data.

    Version notes:

    2023-03-01

    The final Rxivist data upload, more than four years after the first and encompassing 223,541 preprints posted to bioRxiv and medRxiv through the end of February 2023.

    2020-12-07***

    In addition to bioRxiv preprints, the database now includes all medRxiv preprints as well.

    The website where a preprint was posted is now recorded in a new field in the "articles" table, called "repo".

    We've significantly refactored the web crawler to take advantage of developments with the bioRxiv API.

    The main difference is that preprints flagged as "published" by bioRxiv are no longer recorded on the same schedule that download metrics are updated: The Rxivist database should now record published DOI entries the same day bioRxiv detects them.

    Twitter metrics have returned, for the most part. Improvements with the Crossref Event Data API mean we can once again tally daily Twitter counts for all bioRxiv DOIs.

    The "crossref_daily" table remains where these are recorded, and daily numbers are now up to date.

    Historical daily counts have also been re-crawled to fill in the empty space that started in October 2019.

    There are still several gaps that are more than a week long due to missing data from Crossref.

    We have recorded available Crossref Twitter data for all papers with DOI numbers starting with "10.1101," which includes all medRxiv preprints. However, there appears to be almost no Twitter data available for medRxiv preprints.

    The download metrics for article id 72514 (DOI 10.1101/2020.01.30.927871) were found to be out of date for February 2020 and are now correct. This is notable because article 72514 is the most downloaded preprint of all time; we're still looking into why this wasn't updated after the month ended.

    2020-11-18

    Publication checks should be back on schedule.

    2020-10-26

    This snapshot fixes most of the data issues found in the previous version. Indexed papers are now up to date, and download metrics are back on schedule. The check for publication status remains behind schedule, however, and the database may not include published DOIs for papers that have been flagged on bioRxiv as "published" over the last two months. Another snapshot will be posted in the next few weeks with updated publication information.

    2020-09-15

    A crawler error caused this snapshot to exclude all papers posted after about August 29, with some papers having download metrics that were more out of date than usual. The "last_crawled" field is accurate.

    2020-09-08

    This snapshot is misconfigured and will not work without modification; it has been replaced with version 2020-09-15.

    2019-12-27

    Several dozen papers did not have dates associated with them; that has been fixed.

    Some authors have had two entries in the "authors" table for portions of 2019, one profile that was linked to their ORCID and one that was not, occasionally with almost identical "name" strings. This happened after bioRxiv began changing author names to reflect the names in the PDFs, rather than the ones manually entered into their system. These database records are mostly consolidated now, but some may remain.

    2019-11-29

    The Crossref Event Data API remains down; Twitter data is unavailable for dates after early October.

    2019-10-31

    The Crossref Event Data API is still experiencing problems; the Twitter data for October is incomplete in this snapshot.

    The README file has been modified to reflect changes in the process for creating your own DB snapshots if using the newly released PostgreSQL 12.

    2019-10-01

    The Crossref API is back online, and the "crossref_daily" table should now include up-to-date tweet information for July through September.

    About 40,000 authors were removed from the author table because the name had been removed from all preprints they had previously been associated with, likely because their name changed slightly on the bioRxiv website ("John Smith" to "J Smith" or "John M Smith"). The "author_emails" table was also modified to remove entries referring to the deleted authors. The web crawler is being updated to clean these orphaned entries more frequently.

    2019-08-30

    The Crossref Event Data API, which provides the data used to populate the table of tweet counts, has not been fully functional since early July. While we are optimistic that accurate tweet counts will be available at some point, the sparse values currently in the "crossref_daily" table for July and August should not be considered reliable.

    2019-07-01

    A new "institution" field has been added to the "article_authors" table that stores each author's institutional affiliation as listed on that paper. The "authors" table still has each author's most recently observed institution.

    We began collecting this data in the middle of May, but it has not been applied to older papers yet.

    2019-05-11

    The README was updated to correct a link to the Docker repository used for the pre-built images.

    2019-03-21

    The license for this dataset has been changed to CC-BY, which allows use for any purpose and requires only attribution.

    A new table, "publication_dates," has been added and will be continually updated. This table will include an entry for each preprint that has been published externally for which we can determine a date of publication, based on data from Crossref. (This table was previously included in the "paper" schema but was not updated after early December 2018.)

    Foreign key constraints have been added to almost every table in the database. This should not impact any read behavior, but anyone writing to these tables will encounter constraints on existing fields that refer to other tables. Most frequently, this means the "article" field in a table will need to refer to an ID that actually exists in the "articles" table.

    The "author_translations" table has been removed. This was used to redirect incoming requests for outdated author profile pages and was likely not of any functional use to others.

    The "README.md" file has been renamed "1README.md" because Zenodo only displays a preview for the file that appears first in the list alphabetically.

    The "article_ranks" and "article_ranks_working" tables have been removed as well; they were unused.

    2019-02-13.1

    After consultation with bioRxiv, the "fulltext" table will not be included in further snapshots until (and if) concerns about licensing and copyright can be resolved.

    The "docker-compose.yml" file was added, with corresponding instructions in the README to streamline deployment of a local copy of this database.

    2019-02-13

    The redundant "paper" schema has been removed.

    BioRxiv has begun making the full text of preprints available online. Beginning with this version, a new table ("fulltext") is available that contains the text of preprints that have been processed already. The format in which this information is stored may change in the future; any digression will be noted here.

    This is the first version that has a corresponding Docker image.

  19. e

    INSPIRE Download Service (predefined ATOM) for dataset NA 41 crossing...

    • data.europa.eu
    atom feed
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LVermGeo im Auftrag von Zweibrücken, INSPIRE Download Service (predefined ATOM) for dataset NA 41 crossing Pirmasenser Strasse - L471/Western part [Dataset]. https://data.europa.eu/88u/dataset/ceeda47d-e529-0002-069f-8ef120746413
    Explore at:
    atom feedAvailable download formats
    Dataset authored and provided by
    LVermGeo im Auftrag von Zweibrücken
    Description

    Description of the INSPIRE Download Service (predefined Atom): Advertising, entertainment venues, company residences - The link(s) for downloading the datasets is/are dynamically generated from Get Map calls to a WMS interface

  20. FLDAS Noah Land Surface Model L4 Global Monthly 0.1 x 0.1 degree (MERRA-2...

    • catalog.data.gov
    • s.cnmilf.com
    • +3more
    Updated Aug 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NASA/GSFC/SED/ESD/TISL/GESDISC (2025). FLDAS Noah Land Surface Model L4 Global Monthly 0.1 x 0.1 degree (MERRA-2 and CHIRPS) V001 (FLDAS_NOAH01_C_GL_M) at GES DISC [Dataset]. https://catalog.data.gov/dataset/fldas-noah-land-surface-model-l4-global-monthly-0-1-x-0-1-degree-merra-2-and-chirps-v001-f-a740b
    Explore at:
    Dataset updated
    Aug 22, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    This dataset contains a series of land surface parameters simulated from the Noah 3.6.1 model in the Famine Early Warning Systems Network (FEWS NET) Land Data Assimilation System (FLDAS). The data are in 0.10 degree resolution and range from January 1982 to present. The temporal resolution is monthly and the spatial coverage is global (60S, 180W, 90N, 180E). The FLDAS regional monthly datasets will no longer be available and have been superseded by the global monthly dataset. The simulation was forced by a combination of the Modern-Era Retrospective analysis for Research and Applications version 2 (MERRA-2) data and Climate Hazards Group InfraRed Precipitation with Station (CHIRPS) 6-hourly rainfall data that has been downscaled using the NASA Land Data Toolkit.The simulation was initialized on January 1, 1982 using soil moisture and other state fields from a FLDAS/Noah model climatology for that day of the year.In November 2020, all FLDAS data were post-processed with the MOD44W MODIS land mask. Previously, some grid boxes over inland water were considered as over land and, thus, had non-missing values. The post-processing corrected this issue and masked out all model output data over inland water; the post-processing did not affect the meteorological forcing variables. More information on this can be found in the FLDAS README document, and the MOD44W MODIS land mask is available on the FLDAS Project site. If you had downloaded any FLDAS data prior to November 2020, please download the data again to receive the post-processed data.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Himel Sarder (2025). Retail Product Dataset with Missing Values [Dataset]. https://www.kaggle.com/datasets/himelsarder/retail-product-dataset-with-missing-values
Organization logo

Retail Product Dataset with Missing Values

A dataset with numerical categorical values structured missing data for analysis

Explore at:
zip(47826 bytes)Available download formats
Dataset updated
Feb 17, 2025
Authors
Himel Sarder
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This synthetic dataset contains 4,362 rows and five columns, including both numerical and categorical data. It is designed for data cleaning, imputation, and analysis tasks, featuring structured missing values at varying percentages (63%, 4%, 47%, 31%, and 9%).

The dataset includes:
- Category (Categorical): Product category (A, B, C, D)
- Price (Numerical): Randomized product prices
- Rating (Numerical): Ratings between 1 to 5
- Stock (Categorical): Availability status (In Stock, Out of Stock)
- Discount (Numerical): Discount percentage

This dataset is ideal for practicing missing data handling, exploratory data analysis (EDA), and machine learning preprocessing.

Search
Clear search
Close search
Google apps
Main menu