98 datasets found

Handling of missing values in python
kaggle.com
zip
Updated Jul 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
xodeum (2022). Handling of missing values in python [Dataset]. https://www.kaggle.com/datasets/xodeum/handling-of-missing-values-in-python
Explore at:
zip(2634 bytes)Available download formats
Dataset updated
Jul 3, 2022
Authors
xodeum
Description
In this Datasets i simply showed the handling of missing values in your data with help of python libraries such as NumPy and pandas. You can also see the use of Nan and Non values. Detecting, dropping and filling of null values.
Finding_And_Visualizing_Missing_Data_Python
kaggle.com
zip
Updated Nov 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr. Nagendra (2025). Finding_And_Visualizing_Missing_Data_Python [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/finding-and-visualizing-missing-data-python
Explore at:
zip(371581 bytes)Available download formats
Dataset updated
Nov 29, 2025
Authors
Dr. Nagendra
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
• This dataset is designed for learning how to identify missing data in Python.
• It focuses on techniques to detect null, NaN, and incomplete values.
• It includes examples of visualizing missing data patterns using Python libraries.
• Useful for beginners practicing data preprocessing and data cleaning.
• Helps users understand missing data handling methods for machine learning workflows.
• Supports practical exploration of datasets before model training.
H
Python Script for Cleaning Alum Dataset
hydroshare.org
search.dataone.org
zip
Updated Nov 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
saikumar payyavula; Jeff Sadler (2025). Python Script for Cleaning Alum Dataset [Dataset]. https://www.hydroshare.org/resource/6e58232cbf3346619ec37bbc51ba513d
Explore at:
zip(1.7 KB)Available download formats
Dataset updated
Nov 12, 2025
Dataset provided by
HydroShare
Authors
saikumar payyavula; Jeff Sadler
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This resource contains a Python script used to clean and preprocess the alum dosage dataset from a small Oklahoma water treatment plant. The script handles missing values, removes outliers, merges historical water quality and weather data, and prepares the dataset for AI model training.
Cleaning Practice with Errors & Missing Values
kaggle.com
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zuhair khan (2025). Cleaning Practice with Errors & Missing Values [Dataset]. https://www.kaggle.com/datasets/zuhairkhan13/cleaning-practice-with-errors-and-missing-values
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 5, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Zuhair khan
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset is designed specifically for beginners and intermediate learners to practice data cleaning techniques using Python and Pandas.

It includes 500 rows of simulated employee data with intentional errors such as:

Missing values in Age and Salary

Typos in email addresses (@gamil.com)

Inconsistent city name casing (e.g., lahore, Karachi)

Extra spaces in department names (e.g., " HR ")

✅ Skills You Can Practice:

Detecting and handling missing data

String cleaning and formatting

Removing duplicates

Validating email formats

Standardizing categorical data

You can use this dataset to build your own data cleaning notebook, or use it in interviews, assessments, and tutorials.
t
ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture...
researchdata.tuwien.ac.at
researchdata.tuwien.at
zip
Updated Sep 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo (2025). ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture from merged multi-satellite observations [Dataset]. http://doi.org/10.48436/3fcxr-cde10
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.48436/3fcxr-cde10
Dataset updated
Sep 5, 2025
Dataset provided by
TU Wien
Authors
Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was produced with funding from the European Space Agency (ESA) Climate Change Initiative (CCI) Plus Soil Moisture Project (CCN 3 to ESRIN Contract No: 4000126684/19/I-NB "ESA CCI+ Phase 1 New R&D on CCI ECVS Soil Moisture"). Project website: https://climate.esa.int/en/projects/soil-moisture/

This dataset contains information on the Surface Soil Moisture (SM) content derived from satellite observations in the microwave domain.

Dataset Paper (Open Access)

A description of this dataset, including the methodology and validation results, is available at:

Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: an independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data, 17, 4305–4329, https://doi.org/10.5194/essd-17-4305-2025, 2025.

Abstract

ESA CCI Soil Moisture is a multi-satellite climate data record that consists of harmonized, daily observations coming from 19 satellites (as of v09.1) operating in the microwave domain. The wealth of satellite information, particularly over the last decade, facilitates the creation of a data record with the highest possible data consistency and coverage.
However, data gaps are still found in the record. This is particularly notable in earlier periods when a limited number of satellites were in operation, but can also arise from various retrieval issues, such as frozen soils, dense vegetation, and radio frequency interference (RFI). These data gaps present a challenge for many users, as they have the potential to obscure relevant events within a study area or are incompatible with (machine learning) software that often relies on gap-free inputs.
Since the requirement of a gap-free ESA CCI SM product was identified, various studies have demonstrated the suitability of different statistical methods to achieve this goal. A fundamental feature of such gap-filling method is to rely only on the original observational record, without need for ancillary variable or model-based information. Due to the intrinsic challenge, there was until present no global, long-term univariate gap-filled product available. In this version of the record, data gaps due to missing satellite overpasses and invalid measurements are filled using the Discrete Cosine Transform (DCT) Penalized Least Squares (PLS) algorithm (Garcia, 2010). A linear interpolation is applied over periods of (potentially) frozen soils with little to no variability in (frozen) soil moisture content. Uncertainty estimates are based on models calibrated in experiments to fill satellite-like gaps introduced to GLDAS Noah reanalysis soil moisture (Rodell et al., 2004), and consider the gap size and local vegetation conditions as parameters that affect the gapfilling performance.

Summary

Gap-filled global estimates of volumetric surface soil moisture from 1991-2023 at 0.25° sampling

Fields of application (partial): climate variability and change, land-atmosphere interactions, global biogeochemical cycles and ecology, hydrological and land surface modelling, drought applications, and meteorology

Method: Modified version of DCT-PLS (Garcia, 2010) interpolation/smoothing algorithm, linear interpolation over periods of frozen soils. Uncertainty estimates are provided for all data points.

More information: See Preimesberger et al. (2025) and https://doi.org/10.5281/zenodo.8320869" target="_blank" rel="noopener">ESA CCI SM Algorithm Theoretical Baseline Document [Chapter 7.2.9] (Dorigo et al., 2023)

Programmatic Download

You can use command line tools such as wget or curl to download (and extract) data for multiple years. The following command will download and extract the complete data set to the local directory ~/Download on Linux or macOS systems.

#!/bin/bash

# Set download directory
DOWNLOAD_DIR=~/Downloads

base_url="https://researchdata.tuwien.at/records/3fcxr-cde10/files"

# Loop through years 1991 to 2023 and download & extract data
for year in {1991..2023}; do
echo "Downloading $year.zip..."
wget -q -P "$DOWNLOAD_DIR" "$base_url/$year.zip"
unzip -o "$DOWNLOAD_DIR/$year.zip" -d $DOWNLOAD_DIR
rm "$DOWNLOAD_DIR/$year.zip"
done

Data details

The dataset provides global daily estimates for the 1991-2023 period at 0.25° (~25 km) horizontal grid resolution. Daily images are grouped by year (YYYY), each subdirectory containing one netCDF image file for a specific day (DD), month (MM) in a 2-dimensional (longitude, latitude) grid system (CRS: WGS84). The file name has the following convention:

ESACCI-SOILMOISTURE-L3S-SSMV-COMBINED_GAPFILLED-YYYYMMDD000000-fv09.1r1.nc

Data Variables

Each netCDF file contains 3 coordinate variables (WGS84 longitude, latitude and time stamp), as well as the following data variables:

sm: (float) The Soil Moisture variable reflects estimates of daily average volumetric soil moisture content (m3/m3) in the soil surface layer (~0-5 cm) over a whole grid cell (0.25 degree).

sm_uncertainty: (float) The Soil Moisture Uncertainty variable reflects the uncertainty (random error) of the original satellite observations and of the predictions used to fill observation data gaps.

sm_anomaly: Soil moisture anomalies (reference period 1991-2020) derived from the gap-filled values (`sm`)

sm_smoothed: Contains DCT-PLS predictions used to fill data gaps in the original soil moisture field. These values are also provided for cases where an observation was initially available (compare `gapmask`). In this case, they provided a smoothed version of the original data.

gapmask: (0 | 1) Indicates grid cells where a satellite observation is available (1), and where the interpolated (smoothed) values are used instead (0) in the 'sm' field.

frozenmask: (0 | 1) Indicates grid cells where ERA5 soil temperature is <0 °C. In this case, a linear interpolation over time is applied.

Additional information for each variable is given in the netCDF attributes.

Version Changelog

Changes in v9.1r1 (previous version was v09.1):

This version uses a novel uncertainty estimation scheme as described in Preimesberger et al. (2025).

Software to open netCDF files

These data can be read by any software that supports Climate and Forecast (CF) conform metadata standards for netCDF files, such as:

https://github.com/pydata/xarray" target="_blank" rel="noopener">Xarray (python)

https://unidata.github.io/netcdf4-python/" target="_blank" rel="noopener">netCDF4 (python)

https://github.com/TUW-GEO/esa_cci_sm">esa_cci_sm (python)

Similar tools exists for other programming languages (Matlab, R, etc.)

Software packages and GIS tools can open netCDF files, e.g. CDO, NCO, QGIS, ArCGIS

You can also use the GUI software Panoply to view the contents of each file

References

Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: an independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data, 17, 4305–4329, https://doi.org/10.5194/essd-17-4305-2025, 2025.

Dorigo, W., Preimesberger, W., Stradiotti, P., Kidd, R., van der Schalie, R., van der Vliet, M., Rodriguez-Fernandez, N., Madelon, R., & Baghdadi, N. (2023). ESA Climate Change Initiative Plus - Soil Moisture Algorithm Theoretical Baseline Document (ATBD) Supporting Product Version 08.1 (version 1.1). Zenodo. https://doi.org/10.5281/zenodo.8320869

Garcia, D., 2010. Robust smoothing of gridded data in one and higher dimensions with missing values. Computational Statistics & Data Analysis, 54(4), pp.1167-1178. Available at: https://doi.org/10.1016/j.csda.2009.09.020

Rodell, M., Houser, P. R., Jambor, U., Gottschalck, J., Mitchell, K., Meng, C.-J., Arsenault, K., Cosgrove, B., Radakovich, J., Bosilovich, M., Entin, J. K., Walker, J. P., Lohmann, D., and Toll, D.: The Global Land Data Assimilation System, Bulletin of the American Meteorological Society, 85, 381 – 394, https://doi.org/10.1175/BAMS-85-3-381, 2004.

Related Records

The following records are all part of the ESA CCI Soil Moisture science data records community

1
ESA CCI SM MODELFREE Surface Soil Moisture Record
<a href="https://doi.org/10.48436/svr1r-27j77" target="_blank"
Details on imputation methods in R or python.
plos.figshare.com
xls
Updated Nov 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elena Albu; Shan Gao; Laure Wynants; Ben Van Calster (2025). Details on imputation methods in R or python. [Dataset]. http://doi.org/10.1371/journal.pone.0334125.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0334125.t005
Dataset updated
Nov 7, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Elena Albu; Shan Gao; Laure Wynants; Ben Van Calster
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Prediction models are used to predict an outcome based on input variables. Missing data in input variables often occur at model development and at prediction time. The missForestPredict R package proposes an adaptation of the missForest imputation algorithm that is fast, user-friendly and tailored for prediction settings. The algorithm iteratively imputes variables using random forests until a convergence criterion, unified for continuous and categorical variables, is met. The imputation models are saved for each variable and iteration and can be applied later to new observations at prediction time. The missForestPredict package offers extended error monitoring, control over variables used in the imputation and custom initialization. This allows users to tailor the imputation to their specific needs. The missForestPredict algorithm is compared to mean/mode imputation, linear regression imputation, mice, k-nearest neighbours, bagging, miceRanger and IterativeImputer on eight simulated datasets with simulated missingness (48 scenarios) and eight large public datasets using different prediction models. missForestPredict provides competitive results in prediction settings within short computation times.
Z
Adult dataset preprocessed
data.niaid.nih.gov
Updated Jul 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pustozerova, Anastasia; Schuster, Verena (2024). Adult dataset preprocessed [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12533513
Explore at:
Dataset updated
Jul 1, 2024
Dataset provided by
SBA Research
Authors
Pustozerova, Anastasia; Schuster, Verena
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The files "adult_train.csv" and "adult_test.csv" contain preprocessed versions of the Adult dataset from the USI repository.

The file "adult_preprocessing.ipynb" contains a python notebook file with all the preprocessing steps used to generate "adult_train.csv" and "adult_test.csv" from the original Adult dataset.

The preprocessing steps include:

One-hot-encoding of categorical values

Imputation of missing values using knn-imputer with k=1

Standard scaling of ordinal attributes

Note: we assume the scenario when the test set is available before training (every attribute besides the target - "income"), therefore we combine train and test sets before the preprocessing.
Data from: Data and scripts from: “Denoising autoencoder for reconstructing...
osti.gov
Updated Jan 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bi, Xiangyu; Chou, Chunwei; Johnsen, Timothy; Ramakrishnan, Lavanya; Skone, Jonathan; Varadharajan, Charuleka; Wu, Yuxin (2025). Data and scripts from: “Denoising autoencoder for reconstructing sensor observation data and predicting evapotranspiration: noisy and missing values repair and uncertainty quantification” [Dataset]. https://www.osti.gov/dataexplorer/biblio/dataset/2561511
Explore at:
Dataset updated
Jan 1, 2025
Dataset provided by
United States Department of Energyhttp://energy.gov/
Watershed Function SFA
Authors
Bi, Xiangyu; Chou, Chunwei; Johnsen, Timothy; Ramakrishnan, Lavanya; Skone, Jonathan; Varadharajan, Charuleka; Wu, Yuxin
Description
This data package includes data and scripts from the manuscript “Denoising autoencoder for reconstructing sensor observation data and predicting evapotranspiration: noisy and missing values repair and uncertainty quantification”.The study addressed common challenges faced in environmental sensing and modeling, including uncertain input data, missing sensor observations, and high-dimensional datasets with interrelated but redundant variables. Point-scaled meteorological and soil sensor observations were perturbed with noises and missing values, and denoising autoencoder (DAE) neural networks were developed to reconstruct the perturbed data and further predict evapotranspiration. This study concluded that (1) the reconstruction quality of each variable depends on its cross-correlation and alignment to the underlying data structure, (2) uncertainties from the models were overall stronger than those from the data corruption, and (3) there was a tradeoff between reducing bias and reducing variance when evaluating the uncertainty of the machine learning models.This package includes:(1) Four ipython scripts (.ipynb): “DAE_train.ipynb” trains and evaluates DAE neural networks, “DAE_predict.ipynb” makes predictions from the trained DAE models, “ET_train.ipynb” trains and evaluates ET prediction neural networks, and “ET_predict.ipynb” makes predictions from trained ET models.(2) One python file (.py): “methods.py” includes all user-defined functions and python codes used in the ipython scripts.(3) A “sub_models” folder that includes fivemore » trained DAE neural networks (in pytorch format, .pt), which could be used to ingest input data before being fed to the downstream ET models in ‘ET_train.ipynb” or ‘ET_predict.ipynb’.(4) Two data files (.csv). Daily meteorological, vegetation, and soil data is in “df_data.csv”, where “df_meta.csv” contains the location and time information of “df_data.csv”. Each row (index) in “df_meta.csv” corresponds to each row in “df_data.csv”. These data files are formatted to follow the data structure requirements and be directly used in the ipython scripts, and they have been shuffled chronologically to train machine learning models. The meteorological and soil data was collected using point sensors between 2019-2023 at(4.a) Three shrub-dominated field sites in East River, Colorado (named “ph1”, “ph2” and “sg5” in “df_meta.csv”, where “ph1” and “ph2” were located at PumpHouse Hillslopes, and “sg5” was at Snodgrass Mountain meadow) and(4.b) One outdoor, mesoscale, and herbaceous-dominated experiment in Berkeley, California (named “tb” in “df_meta.csv”, short for Smartsoils Testbed at Lawrence Berkeley National Lab).- See "df_data_dd.csv" and "df_meta_dd.csv" for variable descriptions and the Methods section for additional data processing steps. See "flmd.csv" and "README.txt" for brief file descriptions.- All ipython scripts and python files are written in and require PYTHON language software.« less
Science Education Research Topic Modeling Dataset
zenodo.org
data.niaid.nih.gov
bin, html +2
Updated Oct 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tor Ole B. Odden; Tor Ole B. Odden; Alessandro Marin; Alessandro Marin; John L. Rudolph; John L. Rudolph (2024). Science Education Research Topic Modeling Dataset [Dataset]. http://doi.org/10.5281/zenodo.4094974
Explore at:
bin, txt, html, text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4094974
Dataset updated
Oct 9, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tor Ole B. Odden; Tor Ole B. Odden; Alessandro Marin; Alessandro Marin; John L. Rudolph; John L. Rudolph
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset contains scraped and processed text from roughly 100 years of articles published in the Wiley journal Science Education (formerly General Science Quarterly). This text has been cleaned and filtered in preparation for analysis using natural language processing techniques, particularly topic modeling with latent Dirichlet allocation (LDA). We also include a Jupyter Notebook illustrating how one can use LDA to analyze this dataset and extract latent topics from it, as well as analyze the rise and fall of those topics over the history of the journal.

The articles were downloaded and scraped in December of 2019. Only non-duplicate articles with a listed author (according to the CrossRef metadata database) were included, and due to missing data and text recognition issues we excluded all articles published prior to 1922. This resulted in 5577 articles in total being included in the dataset. The text of these articles was then cleaned in the following way:

We removed duplicated text from each article: prior to 1969, articles in the journal were published in a magazine format in which the end of one article and the beginning of the next would share the same page, so we developed an automated detection of article beginnings and endings that was able to remove any duplicate text.

We removed the reference sections of the articles, as well headings (in all caps) such as “ABSTRACT”.

We reunited any partial words that were separated due to line breaks, text recognition issues, or British vs. American spellings (for example converting “per cent” to “percent”)

We removed all numbers, symbols, special characters, and punctuation, and lowercased all words.

We removed all stop words, which are words without any semantic meaning on their own—“the”, “in,” “if”, “and”, “but”, etc.—and all single-letter words.

We lemmatized all words, with the added step of including a part-of-speech tagger so our algorithm would only aggregate and lemmatize words from the same part of speech (e.g., nouns vs. verbs).

We detected and create bi-grams, sets of words that frequently co-occur and carry additional meaning together. These words were combined with an underscore: for example, “problem_solving” and “high_school”.

After filtering, each document was then turned into a list of individual words (or tokens) which were then collected and saved (using the python pickle format) into the file scied_words_bigrams_V5.pkl.

In addition to this file, we have also included the following files:

SciEd_paper_names_weights.pkl: A file containing limited metadata (title, author, year published, and DOI) for each of the papers, in the same order as they appear within the main datafile. This file also includes the weights assigned by an LDA model used to analyze the data

Science Education LDA Notebook.ipynb: A notebook file that replicates our LDA analysis, with a written explanation of all of the steps and suggestions on how to explore the results.

Supporting files for the notebook. These include the requirements, the README, a helper script with functions for plotting that were too long to include in the notebook, and two HTML graphs that are embedded into the notebook.

This dataset is shared under the terms of the Wiley Text and Data Mining Agreement, which allows users to share text and data mining output for non-commercial research purposes. Any questions or comments can be directed to Tor Ole Odden, t.o.odden@fys.uio.no.
Z
NoCORA - Northern Cameroon Observed Rainfall Archive
data.niaid.nih.gov
zenodo.org
Updated Jul 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lavarenne, Jérémy; Nenwala, Victor Hugo; Foulna Tcheobe, Carmel (2024). NoCORA - Northern Cameroon Observed Rainfall Archive [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10156437
Explore at:
Dataset updated
Jul 10, 2024
Dataset provided by
Center for International Forestry Research
Centre de Coopération Internationale en Recherche Agronomique pour le Développement
Authors
Lavarenne, Jérémy; Nenwala, Victor Hugo; Foulna Tcheobe, Carmel
Area covered
North Region, Cameroon
Description
Description: The NoCORA dataset represents a significant effort to compile and clean a comprehensive set of daily rainfall data for Northern Cameroon (North and Extreme North regions). This dataset, overing more than 1 million observations across 418 rainfall stations on a temporal range going from 1927 to 2022, is instrumental for researchers, meteorologists, and policymakers working in climate research, agricultural planning, and water resource management in the region. It integrates data from diverse sources, including Sodecoton rain funnels, the archive of Robert Morel (IRD), Centrale de Lagdo, the GHCN daily service, and the TAHMO network. The construction of NoCORA involved meticulous processes, including manual assembly of data, extensive data cleaning, and standardization of station names and coordinates, making it a hopefully robust and reliable resource for understanding climatic dynamics in Northern Cameroon. Data Sources: The dataset comprises eight primary rainfall data sources and a comprehensive coordinates dataset. The rainfall data sources include extensive historical and contemporary measurements, while the coordinates dataset was developed using reference data and an inference strategy for variant station names or missing coordinates. Dataset Preparation Methods: The preparation involved manual compilation, integration of machine-readable files, data cleaning with OpenRefine, and finalization using Python/Jupyter Notebook. This process should ensured the accuracy and consistency of the dataset. Discussion: NoCORA, with its extensive data compilation, presents an invaluable resource for climate-related studies in Northern Cameroon. However, users must navigate its complexities, including missing data interpretations, potential biases, and data inconsistencies. The dataset's comprehensive nature and historical span require careful handling and validation in research applications. Access to Dataset: The NoCORA dataset, while a comprehensive resource for climatological and meteorological research in Northern Cameroon, is subject to specific access conditions due to its compilation from various partner sources. The original data sources vary in their openness and accessibility, and not all partners have confirmed the open-access status of their data. As such, to ensure compliance with these varying conditions, access to the NoCORA dataset is granted on a request basis. Interested researchers and users are encouraged to contact us for permission to access the dataset. This process allows us to uphold the data sharing agreements with our partners while facilitating research and analysis within the scientific community. Authors Contributions:

Data treatment: Victor Hugo Nenwala, Carmel Foulna Tcheobe, Jérémy Lavarenne. Documentation: Jérémy Lavarenne. Funding: This project was funded by the DESIRA INNOVACC project. Changelog:

v1.0.2 : corrected interversion in column names in the coordinates dataset v1.0.1 : dataset specification file has been updated with complementary information regarding station locations v1.0.0 : initial submission
Ecommerce Dataset for Data Analysis
kaggle.com
zip
Updated Sep 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shrishti Manja (2024). Ecommerce Dataset for Data Analysis [Dataset]. https://www.kaggle.com/datasets/shrishtimanja/ecommerce-dataset-for-data-analysis/code
Explore at:
zip(2028853 bytes)Available download formats
Dataset updated
Sep 19, 2024
Authors
Shrishti Manja
Description
This dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.

About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.

Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.

This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.

This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning

The Device Activity Report with Complete Knowledge (DARCK) for NILM

zenodo.org

bin, xz

Updated Sep 19, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Anonymous Anonymous; Anonymous Anonymous (2025). The Device Activity Report with Complete Knowledge (DARCK) for NILM [Dataset]. http://doi.org/10.5281/zenodo.17159850

Explore at:

bin, xzAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.17159850

Dataset updated

Sep 19, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Anonymous Anonymous; Anonymous Anonymous

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

1. Abstract

This dataset contains aggregated and sub-metered power consumption data from a two-person apartment in Germany. Data was collected from March 5 to September 4, 2025, spanning 6 months. It includes an aggregate reading from a main smart meter and individual readings from 40 smart plugs, smart relays, and smart power meters monitoring various appliances.

2. Dataset Overview

Apartment: Two-person apartment, approx. 58m², located in Aachen, Germany.
Aggregate Meter: eBZ DD3
Sub-meters: 31 Shelly Plus Plug S, 6 Shelly Plus 1PM, 3 Shelly Plus PM Mini Gen3
Sampling Rate: 1 Hz
Measured Quantity: Active Power
Unit of Measurement: Watt
Duration: 6 months
Format: Single CSV file (`DARCK.csv`)
Structure: Timestamped rows with columns for the aggregate meter and each sub-metered appliance.
Completeness: The main power meter has a completeness of 99.3%. Missing values were linearly interpolated.

3. Download and Usage

The dataset can be downloaded here: https://doi.org/10.5281/zenodo.17159850

As it contains longer off periods with zeros, the CSV file is nicely compressible.

To extract it use: xz -d DARCK.csv.xz.
The compression leads to a 97% smaller file size (From 4GB to 90.9MB).

To use the dataset in python, you can, e.g., load the csv file into a pandas dataframe.

python
import pandas as pd

df = pd.read_csv("DARCK.csv", parse_dates=["time"])

4. Measurement Setup

The main meter was monitored using an infrared reading head magnetically attached to the infrared interface of the meter. An ESP8266 flashed with Tasmota decodes the binary datagrams and forwards the Watt readings to the MQTT broker. Individual appliances were monitored using a combination of Shelly Plugs (for outlets), Shelly 1PM (for wired-in devices like ceiling lights), and Shelly PM Mini (for each of the three phases of the oven). All devices reported to a central InfluxDB database via Home Assistant running in docker on a Dell OptiPlex 3020M.

5. File Format (`DARCK.csv`)

The dataset is provided as a single comma-separated value (CSV) file.

The first row is a header containing the column names.
All power values are rounded to the first decimal place.
There are no missing values in the final dataset.
Each row represents 1 second, from start of measuring in March until the end in September.

Column Descriptions

Column Name	Data Type	Unit	Description
`time`	datetime	-	Timestamp for the reading in `YYYY-MM-DD HH:MM:SS`
`main`	float	Watt	Total aggregate power consumption for the apartment, measured at the main electrical panel.
`[appliance_name]`	float	Watt	Power consumption of an individual appliance (e.g., `lightbathroom`, `fridge`, `sherlockpc`). See Section 8 for a full list.
Aggregate Columns
`aggr_chargers`	float	Watt	The sum of `sherlockcharger`, `sherlocklaptop`, `watsoncharger`, `watsonlaptop`, `watsonipadcharger`, `kitchencharger`.
`aggr_stoveplates`	float	Watt	The sum of `stoveplatel1` and `stoveplatel2`.
`aggr_lights`	float	Watt	The sum of `lightbathroom`, `lighthallway`, `lightsherlock`, `lightkitchen`, `lightlivingroom`, `lightwatson`, `lightstoreroom`, `fcob`, `sherlockalarmclocklight`, `sherlockfloorlamphue`, `sherlockledstrip`, `livingfloorlamphue`, `sherlockglobe`, `watsonfloorlamp`, `watsondesklamp` and `watsonledmap`.
Analysis Columns
`inaccuracy`	float	Watt	As no electrical device bypasses a power meter, the true inaccuracy can be assessed. It is the absolute error between the sum of individual measurements and the mains reading. A 30W offset is applied to the sum since the measurement devices themselves draw power which is otherwise unaccounted for.

6. Data Postprocessing Pipeline

The final dataset was generated from two raw data sources (meter.csv and shellies.csv) using a comprehensive postprocessing pipeline.

6.1. Main Meter (`main`) Postprocessing

The aggregate power data required several cleaning steps to ensure accuracy.

Outlier Removal: Readings below 10W or above 10,000W were removed (merely 3 occurrences).
Timestamp Burst Correction: The source data contained bursts of delayed readings. A custom algorithm was used to identify these bursts (large time gap followed by rapid readings) and back-fill the timestamps to create an evenly spaced time series.
Alignment & Interpolation: The smart meter pushes a new value via infrared every second. To align those to the whole seconds, it was resampled to a 1-second frequency by taking the mean of all readings within each second (in 99.5% only 1 value). Any resulting gaps (0.7% outage ratio) were filled using linear interpolation.

6.2. Sub-metered Devices (`shellies`) Postprocessing

The Shelly devices are not prone to the same burst issue as the ESP8266 is. They push a new reading at every change in power drawn. If no power change is observed or the one observed is too small (less than a few Watt), the reading is pushed once a minute, together with a heartbeat. When a device turns on or off, intermediate power values are published, which leads to sub-second values that need to be handled.

Grouping: Data was grouped by the unique device identifier.
Resampling & Filling: The data for each device was resampled to a 1-second frequency using .resample('1s').last().ffill().
This method was chosen to firstly, capture the last known state of the device within each second, handling rapid on/off events. Secondly, to forward-fill the last state across periods of no new data, modeling that the device's consumption remained constant until a new reading was sent.

6.3. Merging and Finalization

Merge: The cleaned main meter and all sub-metered device dataframes were merged into a single dataframe on the time index.
Final Fill: Any remaining NaN values (e.g., from before a device was installed) were filled with 0.0, assuming zero consumption.

7. Manual Corrections and Known Data Issues

During analysis, two significant unmetered load events were identified and manually corrected to improve the accuracy of the aggregate reading. The error column (inaccuracy) was recalculated after these corrections.

March 10th - Unmetered Bulb: An unmetered 107W bulb was active. It was subtracted from the main reading as if it never happened.
May 31st - Unmetered Air Pump: An unmetered 101W pump for an air mattress was used directly in an outlet with no intermediary plug and hence manually added to the respective plug.

8. Appliance Details and Multipurpose Plugs

The following table lists the column names with an explanation where needed. As Watson moved at the beginning of June, some metering plugs changed their appliance.

S
Geographical distribution and climate data of Cycas taiwaniana
scidb.cn
Updated Jan 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CHUNPING XIE (2025). Geographical distribution and climate data of Cycas taiwaniana [Dataset]. http://doi.org/10.57760/sciencedb.19432
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.19432
Dataset updated
Jan 3, 2025
Dataset provided by
Science Data Bank
Authors
CHUNPING XIE
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Description: Geographical Distribution and Climate Data of Cycas taiwaniana (Taiwanese Cycad)This dataset contains the geographical distribution and climate data for Cycas taiwaniana, focusing on its presence across regions in Fujian, Guangdong, and Hainan provinces of China. The dataset includes geographical coordinates (longitude and latitude), monthly climate data (minimum and maximum temperature, and precipitation) across different months, as well as bioclimatic variables based on the WorldClim dataset.**Temporal and Spatial Information** The data covers long-term climate information, with monthly data for each location recorded over a 12-month period (January to December). The dataset includes spatial data in terms of longitude and latitude, corresponding to various locations where Cycas taiwaniana populations are present. The spatial resolution is specific to each point location, and the temporal resolution reflects the monthly climate data for each year.**Data Structure and Units** The dataset consists of 36 records, each representing a unique location with corresponding climate and geographical data. The table includes the following columns: 1. No.: Unique identifier for each data record 2. Longitude: Geographic longitude in decimal degrees 3. Latitude: Geographic latitude in decimal degrees 4. tmin1 to tmin12: Minimum temperature (°C) for each month (January to December) 5. tmax1 to tmax12: Maximum temperature (°C) for each month (January to December) 6. prec1 to prec12: Precipitation (mm) for each month (January to December) 7. bio1 to bio19: Bioclimatic variables (e.g., annual mean temperature, temperature seasonality, precipitation, etc.) derived from WorldClim data (unit varies depending on the variable)The units for each measurement are as follows: - Temperature: Degrees Celsius (°C) - Precipitation: Millimeters (mm) - Bioclimatic variables: Varies depending on the specific variable (e.g., °C, mm)**Data Gaps and Missing Values** The dataset contains some missing values, particularly in the "precipitation" columns for certain months and locations. These missing values may result from gaps in climate station data or limitations in data collection for specific regions. Missing values are indicated as "NA" (Not Available) in the dataset. In cases where data gaps exist, estimations were not made, and the absence of the data is acknowledged in the record.**File Format and Software Compatibility** The dataset is provided in CSV format for ease of use and compatibility with various data analysis tools. It can be opened and processed using software such as Microsoft Excel, R, or Python (with Pandas). Users can download the dataset and work with it in software such as R (https://cran.r-project.org/) or Python (https://www.python.org/). The dataset is compatible with any software that supports CSV files.This dataset provides valuable information for research related to the geographical distribution and climate preferences of Cycas taiwaniana and can be used to inform conservation strategies, ecological studies, and climate change modeling.
Medical Clean Dataset
kaggle.com
zip
Updated Jul 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aamir Shahzad (2025). Medical Clean Dataset [Dataset]. https://www.kaggle.com/datasets/aamir5659/medical-clean-dataset
Explore at:
zip(1262 bytes)Available download formats
Dataset updated
Jul 6, 2025
Authors
Aamir Shahzad
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This is the cleaned version of a real-world medical dataset that was originally noisy, incomplete, and contained various inconsistencies. The dataset was cleaned through a structured and well-documented data preprocessing pipeline using Python and Pandas. Key steps in the cleaning process included:

Handling missing values using statistical techniques such as median imputation and mode replacement

Converting categorical values to consistent formats (e.g., gender formatting, yes/no standardization)

Removing duplicate entries to ensure data accuracy

Parsing and standardizing date fields

Creating new derived features such as age groups

Detecting and reviewing outliers based on IQR

Removing irrelevant or redundant columns

The purpose of cleaning this dataset was to prepare it for further exploratory data analysis (EDA), data visualization, and machine learning modeling.

This cleaned dataset is now ready for training predictive models, generating visual insights, or conducting healthcare-related research. It provides a high-quality foundation for anyone interested in medical analytics or data science practice.
Student Performance and Learning Behavior Dataset for Educational Analytics
zenodo.org
bin, csv
Updated Aug 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kamal NAJEM; Kamal NAJEM (2025). Student Performance and Learning Behavior Dataset for Educational Analytics [Dataset]. http://doi.org/10.5281/zenodo.16459132
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.16459132
Dataset updated
Aug 13, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Kamal NAJEM; Kamal NAJEM
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jul 26, 2025
Description
The dataset used in this study integrates quantitative data on student learning behaviors, engagement patterns, demographics, and academic performance. It was compiled by merging two publicly available Kaggle datasets, resulting in a combined file (“merged_dataset.csv”) containing 14,003 student records with 16 attributes. All records are anonymized and contain no personally identifiable information.

The dataset covers the following categories of variables:

Study behaviors and engagement: StudyHours, Attendance, Extracurricular, AssignmentCompletion, OnlineCourses, Discussions

Resource access and learning environment: Resources, Internet, EduTech

Motivation and psychological factors: Motivation, StressLevel

Demographic information: Gender, Age (ranging from 18 to 30 years)

Learning preference classification: LearningStyle

Academic performance indicators: ExamScore, FinalGrade

In this study, “ExamScore” and “FinalGrade” served as the primary performance indicators. The remaining variables were used to derive behavioral and contextual profiles, which were clustered using unsupervised machine learning techniques.

The analysis and modeling were implemented in Python through a structured Jupyter Notebook (“Project.ipynb”), which included the following main steps:

Environment Setup – Import of essential libraries (NumPy, pandas, Matplotlib, Seaborn, SciPy, StatsModels, scikit-learn, imbalanced-learn) and visualization configuration.

Data Import and Integration – Loading the two source CSV files, harmonizing columns, removing irrelevant attributes, aligning formats, handling missing values, and merging them into a unified dataset (merged_dataset.csv).

Data Preprocessing –

Encoding categorical variables using LabelEncoder.

Scaling features using both z-score standardization (for statistical tests and PCA) and Min–Max normalization (for clustering).

Detecting and removing duplicates.

Clustering Analysis –

Applying K-Means clustering to segment learners into distinct profiles.

Determining the optimal number of clusters using the Elbow Method and Silhouette Score.

Evaluating cluster quality with internal metrics (Silhouette Score, Davies–Bouldin Index).

Dimensionality Reduction & Visualization – Using PCA for 2D/3D cluster visualization and feature importance exploration.

Mapping Clusters to Learning Styles – Associating each identified cluster with the most relevant learning style model based on feature patterns and alignment scores.

Statistical Analysis – Conducting ANOVA and regression to test for significant differences in performance between clusters.

Interpretation & Practical Recommendations – Analyzing cluster-specific characteristics and providing implications for adaptive and mobile learning integration.
u
Data from: CADDI: An in-Class Activity Detection Dataset using IMU data from...
observatorio-cientifico.ua.es
scidb.cn
Updated 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Pina-Navarro, Monica; Gomez-Donoso, Francisco; Cazorla, Miguel; Marquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Pina-Navarro, Monica; Gomez-Donoso, Francisco; Cazorla, Miguel (2025). CADDI: An in-Class Activity Detection Dataset using IMU data from low-cost sensors [Dataset]. https://observatorio-cientifico.ua.es/documentos/668fc49bb9e7c03b01be251c
Explore at:
Dataset updated
2025
Authors
Marquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Pina-Navarro, Monica; Gomez-Donoso, Francisco; Cazorla, Miguel; Marquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Pina-Navarro, Monica; Gomez-Donoso, Francisco; Cazorla, Miguel
Description
Data DescriptionThe CADDI dataset is designed to support research in in-class activity recognition using IMU data from low-cost sensors. It provides multimodal data capturing 19 different activities performed by 12 participants in a classroom environment, utilizing both IMU sensors from a Samsung Galaxy Watch 5 and synchronized stereo camera images. This dataset enables the development and validation of activity recognition models using sensor fusion techniques.Data Generation ProceduresThe data collection process involved recording both continuous and instantaneous activities that typically occur in a classroom setting. The activities were captured using a custom setup, which included:A Samsung Galaxy Watch 5 to collect accelerometer, gyroscope, and rotation vector data at 100Hz.A ZED stereo camera capturing 1080p images at 25-30 fps.A synchronized computer acting as a data hub, receiving IMU data and storing images in real-time.A D-Link DSR-1000AC router for wireless communication between the smartwatch and the computer.Participants were instructed to arrange their workspace as they would in a real classroom, including a laptop, notebook, pens, and a backpack. Data collection was performed under realistic conditions, ensuring that activities were captured naturally.Temporal and Spatial ScopeThe dataset contains a total of 472.03 minutes of recorded data.The IMU sensors operate at 100Hz, while the stereo camera captures images at 25-30Hz.Data was collected from 12 participants, each performing all 19 activities multiple times.The geographical scope of data collection was Alicante, Spain, under controlled indoor conditions.Dataset ComponentsThe dataset is organized into JSON and PNG files, structured hierarchically:IMU Data: Stored in JSON files, containing:Samsung Linear Acceleration Sensor (X, Y, Z values, 100Hz)LSM6DSO Gyroscope (X, Y, Z values, 100Hz)Samsung Rotation Vector (X, Y, Z, W quaternion values, 100Hz)Samsung HR Sensor (heart rate, 1Hz)OPT3007 Light Sensor (ambient light levels, 5Hz)Stereo Camera Images: High-resolution 1920×1080 PNG files from left and right cameras.Synchronization: Each IMU data record and image is timestamped for precise alignment.Data StructureThe dataset is divided into continuous and instantaneous activities:Continuous Activities (e.g., typing, writing, drawing) were recorded for 210 seconds, with the central 200 seconds retained.Instantaneous Activities (e.g., raising a hand, drinking) were repeated 20 times per participant, with data captured only during execution.The dataset is structured as:/continuous/subject_id/activity_name/ /camera_a/ → Left camera images /camera_b/ → Right camera images /sensors/ → JSON files with IMU data

/instantaneous/subject_id/activity_name/repetition_id/ /camera_a/ /camera_b/ /sensors/ Data Quality & Missing DataThe smartwatch buffers 100 readings per second before sending them, ensuring minimal data loss.Synchronization latency between the smartwatch and the computer is negligible.Not all IMU samples have corresponding images due to different recording rates.Outliers and anomalies were handled by discarding incomplete sequences at the start and end of continuous activities.Error Ranges & LimitationsSensor data may contain noise due to minor hand movements.The heart rate sensor operates at 1Hz, limiting its temporal resolution.Camera exposure settings were automatically adjusted, which may introduce slight variations in lighting.File Formats & Software CompatibilityIMU data is stored in JSON format, readable with Python’s json library.Images are in PNG format, compatible with all standard image processing tools.Recommended libraries for data analysis:Python: numpy, pandas, scikit-learn, tensorflow, pytorchVisualization: matplotlib, seabornDeep Learning: Keras, PyTorchPotential ApplicationsDevelopment of activity recognition models in educational settings.Study of student engagement based on movement patterns.Investigation of sensor fusion techniques combining visual and IMU data.This dataset represents a unique contribution to activity recognition research, providing rich multimodal data for developing robust models in real-world educational environments.CitationIf you find this project helpful for your research, please cite our work using the following bibtex entry:@misc{marquezcarpintero2025caddiinclassactivitydetection, title={CADDI: An in-Class Activity Detection Dataset using IMU data from low-cost sensors}, author={Luis Marquez-Carpintero and Sergio Suescun-Ferrandiz and Monica Pina-Navarro and Miguel Cazorla and Francisco Gomez-Donoso}, year={2025}, eprint={2503.02853}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.02853}, }
Z
Data from: A comprehensive dataset for the accelerated development and...
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carreira Pedro, Hugo; Larson, David; Coimbra, Carlos (2020). A comprehensive dataset for the accelerated development and benchmarking of solar forecasting methods [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2826938
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
University of California San Diego
Authors
Carreira Pedro, Hugo; Larson, David; Coimbra, Carlos
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description This repository contains a comprehensive solar irradiance, imaging, and forecasting dataset. The goal with this release is to provide standardized solar and meteorological datasets to the research community for the accelerated development and benchmarking of forecasting methods. The data consist of three years (2014–2016) of quality-controlled, 1-min resolution global horizontal irradiance and direct normal irradiance ground measurements in California. In addition, we provide overlapping data from commonly used exogenous variables, including sky images, satellite imagery, Numerical Weather Prediction forecasts, and weather data. We also include sample codes of baseline models for benchmarking of more elaborated models.

Data usage The usage of the datasets and sample codes presented here is intended for research and development purposes only and implies explicit reference to the paper: Pedro, H.T.C., Larson, D.P., Coimbra, C.F.M., 2019. A comprehensive dataset for the accelerated development and benchmarking of solar forecasting methods. Journal of Renewable and Sustainable Energy 11, 036102. https://doi.org/10.1063/1.5094494

Although every effort was made to ensure the quality of the data, no guarantees or liabilities are implied by the authors or publishers of the data.

Sample code As part of the data release, we are also including the sample code written in Python 3. The preprocessed data used in the scripts are also provided. The code can be used to reproduce the results presented in this work and as a starting point for future studies. Besides the standard scientific Python packages (numpy, scipy, and matplotlib), the code depends on pandas for time-series operations, pvlib for common solar-related tasks, and scikit-learn for Machine Learning models. All required Python packages are readily available on Mac, Linux, and Windows and can be installed via, e.g., pip.

Units All time stamps are in UTC (YYYY-MM-DD HH:MM:SS). All irradiance and weather data are in SI units. Sky image features are derived from 8-bit RGB (256 color levels) data. Satellite images are derived from 8-bit gray-scale (256 color levels) data.

Missing data The string "NAN" indicates missing data

File formats All time series data files as in CSV (comma separated values) Images are given in tar.bz2 files

Files

Folsom_irradiance.csv Primary One-minute GHI, DNI, and DHI data.

Folsom_weather.csv Primary One-minute weather data.

Folsom_sky_images_{YEAR}.tar.bz2 Primary Tar archives with daytime sky images captured at 1-min intervals for the years 2014, 2015, and 2016, compressed with bz2.

Folsom_NAM_lat{LAT}_lon{LON}.csv Primary NAM forecasts for the four nodes nearest the target location. {LAT} and {LON} are replaced by the node’s coordinates listed in Table I in the paper.

Folsom_sky_image_features.csv Secondary Features derived from the sky images.

Folsom_satellite.csv Secondary 10 pixel by 10 pixel GOES-15 images centered in the target location.

Irradiance_features_{horizon}.csv Secondary Irradiance features for the different forecasting horizons ({horizon} 1⁄4 {intra-hour, intra-day, day-ahead}).

Sky_image_features_intra-hour.csv Secondary Sky image features for the intra-hour forecasting issuing times.

Sat_image_features_intra-day.csv Secondary Satellite image features for the intra-day forecasting issuing times.

NAM_nearest_node_day-ahead.csv Secondary NAM forecasts (GHI, DNI computed with the DISC algorithm, and total cloud cover) for the nearest node to the target location prepared for day-ahead forecasting.

Target_{horizon}.csv Secondary Target data for the different forecasting horizons.

Forecast_{horizon}.py Code Python script used to create the forecasts for the different horizons.

Postprocess.py Code Python script used to compute the error metric for all the forecasts.
d
Data from: Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 -...
catalog.data.gov
data.usgs.gov
+2more
Updated Nov 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 - 7—Data [Dataset]. https://catalog.data.gov/dataset/variable-terrestrial-gps-telemetry-detection-rates-parts-1-7data
Explore at:
Dataset updated
Nov 27, 2025
Dataset provided by
U.S. Geological Survey
Description
Studies utilizing Global Positioning System (GPS) telemetry rarely result in 100% fix success rates (FSR). Many assessments of wildlife resource use do not account for missing data, either assuming data loss is random or because a lack of practical treatment for systematic data loss. Several studies have explored how the environment, technological features, and animal behavior influence rates of missing data in GPS telemetry, but previous spatially explicit models developed to correct for sampling bias have been specified to small study areas, on a small range of data loss, or to be species-specific, limiting their general utility. Here we explore environmental effects on GPS fix acquisition rates across a wide range of environmental conditions and detection rates for bias correction of terrestrial GPS-derived, large mammal habitat use. We also evaluate patterns in missing data that relate to potential animal activities that change the orientation of the antennae and characterize home-range probability of GPS detection for 4 focal species; cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Part 1, Positive Openness Raster (raster dataset): Openness is an angular measure of the relationship between surface relief and horizontal distance. For angles less than 90 degrees it is equivalent to the internal angle of a cone with its apex at a DEM location, and is constrained by neighboring elevations within a specified radial distance. 480 meter search radius was used for this calculation of positive openness. Openness incorporates the terrain line-of-sight or viewshed concept and is calculated from multiple zenith and nadir angles-here along eight azimuths. Positive openness measures openness above the surface, with high values for convex forms and low values for concave forms (Yokoyama et al. 2002). We calculated positive openness using a custom python script, following the methods of Yokoyama et. al (2002) using a USGS National Elevation Dataset as input. Part 2, Northern Arizona GPS Test Collar (csv): Bias correction in GPS telemetry data-sets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix acquisition. We found terrain exposure and tall over-story vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The model's predictive ability was evaluated using two independent data-sets from stationary test collars of different make/model, fix interval programming, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. The model training data are provided here for fix attempts by hour. This table can be linked with the site location shapefile using the site field. Part 3, Probability Raster (raster dataset): Bias correction in GPS telemetry datasets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix aquistion. We found terrain exposure and tall overstory vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The models predictive ability was evaluated using two independent datasets from stationary test collars of different make/model, fix interval programing, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. We evaluated GPS telemetry datasets by comparing the mean probability of a successful GPS fix across study animals home-ranges, to the actual observed FSR of GPS downloaded deployed collars on cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Comparing the mean probability of acquisition within study animals home-ranges and observed FSRs of GPS downloaded collars resulted in a approximatly 1:1 linear relationship with an r-sq= 0.68. Part 4, GPS Test Collar Sites (shapefile): Bias correction in GPS telemetry data-sets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix acquisition. We found terrain exposure and tall over-story vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The model's predictive ability was evaluated using two independent data-sets from stationary test collars of different make/model, fix interval programming, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. Part 5, Cougar Home Ranges (shapefile): Cougar home-ranges were calculated to compare the mean probability of a GPS fix acquisition across the home-range to the actual fix success rate (FSR) of the collar as a means for evaluating if characteristics of an animal’s home-range have an effect on observed FSR. We estimated home-ranges using the Local Convex Hull (LoCoH) method using the 90th isopleth. Data obtained from GPS download of retrieved units were only used. Satellite delivered data was omitted from the analysis for animals where the collar was lost or damaged because satellite delivery tends to lose as additional 10% of data. Comparisons with home-range mean probability of fix were also used as a reference for assessing if the frequency animals use areas of low GPS acquisition rates may play a role in observed FSRs. Part 6, Cougar Fix Success Rate by Hour (csv): Cougar GPS collar fix success varied by hour-of-day suggesting circadian rhythms with bouts of rest during daylight hours may change the orientation of the GPS receiver affecting the ability to acquire fixes. Raw data of overall fix success rates (FSR) and FSR by hour were used to predict relative reductions in FSR. Data only includes direct GPS download datasets. Satellite delivered data was omitted from the analysis for animals where the collar was lost or damaged because satellite delivery tends to lose approximately an additional 10% of data. Part 7, Openness Python Script version 2.0: This python script was used to calculate positive openness using a 30 meter digital elevation model for a large geographic area in Arizona, California, Nevada and Utah. A scientific research project used the script to explore environmental effects on GPS fix acquisition rates across a wide range of environmental conditions and detection rates for bias correction of terrestrial GPS-derived, large mammal habitat use.
m
Micro-scale potentiodynamic polarisation (log(j)) curves of 316L stainless...
data.mendeley.com
Updated Oct 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leonardo Bertolucci Coelho (2023). Micro-scale potentiodynamic polarisation (log(j)) curves of 316L stainless steel [Dataset]. http://doi.org/10.17632/7j6b6y48jw.1
Explore at:
Unique identifier
https://doi.org/10.17632/7j6b6y48jw.1
Dataset updated
Oct 3, 2023
Authors
Leonardo Bertolucci Coelho
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This database comprises 5 Potentiodynamic Polarisation (PP) datasets. Each dataset consists of a pair of CSVs: 1 file containing the values of the applied potential E (Vs Ag/AgCl); and 1 containing the corresponding log of the current density log(j) (µA/cm²) values.

This database was deployed as the source dataset in the following scientific article, accepted for publication in npj Materials Degradation journal on 25 September 2023: "Estimating pitting descriptors of 316L stainless steel by machine learning and statistical analysis". Leonardo Bertolucci Coelho1,2,∗, Daniel Torres1, Vincent Vangrunderbeek2, Miguel Bernal1, Gian Marco Paldino3, Gianluca Bontempi3, Jon Ustarroz 1,2 1 ChemSIN – Chemistry of Surfaces, Interfaces and Nanomaterials, Université libre de Bruxelles (ULB), Brussels, Belgium 2 Research Group Electrochemical and Surface Engineering (SURF), Vrije Universiteit Brussel, Brussels, Belgium 3 Machine Learning Group (MLG), Université libre de Bruxelles (ULB), Brussels, Belgium *leonardo.bertolucci.coelho@ulb.be

These datasets are almost identical to the ones available at https://data.mendeley.com/datasets/78rz8vw46x/2 The only difference is that eventual missing j values were filled with an iterative imputer (Python 3.7 language). The IterativeImputer class (from sklearn.impute) models each feature with missing values as a function of other features and uses that estimate for imputation.
H
Modeled groundwater levels across Central Valley, CA, from March 2015 to...
dataverse.harvard.edu
Updated Sep 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anshuman Pradhan; Kyra H. Adams; Venkat Chandrasekaran; Zhen Liu; John T. Reager; Andrew M. Stuart; Michael J. Turmon (2024). Modeled groundwater levels across Central Valley, CA, from March 2015 to August 2020, using GP-DNN regression [Dataset]. http://doi.org/10.7910/DVN/23TNJO
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/23TNJO
Dataset updated
Sep 29, 2024
Dataset provided by
Harvard Dataverse
Authors
Anshuman Pradhan; Kyra H. Adams; Venkat Chandrasekaran; Zhen Liu; John T. Reager; Andrew M. Stuart; Michael J. Turmon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Central Valley, California
Description
This dataset contains groundwater level trends and time series data across a discretized grid of California's Central Valley, modeled with well data using hierarchical Gaussian process and neural network regression methodology. The spatial grid consists of 400 cells, spanning latitudes 34.91 to 40.895 degrees, and 220 cells, spanning longitudes -122.6 to -118.658 degrees. The temporal axis spans March 2015 to Aug 2020, discretized at biweekly intervals, with a total of 132 cells. The spatiotemporal grid details are present in relevant files. The first dataset is contained in the following Python pickle file. 1. 'CV_water_level_trends_Mar2015_Aug2020.pkl': This file contains a nested Python dictionary with following pairs: 1.1. 'longitude': Numpy array of shape 400 x 220 1.2. 'longitude': Numpy array of shape 400 x 220 1.3. 'mean': Python dictionary with mean long-term and seasonal water level trends 1.4. 'P10': Python dictionary with P10 long-term and seasonal water level trends 1.5. 'P90': Python dictionary with P90 long-term and seasonal water level trends Each of the dictionary in 1.3., 1.4. and 1.5. contain the following key and values: 'initial_water_level_ft': Mean/P10/P90 of March 2015 water levels in feet stored as Numpy array of shape 400 x 220 'water_level_decline_rate_ft/biweek': Mean/P10/P90 of March 2015 - Aug 2020 water level decline rates in ft/biweek stored as Numpy array of shape 400 x 220 'water_level_amplitude_ft': Mean/P10/P90 of seasonal water level oscillation amplitude stored as Numpy array of shape 400 x 220 'water_level_phase_deg': Mean/P10/P90 of time to peak seasonal signal in degrees stored as Numpy array of shape 400 x 220 The second dataset is contained in the following Python pickle file. 2. 'CV_water_level_time_series_Mar2015_Aug2020.pkl': This file contains a Python dictionary with following pairs. 2.1. 'longitude': Numpy array of shape 400 x 220 2.2. 'longitude': Numpy array of shape 400 x 220 2.3. 'time_axis': Python list on length 132 containing strings for biweekly periods from March 2015 - August 2020 2.4. 'water_level_well_ft': Processed water level observations in feet from 1744 wells, irregularly sampled across time. The data is stored as Numpy array of shape 400 x 220 x 132, with missing values as nans. 2.5. 'water_level_modeled_mean_ft': Modeled mean water level time series in feet stored as Numpy array of shape 400 x 220 x 132 2.6. 'water_level_modeled_P10_ft': Modeled P10 water level time series in feet stored as Numpy array of shape 400 x 220 x 132 2.6. 'water_level_modeled_P90_ft': Modeled P90 water level time series in feet stored as Numpy array of shape 400 x 220 x 132

Facebook

Twitter

Click to copy link

Link copied

Cite

xodeum (2022). Handling of missing values in python [Dataset]. https://www.kaggle.com/datasets/xodeum/handling-of-missing-values-in-python

Handling of missing values in python

Trade-Offs in Missing Data Conventions

Explore at:

zip(2634 bytes)Available download formats

Dataset updated

Jul 3, 2022

Authors

xodeum

Description

In this Datasets i simply showed the handling of missing values in your data with help of python libraries such as NumPy and pandas. You can also see the use of Nan and Non values. Detecting, dropping and filling of null values.

Clear search

Close search

Google apps

Main menu

Handling of missing values in python

Finding_And_Visualizing_Missing_Data_Python

Python Script for Cleaning Alum Dataset

Cleaning Practice with Errors & Missing Values

ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture...

Dataset Paper (Open Access)

Abstract

Summary

Programmatic Download

Data details

Data Variables

Version Changelog

Software to open netCDF files

References

Related Records

Details on imputation methods in R or python.

Adult dataset preprocessed

Data from: Data and scripts from: “Denoising autoencoder for reconstructing...

Science Education Research Topic Modeling Dataset

NoCORA - Northern Cameroon Observed Rainfall Archive

Ecommerce Dataset for Data Analysis

The Device Activity Report with Complete Knowledge (DARCK) for NILM

1. Abstract

2. Dataset Overview

3. Download and Usage

4. Measurement Setup

5. File Format (DARCK.csv)

Column Descriptions

Column Name

Data Type

Unit

Description

6. Data Postprocessing Pipeline

6.1. Main Meter (main) Postprocessing

6.2. Sub-metered Devices (shellies) Postprocessing

6.3. Merging and Finalization

7. Manual Corrections and Known Data Issues

8. Appliance Details and Multipurpose Plugs

Geographical distribution and climate data of Cycas taiwaniana

Medical Clean Dataset

Student Performance and Learning Behavior Dataset for Educational Analytics

Data from: CADDI: An in-Class Activity Detection Dataset using IMU data from...

Data from: A comprehensive dataset for the accelerated development and...

Data from: Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 -...

Micro-scale potentiodynamic polarisation (log(j)) curves of 316L stainless...

Modeled groundwater levels across Central Valley, CA, from March 2015 to...

Handling of missing values in python

Trade-Offs in Missing Data Conventions

5. File Format (`DARCK.csv`)

6.1. Main Meter (`main`) Postprocessing

6.2. Sub-metered Devices (`shellies`) Postprocessing