58 datasets found

f
Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction...
frontiersin.figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yi-Hui Zhou; Ehsan Saghapour (2023). Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data.PDF [Dataset]. http://doi.org/10.3389/fgene.2021.691274.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2021.691274.s001
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Yi-Hui Zhou; Ehsan Saghapour
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.
Global Freelancers (Raw) Dataset
kaggle.com
Updated Jul 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Urvish Ahir (2025). Global Freelancers (Raw) Dataset [Dataset]. https://www.kaggle.com/datasets/urvishahir/global-freelancers-raw-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 5, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Urvish Ahir
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Description :

This dataset contains 1,000 fictional freelancer profiles from around the world, designed to reflect realistic variability and messiness often encountered in real-world data collection.

Each entry includes demographic, professional, and platform-related information such as:

Name, gender, age, and country

Primary skill and years of experience

Hourly rate (with mixed formatting), client rating, and satisfaction score

Language spoken (based on country)

Inconsistent and unclean values across several fields (e.g., gender, is_active, satisfaction)

Key Features :

Gender-based names using Faker’s male/female name generators

Realistic age and experience distribution (with missing and noisy values)

Country-language pairs mapped using actual linguistic data

Messy formatting: mixed data types, missing values, inconsistent casing

Generated entirely in Python using the faker library no real data used

Use Cases :

Practicing data cleaning and preprocessing

Performing EDA (Exploratory Data Analysis)

Developing data pipelines: raw → clean → model-ready

Teaching feature engineering and handling real-world dirty data

Exercises in data validation, outlier detection, and format standardization

File : global_freelancers_raw.csv

| Column Name | Description | | --------------------- | ------------------------------------------------------------------------ | | `freelancer_ID` | Unique ID starting with `FL` (e.g., FL250001) | | `name` | Full name of freelancer (based on gender) | | `gender` | Gender (messy values and case inconsistency) | | `age` | Age of the freelancer (20–60, with occasional nulls/outliers) | | `country` | Country name (with random formatting/casing) | | `language` | Language spoken (mapped from country) | | `primary_skill` | Key freelance domain (e.g., Web Dev, AI, Cybersecurity) | | `years_of_experience` | Work experience in years (some missing values or odd values included) | | `hourly_rate (USD)` | Hourly rate with currency symbols or missing data | | `rating` | Rating between 1.0–5.0 (some zeros and nulls included) | | `is_active` | Active status (inconsistently represented as strings, numbers, booleans) | | `client_satisfaction` | Satisfaction percentage (e.g., "85%" or 85, may include NaNs) |
Learn Data Science Series Part 1
kaggle.com
Updated Dec 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rupesh Kumar (2022). Learn Data Science Series Part 1 [Dataset]. https://www.kaggle.com/datasets/hunter0007/learn-data-science-part-1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 30, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rupesh Kumar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Please feel free to share it with others and consider supporting me if you find it helpful ⭐️.

Overview:

Chapter 1: Getting started with pandas

Chapter 2: Analysis: Bringing it all together and making decisions

Chapter 3: Appending to DataFrame

Chapter 4: Boolean indexing of dataframes

Chapter 5: Categorical data

Chapter 6: Computational Tools

Chapter 7: Creating DataFrames

Chapter 8: Cross sections of different axes with MultiIndex

Chapter 9: Data Types

Chapter 10: Dealing with categorical variables

Chapter 11: Duplicated data

Chapter 12: Getting information about DataFrames

Chapter 13: Gotchas of pandas

Chapter 14: Graphs and Visualizations

Chapter 15: Grouping Data

Chapter 16: Grouping Time Series Data

Chapter 17: Holiday Calendars

Chapter 18: Indexing and selecting data

Chapter 19: IO for Google BigQuery

Chapter 20: JSON

Chapter 21: Making Pandas Play Nice With Native Python Datatypes

Chapter 22: Map Values

Chapter 23: Merge, join, and concatenate

Chapter 24: Meta: Documentation Guidelines

Chapter 25: Missing Data

Chapter 26: MultiIndex

Chapter 27: Pandas Datareader

Chapter 28: Pandas IO tools (reading and saving data sets)

Chapter 29: pd.DataFrame.apply

Chapter 30: Read MySQL to DataFrame

Chapter 31: Read SQL Server to Dataframe

Chapter 32: Reading files into pandas DataFrame

Chapter 33: Resampling

Chapter 34: Reshaping and pivoting

Chapter 35: Save pandas dataframe to a csv file

Chapter 36: Series

Chapter 37: Shifting and Lagging Data

Chapter 38: Simple manipulation of DataFrames

Chapter 39: String manipulation

Chapter 40: Using .ix, .iloc, .loc, .at and .iat to access a DataFrame

Chapter 41: Working with Time Series
T
titanic
tensorflow.org
Updated Feb 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). titanic [Dataset]. https://www.tensorflow.org/datasets/catalog/titanic
Explore at:
Dataset updated
Feb 12, 2023
Description
Dataset describing the survival status of individual passengers on the Titanic. Missing values in the original dataset are represented using ?. Float and int missing values are replaced with -1, string missing values are replaced with 'Unknown'.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('titanic', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
Z
Adult dataset preprocessed
data.niaid.nih.gov
zenodo.org
Updated Jul 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Schuster, Verena (2024). Adult dataset preprocessed [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12533513
Explore at:
Dataset updated
Jul 1, 2024
Dataset provided by
Pustozerova, Anastasia
Schuster, Verena
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The files "adult_train.csv" and "adult_test.csv" contain preprocessed versions of the Adult dataset from the USI repository.

The file "adult_preprocessing.ipynb" contains a python notebook file with all the preprocessing steps used to generate "adult_train.csv" and "adult_test.csv" from the original Adult dataset.

The preprocessing steps include:

One-hot-encoding of categorical values

Imputation of missing values using knn-imputer with k=1

Standard scaling of ordinal attributes

Note: we assume the scenario when the test set is available before training (every attribute besides the target - "income"), therefore we combine train and test sets before the preprocessing.
S
Geographical distribution and climate data of Cycas taiwaniana
scidb.cn
Updated Jan 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CHUNPING XIE (2025). Geographical distribution and climate data of Cycas taiwaniana [Dataset]. http://doi.org/10.57760/sciencedb.19432
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.19432
Dataset updated
Jan 3, 2025
Dataset provided by
Science Data Bank
Authors
CHUNPING XIE
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Description: Geographical Distribution and Climate Data of Cycas taiwaniana (Taiwanese Cycad)This dataset contains the geographical distribution and climate data for Cycas taiwaniana, focusing on its presence across regions in Fujian, Guangdong, and Hainan provinces of China. The dataset includes geographical coordinates (longitude and latitude), monthly climate data (minimum and maximum temperature, and precipitation) across different months, as well as bioclimatic variables based on the WorldClim dataset.**Temporal and Spatial Information** The data covers long-term climate information, with monthly data for each location recorded over a 12-month period (January to December). The dataset includes spatial data in terms of longitude and latitude, corresponding to various locations where Cycas taiwaniana populations are present. The spatial resolution is specific to each point location, and the temporal resolution reflects the monthly climate data for each year.**Data Structure and Units** The dataset consists of 36 records, each representing a unique location with corresponding climate and geographical data. The table includes the following columns: 1. No.: Unique identifier for each data record 2. Longitude: Geographic longitude in decimal degrees 3. Latitude: Geographic latitude in decimal degrees 4. tmin1 to tmin12: Minimum temperature (°C) for each month (January to December) 5. tmax1 to tmax12: Maximum temperature (°C) for each month (January to December) 6. prec1 to prec12: Precipitation (mm) for each month (January to December) 7. bio1 to bio19: Bioclimatic variables (e.g., annual mean temperature, temperature seasonality, precipitation, etc.) derived from WorldClim data (unit varies depending on the variable)The units for each measurement are as follows: - Temperature: Degrees Celsius (°C) - Precipitation: Millimeters (mm) - Bioclimatic variables: Varies depending on the specific variable (e.g., °C, mm)**Data Gaps and Missing Values** The dataset contains some missing values, particularly in the "precipitation" columns for certain months and locations. These missing values may result from gaps in climate station data or limitations in data collection for specific regions. Missing values are indicated as "NA" (Not Available) in the dataset. In cases where data gaps exist, estimations were not made, and the absence of the data is acknowledged in the record.**File Format and Software Compatibility** The dataset is provided in CSV format for ease of use and compatibility with various data analysis tools. It can be opened and processed using software such as Microsoft Excel, R, or Python (with Pandas). Users can download the dataset and work with it in software such as R (https://cran.r-project.org/) or Python (https://www.python.org/). The dataset is compatible with any software that supports CSV files.This dataset provides valuable information for research related to the geographical distribution and climate preferences of Cycas taiwaniana and can be used to inform conservation strategies, ecological studies, and climate change modeling.
Li-ion Battery Aging Dataset
kaggle.com
Updated May 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GIRITHARAN MANI (2024). Li-ion Battery Aging Dataset [Dataset]. https://www.kaggle.com/datasets/mystifoe77/nasa-battery-data-cleaned/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 12, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
GIRITHARAN MANI
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Overview

This dataset provides a comprehensive view of the aging process of lithium-ion batteries, facilitating the estimation of their Remaining Useful Life (RUL). Originally sourced from NASA's open repository, the dataset has undergone meticulous preprocessing to enhance its analytical utility. The data is presented in a user-friendly CSV format after extracting relevant features from the original .mat files.

Key Features of the Dataset

Battery Performance Metrics:

Capacity: Measured over time to assess degradation.

Internal Resistance (Re): Represents the electrical resistance of the battery.

Charge Transfer Resistance (Rct): Indicates charge movement efficiency.

Environmental Conditions:

Ambient Temperature: External temperature affecting battery performance.

Identification Attributes:

Battery ID: Unique identifier for each battery tested.

Test ID: Links specific test conditions to outcomes.

UID & Filename: Traceable dataset references.

Processed Data:

Missing values have been addressed.

Columns irrelevant to RUL estimation have been removed.

Skewness in the data has been corrected for statistical accuracy.

Labels:

Degradation States: Categorized into intervals for easier interpretation.

Ranges include operational and failure states.

Potential Applications

Battery Health Monitoring:

Predict battery failure timelines.

Enhance battery maintenance strategies.

Data Science and Machine Learning:

Model development for RUL prediction.

Feature engineering for predictive analysis.

Research and Development:

Improve battery design.

Study the impact of environmental and operational conditions on battery life.

Technical Details

File Format: CSV

Size: ~625.02 kB

Columns: 9

Data Points: Multiple observations across various tests.

Tags

Keywords: Lithium-ion batteries, RUL, Battery Aging, Machine Learning, Data Analysis, Predictive Maintenance.

License

Apache 2.0: Permits academic and commercial use.

Usage Instructions

Import the dataset into your data analysis tools (e.g., Python, R, MATLAB).

Explore features to understand correlations and dependencies.

Use machine learning models for RUL prediction.

Provenance

The dataset was retrieved from NASA's publicly available data repositories. It has been preprocessed to align with research and industrial standards for usability in analytical tasks.

Call to Action

Leverage this dataset to enhance your understanding of lithium-ion battery degradation and build models that could revolutionize energy storage solutions.
S
machine learning models on the WDBC dataset
scidb.cn
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahdi Aghaziarati (2025). machine learning models on the WDBC dataset [Dataset]. http://doi.org/10.57760/sciencedb.23537
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.23537
Dataset updated
Apr 15, 2025
Dataset provided by
Science Data Bank
Authors
Mahdi Aghaziarati
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset used in this study is the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, originally provided by the University of Wisconsin and obtained via Kaggle. It consists of 569 observations, each corresponding to a digitized image of a fine needle aspirate (FNA) of a breast mass. The dataset contains 32 attributes: one identifier column (discarded during preprocessing), one diagnosis label (malignant or benign), and 30 continuous real-valued features that describe the morphology of cell nuclei. These features are grouped into three statistical descriptors—mean, standard error (SE), and worst (mean of the three largest values)—for ten morphological properties including radius, perimeter, area, concavity, and fractal dimension. All feature values were normalized using z-score standardization to ensure uniform scale across models sensitive to input ranges. No missing values were present in the original dataset. Label encoding was applied to the diagnosis column, assigning 1 to malignant and 0 to benign cases. The dataset was split into training (80%) and testing (20%) sets while preserving class balance via stratified sampling. The accompanying Python source code (breast_cancer_classification_models.py) performs data loading, preprocessing, model training, evaluation, and result visualization. Four lightweight classifiers—Decision Tree, Naïve Bayes, Perceptron, and K-Nearest Neighbors (KNN)—were implemented using the scikit-learn library (version 1.2 or later). Performance metrics including Accuracy, Precision, Recall, F1-score, and ROC-AUC were calculated for each model. Confusion matrices and ROC curves were generated and saved as PNG files for interpretability. All results are saved in a structured CSV file (classification_results.csv) that contains the performance metrics for each model. Supplementary visualizations include all_feature_histograms.png (distribution plots for all standardized features), model_comparison.png (metric-wise bar plot), and feature_correlation_heatmap.png (Pearson correlation matrix of all 30 features). The data files are in standard CSV and PNG formats and can be opened using any spreadsheet or image viewer, respectively. No rare file types are used, and all scripts are compatible with any Python 3.x environment. This data package enables reproducibility and offers a transparent overview of how baseline machine learning models perform in the domain of breast cancer diagnosis using a clinically-relevant dataset.
t
ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture...
researchdata.tuwien.ac.at
b2find.eudat.eu
zip
Updated Jun 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo (2025). ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture from merged multi-satellite observations [Dataset]. http://doi.org/10.48436/3fcxr-cde10
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.48436/3fcxr-cde10
Dataset updated
Jun 6, 2025
Dataset provided by
TU Wien
Authors
Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was produced with funding from the European Space Agency (ESA) Climate Change Initiative (CCI) Plus Soil Moisture Project (CCN 3 to ESRIN Contract No: 4000126684/19/I-NB "ESA CCI+ Phase 1 New R&D on CCI ECVS Soil Moisture"). Project website: https://climate.esa.int/en/projects/soil-moisture/

This dataset contains information on the Surface Soil Moisture (SM) content derived from satellite observations in the microwave domain.

Dataset paper (public preprint)

A description of this dataset, including the methodology and validation results, is available at:

Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: An independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data Discuss. [preprint], https://doi.org/10.5194/essd-2024-610, in review, 2025.

Abstract

ESA CCI Soil Moisture is a multi-satellite climate data record that consists of harmonized, daily observations coming from 19 satellites (as of v09.1) operating in the microwave domain. The wealth of satellite information, particularly over the last decade, facilitates the creation of a data record with the highest possible data consistency and coverage.
However, data gaps are still found in the record. This is particularly notable in earlier periods when a limited number of satellites were in operation, but can also arise from various retrieval issues, such as frozen soils, dense vegetation, and radio frequency interference (RFI). These data gaps present a challenge for many users, as they have the potential to obscure relevant events within a study area or are incompatible with (machine learning) software that often relies on gap-free inputs.
Since the requirement of a gap-free ESA CCI SM product was identified, various studies have demonstrated the suitability of different statistical methods to achieve this goal. A fundamental feature of such gap-filling method is to rely only on the original observational record, without need for ancillary variable or model-based information. Due to the intrinsic challenge, there was until present no global, long-term univariate gap-filled product available. In this version of the record, data gaps due to missing satellite overpasses and invalid measurements are filled using the Discrete Cosine Transform (DCT) Penalized Least Squares (PLS) algorithm (Garcia, 2010). A linear interpolation is applied over periods of (potentially) frozen soils with little to no variability in (frozen) soil moisture content. Uncertainty estimates are based on models calibrated in experiments to fill satellite-like gaps introduced to GLDAS Noah reanalysis soil moisture (Rodell et al., 2004), and consider the gap size and local vegetation conditions as parameters that affect the gapfilling performance.

Summary

Gap-filled global estimates of volumetric surface soil moisture from 1991-2023 at 0.25° sampling

Fields of application (partial): climate variability and change, land-atmosphere interactions, global biogeochemical cycles and ecology, hydrological and land surface modelling, drought applications, and meteorology

Method: Modified version of DCT-PLS (Garcia, 2010) interpolation/smoothing algorithm, linear interpolation over periods of frozen soils. Uncertainty estimates are provided for all data points.

More information: See Preimesberger et al. (2025) and https://doi.org/10.5281/zenodo.8320869" target="_blank" rel="noopener">ESA CCI SM Algorithm Theoretical Baseline Document [Chapter 7.2.9] (Dorigo et al., 2023)

Programmatic Download

You can use command line tools such as wget or curl to download (and extract) data for multiple years. The following command will download and extract the complete data set to the local directory ~/Download on Linux or macOS systems.

#!/bin/bash

# Set download directory
DOWNLOAD_DIR=~/Downloads

base_url="https://researchdata.tuwien.at/records/3fcxr-cde10/files"

# Loop through years 1991 to 2023 and download & extract data
for year in {1991..2023}; do
echo "Downloading $year.zip..."
wget -q -P "$DOWNLOAD_DIR" "$base_url/$year.zip"
unzip -o "$DOWNLOAD_DIR/$year.zip" -d $DOWNLOAD_DIR
rm "$DOWNLOAD_DIR/$year.zip"
done

Data details

The dataset provides global daily estimates for the 1991-2023 period at 0.25° (~25 km) horizontal grid resolution. Daily images are grouped by year (YYYY), each subdirectory containing one netCDF image file for a specific day (DD), month (MM) in a 2-dimensional (longitude, latitude) grid system (CRS: WGS84). The file name has the following convention:

ESACCI-SOILMOISTURE-L3S-SSMV-COMBINED_GAPFILLED-YYYYMMDD000000-fv09.1r1.nc

Data Variables

Each netCDF file contains 3 coordinate variables (WGS84 longitude, latitude and time stamp), as well as the following data variables:

sm: (float) The Soil Moisture variable reflects estimates of daily average volumetric soil moisture content (m3/m3) in the soil surface layer (~0-5 cm) over a whole grid cell (0.25 degree).

sm_uncertainty: (float) The Soil Moisture Uncertainty variable reflects the uncertainty (random error) of the original satellite observations and of the predictions used to fill observation data gaps.

sm_anomaly: Soil moisture anomalies (reference period 1991-2020) derived from the gap-filled values (`sm`)

sm_smoothed: Contains DCT-PLS predictions used to fill data gaps in the original soil moisture field. These values are also provided for cases where an observation was initially available (compare `gapmask`). In this case, they provided a smoothed version of the original data.

gapmask: (0 | 1) Indicates grid cells where a satellite observation is available (1), and where the interpolated (smoothed) values are used instead (0) in the 'sm' field.

frozenmask: (0 | 1) Indicates grid cells where ERA5 soil temperature is <0 °C. In this case, a linear interpolation over time is applied.

Additional information for each variable is given in the netCDF attributes.

Version Changelog

Changes in v9.1r1 (previous version was v09.1):

This version uses a novel uncertainty estimation scheme as described in Preimesberger et al. (2025).

Software to open netCDF files

These data can be read by any software that supports Climate and Forecast (CF) conform metadata standards for netCDF files, such as:

https://github.com/pydata/xarray" target="_blank" rel="noopener">Xarray (python)

https://unidata.github.io/netcdf4-python/" target="_blank" rel="noopener">netCDF4 (python)

https://github.com/TUW-GEO/esa_cci_sm">esa_cci_sm (python)

Similar tools exists for other programming languages (Matlab, R, etc.)

Software packages and GIS tools can open netCDF files, e.g. CDO, NCO, QGIS, ArCGIS

You can also use the GUI software Panoply to view the contents of each file

References

Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: An independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data Discuss. [preprint], https://doi.org/10.5194/essd-2024-610, in review, 2025.

Dorigo, W., Preimesberger, W., Stradiotti, P., Kidd, R., van der Schalie, R., van der Vliet, M., Rodriguez-Fernandez, N., Madelon, R., & Baghdadi, N. (2023). ESA Climate Change Initiative Plus - Soil Moisture Algorithm Theoretical Baseline Document (ATBD) Supporting Product Version 08.1 (version 1.1). Zenodo. https://doi.org/10.5281/zenodo.8320869

Garcia, D., 2010. Robust smoothing of gridded data in one and higher dimensions with missing values. Computational Statistics & Data Analysis, 54(4), pp.1167-1178. Available at: https://doi.org/10.1016/j.csda.2009.09.020

Rodell, M., Houser, P. R., Jambor, U., Gottschalck, J., Mitchell, K., Meng, C.-J., Arsenault, K., Cosgrove, B., Radakovich, J., Bosilovich, M., Entin, J. K., Walker, J. P., Lohmann, D., and Toll, D.: The Global Land Data Assimilation System, Bulletin of the American Meteorological Society, 85, 381 – 394, https://doi.org/10.1175/BAMS-85-3-381, 2004.

Related Records

The following records are all part of the Soil Moisture Climate Data Records from satellites community

1
ESA CCI SM MODELFREE Surface Soil Moisture Record
<a href="https://doi.org/10.48436/svr1r-27j77" target="_blank"
Overwatch 2 statistics
kaggle.com
Updated Jun 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mykhailo Kachan (2023). Overwatch 2 statistics [Dataset]. https://www.kaggle.com/datasets/mykhailokachan/overwatch-2-statistics
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 27, 2023
Dataset provided by
Kaggle
Authors
Mykhailo Kachan
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset is built on data from Overbuff with the help of python and selenium. Development environment - Jupyter Notebook.

The tables contain the data for competitive seasons 1-4 and for quick play for each hero and rank along with the standard statistics (common to each hero as well as information belonging to a specific hero).

Note: data for some columns are missing on Overbuff site (there is '—' instead of a specific value), so they were dropped: Scoped Crits for Ashe and Widowmaker, Rip Tire Kills for Junkrat, Minefield Kills for Wrecking Ball. 'Self Healing' column for Bastion was dropped too as Bastion doesn't have this property anymore in OW2. Also, there are no values for "Javelin Spin Kills / 10min" for Orisa in season 1 (the column was dropped). Overall, all missing values were cleaned.

Attention: Overbuff doesn't contain info about OW 1 competitive seasons (when you change a skill tier, the data isn't changed). If you know a site where it's possible to get this data, please, leave a comment. Thank you!

The code on GitHub .

All procedure is done in 5 stages:

Stage 1:

Data is retrieved directly from HTML elements on the page with the selenium tool on python.

Stage 2:

After scraping, data was cleansed: 1) Deleted comma separator on thousands (e.g. 1,009 => 1009). 2) Translated time representation (e.g. '01:23') to seconds (1*60 + 23 => 83). 3) Lúcio has become Lucio, Torbjörn - Torbjorn.

Stage 3:

Data were arranged into a table and saved to CSV.

Stage 4:

Columns which are supposed to have only numeric values are checked. All non-numeric values are dropped. This stage helps to find missing values which contain '—' instead and delete them.

Stage 5:

Additional missing values are searched for and dealt with. It's either column rename that happens (as the program cannot infer the correct column name for missing values) or a column drop. This stage ensures all wrong data are truly fixed.

The procedure to fetch the data takes 7 minutes on average.

This project and code were born from this GitHub code.
m
Python code for the estimation of missing prices in real-estate market with...
data.mendeley.com
Updated Dec 12, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iván García-Magariño (2017). Python code for the estimation of missing prices in real-estate market with a dataset of house prices from Teruel city [Dataset]. http://doi.org/10.17632/mxpgf54czz.2
Explore at:
Unique identifier
https://doi.org/10.17632/mxpgf54czz.2
Dataset updated
Dec 12, 2017
Authors
Iván García-Magariño
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Teruel
Description
This research data file contains the necessary software and the dataset for estimating the missing prices of house units. This approach combines several machine learning techniques (linear regression, support vector regression, the k-nearest neighbors and a multi-layer perceptron neural network) with several dimensionality reduction techniques (non-negative factorization, recursive feature elimination and feature selection with a variance threshold). It includes the input dataset formed with the available house prices in two neighborhoods of Teruel city (Spain) in November 13, 2017 from Idealista website. These two neighborhoods are the center of the city and “Ensanche”.

This dataset supports the research of the authors in the improvement of the setup of agent-based simulations about real-estate market. The work about this dataset has been submitted for consideration for publication to a scientific journal.

The open source python code is composed of all the files with the “.py” extension. The main program can be executed from the “main.py” file. The “boxplotErrors.eps” is a chart generated from the execution of the code, and compares the results of the different combinations of machine learning techniques and dimensionality reduction methods.

The dataset is in the “data” folder. The input raw data of the house prices are in the “dataRaw.csv” file. These were shuffled into the “dataShuffled.csv” file. We used cross-validation to obtain the estimations of house prices. The outputted estimations alongside the real values are stored in different files of the “data” folder, in which each filename is composed by the machine learning technique abbreviation and the dimensionality reduction method abbreviation.
d
Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 - 7—Data
catalog.data.gov
data.usgs.gov
+3more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 - 7—Data [Dataset]. https://catalog.data.gov/dataset/variable-terrestrial-gps-telemetry-detection-rates-parts-1-7data
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Description
Studies utilizing Global Positioning System (GPS) telemetry rarely result in 100% fix success rates (FSR). Many assessments of wildlife resource use do not account for missing data, either assuming data loss is random or because a lack of practical treatment for systematic data loss. Several studies have explored how the environment, technological features, and animal behavior influence rates of missing data in GPS telemetry, but previous spatially explicit models developed to correct for sampling bias have been specified to small study areas, on a small range of data loss, or to be species-specific, limiting their general utility. Here we explore environmental effects on GPS fix acquisition rates across a wide range of environmental conditions and detection rates for bias correction of terrestrial GPS-derived, large mammal habitat use. We also evaluate patterns in missing data that relate to potential animal activities that change the orientation of the antennae and characterize home-range probability of GPS detection for 4 focal species; cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Part 1, Positive Openness Raster (raster dataset): Openness is an angular measure of the relationship between surface relief and horizontal distance. For angles less than 90 degrees it is equivalent to the internal angle of a cone with its apex at a DEM location, and is constrained by neighboring elevations within a specified radial distance. 480 meter search radius was used for this calculation of positive openness. Openness incorporates the terrain line-of-sight or viewshed concept and is calculated from multiple zenith and nadir angles-here along eight azimuths. Positive openness measures openness above the surface, with high values for convex forms and low values for concave forms (Yokoyama et al. 2002). We calculated positive openness using a custom python script, following the methods of Yokoyama et. al (2002) using a USGS National Elevation Dataset as input. Part 2, Northern Arizona GPS Test Collar (csv): Bias correction in GPS telemetry data-sets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix acquisition. We found terrain exposure and tall over-story vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The model's predictive ability was evaluated using two independent data-sets from stationary test collars of different make/model, fix interval programming, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. The model training data are provided here for fix attempts by hour. This table can be linked with the site location shapefile using the site field. Part 3, Probability Raster (raster dataset): Bias correction in GPS telemetry datasets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix aquistion. We found terrain exposure and tall overstory vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The models predictive ability was evaluated using two independent datasets from stationary test collars of different make/model, fix interval programing, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. We evaluated GPS telemetry datasets by comparing the mean probability of a successful GPS fix across study animals home-ranges, to the actual observed FSR of GPS downloaded deployed collars on cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Comparing the mean probability of acquisition within study animals home-ranges and observed FSRs of GPS downloaded collars resulted in a approximatly 1:1 linear relationship with an r-sq= 0.68. Part 4, GPS Test Collar Sites (shapefile): Bias correction in GPS telemetry data-sets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix acquisition. We found terrain exposure and tall over-story vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The model's predictive ability was evaluated using two independent data-sets from stationary test collars of different make/model, fix interval programming, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. Part 5, Cougar Home Ranges (shapefile): Cougar home-ranges were calculated to compare the mean probability of a GPS fix acquisition across the home-range to the actual fix success rate (FSR) of the collar as a means for evaluating if characteristics of an animal’s home-range have an effect on observed FSR. We estimated home-ranges using the Local Convex Hull (LoCoH) method using the 90th isopleth. Data obtained from GPS download of retrieved units were only used. Satellite delivered data was omitted from the analysis for animals where the collar was lost or damaged because satellite delivery tends to lose as additional 10% of data. Comparisons with home-range mean probability of fix were also used as a reference for assessing if the frequency animals use areas of low GPS acquisition rates may play a role in observed FSRs. Part 6, Cougar Fix Success Rate by Hour (csv): Cougar GPS collar fix success varied by hour-of-day suggesting circadian rhythms with bouts of rest during daylight hours may change the orientation of the GPS receiver affecting the ability to acquire fixes. Raw data of overall fix success rates (FSR) and FSR by hour were used to predict relative reductions in FSR. Data only includes direct GPS download datasets. Satellite delivered data was omitted from the analysis for animals where the collar was lost or damaged because satellite delivery tends to lose approximately an additional 10% of data. Part 7, Openness Python Script version 2.0: This python script was used to calculate positive openness using a 30 meter digital elevation model for a large geographic area in Arizona, California, Nevada and Utah. A scientific research project used the script to explore environmental effects on GPS fix acquisition rates across a wide range of environmental conditions and detection rates for bias correction of terrestrial GPS-derived, large mammal habitat use.
Z
Multimodal Vision-Audio-Language Dataset
data.niaid.nih.gov
zenodo.org
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Choksi, Bhavin (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10060784
Explore at:
Dataset updated
Jul 11, 2024
Dataset provided by
Roig, Gemma
Choksi, Bhavin
Schaumlöffel, Timothy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation

pip install pandas pyarrow Example

import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])

dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de
o
Data curation materials in "Daily life in the Open Biologist's second job,...
explore.openaire.eu
zenodo.org
Updated Jul 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Livia C T Scorza; Tomasz Zieliński; Andrew J Millar (2024). Data curation materials in "Daily life in the Open Biologist's second job, as a Data Curator" [Dataset]. http://doi.org/10.5281/zenodo.12734112
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.12734112
Dataset updated
Jul 26, 2024
Authors
Livia C T Scorza; Tomasz Zieliński; Andrew J Millar
Description
This is the supplementary material accompanying the manuscript "Daily life in the Open Biologist’s second job, as a Data Curator", published in Wellcome Open Research. It contains: - Python_scripts.zip: Python scripts used for data cleaning and organization: -add_headers.py: adds specified headers automatically to a list of csv files, creating new output files containing a "_with_headers" suffix. -count_NaN_values.py: counts the total number of rows containing null values in a csv file and prints the location of null values in the (row, column) format. -remove_rowsNaN_file.py: removes rows containing null values in a single csv file and saves the modified file with a "_dropNaN" suffix. -remove_rowsNaN_list.py: removes rows containing null values in list of csv files and saves the modified files with a "_dropNaN" suffix. - README_template.txt: a template for a README file to be used to describe and accompany a dataset. - template_for_source_data_information.xlsx: a spreadsheet to help manuscript authors to keep track of data used for each figure (e.g., information about data location and links to dataset description). - Supplementary_Figure_1.tif: Example of a dataset shared by us on Zenodo. The elements that make the dataset FAIR are indicated by the respective letters. Findability (F) is achieved by the dataset unique and persistent identifier (DOI), as well as by the related identifiers for the publication and dataset on GitHub. Additionally, the dataset is described with rich metadata, (e.g., keywords). Accessibility (A) is achieved by the ease of visualization and downloading using a standardised communications protocol (https). Also, the metadata are publicly accessible and licensed under the public domain. Interoperability (I) is achieved by the open formats used (CSV; R), and metadata are harvestable using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), a low-barrier mechanism for repository interoperability. Reusability (R) is achieved by the complete description of the data with metadata in README files and links to the related publication (which contains more detailed information, as well as links to protocols on protocols.io). The dataset has a clear and accessible data usage license (CC-BY 4.0).
NA-CORDEX Cloud-Optimized Dataset
data.ucar.edu
zarr
Updated Sep 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Banihirwe, Anderson; Bonnlander, Brian; McGinnis, Seth; Nienhouse, Eric; de La Beaujardiere, Jeff (2023). NA-CORDEX Cloud-Optimized Dataset [Dataset]. http://doi.org/10.26024/9xkm-fp81
Explore at:
zarrAvailable download formats
Unique identifier
https://doi.org/10.26024/9xkm-fp81
Dataset updated
Sep 10, 2023
Dataset provided by
University Corporation for Atmospheric Research
Authors
Banihirwe, Anderson; Bonnlander, Brian; McGinnis, Seth; Nienhouse, Eric; de La Beaujardiere, Jeff
Time period covered
Jan 1, 1951 - Dec 31, 2099
Area covered

Description
The NA-CORDEX dataset contains output from high-resolution regional climate models run over North America using boundary conditions from global simulations in the CMIP5 archive. The subset of the NA-CORDEX data on AWS (data volume ~15 TB) includes daily data from 1950-2100 for impacts-relevant variables on a 0.25 degree or 0.50 degree common lat-lon grid. This data is freely available on AWS S3 thanks to the AWS Open Data Sponsorship Program and the Amazon Sustainability Data Initiative, which provide free storage and egress. The data on AWS is stored in Zarr format. This format supports the same data model as netCDF and is well suited to object storage and distributed computing in the cloud using the Pangeo libraries in Python. An Intake-ESM Catalog listing all available data can be found at: [https://ncar-na-cordex.s3-us-west-2.amazonaws.com/catalogs/aws-na-cordex.json] The full dataset (data volume ~35 TB) can be accessed for download or via web services on the NCAR Climate Data Gateway. [https://www.earthsystemgrid.org/search/cordexsearch.html]
Z
Data from: A comprehensive dataset for the accelerated development and...
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Larson, David (2020). A comprehensive dataset for the accelerated development and benchmarking of solar forecasting methods [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2826938
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Larson, David
Coimbra, Carlos
Carreira Pedro, Hugo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description This repository contains a comprehensive solar irradiance, imaging, and forecasting dataset. The goal with this release is to provide standardized solar and meteorological datasets to the research community for the accelerated development and benchmarking of forecasting methods. The data consist of three years (2014–2016) of quality-controlled, 1-min resolution global horizontal irradiance and direct normal irradiance ground measurements in California. In addition, we provide overlapping data from commonly used exogenous variables, including sky images, satellite imagery, Numerical Weather Prediction forecasts, and weather data. We also include sample codes of baseline models for benchmarking of more elaborated models.

Data usage The usage of the datasets and sample codes presented here is intended for research and development purposes only and implies explicit reference to the paper: Pedro, H.T.C., Larson, D.P., Coimbra, C.F.M., 2019. A comprehensive dataset for the accelerated development and benchmarking of solar forecasting methods. Journal of Renewable and Sustainable Energy 11, 036102. https://doi.org/10.1063/1.5094494

Although every effort was made to ensure the quality of the data, no guarantees or liabilities are implied by the authors or publishers of the data.

Sample code As part of the data release, we are also including the sample code written in Python 3. The preprocessed data used in the scripts are also provided. The code can be used to reproduce the results presented in this work and as a starting point for future studies. Besides the standard scientific Python packages (numpy, scipy, and matplotlib), the code depends on pandas for time-series operations, pvlib for common solar-related tasks, and scikit-learn for Machine Learning models. All required Python packages are readily available on Mac, Linux, and Windows and can be installed via, e.g., pip.

Units All time stamps are in UTC (YYYY-MM-DD HH:MM:SS). All irradiance and weather data are in SI units. Sky image features are derived from 8-bit RGB (256 color levels) data. Satellite images are derived from 8-bit gray-scale (256 color levels) data.

Missing data The string "NAN" indicates missing data

File formats All time series data files as in CSV (comma separated values) Images are given in tar.bz2 files

Files

Folsom_irradiance.csv Primary One-minute GHI, DNI, and DHI data.

Folsom_weather.csv Primary One-minute weather data.

Folsom_sky_images_{YEAR}.tar.bz2 Primary Tar archives with daytime sky images captured at 1-min intervals for the years 2014, 2015, and 2016, compressed with bz2.

Folsom_NAM_lat{LAT}_lon{LON}.csv Primary NAM forecasts for the four nodes nearest the target location. {LAT} and {LON} are replaced by the node’s coordinates listed in Table I in the paper.

Folsom_sky_image_features.csv Secondary Features derived from the sky images.

Folsom_satellite.csv Secondary 10 pixel by 10 pixel GOES-15 images centered in the target location.

Irradiance_features_{horizon}.csv Secondary Irradiance features for the different forecasting horizons ({horizon} 1⁄4 {intra-hour, intra-day, day-ahead}).

Sky_image_features_intra-hour.csv Secondary Sky image features for the intra-hour forecasting issuing times.

Sat_image_features_intra-day.csv Secondary Satellite image features for the intra-day forecasting issuing times.

NAM_nearest_node_day-ahead.csv Secondary NAM forecasts (GHI, DNI computed with the DISC algorithm, and total cloud cover) for the nearest node to the target location prepared for day-ahead forecasting.

Target_{horizon}.csv Secondary Target data for the different forecasting horizons.

Forecast_{horizon}.py Code Python script used to create the forecasts for the different horizons.

Postprocess.py Code Python script used to compute the error metric for all the forecasts.
H
Data from: United Nations General Assembly Resolutions, 1946-2014
dataverse.harvard.edu
Updated Jan 25, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andreas Warntjen (2016). United Nations General Assembly Resolutions, 1946-2014 [Dataset]. http://doi.org/10.7910/DVN/T8EIWO
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/T8EIWO
Dataset updated
Jan 25, 2016
Dataset provided by
Harvard Dataverse
Authors
Andreas Warntjen
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The data set consists of all resolutions passed by the United Nations General Assembly in their regular sessions from 1946 to 2014 (some from the calendar year 2015 are also included). The data is based on the summary tables on the UN website (accessible via http://www.un.org/documents/resga.htm). The tables were retrieved on 16 September 2015 and the data was automatically extracted using a Python script. The data set consists of 17,359 observations. A number of typos and missing pieces of information was added (see below). The date was transferred into a uniform format. Several checks for consistency were performed (see Python script for details). One resolution number was not assigned (A/RES/50/125), another bloc of numbers is also missing from the UN website (A/RES/56/500 – 300). Please consult the following for more information regarding the original data: http://research.un.org/en/docs/resolutions http://www.unitar.org/ny/sites/unitar.org.ny/files/UN_PGA_Handbook.pdf The data is available as a delimited text file (delimiter: semicolon) and as a Stata (14) data file (all variables are coded as strings). Missing values are denoted by “Missing”.
s
Data from: Nairobi Motorcycle Transit Comparison Dataset: Fuel vs. Electric...
scholardata.sun.ac.za
data.mendeley.com
Updated Mar 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Kitetu; Alois Mbutura; Halloran Stratford; MJ Booysen (2025). Nairobi Motorcycle Transit Comparison Dataset: Fuel vs. Electric Vehicle Performance Tracking (2023) [Dataset]. http://doi.org/10.25413/sun.28554200.v1
Explore at:
Unique identifier
https://doi.org/10.25413/sun.28554200.v1
Dataset updated
Mar 8, 2025
Dataset provided by
SUNScholarData
Authors
Martin Kitetu; Alois Mbutura; Halloran Stratford; MJ Booysen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Nairobi
Description
This dataset contains GPS tracking data and performance metrics for motorcycle taxis (boda bodas) in Nairobi, Kenya, comparing traditional internal combustion engine (ICE) motorcycles with electric motorcycles. The study was conducted in two phases:Baseline Phase: 118 ICE motorcycles tracked over 14 days (2023-11-13 to 2023-11-26)Transition Phase: 108 ICE motorcycles (control) and 9 electric motorcycles (treatment) tracked over 12 days (2023-12-10 to 2023-12-21)The dataset is organised into two main categories:Trip Data: Individual trip-level records containing timing, distance, duration, location, and speed metricsDaily Data: Daily aggregated summaries containing usage metrics, economic data, and energy consumptionThis dataset enables comparative analysis of electric vs. ICE motorcycle performance, economic modelling of transportation costs, environmental impact assessment, urban mobility pattern analysis, and energy efficiency studies in emerging markets.Institutions:EED AdvisoryClean Air TaskforceStellenbosch UniversitySteps to reproduce:Raw Data CollectionGPS tracking devices installed on motorcycles, collecting location data at 10-second intervalsRider-reported information on revenue, maintenance costs, and fuel/electricity usageProcessing StepsGPS data cleaning: Filtered invalid coordinates, removed duplicates, interpolated missing pointsTrip identification: Defined by >1 minute stationary periods or ignition cyclesTrip metrics calculation: Distance, duration, idle time, average/max speedsDaily data aggregation: Summed by user_id and date with self-reported economic dataValidation: Cross-checked with rider logs and known routesAnonymisation: Removed start and end coordinates for first and last trips of each day to protect rider privacy and home locationsTechnical InformationGeographic coverage: Nairobi, KenyaTime period: November-December 2023Time zone: UTC+3 (East Africa Time)Currency: Kenyan Shillings (KES)Data format: CSV filesSoftware used: Python 3.8 (pandas, numpy, geopy)Notes: Some location data points are intentionally missing to protect rider privacy. Self-reported economic and energy consumption data has some missing values where riders did not report.CategoriesMotorcycle, Transportation in Africa, Electric Vehicles
Klib library python
kaggle.com
Updated Jan 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sripaad Srinivasan (2021). Klib library python [Dataset]. https://www.kaggle.com/sripaadsrinivasan/klib-library-python/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 11, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sripaad Srinivasan
Description
klib library enables us to quickly visualize missing data, perform data cleaning, visualize data distribution plot, visualize correlation plot and visualize categorical column values. klib is a Python library for importing, cleaning, analyzing and preprocessing data. Explanations on key functionalities can be found on Medium / TowardsDataScience in the examples section or on YouTube (Data Professor).

Original Github repo

https://raw.githubusercontent.com/akanz1/klib/main/examples/images/header.png" alt="klib Header">

Usage

!pip install klib

import klib import pandas as pd df = pd.DataFrame(data) # klib.describe functions for visualizing datasets - klib.cat_plot(df) # returns a visualization of the number and frequency of categorical features - klib.corr_mat(df) # returns a color-encoded correlation matrix - klib.corr_plot(df) # returns a color-encoded heatmap, ideal for correlations - klib.dist_plot(df) # returns a distribution plot for every numeric feature - klib.missingval_plot(df) # returns a figure containing information about missing values

Examples

Take a look at this starter notebook.

Further examples, as well as applications of the functions can be found here.

Contributing

Pull requests and ideas, especially for further functions are welcome. For major changes or feedback, please open an issue first to discuss what you would like to change. Take a look at this Github repo.

License

MIT
WNBA Games Box Score Since 1997
kaggle.com
Updated Jan 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Greca (2021). WNBA Games Box Score Since 1997 [Dataset]. https://www.kaggle.com/datasets/rafaelgreca/wnba-games-box-score-since-1997/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 13, 2021
Dataset provided by
Kaggle
Authors
Rafael Greca
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Content

This data set contains information about the box score of every WNBA game since 1997 until now. You can get the data individually for each season, decade or a compiled of all the data. In total the data set has, approximately, 100 features/columns/attributes that goes from basic stats (like total points, rebounds, assists, blocks, and so on) to more advanced ones (like floor impact counter, assist rate, possessions, pace, play% and much more!).

Each game will contain the same features to the home team and its opponent (away team) and some other features related to the game itself (like game date, season, season type and match winner). If you like stats and NBA, this data set was made for you!

If do you wanna more about the formulas used and its meaning, please check the reference section. Also you can check the “features_description” file. There you will find a brief description of each feature and its respective formula (only for more advanced stats).

Acknowledgements

LAST TIME THE DATA SET WAS UPDATED:

January 13, 2021 (01/13/2021) – 1pm EDT

Questions about the dataset:

Q:How did you collected the data? A: I created a web scrapper using python to do the hard work.

Q: How did you filled the missing values? A: For the float columns I filled with “0.0”. For the object columns I left with a NaN value, but don’t need to worry about it. The only columns that I need to do that was teamWins, teamLosses, opptWins, opptLosses. However only 8 rows in the entire data set has NaN values! Great news, isn’t it?

Q: Where can I see the description/formula for each attribute/column/feature? A: You can check it out in the “features_informations” file inside the data set.

Q: Will you constantly update the data set? A: Yes!

Q: The data contains only regular reason games? A: No! The data contains playoffs games as well.

References

About the stats and formulas used: https://www.basketball-reference.com/about/glossary.html https://basketball.realgm.com/info/glossary https://www.kaggle.com/rafaelgreca/nba-games-box-score-since-1949 (My other data set about the NBA)

Where the data was collected: https://www.basketball-reference.com/leagues/

Facebook

Twitter

Click to copy link

Link copied

Cite

Yi-Hui Zhou; Ehsan Saghapour (2023). Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data.PDF [Dataset]. http://doi.org/10.3389/fgene.2021.691274.s001

Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data.PDF

Explore at:

pdfAvailable download formats

Unique identifier

https://doi.org/10.3389/fgene.2021.691274.s001

Dataset updated

Jun 1, 2023

Dataset provided by

Frontiers

Authors

Yi-Hui Zhou; Ehsan Saghapour

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.

Clear search

Close search

Google apps

Main menu

Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction...

Global Freelancers (Raw) Dataset

Description :

Key Features :

Use Cases :

File : global_freelancers_raw.csv

Learn Data Science Series Part 1

Please feel free to share it with others and consider supporting me if you find it helpful ⭐️.

Overview:

titanic

Adult dataset preprocessed

Geographical distribution and climate data of Cycas taiwaniana

Li-ion Battery Aging Dataset

Dataset Overview

Key Features of the Dataset

Potential Applications

Technical Details

Tags

License

Usage Instructions

Provenance

Call to Action

machine learning models on the WDBC dataset

ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture...

Dataset paper (public preprint)

Abstract

Summary

Programmatic Download

Data details

Data Variables

Version Changelog

Software to open netCDF files

References

Related Records

Overwatch 2 statistics

Stage 1:

Stage 2:

Stage 3:

Stage 4:

Stage 5:

Python code for the estimation of missing prices in real-estate market with...

Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 - 7—Data

Multimodal Vision-Audio-Language Dataset

Data curation materials in "Daily life in the Open Biologist's second job,...

NA-CORDEX Cloud-Optimized Dataset

Data from: A comprehensive dataset for the accelerated development and...

Data from: United Nations General Assembly Resolutions, 1946-2014

Data from: Nairobi Motorcycle Transit Comparison Dataset: Fuel vs. Electric...

Klib library python

Usage

Examples

Contributing

License

WNBA Games Box Score Since 1997

Content

Acknowledgements

References

Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data.PDF

File : `global_freelancers_raw.csv`