67 datasets found

Learn Data Science Series Part 1
kaggle.com
Updated Dec 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rupesh Kumar (2022). Learn Data Science Series Part 1 [Dataset]. https://www.kaggle.com/datasets/hunter0007/learn-data-science-part-1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 30, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rupesh Kumar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Please feel free to share it with others and consider supporting me if you find it helpful ⭐️.

Overview:

Chapter 1: Getting started with pandas

Chapter 2: Analysis: Bringing it all together and making decisions

Chapter 3: Appending to DataFrame

Chapter 4: Boolean indexing of dataframes

Chapter 5: Categorical data

Chapter 6: Computational Tools

Chapter 7: Creating DataFrames

Chapter 8: Cross sections of different axes with MultiIndex

Chapter 9: Data Types

Chapter 10: Dealing with categorical variables

Chapter 11: Duplicated data

Chapter 12: Getting information about DataFrames

Chapter 13: Gotchas of pandas

Chapter 14: Graphs and Visualizations

Chapter 15: Grouping Data

Chapter 16: Grouping Time Series Data

Chapter 17: Holiday Calendars

Chapter 18: Indexing and selecting data

Chapter 19: IO for Google BigQuery

Chapter 20: JSON

Chapter 21: Making Pandas Play Nice With Native Python Datatypes

Chapter 22: Map Values

Chapter 23: Merge, join, and concatenate

Chapter 24: Meta: Documentation Guidelines

Chapter 25: Missing Data

Chapter 26: MultiIndex

Chapter 27: Pandas Datareader

Chapter 28: Pandas IO tools (reading and saving data sets)

Chapter 29: pd.DataFrame.apply

Chapter 30: Read MySQL to DataFrame

Chapter 31: Read SQL Server to Dataframe

Chapter 32: Reading files into pandas DataFrame

Chapter 33: Resampling

Chapter 34: Reshaping and pivoting

Chapter 35: Save pandas dataframe to a csv file

Chapter 36: Series

Chapter 37: Shifting and Lagging Data

Chapter 38: Simple manipulation of DataFrames

Chapter 39: String manipulation

Chapter 40: Using .ix, .iloc, .loc, .at and .iat to access a DataFrame

Chapter 41: Working with Time Series
f
Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction...
frontiersin.figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yi-Hui Zhou; Ehsan Saghapour (2023). Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data.PDF [Dataset]. http://doi.org/10.3389/fgene.2021.691274.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2021.691274.s001
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Yi-Hui Zhou; Ehsan Saghapour
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.
f
Data from: Missing Value Imputation in Relational Data Using Variational...
tandf.figshare.com
txt
Updated Jul 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simon Fontaine; Jian Kang; Ji Zhu (2025). Missing Value Imputation in Relational Data Using Variational Inference [Dataset]. http://doi.org/10.6084/m9.figshare.29184891.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29184891.v2
Dataset updated
Jul 11, 2025
Dataset provided by
Taylor & Francis
Authors
Simon Fontaine; Jian Kang; Ji Zhu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In real-world networks, node attributes are often only partially observed, necessitating imputation to support analysis or enable downstream tasks. However, most existing imputation methods overlook the rich information contained within the connectivity among nodes. This research is inspired by the premise that leveraging all available information should yield improved imputation, provided a sufficient association between attributes and edges. Consequently, we introduce a joint latent space model that produces a low-dimensional representation of the data and simultaneously captures the edge and node attribute information. This model relies on the pooling of information induced by shared latent variables, thus improving the prediction of node attributes and providing a more effective attribute imputation method. Our approach uses variational inference to approximate posterior distributions for these latent variables, resulting in predictive distributions for missing values. Through numerical experiments, conducted on both simulated data and real-world networks, we demonstrate that our proposed method successfully harnesses the joint structure information and significantly improves the imputation of missing attributes, specifically when the observed information is weak. Additional results, implementation details, a Python implementation, and the code reproducing the results are available online. Supplementary materials for this article are available online.
o
Pre-Processed Power Grid Frequency Time Series
explore.openaire.eu
data.niaid.nih.gov
+1more
Updated Apr 22, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johannes Kruse; Benjamin Schäfer; Dirk Witthaut (2020). Pre-Processed Power Grid Frequency Time Series [Dataset]. http://doi.org/10.5281/zenodo.5105820
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5105820
Dataset updated
Apr 22, 2020
Authors
Johannes Kruse; Benjamin Schäfer; Dirk Witthaut
Description
Overview This repository contains ready-to-use frequency time series as well as the corresponding pre-processing scripts in python. The data covers three synchronous areas of the European power grid: Continental Europe Great Britain Nordic This work is part of the paper "Predictability of Power Grid Frequency"[1]. Please cite this paper, when using the data and the code. For a detailed documentation of the pre-processing procedure we refer to the supplementary material of the paper. Data sources We downloaded the frequency recordings from publically available repositories of three different Transmission System Operators (TSOs). Continental Europe [2]: We downloaded the data from the German TSO TransnetBW GmbH, which retains the Copyright on the data, but allows to re-publish it upon request [3]. Great Britain [4]: The download was supported by National Grid ESO Open Data, which belongs to the British TSO National Grid. They publish the frequency recordings under the NGESO Open License [5]. Nordic [6]: We obtained the data from the Finish TSO Fingrid, which provides the data under the open license CC-BY 4.0 [7]. Content of the repository A) Scripts In the "Download_scripts" folder you will find three scripts to automatically download frequency data from the TSO's websites. In "convert_data_format.py" we save the data with corrected timestamp formats. Missing data is marked as NaN (processing step (1) in the supplementary material of [1]). In "clean_corrupted_data.py" we load the converted data and identify corrupted recordings. We mark them as NaN and clean some of the resulting data holes (processing step (2) in the supplementary material of [1]). The python scripts run with Python 3.7 and with the packages found in "requirements.txt". B) Yearly converted and cleansed data The folders "
Z
Data from: A comprehensive dataset for the accelerated development and...
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Larson, David (2020). A comprehensive dataset for the accelerated development and benchmarking of solar forecasting methods [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2826938
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Larson, David
Coimbra, Carlos
Carreira Pedro, Hugo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description This repository contains a comprehensive solar irradiance, imaging, and forecasting dataset. The goal with this release is to provide standardized solar and meteorological datasets to the research community for the accelerated development and benchmarking of forecasting methods. The data consist of three years (2014–2016) of quality-controlled, 1-min resolution global horizontal irradiance and direct normal irradiance ground measurements in California. In addition, we provide overlapping data from commonly used exogenous variables, including sky images, satellite imagery, Numerical Weather Prediction forecasts, and weather data. We also include sample codes of baseline models for benchmarking of more elaborated models.

Data usage The usage of the datasets and sample codes presented here is intended for research and development purposes only and implies explicit reference to the paper: Pedro, H.T.C., Larson, D.P., Coimbra, C.F.M., 2019. A comprehensive dataset for the accelerated development and benchmarking of solar forecasting methods. Journal of Renewable and Sustainable Energy 11, 036102. https://doi.org/10.1063/1.5094494

Although every effort was made to ensure the quality of the data, no guarantees or liabilities are implied by the authors or publishers of the data.

Sample code As part of the data release, we are also including the sample code written in Python 3. The preprocessed data used in the scripts are also provided. The code can be used to reproduce the results presented in this work and as a starting point for future studies. Besides the standard scientific Python packages (numpy, scipy, and matplotlib), the code depends on pandas for time-series operations, pvlib for common solar-related tasks, and scikit-learn for Machine Learning models. All required Python packages are readily available on Mac, Linux, and Windows and can be installed via, e.g., pip.

Units All time stamps are in UTC (YYYY-MM-DD HH:MM:SS). All irradiance and weather data are in SI units. Sky image features are derived from 8-bit RGB (256 color levels) data. Satellite images are derived from 8-bit gray-scale (256 color levels) data.

Missing data The string "NAN" indicates missing data

File formats All time series data files as in CSV (comma separated values) Images are given in tar.bz2 files

Files

Folsom_irradiance.csv Primary One-minute GHI, DNI, and DHI data.

Folsom_weather.csv Primary One-minute weather data.

Folsom_sky_images_{YEAR}.tar.bz2 Primary Tar archives with daytime sky images captured at 1-min intervals for the years 2014, 2015, and 2016, compressed with bz2.

Folsom_NAM_lat{LAT}_lon{LON}.csv Primary NAM forecasts for the four nodes nearest the target location. {LAT} and {LON} are replaced by the node’s coordinates listed in Table I in the paper.

Folsom_sky_image_features.csv Secondary Features derived from the sky images.

Folsom_satellite.csv Secondary 10 pixel by 10 pixel GOES-15 images centered in the target location.

Irradiance_features_{horizon}.csv Secondary Irradiance features for the different forecasting horizons ({horizon} 1⁄4 {intra-hour, intra-day, day-ahead}).

Sky_image_features_intra-hour.csv Secondary Sky image features for the intra-hour forecasting issuing times.

Sat_image_features_intra-day.csv Secondary Satellite image features for the intra-day forecasting issuing times.

NAM_nearest_node_day-ahead.csv Secondary NAM forecasts (GHI, DNI computed with the DISC algorithm, and total cloud cover) for the nearest node to the target location prepared for day-ahead forecasting.

Target_{horizon}.csv Secondary Target data for the different forecasting horizons.

Forecast_{horizon}.py Code Python script used to create the forecasts for the different horizons.

Postprocess.py Code Python script used to compute the error metric for all the forecasts.
v
Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 - 7—Data
res1catalogd-o-tdatad-o-tgov.vcapture.xyz
data.usgs.gov
+4more
Updated Jul 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 - 7—Data [Dataset]. https://res1catalogd-o-tdatad-o-tgov.vcapture.xyz/dataset/variable-terrestrial-gps-telemetry-detection-rates-parts-1-7data
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Description
Studies utilizing Global Positioning System (GPS) telemetry rarely result in 100% fix success rates (FSR). Many assessments of wildlife resource use do not account for missing data, either assuming data loss is random or because a lack of practical treatment for systematic data loss. Several studies have explored how the environment, technological features, and animal behavior influence rates of missing data in GPS telemetry, but previous spatially explicit models developed to correct for sampling bias have been specified to small study areas, on a small range of data loss, or to be species-specific, limiting their general utility. Here we explore environmental effects on GPS fix acquisition rates across a wide range of environmental conditions and detection rates for bias correction of terrestrial GPS-derived, large mammal habitat use. We also evaluate patterns in missing data that relate to potential animal activities that change the orientation of the antennae and characterize home-range probability of GPS detection for 4 focal species; cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Part 1, Positive Openness Raster (raster dataset): Openness is an angular measure of the relationship between surface relief and horizontal distance. For angles less than 90 degrees it is equivalent to the internal angle of a cone with its apex at a DEM location, and is constrained by neighboring elevations within a specified radial distance. 480 meter search radius was used for this calculation of positive openness. Openness incorporates the terrain line-of-sight or viewshed concept and is calculated from multiple zenith and nadir angles-here along eight azimuths. Positive openness measures openness above the surface, with high values for convex forms and low values for concave forms (Yokoyama et al. 2002). We calculated positive openness using a custom python script, following the methods of Yokoyama et. al (2002) using a USGS National Elevation Dataset as input. Part 2, Northern Arizona GPS Test Collar (csv): Bias correction in GPS telemetry data-sets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix acquisition. We found terrain exposure and tall over-story vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The model's predictive ability was evaluated using two independent data-sets from stationary test collars of different make/model, fix interval programming, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. The model training data are provided here for fix attempts by hour. This table can be linked with the site location shapefile using the site field. Part 3, Probability Raster (raster dataset): Bias correction in GPS telemetry datasets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix aquistion. We found terrain exposure and tall overstory vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The models predictive ability was evaluated using two independent datasets from stationary test collars of different make/model, fix interval programing, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. We evaluated GPS telemetry datasets by comparing the mean probability of a successful GPS fix across study animals home-ranges, to the actual observed FSR of GPS downloaded deployed collars on cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Comparing the mean probability of acquisition within study animals home-ranges and observed FSRs of GPS downloaded collars resulted in a approximatly 1:1 linear relationship with an r-sq= 0.68. Part 4, GPS Test Collar Sites (shapefile): Bias correction in GPS telemetry data-sets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix acquisition. We found terrain exposure and tall over-story vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The model's predictive ability was evaluated using two independent data-sets from stationary test collars of different make/model, fix interval programming, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. Part 5, Cougar Home Ranges (shapefile): Cougar home-ranges were calculated to compare the mean probability of a GPS fix acquisition across the home-range to the actual fix success rate (FSR) of the collar as a means for evaluating if characteristics of an animal’s home-range have an effect on observed FSR. We estimated home-ranges using the Local Convex Hull (LoCoH) method using the 90th isopleth. Data obtained from GPS download of retrieved units were only used. Satellite delivered data was omitted from the analysis for animals where the collar was lost or damaged because satellite delivery tends to lose as additional 10% of data. Comparisons with home-range mean probability of fix were also used as a reference for assessing if the frequency animals use areas of low GPS acquisition rates may play a role in observed FSRs. Part 6, Cougar Fix Success Rate by Hour (csv): Cougar GPS collar fix success varied by hour-of-day suggesting circadian rhythms with bouts of rest during daylight hours may change the orientation of the GPS receiver affecting the ability to acquire fixes. Raw data of overall fix success rates (FSR) and FSR by hour were used to predict relative reductions in FSR. Data only includes direct GPS download datasets. Satellite delivered data was omitted from the analysis for animals where the collar was lost or damaged because satellite delivery tends to lose approximately an additional 10% of data. Part 7, Openness Python Script version 2.0: This python script was used to calculate positive openness using a 30 meter digital elevation model for a large geographic area in Arizona, California, Nevada and Utah. A scientific research project used the script to explore environmental effects on GPS fix acquisition rates across a wide range of environmental conditions and detection rates for bias correction of terrestrial GPS-derived, large mammal habitat use.
Global Freelancers (Raw) Dataset
kaggle.com
Updated Jul 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Urvish Ahir (2025). Global Freelancers (Raw) Dataset [Dataset]. https://www.kaggle.com/datasets/urvishahir/global-freelancers-raw-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 5, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Urvish Ahir
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Description :

This dataset contains 1,000 fictional freelancer profiles from around the world, designed to reflect realistic variability and messiness often encountered in real-world data collection.

Each entry includes demographic, professional, and platform-related information such as:

Name, gender, age, and country

Primary skill and years of experience

Hourly rate (with mixed formatting), client rating, and satisfaction score

Language spoken (based on country)

Inconsistent and unclean values across several fields (e.g., gender, is_active, satisfaction)

Key Features :

Gender-based names using Faker’s male/female name generators

Realistic age and experience distribution (with missing and noisy values)

Country-language pairs mapped using actual linguistic data

Messy formatting: mixed data types, missing values, inconsistent casing

Generated entirely in Python using the faker library no real data used

Use Cases :

Practicing data cleaning and preprocessing

Performing EDA (Exploratory Data Analysis)

Developing data pipelines: raw → clean → model-ready

Teaching feature engineering and handling real-world dirty data

Exercises in data validation, outlier detection, and format standardization

File : global_freelancers_raw.csv

| Column Name | Description | | --------------------- | ------------------------------------------------------------------------ | | `freelancer_ID` | Unique ID starting with `FL` (e.g., FL250001) | | `name` | Full name of freelancer (based on gender) | | `gender` | Gender (messy values and case inconsistency) | | `age` | Age of the freelancer (20–60, with occasional nulls/outliers) | | `country` | Country name (with random formatting/casing) | | `language` | Language spoken (mapped from country) | | `primary_skill` | Key freelance domain (e.g., Web Dev, AI, Cybersecurity) | | `years_of_experience` | Work experience in years (some missing values or odd values included) | | `hourly_rate (USD)` | Hourly rate with currency symbols or missing data | | `rating` | Rating between 1.0–5.0 (some zeros and nulls included) | | `is_active` | Active status (inconsistently represented as strings, numbers, booleans) | | `client_satisfaction` | Satisfaction percentage (e.g., "85%" or 85, may include NaNs) |
d
Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 - 7 Data.
datadiscoverystudio.org
758
Updated May 20, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 - 7 Data. [Dataset]. http://datadiscoverystudio.org/geoportal/rest/metadata/item/4df55111949a4b20aaf19e0d501595a7/html
Explore at:
758Available download formats
Dataset updated
May 20, 2018
Description
description: Studies utilizing Global Positioning System (GPS) telemetry rarely result in 100% fix success rates (FSR). Many assessments of wildlife resource use do not account for missing data, either assuming data loss is random or because a lack of practical treatment for systematic data loss. Several studies have explored how the environment, technological features, and animal behavior influence rates of missing data in GPS telemetry, but previous spatially explicit models developed to correct for sampling bias have been specified to small study areas, on a small range of data loss, or to be species-specific, limiting their general utility. Here we explore environmental effects on GPS fix acquisition rates across a wide range of environmental conditions and detection rates for bias correction of terrestrial GPS-derived, large mammal habitat use. We also evaluate patterns in missing data that relate to potential animal activities that change the orientation of the antennae and characterize home-range probability of GPS detection for 4 focal species; cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Part 1, Positive Openness Raster (raster dataset): Openness is an angular measure of the relationship between surface relief and horizontal distance. For angles less than 90 degrees it is equivalent to the internal angle of a cone with its apex at a DEM location, and is constrained by neighboring elevations within a specified radial distance. 480 meter search radius was used for this calculation of positive openness. Openness incorporates the terrain line-of-sight or viewshed concept and is calculated from multiple zenith and nadir angles-here along eight azimuths. Positive openness measures openness above the surface, with high values for convex forms and low values for concave forms (Yokoyama et al. 2002). We calculated positive openness using a custom python script, following the methods of Yokoyama et. al (2002) using a USGS National Elevation Dataset as input. Part 2, Northern Arizona GPS Test Collar (csv): Bias correction in GPS telemetry data-sets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix acquisition. We found terrain exposure and tall over-story vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The model's predictive ability was evaluated using two independent data-sets from stationary test collars of different make/model, fix interval programming, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. The model training data are provided here for fix attempts by hour. This table can be linked with the site location shapefile using the site field. Part 3, Probability Raster (raster dataset): Bias correction in GPS telemetry datasets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix aquistion. We found terrain exposure and tall overstory vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The models predictive ability was evaluated using two independent datasets from stationary test collars of different make/model, fix interval programing, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. We evaluated GPS telemetry datasets by comparing the mean probability of a successful GPS fix across study animals home-ranges, to the actual observed FSR of GPS downloaded deployed collars on cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Comparing the mean probability of acquisition within study animals home-ranges and observed FSRs of GPS downloaded collars resulted in a approximatly 1:1 linear relationship with an r-sq= 0.68. Part 4, GPS Test Collar Sites (shapefile): Bias correction in GPS telemetry data-sets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix acquisition. We found terrain exposure and tall over-story vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The model's predictive ability was evaluated using two independent data-sets from stationary test collars of different make/model, fix interval programming, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. Part 5, Cougar Home Ranges (shapefile): Cougar home-ranges were calculated to compare the mean probability of a GPS fix acquisition across the home-range to the actual fix success rate (FSR) of the collar as a means for evaluating if characteristics of an animals home-range have an effect on observed FSR. We estimated home-ranges using the Local Convex Hull (LoCoH) method using the 90th isopleth. Data obtained from GPS download of retrieved units were only used. Satellite delivered data was omitted from the analysis for animals where the collar was lost or damaged because satellite delivery tends to lose as additional 10% of data. Comparisons with home-range mean probability of fix were also used as a reference for assessing if the frequency animals use areas of low GPS acquisition rates may play a role in observed FSRs. Part 6, Cougar Fix Success Rate by Hour (csv): Cougar GPS collar fix success varied by hour-of-day suggesting circadian rhythms with bouts of rest during daylight hours may change the orientation of the GPS receiver affecting the ability to acquire fixes. Raw data of overall fix success rates (FSR) and FSR by hour were used to predict relative reductions in FSR. Data only includes direct GPS download datasets. Satellite delivered data was omitted from the analysis for animals where the collar was lost or damaged because satellite delivery tends to lose approximately an additional 10% of data. Part 7, Openness Python Script version 2.0: This python script was used to calculate positive openness using a 30 meter digital elevation model for a large geographic area in Arizona, California, Nevada and Utah. A scientific research project used the script to explore environmental effects on GPS fix acquisition rates across a wide range of environmental conditions and detection rates for bias correction of terrestrial GPS-derived, large mammal habitat use.; abstract: Studies utilizing Global Positioning System (GPS) telemetry rarely result in 100% fix success rates (FSR). Many assessments of wildlife resource use do not account for missing data, either assuming data loss is random or because a lack of practical treatment for systematic data loss. Several studies have explored how the environment, technological features, and animal behavior influence rates of missing data in GPS telemetry, but previous spatially explicit models developed to correct for sampling bias have been specified to small study areas, on a small range of data loss, or to be species-specific, limiting their general utility. Here we explore environmental effects on GPS fix acquisition rates across a wide range of environmental conditions and detection rates for bias correction of terrestrial GPS-derived, large mammal habitat use. We also evaluate patterns in missing data that relate to potential animal activities that change the orientation of the antennae and characterize home-range probability of GPS detection for 4 focal species; cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Part 1, Positive Openness Raster (raster dataset): Openness is an angular measure of the relationship between surface relief and horizontal distance. For angles less than 90 degrees it is equivalent to the internal angle of a cone with its apex at a DEM location, and is constrained by neighboring elevations within a specified radial distance. 480 meter search radius was used for this calculation of positive openness. Openness incorporates the terrain line-of-sight or viewshed concept and is calculated from multiple zenith and nadir angles-here along eight azimuths. Positive openness measures openness above the surface, with high values for convex forms and low values for concave forms (Yokoyama et al. 2002). We calculated positive openness using a custom python script, following the methods of Yokoyama et. al (2002) using a USGS National Elevation Dataset as input. Part 2, Northern Arizona GPS Test Collar (csv): Bias correction in GPS telemetry data-sets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a
t
ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture...
researchdata.tuwien.ac.at
b2find.eudat.eu
zip
Updated Jun 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo (2025). ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture from merged multi-satellite observations [Dataset]. http://doi.org/10.48436/3fcxr-cde10
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.48436/3fcxr-cde10
Dataset updated
Jun 6, 2025
Dataset provided by
TU Wien
Authors
Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was produced with funding from the European Space Agency (ESA) Climate Change Initiative (CCI) Plus Soil Moisture Project (CCN 3 to ESRIN Contract No: 4000126684/19/I-NB "ESA CCI+ Phase 1 New R&D on CCI ECVS Soil Moisture"). Project website: https://climate.esa.int/en/projects/soil-moisture/

This dataset contains information on the Surface Soil Moisture (SM) content derived from satellite observations in the microwave domain.

Dataset paper (public preprint)

A description of this dataset, including the methodology and validation results, is available at:

Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: An independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data Discuss. [preprint], https://doi.org/10.5194/essd-2024-610, in review, 2025.

Abstract

ESA CCI Soil Moisture is a multi-satellite climate data record that consists of harmonized, daily observations coming from 19 satellites (as of v09.1) operating in the microwave domain. The wealth of satellite information, particularly over the last decade, facilitates the creation of a data record with the highest possible data consistency and coverage.
However, data gaps are still found in the record. This is particularly notable in earlier periods when a limited number of satellites were in operation, but can also arise from various retrieval issues, such as frozen soils, dense vegetation, and radio frequency interference (RFI). These data gaps present a challenge for many users, as they have the potential to obscure relevant events within a study area or are incompatible with (machine learning) software that often relies on gap-free inputs.
Since the requirement of a gap-free ESA CCI SM product was identified, various studies have demonstrated the suitability of different statistical methods to achieve this goal. A fundamental feature of such gap-filling method is to rely only on the original observational record, without need for ancillary variable or model-based information. Due to the intrinsic challenge, there was until present no global, long-term univariate gap-filled product available. In this version of the record, data gaps due to missing satellite overpasses and invalid measurements are filled using the Discrete Cosine Transform (DCT) Penalized Least Squares (PLS) algorithm (Garcia, 2010). A linear interpolation is applied over periods of (potentially) frozen soils with little to no variability in (frozen) soil moisture content. Uncertainty estimates are based on models calibrated in experiments to fill satellite-like gaps introduced to GLDAS Noah reanalysis soil moisture (Rodell et al., 2004), and consider the gap size and local vegetation conditions as parameters that affect the gapfilling performance.

Summary

Gap-filled global estimates of volumetric surface soil moisture from 1991-2023 at 0.25° sampling

Fields of application (partial): climate variability and change, land-atmosphere interactions, global biogeochemical cycles and ecology, hydrological and land surface modelling, drought applications, and meteorology

Method: Modified version of DCT-PLS (Garcia, 2010) interpolation/smoothing algorithm, linear interpolation over periods of frozen soils. Uncertainty estimates are provided for all data points.

More information: See Preimesberger et al. (2025) and https://doi.org/10.5281/zenodo.8320869" target="_blank" rel="noopener">ESA CCI SM Algorithm Theoretical Baseline Document [Chapter 7.2.9] (Dorigo et al., 2023)

Programmatic Download

You can use command line tools such as wget or curl to download (and extract) data for multiple years. The following command will download and extract the complete data set to the local directory ~/Download on Linux or macOS systems.

#!/bin/bash

# Set download directory
DOWNLOAD_DIR=~/Downloads

base_url="https://researchdata.tuwien.at/records/3fcxr-cde10/files"

# Loop through years 1991 to 2023 and download & extract data
for year in {1991..2023}; do
echo "Downloading $year.zip..."
wget -q -P "$DOWNLOAD_DIR" "$base_url/$year.zip"
unzip -o "$DOWNLOAD_DIR/$year.zip" -d $DOWNLOAD_DIR
rm "$DOWNLOAD_DIR/$year.zip"
done

Data details

The dataset provides global daily estimates for the 1991-2023 period at 0.25° (~25 km) horizontal grid resolution. Daily images are grouped by year (YYYY), each subdirectory containing one netCDF image file for a specific day (DD), month (MM) in a 2-dimensional (longitude, latitude) grid system (CRS: WGS84). The file name has the following convention:

ESACCI-SOILMOISTURE-L3S-SSMV-COMBINED_GAPFILLED-YYYYMMDD000000-fv09.1r1.nc

Data Variables

Each netCDF file contains 3 coordinate variables (WGS84 longitude, latitude and time stamp), as well as the following data variables:

sm: (float) The Soil Moisture variable reflects estimates of daily average volumetric soil moisture content (m3/m3) in the soil surface layer (~0-5 cm) over a whole grid cell (0.25 degree).

sm_uncertainty: (float) The Soil Moisture Uncertainty variable reflects the uncertainty (random error) of the original satellite observations and of the predictions used to fill observation data gaps.

sm_anomaly: Soil moisture anomalies (reference period 1991-2020) derived from the gap-filled values (`sm`)

sm_smoothed: Contains DCT-PLS predictions used to fill data gaps in the original soil moisture field. These values are also provided for cases where an observation was initially available (compare `gapmask`). In this case, they provided a smoothed version of the original data.

gapmask: (0 | 1) Indicates grid cells where a satellite observation is available (1), and where the interpolated (smoothed) values are used instead (0) in the 'sm' field.

frozenmask: (0 | 1) Indicates grid cells where ERA5 soil temperature is <0 °C. In this case, a linear interpolation over time is applied.

Additional information for each variable is given in the netCDF attributes.

Version Changelog

Changes in v9.1r1 (previous version was v09.1):

This version uses a novel uncertainty estimation scheme as described in Preimesberger et al. (2025).

Software to open netCDF files

These data can be read by any software that supports Climate and Forecast (CF) conform metadata standards for netCDF files, such as:

https://github.com/pydata/xarray" target="_blank" rel="noopener">Xarray (python)

https://unidata.github.io/netcdf4-python/" target="_blank" rel="noopener">netCDF4 (python)

https://github.com/TUW-GEO/esa_cci_sm">esa_cci_sm (python)

Similar tools exists for other programming languages (Matlab, R, etc.)

Software packages and GIS tools can open netCDF files, e.g. CDO, NCO, QGIS, ArCGIS

You can also use the GUI software Panoply to view the contents of each file

References

Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: An independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data Discuss. [preprint], https://doi.org/10.5194/essd-2024-610, in review, 2025.

Dorigo, W., Preimesberger, W., Stradiotti, P., Kidd, R., van der Schalie, R., van der Vliet, M., Rodriguez-Fernandez, N., Madelon, R., & Baghdadi, N. (2023). ESA Climate Change Initiative Plus - Soil Moisture Algorithm Theoretical Baseline Document (ATBD) Supporting Product Version 08.1 (version 1.1). Zenodo. https://doi.org/10.5281/zenodo.8320869

Garcia, D., 2010. Robust smoothing of gridded data in one and higher dimensions with missing values. Computational Statistics & Data Analysis, 54(4), pp.1167-1178. Available at: https://doi.org/10.1016/j.csda.2009.09.020

Rodell, M., Houser, P. R., Jambor, U., Gottschalck, J., Mitchell, K., Meng, C.-J., Arsenault, K., Cosgrove, B., Radakovich, J., Bosilovich, M., Entin, J. K., Walker, J. P., Lohmann, D., and Toll, D.: The Global Land Data Assimilation System, Bulletin of the American Meteorological Society, 85, 381 – 394, https://doi.org/10.1175/BAMS-85-3-381, 2004.

Related Records

The following records are all part of the Soil Moisture Climate Data Records from satellites community

1
ESA CCI SM MODELFREE Surface Soil Moisture Record
<a href="https://doi.org/10.48436/svr1r-27j77" target="_blank"
Additional file 1 of Tensor extrapolation: an adaptation to data sets with...
springernature.figshare.com
html
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Josef Schosser (2023). Additional file 1 of Tensor extrapolation: an adaptation to data sets with missing entries [Dataset]. http://doi.org/10.6084/m9.figshare.19242463.v1
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19242463.v1
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Josef Schosser
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 1. Python code. The notebook highlights core components of the code applied in the study.
Science Education Research Topic Modeling Dataset
zenodo.org
data.niaid.nih.gov
bin, html +2
Updated Oct 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tor Ole B. Odden; Tor Ole B. Odden; Alessandro Marin; Alessandro Marin; John L. Rudolph; John L. Rudolph (2024). Science Education Research Topic Modeling Dataset [Dataset]. http://doi.org/10.5281/zenodo.4094974
Explore at:
bin, txt, html, text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4094974
Dataset updated
Oct 9, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tor Ole B. Odden; Tor Ole B. Odden; Alessandro Marin; Alessandro Marin; John L. Rudolph; John L. Rudolph
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset contains scraped and processed text from roughly 100 years of articles published in the Wiley journal Science Education (formerly General Science Quarterly). This text has been cleaned and filtered in preparation for analysis using natural language processing techniques, particularly topic modeling with latent Dirichlet allocation (LDA). We also include a Jupyter Notebook illustrating how one can use LDA to analyze this dataset and extract latent topics from it, as well as analyze the rise and fall of those topics over the history of the journal.

The articles were downloaded and scraped in December of 2019. Only non-duplicate articles with a listed author (according to the CrossRef metadata database) were included, and due to missing data and text recognition issues we excluded all articles published prior to 1922. This resulted in 5577 articles in total being included in the dataset. The text of these articles was then cleaned in the following way:

We removed duplicated text from each article: prior to 1969, articles in the journal were published in a magazine format in which the end of one article and the beginning of the next would share the same page, so we developed an automated detection of article beginnings and endings that was able to remove any duplicate text.

We removed the reference sections of the articles, as well headings (in all caps) such as “ABSTRACT”.

We reunited any partial words that were separated due to line breaks, text recognition issues, or British vs. American spellings (for example converting “per cent” to “percent”)

We removed all numbers, symbols, special characters, and punctuation, and lowercased all words.

We removed all stop words, which are words without any semantic meaning on their own—“the”, “in,” “if”, “and”, “but”, etc.—and all single-letter words.

We lemmatized all words, with the added step of including a part-of-speech tagger so our algorithm would only aggregate and lemmatize words from the same part of speech (e.g., nouns vs. verbs).

We detected and create bi-grams, sets of words that frequently co-occur and carry additional meaning together. These words were combined with an underscore: for example, “problem_solving” and “high_school”.

After filtering, each document was then turned into a list of individual words (or tokens) which were then collected and saved (using the python pickle format) into the file scied_words_bigrams_V5.pkl.

In addition to this file, we have also included the following files:

SciEd_paper_names_weights.pkl: A file containing limited metadata (title, author, year published, and DOI) for each of the papers, in the same order as they appear within the main datafile. This file also includes the weights assigned by an LDA model used to analyze the data

Science Education LDA Notebook.ipynb: A notebook file that replicates our LDA analysis, with a written explanation of all of the steps and suggestions on how to explore the results.

Supporting files for the notebook. These include the requirements, the README, a helper script with functions for plotting that were too long to include in the notebook, and two HTML graphs that are embedded into the notebook.

This dataset is shared under the terms of the Wiley Text and Data Mining Agreement, which allows users to share text and data mining output for non-commercial research purposes. Any questions or comments can be directed to Tor Ole Odden, t.o.odden@fys.uio.no.
Klib library python
kaggle.com
Updated Jan 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sripaad Srinivasan (2021). Klib library python [Dataset]. https://www.kaggle.com/sripaadsrinivasan/klib-library-python/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 11, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sripaad Srinivasan
Description
klib library enables us to quickly visualize missing data, perform data cleaning, visualize data distribution plot, visualize correlation plot and visualize categorical column values. klib is a Python library for importing, cleaning, analyzing and preprocessing data. Explanations on key functionalities can be found on Medium / TowardsDataScience in the examples section or on YouTube (Data Professor).

Original Github repo

https://raw.githubusercontent.com/akanz1/klib/main/examples/images/header.png" alt="klib Header">

Usage

!pip install klib

import klib import pandas as pd df = pd.DataFrame(data) # klib.describe functions for visualizing datasets - klib.cat_plot(df) # returns a visualization of the number and frequency of categorical features - klib.corr_mat(df) # returns a color-encoded correlation matrix - klib.corr_plot(df) # returns a color-encoded heatmap, ideal for correlations - klib.dist_plot(df) # returns a distribution plot for every numeric feature - klib.missingval_plot(df) # returns a figure containing information about missing values

Examples

Take a look at this starter notebook.

Further examples, as well as applications of the functions can be found here.

Contributing

Pull requests and ideas, especially for further functions are welcome. For major changes or feedback, please open an issue first to discuss what you would like to change. Take a look at this Github repo.

License

MIT
Z
Adult dataset preprocessed
data.niaid.nih.gov
zenodo.org
Updated Jul 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pustozerova, Anastasia (2024). Adult dataset preprocessed [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12533513
Explore at:
Dataset updated
Jul 1, 2024
Dataset provided by
Pustozerova, Anastasia
Schuster, Verena
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The files "adult_train.csv" and "adult_test.csv" contain preprocessed versions of the Adult dataset from the USI repository.

The file "adult_preprocessing.ipynb" contains a python notebook file with all the preprocessing steps used to generate "adult_train.csv" and "adult_test.csv" from the original Adult dataset.

The preprocessing steps include:

One-hot-encoding of categorical values

Imputation of missing values using knn-imputer with k=1

Standard scaling of ordinal attributes

Note: we assume the scenario when the test set is available before training (every attribute besides the target - "income"), therefore we combine train and test sets before the preprocessing.
S
Geographical distribution and climate data of Cycas taiwaniana
scidb.cn
Updated Jan 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CHUNPING XIE (2025). Geographical distribution and climate data of Cycas taiwaniana [Dataset]. http://doi.org/10.57760/sciencedb.19432
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.19432
Dataset updated
Jan 3, 2025
Dataset provided by
Science Data Bank
Authors
CHUNPING XIE
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Description: Geographical Distribution and Climate Data of Cycas taiwaniana (Taiwanese Cycad)This dataset contains the geographical distribution and climate data for Cycas taiwaniana, focusing on its presence across regions in Fujian, Guangdong, and Hainan provinces of China. The dataset includes geographical coordinates (longitude and latitude), monthly climate data (minimum and maximum temperature, and precipitation) across different months, as well as bioclimatic variables based on the WorldClim dataset.**Temporal and Spatial Information** The data covers long-term climate information, with monthly data for each location recorded over a 12-month period (January to December). The dataset includes spatial data in terms of longitude and latitude, corresponding to various locations where Cycas taiwaniana populations are present. The spatial resolution is specific to each point location, and the temporal resolution reflects the monthly climate data for each year.**Data Structure and Units** The dataset consists of 36 records, each representing a unique location with corresponding climate and geographical data. The table includes the following columns: 1. No.: Unique identifier for each data record 2. Longitude: Geographic longitude in decimal degrees 3. Latitude: Geographic latitude in decimal degrees 4. tmin1 to tmin12: Minimum temperature (°C) for each month (January to December) 5. tmax1 to tmax12: Maximum temperature (°C) for each month (January to December) 6. prec1 to prec12: Precipitation (mm) for each month (January to December) 7. bio1 to bio19: Bioclimatic variables (e.g., annual mean temperature, temperature seasonality, precipitation, etc.) derived from WorldClim data (unit varies depending on the variable)The units for each measurement are as follows: - Temperature: Degrees Celsius (°C) - Precipitation: Millimeters (mm) - Bioclimatic variables: Varies depending on the specific variable (e.g., °C, mm)**Data Gaps and Missing Values** The dataset contains some missing values, particularly in the "precipitation" columns for certain months and locations. These missing values may result from gaps in climate station data or limitations in data collection for specific regions. Missing values are indicated as "NA" (Not Available) in the dataset. In cases where data gaps exist, estimations were not made, and the absence of the data is acknowledged in the record.**File Format and Software Compatibility** The dataset is provided in CSV format for ease of use and compatibility with various data analysis tools. It can be opened and processed using software such as Microsoft Excel, R, or Python (with Pandas). Users can download the dataset and work with it in software such as R (https://cran.r-project.org/) or Python (https://www.python.org/). The dataset is compatible with any software that supports CSV files.This dataset provides valuable information for research related to the geographical distribution and climate preferences of Cycas taiwaniana and can be used to inform conservation strategies, ecological studies, and climate change modeling.
B
Replication Data for: No Data? No worries! How image generation can...
borealisdata.ca
Updated Apr 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simon Durand (2025). Replication Data for: No Data? No worries! How image generation can alleviate image scarcity in wildlife surveys: a case study with muskox (Ovibos moschatus) [Dataset]. http://doi.org/10.5683/SP3/LCOG2G
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/LCOG2G
Dataset updated
Apr 8, 2025
Dataset provided by
Borealis
Authors
Simon Durand
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This repositery contains all images (that can be publicly shared), annotation data in the form of JSON files and Python scripts used to produce the paper "No Data? No worries! How image generation can alleviate image scarcity in wildlife surveys: a case study with muskox (Ovibos moschatus)".
Data and code associated with "The Observed Availability of Data and Code in...
zenodo.org
csv, text/x-python +1
Updated Feb 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erin Jones; Brandon McClung; Hadi Fawad; Amy McGovern; Erin Jones; Brandon McClung; Hadi Fawad; Amy McGovern (2025). Data and code associated with "The Observed Availability of Data and Code in Earth Science and Artificial Intelligence" [Dataset]. http://doi.org/10.5281/zenodo.14902836
Explore at:
csv, txt, text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14902836
Dataset updated
Feb 20, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Erin Jones; Brandon McClung; Hadi Fawad; Amy McGovern; Erin Jones; Brandon McClung; Hadi Fawad; Amy McGovern
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Feb 2025
Area covered
Earth
Description
Data and code associated with "The Observed Availability of Data and Code in Earth Science
and Artificial Intelligence" by Erin A. Jones, Brandon McClung, Hadi Fawad, and Amy McGovern.

Instructions: To reproduce figures, download all associated Python and CSV files and place
in a single directory.
Run BAMS_plot.py as you would run Python code on your system.

Code:
BAMS_plot.py: Python code for categorizing data availability statements based on given data
documented below and creating figures 1-3.

Code was originally developed for Python 3.11.7 and run in the Spyder
(version 5.4.3) IDE.

Libraries utilized:
numpy (version 1.26.4)
pandas (version 2.1.4)
matplotlib (version 3.8.0)

For additional documentation, please see code file.

Data:
ASDC_AIES.csv: CSV file containing relevant availability statement data for Artificial
Intelligence for the Earth Systems (AIES)
ASDC_AI_in_Geo.csv: CSV file containing relevant availability statement data for Artificial
Intelligence in Geosciences (AI in Geo.)
ASDC_AIJ.csv: CSV file containing relevant availability statement data for Artificial
Intelligence (AIJ)
ASDC_MWR.csv: CSV file containing relevant availability statement data for Monthly
Weather Review (MWR)

Data documentation:
All CSV files contain the same format of information for each journal. The CSV files above are
needed for the BAMS_plot.py code attached.

Records were analyzed based on the criteria below.

Records:
1) Title of paper
The title of the examined journal article.
2) Article DOI (or URL)
A link to the examined journal article. For AIES, AI in Geo., MWR, the DOI is
generally given. For AIJ, the URL is given.
3) Journal name
The name of the journal where the examined article is published. Either a full
journal name (e.g., Monthly Weather Review), or the acronym used in the
associated paper (e.g., AIES) is used.
4) Year of publication
The year the article was posted online/in print.
5) Is there an ASDC?
If the article contains an availability statement in any form, "yes" is
recorded. Otherwise, "no" is recorded.
6) Justification for non-open data?
If an availability statement contains some justification for why data is not
openly available, the justification is summarized and recorded as one of the
following options: 1) Dataset too large, 2) Licensing/Proprietary, 3) Can be
obtained from other entities, 4) Sensitive information, 5) Available at later
date. If the statement indicates any data is not openly available and no
justification is provided, or if no statement is provided is provided "None"
is recorded. If the statement indicates openly available data or no data
produced, "N/A" is recorded.
7) All data available
If there is an availability statement and data is produced, "y" is recorded
if means to access data associated with the article are given and there is no
indication that any data is not openly available; "n" is recorded if no means
to access data are given or there is some indication that some or all data is
not openly available. If there is no availability statement or no data is
produced, the record is left blank.
8) At least some data available
If there is an availability statement and data is produced, "y" is recorded
if any means to access data associated with the article are given; "n" is
recorded if no means to access data are given. If there is no availability
statement or no data is produced, the record is left blank.
9) All code available
If there is an availability statement and data is produced, "y" is recorded
if means to access code associated with the article are given and there is no
indication that any code is not openly available; "n" is recorded if no means
to access code are given or there is some indication that some or all code is
not openly available. If there is no availability statement or no data is
produced, the record is left blank.
10) At least some code available
If there is an availability statement and data is produced, "y" is recorded
if any means to access code associated with the article are given; "n" is
recorded if no means to access code are given. If there is no availability
statement or no data is produced, the record is left blank.
11) All data available upon request
If there is an availability statement indicating data is produced and no data
is openly available, "y" is recorded if any data is available upon request to
the authors of the examined journal article (not a request to any other
entity); "n" is recorded if no data is available upon request to the authors
of the examined journal article. If there is no availability statement, any
data is openly available, or no data is produced, the record is left blank.
12) At least some data available upon request
If there is an availability statement indicating data is produced and not all
data is openly available, "y" is recorded if all data is available upon
request to the authors of the examined journal article (not a request to any
other entity); "n" is recorded if not all data is available upon request to
the authors of the examined journal article. If there is no availability
statement, all data is openly available, or no data is produced, the record
is left blank.
13) no data produced
If there is an availability statement that indicates that no data was
produced for the examined journal article, "y" is recorded. Otherwise, the
record is left blank.
14) links work
If the availability statement contains one or more links to a data or code
repository, "y" is recorded if all links work; "n" is recorded if one or more
links do not work. If there is no availability statement or the statement
does not contain any links to a data or code repository, the record is left
blank.
Temperature and Humidity Time Series of Cold Storage Room Monitoring
zenodo.org
bin, csv, png, zip
Updated Jun 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elia Henrichs; Elia Henrichs; Florian Stoll; Christian Krupitzer; Christian Krupitzer; Florian Stoll (2025). Temperature and Humidity Time Series of Cold Storage Room Monitoring [Dataset]. http://doi.org/10.5281/zenodo.15130001
Explore at:
png, bin, zip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15130001
Dataset updated
Jun 30, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Elia Henrichs; Elia Henrichs; Florian Stoll; Christian Krupitzer; Christian Krupitzer; Florian Stoll
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The datasets contain the raw data and preprocessed data (following the steps in the Jupyter Notebook) of 9 DHT22 sensors in a cold storage room. Details on how the data was gathered can be found in the publication "Self-Adaptive Integration of Distributed Sensor Systems for Monitoring Cold Storage Environments" by Elia Henrichs, Florian Stoll, and Christian Krupitzer.

This dataset consists of the following files:

Raw.zip - The raw data CSV files of the nine Arduino-based data loggers, containing the semicolon-separated columns date (formatted as dd.mm.yyyy), time (formatted as HH:MM:SS), temperature, and humidity. These files can contain multiple headers.

Preprocessed.zip - The preprocessed data CSV files of the nine Arduino-based data loggers, containing the semicolon-separated columns date (formatted as dd.mm.yyyy), time (formatted as HH:MM:SS), temperature, and humidity. Multiple headers were removed, and the length of the datasets was aligned to equal length by filling missing values with NaN.

DataPreprocessing.ipynb - Jupyter Notebook containing the code to preprocess the data and create the overview file, which summarizes key characteristics of the dataset.

DataPreliminaryAnalysis.ipynb - Jupyter Notebook containing the code to perform the preliminary data analysis (general statistics, peaks, and matrix profiles).

experiment_actions.csv - CSV file logging performed actions (door openings and sensor movements).

overview.csv - CSV file summarizing key characteristics of the dataset and preliminary data analysis.

temphum_logger.ino - Source code to run the Arduino-based data logger with a sampling rate of 5 sec.

Arduino_setup_sketch_v1.png - Circuit diagram of the Arduino-based data logger.
u
Domestic Electrical Load Survey - Key Variables 1994-2014 - South Africa
datafirst.uct.ac.za
Updated Apr 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wiebke Toussaint (2020). Domestic Electrical Load Survey - Key Variables 1994-2014 - South Africa [Dataset]. http://www.datafirst.uct.ac.za/Dataportal/index.php/catalog/758
Explore at:
Dataset updated
Apr 29, 2020
Dataset authored and provided by
Wiebke Toussaint
Time period covered
1994 - 2014
Area covered
South Africa
Description
Abstract

This dataset is a harmonisation of the Domestic Electrical Load Survey (DELS) 1994-2014 dataset. The DELS 1994-2014 questionnaires were changed in 2000. Subsequently survey questions vary between 1994-1999 and 2000-2014. This makes data processing complex, as survey responses first need to be associated with their year of collection and corresponding questionnaire before they can be correctly interpreted. This Key Variables dataset is a user-friendly version of the original dataset. It contains household responses to the most important survey questions, as well as geographic and linking information that allows for the households to be matched to their respective electricity metering data. This dataset and similar custom datasets can be produced from the DELS 1994-2014 dataset with the python package delprocess. The data processing section includes a description of how this dataset was created. The development of the tools to create this dataset was funded by the South African National Energy Development Initiative (SANEDI).

Geographic coverage

The study had national coverage.

Analysis unit

Households

Universe

The dataset covers South African households in the DELS 1994-2014 dataset. These are electrified households that received electricity either directly from Eskom or from their local municipality.

Kind of data

Administrative records

Sampling procedure

The dataset includes all households for which survey responses have been captured in the DELS1994-2014 dataset.

Mode of data collection

Face-to-face [f2f]

Cleaning operations

This dataset has been constructed from the DELS 1994-2014 dataset using the data processing functions in the delprocess python package (www.github.com/wiebket/delprocess: release v1.0). The delprocess python package takes the complexities of the original DELS 1994-2014 dataset into account and makes use of 'spec files' to specify the processing steps that must be performed. To retrieve data for all survey years, two separate spec files are required to process survey response from 1994-1999 and 2000-2014. The spec files used to produce this dataset are included in the program files and can be used as templates for new custom datasets. Full instructions on how to use them to process the data are in the README file contained in the delprocess package.

SPEC FILES specify the following processing steps:

List of search terms for which survey questions will be searched, and variables returned

Transformations (addition, subtraction, multiplication) of variables retrieved from search output

Bin intervals for variables (requires numeric data)

Lables for bins (requires binned data)

Details of bin segments

Replacement (encoding) of coded variable values

Higher level geography detail

In particular, the DELSKV 1994-2014 dataset has been produced by specifying the following processing steps:

TRANSFORMATIONS * monthly_income from 1994 - 1999 is the variable returned by the 'income' search term * monthly_income from 2000 - 2014 is calculated as the sum of the variables returned by the 'earn per month', 'money from small business' and 'external' search terms * Appliance numbers from 1994 - 1999 is the count of appliances (no data was collected on broken appliances) * Appliance numbers from 2000-2014 is the count of appliances [minus] the count of broken appliances (except for TV which included no information on broken appliances) * A new total_adults variable was created by summing the number of all occupants (male and female) over 16 years old * A new total_children variable was created by summing the number of all occupants (male and female) under 16 years old * A new total_pensioners variable was created by summing the number of pensioners (male and female) over 16 years old * A new total_unemployed variable was created by summing the number of unemployed occupants (male and female) over 16 years old * A new total_part_time variable was created by summing the number of part time employed occupants (male and female) over 16 years old * roof_material and wall_material values for 1994-1999 were augmented by 1 * water_access was transformed for 1994-1999 to be 4 [minus] the 'watersource' value

REPLACEMENTS * Appliance usage values have been replaced with: 0=never 1=monthly 2=weekly 3=daily

water_access values have been replaced with: 1=nearby river/dam/borehole 2=block/street taps 3=tap in yard 4=tap inside house

roof_material and wall_material values have been replaced with: 1=IBR/Corr.Iron/Zinc 2=Thatch/Grass 3=Wood/Masonite board 4=Brick 5=Block 6=Plaster 7=Concrete 8=Tiles 9=Plastic 10=Asbestos 11=Daub/Mud/Clay

OTHER NOTES Appliance usage information was only collected after 2000. No binning was done to segment survey responses for this dataset.

Data appraisal

The 2000-2014 survey questions contain no variable for 'number of females: 50+', which goes against the pattern of other occupant age categories.

Spacing in the original questions is irregular and can cause challenges when specifying transformations (eg. 'number of males: 16-24' and 'number of males: 25 - 34', 'part time' and 'parttime').

Spelling mistakes in the original questions can cause challenges when specifying transformations (eg. 'head emploed part time').

MISSING VALUES Missing values have not been replaced and are represented as blanks except for imputed columns (total_adults, total_children, ...) and appliances after 2000, where missing values have been replaced with a 0.
f
S1 Data -
plos.figshare.com
zip
Updated Mar 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xinfu Pang; Wei Sun; Haibo Li; Wei Liu; Changfeng Luan (2024). S1 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0300229.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300229.s001
Dataset updated
Mar 19, 2024
Dataset provided by
PLOS ONE
Authors
Xinfu Pang; Wei Sun; Haibo Li; Wei Liu; Changfeng Luan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Accurate short-term load forecasting is of great significance in improving the dispatching efficiency of power grids, ensuring the safe and reliable operation of power grids, and guiding power systems to formulate reasonable production plans and reduce waste of resources. However, the traditional short-term load forecasting method has limited nonlinear mapping ability and weak generalization ability to unknown data, and it is prone to the loss of time series information, further suggesting that its forecasting accuracy can still be improved. This study presents a short-term power load forecasting method based on Bagging-stochastic configuration networks (SCNs). First, the missing values in the original data are filled with the average values. Second, the influencing factors, such as the weather- and week-type data, are coded. Then, combined with the data of influencing factors after coding, the Bagging-SCNs integration algorithm is used to predict the short-term load. Finally, by taking the daily load data of Quanzhou City, Zhejiang Province as an example, the program of the abovementioned method is compiled in Python language and then compared with the long short-term memory neural network algorithm and the single-SCNs algorithm. Simulation results show that the proposed method for medium- and short-term load forecasting has a high forecasting accuracy and a significant effect on improving the accuracy of load forecasting.
a
U.S. Historical Climate - Monthly Averages for GHCN-D Stations for 1981 -...
community-climatesolutions.hub.arcgis.com
Updated Apr 16, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esri (2019). U.S. Historical Climate - Monthly Averages for GHCN-D Stations for 1981 - 2010 [Dataset]. https://community-climatesolutions.hub.arcgis.com/items/b8df6517ceac42af9ab483089296ed04
Explore at:
Dataset updated
Apr 16, 2019
Dataset authored and provided by
Esri
Area covered

Description
This point layer contains monthly summaries of daily temperatures (means, minimums, and maximums) and precipitation levels (sum, lowest, and highest) for the period January 1981 through December 2010 for weather stations in the Global Historical Climate Network Daily (GHCND). Data in this service were obtained from web services hosted by the Applied Climate Information System ( ACIS). ACIS staff curate the values for the U.S., including correcting erroneous values, reconciling data from stations that have been moved over their history, etc. The data were compiled at Esri from publicly available sources hosted and administered by NOAA. Because the ACIS data is updated and corrected on an ongoing basis, the date of collection for this layer was Jan 23, 2019. The following process was used to produce this dataset:Download the most current list of stations from ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt. Import this into Microsoft Excel and save as CSV. In ArcGIS, import the CSV as a geodatabase table and use the XY Event layer tool to locate each point. Using a detailed U.S. boundary extract the points that fall within the 50 U.S. States, the District of Columbia, and Puerto Rico. Using Python with DA.UpdateCursor and urllib2 access the ACIS Web Services API to determine whether each station had at least 50 monthly values of temperature data for each station. Delete the other stations. Using Python add the necessary field names and acquire all monthly values for the remaining stations. Thus, there are stations that have some missing data. Using Python Add fields and convert the standard values to metric values so both would be present. Thus, there are four sets of monthly data in this dataset: Monthly means, mins, and maxes of daily temperatures - degrees Fahrenheit. Monthly mean of monthly sums of precipitation and the level of precipitation that was the minimum and maximum during the period 1981 to 2010 - mm. Temperatures in 3a. in degrees Celcius. Precipitation levels in 3b in Inches. After initially publishing these data in a different service, it was learned that more precise coordinates for station locations were available from the Enhanced Master Station History Report (EMSHR) published by NOAA NCDC. With the publication of this layer these most precise coordinates are used. A large subset of the EMSHR metadata is available via EMSHR Stations Locations and Metadata 1738 to Present. If your study area includes areas outside of the U.S., use the World Historical Climate - Monthly Averages for GHCN-D Stations 1981 - 2010 layer. The data in this layer come from the same source archive, however, they are not curated by the ACIS staff and may contain errors. Revision History: Initially Published: 23 Jan 2019 Updated 16 Apr 2019 - We learned more precise coordinates for station locations were available from the Enhanced Master Station History Report (EMSHR) published by NOAA NCDC. With the publication of this layer the geometry and attributes for 3,222 of 9,636 stations now have more precise coordinates. The schema was updated to include the NCDC station identifier and elevation fields for feet and meters are also included. A large subset of the EMSHR data is available via EMSHR Stations Locations and Metadata 1738 to Present. Cite as: Esri, 2019: U.S. Historical Climate - Monthly Averages for GHCN-D Stations for 1981 - 2010. ArcGIS Online, Accessed

Facebook

Twitter

Click to copy link

Link copied

Cite

Rupesh Kumar (2022). Learn Data Science Series Part 1 [Dataset]. https://www.kaggle.com/datasets/hunter0007/learn-data-science-part-1

Learn Data Science Series Part 1

This module contains learning material to master Pandas

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Dec 30, 2022

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Rupesh Kumar

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Please feel free to share it with others and consider supporting me if you find it helpful ⭐️.

Overview:

Chapter 1: Getting started with pandas
Chapter 2: Analysis: Bringing it all together and making decisions
Chapter 3: Appending to DataFrame
Chapter 4: Boolean indexing of dataframes
Chapter 5: Categorical data
Chapter 6: Computational Tools
Chapter 7: Creating DataFrames
Chapter 8: Cross sections of different axes with MultiIndex
Chapter 9: Data Types
Chapter 10: Dealing with categorical variables
Chapter 11: Duplicated data
Chapter 12: Getting information about DataFrames
Chapter 13: Gotchas of pandas
Chapter 14: Graphs and Visualizations
Chapter 15: Grouping Data
Chapter 16: Grouping Time Series Data
Chapter 17: Holiday Calendars
Chapter 18: Indexing and selecting data
Chapter 19: IO for Google BigQuery
Chapter 20: JSON
Chapter 21: Making Pandas Play Nice With Native Python Datatypes
Chapter 22: Map Values
Chapter 23: Merge, join, and concatenate
Chapter 24: Meta: Documentation Guidelines
Chapter 25: Missing Data
Chapter 26: MultiIndex
Chapter 27: Pandas Datareader
Chapter 28: Pandas IO tools (reading and saving data sets)
Chapter 29: pd.DataFrame.apply
Chapter 30: Read MySQL to DataFrame
Chapter 31: Read SQL Server to Dataframe
Chapter 32: Reading files into pandas DataFrame
Chapter 33: Resampling
Chapter 34: Reshaping and pivoting
Chapter 35: Save pandas dataframe to a csv file
Chapter 36: Series
Chapter 37: Shifting and Lagging Data
Chapter 38: Simple manipulation of DataFrames
Chapter 39: String manipulation
Chapter 40: Using .ix, .iloc, .loc, .at and .iat to access a DataFrame
Chapter 41: Working with Time Series

Clear search

Close search

Google apps

Main menu

Learn Data Science Series Part 1

Please feel free to share it with others and consider supporting me if you find it helpful ⭐️.

Overview:

Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction...

Data from: Missing Value Imputation in Relational Data Using Variational...

Pre-Processed Power Grid Frequency Time Series

Data from: A comprehensive dataset for the accelerated development and...

Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 - 7—Data

Global Freelancers (Raw) Dataset

Description :

Key Features :

Use Cases :

File : global_freelancers_raw.csv

Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 - 7 Data.

ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture...

Dataset paper (public preprint)

Abstract

Summary

Programmatic Download

Data details

Data Variables

Version Changelog

Software to open netCDF files

References

Related Records

Additional file 1 of Tensor extrapolation: an adaptation to data sets with...

Science Education Research Topic Modeling Dataset

Klib library python

Usage

Examples

Contributing

License

Adult dataset preprocessed

Geographical distribution and climate data of Cycas taiwaniana

Replication Data for: No Data? No worries! How image generation can...

Data and code associated with "The Observed Availability of Data and Code in...

Temperature and Humidity Time Series of Cold Storage Room Monitoring

Domestic Electrical Load Survey - Key Variables 1994-2014 - South Africa

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Cleaning operations

Data appraisal

S1 Data -

U.S. Historical Climate - Monthly Averages for GHCN-D Stations for 1981 -...

Learn Data Science Series Part 1

This module contains learning material to master Pandas

Please feel free to share it with others and consider supporting me if you find it helpful ⭐️.

Overview:

File : `global_freelancers_raw.csv`