67 datasets found
  1. Learn Data Science Series Part 1

    • kaggle.com
    Updated Dec 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rupesh Kumar (2022). Learn Data Science Series Part 1 [Dataset]. https://www.kaggle.com/datasets/hunter0007/learn-data-science-part-1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 30, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rupesh Kumar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Please feel free to share it with others and consider supporting me if you find it helpful ⭐️.

    Overview:

    • Chapter 1: Getting started with pandas
    • Chapter 2: Analysis: Bringing it all together and making decisions
    • Chapter 3: Appending to DataFrame
    • Chapter 4: Boolean indexing of dataframes
    • Chapter 5: Categorical data
    • Chapter 6: Computational Tools
    • Chapter 7: Creating DataFrames
    • Chapter 8: Cross sections of different axes with MultiIndex
    • Chapter 9: Data Types
    • Chapter 10: Dealing with categorical variables
    • Chapter 11: Duplicated data
    • Chapter 12: Getting information about DataFrames
    • Chapter 13: Gotchas of pandas
    • Chapter 14: Graphs and Visualizations
    • Chapter 15: Grouping Data
    • Chapter 16: Grouping Time Series Data
    • Chapter 17: Holiday Calendars
    • Chapter 18: Indexing and selecting data
    • Chapter 19: IO for Google BigQuery
    • Chapter 20: JSON
    • Chapter 21: Making Pandas Play Nice With Native Python Datatypes
    • Chapter 22: Map Values
    • Chapter 23: Merge, join, and concatenate
    • Chapter 24: Meta: Documentation Guidelines
    • Chapter 25: Missing Data
    • Chapter 26: MultiIndex
    • Chapter 27: Pandas Datareader
    • Chapter 28: Pandas IO tools (reading and saving data sets)
    • Chapter 29: pd.DataFrame.apply
    • Chapter 30: Read MySQL to DataFrame
    • Chapter 31: Read SQL Server to Dataframe
    • Chapter 32: Reading files into pandas DataFrame
    • Chapter 33: Resampling
    • Chapter 34: Reshaping and pivoting
    • Chapter 35: Save pandas dataframe to a csv file
    • Chapter 36: Series
    • Chapter 37: Shifting and Lagging Data
    • Chapter 38: Simple manipulation of DataFrames
    • Chapter 39: String manipulation
    • Chapter 40: Using .ix, .iloc, .loc, .at and .iat to access a DataFrame
    • Chapter 41: Working with Time Series
  2. f

    Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction...

    • frontiersin.figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yi-Hui Zhou; Ehsan Saghapour (2023). Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data.PDF [Dataset]. http://doi.org/10.3389/fgene.2021.691274.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Yi-Hui Zhou; Ehsan Saghapour
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.

  3. f

    Data from: Missing Value Imputation in Relational Data Using Variational...

    • tandf.figshare.com
    txt
    Updated Jul 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simon Fontaine; Jian Kang; Ji Zhu (2025). Missing Value Imputation in Relational Data Using Variational Inference [Dataset]. http://doi.org/10.6084/m9.figshare.29184891.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jul 11, 2025
    Dataset provided by
    Taylor & Francis
    Authors
    Simon Fontaine; Jian Kang; Ji Zhu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In real-world networks, node attributes are often only partially observed, necessitating imputation to support analysis or enable downstream tasks. However, most existing imputation methods overlook the rich information contained within the connectivity among nodes. This research is inspired by the premise that leveraging all available information should yield improved imputation, provided a sufficient association between attributes and edges. Consequently, we introduce a joint latent space model that produces a low-dimensional representation of the data and simultaneously captures the edge and node attribute information. This model relies on the pooling of information induced by shared latent variables, thus improving the prediction of node attributes and providing a more effective attribute imputation method. Our approach uses variational inference to approximate posterior distributions for these latent variables, resulting in predictive distributions for missing values. Through numerical experiments, conducted on both simulated data and real-world networks, we demonstrate that our proposed method successfully harnesses the joint structure information and significantly improves the imputation of missing attributes, specifically when the observed information is weak. Additional results, implementation details, a Python implementation, and the code reproducing the results are available online. Supplementary materials for this article are available online.

  4. o

    Pre-Processed Power Grid Frequency Time Series

    • explore.openaire.eu
    • data.niaid.nih.gov
    • +1more
    Updated Apr 22, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johannes Kruse; Benjamin Schäfer; Dirk Witthaut (2020). Pre-Processed Power Grid Frequency Time Series [Dataset]. http://doi.org/10.5281/zenodo.5105820
    Explore at:
    Dataset updated
    Apr 22, 2020
    Authors
    Johannes Kruse; Benjamin Schäfer; Dirk Witthaut
    Description

    Overview This repository contains ready-to-use frequency time series as well as the corresponding pre-processing scripts in python. The data covers three synchronous areas of the European power grid: Continental Europe Great Britain Nordic This work is part of the paper "Predictability of Power Grid Frequency"[1]. Please cite this paper, when using the data and the code. For a detailed documentation of the pre-processing procedure we refer to the supplementary material of the paper. Data sources We downloaded the frequency recordings from publically available repositories of three different Transmission System Operators (TSOs). Continental Europe [2]: We downloaded the data from the German TSO TransnetBW GmbH, which retains the Copyright on the data, but allows to re-publish it upon request [3]. Great Britain [4]: The download was supported by National Grid ESO Open Data, which belongs to the British TSO National Grid. They publish the frequency recordings under the NGESO Open License [5]. Nordic [6]: We obtained the data from the Finish TSO Fingrid, which provides the data under the open license CC-BY 4.0 [7]. Content of the repository A) Scripts In the "Download_scripts" folder you will find three scripts to automatically download frequency data from the TSO's websites. In "convert_data_format.py" we save the data with corrected timestamp formats. Missing data is marked as NaN (processing step (1) in the supplementary material of [1]). In "clean_corrupted_data.py" we load the converted data and identify corrupted recordings. We mark them as NaN and clean some of the resulting data holes (processing step (2) in the supplementary material of [1]). The python scripts run with Python 3.7 and with the packages found in "requirements.txt". B) Yearly converted and cleansed data The folders "

  5. Z

    Data from: A comprehensive dataset for the accelerated development and...

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Larson, David (2020). A comprehensive dataset for the accelerated development and benchmarking of solar forecasting methods [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2826938
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Larson, David
    Coimbra, Carlos
    Carreira Pedro, Hugo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description This repository contains a comprehensive solar irradiance, imaging, and forecasting dataset. The goal with this release is to provide standardized solar and meteorological datasets to the research community for the accelerated development and benchmarking of forecasting methods. The data consist of three years (2014–2016) of quality-controlled, 1-min resolution global horizontal irradiance and direct normal irradiance ground measurements in California. In addition, we provide overlapping data from commonly used exogenous variables, including sky images, satellite imagery, Numerical Weather Prediction forecasts, and weather data. We also include sample codes of baseline models for benchmarking of more elaborated models.

    Data usage The usage of the datasets and sample codes presented here is intended for research and development purposes only and implies explicit reference to the paper: Pedro, H.T.C., Larson, D.P., Coimbra, C.F.M., 2019. A comprehensive dataset for the accelerated development and benchmarking of solar forecasting methods. Journal of Renewable and Sustainable Energy 11, 036102. https://doi.org/10.1063/1.5094494

    Although every effort was made to ensure the quality of the data, no guarantees or liabilities are implied by the authors or publishers of the data.

    Sample code As part of the data release, we are also including the sample code written in Python 3. The preprocessed data used in the scripts are also provided. The code can be used to reproduce the results presented in this work and as a starting point for future studies. Besides the standard scientific Python packages (numpy, scipy, and matplotlib), the code depends on pandas for time-series operations, pvlib for common solar-related tasks, and scikit-learn for Machine Learning models. All required Python packages are readily available on Mac, Linux, and Windows and can be installed via, e.g., pip.

    Units All time stamps are in UTC (YYYY-MM-DD HH:MM:SS). All irradiance and weather data are in SI units. Sky image features are derived from 8-bit RGB (256 color levels) data. Satellite images are derived from 8-bit gray-scale (256 color levels) data.

    Missing data The string "NAN" indicates missing data

    File formats All time series data files as in CSV (comma separated values) Images are given in tar.bz2 files

    Files

    Folsom_irradiance.csv Primary One-minute GHI, DNI, and DHI data.

    Folsom_weather.csv Primary One-minute weather data.

    Folsom_sky_images_{YEAR}.tar.bz2 Primary Tar archives with daytime sky images captured at 1-min intervals for the years 2014, 2015, and 2016, compressed with bz2.

    Folsom_NAM_lat{LAT}_lon{LON}.csv Primary NAM forecasts for the four nodes nearest the target location. {LAT} and {LON} are replaced by the node’s coordinates listed in Table I in the paper.

    Folsom_sky_image_features.csv Secondary Features derived from the sky images.

    Folsom_satellite.csv Secondary 10 pixel by 10 pixel GOES-15 images centered in the target location.

    Irradiance_features_{horizon}.csv Secondary Irradiance features for the different forecasting horizons ({horizon} 1⁄4 {intra-hour, intra-day, day-ahead}).

    Sky_image_features_intra-hour.csv Secondary Sky image features for the intra-hour forecasting issuing times.

    Sat_image_features_intra-day.csv Secondary Satellite image features for the intra-day forecasting issuing times.

    NAM_nearest_node_day-ahead.csv Secondary NAM forecasts (GHI, DNI computed with the DISC algorithm, and total cloud cover) for the nearest node to the target location prepared for day-ahead forecasting.

    Target_{horizon}.csv Secondary Target data for the different forecasting horizons.

    Forecast_{horizon}.py Code Python script used to create the forecasts for the different horizons.

    Postprocess.py Code Python script used to compute the error metric for all the forecasts.

  6. v

    Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 - 7—Data

    • res1catalogd-o-tdatad-o-tgov.vcapture.xyz
    • data.usgs.gov
    • +4more
    Updated Jul 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 - 7—Data [Dataset]. https://res1catalogd-o-tdatad-o-tgov.vcapture.xyz/dataset/variable-terrestrial-gps-telemetry-detection-rates-parts-1-7data
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    U.S. Geological Survey
    Description

    Studies utilizing Global Positioning System (GPS) telemetry rarely result in 100% fix success rates (FSR). Many assessments of wildlife resource use do not account for missing data, either assuming data loss is random or because a lack of practical treatment for systematic data loss. Several studies have explored how the environment, technological features, and animal behavior influence rates of missing data in GPS telemetry, but previous spatially explicit models developed to correct for sampling bias have been specified to small study areas, on a small range of data loss, or to be species-specific, limiting their general utility. Here we explore environmental effects on GPS fix acquisition rates across a wide range of environmental conditions and detection rates for bias correction of terrestrial GPS-derived, large mammal habitat use. We also evaluate patterns in missing data that relate to potential animal activities that change the orientation of the antennae and characterize home-range probability of GPS detection for 4 focal species; cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Part 1, Positive Openness Raster (raster dataset): Openness is an angular measure of the relationship between surface relief and horizontal distance. For angles less than 90 degrees it is equivalent to the internal angle of a cone with its apex at a DEM location, and is constrained by neighboring elevations within a specified radial distance. 480 meter search radius was used for this calculation of positive openness. Openness incorporates the terrain line-of-sight or viewshed concept and is calculated from multiple zenith and nadir angles-here along eight azimuths. Positive openness measures openness above the surface, with high values for convex forms and low values for concave forms (Yokoyama et al. 2002). We calculated positive openness using a custom python script, following the methods of Yokoyama et. al (2002) using a USGS National Elevation Dataset as input. Part 2, Northern Arizona GPS Test Collar (csv): Bias correction in GPS telemetry data-sets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix acquisition. We found terrain exposure and tall over-story vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The model's predictive ability was evaluated using two independent data-sets from stationary test collars of different make/model, fix interval programming, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. The model training data are provided here for fix attempts by hour. This table can be linked with the site location shapefile using the site field. Part 3, Probability Raster (raster dataset): Bias correction in GPS telemetry datasets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix aquistion. We found terrain exposure and tall overstory vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The models predictive ability was evaluated using two independent datasets from stationary test collars of different make/model, fix interval programing, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. We evaluated GPS telemetry datasets by comparing the mean probability of a successful GPS fix across study animals home-ranges, to the actual observed FSR of GPS downloaded deployed collars on cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Comparing the mean probability of acquisition within study animals home-ranges and observed FSRs of GPS downloaded collars resulted in a approximatly 1:1 linear relationship with an r-sq= 0.68. Part 4, GPS Test Collar Sites (shapefile): Bias correction in GPS telemetry data-sets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix acquisition. We found terrain exposure and tall over-story vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The model's predictive ability was evaluated using two independent data-sets from stationary test collars of different make/model, fix interval programming, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. Part 5, Cougar Home Ranges (shapefile): Cougar home-ranges were calculated to compare the mean probability of a GPS fix acquisition across the home-range to the actual fix success rate (FSR) of the collar as a means for evaluating if characteristics of an animal’s home-range have an effect on observed FSR. We estimated home-ranges using the Local Convex Hull (LoCoH) method using the 90th isopleth. Data obtained from GPS download of retrieved units were only used. Satellite delivered data was omitted from the analysis for animals where the collar was lost or damaged because satellite delivery tends to lose as additional 10% of data. Comparisons with home-range mean probability of fix were also used as a reference for assessing if the frequency animals use areas of low GPS acquisition rates may play a role in observed FSRs. Part 6, Cougar Fix Success Rate by Hour (csv): Cougar GPS collar fix success varied by hour-of-day suggesting circadian rhythms with bouts of rest during daylight hours may change the orientation of the GPS receiver affecting the ability to acquire fixes. Raw data of overall fix success rates (FSR) and FSR by hour were used to predict relative reductions in FSR. Data only includes direct GPS download datasets. Satellite delivered data was omitted from the analysis for animals where the collar was lost or damaged because satellite delivery tends to lose approximately an additional 10% of data. Part 7, Openness Python Script version 2.0: This python script was used to calculate positive openness using a 30 meter digital elevation model for a large geographic area in Arizona, California, Nevada and Utah. A scientific research project used the script to explore environmental effects on GPS fix acquisition rates across a wide range of environmental conditions and detection rates for bias correction of terrestrial GPS-derived, large mammal habitat use.

  7. Global Freelancers (Raw) Dataset

    • kaggle.com
    Updated Jul 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Urvish Ahir (2025). Global Freelancers (Raw) Dataset [Dataset]. https://www.kaggle.com/datasets/urvishahir/global-freelancers-raw-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 5, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Urvish Ahir
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Description :

    This dataset contains 1,000 fictional freelancer profiles from around the world, designed to reflect realistic variability and messiness often encountered in real-world data collection.

    • Each entry includes demographic, professional, and platform-related information such as:
    • Name, gender, age, and country
    • Primary skill and years of experience
    • Hourly rate (with mixed formatting), client rating, and satisfaction score
    • Language spoken (based on country)
    • Inconsistent and unclean values across several fields (e.g., gender, is_active, satisfaction)

    Key Features :

    • Gender-based names using Faker’s male/female name generators
    • Realistic age and experience distribution (with missing and noisy values)
    • Country-language pairs mapped using actual linguistic data
    • Messy formatting: mixed data types, missing values, inconsistent casing
    • Generated entirely in Python using the faker library no real data used

    Use Cases :

    • Practicing data cleaning and preprocessing
    • Performing EDA (Exploratory Data Analysis)
    • Developing data pipelines: raw → clean → model-ready
    • Teaching feature engineering and handling real-world dirty data
    • Exercises in data validation, outlier detection, and format standardization

    File : global_freelancers_raw.csv

    | Column Name      | Description                               |
    | --------------------- | ------------------------------------------------------------------------ |
    | `freelancer_ID`    | Unique ID starting with `FL` (e.g., FL250001)              |
    | `name`        | Full name of freelancer (based on gender)                |
    | `gender`       | Gender (messy values and case inconsistency)               |
    | `age`         | Age of the freelancer (20–60, with occasional nulls/outliers)      |
    | `country`       | Country name (with random formatting/casing)               |
    | `language`      | Language spoken (mapped from country)                  |
    | `primary_skill`    | Key freelance domain (e.g., Web Dev, AI, Cybersecurity)         |
    | `years_of_experience` | Work experience in years (some missing values or odd values included)  |
    | `hourly_rate (USD)`  | Hourly rate with currency symbols or missing data            |
    | `rating`       | Rating between 1.0–5.0 (some zeros and nulls included)          |
    | `is_active`      | Active status (inconsistently represented as strings, numbers, booleans) |
    | `client_satisfaction` | Satisfaction percentage (e.g., "85%" or 85, may include NaNs)      |
    
  8. d

    Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 - 7 Data.

    • datadiscoverystudio.org
    758
    Updated May 20, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 - 7 Data. [Dataset]. http://datadiscoverystudio.org/geoportal/rest/metadata/item/4df55111949a4b20aaf19e0d501595a7/html
    Explore at:
    758Available download formats
    Dataset updated
    May 20, 2018
    Description

    description: Studies utilizing Global Positioning System (GPS) telemetry rarely result in 100% fix success rates (FSR). Many assessments of wildlife resource use do not account for missing data, either assuming data loss is random or because a lack of practical treatment for systematic data loss. Several studies have explored how the environment, technological features, and animal behavior influence rates of missing data in GPS telemetry, but previous spatially explicit models developed to correct for sampling bias have been specified to small study areas, on a small range of data loss, or to be species-specific, limiting their general utility. Here we explore environmental effects on GPS fix acquisition rates across a wide range of environmental conditions and detection rates for bias correction of terrestrial GPS-derived, large mammal habitat use. We also evaluate patterns in missing data that relate to potential animal activities that change the orientation of the antennae and characterize home-range probability of GPS detection for 4 focal species; cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Part 1, Positive Openness Raster (raster dataset): Openness is an angular measure of the relationship between surface relief and horizontal distance. For angles less than 90 degrees it is equivalent to the internal angle of a cone with its apex at a DEM location, and is constrained by neighboring elevations within a specified radial distance. 480 meter search radius was used for this calculation of positive openness. Openness incorporates the terrain line-of-sight or viewshed concept and is calculated from multiple zenith and nadir angles-here along eight azimuths. Positive openness measures openness above the surface, with high values for convex forms and low values for concave forms (Yokoyama et al. 2002). We calculated positive openness using a custom python script, following the methods of Yokoyama et. al (2002) using a USGS National Elevation Dataset as input. Part 2, Northern Arizona GPS Test Collar (csv): Bias correction in GPS telemetry data-sets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix acquisition. We found terrain exposure and tall over-story vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The model's predictive ability was evaluated using two independent data-sets from stationary test collars of different make/model, fix interval programming, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. The model training data are provided here for fix attempts by hour. This table can be linked with the site location shapefile using the site field. Part 3, Probability Raster (raster dataset): Bias correction in GPS telemetry datasets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix aquistion. We found terrain exposure and tall overstory vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The models predictive ability was evaluated using two independent datasets from stationary test collars of different make/model, fix interval programing, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. We evaluated GPS telemetry datasets by comparing the mean probability of a successful GPS fix across study animals home-ranges, to the actual observed FSR of GPS downloaded deployed collars on cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Comparing the mean probability of acquisition within study animals home-ranges and observed FSRs of GPS downloaded collars resulted in a approximatly 1:1 linear relationship with an r-sq= 0.68. Part 4, GPS Test Collar Sites (shapefile): Bias correction in GPS telemetry data-sets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix acquisition. We found terrain exposure and tall over-story vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The model's predictive ability was evaluated using two independent data-sets from stationary test collars of different make/model, fix interval programming, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. Part 5, Cougar Home Ranges (shapefile): Cougar home-ranges were calculated to compare the mean probability of a GPS fix acquisition across the home-range to the actual fix success rate (FSR) of the collar as a means for evaluating if characteristics of an animals home-range have an effect on observed FSR. We estimated home-ranges using the Local Convex Hull (LoCoH) method using the 90th isopleth. Data obtained from GPS download of retrieved units were only used. Satellite delivered data was omitted from the analysis for animals where the collar was lost or damaged because satellite delivery tends to lose as additional 10% of data. Comparisons with home-range mean probability of fix were also used as a reference for assessing if the frequency animals use areas of low GPS acquisition rates may play a role in observed FSRs. Part 6, Cougar Fix Success Rate by Hour (csv): Cougar GPS collar fix success varied by hour-of-day suggesting circadian rhythms with bouts of rest during daylight hours may change the orientation of the GPS receiver affecting the ability to acquire fixes. Raw data of overall fix success rates (FSR) and FSR by hour were used to predict relative reductions in FSR. Data only includes direct GPS download datasets. Satellite delivered data was omitted from the analysis for animals where the collar was lost or damaged because satellite delivery tends to lose approximately an additional 10% of data. Part 7, Openness Python Script version 2.0: This python script was used to calculate positive openness using a 30 meter digital elevation model for a large geographic area in Arizona, California, Nevada and Utah. A scientific research project used the script to explore environmental effects on GPS fix acquisition rates across a wide range of environmental conditions and detection rates for bias correction of terrestrial GPS-derived, large mammal habitat use.; abstract: Studies utilizing Global Positioning System (GPS) telemetry rarely result in 100% fix success rates (FSR). Many assessments of wildlife resource use do not account for missing data, either assuming data loss is random or because a lack of practical treatment for systematic data loss. Several studies have explored how the environment, technological features, and animal behavior influence rates of missing data in GPS telemetry, but previous spatially explicit models developed to correct for sampling bias have been specified to small study areas, on a small range of data loss, or to be species-specific, limiting their general utility. Here we explore environmental effects on GPS fix acquisition rates across a wide range of environmental conditions and detection rates for bias correction of terrestrial GPS-derived, large mammal habitat use. We also evaluate patterns in missing data that relate to potential animal activities that change the orientation of the antennae and characterize home-range probability of GPS detection for 4 focal species; cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Part 1, Positive Openness Raster (raster dataset): Openness is an angular measure of the relationship between surface relief and horizontal distance. For angles less than 90 degrees it is equivalent to the internal angle of a cone with its apex at a DEM location, and is constrained by neighboring elevations within a specified radial distance. 480 meter search radius was used for this calculation of positive openness. Openness incorporates the terrain line-of-sight or viewshed concept and is calculated from multiple zenith and nadir angles-here along eight azimuths. Positive openness measures openness above the surface, with high values for convex forms and low values for concave forms (Yokoyama et al. 2002). We calculated positive openness using a custom python script, following the methods of Yokoyama et. al (2002) using a USGS National Elevation Dataset as input. Part 2, Northern Arizona GPS Test Collar (csv): Bias correction in GPS telemetry data-sets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a

  9. t

    ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture...

    • researchdata.tuwien.ac.at
    • b2find.eudat.eu
    zip
    Updated Jun 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo (2025). ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture from merged multi-satellite observations [Dataset]. http://doi.org/10.48436/3fcxr-cde10
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 6, 2025
    Dataset provided by
    TU Wien
    Authors
    Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    This dataset was produced with funding from the European Space Agency (ESA) Climate Change Initiative (CCI) Plus Soil Moisture Project (CCN 3 to ESRIN Contract No: 4000126684/19/I-NB "ESA CCI+ Phase 1 New R&D on CCI ECVS Soil Moisture"). Project website: https://climate.esa.int/en/projects/soil-moisture/

    This dataset contains information on the Surface Soil Moisture (SM) content derived from satellite observations in the microwave domain.

    Dataset paper (public preprint)

    A description of this dataset, including the methodology and validation results, is available at:

    Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: An independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data Discuss. [preprint], https://doi.org/10.5194/essd-2024-610, in review, 2025.

    Abstract

    ESA CCI Soil Moisture is a multi-satellite climate data record that consists of harmonized, daily observations coming from 19 satellites (as of v09.1) operating in the microwave domain. The wealth of satellite information, particularly over the last decade, facilitates the creation of a data record with the highest possible data consistency and coverage.
    However, data gaps are still found in the record. This is particularly notable in earlier periods when a limited number of satellites were in operation, but can also arise from various retrieval issues, such as frozen soils, dense vegetation, and radio frequency interference (RFI). These data gaps present a challenge for many users, as they have the potential to obscure relevant events within a study area or are incompatible with (machine learning) software that often relies on gap-free inputs.
    Since the requirement of a gap-free ESA CCI SM product was identified, various studies have demonstrated the suitability of different statistical methods to achieve this goal. A fundamental feature of such gap-filling method is to rely only on the original observational record, without need for ancillary variable or model-based information. Due to the intrinsic challenge, there was until present no global, long-term univariate gap-filled product available. In this version of the record, data gaps due to missing satellite overpasses and invalid measurements are filled using the Discrete Cosine Transform (DCT) Penalized Least Squares (PLS) algorithm (Garcia, 2010). A linear interpolation is applied over periods of (potentially) frozen soils with little to no variability in (frozen) soil moisture content. Uncertainty estimates are based on models calibrated in experiments to fill satellite-like gaps introduced to GLDAS Noah reanalysis soil moisture (Rodell et al., 2004), and consider the gap size and local vegetation conditions as parameters that affect the gapfilling performance.

    Summary

    • Gap-filled global estimates of volumetric surface soil moisture from 1991-2023 at 0.25° sampling
    • Fields of application (partial): climate variability and change, land-atmosphere interactions, global biogeochemical cycles and ecology, hydrological and land surface modelling, drought applications, and meteorology
    • Method: Modified version of DCT-PLS (Garcia, 2010) interpolation/smoothing algorithm, linear interpolation over periods of frozen soils. Uncertainty estimates are provided for all data points.
    • More information: See Preimesberger et al. (2025) and https://doi.org/10.5281/zenodo.8320869" target="_blank" rel="noopener">ESA CCI SM Algorithm Theoretical Baseline Document [Chapter 7.2.9] (Dorigo et al., 2023)

    Programmatic Download

    You can use command line tools such as wget or curl to download (and extract) data for multiple years. The following command will download and extract the complete data set to the local directory ~/Download on Linux or macOS systems.

    #!/bin/bash

    # Set download directory
    DOWNLOAD_DIR=~/Downloads

    base_url="https://researchdata.tuwien.at/records/3fcxr-cde10/files"

    # Loop through years 1991 to 2023 and download & extract data
    for year in {1991..2023}; do
    echo "Downloading $year.zip..."
    wget -q -P "$DOWNLOAD_DIR" "$base_url/$year.zip"
    unzip -o "$DOWNLOAD_DIR/$year.zip" -d $DOWNLOAD_DIR
    rm "$DOWNLOAD_DIR/$year.zip"
    done

    Data details

    The dataset provides global daily estimates for the 1991-2023 period at 0.25° (~25 km) horizontal grid resolution. Daily images are grouped by year (YYYY), each subdirectory containing one netCDF image file for a specific day (DD), month (MM) in a 2-dimensional (longitude, latitude) grid system (CRS: WGS84). The file name has the following convention:

    ESACCI-SOILMOISTURE-L3S-SSMV-COMBINED_GAPFILLED-YYYYMMDD000000-fv09.1r1.nc

    Data Variables

    Each netCDF file contains 3 coordinate variables (WGS84 longitude, latitude and time stamp), as well as the following data variables:

    • sm: (float) The Soil Moisture variable reflects estimates of daily average volumetric soil moisture content (m3/m3) in the soil surface layer (~0-5 cm) over a whole grid cell (0.25 degree).
    • sm_uncertainty: (float) The Soil Moisture Uncertainty variable reflects the uncertainty (random error) of the original satellite observations and of the predictions used to fill observation data gaps.
    • sm_anomaly: Soil moisture anomalies (reference period 1991-2020) derived from the gap-filled values (`sm`)
    • sm_smoothed: Contains DCT-PLS predictions used to fill data gaps in the original soil moisture field. These values are also provided for cases where an observation was initially available (compare `gapmask`). In this case, they provided a smoothed version of the original data.
    • gapmask: (0 | 1) Indicates grid cells where a satellite observation is available (1), and where the interpolated (smoothed) values are used instead (0) in the 'sm' field.
    • frozenmask: (0 | 1) Indicates grid cells where ERA5 soil temperature is <0 °C. In this case, a linear interpolation over time is applied.

    Additional information for each variable is given in the netCDF attributes.

    Version Changelog

    Changes in v9.1r1 (previous version was v09.1):

    • This version uses a novel uncertainty estimation scheme as described in Preimesberger et al. (2025).

    Software to open netCDF files

    These data can be read by any software that supports Climate and Forecast (CF) conform metadata standards for netCDF files, such as:

    References

    • Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: An independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data Discuss. [preprint], https://doi.org/10.5194/essd-2024-610, in review, 2025.
    • Dorigo, W., Preimesberger, W., Stradiotti, P., Kidd, R., van der Schalie, R., van der Vliet, M., Rodriguez-Fernandez, N., Madelon, R., & Baghdadi, N. (2023). ESA Climate Change Initiative Plus - Soil Moisture Algorithm Theoretical Baseline Document (ATBD) Supporting Product Version 08.1 (version 1.1). Zenodo. https://doi.org/10.5281/zenodo.8320869
    • Garcia, D., 2010. Robust smoothing of gridded data in one and higher dimensions with missing values. Computational Statistics & Data Analysis, 54(4), pp.1167-1178. Available at: https://doi.org/10.1016/j.csda.2009.09.020
    • Rodell, M., Houser, P. R., Jambor, U., Gottschalck, J., Mitchell, K., Meng, C.-J., Arsenault, K., Cosgrove, B., Radakovich, J., Bosilovich, M., Entin, J. K., Walker, J. P., Lohmann, D., and Toll, D.: The Global Land Data Assimilation System, Bulletin of the American Meteorological Society, 85, 381 – 394, https://doi.org/10.1175/BAMS-85-3-381, 2004.

    Related Records

    The following records are all part of the Soil Moisture Climate Data Records from satellites community

    1

    ESA CCI SM MODELFREE Surface Soil Moisture Record

    <a href="https://doi.org/10.48436/svr1r-27j77" target="_blank"

  10. Additional file 1 of Tensor extrapolation: an adaptation to data sets with...

    • springernature.figshare.com
    html
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Josef Schosser (2023). Additional file 1 of Tensor extrapolation: an adaptation to data sets with missing entries [Dataset]. http://doi.org/10.6084/m9.figshare.19242463.v1
    Explore at:
    htmlAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Josef Schosser
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 1. Python code. The notebook highlights core components of the code applied in the study.

  11. Science Education Research Topic Modeling Dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin, html +2
    Updated Oct 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tor Ole B. Odden; Tor Ole B. Odden; Alessandro Marin; Alessandro Marin; John L. Rudolph; John L. Rudolph (2024). Science Education Research Topic Modeling Dataset [Dataset]. http://doi.org/10.5281/zenodo.4094974
    Explore at:
    bin, txt, html, text/x-pythonAvailable download formats
    Dataset updated
    Oct 9, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tor Ole B. Odden; Tor Ole B. Odden; Alessandro Marin; Alessandro Marin; John L. Rudolph; John L. Rudolph
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This dataset contains scraped and processed text from roughly 100 years of articles published in the Wiley journal Science Education (formerly General Science Quarterly). This text has been cleaned and filtered in preparation for analysis using natural language processing techniques, particularly topic modeling with latent Dirichlet allocation (LDA). We also include a Jupyter Notebook illustrating how one can use LDA to analyze this dataset and extract latent topics from it, as well as analyze the rise and fall of those topics over the history of the journal.

    The articles were downloaded and scraped in December of 2019. Only non-duplicate articles with a listed author (according to the CrossRef metadata database) were included, and due to missing data and text recognition issues we excluded all articles published prior to 1922. This resulted in 5577 articles in total being included in the dataset. The text of these articles was then cleaned in the following way:

    • We removed duplicated text from each article: prior to 1969, articles in the journal were published in a magazine format in which the end of one article and the beginning of the next would share the same page, so we developed an automated detection of article beginnings and endings that was able to remove any duplicate text.
    • We removed the reference sections of the articles, as well headings (in all caps) such as “ABSTRACT”.
    • We reunited any partial words that were separated due to line breaks, text recognition issues, or British vs. American spellings (for example converting “per cent” to “percent”)
    • We removed all numbers, symbols, special characters, and punctuation, and lowercased all words.
    • We removed all stop words, which are words without any semantic meaning on their own—“the”, “in,” “if”, “and”, “but”, etc.—and all single-letter words.
    • We lemmatized all words, with the added step of including a part-of-speech tagger so our algorithm would only aggregate and lemmatize words from the same part of speech (e.g., nouns vs. verbs).
    • We detected and create bi-grams, sets of words that frequently co-occur and carry additional meaning together. These words were combined with an underscore: for example, “problem_solving” and “high_school”.

    After filtering, each document was then turned into a list of individual words (or tokens) which were then collected and saved (using the python pickle format) into the file scied_words_bigrams_V5.pkl.

    In addition to this file, we have also included the following files:

    1. SciEd_paper_names_weights.pkl: A file containing limited metadata (title, author, year published, and DOI) for each of the papers, in the same order as they appear within the main datafile. This file also includes the weights assigned by an LDA model used to analyze the data
    2. Science Education LDA Notebook.ipynb: A notebook file that replicates our LDA analysis, with a written explanation of all of the steps and suggestions on how to explore the results.
    3. Supporting files for the notebook. These include the requirements, the README, a helper script with functions for plotting that were too long to include in the notebook, and two HTML graphs that are embedded into the notebook.

    This dataset is shared under the terms of the Wiley Text and Data Mining Agreement, which allows users to share text and data mining output for non-commercial research purposes. Any questions or comments can be directed to Tor Ole Odden, t.o.odden@fys.uio.no.

  12. Klib library python

    • kaggle.com
    Updated Jan 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sripaad Srinivasan (2021). Klib library python [Dataset]. https://www.kaggle.com/sripaadsrinivasan/klib-library-python/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 11, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sripaad Srinivasan
    Description

    klib library enables us to quickly visualize missing data, perform data cleaning, visualize data distribution plot, visualize correlation plot and visualize categorical column values. klib is a Python library for importing, cleaning, analyzing and preprocessing data. Explanations on key functionalities can be found on Medium / TowardsDataScience in the examples section or on YouTube (Data Professor).

    Original Github repo

    https://raw.githubusercontent.com/akanz1/klib/main/examples/images/header.png" alt="klib Header">

    Usage

    !pip install klib
    
    import klib
    import pandas as pd
    
    df = pd.DataFrame(data)
    
    # klib.describe functions for visualizing datasets
    - klib.cat_plot(df) # returns a visualization of the number and frequency of categorical features
    - klib.corr_mat(df) # returns a color-encoded correlation matrix
    - klib.corr_plot(df) # returns a color-encoded heatmap, ideal for correlations
    - klib.dist_plot(df) # returns a distribution plot for every numeric feature
    - klib.missingval_plot(df) # returns a figure containing information about missing values
    

    Examples

    Take a look at this starter notebook.

    Further examples, as well as applications of the functions can be found here.

    Contributing

    Pull requests and ideas, especially for further functions are welcome. For major changes or feedback, please open an issue first to discuss what you would like to change. Take a look at this Github repo.

    License

    MIT

  13. Z

    Adult dataset preprocessed

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pustozerova, Anastasia (2024). Adult dataset preprocessed [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12533513
    Explore at:
    Dataset updated
    Jul 1, 2024
    Dataset provided by
    Pustozerova, Anastasia
    Schuster, Verena
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The files "adult_train.csv" and "adult_test.csv" contain preprocessed versions of the Adult dataset from the USI repository.

    The file "adult_preprocessing.ipynb" contains a python notebook file with all the preprocessing steps used to generate "adult_train.csv" and "adult_test.csv" from the original Adult dataset.

    The preprocessing steps include:

    One-hot-encoding of categorical values

    Imputation of missing values using knn-imputer with k=1

    Standard scaling of ordinal attributes

    Note: we assume the scenario when the test set is available before training (every attribute besides the target - "income"), therefore we combine train and test sets before the preprocessing.

  14. S

    Geographical distribution and climate data of Cycas taiwaniana

    • scidb.cn
    Updated Jan 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CHUNPING XIE (2025). Geographical distribution and climate data of Cycas taiwaniana [Dataset]. http://doi.org/10.57760/sciencedb.19432
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 3, 2025
    Dataset provided by
    Science Data Bank
    Authors
    CHUNPING XIE
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Description: Geographical Distribution and Climate Data of Cycas taiwaniana (Taiwanese Cycad)This dataset contains the geographical distribution and climate data for Cycas taiwaniana, focusing on its presence across regions in Fujian, Guangdong, and Hainan provinces of China. The dataset includes geographical coordinates (longitude and latitude), monthly climate data (minimum and maximum temperature, and precipitation) across different months, as well as bioclimatic variables based on the WorldClim dataset.**Temporal and Spatial Information** The data covers long-term climate information, with monthly data for each location recorded over a 12-month period (January to December). The dataset includes spatial data in terms of longitude and latitude, corresponding to various locations where Cycas taiwaniana populations are present. The spatial resolution is specific to each point location, and the temporal resolution reflects the monthly climate data for each year.**Data Structure and Units** The dataset consists of 36 records, each representing a unique location with corresponding climate and geographical data. The table includes the following columns: 1. No.: Unique identifier for each data record 2. Longitude: Geographic longitude in decimal degrees 3. Latitude: Geographic latitude in decimal degrees 4. tmin1 to tmin12: Minimum temperature (°C) for each month (January to December) 5. tmax1 to tmax12: Maximum temperature (°C) for each month (January to December) 6. prec1 to prec12: Precipitation (mm) for each month (January to December) 7. bio1 to bio19: Bioclimatic variables (e.g., annual mean temperature, temperature seasonality, precipitation, etc.) derived from WorldClim data (unit varies depending on the variable)The units for each measurement are as follows: - Temperature: Degrees Celsius (°C) - Precipitation: Millimeters (mm) - Bioclimatic variables: Varies depending on the specific variable (e.g., °C, mm)**Data Gaps and Missing Values** The dataset contains some missing values, particularly in the "precipitation" columns for certain months and locations. These missing values may result from gaps in climate station data or limitations in data collection for specific regions. Missing values are indicated as "NA" (Not Available) in the dataset. In cases where data gaps exist, estimations were not made, and the absence of the data is acknowledged in the record.**File Format and Software Compatibility** The dataset is provided in CSV format for ease of use and compatibility with various data analysis tools. It can be opened and processed using software such as Microsoft Excel, R, or Python (with Pandas). Users can download the dataset and work with it in software such as R (https://cran.r-project.org/) or Python (https://www.python.org/). The dataset is compatible with any software that supports CSV files.This dataset provides valuable information for research related to the geographical distribution and climate preferences of Cycas taiwaniana and can be used to inform conservation strategies, ecological studies, and climate change modeling.

  15. B

    Replication Data for: No Data? No worries! How image generation can...

    • borealisdata.ca
    Updated Apr 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simon Durand (2025). Replication Data for: No Data? No worries! How image generation can alleviate image scarcity in wildlife surveys: a case study with muskox (Ovibos moschatus) [Dataset]. http://doi.org/10.5683/SP3/LCOG2G
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 8, 2025
    Dataset provided by
    Borealis
    Authors
    Simon Durand
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This repositery contains all images (that can be publicly shared), annotation data in the form of JSON files and Python scripts used to produce the paper "No Data? No worries! How image generation can alleviate image scarcity in wildlife surveys: a case study with muskox (Ovibos moschatus)".

  16. Data and code associated with "The Observed Availability of Data and Code in...

    • zenodo.org
    csv, text/x-python +1
    Updated Feb 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erin Jones; Brandon McClung; Hadi Fawad; Amy McGovern; Erin Jones; Brandon McClung; Hadi Fawad; Amy McGovern (2025). Data and code associated with "The Observed Availability of Data and Code in Earth Science and Artificial Intelligence" [Dataset]. http://doi.org/10.5281/zenodo.14902836
    Explore at:
    csv, txt, text/x-pythonAvailable download formats
    Dataset updated
    Feb 20, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Erin Jones; Brandon McClung; Hadi Fawad; Amy McGovern; Erin Jones; Brandon McClung; Hadi Fawad; Amy McGovern
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Feb 2025
    Area covered
    Earth
    Description

    Data and code associated with "The Observed Availability of Data and Code in Earth Science
    and Artificial Intelligence" by Erin A. Jones, Brandon McClung, Hadi Fawad, and Amy McGovern.

    Instructions: To reproduce figures, download all associated Python and CSV files and place
    in a single directory.
    Run BAMS_plot.py as you would run Python code on your system.

    Code:
    BAMS_plot.py: Python code for categorizing data availability statements based on given data
    documented below and creating figures 1-3.

    Code was originally developed for Python 3.11.7 and run in the Spyder
    (version 5.4.3) IDE.

    Libraries utilized:
    numpy (version 1.26.4)
    pandas (version 2.1.4)
    matplotlib (version 3.8.0)

    For additional documentation, please see code file.

    Data:
    ASDC_AIES.csv: CSV file containing relevant availability statement data for Artificial
    Intelligence for the Earth Systems (AIES)
    ASDC_AI_in_Geo.csv: CSV file containing relevant availability statement data for Artificial
    Intelligence in Geosciences (AI in Geo.)
    ASDC_AIJ.csv: CSV file containing relevant availability statement data for Artificial
    Intelligence (AIJ)
    ASDC_MWR.csv: CSV file containing relevant availability statement data for Monthly
    Weather Review (MWR)


    Data documentation:
    All CSV files contain the same format of information for each journal. The CSV files above are
    needed for the BAMS_plot.py code attached.

    Records were analyzed based on the criteria below.

    Records:
    1) Title of paper
    The title of the examined journal article.
    2) Article DOI (or URL)
    A link to the examined journal article. For AIES, AI in Geo., MWR, the DOI is
    generally given. For AIJ, the URL is given.
    3) Journal name
    The name of the journal where the examined article is published. Either a full
    journal name (e.g., Monthly Weather Review), or the acronym used in the
    associated paper (e.g., AIES) is used.
    4) Year of publication
    The year the article was posted online/in print.
    5) Is there an ASDC?
    If the article contains an availability statement in any form, "yes" is
    recorded. Otherwise, "no" is recorded.
    6) Justification for non-open data?
    If an availability statement contains some justification for why data is not
    openly available, the justification is summarized and recorded as one of the
    following options: 1) Dataset too large, 2) Licensing/Proprietary, 3) Can be
    obtained from other entities, 4) Sensitive information, 5) Available at later
    date. If the statement indicates any data is not openly available and no
    justification is provided, or if no statement is provided is provided "None"
    is recorded. If the statement indicates openly available data or no data
    produced, "N/A" is recorded.
    7) All data available
    If there is an availability statement and data is produced, "y" is recorded
    if means to access data associated with the article are given and there is no
    indication that any data is not openly available; "n" is recorded if no means
    to access data are given or there is some indication that some or all data is
    not openly available. If there is no availability statement or no data is
    produced, the record is left blank.
    8) At least some data available
    If there is an availability statement and data is produced, "y" is recorded
    if any means to access data associated with the article are given; "n" is
    recorded if no means to access data are given. If there is no availability
    statement or no data is produced, the record is left blank.
    9) All code available
    If there is an availability statement and data is produced, "y" is recorded
    if means to access code associated with the article are given and there is no
    indication that any code is not openly available; "n" is recorded if no means
    to access code are given or there is some indication that some or all code is
    not openly available. If there is no availability statement or no data is
    produced, the record is left blank.
    10) At least some code available
    If there is an availability statement and data is produced, "y" is recorded
    if any means to access code associated with the article are given; "n" is
    recorded if no means to access code are given. If there is no availability
    statement or no data is produced, the record is left blank.
    11) All data available upon request
    If there is an availability statement indicating data is produced and no data
    is openly available, "y" is recorded if any data is available upon request to
    the authors of the examined journal article (not a request to any other
    entity); "n" is recorded if no data is available upon request to the authors
    of the examined journal article. If there is no availability statement, any
    data is openly available, or no data is produced, the record is left blank.
    12) At least some data available upon request
    If there is an availability statement indicating data is produced and not all
    data is openly available, "y" is recorded if all data is available upon
    request to the authors of the examined journal article (not a request to any
    other entity); "n" is recorded if not all data is available upon request to
    the authors of the examined journal article. If there is no availability
    statement, all data is openly available, or no data is produced, the record
    is left blank.
    13) no data produced
    If there is an availability statement that indicates that no data was
    produced for the examined journal article, "y" is recorded. Otherwise, the
    record is left blank.
    14) links work
    If the availability statement contains one or more links to a data or code
    repository, "y" is recorded if all links work; "n" is recorded if one or more
    links do not work. If there is no availability statement or the statement
    does not contain any links to a data or code repository, the record is left
    blank.

  17. Temperature and Humidity Time Series of Cold Storage Room Monitoring

    • zenodo.org
    bin, csv, png, zip
    Updated Jun 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elia Henrichs; Elia Henrichs; Florian Stoll; Christian Krupitzer; Christian Krupitzer; Florian Stoll (2025). Temperature and Humidity Time Series of Cold Storage Room Monitoring [Dataset]. http://doi.org/10.5281/zenodo.15130001
    Explore at:
    png, bin, zip, csvAvailable download formats
    Dataset updated
    Jun 30, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Elia Henrichs; Elia Henrichs; Florian Stoll; Christian Krupitzer; Christian Krupitzer; Florian Stoll
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The datasets contain the raw data and preprocessed data (following the steps in the Jupyter Notebook) of 9 DHT22 sensors in a cold storage room. Details on how the data was gathered can be found in the publication "Self-Adaptive Integration of Distributed Sensor Systems for Monitoring Cold Storage Environments" by Elia Henrichs, Florian Stoll, and Christian Krupitzer.

    This dataset consists of the following files:

    • Raw.zip - The raw data CSV files of the nine Arduino-based data loggers, containing the semicolon-separated columns date (formatted as dd.mm.yyyy), time (formatted as HH:MM:SS), temperature, and humidity. These files can contain multiple headers.
    • Preprocessed.zip - The preprocessed data CSV files of the nine Arduino-based data loggers, containing the semicolon-separated columns date (formatted as dd.mm.yyyy), time (formatted as HH:MM:SS), temperature, and humidity. Multiple headers were removed, and the length of the datasets was aligned to equal length by filling missing values with NaN.
    • DataPreprocessing.ipynb - Jupyter Notebook containing the code to preprocess the data and create the overview file, which summarizes key characteristics of the dataset.
    • DataPreliminaryAnalysis.ipynb - Jupyter Notebook containing the code to perform the preliminary data analysis (general statistics, peaks, and matrix profiles).
    • experiment_actions.csv - CSV file logging performed actions (door openings and sensor movements).
    • overview.csv - CSV file summarizing key characteristics of the dataset and preliminary data analysis.
    • temphum_logger.ino - Source code to run the Arduino-based data logger with a sampling rate of 5 sec.
    • Arduino_setup_sketch_v1.png - Circuit diagram of the Arduino-based data logger.
  18. u

    Domestic Electrical Load Survey - Key Variables 1994-2014 - South Africa

    • datafirst.uct.ac.za
    Updated Apr 29, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wiebke Toussaint (2020). Domestic Electrical Load Survey - Key Variables 1994-2014 - South Africa [Dataset]. http://www.datafirst.uct.ac.za/Dataportal/index.php/catalog/758
    Explore at:
    Dataset updated
    Apr 29, 2020
    Dataset authored and provided by
    Wiebke Toussaint
    Time period covered
    1994 - 2014
    Area covered
    South Africa
    Description

    Abstract

    This dataset is a harmonisation of the Domestic Electrical Load Survey (DELS) 1994-2014 dataset. The DELS 1994-2014 questionnaires were changed in 2000. Subsequently survey questions vary between 1994-1999 and 2000-2014. This makes data processing complex, as survey responses first need to be associated with their year of collection and corresponding questionnaire before they can be correctly interpreted. This Key Variables dataset is a user-friendly version of the original dataset. It contains household responses to the most important survey questions, as well as geographic and linking information that allows for the households to be matched to their respective electricity metering data. This dataset and similar custom datasets can be produced from the DELS 1994-2014 dataset with the python package delprocess. The data processing section includes a description of how this dataset was created. The development of the tools to create this dataset was funded by the South African National Energy Development Initiative (SANEDI).

    Geographic coverage

    The study had national coverage.

    Analysis unit

    Households

    Universe

    The dataset covers South African households in the DELS 1994-2014 dataset. These are electrified households that received electricity either directly from Eskom or from their local municipality.

    Kind of data

    Administrative records

    Sampling procedure

    The dataset includes all households for which survey responses have been captured in the DELS1994-2014 dataset.

    Mode of data collection

    Face-to-face [f2f]

    Cleaning operations

    This dataset has been constructed from the DELS 1994-2014 dataset using the data processing functions in the delprocess python package (www.github.com/wiebket/delprocess: release v1.0). The delprocess python package takes the complexities of the original DELS 1994-2014 dataset into account and makes use of 'spec files' to specify the processing steps that must be performed. To retrieve data for all survey years, two separate spec files are required to process survey response from 1994-1999 and 2000-2014. The spec files used to produce this dataset are included in the program files and can be used as templates for new custom datasets. Full instructions on how to use them to process the data are in the README file contained in the delprocess package.

    SPEC FILES specify the following processing steps:

    1. List of search terms for which survey questions will be searched, and variables returned
    2. Transformations (addition, subtraction, multiplication) of variables retrieved from search output
    3. Bin intervals for variables (requires numeric data)
    4. Lables for bins (requires binned data)
    5. Details of bin segments
    6. Replacement (encoding) of coded variable values
    7. Higher level geography detail

    In particular, the DELSKV 1994-2014 dataset has been produced by specifying the following processing steps:

    TRANSFORMATIONS * monthly_income from 1994 - 1999 is the variable returned by the 'income' search term * monthly_income from 2000 - 2014 is calculated as the sum of the variables returned by the 'earn per month', 'money from small business' and 'external' search terms * Appliance numbers from 1994 - 1999 is the count of appliances (no data was collected on broken appliances) * Appliance numbers from 2000-2014 is the count of appliances [minus] the count of broken appliances (except for TV which included no information on broken appliances) * A new total_adults variable was created by summing the number of all occupants (male and female) over 16 years old * A new total_children variable was created by summing the number of all occupants (male and female) under 16 years old * A new total_pensioners variable was created by summing the number of pensioners (male and female) over 16 years old * A new total_unemployed variable was created by summing the number of unemployed occupants (male and female) over 16 years old * A new total_part_time variable was created by summing the number of part time employed occupants (male and female) over 16 years old * roof_material and wall_material values for 1994-1999 were augmented by 1 * water_access was transformed for 1994-1999 to be 4 [minus] the 'watersource' value

    REPLACEMENTS * Appliance usage values have been replaced with: 0=never 1=monthly 2=weekly 3=daily

    • water_access values have been replaced with: 1=nearby river/dam/borehole 2=block/street taps 3=tap in yard 4=tap inside house

    • roof_material and wall_material values have been replaced with: 1=IBR/Corr.Iron/Zinc 2=Thatch/Grass 3=Wood/Masonite board 4=Brick 5=Block 6=Plaster 7=Concrete 8=Tiles 9=Plastic 10=Asbestos 11=Daub/Mud/Clay

    OTHER NOTES Appliance usage information was only collected after 2000. No binning was done to segment survey responses for this dataset.

    Data appraisal

    • The 2000-2014 survey questions contain no variable for 'number of females: 50+', which goes against the pattern of other occupant age categories.
    • Spacing in the original questions is irregular and can cause challenges when specifying transformations (eg. 'number of males: 16-24' and 'number of males: 25 - 34', 'part time' and 'parttime').
    • Spelling mistakes in the original questions can cause challenges when specifying transformations (eg. 'head emploed part time').

    MISSING VALUES Missing values have not been replaced and are represented as blanks except for imputed columns (total_adults, total_children, ...) and appliances after 2000, where missing values have been replaced with a 0.

  19. f

    S1 Data -

    • plos.figshare.com
    zip
    Updated Mar 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xinfu Pang; Wei Sun; Haibo Li; Wei Liu; Changfeng Luan (2024). S1 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0300229.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 19, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Xinfu Pang; Wei Sun; Haibo Li; Wei Liu; Changfeng Luan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Accurate short-term load forecasting is of great significance in improving the dispatching efficiency of power grids, ensuring the safe and reliable operation of power grids, and guiding power systems to formulate reasonable production plans and reduce waste of resources. However, the traditional short-term load forecasting method has limited nonlinear mapping ability and weak generalization ability to unknown data, and it is prone to the loss of time series information, further suggesting that its forecasting accuracy can still be improved. This study presents a short-term power load forecasting method based on Bagging-stochastic configuration networks (SCNs). First, the missing values in the original data are filled with the average values. Second, the influencing factors, such as the weather- and week-type data, are coded. Then, combined with the data of influencing factors after coding, the Bagging-SCNs integration algorithm is used to predict the short-term load. Finally, by taking the daily load data of Quanzhou City, Zhejiang Province as an example, the program of the abovementioned method is compiled in Python language and then compared with the long short-term memory neural network algorithm and the single-SCNs algorithm. Simulation results show that the proposed method for medium- and short-term load forecasting has a high forecasting accuracy and a significant effect on improving the accuracy of load forecasting.

  20. a

    U.S. Historical Climate - Monthly Averages for GHCN-D Stations for 1981 -...

    • community-climatesolutions.hub.arcgis.com
    Updated Apr 16, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Esri (2019). U.S. Historical Climate - Monthly Averages for GHCN-D Stations for 1981 - 2010 [Dataset]. https://community-climatesolutions.hub.arcgis.com/items/b8df6517ceac42af9ab483089296ed04
    Explore at:
    Dataset updated
    Apr 16, 2019
    Dataset authored and provided by
    Esri
    Area covered
    Description

    This point layer contains monthly summaries of daily temperatures (means, minimums, and maximums) and precipitation levels (sum, lowest, and highest) for the period January 1981 through December 2010 for weather stations in the Global Historical Climate Network Daily (GHCND). Data in this service were obtained from web services hosted by the Applied Climate Information System ( ACIS). ACIS staff curate the values for the U.S., including correcting erroneous values, reconciling data from stations that have been moved over their history, etc. The data were compiled at Esri from publicly available sources hosted and administered by NOAA. Because the ACIS data is updated and corrected on an ongoing basis, the date of collection for this layer was Jan 23, 2019. The following process was used to produce this dataset:Download the most current list of stations from ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt. Import this into Microsoft Excel and save as CSV. In ArcGIS, import the CSV as a geodatabase table and use the XY Event layer tool to locate each point. Using a detailed U.S. boundary extract the points that fall within the 50 U.S. States, the District of Columbia, and Puerto Rico. Using Python with DA.UpdateCursor and urllib2 access the ACIS Web Services API to determine whether each station had at least 50 monthly values of temperature data for each station. Delete the other stations. Using Python add the necessary field names and acquire all monthly values for the remaining stations. Thus, there are stations that have some missing data. Using Python Add fields and convert the standard values to metric values so both would be present. Thus, there are four sets of monthly data in this dataset: Monthly means, mins, and maxes of daily temperatures - degrees Fahrenheit. Monthly mean of monthly sums of precipitation and the level of precipitation that was the minimum and maximum during the period 1981 to 2010 - mm. Temperatures in 3a. in degrees Celcius. Precipitation levels in 3b in Inches. After initially publishing these data in a different service, it was learned that more precise coordinates for station locations were available from the Enhanced Master Station History Report (EMSHR) published by NOAA NCDC. With the publication of this layer these most precise coordinates are used. A large subset of the EMSHR metadata is available via EMSHR Stations Locations and Metadata 1738 to Present. If your study area includes areas outside of the U.S., use the World Historical Climate - Monthly Averages for GHCN-D Stations 1981 - 2010 layer. The data in this layer come from the same source archive, however, they are not curated by the ACIS staff and may contain errors. Revision History: Initially Published: 23 Jan 2019 Updated 16 Apr 2019 - We learned more precise coordinates for station locations were available from the Enhanced Master Station History Report (EMSHR) published by NOAA NCDC. With the publication of this layer the geometry and attributes for 3,222 of 9,636 stations now have more precise coordinates. The schema was updated to include the NCDC station identifier and elevation fields for feet and meters are also included. A large subset of the EMSHR data is available via EMSHR Stations Locations and Metadata 1738 to Present. Cite as: Esri, 2019: U.S. Historical Climate - Monthly Averages for GHCN-D Stations for 1981 - 2010. ArcGIS Online, Accessed

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rupesh Kumar (2022). Learn Data Science Series Part 1 [Dataset]. https://www.kaggle.com/datasets/hunter0007/learn-data-science-part-1
Organization logo

Learn Data Science Series Part 1

This module contains learning material to master Pandas

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 30, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rupesh Kumar
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Please feel free to share it with others and consider supporting me if you find it helpful ⭐️.

Overview:

  • Chapter 1: Getting started with pandas
  • Chapter 2: Analysis: Bringing it all together and making decisions
  • Chapter 3: Appending to DataFrame
  • Chapter 4: Boolean indexing of dataframes
  • Chapter 5: Categorical data
  • Chapter 6: Computational Tools
  • Chapter 7: Creating DataFrames
  • Chapter 8: Cross sections of different axes with MultiIndex
  • Chapter 9: Data Types
  • Chapter 10: Dealing with categorical variables
  • Chapter 11: Duplicated data
  • Chapter 12: Getting information about DataFrames
  • Chapter 13: Gotchas of pandas
  • Chapter 14: Graphs and Visualizations
  • Chapter 15: Grouping Data
  • Chapter 16: Grouping Time Series Data
  • Chapter 17: Holiday Calendars
  • Chapter 18: Indexing and selecting data
  • Chapter 19: IO for Google BigQuery
  • Chapter 20: JSON
  • Chapter 21: Making Pandas Play Nice With Native Python Datatypes
  • Chapter 22: Map Values
  • Chapter 23: Merge, join, and concatenate
  • Chapter 24: Meta: Documentation Guidelines
  • Chapter 25: Missing Data
  • Chapter 26: MultiIndex
  • Chapter 27: Pandas Datareader
  • Chapter 28: Pandas IO tools (reading and saving data sets)
  • Chapter 29: pd.DataFrame.apply
  • Chapter 30: Read MySQL to DataFrame
  • Chapter 31: Read SQL Server to Dataframe
  • Chapter 32: Reading files into pandas DataFrame
  • Chapter 33: Resampling
  • Chapter 34: Reshaping and pivoting
  • Chapter 35: Save pandas dataframe to a csv file
  • Chapter 36: Series
  • Chapter 37: Shifting and Lagging Data
  • Chapter 38: Simple manipulation of DataFrames
  • Chapter 39: String manipulation
  • Chapter 40: Using .ix, .iloc, .loc, .at and .iat to access a DataFrame
  • Chapter 41: Working with Time Series
Search
Clear search
Close search
Google apps
Main menu