26 datasets found

f
Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction...
figshare.com
frontiersin.figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yi-Hui Zhou; Ehsan Saghapour (2023). Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data.PDF [Dataset]. http://doi.org/10.3389/fgene.2021.691274.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2021.691274.s001
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Yi-Hui Zhou; Ehsan Saghapour
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.
Pre-Processed Power Grid Frequency Time Series
data.subak.org
csv
Updated Feb 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2023). Pre-Processed Power Grid Frequency Time Series [Dataset]. https://data.subak.org/dataset/pre-processed-power-grid-frequency-time-series
Explore at:
csvAvailable download formats
Dataset updated
Feb 16, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Description
Overview

This repository contains ready-to-use frequency time series as well as the corresponding pre-processing scripts in python. The data covers three synchronous areas of the European power grid:

Continental Europe

Great Britain

Nordic

This work is part of the paper "Predictability of Power Grid Frequency"[1]. Please cite this paper, when using the data and the code. For a detailed documentation of the pre-processing procedure we refer to the supplementary material of the paper.

Data sources

We downloaded the frequency recordings from publically available repositories of three different Transmission System Operators (TSOs).

Continental Europe [2]: We downloaded the data from the German TSO TransnetBW GmbH, which retains the Copyright on the data, but allows to re-publish it upon request [3].

Great Britain [4]: The download was supported by National Grid ESO Open Data, which belongs to the British TSO National Grid. They publish the frequency recordings under the NGESO Open License [5].

Nordic [6]: We obtained the data from the Finish TSO Fingrid, which provides the data under the open license CC-BY 4.0 [7].

Content of the repository

A) Scripts

In the "Download_scripts" folder you will find three scripts to automatically download frequency data from the TSO's websites.

In "convert_data_format.py" we save the data with corrected timestamp formats. Missing data is marked as NaN (processing step (1) in the supplementary material of [1]).

In "clean_corrupted_data.py" we load the converted data and identify corrupted recordings. We mark them as NaN and clean some of the resulting data holes (processing step (2) in the supplementary material of [1]).

The python scripts run with Python 3.7 and with the packages found in "requirements.txt".

B) Yearly converted and cleansed data

The folders "

File type: The files are zipped csv-files, where each file comprises one year.

Data format: The files contain two columns. The second column contains the frequency values in Hz. The first one represents the time stamps in the format Year-Month-Day Hour-Minute-Second, which is given as naive local time. The local time refers to the following time zones and includes Daylight Saving Times (python time zone in brackets):

TransnetBW: Continental European Time (CE)

Nationalgrid: Great Britain (GB)

Fingrid: Finland (Europe/Helsinki)

NaN representation: We mark corrupted and missing data as "NaN" in the csv-files.

Use cases

We point out that this repository can be used in two different was:

Use pre-processed data: You can directly use the converted or the cleansed data. Note however, that both data sets include segments of NaN-values due to missing and corrupted recordings. Only a very small part of the NaN-values were eliminated in the cleansed data to not manipulate the data too much.

Produce your own cleansed data: Depending on your application, you might want to cleanse the data in a custom way. You can easily add your custom cleansing procedure in "clean_corrupted_data.py" and then produce cleansed data from the raw data in "

License

This work is licensed under multiple licenses, which are located in the "LICENSES" folder.

We release the code in the folder "Scripts" under the MIT license .

The pre-processed data in the subfolders "**/Fingrid" and "**/Nationalgrid" are licensed under CC-BY 4.0.

TransnetBW originally did not publish their data under an open license. We have explicitly received the permission to publish the pre-processed version from TransnetBW. However, we cannot publish our pre-processed version under an open license due to the missing license of the original TransnetBW data.

Changelog

Version 2:

Add time zone information to description

Include new frequency data

Update references

Change folder structure to yearly folders

Version 3:

Correct TransnetBW files for missing data in May 2016
Z
Data from: A comprehensive dataset for the accelerated development and...
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carreira Pedro, Hugo (2020). A comprehensive dataset for the accelerated development and benchmarking of solar forecasting methods [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2826938
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Coimbra, Carlos
Larson, David
Carreira Pedro, Hugo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description This repository contains a comprehensive solar irradiance, imaging, and forecasting dataset. The goal with this release is to provide standardized solar and meteorological datasets to the research community for the accelerated development and benchmarking of forecasting methods. The data consist of three years (2014–2016) of quality-controlled, 1-min resolution global horizontal irradiance and direct normal irradiance ground measurements in California. In addition, we provide overlapping data from commonly used exogenous variables, including sky images, satellite imagery, Numerical Weather Prediction forecasts, and weather data. We also include sample codes of baseline models for benchmarking of more elaborated models.

Data usage The usage of the datasets and sample codes presented here is intended for research and development purposes only and implies explicit reference to the paper: Pedro, H.T.C., Larson, D.P., Coimbra, C.F.M., 2019. A comprehensive dataset for the accelerated development and benchmarking of solar forecasting methods. Journal of Renewable and Sustainable Energy 11, 036102. https://doi.org/10.1063/1.5094494

Although every effort was made to ensure the quality of the data, no guarantees or liabilities are implied by the authors or publishers of the data.

Sample code As part of the data release, we are also including the sample code written in Python 3. The preprocessed data used in the scripts are also provided. The code can be used to reproduce the results presented in this work and as a starting point for future studies. Besides the standard scientific Python packages (numpy, scipy, and matplotlib), the code depends on pandas for time-series operations, pvlib for common solar-related tasks, and scikit-learn for Machine Learning models. All required Python packages are readily available on Mac, Linux, and Windows and can be installed via, e.g., pip.

Units All time stamps are in UTC (YYYY-MM-DD HH:MM:SS). All irradiance and weather data are in SI units. Sky image features are derived from 8-bit RGB (256 color levels) data. Satellite images are derived from 8-bit gray-scale (256 color levels) data.

Missing data The string "NAN" indicates missing data

File formats All time series data files as in CSV (comma separated values) Images are given in tar.bz2 files

Files

Folsom_irradiance.csv Primary One-minute GHI, DNI, and DHI data.

Folsom_weather.csv Primary One-minute weather data.

Folsom_sky_images_{YEAR}.tar.bz2 Primary Tar archives with daytime sky images captured at 1-min intervals for the years 2014, 2015, and 2016, compressed with bz2.

Folsom_NAM_lat{LAT}_lon{LON}.csv Primary NAM forecasts for the four nodes nearest the target location. {LAT} and {LON} are replaced by the node’s coordinates listed in Table I in the paper.

Folsom_sky_image_features.csv Secondary Features derived from the sky images.

Folsom_satellite.csv Secondary 10 pixel by 10 pixel GOES-15 images centered in the target location.

Irradiance_features_{horizon}.csv Secondary Irradiance features for the different forecasting horizons ({horizon} 1⁄4 {intra-hour, intra-day, day-ahead}).

Sky_image_features_intra-hour.csv Secondary Sky image features for the intra-hour forecasting issuing times.

Sat_image_features_intra-day.csv Secondary Satellite image features for the intra-day forecasting issuing times.

NAM_nearest_node_day-ahead.csv Secondary NAM forecasts (GHI, DNI computed with the DISC algorithm, and total cloud cover) for the nearest node to the target location prepared for day-ahead forecasting.

Target_{horizon}.csv Secondary Target data for the different forecasting horizons.

Forecast_{horizon}.py Code Python script used to create the forecasts for the different horizons.

Postprocess.py Code Python script used to compute the error metric for all the forecasts.
d
Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 - 7—Data
datasets.ai
data.usgs.gov
+1more
55
Updated Sep 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of the Interior (2024). Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 - 7—Data [Dataset]. https://datasets.ai/datasets/variable-terrestrial-gps-telemetry-detection-rates-parts-1-7data
Explore at:
55Available download formats
Dataset updated
Sep 11, 2024
Dataset authored and provided by
Department of the Interior
Description
Studies utilizing Global Positioning System (GPS) telemetry rarely result in 100% fix success rates (FSR). Many assessments of wildlife resource use do not account for missing data, either assuming data loss is random or because a lack of practical treatment for systematic data loss. Several studies have explored how the environment, technological features, and animal behavior influence rates of missing data in GPS telemetry, but previous spatially explicit models developed to correct for sampling bias have been specified to small study areas, on a small range of data loss, or to be species-specific, limiting their general utility. Here we explore environmental effects on GPS fix acquisition rates across a wide range of environmental conditions and detection rates for bias correction of terrestrial GPS-derived, large mammal habitat use. We also evaluate patterns in missing data that relate to potential animal activities that change the orientation of the antennae and characterize home-range probability of GPS detection for 4 focal species; cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Part 1, Positive Openness Raster (raster dataset): Openness is an angular measure of the relationship between surface relief and horizontal distance. For angles less than 90 degrees it is equivalent to the internal angle of a cone with its apex at a DEM location, and is constrained by neighboring elevations within a specified radial distance. 480 meter search radius was used for this calculation of positive openness. Openness incorporates the terrain line-of-sight or viewshed concept and is calculated from multiple zenith and nadir angles-here along eight azimuths. Positive openness measures openness above the surface, with high values for convex forms and low values for concave forms (Yokoyama et al. 2002). We calculated positive openness using a custom python script, following the methods of Yokoyama et. al (2002) using a USGS National Elevation Dataset as input. Part 2, Northern Arizona GPS Test Collar (csv): Bias correction in GPS telemetry data-sets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix acquisition. We found terrain exposure and tall over-story vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The model's predictive ability was evaluated using two independent data-sets from stationary test collars of different make/model, fix interval programming, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. The model training data are provided here for fix attempts by hour. This table can be linked with the site location shapefile using the site field. Part 3, Probability Raster (raster dataset): Bias correction in GPS telemetry datasets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix aquistion. We found terrain exposure and tall overstory vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The models predictive ability was evaluated using two independent datasets from stationary test collars of different make/model, fix interval programing, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. We evaluated GPS telemetry datasets by comparing the mean probability of a successful GPS fix across study animals home-ranges, to the actual observed FSR of GPS downloaded deployed collars on cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Comparing the mean probability of acquisition within study animals home-ranges and observed FSRs of GPS downloaded collars resulted in a approximatly 1:1 linear relationship with an r-sq= 0.68. Part 4, GPS Test Collar Sites (shapefile): Bias correction in GPS telemetry data-sets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix acquisition. We found terrain exposure and tall over-story vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The model's predictive ability was evaluated using two independent data-sets from stationary test collars of different make/model, fix interval programming, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. Part 5, Cougar Home Ranges (shapefile): Cougar home-ranges were calculated to compare the mean probability of a GPS fix acquisition across the home-range to the actual fix success rate (FSR) of the collar as a means for evaluating if characteristics of an animal’s home-range have an effect on observed FSR. We estimated home-ranges using the Local Convex Hull (LoCoH) method using the 90th isopleth. Data obtained from GPS download of retrieved units were only used. Satellite delivered data was omitted from the analysis for animals where the collar was lost or damaged because satellite delivery tends to lose as additional 10% of data. Comparisons with home-range mean probability of fix were also used as a reference for assessing if the frequency animals use areas of low GPS acquisition rates may play a role in observed FSRs. Part 6, Cougar Fix Success Rate by Hour (csv): Cougar GPS collar fix success varied by hour-of-day suggesting circadian rhythms with bouts of rest during daylight hours may change the orientation of the GPS receiver affecting the ability to acquire fixes. Raw data of overall fix success rates (FSR) and FSR by hour were used to predict relative reductions in FSR. Data only includes direct GPS download datasets. Satellite delivered data was omitted from the analysis for animals where the collar was lost or damaged because satellite delivery tends to lose approximately an additional 10% of data. Part 7, Openness Python Script version 2.0: This python script was used to calculate positive openness using a 30 meter digital elevation model for a large geographic area in Arizona, California, Nevada and Utah. A scientific research project used the script to explore environmental effects on GPS fix acquisition rates across a wide range of environmental conditions and detection rates for bias correction of terrestrial GPS-derived, large mammal habitat use.
f
Reproductive health and Family planning service characteristics of...
plos.figshare.com
xls
Updated Oct 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shimels Derso Kebede; Daniel Niguse Mamo; Jibril Bashir Adem; Birhan Ewunu Semagn; Agmasie Damtew Walle (2023). Reproductive health and Family planning service characteristics of respondents. [Dataset]. http://doi.org/10.1371/journal.pdig.0000345.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000345.t002
Dataset updated
Oct 17, 2023
Dataset provided by
PLOS Digital Health
Authors
Shimels Derso Kebede; Daniel Niguse Mamo; Jibril Bashir Adem; Birhan Ewunu Semagn; Agmasie Damtew Walle
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reproductive health and Family planning service characteristics of respondents.
n
1 km Resolution UK Composite Rainfall Data from the Met Office Nimrod System...
data-search.nerc.ac.uk
catalogue.ceda.ac.uk
Updated Jul 8, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). 1 km Resolution UK Composite Rainfall Data from the Met Office Nimrod System [Dataset]. https://data-search.nerc.ac.uk/geonetwork/srv/search?format=NIMROD%20product%20data%20are%20in%20the%20non-standard%20NIMROD%20binary%20format.%20Refer%20to%20linked%20documentation%20for%20further%20details.%20However,%20the%20documentation%20does%20not%20indicate%20there%20are%20three%204%20byte%20elements%20in%20each%20record.%20These%20are%20located%20before%20each%20block%20of%20data%20within%20the%20record,%20the%20first%20and%20second%20should%20be%20equal%20to%20512%20to%20confirm%20the%20size%20of%20the%20next%20data%20block;%20the%20third%20should%20be%20equal%20to%20twice%20the%20size%20of%20the%20data%20grid%20(e.g.%20if%20the%20data%20are%20in%20a%20grid%20200%20by%20500%20boxes,%20then%20this%20third%20entry%20should%20be%20200,000)
Explore at:
Dataset updated
Jul 8, 2021
Area covered
United Kingdom
Description
1 km resolution composite data from the Met Office's UK rainfall radars via the Met Office NIMROD system. The NIMROD system is a very short range forecasting system used by the Met Office. Data are available from 2004 until present at UK stations and detail rain-rate observations taken every 5 minutes. Each file has been compressed and then stored within daily tar archive files. The precipitation rate analysis uses processed radar and satellite data, together with surface reports and Numerical Weather Prediction (NWP) fields. The UK has a network of 15 C-band rainfall radars and data form these are processed by the Met Office NIMROD system. Please note CEDA are not able to fulfil requests for missing data from this archive. The data may be available at a cost by contacting the Met Office directly with required dates. It is worth contacting the CEDA first to check if the reason for the gap is already identified as being due to the data not existing at all. CEDA does not support reading software but programs written by the community to do this task in IDL, Matlab, FORTRAN and Python are available in the dataset software directory. The data files contain integer precipitation rates in unit of (mm/hr)*32. Each value is between 0 and 32767. In practice it is rare to see a value in excess of 4096 i.e. 128 mm/hr. At 10:00 on 14 June 2005, the 1 km composite data files became larger with 2175 rows by 1725 columns compared to the previous 775 rows by 640 columns. From 14:55 on 30 August 2006, the 1 km composite data files are gzipped files. From 13 Nov 2007, the 1 km composite is derived directly from processed polar (600m x 1 degree) rain rate estimates and there is more detail in the rain structure.
c
Compilation of surface water diversion sites and daily withdrawals in the...
s.cnmilf.com
data.usgs.gov
+1more
Updated Nov 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Compilation of surface water diversion sites and daily withdrawals in the Upper Colorado River and Little Colorado River Basins, 1980-2022 [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/compilation-of-surface-water-diversion-sites-and-daily-withdrawals-in-the-upper-color-1980
Explore at:
Dataset updated
Nov 5, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Colorado River, Little Colorado River
Description
This data release contains an inventory of 1,358 major surface water diversion structures with associated daily time series withdrawal records (1980-2022) for structures within the Upper Colorado River and Little Colorado River Basins. Diversion structures were included in this dataset if they were determined to have the capacity to divert water at rates greater than 10 cubic feet per second. Since those river basins encompasses portions of five states, water use data are dispersed among numerous federal and state agency databases and there is no centralized dataset that documents surface water use within the entire UCOL at a fine spatial and temporal resolution. Diversion structures and locations were identified from a mix of state reports, maps, and satellite imagery. A Python script was developed to automate retrieval of daily time series withdrawal records from multiple state and federal databases. The script was also used to process, filter, and harmonize the diversion records to remove outlier values and estimate missing data. The original withdrawal data, the processed datasets, and the Python script are included in this data release.
U.S. Historical Climate - Monthly Averages for GHCN-D Stations for 1981 -...
climate-arcgis-content.hub.arcgis.com
community-climatesolutions.hub.arcgis.com
Updated Apr 16, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esri (2019). U.S. Historical Climate - Monthly Averages for GHCN-D Stations for 1981 - 2010 [Dataset]. https://climate-arcgis-content.hub.arcgis.com/datasets/esri::u-s-historical-climate-monthly-averages-for-ghcn-d-stations-for-1981-2010
Explore at:
Dataset updated
Apr 16, 2019
Dataset authored and provided by
Esrihttp://esri.com/
Area covered
Pacific Ocean, North Pacific Ocean
Description
This point layer contains monthly summaries of daily temperatures (means, minimums, and maximums) and precipitation levels (sum, lowest, and highest) for the period January 1981 through December 2010 for weather stations in the Global Historical Climate Network Daily (GHCND). Data in this service were obtained from web services hosted by the Applied Climate Information System ( ACIS). ACIS staff curate the values for the U.S., including correcting erroneous values, reconciling data from stations that have been moved over their history, etc. The data were compiled at Esri from publicly available sources hosted and administered by NOAA. Because the ACIS data is updated and corrected on an ongoing basis, the date of collection for this layer was Jan 23, 2019. The following process was used to produce this dataset:Download the most current list of stations from ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt. Import this into Microsoft Excel and save as CSV. In ArcGIS, import the CSV as a geodatabase table and use the XY Event layer tool to locate each point. Using a detailed U.S. boundary extract the points that fall within the 50 U.S. States, the District of Columbia, and Puerto Rico. Using Python with DA.UpdateCursor and urllib2 access the ACIS Web Services API to determine whether each station had at least 50 monthly values of temperature data for each station. Delete the other stations. Using Python add the necessary field names and acquire all monthly values for the remaining stations. Thus, there are stations that have some missing data. Using Python Add fields and convert the standard values to metric values so both would be present. Thus, there are four sets of monthly data in this dataset: Monthly means, mins, and maxes of daily temperatures - degrees Fahrenheit. Monthly mean of monthly sums of precipitation and the level of precipitation that was the minimum and maximum during the period 1981 to 2010 - mm. Temperatures in 3a. in degrees Celcius. Precipitation levels in 3b in Inches. After initially publishing these data in a different service, it was learned that more precise coordinates for station locations were available from the Enhanced Master Station History Report (EMSHR) published by NOAA NCDC. With the publication of this layer these most precise coordinates are used. A large subset of the EMSHR metadata is available via EMSHR Stations Locations and Metadata 1738 to Present. If your study area includes areas outside of the U.S., use the World Historical Climate - Monthly Averages for GHCN-D Stations 1981 - 2010 layer. The data in this layer come from the same source archive, however, they are not curated by the ACIS staff and may contain errors. Revision History: Initially Published: 23 Jan 2019 Updated 16 Apr 2019 - We learned more precise coordinates for station locations were available from the Enhanced Master Station History Report (EMSHR) published by NOAA NCDC. With the publication of this layer the geometry and attributes for 3,222 of 9,636 stations now have more precise coordinates. The schema was updated to include the NCDC station identifier and elevation fields for feet and meters are also included. A large subset of the EMSHR data is available via EMSHR Stations Locations and Metadata 1738 to Present. Cite as: Esri, 2019: U.S. Historical Climate - Monthly Averages for GHCN-D Stations for 1981 - 2010. ArcGIS Online, Accessed
t
ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture...
researchdata.tuwien.ac.at
b2find.dkrz.de
zip
Updated Feb 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo (2025). ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture from merged multi-satellite observations [Dataset]. http://doi.org/10.48436/3fcxr-cde10
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.48436/3fcxr-cde10
Dataset updated
Feb 14, 2025
Dataset provided by
TU Wien
Authors
Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was produced with funding from the European Space Agency (ESA) Climate Change Initiative (CCI) Plus Soil Moisture Project (CCN 3 to ESRIN Contract No: 4000126684/19/I-NB "ESA CCI+ Phase 1 New R&D on CCI ECVS Soil Moisture"). Project website: https://climate.esa.int/en/projects/soil-moisture/

This dataset contains information on the Surface Soil Moisture (SM) content derived from satellite observations in the microwave domain.

Dataset paper (public preprint)

A description of this dataset, including the methodology and validation results, is available at:

Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: An independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data Discuss. [preprint], https://doi.org/10.5194/essd-2024-610, in review, 2025.

Abstract

ESA CCI Soil Moisture is a multi-satellite climate data record that consists of harmonized, daily observations coming from 19 satellites (as of v09.1) operating in the microwave domain. The wealth of satellite information, particularly over the last decade, facilitates the creation of a data record with the highest possible data consistency and coverage.
However, data gaps are still found in the record. This is particularly notable in earlier periods when a limited number of satellites were in operation, but can also arise from various retrieval issues, such as frozen soils, dense vegetation, and radio frequency interference (RFI). These data gaps present a challenge for many users, as they have the potential to obscure relevant events within a study area or are incompatible with (machine learning) software that often relies on gap-free inputs.
Since the requirement of a gap-free ESA CCI SM product was identified, various studies have demonstrated the suitability of different statistical methods to achieve this goal. A fundamental feature of such gap-filling method is to rely only on the original observational record, without need for ancillary variable or model-based information. Due to the intrinsic challenge, there was until present no global, long-term univariate gap-filled product available. In this version of the record, data gaps due to missing satellite overpasses and invalid measurements are filled using the Discrete Cosine Transform (DCT) Penalized Least Squares (PLS) algorithm (Garcia, 2010). A linear interpolation is applied over periods of (potentially) frozen soils with little to no variability in (frozen) soil moisture content. Uncertainty estimates are based on models calibrated in experiments to fill satellite-like gaps introduced to GLDAS Noah reanalysis soil moisture (Rodell et al., 2004), and consider the gap size and local vegetation conditions as parameters that affect the gapfilling performance.

Summary

Gap-filled global estimates of volumetric surface soil moisture from 1991-2023 at 0.25° sampling

Fields of application (partial): climate variability and change, land-atmosphere interactions, global biogeochemical cycles and ecology, hydrological and land surface modelling, drought applications, and meteorology

Method: Modified version of DCT-PLS (Garcia, 2010) interpolation/smoothing algorithm, linear interpolation over periods of frozen soils. Uncertainty estimates are provided for all data points.

More information: See Preimesberger et al. (2025) and https://doi.org/10.5281/zenodo.8320869" target="_blank" rel="noopener">ESA CCI SM Algorithm Theoretical Baseline Document [Chapter 7.2.9] (Dorigo et al., 2023)

Programmatic Download

You can use command line tools such as wget or curl to download (and extract) data for multiple years. The following command will download and extract the complete data set to the local directory ~/Download on Linux or macOS systems.

#!/bin/bash

# Set download directory
DOWNLOAD_DIR=~/Downloads

base_url="https://researchdata.tuwien.at/records/3fcxr-cde10/files"

# Loop through years 1991 to 2023 and download & extract data
for year in {1991..2023}; do
echo "Downloading $year.zip..."
wget -q -P "$DOWNLOAD_DIR" "$base_url/$year.zip"
unzip -o "$DOWNLOAD_DIR/$year.zip" -d $DOWNLOAD_DIR
rm "$DOWNLOAD_DIR/$year.zip"
done

Data details

The dataset provides global daily estimates for the 1991-2023 period at 0.25° (~25 km) horizontal grid resolution. Daily images are grouped by year (YYYY), each subdirectory containing one netCDF image file for a specific day (DD), month (MM) in a 2-dimensional (longitude, latitude) grid system (CRS: WGS84). The file name has the following convention:

ESACCI-SOILMOISTURE-L3S-SSMV-COMBINED_GAPFILLED-YYYYMMDD000000-fv09.1r1.nc

Data Variables

Each netCDF file contains 3 coordinate variables (WGS84 longitude, latitude and time stamp), as well as the following data variables:

sm: (float) The Soil Moisture variable reflects estimates of daily average volumetric soil moisture content (m3/m3) in the soil surface layer (~0-5 cm) over a whole grid cell (0.25 degree).

sm_uncertainty: (float) The Soil Moisture Uncertainty variable reflects the uncertainty (random error) of the original satellite observations and of the predictions used to fill observation data gaps.

sm_anomaly: Soil moisture anomalies (reference period 1991-2020) derived from the gap-filled values (`sm`)

sm_smoothed: Contains DCT-PLS predictions used to fill data gaps in the original soil moisture field. These values are also available when an observation was initially available (compare `gapmask`). In this case, they provided a smoothed version of the original data.

gapmask: (0 | 1) Indicates grid cells where a satellite observation is available (0), and where the interpolated value is used (1) in the 'sm' field.

frozenmask: (0 | 1) Indicates grid cells where ERA5 soil temperature is <0 °C. In this case, a linear interpolation over time is applied.

Additional information for each variable is given in the netCDF attributes.

Version Changelog

Changes in v9.1r1 (previous version was v09.1):

This version uses a novel uncertainty estimation scheme as described in Preimesberger et al. (2025).

Software to open netCDF files

These data can be read by any software that supports Climate and Forecast (CF) conform metadata standards for netCDF files, such as:

https://github.com/pydata/xarray" target="_blank" rel="noopener">Xarray (python)

https://unidata.github.io/netcdf4-python/" target="_blank" rel="noopener">netCDF4 (python)

https://github.com/TUW-GEO/esa_cci_sm">esa_cci_sm (python)

Similar tools exists for other programming languages (Matlab, R, etc.)

Software packages and GIS tools can open netCDF files, e.g. CDO, NCO, QGIS, ArCGIS

You can also use the GUI software Panoply to view the contents of each file

References

Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: An independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data Discuss. [preprint], https://doi.org/10.5194/essd-2024-610, in review, 2025.

Dorigo, W., Preimesberger, W., Stradiotti, P., Kidd, R., van der Schalie, R., van der Vliet, M., Rodriguez-Fernandez, N., Madelon, R., & Baghdadi, N. (2023). ESA Climate Change Initiative Plus - Soil Moisture Algorithm Theoretical Baseline Document (ATBD) Supporting Product Version 08.1 (version 1.1). Zenodo. https://doi.org/10.5281/zenodo.8320869

Garcia, D., 2010. Robust smoothing of gridded data in one and higher dimensions with missing values. Computational Statistics & Data Analysis, 54(4), pp.1167-1178. Available at: https://doi.org/10.1016/j.csda.2009.09.020

Rodell, M., Houser, P. R., Jambor, U., Gottschalck, J., Mitchell, K., Meng, C.-J., Arsenault, K., Cosgrove, B., Radakovich, J., Bosilovich, M., Entin, J. K., Walker, J. P., Lohmann, D., and Toll, D.: The Global Land Data Assimilation System, Bulletin of the American Meteorological Society, 85, 381 – 394, https://doi.org/10.1175/BAMS-85-3-381, 2004.

Related Records

The following records are all part of the Soil Moisture Climate Data Records from satellites community

1
ESA CCI SM MODELFREE Surface Soil Moisture Record
<a href="https://doi.org/10.48436/svr1r-27j77" target="_blank"
SignalFlowEEG Example Data
figshare.com
bin
Updated Mar 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ernest Pedapati (2024). SignalFlowEEG Example Data [Dataset]. http://doi.org/10.6084/m9.figshare.25414042.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25414042.v1
Dataset updated
Mar 15, 2024
Dataset provided by
figshare
Authors
Ernest Pedapati
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The SignalFlowEEG Example Data dataset contains sample EEG recordings that demonstrate the capabilities and usage of the SignalFlowEEG Python package. This package provides a comprehensive set of tools for processing, analyzing, and visualizing electroencephalography (EEG) data, with a focus on neuroscience research applications.The example dataset includes EEG recordings from various paradigms:Resting-state EEG: A 5-minute recording where the subject relaxed with eyes closed.Auditory chirp stimulation: EEG recorded while the subject listened to chirp sounds with varying frequencies.Visual evoked potentials: EEG recorded as the subject viewed checkerboard pattern stimuli to elicit visual responses.These recordings were collected at the Cincinnati Children's Hospital Medical Center and are made available for educational and testing purposes.SignalFlowEEG builds upon MNE-Python, a popular open-source library for EEG analysis, and offers additional functionality tailored for clinical research workflows. This example dataset allows users to explore SignalFlowEEG's features and gain hands-on experience analyzing EEG data with this powerful Python package.The dataset consists of .set files, a format used by the EEGLAB toolbox. Each file contains raw EEG data, channel info, and event markers for a specific experimental paradigm. Files can be loaded using mne.io.read_raw_eeglab() from MNE-Python, a SignalFlowEEG dependency. The dataset has no missing data or special abbreviations. Channel names and event markers follow standard EEGLAB conventions.
Additional file 9 of The automatic detection of diabetic kidney disease from...
springernature.figshare.com
zip
Updated Aug 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaomin Shi; Ling Gao; Juan Zhang; Baifang Zhang; Jing Xiao; Wan Xu; Yuan Tian; Lihua Ni; Xiaoyan Wu (2024). Additional file 9 of The automatic detection of diabetic kidney disease from retinal vascular parameters combined with clinical variables using artificial intelligence in type-2 diabetes patients [Dataset]. http://doi.org/10.6084/m9.figshare.26634081.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26634081.v1
Dataset updated
Aug 16, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Shaomin Shi; Ling Gao; Juan Zhang; Baifang Zhang; Jing Xiao; Wan Xu; Yuan Tian; Lihua Ni; Xiaoyan Wu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 9. Supplementary Python source codes. Code1: Python code for the model using RF classifier with SMOTE correction for data set imbalance. Code2: Python code for the model using SVM classifier with SMOTE correction for data set imbalance. Code3: Python code for the model using BDT classifier with SMOTE correction for data set imbalance. Code4: Python code for the model using Ada classifier with SMOTE correction for data set imbalance. Code5: Python code for the model using RF classifier with Random oversampling correction for data set imbalance. Code6: Python code for the model using SVM classifier with Random oversampling correction for data set imbalance. Code7: Python code for the model using BDT classifier with Random oversampling correction for data set imbalance. Code8: Python code for the model using Ada classifier with Random oversampling correction for data set imbalance. Code9: Python code for the model using RF classifier with no correction for data set imbalance. Code10: Python code for the model using SVM classifier with no correction for data set imbalance. Code11: Python code for the model using BDT classifier with no correction for data set imbalance. Code12: Python code for the model using Ada classifier with no correction for data set imbalance. Code13: Python code for the ROC curves of models with SMOTE correction for data set imbalance. Code14: Python code for the ROC curves of models with Random oversampling correction for data set imbalance. Code15: Python code for the ROC curves of models with no correction for data set imbalance. Code16: Python code for the model using RF classifier with SMOTE correction for data set imbalance, and imputing the missing data by the method of backfilling missing values. Code17: Python code for the model using RF classifier with SMOTE correction for data set imbalance, and imputing the missing data by means. Code18: Python code for tunning of the model using RF classifier with SMOTE correction for data set imbalance. Code19: Python code for calculating the standard deviations.
S
Data from: CADDI: An in-Class Activity Detection Dataset using IMU data from...
scidb.cn
observatorio-cientifico.ua.es
Updated May 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luis Marquez-Carpintero; Sergio Suescun-Ferrandiz; Monica Pina-Navarro; Francisco Gomez-Donoso; Miguel Cazorla (2024). CADDI: An in-Class Activity Detection Dataset using IMU data from low-cost sensors [Dataset]. http://doi.org/10.57760/sciencedb.08377
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.08377
Dataset updated
May 28, 2024
Dataset provided by
Science Data Bank
Authors
Luis Marquez-Carpintero; Sergio Suescun-Ferrandiz; Monica Pina-Navarro; Francisco Gomez-Donoso; Miguel Cazorla
Description
Data DescriptionThe CADDI dataset is designed to support research in in-class activity recognition using IMU data from low-cost sensors. It provides multimodal data capturing 19 different activities performed by 12 participants in a classroom environment, utilizing both IMU sensors from a Samsung Galaxy Watch 5 and synchronized stereo camera images. This dataset enables the development and validation of activity recognition models using sensor fusion techniques.Data Generation ProceduresThe data collection process involved recording both continuous and instantaneous activities that typically occur in a classroom setting. The activities were captured using a custom setup, which included:A Samsung Galaxy Watch 5 to collect accelerometer, gyroscope, and rotation vector data at 100Hz.A ZED stereo camera capturing 1080p images at 25-30 fps.A synchronized computer acting as a data hub, receiving IMU data and storing images in real-time.A D-Link DSR-1000AC router for wireless communication between the smartwatch and the computer.Participants were instructed to arrange their workspace as they would in a real classroom, including a laptop, notebook, pens, and a backpack. Data collection was performed under realistic conditions, ensuring that activities were captured naturally.Temporal and Spatial ScopeThe dataset contains a total of 472.03 minutes of recorded data.The IMU sensors operate at 100Hz, while the stereo camera captures images at 25-30Hz.Data was collected from 12 participants, each performing all 19 activities multiple times.The geographical scope of data collection was Alicante, Spain, under controlled indoor conditions.Dataset ComponentsThe dataset is organized into JSON and PNG files, structured hierarchically:IMU Data: Stored in JSON files, containing:Samsung Linear Acceleration Sensor (X, Y, Z values, 100Hz)LSM6DSO Gyroscope (X, Y, Z values, 100Hz)Samsung Rotation Vector (X, Y, Z, W quaternion values, 100Hz)Samsung HR Sensor (heart rate, 1Hz)OPT3007 Light Sensor (ambient light levels, 5Hz)Stereo Camera Images: High-resolution 1920×1080 PNG files from left and right cameras.Synchronization: Each IMU data record and image is timestamped for precise alignment.Data StructureThe dataset is divided into continuous and instantaneous activities:Continuous Activities (e.g., typing, writing, drawing) were recorded for 210 seconds, with the central 200 seconds retained.Instantaneous Activities (e.g., raising a hand, drinking) were repeated 20 times per participant, with data captured only during execution.The dataset is structured as:/continuous/subject_id/activity_name/ /camera_a/ → Left camera images /camera_b/ → Right camera images /sensors/ → JSON files with IMU data

/instantaneous/subject_id/activity_name/repetition_id/ /camera_a/ /camera_b/ /sensors/ Data Quality & Missing DataThe smartwatch buffers 100 readings per second before sending them, ensuring minimal data loss.Synchronization latency between the smartwatch and the computer is negligible.Not all IMU samples have corresponding images due to different recording rates.Outliers and anomalies were handled by discarding incomplete sequences at the start and end of continuous activities.Error Ranges & LimitationsSensor data may contain noise due to minor hand movements.The heart rate sensor operates at 1Hz, limiting its temporal resolution.Camera exposure settings were automatically adjusted, which may introduce slight variations in lighting.File Formats & Software CompatibilityIMU data is stored in JSON format, readable with Python’s json library.Images are in PNG format, compatible with all standard image processing tools.Recommended libraries for data analysis:Python: numpy, pandas, scikit-learn, tensorflow, pytorchVisualization: matplotlib, seabornDeep Learning: Keras, PyTorchPotential ApplicationsDevelopment of activity recognition models in educational settings.Study of student engagement based on movement patterns.Investigation of sensor fusion techniques combining visual and IMU data.This dataset represents a unique contribution to activity recognition research, providing rich multimodal data for developing robust models in real-world educational environments.CitationIf you find this project helpful for your research, please cite our work using the following bibtex entry:@misc{marquezcarpintero2025caddiinclassactivitydetection, title={CADDI: An in-Class Activity Detection Dataset using IMU data from low-cost sensors}, author={Luis Marquez-Carpintero and Sergio Suescun-Ferrandiz and Monica Pina-Navarro and Miguel Cazorla and Francisco Gomez-Donoso}, year={2025}, eprint={2503.02853}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.02853}, }
m
Digital Transformation and Tax Uncertainty
data.mendeley.com
Updated Oct 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wanyi Chen (2024). Digital Transformation and Tax Uncertainty [Dataset]. http://doi.org/10.17632/npn454p8mb.4
Explore at:
Unique identifier
https://doi.org/10.17632/npn454p8mb.4
Dataset updated
Oct 17, 2024
Authors
Wanyi Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A-share listed companies of the Shanghai and Shenzhen Stock Exchanges from 2008 to 2023 were selected to comprise the research sample. Considering the particularity of the industry characteristics, the financial industry, special treatment companies, and missing data of variables were eliminated. Moreover, the final research sample contained 14,048 firm-year observations. All continuous variables were winsorized at the 1 and 99% levels. Financial and tax risk data were obtained from the China Stock Market and Accounting Research and Wind databases. The data of enterprise digital transformation were obtained using Python technology through text analysis.
Z
LimnoSat-US: A Remote Sensing Dataset for U.S. Lakes from 1984-2020
data.niaid.nih.gov
zenodo.org
Updated Oct 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiao Yang (2020). LimnoSat-US: A Remote Sensing Dataset for U.S. Lakes from 1984-2020 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4139694
Explore at:
Dataset updated
Oct 29, 2020
Dataset provided by
Matthew R.V. Ross
Simon Topp
Xiao Yang
Tamlin Pavelsky
John Gardner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
United States
Description
LimnoSat-US is an analysis-ready remote sensing database that includes reflectance values spanning 36 years for 56,792 lakes across > 328,000 Landsat scenes. The database comes pre-processed with cross-sensor standardization and the effects of clouds, cloud shadows, snow, ice, and macrophytes removed. In total, it contains over 22 million individual lake observations with an average of 393 +/- 233 (mean +/- standard deviation) observations per lake over the 36 year period. The data and code contained within this repository are as follows:

HydroLakes_DP.shp: A shapefile containing the deepest points for all U.S. lakes within HydroLakes. For more information on the deepest point see https://doi.org/10.5281/zenodo.4136754 and Shen et al (2015).

LakeExport.py: Python code to extract reflectance values for U.S. lakes from Google Earth Engine.

GEE_pull_functions.py: Functions called within LakeExport.py

01_LakeExtractor.Rmd: An R Markdown file that takes the raw data from LakeExport.py and processes it for the final database.

SceneMetadata.csv: A file containing additional information such as scene cloud cover and sun angle for all Landsat scenes within the database. Can be joined to the final database using LandsatID.

srCorrected_us_hydrolakes_dp_20200628: The final LimnoSat-US database containing all cloud free observations of U.S. lakes from 1984-2020. Missing values for bands not shared between sensors (Aerosol and TIR2) are denoted by -99. dWL is the dominant wavelength calculated following Wang et al. (2015). pCount_dswe1 represents the number of high confidence water pixels within 120 meters of the deepest point. pCount_dswe3 represents the number of vegetated water pixels within 120 meters and can be used as a flag for potential reflectance noise. All reflectance values represent the median value of high confidence water pixels within 120 meters. The final database is provided in both as a .csv and .feather formats. It can be linked to SceneMetadata.cvs using LandsatID. All reflectance values are derived from USGS T1-SR Landsat scenes.
Physical Informed Deep Learning Reconstructs Missing Climate Information in...
zenodo.org
zip
Updated May 17, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ziqaing Yao; Ziqaing Yao (2022). Physical Informed Deep Learning Reconstructs Missing Climate Information in the Antarctic [Dataset]. http://doi.org/10.5281/zenodo.6555940
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6555940
Dataset updated
May 17, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ziqaing Yao; Ziqaing Yao
Area covered
Antarctica
Description
Here is the code and data corresponding to the Physical Informed Deep Learning Reconstructs Missing Climate Information in the Antarctic article.

Reanalysis data
In the reanalysis data folder, skin temperature data and wind speed data are in two different folders. The reanalysis data is the monthly average of the ERA-Interim from European Centre for Medium-Range Weather Forecasts (ECMWF) for 40 years from 1979 to 2018, with 1.5° latitude × 1.5° longitude resolution. The first 30 years are used as training data, and the next 10 years are used as test data. And the variables such as skin temperature (K, Kelvins) and surface wind speed (m/s, meter per second) are selected for experiments in Antarctic region (longitude (0°-360°), latitude (60°S-90°S)).

Station data
The station data folder contains real observations from 180 weather stations in the Antarctic region. These include variables such as temperature, wind speed, and wind direction.

PI-RFR
The PI-RFR folder contains the code for this method, including training, testing, mask generation and network structure.

Requirements:
- Python >= 3.6
- PyTorch >= 1.6
- Numpy == 1.20.3
- NetCDF4 == 1.5.7
- h5py == 3.3.0

Train & Test
To training or testing, use

python run.py python run.py --test
u
WEISS Catheter Segmentation in Fluoroscopy Dataset
rdr.ucl.ac.uk
png
Updated Nov 27, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evangelos Mazomenos; Danail Stoyanov; Marta Gherardini (2023). WEISS Catheter Segmentation in Fluoroscopy Dataset [Dataset]. http://doi.org/10.5522/04/24624243.v1
Explore at:
pngAvailable download formats
Unique identifier
https://doi.org/10.5522/04/24624243.v1
Dataset updated
Nov 27, 2023
Dataset provided by
University College London
Authors
Evangelos Mazomenos; Danail Stoyanov; Marta Gherardini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains fluoroscopy images extracted from four videos of canulation experiments with an aorta phantom and six videos of in-vivo catheterisation procedures: four Transcatheter Aortic Valve Implantations (TAVI) and two diagnostic catheterisation procedures. Please refer to the README.docxThe Phantom.hdf5 file contains the 2000 (Dataset-2 in the paper) images extracted from the four fluoroscopy videos from catheterization experiments carried out on a silicon aorta phantom in an angiography suite.The T1T2.hdf5 and T3-T6.hdf5 files contain images extracted from the six fluoroscopy videos during in-vivo endovascular operations (Dataset-3 in the paper). Specifically, 836 frames were extracted from TAVI (data groups T1, T2, T3 andT4) and 371 from diagnostic catheterization (data groups T5 andT6). Each data group contains the following number of images: T1 – 286, T2 – 150, T3 – 200, T4 – 200, T5 – 143, T6 – 228.Binary segmentation masks of the interventional catheter are provided as ground truth. A semiautomated tracking method with manual initialisation (http://ieeexplore.ieee.org/document/7381624/) was employed to obtain the catheter annotations as the 2D coordinates of the catheter restricted to a manually selected region of interest (ROI). The method employs a b-spline tube model as a prior for the catheter shape to restrict the search space and deal with potential missing measurements. This is combined with a probabilistic framework that estimates the pixel-wise posteriors between the foreground (catheter) and background delimited by the b-spline tube contour. The output of the algorithm was manually checked and corrected to provide the final catheter segmentation.The annotations are provided in the files: “Phantom_label.hdf5”, “T1T2_label.hdf5” and “T3-T6_label.hdf5”. All annotations consist of full-scale (256x256 px) binary masks where background pixels have a “0” value, while a value equal to “1” denotes the catheter pixels.Example python code (MAIN.py) is provided to access the data and the labels and visualize them.Citing the datasetThe dataset should be cited using its DOI whenever research making use of this dataset is reported in any academic publication or research report. Please also cite the following publication:Marta Gherardini, Evangelos Mazomenos, Arianna Menciassi, Danail Stoyanov, “Catheter segmentation in X-ray fluoroscopy using synthetic data and transfer learning with light U-nets”, Computer Methods and Programs in Biomedicine, Volume 192, Aug 2020, 105420, doi:10.1016/j.cmpb.2020.105420.To find out more about our research team, visit the Surgical Robot Vision and Wellcome/EPSRC Centre for Interventional and Surgical Science websites.
AMAS: a fast tool for large alignment manipulation and computing of summary...
data.niaid.nih.gov
datadryad.org
+2more
zip
Updated Jan 19, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marek L. Borowiec (2017). AMAS: a fast tool for large alignment manipulation and computing of summary statistics [Dataset]. http://doi.org/10.5061/dryad.p2q52
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.p2q52
Dataset updated
Jan 19, 2017
Dataset provided by
University of California, Davis
Authors
Marek L. Borowiec
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The amount of data used in phylogenetics has grown explosively in the recent years and many phylogenies are inferred with hundreds or even thousands of loci and many taxa. These modern phylogenomic studies often entail separate analyses of each of the loci in addition to multiple analyses of subsets of genes or concatenated sequences. Computationally efficient tools for handling and computing properties of thousands of single-locus or large concatenated alignments are needed. Here I present AMAS (Alignment Manipulation And Summary), a tool that can be used either as a stand-alone command-line utility or as a Python package. AMAS works on amino acid and nucleotide alignments and combines capabilities of sequence manipulation with a function that calculates basic statistics. The manipulation functions include conversions among popular formats, concatenation, extracting sites and splitting according to a pre-defined partitioning scheme, creation of replicate data sets, and removal of taxa. The statistics calculated include the number of taxa, alignment length, total count of matrix cells, overall number of undetermined characters, percent of missing data, AT and GC contents (for DNA alignments), count and proportion of variable sites, count and proportion of parsimony informative sites, and counts of all characters relevant for a nucleotide or amino acid alphabet. AMAS is particularly suitable for very large alignments with hundreds of taxa and thousands of loci. It is computationally efficient, utilizes parallel processing, and performs better at concatenation than other popular tools. AMAS is a Python 3 program that relies solely on Python’s core modules and needs no additional dependencies. AMAS source code and manual can be downloaded from http://github.com/marekborowiec/AMAS/ under GNU General Public License.
LSST light curves for constant and variable sources, and for point-like and...
zenodo.org
bin, txt
Updated Mar 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miguel Crispim Romão; Miguel Crispim Romão; Djuna Croon; Djuna Croon; Daniel Godines; Daniel Godines (2025). LSST light curves for constant and variable sources, and for point-like and extended objects microlensing [Dataset]. http://doi.org/10.5281/zenodo.15005108
Explore at:
bin, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15005108
Dataset updated
Mar 14, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Miguel Crispim Romão; Miguel Crispim Romão; Djuna Croon; Djuna Croon; Daniel Godines; Daniel Godines
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the dataset that accompanies the paper Anomaly Detection to Identify Transients in LSST Time Series Data, which should be consulted for further details, along with the artefacts of the trained machine learning models. The dataset was generated using simulated LSST light curves for the Vera C. Rubin Observatory cadence and observational conditions via rubin-sim. It comprises approximately 600 000 light curves designed to detect various transient events, including microlensing signals and variable stars, as well as non-variable signal-less sources used to train the anomaly detection model.

The dataset includes six distinct classes: Constant (non-variable signal-less sources), RR Lyrae variables, Point-like Microlensing (ML), Binary Microlensing (Binary ML), Boson Stars (BS), and NFW Subhalos (NFW). The total number of simulated light curves for each class is as follows:

BS: 320 494

Binary ML: 84 022

ML: 53 565

RR Lyrae: 49 573

NFW: 47 837

Constant: 41 522

The light curves incorporate rubin-sim noise simulation and the LSST 10-year baseline cadence strategy (v2.0). Light curves for Constant, variable, and point-like microlensing events were simulated using MicroLIA, while binary microlensing events were generated using pyLIMA. Light curves for the BS and NFW objects were simulated using the code from this work.

The dataset contains 182 columns covering simulation and generation parameters, observable time series features, the time series itself, and the predictions from the machine learning models used in the paper. The columns are organised by type using prefixes and suffixes:

'timestamps', 'mag', 'magerr': Light curve data.

'gen': Generation parameters (metadata).

'sim': Simulation parameters (metadata).

'feature_' prefix: Features extracted from the light curve and its derivative, marked with the suffix 'deriv'.

'iforest_output': iForest anomaly score.

'pred_': Probabilities and class prediction for the multiclass classifier.

The dataset is provided in 'parquet' format, accessible in Python via 'pandas' by installing the 'parquet' optional dependency (i.e., pip install pandas[parquet]).

The artefacts were generated in Python 3.9.21 using scikit-learn 1.4.1. The imputer_train.pkl file is required to impute missing values before predicting with the iForest model (final_isolation_forest_model.pkl), as it does not handle missing or nan values. The multiclass classifier (classifier.pck) handles missing and nan values directly and was trained without imputed data.

Please cite the paper alongisde the zenodo entry if you use this dataset:

@article{CrispimRomao:2025pyl,
author = "Crispim Romao, Miguel and Croon, Djuna and Godines, Daniel",
title = "{Anomaly Detection to identify Transients in LSST Time Series Data}",
eprint = "2503.09699",
archivePrefix = "arXiv",
primaryClass = "astro-ph.SR",
reportNumber = "IPPP/25/15",
month = "3",
year = "2025"
}
Z
Multimodal Vision-Audio-Language Dataset
data.niaid.nih.gov
zenodo.org
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Schaumlöffel, Timothy (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10060784
Explore at:
Dataset updated
Jul 11, 2024
Dataset provided by
Schaumlöffel, Timothy
Choksi, Bhavin
Roig, Gemma
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation

pip install pandas pyarrow Example

import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])

dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de
AIS data
zenodo.org
data.niaid.nih.gov
zip
Updated Jun 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luka Grgičević; Luka Grgičević (2023). AIS data [Dataset]. http://doi.org/10.5281/zenodo.8064564
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8064564
Dataset updated
Jun 26, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Luka Grgičević; Luka Grgičević
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Terrestrial vessel automatic identification system (AIS) data was collected around Ålesund, Norway in 2020, from multiple receiving stations with unsynchronized clocks. Features are 'mmsi', 'imo', 'length', 'latitude', 'longitude', 'sog', 'cog', 'true_heading', 'datetime UTC', 'navigational status', and 'message number'. Compact parquet files can be turned into data frames with python's pandas library. Data is irregularly sampled because of the navigational status. The preprocessing script for training the machine learning models can be found here. There you will find gathered dozen of trainable models and hundreds of datasets. Visit this website for more information about the data. If you have additional questions, please find our information in the links below:

Luka Grgičević

Ottar Laurits Osen

Facebook

Twitter

Click to copy link

Link copied

Cite

Yi-Hui Zhou; Ehsan Saghapour (2023). Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data.PDF [Dataset]. http://doi.org/10.3389/fgene.2021.691274.s001

Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data.PDF

Explore at:

pdfAvailable download formats

Unique identifier

https://doi.org/10.3389/fgene.2021.691274.s001

Dataset updated

Jun 1, 2023

Dataset provided by

Frontiers

Authors

Yi-Hui Zhou; Ehsan Saghapour

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.

Clear search

Close search

Google apps

Main menu

Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction...

Pre-Processed Power Grid Frequency Time Series

Data from: A comprehensive dataset for the accelerated development and...

Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 - 7—Data

Reproductive health and Family planning service characteristics of...

1 km Resolution UK Composite Rainfall Data from the Met Office Nimrod System...

Compilation of surface water diversion sites and daily withdrawals in the...

U.S. Historical Climate - Monthly Averages for GHCN-D Stations for 1981 -...

ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture...

Dataset paper (public preprint)

Abstract

Summary

Programmatic Download

Data details

Data Variables

Version Changelog

Software to open netCDF files

References

Related Records

SignalFlowEEG Example Data

Additional file 9 of The automatic detection of diabetic kidney disease from...

Data from: CADDI: An in-Class Activity Detection Dataset using IMU data from...

Digital Transformation and Tax Uncertainty

LimnoSat-US: A Remote Sensing Dataset for U.S. Lakes from 1984-2020

Physical Informed Deep Learning Reconstructs Missing Climate Information in...

WEISS Catheter Segmentation in Fluoroscopy Dataset

AMAS: a fast tool for large alignment manipulation and computing of summary...

LSST light curves for constant and variable sources, and for point-like and...

Multimodal Vision-Audio-Language Dataset

AIS data

Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data.PDF