29 datasets found

ENTSO-E Hydropower modelling data (PECD) in CSV format
zenodo.org
csv
Updated Aug 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matteo De Felice; Matteo De Felice (2020). ENTSO-E Hydropower modelling data (PECD) in CSV format [Dataset]. http://doi.org/10.5281/zenodo.3950048
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3950048
Dataset updated
Aug 14, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Matteo De Felice; Matteo De Felice
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PECD Hydro modelling

This repository contains a more user-friendly version of the Hydro modelling data released by ENTSO-E with their latest Seasonal Outlook.

The original URLs:

The zipped file: https://eepublicdownloads.blob.core.windows.net/public-cdn-container/clean-documents/sdc-documents/seasonal/SOR2020/data/Hydro.zip

The documentation file (v 1.0): https://eepublicdownloads.blob.core.windows.net/public-cdn-container/clean-documents/sdc-documents/MAF/2019/Hydropower_Modelling_New_database_and_methodology.pdf

The original ENTSO-E hydropower dataset integrates the PECD (Pan-European Climate Database) released for the MAF 2019

As I did for the wind & solar data, the datasets released in this repository are only a more user- and machine-readable version of the original Excel files. As avid user of ENTSO-E data, with this repository I want to share my data wrangling efforts to make this dataset more accessible.

Data description

The zipped file contains 86 Excel files, two different files for each ENTSO-E zone.

In this repository you can find 5 CSV files:

PECD-hydro-capacities.csv: installed capacities

PECD-hydro-weekly-inflows.csv: weekly inflows for reservoir and open-loop pumping

PECD-hydro-daily-ror-generation.csv: daily run-of-river generation

PECD-hydro-weekly-reservoir-min-max-generation.csv: minimum and maximum weekly reservoir generation

PECD-hydro-weekly-reservoir-min-max-levels.csv: weekly minimum and maximum reservoir levels

Capacities

The file PECD-hydro-capacities.csv contains: run of river capacity (MW) and storage capacity (GWh), reservoir plants capacity (MW) and storage capacity (GWh), closed-loop pumping/turbining (MW) and storage capacity and open-loop pumping/turbining (MW) and storage capacity. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

sheet Run-of-River and pondage, rows from 5 to 7, columns from 2 to 5

sheet Reservoir, rows from 5 to 7, columns from 1 to 3

sheet Pump storage - Open Loop, rows from 5 to 7, columns from 1 to 3

sheet Pump storage - Closed Loop, rows from 5 to 7, columns from 1 to 3

Inflows

The file PECD-hydro-weekly-inflows.csv contains the weekly inflow (GWh) for the climatic years 1982-2017 for reservoir plants and open-loop pumping. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

sheet Reservoir, rows from 13 to 66, columns from 16 to 51

sheet Pump storage - Open Loop, rows from 13 to 66, columns from 16 to 51

Daily run-of-river

The file PECD-hydro-daily-ror-generation.csv contains the daily run-of-river generation (GWh). The data is extracted from the Excel files with the name starting with PEMM from the following sections:

sheet Run-of-River and pondage, rows from 13 to 378, columns from 15 to 51

Miminum and maximum reservoir generation

The file PECD-hydro-weekly-reservoir-min-max-generation.csv contains the minimum and maximum generation (MW, weekly) for reservoir-based plants for the climatic years 1982-2017. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

sheet Reservoir, rows from 13 to 66, columns from 196 to 231

sheet Reservoir, rows from 13 to 66, columns from 232 to 267

Minimum/Maximum reservoir levels

The file PECD-hydro-weekly-reservoir-min-max-levels.csv contains the minimum/maximum reservoir levels at beginning of each week (scaled coefficient from 0 to 1). The data is extracted from the Excel files with the name starting with PEMM from the following sections:

sheet Reservoir, rows from 14 to 66, column 12

sheet Reservoir, rows from 14 to 66, column 13

CHANGELOG

[2020/07/17] Added maximum generation for the reservoir
👨‍🦯 Parkinson's Disease Detection Dataset 👨‍⚕️
kaggle.com
Updated Jul 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kancharla Naveen Kumar (2023). 👨‍🦯 Parkinson's Disease Detection Dataset 👨‍⚕️ [Dataset]. https://www.kaggle.com/datasets/naveenkumar20bps1137/parkinsons-disease-detection
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 10, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kancharla Naveen Kumar
Description
Parkinson's data set

This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to "status" column which is set to 0 for healthy and 1 for PD.

The data is in ASCII CSV format. The rows of the CSV file contain an instance corresponding to one voice recording. There are around six recordings per patient, the name of the patient is identified in the first column. For further information or to pass on comments, please contact Max Little (littlem '@' robots.ox.ac.uk).

Further details are contained in the following reference -- if you use this dataset, please cite: Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2008), 'Suitability of dysphonia measurements for telemonitoring of Parkinson's disease', IEEE Transactions on Biomedical Engineering (to appear).

This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.

Citation:

Little,Max. (2008). Parkinsons. UCI Machine Learning Repository. https://doi.org/10.24432/C59C74.

Matrix column entries (attributes):

name - ASCII subject name and recording number MDVP:Fo(Hz) - Average vocal fundamental frequency MDVP:Fhi(Hz) - Maximum vocal fundamental frequency MDVP:Flo(Hz) - Minimum vocal fundamental frequency Five measures of variation in Frequency MDVP:Jitter(%) - Percentage of cycle-to-cycle variability of the period duration MDVP:Jitter(Abs) - Absolute value of cycle-to-cycle variability of the period duration MDVP:RAP - Relative measure of the pitch disturbance MDVP:PPQ - Pitch perturbation quotient Jitter:DDP - Average absolute difference of differences between jitter cycles Six measures of variation in amplitude MDVP:Shimmer - Variations in the voice amplitdue MDVP:Shimmer(dB) - Variations in the voice amplitdue in dB Shimmer:APQ3 - Three point amplitude perturbation quotient measured against the average of the three amplitude Shimmer:APQ5 - Five point amplitude perturbation quotient measured against the average of the three amplitude MDVP:APQ - Amplitude perturbation quotient from MDVP Shimmer:DDA - Average absolute difference between the amplitudes of consecutive periods Two measures of ratio of noise to tonal components in the voice NHR - Noise-to-harmonics Ratio and HNR - Harmonics-to-noise Ratio status - Health status of the subject (one) - Parkinson's, (zero) - healthy Two nonlinear dynamical complexity measures RPDE - Recurrence period density entropy D2 - correlation dimension DFA - Signal fractal scaling exponent Three nonlinear measures of fundamental frequency variation spread1 - discrete probability distribution of occurrence of relative semitone variations spread2 - Three nonlinear measures of fundamental frequency variation PPE - Entropy of the discrete probability distribution of occurrence of relative semitone variations
Z
Annual maps of cropland abandonment, land cover, and other derived data for...
data.niaid.nih.gov
repository.soilwise-he.eu
Updated Jul 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Radeloff, Volker C. (2022). Annual maps of cropland abandonment, land cover, and other derived data for time-series analysis of cropland abandonment [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5348286
Explore at:
Dataset updated
Jul 19, 2022
Dataset provided by
Yin, He
Wilcove, David S.
Crawford, Christopher L.
Radeloff, Volker C.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This archive contains raw annual land cover maps, cropland abandonment maps, and accompanying derived data products to support:

Crawford C.L., Yin, H., Radeloff, V.C., and Wilcove, D.S. 2022. Rural land abandonment is too ephemeral to provide major benefits for biodiversity and climate. Science Advances doi.org/10.1126/sciadv.abm8999.

An archive of the analysis scripts developed for this project can be found at: https://github.com/chriscra/abandonment_trajectories (https://doi.org/10.5281/zenodo.6383127).

Note that the label "_2022_02_07" in many file names refers to the date of the primary analysis. "dts” or “dt” refer to “data.tables," large .csv files that were manipulated using the data.table package in R (Dowle and Srinivasan 2021, http://r-datatable.com/). “Rasters” refer to “.tif” files that were processed using the raster and terra packages in R (Hijmans, 2022; https://rspatial.org/terra/; https://rspatial.org/raster).

Data files fall into one of four categories of data derived during our analysis of abandonment: observed, potential, maximum, or recultivation. Derived datasets also follow the same naming convention, though are aggregated across sites. These four categories are as follows (using “age_dts” for our site in Shaanxi Province, China as an example):

observed abandonment identified through our primary analysis, with a threshold of five years. These files do not have a specific label beyond the description of the file and the date of analysis (e.g., shaanxi_age_2022_02_07.csv);

potential abandonment for a scenario without any recultivation, in which abandoned croplands are left abandoned from the year of initial abandonment through the end of the time series, with the label “_potential” (e.g., shaanxi_potential_age_2022_02_07.csv);

maximum age of abandonment over the course of the time series, with the label “_max” (e.g., shaanxi_max_age_2022_02_07.csv);

recultivation periods, corresponding to the lengths of recultivation periods following abandonment, given the label “_recult” (e.g., shaanxi_recult_age_2022_02_07.csv).

This archive includes multiple .zip files, the contents of which are described below:

age_dts.zip - Maps of abandonment age (i.e., how long each pixel has been abandoned for, as of that year, also referred to as length, duration, etc.), for each year between 1987-2017 for all 11 sites. These maps are stored as .csv files, where each row is a pixel, the first two columns refer to the x and y coordinates (in terms of longitude and latitude), and subsequent columns contain the abandonment age values for an individual year (where years are labeled with "y" followed by the year, e.g., "y1987"). Maps are given with a latitude and longitude coordinate reference system. Folder contains observed age, potential age (“_potential”), maximum age (“_max”), and recultivation lengths (“_recult”) for all sites. Maximum age .csv files include only three columns: x, y, and the maximum length (i.e., “max age”, in years) for each pixel throughout the entire time series (1987-2017). Files were produced using the custom functions "cc_filter_abn_dt()," “cc_calc_max_age()," “cc_calc_potential_age(),” and “cc_calc_recult_age();” see "_util/_util_functions.R."

age_rasters.zip - Maps of abandonment age (i.e., how long each pixel has been abandoned for), for each year between 1987-2017 for all 11 sites. Maps are stored as .tif files, where each band corresponds to one of the 31 years in our analysis (1987-2017), in ascending order (i.e., the first layer is 1987 and the 31st layer is 2017). Folder contains observed age, potential age (“_potential”), and maximum age (“_max”) rasters for all sites. Maximum age rasters include just one band (“layer”). These rasters match the corresponding .csv files contained in "age_dts.zip.”

derived_data.zip - summary datasets created throughout this analysis, listed below.

diff.zip - .csv files for each of our eleven sites containing the year-to-year lagged differences in abandonment age (i.e., length of time abandoned) for each pixel. The rows correspond to a single pixel of land, and the columns refer to the year the difference is in reference to. These rows do not have longitude or latitude values associated with them; however, rows correspond to the same rows in the .csv files in "input_data.tables.zip" and "age_dts.zip." These files were produced using the custom function "cc_diff_dt()" (much like the base R function "diff()"), contained within the custom function "cc_filter_abn_dt()" (see "_util/_util_functions.R"). Folder contains diff files for observed abandonment, potential abandonment (“_potential”), and recultivation lengths (“_recult”) for all sites.

input_dts.zip - annual land cover maps for eleven sites with four land cover classes (see below), adapted from Yin et al. 2020 Remote Sensing of Environment (https://doi.org/10.1016/j.rse.2020.111873). Like “age_dts,” these maps are stored as .csv files, where each row is a pixel and the first two columns refer to x and y coordinates (in terms of longitude and latitude). Subsequent columns contain the land cover class for an individual year (e.g., "y1987"). Note that these maps were recoded from Yin et al. 2020 so that land cover classification was consistent across sites (see below). This contains two files for each site: the raw land cover maps from Yin et al. 2020 (after recoding), and a “clean” version produced by applying 5- and 8-year temporal filters to the raw input (see custom function “cc_temporal_filter_lc(),” in “_util/_util_functions.R” and “1_prep_r_to_dt.R”). These files correspond to those in "input_rasters.zip," and serve as the primary inputs for the analysis.

input_rasters.zip - annual land cover maps for eleven sites with four land cover classes (see below), adapted from Yin et al. 2020 Remote Sensing of Environment. Maps are stored as ".tif" files, where each band corresponds one of the 31 years in our analysis (1987-2017), in ascending order (i.e., the first layer is 1987 and the 31st layer is 2017). Maps are given with a latitude and longitude coordinate reference system. Note that these maps were recoded so that land cover classes matched across sites (see below). Contains two files for each site: the raw land cover maps (after recoding), and a “clean” version that has been processed with 5- and 8-year temporal filters (see above). These files match those in "input_dts.zip."

length.zip - .csv files containing the length (i.e., age or duration, in years) of each distinct individual period of abandonment at each site. This folder contains length files for observed and potential abandonment, as well as recultivation lengths. Produced using the custom function "cc_filter_abn_dt()" and “cc_extract_length();” see "_util/_util_functions.R."

derived_data.zip contains the following files:

"site_df.csv" - a simple .csv containing descriptive information for each of our eleven sites, along with the original land cover codes used by Yin et al. 2020 (updated so that all eleven sites in how land cover classes were coded; see below).

Primary derived datasets for both observed abandonment (“area_dat”) and potential abandonment (“potential_area_dat”).

area_dat - Shows the area (in ha) in each land cover class at each site in each year (1987-2017), along with the area of cropland abandoned in each year following a five-year abandonment threshold (abandoned for >=5 years) or no threshold (abandoned for >=1 years). Produced using custom functions "cc_calc_area_per_lc_abn()" via "cc_summarize_abn_dts()". See scripts "cluster/2_analyze_abn.R" and "_util/_util_functions.R."

persistence_dat - A .csv containing the area of cropland abandoned (ha) for a given "cohort" of abandoned cropland (i.e., a group of cropland abandoned in the same year, also called "year_abn") in a specific year. This area is also given as a proportion of the initial area abandoned in each cohort, or the area of each cohort when it was first classified as abandoned at year 5 ("initial_area_abn"). The "age" is given as the number of years since a given cohort of abandoned cropland was last actively cultivated, and "time" is marked relative to the 5th year, when our five-year definition first classifies that land as abandoned (and where the proportion of abandoned land remaining abandoned is 1). Produced using custom functions "cc_calc_persistence()" via "cc_summarize_abn_dts()". See scripts "cluster/2_analyze_abn.R" and "_util/_util_functions.R." This serves as the main input for our linear models of recultivation (“decay”) trajectories.

turnover_dat - A .csv showing the annual gross gain, annual gross loss, and annual net change in the area (in ha) of abandoned cropland at each site in each year of the time series. Produced using custom functions "cc_calc_abn_diff()" via "cc_summarize_abn_dts()" (see "_util/_util_functions.R"), implemented in "cluster/2_analyze_abn.R." This file is only produced for observed abandonment.

Area summary files (for observed abandonment only)

area_summary_df - Contains a range of summary values relating to the area of cropland abandonment for each of our eleven sites. All area values are given in hectares (ha) unless stated otherwise. It contains 16 variables as columns, including 1) "site," 2) "total_site_area_ha_2017" - the total site area (ha) in 2017, 3) "cropland_area_1987" - the area in cropland in 1987 (ha), 4) "area_abn_ha_2017" - the area of cropland abandoned as of 2017 (ha), 5) "area_ever_abn_ha" - the total area of those pixels that were abandoned at least once during the time series (corresponding to the area of potential abandonment, as of 2017), 6) "total_crop_extent_ha" - the total area of those pixels that were classified as cropland at least once during the time series, 7)
Data from: A large synthetic dataset for machine learning applications in...
zenodo.org
csv, json, png, zip
Updated Mar 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis (2025). A large synthetic dataset for machine learning applications in power transmission grids [Dataset]. http://doi.org/10.5281/zenodo.13378476
Explore at:
zip, png, csv, jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13378476
Dataset updated
Mar 25, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.

This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.

Data generation algorithm

The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.

Network

The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.

Time series

The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.

There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).

Usage

The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.

Selecting a particular country

This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):

import pandas as pd CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)

The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:

CH_gens_list = CH_gens.dropna().squeeze().to_list()

Finally, we can import all the time series of Swiss generators from a given data table with

pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)

The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.

Averaging over time

This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:

hourly_loads = pd.read_csv('loads_2018_3.csv')

To get a daily average of the loads, we can use:

daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()

This results in series of length 364. To average further over entire weeks and get series of length 52, we use:

weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()

Source code

The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.

Funding

This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.
o
Dataset for "Strong dispersion property for the quantum walk on the...
explore.openaire.eu
zenodo.org
Updated Jan 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martins Kokainis; Krišjānis Prūsis; Jevgenijs Vihrovs; Vyacheslavs Kashcheyevs; Andris Ambainis (2022). Dataset for "Strong dispersion property for the quantum walk on the hypercube" [Dataset]. http://doi.org/10.5281/zenodo.5907184
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5907184
Dataset updated
Jan 27, 2022
Authors
Martins Kokainis; Krišjānis Prūsis; Jevgenijs Vihrovs; Vyacheslavs Kashcheyevs; Andris Ambainis
Description
Dataset for Figure 1 and Figure 2 presented in "Strong dispersion property for the quantum walk on the hypercube" (preprint available at arxiv.org/abs/2201.11735). The rows of data.csv file contain the calculated quantities related to the quantum walk on the hypercube: the first row is the maximum probability of a vertex during a walk on the 50-dimensional hypercube; the second row contains the number of steps to minimize the aforementioned probability for various n; the third row is the maximum probability of a vertex after approximately 0.849n steps, for various n; the fourth row is the probability of the walker to be at the 0n vertex (n=50) during a walk on the 50-dimensional hypercube. The rows of aux.csv contain auxiliary data needed to plot the figures: the first row contains the integers 0 to 199 and corresponds to the variable 't' in Figure 1; the second row contains the integers from 1 to 200 and corresponds to the variable 'n' in Figure 2; the third row is the value of the linear function -0.754 + 0.849*n, depicted in the upper panel of Figure 2; the fourth row is the value of the function 5*1.93^(-n), depicted in the lower panel of Figure 2. To generate the figures, the following Matlab commands may be used (after loading the CSV files into variables aux and data): figure; scatter(aux(1,:),data(1,:),15); set(gca,'YScale','log') % F1: upper figure; scatter(aux(1,1:2:end),data(1,1:2:end),15); hold on; scatter(aux(1,:),data(4,:),15,'s'); hold off; set(gca,'YScale','log') % F1: lower figure; scatter(aux(2,:),data(2,:),15); hold on; plot(aux(2,:),aux(3,:)); xlim([0,100]);hold off; %F2: upper figure; scatter(aux(2,:),data(3,:),15);set(gca,'YScale','log'); hold on; semilogy(aux(2,:), aux(4,:));hold off; xlim([0,100]); %F2: lower This work has been supported by Latvian Council of Science (project no. lzp-2018/1-0173) and the ERDF project 1.1.1.5/18/A/020.
d
US B2B Contact Data | 200M+ Verified Records | 95% Accuracy | API/CSV/JSON
datarade.ai
.json, .csv
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Forager.ai, US B2B Contact Data | 200M+ Verified Records | 95% Accuracy | API/CSV/JSON [Dataset]. https://datarade.ai/data-products/us-b2b-contact-data-180m-records-bi-weekly-updates-csv-forager-ai
Explore at:
.json, .csvAvailable download formats
Dataset provided by
Forager.ai
Area covered
United States of America
Description
US B2B Contact Database | 200M+ Verified Records | 95% Accuracy | API/CSV/JSON Elevate your sales and marketing efforts with America's most comprehensive B2B contact data, featuring over 200M+ verified records of decision-makers, from CEOs to managers, across all industries. Powered by AI and refreshed bi-weekly, this dataset ensures you have access to the freshest, most accurate contact details available for effective outreach and engagement.

Key Features & Stats:

200M+ Decision-Makers: Includes C-level executives, VPs, Directors, and Managers.

95% Accuracy: Email & Phone numbers verified for maximum deliverability.

Bi-Weekly Updates: Never waste time on outdated leads with our frequent data refreshes.

50+ Data Points: Comprehensive firmographic, technographic, and contact details.

Core Fields:

Direct Work Emails & Personal Emails for effective outreach.

Mobile Phone Numbers for cold calls and SMS campaigns.

Full Name, Job Title, Seniority for better personalization.

Company Insights: Size, Revenue, Funding data, Industry, and Tech Stack for a complete profile.

Location: HQ and regional offices to target local, national, or international markets.

Top Use Cases:

Cold Email & Calling Campaigns: Target the right people with accurate contact data.

CRM & Marketing Automation Enrichment: Enhance your CRM with enriched data for better lead management.

ABM & Sales Intelligence: Target the right decision-makers and personalize your approach.

Recruiting & Talent Mapping: Access CEO and senior leadership data for executive search.

Instant Delivery Options:

JSON – Bulk downloads via S3 for easy integration.

REST API – Real-time integration for seamless workflow automation.

CRM Sync – Direct integration with your CRM for streamlined lead management.

Enterprise-Grade Quality:

SOC 2 Compliant: Ensuring the highest standards of security and data privacy.

GDPR/CCPA Ready: Fully compliant with global data protection regulations.

Triple-Verification Process: Ensuring the accuracy and deliverability of every record.

Suppression List Management: Eliminate irrelevant or non-opt-in contacts from your outreach.

US Business Contacts | B2B Email Database | Sales Leads | CRM Enrichment | Verified Phone Numbers | ABM Data | CEO Contact Data | US B2B Leads | US prospects data
w
Randomized Hourly Load Data for use with Taxonomy Distribution Feeders
data.wu.ac.at
datadiscoverystudio.org
application/unknown
Updated Aug 29, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Energy (2017). Randomized Hourly Load Data for use with Taxonomy Distribution Feeders [Dataset]. https://data.wu.ac.at/schema/data_gov/NWYwYmFmYTItOWRkMC00OWM0LTk3OGYtZDcyYzZiOWY5N2Ez
Explore at:
application/unknownAvailable download formats
Dataset updated
Aug 29, 2017
Dataset provided by
Department of Energy
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset was developed by NREL's distributed energy systems integration group as part of a study on high penetrations of distributed solar PV [1]. It consists of hourly load data in CSV format for use with the PNNL taxonomy of distribution feeders [2]. These feeders were developed in the open source GridLAB-D modelling language [3]. In this dataset each of the load points in the taxonomy feeders is populated with hourly averaged load data from a utility in the feeder’s geographical region, scaled and randomized to emulate real load profiles. For more information on the scaling and randomization process, see [1].

The taxonomy feeders are statistically representative of the various types of distribution feeders found in five geographical regions of the U.S. Efforts are underway (possibly complete) to translate these feeders into the OpenDSS modelling language.

This data set consists of one large CSV file for each feeder. Within each CSV, each column represents one load bus on the feeder. The header row lists the name of the load bus. The subsequent 8760 rows represent the loads for each hour of the year. The loads were scaled and randomized using a Python script, so each load series represents only one of many possible randomizations. In the header row, "rl" = residential load and "cl" = commercial load. Commercial loads are followed by a phase letter (A, B, or C). For regions 1-3, the data is from 2009. For regions 4-5, the data is from 2000.

For use in GridLAB-D, each column will need to be separated into its own CSV file without a header. The load value goes in the second column, and corresponding datetime values go in the first column, as shown in the sample file, sample_individual_load_file.csv. Only the first value in the time column needs to written as an absolute time; subsequent times may be written in relative format (i.e. "+1h", as in the sample). The load should be written in P+Qj format, as seen in the sample CSV, in units of Watts (W) and Volt-amps reactive (VAr). This dataset was derived from metered load data and hence includes only real power; reactive power can be generated by assuming an appropriate power factor. These loads were used with GridLAB-D version 2.2.

Browse files in this dataset, accessible as individual files and as a single ZIP file. This dataset is approximately 242MB compressed or 475MB uncompressed.

For questions about this dataset, contact andy.hoke@nrel.gov.

If you find this dataset useful, please mention NREL and cite [1] in your work.

References:

[1] A. Hoke, R. Butler, J. Hambrick, and B. Kroposki, “Steady-State Analysis of Maximum Photovoltaic Penetration Levels on Typical Distribution Feeders,” IEEE Transactions on Sustainable Energy, April 2013, available at http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6357275 .

[2] K. Schneider, D. P. Chassin, R. Pratt, D. Engel, and S. Thompson, “Modern Grid Initiative Distribution Taxonomy Final Report”, PNNL, Nov. 2008. Accessed April 27, 2012: http://www.gridlabd.org/models/feeders/taxonomy of prototypical feeders.pdf

[3] K. Schneider, D. Chassin, Y. Pratt, and J. C. Fuller, “Distribution power flow for smart grid technologies”, IEEE/PES Power Systems Conference and Exposition, Seattle, WA, Mar. 2009, pp. 1-7, 15-18.
Z
Data from: Isotopes and related data associated with water tracing with...
data.niaid.nih.gov
Updated Feb 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michelon, Anthony (2022). Isotopes and related data associated with water tracing with environmental DNA in a high-Alpine catchment [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3515061
Explore at:
Dataset updated
Feb 8, 2022
Dataset provided by
Vennemann, Torsten W.
Michelon, Anthony
Ba, Rokhaya
Altermatt, Florian
Schaefli, Bettina
Beria, Harsh
Ceperley, Natalie
Mächler, Elvira
Salyani, Anham
Larsen, Annegret
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Isotopes and related data associated with water tracing with environmental DNA in a high-Alpine catchment Prepared by Natalie Ceperley, February 2020.

All methods associated with this data are available in the manuscript: Elvira Mächler, Anham Salyani, Jean-Claude Walser, Annegret Larsen, Bettina Schaefli, Florian Altermatt, and Natalie Ceperley. 2019. Water tracing with environmental DNA in a high-Alpine catchment, Hydrology and Earth System Sciences. https://doi.org/10.5194/hess-2019-551. Related data sets are and will be published in the Vallon de Nant Community on Zenodo. Associated sequencing data are publicly available on European Nucleotide Archive (Mächler et al., 2020).

All isotope data analyzed in the laboratory of Torsten W. Vennemann at the University of Lausanne.

All Files: ▪ NaN - No measurement or sample ▪ Details regarding measurement are available in paper or supplement.

Files: 1) climate_hydro_2017_daily.csv ⁃ 16 columns: ⁃ 1. day of year with January 1, 2017 = 1 ⁃ 2-5. Q: daily mean, min, max, and baseflow discharge as measured at outlet (location ER/MR), in liters / day ⁃ 6. P: mean mm of rain across catchment per day ⁃ 7. SR: total solar radiation per day in W/hr/m2 as median of 4 meteorological stations ⁃ 8-10. SCA: mean, min, and max snow covered area on days with satellite imagery available for whole catchment area, in % ⁃ 11-13. water temperature, mean, min, and max, at outlet (location ER/MR), in degrees C ⁃ 14-16. air temperature, mean, min, and max at 4 meteorological stations, in degrees C

2) delta-18-O_permil.csv ⁃ stable isotopes of water (delta 18-O) in per mil ⁃ columns correspond to sampling locations (locations.csv), rows correspond to sampling days (sampledates.csv)

3) delta-2-H_permil.csv ⁃ stable isotopes of water (delta 2-H) in per mil ⁃ columns correspond to sampling locations (locations.csv), rows correspond to sampling days (sampledates.csv)

4) dqdt_outlet_prev48hrs.csv - dq/dt determined at the outlet for the previous 48 hours at sampling moment (TimeOfSamples_HR.csv) for each sampling site ⁃ columns correspond to sampling locations (locations.csv), rows correspond to sampling days (sampledates.csv)

5) ednasamplecount.csv - this is the tally of samples (1 sample includes 4 replicates) ⁃ columns correspond to sampling locations (locations.csv), rows correspond to sampling days (sampledates.csv)

6) electricalconductivity_instrument.csv ⁃ Code: 108 - post-analyzed using a glass bodied 6 mm probe in the laboratory (Jenway 4510, Staffordshire, UK). 102 - hand measurement with WTW (multi-3510 with a IDS-tetracon-925, Xylem Analytics, Germany) ⁃ columns correspond to sampling locations (locations.csv), rows correspond to sampling days (sampledates.csv)

7) electricalconductivity_uScm.csv - this is the electrical conductivity in micro siemens per cm, according to the instruments coded in electricalconductivity_instrument.csv ⁃ columns correspond to sampling locations (locations.csv), rows correspond to sampling days (sampledates.csv)

8) LC-excess.csv - this is the line control execss from the meteoric water line as determined by the samples in the file: precipitationistopemetadata.csv ⁃ columns correspond to sampling locations (locations.csv), rows correspond to sampling days (sampledates.csv)

9) locations.csv ⁃ Location codes used in other files. - Coordinates in CH1903 / LV03 and WGS 84 (lat/lon). Elevation in m. asl.

10) precipitationisotopemetadata.csv - This is the sampling information for the isotope data that was used to calculate the meteoric water line. - The full data set will become available in a subsequent publication on Zenodo linked to the same community. - 4 columns: - 1. code: rain (1) or snow (2) - 2. collection date and time - 3. elevation in m. asl. - 4. in the case of rain, this is the depth of collection in mm (area normalized volume), in the case of snow, this is the mean depth below the surface that the sample was taken from in cm. ⁃ columns correspond to sampling locations (locations.csv), rows correspond to sampling days (sampledates.csv)

11) sampledates.csv - These are the sample dates in day, month, year and day of year corresponding to the rows in other files

12) stationlocations.csv - These are the locations of four meteorological stations and discharge measurement station. - Coordinates in CH1903 / LV03 and WGS 84 (lat/lon). Elevation in m. asl.

13) TimeOfSamples_HR.csv - This is the time of the sample in hours and decimals correspond to minutes past hour ⁃ columns correspond to sampling locations (locations.csv), rows correspond to sampling days (sampledates.csv)

14) watertemperature_degC.csv ⁃ columns correspond to sampling locations (locations.csv), rows correspond to sampling days (sampledates.csv) - measure in degrees C - instrument in watertemperature_instrument.csv

15) watertemperature_instrument.csv ⁃ Code: 1 = hand measurement with WTW (multi-3510 with a IDS-tetracon-925, Xylem Analytics, Germany) 2 = HOBO Pendant Temperature/Light Data Logger 64K - UA-002-64", Onset (Bourne, MA, USA) 3 = Continually logging WTW (IDS-tetracon-325, Xylem Analytics, Germany) 4 = Continually logging (10min) HOBO U24-001 Conductivity, Onset (Bourne, MA, USA) ⁃ columns correspond to sampling locations (locations.csv), rows correspond to sampling days (sampledates.csv)
Z
KGCW 2024 Challenge @ ESWC 2024
data.niaid.nih.gov
investigacion.usc.gal
Updated Jun 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chaves-Fraga, David (2024). KGCW 2024 Challenge @ ESWC 2024 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10721874
Explore at:
Dataset updated
Jun 11, 2024
Dataset provided by
Chaves-Fraga, David
Iglesias, Ana
Dimou, Anastasia
Van Assche, Dylan
Serles, Umutcan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Knowledge Graph Construction Workshop 2024: challenge

Knowledge graph construction of heterogeneous data has seen a lot of uptakein the last decade from compliance to performance optimizations with respectto execution time. Besides execution time as a metric for comparing knowledgegraph construction, other metrics e.g. CPU or memory usage are not considered.This challenge aims at benchmarking systems to find which RDF graphconstruction system optimizes for metrics e.g. execution time, CPU,memory usage, or a combination of these metrics.

Task description

The task is to reduce and report the execution time and computing resources(CPU and memory usage) for the parameters listed in this challenge, comparedto the state-of-the-art of the existing tools and the baseline results providedby this challenge. This challenge is not limited to execution times to createthe fastest pipeline, but also computing resources to achieve the most efficientpipeline.

We provide a tool which can execute such pipelines end-to-end. This tool alsocollects and aggregates the metrics such as execution time, CPU and memoryusage, necessary for this challenge as CSV files. Moreover, the informationabout the hardware used during the execution of the pipeline is available aswell to allow fairly comparing different pipelines. Your pipeline should consistof Docker images which can be executed on Linux to run the tool. The tool isalready tested with existing systems, relational databases e.g. MySQL andPostgreSQL, and triplestores e.g. Apache Jena Fuseki and OpenLink Virtuosowhich can be combined in any configuration. It is strongly encouraged to usethis tool for participating in this challenge. If you prefer to use a differenttool or our tool imposes technical requirements you cannot solve, please contactus directly.

Track 1: Conformance

The set of new specification for the RDF Mapping Language (RML) established by the W3C Community Group on Knowledge Graph Construction provide a set of test-cases for each module:

RML-Core

RML-IO

RML-CC

RML-FNML

RML-Star

These test-cases are evaluated in this Track of the Challenge to determine their feasibility, correctness, etc. by applying them in implementations. This Track is in Beta status because these new specifications have not seen any implementation yet, thus it may contain bugs and issues. If you find problems with the mappings, output, etc. please report them to the corresponding repository of each module.

Note: validating the output of the RML Star module automatically through the provided tooling is currently not possible, see https://github.com/kg-construct/challenge-tool/issues/1.

Through this Track we aim to spark development of implementations for the new specifications and improve the test-cases. Let us know your problems with the test-cases and we will try to find a solution.

Track 2: Performance

Part 1: Knowledge Graph Construction Parameters

These parameters are evaluated using synthetic generated data to have moreinsights of their influence on the pipeline.

Data

Number of data records: scaling the data size vertically by the number of records with a fixed number of data properties (10K, 100K, 1M, 10M records).

Number of data properties: scaling the data size horizontally by the number of data properties with a fixed number of data records (1, 10, 20, 30 columns).

Number of duplicate values: scaling the number of duplicate values in the dataset (0%, 25%, 50%, 75%, 100%).

Number of empty values: scaling the number of empty values in the dataset (0%, 25%, 50%, 75%, 100%).

Number of input files: scaling the number of datasets (1, 5, 10, 15).

Mappings

Number of subjects: scaling the number of subjects with a fixed number of predicates and objects (1, 10, 20, 30 TMs).

Number of predicates and objects: scaling the number of predicates and objects with a fixed number of subjects (1, 10, 20, 30 POMs).

Number of and type of joins: scaling the number of joins and type of joins (1-1, N-1, 1-N, N-M)

Part 2: GTFS-Madrid-Bench

The GTFS-Madrid-Bench provides insights in the pipeline with real data from thepublic transport domain in Madrid.

Scaling

GTFS-1 SQL

GTFS-10 SQL

GTFS-100 SQL

GTFS-1000 SQL

Heterogeneity

GTFS-100 XML + JSON

GTFS-100 CSV + XML

GTFS-100 CSV + JSON

GTFS-100 SQL + XML + JSON + CSV

Example pipeline

The ground truth dataset and baseline results are generated in different stepsfor each parameter:

The provided CSV files and SQL schema are loaded into a MySQL relational database.

Mappings are executed by accessing the MySQL relational database to construct a knowledge graph in N-Triples as RDF format

The pipeline is executed 5 times from which the median execution time of eachstep is calculated and reported. Each step with the median execution time isthen reported in the baseline results with all its measured metrics.Knowledge graph construction timeout is set to 24 hours. The execution is performed with the following tool: https://github.com/kg-construct/challenge-tool,you can adapt the execution plans for this example pipeline to your own needs.

Each parameter has its own directory in the ground truth dataset with thefollowing files:

Input dataset as CSV.

Mapping file as RML.

Execution plan for the pipeline in metadata.json.

Datasets

Knowledge Graph Construction Parameters

The dataset consists of:

Input dataset as CSV for each parameter.

Mapping file as RML for each parameter.

Baseline results for each parameter with the example pipeline.

Ground truth dataset for each parameter generated with the example pipeline.

Format

All input datasets are provided as CSV, depending on the parameter that is beingevaluated, the number of rows and columns may differ. The first row is alwaysthe header of the CSV.

GTFS-Madrid-Bench

The dataset consists of:

Input dataset as CSV with SQL schema for the scaling and a combination of XML,

CSV, and JSON is provided for the heterogeneity.

Mapping file as RML for both scaling and heterogeneity.

SPARQL queries to retrieve the results.

Baseline results with the example pipeline.

Ground truth dataset generated with the example pipeline.

Format

CSV datasets always have a header as their first row.JSON and XML datasets have their own schema.

Evaluation criteria

Submissions must evaluate the following metrics:

Execution time of all the steps in the pipeline. The execution time of a step is the difference between the begin and end time of a step.

CPU time as the time spent in the CPU for all steps of the pipeline. The CPU time of a step is the difference between the begin and end CPU time of a step.

Minimal and maximal memory consumption for each step of the pipeline. The minimal and maximal memory consumption of a step is the minimum and maximum calculated of the memory consumption during the execution of a step.

Expected output

Duplicate values

Scale Number of Triples

0 percent 2000000 triples

25 percent 1500020 triples

50 percent 1000020 triples

75 percent 500020 triples

100 percent 20 triples

Empty values

Scale Number of Triples

0 percent 2000000 triples

25 percent 1500000 triples

50 percent 1000000 triples

75 percent 500000 triples

100 percent 0 triples

Mappings

Scale Number of Triples

1TM + 15POM 1500000 triples

3TM + 5POM 1500000 triples

5TM + 3POM 1500000 triples

15TM + 1POM 1500000 triples

Properties

Scale Number of Triples

1M rows 1 column 1000000 triples

1M rows 10 columns 10000000 triples

1M rows 20 columns 20000000 triples

1M rows 30 columns 30000000 triples

Records

Scale Number of Triples

10K rows 20 columns 200000 triples

100K rows 20 columns 2000000 triples

1M rows 20 columns 20000000 triples

10M rows 20 columns 200000000 triples

Joins

1-1 joins

Scale Number of Triples

0 percent 0 triples

25 percent 125000 triples

50 percent 250000 triples

75 percent 375000 triples

100 percent 500000 triples

1-N joins

Scale Number of Triples

1-10 0 percent 0 triples

1-10 25 percent 125000 triples

1-10 50 percent 250000 triples

1-10 75 percent 375000 triples

1-10 100 percent 500000 triples

1-5 50 percent 250000 triples

1-10 50 percent 250000 triples

1-15 50 percent 250005 triples

1-20 50 percent 250000 triples

1-N joins

Scale Number of Triples

10-1 0 percent 0 triples

10-1 25 percent 125000 triples

10-1 50 percent 250000 triples

10-1 75 percent 375000 triples

10-1 100 percent 500000 triples

5-1 50 percent 250000 triples

10-1 50 percent 250000 triples

15-1 50 percent 250005 triples

20-1 50 percent 250000 triples

N-M joins

Scale Number of Triples

5-5 50 percent 1374085 triples

10-5 50 percent 1375185 triples

5-10 50 percent 1375290 triples

5-5 25 percent 718785 triples

5-5 50 percent 1374085 triples

5-5 75 percent 1968100 triples

5-5 100 percent 2500000 triples

5-10 25 percent 719310 triples

5-10 50 percent 1375290 triples

5-10 75 percent 1967660 triples

5-10 100 percent 2500000 triples

10-5 25 percent 719370 triples

10-5 50 percent 1375185 triples

10-5 75 percent 1968235 triples

10-5 100 percent 2500000 triples

GTFS Madrid Bench

Generated Knowledge Graph

Scale Number of Triples

1 395953 triples

10 3959530 triples

100 39595300 triples

1000 395953000 triples

Queries

Query Scale 1 Scale 10 Scale 100 Scale 1000

Q1 58540 results 585400 results No results available No results available

Q2 636 results 11998 results
125565 results 1261368 results

Q3 421 results 4207 results 42067 results 420667 results

Q4 13 results 130 results 1300 results 13000 results

Q5 35 results 350 results 3500 results 35000 results

Q6 1 result 1 result 1 result 1 result

Q7 68 results 67 results 67 results 53 results

Q8 35460 results 354600 results No results available No results available

Q9 130 results 1300
Earth Analytics Bootcamp | Teaching Dataset | Temperature Data for Sonoma...
figshare.com
txt
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Earth Lab (2023). Earth Analytics Bootcamp | Teaching Dataset | Temperature Data for Sonoma County Airport and San Diego Area [Dataset]. http://doi.org/10.6084/m9.figshare.10314458.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.10314458.v1
Dataset updated
Jun 5, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Earth Lab
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Sonoma County, Earth, San Diego
Description
This teaching data subset contains data on average monthly maximum temperature (Fahrenheit) in 1999 to 2018 at Sonoma County Airport and San Diego Area. 1. san-diego-monthly-mean-max-temp-fahr-1999-2018.csv: contains the average monthly maximum temperature for each month and year for the San Diego area between 1999 and 2018. The data are organized with 20 rows (one for each year from 1999 to 2018) and 12 columns (one for each month from Jan to Dec). Note that this .csv file does not contain headers. 2. sonoma-monthly-mean-max-temp-fahr-1999-2018.csv: contains the average monthly maximum temperature for each month and year at the Sonoma County Airport between 1999 and 2018. The data are organized with 20 rows (one for each year from 1999 to 2018) and 12 columns (one for each month from Jan to Dec). Note that this .csv file does not contain headers.Source: National Weather Service
o
HarDWR - Harmonized Water Rights Records
osti.gov
Updated Apr 25, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
USDOE Office of Science (SC), Biological and Environmental Research (BER) (2024). HarDWR - Harmonized Water Rights Records [Dataset]. http://doi.org/10.57931/2341234
Explore at:
Unique identifier
https://doi.org/10.57931/2341234
Dataset updated
Apr 25, 2024
Dataset provided by
USDOE Office of Science (SC), Biological and Environmental Research (BER)
MultiSector Dynamics - Living, Intuitive, Value-adding, Environment
Description
For a detailed description of the database of which this record is only one part, please see the HarDWR meta-record. Here we present a new dataset of western U.S. water rights records. This dataset provides consistent unique identifiers for each spatial unit of water management across the domain, unique identifiers for each water right record, and a consistent categorization scheme that puts each water right record into one of 7 broad use categories. These data were instrumental in conducting a study of the multi-sector dynamics of intersectoral water allocation changes through water markets (Grogan et al., in review). Specifically, the data were formatted for use as input to a process-based hydrologic model, WBM, with a water rights module (Grogan et al., in review). While this specific study motivated the development of the database presented here, U.S. west water management is a rich area of study (e.g., Anderson and Woosly, 2005; Tidwell, 2014; Null and Prudencio, 2016; Carney et al, 2021) so releasing this database publicly with documentation and usage notes will enable other researchers to do further work on water management in the U.S. west. The raw downloaded data for each state is described in Lisk et al. (in review), as well as here. The dataset is a series of various files organized by state sub-directories. The first two characters of each file name is the abbreviation for the state the in which the file contains data for. After the abbreviation is the text which describes the contents of the file. Here is each file type described in detail: XXFullHarmonizedRights.csv: A file of the combined groundwater and surface water records for each state. Essentially, this file is the merging of XXGroundwaterHarmonizedRights.csv and XXSurfaceWaterHarmonizedRights.csv by state. The column headers for each of this type of file are: state - The name of the state the data comes from. FIPS - The two-digit numeric state ID code. waterRightID - The unique identifying ID of the water right, the same identifier as its state uses. priorityDate - The priority date associated with the right. origWaterUse - The original stated water use(s) from the state. waterUse - The water use category under the unified use categories established here. source - Whether the right is for surface water or groundwater. basinNum - The alpha-numeric identifier of the WMA the record belongs to. CFS - The maximum flow of the allocation in cubic feet per second (ft3s-1). Arizona is unique among the states, as its surface and groundwater resources are managed with two different sets of boundaries. So, for Arizona, the basinNum column is missing and instead there are two columns: surBasinNum - The alpha-numeric identifier of the surface water WMA the record belongs to. grdBasinNum - The alpha-numeric identifier of the groundwater WMA the record belongs to. XXStatePOD.shp: A shapefile which identifies the location of the Points of Diversion for the state's water rights. It should be noted that not all water right records in XXFullHarmonizedRights.csv have coordinates, and therefore may be missing from this file. XXStatePOU.shp: A shapefile which contains the area(s) in which each water right is claimed to be used. Currently, only Idaho and Washington provided valid data to include within this file. XXGroundwaterHarmonizedRights.csv: A file which contains only harmonized groundwater rights collected from each state. See XXFullHarmonizedRights.csv for more details on how the data is formatted. XXSurfaceWaterHarmonizedRights.csv: A file which contains only harmonized surface water rights collected from each state. See XXFullHarmonizedRights.csv for more details on how the data is formatted. Additionally, one file, stateWMALabels.csv, is not stored within a sub-directory. While we have referred to the spatial boundaries that each state uses to manage its water resources as WMAs, this term is not shared across all states. This file lists the proper name for each boundary set, by state. For those whom may be interested in exploring our code more in depth, we are also making available an internal data file for convenience. The file is in .RData format and contains everything described above as well as some minor additional objects used within the code calculating the cumulative curves. For completeness, here is a detailed description of the various objects which can be found within the .RData file: states: A character vector containing the state names for those states in which data was collected for. More importantly, the index of the state name is also the index in which that state's data can be found in the various following list objects. For example, if California is the third index in this object, the data for California will also be in the third index for each accompanying list. rightsByState_ground: A list of data frames with the cleaned ground water rights collected from each state. This object holds the the data that is exported to created the xxGroundwaterHarmonizedRights.csv files. rightsByState_surface: A list of data frames with the cleaned surface water rights collected from each state. This object holds the the data that is exported to created the xxSurfaceWaterHarmonizedRights.csv files. fullRightsRecs: A list of the combined groundwater and surface water records for each state. This object holds the the data that is exported to created the xxFullHarmonizedRights.csv files. projProj: The spatial projection used for map creation in the beginning of the project. Specifically, the World Geodetic System (WGS84) as a coordinate reference system (CRS) string in PROJ.4 format. wmaStateLabel: The name and/or abbreviation for what each state legally calls their WMAs. h2oUseByState: A list of spatial polygon data frames which contain the area(s) in which each water right is claimed to be used. It should be noted that not all water right records have a listed area(s) of use in this object. Currently, only Idaho and Washington provided valid data to be included in this object. h2oDivByState: A list of spatial points data frames which identifies the location of the Point of Diversion for the state's water rights. It should be noted that not all water right records have a listed Point of Diversion in this object. spatialWMAByState: A list of spatial polygon data frames which contain the spatial WMA boundaries for each state. The only data contained within the table are identifiers for each polygon. It is worth reiterating that Arizona is the only state in which the surface and groundwater WMA boundaries are not the same. wmaIDByState: A list which contains the unique ID values of the WMAs for each state. plottingDim: A character vector used to inform mapping functions for internal map making. Each state is classified as either "tall" or "wide", to maximize space on a typical 8x11 page. The code related to the creation of this dataset can be viewed within HarDWR GitHub Repository/dataHarmonization.
d
Geochemical data supporting analysis of geochemical conditions and nitrogen...
catalog.data.gov
data.usgs.gov
+2more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Geochemical data supporting analysis of geochemical conditions and nitrogen transport in nearshore groundwater and the subterranean estuary at a Cape Cod embayment, East Falmouth, Massachusetts, 2013 [Dataset]. https://catalog.data.gov/dataset/geochemical-data-supporting-analysis-of-geochemical-conditions-and-nitrogen-transport-in-n
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Cape Cod, East Falmouth, Falmouth, Massachusetts
Description
This data release provides analytical and other data in support of an analysis of nitrogen transport and transformation in groundwater and in a subterranean estuary in the Eel River and onshore locations on the Seacoast Shores peninsula, Falmouth, Massachusetts. The analysis is described in U.S. Geological Survey Scientific Investigations Report 2018-5095 by Colman and others (2018). This data release is structured as a set of comma-separated values (CSV) files, each of which contains data columns for laboratory (if applicable), USGS Site Name, date sampled, time sampled, and columns of specific analytical and(or) other data. The .csv data files have the same number of rows and each row in each .csv file corresponds to the same sample. Blank cells in a .csv file indicate that the sample was not analyzed for that constituent. The data release also provides a Data Dictionary (Data_Dictionary.csv) that provides the following information for each constituent (analyte): laboratory or data source, data type, description of units, method, minimum reporting limit, limit of quantitation if appropriate, method reference citations, minimum, maximum, median, and average values for each analyte. The data release also contains a file called Abbreviations in Data_Dictionary.pdf that contains all of the abbreviations in the Data Dictionary and in the well characteristics file in the companion report, Colman and others (2018).
Australian Employee Salary/Wages DATAbase by detailed occupation, location...
figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Richard Ferrers; Australian Taxation Office (2023). Australian Employee Salary/Wages DATAbase by detailed occupation, location and year (2002-14); (plus Sole Traders) [Dataset]. http://doi.org/10.6084/m9.figshare.4522895.v5
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4522895.v5
Dataset updated
May 31, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Richard Ferrers; Australian Taxation Office
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The ATO (Australian Tax Office) made a dataset openly available (see links) showing all the Australian Salary and Wages (2002, 2006, 2010, 2014) by detailed occupation (around 1,000) and over 100 SA4 regions. Sole Trader sales and earnings are also provided. This open data (csv) is now packaged into a database (*.sql) with 45 sample SQL queries (backupSQL[date]_public.txt).See more description at related Figshare #datavis record. Versions:V5: Following #datascience course, I have made main data (individual salary and wages) available as csv and Jupyter Notebook. Checksum matches #dataTotals. In 209,xxx rows.Also provided Jobs, and SA4(Locations) description files as csv. More details at: Where are jobs growing/shrinking? Figshare DOI: 4056282 (linked below). Noted 1% discrepancy ($6B) in 2010 wages total - to follow up.#dataTotals - Salary and WagesYearWorkers (M)Earnings ($B) 20028.528520069.4372201010.2481201410.3584#dataTotal - Sole TradersYearWorkers (M)Sales ($B)Earnings ($B)20020.9611320061.0881920101.11122620141.19630#links See ATO request for data at ideascale link below.See original csv open data set (CC-BY) at data.gov.au link below.This database was used to create maps of change in regional employment - see Figshare link below (m9.figshare.4056282).#packageThis file package contains a database (analysing the open data) in SQL package and sample SQL text, interrogating the DB. DB name: test. There are 20 queries relating to Salary and Wages.#analysisThe database was analysed and outputs provided on Nectar(.org.au) resources at: http://118.138.240.130.(offline)This is only resourced for max 1 year, from July 2016, so will expire in June 2017. Hence the filing here. The sample home page is provided here (and pdf), but not all the supporting files, which may be packaged and added later. Until then all files are available at the Nectar URL. Nectar URL now offline - server files attached as package (html_backup[date].zip), including php scripts, html, csv, jpegs.#installIMPORT: DB SQL dump e.g. test_2016-12-20.sql (14.8Mb)1.Started MAMP on OSX.1.1 Go to PhpMyAdmin2. New Database: 3. Import: Choose file: test_2016-12-20.sql -> Go (about 15-20 seconds on MacBookPro 16Gb, 2.3 Ghz i5)4. four tables appeared: jobTitles 3,208 rows | salaryWages 209,697 rows | soleTrader 97,209 rows | stateNames 9 rowsplus views e.g. deltahair, Industrycodes, states5. Run test query under **#; Sum of Salary by SA4 e.g. 101 $4.7B, 102 $6.9B#sampleSQLselect sa4,(select sum(count) from salaryWageswhere year = '2014' and sa4 = sw.sa4) as thisYr14,(select sum(count) from salaryWageswhere year = '2010' and sa4 = sw.sa4) as thisYr10,(select sum(count) from salaryWageswhere year = '2006' and sa4 = sw.sa4) as thisYr06,(select sum(count) from salaryWageswhere year = '2002' and sa4 = sw.sa4) as thisYr02from salaryWages swgroup by sa4order by sa4
d
Long-term monotonic trends in annual and monthly streamflow metrics at...
catalog.data.gov
data.usgs.gov
+1more
Updated Oct 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Long-term monotonic trends in annual and monthly streamflow metrics at streamgages in the United States (ver. 2.0, October 2024) [Dataset]. https://catalog.data.gov/dataset/long-term-monotonic-trends-in-annual-and-monthly-streamflow-metrics-at-streamgages-in-the-
Explore at:
Dataset updated
Oct 5, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
United States
Description
The U.S. Geological Survey (USGS) Water Resources Mission Area (WMA) is working to address a need to understand where the Nation is experiencing water shortages or surpluses relative to the demand for water need by delivering routine assessments of water supply and demand and an understanding of the natural and human factors affecting the balance between supply and demand. A key part of these national assessments is identifying long-term trends in water availability, including groundwater and surface water quantity, quality, and use. This data release contains Mann-Kendall monotonic trend analyses for 18 observed annual and monthly streamflow metrics at 6,347 U.S. Geological Survey streamgages located in the conterminous United States, Alaska, Hawaii, and Puerto Rico. Streamflow metrics include annual mean flow, maximum 1-day and 7-day flows, minimum 7-day and 30-day flows, and the date of the center of volume (the date on which 50% of the annual flow has passed by a gage), along with the mean flow for each month of the year. Annual streamflow metrics are computed from mean daily discharge records at U.S. Geological Survey streamgages that are publicly available from the National Water Information System (NWIS). Trend analyses are computed using annual streamflow metrics computed through climate year 2022 (April 2022- March 2023) for low-flow metrics and water year 2022 (October 2021 - September 2022) for all other metrics. Trends at each site are available for up to four different periods: (i) the longest possible period that meets completeness criteria at each site, (ii) 1980-2020, (iii) 1990-2020, (iv) 2000-2020. Annual metric time series analyzed for trends must have 80 percent complete records during fixed periods. In addition, each of these time series must have 80 percent complete records during their first and last decades. All longest possible period time series must be at least 10 years long and have annual metric values for at least 80% of the years running from 2013 to 2022. This data release provides the following five CSV output files along with a model archive: (1) streamflow_trend_results.csv - contains test results of all trend analyses with each row representing one unique combination of (i) NWIS streamgage identifiers, (ii) metric (computed using Oct 1 - Sep 30 water years except for low-flow metrics computed using climate years (Apr 1 - Mar 31), (iii) trend periods of interest (longest possible period through 2022, 1980-2020, 1990-2020, 2000-2020) and (iv) records containing either the full trend period or only a portion of the trend period following substantial increases in cumulative upstream reservoir storage capacity. This is an output from the final process step (#5) of the workflow. (2) streamflow_trend_trajectories_with_confidence_bands.csv - contains annual trend trajectories estimated using Theil-Sen regression, which estimates the median of the probability distribution of a metric for a given year, along with 90 percent confidence intervals (5th and 95h percentile values). This is an output from the final process step (#5) of the workflow. (3) streamflow_trend_screening_all_steps.csv - contains the screening results of all 7,873 streamgages initially considered as candidate sites for trend analysis and identifies the screens that prevented some sites from being included in the Mann-Kendall trend analysis. (4) all_site_year_metrics.csv - contains annual time series values of streamflow metrics computed from mean daily discharge data at 7,873 candidate sites. This is an output of Process Step 1 in the workflow. (5) all_site_year_filters.csv - contains information about the completeness and quality of daily mean discharge at each streamgage during each year (water year, climate year, and calendar year). This is also an output of Process Step 1 in the workflow and is combined with all_site_year_metrics.csv in Process Step 2. In addition, a .zip file contains a model archive for reproducing the trend results using R 4.4.1 statistical software. See the README file contained in the model archive for more information. Caution must be exercised when utilizing monotonic trend analyses conducted over periods of up to several decades (and in some places longer ones) due to the potential for confounding deterministic gradual trends with multi-decadal climatic fluctuations. In addition, trend results are available for post-reservoir construction periods within the four trend periods described above to avoid including abrupt changes arising from the construction of larger reservoirs in periods for which gradual monotonic trends are computed. Other abrupt changes, such as changes to water withdrawals and wastewater return flows, or episodic disturbances with multi-year recovery periods, such as wildfires, are not evaluated. Sites with pronounced abrupt changes or other non-monotonic trajectories of change may require more sophisticated trend analyses than those presented in this data release.
Data from: KGCW 2023 Challenge @ ESWC 2023
zenodo.org
investigacion.usc.gal
application/gzip
Updated Apr 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dylan Van Assche; Dylan Van Assche; David Chaves-Fraga; David Chaves-Fraga; Anastasia Dimou; Anastasia Dimou; Umutcan Şimşek; Umutcan Şimşek; Ana Iglesias; Ana Iglesias (2024). KGCW 2023 Challenge @ ESWC 2023 [Dataset]. http://doi.org/10.5281/zenodo.7837289
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7837289
Dataset updated
Apr 15, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dylan Van Assche; Dylan Van Assche; David Chaves-Fraga; David Chaves-Fraga; Anastasia Dimou; Anastasia Dimou; Umutcan Şimşek; Umutcan Şimşek; Ana Iglesias; Ana Iglesias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Knowledge Graph Construction Workshop 2023: challenge

Knowledge graph construction of heterogeneous data has seen a lot of uptake
in the last decade from compliance to performance optimizations with respect
to execution time. Besides execution time as a metric for comparing knowledge
graph construction, other metrics e.g. CPU or memory usage are not considered.
This challenge aims at benchmarking systems to find which RDF graph
construction system optimizes for metrics e.g. execution time, CPU,
memory usage, or a combination of these metrics.

Task description

The task is to reduce and report the execution time and computing resources
(CPU and memory usage) for the parameters listed in this challenge, compared
to the state-of-the-art of the existing tools and the baseline results provided
by this challenge. This challenge is not limited to execution times to create
the fastest pipeline, but also computing resources to achieve the most efficient
pipeline.

We provide a tool which can execute such pipelines end-to-end. This tool also
collects and aggregates the metrics such as execution time, CPU and memory
usage, necessary for this challenge as CSV files. Moreover, the information
about the hardware used during the execution of the pipeline is available as
well to allow fairly comparing different pipelines. Your pipeline should consist
of Docker images which can be executed on Linux to run the tool. The tool is
already tested with existing systems, relational databases e.g. MySQL and
PostgreSQL, and triplestores e.g. Apache Jena Fuseki and OpenLink Virtuoso
which can be combined in any configuration. It is strongly encouraged to use
this tool for participating in this challenge. If you prefer to use a different
tool or our tool imposes technical requirements you cannot solve, please contact
us directly.

Part 1: Knowledge Graph Construction Parameters

These parameters are evaluated using synthetic generated data to have more
insights of their influence on the pipeline.

Data

Number of data records: scaling the data size vertically by the number of records with a fixed number of data properties (10K, 100K, 1M, 10M records).

Number of data properties: scaling the data size horizontally by the number of data properties with a fixed number of data records (1, 10, 20, 30 columns).

Number of duplicate values: scaling the number of duplicate values in the dataset (0%, 25%, 50%, 75%, 100%).

Number of empty values: scaling the number of empty values in the dataset (0%, 25%, 50%, 75%, 100%).

Number of input files: scaling the number of datasets (1, 5, 10, 15).

Mappings

Number of subjects: scaling the number of subjects with a fixed number of predicates and objects (1, 10, 20, 30 TMs).

Number of predicates and objects: scaling the number of predicates and objects with a fixed number of subjects (1, 10, 20, 30 POMs).

Number of and type of joins: scaling the number of joins and type of joins (1-1, N-1, 1-N, N-M)

Part 2: GTFS-Madrid-Bench

The GTFS-Madrid-Bench provides insights in the pipeline with real data from the
public transport domain in Madrid.

Scaling

GTFS-1 SQL

GTFS-10 SQL

GTFS-100 SQL

GTFS-1000 SQL

Heterogeneity

GTFS-100 XML + JSON

GTFS-100 CSV + XML

GTFS-100 CSV + JSON

GTFS-100 SQL + XML + JSON + CSV

Example pipeline

The ground truth dataset and baseline results are generated in different steps
for each parameter:

The provided CSV files and SQL schema are loaded into a MySQL relational database.

Mappings are executed by accessing the MySQL relational database to construct a knowledge graph in N-Triples as RDF format.

The constructed knowledge graph is loaded into a Virtuoso triplestore, tuned according to the Virtuoso documentation.

The provided SPARQL queries are executed on the SPARQL endpoint exposed by Virtuoso.

The pipeline is executed 5 times from which the median execution time of each
step is calculated and reported. Each step with the median execution time is
then reported in the baseline results with all its measured metrics.
Query timeout is set to 1 hour and knowledge graph construction timeout
to 24 hours. The execution is performed with the following tool: https://github.com/kg-construct/challenge-tool,
you can adapt the execution plans for this example pipeline to your own needs.

Each parameter has its own directory in the ground truth dataset with the
following files:

Input dataset as CSV.

Mapping file as RML.

Queries as SPARQL.

Execution plan for the pipeline in metadata.json.

Datasets

Knowledge Graph Construction Parameters

The dataset consists of:

Input dataset as CSV for each parameter.

Mapping file as RML for each parameter.

SPARQL queries to retrieve the results for each parameter.

Baseline results for each parameter with the example pipeline.

Ground truth dataset for each parameter generated with the example pipeline.

Format

All input datasets are provided as CSV, depending on the parameter that is being
evaluated, the number of rows and columns may differ. The first row is always
the header of the CSV.

GTFS-Madrid-Bench

The dataset consists of:

Input dataset as CSV with SQL schema for the scaling and a combination of XML,

CSV, and JSON is provided for the heterogeneity.

Mapping file as RML for both scaling and heterogeneity.

SPARQL queries to retrieve the results.

Baseline results with the example pipeline.

Ground truth dataset generated with the example pipeline.

Format

CSV datasets always have a header as their first row.
JSON and XML datasets have their own schema.

Evaluation criteria

Submissions must evaluate the following metrics:

Execution time of all the steps in the pipeline. The execution time of a step is the difference between the begin and end time of a step.

CPU time as the time spent in the CPU for all steps of the pipeline. The CPU time of a step is the difference between the begin and end CPU time of a step.

Minimal and maximal memory consumption for each step of the pipeline. The minimal and maximal memory consumption of a step is the minimum and maximum calculated of the memory consumption during the execution of a step.

Expected output

Duplicate values

Scale Number of Triples
0 percent 2000000 triples
25 percent 1500020 triples
50 percent 1000020 triples
75 percent 500020 triples
100 percent 20 triples

Empty values

Scale Number of Triples
0 percent 2000000 triples
25 percent 1500000 triples
50 percent 1000000 triples
75 percent 500000 triples
100 percent 0 triples

Mappings

Scale Number of Triples
1TM + 15POM 1500000 triples
3TM + 5POM 1500000 triples
5TM + 3POM 1500000 triples
15TM + 1POM 1500000 triples

Properties

Scale Number of Triples
1M rows 1 column 1000000 triples
1M rows 10 columns 10000000 triples
1M rows 20 columns 20000000 triples
1M rows 30 columns 30000000 triples

Records

Scale Number of Triples
10K rows 20 columns 200000 triples
100K rows 20 columns 2000000 triples
1M rows 20 columns 20000000 triples
10M rows 20 columns 200000000 triples

Joins

1-1 joins

Scale Number of Triples
0 percent 0 triples
25 percent 125000 triples
50 percent 250000 triples
75 percent 375000 triples
100 percent 500000 triples

1-N joins

Scale Number of Triples
1-10 0 percent 0 triples
1-10 25 percent 125000 triples
1-10 50 percent 250000 triples
1-10 75 percent 375000

Scale	Number of Triples
0 percent	2000000 triples
25 percent	1500020 triples
50 percent	1000020 triples
75 percent	500020 triples
100 percent	20 triples

Scale	Number of Triples
0 percent	2000000 triples
25 percent	1500000 triples
50 percent	1000000 triples
75 percent	500000 triples
100 percent	0 triples

Scale	Number of Triples
1TM + 15POM	1500000 triples
3TM + 5POM	1500000 triples
5TM + 3POM	1500000 triples
15TM + 1POM	1500000 triples

Scale	Number of Triples
1M rows 1 column	1000000 triples
1M rows 10 columns	10000000 triples
1M rows 20 columns	20000000 triples
1M rows 30 columns	30000000 triples

Scale	Number of Triples
10K rows 20 columns	200000 triples
100K rows 20 columns	2000000 triples
1M rows 20 columns	20000000 triples
10M rows 20 columns	200000000 triples

Scale	Number of Triples
0 percent	0 triples
25 percent	125000 triples
50 percent	250000 triples
75 percent	375000 triples
100 percent	500000 triples

Scale	Number of Triples
1-10 0 percent	0 triples
1-10 25 percent	125000 triples
1-10 50 percent	250000 triples
1-10 75 percent	375000

Synthetic Indoor Climate and Occupancy Data from Office and Meeting Room...

zenodo.org
data.niaid.nih.gov

csv, zip

Updated Apr 25, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Manuel Weber; Manuel Weber; Farzan Banihashemi; Farzan Banihashemi (2024). Synthetic Indoor Climate and Occupancy Data from Office and Meeting Room Simulations [Dataset]. http://doi.org/10.5281/zenodo.10507614

Explore at:

csv, zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.10507614

Dataset updated

Apr 25, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Manuel Weber; Manuel Weber; Farzan Banihashemi; Farzan Banihashemi

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This is the dataset used for the publication "Coddora: CO2-based Occupancy Detection model
trained via DOmain RAndomization". The goal is to provide training data for occupancy detection.

The dataset contains one million days of data including 10 occupied days for each of 100,000 randomized room models (50,000 rooms considering office activity and 50,000 meeting room activity). Data were generated in EnergyPlus simulations according to the methodology described in the paper.

When using the dataset, please cite:

Manuel Weber, Farzan Banihashemi, Davor Stjelja, Peter Mandl, Ruben Mayer, and Hans-Arno Jacobsen. 2024. Coddora: CO2-Based Occupancy Detection Model Trained via Domain Randomization. In International Joint Conference on Neural Networks (IJCNN). June 30 - July 5, 2024, Yokohama, Japan.

Dataset Structure

The following files are provided:

1. dataset_office_rooms.h5 (provided as zip file)
2. dataset_meeting_rooms.h5 (provided as zip file)
3. simulated_occupancy_office_rooms.csv
4. simulated_occupancy_meeting_rooms.csv

Please use an archiving tool such as 7zip to unzip the hdf5 files.
Both hdf5 files contain two datasets with the following keys:

1. "data": contains the simulated indoor climate and occupancy data
2. "metadata": contains the metadata that were used for each simulation

The csv files contain the time series of occupancy that were used for the simulations.

Data

Data includes the following fields:

Datetime: day of the year (may be relevant due to seasonal differences) and time of the day
Zone Air CO2 Concentration: CO2 level in ppm
Zone Mean Air Temperature: temperature in °C
Zone Air Relative Humidity: relative humidity in %
Occupancy: level of occupancy relative to the maximum capacity of the room (in the range [0-1])
Ventilation: fraction of window opening in the range [0.01, 1]
SimID: foreign key to reference the room properties the simulation was based on
BinaryOccupancy: 0 or 1 denoting absence or presence (for binary classification)

Example row:

Datetime	Zone Air CO2 Concentration	Zone Mean Air Temperature	Zone Air Relative Humidity	Occupancy	Ventilation	simID	BinaryOccupancy
10/09 11:21:00	1084.5624647371608	24.545635909907148	41.18393114737054	0.7	0.0	99	1

Metadata

Metadata includes the following fields.
Underscores denote that the field was not selected during randomization but calculated from the other values.

width: room width in m
length: room length in m
height: hoom height in m
infiltration: infiltration per exterior area in m³/m²s
outdoor_co2: co2 concentration in the outdoor air in ppm (set to a random value between [300, 500])
orientation: angle between the room's facade orientation and the north direction in degrees
maxOccupants: room occupation limit, i.e. the maximum number of occupants
_floorArea: floor area in m² (calculated from room dimensions)
_volume: room volume in m³ (calculated from room dimensions)
_exteriorSurfaceArea: surface area of the facade wall (calculated from room dimensions)
_winToFloorRatio: ratio between total window area and floor area (calculated from room model)
firstDayUsedOfOccupancySequence: selected starting day in the sequence of occupancy data for rooms with the respective maxOccupants value
simID: unique identifier of the simulation to relate between simulation metadata and resulting simulated data

Example row:

width	length	height	infiltration	outdoor_co2	orientation	maxOccupants	_floorArea	_volume	_exteriorSurfaceArea	_winToFloorRatio	firstDayOfUsedOccupancySequence	simID
5.481	5.190	3.264	0.000214	438.0	316.0	4.0	28.446	92.849	16.940	0.216	192	0

Occupancy Data

The occupancy data provided through the separate csv files contain the data from the upfront occupancy simulations that the climate simulation was based on. For each level of considered room occupancy limit (maxOccupants), the datasets provide minute values of occupancy throughout 1000 days.

Datetime, Date, Timestamp: fictive time of simulated occupancy record (sequences are in 1-minute resolution)
Occupants: number of present occupants
Occupancy: binary occupancy state (0=unoccupied, 1=occupied)
WindowState: binary state of ventilation (0=windows closed, 1=room is ventilated)
maxOccupants: maximum number of occupants considered for the simulated sequence
WindowOpeningFraction: fractional extent to which windows are opened, within the interval [0.01, 1]

Example row:

Datetime	Date	Timestamp	Occupants	Occupancy	WindowState	maxOccupants	WindowOpeningFraction
2023-01-01 00:00:00	2023-01-01	1.672531e+09	0	0	0	1	0.0

MH Deloitte Machine Learning Challenge
kaggle.com
Updated Nov 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamed Ziauddin (2021). MH Deloitte Machine Learning Challenge [Dataset]. https://www.kaggle.com/mohamedziauddin/mh-deloitte-machine-learning-challenge/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 30, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mohamed Ziauddin
Description
Dataset Description Train.csv - 67463 rows x 35 columns (Includes target column as Loan Status) Attributes ID Loan Amount Funded Amount Funded Amount Investor Term Batch Enrolled Interest Rate Grade Sub Grade Employment Duration Home Ownership Verification Status Payment Plan Loan Title Debit to Income Delinquency - two years Inquires - six months Open Account Public Record Revolving Balance Revolving Utilities Total Accounts Initial List Status Total Received Interest Total Received Late Fee Recoveries Collection Recovery Fee Collection 12 months Medical Application Type Last week Pay Accounts Delinquent Total Collection Amount Total Current Balance Total Revolving Credit Limit Loan Status

Test.csv - 28913 rows x 34 columns(Includes target column as Loan Status)

Sample Submission.csv - The challenge is to predict the Loan Status

Knowledge and Skills Big dataset, underfitting vs overfitting Optimising log_loss to generalise well on unseen data

Note: This dataset is from machinehack.com to help fellow Kagglers to use this dataset and compete in the competition Courtesy: https://machinehack.com/hackathons/deloitte_presents_machine_learning_challenge_predict_loan_defaulters/data

Data from: CY-Bench: A comprehensive benchmark dataset for subnational crop...

zenodo.org
explore.openaire.eu

zip

Updated Sep 25, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Dilli Paudel; Dilli Paudel; Hilmy Baja; Hilmy Baja; Ron van Bree; Michiel Kallenberg; Michiel Kallenberg; Stella Ofori-Ampofo; Aike Potze; Pratishtha Poudel; Pratishtha Poudel; Abdelrahman Saleh; Weston Anderson; Weston Anderson; Malte von Bloh; Andres Castellano; Oumnia Ennaji; Raed Hamed; Rahel Laudien; Donghoon Lee; Inti Luna; Dainius Masiliūnas; Dainius Masiliūnas; Michele Meroni; Janet Mumo Mutuku; Siyabusa Mkuhlani; Jonathan Richetti; Alex C. Ruane; Ritvik Sahajpal; Guanyuan Shuai; Vasileios Sitokonstantinou; Rogerio de Souza Noia Junior; Amit Kumar Srivastava; Robert Strong; Lily-belle Sweet; Lily-belle Sweet; Petar Vojnović; Allard de Wit; Allard de Wit; Maximilian Zachow; Ioannis N. Athanasiadis; Ron van Bree; Stella Ofori-Ampofo; Aike Potze; Abdelrahman Saleh; Malte von Bloh; Andres Castellano; Oumnia Ennaji; Raed Hamed; Rahel Laudien; Donghoon Lee; Inti Luna; Michele Meroni; Janet Mumo Mutuku; Siyabusa Mkuhlani; Jonathan Richetti; Alex C. Ruane; Ritvik Sahajpal; Guanyuan Shuai; Vasileios Sitokonstantinou; Rogerio de Souza Noia Junior; Amit Kumar Srivastava; Robert Strong; Petar Vojnović; Maximilian Zachow; Ioannis N. Athanasiadis (2024). CY-Bench: A comprehensive benchmark dataset for subnational crop yield forecasting [Dataset]. http://doi.org/10.5281/zenodo.13798797

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13798797

Dataset updated

Sep 25, 2024

Dataset provided by

AgML (https://www.agml.org/)

Authors

License

https://joinup.ec.europa.eu/page/eupl-text-11-12https://joinup.ec.europa.eu/page/eupl-text-11-12

Description

CY-Bench: A comprehensive benchmark dataset for sub-national crop yield forecasting

Overview

CY-Bench is a dataset and benchmark for subnational crop yield forecasting, with coverage of major crop growing countries of the world for maize and wheat. By subnational, we mean the administrative level where yield statistics are published. When statistics are available for multiple levels, we pick the highest resolution. The dataset combines sub-national yield statistics with relevant predictors, such as growing-season weather indicators, remote sensing indicators, evapotranspiration, soil moisture indicators, and static soil properties. CY-Bench has been designed and curated by agricultural experts, climate scientists, and machine learning researchers from the AgML Community, with the aim of facilitating model intercomparison across the diverse agricultural systems around the globe in conditions as close as possible to real-world operationalization. Ultimately, by lowering the barrier to entry for ML researchers in this crucial application area, CY-Bench will facilitate the development of improved crop forecasting tools that can be used to support decision-makers in food security planning worldwide.

* Crops : Wheat & Maize
* Spatial Coverage : Wheat (29 countries), Maize (38).
See CY-Bench paper appendix for the list of countries.
* Temporal Coverage : Varies. See country-specific data

Data

Data format

The benchmark data is organized as a collection of CSV files, with each file representing a specific category of variable for a particular country. Each CSV file is named according to the category and the country it pertains to, facilitating easy identification and retrieval. The data within each CSV file is structured in tabular format, where rows represent observations and columns represent different predictors related to a category of variable.

Data content

All data files are provided as .csv.

Data	Description	Variables (units)	Temporal Resolution	Data Source (Reference)
crop_calendar	Start and end of growing season	sos (day of the year), eos (day of the year)	Static	World Cereal (Franch et al, 2022)
fpar	fraction of absorbed photosynthetically active radiation	fpar (%)	Dekadal (3 times a month; 1-10, 11-20, 21-31)	European Commission's Joint Research Centre (EC-JRC, 2024)
ndvi	normalized difference vegetation index	-	approximately weekly	MOD09CMG (Vermote, 2015)
meteo	temperature, precipitation (prec), radiation, potential evapotranspiration (et0), climatic water balance (= prec - et0)	tmin (C), tmax (C), tavg (C), prec (mm0, et0 (mm), cwb (mm), rad (J m-2 day-1)	daily	AgERA5 (Boogaard et al, 2022), FAO-AQUASTAT for et0 (FAO-AQUASTAT, 2024)
soil_moisture	surface soil moisture, rootzone soil moisture	ssm (kg m-2), rsm (kg m-2)	daily	GLDAS (Rodell et al, 2004)
soil	available water capacity, bulk density, drainage class	awc (c m-1), bulk_density (kg dm-3), drainage class (category)	static	WISE Soil database (Batjes, 2016)
yield	end-of-season yield	yield (t ha-1)	yearly	Various country or region specific sources (see crop_statistics_... in https://github.com/BigDataWUR/AgML-CY-Bench/tree/main/data_preparation)

Folder structure

The CY-Bench dataset has been structure at first level by crop type and subsequently by country. For each country, the folder name follows the ISO 3166-1 alpha-2 two-character code. A separate .csv is available for each predictor data and crop calendar as shown below. The csv files are named to reflect the corresponding country and crop type e.g. **variable_croptype_country.csv**.
```
CY-Bench
│
└─── maize
│ │
│ └─── AO
│ │ -- crop_calendar_maize_AO.csv
│ │ -- fpar_maize_AO.csv
│ │ -- meteo_maize_AO.csv
│ │ -- ndvi_maize_AO.csv
│ │ -- soil_maize_AO.csv
│ │ -- soil_moisture_maize_AO.csv
│ │ -- yield_maize_AO.csv
│ │
│ └─── AR
│ -- crop_calendar_maize_AR.csv
│ -- fpar_maize_AR.csv
│ -- ...
│
└─── wheat
│ │
│ └─── AR
│ │ -- crop_calendar_wheat_AR.csv
│ │ -- fpar_wheat_AR.csv
│ │ ...
```

Example : CSV data content for maize in country X

```
X
└─── crop_calendar_maize_X.csv
│ -- crop_name (name of the crop)
│ -- adm_id (unique identifier for a subnational unit)
│ -- sos (start of crop season)
│ -- eos (end of crop season)
│
└─── fpar_maize_X.csv
│ -- crop_name
│ -- adm_id
│ -- date (in the format YYYYMMdd)
│ -- fpar
│
└─── meteo_maize_X.csv
│ -- crop_name
│ -- adm_id
│ -- date (in the format YYYYMMdd)

│ -- tmin (minimum temperature)
│ -- tmax (maximum temperature)
│ -- prec (precipitation)
│ -- rad (radiation)
│ -- tavg (average temperature)
│ -- et0 (evapotranspiration)
│ -- cwb (crop water balance)
│
└─── ndvi_maize_X.csv
│ -- crop_name
│ -- adm_id
│ -- date (in the format YYYYMMdd)
│ -- ndvi
│
└─── soil_maize_X.csv
│ -- crop_name
│ -- adm_id
│ -- awc (available water capacity)
│ -- bulk_density
│ -- drainage_class
│
└─── soil_moisture_maize_X.csv
│ -- crop_name
│ -- adm_id
│ -- date (in the format YYYYMMdd)
│ -- ssm (surface soil moisture)
│ -- rsm ()
│
└─── yield_maize_X.csv
│ -- crop_name
│ -- country_code
│ -- adm_id
│ -- harvest_year
│ -- yield
│ -- harvest_area
│ -- production

Data access

The full dataset can be downloaded directly from Zenodo or using the ```zenodo_get``` library

License and citation

We kindly ask all users of CY-Bench to properly respect licensing and citation conditions of the datasets included.

Ultra Sound Garbage Bin Sensor Reading Records
zenodo.org
zip
Updated Mar 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
André Amorim Ribeiro; André Amorim Ribeiro (2025). Ultra Sound Garbage Bin Sensor Reading Records [Dataset]. http://doi.org/10.5281/zenodo.14988663
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14988663
Dataset updated
Mar 7, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
André Amorim Ribeiro; André Amorim Ribeiro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The directory is structured in the following way:

The file "ids.csv" is a one-column table that contains the IDs of all the containers present in the dataset. Use this file as a helper to iterate over the dictionary.

The file "INFO.csv" contains relevant data about the containers, namely "Capacity," "WasteType," "Latitude," and "Longitude."

Then, for each container, there are four files with the following naming convention:

Container_ID_recs_UnCorrected_with_metrics.csv

Container_ID_fill_UnCorrected_with_metrics.csv

Container_ID_recs_Corrected_with_metrics.csv

Container_ID_fill_Corrected_with_metrics.csv

The RAW values extracted from the sensors are in the UnCorrected files. The FILL levels are extracted by the sensors, and DATES (whether from collections or sensor readings) are provided by the management system. The sensor timestamps are automatic, and collection times are manually entered by service providers. The rest of the information consists of calculations made by us for reproducibility. An atempt to correct some data was preform, these are present in the corrected files.

The fill files contain information about the Fill level of the container. Additional columns include the Max and Min monotonic approximations between each collection, along with their mean (middle point) value. The "REC" column is a mask set to true for the first fill value measured after a collection is made. The "Cidx" column is a unique identifier for each fill value between collections, which allows us to use groupby on each event between collections.

The rec files contain information about the Collections made for that container. Additional columns include the End_Pointer, which is the row index for when that collection is made in the corresponding fill file, along with AVG_DIST, which is 100 minus the average difference between the Max and Min approximations until the next collection, and 100 times the Spearman coefficient for the fill values until the next collection.
c
Comprehensive eBay Products Dataset: Analyze Listings, Prices, and Trends |...
crawlfeeds.com
csv, zip
Updated Jul 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Comprehensive eBay Products Dataset: Analyze Listings, Prices, and Trends | Download Now! [Dataset]. https://crawlfeeds.com/datasets/ebay-products-dataset
Explore at:
csv, zipAvailable download formats
Dataset updated
Jul 9, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
Massive eBay Marketplace Data Collection for E-commerce Intelligence

Unlock the power of online marketplace analytics with our comprehensive eBay products dataset. This premium collection contains 1.29 million products from eBay's global marketplace, providing extensive insights into one of the world's largest e-commerce platforms. Perfect for competitive analysis, pricing strategies, market research, and machine learning applications in e-commerce.

Dataset Overview

Total Products: 1,290,000+ marketplace listings

Source: eBay Global Marketplace

Format: CSV, ZIP compressed

File Size: Optimized compressed format

Coverage: Multi-category product listings across eBay

Complete Data Fields Included

Product Identification

id: Unique eBay product identifiers

name: Complete product titles and names

url: Direct eBay listing page links

epid: eBay Product ID for catalog matching

source: Data source identification

Product Details

raw_product_description: Original unprocessed product descriptions

product_description: Cleaned and formatted product descriptions

brand: Brand names and manufacturer information

mpn: Manufacturer Part Numbers

gtin13: Global Trade Item Numbers (barcodes)

Pricing and Availability

price: Current listing prices

currency: Currency information for international listings

in_stock: Stock availability status

breadcrumbs: Category navigation paths

Visual and Technical Data

images: Product image URLs and references

crawled_at: Data collection timestamps

Key Use Cases

E-commerce Market Research

Analyze eBay marketplace trends and patterns

Study product category performance

Monitor pricing strategies across sellers

Identify high-demand product categories

Competitive Intelligence

Benchmark pricing against eBay marketplace

Analyze product positioning strategies

Study seller competition and market share

Monitor inventory levels and availability

Price Optimization

Develop dynamic pricing algorithms

Analyze price elasticity across categories

Compare marketplace pricing trends

Optimize listing prices for maximum visibility

Machine Learning Applications

Train recommendation systems

Develop price prediction models

Create product categorization algorithms

Build inventory forecasting systems

Target Industries

E-commerce Retailers

Marketplace Strategy: Optimize eBay selling strategies

Pricing Intelligence: Competitive price monitoring

Product Research: Identify profitable product opportunities

Inventory Planning: Demand forecasting and stock optimization

Technology Companies

AI Training Data: Machine learning model development

Analytics Platforms: E-commerce intelligence tools

Price Comparison: Marketplace comparison services

Search Enhancement: Product discovery optimization

Market Research Firms

Industry Reports: E-commerce marketplace analysis

Consumer Behavior: Online shopping pattern studies

Brand Monitoring: Brand performance tracking

Trend Analysis: Market trend identification

Academic Research

E-commerce Studies: Online marketplace research

Business Intelligence: Retail analytics case studies

Data Science Projects: Large-scale dataset analysis

Economic Research: Digital marketplace economics

Data Quality Features

Comprehensive Coverage: 1.29 million unique products

Rich Metadata: Complete product information included

Validated Data: Quality-checked and processed

Structured Format: Ready-to-use CSV format

Global Scope: International marketplace coverage

Facebook

Twitter

Click to copy link

Link copied

Cite

Matteo De Felice; Matteo De Felice (2020). ENTSO-E Hydropower modelling data (PECD) in CSV format [Dataset]. http://doi.org/10.5281/zenodo.3950048

ENTSO-E Hydropower modelling data (PECD) in CSV format

Explore at:

6 scholarly articles cite this dataset (View in Google Scholar)

csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.3950048

Dataset updated

Aug 14, 2020

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Matteo De Felice; Matteo De Felice

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

PECD Hydro modelling

This repository contains a more user-friendly version of the Hydro modelling data released by ENTSO-E with their latest Seasonal Outlook.

The original URLs:

The original ENTSO-E hydropower dataset integrates the PECD (Pan-European Climate Database) released for the MAF 2019

As I did for the wind & solar data, the datasets released in this repository are only a more user- and machine-readable version of the original Excel files. As avid user of ENTSO-E data, with this repository I want to share my data wrangling efforts to make this dataset more accessible.

Data description

The zipped file contains 86 Excel files, two different files for each ENTSO-E zone.

In this repository you can find 5 CSV files:

PECD-hydro-capacities.csv: installed capacities
PECD-hydro-weekly-inflows.csv: weekly inflows for reservoir and open-loop pumping
PECD-hydro-daily-ror-generation.csv: daily run-of-river generation
PECD-hydro-weekly-reservoir-min-max-generation.csv: minimum and maximum weekly reservoir generation
PECD-hydro-weekly-reservoir-min-max-levels.csv: weekly minimum and maximum reservoir levels

Capacities

The file PECD-hydro-capacities.csv contains: run of river capacity (MW) and storage capacity (GWh), reservoir plants capacity (MW) and storage capacity (GWh), closed-loop pumping/turbining (MW) and storage capacity and open-loop pumping/turbining (MW) and storage capacity. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

sheet Run-of-River and pondage, rows from 5 to 7, columns from 2 to 5
sheet Reservoir, rows from 5 to 7, columns from 1 to 3
sheet Pump storage - Open Loop, rows from 5 to 7, columns from 1 to 3
sheet Pump storage - Closed Loop, rows from 5 to 7, columns from 1 to 3

Inflows

The file PECD-hydro-weekly-inflows.csv contains the weekly inflow (GWh) for the climatic years 1982-2017 for reservoir plants and open-loop pumping. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

sheet Reservoir, rows from 13 to 66, columns from 16 to 51
sheet Pump storage - Open Loop, rows from 13 to 66, columns from 16 to 51

Daily run-of-river

The file PECD-hydro-daily-ror-generation.csv contains the daily run-of-river generation (GWh). The data is extracted from the Excel files with the name starting with PEMM from the following sections:

sheet Run-of-River and pondage, rows from 13 to 378, columns from 15 to 51

Miminum and maximum reservoir generation

The file PECD-hydro-weekly-reservoir-min-max-generation.csv contains the minimum and maximum generation (MW, weekly) for reservoir-based plants for the climatic years 1982-2017. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

sheet Reservoir, rows from 13 to 66, columns from 196 to 231
sheet Reservoir, rows from 13 to 66, columns from 232 to 267

Minimum/Maximum reservoir levels

The file PECD-hydro-weekly-reservoir-min-max-levels.csv contains the minimum/maximum reservoir levels at beginning of each week (scaled coefficient from 0 to 1). The data is extracted from the Excel files with the name starting with PEMM from the following sections:

sheet Reservoir, rows from 14 to 66, column 12
sheet Reservoir, rows from 14 to 66, column 13

CHANGELOG

[2020/07/17] Added maximum generation for the reservoir

Clear search

Close search

Google apps

Main menu

ENTSO-E Hydropower modelling data (PECD) in CSV format

👨‍🦯 Parkinson's Disease Detection Dataset 👨‍⚕️

Parkinson's data set

This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Citation:

Little,Max. (2008). Parkinsons. UCI Machine Learning Repository. https://doi.org/10.24432/C59C74.

Matrix column entries (attributes):

Annual maps of cropland abandonment, land cover, and other derived data for...

Data from: A large synthetic dataset for machine learning applications in...

Data generation algorithm

Network

Time series

Usage

Selecting a particular country

Averaging over time

Source code

Funding

Dataset for "Strong dispersion property for the quantum walk on the...

US B2B Contact Data | 200M+ Verified Records | 95% Accuracy | API/CSV/JSON

Randomized Hourly Load Data for use with Taxonomy Distribution Feeders

Data from: Isotopes and related data associated with water tracing with...

KGCW 2024 Challenge @ ESWC 2024

Earth Analytics Bootcamp | Teaching Dataset | Temperature Data for Sonoma...

HarDWR - Harmonized Water Rights Records

Geochemical data supporting analysis of geochemical conditions and nitrogen...

Australian Employee Salary/Wages DATAbase by detailed occupation, location...

Long-term monotonic trends in annual and monthly streamflow metrics at...

Data from: KGCW 2023 Challenge @ ESWC 2023

Knowledge Graph Construction Workshop 2023: challenge

Synthetic Indoor Climate and Occupancy Data from Office and Meeting Room...

Dataset Structure

Data

Metadata

Occupancy Data

MH Deloitte Machine Learning Challenge

Data from: CY-Bench: A comprehensive benchmark dataset for subnational crop...

CY-Bench: A comprehensive benchmark dataset for sub-national crop yield forecasting

Overview

Data

Data format

Data content

Folder structure

Example : CSV data content for maize in country X

Data access

License and citation

Ultra Sound Garbage Bin Sensor Reading Records

Comprehensive eBay Products Dataset: Analyze Listings, Prices, and Trends |...

Massive eBay Marketplace Data Collection for E-commerce Intelligence

Dataset Overview

Complete Data Fields Included

Product Identification

Product Details

Pricing and Availability

Visual and Technical Data

Key Use Cases

E-commerce Market Research

Competitive Intelligence

Price Optimization

Machine Learning Applications

Target Industries

E-commerce Retailers

Technology Companies

Market Research Firms

Academic Research

Data Quality Features

ENTSO-E Hydropower modelling data (PECD) in CSV format