17 datasets found

Storage and Transit Time Data and Code
zenodo.org
zip
Updated Oct 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew Felton; Andrew Felton (2024). Storage and Transit Time Data and Code [Dataset]. http://doi.org/10.5281/zenodo.14009758
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14009758
Dataset updated
Oct 29, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrew Felton; Andrew Felton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Author: Andrew J. Felton
Date: 10/29/2024

This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:

"Global estimates of the storage and transit time of water through vegetation"

Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated.

Data information:

The data folder contains key data sets used for analysis. In particular:

"data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.

#Code information

Python scripts can be found in the "supporting_code" folder.

Each R script in this project has a role:

"01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).

"02_functions.R": This script contains custom functions. Load this using the
`source()` function in the 01_start.R script.

"03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
`source()` function in the 01_start.R script.

"04_figures_tables.R": This is the main workhouse for figure/table production and
supporting analyses. This script generates the key figures and summary statistics
used in the study that then get saved in the manuscript_figures folder. Note that all
maps were produced using Python code found in the "supporting_code"" folder.

"supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.

"supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.
Z
Data from: Russian Financial Statements Database: A firm-level collection of...
data.niaid.nih.gov
Updated Mar 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ledenev, Victor (2025). Russian Financial Statements Database: A firm-level collection of the universe of financial statements [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14622208
Explore at:
Dataset updated
Mar 14, 2025
Dataset provided by
Ledenev, Victor
Bondarkov, Sergey
Skougarevskiy, Dmitriy
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Area covered
Russia
Description
The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:

🔓 First open data set with information on every active firm in Russia.

🗂️ First open financial statements data set that includes non-filing firms.

🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.

📅 Covers 2011-2023 initially, will be continuously updated.

🏗️ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.

The RFSD is hosted on 🤗 Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.

The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.

Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.

Importing The Data

You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on 🤗 Hugging Face Datasets library.

Python

🤗 Hugging Face Datasets

It is as easy as:

from datasets import load_dataset import polars as pl

This line will download 6.6GB+ of all RFSD data and store it in a 🤗 cache folder

RFSD = load_dataset('irlspbru/RFSD')

Alternatively, this will download ~540MB with all financial statements for 2023# to a Polars DataFrame (requires about 8GB of RAM)

RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')

Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.

Local File Import

Importing in Python requires pyarrow package installed.

import pyarrow.dataset as ds import polars as pl

Read RFSD metadata from local file

RFSD = ds.dataset("local/path/to/RFSD")

Use RFSD_dataset.schema to glimpse the data structure and columns' classes

print(RFSD.schema)

Load full dataset into memory

RFSD_full = pl.from_arrow(RFSD.to_table())

Load only 2019 data into memory

RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))

Load only revenue for firms in 2019, identified by taxpayer id

RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )

Give suggested descriptive names to variables

renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})

R

Local File Import

Importing in R requires arrow package installed.

library(arrow) library(data.table)

Read RFSD metadata from local file

RFSD <- open_dataset("local/path/to/RFSD")

Use schema() to glimpse into the data structure and column classes

schema(RFSD)

Load full dataset into memory

scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())

Load only 2019 data into memory

scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())

Load only revenue for firms in 2019, identified by taxpayer id

scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())

Give suggested descriptive names to variables

renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)

Use Cases

🌍 For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) — interest_payments.md

🏭 For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) — tfp.md

🗺️ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses — spatialization.md

FAQ

Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?

To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.

What is the data period?

We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).

Why are there no data for firm X in year Y?

Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:

We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).

Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek — in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.

Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.

Why is the geolocation of firm X incorrect?

We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.

Why is the data for firm X different from https://bo.nalog.ru/?

Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.

Why is the data for firm X unrealistic?

We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.

Why is the data for groups of companies different from their IFRS statements?

We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.

Why is the data not in CSV?

The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.

Version and Update Policy

Version (SemVer): 1.0.0.

We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.

Licence

Creative Commons License Attribution 4.0 International (CC BY 4.0).

Copyright © the respective contributors.

Citation

Please cite as:

@unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}

Acknowledgments and Contacts

Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru

Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,
c
ckanext-salford
catalog.civicdataecosystem.org
Updated Jun 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). ckanext-salford [Dataset]. https://catalog.civicdataecosystem.org/dataset/ckanext-salford
Explore at:
Dataset updated
Jun 4, 2025
Description
The Salford extension for CKAN is designed to enhance CKAN's functionality for specific use cases, particularly involving the management and import of datasets relevant to the Salford City Council. By incorporating custom configurations and an ETL script, this extension streamlines the process of integrating external data sources, especially from data.gov.uk, into a CKAN instance. It also provides a structured approach to configuring CKAN for specific data management needs. Key Features: Custom Plugin Integration: Enables the addition of 'salford' and 'esd' plugins to extend CKAN's core functionality, addressing specific data management requirements. Configurable Licenses Group URL: Allows administrators to specify a licenses group URL in the CKAN configuration, streamlining access to license information pertinent to the dataset. ETL Script for Data.gov.uk Import: Includes a Python script (etl.py) to import datasets specifically from the Salford City Council publisher on data.gov.uk. Non-UKLP Dataset Compatibility: The ETL script is designed to filter and import non-UKLP datasets, excluding INSPIRE datasets from the data.gov.uk import process at this time. Bower Component Installation: Simplifies asset management by providing instructions of installing bower components. Technical Integration: The Salford extension requires modifications to the CKAN configuration file (production.ini). Specifically, it involves adding salford and esd to the ckan.plugins setting, defining the licensesgroupurl, and potentially configuring other custom options. The ETL script leverages the CKAN API (ckanapi) for data import. Additionally, Bower components must be installed. Benefits & Impact: Using the Salford CKAN extension, organizations can establish a more streamlined data ingestion process tailored to Salford City Council datasets, enhance data accessibility, improve asset management and facilitate better data governance aligned with specific licensing requirements. By selectively importing datasets and offering custom plugin support, it caters to specialized data management needs.
a
SSURGO Portal User Guide
ngda-portfolio-community-geoplatform.hub.arcgis.com
Updated Jul 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The citation is currently not available for this dataset.
Explore at:
Dataset updated
Jul 16, 2025
Dataset authored and provided by
USDA NRCS ArcGIS Online
Area covered

Description
SSURGO PortalThe newest version of SSURGO Portal with Soil Data Viewer is available via the Quick Start Guide. Install Python to C:\Program Files. This is a different version than what ArcGIS Pro uses.If you need data for multiple states, we also offer a prebuilt large database with all SSURGO for the entire United States and all Islands. The prebuilt saves you time but it’s large and takes a while to download.You can also use the prebuilt gNATSGO GeoPackage database in SSURGO Portal – Soil Data Viewer. Read the ReadMe.txt in the folder. More about gNATSGO here. You can also import STATSGO2 data into SSURGO Portal and create a database to use in Soil Data Viewer – Available for download via the Soils Box folder. SSURGO Portal NotesThis 10 minute video covers it all, other than installation of SSURGO Portal and the GIS tool. Installation is typically smooth and easy.There is also a user guide on the SSURGO Portal website that can be very helpful. It has info about using the data in ArcGIS Pro or QGIS. SQLite SSURGO database be opened and queried with DB Browser. It’s essentially free Microsoft Access.Guidance about setting up DB Browser to easily open SQLite databases is available in section 4 of this Installation Guide.Workflow if you need to make your own databaseInstall SSURGO PortalInstall SSURGO Downloader GIS tool (Refer to the Installation and User Guide for assistance)There is one for QGIS and one for ArcGIS Pro. They both do the same thing. Quickly download California SSURGO data with toolEnter two digit state symbol followed by asterisk in “Search by Areasymbol” to download all data for state.For example, enter CA* to batch download all data for CaliforniaOpen SSURGO Portal and create a new SQLite SSURGO Template database (Refer to the User Guide for assistance)Import SSURGO data you downloaded into databaseYou can import SSURGO data from many states at once, building a database that spans many statesAfter SSURGO data is done importing, click on Soil Data Viewer tab and run ratingsThese are the exact same ratings as Web Soil SurveyA new table is added to your database for each ratingYou can search for ratings by keywordIf desired, open database in GIS and make a map (Refer to the User Guide for assistance)Workflow if you need use large prebuilt database (don’t make own database) Install SSURGO PortalIn SSURGO Portal, browse to unzipped prebuilt GeoPackage database with all SSURGOprebuilt large database with all SSURGOgNATSGO GeoPackage databaseIn SSURGO Portal, click on Soil Data Viewer tab and run ratingsThese are the exact same ratings as Web Soil SurveyA new table is added to your database for each ratingYou can search for ratings by keywordIf desired, open database in GIS and make a mapIf you have trouble installing SSURGO Portal. Its usually the connection with Python. Create Desktop short cut that tells SSURGO Portal which Python to useThese were created for Windows 11 Right click anywhere on your desktop and choose New > ShortcutIn the text bar enter your path to the python.exe and your path to the SSURGO Portal.pyz. Notes:Example of format:"C:\Program Files\Python310\python.exe" "C:\SSURGO Portal\SSURGO_Portal-0.3.0.8.pyz"Include quotation marks.Paths may be different on your machine. To avoid typing, you can browse to python.exe in windows explorer, right click and select "Copy as Path and paste results into box. Paste into short location and then do the same for SSURGO Portal.pyz file, but paste to the right of the python.exe path. Click NextName the shortcut anything you want.
Data from: A large synthetic dataset for machine learning applications in...
zenodo.org
csv, json, png, zip
Updated Mar 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis (2025). A large synthetic dataset for machine learning applications in power transmission grids [Dataset]. http://doi.org/10.5281/zenodo.13378476
Explore at:
zip, png, csv, jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13378476
Dataset updated
Mar 25, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.

This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.

Data generation algorithm

The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.

Network

The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.

Time series

The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.

There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).

Usage

The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.

Selecting a particular country

This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):

import pandas as pd CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)

The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:

CH_gens_list = CH_gens.dropna().squeeze().to_list()

Finally, we can import all the time series of Swiss generators from a given data table with

pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)

The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.

Averaging over time

This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:

hourly_loads = pd.read_csv('loads_2018_3.csv')

To get a daily average of the loads, we can use:

daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()

This results in series of length 364. To average further over entire weeks and get series of length 52, we use:

weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()

Source code

The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.

Funding

This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.
z
F-DATA: A Fugaku Workload Dataset for Job-centric Predictive Modelling in...
zenodo.org
data.niaid.nih.gov
bin, csv
Updated Aug 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francesco Antici; Francesco Antici; Andrea Bartolini; Andrea Bartolini; Jens Domke; Jens Domke; Zeynep Kiziltan; Zeynep Kiziltan; Keiji Yamamoto; Keiji Yamamoto (2025). F-DATA: A Fugaku Workload Dataset for Job-centric Predictive Modelling in HPC Systems [Dataset]. http://doi.org/10.5281/zenodo.11467483
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11467483
Dataset updated
Aug 1, 2025
Dataset provided by
Nature Publishing Group
Authors
Francesco Antici; Francesco Antici; Andrea Bartolini; Andrea Bartolini; Jens Domke; Jens Domke; Zeynep Kiziltan; Zeynep Kiziltan; Keiji Yamamoto; Keiji Yamamoto
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
F-DATA is a novel workload dataset containing the data of around 24 million jobs executed on Supercomputer Fugaku, over the three years of public system usage (March 2021-April 2024). Each job data contains an extensive set of features, such as exit code, duration, power consumption and performance metrics (e.g. #flops, memory bandwidth, operational intensity and memory/compute bound label), which allows for a multitude of job characteristics prediction. The full list of features can be found in the file feature_list.csv.

The sensitive data appears both in anonymized and encoded versions. The encoding is based on a Natural Language Processing model and retains sensitive but useful job information for prediction purposes, without violating data privacy. The scripts used to generate the dataset are available in the F-DATA GitHub repository, along with a series of plots and instruction on how to load the data.

F-DATA is composed of 38 files, with each YY_MM.parquet file containing the data of the jobs submitted in the month MM of the year YY.

The files of F-DATA are saved as .parquet files. It is possible to load such files as dataframes by leveraging the pandas APIs, after installing pyarrow (pip install pyarrow). A single file can be read with the following Python instrcutions:

# Importing pandas library

import pandas as pd

# Read the 21_01.parquet file in a dataframe format

df = pd.read_parquet("21_01.parquet")

df.head()

Please cite this work as:

@article{antici2025fdata,

title={F-DATA: A Fugaku Workload Dataset for Job-centric Predictive Modelling in HPC Systems},

author={Antici, Francesco and Bartolini, Andrea and Domke, Jens and Kiziltan, Zeynep and Yamamoto, Keiji},

journal = {Scientific Data},

volume={12},

pages={1321},

year={2025},

publisher={Nature Publishing Group},

doi={https://doi.org/10.1038/s41597-025-05633-1}

}
c
Research data supporting "The sun always shines somewhere. The energetic...
repository.cam.ac.uk
csv
Updated Aug 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivens, Frederick (2024). Research data supporting "The sun always shines somewhere. The energetic feasibility of a global grid with 100% renewable electricity" [Dataset]. http://doi.org/10.17863/CAM.111494
Explore at:
csv(293239 bytes)Available download formats
Unique identifier
https://doi.org/10.17863/CAM.111494
Dataset updated
Aug 21, 2024
Dataset provided by
Apollo
University of Cambridge
Authors
Ivens, Frederick
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
EUROPEAN ELECTRICITY DEMAND TIME SERIES (2017)

DATASET DESCRIPTION This dataset contains the aggregated electricity demand time series for Europe in 2017, measured in megawatts (MW). The data is presented with a timestamp in Coordinated Universal Time (UTC) and the corresponding electricity demand in MW.

DATA SOURCE The original data was obtained from the European Network of Transmission System Operators for Electricity (ENTSO-E) and can be accessed at: https://www.entsoe.eu/data/power-stats/ Specifically, the data was extracted from the "MHLV_data-2015-2019.xlsx" file, which provides aggregated hourly electricity load data by country for the years 2015 to 2019.

DATA PROCESSING The dataset was created using the following steps: - Importing Data: The original Excel file was imported into a Python environment using the pandas library. The data was checked for completeness to ensure no missing or corrupted entries. - Time Conversion: All timestamps in the dataset were converted to Coordinated Universal Time (UTC) to standardize the time reference across the dataset. - Aggregation: The data for the year 2017 was extracted from the dataset. The hourly electricity load for all European countries (defined as per as per Yu et al (2019), see below) was summed to generate an aggregated time series representing the total electricity demand across Europe.

FILE FORMAT - CSV Format: The dataset is stored in a CSV file with two columns: - Timestamp (UTC): The time at which the electricity demand was recorded. - Electricity Demand (MW): The total aggregated electricity demand for Europe in megawatts.

USAGE NOTES - Temporal Coverage: The dataset covers the entire year of 2017 with hourly granularity. - Geographical Coverage: The dataset aggregates data from multiple European countries, following the definition of Europe as per Yu et al (2019).

REFERENCES J. Yu, K. Bakic, A. Kumar, A. Iliceto, L. Beleke Tabu, J. Ruaud, J. Fan, B. Cova, H. Li, D. Ernst, R. Fonteneau, M. Theku, G. Sanchis, M. Chamollet, M. Le Du, Y. Zhang, S. Chatzivasileiadis, D.-C. Radu, M. Berger, M. Stabile, F. Heymann, M. Dupré La Tour, M. Manuel de Villena Millan, and M. Ranjbar. Global electricity network - feasibility study. Technical report, CIGRE, 2019. URL https://hdl.handle.net/2268/239969. Accessed July 2024.
T
mnist
tensorflow.org
universe.roboflow.com
+3more
Updated Jun 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). mnist [Dataset]. https://www.tensorflow.org/datasets/catalog/mnist
Explore at:
Dataset updated
Jun 1, 2024
Description
The MNIST database of handwritten digits.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('mnist', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/mnist-3.0.1.png" alt="Visualization" width="500px">
h
audio-datasets
huggingface.co
Updated Jan 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gijs (2025). audio-datasets [Dataset]. https://huggingface.co/datasets/gijs/audio-datasets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 7, 2025
Authors
Gijs
Description
This dataset repository contains all the text files of the datasets analysed in the Survey Paper on Audio Datasets of Scenes and Events. See here for the paper. The GitHub repository containing the scripts are shared here. Including a bash script to download the audio data for each of the datasets. In this repository, we also included a Python file dataset.py, for easy importing of each of the datasets. Please respect the original license of the dataset owner when downloading the data:… See the full description on the dataset page: https://huggingface.co/datasets/gijs/audio-datasets.
Z
Gaussian Process Model and Sensor Placement for Detroit Green...
data.niaid.nih.gov
zenodo.org
Updated Feb 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mason, Brooke (2023). Gaussian Process Model and Sensor Placement for Detroit Green Infrastructure: Datasets and Code [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7667043
Explore at:
Dataset updated
Feb 23, 2023
Dataset provided by
Mason, Brooke
Kerkez, Branko
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Detroit
Description
code.zip: Zip folder containing a folder titled "code" which holds:

csv file titled "MonitoredRainGardens.csv" containing the 14 monitored green infrastructure (GI) sites with their design and physiographic features;

csv file titled "storm_constants.csv" which contain the computed decay constants for every storm in every GI during the measurement period;

csv file titled "newGIsites_AllData.csv" which contain the other 130 GI sites in Detroit and their design and physiographic features;

csv file titled "Detroit_Data_MeanDesignFeatures.csv" which contain the design and physiographic features for all of Detroit;

Jupyter notebook titled "GI_GP_SensorPlacement.ipynb" which provides the code for training the GP models and displaying the sensor placement results;

a folder titled "MATLAB" which contains the following:

folder titled "SFO" which contains the SFO toolbox for the sensor placement work

file titled "sensor_placement.mlx" that contains the code for the sensor placement work

several .mat files created in Python for importing into Matlab for the sensor placement work: "constants_sigma.mat", "constants_coords.mat", "GInew_sigma.mat", "GInew_coords.mat", and "R1_sensor.mat" through "R6_sensor.mat"

several .mat files created in Matalb for importing into Python for visualizing the results: "MI_DETselectedGI.mat" and "DETselectedGI.mat"
ARTS Microwave Single Scattering Properties Database
zenodo.org
application/gzip, pdf +1
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robin Ekelund; Manfred Brath; Jana Mendrok; Patrick Eriksson; Robin Ekelund; Manfred Brath; Jana Mendrok; Patrick Eriksson (2024). ARTS Microwave Single Scattering Properties Database [Dataset]. http://doi.org/10.5281/zenodo.4646605
Explore at:
application/gzip, pdf, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4646605
Dataset updated
Jul 19, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Robin Ekelund; Manfred Brath; Jana Mendrok; Patrick Eriksson; Robin Ekelund; Manfred Brath; Jana Mendrok; Patrick Eriksson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The database contains microwave single scattering data, mainly of ice hydrometeors. Main applications lie in microwave remote sensing, both passive and active. Covered hydrometeors range from pristine ice crystals to large aggregates, graupel and hail, 34 types in total. Furthermore, 34 frequencies from 1 to 886 GHz, and 3 temperatures, 190, 230 and 270 K, are included. The main article orientation is currently totally random (i.e. each orientation is equally probable), but the database is designed to also handle azimuthally orientation, and data for this orientation case will be added in the future. Mie code was used for the two completely spherical habits, while the bulk of data were calculated using the discrete dipole approximation method (DDA).

Interfaces for easy user access is available under a separate upload, due to different licensing, under the title ARTS Microwave Single Scattering Properties Database Interfaces. Interfaces in MATLAB and Python are available, supporting browsing and importing of the data. There is also functionality for extracting data for usage in RTTOV.

A description of the database is also available in the following article. Please cite it if you use the database for a publication.

Eriksson, P., R. Ekelund, J. Mendrok, M. Brath, O. Lemke, and S. A. Buehler (2018), A general database of hydrometeor single scattering properties at microwave and sub-millimetre wavelengths, Earth Syst. Sci. Data, 10, 1301–1326, doi: 10.5194/essd-10-1301-2018.

New version 1.1.0 released: Database, technical report and readme document updated. It is highly recommended to download both the new interface and database versions.

-Added three new habits: two liquid habits with azimuthally random orientation and one new bullet rosette with totally random orientation (with IDs from 35 to 37).

-Updated Python and MATLAB interface to accommodate azimuthally random oriented data and with some other minor updates.

-DDA calculations at frequencies under 100 GHz have been re-calculated using higher EPS settings, in order to accommodate radar applications (see technical report, Sec. 4.1.1).

-The tolerance of the extinction cross-section post-calculation check is now 10 % instead of 30 % (see bullet 5 in technical report, Sec. 4.4.1.1). Those calculations that could not meet the stricter criteria were recalculated using higher EPS setting.

-Format of standard habits revised. The weighting applied to the large habit at each size is now available in the in the mat-files.

-Fixed wrong index for GEM hail in specifications table.
T
tiny_shakespeare
tensorflow.org
huggingface.co
Updated Feb 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). tiny_shakespeare [Dataset]. https://www.tensorflow.org/datasets/catalog/tiny_shakespeare
Explore at:
Dataset updated
Feb 11, 2023
Description
40,000 lines of Shakespeare from a variety of Shakespeare's plays. Featured in Andrej Karpathy's blog post 'The Unreasonable Effectiveness of Recurrent Neural Networks': http://karpathy.github.io/2015/05/21/rnn-effectiveness/.

To use for e.g. character modelling:

d = tfds.load(name='tiny_shakespeare')['train'] d = d.map(lambda x: tf.strings.unicode_split(x['text'], 'UTF-8')) # train split includes vocabulary for other splits vocabulary = sorted(set(next(iter(d)).numpy())) d = d.map(lambda x: {'cur_char': x[:-1], 'next_char': x[1:]}) d = d.unbatch() seq_len = 100 batch_size = 2 d = d.batch(seq_len) d = d.batch(batch_size)

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('tiny_shakespeare', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

Data from: WILLOW - Norther: data set for the full-scale validation of...

zenodo.org
data.europa.eu

application/gzip, bin +1

Updated Nov 14, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Dominik Fallais; Dominik Fallais; Wout Weijtjens; Wout Weijtjens; Christof Devriendt; Christof Devriendt; Carlos Sastre Jurado; Carlos Sastre Jurado; Francisco d N Santos; Francisco d N Santos (2024). WILLOW - Norther: data set for the full-scale validation of model-based virtual sensing methods for an operational offshore wind turbine [Dataset]. http://doi.org/10.5281/zenodo.11093262

Explore at:

application/gzip, png, binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.11093262

Dataset updated

Nov 14, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Dominik Fallais; Dominik Fallais; Wout Weijtjens; Wout Weijtjens; Christof Devriendt; Christof Devriendt; Carlos Sastre Jurado; Carlos Sastre Jurado; Francisco d N Santos; Francisco d N Santos

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Apr 30, 2024

Description

1. General description

This data set contains as-build design information, as well as full-scale vibration response measurements from an operational offshore wind-turbine. The turbine is part of the Norther wind farm which is located in the Belgian North Sea and includes a total of 44 Vestas V164 (8.4MW) wind turbines on monopile foundations, see Fig1_Norther_locaction.png. This data set is intended to verify and validate model-based virtual sensing algorithms, using data as well as modeling information from a real turbine.

1.1 Summary of the shared structural information

The included information entails a detailed description of the geometric properties of the monopile and transition piece, distributed and lumped structural masses . All information shared in this record is conform the as-designed documentation. An example of the lumped masses considered in the model input files is presented in "Fig2_Sensor_Network.png"

1.2 Summary of the shared geotechnical information

Monopiles are distinguished by the significant role of soil-structure interaction. Ground reaction is most typically included in the structural model as non-linear p-y curves. Different p-y curves are available for a certain number of soils in the standards applicable to offshore structures (API RP 2GEO, 2011, and ISO 19901-4:2016(E), 2016).

The required soil properties to define p-y curves according to the API framework are given in the soil profile provided in a separate Excel. Rather than symbols, the name of the soil properties is generally used as column header (e.g., Undrained shear strength). Therefore, it is straightforward to identify each soil parameter. The only soil parameter that might lead to confusion is:

"epsilon50 [-]" represents the vertical strain at half the maximum principal stress difference in a static undrained triaxial compression test on an undisturbed soil sample.

It's worthy to note that estimates for the small shear strain stiffness, referred to as Gmax, are also included. Despite not being required as an input to define the API p-y curves, this parameter remains a key input for other soil reaction frameworks than the API (e.g., PISA).

1.3 Summary of the shared measurement data

Two sets of measurement data have been curated for validation purposes; the first interval has been collected during parked conditions, whereas the second interval has been collected during rated operational conditions. Both records have a length of 2 hours, and are subdivided into 10-minute data sets. Furthermore 1Hz SCADA data has been made available for the selected intervals. All different data sources are time synchronized and have been subjected to several internal quality checks.

The sensor network on NRT-WTG is illustrated in in Fig. 2, whereas a description of the sensor types is presented in Tab.1. The acceleration sensors are installed in the horizontal plane, and measure tangential (Y) and orthogonal (X) to the wall, where the positive Y direction is pointing clockwise and the positive X direction is pointing inwards. All strain sensors are installed vertically and are located on the inside of the wall.

Data type	Sensor type	Fs (Hz)	Level mLAT (m)	Description
Acceleration (g)	Piezo-electric acc. sensor (ACC)	30	15, 69, 97	3 Bi-directional accelerometers at different levels. LAT 15 installed at 240 degree heading; LAT 69 and 97 at 60 degree.
Strain (micro strain)	Resistive strain gauge (SG)	30	14	6 SGs: equally spaced around the inner circumference of the can. Headings: 50, 110, 170, 230, 290, 350 degree.
Strain (micro strain)	Fiber-Bragg Grating strain gauge (FBG)	100	-17, -19	2 FBGs per level at 165 and 255 degree respectively.

Table 1. Description of sensor types.

The FBG strain time series have been synchronized with the SG time series using using a cross-correlation based approach. Therefore the SG data has been used to genereate refrence strain time series at the headings of the FBG sensors; the FBG data is subsequently synchronized with regard to this reference time series. No synchronization of the acceleration data was needed, since these are collected using the same data aquisition system as the SG data.

The SG strain time series have been calibrated and temperature compensated, whereas this is not the case for the FBG strain time series. The latter have a yet to be determined calibration offset.

In conjunction to the sensor channels presented in Tab. 1, 1 Hz SCADA data is provided. A summary of the provided SCADA parameters, all sampled at 1Hz, is presented in Tab 2.

Parameter	Unit	Description
Wind speed	m/s	Wind speed as recorded in the turbine SCADA
Wind direction	°	Wind direction relative to North (0°) as recorded in the turbine SCADA
Yaw angle	°	Yaw orientation of the nacelle relative to North (0°) as recorded in the turbine SCADA
Pitch angle	°	Rotor blade pitch as recorded in the turbine SCADA
Rotor speed	rpm	Rotor speed in rotations per minute as recorded in the turbine SCADA
Power	kW	Active power of the turbine as recorded in the turbine SCADA

Table 2. List of provided SCADA parameters

A summary of the selected intervals and relevant corresponding scada parameters is given in Tab 3.

Scenario	T1 (UTC)	T2 (UTC)	Windspeed	RPM	Pitch
Parked	03/07 01:30	03/07 03:30	< 4.5 m/s	~1	~18 °
Rated	05/07 22:30	06/07 00:30	~15 m/s	10.5	8.1°

Table 3. Selected data intervals and relevant scada parameters

2. Included in this version

2.1 Version - 0.1.0

Relevant Design information can be found in:
- Geometry data for NRT-WTG: "WILLOW-Geometry_v4.xlsx"
- Best estimate soil profile: "WILLOW-BE_soil_profile.xlsx"
Acceleration, strain and scada data can be found in the following parquet files:
- Measurement data for the parked case: "NRT-WTG_Parked.parquet.gz"
- Measurement data for the rated case: "NRT-WTG_Rated.parquet.gz"

3. Importing parquet files

To import the measurement data into Python it is recommended to use pandas:

import pandas as pd
# Read Parquet file with Pandas:
relative_file_path = 'NRT-WTG_Parked.parquet.gz' 
data = pd.read_parquet(relative_file_path ) 

Once the dataframe has been imported, the users can process/re-arrange the raw data according the their needs; it should be noted that the imported dataframe contains NAN values - these are caused by the different sampling rates of the provided signals.

T
qm9
tensorflow.org
Updated Dec 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). qm9 [Dataset]. http://doi.org/10.6084/m9.figshare.c.978904.v5
Explore at:
Unique identifier
https://doi.org/10.6084/m9.figshare.c.978904.v5
Dataset updated
Dec 11, 2024
Description
QM9 consists of computed geometric, energetic, electronic, and thermodynamic properties for 134k stable small organic molecules made up of C, H, O, N, and F. As usual, we remove the uncharacterized molecules and provide the remaining 130,831.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('qm9', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
T
curated_breast_imaging_ddsm
tensorflow.org
Updated Jun 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). curated_breast_imaging_ddsm [Dataset]. https://www.tensorflow.org/datasets/catalog/curated_breast_imaging_ddsm
Explore at:
Dataset updated
Jun 1, 2024
Description
The CBIS-DDSM (Curated Breast Imaging Subset of DDSM) is an updated and standardized version of the Digital Database for Screening Mammography (DDSM). The DDSM is a database of 2,620 scanned film mammography studies. It contains normal, benign, and malignant cases with verified pathology information.

The default config is made of patches extracted from the original mammograms, following the description from (http://arxiv.org/abs/1708.09427), in order to frame the task to solve in a traditional image classification setting.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('curated_breast_imaging_ddsm', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/curated_breast_imaging_ddsm-patches-3.0.0.png" alt="Visualization" width="500px">
Supplementary material 3 from: Harrington LA, Green J, Muinde P, Macdonald...
zenodo.org
bin
Updated Aug 5, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lauren A. Harrington; Jennah Green; Patrick Muinde; David W. Macdonald; Mark Auliya; Neil D'Cruze; Lauren A. Harrington; Jennah Green; Patrick Muinde; David W. Macdonald; Mark Auliya; Neil D'Cruze (2020). Supplementary material 3 from: Harrington LA, Green J, Muinde P, Macdonald DW, Auliya M, D'Cruze N (2020) Snakes and ladders: A review of ball python production in West Africa for the global pet market. Nature Conservation 41: 1-24. https://doi.org/10.3897/natureconservation.41.51270 [Dataset]. http://doi.org/10.3897/natureconservation.41.51270.suppl3
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.3897/natureconservation.41.51270.suppl3
Dataset updated
Aug 5, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lauren A. Harrington; Jennah Green; Patrick Muinde; David W. Macdonald; Mark Auliya; Neil D'Cruze; Lauren A. Harrington; Jennah Green; Patrick Muinde; David W. Macdonald; Mark Auliya; Neil D'Cruze
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Main importing countries: additional data
Data from: Field measurement data of hydrodynamic and morphological...
zenodo.org
zip
Updated Jan 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rik Gijsman; Rik Gijsman; Sabine Engel; Daphne van der Wal; Daphne van der Wal; Rob van Zee; Jessica Johnson; Matthijs van der Geest; Matthijs van der Geest; Kathelijne Wijnberg; Kathelijne Wijnberg; Erik Horstman; Erik Horstman; Sabine Engel; Rob van Zee; Jessica Johnson (2025). Field measurement data of hydrodynamic and morphological processes in the mangrove forest of Lac Bay, Bonaire, Caribbean Netherlands [Dataset]. http://doi.org/10.5281/zenodo.13904523
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13904523
Dataset updated
Jan 2, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rik Gijsman; Rik Gijsman; Sabine Engel; Daphne van der Wal; Daphne van der Wal; Rob van Zee; Jessica Johnson; Matthijs van der Geest; Matthijs van der Geest; Kathelijne Wijnberg; Kathelijne Wijnberg; Erik Horstman; Erik Horstman; Sabine Engel; Rob van Zee; Jessica Johnson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Bonaire, Caribbean Netherlands, Lac Bay
Description
31-10-2024, Rik Gijsman

_
Dataset of field measurement of hydrodynamic and morphological processes in the mangrove forest of Lac Bay, Bonaire, Caribbean Netherlands.

_
For more information please see scientific publication:

Gijsman, R., Engel, S., van der Wal, D., van Zee, R., Johnson, J., van der Geest, M., Wijnberg, K.M. and Horstman, E.M. (2024). The Importance of Tidal Creeks for Mangrove Survival on Small Oceanic Islands. Unpublished Manuscript.

_

Dataset contains:

Timeseries data from field instruments:

Atmospheric pressure

Water depth

Water level

Flow velocity

Waves

Turbidity

Rainfall

Temperature

Wind

Tidal creek flows

Tidal creek transport

Lagoon suspension

Mapped data of field surveys:

Bathymetry and creek profiles

Sediment characteristics

Python script for importing and plotting data

Overview map with shapefiles of instrument location coordinates

_
Additional notes for your information:

Timeseries data stored in .txt files

Visualization of data in .txt files provided with accompanying .png files

Timeseries data of Hobo sensors, Rain gauge and KNMI station interpolated between 19-01-2022 00:00:00 and 18-05-2022 23:55:00 (120 days) on 5 minute intervals

Timeseries data of other sensors interpolated between 19-01-2022 00:00:00 and 08-03-2022 23:55:00 (49 days) on 5 minute intervals

Column headers show instrument station abbreviation

Datetime in local timezone of Bonaire (GMT-4)

When a property (e.g., water depth) was measured by different instruments, the instrument used is mentioned in the file name, e.g. "water_depth_hobo.txt"

Measurements from KNMI (The Royal Netherlands Meteorological Institute) station 990 were included in the dataset and indicated with '_knmi' in the file name. Data obtained from 'https://www.knmi.nl/nederland-nu/klimatologie/uurgegevens_Caribisch_gebied'
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Andrew Felton; Andrew Felton (2024). Storage and Transit Time Data and Code [Dataset]. http://doi.org/10.5281/zenodo.14009758

Storage and Transit Time Data and Code

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14009758

Dataset updated

Oct 29, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Andrew Felton; Andrew Felton

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Author: Andrew J. Felton
Date: 10/29/2024

This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:

"Global estimates of the storage and transit time of water through vegetation"

Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated.

Data information:

The data folder contains key data sets used for analysis. In particular:

"data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.

#Code information

Python scripts can be found in the "supporting_code" folder.

Each R script in this project has a role:

"01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).

"02_functions.R": This script contains custom functions. Load this using the
`source()` function in the 01_start.R script.

"03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
`source()` function in the 01_start.R script.

"04_figures_tables.R": This is the main workhouse for figure/table production and
supporting analyses. This script generates the key figures and summary statistics
used in the study that then get saved in the manuscript_figures folder. Note that all
maps were produced using Python code found in the "supporting_code"" folder.

"supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.

"supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.

Clear search

Close search

Google apps

Main menu

Storage and Transit Time Data and Code

Data from: Russian Financial Statements Database: A firm-level collection of...

This line will download 6.6GB+ of all RFSD data and store it in a 🤗 cache folder

Alternatively, this will download ~540MB with all financial statements for 2023# to a Polars DataFrame (requires about 8GB of RAM)

Read RFSD metadata from local file

Use RFSD_dataset.schema to glimpse the data structure and columns' classes

Load full dataset into memory

Load only 2019 data into memory

Load only revenue for firms in 2019, identified by taxpayer id

Give suggested descriptive names to variables

Read RFSD metadata from local file

Use schema() to glimpse into the data structure and column classes

Load full dataset into memory

Load only 2019 data into memory

Load only revenue for firms in 2019, identified by taxpayer id

Give suggested descriptive names to variables

ckanext-salford

SSURGO Portal User Guide

Data from: A large synthetic dataset for machine learning applications in...

Data generation algorithm

Network

Time series

Usage

Selecting a particular country

Averaging over time

Source code

Funding

F-DATA: A Fugaku Workload Dataset for Job-centric Predictive Modelling in...

Research data supporting "The sun always shines somewhere. The energetic...

mnist

audio-datasets

Gaussian Process Model and Sensor Placement for Detroit Green...

ARTS Microwave Single Scattering Properties Database

tiny_shakespeare

Data from: WILLOW - Norther: data set for the full-scale validation of...

1. General description

1.1 Summary of the shared structural information

1.2 Summary of the shared geotechnical information

1.3 Summary of the shared measurement data

2. Included in this version

2.1 Version - 0.1.0

3. Importing parquet files

qm9

curated_breast_imaging_ddsm

Supplementary material 3 from: Harrington LA, Green J, Muinde P, Macdonald...

Data from: Field measurement data of hydrodynamic and morphological...

Storage and Transit Time Data and CodeSee More Versions

Storage and Transit Time Data and Code