23 datasets found

Storage and Transit Time Data and Code
zenodo.org
zip
Updated Oct 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew Felton; Andrew Felton (2024). Storage and Transit Time Data and Code [Dataset]. http://doi.org/10.5281/zenodo.14009758
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14009758
Dataset updated
Oct 29, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrew Felton; Andrew Felton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Author: Andrew J. Felton
Date: 10/29/2024

This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:

"Global estimates of the storage and transit time of water through vegetation"

Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated.

Data information:

The data folder contains key data sets used for analysis. In particular:

"data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.

#Code information

Python scripts can be found in the "supporting_code" folder.

Each R script in this project has a role:

"01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).

"02_functions.R": This script contains custom functions. Load this using the
`source()` function in the 01_start.R script.

"03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
`source()` function in the 01_start.R script.

"04_figures_tables.R": This is the main workhouse for figure/table production and
supporting analyses. This script generates the key figures and summary statistics
used in the study that then get saved in the manuscript_figures folder. Note that all
maps were produced using Python code found in the "supporting_code"" folder.

"supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.

"supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.
Z
Data from: Russian Financial Statements Database: A firm-level collection of...
data.niaid.nih.gov
Updated Mar 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Skougarevskiy, Dmitriy (2025). Russian Financial Statements Database: A firm-level collection of the universe of financial statements [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14622208
Explore at:
Dataset updated
Mar 14, 2025
Dataset provided by
Bondarkov, Sergey
Ledenev, Victor
Skougarevskiy, Dmitriy
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Area covered
Russia
Description
The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:

🔓 First open data set with information on every active firm in Russia.

🗂️ First open financial statements data set that includes non-filing firms.

🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.

📅 Covers 2011-2023 initially, will be continuously updated.

🏗️ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.

The RFSD is hosted on 🤗 Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.

The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.

Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.

Importing The Data

You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on 🤗 Hugging Face Datasets library.

Python

🤗 Hugging Face Datasets

It is as easy as:

from datasets import load_dataset import polars as pl

This line will download 6.6GB+ of all RFSD data and store it in a 🤗 cache folder

RFSD = load_dataset('irlspbru/RFSD')

Alternatively, this will download ~540MB with all financial statements for 2023# to a Polars DataFrame (requires about 8GB of RAM)

RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')

Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.

Local File Import

Importing in Python requires pyarrow package installed.

import pyarrow.dataset as ds import polars as pl

Read RFSD metadata from local file

RFSD = ds.dataset("local/path/to/RFSD")

Use RFSD_dataset.schema to glimpse the data structure and columns' classes

print(RFSD.schema)

Load full dataset into memory

RFSD_full = pl.from_arrow(RFSD.to_table())

Load only 2019 data into memory

RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))

Load only revenue for firms in 2019, identified by taxpayer id

RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )

Give suggested descriptive names to variables

renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})

R

Local File Import

Importing in R requires arrow package installed.

library(arrow) library(data.table)

Read RFSD metadata from local file

RFSD <- open_dataset("local/path/to/RFSD")

Use schema() to glimpse into the data structure and column classes

schema(RFSD)

Load full dataset into memory

scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())

Load only 2019 data into memory

scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())

Load only revenue for firms in 2019, identified by taxpayer id

scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())

Give suggested descriptive names to variables

renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)

Use Cases

🌍 For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) — interest_payments.md

🏭 For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) — tfp.md

🗺️ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses — spatialization.md

FAQ

Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?

To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.

What is the data period?

We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).

Why are there no data for firm X in year Y?

Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:

We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).

Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek — in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.

Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.

Why is the geolocation of firm X incorrect?

We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.

Why is the data for firm X different from https://bo.nalog.ru/?

Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.

Why is the data for firm X unrealistic?

We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.

Why is the data for groups of companies different from their IFRS statements?

We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.

Why is the data not in CSV?

The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.

Version and Update Policy

Version (SemVer): 1.0.0.

We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.

Licence

Creative Commons License Attribution 4.0 International (CC BY 4.0).

Copyright © the respective contributors.

Citation

Please cite as:

@unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}

Acknowledgments and Contacts

Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru

Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,
c
ckanext-salford
catalog.civicdataecosystem.org
Updated Jun 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). ckanext-salford [Dataset]. https://catalog.civicdataecosystem.org/dataset/ckanext-salford
Explore at:
Dataset updated
Jun 4, 2025
Area covered
Salford
Description
The Salford extension for CKAN is designed to enhance CKAN's functionality for specific use cases, particularly involving the management and import of datasets relevant to the Salford City Council. By incorporating custom configurations and an ETL script, this extension streamlines the process of integrating external data sources, especially from data.gov.uk, into a CKAN instance. It also provides a structured approach to configuring CKAN for specific data management needs. Key Features: Custom Plugin Integration: Enables the addition of 'salford' and 'esd' plugins to extend CKAN's core functionality, addressing specific data management requirements. Configurable Licenses Group URL: Allows administrators to specify a licenses group URL in the CKAN configuration, streamlining access to license information pertinent to the dataset. ETL Script for Data.gov.uk Import: Includes a Python script (etl.py) to import datasets specifically from the Salford City Council publisher on data.gov.uk. Non-UKLP Dataset Compatibility: The ETL script is designed to filter and import non-UKLP datasets, excluding INSPIRE datasets from the data.gov.uk import process at this time. Bower Component Installation: Simplifies asset management by providing instructions of installing bower components. Technical Integration: The Salford extension requires modifications to the CKAN configuration file (production.ini). Specifically, it involves adding salford and esd to the ckan.plugins setting, defining the licensesgroupurl, and potentially configuring other custom options. The ETL script leverages the CKAN API (ckanapi) for data import. Additionally, Bower components must be installed. Benefits & Impact: Using the Salford CKAN extension, organizations can establish a more streamlined data ingestion process tailored to Salford City Council datasets, enhance data accessibility, improve asset management and facilitate better data governance aligned with specific licensing requirements. By selectively importing datasets and offering custom plugin support, it caters to specialized data management needs.
Z
F-DATA: A Fugaku Workload Dataset for Job-centric Predictive Modelling in...
data.niaid.nih.gov
Updated Jun 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antici, Francesco (2024). F-DATA: A Fugaku Workload Dataset for Job-centric Predictive Modelling in HPC Systems [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11467482
Explore at:
Dataset updated
Jun 10, 2024
Dataset provided by
Yamamoto, Keiji
Kiziltan, Zeynep
Bartolini, Andrea
Domke, Jens
Antici, Francesco
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
F-DATA is a novel workload dataset containing the data of around 24 million jobs executed on Supercomputer Fugaku, over the three years of public system usage (March 2021-April 2024). Each job data contains an extensive set of features, such as exit code, duration, power consumption and performance metrics (e.g. #flops, memory bandwidth, operational intensity and memory/compute bound label), which allows for a multitude of job characteristics prediction. The full list of features can be found in the file feature_list.csv.

The sensitive data appears both in anonymized and encoded versions. The encoding is based on a Natural Language Processing model and retains sensitive but useful job information for prediction purposes, without violating data privacy. The scripts used to generate the dataset are available in the F-DATA GitHub repository, along with a series of plots and instruction on how to load the data.

F-DATA is composed of 38 files, with each YY_MM.parquet file containing the data of the jobs submitted in the month MM of the year YY.

The files of F-DATA are saved as .parquet files. It is possible to load such files as dataframes by leveraging the pandas APIs, after installing pyarrow (pip install pyarrow). A single file can be read with the following Python instrcutions:

Importing pandas library

import pandas as pd

Read the 21_01.parquet file in a dataframe format

df = pd.read_parquet("21_01.parquet")

df.head()
Electronic Sales
kaggle.com
Updated Dec 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anshul Pachauri (2023). Electronic Sales [Dataset]. https://www.kaggle.com/datasets/anshulpachauri/electronic-sales
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 19, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Anshul Pachauri
Description
The provided Python code is a comprehensive analysis of sales data for a business that involves the merging of monthly sales data, cleaning and augmenting the dataset, and performing various analytical tasks. Here's a breakdown of the code:

Data Preparation and Merging:

The code begins by importing necessary libraries and filtering out warnings. It merges sales data from 12 months into a single file named "all_data.csv." Data Cleaning:

Rows with NaN values are dropped, and any entries starting with 'Or' in the 'Order Date' column are removed. Columns like 'Quantity Ordered' and 'Price Each' are converted to numeric types for further analysis. Data Augmentation:

Additional columns such as 'Month,' 'Sales,' and 'City' are added to the dataset. The 'City' column is derived from the 'Purchase Address' column. Analysis:

Several analyses are conducted, answering questions such as: The best month for sales and total earnings. The city with the highest number of sales. The ideal time for advertisements based on the number of orders per hour. Products that are often sold together. The best-selling products and their correlation with price. Visualization:

Bar charts and line plots are used for visualizing the analysis results, making it easier to interpret trends and patterns. Matplotlib is employed for creating visualizations. Summary:

The code concludes with a comprehensive visualization that combines the quantity ordered and average price for each product, shedding light on product performance. This code is structured to offer insights into sales patterns, customer behavior, and product performance, providing valuable information for strategic decision-making in the business.
T
mnist
tensorflow.org
universe.roboflow.com
+3more
Updated Jun 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). mnist [Dataset]. https://www.tensorflow.org/datasets/catalog/mnist
Explore at:
Dataset updated
Jun 1, 2024
Description
The MNIST database of handwritten digits.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('mnist', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/mnist-3.0.1.png" alt="Visualization" width="500px">
e
HURRECON Model for Estimating Hurricane Wind Speed, Direction, and Damage (R...
portal.edirepository.org
search.dataone.org
zip
Updated Feb 14, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emery Boose (2024). HURRECON Model for Estimating Hurricane Wind Speed, Direction, and Damage (R and Python) [Dataset]. http://doi.org/10.6073/pasta/1416f19219e3824de0c372a143a73f3e
Explore at:
zip(45124 byte)Available download formats
Unique identifier
https://doi.org/10.6073/pasta/1416f19219e3824de0c372a143a73f3e
Dataset updated
Feb 14, 2024
Dataset provided by
EDI
Authors
Emery Boose
License
https://spdx.org/licenses/CC0-1.0https://spdx.org/licenses/CC0-1.0
Area covered
Earth
Description
The HURRECON model estimates wind speed, wind direction, enhanced Fujita scale wind damage, and duration of EF0 to EF5 winds as a function of hurricane location and maximum sustained wind speed. Results may be generated for a single site or an entire region. Hurricane track and intensity data may be imported directly from the US National Hurricane Center's HURDAT2 database. HURRECON is available in R and Python. The R version is available on CRAN as HurreconR. The model is an updated version of the original HURRECON model written in Borland Pascal for use with Idrisi (see HF025). New features include support for: (1) estimating wind damage on the enhanced Fujita scale, (2) importing hurricane track and intensity data directly from HURDAT2, (3) creating a land-water file with user-selected geographic coordinates and spatial resolution, and (4) creating plots of site and regional results. The model equations for estimating wind speed and direction, including parameter values for inflow angle, friction factor, and wind gust factor (over land and water), are unchanged from the original HURRECON model. For more details and sample datasets, see the project website on GitHub (https://github.com/hurrecon-model).
c
ckanext-ldap
catalog.civicdataecosystem.org
Updated Jun 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). ckanext-ldap [Dataset]. https://catalog.civicdataecosystem.org/dataset/ckanext-ldap
Explore at:
Dataset updated
Jun 4, 2025
Description
The LDAP Authentication extension for CKAN provides a method for authenticating users against an LDAP (Lightweight Directory Access Protocol) server, enhancing security and simplifying user management. This extension allows CKAN to leverage existing LDAP infrastructure for user authentication, account information, and group management. Allowing administrators to configure CKAN to authenticate via LDAP and utilize the credentials of LDAP users. Key Features: LDAP Authentication: Enables CKAN to authenticate users against an LDAP server, using usernames and passwords stored within the LDAP directory. User Data Import: Imports user attributes, such as username, full name, email address, and description, from the LDAP directory into the CKAN user profile. Flexible LDAP Search: Supports matching against multiple LDAP fields (e.g., username, full name) using configurable search filters. An alternative search filter can be specified for cases where the primary filter returns no results, allowing for flexible matching. Combined Authentication: Allows combining LDAP authentication with basic CKAN authentication, providing a fallback mechanism for users not present in the LDAP directory. Automatic Organization Assignment: Automatically adds LDAP users to a specified organization within CKAN upon their first login, simplifying organizational role management. The role of the user within the organization can also be specified. Active Directory Support: Compatible with Active Directory, allowing seamless integration with existing Windows-based directory services. Profile Edit Restriction: Provides an option to prevent LDAP users from editing their profiles within CKAN, centralizing user data management in the LDAP directory. Password Reset Functionality: Allows LDAP users to reset their CKAN passwords (not their LDAP passwords), providing a way to recover access to their CKAN accounts. This functionality can be disabled, preventing user-initiated password resets for these accounts. Existing user migration: Facilitates migration from CKAN authentication to LDAP authentication by mapping any existing CKAN user with the same username to the login LDAP user. Referral Ignore: Ignores any referral results to avoid queries containing more than one result. Debug/Trace level option: sets the debug level of python-ldap and python-ldap trace level to allow for debugging. Technical Integration: The extension integrates with CKAN by adding 'ldap' to the list of plugins in the CKAN configuration file. It overrides the default CKAN login form, redirecting login requests to the LDAP authentication handler. Configuration options can specified in the CKAN .ini config file, including the LDAP server URI, base DN, search filters, and attribute mappings. Benefits & Impact: Implementing the LDAP Authentication extension simplifies user management by centralizing authentication within an LDAP directory, reducing administrative overhead. It enhances security by leveraging existing LDAP security policies and credentials. The extension streamlines user onboarding by automatically importing user data and assigning organizational roles, improving user experience and data consistency.
c
Research Data Supporting "Nuclear Wavefunctions of Dispersion Bound Systems:...
repository.cam.ac.uk
zip
Updated Jan 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Panchagnula, Kripa; Graf, Daniel; Johnson, Erin (2025). Research Data Supporting "Nuclear Wavefunctions of Dispersion Bound Systems: Endohedral Eigenstates of Endofullerenes" [Dataset]. http://doi.org/10.17863/CAM.114388
Explore at:
zip(17617883928 bytes)Available download formats
Unique identifier
https://doi.org/10.17863/CAM.114388
Dataset updated
Jan 27, 2025
Dataset provided by
Apollo
University of Cambridge
Authors
Panchagnula, Kripa; Graf, Daniel; Johnson, Erin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the root directory containing data files, bash scripts, and Python scripts to generate the
data for the tables and figures in my PhD thesis, titled "Nuclear Wavefunctions of Dispersion
Bound Systems: Endohedral Eigenstates of Endofullerenes". This thesis was submitted in September
2024, with corrections (no additional calculations) approved in December 2024. The electronic
structure data is provided raw, as outputs from FermiONs++ and FHI-aims. The machine learned PESs
are constructed from Python scripts. These are then used to calculate the nuclear eigenstates, which is achieved using a self written library, "EPEE" available on GitLab at
https://gitlab.developers.cam.ac.uk/ksp31/epee.

Author: Kripa Panchagnula
Date: January 2025

To run the machine learning, nuclear diagonalisation, and plotting scripts the "thesis_calcs" branch (commit SHA: 100d79600aae7668d4ceaeafc6274a89f019283c) or "main" branch (commit SHA:
4e4d677f609028710fbc8e4f48dc4895543340db) of EPEE is required alongside NumPy, SciPy, scikit-learn, matplotlib and the "development" branch of QSym2 from https://qsym2.dev/. Any Python script importing from src is referring to the EPEE library. Each Python script must be run from within its containing directory.

The data is separated into the following folders:
- background/
This folder contains a Python script to generate figures for Chapters 1-3.
- He@C60/
This folder contains electronic structure data from FermiONs++ with Python scripts
to generate data for Chapter 4.
- X@C70/
This folder contains Python scripts to generate data for Chapter 5.
- Ne@C70/
This folder contains electronic structure data from FermiONs++ and FHI-aims with
Python scripts to generate data for Chapter 6.
- H2@C70/
This folder contains Python scripts to generate data for Chapter 7.
- peapods/
This folder contains Python scripts to generate data for Chapter 8.

Each folder contains its own README, with more details about its structure. File types include text files (.txt, .dat, .cube), scripts (.bash, .py) and NumPy compressed data files (.npz).
Z
Gaussian Process Model and Sensor Placement for Detroit Green...
data.niaid.nih.gov
zenodo.org
Updated Feb 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mason, Brooke (2023). Gaussian Process Model and Sensor Placement for Detroit Green Infrastructure: Datasets and Code [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7667043
Explore at:
Dataset updated
Feb 23, 2023
Dataset provided by
Mason, Brooke
Kerkez, Branko
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Detroit
Description
code.zip: Zip folder containing a folder titled "code" which holds:

csv file titled "MonitoredRainGardens.csv" containing the 14 monitored green infrastructure (GI) sites with their design and physiographic features;

csv file titled "storm_constants.csv" which contain the computed decay constants for every storm in every GI during the measurement period;

csv file titled "newGIsites_AllData.csv" which contain the other 130 GI sites in Detroit and their design and physiographic features;

csv file titled "Detroit_Data_MeanDesignFeatures.csv" which contain the design and physiographic features for all of Detroit;

Jupyter notebook titled "GI_GP_SensorPlacement.ipynb" which provides the code for training the GP models and displaying the sensor placement results;

a folder titled "MATLAB" which contains the following:

folder titled "SFO" which contains the SFO toolbox for the sensor placement work

file titled "sensor_placement.mlx" that contains the code for the sensor placement work

several .mat files created in Python for importing into Matlab for the sensor placement work: "constants_sigma.mat", "constants_coords.mat", "GInew_sigma.mat", "GInew_coords.mat", and "R1_sensor.mat" through "R6_sensor.mat"

several .mat files created in Matalb for importing into Python for visualizing the results: "MI_DETselectedGI.mat" and "DETselectedGI.mat"
c
Research data supporting "The sun always shines somewhere. The energetic...
repository.cam.ac.uk
csv
Updated Aug 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivens, Frederick (2024). Research data supporting "The sun always shines somewhere. The energetic feasibility of a global grid with 100% renewable electricity" [Dataset]. http://doi.org/10.17863/CAM.111494
Explore at:
csv(293239 bytes)Available download formats
Unique identifier
https://doi.org/10.17863/CAM.111494
Dataset updated
Aug 21, 2024
Dataset provided by
Apollo
University of Cambridge
Authors
Ivens, Frederick
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
EUROPEAN ELECTRICITY DEMAND TIME SERIES (2017)

DATASET DESCRIPTION This dataset contains the aggregated electricity demand time series for Europe in 2017, measured in megawatts (MW). The data is presented with a timestamp in Coordinated Universal Time (UTC) and the corresponding electricity demand in MW.

DATA SOURCE The original data was obtained from the European Network of Transmission System Operators for Electricity (ENTSO-E) and can be accessed at: https://www.entsoe.eu/data/power-stats/ Specifically, the data was extracted from the "MHLV_data-2015-2019.xlsx" file, which provides aggregated hourly electricity load data by country for the years 2015 to 2019.

DATA PROCESSING The dataset was created using the following steps: - Importing Data: The original Excel file was imported into a Python environment using the pandas library. The data was checked for completeness to ensure no missing or corrupted entries. - Time Conversion: All timestamps in the dataset were converted to Coordinated Universal Time (UTC) to standardize the time reference across the dataset. - Aggregation: The data for the year 2017 was extracted from the dataset. The hourly electricity load for all European countries (defined as per as per Yu et al (2019), see below) was summed to generate an aggregated time series representing the total electricity demand across Europe.

FILE FORMAT - CSV Format: The dataset is stored in a CSV file with two columns: - Timestamp (UTC): The time at which the electricity demand was recorded. - Electricity Demand (MW): The total aggregated electricity demand for Europe in megawatts.

USAGE NOTES - Temporal Coverage: The dataset covers the entire year of 2017 with hourly granularity. - Geographical Coverage: The dataset aggregates data from multiple European countries, following the definition of Europe as per Yu et al (2019).

REFERENCES J. Yu, K. Bakic, A. Kumar, A. Iliceto, L. Beleke Tabu, J. Ruaud, J. Fan, B. Cova, H. Li, D. Ernst, R. Fonteneau, M. Theku, G. Sanchis, M. Chamollet, M. Le Du, Y. Zhang, S. Chatzivasileiadis, D.-C. Radu, M. Berger, M. Stabile, F. Heymann, M. Dupré La Tour, M. Manuel de Villena Millan, and M. Ranjbar. Global electricity network - feasibility study. Technical report, CIGRE, 2019. URL https://hdl.handle.net/2268/239969. Accessed July 2024.
Z
Linux Kernel binary size
data.niaid.nih.gov
Updated Jun 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugo MARTIN (2021). Linux Kernel binary size [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4943883
Explore at:
Dataset updated
Jun 14, 2021
Dataset provided by
Mathieu ACHER
Hugo MARTIN
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset containing measurements of Linux Kernel binary size after compilation. The reported size, in the column "perf", is the size in bytes of the vmlinux file. In contains also a column "active_options" reporting the number of activated options (set at "y"). All other columns, the list being reported in the file "Linux_options.json", are Linux kernel options. The sampling have been made using randconfig. The version of Linux used is 4.13.3.

Not all available options are present. First, it only contains options about the x86 and 64 bits version. Then, all non-tristate options have been ignored. Finally, options not having multiple value through the whole dataset, due to not enough variability in the sampling, are ignored. All options are encoded as 0 for "n" and "m" options value, and 1 for "y".

In python, importing the dataset using pandas will attribute all columns to int64, which will lead to a great consumption of memory (~50GB). We provide this way to import it using less than 1 GB of memory by setting options columns to int8.

import pandas as pd import json import numpy

with open("Linux_options.json","r") as f: linux_options = json.load(f)

Load csv by setting options as int8 to save a lot of memory

return pd.read_csv("Linux.csv", dtype={f:numpy.int8 for f in linux_options})
2020 NFL Statistics (Active and Retired Players)
kaggle.com
Updated Feb 8, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Trevor Youngquist (2021). 2020 NFL Statistics (Active and Retired Players) [Dataset]. https://www.kaggle.com/trevyoungquist/2020-nfl-stats-active-and-retired-players
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 8, 2021
Dataset provided by
Kaggle
Authors
Trevor Youngquist
Description
2020 NFL Stats Web Scrape

This dataset consists of basic statistics and career statistics provided by the NFL on their official website (http://www.nfl.com) for all players, active and retired.

Summary

All of the data was web scraped using Python code, which can be found and downloaded here: https://github.com/ytrevor81/NFL-Stats-Web-Scrape

Explanation of Data

Before we go into the specifics, it's important to note in the basic statistics and career statistics CSV files that all players are assigned a 'Player_Id'. This is the same ID used by the official NFL website to identify each player. This is useful in case of, for example, importing these CSV files in a SQL database for an app.

The first main group of stats is the basic stats provided for each player. This data is stored in the CSV file titled Active_Player_Basic_Stats.csv and Retired_Player_Basic_Stats.csv.

The data pulled for each player in Active_Player_Basic_Stats.csv is as follows: a. Player ID b. Full Name c. Position d. Number e. Current Team f. Height g. Height h. Weight i. Experience j. Age k. College

The data pulled for each player in Retired_Player_Basic_Stats.csv differs slightly from the previous data set. The data is as follows: a. Player ID b. Full Name c. Position f. Height g. Height h. Weight j. College k. Hall of Fame Status

The second main group of stats gathered for each player are their career statistics. Due to the NFL having a various amount of positions that players occupy, the career statistics are divided into statistics categories. The stats for active players and retired players are structured the same, but are stored in separate CSV files (ActivePlayer_(category)_Stats.csv and RetiredPlayer_(category)_Stats.csv). The following are the career statistics categories and accompanying CSV file names: a. Defensive Stats - ..._Defense_Stats.csv b. Fumbles Stats - ..._Fumbles_Stats.csv c. Kick Returns Stats - ..._KickReturns_Stats.csv d. Field Goal Kicking Stats - ..._Kicking_Stats.csv e. Passing Stats - ..._Passing_Stats.csv f. Punt Returns Stats - ..._PuntReturns_Stats.csv g. Punting Stats - ..._Punting_Stats.csv h. Receiving Stats - ..._Receiving_Stats.csv i. Rushing Stats - ..._Rushing_Stats.csv
Galaxy Training Material for the 'Use Jupyter notebooks in Galaxy' tutorial
zenodo.org
csv
Updated Apr 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Delphine Lariviere; Delphine Lariviere; Teresa Müller; Teresa Müller (2025). Galaxy Training Material for the 'Use Jupyter notebooks in Galaxy' tutorial [Dataset]. http://doi.org/10.5281/zenodo.15263830
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15263830
Dataset updated
Apr 22, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Delphine Lariviere; Delphine Lariviere; Teresa Müller; Teresa Müller
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was originally curated by Software Carpentry, a branch of The Carpentries non-profit organization, and is based on data from the Gapminder Foundation. It consists of six tabular CSV files containing GDP data for various countries across different years. The dataset was initially prepared for the Software Carpentry tutorial "Plotting and Programming in Python" and is also reused in the Galaxy Training Network (GTN) tutorial "Use Jupyter Notebooks in Galaxy."

This GTN tutorial provides an introduction to launching a Jupyter Notebook in Galaxy, installing dependencies, and importing and exporting data. It serves as a setup guide for a Jupyter Notebook environment that can be used to follow the Software Carpentry tutorial "Plotting and Programming in Python."
IN2020_E01 Tasmanian Coast Bathymetry 10m - 210m Multi-resolution AusSeabed...
researchdata.edu.au
data.csiro.au
datadownload
Updated Feb 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Davina Gifford; Bernadette Heaney; Cisco Navidad; Phil Vandenbossche; Chris Berry; Amy Nau; Jason Fazey (2024). IN2020_E01 Tasmanian Coast Bathymetry 10m - 210m Multi-resolution AusSeabed products [Dataset]. http://doi.org/10.25919/17WM-DK49
Explore at:
datadownloadAvailable download formats
Unique identifier
https://doi.org/10.25919/17WM-DK49
Dataset updated
Feb 9, 2024
Dataset provided by
CSIROhttp://www.csiro.au/
Authors
Davina Gifford; Bernadette Heaney; Cisco Navidad; Phil Vandenbossche; Chris Berry; Amy Nau; Jason Fazey
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jul 29, 2020 - Aug 6, 2020
Area covered

Description
This layer group describes multibeam echosounder data collected on RV Investigator voyage IN2020_E01 titled "Trials and Calibration". The voyage took place between July 29 and August 6, 2020, departing from Hobart (TAS) and arriving in Hobart (TAS).

The purpose of this voyage was to undertake post port-period equipment calibrations and commissioning, sea trials as well as personnel training.

This dataset is published with the permission of CSIRO. Not to be used for navigational purposes.

The dataset contains bathymetry grids of 10m to 210m resolution of the Tasmanian Coast produced from the processed EM122 and EM710 bathymetry data. Lineage: Multibeam data was logged from the EM’s in Kongsberg’s proprietary *.all format and was converted to be processed within CARIS HIPS and SIPS version 10.4. Initial data conversion and processing was performed using the GSM python batch utility, with the manual method used for importing patch test data. Once the raw files were converted into the HIPS and SIPS format, the data was analysed for noise. With the exception of EM710 reference surface lines (files 0083, 0085, 0087, 0090, 0092, 0096, 0098, 0100 & 0103) and EM710 patch test lines (files 0031, 0033, 0035, 0037, 0040, 0042, 0044) that had GPS tide applied, no tide was applied to the remaining lines. All lines were merged using the vessel file appropriate for either the EM122 or EM710. Because the angular offsets were zeroed in SIS prior to the EM710 and EM122 patch tests, the vessel files for each were edited to apply the calibration values to those lines.

The data was then gridded at the highest resolution possible and further inspected for outliers.

The data was then gridded at multiple resolutions in python Caris batch script using a Depth filter Vs Resolution guideline derived from AusSeabed Multibeam guidelines v2 and further inspected for outliers. Final raster products are available in L3 folder of this collection. Final processed data were also exported per line as GSF and ASCII format and available in the L2 folder of this collection.
T
tiny_shakespeare
tensorflow.org
huggingface.co
Updated Feb 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). tiny_shakespeare [Dataset]. https://www.tensorflow.org/datasets/catalog/tiny_shakespeare
Explore at:
Dataset updated
Feb 11, 2023
Description
40,000 lines of Shakespeare from a variety of Shakespeare's plays. Featured in Andrej Karpathy's blog post 'The Unreasonable Effectiveness of Recurrent Neural Networks': http://karpathy.github.io/2015/05/21/rnn-effectiveness/.

To use for e.g. character modelling:

d = tfds.load(name='tiny_shakespeare')['train'] d = d.map(lambda x: tf.strings.unicode_split(x['text'], 'UTF-8')) # train split includes vocabulary for other splits vocabulary = sorted(set(next(iter(d)).numpy())) d = d.map(lambda x: {'cur_char': x[:-1], 'next_char': x[1:]}) d = d.unbatch() seq_len = 100 batch_size = 2 d = d.batch(seq_len) d = d.batch(batch_size)

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('tiny_shakespeare', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
Dioptra Test Platform
catalog.data.gov
data.nist.gov
Updated Sep 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2024). Dioptra Test Platform [Dataset]. https://catalog.data.gov/dataset/dioptra-test-platform
Explore at:
Dataset updated
Sep 11, 2024
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
Source code, documentation, and examples of use of the source code for the Dioptra Test Platform.Dioptra is a software test platform for assessing the trustworthy characteristics of artificial intelligence (AI). Trustworthy AI is: valid and reliable, safe, secure and resilient, accountable and transparent, explainable and interpretable, privacy-enhanced, and fair - with harmful bias managed1. Dioptra supports the Measure function of the NIST AI Risk Management Framework by providing functionality to assess, analyze, and track identified AI risks.Dioptra provides a REST API, which can be controlled via an intuitive web interface, a Python client, or any REST client library of the user's choice for designing, managing, executing, and tracking experiments. Details are available in the project documentation available at https://pages.nist.gov/dioptra/.Use CasesWe envision the following primary use cases for Dioptra:- Model Testing: -- 1st party - Assess AI models throughout the development lifecycle -- 2nd party - Assess AI models during acquisition or in an evaluation lab environment -- 3rd party - Assess AI models during auditing or compliance activities- Research: Aid trustworthy AI researchers in tracking experiments- Evaluations and Challenges: Provide a common platform and resources for participants- Red-Teaming: Expose models and resources to a red team in a controlled environmentKey PropertiesDioptra strives for the following key properties:- Reproducible: Dioptra automatically creates snapshots of resources so experiments can be reproduced and validated- Traceable: The full history of experiments and their inputs are tracked- Extensible: Support for expanding functionality and importing existing Python packages via a plugin system- Interoperable: A type system promotes interoperability between plugins- Modular: New experiments can be composed from modular components in a simple yaml file- Secure: Dioptra provides user authentication with access controls coming soon- Interactive: Users can interact with Dioptra via an intuitive web interface- Shareable and Reusable: Dioptra can be deployed in a multi-tenant environment so users can share and reuse components
Data from: IDAES-PSE 2.6.0 Release
osti.gov
Updated Sep 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
A- Sitterley, Kurban; Amusat, Oluwamayowa; Atia, Adam; B Parker, Robert; Balyoz, Celeste; Beattie, Keith; Bianchi, Ludovico; Chen, Xinhe; Dowling, Alexander; Gunter, Dan; Holly, Marcus; Knueven, Bernard; L Bynum, Michael; Lee, Andrew; Mundt, Miranda; Nicholson, Bethany; Pang, Sheng; Paul, Brandon; Siirola, John; Skolfield, Kyle; Vyas, Javal (2024). IDAES-PSE 2.6.0 Release [Dataset]. https://www.osti.gov/dataexplorer/biblio/dataset/2530210
Explore at:
Dataset updated
Sep 30, 2024
Dataset provided by
National Energy Technology Laboratoryhttps://netl.doe.gov/
USDOE Office of Fossil Energy (FE)
Authors
A- Sitterley, Kurban; Amusat, Oluwamayowa; Atia, Adam; B Parker, Robert; Balyoz, Celeste; Beattie, Keith; Bianchi, Ludovico; Chen, Xinhe; Dowling, Alexander; Gunter, Dan; Holly, Marcus; Knueven, Bernard; L Bynum, Michael; Lee, Andrew; Mundt, Miranda; Nicholson, Bethany; Pang, Sheng; Paul, Brandon; Siirola, John; Skolfield, Kyle; Vyas, Javal
Description
The Institute for the Design of Advanced Energy Systems (IDAES) Integrated Platform is a versatile computational environment offering extensive process systems engineering (PSE) capabilities for optimizing the design and operation of complex, interacting technologies and systems. IDAES enables users to efficiently search vast, complex design spaces to discover the lowest cost solutions while supporting the full process modeling lifecycle, from conceptual design to dynamic optimization and control. The extensible, open platform empowers users to create models of novel processes and rapidly develop custom analyses, workflows, and end-user applications. IDAES-PSE 2.6.0 Release Highlights Upcoming Changes IDAES will be switching to the new Pyomo solver interface in the next release. Whilst this will hopefully be a smooth transition for most users, there are a few important changes to be aware of. The new solver interface uses a different version of the IPOPT writer (“ipopt_v2”) and thus any custom configuration options you might have set for IPOPT will not carry over and will need to be reset. By default, the new Pyomo linear presolver will be activated with ipopt_v2. Whilst are working to identify any bugs in the presolver, it is possible that some edge cases will remain. IDAES will begin deploying amore » new set of scaling tools and APIs over the next few releases that make use of the new solver writers. The old scaling tools and APIs will remain for backward compatibility but will begin to be deprecated. New Models, Tools and Features New Intersphinx extension automatically linking Jupyter notebook examples to project documentation New end-to-end diagnostics example demonstrated on a real problem New complementarity formulation for VLE with cubic equations of state, backward compatibility for old formulation New solver interface with presolve (ipopt_v2) in support of upcoming changes to the initialization and APIs methods, with default set to ipopt to maintain backwards compatibility; this will deprecate once all examples have been updated New forecaster and parameterized bidder methods within grid integration library Updated surrogates API and examples to support Keras 3, with backwards compatibility for older formats such as TensorFlow SavedModel (TFSM) Updated costing base dictionary to include the 2023 cost year index value Updated ProcessBlock to include information on the constructing block class Updated Flowsheet Visualizer to allow visualize() method to return value and functions Bug Fixes Fixed bug in the Modular Property Framework that would cause errors when trying to use phase-based material balances with phase equilibria. Fixed bug in Modular Properties Framework that caused errors when initializing models with non-vapor-liquid phase equilibria. Fixed typos flagged by June update to crate-ci/typos and removed DMF-related exceptions Minor corrections of units of measurement handling in power plant waste/transport costing expressions, control volume material holdup expressions, and BTX property package parameters Fixed throwing >7500 numpy deprecation warnings by replacing scalar value assignment with element extraction and item iteration calls Testing and Robustness Migrated slow tests (>10s) to integration, impacting test coverage but also yielding a nearly 30% decrease in local test runtime Pinned pint to avoid issues with older supported Python versions Pinned codecov versions to avoid tokenless upload behavior with latest version Bumped extensions to version 3.4.2 to allow pointing to non-standard install location Deprecations and Removals Python 3.8 is no longer supported. The supported Python versions are 3.9 through 3.12 The Data Management Framework (DMF) is no longer supported. Importing idaes.core.dmf will cause a deprecation warning to be displayed until the next release The SOFC Keras surrogates have been removed. The current version of the SOFC surrogate model in the examples repository is a PySMO Kriging model.« less
ExioML: Global Sectoral Sustainability Dataset
kaggle.com
Updated Jun 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yanming Yann Guo (2024). ExioML: Global Sectoral Sustainability Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/8690108
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/8690108
Dataset updated
Jun 14, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Yanming Yann Guo
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
🙋‍♂️ Introduction

ExioML is the first ML-ready benchmark dataset in eco-economic research, designed for global sectoral sustainability analysis. It addresses significant research gaps by leveraging the high-quality, open-source EE-MRIO dataset ExioBase 3.8.2. ExioML covers 163 sectors across 49 regions from 1995 to 2022, overcoming data inaccessibility issues. The dataset includes both factor accounting in tabular format and footprint networks in graph structure.

We demonstrate a GHG emission regression task using a factor accounting table, comparing the performance of shallow and deep models. The results show a low Mean Squared Error (MSE), quantifying sectoral GHG emissions in terms of value-added, employment, and energy consumption, validating the dataset's usability. The footprint network in ExioML, inherent in the multi-dimensional MRIO framework, enables tracking resource flow between international sectors.

ExioML offers promising research opportunities, such as predicting embodied emissions through international trade, estimating regional sustainability transitions, and analyzing the topological changes in global trading networks over time. It reduces barriers and intensive data pre-processing for ML researchers, facilitates the integration of ML and eco-economic research, and provides new perspectives for sound climate policy and global sustainable development.

📊 Dataset

ExioML supports graph and tabular structure learning algorithms through the Footprint Network and Factor Accounting table. The dataset includes the following factors in PxP and IxI:

Region (Categorical feature)

Sector (Categorical feature)

Value Added M.EUR

Employment 1000 p.

GHG emissions kg CO2 eq.

Energy Carrier Net Total TJ

Year (Numerical feature)

☁️ Factor Accounting

The Factor Accounting table shares common features with the Footprint Network and summarizes the total heterogeneous characteristics of various sectors.

🚞 Footprint Network

The Footprint Network models the high-dimensional global trading network, capturing its economic, social, and environmental impacts. This network is structured as a directed graph, where directionality represents sectoral input-output relationships, delineating sectors by their roles as sources (exporting) and targets (importing). The basic element in the ExioML Footprint Network is international trade across different sectors with features such as value-added, emission amount, and energy input. The Footprint Network helps identify critical sectors and paths for sustainability management and optimization. The Footprint Network is hosted on Zenodo.

🔗 Code and Data Availability

The ExioML development toolkit in Python and the regression model used for validation are available on the GitHub repository: https://github.com/YVNMINC/ExioML. The complete ExioML dataset is hosted by Zenodo: https://zenodo.org/records/10604610.

💡 Additional Information

More details about the dataset are available in our paper: ExioML: Eco-economic dataset for Machine Learning in Global Sectoral Sustainability, accepted by the ICLR 2024 Climate Change AI workshop: https://arxiv.org/abs/2406.09046.

📄 Citation

@inproceedings{guo2024exioml, title={ExioML: Eco-economic dataset for Machine Learning in Global Sectoral Sustainability}, author={Yanming, Guo and Jin, Ma}, booktitle={ICLR 2024 Workshop on Tackling Climate Change with Machine Learning}, year={2024} }

🌟 Reference

Stadler, Konstantin, et al. "EXIOBASE 3." Zenodo. Retrieved March 22 (2021): 2023.
Apple Leaf Disease Detection Using Vision Transformer
zenodo.org
text/x-python
Updated Jun 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amreen Batool; Amreen Batool (2025). Apple Leaf Disease Detection Using Vision Transformer [Dataset]. http://doi.org/10.5281/zenodo.15702007
Explore at:
text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15702007
Dataset updated
Jun 20, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Amreen Batool; Amreen Batool
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains a Python script for classifying apple leaf diseases using a Vision Transformer (ViT) model. The dataset used is the Plant Village dataset, which contains images of apple leaves with four classes: Healthy, Apple Scab, Black Rot, and Cedar Apple Rust. The script includes data preprocessing, model training, and evaluation steps.

Table of Contents

Introduction

Code Explanation

Steps for Implementation

Example Usage

Conclusion

Introduction

The goal of this project is to classify apple leaf diseases using a Vision Transformer (ViT) model. The dataset is divided into four classes: Healthy, Apple Scab, Black Rot, and Cedar Apple Rust. The script includes data preprocessing, model training, and evaluation steps.

Code Explanation

1. Importing Libraries

The script starts by importing necessary libraries such as matplotlib, seaborn, numpy, pandas, tensorflow, and sklearn. These libraries are used for data visualization, data manipulation, and building/training the deep learning model.

2. Visualizing the Dataset

The walk_through_dir function is used to explore the dataset directory structure and count the number of images in each class.

The dataset is divided into Train, Val, and Test directories, each containing subdirectories for the four classes.

3. Data Augmentation

The script uses ImageDataGenerator from Keras to apply data augmentation techniques such as rotation, horizontal flipping, and rescaling to the training data. This helps in improving the model's generalization ability.

Separate generators are created for training, validation, and test datasets.

4. Patch Visualization

The script defines a Patches layer that extracts patches from the images. This is a crucial step in Vision Transformers, where images are divided into smaller patches that are then processed by the transformer.

The script visualizes these patches for different patch sizes (32x32, 16x16, 8x8) to understand how the image is divided.

5. Model Training

The script defines a Vision Transformer (ViT) model using TensorFlow and Keras. The model is compiled with the Adam optimizer and categorical cross-entropy loss.

The model is trained for a specified number of epochs, and the training history is stored for later analysis.

6. Model Evaluation

After training, the model is evaluated on the test dataset. The script generates a confusion matrix and a classification report to assess the model's performance.

The confusion matrix is visualized using seaborn to provide a clear understanding of the model's predictions.

7. Visualizing Misclassified Images

The script includes functionality to visualize misclassified images, which helps in understanding where the model is making errors.

8. Fine-Tuning and Learning Rate Adjustment

The script demonstrates how to fine-tune the model by adjusting the learning rate and re-training the model.

Steps for Implementation

Dataset Preparation

Ensure that the dataset is organized into Train, Val, and Test directories, with each directory containing subdirectories for each class (Healthy, Apple Scab, Black Rot, Cedar Apple Rust).

Install Required Libraries

Install the necessary Python libraries using pip:

pip install tensorflow matplotlib seaborn numpy pandas scikit-learn

Run the Script

Execute the script in a Python environment. The script will automatically:

Load and preprocess the dataset.

Apply data augmentation.

Train the Vision Transformer model.

Evaluate the model and generate performance metrics.

Analyze Results

Review the confusion matrix and classification report to understand the model's performance.

Visualize misclassified images to identify potential areas for improvement.

Fine-Tuning

Experiment with different patch sizes, learning rates, and data augmentation techniques to improve the model's accuracy.

Facebook

Twitter

Click to copy link

Link copied

Cite

Andrew Felton; Andrew Felton (2024). Storage and Transit Time Data and Code [Dataset]. http://doi.org/10.5281/zenodo.14009758

Storage and Transit Time Data and Code

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14009758

Dataset updated

Oct 29, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Andrew Felton; Andrew Felton

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Author: Andrew J. Felton
Date: 10/29/2024

This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:

"Global estimates of the storage and transit time of water through vegetation"

Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated.

Data information:

The data folder contains key data sets used for analysis. In particular:

"data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.

#Code information

Python scripts can be found in the "supporting_code" folder.

Each R script in this project has a role:

"01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).

"02_functions.R": This script contains custom functions. Load this using the
`source()` function in the 01_start.R script.

"03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
`source()` function in the 01_start.R script.

"04_figures_tables.R": This is the main workhouse for figure/table production and
supporting analyses. This script generates the key figures and summary statistics
used in the study that then get saved in the manuscript_figures folder. Note that all
maps were produced using Python code found in the "supporting_code"" folder.

"supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.

"supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.

Clear search

Close search

Google apps

Main menu

Storage and Transit Time Data and Code

Data from: Russian Financial Statements Database: A firm-level collection of...

This line will download 6.6GB+ of all RFSD data and store it in a 🤗 cache folder

Alternatively, this will download ~540MB with all financial statements for 2023# to a Polars DataFrame (requires about 8GB of RAM)

Read RFSD metadata from local file

Use RFSD_dataset.schema to glimpse the data structure and columns' classes

Load full dataset into memory

Load only 2019 data into memory

Load only revenue for firms in 2019, identified by taxpayer id

Give suggested descriptive names to variables

Read RFSD metadata from local file

Use schema() to glimpse into the data structure and column classes

Load full dataset into memory

Load only 2019 data into memory

Load only revenue for firms in 2019, identified by taxpayer id

Give suggested descriptive names to variables

ckanext-salford

F-DATA: A Fugaku Workload Dataset for Job-centric Predictive Modelling in...

Importing pandas library

Read the 21_01.parquet file in a dataframe format

Electronic Sales

mnist

HURRECON Model for Estimating Hurricane Wind Speed, Direction, and Damage (R...

ckanext-ldap

Research Data Supporting "Nuclear Wavefunctions of Dispersion Bound Systems:...

Gaussian Process Model and Sensor Placement for Detroit Green...

Research data supporting "The sun always shines somewhere. The energetic...

Linux Kernel binary size

Load csv by setting options as int8 to save a lot of memory

2020 NFL Statistics (Active and Retired Players)

2020 NFL Stats Web Scrape

Summary

Explanation of Data

Galaxy Training Material for the 'Use Jupyter notebooks in Galaxy' tutorial

IN2020_E01 Tasmanian Coast Bathymetry 10m - 210m Multi-resolution AusSeabed...

tiny_shakespeare

Dioptra Test Platform

Data from: IDAES-PSE 2.6.0 Release

ExioML: Global Sectoral Sustainability Dataset

🙋‍♂️ Introduction

📊 Dataset

☁️ Factor Accounting

🚞 Footprint Network

🔗 Code and Data Availability

💡 Additional Information

📄 Citation

🌟 Reference

Apple Leaf Disease Detection Using Vision Transformer

Table of Contents

Introduction

Code Explanation

1. Importing Libraries

2. Visualizing the Dataset

3. Data Augmentation

4. Patch Visualization

5. Model Training

6. Model Evaluation

7. Visualizing Misclassified Images

8. Fine-Tuning and Learning Rate Adjustment

Steps for Implementation

Storage and Transit Time Data and CodeSee More Versions

Storage and Transit Time Data and Code