Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Author: Andrew J. Felton
Date: 10/29/2024
This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:
"Global estimates of the storage and transit time of water through vegetation"
Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated.
Data information:
The data folder contains key data sets used for analysis. In particular:
"data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.
#Code information
Python scripts can be found in the "supporting_code" folder.
Each R script in this project has a role:
"01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).
"02_functions.R": This script contains custom functions. Load this using the
`source()` function in the 01_start.R script.
"03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
`source()` function in the 01_start.R script.
"04_figures_tables.R": This is the main workhouse for figure/table production and
supporting analyses. This script generates the key figures and summary statistics
used in the study that then get saved in the manuscript_figures folder. Note that all
maps were produced using Python code found in the "supporting_code"" folder.
"supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.
"supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:
🔓 First open data set with information on every active firm in Russia.
🗂️ First open financial statements data set that includes non-filing firms.
🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.
📅 Covers 2011-2023 initially, will be continuously updated.
🏗️ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.
The RFSD is hosted on 🤗 Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.
The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.
Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.
Importing The Data
You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on 🤗 Hugging Face Datasets library.
Python
🤗 Hugging Face Datasets
It is as easy as:
from datasets import load_dataset import polars as pl
RFSD = load_dataset('irlspbru/RFSD')
RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')
Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.
Local File Import
Importing in Python requires pyarrow package installed.
import pyarrow.dataset as ds import polars as pl
RFSD = ds.dataset("local/path/to/RFSD")
print(RFSD.schema)
RFSD_full = pl.from_arrow(RFSD.to_table())
RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))
RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )
renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})
R
Local File Import
Importing in R requires arrow package installed.
library(arrow) library(data.table)
RFSD <- open_dataset("local/path/to/RFSD")
schema(RFSD)
scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())
scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())
scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())
renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)
Use Cases
🌍 For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) — interest_payments.md
🏭 For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) — tfp.md
🗺️ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses — spatialization.md
FAQ
Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?
To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.
What is the data period?
We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).
Why are there no data for firm X in year Y?
Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:
We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).
Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek — in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.
Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.
Why is the geolocation of firm X incorrect?
We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.
Why is the data for firm X different from https://bo.nalog.ru/?
Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.
Why is the data for firm X unrealistic?
We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.
Why is the data for groups of companies different from their IFRS statements?
We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.
Why is the data not in CSV?
The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.
Version and Update Policy
Version (SemVer): 1.0.0.
We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.
Licence
Creative Commons License Attribution 4.0 International (CC BY 4.0).
Copyright © the respective contributors.
Citation
Please cite as:
@unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}
Acknowledgments and Contacts
Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru
Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,
The Salford extension for CKAN is designed to enhance CKAN's functionality for specific use cases, particularly involving the management and import of datasets relevant to the Salford City Council. By incorporating custom configurations and an ETL script, this extension streamlines the process of integrating external data sources, especially from data.gov.uk, into a CKAN instance. It also provides a structured approach to configuring CKAN for specific data management needs. Key Features: Custom Plugin Integration: Enables the addition of 'salford' and 'esd' plugins to extend CKAN's core functionality, addressing specific data management requirements. Configurable Licenses Group URL: Allows administrators to specify a licenses group URL in the CKAN configuration, streamlining access to license information pertinent to the dataset. ETL Script for Data.gov.uk Import: Includes a Python script (etl.py) to import datasets specifically from the Salford City Council publisher on data.gov.uk. Non-UKLP Dataset Compatibility: The ETL script is designed to filter and import non-UKLP datasets, excluding INSPIRE datasets from the data.gov.uk import process at this time. Bower Component Installation: Simplifies asset management by providing instructions of installing bower components. Technical Integration: The Salford extension requires modifications to the CKAN configuration file (production.ini). Specifically, it involves adding salford and esd to the ckan.plugins setting, defining the licensesgroupurl, and potentially configuring other custom options. The ETL script leverages the CKAN API (ckanapi) for data import. Additionally, Bower components must be installed. Benefits & Impact: Using the Salford CKAN extension, organizations can establish a more streamlined data ingestion process tailored to Salford City Council datasets, enhance data accessibility, improve asset management and facilitate better data governance aligned with specific licensing requirements. By selectively importing datasets and offering custom plugin support, it caters to specialized data management needs.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
F-DATA is a novel workload dataset containing the data of around 24 million jobs executed on Supercomputer Fugaku, over the three years of public system usage (March 2021-April 2024). Each job data contains an extensive set of features, such as exit code, duration, power consumption and performance metrics (e.g. #flops, memory bandwidth, operational intensity and memory/compute bound label), which allows for a multitude of job characteristics prediction. The full list of features can be found in the file feature_list.csv.
The sensitive data appears both in anonymized and encoded versions. The encoding is based on a Natural Language Processing model and retains sensitive but useful job information for prediction purposes, without violating data privacy. The scripts used to generate the dataset are available in the F-DATA GitHub repository, along with a series of plots and instruction on how to load the data.
F-DATA is composed of 38 files, with each YY_MM.parquet file containing the data of the jobs submitted in the month MM of the year YY.
The files of F-DATA are saved as .parquet files. It is possible to load such files as dataframes by leveraging the pandas APIs, after installing pyarrow (pip install pyarrow). A single file can be read with the following Python instrcutions:
import pandas as pd
df = pd.read_parquet("21_01.parquet")
df.head()
The provided Python code is a comprehensive analysis of sales data for a business that involves the merging of monthly sales data, cleaning and augmenting the dataset, and performing various analytical tasks. Here's a breakdown of the code:
Data Preparation and Merging:
The code begins by importing necessary libraries and filtering out warnings. It merges sales data from 12 months into a single file named "all_data.csv." Data Cleaning:
Rows with NaN values are dropped, and any entries starting with 'Or' in the 'Order Date' column are removed. Columns like 'Quantity Ordered' and 'Price Each' are converted to numeric types for further analysis. Data Augmentation:
Additional columns such as 'Month,' 'Sales,' and 'City' are added to the dataset. The 'City' column is derived from the 'Purchase Address' column. Analysis:
Several analyses are conducted, answering questions such as: The best month for sales and total earnings. The city with the highest number of sales. The ideal time for advertisements based on the number of orders per hour. Products that are often sold together. The best-selling products and their correlation with price. Visualization:
Bar charts and line plots are used for visualizing the analysis results, making it easier to interpret trends and patterns. Matplotlib is employed for creating visualizations. Summary:
The code concludes with a comprehensive visualization that combines the quantity ordered and average price for each product, shedding light on product performance. This code is structured to offer insights into sales patterns, customer behavior, and product performance, providing valuable information for strategic decision-making in the business.
The MNIST database of handwritten digits.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('mnist', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/mnist-3.0.1.png" alt="Visualization" width="500px">
https://spdx.org/licenses/CC0-1.0https://spdx.org/licenses/CC0-1.0
The HURRECON model estimates wind speed, wind direction, enhanced Fujita scale wind damage, and duration of EF0 to EF5 winds as a function of hurricane location and maximum sustained wind speed. Results may be generated for a single site or an entire region. Hurricane track and intensity data may be imported directly from the US National Hurricane Center's HURDAT2 database. HURRECON is available in R and Python. The R version is available on CRAN as HurreconR. The model is an updated version of the original HURRECON model written in Borland Pascal for use with Idrisi (see HF025). New features include support for: (1) estimating wind damage on the enhanced Fujita scale, (2) importing hurricane track and intensity data directly from HURDAT2, (3) creating a land-water file with user-selected geographic coordinates and spatial resolution, and (4) creating plots of site and regional results. The model equations for estimating wind speed and direction, including parameter values for inflow angle, friction factor, and wind gust factor (over land and water), are unchanged from the original HURRECON model. For more details and sample datasets, see the project website on GitHub (https://github.com/hurrecon-model).
The LDAP Authentication extension for CKAN provides a method for authenticating users against an LDAP (Lightweight Directory Access Protocol) server, enhancing security and simplifying user management. This extension allows CKAN to leverage existing LDAP infrastructure for user authentication, account information, and group management. Allowing administrators to configure CKAN to authenticate via LDAP and utilize the credentials of LDAP users. Key Features: LDAP Authentication: Enables CKAN to authenticate users against an LDAP server, using usernames and passwords stored within the LDAP directory. User Data Import: Imports user attributes, such as username, full name, email address, and description, from the LDAP directory into the CKAN user profile. Flexible LDAP Search: Supports matching against multiple LDAP fields (e.g., username, full name) using configurable search filters. An alternative search filter can be specified for cases where the primary filter returns no results, allowing for flexible matching. Combined Authentication: Allows combining LDAP authentication with basic CKAN authentication, providing a fallback mechanism for users not present in the LDAP directory. Automatic Organization Assignment: Automatically adds LDAP users to a specified organization within CKAN upon their first login, simplifying organizational role management. The role of the user within the organization can also be specified. Active Directory Support: Compatible with Active Directory, allowing seamless integration with existing Windows-based directory services. Profile Edit Restriction: Provides an option to prevent LDAP users from editing their profiles within CKAN, centralizing user data management in the LDAP directory. Password Reset Functionality: Allows LDAP users to reset their CKAN passwords (not their LDAP passwords), providing a way to recover access to their CKAN accounts. This functionality can be disabled, preventing user-initiated password resets for these accounts. Existing user migration: Facilitates migration from CKAN authentication to LDAP authentication by mapping any existing CKAN user with the same username to the login LDAP user. Referral Ignore: Ignores any referral results to avoid queries containing more than one result. Debug/Trace level option: sets the debug level of python-ldap and python-ldap trace level to allow for debugging. Technical Integration: The extension integrates with CKAN by adding 'ldap' to the list of plugins in the CKAN configuration file. It overrides the default CKAN login form, redirecting login requests to the LDAP authentication handler. Configuration options can specified in the CKAN .ini config file, including the LDAP server URI, base DN, search filters, and attribute mappings. Benefits & Impact: Implementing the LDAP Authentication extension simplifies user management by centralizing authentication within an LDAP directory, reducing administrative overhead. It enhances security by leveraging existing LDAP security policies and credentials. The extension streamlines user onboarding by automatically importing user data and assigning organizational roles, improving user experience and data consistency.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the root directory containing data files, bash scripts, and Python scripts to generate the
data for the tables and figures in my PhD thesis, titled "Nuclear Wavefunctions of Dispersion
Bound Systems: Endohedral Eigenstates of Endofullerenes". This thesis was submitted in September
2024, with corrections (no additional calculations) approved in December 2024. The electronic
structure data is provided raw, as outputs from FermiONs++ and FHI-aims. The machine learned PESs
are constructed from Python scripts. These are then used to calculate the nuclear eigenstates, which
is achieved using a self written library, "EPEE" available on GitLab at
https://gitlab.developers.cam.ac.uk/ksp31/epee.
Author: Kripa Panchagnula
Date: January 2025
To run the machine learning, nuclear diagonalisation, and plotting scripts the "thesis_calcs" branch
(commit SHA: 100d79600aae7668d4ceaeafc6274a89f019283c) or "main" branch (commit SHA:
4e4d677f609028710fbc8e4f48dc4895543340db) of EPEE is required alongside NumPy, SciPy, scikit-learn,
matplotlib and the "development" branch of QSym2 from https://qsym2.dev/. Any Python script importing
from src is referring to the EPEE library. Each Python script must be run from within its containing
directory.
The data is separated into the following folders:
- background/
This folder contains a Python script to generate figures for Chapters 1-3.
- He@C60/
This folder contains electronic structure data from FermiONs++ with Python scripts
to generate data for Chapter 4.
- X@C70/
This folder contains Python scripts to generate data for Chapter 5.
- Ne@C70/
This folder contains electronic structure data from FermiONs++ and FHI-aims with
Python scripts to generate data for Chapter 6.
- H2@C70/
This folder contains Python scripts to generate data for Chapter 7.
- peapods/
This folder contains Python scripts to generate data for Chapter 8.
Each folder contains its own README, with more details about its structure. File types include text files (.txt, .dat, .cube), scripts (.bash, .py) and NumPy compressed data files (.npz).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
code.zip: Zip folder containing a folder titled "code" which holds:
csv file titled "MonitoredRainGardens.csv" containing the 14 monitored green infrastructure (GI) sites with their design and physiographic features;
csv file titled "storm_constants.csv" which contain the computed decay constants for every storm in every GI during the measurement period;
csv file titled "newGIsites_AllData.csv" which contain the other 130 GI sites in Detroit and their design and physiographic features;
csv file titled "Detroit_Data_MeanDesignFeatures.csv" which contain the design and physiographic features for all of Detroit;
Jupyter notebook titled "GI_GP_SensorPlacement.ipynb" which provides the code for training the GP models and displaying the sensor placement results;
a folder titled "MATLAB" which contains the following:
folder titled "SFO" which contains the SFO toolbox for the sensor placement work
file titled "sensor_placement.mlx" that contains the code for the sensor placement work
several .mat files created in Python for importing into Matlab for the sensor placement work: "constants_sigma.mat", "constants_coords.mat", "GInew_sigma.mat", "GInew_coords.mat", and "R1_sensor.mat" through "R6_sensor.mat"
several .mat files created in Matalb for importing into Python for visualizing the results: "MI_DETselectedGI.mat" and "DETselectedGI.mat"
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
EUROPEAN ELECTRICITY DEMAND TIME SERIES (2017)
DATASET DESCRIPTION This dataset contains the aggregated electricity demand time series for Europe in 2017, measured in megawatts (MW). The data is presented with a timestamp in Coordinated Universal Time (UTC) and the corresponding electricity demand in MW.
DATA SOURCE The original data was obtained from the European Network of Transmission System Operators for Electricity (ENTSO-E) and can be accessed at: https://www.entsoe.eu/data/power-stats/ Specifically, the data was extracted from the "MHLV_data-2015-2019.xlsx" file, which provides aggregated hourly electricity load data by country for the years 2015 to 2019.
DATA PROCESSING The dataset was created using the following steps: - Importing Data: The original Excel file was imported into a Python environment using the pandas library. The data was checked for completeness to ensure no missing or corrupted entries. - Time Conversion: All timestamps in the dataset were converted to Coordinated Universal Time (UTC) to standardize the time reference across the dataset. - Aggregation: The data for the year 2017 was extracted from the dataset. The hourly electricity load for all European countries (defined as per as per Yu et al (2019), see below) was summed to generate an aggregated time series representing the total electricity demand across Europe.
FILE FORMAT - CSV Format: The dataset is stored in a CSV file with two columns: - Timestamp (UTC): The time at which the electricity demand was recorded. - Electricity Demand (MW): The total aggregated electricity demand for Europe in megawatts.
USAGE NOTES - Temporal Coverage: The dataset covers the entire year of 2017 with hourly granularity. - Geographical Coverage: The dataset aggregates data from multiple European countries, following the definition of Europe as per Yu et al (2019).
REFERENCES J. Yu, K. Bakic, A. Kumar, A. Iliceto, L. Beleke Tabu, J. Ruaud, J. Fan, B. Cova, H. Li, D. Ernst, R. Fonteneau, M. Theku, G. Sanchis, M. Chamollet, M. Le Du, Y. Zhang, S. Chatzivasileiadis, D.-C. Radu, M. Berger, M. Stabile, F. Heymann, M. Dupré La Tour, M. Manuel de Villena Millan, and M. Ranjbar. Global electricity network - feasibility study. Technical report, CIGRE, 2019. URL https://hdl.handle.net/2268/239969. Accessed July 2024.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset containing measurements of Linux Kernel binary size after compilation. The reported size, in the column "perf", is the size in bytes of the vmlinux file. In contains also a column "active_options" reporting the number of activated options (set at "y"). All other columns, the list being reported in the file "Linux_options.json", are Linux kernel options. The sampling have been made using randconfig. The version of Linux used is 4.13.3.
Not all available options are present. First, it only contains options about the x86 and 64 bits version. Then, all non-tristate options have been ignored. Finally, options not having multiple value through the whole dataset, due to not enough variability in the sampling, are ignored. All options are encoded as 0 for "n" and "m" options value, and 1 for "y".
In python, importing the dataset using pandas will attribute all columns to int64, which will lead to a great consumption of memory (~50GB). We provide this way to import it using less than 1 GB of memory by setting options columns to int8.
import pandas as pd import json import numpy
with open("Linux_options.json","r") as f: linux_options = json.load(f)
return pd.read_csv("Linux.csv", dtype={f:numpy.int8 for f in linux_options})
This dataset consists of basic statistics and career statistics provided by the NFL on their official website (http://www.nfl.com) for all players, active and retired.
All of the data was web scraped using Python code, which can be found and downloaded here: https://github.com/ytrevor81/NFL-Stats-Web-Scrape
Before we go into the specifics, it's important to note in the basic statistics and career statistics CSV files that all players are assigned a 'Player_Id'. This is the same ID used by the official NFL website to identify each player. This is useful in case of, for example, importing these CSV files in a SQL database for an app.
The data pulled for each player in Active_Player_Basic_Stats.csv is as follows: a. Player ID b. Full Name c. Position d. Number e. Current Team f. Height g. Height h. Weight i. Experience j. Age k. College
The data pulled for each player in Retired_Player_Basic_Stats.csv differs slightly from the previous data set. The data is as follows: a. Player ID b. Full Name c. Position f. Height g. Height h. Weight j. College k. Hall of Fame Status
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was originally curated by Software Carpentry, a branch of The Carpentries non-profit organization, and is based on data from the Gapminder Foundation. It consists of six tabular CSV files containing GDP data for various countries across different years. The dataset was initially prepared for the Software Carpentry tutorial "Plotting and Programming in Python" and is also reused in the Galaxy Training Network (GTN) tutorial "Use Jupyter Notebooks in Galaxy."
This GTN tutorial provides an introduction to launching a Jupyter Notebook in Galaxy, installing dependencies, and importing and exporting data. It serves as a setup guide for a Jupyter Notebook environment that can be used to follow the Software Carpentry tutorial "Plotting and Programming in Python."
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This layer group describes multibeam echosounder data collected on RV Investigator voyage IN2020_E01 titled "Trials and Calibration". The voyage took place between July 29 and August 6, 2020, departing from Hobart (TAS) and arriving in Hobart (TAS).
The purpose of this voyage was to undertake post port-period equipment calibrations and commissioning, sea trials as well as personnel training.
This dataset is published with the permission of CSIRO. Not to be used for navigational purposes.
The dataset contains bathymetry grids of 10m to 210m resolution of the Tasmanian Coast produced from the processed EM122 and EM710 bathymetry data. Lineage: Multibeam data was logged from the EM’s in Kongsberg’s proprietary *.all format and was converted to be processed within CARIS HIPS and SIPS version 10.4. Initial data conversion and processing was performed using the GSM python batch utility, with the manual method used for importing patch test data. Once the raw files were converted into the HIPS and SIPS format, the data was analysed for noise. With the exception of EM710 reference surface lines (files 0083, 0085, 0087, 0090, 0092, 0096, 0098, 0100 & 0103) and EM710 patch test lines (files 0031, 0033, 0035, 0037, 0040, 0042, 0044) that had GPS tide applied, no tide was applied to the remaining lines. All lines were merged using the vessel file appropriate for either the EM122 or EM710. Because the angular offsets were zeroed in SIS prior to the EM710 and EM122 patch tests, the vessel files for each were edited to apply the calibration values to those lines.
The data was then gridded at the highest resolution possible and further inspected for outliers.
The data was then gridded at multiple resolutions in python Caris batch script using a Depth filter Vs Resolution guideline derived from AusSeabed Multibeam guidelines v2 and further inspected for outliers. Final raster products are available in L3 folder of this collection. Final processed data were also exported per line as GSF and ASCII format and available in the L2 folder of this collection.
40,000 lines of Shakespeare from a variety of Shakespeare's plays. Featured in Andrej Karpathy's blog post 'The Unreasonable Effectiveness of Recurrent Neural Networks': http://karpathy.github.io/2015/05/21/rnn-effectiveness/.
To use for e.g. character modelling:
d = tfds.load(name='tiny_shakespeare')['train']
d = d.map(lambda x: tf.strings.unicode_split(x['text'], 'UTF-8'))
# train split includes vocabulary for other splits
vocabulary = sorted(set(next(iter(d)).numpy()))
d = d.map(lambda x: {'cur_char': x[:-1], 'next_char': x[1:]})
d = d.unbatch()
seq_len = 100
batch_size = 2
d = d.batch(seq_len)
d = d.batch(batch_size)
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('tiny_shakespeare', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Source code, documentation, and examples of use of the source code for the Dioptra Test Platform.Dioptra is a software test platform for assessing the trustworthy characteristics of artificial intelligence (AI). Trustworthy AI is: valid and reliable, safe, secure and resilient, accountable and transparent, explainable and interpretable, privacy-enhanced, and fair - with harmful bias managed1. Dioptra supports the Measure function of the NIST AI Risk Management Framework by providing functionality to assess, analyze, and track identified AI risks.Dioptra provides a REST API, which can be controlled via an intuitive web interface, a Python client, or any REST client library of the user's choice for designing, managing, executing, and tracking experiments. Details are available in the project documentation available at https://pages.nist.gov/dioptra/.Use CasesWe envision the following primary use cases for Dioptra:- Model Testing: -- 1st party - Assess AI models throughout the development lifecycle -- 2nd party - Assess AI models during acquisition or in an evaluation lab environment -- 3rd party - Assess AI models during auditing or compliance activities- Research: Aid trustworthy AI researchers in tracking experiments- Evaluations and Challenges: Provide a common platform and resources for participants- Red-Teaming: Expose models and resources to a red team in a controlled environmentKey PropertiesDioptra strives for the following key properties:- Reproducible: Dioptra automatically creates snapshots of resources so experiments can be reproduced and validated- Traceable: The full history of experiments and their inputs are tracked- Extensible: Support for expanding functionality and importing existing Python packages via a plugin system- Interoperable: A type system promotes interoperability between plugins- Modular: New experiments can be composed from modular components in a simple yaml file- Secure: Dioptra provides user authentication with access controls coming soon- Interactive: Users can interact with Dioptra via an intuitive web interface- Shareable and Reusable: Dioptra can be deployed in a multi-tenant environment so users can share and reuse components
The Institute for the Design of Advanced Energy Systems (IDAES) Integrated Platform is a versatile computational environment offering extensive process systems engineering (PSE) capabilities for optimizing the design and operation of complex, interacting technologies and systems. IDAES enables users to efficiently search vast, complex design spaces to discover the lowest cost solutions while supporting the full process modeling lifecycle, from conceptual design to dynamic optimization and control. The extensible, open platform empowers users to create models of novel processes and rapidly develop custom analyses, workflows, and end-user applications. IDAES-PSE 2.6.0 Release Highlights Upcoming Changes IDAES will be switching to the new Pyomo solver interface in the next release. Whilst this will hopefully be a smooth transition for most users, there are a few important changes to be aware of. The new solver interface uses a different version of the IPOPT writer (“ipopt_v2”) and thus any custom configuration options you might have set for IPOPT will not carry over and will need to be reset. By default, the new Pyomo linear presolver will be activated with ipopt_v2. Whilst are working to identify any bugs in the presolver, it is possible that some edge cases will remain. IDAES will begin deploying amore » new set of scaling tools and APIs over the next few releases that make use of the new solver writers. The old scaling tools and APIs will remain for backward compatibility but will begin to be deprecated. New Models, Tools and Features New Intersphinx extension automatically linking Jupyter notebook examples to project documentation New end-to-end diagnostics example demonstrated on a real problem New complementarity formulation for VLE with cubic equations of state, backward compatibility for old formulation New solver interface with presolve (ipopt_v2) in support of upcoming changes to the initialization and APIs methods, with default set to ipopt to maintain backwards compatibility; this will deprecate once all examples have been updated New forecaster and parameterized bidder methods within grid integration library Updated surrogates API and examples to support Keras 3, with backwards compatibility for older formats such as TensorFlow SavedModel (TFSM) Updated costing base dictionary to include the 2023 cost year index value Updated ProcessBlock to include information on the constructing block class Updated Flowsheet Visualizer to allow visualize() method to return value and functions Bug Fixes Fixed bug in the Modular Property Framework that would cause errors when trying to use phase-based material balances with phase equilibria. Fixed bug in Modular Properties Framework that caused errors when initializing models with non-vapor-liquid phase equilibria. Fixed typos flagged by June update to crate-ci/typos and removed DMF-related exceptions Minor corrections of units of measurement handling in power plant waste/transport costing expressions, control volume material holdup expressions, and BTX property package parameters Fixed throwing >7500 numpy deprecation warnings by replacing scalar value assignment with element extraction and item iteration calls Testing and Robustness Migrated slow tests (>10s) to integration, impacting test coverage but also yielding a nearly 30% decrease in local test runtime Pinned pint to avoid issues with older supported Python versions Pinned codecov versions to avoid tokenless upload behavior with latest version Bumped extensions to version 3.4.2 to allow pointing to non-standard install location Deprecations and Removals Python 3.8 is no longer supported. The supported Python versions are 3.9 through 3.12 The Data Management Framework (DMF) is no longer supported. Importing idaes.core.dmf will cause a deprecation warning to be displayed until the next release The SOFC Keras surrogates have been removed. The current version of the SOFC surrogate model in the examples repository is a PySMO Kriging model.« less
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
ExioML is the first ML-ready benchmark dataset in eco-economic research, designed for global sectoral sustainability analysis. It addresses significant research gaps by leveraging the high-quality, open-source EE-MRIO dataset ExioBase 3.8.2. ExioML covers 163 sectors across 49 regions from 1995 to 2022, overcoming data inaccessibility issues. The dataset includes both factor accounting in tabular format and footprint networks in graph structure.
We demonstrate a GHG emission regression task using a factor accounting table, comparing the performance of shallow and deep models. The results show a low Mean Squared Error (MSE), quantifying sectoral GHG emissions in terms of value-added, employment, and energy consumption, validating the dataset's usability. The footprint network in ExioML, inherent in the multi-dimensional MRIO framework, enables tracking resource flow between international sectors.
ExioML offers promising research opportunities, such as predicting embodied emissions through international trade, estimating regional sustainability transitions, and analyzing the topological changes in global trading networks over time. It reduces barriers and intensive data pre-processing for ML researchers, facilitates the integration of ML and eco-economic research, and provides new perspectives for sound climate policy and global sustainable development.
ExioML supports graph and tabular structure learning algorithms through the Footprint Network and Factor Accounting table. The dataset includes the following factors in PxP and IxI:
The Factor Accounting table shares common features with the Footprint Network and summarizes the total heterogeneous characteristics of various sectors.
The Footprint Network models the high-dimensional global trading network, capturing its economic, social, and environmental impacts. This network is structured as a directed graph, where directionality represents sectoral input-output relationships, delineating sectors by their roles as sources (exporting) and targets (importing). The basic element in the ExioML Footprint Network is international trade across different sectors with features such as value-added, emission amount, and energy input. The Footprint Network helps identify critical sectors and paths for sustainability management and optimization. The Footprint Network is hosted on Zenodo.
The ExioML development toolkit in Python and the regression model used for validation are available on the GitHub repository: https://github.com/YVNMINC/ExioML. The complete ExioML dataset is hosted by Zenodo: https://zenodo.org/records/10604610.
More details about the dataset are available in our paper: ExioML: Eco-economic dataset for Machine Learning in Global Sectoral Sustainability, accepted by the ICLR 2024 Climate Change AI workshop: https://arxiv.org/abs/2406.09046.
@inproceedings{guo2024exioml,
title={ExioML: Eco-economic dataset for Machine Learning in Global Sectoral Sustainability},
author={Yanming, Guo and Jin, Ma},
booktitle={ICLR 2024 Workshop on Tackling Climate Change with Machine Learning},
year={2024}
}
Stadler, Konstantin, et al. "EXIOBASE 3." Zenodo. Retrieved March 22 (2021): 2023.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains a Python script for classifying apple leaf diseases using a Vision Transformer (ViT) model. The dataset used is the Plant Village dataset, which contains images of apple leaves with four classes: Healthy, Apple Scab, Black Rot, and Cedar Apple Rust. The script includes data preprocessing, model training, and evaluation steps.
The goal of this project is to classify apple leaf diseases using a Vision Transformer (ViT) model. The dataset is divided into four classes: Healthy, Apple Scab, Black Rot, and Cedar Apple Rust. The script includes data preprocessing, model training, and evaluation steps.
matplotlib
, seaborn
, numpy
, pandas
, tensorflow
, and sklearn
. These libraries are used for data visualization, data manipulation, and building/training the deep learning model.walk_through_dir
function is used to explore the dataset directory structure and count the number of images in each class.Train
, Val
, and Test
directories, each containing subdirectories for the four classes.ImageDataGenerator
from Keras to apply data augmentation techniques such as rotation, horizontal flipping, and rescaling to the training data. This helps in improving the model's generalization ability.Patches
layer that extracts patches from the images. This is a crucial step in Vision Transformers, where images are divided into smaller patches that are then processed by the transformer.seaborn
to provide a clear understanding of the model's predictions.Dataset Preparation
Train
, Val
, and Test
directories, with each directory containing subdirectories for each class (Healthy, Apple Scab, Black Rot, Cedar Apple Rust).Install Required Libraries
pip install tensorflow matplotlib seaborn numpy pandas scikit-learn
Run the Script
Analyze Results
Fine-Tuning
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Author: Andrew J. Felton
Date: 10/29/2024
This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:
"Global estimates of the storage and transit time of water through vegetation"
Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated.
Data information:
The data folder contains key data sets used for analysis. In particular:
"data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.
#Code information
Python scripts can be found in the "supporting_code" folder.
Each R script in this project has a role:
"01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).
"02_functions.R": This script contains custom functions. Load this using the
`source()` function in the 01_start.R script.
"03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
`source()` function in the 01_start.R script.
"04_figures_tables.R": This is the main workhouse for figure/table production and
supporting analyses. This script generates the key figures and summary statistics
used in the study that then get saved in the manuscript_figures folder. Note that all
maps were produced using Python code found in the "supporting_code"" folder.
"supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.
"supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.