Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Author: Andrew J. FeltonDate: 5/5/2024
This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis and figure production for the study entitled:
"Global estimates of the storage and transit time of water through vegetation"
Please note that 'turnover' and 'transit' are used interchangeably in this project.
Data information:
The data folder contains key data sets used for analysis. In particular:
"data/turnover_from_python/updated/annual/multi_year_average/average_annual_turnover.nc" contains a global array summarizing five year (2016-2020) averages of annual transit, storage, canopy transpiration, and number of months of data. This is the core dataset for the analysis; however, each folder has much more data, including a dataset for each year of the analysis. Data are also available is separate .csv files for each land cover type. Oterh data can be found for the minimum, monthly, and seasonal transit time found in their respective folders. These data were produced using the python code found in the "supporting_code" folder given the ease of working with .nc and EASE grid in the xarray python module. R was used primarily for data visualization purposes. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here.
Python scripts can be found in the "supporting_code" folder.
Each R script in this project has a particular function:
01_start.R: This script loads the R packages used in the analysis, sets thedirectory, and imports custom functions for the project. You can also load in the main transit time (turnover) datasets here using the source() function.
02_functions.R: This script contains the custom function for this analysis, primarily to work with importing the seasonal transit data. Load this using the source() function in the 01_start.R script.
03_generate_data.R: This script is not necessary to run and is primarilyfor documentation. The main role of this code was to import and wranglethe data needed to calculate ground-based estimates of aboveground water storage.
04_annual_turnover_storage_import.R: This script imports the annual turnover andstorage data for each landcover type. You load in these data from the 01_start.R scriptusing the source() function.
05_minimum_turnover_storage_import.R: This script imports the minimum turnover andstorage data for each landcover type. Minimum is defined as the lowest monthlyestimate.You load in these data from the 01_start.R scriptusing the source() function.
06_figures_tables.R: This is the main workhouse for figure/table production and supporting analyses. This script generates the key figures and summary statistics used in the study that then get saved in the manuscript_figures folder. Note that allmaps were produced using Python code found in the "supporting_code"" folder.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset provides the raw data associated with the NCBI GEO accession number GSE183947. The underlying data is RNA-Sequencing (RNA-Seq) expression matrix. It is derived from matched normal and malignant breast cancer tissue samples. The primary goal of this resource is to teach the complete workflow of: - Downloading and importing high-throughput genomics data from public repositories. - Cleaning and normalizing the raw expression values (e.g., FPKM/TPM). - Preparing the data structure for downstream Differential Gene Expression (DEG) analysis. This resource is essential for anyone practicing translational bioinformatics and cancer research.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:
🔓 First open data set with information on every active firm in Russia.
🗂️ First open financial statements data set that includes non-filing firms.
🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.
📅 Covers 2011-2023 initially, will be continuously updated.
🏗️ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.
The RFSD is hosted on 🤗 Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.
The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.
Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.
Importing The Data
You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on 🤗 Hugging Face Datasets library.
Python
🤗 Hugging Face Datasets
It is as easy as:
from datasets import load_dataset import polars as pl
RFSD = load_dataset('irlspbru/RFSD')
RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')
Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.
Local File Import
Importing in Python requires pyarrow package installed.
import pyarrow.dataset as ds import polars as pl
RFSD = ds.dataset("local/path/to/RFSD")
print(RFSD.schema)
RFSD_full = pl.from_arrow(RFSD.to_table())
RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))
RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )
renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})
R
Local File Import
Importing in R requires arrow package installed.
library(arrow) library(data.table)
RFSD <- open_dataset("local/path/to/RFSD")
schema(RFSD)
scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())
scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())
scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())
renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)
Use Cases
🌍 For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) — interest_payments.md
🏭 For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) — tfp.md
🗺️ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses — spatialization.md
FAQ
Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?
To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.
What is the data period?
We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).
Why are there no data for firm X in year Y?
Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:
We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).
Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek — in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.
Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.
Why is the geolocation of firm X incorrect?
We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.
Why is the data for firm X different from https://bo.nalog.ru/?
Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.
Why is the data for firm X unrealistic?
We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.
Why is the data for groups of companies different from their IFRS statements?
We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.
Why is the data not in CSV?
The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.
Version and Update Policy
Version (SemVer): 1.0.0.
We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.
Licence
Creative Commons License Attribution 4.0 International (CC BY 4.0).
Copyright © the respective contributors.
Citation
Please cite as:
@unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}
Acknowledgments and Contacts
Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru
Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,
Facebook
TwitterThe Salford extension for CKAN is designed to enhance CKAN's functionality for specific use cases, particularly involving the management and import of datasets relevant to the Salford City Council. By incorporating custom configurations and an ETL script, this extension streamlines the process of integrating external data sources, especially from data.gov.uk, into a CKAN instance. It also provides a structured approach to configuring CKAN for specific data management needs. Key Features: Custom Plugin Integration: Enables the addition of 'salford' and 'esd' plugins to extend CKAN's core functionality, addressing specific data management requirements. Configurable Licenses Group URL: Allows administrators to specify a licenses group URL in the CKAN configuration, streamlining access to license information pertinent to the dataset. ETL Script for Data.gov.uk Import: Includes a Python script (etl.py) to import datasets specifically from the Salford City Council publisher on data.gov.uk. Non-UKLP Dataset Compatibility: The ETL script is designed to filter and import non-UKLP datasets, excluding INSPIRE datasets from the data.gov.uk import process at this time. Bower Component Installation: Simplifies asset management by providing instructions of installing bower components. Technical Integration: The Salford extension requires modifications to the CKAN configuration file (production.ini). Specifically, it involves adding salford and esd to the ckan.plugins setting, defining the licensesgroupurl, and potentially configuring other custom options. The ETL script leverages the CKAN API (ckanapi) for data import. Additionally, Bower components must be installed. Benefits & Impact: Using the Salford CKAN extension, organizations can establish a more streamlined data ingestion process tailored to Salford City Council datasets, enhance data accessibility, improve asset management and facilitate better data governance aligned with specific licensing requirements. By selectively importing datasets and offering custom plugin support, it caters to specialized data management needs.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Vezora (From Huggingface) [source]
The Vezora/Tested-188k-Python-Alpaca dataset is a comprehensive collection of functional Python code samples, specifically designed for training and analysis purposes. With 188,000 samples, this dataset offers an extensive range of examples that cater to the research needs of Python programming enthusiasts.
This valuable resource consists of various columns, including input, which represents the input or parameters required for executing the Python code sample. The instruction column describes the task or objective that the Python code sample aims to solve. Additionally, there is an output column that showcases the resulting output generated by running the respective Python code.
By utilizing this dataset, researchers can effectively study and analyze real-world scenarios and applications of Python programming. Whether for educational purposes or development projects, this dataset serves as a reliable reference for individuals seeking practical examples and solutions using Python
The Vezora/Tested-188k-Python-Alpaca dataset is a comprehensive collection of functional Python code samples, containing 188,000 samples in total. This dataset can be a valuable resource for researchers and programmers interested in exploring various aspects of Python programming.
Contents of the Dataset
The dataset consists of several columns:
- output: This column represents the expected output or result that is obtained when executing the corresponding Python code sample.
- instruction: It provides information about the task or instruction that each Python code sample is intended to solve.
- input: The input parameters or values required to execute each Python code sample.
Exploring the Dataset
To make effective use of this dataset, it is essential to understand its structure and content properly. Here are some steps you can follow:
- Importing Data: Load the dataset into your preferred environment for data analysis using appropriate tools like pandas in Python.
import pandas as pd # Load the dataset df = pd.read_csv('train.csv')
- Understanding Column Names: Familiarize yourself with the column names and their meanings by referring to the provided description.
# Display column names print(df.columns)
- Sample Exploration: Get an initial understanding of the data structure by examining a few random samples from different columns.
# Display random samples from 'output' column print(df['output'].sample(5))
- Analyzing Instructions: Analyze different instructions or tasks present in the 'instruction' column to identify specific areas you are interested in studying or learning about.
# Count unique instructions and display top ones with highest occurrences instruction_counts = df['instruction'].value_counts() print(instruction_counts.head(10))Potential Use Cases
The Vezora/Tested-188k-Python-Alpaca dataset can be utilized in various ways:
- Code Analysis: Analyze the code samples to understand common programming patterns and best practices.
- Code Debugging: Use code samples with known outputs to test and debug your own Python programs.
- Educational Purposes: Utilize the dataset as a teaching tool for Python programming classes or tutorials.
- Machine Learning Applications: Train machine learning models to predict outputs based on given inputs.
Remember that this dataset provides a plethora of diverse Python coding examples, allowing you to explore different
- Code analysis: Researchers and developers can use this dataset to analyze various Python code samples and identify patterns, best practices, and common mistakes. This can help in improving code quality and optimizing performance.
- Language understanding: Natural language processing techniques can be applied to the instruction column of this dataset to develop models that can understand and interpret natural language instructions for programming tasks.
- Code generation: The input column of this dataset contains the required inputs for executing each Python code sample. Researchers can build models that generate Python code based on specific inputs or task requirements using the examples provided in this dataset. This can be useful in automating repetitive programming tasks o...
Facebook
TwitterSSURGO PortalThe newest version of SSURGO Portal with Soil Data Viewer is available via the Quick Start Guide. Install Python to C:\Program Files. This is a different version than what ArcGIS Pro uses.If you need data for multiple states, we also offer a prebuilt large database with all SSURGO for the entire United States and all Islands. The prebuilt saves you time but it’s large and takes a while to download.You can also use the prebuilt gNATSGO GeoPackage database in SSURGO Portal – Soil Data Viewer. Read the ReadMe.txt in the folder. More about gNATSGO here. You can also import STATSGO2 data into SSURGO Portal and create a database to use in Soil Data Viewer – Available for download via the Soils Box folder. SSURGO Portal NotesThis 10 minute video covers it all, other than installation of SSURGO Portal and the GIS tool. Installation is typically smooth and easy.There is also a user guide on the SSURGO Portal website that can be very helpful. It has info about using the data in ArcGIS Pro or QGIS. SQLite SSURGO database be opened and queried with DB Browser. It’s essentially free Microsoft Access.Guidance about setting up DB Browser to easily open SQLite databases is available in section 4 of this Installation Guide.Workflow if you need to make your own databaseInstall SSURGO PortalInstall SSURGO Downloader GIS tool (Refer to the Installation and User Guide for assistance)There is one for QGIS and one for ArcGIS Pro. They both do the same thing. Quickly download California SSURGO data with toolEnter two digit state symbol followed by asterisk in “Search by Areasymbol” to download all data for state.For example, enter CA* to batch download all data for CaliforniaOpen SSURGO Portal and create a new SQLite SSURGO Template database (Refer to the User Guide for assistance)Import SSURGO data you downloaded into databaseYou can import SSURGO data from many states at once, building a database that spans many statesAfter SSURGO data is done importing, click on Soil Data Viewer tab and run ratingsThese are the exact same ratings as Web Soil SurveyA new table is added to your database for each ratingYou can search for ratings by keywordIf desired, open database in GIS and make a map (Refer to the User Guide for assistance)Workflow if you need use large prebuilt database (don’t make own database) Install SSURGO PortalIn SSURGO Portal, browse to unzipped prebuilt GeoPackage database with all SSURGOprebuilt large database with all SSURGOgNATSGO GeoPackage databaseIn SSURGO Portal, click on Soil Data Viewer tab and run ratingsThese are the exact same ratings as Web Soil SurveyA new table is added to your database for each ratingYou can search for ratings by keywordIf desired, open database in GIS and make a mapIf you have trouble installing SSURGO Portal. Its usually the connection with Python. Create Desktop short cut that tells SSURGO Portal which Python to useThese were created for Windows 11 Right click anywhere on your desktop and choose New > ShortcutIn the text bar enter your path to the python.exe and your path to the SSURGO Portal.pyz. Notes:Example of format:"C:\Program Files\Python310\python.exe" "C:\SSURGO Portal\SSURGO_Portal-0.3.0.8.pyz"Include quotation marks.Paths may be different on your machine. To avoid typing, you can browse to python.exe in windows explorer, right click and select "Copy as Path and paste results into box. Paste into short location and then do the same for SSURGO Portal.pyz file, but paste to the right of the python.exe path. Click NextName the shortcut anything you want.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset containing measurements of Linux Kernel binary size after compilation. The reported size, in the column "perf", is the size in bytes of the vmlinux file. In contains also a column "active_options" reporting the number of activated options (set at "y"). All other columns, the list being reported in the file "Linux_options.json", are Linux kernel options. The sampling have been made using randconfig. The version of Linux used is 4.13.3.
Not all available options are present. First, it only contains options about the x86 and 64 bits version. Then, all non-tristate options have been ignored. Finally, options not having multiple value through the whole dataset, due to not enough variability in the sampling, are ignored. All options are encoded as 0 for "n" and "m" options value, and 1 for "y".
In python, importing the dataset using pandas will attribute all columns to int64, which will lead to a great consumption of memory (~50GB). We provide this way to import it using less than 1 GB of memory by setting options columns to int8.
import pandas as pd import json import numpy
with open("Linux_options.json","r") as f: linux_options = json.load(f)
return pd.read_csv("Linux.csv", dtype={f:numpy.int8 for f in linux_options})
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
F-DATA is a novel workload dataset containing the data of around 24 million jobs executed on Supercomputer Fugaku, over the three years of public system usage (March 2021-April 2024). Each job data contains an extensive set of features, such as exit code, duration, power consumption and performance metrics (e.g. #flops, memory bandwidth, operational intensity and memory/compute bound label), which allows for a multitude of job characteristics prediction. The full list of features can be found in the file feature_list.csv.
The sensitive data appears both in anonymized and encoded versions. The encoding is based on a Natural Language Processing model and retains sensitive but useful job information for prediction purposes, without violating data privacy. The scripts used to generate the dataset are available in the F-DATA GitHub repository, along with a series of plots and instruction on how to load the data.
F-DATA is composed of 38 files, with each YY_MM.parquet file containing the data of the jobs submitted in the month MM of the year YY.
The files of F-DATA are saved as .parquet files. It is possible to load such files as dataframes by leveraging the pandas APIs, after installing pyarrow (pip install pyarrow). A single file can be read with the following Python instrcutions:
import pandas as pd
df = pd.read_parquet("21_01.parquet")
df.head()
Facebook
TwitterThe MNIST database of handwritten digits.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('mnist', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/mnist-3.0.1.png" alt="Visualization" width="500px">
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
code.zip: Zip folder containing a folder titled "code" which holds:
csv file titled "MonitoredRainGardens.csv" containing the 14 monitored green infrastructure (GI) sites with their design and physiographic features;
csv file titled "storm_constants.csv" which contain the computed decay constants for every storm in every GI during the measurement period;
csv file titled "newGIsites_AllData.csv" which contain the other 130 GI sites in Detroit and their design and physiographic features;
csv file titled "Detroit_Data_MeanDesignFeatures.csv" which contain the design and physiographic features for all of Detroit;
Jupyter notebook titled "GI_GP_SensorPlacement.ipynb" which provides the code for training the GP models and displaying the sensor placement results;
a folder titled "MATLAB" which contains the following:
folder titled "SFO" which contains the SFO toolbox for the sensor placement work
file titled "sensor_placement.mlx" that contains the code for the sensor placement work
several .mat files created in Python for importing into Matlab for the sensor placement work: "constants_sigma.mat", "constants_coords.mat", "GInew_sigma.mat", "GInew_coords.mat", and "R1_sensor.mat" through "R6_sensor.mat"
several .mat files created in Matalb for importing into Python for visualizing the results: "MI_DETselectedGI.mat" and "DETselectedGI.mat"
Facebook
Twitterklib library enables us to quickly visualize missing data, perform data cleaning, visualize data distribution plot, visualize correlation plot and visualize categorical column values. klib is a Python library for importing, cleaning, analyzing and preprocessing data. Explanations on key functionalities can be found on Medium / TowardsDataScience in the examples section or on YouTube (Data Professor).
Original Github repo
https://raw.githubusercontent.com/akanz1/klib/main/examples/images/header.png" alt="klib Header">
!pip install klib
import klib
import pandas as pd
df = pd.DataFrame(data)
# klib.describe functions for visualizing datasets
- klib.cat_plot(df) # returns a visualization of the number and frequency of categorical features
- klib.corr_mat(df) # returns a color-encoded correlation matrix
- klib.corr_plot(df) # returns a color-encoded heatmap, ideal for correlations
- klib.dist_plot(df) # returns a distribution plot for every numeric feature
- klib.missingval_plot(df) # returns a figure containing information about missing values
Take a look at this starter notebook.
Further examples, as well as applications of the functions can be found here.
Pull requests and ideas, especially for further functions are welcome. For major changes or feedback, please open an issue first to discuss what you would like to change. Take a look at this Github repo.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
F-DATA is a novel workload dataset containing the data of around 24 million jobs executed on Supercomputer Fugaku, over the three years of public system usage (March 2021-April 2024). Each job data contains an extensive set of features, such as exit code, duration, power consumption and performance metrics (e.g. #flops, memory bandwidth, operational intensity and memory/compute bound label), which allows for a multitude of job characteristics prediction. The full list of features can be found in the file feature_list.csv.
The sensitive data appears both in anonymized and encoded versions. The encoding is based on a Natural Language Processing model and retains sensitive but useful job information for prediction purposes, without violating data privacy. The scripts used to generate the dataset are available in the F-DATA GitHub repository, along with a series of plots and instruction on how to load the data.
F-DATA is composed of 38 files, with each YY_MM.parquet file containing the data of the jobs submitted in the month MM of the year YY.
The files of F-DATA are saved as .parquet files. It is possible to load such files as dataframes by leveraging the pandas APIs, after installing pyarrow (pip install pyarrow). A single file can be read with the following Python instrcutions:
import pandas as pd
df = pd.read_parquet("21_01.parquet")
df.head()
Facebook
TwitterThe HURRECON model estimates wind speed, wind direction, enhanced Fujita scale wind damage, and duration of EF0 to EF5 winds as a function of hurricane location and maximum sustained wind speed. Results may be generated for a single site or an entire region. Hurricane track and intensity data may be imported directly from the US National Hurricane Center's HURDAT2 database. HURRECON is available in R and Python. The R version is available on CRAN as HurreconR. The model is an updated version of the original HURRECON model written in Borland Pascal for use with Idrisi (see HF025). New features include support for: (1) estimating wind damage on the enhanced Fujita scale, (2) importing hurricane track and intensity data directly from HURDAT2, (3) creating a land-water file with user-selected geographic coordinates and spatial resolution, and (4) creating plots of site and regional results. The model equations for estimating wind speed and direction, including parameter values for inflow angle, friction factor, and wind gust factor (over land and water), are unchanged from the original HURRECON model. For more details and sample datasets, see the project website on GitHub (https://github.com/hurrecon-model).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the root directory containing data files, bash scripts, and Python scripts to generate the
data for the tables and figures in my PhD thesis, titled "Nuclear Wavefunctions of Dispersion
Bound Systems: Endohedral Eigenstates of Endofullerenes". This thesis was submitted in September
2024, with corrections (no additional calculations) approved in December 2024. The electronic
structure data is provided raw, as outputs from FermiONs++ and FHI-aims. The machine learned PESs
are constructed from Python scripts. These are then used to calculate the nuclear eigenstates, which
is achieved using a self written library, "EPEE" available on GitLab at
https://gitlab.developers.cam.ac.uk/ksp31/epee.
Author: Kripa Panchagnula
Date: January 2025
To run the machine learning, nuclear diagonalisation, and plotting scripts the "thesis_calcs" branch
(commit SHA: 100d79600aae7668d4ceaeafc6274a89f019283c) or "main" branch (commit SHA:
4e4d677f609028710fbc8e4f48dc4895543340db) of EPEE is required alongside NumPy, SciPy, scikit-learn,
matplotlib and the "development" branch of QSym2 from https://qsym2.dev/. Any Python script importing
from src is referring to the EPEE library. Each Python script must be run from within its containing
directory.
The data is separated into the following folders:
- background/
This folder contains a Python script to generate figures for Chapters 1-3.
- He@C60/
This folder contains electronic structure data from FermiONs++ with Python scripts
to generate data for Chapter 4.
- X@C70/
This folder contains Python scripts to generate data for Chapter 5.
- Ne@C70/
This folder contains electronic structure data from FermiONs++ and FHI-aims with
Python scripts to generate data for Chapter 6.
- H2@C70/
This folder contains Python scripts to generate data for Chapter 7.
- peapods/
This folder contains Python scripts to generate data for Chapter 8.
Each folder contains its own README, with more details about its structure. File types include text files (.txt, .dat, .cube), scripts (.bash, .py) and NumPy compressed data files (.npz).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The shown results in this article are meant to be a demonstration of the presented method. The data files included in this archive contain all raw data used for plotting all the figures shown in the article.
The shown figures are based on three different, but representative experiments. The file numbers represent our internal identification number. Use the following key to resolve ids to material names and a short description of the material:
ID short name description 546214 EP-10SCF Bisphonel A epoxy (DER 331, DOW) filled with 10 vol.% short carbon fiber (A385, Tenax), in-house manufactured 548671 PA6-10SCF-8Gr Polyamide 6 filled with 10 vol. % short carbon fibers (A385, Tenax) and 8 vol.% graphite (RGC39A, Superior Graphite), in-house manufactured 550187 PPS-40GR Polyphenylene sulfide filled with 40 wt.% graphite, commercially available as TECACOMP PPS TC black 4084 (Ensinger, Germany)
To each id belong at least of two files, one with file extension 'proc' and one with 'proc.header.yaml'. The proc file is a tab separated ascii table and contains data recorded by our tribometer as well as data processed from them. The yaml files contain all necessary header information of each respective proc file of which especially the header name and (column) index is important for reproduction purpose. Additionally, id 550187 contains files with extension 'xt' and 'xt.header.yaml'. The xt file is tab separated ascii table and contains the temporally and laterally resolved data of the luminance used for plotting the xt plot. The yaml file, similar for the proc file, contain all necessary header information.
Information for importing the data files: All header files follow a standardized serialization format called yaml. The proc and xt files files are strict tab separated ascii tables. Therefore, all data can be easily read by any programming language of your choice, e.g. python, ruby or matlab. With additional effort, an import into excel or similar software is also possible.
Facebook
TwitterThis dataset repository contains all the text files of the datasets analysed in the Survey Paper on Audio Datasets of Scenes and Events. See here for the paper. The GitHub repository containing the scripts are shared here. Including a bash script to download the audio data for each of the datasets. In this repository, we also included a Python file dataset.py, for easy importing of each of the datasets. Please respect the original license of the dataset owner when downloading the data:… See the full description on the dataset page: https://huggingface.co/datasets/gijs/audio-datasets.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By nlpai-lab (From Huggingface) [source]
This dataset provides a collection of translations from English to Korean for NLP models such as GPT4ALL, Dolly, and Vicuna Data. The translations were generated using the DeepL API. It contains three columns: instruction represents the instruction given to the model for the translation task, input is the input text that needs to be translated from English to Korean, and output is the corresponding translated text in Korean. The dataset aims to facilitate research and development in natural language processing tasks by providing a reliable source of translated data
This dataset contains Korean translations of instructions, inputs, and outputs for various NLP models including GPT4ALL, Dolly, and Vicuna Data. The translations were generated using the DeepL API.
Description of Columns
The dataset consists of the following columns:
instruction: This column contains the original instruction given to the model for the translation task.
input: This column contains the input text in English that needs to be translated to Korean.
output: This column contains the translated text in Korean.How to Utilize this Dataset
You can use this dataset for various natural language processing (NLP) tasks such as machine translation or training language models specifically focused on English-Korean translation.
Here are a few steps on how you can utilize this dataset effectively:
Importing Data: Load or import the provided train.csv file into your Python environment or preferred programming language.
Data Preprocessing: Clean and preprocess both input and output texts if needed. You may consider tokenization, removing stopwords, or any other preprocessing techniques that align with your specific task requirements.
Model Training: Utilize deep learning frameworks like PyTorch or TensorFlow to develop your NLP model focused on English-Korean translation using this prepared dataset as training data.
Evaluation & Fine-tuning: Evaluate your trained model's performance using suitable metrics such as BLEU score or perplexity measurement techniques specific to machine translation tasks. Fine-tune your model by iterating over different architectures and hyperparameters based on evaluation results until desired performance is achieved.
Inference & Deployment: Once you are satisfied with your trained model's performance, use it for making predictions on unseen English texts which need translation into Korean within any application where it can provide meaningful value.
Remember that this dataset was translated using DeepL API; thus, you can leverage these translations as a starting point for your NLP projects. However, it is essential to validate and further refine the translations according to your specific use case or domain requirements.
Good luck with your NLP projects using this Korean Translation Dataset!
- Training and evaluating machine translation models: This dataset can be used to train and evaluate machine translation models for translating English text to Korean. The instruction column provides specific instructions given to the model, while the input column contains the English text that needs to be translated. The output column contains the corresponding translations in Korean.
- Language learning and practice: This dataset can be used by language learners who want to practice translating English text into Korean. Users can compare their own translations with the provided translations in the output column to improve their language skills.
- Benchmarking different translation APIs or models: The dataset includes translations generated using the DeepL API, but it can also be used as a benchmark for comparing other translation APIs or models. By comparing the performance of different systems on this dataset, researchers and developers can gain insights into the strengths and weaknesses of different translation approaches
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. [See Other Information](https:/...
Facebook
TwitterSource code, documentation, and examples of use of the source code for the Dioptra Test Platform.Dioptra is a software test platform for assessing the trustworthy characteristics of artificial intelligence (AI). Trustworthy AI is: valid and reliable, safe, secure and resilient, accountable and transparent, explainable and interpretable, privacy-enhanced, and fair - with harmful bias managed1. Dioptra supports the Measure function of the NIST AI Risk Management Framework by providing functionality to assess, analyze, and track identified AI risks.Dioptra provides a REST API, which can be controlled via an intuitive web interface, a Python client, or any REST client library of the user's choice for designing, managing, executing, and tracking experiments. Details are available in the project documentation available at https://pages.nist.gov/dioptra/.Use CasesWe envision the following primary use cases for Dioptra:- Model Testing: -- 1st party - Assess AI models throughout the development lifecycle -- 2nd party - Assess AI models during acquisition or in an evaluation lab environment -- 3rd party - Assess AI models during auditing or compliance activities- Research: Aid trustworthy AI researchers in tracking experiments- Evaluations and Challenges: Provide a common platform and resources for participants- Red-Teaming: Expose models and resources to a red team in a controlled environmentKey PropertiesDioptra strives for the following key properties:- Reproducible: Dioptra automatically creates snapshots of resources so experiments can be reproduced and validated- Traceable: The full history of experiments and their inputs are tracked- Extensible: Support for expanding functionality and importing existing Python packages via a plugin system- Interoperable: A type system promotes interoperability between plugins- Modular: New experiments can be composed from modular components in a simple yaml file- Secure: Dioptra provides user authentication with access controls coming soon- Interactive: Users can interact with Dioptra via an intuitive web interface- Shareable and Reusable: Dioptra can be deployed in a multi-tenant environment so users can share and reuse components
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The database contains microwave single scattering data, mainly of ice hydrometeors. Main applications lie in microwave remote sensing, both passive and active. Covered hydrometeors range from pristine ice crystals to large aggregates, graupel and hail, 34 types in total. Furthermore, 34 frequencies from 1 to 886 GHz, and 3 temperatures, 190, 230 and 270 K, are included. The main article orientation is currently totally random (i.e. each orientation is equally probable), but the database is designed to also handle azimuthally orientation, and data for this orientation case will be added in the future. Mie code was used for the two completely spherical habits, while the bulk of data were calculated using the discrete dipole approximation method (DDA).
Interfaces for easy user access is available under a separate upload, due to different licensing, under the title ARTS Microwave Single Scattering Properties Database Interfaces. Interfaces in MATLAB and Python are available, supporting browsing and importing of the data. There is also functionality for extracting data for usage in RTTOV.
A description of the database is also available in the following article. Please cite it if you use the database for a publication.
Eriksson, P., R. Ekelund, J. Mendrok, M. Brath, O. Lemke, and S. A. Buehler (2018), A general database of hydrometeor single scattering properties at microwave and sub-millimetre wavelengths, Earth Syst. Sci. Data, 10, 1301–1326, doi: 10.5194/essd-10-1301-2018.
New version 1.1.0 released: Database, technical report and readme document updated. It is highly recommended to download both the new interface and database versions.
-Added three new habits: two liquid habits with azimuthally random orientation and one new bullet rosette with totally random orientation (with IDs from 35 to 37).
-Updated Python and MATLAB interface to accommodate azimuthally random oriented data and with some other minor updates.
-DDA calculations at frequencies under 100 GHz have been re-calculated using higher EPS settings, in order to accommodate radar applications (see technical report, Sec. 4.1.1).
-The tolerance of the extinction cross-section post-calculation check is now 10 % instead of 30 % (see bullet 5 in technical report, Sec. 4.4.1.1). Those calculations that could not meet the stricter criteria were recalculated using higher EPS setting.
-Format of standard habits revised. The weighting applied to the large habit at each size is now available in the in the mat-files.
-Fixed wrong index for GEM hail in specifications table.
Facebook
TwitterThe Institute for the Design of Advanced Energy Systems (IDAES) Integrated Platform is a versatile computational environment offering extensive process systems engineering (PSE) capabilities for optimizing the design and operation of complex, interacting technologies and systems. IDAES enables users to efficiently search vast, complex design spaces to discover the lowest cost solutions while supporting the full process modeling lifecycle, from conceptual design to dynamic optimization and control. The extensible, open platform empowers users to create models of novel processes and rapidly develop custom analyses, workflows, and end-user applications. IDAES-PSE 2.6.0 Release Highlights Upcoming Changes IDAES will be switching to the new Pyomo solver interface in the next release. Whilst this will hopefully be a smooth transition for most users, there are a few important changes to be aware of. The new solver interface uses a different version of the IPOPT writer (“ipopt_v2”) and thus any custom configuration options you might have set for IPOPT will not carry over and will need to be reset. By default, the new Pyomo linear presolver will be activated with ipopt_v2. Whilst are working to identify any bugs in the presolver, it is possible that some edge cases will remain. IDAES will begin deploying amore » new set of scaling tools and APIs over the next few releases that make use of the new solver writers. The old scaling tools and APIs will remain for backward compatibility but will begin to be deprecated. New Models, Tools and Features New Intersphinx extension automatically linking Jupyter notebook examples to project documentation New end-to-end diagnostics example demonstrated on a real problem New complementarity formulation for VLE with cubic equations of state, backward compatibility for old formulation New solver interface with presolve (ipopt_v2) in support of upcoming changes to the initialization and APIs methods, with default set to ipopt to maintain backwards compatibility; this will deprecate once all examples have been updated New forecaster and parameterized bidder methods within grid integration library Updated surrogates API and examples to support Keras 3, with backwards compatibility for older formats such as TensorFlow SavedModel (TFSM) Updated costing base dictionary to include the 2023 cost year index value Updated ProcessBlock to include information on the constructing block class Updated Flowsheet Visualizer to allow visualize() method to return value and functions Bug Fixes Fixed bug in the Modular Property Framework that would cause errors when trying to use phase-based material balances with phase equilibria. Fixed bug in Modular Properties Framework that caused errors when initializing models with non-vapor-liquid phase equilibria. Fixed typos flagged by June update to crate-ci/typos and removed DMF-related exceptions Minor corrections of units of measurement handling in power plant waste/transport costing expressions, control volume material holdup expressions, and BTX property package parameters Fixed throwing >7500 numpy deprecation warnings by replacing scalar value assignment with element extraction and item iteration calls Testing and Robustness Migrated slow tests (>10s) to integration, impacting test coverage but also yielding a nearly 30% decrease in local test runtime Pinned pint to avoid issues with older supported Python versions Pinned codecov versions to avoid tokenless upload behavior with latest version Bumped extensions to version 3.4.2 to allow pointing to non-standard install location Deprecations and Removals Python 3.8 is no longer supported. The supported Python versions are 3.9 through 3.12 The Data Management Framework (DMF) is no longer supported. Importing idaes.core.dmf will cause a deprecation warning to be displayed until the next release The SOFC Keras surrogates have been removed. The current version of the SOFC surrogate model in the examples repository is a PySMO Kriging model.« less
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Author: Andrew J. FeltonDate: 5/5/2024
This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis and figure production for the study entitled:
"Global estimates of the storage and transit time of water through vegetation"
Please note that 'turnover' and 'transit' are used interchangeably in this project.
Data information:
The data folder contains key data sets used for analysis. In particular:
"data/turnover_from_python/updated/annual/multi_year_average/average_annual_turnover.nc" contains a global array summarizing five year (2016-2020) averages of annual transit, storage, canopy transpiration, and number of months of data. This is the core dataset for the analysis; however, each folder has much more data, including a dataset for each year of the analysis. Data are also available is separate .csv files for each land cover type. Oterh data can be found for the minimum, monthly, and seasonal transit time found in their respective folders. These data were produced using the python code found in the "supporting_code" folder given the ease of working with .nc and EASE grid in the xarray python module. R was used primarily for data visualization purposes. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here.
Python scripts can be found in the "supporting_code" folder.
Each R script in this project has a particular function:
01_start.R: This script loads the R packages used in the analysis, sets thedirectory, and imports custom functions for the project. You can also load in the main transit time (turnover) datasets here using the source() function.
02_functions.R: This script contains the custom function for this analysis, primarily to work with importing the seasonal transit data. Load this using the source() function in the 01_start.R script.
03_generate_data.R: This script is not necessary to run and is primarilyfor documentation. The main role of this code was to import and wranglethe data needed to calculate ground-based estimates of aboveground water storage.
04_annual_turnover_storage_import.R: This script imports the annual turnover andstorage data for each landcover type. You load in these data from the 01_start.R scriptusing the source() function.
05_minimum_turnover_storage_import.R: This script imports the minimum turnover andstorage data for each landcover type. Minimum is defined as the lowest monthlyestimate.You load in these data from the 01_start.R scriptusing the source() function.
06_figures_tables.R: This is the main workhouse for figure/table production and supporting analyses. This script generates the key figures and summary statistics used in the study that then get saved in the manuscript_figures folder. Note that allmaps were produced using Python code found in the "supporting_code"" folder.