Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Author: Andrew J. FeltonDate: 5/5/2024
This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis and figure production for the study entitled:
"Global estimates of the storage and transit time of water through vegetation"
Please note that 'turnover' and 'transit' are used interchangeably in this project.
Data information:
The data folder contains key data sets used for analysis. In particular:
"data/turnover_from_python/updated/annual/multi_year_average/average_annual_turnover.nc" contains a global array summarizing five year (2016-2020) averages of annual transit, storage, canopy transpiration, and number of months of data. This is the core dataset for the analysis; however, each folder has much more data, including a dataset for each year of the analysis. Data are also available is separate .csv files for each land cover type. Oterh data can be found for the minimum, monthly, and seasonal transit time found in their respective folders. These data were produced using the python code found in the "supporting_code" folder given the ease of working with .nc and EASE grid in the xarray python module. R was used primarily for data visualization purposes. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here.
Python scripts can be found in the "supporting_code" folder.
Each R script in this project has a particular function:
01_start.R: This script loads the R packages used in the analysis, sets thedirectory, and imports custom functions for the project. You can also load in the main transit time (turnover) datasets here using the source() function.
02_functions.R: This script contains the custom function for this analysis, primarily to work with importing the seasonal transit data. Load this using the source() function in the 01_start.R script.
03_generate_data.R: This script is not necessary to run and is primarilyfor documentation. The main role of this code was to import and wranglethe data needed to calculate ground-based estimates of aboveground water storage.
04_annual_turnover_storage_import.R: This script imports the annual turnover andstorage data for each landcover type. You load in these data from the 01_start.R scriptusing the source() function.
05_minimum_turnover_storage_import.R: This script imports the minimum turnover andstorage data for each landcover type. Minimum is defined as the lowest monthlyestimate.You load in these data from the 01_start.R scriptusing the source() function.
06_figures_tables.R: This is the main workhouse for figure/table production and supporting analyses. This script generates the key figures and summary statistics used in the study that then get saved in the manuscript_figures folder. Note that allmaps were produced using Python code found in the "supporting_code"" folder.
Facebook
Twitterhttps://www.usa.gov/government-works/https://www.usa.gov/government-works/
The Livestock and Meat International Trade Data product includes monthly and annual data for imports of live cattle, hogs, sheep, goats, beef and veal, pork, lamb and mutton, chicken meat, turkey meat, eggs, and egg products. This product does not include any Dairy Data. Using official trade statistics reported by the U.S. Census, this data product provides data aggregated by commodity and converted to the same units used in the USDA’s World Agricultural Supply and Demand Estimates (WASDE). These units are carcass-weight-equivalent (CWE) pounds for meat products and dozen equivalents for eggs and egg products. Live animal numbers are not converted. With breakdowns by partner country and historical data back to 1989, these data can be used to analyze trends in livestock, meat, and poultry shipments alongside domestic production data and WASDE estimates. Timely analysis and discussion can be found in the monthly Livestock, Dairy, and Poultry Outlook report.
This includes all of the same monthly data as the Excel tables, as well as disaggregated, unconverted data. These files are machine-readable, providing a convenient format for Python users and programmers.
The Livestock and Meat Trade Data Set contains monthly and annual data for imports of live cattle, hogs, sheep, and goats, as well as beef and veal, pork, lamb and mutton, chicken meat, turkey meat, and eggs. The tables report physical quantities, not dollar values or unit prices. Data on beef and veal, pork, lamb, and mutton are on a carcass-weight-equivalent basis. Breakdowns by country are included.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset provides the raw data associated with the NCBI GEO accession number GSE183947. The underlying data is RNA-Sequencing (RNA-Seq) expression matrix. It is derived from matched normal and malignant breast cancer tissue samples. The primary goal of this resource is to teach the complete workflow of: - Downloading and importing high-throughput genomics data from public repositories. - Cleaning and normalizing the raw expression values (e.g., FPKM/TPM). - Preparing the data structure for downstream Differential Gene Expression (DEG) analysis. This resource is essential for anyone practicing translational bioinformatics and cancer research.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:
🔓 First open data set with information on every active firm in Russia.
🗂️ First open financial statements data set that includes non-filing firms.
🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.
📅 Covers 2011-2023 initially, will be continuously updated.
🏗️ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.
The RFSD is hosted on 🤗 Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.
The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.
Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.
Importing The Data
You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on 🤗 Hugging Face Datasets library.
Python
🤗 Hugging Face Datasets
It is as easy as:
from datasets import load_dataset import polars as pl
RFSD = load_dataset('irlspbru/RFSD')
RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')
Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.
Local File Import
Importing in Python requires pyarrow package installed.
import pyarrow.dataset as ds import polars as pl
RFSD = ds.dataset("local/path/to/RFSD")
print(RFSD.schema)
RFSD_full = pl.from_arrow(RFSD.to_table())
RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))
RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )
renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})
R
Local File Import
Importing in R requires arrow package installed.
library(arrow) library(data.table)
RFSD <- open_dataset("local/path/to/RFSD")
schema(RFSD)
scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())
scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())
scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())
renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)
Use Cases
🌍 For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) — interest_payments.md
🏭 For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) — tfp.md
🗺️ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses — spatialization.md
FAQ
Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?
To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.
What is the data period?
We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).
Why are there no data for firm X in year Y?
Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:
We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).
Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek — in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.
Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.
Why is the geolocation of firm X incorrect?
We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.
Why is the data for firm X different from https://bo.nalog.ru/?
Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.
Why is the data for firm X unrealistic?
We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.
Why is the data for groups of companies different from their IFRS statements?
We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.
Why is the data not in CSV?
The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.
Version and Update Policy
Version (SemVer): 1.0.0.
We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.
Licence
Creative Commons License Attribution 4.0 International (CC BY 4.0).
Copyright © the respective contributors.
Citation
Please cite as:
@unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}
Acknowledgments and Contacts
Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru
Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,
Facebook
TwitterThe Salford extension for CKAN is designed to enhance CKAN's functionality for specific use cases, particularly involving the management and import of datasets relevant to the Salford City Council. By incorporating custom configurations and an ETL script, this extension streamlines the process of integrating external data sources, especially from data.gov.uk, into a CKAN instance. It also provides a structured approach to configuring CKAN for specific data management needs. Key Features: Custom Plugin Integration: Enables the addition of 'salford' and 'esd' plugins to extend CKAN's core functionality, addressing specific data management requirements. Configurable Licenses Group URL: Allows administrators to specify a licenses group URL in the CKAN configuration, streamlining access to license information pertinent to the dataset. ETL Script for Data.gov.uk Import: Includes a Python script (etl.py) to import datasets specifically from the Salford City Council publisher on data.gov.uk. Non-UKLP Dataset Compatibility: The ETL script is designed to filter and import non-UKLP datasets, excluding INSPIRE datasets from the data.gov.uk import process at this time. Bower Component Installation: Simplifies asset management by providing instructions of installing bower components. Technical Integration: The Salford extension requires modifications to the CKAN configuration file (production.ini). Specifically, it involves adding salford and esd to the ckan.plugins setting, defining the licensesgroupurl, and potentially configuring other custom options. The ETL script leverages the CKAN API (ckanapi) for data import. Additionally, Bower components must be installed. Benefits & Impact: Using the Salford CKAN extension, organizations can establish a more streamlined data ingestion process tailored to Salford City Council datasets, enhance data accessibility, improve asset management and facilitate better data governance aligned with specific licensing requirements. By selectively importing datasets and offering custom plugin support, it caters to specialized data management needs.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Vezora (From Huggingface) [source]
The Vezora/Tested-188k-Python-Alpaca dataset is a comprehensive collection of functional Python code samples, specifically designed for training and analysis purposes. With 188,000 samples, this dataset offers an extensive range of examples that cater to the research needs of Python programming enthusiasts.
This valuable resource consists of various columns, including input, which represents the input or parameters required for executing the Python code sample. The instruction column describes the task or objective that the Python code sample aims to solve. Additionally, there is an output column that showcases the resulting output generated by running the respective Python code.
By utilizing this dataset, researchers can effectively study and analyze real-world scenarios and applications of Python programming. Whether for educational purposes or development projects, this dataset serves as a reliable reference for individuals seeking practical examples and solutions using Python
The Vezora/Tested-188k-Python-Alpaca dataset is a comprehensive collection of functional Python code samples, containing 188,000 samples in total. This dataset can be a valuable resource for researchers and programmers interested in exploring various aspects of Python programming.
Contents of the Dataset
The dataset consists of several columns:
- output: This column represents the expected output or result that is obtained when executing the corresponding Python code sample.
- instruction: It provides information about the task or instruction that each Python code sample is intended to solve.
- input: The input parameters or values required to execute each Python code sample.
Exploring the Dataset
To make effective use of this dataset, it is essential to understand its structure and content properly. Here are some steps you can follow:
- Importing Data: Load the dataset into your preferred environment for data analysis using appropriate tools like pandas in Python.
import pandas as pd # Load the dataset df = pd.read_csv('train.csv')
- Understanding Column Names: Familiarize yourself with the column names and their meanings by referring to the provided description.
# Display column names print(df.columns)
- Sample Exploration: Get an initial understanding of the data structure by examining a few random samples from different columns.
# Display random samples from 'output' column print(df['output'].sample(5))
- Analyzing Instructions: Analyze different instructions or tasks present in the 'instruction' column to identify specific areas you are interested in studying or learning about.
# Count unique instructions and display top ones with highest occurrences instruction_counts = df['instruction'].value_counts() print(instruction_counts.head(10))Potential Use Cases
The Vezora/Tested-188k-Python-Alpaca dataset can be utilized in various ways:
- Code Analysis: Analyze the code samples to understand common programming patterns and best practices.
- Code Debugging: Use code samples with known outputs to test and debug your own Python programs.
- Educational Purposes: Utilize the dataset as a teaching tool for Python programming classes or tutorials.
- Machine Learning Applications: Train machine learning models to predict outputs based on given inputs.
Remember that this dataset provides a plethora of diverse Python coding examples, allowing you to explore different
- Code analysis: Researchers and developers can use this dataset to analyze various Python code samples and identify patterns, best practices, and common mistakes. This can help in improving code quality and optimizing performance.
- Language understanding: Natural language processing techniques can be applied to the instruction column of this dataset to develop models that can understand and interpret natural language instructions for programming tasks.
- Code generation: The input column of this dataset contains the required inputs for executing each Python code sample. Researchers can build models that generate Python code based on specific inputs or task requirements using the examples provided in this dataset. This can be useful in automating repetitive programming tasks o...
Facebook
TwitterSSURGO PortalThe newest version of SSURGO Portal with Soil Data Viewer is available via the Quick Start Guide. Install Python to C:\Program Files. This is a different version than what ArcGIS Pro uses.If you need data for multiple states, we also offer a prebuilt large database with all SSURGO for the entire United States and all Islands. The prebuilt saves you time but it’s large and takes a while to download.You can also use the prebuilt gNATSGO GeoPackage database in SSURGO Portal – Soil Data Viewer. Read the ReadMe.txt in the folder. More about gNATSGO here. You can also import STATSGO2 data into SSURGO Portal and create a database to use in Soil Data Viewer – Available for download via the Soils Box folder. SSURGO Portal NotesThis 10 minute video covers it all, other than installation of SSURGO Portal and the GIS tool. Installation is typically smooth and easy.There is also a user guide on the SSURGO Portal website that can be very helpful. It has info about using the data in ArcGIS Pro or QGIS. SQLite SSURGO database be opened and queried with DB Browser. It’s essentially free Microsoft Access.Guidance about setting up DB Browser to easily open SQLite databases is available in section 4 of this Installation Guide.Workflow if you need to make your own databaseInstall SSURGO PortalInstall SSURGO Downloader GIS tool (Refer to the Installation and User Guide for assistance)There is one for QGIS and one for ArcGIS Pro. They both do the same thing. Quickly download California SSURGO data with toolEnter two digit state symbol followed by asterisk in “Search by Areasymbol” to download all data for state.For example, enter CA* to batch download all data for CaliforniaOpen SSURGO Portal and create a new SQLite SSURGO Template database (Refer to the User Guide for assistance)Import SSURGO data you downloaded into databaseYou can import SSURGO data from many states at once, building a database that spans many statesAfter SSURGO data is done importing, click on Soil Data Viewer tab and run ratingsThese are the exact same ratings as Web Soil SurveyA new table is added to your database for each ratingYou can search for ratings by keywordIf desired, open database in GIS and make a map (Refer to the User Guide for assistance)Workflow if you need use large prebuilt database (don’t make own database) Install SSURGO PortalIn SSURGO Portal, browse to unzipped prebuilt GeoPackage database with all SSURGOprebuilt large database with all SSURGOgNATSGO GeoPackage databaseIn SSURGO Portal, click on Soil Data Viewer tab and run ratingsThese are the exact same ratings as Web Soil SurveyA new table is added to your database for each ratingYou can search for ratings by keywordIf desired, open database in GIS and make a mapIf you have trouble installing SSURGO Portal. Its usually the connection with Python. Create Desktop short cut that tells SSURGO Portal which Python to useThese were created for Windows 11 Right click anywhere on your desktop and choose New > ShortcutIn the text bar enter your path to the python.exe and your path to the SSURGO Portal.pyz. Notes:Example of format:"C:\Program Files\Python310\python.exe" "C:\SSURGO Portal\SSURGO_Portal-0.3.0.8.pyz"Include quotation marks.Paths may be different on your machine. To avoid typing, you can browse to python.exe in windows explorer, right click and select "Copy as Path and paste results into box. Paste into short location and then do the same for SSURGO Portal.pyz file, but paste to the right of the python.exe path. Click NextName the shortcut anything you want.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset containing measurements of Linux Kernel binary size after compilation. The reported size, in the column "perf", is the size in bytes of the vmlinux file. In contains also a column "active_options" reporting the number of activated options (set at "y"). All other columns, the list being reported in the file "Linux_options.json", are Linux kernel options. The sampling have been made using randconfig. The version of Linux used is 4.13.3.
Not all available options are present. First, it only contains options about the x86 and 64 bits version. Then, all non-tristate options have been ignored. Finally, options not having multiple value through the whole dataset, due to not enough variability in the sampling, are ignored. All options are encoded as 0 for "n" and "m" options value, and 1 for "y".
In python, importing the dataset using pandas will attribute all columns to int64, which will lead to a great consumption of memory (~50GB). We provide this way to import it using less than 1 GB of memory by setting options columns to int8.
import pandas as pd import json import numpy
with open("Linux_options.json","r") as f: linux_options = json.load(f)
return pd.read_csv("Linux.csv", dtype={f:numpy.int8 for f in linux_options})
Facebook
TwitterThe MNIST database of handwritten digits.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('mnist', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/mnist-3.0.1.png" alt="Visualization" width="500px">
Facebook
TwitterThe HURRECON model estimates wind speed, wind direction, enhanced Fujita scale wind damage, and duration of EF0 to EF5 winds as a function of hurricane location and maximum sustained wind speed. Results may be generated for a single site or an entire region. Hurricane track and intensity data may be imported directly from the US National Hurricane Center's HURDAT2 database. HURRECON is available in R and Python. The R version is available on CRAN as HurreconR. The model is an updated version of the original HURRECON model written in Borland Pascal for use with Idrisi (see HF025). New features include support for: (1) estimating wind damage on the enhanced Fujita scale, (2) importing hurricane track and intensity data directly from HURDAT2, (3) creating a land-water file with user-selected geographic coordinates and spatial resolution, and (4) creating plots of site and regional results. The model equations for estimating wind speed and direction, including parameter values for inflow angle, friction factor, and wind gust factor (over land and water), are unchanged from the original HURRECON model. For more details and sample datasets, see the project website on GitHub (https://github.com/hurrecon-model).
Facebook
TwitterThis dataset repository contains all the text files of the datasets analysed in the Survey Paper on Audio Datasets of Scenes and Events. See here for the paper. The GitHub repository containing the scripts are shared here. Including a bash script to download the audio data for each of the datasets. In this repository, we also included a Python file dataset.py, for easy importing of each of the datasets. Please respect the original license of the dataset owner when downloading the data:… See the full description on the dataset page: https://huggingface.co/datasets/gijs/audio-datasets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
F-DATA is a novel workload dataset containing the data of around 24 million jobs executed on Supercomputer Fugaku, over the three years of public system usage (March 2021-April 2024). Each job data contains an extensive set of features, such as exit code, duration, power consumption and performance metrics (e.g. #flops, memory bandwidth, operational intensity and memory/compute bound label), which allows for a multitude of job characteristics prediction. The full list of features can be found in the file feature_list.csv.
The sensitive data appears both in anonymized and encoded versions. The encoding is based on a Natural Language Processing model and retains sensitive but useful job information for prediction purposes, without violating data privacy. The scripts used to generate the dataset are available in the F-DATA GitHub repository, along with a series of plots and instruction on how to load the data.
F-DATA is composed of 38 files, with each YY_MM.parquet file containing the data of the jobs submitted in the month MM of the year YY.
The files of F-DATA are saved as .parquet files. It is possible to load such files as dataframes by leveraging the pandas APIs, after installing pyarrow (pip install pyarrow). A single file can be read with the following Python instrcutions:
import pandas as pd
df = pd.read_parquet("21_01.parquet")
df.head()
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The shown results in this article are meant to be a demonstration of the presented method. The data files included in this archive contain all raw data used for plotting all the figures shown in the article.
The shown figures are based on three different, but representative experiments. The file numbers represent our internal identification number. Use the following key to resolve ids to material names and a short description of the material:
ID short name description 546214 EP-10SCF Bisphonel A epoxy (DER 331, DOW) filled with 10 vol.% short carbon fiber (A385, Tenax), in-house manufactured 548671 PA6-10SCF-8Gr Polyamide 6 filled with 10 vol. % short carbon fibers (A385, Tenax) and 8 vol.% graphite (RGC39A, Superior Graphite), in-house manufactured 550187 PPS-40GR Polyphenylene sulfide filled with 40 wt.% graphite, commercially available as TECACOMP PPS TC black 4084 (Ensinger, Germany)
To each id belong at least of two files, one with file extension 'proc' and one with 'proc.header.yaml'. The proc file is a tab separated ascii table and contains data recorded by our tribometer as well as data processed from them. The yaml files contain all necessary header information of each respective proc file of which especially the header name and (column) index is important for reproduction purpose. Additionally, id 550187 contains files with extension 'xt' and 'xt.header.yaml'. The xt file is tab separated ascii table and contains the temporally and laterally resolved data of the luminance used for plotting the xt plot. The yaml file, similar for the proc file, contain all necessary header information.
Information for importing the data files: All header files follow a standardized serialization format called yaml. The proc and xt files files are strict tab separated ascii tables. Therefore, all data can be easily read by any programming language of your choice, e.g. python, ruby or matlab. With additional effort, an import into excel or similar software is also possible.
Facebook
Twitterklib library enables us to quickly visualize missing data, perform data cleaning, visualize data distribution plot, visualize correlation plot and visualize categorical column values. klib is a Python library for importing, cleaning, analyzing and preprocessing data. Explanations on key functionalities can be found on Medium / TowardsDataScience in the examples section or on YouTube (Data Professor).
Original Github repo
https://raw.githubusercontent.com/akanz1/klib/main/examples/images/header.png" alt="klib Header">
!pip install klib
import klib
import pandas as pd
df = pd.DataFrame(data)
# klib.describe functions for visualizing datasets
- klib.cat_plot(df) # returns a visualization of the number and frequency of categorical features
- klib.corr_mat(df) # returns a color-encoded correlation matrix
- klib.corr_plot(df) # returns a color-encoded heatmap, ideal for correlations
- klib.dist_plot(df) # returns a distribution plot for every numeric feature
- klib.missingval_plot(df) # returns a figure containing information about missing values
Take a look at this starter notebook.
Further examples, as well as applications of the functions can be found here.
Pull requests and ideas, especially for further functions are welcome. For major changes or feedback, please open an issue first to discuss what you would like to change. Take a look at this Github repo.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
code.zip: Zip folder containing a folder titled "code" which holds:
csv file titled "MonitoredRainGardens.csv" containing the 14 monitored green infrastructure (GI) sites with their design and physiographic features;
csv file titled "storm_constants.csv" which contain the computed decay constants for every storm in every GI during the measurement period;
csv file titled "newGIsites_AllData.csv" which contain the other 130 GI sites in Detroit and their design and physiographic features;
csv file titled "Detroit_Data_MeanDesignFeatures.csv" which contain the design and physiographic features for all of Detroit;
Jupyter notebook titled "GI_GP_SensorPlacement.ipynb" which provides the code for training the GP models and displaying the sensor placement results;
a folder titled "MATLAB" which contains the following:
folder titled "SFO" which contains the SFO toolbox for the sensor placement work
file titled "sensor_placement.mlx" that contains the code for the sensor placement work
several .mat files created in Python for importing into Matlab for the sensor placement work: "constants_sigma.mat", "constants_coords.mat", "GInew_sigma.mat", "GInew_coords.mat", and "R1_sensor.mat" through "R6_sensor.mat"
several .mat files created in Matalb for importing into Python for visualizing the results: "MI_DETselectedGI.mat" and "DETselectedGI.mat"
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
F-DATA is a novel workload dataset containing the data of around 24 million jobs executed on Supercomputer Fugaku, over the three years of public system usage (March 2021-April 2024). Each job data contains an extensive set of features, such as exit code, duration, power consumption and performance metrics (e.g. #flops, memory bandwidth, operational intensity and memory/compute bound label), which allows for a multitude of job characteristics prediction. The full list of features can be found in the file feature_list.csv.
The sensitive data appears both in anonymized and encoded versions. The encoding is based on a Natural Language Processing model and retains sensitive but useful job information for prediction purposes, without violating data privacy. The scripts used to generate the dataset are available in the F-DATA GitHub repository, along with a series of plots and instruction on how to load the data.
F-DATA is composed of 38 files, with each YY_MM.parquet file containing the data of the jobs submitted in the month MM of the year YY.
The files of F-DATA are saved as .parquet files. It is possible to load such files as dataframes by leveraging the pandas APIs, after installing pyarrow (pip install pyarrow). A single file can be read with the following Python instrcutions:
import pandas as pd
df = pd.read_parquet("21_01.parquet")
df.head()
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The database contains microwave single scattering data, mainly of ice hydrometeors. Main applications lie in microwave remote sensing, both passive and active. Covered hydrometeors range from pristine ice crystals to large aggregates, graupel and hail, 34 types in total. Furthermore, 34 frequencies from 1 to 886 GHz, and 3 temperatures, 190, 230 and 270 K, are included. The main article orientation is currently totally random (i.e. each orientation is equally probable), but the database is designed to also handle azimuthally orientation, and data for this orientation case will be added in the future. Mie code was used for the two completely spherical habits, while the bulk of data were calculated using the discrete dipole approximation method (DDA).
Interfaces for easy user access is available under a separate upload, due to different licensing, under the title ARTS Microwave Single Scattering Properties Database Interfaces. Interfaces in MATLAB and Python are available, supporting browsing and importing of the data. There is also functionality for extracting data for usage in RTTOV.
A description of the database is also available in the following article. Please cite it if you use the database for a publication.
Eriksson, P., R. Ekelund, J. Mendrok, M. Brath, O. Lemke, and S. A. Buehler (2018), A general database of hydrometeor single scattering properties at microwave and sub-millimetre wavelengths, Earth Syst. Sci. Data, 10, 1301–1326, doi: 10.5194/essd-10-1301-2018.
New version 1.1.0 released: Database, technical report and readme document updated. It is highly recommended to download both the new interface and database versions.
-Added three new habits: two liquid habits with azimuthally random orientation and one new bullet rosette with totally random orientation (with IDs from 35 to 37).
-Updated Python and MATLAB interface to accommodate azimuthally random oriented data and with some other minor updates.
-DDA calculations at frequencies under 100 GHz have been re-calculated using higher EPS settings, in order to accommodate radar applications (see technical report, Sec. 4.1.1).
-The tolerance of the extinction cross-section post-calculation check is now 10 % instead of 30 % (see bullet 5 in technical report, Sec. 4.4.1.1). Those calculations that could not meet the stricter criteria were recalculated using higher EPS setting.
-Format of standard habits revised. The weighting applied to the large habit at each size is now available in the in the mat-files.
-Fixed wrong index for GEM hail in specifications table.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set contains as-build design information, as well as full-scale vibration response measurements from an operational offshore wind-turbine. The turbine is part of the Norther wind farm which is located in the Belgian North Sea and includes a total of 44 Vestas V164 (8.4MW) wind turbines on monopile foundations, see Fig1_Norther_locaction.png. This data set is intended to verify and validate model-based virtual sensing algorithms, using data as well as modeling information from a real turbine.
The included information entails a detailed description of the geometric properties of the monopile and transition piece, distributed and lumped structural masses . All information shared in this record is conform the as-designed documentation. An example of the lumped masses considered in the model input files is presented in "Fig2_Sensor_Network.png"
Monopiles are distinguished by the significant role of soil-structure interaction. Ground reaction is most typically included in the structural model as non-linear p-y curves. Different p-y curves are available for a certain number of soils in the standards applicable to offshore structures (API RP 2GEO, 2011, and ISO 19901-4:2016(E), 2016).
The required soil properties to define p-y curves according to the API framework are given in the soil profile provided in a separate Excel. Rather than symbols, the name of the soil properties is generally used as column header (e.g., Undrained shear strength). Therefore, it is straightforward to identify each soil parameter. The only soil parameter that might lead to confusion is:
It's worthy to note that estimates for the small shear strain stiffness, referred to as Gmax, are also included. Despite not being required as an input to define the API p-y curves, this parameter remains a key input for other soil reaction frameworks than the API (e.g., PISA).
Two sets of measurement data have been curated for validation purposes; the first interval has been collected during parked conditions, whereas the second interval has been collected during rated operational conditions. Both records have a length of 2 hours, and are subdivided into 10-minute data sets. Furthermore 1Hz SCADA data has been made available for the selected intervals. All different data sources are time synchronized and have been subjected to several internal quality checks.
The sensor network on NRT-WTG is illustrated in in Fig. 2, whereas a description of the sensor types is presented in Tab.1. The acceleration sensors are installed in the horizontal plane, and measure tangential (Y) and orthogonal (X) to the wall, where the positive Y direction is pointing clockwise and the positive X direction is pointing inwards. All strain sensors are installed vertically and are located on the inside of the wall.
| Data type | Sensor type | Fs (Hz) |
Level mLAT (m) | Description |
| Acceleration (g) | Piezo-electric acc. sensor (ACC) | 30 | 15, 69, 97 | 3 Bi-directional accelerometers at different levels. LAT 15 installed at 240 degree heading; LAT 69 and 97 at 60 degree. |
| Strain (micro strain) | Resistive strain gauge (SG) | 30 | 14 | 6 SGs: equally spaced around the inner circumference of the can. Headings: 50, 110, 170, 230, 290, 350 degree. |
| Strain (micro strain) | Fiber-Bragg Grating strain gauge (FBG) | 100 | -17, -19 | 2 FBGs per level at 165 and 255 degree respectively. |
Table 1. Description of sensor types.
The FBG strain time series have been synchronized with the SG time series using using a cross-correlation based approach. Therefore the SG data has been used to genereate refrence strain time series at the headings of the FBG sensors; the FBG data is subsequently synchronized with regard to this reference time series. No synchronization of the acceleration data was needed, since these are collected using the same data aquisition system as the SG data.
The SG strain time series have been calibrated and temperature compensated, whereas this is not the case for the FBG strain time series. The latter have a yet to be determined calibration offset.
In conjunction to the sensor channels presented in Tab. 1, 1 Hz SCADA data is provided. A summary of the provided SCADA parameters, all sampled at 1Hz, is presented in Tab 2.
| Parameter | Unit | Description |
| Wind speed | m/s | Wind speed as recorded in the turbine SCADA |
| Wind direction | ° | Wind direction relative to North (0°) as recorded in the turbine SCADA |
| Yaw angle | ° | Yaw orientation of the nacelle relative to North (0°) as recorded in the turbine SCADA |
| Pitch angle | ° | Rotor blade pitch as recorded in the turbine SCADA |
| Rotor speed | rpm | Rotor speed in rotations per minute as recorded in the turbine SCADA |
| Power | kW | Active power of the turbine as recorded in the turbine SCADA |
Table 2. List of provided SCADA parameters
A summary of the selected intervals and relevant corresponding scada parameters is given in Tab 3.
| Scenario | T1 (UTC) | T2 (UTC) | Windspeed | RPM | Pitch |
| Parked |
03/07 01:30 |
03/07 03:30 | < 4.5 m/s | ~1 | ~18 ° |
| Rated |
05/07 22:30 |
06/07 00:30 | ~15 m/s | 10.5 | 8.1° |
Table 3. Selected data intervals and relevant scada parameters
To import the measurement data into Python it is recommended to use pandas:
import pandas as pd
# Read Parquet file with Pandas: relative_file_path = 'NRT-WTG_Parked.parquet.gz' data = pd.read_parquet(relative_file_path )
Once the dataframe has been imported, the users can process/re-arrange the raw data according the their needs; it should be noted that the imported dataframe contains NAN values - these are caused by the different sampling rates of the provided signals.
Facebook
Twitter40,000 lines of Shakespeare from a variety of Shakespeare's plays. Featured in Andrej Karpathy's blog post 'The Unreasonable Effectiveness of Recurrent Neural Networks': http://karpathy.github.io/2015/05/21/rnn-effectiveness/.
To use for e.g. character modelling:
d = tfds.load(name='tiny_shakespeare')['train']
d = d.map(lambda x: tf.strings.unicode_split(x['text'], 'UTF-8'))
# train split includes vocabulary for other splits
vocabulary = sorted(set(next(iter(d)).numpy()))
d = d.map(lambda x: {'cur_char': x[:-1], 'next_char': x[1:]})
d = d.unbatch()
seq_len = 100
batch_size = 2
d = d.batch(seq_len)
d = d.batch(batch_size)
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('tiny_shakespeare', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data has been anonymised prior the publication. The data has been standardized in Fast Healthcare Interoperability Resources (FHIR) data standard.
This work has been conducted within the framework of the MOTU++ project (PR19-PAI-P2).
This research was co-funded by the Complementary National Plan PNC-I.1 "Research initiatives for innovative technologies and pathways in the health and welfare sector” D.D. 931 of 06/06/2022, DARE - DigitAl lifelong pRevEntion initiative, code PNC0000002, CUP: (B53C22006450001) and by the Italian National Institute for Insurance against Accidents at Work (INAIL) within the MOTU++ project (PR19-PAI-P2).
Authors express their gratitude to all the AlmaHealthDB Team.
The repository includes a Docker Compose setup for importing the MOTU dataset into a HAPI FHIR server, formatted as NDJSON following the HL7 FHIR R4 standards.
Before you begin, ensure you have the following installed:
dataset directory containing the NDJSON files.docker-compose up in the terminal to start the Docker containers.python main.py in the terminal to start the data import process.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Author: Andrew J. FeltonDate: 5/5/2024
This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis and figure production for the study entitled:
"Global estimates of the storage and transit time of water through vegetation"
Please note that 'turnover' and 'transit' are used interchangeably in this project.
Data information:
The data folder contains key data sets used for analysis. In particular:
"data/turnover_from_python/updated/annual/multi_year_average/average_annual_turnover.nc" contains a global array summarizing five year (2016-2020) averages of annual transit, storage, canopy transpiration, and number of months of data. This is the core dataset for the analysis; however, each folder has much more data, including a dataset for each year of the analysis. Data are also available is separate .csv files for each land cover type. Oterh data can be found for the minimum, monthly, and seasonal transit time found in their respective folders. These data were produced using the python code found in the "supporting_code" folder given the ease of working with .nc and EASE grid in the xarray python module. R was used primarily for data visualization purposes. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here.
Python scripts can be found in the "supporting_code" folder.
Each R script in this project has a particular function:
01_start.R: This script loads the R packages used in the analysis, sets thedirectory, and imports custom functions for the project. You can also load in the main transit time (turnover) datasets here using the source() function.
02_functions.R: This script contains the custom function for this analysis, primarily to work with importing the seasonal transit data. Load this using the source() function in the 01_start.R script.
03_generate_data.R: This script is not necessary to run and is primarilyfor documentation. The main role of this code was to import and wranglethe data needed to calculate ground-based estimates of aboveground water storage.
04_annual_turnover_storage_import.R: This script imports the annual turnover andstorage data for each landcover type. You load in these data from the 01_start.R scriptusing the source() function.
05_minimum_turnover_storage_import.R: This script imports the minimum turnover andstorage data for each landcover type. Minimum is defined as the lowest monthlyestimate.You load in these data from the 01_start.R scriptusing the source() function.
06_figures_tables.R: This is the main workhouse for figure/table production and supporting analyses. This script generates the key figures and summary statistics used in the study that then get saved in the manuscript_figures folder. Note that allmaps were produced using Python code found in the "supporting_code"" folder.