Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Self-built:
PIConGPU: https://github.com/franzpoeschel/picongpu/tree/smc2021-paper GAPD: closed source software, Git tag smc2021-paper in private repository openPMD-api: https://github.com/franzpoeschel/openPMD-api/tree/smc2021-paper ADIOS2: https://github.com/ornladios/ADIOS2, Git hash bf25ad59b8b15b9f48ddabad65a41f2050d3bd7f libfabric: 1.6.3a1
Summit modules:
1) gcc/8.1.1
2) spectrum-mpi/10.3.1.2-20200121
3) cmake/3.18.2
4) git/2.20.1
5) cuda/10.1.243
6) boost/1.66.0
7) zlib/1.2.11
8) libpng/1.6.34
9) freetype/2.9.1
10) python/3.7.0-anaconda3-5.3.0
A dataset containing the monitoring of several hardware counters (HPC) associated with 7 cache side-channel attacks (Spectre V1, V2, V4; Meltdown, ZombieLoad, Fallout, and Crosstalk), along with data obtained for 7 benign/benchmark programs (matrix multiplier, stress -c, stress -m, MiBench, STREAM, bzip2, and ffmpeg). All programs are run on Intel x86 architectures. The selection of the hardware attacks used to collect the data was done by analyzing the characteristics of the computer, as well as the available mitigations, to determine if the machine was vulnerable to each of them. The selection of benign programs was mainly based on benchmark sets that offered reliable and reproducible execution behavior, allowing for effective comparison with workloads. A selection of different benchmark sets with varied approaches was made to ensure optimal coverage of the dataset. Finally, the selection of activity counters was based on a detailed analysis of the exploited vulnerability, prior work, and, later, data analysis to ensure their validity. From this study, the following hardware counters were selected: branch-misses, branch-instructions, LLC-load-misses, L1-dcache-load-misses, and instructions. Each file corresponds to one of the 14 programs executed to generate the values of the analyzed hardware counters. Each file is identified by the name of the program associated with its execution.For the data collection, it was necessary to identify and acquire the binary codes of the selected programs (benign and attacks). Below, the source from which the codes were obtained is defined for each case. Malicious codes:1) Meltdown Github: I. of Applied Information Processing and C. (IAIK), Meltdown, https://github.com/IAIK/meltdown.2) Spectre V1 GitHub: R. C. (crozone), Spectrepoc, https://github.com/crozone/SpectrePoC. 3) Spectre V2 GitHub: A. C. (Anton-Cao), Spectrev2-poc, https://github.com/Anton-Cao/spectrev2-poc.4) Spectre V4 GitHub: Y. S. (mmxsrup), Cve-2018-3639, https://github.com/mmxsrup/CVE-2018-3639.5) ZombieLoad GitHub: I. of Applied Information Processing and C. (IAIK), Zombieload, https://github.com/IAIK/ZombieLoad. 6) Fallout GitHub: T. H. (tristan-hornetz), Fallout, https://github.com/tristan - hornetz /fallout.7) Crosstalk GitHub: T. H. (tristan-hornetz), Crosstalk, https://github.com/tristan- hornetz/crosstalk. Benign codes:1) Matrix Multiplier: Codi propi2) stress -c UNIX Tool: R. O. S. Projects, Stress, https://github.com/resurrecting-open-source-projects/stress.3) stress -m UNIX Tool: R. O. S. Projects, Stress, https://github.com/resurrecting-open-source-projects/stress.4) MiBench Bitcount GitHub: Embecosm, Mibench, https://github.com/embecosm/mibench.5) STREAM GitHub: J. H. (jeffhammond), Stream, https://github.com/jeffhammond/STREAM.6) bzip2 UNIX Tool: https://sourceware.org/bzip2/7) ffmpeg UNIX Package: https://ffmpeg.org/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Welcome to the Music Informatics for Radio Across the GlobE (MIRAGE) MetaCorpus. The current (v0.2) development release consists of metadata (e.g., artist name, track title) and musicological features (e.g., instrument list, voice type, tempo) for 1 million events streaming on 10,000 internet radio stations across the globe, with 100 events from each station.
Users who wish to access, interact with, and/or export metadata from the MIRAGE-MetaCorpus may also visit the MIRAGE online dashboard at the following url:
The current MIRAGE-MetaCorpus is available under a CC4 license. Users may cite the dataset here:
Sears, David R.W. “Music Informatics for Radio Across the Globe (MIRAGE) Metacorpus -- 2024”. Zenodo, July 19, 2024. https://doi.org/10.5281/zenodo.12786202.
Users accessing the MIRAGE-MetaCorpus using the online dashboard should also cite the following ISMIR paper:
Ngan V.T. Nguyen, Elizabeth A.M. Acosta, Tommy Dang, and David R.W. Sears. "Exploring Internet Radio Across the Globe with the MIRAGE Online Dashboard," in Proceedings of the 25th International Society for Music Information Retrieval Conference (San Francisco, CA, 2024).
This repository of the MIRAGE-MetaCorpus contains 81 metadata variables from the following open-access sources:
Each event also includes attribution metadata from the following commercial sources:
The metadata reflect information about each event's location (e.g., city, country), station (name, format, url), event (id, local time at station, etc.), artist (name, voice type, etc.), and track (e.g., title, year of release, etc.). For that reason, the MIRAGE-MetaCorpus includes the following datasets:
A subset of the MIRAGE-MetaCorpus is also available for events with metadata from online music libraries that reliably matched the event's description in the radio station's stream encoder:
If you are a copyright owner for any of the metadata that appears in the MIRAGE-MetaCorpus and would like us to remove your metadata, please contact the developer team at the following email address: miragedashboard@gmail.com
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘COVID-19 Coronavirus Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/vignesh1694/covid19-coronavirus on 14 February 2022.
--- Dataset description provided by original source is as follows ---
A SARS-like virus outbreak originating in Wuhan, China, is spreading into neighboring Asian countries, and as far afield as Australia, the US a and Europe.
On 31 December 2019, the Chinese authorities reported a case of pneumonia with an unknown cause in Wuhan, Hubei province, to the World Health Organisation (WHO)’s China Office. As more and more cases emerged, totaling 44 by 3 January, the country’s National Health Commission isolated the virus causing fever and flu-like symptoms and identified it as a novel coronavirus, now known to the WHO as 2019-nCoV.
The following dataset shows the numbers of spreading coronavirus across the globe.
Sno - Serial number Date - Date of the observation Province / State - Province or state of the observation Country - Country of observation Last Update - Recent update (not accurate in terms of time) Confirmed - Number of confirmed cases Deaths - Number of death cases Recovered - Number of recovered cases
Thanks to John Hopkins CSSE for the live updates on Coronavirus and data streaming. Source: https://github.com/CSSEGISandData/COVID-19 Dashboard: https://public.tableau.com/profile/vignesh.coumarane#!/vizhome/DashboardToupload/Dashboard12
Inspired by the following work: https://gisanddata.maps.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6
--- Original source retains full ownership of the source dataset ---
Concentration-discharge relationships are a key tool for understanding the sourcing and transport of material from watersheds to fluvial networks. Storm events in particular provide insight into variability in the sources of solutes and sediment within watersheds, and the hydrologic pathways that connect hillslope to stream channel. Here we examine high-frequency sensor-based specific conductance and turbidity data from multiple storm events across two watersheds (Quebrada Sonadora and Rio Icacos) with different lithology in the Luquillo Mountains of Puerto Rico, a forested tropical ecosystem. Our analyses include Hurricane Maria, a category 5 hurricane. To analyze hysteresis, we used a recently developed set of metrics to describe and quantify storm events including the hysteresis index (HI), which describes the directionality of hysteresis loops, and the flushing index (FI), which describes whether the mobilization of material is source or transport limited. We also examine the role of antecedent discharge to predict hysteretic behavior during storms. Overall, specific conductance and turbidity showed contrasting responses to storms. The hysteretic behavior of specific conductance was very similar across sites, displaying clockwise hysteresis and a negative flushing index indicating proximal sources of solutes and consistent source limitation. In contrast, the directionality of turbidity hysteresis was significantly different between watersheds, although both had strong flushing behavior indicative of transport limitation. Overall, models that included antecedent discharge did not perform any better than models with peak discharge alone, suggesting that the magnitude and trajectory of an individual event was the strongest driver of material flux and hysteretic behavior. Hurricane Maria produced unique hysteresis metrics within both watersheds, indicating a distinctive response to this major hydrological event. The similarity in response of specific conductance to storms suggests that solute sources and pathways are similar in the two watersheds. The divergence in behavior for turbidity suggests that sources and pathways of particulate matter vary between the two watersheds. The use of high-frequency sensor data allows the quantification of storm events while index-based metrics of hysteresis allow for the direct comparison of complex storm events across a heterogeneous landscape and variable flow conditions.
Additional scripts for hysteresis analysis are available here in the 'python scripts for analysis' folder and at https://github.com/miguelcleon/HysteresisAnalysis/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MPEG-DASH datasets for the SLF4Web research project. SLF4Web is a Web-based implementation of a static light field consumption system; it allows SLF datasets to be adaptively streamed over the network (via MPEG-DASH) and then to be visualized in a vanilla Web browser. The datasets are encoded using the H.264/AVC video codec. A subset of the datasets are available in multiple qualities to allow for adaptive network streaming.
The SLF4Web source code is available on GitHub (https://github.com/EDM-Research/SLF4Web) and as a bundle at https://zenodo.org/badge/latestdoi/432214902.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supplementary data associated with 'National-scale biogeography and function of river and stream bacterial biofilm communities'. Preprint is available at: https://doi.org/10.1101/2025.03.05.641783.
R scripts for data analysis and visualisation of this dataset are available on GitHub at: https://github.com/amycthorpe/biofilm_MAG_analysis.
Snakemake workflows to generate the results are available on GitHub at: https://github.com/amycthorpe/metag_analysis_EA and https://github.com/amycthorpe/EA_metag_post_analysis.
Environmental metadata:
Metagenome assembled genomes (MAGs):
Metabolic and functional traits:
Environmental drivers:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data package contains discharge and water quality data and model results at Coal Creek Watershed in the central Rocky Mountains of Colorado, USA. Files include high-frequency stream chemistry data collected during the period of Dec 2015 to Jun 2018, and model results of water storage and flux. The dataset also includes dissolved organic carbon and sodium stream chemistry data for the period of 2016. Our model then incorporates the USGS datasets of discharge and stream chemistry, for which data and citations are provided in the dataset files and related reference field. The resulting model BioRT-Flux-PIHM is the biogeochemical reactive transport model of the PIHM family code MM-PIHM for watershed processes and is detailed in the reference paper (doi.org/10.1029/2018WR024257) and in Github (https://github.com/PSUmodeling/BioRT-Flux-PIHM).
The Open Australian Legal Embeddings are the first open-source embeddings of Australian legislative and judicial documents.
Trained on the largest open database of Australian law, the Open Australian Legal Corpus, the Embeddings consist of roughly 5.2 million 384-dimensional vectors embedded with BAAI/bge-small-en-v1.5
.
The Embeddings open the door to a wide range of possibilities in the field of Australian legal AI, including the development of document classifiers, search engines and chatbots.
To ensure their accessibility to as wide an audience as possible, the Embeddings are distributed under the same licence as the Open Australian Legal Corpus.
The below code snippet illustrates how the Embeddings may be loaded and queried via the Hugging Face Datasets Python library: ```python import itertools import sklearn.metrics.pairwise
from datasets import load_dataset from sentence_transformers import SentenceTransformer
model = SentenceTransformer('BAAI/bge-small-en-v1.5') instruction = 'Represent this sentence for searching relevant passages: '
oale = load_dataset('umarbutler/open_australian_legal_embeddings', split='train', streaming=True) # Set streaming
to False
if you wish to load the entire dataset into memory (unadvised unless you have at least 64 GB of RAM).
sample = list(itertools.islice(oale, 100000))
query = model.encode(instruction + 'Who is the Governor-General of Australia?', normalize_embeddings=True)
similarities = sklearn.metrics.pairwise.cosine_similarity([query], [embedding['embedding'] for embedding in sample]) most_similar_index = similarities.argmax() most_similar = sample[most_similar_index]
print(most_similar['text']) ```
To speed up the loading of the Embeddings, you may wish to install orjson
.
The Embeddings are stored in data/embeddings.jsonl
, a json lines file where each line is a list of 384 32-bit floating point numbers. Associated metadata is stored in data/metadatas.jsonl
and the corresponding texts are located in data/texts.jsonl
.
The metadata fields are the same as those used for the Open Australian Legal Corpus, barring the text
field, which was removed, and with the addition of the is_last_chunk
key, which is a boolean flag for whether a text is the last chunk of a document (used to detect and remove corrupted documents when creating and updating the Embeddings).
All documents in the Open Australian Legal Corpus were split into semantically meaningful chunks up to 512-tokens-long (as determined by bge-small-en-v1.5
's tokeniser) with the semchunk
Python library. These chunks included a header embedding documents' titles, jurisdictions and types in the following format:
perl
Title: {title}
Jurisdiction: {jurisdiction}
Type: {type}
{text}
The chunks were then vectorised by bge-small-en-v1.5
on a single GeForce RTX 2080 Ti with a batch size of 32 via the SentenceTransformers
library.
The resulting embeddings were serialised as json-encoded lists of floats by orjson
and stored in data/embeddings.jsonl
. The corresponding metadata and texts (with their headers removed) were saved to data/metadatas.jsonl
and data/texts.jsonl
, respectively.
The code used to create and update the Embeddings may be found [here](https://github.com/umarbutler/open-australian-legal-embeddings-...
This dataset is one of a suite of products from the Nature’s Network project (naturesnetwork.org). Nature’s Network is a collaborative effort to identify shared priorities for conservation in the Northeast, considering the value of fish and wildlife species and the natural areas they inhabit. Brook Trout probability of occurrence is intended to provide predictions of occupancy (probability of presence) for catchments smaller than 200 km2 in the Northeast and Mid-Atlantic region from Virginia to Maine. The dataset provides predictions under current environmental conditions and for future increases in stream temperature. Brook Trout probability of occurrence (under current climate) is one input used in developing “Lotic Core Areas, Stratified by Watershed, Northeast U.S.” that is also part of Nature’s Network. Lotic core areas represent intact, well-connected rivers and stream reaches in the Northeast and Mid-Atlantic region that, if protected as part of stream networks and watersheds, will continue to support a broad diversity of aquatic species and the ecosystems on which they depend. The combination of lotic core areas, lentic (lake and pond) core areas, and aquatic buffers constitute the “aquatic core networks” of Nature’s Network. These and other datasets that augment or complement aquatic core networks are available in the Nature’s Network gallery: https://nalcc.databasin.org/galleries/8f4dfe780c444634a45ee4acc930a055.
Intended Uses
In the context of Nature’s Network, this dataset is primarily intended to be used in conjunction with the product “Lotic Core Areas, Stratified by Watershed, Northeast U.S.” to better understand the importance of core areas to Brook Trout. It also can be used on its own to identify priority watersheds for Brook Trout.
The dataset was originally developed for and is part of the Interactive Catchment Explorer (ICE). ICE (http://ice.ecosheds.org/) is a dynamic visualization interface for exploring catchment characteristics and environmental model predictions. ICE was created for resource managers and researchers to explore complex, multivariate environmental datasets and model results, to identify spatial patterns related to ecological conditions, and to prioritize locations for restoration or further study. ICE is part of the Spatial Hydro-Ecological Decision System (SHEDS).
Description and Derivation
The dataset provides predictions under current environmental conditions and for future increases in stream temperature of 2, 4, and 6 degrees Celsius. It employs a logistic mixed effects model to include the effects of landscape, land-use, and climate variables on the probability of Brook Trout occupancy in stream reaches (confluence to confluence). It includes random effects of HUC10 (watershed) to allow for the chance that the probability of occupancy and the effect of covariates were likely to be similar within a watershed. The fish data came primarily from state and federal agencies that sample streams for Brook Trout as part of regular monitoring. A stream is considered occupied if any Brook Trout were ever caught during an electrofishing survey between 1991 and 2010. The results are based on more than 15,000 samples from more than 13,000 catchments from all 13 Northeast states.
Factors that had a strong positive effect on Brook Trout occupancy included percent forest cover and summer precipitation. Factors that had a strong negative effect on occupancy included July stream temperature, percent agriculture, drainage area, and percent upstream impounded area.
Estimates of the probability of occupancy for each catchment with increases in stream temperature of either 2,4 or 6 degrees C are also provided. To provide these estimates, the input values for mean July stream temperature were simply increased by 2, 4, or 6 C and estimated occupancies recorded.
More technical details about the Brook Trout probability of occurrence product are available at: http://conte-ecology.github.io/Northeast_Bkt_Occupancy/. Technical details about the regional stream temperature model, which is used in predicting Brook Trout occupancy, are available at: http://conte-ecology.github.io/conteStreamTemperature_northeast/.
Known Issues and Uncertainties
As with any project carried out across such a large area, this dataset is subject to limitations. The results by themselves are not a prescription for on-the-ground action; users are encouraged to verify, with field visits and site-specific knowledge, the value of any areas identified in the project. Known issues and uncertainties include the following:
Users are cautioned against using the data on too small an area (for example, a small segment of stream), as the data may not be sufficiently accurate at that level of resolution.
Uncertainties in predictions of stream temperature also result in uncertainties in Brook Trout occupancy estimates. Local effects of groundwater (which may provide cold-water refugia for Brook Trout) cannot be well accounted for in regional stream temperature models at this time. Catchments near waterbodies with water control structures such as dams may also have unreliable temperature predictions because the temperature model does not include information on release schedules or strategies.
Catchments with any Brook Trout occurrences reported in the past 30 years have been presumed to be occupied for purposes of the model. If local extirpations have occurred, this could lead to overprediction of the probability of Brook Trout occupancy.
Projections of effects of future temperature changes to Brook Trout occupancy are intended to convey a sense of the resilience of the species to changing temperatures. In reality, stream temperatures will not change at the same rate or uniformly, as some streams are more buffered against changing air temperatures than others.
Brook Trout occupancy predictions are not available in certain areas where surficial soil coarseness data were absent. These areas include the White Mountains of NH and mountainous areas in NY such as the Adirondacks.
As with any regional GIS data, errors in mapping and alignment of hydrography, development, agriculture, and a number of other data layers can affect the model results.
Attribute definitions
Source = data source
FEATUREID = unique identifier
NextDownID = unique identifier of catchment immediately downstream (-1 = none)
Shape_Leng = length of catchment in meters
Shape_Area = area of catchment in square meters
AreaSqKm = area of catchment in square kilometers
huc12 = 12 digit Hydrologic Unit Code for the watershed
stusps = state in which the catchment is located
agricultur = the percentage of the catchment that is covered by agricultural land (e.g. cultivated crops, orchards, and pasture) including fallow land.
elevation = mean elevation of catchment (m)
forest = the percentage of the catchment that is forested
summer_prc = mean precipitation per month in summer (mm)
UpAreaSqKM = drainage area upstream of catchment in square kilometers
occ_curren = probability of Brook Trout occupancy (current climate)
plus2 = probability of Brook Trout occupancy if stream temperature were to warm by 2 degrees C, relative to current climate
plus4 = probability of Brook Trout occupancy if stream temperature were to warm by 4 degrees C, relative to current climate
plus6 = probability of Brook Trout occupancy if stream temperature were to warm by 6 degrees C, relative to current climate
max_temp_0 = the maximum additional stream temperature (degrees C), on top of the current mean summer temperature for the catchment, that would be predicted to result in a 30% probability of occupancy for Brook Trout
max_temp_1 = the maximum additional stream temperature (degrees C), on top of the current mean summer temperature for the catchment, that would be predicted to result in a 50% probability of occupancy Brook Trout
max_temp_2 = the maximum additional stream temperature (degrees C), on top of the current mean summer temperature for the catchment, that would be predicted to result in a 70% probability of occupancy Brook Trout
meanSumme = mean summer stream temperature (C)
meanDays_1 = mean days per year that stream temperature exceeds 18 degrees C
meanDays_2 = mean days per year that stream temperature exceeds 22 degrees C
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This dataset was supplied to the Bioregional Assessment Programme by a third party and is presented here as originally supplied. Metadata was not provided and has been compiled by the Bioregional Assessment Programme based on the known details at the time of acquisition.
The data includes level, salinity and temperature from gauge 203450 and 203470 in the Richmond catchment. This data is plotted against time for water quality analysis purposes
This is a download from the open access NSW database at http://realtimedata.water.nsw.gov.au/water.stm
This data is a download from the open access NSW database
http://realtimedata.water.nsw.gov.au/water.stm
The data includes level, salinity and temperature from gauge 203450 and 203470 in the Richmond catchment.
Data is was downloaded on 18/3/2015.
NSW Office of Water (2015) CLM - Richmond stream gauge data. Bioregional Assessment Source Dataset. Viewed 07 April 2016, http://data.bioregionalassessments.gov.au/dataset/03f59f6b-8d06-4513-b662-db7c4c2d2909.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Advances in fluorescence microscopy enable monitoring larger brain areas in-vivo with finer time resolution. The resulting data rates require reproducible analysis pipelines that are reliable, fully automated, and scalable to datasets generated over the course of months. We present CaImAn, an open-source library for calcium imaging data analysis. CaImAn provides automatic and scalable methods to address problems common to preprocessing, including motion correction, neural activity identification, and registration across different sessions of data collection. It does this while requiring minimal user intervention, with good scalability on computers ranging from laptops to high-performance computing clusters. CaImAn is suitable for two-photon and one-photon imaging, and also enables real-time analysis on streaming data.
To benchmark the performance of CaImAn we collected and combined a corpus of manual annotations from multiple labelers on nine mouse two-photon datasets, that are contained in this open access repository. We demonstrate that CaImAn achieves near-human performance in detecting locations of active neurons.
In order to reproduce the results of the paper or download the annotations and the raw movies, please refer to the readme.md at:
https://github.com/flatironinstitute/CaImAn/blob/master/use_cases/eLife_scripts/README.md
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:
🔓 First open data set with information on every active firm in Russia.
🗂️ First open financial statements data set that includes non-filing firms.
🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.
📅 Covers 2011-2023 initially, will be continuously updated.
🏗️ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.
The RFSD is hosted on 🤗 Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.
The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.
Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.
Importing The Data
You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on 🤗 Hugging Face Datasets library.
Python
🤗 Hugging Face Datasets
It is as easy as:
from datasets import load_dataset import polars as pl
RFSD = load_dataset('irlspbru/RFSD')
RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')
Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.
Local File Import
Importing in Python requires pyarrow package installed.
import pyarrow.dataset as ds import polars as pl
RFSD = ds.dataset("local/path/to/RFSD")
print(RFSD.schema)
RFSD_full = pl.from_arrow(RFSD.to_table())
RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))
RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )
renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})
R
Local File Import
Importing in R requires arrow package installed.
library(arrow) library(data.table)
RFSD <- open_dataset("local/path/to/RFSD")
schema(RFSD)
scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())
scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())
scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())
renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)
Use Cases
🌍 For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) — interest_payments.md
🏭 For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) — tfp.md
🗺️ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses — spatialization.md
FAQ
Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?
To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.
What is the data period?
We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).
Why are there no data for firm X in year Y?
Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:
We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).
Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek — in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.
Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.
Why is the geolocation of firm X incorrect?
We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.
Why is the data for firm X different from https://bo.nalog.ru/?
Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.
Why is the data for firm X unrealistic?
We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.
Why is the data for groups of companies different from their IFRS statements?
We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.
Why is the data not in CSV?
The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.
Version and Update Policy
Version (SemVer): 1.0.0.
We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.
Licence
Creative Commons License Attribution 4.0 International (CC BY 4.0).
Copyright © the respective contributors.
Citation
Please cite as:
@unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}
Acknowledgments and Contacts
Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru
Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Self-built:
PIConGPU: https://github.com/franzpoeschel/picongpu/tree/smc2021-paper GAPD: closed source software, Git tag smc2021-paper in private repository openPMD-api: https://github.com/franzpoeschel/openPMD-api/tree/smc2021-paper ADIOS2: https://github.com/ornladios/ADIOS2, Git hash bf25ad59b8b15b9f48ddabad65a41f2050d3bd7f libfabric: 1.6.3a1
Summit modules:
1) gcc/8.1.1
2) spectrum-mpi/10.3.1.2-20200121
3) cmake/3.18.2
4) git/2.20.1
5) cuda/10.1.243
6) boost/1.66.0
7) zlib/1.2.11
8) libpng/1.6.34
9) freetype/2.9.1
10) python/3.7.0-anaconda3-5.3.0