Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General overview
The following datasets are described by this metadata record, and are available for download from the provided URL.
- Raw log files, physical parameters raw log files
- Raw excel files, respiration/PAM chamber raw excel spreadsheets
- Processed and cleaned excel files, respiration chamber biomass data
- Raw rapid light curve excel files (this is duplicated from Raw log files), combined dataset pH, temperature, oxygen, salinity, velocity for experiment
- Associated R script file for pump cycles of respirations chambers
####
Physical parameters raw log files
Raw log files
1) DATE=
2) Time= UTC+11
3) PROG=Automated program to control sensors and collect data
4) BAT=Amount of battery remaining
5) STEP=check aquation manual
6) SPIES=check aquation manual
7) PAR=Photoactive radiation
8) Levels=check aquation manual
9) Pumps= program for pumps
10) WQM=check aquation manual
####
Respiration/PAM chamber raw excel spreadsheets
Abbreviations in headers of datasets
Note: Two data sets are provided in different formats. Raw and cleaned (adj). These are the same data with the PAR column moved over to PAR.all for analysis. All headers are the same. The cleaned (adj) dataframe will work with the R syntax below, alternative add code to do cleaning in R.
Date: ISO 1986 - Check
Time:UTC+11 unless otherwise stated
DATETIME: UTC+11 unless otherwise stated
ID (of instrument in respiration chambers)
ID43=Pulse amplitude fluoresence measurement of control
ID44=Pulse amplitude fluoresence measurement of acidified chamber
ID=1 Dissolved oxygen
ID=2 Dissolved oxygen
ID3= PAR
ID4= PAR
PAR=Photo active radiation umols
F0=minimal florescence from PAM
Fm=Maximum fluorescence from PAM
Yield=(F0 β Fm)/Fm
rChl=an estimate of chlorophyll (Note this is uncalibrated and is an estimate only)
Temp=Temperature degrees C
PAR=Photo active radiation
PAR2= Photo active radiation2
DO=Dissolved oxygen
%Sat= Saturation of dissolved oxygen
Notes=This is the program of the underwater submersible logger with the following abreviations:
Notes-1) PAM=
Notes-2) PAM=Gain level set (see aquation manual for more detail)
Notes-3) Acclimatisation= Program of slowly introducing treatment water into chamber
Notes-4) Shutter start up 2 sensors+sampleβ¦= Shutter PAMs automatic set up procedure (see aquation manual)
Notes-5) Yield step 2=PAM yield measurement and calculation of control
Notes-6) Yield step 5= PAM yield measurement and calculation of acidified
Notes-7) Abatus respiration DO and PAR step 1= Program to measure dissolved oxygen and PAR (see aquation manual). Steps 1-4 are different stages of this program including pump cycles, DO and PAR measurements.
8) Rapid light curve data
Pre LC: A yield measurement prior to the following measurement
After 10.0 sec at 0.5% to 8%: Level of each of the 8 steps of the rapid light curve
Odessey PAR (only in some deployments): An extra measure of PAR (umols) using an Odessey data logger
Dataflow PAR: An extra measure of PAR (umols) using a Dataflow sensor.
PAM PAR: This is copied from the PAR or PAR2 column
PAR all: This is the complete PAR file and should be used
Deployment: Identifying which deployment the data came from
####
Respiration chamber biomass data
The data is chlorophyll a biomass from cores from the respiration chambers. The headers are: Depth (mm) Treat (Acidified or control) Chl a (pigment and indicator of biomass) Core (5 cores were collected from each chamber, three were analysed for chl a), these are psudoreplicates/subsamples from the chambers and should not be treated as replicates.
####
Associated R script file for pump cycles of respirations chambers
Associated respiration chamber data to determine the times when respiration chamber pumps delivered treatment water to chambers. Determined from Aquation log files (see associated files). Use the chamber cut times to determine net production rates. Note: Users need to avoid the times when the respiration chambers are delivering water as this will give incorrect results. The headers that get used in the attached/associated R file are start regression and end regression. The remaining headers are not used unless called for in the associated R script. The last columns of these datasets (intercept, ElapsedTimeMincoef) are determined from the linear regressions described below.
To determine the rate of change of net production, coefficients of the regression of oxygen consumption in discrete 180 minute data blocks were determined. R squared values for fitted regressions of these coefficients were consistently high (greater than 0.9). We make two assumptions with calculation of net production rates: the first is that heterotrophic community members do not change their metabolism under OA; and the second is that the heterotrophic communities are similar between treatments.
####
Combined dataset pH, temperature, oxygen, salinity, velocity for experiment
This data is rapid light curve data generated from a Shutter PAM fluorimeter. There are eight steps in each rapid light curve. Note: The software component of the Shutter PAM fluorimeter for sensor 44 appeared to be damaged and would not cycle through the PAR cycles. Therefore the rapid light curves and recovery curves should only be used for the control chambers (sensor ID43).
The headers are
PAR: Photoactive radiation
relETR: F0/Fm x PAR
Notes: Stage/step of light curve
Treatment: Acidified or control
The associated light treatments in each stage. Each actinic light intensity is held for 10 seconds, then a saturating pulse is taken (see PAM methods).
After 10.0 sec at 0.5% = 1 umols PAR
After 10.0 sec at 0.7% = 1 umols PAR
After 10.0 sec at 1.1% = 0.96 umols PAR
After 10.0 sec at 1.6% = 4.32 umols PAR
After 10.0 sec at 2.4% = 4.32 umols PAR
After 10.0 sec at 3.6% = 8.31 umols PAR
After 10.0 sec at 5.3% =15.78 umols PAR
After 10.0 sec at 8.0% = 25.75 umols PAR
This dataset appears to be missing data, note D5 rows potentially not useable information
See the word document in the download file for more information.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:
π First open data set with information on every active firm in Russia.
ποΈ First open financial statements data set that includes non-filing firms.
ποΈ Sourced from two official data providers: the Rosstat and the Federal Tax Service.
π Covers 2011-2023 initially, will be continuously updated.
ποΈ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.
The RFSD is hosted on π€ Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.
The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.
Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.
Importing The Data
You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on π€ Hugging Face Datasets library.
Python
π€ Hugging Face Datasets
It is as easy as:
from datasets import load_dataset import polars as pl
RFSD = load_dataset('irlspbru/RFSD')
RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')
Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.
Local File Import
Importing in Python requires pyarrow package installed.
import pyarrow.dataset as ds import polars as pl
RFSD = ds.dataset("local/path/to/RFSD")
print(RFSD.schema)
RFSD_full = pl.from_arrow(RFSD.to_table())
RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))
RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )
renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})
R
Local File Import
Importing in R requires arrow package installed.
library(arrow) library(data.table)
RFSD <- open_dataset("local/path/to/RFSD")
schema(RFSD)
scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())
scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())
scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())
renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)
Use Cases
π For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) β interest_payments.md
π For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) β tfp.md
πΊοΈ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses β spatialization.md
FAQ
Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?
To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.
What is the data period?
We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).
Why are there no data for firm X in year Y?
Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:
We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).
Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek β in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.
Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.
Why is the geolocation of firm X incorrect?
We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.
Why is the data for firm X different from https://bo.nalog.ru/?
Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.
Why is the data for firm X unrealistic?
We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.
Why is the data for groups of companies different from their IFRS statements?
We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.
Why is the data not in CSV?
The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.
Version and Update Policy
Version (SemVer): 1.0.0.
We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.
Licence
Creative Commons License Attribution 4.0 International (CC BY 4.0).
Copyright Β© the respective contributors.
Citation
Please cite as:
@unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}
Acknowledgments and Contacts
Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru
Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General overview
The following datasets are described by this metadata record, and are available for download from the provided URL.
- Raw log files, physical parameters raw log files
- Raw excel files, respiration/PAM chamber raw excel spreadsheets
- Processed and cleaned excel files, respiration chamber biomass data
- Raw rapid light curve excel files (this is duplicated from Raw log files), combined dataset pH, temperature, oxygen, salinity, velocity for experiment
- Associated R script file for pump cycles of respirations chambers
####
Physical parameters raw log files
Raw log files
1) DATE=
2) Time= UTC+11
3) PROG=Automated program to control sensors and collect data
4) BAT=Amount of battery remaining
5) STEP=check aquation manual
6) SPIES=check aquation manual
7) PAR=Photoactive radiation
8) Levels=check aquation manual
9) Pumps= program for pumps
10) WQM=check aquation manual
####
Respiration/PAM chamber raw excel spreadsheets
Abbreviations in headers of datasets
Note: Two data sets are provided in different formats. Raw and cleaned (adj). These are the same data with the PAR column moved over to PAR.all for analysis. All headers are the same. The cleaned (adj) dataframe will work with the R syntax below, alternative add code to do cleaning in R.
Date: ISO 1986 - Check
Time:UTC+11 unless otherwise stated
DATETIME: UTC+11 unless otherwise stated
ID (of instrument in respiration chambers)
ID43=Pulse amplitude fluoresence measurement of control
ID44=Pulse amplitude fluoresence measurement of acidified chamber
ID=1 Dissolved oxygen
ID=2 Dissolved oxygen
ID3= PAR
ID4= PAR
PAR=Photo active radiation umols
F0=minimal florescence from PAM
Fm=Maximum fluorescence from PAM
Yield=(F0 β Fm)/Fm
rChl=an estimate of chlorophyll (Note this is uncalibrated and is an estimate only)
Temp=Temperature degrees C
PAR=Photo active radiation
PAR2= Photo active radiation2
DO=Dissolved oxygen
%Sat= Saturation of dissolved oxygen
Notes=This is the program of the underwater submersible logger with the following abreviations:
Notes-1) PAM=
Notes-2) PAM=Gain level set (see aquation manual for more detail)
Notes-3) Acclimatisation= Program of slowly introducing treatment water into chamber
Notes-4) Shutter start up 2 sensors+sampleβ¦= Shutter PAMs automatic set up procedure (see aquation manual)
Notes-5) Yield step 2=PAM yield measurement and calculation of control
Notes-6) Yield step 5= PAM yield measurement and calculation of acidified
Notes-7) Abatus respiration DO and PAR step 1= Program to measure dissolved oxygen and PAR (see aquation manual). Steps 1-4 are different stages of this program including pump cycles, DO and PAR measurements.
8) Rapid light curve data
Pre LC: A yield measurement prior to the following measurement
After 10.0 sec at 0.5% to 8%: Level of each of the 8 steps of the rapid light curve
Odessey PAR (only in some deployments): An extra measure of PAR (umols) using an Odessey data logger
Dataflow PAR: An extra measure of PAR (umols) using a Dataflow sensor.
PAM PAR: This is copied from the PAR or PAR2 column
PAR all: This is the complete PAR file and should be used
Deployment: Identifying which deployment the data came from
####
Respiration chamber biomass data
The data is chlorophyll a biomass from cores from the respiration chambers. The headers are: Depth (mm) Treat (Acidified or control) Chl a (pigment and indicator of biomass) Core (5 cores were collected from each chamber, three were analysed for chl a), these are psudoreplicates/subsamples from the chambers and should not be treated as replicates.
####
Associated R script file for pump cycles of respirations chambers
Associated respiration chamber data to determine the times when respiration chamber pumps delivered treatment water to chambers. Determined from Aquation log files (see associated files). Use the chamber cut times to determine net production rates. Note: Users need to avoid the times when the respiration chambers are delivering water as this will give incorrect results. The headers that get used in the attached/associated R file are start regression and end regression. The remaining headers are not used unless called for in the associated R script. The last columns of these datasets (intercept, ElapsedTimeMincoef) are determined from the linear regressions described below.
To determine the rate of change of net production, coefficients of the regression of oxygen consumption in discrete 180 minute data blocks were determined. R squared values for fitted regressions of these coefficients were consistently high (greater than 0.9). We make two assumptions with calculation of net production rates: the first is that heterotrophic community members do not change their metabolism under OA; and the second is that the heterotrophic communities are similar between treatments.
####
Combined dataset pH, temperature, oxygen, salinity, velocity for experiment
This data is rapid light curve data generated from a Shutter PAM fluorimeter. There are eight steps in each rapid light curve. Note: The software component of the Shutter PAM fluorimeter for sensor 44 appeared to be damaged and would not cycle through the PAR cycles. Therefore the rapid light curves and recovery curves should only be used for the control chambers (sensor ID43).
The headers are
PAR: Photoactive radiation
relETR: F0/Fm x PAR
Notes: Stage/step of light curve
Treatment: Acidified or control
The associated light treatments in each stage. Each actinic light intensity is held for 10 seconds, then a saturating pulse is taken (see PAM methods).
After 10.0 sec at 0.5% = 1 umols PAR
After 10.0 sec at 0.7% = 1 umols PAR
After 10.0 sec at 1.1% = 0.96 umols PAR
After 10.0 sec at 1.6% = 4.32 umols PAR
After 10.0 sec at 2.4% = 4.32 umols PAR
After 10.0 sec at 3.6% = 8.31 umols PAR
After 10.0 sec at 5.3% =15.78 umols PAR
After 10.0 sec at 8.0% = 25.75 umols PAR
This dataset appears to be missing data, note D5 rows potentially not useable information
See the word document in the download file for more information.