8 datasets found

R script for summary statistics and structural equation modelling
figshare.com
txt
Updated Feb 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eleanor Durrant; Marion Pfeifer (2024). R script for summary statistics and structural equation modelling [Dataset]. http://doi.org/10.6084/m9.figshare.25226258.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25226258.v1
Dataset updated
Feb 15, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Eleanor Durrant; Marion Pfeifer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
R script used with accompanying data frame 'plot_character' that is within the project to calculate summary statistics and structural equation modelling.
Chromosome 19 LD data for simulating summary statistics
search.datacite.org
Updated May 30, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jean Morrison (2019). Chromosome 19 LD data for simulating summary statistics [Dataset]. http://doi.org/10.5281/zenodo.3235779
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3235779
Dataset updated
May 30, 2019
Dataset provided by
DataCitehttps://www.datacite.org/
Zenodohttp://zenodo.org/
Authors
Jean Morrison
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data set contains two files both of which contain R objects.

chr19_snpdata_hm3only.RDS : A data frame with snp information

evd_list_chr19_hm3.RDS : A list of eigen decomposition of the SNP correlation matrix spanning chromosome 19

These data contain only SNPs in both 1k Genomes and HapMap3. Correlation matrices were estimated using LD Shrink. These data were built for use with the causeSims R package found here: https://github.com/jean997/causeSims
d
Replication Data for: The Wikipedia Adventure: Field Evaluation of an...
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Narayan, Sneha; Orlowitz, Jake; Morgan, Jonathan T.; Shaw, Aaron D.; Hill, Benjamin Mako (2023). Replication Data for: The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial for New Users [Dataset]. http://doi.org/10.7910/DVN/6HPRIG
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/6HPRIG
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Narayan, Sneha; Orlowitz, Jake; Morgan, Jonathan T.; Shaw, Aaron D.; Hill, Benjamin Mako
Description
This dataset contains the data and code necessary to replicate work in the following paper: Narayan, Sneha, Jake Orlowitz, Jonathan Morgan, Benjamin Mako Hill, and Aaron Shaw. 2017. “The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial for New Users.” in Proceedings of the 20th ACM Conference on Computer-Supported Cooperative Work & Social Computing (CSCW '17). New York, New York: ACM Press. http://dx.doi.org/10.1145/2998181.2998307 The published paper contains two studies. Study 1 is a descriptive analysis of a survey of Wikipedia editors who played a gamified tutorial. Study 2 is a field experiment that evaluated the same the tutorial. These data are the data used in the field experiment described in Study 2. Description of Files This dataset contains the following files beyond this README: twa.RData — An RData file that includes all variables used in Study 2. twa_analysis.R — A GNU R script that includes all the code used to generate the tables and plots related to Study 2 in the paper. The RData file contains one variable (d) which is an R dataframe (i.e., table) that includes the following columns: userid (integer): The unique numerical ID representing each user on in our sample. These are 8-digit integers and describe public accounts on Wikipedia. sample.date (date string): The day the user was recruited to the study. Dates are formatted in “YYYY-MM-DD” format. In the case of invitees, it is the date their invitation was sent. For users in the control group, these is the date that they would have been invited to the study. edits.all (integer): The total number of edits made by the user on Wikipedia in the 180 days after they joined the study. Edits to user's user pages, user talk pages and subpages are ignored. edits.ns0 (integer): The total number of edits made by user to article pages on Wikipedia in the 180 days after they joined the study. edits.talk (integer): The total number of edits made by user to talk pages on Wikipedia in the 180 days after they joined the study. Edits to a user's user page, user talk page and subpages are ignored. treat (logical): TRUE if the user was invited, FALSE if the user was in control group. play (logical): TRUE if the user played the game. FALSE if the user did not. All users in control are listed as FALSE because any user who had not been invited to the game but played was removed. twa.level (integer): Takes a value 0 of if the user has not played the game. Ranges from 1 to 7 for those who did, indicating the highest level they reached in the game. quality.score (float). This is the average word persistence (over a 6 revision window) over all edits made by this userid. Our measure of word persistence (persistent word revision per word) is a measure of edit quality developed by Halfaker et al. that tracks how long words in an edit persist after subsequent revisions are made to the wiki-page. For more information on how word persistence is calculated, see the following paper: Halfaker, Aaron, Aniket Kittur, Robert Kraut, and John Riedl. 2009. “A Jury of Your Peers: Quality, Experience and Ownership in Wikipedia.” In Proceedings of the 5th International Symposium on Wikis and Open Collaboration (OpenSym '09), 1–10. New York, New York: ACM Press. doi:10.1145/1641309.1641332. Or this page: https://meta.wikimedia.org/wiki/Research:Content_persistence How we created twa.RData The files twa.RData combines datasets drawn from three places: A dataset created by Wikimedia Foundation staff that tracked the details of the experiment and how far people got in the game. The variables userid, sample.date, treat, play, and twa.level were all generated in a dataset created by WMF staff when The Wikipedia Adventure was deployed. All users in the sample created their accounts within 2 days before the date they were entered into the study. None of them had received a Teahouse invitation, a Level 4 user warning, or been blocked from editing at the time that they entered the study. Additionally, all users made at least one edit after the day they were invited. Users were sorted randomly into treatment and control groups, based on which they either received or did not receive an invite to play The Wikipedia Adventure. Edit and text persistence data drawn from public XML dumps created on May 21st, 2015. We used publicly available XML dumps to generate the outcome variables, namely edits.all, edits.ns0, edits.talk and quality.score. We first extracted all edits made by users in our sample during the six month period since they joined the study, excluding edits made to user pages or user talk pages using. We parsed the XML dumps using the Python based wikiq and MediaWikiUtilities software online at: http://projects.mako.cc/source/?p=mediawiki_dump_tools https://github.com/mediawiki-utilities/python-mediawiki-utilities We o... Visit https://dataone.org/datasets/sha256%3Ab1240bda398e8fa311ac15dbcc04880333d5f3fbe67a7a951786da2d44e33018 for complete metadata about this dataset.
Market Basket Analysis
kaggle.com
zip
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Explore at:
zip(23875170 bytes)Available download formats
Dataset updated
Dec 9, 2021
Authors
Aslan Ahmedov
Description
Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import

Data Understanding and Exploration

Transformation of the data – so that is ready to be consumed by the association rules algorithm

Running association rules

Exploring the rules generated

Filtering the generated rules

Visualization of Rule

Dataset Description

File name: Assignment-1_Data

List name: retaildata

File format: . xlsx

Number of Row: 522065

Number of Attributes: 7

BillNo: 6-digit number assigned to each transaction. Nominal.

Itemname: Product name. Nominal.

Quantity: The quantities of each product per transaction. Numeric.

Date: The day and time when each transaction was generated. Numeric.

Price: Product price. Numeric.

CustomerID: 5-digit number assigned to each customer. Nominal.

Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.

readxl - Read Excel Files in R.

plyr - Tools for Splitting, Applying and Combining Data.

ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

knitr - Dynamic Report generation in R.

magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.

dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
f
Supplement 1. R code demonstrating how to fit a logistic regression model,...
figshare.com
wiley.figshare.com
html
Updated Aug 9, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David I. Warton; Francis K. C. Hui (2016). Supplement 1. R code demonstrating how to fit a logistic regression model, with a random intercept term, and how to use resampling-based hypothesis testing for inference. [Dataset]. http://doi.org/10.6084/m9.figshare.3550407.v1
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3550407.v1
Dataset updated
Aug 9, 2016
Dataset provided by
Wiley
Authors
David I. Warton; Francis K. C. Hui
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
File List glmmeg.R: R code demonstrating how to fit a logistic regression model, with a random intercept term, to randomly generated overdispersed binomial data. boot.glmm.R: R code for estimating P-values by applying the bootstrap to a GLMM likelihood ratio statistic. Description glmm.R is some example R code which show how to fit a logistic regression model (with or without a random effects term) and use diagnostic plots to check the fit. The code is run on some randomly generated data, which are generated in such a way that overdispersion is evident. This code could be directly applied for your own analyses if you read into R a data.frame called “dataset”, which has columns labelled “success” and “failure” (for number of binomial successes and failures), “species” (a label for the different rows in the dataset), and where we want to test for the effect of some predictor variable called “location”. In other cases, just change the labels and formula as appropriate. boot.glmm.R extends glmm.R by using bootstrapping to calculate P-values in a way that provides better control of Type I error in small samples. It accepts data in the same form as that generated in glmm.R.

Data from: Edge-bundled spatial layer to visualize mobility flows in Europe...

zenodo.org
data-staging.niaid.nih.gov

bin, png

Updated Dec 19, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Oula Inkeröinen; Oula Inkeröinen; Tuomas Väisänen; Tuomas Väisänen; Olle Järv; Olle Järv (2024). Edge-bundled spatial layer to visualize mobility flows in Europe on NUTS 2 level [Dataset]. http://doi.org/10.5281/zenodo.14380383

Explore at:

png, binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14380383

Dataset updated

Dec 19, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Oula Inkeröinen; Oula Inkeröinen; Tuomas Väisänen; Tuomas Väisänen; Olle Järv; Olle Järv

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered

Europe

Description

Description of edge-bundled spatial layer

This repository contains a GeoPackage of edge-bundled line geometries between the centroids of all https://ec.europa.eu/eurostat/web/gisco/geodata/statistical-units/territorial-units-statistics" target="_blank" rel="noopener">NUTS 2 regions in continental Europe. The centroids of the NUTS 2 regions are derived from the 2021 version of the regions. The spatial layer contains just the edge-bundled lines, and no values for the flows. The coordinate reference system used is the https://epsg.io/3035" target="_blank" rel="noopener">ETRS89-extended / LAEA Europe (EPSG:3035) commonly used by The European Union.

This data is made to support the visualization of complex origin-destination matrix mobility data on the NUTS 2 level in Europe. Straight line geometries between origin and destination points can lose their legibility when the number of flows gets high.

Usage

To use the spatial layer, combine the provided GeoPackage with your origin-destination matrix data, such as migration, student exchange, or some other flow data. The edge-bundled flows has a directionality-preserving column for joining the flows (OD_ID). This can be done in QGIS/ArcGIS with a table join or in R/Python with a data frame merge.

Data structure

Column	Description	Datatype
fid	Unique identifier for a row in the data	Integer (64 bit)
orig_nuts	The NUTS 2 code of the origin.	String
dest_nuts	The NUTS 2 code of the destination.	String
OD_ID	Unique identifier for the mobility using the NUTS 2 codes for origin and destination. E.g., FI1B_DK03	String

Production code

The spatial layer was produced by the https://doi.org/10.5281/zenodo.14532547">Edge-bundling tool for regional mobility flow data, which is a fork of a similar tool by Ondrej Peterka (2024), which is based on the work of Wallinger et al., (2022).

References

Peterka, O. (2024). Xpeterk1/edge-path-bundling [Python, C++]. https://github.com/xpeterk1/edge-path-bundling (Original work published 2023)

Wallinger, M., Archambault, D., Auber, D., Nöllenburg, M., & Peltonen, J. (2022). Edge-Path Bundling: A Less Ambiguous Edge Bundling Approach. IEEE Transactions on Visualization and Computer Graphics, 28(1), 313–323. https://doi.org/10.1109/TVCG.2021.3114795

Z
Data from: Russian Financial Statements Database: A firm-level collection of...
data.niaid.nih.gov
Updated Mar 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bondarkov, Sergey; Ledenev, Victor; Skougarevskiy, Dmitriy (2025). Russian Financial Statements Database: A firm-level collection of the universe of financial statements [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14622208
Explore at:
Dataset updated
Mar 14, 2025
Dataset provided by
European University at St. Petersburg
European University at St Petersburg
Authors
Bondarkov, Sergey; Ledenev, Victor; Skougarevskiy, Dmitriy
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:

🔓 First open data set with information on every active firm in Russia.

🗂️ First open financial statements data set that includes non-filing firms.

🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.

📅 Covers 2011-2023 initially, will be continuously updated.

🏗️ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.

The RFSD is hosted on 🤗 Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.

The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.

Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.

Importing The Data

You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on 🤗 Hugging Face Datasets library.

Python

🤗 Hugging Face Datasets

It is as easy as:

from datasets import load_dataset import polars as pl

This line will download 6.6GB+ of all RFSD data and store it in a 🤗 cache folder

RFSD = load_dataset('irlspbru/RFSD')

Alternatively, this will download ~540MB with all financial statements for 2023# to a Polars DataFrame (requires about 8GB of RAM)

RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')

Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.

Local File Import

Importing in Python requires pyarrow package installed.

import pyarrow.dataset as ds import polars as pl

Read RFSD metadata from local file

RFSD = ds.dataset("local/path/to/RFSD")

Use RFSD_dataset.schema to glimpse the data structure and columns' classes

print(RFSD.schema)

Load full dataset into memory

RFSD_full = pl.from_arrow(RFSD.to_table())

Load only 2019 data into memory

RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))

Load only revenue for firms in 2019, identified by taxpayer id

RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )

Give suggested descriptive names to variables

renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})

R

Local File Import

Importing in R requires arrow package installed.

library(arrow) library(data.table)

Read RFSD metadata from local file

RFSD <- open_dataset("local/path/to/RFSD")

Use schema() to glimpse into the data structure and column classes

schema(RFSD)

Load full dataset into memory

scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())

Load only 2019 data into memory

scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())

Load only revenue for firms in 2019, identified by taxpayer id

scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())

Give suggested descriptive names to variables

renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)

Use Cases

🌍 For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) — interest_payments.md

🏭 For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) — tfp.md

🗺️ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses — spatialization.md

FAQ

Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?

To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.

What is the data period?

We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).

Why are there no data for firm X in year Y?

Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:

We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).

Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek — in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.

Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.

Why is the geolocation of firm X incorrect?

We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.

Why is the data for firm X different from https://bo.nalog.ru/?

Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.

Why is the data for firm X unrealistic?

We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.

Why is the data for groups of companies different from their IFRS statements?

We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.

Why is the data not in CSV?

The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.

Version and Update Policy

Version (SemVer): 1.0.0.

We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.

Licence

Creative Commons License Attribution 4.0 International (CC BY 4.0).

Copyright © the respective contributors.

Citation

Please cite as:

@unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}

Acknowledgments and Contacts

Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru

Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,
Hitters Baseball Data
kaggle.com
zip
Updated Jul 11, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mehmet Akturk (2020). Hitters Baseball Data [Dataset]. https://www.kaggle.com/mathchi/hitters-baseball-data
Explore at:
zip(9173 bytes)Available download formats
Dataset updated
Jul 11, 2020
Authors
Mehmet Akturk
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Baseball Data

Description

Major League Baseball Data from the 1986 and 1987 seasons.

Usage

Hitters

Format

A data frame with 322 observations of major league players on the following 20 variables.

AtBat: Number of times at bat in 1986

Hits: Number of hits in 1986

HmRun: Number of home runs in 1986

Runs: Number of runs in 1986

RBI: Number of runs batted in in 1986

Walks: Number of walks in 1986

Years: Number of years in the major leagues

CAtBat: Number of times at bat during his career

CHits: Number of hits during his career

CHmRun: Number of home runs during his career

CRuns: Number of runs during his career

CRBI: Number of runs batted in during his career

CWalks: Number of walks during his career

League: A factor with levels A and N indicating player's league at the end of 1986

Division: A factor with levels E and W indicating player's division at the end of 1986

PutOuts: Number of put outs in 1986

Assists: Number of assists in 1986

Errors: Number of errors in 1986

Salary: 1987 annual salary on opening day in thousands of dollars

NewLeague: A factor with levels A and N indicating player's league at the beginning of 1987

Source

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. This is part of the data that was used in the 1988 ASA Graphics Section Poster Session. The salary data were originally from Sports Illustrated, April 20, 1987. The 1986 and career statistics were obtained from The 1987 Baseball Encyclopedia Update published by Collier Books, Macmillan Publishing Company, New York.

References

Games, G., Witten, D., Hastie, T., and Tibshirani, R. (2013) An Introduction to Statistical Learning with applications in R, www.StatLearning.com, Springer-Verlag, New York

Examples

summary(Hitters)

lm(Salary~AtBat+Hits,data=Hitters)

Dataset imported from https://www.r-project.org.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Eleanor Durrant; Marion Pfeifer (2024). R script for summary statistics and structural equation modelling [Dataset]. http://doi.org/10.6084/m9.figshare.25226258.v1

R script for summary statistics and structural equation modelling

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.25226258.v1

Dataset updated

Feb 15, 2024

Dataset provided by

figshare
Figsharehttp://figshare.com/

Authors

Eleanor Durrant; Marion Pfeifer

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

R script used with accompanying data frame 'plot_character' that is within the project to calculate summary statistics and structural equation modelling.

Clear search

Close search

Google apps

Main menu

R script for summary statistics and structural equation modelling

Chromosome 19 LD data for simulating summary statistics

Replication Data for: The Wikipedia Adventure: Field Evaluation of an...

Market Basket Analysis

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing

Supplement 1. R code demonstrating how to fit a logistic regression model,...

Data from: Edge-bundled spatial layer to visualize mobility flows in Europe...

Description of edge-bundled spatial layer

Usage

Data structure

Production code

References

Data from: Russian Financial Statements Database: A firm-level collection of...

This line will download 6.6GB+ of all RFSD data and store it in a 🤗 cache folder

Alternatively, this will download ~540MB with all financial statements for 2023# to a Polars DataFrame (requires about 8GB of RAM)

Read RFSD metadata from local file

Use RFSD_dataset.schema to glimpse the data structure and columns' classes

Load full dataset into memory

Load only 2019 data into memory

Load only revenue for firms in 2019, identified by taxpayer id

Give suggested descriptive names to variables

Read RFSD metadata from local file

Use schema() to glimpse into the data structure and column classes

Load full dataset into memory

Load only 2019 data into memory

Load only revenue for firms in 2019, identified by taxpayer id

Give suggested descriptive names to variables

Hitters Baseball Data

Baseball Data

Description

Usage

Format

Source

References

Examples

lm(Salary~AtBat+Hits,data=Hitters)

R script for summary statistics and structural equation modelling