18 datasets found

Market Basket Analysis
kaggle.com
zip
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Explore at:
zip(23875170 bytes)Available download formats
Dataset updated
Dec 9, 2021
Authors
Aslan Ahmedov
Description
Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import

Data Understanding and Exploration

Transformation of the data – so that is ready to be consumed by the association rules algorithm

Running association rules

Exploring the rules generated

Filtering the generated rules

Visualization of Rule

Dataset Description

File name: Assignment-1_Data

List name: retaildata

File format: . xlsx

Number of Row: 522065

Number of Attributes: 7

BillNo: 6-digit number assigned to each transaction. Nominal.

Itemname: Product name. Nominal.

Quantity: The quantities of each product per transaction. Numeric.

Date: The day and time when each transaction was generated. Numeric.

Price: Product price. Numeric.

CustomerID: 5-digit number assigned to each customer. Nominal.

Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.

readxl - Read Excel Files in R.

plyr - Tools for Splitting, Applying and Combining Data.

ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

knitr - Dynamic Report generation in R.

magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.

dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
d
Data from: Monotonic trend computations for streamflow statistics within and...
catalog.data.gov
data.usgs.gov
Updated Sep 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Monotonic trend computations for streamflow statistics within and near the Mobile Bay and Perdido Bay watersheds, United States, 1950–2022 [Dataset]. https://catalog.data.gov/dataset/monotonic-trend-computations-for-streamflow-statistics-within-and-near-the-mobile-bay-and-
Explore at:
Dataset updated
Sep 17, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Perdido Bay, Mobile Bay, United States
Description
This data release provides comprehensive results of monotonic trend assessment for long-term U.S. Geological Survey (USGS) streamgages in or proximal to the watersheds of Mobile and Perdido Bays, south-central United States (Tatum and others, 2024). Long-term is defined as streamgages having at least five complete decades of daily streamflow data since January 1, 1950, exclusive to those streamgages also having the entire 2010s decade represented. Input data for the trend assessment are daily streamflow data retrieved on March 8, 2024 (U.S. Geological Survey, 2024) and formatted using the fill_dvenv() function in akqdecay (Crowley-Ornelas and others, 2024). Monotonic trends were assessed for each of 69 streamgages using 26 Mann-Kendall hypothesis tests for 20 hydrologic metrics understood as particularly useful in ecological studies (Henriksen and others, 2006) with another 6 metrics measuring well-known streamflow properties, such as annual harmonic mean streamflow (Asquith and Heitmuller, 2008) and annual mean streamflow with decadal flow-duration curve quantiles (10th, 50th, and 90th percentiles) (Crowley-Ornelas and others, 2023). Helsel and others (2020) provide background and description of the Mann-Kendall hypothesis test. Some of the trend analyses are based on the annual values of a hydrologic metric (calendar year is the time interval test) whereas others are decadal (decade is the time interval for the test). The principal result output for this data release (monotrnd_1hyp.txt) clearly distinguishes the time interval for the respective tests. This data release includes the computational workflow to conduct the hypothesis testing and requisite data manipulations to do so. The workflow is comprised of the core computation script monotrnd_script.R and an auxiliary script containing functions for 20 ecological flow metrics. This means that script monotrnd_script.R requires additional functions to be loaded into the R workspace and sources the file monotrnd_ecomets_include.R. This design is useful as part of isolation of the 20 ecological-oriented hydrologic metrics (subroutines) (logic and nomenclature therein is informed by Henriksen and others, 2006) from the streamgage-looping workflow and other data manipulation features in monotrnd_script.R. The monotrnd_script.R is designed to use time series of daily mean streamflow stored in an R environment data object using the streamgage identification number as the key and a data frame (table) of the daily streamflows in the format defined by the dvget() and filled by the filldv_env() functions of the akqdecay R package (See supplemental information section; Crowley-Ornelas and others, 2024). Additionally, monotrnd_script.R tags a specific subset of streamgages within the workflow, identified by the authors as "major nodes," with a binary indicator (1 or 0) to support targeted analyses on these selected locations. The data in file monotrnd_1hyp.txt are comma-delimited results of Kendall tau or other test statistics and p-values of the Mann-Kendall hypothesis tests as part of monotonic trend assessment for 69 USGS streamgages using 26 Mann–Kendall hypothesis tests on a variety of streamflow metrics. The data include USGS streamgage identification numbers with prepended "S" character, decimal latitudes and longitudes for the streamgage locations, range of calendar year and decades of streamflow processed along with integer counts of number of calendar years and decades, Kendall tau (or other test statistic) and associated p-value of the test statistic for the 26 streamflow metrics considered. Broadly, the "left side of the table" presents the results for the tests on metrics using calendar year time steps, and the "right side of the table" presents the results for the tests on metrics using decade time steps. The content of the file does not assign or draw conclusions on statistical significance because the p-values are provided. The file monotrnd_dictionary_1hyp.txt is a simple plain-text, pipe-delimited file of directly human-readable short definitions for the columns in the monotrnd_1hyp.txt. (This dictionary and two others accompany this data release to facilitate potential reuse of information by some users.) The source of monotrnd_1hyp.txt stems from ending computational steps in script monotrnd_script.R. Short summaries synthesizing information in file monotrnd_1hyp.txt are available in files monotrnd_3cnt.txt and monotrnd_2stn.txt also accompanying this data release. The data in file monotrnd_2stn.txt are comma-delimited summaries by streamgage identification number of the monotonic trend assessments for 26 Mann-Kendall hypothesis tests on streamflow metrics as described elsewhere in this data release. The summary data herein are composed of records (rows) by streamgage that include columns of (1) streamgage identification numbers with a prepended "S" character, (2) decimal latitudes and longitudes for the streamgage locations, (3) the integer counts of the number of hypothesis tests, (4) the integer count of number of tests for which the computed hypothesis test p-values less than the 0.05 level of statistical significance (so-called alpha = 0.05), and (5) colon-delimited strings of alphanumeric characters identifying each of the statistically significant tests for the respective streamgage. The file monotrnd_dictionary_2stn.txt is a simple plain-text, pipe-delimited file of directly human-readable short definitions for the columns in monotrnd_2stn.txt. The source of monotrnd_2stn.txt stems from ending computational steps in script monotrnd_script.R described elsewhere in this data release from its production of the monotrnd_1hyp.txt; this later data file provides the values used to assemble monotrnd_2stn.txt. The information in file monotrnd_3cnt.txt are comma-delimited summaries of Kendall tau or other test statistic arithmetic means as well as integer counts of statistically significant trends as part of monotonic trend assessment using 26 Mann-Kendall hypothesis tests on a variety of streamflow metrics for 69 USGS streamgages as described elsewhere in this data release. The two-column summary data herein are composed of a first row indicating by character string of the integer number of streamgages (69) and then subsequent rows in pairs of three-decimal character-string representation of mean Kendall tau (or the test statistics of a seasonal Mann-Kendall test) followed by character string of the integer number of the counts of statistically significant tests for the respective test at it was applied to the 69 streamgages. Statistical significance is defined as p-values less than the 0.05 level of statistical significance (so-called alpha = 0.05). The file monotrnd_dictionary_3cnt.txt is a simple plain-text, pipe-delimited file of directly human-readable short definitions for the columns in the monotrnd_3cnt.txt. The source of monotrnd_3cnt.txt stems from ending computational steps in script monotrnd_script.R described elsewhere in this data release from its production of the monotrnd_1hyp.txt; this later data file provides the values used to assemble monotrnd_3cnt.txt.
FacialRecognition
kaggle.com
zip
Updated Dec 1, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TheNicelander (2016). FacialRecognition [Dataset]. https://www.kaggle.com/petein/facialrecognition
Explore at:
zip(121674455 bytes)Available download formats
Dataset updated
Dec 1, 2016
Authors
TheNicelander
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description

#https://www.kaggle.com/c/facial-keypoints-detection/details/getting-started-with-r #################################

###Variables for downloaded files data.dir <- ' ' train.file <- paste0(data.dir, 'training.csv') test.file <- paste0(data.dir, 'test.csv') #################################

###Load csv -- creates a data.frame matrix where each column can have a different type. d.train <- read.csv(train.file, stringsAsFactors = F) d.test <- read.csv(test.file, stringsAsFactors = F)

###In training.csv, we have 7049 rows, each one with 31 columns. ###The first 30 columns are keypoint locations, which R correctly identified as numbers. ###The last one is a string representation of the image, identified as a string.

###To look at samples of the data, uncomment this line:

head(d.train)

###Let's save the first column as another variable, and remove it from d.train: ###d.train is our dataframe, and we want the column called Image. ###Assigning NULL to a column removes it from the dataframe

im.train <- d.train$Image d.train$Image <- NULL #removes 'image' from the dataframe

im.test <- d.test$Image d.test$Image <- NULL #removes 'image' from the dataframe

################################# #The image is represented as a series of numbers, stored as a string #Convert these strings to integers by splitting them and converting the result to integer

#strsplit splits the string #unlist simplifies its output to a vector of strings #as.integer converts it to a vector of integers. as.integer(unlist(strsplit(im.train[1], " "))) as.integer(unlist(strsplit(im.test[1], " ")))

###Install and activate appropriate libraries ###The tutorial is meant for Linux and OSx, where they use a different library, so: ###Replace all instances of %dopar% with %do%.

install.packages('foreach')

library("foreach", lib.loc="~/R/win-library/3.3")

###implement parallelization im.train <- foreach(im = im.train, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } im.test <- foreach(im = im.test, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } #The foreach loop will evaluate the inner command for each row in im.train, and combine the results with rbind (combine by rows). #%do% instructs R to do all evaluations in parallel. #im.train is now a matrix with 7049 rows (one for each image) and 9216 columns (one for each pixel):

###Save all four variables in data.Rd file ###Can reload them at anytime with load('data.Rd')

save(d.train, im.train, d.test, im.test, file='data.Rd')

load('data.Rd')

#each image is a vector of 96*96 pixels (96*96 = 9216). #convert these 9216 integers into a 96x96 matrix: im <- matrix(data=rev(im.train[1,]), nrow=96, ncol=96)

#im.train[1,] returns the first row of im.train, which corresponds to the first training image. #rev reverse the resulting vector to match the interpretation of R's image function #(which expects the origin to be in the lower left corner).

#To visualize the image we use R's image function: image(1:96, 1:96, im, col=gray((0:255)/255))

#Let’s color the coordinates for the eyes and nose points(96-d.train$nose_tip_x[1], 96-d.train$nose_tip_y[1], col="red") points(96-d.train$left_eye_center_x[1], 96-d.train$left_eye_center_y[1], col="blue") points(96-d.train$right_eye_center_x[1], 96-d.train$right_eye_center_y[1], col="green")

#Another good check is to see how variable is our data. #For example, where are the centers of each nose in the 7049 images? (this takes a while to run): for(i in 1:nrow(d.train)) { points(96-d.train$nose_tip_x[i], 96-d.train$nose_tip_y[i], col="red") }

#there are quite a few outliers -- they could be labeling errors. Looking at one extreme example we get this: #In this case there's no labeling error, but this shows that not all faces are centralized idx <- which.max(d.train$nose_tip_x) im <- matrix(data=rev(im.train[idx,]), nrow=96, ncol=96) image(1:96, 1:96, im, col=gray((0:255)/255)) points(96-d.train$nose_tip_x[idx], 96-d.train$nose_tip_y[idx], col="red")

#One of the simplest things to try is to compute the mean of the coordinates of each keypoint in the training set and use that as a prediction for all images colMeans(d.train, na.rm=T)

#To build a submission file we need to apply these computed coordinates to the test instances: p <- matrix(data=colMeans(d.train, na.rm=T), nrow=nrow(d.test), ncol=ncol(d.train), byrow=T) colnames(p) <- names(d.train) predictions <- data.frame(ImageId = 1:nrow(d.test), p) head(predictions)

#The expected submission format has one one keypoint per row, but we can easily get that with the help of the reshape2 library:

install.packages('reshape2')

library(...
d
Data from: Cooperation and coexpression: how coexpression networks shift in...
datadryad.org
data.niaid.nih.gov
+2more
zip
Updated Mar 19, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sathvik X. Palakurty; John R. Stinchcombe; Michelle E. Afkhami (2018). Cooperation and coexpression: how coexpression networks shift in response to multiple mutualists [Dataset]. http://doi.org/10.5061/dryad.2hj343f
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.2hj343f
Dataset updated
Mar 19, 2018
Dataset provided by
Dryad
Authors
Sathvik X. Palakurty; John R. Stinchcombe; Michelle E. Afkhami
Time period covered
Mar 1, 2018
Description
Differential Coexpression ScriptThis script contains the use of previously normalized data to execute the DiffCoEx computational pipeline on an experiment with four treatment groups.differentialCoexpression.rNormalized Transformed Expression Count DataNormalized, transformed expression count data of Medicago truncatula and mycorrhizal fungi is given as an R data frame where the columns denote different genes and rows denote different samples. This data is used for downstream differential coexpression analyses.Expression_Data.zipNormalization and Transformation of Raw Count Data ScriptRaw count data is transformed and normalized with available R packages and RNA-Seq best practices.dataPrep.rRaw_Count_Data_Mycorrhizal_FungiRaw count data from HtSeq for mycorrhizal fungi reads are later transformed and normalized for use in differential coexpression analysis. 'R+' indicates that the sample was obtained from a plant grown in the presence of both mycorrhizal fungi and rhizobia. 'R-' indicate...
f
Supplement 1. R code demonstrating how to fit a logistic regression model,...
figshare.com
wiley.figshare.com
html
Updated Aug 9, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David I. Warton; Francis K. C. Hui (2016). Supplement 1. R code demonstrating how to fit a logistic regression model, with a random intercept term, and how to use resampling-based hypothesis testing for inference. [Dataset]. http://doi.org/10.6084/m9.figshare.3550407.v1
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3550407.v1
Dataset updated
Aug 9, 2016
Dataset provided by
Wiley
Authors
David I. Warton; Francis K. C. Hui
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
File List glmmeg.R: R code demonstrating how to fit a logistic regression model, with a random intercept term, to randomly generated overdispersed binomial data. boot.glmm.R: R code for estimating P-values by applying the bootstrap to a GLMM likelihood ratio statistic. Description glmm.R is some example R code which show how to fit a logistic regression model (with or without a random effects term) and use diagnostic plots to check the fit. The code is run on some randomly generated data, which are generated in such a way that overdispersion is evident. This code could be directly applied for your own analyses if you read into R a data.frame called “dataset”, which has columns labelled “success” and “failure” (for number of binomial successes and failures), “species” (a label for the different rows in the dataset), and where we want to test for the effect of some predictor variable called “location”. In other cases, just change the labels and formula as appropriate. boot.glmm.R extends glmm.R by using bootstrapping to calculate P-values in a way that provides better control of Type I error in small samples. It accepts data in the same form as that generated in glmm.R.
d
Field data of seed dormancy and pod dehiscence in Hairy Vetch (V. villosa)
datadryad.org
data.niaid.nih.gov
zip
Updated Sep 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neal Tilhou (2023). Field data of seed dormancy and pod dehiscence in Hairy Vetch (V. villosa) [Dataset]. http://doi.org/10.5061/dryad.z34tmpgm1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.z34tmpgm1
Dataset updated
Sep 21, 2023
Dataset provided by
Dryad
Authors
Neal Tilhou
Time period covered
Aug 23, 2023
Description
See attached publication.
s
WoSIS snapshot - December 2023
repository.soilwise-he.eu
data.isric.org
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WoSIS snapshot - December 2023 [Dataset]. https://repository.soilwise-he.eu/cat/collections/metadata:main/items/e50f84e1-aa5b-49cb-bd6b-cd581232a2ec
Explore at:
Description
ABSTRACT:

The World Soil Information Service (WoSIS) provides quality-assessed and standardized soil profile data to support digital soil mapping and environmental applications at broad scale levels. Since the release of the ‘WoSIS snapshot 2019’ many new soil data were shared with us, registered in the ISRIC data repository, and subsequently standardized in accordance with the licenses specified by the data providers. The source data were contributed by a wide range of data providers, therefore special attention was paid to the standardization of soil property definitions, soil analytical procedures and soil property values (and units of measurement).

We presently consider the following soil chemical properties (organic carbon, total carbon, total carbonate equivalent, total Nitrogen, Phosphorus (extractable-P, total-P, and P-retention), soil pH, cation exchange capacity, and electrical conductivity) and physical properties (soil texture (sand, silt, and clay), bulk density, coarse fragments, and water retention), grouped according to analytical procedures (aggregates) that are operationally comparable.

For each profile we provide the original soil classification (FAO, WRB, USDA, and version) and horizon designations as far as these have been specified in the source databases.

Three measures for 'fitness-for-intended-use' are provided: positional uncertainty (for site locations), time of sampling/description, and a first approximation for the uncertainty associated with the operationally defined analytical methods. These measures should be considered during digital soil mapping and subsequent earth system modelling that use the present set of soil data.

DATA SET DESCRIPTION:

The 'WoSIS 2023 snapshot' comprises data for 228k profiles from 217k geo-referenced sites that originate from 174 countries. The profiles represent over 900k soil layers (or horizons) and over 6 million records. The actual number of measurements for each property varies (greatly) between proﬁles and with depth, this generally depending on the objectives of the initial soil sampling programmes.

The data are provided in TSV (tab separated values) format and as GeoPackage. The zip-file (446 Mb) contains the following files:

Readme_WoSIS_202312_v2.pdf: Provides a short description of the dataset, file structure, column names, units and category values (this file is also available directly under 'online resources'). The pdf includes links to tutorials for downloading the TSV files into R respectively Excel. See also 'HOW TO READ TSV FILES INTO R AND PYTHON' in the next section.

wosis_202312_observations.tsv: This file lists the four to six letter codes for each observation, whether the observation is for a site/profile or layer (horizon), the unit of measurement and the number of profiles respectively layers represented in the snapshot. It also provides an estimate for the inferred accuracy for the laboratory measurements.

wosis_202312_sites.tsv: This file characterizes the site location where profiles were sampled.

wosis_2023112_profiles: Presents the unique profile ID (i.e. primary key), site_id, source of the data, country ISO code and name, positional uncertainty, latitude and longitude (WGS 1984), maximum depth of soil described and sampled, as well as information on the soil classification system and edition. Depending on the soil classification system used, the number of fields will vary .

wosis_202312_layers: This file characterises the layers (or horizons) per profile, and lists their upper and lower depths (cm).

wosis_202312_xxxx.tsv : This type of file presents results for each observation (e.g. “xxxx” = “BDFIOD” ), as defined under “code” in file wosis_202312_observation.tsv. (e.g. wosis_202311_bdfiod.tsv).

wosis_202312.gpkg: Contains the above datafiles in GeoPackage format (which stores the files within an SQLite database).

HOW TO READ TSV FILES INTO R AND PYTHON:

A) To read the data in R, please uncompress the ZIP file and specify the uncompressed folder.

setwd("/YourFolder/WoSIS_2023_December/") ## For example: setwd('D:/WoSIS_2023_December/')

Then use read_tsv to read the TSV files, specifying the data types for each column (c = character, i = integer, n = number, d = double, l = logical, f = factor, D = date, T = date time, t = time).

observations = readr::read_tsv('wosis_202312_observations.tsv', col_types='cccciid')
observations ## show columns and first 10 rows

sites = readr::read_tsv('wosis_202312_sites.tsv', col_types='iddcccc') sites

profiles = readr::read_tsv('wosis_202312_profiles.tsv', col_types='icciccddcccccciccccicccci') profiles

layers = readr::read_tsv('wosis_202312_layers.tsv', col_types='iiciciiilcc') layers

Do this for each observation 'XXXX', e.g. file 'Wosis_202312_orgc.tsv':

orgc = readr::read_tsv('wosis_202312_orgc.tsv', col_types='iicciilccdccddccccc')
orgc

Note: One may also use the following R code (example is for file 'observations.tsv'): observations <- read.table("wosis_202312_observations.tsv", sep = "\t", header = TRUE, quote = "", comment.char = "", stringsAsFactors = FALSE )

B) To read the files into python first decompress the files to your selected folder. Then in python:

import the required library

import pandas as pd

Read the observations data

observations = pd.read_csv("wosis_202312_observations.tsv", sep="\t") # print the data frame header and some rows observations.head()

Read the sites data

sites = pd.read_csv("wosis_202312_sites.tsv", sep="\t")

Read the profiles data

profiles = pd.read_csv("wosis_202312_profiles.tsv", sep="\t")

Read the layers data

layers = pd.read_csv("wosis_202312_layers.tsv", sep="\t")

Read the soil property data, e.g. 'cfvo' (do this for each observation)

cfvo = pd.read_csv("wosis_202312_cfvo.tsv", sep="\t")

CITATION: Calisto, L., de Sousa, L.M., Batjes, N.H., 2023. Standardised soil profile data for the world (WoSIS snapshot – December 2023), https://doi.org/10.17027/isric-wdcsoils-20231130

Supplement to: Batjes N.H., Calisto, L. and de Sousa L.M., 2023. Providing quality-assessed and standardised soil data to support global mapping and modelling (WoSIS snapshot 2023). Earth System Science Data, https://doi.org/10.5194/essd-16-4735-2024.

Google Data Analytics Case Study Cyclistic

kaggle.com

zip

Updated Sep 27, 2022

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Udayakumar19 (2022). Google Data Analytics Case Study Cyclistic [Dataset]. https://www.kaggle.com/datasets/udayakumar19/google-data-analytics-case-study-cyclistic/suggestions

Explore at:

zip(1299 bytes)Available download formats

Dataset updated

Sep 27, 2022

Authors

Udayakumar19

Description

Introduction

Welcome to the Cyclistic bike-share analysis case study! In this case study, you will perform many real-world tasks of a junior data analyst. You will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path.

Scenario

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Ask

How do annual members and casual riders use Cyclistic bikes differently?

Guiding Question:

What is the problem you are trying to solve?
  How do annual members and casual riders use Cyclistic bikes differently?
How can your insights drive business decisions?
  The insight will help the marketing team to make a strategy for casual riders

Prepare

Guiding Question:

Where is your data located?
  Data located in Cyclistic organization data.

How is data organized?
  Dataset are in csv format for each month wise from Financial year 22.

Are there issues with bias or credibility in this data? Does your data ROCCC? 
  It is good it is ROCCC because data collected in from Cyclistic organization.

How are you addressing licensing, privacy, security, and accessibility?
  The company has their own license over the dataset. Dataset does not have any personal information about the riders.

How did you verify the data’s integrity?
  All the files have consistent columns and each column has the correct type of data.

How does it help you answer your questions?
  Insights always hidden in the data. We have the interpret with data to find the insights.

Are there any problems with the data?
  Yes, starting station names, ending station names have null values.

Process

Guiding Question:

What tools are you choosing and why?
  I used R studio for the cleaning and transforming the data for analysis phase because of large dataset and to gather experience in the language.

Have you ensured the data’s integrity?
 Yes, the data is consistent throughout the columns.

What steps have you taken to ensure that your data is clean?
  First duplicates, null values are removed then added new columns for analysis.

How can you verify that your data is clean and ready to analyze? 
 Make sure the column names are consistent thorough out all data sets by using the “bind row” function.

Make sure column data types are consistent throughout all the dataset by using the “compare_df_col” from the “janitor” package.
Combine the all dataset into single data frame to make consistent throught the analysis.
Removed the column start_lat, start_lng, end_lat, end_lng from the dataframe because those columns not required for analysis.
Create new columns day, date, month, year, from the started_at column this will provide additional opportunities to aggregate the data
Create the “ride_length” column from the started_at and ended_at column to find the average duration of the ride by the riders.
Removed the null rows from the dataset by using the “na.omit function”
Have you documented your cleaning process so you can review and share those results? 
  Yes, the cleaning process is documented clearly.

Analyze Phase:

Guiding Questions:

How should you organize your data to perform analysis on it? The data has been organized in one single dataframe by using the read csv function in R Has your data been properly formatted? Yes, all the columns have their correct data type.

What surprises did you discover in the data?
  Casual member ride duration is higher than the annual members
  Causal member widely uses docked bike than the annual members
What trends or relationships did you find in the data?
  Annual members are used mainly for commute purpose
  Casual member are preferred the docked bikes
  Annual members are preferred the electric or classic bikes
How will these insights help answer your business questions?
  This insights helps to build a profile for members

Guiding Quesions:

Were you able to answer the question of how ...

120 years of Olympic history: athletes and results
kaggle.com
zip
Updated Jun 15, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
rgriffin (2018). 120 years of Olympic history: athletes and results [Dataset]. https://www.kaggle.com/datasets/heesoo37/120-years-of-olympic-history-athletes-and-results
Explore at:
zip(5690772 bytes)Available download formats
Dataset updated
Jun 15, 2018
Authors
rgriffin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

This is a historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016. I scraped this data from www.sports-reference.com in May 2018. The R code I used to scrape and wrangle the data is on GitHub. I recommend checking my kernel before starting your own analysis.

Note that the Winter and Summer Games were held in the same year up until 1992. After that, they staggered them such that Winter Games occur on a four year cycle starting with 1994, then Summer in 1996, then Winter in 1998, and so on. A common mistake people make when analyzing this data is to assume that the Summer and Winter Games have always been staggered.

Content

The file athlete_events.csv contains 271116 rows and 15 columns. Each row corresponds to an individual athlete competing in an individual Olympic event (athlete-events). The columns are:

ID - Unique number for each athlete

Name - Athlete's name

Sex - M or F

Age - Integer

Height - In centimeters

Weight - In kilograms

Team - Team name

NOC - National Olympic Committee 3-letter code

Games - Year and season

Year - Integer

Season - Summer or Winter

City - Host city

Sport - Sport

Event - Event

Medal - Gold, Silver, Bronze, or NA

Acknowledgements

The Olympic data on www.sports-reference.com is the result of an incredible amount of research by a group of Olympic history enthusiasts and self-proclaimed 'statistorians'. Check out their blog for more information. All I did was consolidated their decades of work into a convenient format for data analysis.

Inspiration

This dataset provides an opportunity to ask questions about how the Olympics have evolved over time, including questions about the participation and performance of women, different nations, and different sports and events.
d
Data from: An assessment of wheat yield sensitivity and breeding gains in...
datadryad.org
data.niaid.nih.gov
zip
Updated Mar 5, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sharon M. Gourdji; Ky L. Mathews; Matthew Reynolds; Jose Crossa; David B. Lobell (2013). An assessment of wheat yield sensitivity and breeding gains in hot environments [Dataset]. http://doi.org/10.5061/dryad.525vm
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.525vm
Dataset updated
Mar 5, 2013
Dataset provided by
Dryad
Authors
Sharon M. Gourdji; Ky L. Mathews; Matthew Reynolds; Jose Crossa; David B. Lobell
Time period covered
Nov 8, 2012
Description
regression.dat_nurseries_ADW2.RdatThis R data frame contains 1353 rows corresponding to the international trials in the CIMMYT database used in this study. The column names should be self-descriptive, and contain all the predictors used for this regression analysis.
d
DONALD
search.dataone.org
Updated Sep 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miani, Alessandro (2024). DONALD [Dataset]. http://doi.org/10.7910/DVN/VDQL8A
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/VDQL8A
Dataset updated
Sep 24, 2024
Dataset provided by
Harvard Dataverse
Authors
Miani, Alessandro
Description
DONALD’s raw texts are stored in the R data frame “DONALD.txt.rdata” (size = 12.5 Gb, 4.87 Gb compressed) consisting of 2,173,172 rows (one per document) × 2 columns, namely document ID and text.
s
Data.Rda for How uncertain is the survival extrapolation? A study of the...
orda.shef.ac.uk
application/gzip
Updated Sep 10, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benjamin Kearns (2019). Data.Rda for How uncertain is the survival extrapolation? A study of the impact of different parametric survival models on extrapolated uncertainty about hazard functions, lifetime mean survival and cost-effectiveness [Dataset]. http://doi.org/10.15131/shef.data.9751907.v2
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.15131/shef.data.9751907.v2
Dataset updated
Sep 10, 2019
Dataset provided by
The University of Sheffield
Authors
Benjamin Kearns
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
An R Rda-file containing the four hypothetical datasets used in the analysis (Flat, increasing, decreasing, unimodal). These are stored in a single data-frame, where the 400 rows correspond to observations. For each dataset there are three variables: an event indicator = 1 for if death was during follow-up, else = 0 (suffix "_dead"), the true time of death (suffix "_time") and the observed follow-up time, which will = 1 if the true time of death > 1 (suffix "_obs"). Hence there are twelve columns.A script is provided seperately within this project (Analysis.R) which includes the code used to analyse this dataset in order to obtain the results reported in the manuscript "How uncertain is the survival extrapolation? A study of the impact of different parametric survival models on extrapolated uncertainty about hazard functions, lifetime mean survival and cost-effectiveness."
d
Data and analysis from: Body mass, temperature, and depth shape the maximum...
datadryad.org
data.niaid.nih.gov
zip
Updated Oct 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sebastián A. Pardo; Nicholas K. Dulvy (2022). Data and analysis from: Body mass, temperature, and depth shape the maximum intrinsic rate of population increase in sharks and rays [Dataset]. http://doi.org/10.5061/dryad.wh70rxwrb
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.wh70rxwrb
Dataset updated
Oct 2, 2022
Dataset provided by
Dryad
Authors
Sebastián A. Pardo; Nicholas K. Dulvy
Time period covered
Sep 19, 2022
Description
Analyses are reproducible using version 3.3.2 or above (R Core Team 2016). Files needed for reproducing the analyses are: chond-data.csv: Data frame with 63 rows (species) and 11 variables. Some of these variables are based on the same life history trait but are transformed for ease of interpretation and analysis. stein-et-al-single.tree: Phylogenetic tree with scaled branch lengths from Stein et al. (2018) used in analyses. These are freely downloadable from http://vertlife.org/sharktree/. rmax-scaling-analysis.R: R code with minimum working example of how to load data files, fit models phylogenetic linear models using the pgls function in the caper package, run information-theoretic comparisons, and check diagnostics.
Tennessee Eastman Process Simulation Dataset
kaggle.com
zip
Updated Feb 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergei Averkiev (2020). Tennessee Eastman Process Simulation Dataset [Dataset]. https://www.kaggle.com/averkij/tennessee-eastman-process-simulation-dataset
Explore at:
zip(1370814903 bytes)Available download formats
Dataset updated
Feb 9, 2020
Authors
Sergei Averkiev
Description
Intro

This dataverse contains the data referenced in Rieth et al. (2017). Issues and Advances in Anomaly Detection Evaluation for Joint Human-Automated Systems. To be presented at Applied Human Factors and Ergonomics 2017.

Content

Each .RData file is an external representation of an R dataframe that can be read into an R environment with the 'load' function. The variables loaded are named ‘fault_free_training’, ‘fault_free_testing’, ‘faulty_testing’, and ‘faulty_training’, corresponding to the RData files.

Each dataframe contains 55 columns:

Column 1 ('faultNumber') ranges from 1 to 20 in the “Faulty” datasets and represents the fault type in the TEP. The “FaultFree” datasets only contain fault 0 (i.e. normal operating conditions).

Column 2 ('simulationRun') ranges from 1 to 500 and represents a different random number generator state from which a full TEP dataset was generated (Note: the actual seeds used to generate training and testing datasets were non-overlapping).

Column 3 ('sample') ranges either from 1 to 500 (“Training” datasets) or 1 to 960 (“Testing” datasets). The TEP variables (columns 4 to 55) were sampled every 3 minutes for a total duration of 25 hours and 48 hours respectively. Note that the faults were introduced 1 and 8 hours into the Faulty Training and Faulty Testing datasets, respectively.

Columns 4 to 55 contain the process variables; the column names retain the original variable names.

Acknowledgements

This work was sponsored by the Office of Naval Research, Human & Bioengineered Systems (ONR 341), program officer Dr. Jeffrey G. Morrison under contract N00014-15-C-5003. The views expressed are those of the authors and do not reflect the official policy or position of the Office of Naval Research, Department of Defense, or US Government.

User Agreement

By accessing or downloading the data or work provided here, you, the User, agree that you have read this agreement in full and agree to its terms.

The person who owns, created, or contributed a work to the data or work provided here dedicated the work to the public domain and has waived his or her rights to the work worldwide under copyright law. You can copy, modify, distribute, and perform the work, for any lawful purpose, without asking permission.

In no way are the patent or trademark rights of any person affected by this agreement, nor are the rights that any other person may have in the work or in how the work is used, such as publicity or privacy rights.

Pacific Science & Engineering Group, Inc., its agents and assigns, make no warranties about the work and disclaim all liability for all uses of the work, to the fullest extent permitted by law.

When you use or cite the work, you shall not imply endorsement by Pacific Science & Engineering Group, Inc., its agents or assigns, or by another author or affirmer of the work.

This Agreement may be amended, and the use of the data or work shall be governed by the terms of the Agreement at the time that you access or download the data or work from this Website.
The first two rows of a pandas DataFrame ready to be used with GLAM.
plos.figshare.com
xls
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Felix Molter; Armin W. Thomas; Hauke R. Heekeren; Peter N. C. Mohr (2023). The first two rows of a pandas DataFrame ready to be used with GLAM. [Dataset]. http://doi.org/10.1371/journal.pone.0226428.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0226428.t001
Dataset updated
Jun 3, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Felix Molter; Armin W. Thomas; Hauke R. Heekeren; Peter N. C. Mohr
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The first two rows of a pandas DataFrame ready to be used with GLAM.
Data from: Streptococcus pyogenes pharyngitis elicits diverse antibody...
zenodo.org
bin, csv, html
Updated Oct 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Danika Hill; Danika Hill (2024). Streptococcus pyogenes pharyngitis elicits diverse antibody responses to key vaccine antigens influenced by the imprint of past infections. [Dataset]. http://doi.org/10.5281/zenodo.13347362
Explore at:
bin, csv, htmlAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13347362
Dataset updated
Oct 4, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Danika Hill; Danika Hill
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Here you will find the raw data (RawData.RData) and code (CHIVAS_SEROLOGY_Code.Rmd, an R Markdown file) for generating the analysis and figures for the following publication:

Streptococcus pyogenes pharyngitis elicits diverse antibody responses to key vaccine antigens influenced by the imprint of past infections.

Joshua Osowicki1,2,3 #, Hannah R Frost1 #, Kristy I Azzopardi1, Alana L Whitcombe4, Reuben McGregor4, Lauren H. Carlton4, Ciara Baker1, Loraine Fabri1,5,6, Manisha Pandey7, Michael F Good7, Jonathan R. Carapetis8,9,10, Mark J Walker11,12,13, Pierre R Smeesters1,2,5,6, Paul V Licciardi2,14, Nicole J Moreland4 *, Danika L Hill15 *, Andrew C Steer1,2,3 *

Provided in the RData file are the following items:

Dataframes:

"outcome" : clinical variables associated with human challenge for each participant

"data" : ELISA and functional antibody responses for human challenge participants. Each timepoint and isotype for each antigen as seperate column)

"data_long": Data equivalent to "data" file but in long format, i.e. One column for each antigen, timepoint and isotype as factors.

"data.melt" : Data equivalent to "data" file but in longer format , i.e. timepoint, isotype and antigen as factors, 'value' as ELISA AU.

"luminex" : IgG responses to 6 antigens analysed by luminex bead-based assay in human challenge participants.

"luminex.children" : IgG responses to 6 antigen analysed by luminex bead-based assay in children

Vectors:

"pharyngitis" : participant "id" for the 19 individuals that developed pharyngitis.

"Antigen.Order" : relates to "Main" antigen classification used in Figure 2

'additional" : relates to "Additional

Function:

"custom_theme" : used as a theme when using ggplot to graph.

Adobe Illustrator or Inkscape were used to generate the final image files for publication, with some graph editing to axes labels, font size, adding p-values etc.

Additional files:

3 .csv files have been included for download

"ELISA_data_wide_format.csv", a wide format data table of 25 human challenge individuals and 219 variables. Equivalent to the 'data' dataframe in the RData file

"CHIVAS_luminex.csv", a long format data table of 25 human challenge participants at 1 week, 1 month, and 3 months. Equivalent to the 'luminex' dataframe in the RData file.

"Luminex.children.csv", a datatable of 6 luminex variables for 39 children (healthy and post pharyngitis). Equivalent to the 'luminex.children' dataframe in the RData file.
Z
Data from: Russian Financial Statements Database: A firm-level collection of...
data.niaid.nih.gov
Updated Mar 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bondarkov, Sergey; Ledenev, Victor; Skougarevskiy, Dmitriy (2025). Russian Financial Statements Database: A firm-level collection of the universe of financial statements [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14622208
Explore at:
Dataset updated
Mar 14, 2025
Dataset provided by
European University at St Petersburg
European University at St. Petersburg
Authors
Bondarkov, Sergey; Ledenev, Victor; Skougarevskiy, Dmitriy
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:

🔓 First open data set with information on every active firm in Russia.

🗂️ First open financial statements data set that includes non-filing firms.

🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.

📅 Covers 2011-2023 initially, will be continuously updated.

🏗️ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.

The RFSD is hosted on 🤗 Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.

The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.

Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.

Importing The Data

You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on 🤗 Hugging Face Datasets library.

Python

🤗 Hugging Face Datasets

It is as easy as:

from datasets import load_dataset import polars as pl

This line will download 6.6GB+ of all RFSD data and store it in a 🤗 cache folder

RFSD = load_dataset('irlspbru/RFSD')

Alternatively, this will download ~540MB with all financial statements for 2023# to a Polars DataFrame (requires about 8GB of RAM)

RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')

Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.

Local File Import

Importing in Python requires pyarrow package installed.

import pyarrow.dataset as ds import polars as pl

Read RFSD metadata from local file

RFSD = ds.dataset("local/path/to/RFSD")

Use RFSD_dataset.schema to glimpse the data structure and columns' classes

print(RFSD.schema)

Load full dataset into memory

RFSD_full = pl.from_arrow(RFSD.to_table())

Load only 2019 data into memory

RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))

Load only revenue for firms in 2019, identified by taxpayer id

RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )

Give suggested descriptive names to variables

renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})

R

Local File Import

Importing in R requires arrow package installed.

library(arrow) library(data.table)

Read RFSD metadata from local file

RFSD <- open_dataset("local/path/to/RFSD")

Use schema() to glimpse into the data structure and column classes

schema(RFSD)

Load full dataset into memory

scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())

Load only 2019 data into memory

scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())

Load only revenue for firms in 2019, identified by taxpayer id

scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())

Give suggested descriptive names to variables

renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)

Use Cases

🌍 For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) — interest_payments.md

🏭 For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) — tfp.md

🗺️ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses — spatialization.md

FAQ

Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?

To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.

What is the data period?

We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).

Why are there no data for firm X in year Y?

Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:

We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).

Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek — in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.

Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.

Why is the geolocation of firm X incorrect?

We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.

Why is the data for firm X different from https://bo.nalog.ru/?

Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.

Why is the data for firm X unrealistic?

We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.

Why is the data for groups of companies different from their IFRS statements?

We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.

Why is the data not in CSV?

The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.

Version and Update Policy

Version (SemVer): 1.0.0.

We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.

Licence

Creative Commons License Attribution 4.0 International (CC BY 4.0).

Copyright © the respective contributors.

Citation

Please cite as:

@unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}

Acknowledgments and Contacts

Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru

Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,
Supplement 2. R code used for wolf analysis.
wiley.figshare.com
html
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jason Matthiopoulos; Mark Hebblewhite; Geert Aarts; John Fieberg (2023). Supplement 2. R code used for wolf analysis. [Dataset]. http://doi.org/10.6084/m9.figshare.3550839.v1
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3550839.v1
Dataset updated
May 30, 2023
Dataset provided by
Wileyhttps://www.wiley.com/
Authors
Jason Matthiopoulos; Mark Hebblewhite; Geert Aarts; John Fieberg
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
File List Wolf code.r – Source code to run wolf analysis Description This is provided for illustration only, the wolf data are not offered online. The code operates on a data frame in which rows correspond to points in space. The data frame contains a column for use (1 for a telemetry observation, 0 for a control point selected from the wolf’s home range). It also contains columns for x and y coordinates of the point, environmental covariates at that location, wolf ID and wolf pack membership. 1. Data frame preparation The data set is first thinned, for computational expediency, the covariates are standardized to improve convergence and the data frame is augmented with columns for wolf-pack-level covariate expectations (required by the GFR approach). 2. Leave-one-out validation The code allows the removal of a single wolf from the data set. Two models (one with just random effects, the second with GFR interactions) are fit to the data and predictions are made for the missing wolf. The function gof() generates goodness-of-fit diagnostics.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis

Market Basket Analysis

Analyzing Consumer Behaviour Using MBA Association Rule Mining

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

zip(23875170 bytes)Available download formats

Dataset updated

Dec 9, 2021

Authors

Aslan Ahmedov

Description

Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import
Data Understanding and Exploration
Transformation of the data – so that is ready to be consumed by the association rules algorithm
Running association rules
Exploring the rules generated
Filtering the generated rules
Visualization of Rule

Dataset Description

File name: Assignment-1_Data
List name: retaildata
File format: . xlsx
Number of Row: 522065
Number of Attributes: 7
- BillNo: 6-digit number assigned to each transaction. Nominal.
- Itemname: Product name. Nominal.
- Quantity: The quantities of each product per transaction. Numeric.
- Date: The day and time when each transaction was generated. Numeric.
- Price: Product price. Numeric.
- CustomerID: 5-digit number assigned to each customer. Nominal.
- Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.
readxl - Read Excel Files in R.
plyr - Tools for Splitting, Applying and Combining Data.
ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
knitr - Dynamic Report generation in R.
magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.
dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...

Clear search

Close search

Google apps

Main menu

Market Basket Analysis

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing

Data from: Monotonic trend computations for streamflow statistics within and...

FacialRecognition

head(d.train)

install.packages('foreach')

save(d.train, im.train, d.test, im.test, file='data.Rd')

load('data.Rd')

install.packages('reshape2')

Data from: Cooperation and coexpression: how coexpression networks shift in...

Supplement 1. R code demonstrating how to fit a logistic regression model,...

Field data of seed dormancy and pod dehiscence in Hairy Vetch (V. villosa)

WoSIS snapshot - December 2023

Do this for each observation 'XXXX', e.g. file 'Wosis_202312_orgc.tsv':

import the required library

Read the observations data

Read the sites data

Read the profiles data

Read the layers data

Read the soil property data, e.g. 'cfvo' (do this for each observation)

Google Data Analytics Case Study Cyclistic

Introduction

Scenario

Ask

Guiding Question:

Prepare

Guiding Question:

Process

Guiding Question:

Analyze Phase:

Guiding Questions:

Share

Guiding Quesions:

120 years of Olympic history: athletes and results

Context

Content

Acknowledgements

Inspiration

Data from: An assessment of wheat yield sensitivity and breeding gains in...

DONALD

Data.Rda for How uncertain is the survival extrapolation? A study of the...

Data and analysis from: Body mass, temperature, and depth shape the maximum...

Tennessee Eastman Process Simulation Dataset

Intro

Content

Acknowledgements

User Agreement

The first two rows of a pandas DataFrame ready to be used with GLAM.

Data from: Streptococcus pyogenes pharyngitis elicits diverse antibody...

Data from: Russian Financial Statements Database: A firm-level collection of...

This line will download 6.6GB+ of all RFSD data and store it in a 🤗 cache folder

Alternatively, this will download ~540MB with all financial statements for 2023# to a Polars DataFrame (requires about 8GB of RAM)

Read RFSD metadata from local file

Use RFSD_dataset.schema to glimpse the data structure and columns' classes

Load full dataset into memory

Load only 2019 data into memory

Load only revenue for firms in 2019, identified by taxpayer id

Give suggested descriptive names to variables

Read RFSD metadata from local file

Use schema() to glimpse into the data structure and column classes

Load full dataset into memory

Load only 2019 data into memory

Load only revenue for firms in 2019, identified by taxpayer id

Give suggested descriptive names to variables

Supplement 2. R code used for wolf analysis.

Market Basket Analysis

Analyzing Consumer Behaviour Using MBA Association Rule Mining

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing