Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Publication
will_INF.txt and go_INF.txt). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).1-script-create-input-data-raw.r. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade, (ii) coll (for "collocate"), (iii) BE going to (for frequency of the collocates with be going to) and (iv) will (for frequency of the collocates with will); it is available in the input_data_raw.txt. 2-script-create-motion-chart-input-data.R processes the input_data_raw.txt for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt). The output from the second script is input_data_futurate.txt.input_data_futurate.txt contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R).Future Constructions.Rproj file to open an RStudio session whose working directory is associated with the contents of this repository.
Facebook
TwitterWelcome to the Cyclistic bike-share analysis case study! In this case study, you will perform many real-world tasks of a junior data analyst. You will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path.
You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
How do annual members and casual riders use Cyclistic bikes differently?
What is the problem you are trying to solve?
How do annual members and casual riders use Cyclistic bikes differently?
How can your insights drive business decisions?
The insight will help the marketing team to make a strategy for casual riders
Where is your data located?
Data located in Cyclistic organization data.
How is data organized?
Dataset are in csv format for each month wise from Financial year 22.
Are there issues with bias or credibility in this data? Does your data ROCCC?
It is good it is ROCCC because data collected in from Cyclistic organization.
How are you addressing licensing, privacy, security, and accessibility?
The company has their own license over the dataset. Dataset does not have any personal information about the riders.
How did you verify the data’s integrity?
All the files have consistent columns and each column has the correct type of data.
How does it help you answer your questions?
Insights always hidden in the data. We have the interpret with data to find the insights.
Are there any problems with the data?
Yes, starting station names, ending station names have null values.
What tools are you choosing and why?
I used R studio for the cleaning and transforming the data for analysis phase because of large dataset and to gather experience in the language.
Have you ensured the data’s integrity?
Yes, the data is consistent throughout the columns.
What steps have you taken to ensure that your data is clean?
First duplicates, null values are removed then added new columns for analysis.
How can you verify that your data is clean and ready to analyze?
Make sure the column names are consistent thorough out all data sets by using the “bind row” function.
Make sure column data types are consistent throughout all the dataset by using the “compare_df_col” from the “janitor” package.
Combine the all dataset into single data frame to make consistent throught the analysis.
Removed the column start_lat, start_lng, end_lat, end_lng from the dataframe because those columns not required for analysis.
Create new columns day, date, month, year, from the started_at column this will provide additional opportunities to aggregate the data
Create the “ride_length” column from the started_at and ended_at column to find the average duration of the ride by the riders.
Removed the null rows from the dataset by using the “na.omit function”
Have you documented your cleaning process so you can review and share those results?
Yes, the cleaning process is documented clearly.
How should you organize your data to perform analysis on it? The data has been organized in one single dataframe by using the read csv function in R Has your data been properly formatted? Yes, all the columns have their correct data type.
What surprises did you discover in the data?
Casual member ride duration is higher than the annual members
Causal member widely uses docked bike than the annual members
What trends or relationships did you find in the data?
Annual members are used mainly for commute purpose
Casual member are preferred the docked bikes
Annual members are preferred the electric or classic bikes
How will these insights help answer your business questions?
This insights helps to build a profile for members
Were you able to answer the question of how ...
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.
Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.
Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.
Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.
Methods eLAB Development and Source Code (R statistical software):
eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).
eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.
Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.
The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).
Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.
Data Dictionary (DD)
EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.
Study Cohort
This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.
Statistical Analysis
OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
#https://www.kaggle.com/c/facial-keypoints-detection/details/getting-started-with-r #################################
###Variables for downloaded files data.dir <- ' ' train.file <- paste0(data.dir, 'training.csv') test.file <- paste0(data.dir, 'test.csv') #################################
###Load csv -- creates a data.frame matrix where each column can have a different type. d.train <- read.csv(train.file, stringsAsFactors = F) d.test <- read.csv(test.file, stringsAsFactors = F)
###In training.csv, we have 7049 rows, each one with 31 columns. ###The first 30 columns are keypoint locations, which R correctly identified as numbers. ###The last one is a string representation of the image, identified as a string.
###To look at samples of the data, uncomment this line:
###Let's save the first column as another variable, and remove it from d.train: ###d.train is our dataframe, and we want the column called Image. ###Assigning NULL to a column removes it from the dataframe
im.train <- d.train$Image d.train$Image <- NULL #removes 'image' from the dataframe
im.test <- d.test$Image d.test$Image <- NULL #removes 'image' from the dataframe
################################# #The image is represented as a series of numbers, stored as a string #Convert these strings to integers by splitting them and converting the result to integer
#strsplit splits the string #unlist simplifies its output to a vector of strings #as.integer converts it to a vector of integers. as.integer(unlist(strsplit(im.train[1], " "))) as.integer(unlist(strsplit(im.test[1], " ")))
###Install and activate appropriate libraries ###The tutorial is meant for Linux and OSx, where they use a different library, so: ###Replace all instances of %dopar% with %do%.
library("foreach", lib.loc="~/R/win-library/3.3")
###implement parallelization im.train <- foreach(im = im.train, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } im.test <- foreach(im = im.test, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } #The foreach loop will evaluate the inner command for each row in im.train, and combine the results with rbind (combine by rows). #%do% instructs R to do all evaluations in parallel. #im.train is now a matrix with 7049 rows (one for each image) and 9216 columns (one for each pixel):
###Save all four variables in data.Rd file ###Can reload them at anytime with load('data.Rd')
#each image is a vector of 96*96 pixels (96*96 = 9216). #convert these 9216 integers into a 96x96 matrix: im <- matrix(data=rev(im.train[1,]), nrow=96, ncol=96)
#im.train[1,] returns the first row of im.train, which corresponds to the first training image. #rev reverse the resulting vector to match the interpretation of R's image function #(which expects the origin to be in the lower left corner).
#To visualize the image we use R's image function: image(1:96, 1:96, im, col=gray((0:255)/255))
#Let’s color the coordinates for the eyes and nose points(96-d.train$nose_tip_x[1], 96-d.train$nose_tip_y[1], col="red") points(96-d.train$left_eye_center_x[1], 96-d.train$left_eye_center_y[1], col="blue") points(96-d.train$right_eye_center_x[1], 96-d.train$right_eye_center_y[1], col="green")
#Another good check is to see how variable is our data. #For example, where are the centers of each nose in the 7049 images? (this takes a while to run): for(i in 1:nrow(d.train)) { points(96-d.train$nose_tip_x[i], 96-d.train$nose_tip_y[i], col="red") }
#there are quite a few outliers -- they could be labeling errors. Looking at one extreme example we get this: #In this case there's no labeling error, but this shows that not all faces are centralized idx <- which.max(d.train$nose_tip_x) im <- matrix(data=rev(im.train[idx,]), nrow=96, ncol=96) image(1:96, 1:96, im, col=gray((0:255)/255)) points(96-d.train$nose_tip_x[idx], 96-d.train$nose_tip_y[idx], col="red")
#One of the simplest things to try is to compute the mean of the coordinates of each keypoint in the training set and use that as a prediction for all images colMeans(d.train, na.rm=T)
#To build a submission file we need to apply these computed coordinates to the test instances: p <- matrix(data=colMeans(d.train, na.rm=T), nrow=nrow(d.test), ncol=ncol(d.train), byrow=T) colnames(p) <- names(d.train) predictions <- data.frame(ImageId = 1:nrow(d.test), p) head(predictions)
#The expected submission format has one one keypoint per row, but we can easily get that with the help of the reshape2 library:
library(...
Facebook
TwitterMarket basket analysis with Apriori algorithm
The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.
Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.
Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.
Number of Attributes: 7
https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">
First, we need to load required libraries. Shortly I describe all libraries.
https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">
Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.
https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png">
https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">
After we will clear our data frame, will remove missing values.
https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">
To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
File List
MCEM_analysis_of_seed_fate.R (MD5: f22e6f0fe39d55b7e7f0b67c45eede55)
Description
MCEM_analysis_of_seed_fate.R – This file contains R source code for functions to fit simultaneous estimates of survival, dispersal, and detection parameters for a single cohort of marked seeds using a Monte Carlo Expectation-Maximization algorithm. Seeds do not need to be individually identifiable, and the model assumes a single, area-constrained recovery effort.
Potential users should note that the model encoded here was developed specifically for metal-tagged acorns. Survivors occur mainly in rodent caches underneath the soil surface, whereas tags from acorns that have been eaten are left on the soil surface. Tags from non-survivors are therefore assumed to be easier to locate than tags from survivors.
There are 3 primary functions for users: prepare.data formats raw data for fitting, em.seedfate provides the main user interface, and continue.seedfate extends a previously completed run to obtain a more precise fit. A number of additional functions and utilities not intended to be called directly by an end user are also included.
prepare.data takes the following arguments:
-- TABLE: Please see in attached file. --
Data should be formatted into three data frames as follows: -- TABLE: Please see in attached file. --
-- TABLE: Please see in attached file. --
-- TABLE: Please see in attached file. --
The em.seedfate function takes the following arguments:
-- TABLE: Please see in attached file. --
The continue.seedfate function takes the following arguments:
-- TABLE: Please see in attached file. --
Specifying search areas:
The simplest and most common topology is a circle, centered on the plot, with an inner radius r1 = 0 m and an outer radius of r2 m. This region can be encoded with a single line in search.data as shown in the first line of Table 1.
Table 1. Examples of circular search region encoding
-- TABLE: Please see in attached file. --
A slightly more complex circular region is a ring, coded in Table 1, sample 2. In this case, the search covers a complete circle, but the area immediately next to the plot is not searched.
Another alternative is illustrated by the 4 lines in Table 1 for sample 3. Here, the search covers 4 pie-slice shaped sectors radiating from the plot at 45, 135, 225, and 315 degrees, each with an angular width of 30 degrees. The radian values for phi1 and phi2 for sample 3 correspond to 30 and 60 degrees in each of the Cartesian quadrants, and designate the angular boundaries of each subregion.
Sample 4 demonstrates an alternative method to code the same search area as sample 3. Because dispersal is assumed to be isotropic and the 4 subregions are arranged radially, the integral of the dispersal kernel over sample area 3 is identical to 4 times the integral over the subregion in the first quadrant. Note that this coding shortcut may only be used if the subregions are perfectly symmetrical and dispersal is isotropic.
This multiplication trick is especially useful for search areas that consist of arrays of belt transects radiating from the plot, since it allows the entire search area to be encoded using only 2 lines (Table 2, sample 5). Additional lines are required if, instead of a transect, lines of quadrats radiate from the plot (Table 2, sample 6). In both sample 5 and sample 6, transects would begin to overlap near the plot. Both designs therefore include a central circular "hub," which covers the region where overlaps are possible. Sample 7 (Table 2) does not include this hub, but the quadrats fall far enough from the plot that overlap does not occur. This design would be typical of an array of mast traps around a tree. Notice that in each of the samples listed in Table 2, the spokes and quadrats are all vertically centered on the x-axis.
Table 2. Examples of hub-and-spoke search region encoding
-- TABLE: Please see in attached file. --
The multiplication trick can also be used to shorten the coding for a matrix-like array of quadrats, but in this case the symmetry is based on the Cartesian quadrants. Thus, all of the quadrats in the first quadrant must be encoded. In addition, none of the encoded quadrats may cross the x- or y-axis (Table 3). Alternatively, each quadrat can be individually encoded (not shown).
Table 3. Shortened encoding for quadrat arrays
-- TABLE: Please see in attached file. --
Facebook
TwitterUnderstanding the evolution of traits subject to trade-offs is challenging because phenotypes can (co)vary at both the among- and within-individual levels. Among-individual covariation indicates consistent, possibly genetic, differences in how individuals resolve the trade-off, while within-individual covariation indicates trait plasticity. There is also the potential for consistent among-individual differences in behavioral plasticity, although this has rarely been investigated. We studied the sources of (co)variance in two characteristics of an acoustic advertisement signal that trade off with one another and are under sexual selection in the gray treefrog, Hyla chrysoscelis: call duration and call rate. We recorded males on multiple nights calling spontaneously and in response to playbacks simulating different competition levels. Call duration, call rate, and their product, call effort, were all repeatable both within and across social contexts. Call duration and call rate covaried n..., , , # Data and code from:Partitioning variance in a signaling trade-off under sexual selection reveals among-individual covariance in trait allocation
Michael S. Reichert, Ivan de la Hera, Maria Moiron
Evolution 2024
Summary: Data are measurements of the characteristics of individual calls from a study of individual variation in calling in Cope's gray treefrog, Hyla chrysoscelis.
Note: There are some NA entries in the data files because these are outputs of R data frames. NA corresponds to an empty cell (i.e. no data are available for that variable for that row).
List of files: TreefrogVariance.csv -This is the main raw data file. Each row contains the data from a single call. Variables are as follows: CD - call duration, in seconds CR - call rate, in calls/second *Note that the intercall interval (ICI), which is analyzed in the supplement as an alternative to call rate, is not directly included in this data file but can be calculated a...
Facebook
TwitterDifferential Coexpression ScriptThis script contains the use of previously normalized data to execute the DiffCoEx computational pipeline on an experiment with four treatment groups.differentialCoexpression.rNormalized Transformed Expression Count DataNormalized, transformed expression count data of Medicago truncatula and mycorrhizal fungi is given as an R data frame where the columns denote different genes and rows denote different samples. This data is used for downstream differential coexpression analyses.Expression_Data.zipNormalization and Transformation of Raw Count Data ScriptRaw count data is transformed and normalized with available R packages and RNA-Seq best practices.dataPrep.rRaw_Count_Data_Mycorrhizal_FungiRaw count data from HtSeq for mycorrhizal fungi reads are later transformed and normalized for use in differential coexpression analysis. 'R+' indicates that the sample was obtained from a plant grown in the presence of both mycorrhizal fungi and rhizobia. 'R-' indicate...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Brief Description: - The Chief Marketing Officer (CMO) of Healthy Foods Inc. wants to understand customer sentiments about the specialty foods that the company offers. This information has been collected through customer reviews on their website. Dataset consists of about 5000 reviews. They want the answers to the following questions: 1. What are the most frequently used words in the customer reviews? 2. How can the data be prepared for text analysis? 3. What are the overall sentiments towards the products?
Steps:
- Set the working directory and read the data.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fd7ec6c7460b58ae39c96d5431cca2d37%2FPicture1.png?generation=1691146783504075&alt=media" alt="">
- Data cleaning. Check for missing values and data types of variables
- Run the required libraries ("tm", "SnowballC", "dplyr", "sentimentr", "wordcloud2", "RColorBrewer")
- TEXT ACQUISITION and AGGREGATION. Create corpus.
- TEXT PRE-PROCESSING. Cleaning the text
- Replace special characters with " ". We use the tm_map function for this purpose
- make all the alphabets lower case
- remove punctuations
- remove whitespace
- remove stopwords
- remove numbers
- stem the document
- create term document matrix
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F0508dfd5df9b1ed2885e1eea35b84f30%2FPicture2.png?generation=1691147153582115&alt=media" alt="">
- convert into matrix and find out frequency of words
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Febc729e81068856dec368667c5758995%2FPicture3.png?generation=1691147243385812&alt=media" alt="">
- convert into a data frame
- TEXT EXPLORATION find out the words which appear most frequently and least frequently
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F33cf5decc039baf96dbe86dd6964792a%2FTop%205%20frequent%20words.jpeg?generation=1691147382783191&alt=media" alt="">
- Create Wordcloud
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F99f1147bd9e9a4e6bb35686b015fc714%2FWordCloud.png?generation=1691147502824379&alt=media" alt="">
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General description
SAPFLUXNET contains a global database of sap flow and environmental data, together with metadata at different levels.
SAPFLUXNET is a harmonised database, compiled from contributions from researchers worldwide. This version (0.1.4) contains more than 200 datasets, from all over the World, covering a broad range of bioclimatic conditions.
More information on the coverage can be found here: http://sapfluxnet.creaf.cat/shiny/sfn_progress_dashboard/.
The SAPFLUXNET project has been developed by researchers at CREAF and other institutions (http://sapfluxnet.creaf.cat/#team), coordinated by Rafael Poyatos (CREAF, http://www.creaf.cat/staff/rafael-poyatos-lopez), and funded by two Spanish Young Researcher's Grants (SAPFLUXNET, CGL2014-55883-JIN; DATAFORUSE, RTI2018-095297-J-I00 ) and an Alexander von Humboldt Research Fellowship for Experienced Researchers).
Variables and units
SAPFLUXNET contains whole-plant sap flow and environmental variables at sub-daily temporal resolution. Both sap flow and environmental time series have accompanying flags in a data frame, one for sap flow and another for environmental
variables. These flags store quality issues detected during the quality control process and can be used to add further quality flags.
Metadata contain relevant variables informing about site conditions, stand characteristics, tree and species attributes, sap flow methodology and details on environmental measurements. To learn more about variables, units and data flags please use the functionalities implemented in the sapfluxnetr package (https://github.com/sapfluxnet/sapfluxnetr). In particular, have a look at the package vignettes using R:
# remotes::install_github(
# 'sapfluxnet/sapfluxnetr',
# build_opts = c("--no-resave-data", "--no-manual", "--build-vignettes")
# )
library(sapfluxnetr)
# to list all vignettes
vignette(package='sapfluxnetr')
# variables and units
vignette('metadata-and-data-units', package='sapfluxnetr')
# data flags
vignette('data-flags', package='sapfluxnetr')
Data formats
SAPFLUXNET data can be found in two formats: 1) RData files belonging to the custom-built 'sfn_data' class and 2) Text files in .csv format. We recommend using the sfn_data objects together with the sapfluxnetr package, although we also provide the text files for convenience. For each dataset, text files are structured in the same way as the slots of sfn_data objects; if working with text files, we recommend that you check the data structure of 'sfn_data' objects in the corresponding vignette.
Working with sfn_data files
To work with SAPFLUXNET data, first they have to be downloaded from Zenodo, maintaining the folder structure. A first level in the folder hierarchy corresponds to file format, either RData files or csv's. A second level corresponds to how sap flow is expressed: per plant, per sapwood area or per leaf area. Please note that interconversions among the magnitudes have been performed whenever possible. Below this level, data have been organised per dataset. In the case of RData files, each dataset is contained in a sfn_data object, which stores all data and metadata in different slots (see the vignette 'sfn-data-classes'). In the case of csv files, each dataset has 9 individual files, corresponding to metadata (5), sap flow and environmental data (2) and their corresponding data flags (2).
After downloading the entire database, the sapfluxnetr package can be used to:
- Work with data from a single site: data access, plotting and time aggregation.
- Select the subset datasets to work with.
- Work with data from multiple sites: data access, plotting and time aggregation.
Please check the following package vignettes to learn more about how to work with sfn_data files:
Working with text files
We recommend to work with sfn_data objects using R and the sapfluxnetr package and we do not currently provide code to work with text files.
Data issues and reporting
Please report any issue you may find in the database by sending us an email: sapfluxnet@creaf.uab.cat.
Temporary data fixes, detected but not yet included in released versions will be published in SAPFLUXNET main web page ('Known data errors').
Data access, use and citation
This version of the SAPFLUXNET database is open access. We are working on a data paper describing the database, but, before its publication, please cite this Zenodo entry if SAPFLUXNET is used in any publication.
Facebook
TwitterThis is Census 2020 block data specifically formatted for use by the Environmental Protection Agency (EPA) in-development Environmental Justice Analysis Multisite (EJAM) tool, which uses R code to find which block centroids are within X miles of each specified point (e.g., regulated facility), and to find those distances. The datasets have latitude and longitude of each block's internal point, as provided by Census Bureau, and the FIPS code of the block and its parent block group. The datasets also include a weight for each block, representing this block's Census 2020 population count as a fraction of the count for the parent block group overall, for use in estimating how much of a given block group is within X miles of a specified point or inside a polygon of interest. The datasets also have an effective radius of each block, which is what the radius would be in miles if the block covered the same area in square miles but were circular. The datasets also have coordinates in units that facilitate building a quadtree index of locations. They are in R data.table format, saved as .rda or .arrow files to be read by R code.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:
🔓 First open data set with information on every active firm in Russia.
🗂️ First open financial statements data set that includes non-filing firms.
🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.
📅 Covers 2011-2023 initially, will be continuously updated.
🏗️ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.
The RFSD is hosted on 🤗 Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.
The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.
Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.
Importing The Data
You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on 🤗 Hugging Face Datasets library.
Python
🤗 Hugging Face Datasets
It is as easy as:
from datasets import load_dataset import polars as pl
RFSD = load_dataset('irlspbru/RFSD')
RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')
Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.
Local File Import
Importing in Python requires pyarrow package installed.
import pyarrow.dataset as ds import polars as pl
RFSD = ds.dataset("local/path/to/RFSD")
print(RFSD.schema)
RFSD_full = pl.from_arrow(RFSD.to_table())
RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))
RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )
renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})
R
Local File Import
Importing in R requires arrow package installed.
library(arrow) library(data.table)
RFSD <- open_dataset("local/path/to/RFSD")
schema(RFSD)
scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())
scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())
scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())
renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)
Use Cases
🌍 For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) — interest_payments.md
🏭 For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) — tfp.md
🗺️ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses — spatialization.md
FAQ
Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?
To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.
What is the data period?
We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).
Why are there no data for firm X in year Y?
Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:
We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).
Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek — in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.
Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.
Why is the geolocation of firm X incorrect?
We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.
Why is the data for firm X different from https://bo.nalog.ru/?
Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.
Why is the data for firm X unrealistic?
We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.
Why is the data for groups of companies different from their IFRS statements?
We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.
Why is the data not in CSV?
The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.
Version and Update Policy
Version (SemVer): 1.0.0.
We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.
Licence
Creative Commons License Attribution 4.0 International (CC BY 4.0).
Copyright © the respective contributors.
Citation
Please cite as:
@unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}
Acknowledgments and Contacts
Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru
Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Author: Andrew J. Felton
Date: 10/29/2024
This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:
"Global estimates of the storage and transit time of water through vegetation"
Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated.
Data information:
The data folder contains key data sets used for analysis. In particular:
"data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.
#Code information
Python scripts can be found in the "supporting_code" folder.
Each R script in this project has a role:
"01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).
"02_functions.R": This script contains custom functions. Load this using the
`source()` function in the 01_start.R script.
"03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
`source()` function in the 01_start.R script.
"04_figures_tables.R": This is the main workhouse for figure/table production and
supporting analyses. This script generates the key figures and summary statistics
used in the study that then get saved in the manuscript_figures folder. Note that all
maps were produced using Python code found in the "supporting_code"" folder.
"supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.
"supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description from the SaRNet: A Dataset for Deep Learning Assisted Search and Rescue with Satellite Imagery GitHub Repository * The "Note" was added by the Roboflow team.
This is a single class dataset consisting of tiles of satellite imagery labeled with potential 'targets'. Labelers were instructed to draw boxes around anything they suspect may a paraglider wing, missing in a remote area of Nevada. Volunteers were shown examples of similar objects already in the environment for comparison. The missing wing, as it was found after 3 weeks, is shown below.
https://michaeltpublic.s3.amazonaws.com/images/anomaly_small.jpg" alt="anomaly">
The dataset contains the following:
| Set | Images | Annotations |
|---|---|---|
| Train | 1808 | 3048 |
| Validate | 490 | 747 |
| Test | 254 | 411 |
| Total | 2552 | 4206 |
The data is in the COCO format, and is directly compatible with faster r-cnn as implemented in Facebook's Detectron2.
Download the data here: sarnet.zip
Or follow these steps
# download the dataset
wget https://michaeltpublic.s3.amazonaws.com/sarnet.zip
# extract the files
unzip sarnet.zip
***Note* with Roboflow, you can download the data here** (original, raw images, with annotations): https://universe.roboflow.com/roboflow-public/sarnet-search-and-rescue/ (download v1, original_raw-images) * Download the dataset in COCO JSON format, or another format of choice, and import them to Roboflow after unzipping the folder to get started on your project.
Get started with a Faster R-CNN model pretrained on SaRNet: SaRNet_Demo.ipynb
Source code for the paper is located here: SaRNet_train_test.ipynb
@misc{thoreau2021sarnet,
title={SaRNet: A Dataset for Deep Learning Assisted Search and Rescue with Satellite Imagery},
author={Michael Thoreau and Frazer Wilson},
year={2021},
eprint={2107.12469},
archivePrefix={arXiv},
primaryClass={eess.IV}
}
The source data was generously provided by Planet Labs, Airbus Defence and Space, and Maxar Technologies.
Facebook
TwitterRaw Fecundity DataThis is an R datafile containing a data.frame named "fecundity.data" This file reads into the R code file: estimate.parameters.R file provided with this paper or looked at without the associated regression code. Please reader the "Read Me" file for metadata.data.Rdataestimate.parameters.RThis file contains the code for the regression analyses run in this paper. The data.Rdata feeds right into it and the two files can be used as is together. Please read the "Read Me" file provided with the data.Rdata .
Facebook
TwitterThis contains the derivative data and code to create the figures and tables used in the paper. The data files contain derivative data based on raw data provided by DonorsChoose.org and available via data.donorschoose.org. The code includes an R file that combines the disparate data sets into a single data frame and produces the final analysis. There is also an R file containing functions which were not used to make figures or tables in the final paper, but are used in some analyses mentioned in the paper.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here you can find the model results of the report:
De Felice, M., Busch, S., Kanellopoulos, K., Kavvadias, K. and Hidalgo Gonzalez, I., Power system flexibility in a variable climate, EUR 30184 EN, Publications Office of the European Union, Luxembourg, 2020, ISBN 978-92-76-18183-5 (online), doi:10.2760/75312 (online), JRC120338.
This dataset contains both the raw GDX files generated by the GAMS () optimiser for the Dispa-SET model. Details on the output format and the names of the variables can be found in the Dispa-SET documentation. A markdown notebook in R (and the rendered PDF) contains an example on how to read the GDX files in R.
We also include in this dataset a data frame saved in the Apache Parquet format that can be read both in R and Python.
A description of the methodology and the data sources with the references can be found into the report.
Linked resources
Input files: https://zenodo.org/record/3775569#.XqqY3JpS-fc
Source code for the figures: https://github.com/energy-modelling-toolkit/figures-JRC-report-power-system-and-climate-variability
Update
[29/06/2020] Updated new version of the Parquet file with the right data in the column climate_year
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set contains two files both of which contain R objects.
chr19_snpdata_hm3only.RDS : A data frame with snp information
evd_list_chr19_hm3.RDS : A list of eigen decomposition of the SNP correlation matrix spanning chromosome 19
These data contain only SNPs in both 1k Genomes and HapMap3. Correlation matrices were estimated using LD Shrink. These data were built for use with the causeSims R package found here: https://github.com/jean997/causeSims
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Facebook
TwitterData (bold boxes) and background estimates (colour fill) for m_beta vs. m_betagamma for the gluino R-hadron search (1000 GeV). The...
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Publication
will_INF.txt and go_INF.txt). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).1-script-create-input-data-raw.r. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade, (ii) coll (for "collocate"), (iii) BE going to (for frequency of the collocates with be going to) and (iv) will (for frequency of the collocates with will); it is available in the input_data_raw.txt. 2-script-create-motion-chart-input-data.R processes the input_data_raw.txt for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt). The output from the second script is input_data_futurate.txt.input_data_futurate.txt contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R).Future Constructions.Rproj file to open an RStudio session whose working directory is associated with the contents of this repository.