10 datasets found

FacialRecognition
kaggle.com
zip
Updated Dec 1, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TheNicelander (2016). FacialRecognition [Dataset]. https://www.kaggle.com/petein/facialrecognition
Explore at:
zip(121674455 bytes)Available download formats
Dataset updated
Dec 1, 2016
Authors
TheNicelander
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description

#https://www.kaggle.com/c/facial-keypoints-detection/details/getting-started-with-r #################################

###Variables for downloaded files data.dir <- ' ' train.file <- paste0(data.dir, 'training.csv') test.file <- paste0(data.dir, 'test.csv') #################################

###Load csv -- creates a data.frame matrix where each column can have a different type. d.train <- read.csv(train.file, stringsAsFactors = F) d.test <- read.csv(test.file, stringsAsFactors = F)

###In training.csv, we have 7049 rows, each one with 31 columns. ###The first 30 columns are keypoint locations, which R correctly identified as numbers. ###The last one is a string representation of the image, identified as a string.

###To look at samples of the data, uncomment this line:

head(d.train)

###Let's save the first column as another variable, and remove it from d.train: ###d.train is our dataframe, and we want the column called Image. ###Assigning NULL to a column removes it from the dataframe

im.train <- d.train$Image d.train$Image <- NULL #removes 'image' from the dataframe

im.test <- d.test$Image d.test$Image <- NULL #removes 'image' from the dataframe

################################# #The image is represented as a series of numbers, stored as a string #Convert these strings to integers by splitting them and converting the result to integer

#strsplit splits the string #unlist simplifies its output to a vector of strings #as.integer converts it to a vector of integers. as.integer(unlist(strsplit(im.train[1], " "))) as.integer(unlist(strsplit(im.test[1], " ")))

###Install and activate appropriate libraries ###The tutorial is meant for Linux and OSx, where they use a different library, so: ###Replace all instances of %dopar% with %do%.

install.packages('foreach')

library("foreach", lib.loc="~/R/win-library/3.3")

###implement parallelization im.train <- foreach(im = im.train, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } im.test <- foreach(im = im.test, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } #The foreach loop will evaluate the inner command for each row in im.train, and combine the results with rbind (combine by rows). #%do% instructs R to do all evaluations in parallel. #im.train is now a matrix with 7049 rows (one for each image) and 9216 columns (one for each pixel):

###Save all four variables in data.Rd file ###Can reload them at anytime with load('data.Rd')

save(d.train, im.train, d.test, im.test, file='data.Rd')

load('data.Rd')

#each image is a vector of 96*96 pixels (96*96 = 9216). #convert these 9216 integers into a 96x96 matrix: im <- matrix(data=rev(im.train[1,]), nrow=96, ncol=96)

#im.train[1,] returns the first row of im.train, which corresponds to the first training image. #rev reverse the resulting vector to match the interpretation of R's image function #(which expects the origin to be in the lower left corner).

#To visualize the image we use R's image function: image(1:96, 1:96, im, col=gray((0:255)/255))

#Let’s color the coordinates for the eyes and nose points(96-d.train$nose_tip_x[1], 96-d.train$nose_tip_y[1], col="red") points(96-d.train$left_eye_center_x[1], 96-d.train$left_eye_center_y[1], col="blue") points(96-d.train$right_eye_center_x[1], 96-d.train$right_eye_center_y[1], col="green")

#Another good check is to see how variable is our data. #For example, where are the centers of each nose in the 7049 images? (this takes a while to run): for(i in 1:nrow(d.train)) { points(96-d.train$nose_tip_x[i], 96-d.train$nose_tip_y[i], col="red") }

#there are quite a few outliers -- they could be labeling errors. Looking at one extreme example we get this: #In this case there's no labeling error, but this shows that not all faces are centralized idx <- which.max(d.train$nose_tip_x) im <- matrix(data=rev(im.train[idx,]), nrow=96, ncol=96) image(1:96, 1:96, im, col=gray((0:255)/255)) points(96-d.train$nose_tip_x[idx], 96-d.train$nose_tip_y[idx], col="red")

#One of the simplest things to try is to compute the mean of the coordinates of each keypoint in the training set and use that as a prediction for all images colMeans(d.train, na.rm=T)

#To build a submission file we need to apply these computed coordinates to the test instances: p <- matrix(data=colMeans(d.train, na.rm=T), nrow=nrow(d.test), ncol=ncol(d.train), byrow=T) colnames(p) <- names(d.train) predictions <- data.frame(ImageId = 1:nrow(d.test), p) head(predictions)

#The expected submission format has one one keypoint per row, but we can easily get that with the help of the reshape2 library:

install.packages('reshape2')

library(...
n
Effect of data source on estimates of regional bird richness in northeastern...
data.niaid.nih.gov
datadryad.org
zip
Updated May 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roi Ankori-Karlinsky; Ronen Kadmon; Michael Kalyuzhny; Katherine F. Barnes; Andrew M. Wilson; Curtis Flather; Rosalind Renfrew; Joan Walsh; Edna Guk (2021). Effect of data source on estimates of regional bird richness in northeastern United States [Dataset]. http://doi.org/10.5061/dryad.m905qfv0h
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.m905qfv0h
Dataset updated
May 4, 2021
Dataset provided by
Massachusetts Audubon Society
University of Vermont
Hebrew University of Jerusalem
New York State Department of Environmental Conservation
Agricultural Research Service
University of Michigan
Gettysburg College
Columbia University
Authors
Roi Ankori-Karlinsky; Ronen Kadmon; Michael Kalyuzhny; Katherine F. Barnes; Andrew M. Wilson; Curtis Flather; Rosalind Renfrew; Joan Walsh; Edna Guk
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
Northeastern United States, United States
Description
Standardized data on large-scale and long-term patterns of species richness are critical for understanding the consequences of natural and anthropogenic changes in the environment. The North American Breeding Bird Survey (BBS) is one of the largest and most widely used sources of such data, but so far, little is known about the degree to which BBS data provide accurate estimates of regional richness. Here we test this question by comparing estimates of regional richness based on BBS data with spatially and temporally matched estimates based on state Breeding Bird Atlases (BBA). We expected that estimates based on BBA data would provide a more complete (and therefore, more accurate) representation of regional richness due to their larger number of observation units and higher sampling effort within the observation units. Our results were only partially consistent with these predictions: while estimates of regional richness based on BBA data were higher than those based on BBS data, estimates of local richness (number of species per observation unit) were higher in BBS data. The latter result is attributed to higher land-cover heterogeneity in BBS units and higher effectiveness of bird detection (more species are detected per unit time). Interestingly, estimates of regional richness based on BBA blocks were higher than those based on BBS data even when differences in the number of observation units were controlled for. Our analysis indicates that this difference was due to higher compositional turnover between BBA units, probably due to larger differences in habitat conditions between BBA units and a larger number of geographically restricted species. Our overall results indicate that estimates of regional richness based on BBS data suffer from incomplete detection of a large number of rare species, and that corrections of these estimates based on standard extrapolation techniques are not sufficient to remove this bias. Future applications of BBS data in ecology and conservation, and in particular, applications in which the representation of rare species is important (e.g., those focusing on biodiversity conservation), should be aware of this bias, and should integrate BBA data whenever possible.

Methods Overview

This is a compilation of second-generation breeding bird atlas data and corresponding breeding bird survey data. This contains presence-absence breeding bird observations in 5 U.S. states: MA, MI, NY, PA, VT, sampling effort per sampling unit, geographic location of sampling units, and environmental variables per sampling unit: elevation and elevation range from (from SRTM), mean annual precipitation & mean summer temperature (from PRISM), and NLCD 2006 land-use data.

Each row contains all observations per sampling unit, with additional tables containing information on sampling effort impact on richness, a rareness table of species per dataset, and two summary tables for both bird diversity and environmental variables.

The methods for compilation are contained in the supplementary information of the manuscript but also here:

Bird data

For BBA data, shapefiles for blocks and the data on species presences and sampling effort in blocks were received from the atlas coordinators. For BBS data, shapefiles for routes and raw species data were obtained from the Patuxent Wildlife Research Center (https://databasin.org/datasets/02fe0ebbb1b04111b0ba1579b89b7420 and https://www.pwrc.usgs.gov/BBS/RawData).

Using ArcGIS Pro© 10.0, species observations were joined to respective BBS and BBA observation units shapefiles using the Join Table tool. For both BBA and BBS, a species was coded as either present (1) or absent (0). Presence in a sampling unit was based on codes 2, 3, or 4 in the original volunteer birding checklist codes (possible breeder, probable breeder, and confirmed breeder, respectively), and absence was based on codes 0 or 1 (not observed and observed but not likely breeding). Spelling inconsistencies of species names between BBA and BBS datasets were fixed. Species that needed spelling fixes included Brewer’s Blackbird, Cooper’s Hawk, Henslow’s Sparrow, Kirtland’s Warbler, LeConte’s Sparrow, Lincoln’s Sparrow, Swainson’s Thrush, Wilson’s Snipe, and Wilson’s Warbler. In addition, naming conventions were matched between BBS and BBA data. The Alder and Willow Flycatchers were lumped into Traill’s Flycatcher and regional races were lumped into a single species column: Dark-eyed Junco regional types were lumped together into one Dark-eyed Junco, Yellow-shafted Flicker was lumped into Northern Flicker, Saltmarsh Sparrow and the Saltmarsh Sharp-tailed Sparrow were lumped into Saltmarsh Sparrow, and the Yellow-rumped Myrtle Warbler was lumped into Myrtle Warbler (currently named Yellow-rumped Warbler). Three hybrid species were removed: Brewster's and Lawrence's Warblers and the Mallard x Black Duck hybrid. Established “exotic” species were included in the analysis since we were concerned only with detection of richness and not of specific species.

The resultant species tables with sampling effort were pivoted horizontally so that every row was a sampling unit and each species observation was a column. This was done for each state using R version 3.6.2 (R© 2019, The R Foundation for Statistical Computing Platform) and all state tables were merged to yield one BBA and one BBS dataset. Following the joining of environmental variables to these datasets (see below), BBS and BBA data were joined using rbind.data.frame in R© to yield a final dataset with all species observations and environmental variables for each observation unit.

Environmental data

Using ArcGIS Pro© 10.0, all environmental raster layers, BBA and BBS shapefiles, and the species observations were integrated in a common coordinate system (North_America Equidistant_Conic) using the Project tool. For BBS routes, 400m buffers were drawn around each route using the Buffer tool. The observation unit shapefiles for all states were merged (separately for BBA blocks and BBS routes and 400m buffers) using the Merge tool to create a study-wide shapefile for each data source. Whether or not a BBA block was adjacent to a BBS route was determined using the Intersect tool based on a radius of 30m around the route buffer (to fit the NLCD map resolution). Area and length of the BBS route inside the proximate BBA block were also calculated. Mean values for annual precipitation and summer temperature, and mean and range for elevation, were extracted for every BBA block and 400m buffer BBS route using Zonal Statistics as Table tool. The area of each land-cover type in each observation unit (BBA block and BBS buffer) was calculated from the NLCD layer using the Zonal Histogram tool.
Food Reviews - Text Mining & Sentiment Analysis
kaggle.com
zip
Updated Aug 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). Food Reviews - Text Mining & Sentiment Analysis [Dataset]. https://www.kaggle.com/datasets/vikramamin/food-reviews-text-mining-and-sentiment-analysis
Explore at:
zip(1075643 bytes)Available download formats
Dataset updated
Aug 4, 2023
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Brief Description: - The Chief Marketing Officer (CMO) of Healthy Foods Inc. wants to understand customer sentiments about the specialty foods that the company offers. This information has been collected through customer reviews on their website. Dataset consists of about 5000 reviews. They want the answers to the following questions: 1. What are the most frequently used words in the customer reviews? 2. How can the data be prepared for text analysis? 3. What are the overall sentiments towards the products?

We will be using text mining and sentiment analysis (R programming) to offer insights to the CMO with regards to the food reviews

Steps: - Set the working directory and read the data. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fd7ec6c7460b58ae39c96d5431cca2d37%2FPicture1.png?generation=1691146783504075&alt=media" alt=""> - Data cleaning. Check for missing values and data types of variables - Run the required libraries ("tm", "SnowballC", "dplyr", "sentimentr", "wordcloud2", "RColorBrewer") - TEXT ACQUISITION and AGGREGATION. Create corpus. - TEXT PRE-PROCESSING. Cleaning the text - Replace special characters with " ". We use the tm_map function for this purpose - make all the alphabets lower case - remove punctuations - remove whitespace - remove stopwords - remove numbers - stem the document - create term document matrix https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F0508dfd5df9b1ed2885e1eea35b84f30%2FPicture2.png?generation=1691147153582115&alt=media" alt=""> - convert into matrix and find out frequency of words https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Febc729e81068856dec368667c5758995%2FPicture3.png?generation=1691147243385812&alt=media" alt=""> - convert into a data frame - TEXT EXPLORATION find out the words which appear most frequently and least frequently https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F33cf5decc039baf96dbe86dd6964792a%2FTop%205%20frequent%20words.jpeg?generation=1691147382783191&alt=media" alt=""> - Create Wordcloud

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F99f1147bd9e9a4e6bb35686b015fc714%2FWordCloud.png?generation=1691147502824379&alt=media" alt="">

TEXT MODELLING

Word association between two words which tend to appear more number of times. Here we try to find the association for the top three occurring words "like", "tast", "flavor" by setting a correlation limit of 0.2 https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fbfdbfbe28a30012f0e7ab54d6185c223%2FPicture4.png?generation=1691147754149529&alt=media" alt="">

"like" has an association with "realli" (they appear about 25% of the time together), dont (24%), one(21%)

"tast" does not have an association with any word with the set correlation limit

"flavor" has an association with the word "chip"(they appear about 27% of the time together)

Sentiment analysis https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fa5da1dd46a60494ec9b26fa1a08b2087%2FPicture5.png?generation=1691147897889137&alt=media" alt="">

element_id refers to the Review No and sentence_id refers to the Sentence No in the review , word_count refers to the number of words part of that sentence in that review. Sentiment would be either positive or negative.

Let us find out the overall sentiment score of all the reviews https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6fce0e810d47ea8864ebac58eca1be99%2FPicture6.png?generation=1691148149575056&alt=media" alt="">

This indicates that the entire food review document has a marginally positive score

Let us find out the sentiment score for each of the 5000 reviews. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F5b7861d5ebc3881483dd65a8385a539c%2FPicture7.png?generation=1691148278877972&alt=media" alt="">

(-1) indicates the most extreme negative sentiment and (+1) indicates the most extreme positive sentiment

Let us create a separate data frame for all the negative sentiments. In total there are 726 negative sentiments out of the total 5000 reviews (approx 15%).
Additional file 1 of Holomics - a user-friendly R shiny application for...
springernature.figshare.com
zip
Updated Aug 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katharina Munk; Daria Ilina; Lisa Ziemba; Günter Brader; Eva M. Molin (2024). Additional file 1 of Holomics - a user-friendly R shiny application for multi-omics data integration and analysis [Dataset]. http://doi.org/10.6084/m9.figshare.26688587.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26688587.v1
Dataset updated
Aug 15, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Katharina Munk; Daria Ilina; Lisa Ziemba; Günter Brader; Eva M. Molin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 1. The additional file (Additional file 1.zip) is a compressed folder containing four .csv files. Table S1: Targeted metabolite data, Table S2: Microbiomics ASV count table resulting from 16S amplicon sequencing, Table S3: Microbiomics ASV count table resulting from ITS amplicon sequencing, Table S4: Transcriptomics read count table (transposed format), and Table S5: Labels and class information including color code of the analyzed samples. Besides of being the data source for the present case study, these data tables can be used as test dataset after removal of the table header (first line). We highly recommend opening the files in a text editor of your choice and remove the headers there. When doing this step in Excel an error may occur.

Surface soil moisture for Europe 2014-2024 at 1 km annual and quarterly...

repository.soilwise-he.eu

Updated Feb 7, 2025

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

(2025). Surface soil moisture for Europe 2014-2024 at 1 km annual and quarterly aggregates [Dataset]. http://doi.org/10.5281/zenodo.14833053

Explore at:

Unique identifier

https://doi.org/10.5281/zenodo.14833053

Dataset updated

Feb 7, 2025

Description

Copernicus Land Monitoring Services provides Surface Soil Moisture 2014-present (raster 1 km), Europe, daily – version 1. Each day covers only 5 to 10% of European land mask and shows lines of scenes (obvious artifacts). This is the long-term aggregates of daily images of soil moisture (0–100%) based on two types of aggregation:

Long-term quarterly (qr.1 - winter, qr.2 - spring, qr.3 - summer and qr.4 - autumn),
Annual quantiles P.05, P.50 and P.95,

The soil moisture rasters are based on Sentinel 1 and described in detail in:

Bauer-Marschallinger, B. ; Freeman, V. ; Cao, S. ; Paulik, C. ; Schaufler, S. ; Stachl, T. ; Modanesi, S. ; Massari, C. ; Ciabatta, L. ; Brocca, L. ; Wagner, W. Toward Global Soil Moisture Monitoring With Sentinel-1: Harnessing Assets and Overcoming Obstacles. IEEE Transactions on Geoscience and Remote Sensing 2019, 1 - 20. DOI 10.1109/TGRS.2018.2858004

You can access and download the original data as .nc files from: https://globalland.vito.be/download/manifest/ssm_1km_v1_daily_netcdf/.

Aggregation has been generated using the terra package in R in combination with the matrixStats::rowQuantiles function. Tiling system and land mask for pan-EU is also available.

library(terra)library(matrixStats)g1 = terra::vect('/mnt/inca/EU_landmask/tilling_filter/eu_ard2_final_status.gpkg')## 1254 tilestile = g1[534]nc.lst = list.files('/mnt/landmark/SM1km/ssm_1km_v1_daily_netcdf/', pattern = glob2rx('*.nc$'), full.names=TRUE)## 3726## test it#r = terra::rast(nc.lst[100:210])agg_tile = function(r, tile, pv=c(0.05,0.5,0.95), out.year='2015.annual'){ bb = paste(as.vector(ext(tile)), collapse = '.') out.tif = paste0('./eu_tmp/', out.year, '/sm1km_', pv, '_', out.year, '_', bb, '.tif') if(any(!file.exists(out.tif))){  r.t = terra::crop(r, ext(tile))  ## each tile is 100x100 pixels 365 days  r.t = as.data.frame(r.t, xy=TRUE, na.rm=FALSE)  sel.c = grep(glob2rx('ssm$'), colnames(r.t))  ## remove everything outside the range  t1s = cbind(data.frame(matrixStats::rowQuantiles(as.matrix(r.t[,sel.c]), probs = pv, na.rm=TRUE)), data.frame(x=r.t$x, y=r.t$y))  #str(t1s)  ## write to GeoTIFFs  r.o = terra::rast(t1s[,c('x','y','X5.','X50.','X95.')], type='xyz', crs='+proj=longlat +datum=WGS84 +no_defs')  for(k in 1:length(pv)){    terra::writeRaster(r.o[[k]], filename=out.tif[k], gdal=c('COMPRESS=DEFLATE'), datatype='INT2U', NAflag=32768, overwrite=FALSE)  }  rm(r.t); gc()  tmpFiles(remove=TRUE) }}## quarterly values:lA = data.frame(filename=nc.lst)library(lubridate)lA$Date = ymd(sapply(lA$filename, function(i){substr(strsplit(basename(i), '_')[[1]][4], 1, 8)}))#summary(is.na(lA$Date))#hist(lA$Date, breaks=60)lA$quarter = quarter(lA$Date, fiscal_start = 11)summary(as.factor(lA$quarter))for(qr in 1:4){ #qr=1 pth = paste0('A.q', qr) rs = terra::rast(lA$filename[lA$quarter==qr]) #agg_tile(rs, tile, out.year=pth) x = parallel::mclapply(sample(1:length(g1)), function(i){try( agg_tile(rs, tile=g1[i], out.year=pth) )}, mc.cores=20) for(type in c(0.05,0.5,0.95)){  x <- list.files(path=paste0('./eu_tmp/', pth), pattern=glob2rx(paste0('sm1km_', type, '_*.tif$')), full.names=TRUE)  out.tmp <- paste0(pth, '.', type, '.sm1km_eu.txt')  vrt.tmp <- paste0(pth, '.', type, '.sm1km_eu.vrt')  cat(x, sep=' n', file=out.tmp)  system(paste0('gdalbuildvrt -input_file_list ', out.tmp, ' ', vrt.tmp))  system(paste0('gdal_translate ', vrt.tmp, ' ./cogs/soil.moisture_s1.clms.qr.', qr, '.p', type, '_m_1km_20140101_20241231_eu_epsg4326_v20250206.tif -ot 'Byte' -r 'near' --config GDAL_CACHEMAX 9216 -co BIGTIFF=YES -co NUM_THREADS=80 -co COMPRESS=DEFLATE -of COG -projwin -32 72 45 27')) }}## per year ----for(year in 2015:2023){ l.lst = nc.lst[grep(year, basename(nc.lst))] r = terra::rast(l.lst) ## test it: pth = paste0(year, '.annual') x = parallel::mclapply(sample(1:length(g1)), function(i){try( agg_tile(r, tile=g1[i], out.year=pth) )}, mc.cores=40) ## Mosaics: for(type in c(0.05,0.5,0.95)){  x <- list.files(path=paste0('./eu_tmp/', pth), pattern=glob2rx(paste0('sm1km_', type, '_*.tif$')), full.names=TRUE)  out.tmp <- paste0(pth, '.', type, '.sm1km_eu.txt')  vrt.tmp <- paste0(pth, '.', type, '.sm1km_eu.vrt')  cat(x, sep=' n', file=out.tmp)  system(paste0('gdalbuildvrt -input_file_list ', out.tmp, ' ', vrt.tmp))  system(paste0('gdal_translate ', vrt.tmp, ' ./cogs/soil.moisture_s1.clms.annual.', type, '_m_1km_', year, '0101_', year, '1231_eu_epsg4326_v20250206.tif -ot 'Byte' -r 'near' --config GDAL_CACHEMAX 9216 -co BIGTIFF=YES -co NUM_THREADS=80 -co COMPRESS=DEFLATE -of COG -projwin -32 72 45 27')) }}

f
Additional file 1 of Correction: Holomics ‑ a user‑friendly R shiny...
datasetcatalog.nlm.nih.gov
springernature.figshare.com
Updated Aug 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ziemba, Lisa; Brader, Günter; Molin, Eva M.; Munk, Katharina; Ilina, Daria (2024). Additional file 1 of Correction: Holomics ‑ a user‑friendly R shiny application for multi‑omics data integration and analysis [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001449772
Explore at:
Dataset updated
Aug 15, 2024
Authors
Ziemba, Lisa; Brader, Günter; Molin, Eva M.; Munk, Katharina; Ilina, Daria
Description
Additional file 1. The additional file (Additional file 1.zip) is a compressed folder containing four.csv files. Table S1: Targeted metabolite data, Table S2: Microbiomics ASV count table resulting from 16S amplicon sequencing, Table S3: Microbiomics ASV count table resulting from ITS amplicon sequencing, Table S4: Transcriptomics read count table (transposed format), and Table S5: Labels and class information including color code of the analyzed samples. Besides of being the data source for the present case study, these data tables can be used as test datasets after removal of the table header (first line). We highly recommend opening the files in a text editor of your choice and remove the headers there. When doing this step in Excel an error may occur.
Market Basket Analysis
kaggle.com
zip
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Explore at:
zip(23875170 bytes)Available download formats
Dataset updated
Dec 9, 2021
Authors
Aslan Ahmedov
Description
Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import

Data Understanding and Exploration

Transformation of the data – so that is ready to be consumed by the association rules algorithm

Running association rules

Exploring the rules generated

Filtering the generated rules

Visualization of Rule

Dataset Description

File name: Assignment-1_Data

List name: retaildata

File format: . xlsx

Number of Row: 522065

Number of Attributes: 7

BillNo: 6-digit number assigned to each transaction. Nominal.

Itemname: Product name. Nominal.

Quantity: The quantities of each product per transaction. Numeric.

Date: The day and time when each transaction was generated. Numeric.

Price: Product price. Numeric.

CustomerID: 5-digit number assigned to each customer. Nominal.

Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.

readxl - Read Excel Files in R.

plyr - Tools for Splitting, Applying and Combining Data.

ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

knitr - Dynamic Report generation in R.

magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.

dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
Z
Data from: Russian Financial Statements Database: A firm-level collection of...
data.niaid.nih.gov
Updated Mar 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bondarkov, Sergey; Ledenev, Victor; Skougarevskiy, Dmitriy (2025). Russian Financial Statements Database: A firm-level collection of the universe of financial statements [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14622208
Explore at:
Dataset updated
Mar 14, 2025
Dataset provided by
European University at St. Petersburg
European University at St Petersburg
Authors
Bondarkov, Sergey; Ledenev, Victor; Skougarevskiy, Dmitriy
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:

🔓 First open data set with information on every active firm in Russia.

🗂️ First open financial statements data set that includes non-filing firms.

🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.

📅 Covers 2011-2023 initially, will be continuously updated.

🏗️ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.

The RFSD is hosted on 🤗 Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.

The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.

Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.

Importing The Data

You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on 🤗 Hugging Face Datasets library.

Python

🤗 Hugging Face Datasets

It is as easy as:

from datasets import load_dataset import polars as pl

This line will download 6.6GB+ of all RFSD data and store it in a 🤗 cache folder

RFSD = load_dataset('irlspbru/RFSD')

Alternatively, this will download ~540MB with all financial statements for 2023# to a Polars DataFrame (requires about 8GB of RAM)

RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')

Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.

Local File Import

Importing in Python requires pyarrow package installed.

import pyarrow.dataset as ds import polars as pl

Read RFSD metadata from local file

RFSD = ds.dataset("local/path/to/RFSD")

Use RFSD_dataset.schema to glimpse the data structure and columns' classes

print(RFSD.schema)

Load full dataset into memory

RFSD_full = pl.from_arrow(RFSD.to_table())

Load only 2019 data into memory

RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))

Load only revenue for firms in 2019, identified by taxpayer id

RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )

Give suggested descriptive names to variables

renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})

R

Local File Import

Importing in R requires arrow package installed.

library(arrow) library(data.table)

Read RFSD metadata from local file

RFSD <- open_dataset("local/path/to/RFSD")

Use schema() to glimpse into the data structure and column classes

schema(RFSD)

Load full dataset into memory

scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())

Load only 2019 data into memory

scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())

Load only revenue for firms in 2019, identified by taxpayer id

scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())

Give suggested descriptive names to variables

renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)

Use Cases

🌍 For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) — interest_payments.md

🏭 For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) — tfp.md

🗺️ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses — spatialization.md

FAQ

Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?

To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.

What is the data period?

We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).

Why are there no data for firm X in year Y?

Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:

We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).

Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek — in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.

Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.

Why is the geolocation of firm X incorrect?

We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.

Why is the data for firm X different from https://bo.nalog.ru/?

Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.

Why is the data for firm X unrealistic?

We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.

Why is the data for groups of companies different from their IFRS statements?

We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.

Why is the data not in CSV?

The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.

Version and Update Policy

Version (SemVer): 1.0.0.

We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.

Licence

Creative Commons License Attribution 4.0 International (CC BY 4.0).

Copyright © the respective contributors.

Citation

Please cite as:

@unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}

Acknowledgments and Contacts

Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru

Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,
Bank Loan Approval - LR, DT, RF and AUC
kaggle.com
zip
Updated Nov 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). Bank Loan Approval - LR, DT, RF and AUC [Dataset]. https://www.kaggle.com/datasets/vikramamin/bank-loan-approval-lr-dt-rf-and-auc
Explore at:
zip(61437 bytes)Available download formats
Dataset updated
Nov 7, 2023
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
DATASET: Dependent variable is 'Personal.Loan'. 0 indicates loan not approved and 1 indicates loan approved.

OBJECTIVE : We will do Exploratory Data Analysis and use Logistic Regression, Decision Tree, Random Forest and AUC to find out which is the best model. Steps:

Set the working directory and read the data

Check the data types of all the variables https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F020afd07cf0c5ba058d88add9bcd467a%2FPicture1.png?generation=1699357564112927&alt=media" alt="">

DATA CLEANING

We need to change the data types of certain variables to factor vector

Check for missing data, duplicate records and remove insignificant variables https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fa286a5225207d4419b34bcf800e3cb67%2FPicture2.png?generation=1699357685993423&alt=media" alt="">

New data frame created called 'bank1' after dropping the 'ID' column.

EXPLORATORY DATA ANALYSIS

We will try to get some insights by digging into the data through bar charts and box plots which can help the bank management in decision making

Run the required libraries https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F7363f4b9ca8245b6e998bf07005fa099%2FPicture3.png?generation=1699357871368520&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8dba10f16fc6c2d7fd51a4c82a692136%2FCount%20of%20Loans%20Approved%20%20Not%20Approved.jpeg?generation=1699357967347355&alt=media" alt="">

Out of the total 5000 customers, 4520 have not been approved for a loan while 480 have been https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fe5eec968e7b264d9ec540bd1f24379fd%2FPicture4.png?generation=1699358066228901&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fb64eba6f373d5c043c9f504cfa348a75%2FPicture5.png?generation=1699358103026827&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F94608993dc12cdc31cfeca92932e0cb5%2FBoxPlot%20Income%20and%20Family.jpeg?generation=1699358148840198&alt=media" alt="">

THIS INDICATES THAT INCOME IS HIGHER WHEN THERE ARE LESS FAMILY MEMBERS https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8e44daf4ed42094f71c3000737f07a32%2FPicture6.png?generation=1699360599956530&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F0fd9010b95acf9ad20f7b9d0e171f305%2FBoxplot%20between%20Income%20%20Personal%20Loan.jpeg?generation=1699359231020725&alt=media" alt="">

THIS INDICATES PERSONAL LOAN HAS BEEN APPROVED FOR CUSTOMERS HAVING HIGHER INCOME https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Ff817481849aba7f176b7c4d0147308de%2FPicture7.png?generation=1699360768102069&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8e0bad8c76aaa11fe3b9909721d587f5%2FBoxPlot%20between%20Income%20%20Credit%20Cards.jpeg?generation=1699360798538907&alt=media" alt="">

THIS INDICATES THAT THE INCOME IS PRETTY SIMILAR FOR CUSTOMERS OWNING AND NOT OWNING A CREDIT CARD https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fab4b2fd2fde2a009bceb05a5a1161040%2FPicture8.png?generation=1699360882879480&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fe747dfa315609c4907ea83a9ac7f482c%2FBoxPlot%20between%20Income%20Class%20%20Mortgage.jpeg?generation=1699359265603058&alt=media" alt="">

CUSTOMERS BELONGING TO THE RICH CLASS (INCOME GROUP : 150-200) HAVE THE HIGHEST MORTGAGE https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6552d3fb9564b3ab3239ef67ed17a098%2FPicture9.png?generation=1699360938106437&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F4c7c7077e26229f455c1d9ef6e83195f%2FBoxPlot%20between%20CC%20Avg%20and%20Online%20Banking.jpeg?generation=1699359306645100&alt=media" alt="">

CC AVG IS PRETTY SIMILAR FOR THOSE WHO OPTED FOR ONLINE SERVICES AND THOSE WHO DID NOT
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Feddee2ca08a8138bb54eed0c25750280%2FPicture10.png?generation=1699360994581181&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6127e25258b25ccfbae66a5463a72773%2FBoxplot%20between%20CC%20Avg%20and%20Education.jpeg?generation=1699359333295827&alt=media" alt="">

MORE EDUCATED CUSTOMERS HAVE A HIGHER CREDIT AVERAGE ![](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F...
Alone TV Series
kaggle.com
zip
Updated Feb 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joakim Arvidsson (2024). Alone TV Series [Dataset]. https://www.kaggle.com/datasets/joebeachcapital/alone-tv-series
Explore at:
zip(24709 bytes)Available download formats
Dataset updated
Feb 6, 2024
Authors
Joakim Arvidsson
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Updated to include all season up to and including season 10.

This dataset contains data from the TV series Alone collected and shared by Dan Oehm. As described in Oehm's blog post](https://gradientdescending.com/alone-r-package-datasets-from-the-survival-tv-series/), in the survival TV series ‘Alone,' 10 survivalists are dropped in an extremely remote area and must fend for themselves. They aim to last 100 days in the Artic winter, living off the land through their survival skills, endurance, and mental fortitude.

This package contains four datasets:

survivalists.csv: A data frame of survivalists across all 9 seasons detailing name and demographics, location and profession, result, days lasted, reasons for tapping out (detailed and categorised), and page URL.

loadouts.csv: The rules allow each survivalist to take 10 items with them. This dataset includes information on each survivalist's loadout. It has detailed item descriptions and a simplified version for easier aggregation and analysis

episodes.csv: This dataset contains details of each episode including the title, number of viewers, beginning quote, and IMDb rating. New episodes are added at the end of future seasons.

seasons.csv: The season summary dataset includes location, latitude and longitude, and other season-level information. It includes the date of drop-off where the information exists.

Acknowledging the Alone dataset

Dan Oehm:

Alone data package: https://github.com/doehm/alone Alone data package blog post: https://gradientdescending.com/alone-r-package-datasets-from-the-survival-tv-series/ Examples of analyses are included in Dan Oehm's blog post.

References

History: https://www.history.com/shows/alone/cast

Wikipedia: https://en.wikipedia.org/wiki/Alone_(TV_series)

Wikipedia (episodes): https://en.wikipedia.org/wiki/List_of_Alone_episodes#Season_1_(2015)_-_Vancouver_Island
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

TheNicelander (2016). FacialRecognition [Dataset]. https://www.kaggle.com/petein/facialrecognition

FacialRecognition

Test environment for FacialRecognition competition

Explore at:

zip(121674455 bytes)Available download formats

Dataset updated

Dec 1, 2016

Authors

TheNicelander

License

http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

Description

#https://www.kaggle.com/c/facial-keypoints-detection/details/getting-started-with-r #################################

###Variables for downloaded files data.dir <- ' ' train.file <- paste0(data.dir, 'training.csv') test.file <- paste0(data.dir, 'test.csv') #################################

###Load csv -- creates a data.frame matrix where each column can have a different type. d.train <- read.csv(train.file, stringsAsFactors = F) d.test <- read.csv(test.file, stringsAsFactors = F)

###In training.csv, we have 7049 rows, each one with 31 columns. ###The first 30 columns are keypoint locations, which R correctly identified as numbers. ###The last one is a string representation of the image, identified as a string.

###To look at samples of the data, uncomment this line:

head(d.train)

###Let's save the first column as another variable, and remove it from d.train: ###d.train is our dataframe, and we want the column called Image. ###Assigning NULL to a column removes it from the dataframe

im.train <- d.train$Image d.train$Image <- NULL #removes 'image' from the dataframe

im.test <- d.test$Image d.test$Image <- NULL #removes 'image' from the dataframe

################################# #The image is represented as a series of numbers, stored as a string #Convert these strings to integers by splitting them and converting the result to integer

#strsplit splits the string #unlist simplifies its output to a vector of strings #as.integer converts it to a vector of integers. as.integer(unlist(strsplit(im.train[1], " "))) as.integer(unlist(strsplit(im.test[1], " ")))

###Install and activate appropriate libraries ###The tutorial is meant for Linux and OSx, where they use a different library, so: ###Replace all instances of %dopar% with %do%.

install.packages('foreach')

library("foreach", lib.loc="~/R/win-library/3.3")

###implement parallelization im.train <- foreach(im = im.train, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } im.test <- foreach(im = im.test, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } #The foreach loop will evaluate the inner command for each row in im.train, and combine the results with rbind (combine by rows). #%do% instructs R to do all evaluations in parallel. #im.train is now a matrix with 7049 rows (one for each image) and 9216 columns (one for each pixel):

###Save all four variables in data.Rd file ###Can reload them at anytime with load('data.Rd')

save(d.train, im.train, d.test, im.test, file='data.Rd')

load('data.Rd')

#each image is a vector of 96*96 pixels (96*96 = 9216). #convert these 9216 integers into a 96x96 matrix: im <- matrix(data=rev(im.train[1,]), nrow=96, ncol=96)

#im.train[1,] returns the first row of im.train, which corresponds to the first training image. #rev reverse the resulting vector to match the interpretation of R's image function #(which expects the origin to be in the lower left corner).

#To visualize the image we use R's image function: image(1:96, 1:96, im, col=gray((0:255)/255))

#Let’s color the coordinates for the eyes and nose points(96-d.train$nose_tip_x[1], 96-d.train$nose_tip_y[1], col="red") points(96-d.train$left_eye_center_x[1], 96-d.train$left_eye_center_y[1], col="blue") points(96-d.train$right_eye_center_x[1], 96-d.train$right_eye_center_y[1], col="green")

#Another good check is to see how variable is our data. #For example, where are the centers of each nose in the 7049 images? (this takes a while to run): for(i in 1:nrow(d.train)) { points(96-d.train$nose_tip_x[i], 96-d.train$nose_tip_y[i], col="red") }

#there are quite a few outliers -- they could be labeling errors. Looking at one extreme example we get this: #In this case there's no labeling error, but this shows that not all faces are centralized idx <- which.max(d.train$nose_tip_x) im <- matrix(data=rev(im.train[idx,]), nrow=96, ncol=96) image(1:96, 1:96, im, col=gray((0:255)/255)) points(96-d.train$nose_tip_x[idx], 96-d.train$nose_tip_y[idx], col="red")

#One of the simplest things to try is to compute the mean of the coordinates of each keypoint in the training set and use that as a prediction for all images colMeans(d.train, na.rm=T)

#To build a submission file we need to apply these computed coordinates to the test instances: p <- matrix(data=colMeans(d.train, na.rm=T), nrow=nrow(d.test), ncol=ncol(d.train), byrow=T) colnames(p) <- names(d.train) predictions <- data.frame(ImageId = 1:nrow(d.test), p) head(predictions)

#The expected submission format has one one keypoint per row, but we can easily get that with the help of the reshape2 library:

install.packages('reshape2')

library(...

Clear search

Close search

Google apps

Main menu

FacialRecognition

head(d.train)

install.packages('foreach')

save(d.train, im.train, d.test, im.test, file='data.Rd')

load('data.Rd')

install.packages('reshape2')

Effect of data source on estimates of regional bird richness in northeastern...

Food Reviews - Text Mining & Sentiment Analysis

Additional file 1 of Holomics - a user-friendly R shiny application for...

Surface soil moisture for Europe 2014-2024 at 1 km annual and quarterly...

Additional file 1 of Correction: Holomics ‑ a user‑friendly R shiny...

Market Basket Analysis

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing

Data from: Russian Financial Statements Database: A firm-level collection of...

This line will download 6.6GB+ of all RFSD data and store it in a 🤗 cache folder

Alternatively, this will download ~540MB with all financial statements for 2023# to a Polars DataFrame (requires about 8GB of RAM)

Read RFSD metadata from local file

Use RFSD_dataset.schema to glimpse the data structure and columns' classes

Load full dataset into memory

Load only 2019 data into memory

Load only revenue for firms in 2019, identified by taxpayer id

Give suggested descriptive names to variables

Read RFSD metadata from local file

Use schema() to glimpse into the data structure and column classes

Load full dataset into memory

Load only 2019 data into memory

Load only revenue for firms in 2019, identified by taxpayer id

Give suggested descriptive names to variables

Bank Loan Approval - LR, DT, RF and AUC

Alone TV Series

FacialRecognition

Test environment for FacialRecognition competition

head(d.train)

install.packages('foreach')

save(d.train, im.train, d.test, im.test, file='data.Rd')

load('data.Rd')

install.packages('reshape2')