8 datasets found

Google Data Analytics Case Study Cyclistic

kaggle.com

zip

Updated Sep 27, 2022

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Udayakumar19 (2022). Google Data Analytics Case Study Cyclistic [Dataset]. https://www.kaggle.com/datasets/udayakumar19/google-data-analytics-case-study-cyclistic/suggestions

Explore at:

zip(1299 bytes)Available download formats

Dataset updated

Sep 27, 2022

Authors

Udayakumar19

Description

Introduction

Welcome to the Cyclistic bike-share analysis case study! In this case study, you will perform many real-world tasks of a junior data analyst. You will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path.

Scenario

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Ask

How do annual members and casual riders use Cyclistic bikes differently?

Guiding Question:

What is the problem you are trying to solve?
  How do annual members and casual riders use Cyclistic bikes differently?
How can your insights drive business decisions?
  The insight will help the marketing team to make a strategy for casual riders

Prepare

Guiding Question:

Where is your data located?
  Data located in Cyclistic organization data.

How is data organized?
  Dataset are in csv format for each month wise from Financial year 22.

Are there issues with bias or credibility in this data? Does your data ROCCC? 
  It is good it is ROCCC because data collected in from Cyclistic organization.

How are you addressing licensing, privacy, security, and accessibility?
  The company has their own license over the dataset. Dataset does not have any personal information about the riders.

How did you verify the data’s integrity?
  All the files have consistent columns and each column has the correct type of data.

How does it help you answer your questions?
  Insights always hidden in the data. We have the interpret with data to find the insights.

Are there any problems with the data?
  Yes, starting station names, ending station names have null values.

Process

Guiding Question:

What tools are you choosing and why?
  I used R studio for the cleaning and transforming the data for analysis phase because of large dataset and to gather experience in the language.

Have you ensured the data’s integrity?
 Yes, the data is consistent throughout the columns.

What steps have you taken to ensure that your data is clean?
  First duplicates, null values are removed then added new columns for analysis.

How can you verify that your data is clean and ready to analyze? 
 Make sure the column names are consistent thorough out all data sets by using the “bind row” function.

Make sure column data types are consistent throughout all the dataset by using the “compare_df_col” from the “janitor” package.
Combine the all dataset into single data frame to make consistent throught the analysis.
Removed the column start_lat, start_lng, end_lat, end_lng from the dataframe because those columns not required for analysis.
Create new columns day, date, month, year, from the started_at column this will provide additional opportunities to aggregate the data
Create the “ride_length” column from the started_at and ended_at column to find the average duration of the ride by the riders.
Removed the null rows from the dataset by using the “na.omit function”
Have you documented your cleaning process so you can review and share those results? 
  Yes, the cleaning process is documented clearly.

Analyze Phase:

Guiding Questions:

How should you organize your data to perform analysis on it? The data has been organized in one single dataframe by using the read csv function in R Has your data been properly formatted? Yes, all the columns have their correct data type.

What surprises did you discover in the data?
  Casual member ride duration is higher than the annual members
  Causal member widely uses docked bike than the annual members
What trends or relationships did you find in the data?
  Annual members are used mainly for commute purpose
  Casual member are preferred the docked bikes
  Annual members are preferred the electric or classic bikes
How will these insights help answer your business questions?
  This insights helps to build a profile for members

Guiding Quesions:

Were you able to answer the question of how ...

Food Reviews - Text Mining & Sentiment Analysis
kaggle.com
zip
Updated Aug 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). Food Reviews - Text Mining & Sentiment Analysis [Dataset]. https://www.kaggle.com/datasets/vikramamin/food-reviews-text-mining-and-sentiment-analysis
Explore at:
zip(1075643 bytes)Available download formats
Dataset updated
Aug 4, 2023
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Brief Description: - The Chief Marketing Officer (CMO) of Healthy Foods Inc. wants to understand customer sentiments about the specialty foods that the company offers. This information has been collected through customer reviews on their website. Dataset consists of about 5000 reviews. They want the answers to the following questions: 1. What are the most frequently used words in the customer reviews? 2. How can the data be prepared for text analysis? 3. What are the overall sentiments towards the products?

We will be using text mining and sentiment analysis (R programming) to offer insights to the CMO with regards to the food reviews

Steps: - Set the working directory and read the data. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fd7ec6c7460b58ae39c96d5431cca2d37%2FPicture1.png?generation=1691146783504075&alt=media" alt=""> - Data cleaning. Check for missing values and data types of variables - Run the required libraries ("tm", "SnowballC", "dplyr", "sentimentr", "wordcloud2", "RColorBrewer") - TEXT ACQUISITION and AGGREGATION. Create corpus. - TEXT PRE-PROCESSING. Cleaning the text - Replace special characters with " ". We use the tm_map function for this purpose - make all the alphabets lower case - remove punctuations - remove whitespace - remove stopwords - remove numbers - stem the document - create term document matrix https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F0508dfd5df9b1ed2885e1eea35b84f30%2FPicture2.png?generation=1691147153582115&alt=media" alt=""> - convert into matrix and find out frequency of words https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Febc729e81068856dec368667c5758995%2FPicture3.png?generation=1691147243385812&alt=media" alt=""> - convert into a data frame - TEXT EXPLORATION find out the words which appear most frequently and least frequently https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F33cf5decc039baf96dbe86dd6964792a%2FTop%205%20frequent%20words.jpeg?generation=1691147382783191&alt=media" alt=""> - Create Wordcloud

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F99f1147bd9e9a4e6bb35686b015fc714%2FWordCloud.png?generation=1691147502824379&alt=media" alt="">

TEXT MODELLING

Word association between two words which tend to appear more number of times. Here we try to find the association for the top three occurring words "like", "tast", "flavor" by setting a correlation limit of 0.2 https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fbfdbfbe28a30012f0e7ab54d6185c223%2FPicture4.png?generation=1691147754149529&alt=media" alt="">

"like" has an association with "realli" (they appear about 25% of the time together), dont (24%), one(21%)

"tast" does not have an association with any word with the set correlation limit

"flavor" has an association with the word "chip"(they appear about 27% of the time together)

Sentiment analysis https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fa5da1dd46a60494ec9b26fa1a08b2087%2FPicture5.png?generation=1691147897889137&alt=media" alt="">

element_id refers to the Review No and sentence_id refers to the Sentence No in the review , word_count refers to the number of words part of that sentence in that review. Sentiment would be either positive or negative.

Let us find out the overall sentiment score of all the reviews https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6fce0e810d47ea8864ebac58eca1be99%2FPicture6.png?generation=1691148149575056&alt=media" alt="">

This indicates that the entire food review document has a marginally positive score

Let us find out the sentiment score for each of the 5000 reviews. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F5b7861d5ebc3881483dd65a8385a539c%2FPicture7.png?generation=1691148278877972&alt=media" alt="">

(-1) indicates the most extreme negative sentiment and (+1) indicates the most extreme positive sentiment

Let us create a separate data frame for all the negative sentiments. In total there are 726 negative sentiments out of the total 5000 reviews (approx 15%).
Market Basket Analysis
kaggle.com
zip
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Explore at:
zip(23875170 bytes)Available download formats
Dataset updated
Dec 9, 2021
Authors
Aslan Ahmedov
Description
Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import

Data Understanding and Exploration

Transformation of the data – so that is ready to be consumed by the association rules algorithm

Running association rules

Exploring the rules generated

Filtering the generated rules

Visualization of Rule

Dataset Description

File name: Assignment-1_Data

List name: retaildata

File format: . xlsx

Number of Row: 522065

Number of Attributes: 7

BillNo: 6-digit number assigned to each transaction. Nominal.

Itemname: Product name. Nominal.

Quantity: The quantities of each product per transaction. Numeric.

Date: The day and time when each transaction was generated. Numeric.

Price: Product price. Numeric.

CustomerID: 5-digit number assigned to each customer. Nominal.

Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.

readxl - Read Excel Files in R.

plyr - Tools for Splitting, Applying and Combining Data.

ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

knitr - Dynamic Report generation in R.

magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.

dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
d
Particle Image Velocimetry Results
catalog.data.gov
s.cnmilf.com
Updated Oct 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Particle Image Velocimetry Results [Dataset]. https://catalog.data.gov/dataset/particle-image-velocimetry-results
Explore at:
Dataset updated
Oct 29, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This child item contains the Mathworks Matlab mat-file outputs from the scripts described in the Ancillary Scripts child item. Each file contains the results for a particular field site. See the FGDC metadata Process Steps section for more information about opening these files. The mat-files included here have a standard set of output variables and include a variable named "zzVariableDescriptions" in each mat-file which describes the contents of the file. The following variables and descriptions are included in each mat-file (extracted from the "zzVariableDescriptions" variable):

calibration_distance: The distance between calibration points in meters.

calibration_points: Pixel coordinates of the calibration points. Format (array): [X1,Y1; X2; Y2]

calibration_time: Time increment between image frames in milliseconds.

caluv: Correction factor used to convert pixel/second into meters/second.

calxy: Pixel ground resolution in meters/pixel.

directory: Path to folder containing images used in PIV analysis.

filenames: Cell array of strings containing image frame filenames. Format (cellarray): 1m (m: number of frames)

imagesLocation: Path to folder containing images used in PIV analysis.

i: Dimensions of PIV results. Format: inumber of rows along y-axis

j: Dimensions of PIV results. Format: jnumber of columns along x-axis

k: Dimensions of PIV results. Format: knumber ofimages or frames in time (numbmer of images processed)

p: PIVLab image pre-processing settings. See PIVLab documentation for information.

pixel_resolution: Pixel ground resolution in meters. Assumes square pixels.

r: PIVLab post-processing settings. See PIVLab documentation for information.

resultsFileFullPath: Path to folder containing PIV results in mat-file format.

s: PIVLab standard processing settings. See PIVLab documentation for information.

typevector: Array (mnp) containing raw vector result type of frame (mn) for each frame (p). Format: type 1-valid PIV vector; type 0-masked vector; type 2-invalid PIV vector

typevector_filt: Array (mnp) containing filtered vector result type of frame (mn) for each frame (p). Format: type 1-valid PIV vector; type 0-masked vector; type 2-invalid PIV vector

u_mean: Array (mn) containing the temporal average u component of velocity in meters/second. Values are averaged for every vector for each frame (along p dimension).

u_stack: Array (mnp) containing filtered u component velocities for each vector (mn) for each frame (p).

v_mean: Array (mn) containing the temporal average v component of velocity in meters/second. Values are averaged for every vector for each frame (along p dimension).

v_stack: Array (mnp) containing filtered v component velocities for each vector (mn) for each frame (p).

x_ground: Array (mn) containing the x (horizontal) ground coordinate in meters for each PIV result vector. Origin of coordinates is the lower left corner.

x_pixel: Array (mn) containing the x (horizontal) pixel coordinate for each PIV result vector.

y_ground: Array (mn) containing the y (horizontal) ground coordinate in meters for each PIV result vector. Origin of coordinates is the lower left corner.

y_pixel: Array (mn) containing the y (horizontal) pixel coordinate for each PIV result vector.

zzVariableDescriptions: A structured array containing elements named after each variable in this dataset.

Each Field Site is abbreviated in various files in this data release. File and folder names are used to quickly identify which site a particular file or dataset represents. The following abbreviations are used:

ACR: Androscoggin River, Auburn, Maine, USA

AFR: Agua Fria River, near Rock Springs, Arizona, USA

CCC: Coachella Canal above All-American Canal Diversion, California, USA

CMC: Cochiti East Side Main Channel, near Cochiti, New Mexico, USA

GLR: Gila River near Dome, Arizona, USA

RMC: Reservation Main Canal near Yuma, Arizona, USA

SMC: Sile Main Canal (at head) at Cochiti, New Mexico, USA

WMD: Wellton-Mohawk Main Outlet Drain near Yuma, Arizona, USA
TarokPlayingCardsExtractionIsolation
kaggle.com
zip
Updated May 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
joej970 (2024). TarokPlayingCardsExtractionIsolation [Dataset]. https://www.kaggle.com/datasets/joej970/tarokplayingcardsextractionisolation/data
Explore at:
zip(201740700 bytes)Available download formats
Dataset updated
May 14, 2024
Authors
joej970
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The images are extracted from a video of throwing playing cards on two piles. In the end, AI should recognise and classify all cards. To make it easier for AI, first generate a mask that shows where the added card is, by finding a content change between two sequential frames. This dataset is provided to test algorithms that find what part of an image has changed. Refer to reddit post for more info: https://www.reddit.com/r/computervision/comments/1crmg83/ideas_how_to_improve_extraction_and_isolation_of/
Bank Loan Approval - LR, DT, RF and AUC
kaggle.com
zip
Updated Nov 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). Bank Loan Approval - LR, DT, RF and AUC [Dataset]. https://www.kaggle.com/datasets/vikramamin/bank-loan-approval-lr-dt-rf-and-auc
Explore at:
zip(61437 bytes)Available download formats
Dataset updated
Nov 7, 2023
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
DATASET: Dependent variable is 'Personal.Loan'. 0 indicates loan not approved and 1 indicates loan approved.

OBJECTIVE : We will do Exploratory Data Analysis and use Logistic Regression, Decision Tree, Random Forest and AUC to find out which is the best model. Steps:

Set the working directory and read the data

Check the data types of all the variables https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F020afd07cf0c5ba058d88add9bcd467a%2FPicture1.png?generation=1699357564112927&alt=media" alt="">

DATA CLEANING

We need to change the data types of certain variables to factor vector

Check for missing data, duplicate records and remove insignificant variables https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fa286a5225207d4419b34bcf800e3cb67%2FPicture2.png?generation=1699357685993423&alt=media" alt="">

New data frame created called 'bank1' after dropping the 'ID' column.

EXPLORATORY DATA ANALYSIS

We will try to get some insights by digging into the data through bar charts and box plots which can help the bank management in decision making

Run the required libraries https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F7363f4b9ca8245b6e998bf07005fa099%2FPicture3.png?generation=1699357871368520&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8dba10f16fc6c2d7fd51a4c82a692136%2FCount%20of%20Loans%20Approved%20%20Not%20Approved.jpeg?generation=1699357967347355&alt=media" alt="">

Out of the total 5000 customers, 4520 have not been approved for a loan while 480 have been https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fe5eec968e7b264d9ec540bd1f24379fd%2FPicture4.png?generation=1699358066228901&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fb64eba6f373d5c043c9f504cfa348a75%2FPicture5.png?generation=1699358103026827&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F94608993dc12cdc31cfeca92932e0cb5%2FBoxPlot%20Income%20and%20Family.jpeg?generation=1699358148840198&alt=media" alt="">

THIS INDICATES THAT INCOME IS HIGHER WHEN THERE ARE LESS FAMILY MEMBERS https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8e44daf4ed42094f71c3000737f07a32%2FPicture6.png?generation=1699360599956530&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F0fd9010b95acf9ad20f7b9d0e171f305%2FBoxplot%20between%20Income%20%20Personal%20Loan.jpeg?generation=1699359231020725&alt=media" alt="">

THIS INDICATES PERSONAL LOAN HAS BEEN APPROVED FOR CUSTOMERS HAVING HIGHER INCOME https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Ff817481849aba7f176b7c4d0147308de%2FPicture7.png?generation=1699360768102069&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8e0bad8c76aaa11fe3b9909721d587f5%2FBoxPlot%20between%20Income%20%20Credit%20Cards.jpeg?generation=1699360798538907&alt=media" alt="">

THIS INDICATES THAT THE INCOME IS PRETTY SIMILAR FOR CUSTOMERS OWNING AND NOT OWNING A CREDIT CARD https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fab4b2fd2fde2a009bceb05a5a1161040%2FPicture8.png?generation=1699360882879480&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fe747dfa315609c4907ea83a9ac7f482c%2FBoxPlot%20between%20Income%20Class%20%20Mortgage.jpeg?generation=1699359265603058&alt=media" alt="">

CUSTOMERS BELONGING TO THE RICH CLASS (INCOME GROUP : 150-200) HAVE THE HIGHEST MORTGAGE https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6552d3fb9564b3ab3239ef67ed17a098%2FPicture9.png?generation=1699360938106437&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F4c7c7077e26229f455c1d9ef6e83195f%2FBoxPlot%20between%20CC%20Avg%20and%20Online%20Banking.jpeg?generation=1699359306645100&alt=media" alt="">

CC AVG IS PRETTY SIMILAR FOR THOSE WHO OPTED FOR ONLINE SERVICES AND THOSE WHO DID NOT
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Feddee2ca08a8138bb54eed0c25750280%2FPicture10.png?generation=1699360994581181&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6127e25258b25ccfbae66a5463a72773%2FBoxplot%20between%20CC%20Avg%20and%20Education.jpeg?generation=1699359333295827&alt=media" alt="">

MORE EDUCATED CUSTOMERS HAVE A HIGHER CREDIT AVERAGE ![](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F...
Scene Classification: Images and Audio
kaggle.com
zip
Updated Feb 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jordan J. Bird (2020). Scene Classification: Images and Audio [Dataset]. https://www.kaggle.com/datasets/birdy654/scene-classification-images-and-audio
Explore at:
zip(1730810662 bytes)Available download formats
Dataset updated
Feb 1, 2020
Authors
Jordan J. Bird
Description
Do images and audio complement one another in scene classification?

These dataset is made up of images from 8 different environments. 37 video sources have been processed, every 1 second an image is extracted (frame at 0.5s, 1.5s, 2.5s ... and so on) and to accompany that image, the MFCC audio statistics are also extracted from the relevant second of video.

In this dataset, you will notice some common errors from single classifiers. For example, in the video of London, the image classifier confuses the environment with "FOREST" when a lady walks past with flowing hair. Likewise, the audio classifier gets confused by "RIVER" when we walk past a large fountain in Las Vegas due to the sounds of flowing water. Both of these errors can be fixed by a multi-modal approach, where fusion allows for the correction of errors. In our study, both of these issues were classified as "CITY" since multimodality can provide a solution for single-modal errors due to anomalous data occurring.

Please cite this study if you use the dataset

Look and Listen: A Multi-Modal Late Fusion Approach to Scene Classification for Autonomous Machines Jordan J. Bird, Diego R. Faria, Cristiano Premebida, Aniko Ekart, and George Vogiatzis

Context

In this challenge, we can learn environments ("Where am I?") from either images, audio, or take a multimodal approach to fuse the data.

Multi-modal fusion often requires far fewer computing resources than temporal models, but sometimes at the cost of classification ability. Can a method of fusion overcome this? Let's find out!

Content

Class data are given as strings in dataset.csv

Each row of the dataset contains a path to the image, as well as the MFCC data extracted from the second of video that accompany the frame.

MFCC Extraction

(copied and pasted from the paper) we extract the the Mel-Frequency Cepstral Coefficients (MFCC) of the audio clips through a set of sliding windows 0.25s in length (ie frame size of 4K sampling points) and an additional set of overlapping windows, thus producing 8 sliding windows, 8 frames/sec. From each audio-frame, we extract 13 MFCC attributes, producing 104 attributes per 1 second clip.

These are numbered in sequence from MFCC_1

Two Classes?

The original study deals with Class 2 (the actual environment, 8 classes) but we have included Class 1 also. Class 1 is a much easier binary classification problem of "Outdoors" and "Indoors"
Insurance Claims Data
kaggle.com
zip
Updated Jan 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Satish Varma (2022). Insurance Claims Data [Dataset]. https://www.kaggle.com/datasets/saisatish09/insuranceclaimsdata
Explore at:
zip(1959661 bytes)Available download formats
Dataset updated
Jan 30, 2022
Authors
Satish Varma
Description
Autobi(Automobile Bodily Injury Claims) -

The data contains information on demographic information about the claimant, attorney involvement and the economic loss (LOSS, in thousands), among other variables.The full data contains over 70,000 closed claims based on data from thirty-two insurers.

A data frame with 1340 observations on the following 8 variables.

CASENUM- Case number to identify the claim, a numeric vector ATTORNEY- Whether the claimant is represented by an attorney (=1 if yes and =2 if no), a numeric vector CLMSEX - Claimant's gender (=1 if male and =2 if female), a numeric vector MARITAL- claimant's marital status (=1 if married, =2 if single, =3 if widowed, and =4 if divorced/separated), a numeric vector CLMINSUR- Whether or not the driver of the claimant's vehicle was uninsured (=1 if yes, =2 if no, and =3 if not applicable), a numeric vector SEATBELT- Whether or not the claimant was wearing a seatbelt/child restraint (=1 if yes, =2 if no, and =3 if not applicable), a numeric vector CLMAGE- Claimant's age, a numeric vector LOSS- The claimant's total economic loss (in thousands), a numeric vector

AutoClaims(Automobile Insurance Claims) -

A data frame with 6773 observations on the following 5 variables.

STATE CLASS - Rating class of operator, based on age, gender, marital status, use of vehicle GENDER AGE - Age of operator PAID - Amount paid to settle and close a claim

AutoCollision(Automobile UK Collision Claims)

8,942 collision losses from private passenger United Kingdom (UK) automobile insurance policies. The average severity is in pounds sterling adjusted for inflation.

A data frame with 32 observations on the following 4 variables.

Age - Age of driver Vehicle_Use - Purpose of the vehicle use Severity - Average amount of claims Claim_Count - Number of claims

Additional information can be found in the document: https://cran.r-project.org/web/packages/insuranceData/index.html
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Udayakumar19 (2022). Google Data Analytics Case Study Cyclistic [Dataset]. https://www.kaggle.com/datasets/udayakumar19/google-data-analytics-case-study-cyclistic/suggestions

Google Data Analytics Case Study Cyclistic

Difference between Casual vs Member in Cyclistic Riders

Explore at:

zip(1299 bytes)Available download formats

Dataset updated

Sep 27, 2022

Authors

Udayakumar19

Description

Introduction

Scenario

Ask

How do annual members and casual riders use Cyclistic bikes differently?

Guiding Question:

What is the problem you are trying to solve?
  How do annual members and casual riders use Cyclistic bikes differently?
How can your insights drive business decisions?
  The insight will help the marketing team to make a strategy for casual riders

Prepare

Guiding Question:

Where is your data located?
  Data located in Cyclistic organization data.

How is data organized?
  Dataset are in csv format for each month wise from Financial year 22.

Are there issues with bias or credibility in this data? Does your data ROCCC? 
  It is good it is ROCCC because data collected in from Cyclistic organization.

How are you addressing licensing, privacy, security, and accessibility?
  The company has their own license over the dataset. Dataset does not have any personal information about the riders.

How did you verify the data’s integrity?
  All the files have consistent columns and each column has the correct type of data.

How does it help you answer your questions?
  Insights always hidden in the data. We have the interpret with data to find the insights.

Are there any problems with the data?
  Yes, starting station names, ending station names have null values.

Process

Guiding Question:

What tools are you choosing and why?
  I used R studio for the cleaning and transforming the data for analysis phase because of large dataset and to gather experience in the language.

Have you ensured the data’s integrity?
 Yes, the data is consistent throughout the columns.

What steps have you taken to ensure that your data is clean?
  First duplicates, null values are removed then added new columns for analysis.

How can you verify that your data is clean and ready to analyze? 
 Make sure the column names are consistent thorough out all data sets by using the “bind row” function.

Make sure column data types are consistent throughout all the dataset by using the “compare_df_col” from the “janitor” package.
Combine the all dataset into single data frame to make consistent throught the analysis.
Removed the column start_lat, start_lng, end_lat, end_lng from the dataframe because those columns not required for analysis.
Create new columns day, date, month, year, from the started_at column this will provide additional opportunities to aggregate the data
Create the “ride_length” column from the started_at and ended_at column to find the average duration of the ride by the riders.
Removed the null rows from the dataset by using the “na.omit function”
Have you documented your cleaning process so you can review and share those results? 
  Yes, the cleaning process is documented clearly.

Analyze Phase:

Guiding Questions:

What surprises did you discover in the data?
  Casual member ride duration is higher than the annual members
  Causal member widely uses docked bike than the annual members
What trends or relationships did you find in the data?
  Annual members are used mainly for commute purpose
  Casual member are preferred the docked bikes
  Annual members are preferred the electric or classic bikes
How will these insights help answer your business questions?
  This insights helps to build a profile for members

Guiding Quesions:

Were you able to answer the question of how ...

Clear search

Close search

Google apps

Main menu

Google Data Analytics Case Study Cyclistic

Introduction

Scenario

Ask

Guiding Question:

Prepare

Guiding Question:

Process

Guiding Question:

Analyze Phase:

Guiding Questions:

Share

Guiding Quesions:

Food Reviews - Text Mining & Sentiment Analysis

Market Basket Analysis

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing

Particle Image Velocimetry Results

TarokPlayingCardsExtractionIsolation

Bank Loan Approval - LR, DT, RF and AUC

Scene Classification: Images and Audio

Do images and audio complement one another in scene classification?

Please cite this study if you use the dataset

Context

Content

MFCC Extraction

Two Classes?

Insurance Claims Data

Autobi(Automobile Bodily Injury Claims) -

AutoClaims(Automobile Insurance Claims) -

AutoCollision(Automobile UK Collision Claims)

Google Data Analytics Case Study CyclisticSee More Versions

Difference between Casual vs Member in Cyclistic Riders

Introduction

Scenario

Ask

Guiding Question:

Prepare

Guiding Question:

Process

Guiding Question:

Analyze Phase:

Guiding Questions:

Share

Guiding Quesions:

Google Data Analytics Case Study Cyclistic