3 datasets found

KC_House Dataset -Linear Regression of Home Prices
kaggle.com
zip
Updated May 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). KC_House Dataset -Linear Regression of Home Prices [Dataset]. https://www.kaggle.com/datasets/vikramamin/kc-house-dataset-home-prices
Explore at:
zip(776807 bytes)Available download formats
Dataset updated
May 15, 2023
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset: House pricing dataset containing 21 columns and 21613 rows.

Programming Language : R

Objective : To predict house prices by creating a model

Steps : A) Import the dataset B) Install and run libraries C) Data Cleaning - Remove Null Values , Change Data Types , Dropping of Columns which are not important D) Data Analysis - (i)Linear Regression Model was used to establish the relationship between the dependent variable (price) and other independent variable (ii) Outliers were identified and removed (iii) Regression model was run once again after removing the outliers (iv) Multiple R- squared was calculated which indicated the independent variables can explain 73% change/ variation in the dependent variable (v) P value was less than that of alpha 0.05 which shows it is statistically significant. (vi) Interpreting the meaning of the results of the coefficients (vii) Checked the assumption of multicollinearity (viii) VIF(Variance inflation factor) was calculated for all the independent variables and their absolute value was found to be less than 5. Hence, there is not threat of multicollinearity and that we can proceed with the independent variables specified.
Cyclisitic Trip Data 2019 (Google)
kaggle.com
zip
Updated Aug 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaine Pepper (2022). Cyclisitic Trip Data 2019 (Google) [Dataset]. https://www.kaggle.com/datasets/shainepepper/divvy-2019-trip-data-clean
Explore at:
zip(27551971 bytes)Available download formats
Dataset updated
Aug 4, 2022
Authors
Shaine Pepper
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Intro

Cleaning this data took some time due to many NULL values, typos, and unorganized collection. My first step was to put the dataset into R and work my magic there. After analyzing and cleaning the data, I moved the data to Tableau to create easily understandable and helpful graphs. This step was a learning curve because there are so many potential options inside Tableau. Finding the correct graph to share my findings while keeping the stakeholders' tasks in mind was my biggest obstacle.

RStudio

Firstly I needed to combine the 4 datasets into 1, I did this using the rbind() function.

Step two was to remove typos or poorly named columns. colnames(Cyclistic_Data_2019)[colnames(Cyclistic_Data_2019) == "tripduration"] <- "trip_duration" colnames(Cyclistic_Data_2019)[colnames(Cyclistic_Data_2019) == "bikeid"] <- "bike_id"' colnames(Cyclistic_Data_2019)[colnames(Cyclistic_Data_2019) == "usertype"] <- "user_type" colnames(Cyclistic_Data_2019)[colnames(Cyclistic_Data_2019) == "birthyear"] <- "birth_year"

Next step was to remove all NULL and over exaggerated numbers. Such as trip durations more than 10 hours long.

library(dplyr) Cyclistic_Clean_v2 <- Cyclistic_Data_2019 %>% filter(across(where(is.character), ~ . != "NULL")) %>% type.convert(as.is = TRUE)

Once removing the NULL data, it was time to remove potential typos and poorly collected data. I could only identify exaggerated data under the "trip_duration" column. Finding that there were multiple cases of 2,000,000 + second trips. To find these large values, I used the count() function.

Cyclistic_Clean_v2 %>% count(Cyclistic_Clean_v2, trip_duration > "30000")

After finding multiple instances of this, I ran into a hard spot, the trip_duration column was categorized as a character when it needed to be numeric to be further cleaned. it took me quite a while to find out that this was an issue, and then I remembered the class() function. With this, I was easily able to identify that the classification was wrong

class(Cyclistic_Clean_v2$trip_duration)

Once identifying the classification, I still had some work to do before converting it to an integer as it contained quotations, periods, and a trailing 0. To remove these I used the gsub() function.

Cyclistic_Clean_v2$trip_duration <- gsub(".0", "", Cyclistic_Clean_v2$trip_duration) Cyclistic_Clean_v2$trip_duration <- gsub('"', '', Cyclistic_Clean_v2$trip_duration)

Now that unwanted characters are gone, we can convert the column into numeric.

Cyclistic_Clean_v2$trip_duration <- as.numeric(Cyclistic_Clean_v2$trip_duration)

Doing this allows Tableau and R to read the data properly to create graphs without error.

Next I created a backup dataset incase there was any issue while exporting.

Cyclistic_Clean_v3 <- Cyclistic_Clean_v2 write.csv(Cyclistic_Clean_v2,"Folder.Path\Cyclistic_Data_Cleaned_2019.csv", row.names = FALSE)

After exporting I came to the conclusion that I should have put together a more accurate change log rather than brief notes. That is one major learning lesson I will take away from this project.

All around, I had a lot of fun using R to transform and analyze the data. I learned many of different ways to efficiently clean data.

Tableau

Now onto the fun part! Tableau is a very good tool to learn. There are so many different ways to bring your data to life and show your creativity inside your work. After a few guides and errors, I could finally start building graphs to bring the stakeholders' tasks to fruition.

Charts

Please note this are all made in tableau and meant to be interactive.

Here you can find the relation between male and female riders.
View post on imgur.com

Male vs Female tripduration with usertype
View post on imgur.com

Busiest stations filtered by months. (This is meant to be interactive.)
View post on imgur.com

Most popular starting stations.
View post on imgur.com

Most popular ending stations.
View post on imgur.com

Conclusion

My main goal was to help find out how Cyclistic can convert casual riders into subscribers. Here is my findings.

Casual riders ride much longer than subscribers duration wise.

Although there are many more male riders, females tend to ride longer than males.

Stations #562 & #568 are the most busy by a h...
FacialRecognition
kaggle.com
zip
Updated Dec 1, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TheNicelander (2016). FacialRecognition [Dataset]. https://www.kaggle.com/petein/facialrecognition
Explore at:
zip(121674455 bytes)Available download formats
Dataset updated
Dec 1, 2016
Authors
TheNicelander
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description

#https://www.kaggle.com/c/facial-keypoints-detection/details/getting-started-with-r #################################

###Variables for downloaded files data.dir <- ' ' train.file <- paste0(data.dir, 'training.csv') test.file <- paste0(data.dir, 'test.csv') #################################

###Load csv -- creates a data.frame matrix where each column can have a different type. d.train <- read.csv(train.file, stringsAsFactors = F) d.test <- read.csv(test.file, stringsAsFactors = F)

###In training.csv, we have 7049 rows, each one with 31 columns. ###The first 30 columns are keypoint locations, which R correctly identified as numbers. ###The last one is a string representation of the image, identified as a string.

###To look at samples of the data, uncomment this line:

head(d.train)

###Let's save the first column as another variable, and remove it from d.train: ###d.train is our dataframe, and we want the column called Image. ###Assigning NULL to a column removes it from the dataframe

im.train <- d.train$Image d.train$Image <- NULL #removes 'image' from the dataframe

im.test <- d.test$Image d.test$Image <- NULL #removes 'image' from the dataframe

################################# #The image is represented as a series of numbers, stored as a string #Convert these strings to integers by splitting them and converting the result to integer

#strsplit splits the string #unlist simplifies its output to a vector of strings #as.integer converts it to a vector of integers. as.integer(unlist(strsplit(im.train[1], " "))) as.integer(unlist(strsplit(im.test[1], " ")))

###Install and activate appropriate libraries ###The tutorial is meant for Linux and OSx, where they use a different library, so: ###Replace all instances of %dopar% with %do%.

install.packages('foreach')

library("foreach", lib.loc="~/R/win-library/3.3")

###implement parallelization im.train <- foreach(im = im.train, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } im.test <- foreach(im = im.test, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } #The foreach loop will evaluate the inner command for each row in im.train, and combine the results with rbind (combine by rows). #%do% instructs R to do all evaluations in parallel. #im.train is now a matrix with 7049 rows (one for each image) and 9216 columns (one for each pixel):

###Save all four variables in data.Rd file ###Can reload them at anytime with load('data.Rd')

save(d.train, im.train, d.test, im.test, file='data.Rd')

load('data.Rd')

#each image is a vector of 96*96 pixels (96*96 = 9216). #convert these 9216 integers into a 96x96 matrix: im <- matrix(data=rev(im.train[1,]), nrow=96, ncol=96)

#im.train[1,] returns the first row of im.train, which corresponds to the first training image. #rev reverse the resulting vector to match the interpretation of R's image function #(which expects the origin to be in the lower left corner).

#To visualize the image we use R's image function: image(1:96, 1:96, im, col=gray((0:255)/255))

#Let’s color the coordinates for the eyes and nose points(96-d.train$nose_tip_x[1], 96-d.train$nose_tip_y[1], col="red") points(96-d.train$left_eye_center_x[1], 96-d.train$left_eye_center_y[1], col="blue") points(96-d.train$right_eye_center_x[1], 96-d.train$right_eye_center_y[1], col="green")

#Another good check is to see how variable is our data. #For example, where are the centers of each nose in the 7049 images? (this takes a while to run): for(i in 1:nrow(d.train)) { points(96-d.train$nose_tip_x[i], 96-d.train$nose_tip_y[i], col="red") }

#there are quite a few outliers -- they could be labeling errors. Looking at one extreme example we get this: #In this case there's no labeling error, but this shows that not all faces are centralized idx <- which.max(d.train$nose_tip_x) im <- matrix(data=rev(im.train[idx,]), nrow=96, ncol=96) image(1:96, 1:96, im, col=gray((0:255)/255)) points(96-d.train$nose_tip_x[idx], 96-d.train$nose_tip_y[idx], col="red")

#One of the simplest things to try is to compute the mean of the coordinates of each keypoint in the training set and use that as a prediction for all images colMeans(d.train, na.rm=T)

#To build a submission file we need to apply these computed coordinates to the test instances: p <- matrix(data=colMeans(d.train, na.rm=T), nrow=nrow(d.test), ncol=ncol(d.train), byrow=T) colnames(p) <- names(d.train) predictions <- data.frame(ImageId = 1:nrow(d.test), p) head(predictions)

#The expected submission format has one one keypoint per row, but we can easily get that with the help of the reshape2 library:

install.packages('reshape2')

library(...
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

vikram amin (2023). KC_House Dataset -Linear Regression of Home Prices [Dataset]. https://www.kaggle.com/datasets/vikramamin/kc-house-dataset-home-prices

KC_House Dataset -Linear Regression of Home Prices

Linear Regression of Home Prices

Explore at:

zip(776807 bytes)Available download formats

Dataset updated

May 15, 2023

Authors

vikram amin

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Dataset: House pricing dataset containing 21 columns and 21613 rows.
Programming Language : R
Objective : To predict house prices by creating a model
Steps : A) Import the dataset B) Install and run libraries C) Data Cleaning - Remove Null Values , Change Data Types , Dropping of Columns which are not important D) Data Analysis - (i)Linear Regression Model was used to establish the relationship between the dependent variable (price) and other independent variable (ii) Outliers were identified and removed (iii) Regression model was run once again after removing the outliers (iv) Multiple R- squared was calculated which indicated the independent variables can explain 73% change/ variation in the dependent variable (v) P value was less than that of alpha 0.05 which shows it is statistically significant. (vi) Interpreting the meaning of the results of the coefficients (vii) Checked the assumption of multicollinearity (viii) VIF(Variance inflation factor) was calculated for all the independent variables and their absolute value was found to be less than 5. Hence, there is not threat of multicollinearity and that we can proceed with the independent variables specified.

Clear search

Close search

Google apps

Main menu

KC_House Dataset -Linear Regression of Home Prices

Cyclisitic Trip Data 2019 (Google)

Intro

RStudio

Tableau

Charts

Conclusion

FacialRecognition

head(d.train)

install.packages('foreach')

save(d.train, im.train, d.test, im.test, file='data.Rd')

load('data.Rd')

install.packages('reshape2')

KC_House Dataset -Linear Regression of Home Prices

Linear Regression of Home Prices