10 datasets found

Housing Price Prediction using DT and RF in R
kaggle.com
zip
Updated Aug 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). Housing Price Prediction using DT and RF in R [Dataset]. https://www.kaggle.com/datasets/vikramamin/housing-price-prediction-using-dt-and-rf-in-r
Explore at:
zip(629100 bytes)Available download formats
Dataset updated
Aug 31, 2023
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Objective: To predict the prices of houses in the City of Melbourne

Approach: Using Decision Tree and Random Forest https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Ffc6fb7d0bd8e854daf7a6f033937a397%2FPicture1.png?generation=1693489996707941&alt=media" alt="">

Data Cleaning:

Date column is shown as a character vector which is converted into a date vector using the library ‘lubridate’

We create a new column called age to understand the age of the house as it can be a factor in the pricing of the house. We extract the year from column ‘Date’ and subtract it from the column ‘Year Built’

We remove 11566 records which have missing values

We drop columns which are not significant such as ‘X’, ‘suburb’, ‘address’, (we have kept zipcode as it serves the purpose in place of suburb and address), ‘type’, ‘method’, ‘SellerG’, ‘date’, ‘Car’, ‘year built’, ‘Council Area’, ‘Region Name’

We split the data into ‘train’ and ‘test’ in 80/20 ratio using the sample function

Run libraries ‘rpart’, ‘rpart.plot’, ‘rattle’, ‘RcolorBrewer’

Run decision tree using the rpart function. ‘Price’ is the dependent variable https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6065322d19b1376c4a341a4f22933a51%2FPicture2.png?generation=1693490067579017&alt=media" alt="">

Average price for 5464 houses is $1084349

Where building area is less than 200.5, the average price for 4582 houses is $931445. Where building area is less than 200.5 & age of the building is less than 67.5 years, the avg price for 3385 houses is $799299.6.

$4801538 is the Highest average prices of 13 houses where distance is lower than 5.35 & building are is >280.5
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F136542b7afb6f03c1890bae9b07dc464%2FDecision%20Tree%20Plot.jpeg?generation=1693490124083168&alt=media" alt="">

We use the caret package for tuning the parameter and the optimal complexity parameter found is 0.01 with RMSE 445197.9 https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Feb1633df9dd61ba3a51574873b055fd0%2FPicture3.png?generation=1693490163033658&alt=media" alt="">

We use library (Metrics) to find out the RMSE ($392107), MAPE (0.297) which means an accuracy of 99.70% and MAE ($272015.4)

Variables ‘postcode’, longitude and building are the most important variables

Test$Price indicates the actual price and test$predicted indicates the predicted price for particular 6 houses. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F620b1aad968c9aee169d0e7371bf3818%2FPicture4.png?generation=1693490211728176&alt=media" alt="">

We use the default parameters of random forest on the train data https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fe9a3c3f8776ee055e4a1bb92d782e19c%2FPicture5.png?generation=1693490244695668&alt=media" alt="">

The below image indicates that ‘Building Area’, ‘Age of the house’ and ‘Distance’ are the most important variables that affect the price of the house. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc14d6266184db8f30290c528d72b9f6b%2FRandom%20Forest%20Variables%20Importance.jpeg?generation=1693490284920037&alt=media" alt="">

Based on the default parameters, RMSE is $250426.2, MAPE is 0.147 (accuracy is 99.853%) and MAE is $151657.7

Error starts to remain constant between 100 to 200 trees and thereafter there is almost minimal reduction. We can choose N tree=200. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F365f9e8587d3a65805330889d22f9e60%2FNtree%20Plot.jpeg?generation=1693490308734539&alt=media" alt="">

We tune the model and find mtry = 3 has the lowest out of bag error

We use the caret package and use 5 fold cross validation technique

RMSE is $252216.10 , MAPE is 0.146 (accuracy is 99.854%) , MAE is $151669.4

We can conclude that Random Forest give us more accurate results as compared to Decision Tree

In Random Forest , the default parameters (N tree = 500) give us lower RMSE and MAPE as compared to N tree = 200. So we can proceed with those parameters.
Data from: Superconductor-ferromagnet hybrids for non-reciprocal electronics...
data.europa.eu
ekoizpen-zientifikoa.ehu.eus
+1more
unknown
Updated Jul 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). Superconductor-ferromagnet hybrids for non-reciprocal electronics and detectors [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-7798143?locale=hr
Explore at:
unknown(64)Available download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data for the manuscript "Superconductor-ferromagnet hybrids for non-reciprocal electronics and detectors", submitted to Superconductor Science and Technology, arXiv:2302.12732. This archive contains the data for all plots of numerical data in the manuscript. ## Fig. 4 Data of Fig. 4 in the WDX (Wolfram Data Exchange) format (unzip to extract the files). Contains critical exchange fields and critical thicknesses as functions of the temperature. Can be opened with Wolfram Mathematica with the command: Import[FileNameJoin[{NotebookDirectory[],"filename.wdx"}]] ## Fig. 5 Data of Fig. 5 in the WDX (Wolfram Data Exchange) format (unzip to extract the files). Contains theoretically calculated I(V) curves and the rectification coefficient R of N/FI/S junctions. Can be opened with Wolfram Mathematica with the command Import[FileNameJoin[{NotebookDirectory[],"filename.wdx"}]]. ## Fig. 7a Data of Fig. 7a in the ascii format. Contains G in uS as a function of B in mT and V in mV. ## Fig. 7c Data of Fig. 7c in the ascii format. Contains G in uS as a function of B in mT and V in mV. ## Fig. 7e Data of Fig. 7e in the ascii format. Contains G in uS as a function of B in mT and V in mV. The plots 7b, d, and f are taken from the plots a, c and e as indicated in the caption of the figure. ## Fig. 8 Data of Fig. 8 in the ascii format. Contains G in uS as a function V in mV for several values of B in mT. ## Fig. 8 inset Data of Fig. 8 inset in the ascii format. Contains G_0/G_N as a function of B in mT. ## Fig9a_b First raw Magnetic field values in T, first column voltage drop in V, rest of the columns differential conductance in S ## Fig9b_FIT First raw Magnetic field values in T, first column voltage drop in V, rest of the columns differential conductance in S ## Fig9c First raw Magnetic field values in T, first column voltage drop in V, rest of the columns R (real number) ## Fig9c inset First raw Magnetic field values in T, odd columns voltage drop in V, even columns injected current in A ## Fog9d Foist column magnetic field in T, second column conductance ration (real number), sample name in the file name. ## Fig. 12 Data of Fig. 12 in the ascii format. Contains energy resolution as functions of temperature and tunnel resistance with current and voltage readout. ## Fig. 13 Data of Fig. 13 in the ascii format. Contains energy resolution as functions of (a) exchange field, (b) polarization, (c) dynes, and (d) absorber volume with different amplifier noises. ## Fig. 14 Data of Fig. 14 in the ascii format. Contains detector pulse current as functions of (a) temperature change (b) time with different detector parameters. ## Fig. 17 Data of Fig. 17 in the ascii format. Contains dIdV curves as function of the voltage for different THz illumination frequency and polarization. ## Fig. 18 Data of Fig. 18 in the ascii format. Contains the current flowing throughout the junction as function time (arbitrary units) for ON and OFF illumination at 150 GHz for InPol and CrossPol polarization. ## Fig. 21 Data of Fig. 21c in the ascii format. Contains the magnitude of readout line S43 as frequency. Data of Fig. 21d in the ascii format. Contains the magnitude of iKID line S21 as frequency.
Data from: Projections of Definitive Screening Designs by Dropping Columns:...
tandf.figshare.com
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alan R. Vazquez; Peter Goos; Eric D. Schoen (2023). Projections of Definitive Screening Designs by Dropping Columns: Selection and Evaluation [Dataset]. http://doi.org/10.6084/m9.figshare.7624412.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7624412.v2
Dataset updated
Jun 1, 2023
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Alan R. Vazquez; Peter Goos; Eric D. Schoen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract–Definitive screening designs permit the study of many quantitative factors in a few runs more than twice the number of factors. In practical applications, researchers often require a design for m quantitative factors, construct a definitive screening design for more than m factors and drop the superfluous columns. This is done when the number of runs in the standard m-factor definitive screening design is considered too limited or when no standard definitive screening design (sDSD) exists for m factors. In these cases, it is common practice to arbitrarily drop the last columns of the larger design. In this article, we show that certain statistical properties of the resulting experimental design depend on the exact columns dropped and that other properties are insensitive to these columns. We perform a complete search for the best sets of 1–8 columns to drop from sDSDs with up to 24 factors. We observed the largest differences in statistical properties when dropping four columns from 8- and 10-factor definitive screening designs. In other cases, the differences are small, or even nonexistent.
FacialRecognition
kaggle.com
zip
Updated Dec 1, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TheNicelander (2016). FacialRecognition [Dataset]. https://www.kaggle.com/petein/facialrecognition
Explore at:
zip(121674455 bytes)Available download formats
Dataset updated
Dec 1, 2016
Authors
TheNicelander
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description

#https://www.kaggle.com/c/facial-keypoints-detection/details/getting-started-with-r #################################

###Variables for downloaded files data.dir <- ' ' train.file <- paste0(data.dir, 'training.csv') test.file <- paste0(data.dir, 'test.csv') #################################

###Load csv -- creates a data.frame matrix where each column can have a different type. d.train <- read.csv(train.file, stringsAsFactors = F) d.test <- read.csv(test.file, stringsAsFactors = F)

###In training.csv, we have 7049 rows, each one with 31 columns. ###The first 30 columns are keypoint locations, which R correctly identified as numbers. ###The last one is a string representation of the image, identified as a string.

###To look at samples of the data, uncomment this line:

head(d.train)

###Let's save the first column as another variable, and remove it from d.train: ###d.train is our dataframe, and we want the column called Image. ###Assigning NULL to a column removes it from the dataframe

im.train <- d.train$Image d.train$Image <- NULL #removes 'image' from the dataframe

im.test <- d.test$Image d.test$Image <- NULL #removes 'image' from the dataframe

################################# #The image is represented as a series of numbers, stored as a string #Convert these strings to integers by splitting them and converting the result to integer

#strsplit splits the string #unlist simplifies its output to a vector of strings #as.integer converts it to a vector of integers. as.integer(unlist(strsplit(im.train[1], " "))) as.integer(unlist(strsplit(im.test[1], " ")))

###Install and activate appropriate libraries ###The tutorial is meant for Linux and OSx, where they use a different library, so: ###Replace all instances of %dopar% with %do%.

install.packages('foreach')

library("foreach", lib.loc="~/R/win-library/3.3")

###implement parallelization im.train <- foreach(im = im.train, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } im.test <- foreach(im = im.test, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } #The foreach loop will evaluate the inner command for each row in im.train, and combine the results with rbind (combine by rows). #%do% instructs R to do all evaluations in parallel. #im.train is now a matrix with 7049 rows (one for each image) and 9216 columns (one for each pixel):

###Save all four variables in data.Rd file ###Can reload them at anytime with load('data.Rd')

save(d.train, im.train, d.test, im.test, file='data.Rd')

load('data.Rd')

#each image is a vector of 96*96 pixels (96*96 = 9216). #convert these 9216 integers into a 96x96 matrix: im <- matrix(data=rev(im.train[1,]), nrow=96, ncol=96)

#im.train[1,] returns the first row of im.train, which corresponds to the first training image. #rev reverse the resulting vector to match the interpretation of R's image function #(which expects the origin to be in the lower left corner).

#To visualize the image we use R's image function: image(1:96, 1:96, im, col=gray((0:255)/255))

#Let’s color the coordinates for the eyes and nose points(96-d.train$nose_tip_x[1], 96-d.train$nose_tip_y[1], col="red") points(96-d.train$left_eye_center_x[1], 96-d.train$left_eye_center_y[1], col="blue") points(96-d.train$right_eye_center_x[1], 96-d.train$right_eye_center_y[1], col="green")

#Another good check is to see how variable is our data. #For example, where are the centers of each nose in the 7049 images? (this takes a while to run): for(i in 1:nrow(d.train)) { points(96-d.train$nose_tip_x[i], 96-d.train$nose_tip_y[i], col="red") }

#there are quite a few outliers -- they could be labeling errors. Looking at one extreme example we get this: #In this case there's no labeling error, but this shows that not all faces are centralized idx <- which.max(d.train$nose_tip_x) im <- matrix(data=rev(im.train[idx,]), nrow=96, ncol=96) image(1:96, 1:96, im, col=gray((0:255)/255)) points(96-d.train$nose_tip_x[idx], 96-d.train$nose_tip_y[idx], col="red")

#One of the simplest things to try is to compute the mean of the coordinates of each keypoint in the training set and use that as a prediction for all images colMeans(d.train, na.rm=T)

#To build a submission file we need to apply these computed coordinates to the test instances: p <- matrix(data=colMeans(d.train, na.rm=T), nrow=nrow(d.test), ncol=ncol(d.train), byrow=T) colnames(p) <- names(d.train) predictions <- data.frame(ImageId = 1:nrow(d.test), p) head(predictions)

#The expected submission format has one one keypoint per row, but we can easily get that with the help of the reshape2 library:

install.packages('reshape2')

library(...
KC_House Dataset -Linear Regression of Home Prices
kaggle.com
zip
Updated May 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). KC_House Dataset -Linear Regression of Home Prices [Dataset]. https://www.kaggle.com/datasets/vikramamin/kc-house-dataset-home-prices
Explore at:
zip(776807 bytes)Available download formats
Dataset updated
May 15, 2023
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset: House pricing dataset containing 21 columns and 21613 rows.

Programming Language : R

Objective : To predict house prices by creating a model

Steps : A) Import the dataset B) Install and run libraries C) Data Cleaning - Remove Null Values , Change Data Types , Dropping of Columns which are not important D) Data Analysis - (i)Linear Regression Model was used to establish the relationship between the dependent variable (price) and other independent variable (ii) Outliers were identified and removed (iii) Regression model was run once again after removing the outliers (iv) Multiple R- squared was calculated which indicated the independent variables can explain 73% change/ variation in the dependent variable (v) P value was less than that of alpha 0.05 which shows it is statistically significant. (vi) Interpreting the meaning of the results of the coefficients (vii) Checked the assumption of multicollinearity (viii) VIF(Variance inflation factor) was calculated for all the independent variables and their absolute value was found to be less than 5. Hence, there is not threat of multicollinearity and that we can proceed with the independent variables specified.
d
Young and older adult vowel categorization responses
datadryad.org
zip
Updated Mar 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mishaela DiNino (2024). Young and older adult vowel categorization responses [Dataset]. http://doi.org/10.5061/dryad.brv15dvh0
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.brv15dvh0
Dataset updated
Mar 14, 2024
Dataset provided by
Dryad
Authors
Mishaela DiNino
Time period covered
Feb 20, 2024
Description
Young and older adult vowel categorization responses

https://doi.org/10.5061/dryad.brv15dvh0

On each trial, participants heard a stimulus and clicked a box on the computer screen to indicate whether they heard "SET" or "SAT." Responses of "SET" are coded as 0 and responses of "SAT" are coded as 1. The continuum steps, from 1-7, for duration and spectral quality cues of the stimulus on each trial are named "DurationStep" and "SpectralStep," respectively. Group (young or older adult) and listening condition (quiet or noise) information are provided for each row of the dataset.
f
DataSheet1_Evaluation of Version 3 Total and Tropospheric Ozone Columns From...
frontiersin.figshare.com
docx
Updated Jun 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Natalya A. Kramarova; Jerald R. Ziemke; Liang-Kang Huang; Jay R. Herman; Krzysztof Wargan; Colin J. Seftor; Gordon J. Labow; Luke D. Oman (2023). DataSheet1_Evaluation of Version 3 Total and Tropospheric Ozone Columns From Earth Polychromatic Imaging Camera on Deep Space Climate Observatory for Studying Regional Scale Ozone Variations.docx [Dataset]. http://doi.org/10.3389/frsen.2021.734071.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/frsen.2021.734071.s001
Dataset updated
Jun 6, 2023
Dataset provided by
Frontiers
Authors
Natalya A. Kramarova; Jerald R. Ziemke; Liang-Kang Huang; Jay R. Herman; Krzysztof Wargan; Colin J. Seftor; Gordon J. Labow; Luke D. Oman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Earth
Description
Discrete wavelength radiance measurements from the Deep Space Climate Observatory (DSCOVR) Earth Polychromatic Imaging Camera (EPIC) allows derivation of global synoptic maps of total and tropospheric ozone columns every hour during Northern Hemisphere (NH) Summer or 2 hours during Northern Hemisphere winter. In this study, we present version 3 retrieval of Earth Polychromatic Imaging Camera ozone that covers the period from June 2015 to the present with improved geolocation, calibration, and algorithmic updates. The accuracy of total and tropospheric ozone measurements from EPIC have been evaluated using correlative satellite and ground-based total and tropospheric ozone measurements at time scales from daily averages to monthly means. The comparisons show good agreement with increased differences at high latitudes. The agreement improves if we only accept retrievals derived from the EPIC 317 nm triplet and limit solar zenith and satellite looking angles to 70°. With such filtering in place, the comparisons of EPIC total column ozone retrievals with correlative satellite and ground-based data show mean differences within ±5-7 Dobson Units (or 1.5–2.5%). The biases with other satellite instruments tend to be mostly negative in the Southern Hemisphere while there are no clear latitudinal patterns in ground-based comparisons. Evaluation of the EPIC ozone time series at different ground-based stations with the correlative ground-based and satellite instruments and ozonesondes demonstrated good consistency in capturing ozone variations at daily, weekly and monthly scales with a persistently high correlation (r2 > 0.9) for total and tropospheric columns. We examined EPIC tropospheric ozone columns by comparing with ozonesondes at 12 stations and found that differences in tropospheric column ozone are within ±2.5 DU (or ∼±10%) after removing a constant 3 DU offset at all stations between EPIC and sondes. The analysis of the time series of zonally averaged EPIC tropospheric ozone revealed a statistically significant drop of ∼2–4 DU (∼5–10%) over the entire NH in spring and summer of 2020. This drop in tropospheric ozone is partially related to the unprecedented Arctic stratospheric ozone losses in winter-spring 2019/2020 and reductions in ozone precursor pollutants due to the COVID-19 pandemic.
South African Award Winning Wines
kaggle.com
zip
Updated Jan 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rubi Leandro (2024). South African Award Winning Wines [Dataset]. https://www.kaggle.com/datasets/rubileandro/south-african-award-winning-wines
Explore at:
zip(17527 bytes)Available download formats
Dataset updated
Jan 9, 2024
Authors
Rubi Leandro
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Area covered
South Africa
Description
The original source for the data is the online publisher Top Wine SA. The are well-established experts and collect data from various expert sources within the South African Wine market and keep track of the best wines based on local as well as international judges. This data was published in 2023.

Top Wine SA – Top SA Wine Ratings https://topwinesa.com/top-sa-wines-and-cellars/top-sa-wine-ratings/

The original dataset provides the following description regarding categories:

“Prices are given in South African rand (R) per standard bottle size (750ml), unless stated otherwise, and are inclusive of VAT – prices as supplied by the producers, or approximate retail prices in South Africa if the producers are closed to the public – top wines may well be ‘sold out’ or increase in price after a stellar rating.

Note: AW = Auction Wine, EXP = Export Only, MC = Museum Class (typically ‘sold out’ at the winery, assessed to gauge development or longevity), and OL = Own Label (exclusive to particular retailer(s), not available from the producer).”

I removed these categories because my focus is on what the average South African consumer is paying for.

Wine that is export only is automatically excluded from this this case. As is Museum Class and Auction Wine. Although some Own Label wines were not too difficult to find, I was not able to find all of them so for this reason I have excluded these.

Below I have listed how many in each of these particular categories there were, before removing them.

• AW: 8 • EX: 21 • MC: 4 • OL: 16

The Process I manually copied the data from the TOP SA wines website at this address: https://topwinesa.com/top-sa-wines-and-cellars/top-sa-wine-ratings/

I added in an extra column for Category and Second Category. And assigned the relevant category to each one.

I cleaned up the data which involved sorting and deleting unnecessary rows. There were blank cells in some columns as some wines had more than one award. I removed the any additional awards as the focus of this data set is that there are the wines, vintages, and prices that consumers are paying for, for award-winning South Africa wine (as opposed to the particular award that won).

Then I deleted all entries of the categories mentioned above (AW, EX, MC, OL).

The prices were entered as text with the letter “R”. I used a formula in excel to remove the letter “R” from the price column so that the price could be stored as a number.

=IF(ISNUMBER(VALUE(SUBSTITUTE(B2,"R ",""))), VALUE(SUBSTITUTE(B2,"R ","")), "")

I wanted to have the vintage as its own column. So I used a formula to extract the numbers from the wine names provided. For most of the entries, the year was the only number in the name. For those entries that had additional numbers in the name, I manually fixed those cases.

=IFERROR(VALUE(TEXTJOIN("", TRUE, IF(ISNUMBER(--MID(A2, ROW(INDIRECT("1:" & LEN(A2))), 1)), MID(A2, ROW(INDIRECT("1:" & LEN(A2))), 1), ""))), "")
Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race,...
search.datacite.org
doi.org
+1more
Updated 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Kaplan (2018). Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race, 1980-2016 [Dataset]. http://doi.org/10.3886/e102263v5-10021
Explore at:
Unique identifier
https://doi.org/10.3886/e102263v5-10021
Dataset updated
2018
Dataset provided by
DataCitehttps://www.datacite.org/
Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
Authors
Jacob Kaplan
Description
Version 5 release notes:
Removes support for SPSS and Excel data.Changes the crimes that are stored in each file. There are more files now with fewer crimes per file. The files and their included crimes have been updated below.
Adds in agencies that report 0 months of the year.Adds a column that indicates the number of months reported. This is generated summing up the number of unique months an agency reports data for. Note that this indicates the number of months an agency reported arrests for ANY crime. They may not necessarily report every crime every month. Agencies that did not report a crime with have a value of NA for every arrest column for that crime.Removes data on runaways.
Version 4 release notes:
Changes column names from "poss_coke" and "sale_coke" to "poss_heroin_coke" and "sale_heroin_coke" to clearly indicate that these column includes the sale of heroin as well as similar opiates such as morphine, codeine, and opium. Also changes column names for the narcotic columns to indicate that they are only for synthetic narcotics.
Version 3 release notes:
Add data for 2016.Order rows by year (descending) and ORI.Version 2 release notes:
Fix bug where Philadelphia Police Department had incorrect FIPS county code.
The Arrests by Age, Sex, and Race data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains highly granular data on the number of people arrested for a variety of crimes (see below for a full list of included crimes). The data sets here combine data from the years 1980-2015 into a single file. These files are quite large and may take some time to load.
All the data was downloaded from NACJD as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here. https://github.com/jacobkap/crime_data. If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.

I did not make any changes to the data other than the following. When an arrest column has a value of "None/not reported", I change that value to zero. This makes the (possible incorrect) assumption that these values represent zero crimes reported. The original data does not have a value when the agency reports zero arrests other than "None/not reported." In other words, this data does not differentiate between real zeros and missing values. Some agencies also incorrectly report the following numbers of arrests which I change to NA: 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99999, 99998.

To reduce file size and make the data more manageable, all of the data is aggregated yearly. All of the data is in agency-year units such that every row indicates an agency in a given year. Columns are crime-arrest category units. For example, If you choose the data set that includes murder, you would have rows for each agency-year and columns with the number of people arrests for murder. The ASR data breaks down arrests by age and gender (e.g. Male aged 15, Male aged 18). They also provide the number of adults or juveniles arrested by race. Because most agencies and years do not report the arrestee's ethnicity (Hispanic or not Hispanic) or juvenile outcomes (e.g. referred to adult court, referred to welfare agency), I do not include these columns.

To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. Please note that some of the FIPS codes have leading zeros and if you open it in Excel it will automatically delete those leading zeros.

I created 9 arrest categories myself. The categories are:
Total Male JuvenileTotal Female JuvenileTotal Male AdultTotal Female AdultTotal MaleTotal FemaleTotal JuvenileTotal AdultTotal ArrestsAll of these categories are based on the sums of the sex-age categories (e.g. Male under 10, Female aged 22) rather than using the provided age-race categories (e.g. adult Black, juvenile Asian). As not all agencies report the race data, my method is more accurate. These categories also make up the data in the "simple" version of the data. The "simple" file only includes the above 9 columns as the arrest data (all other columns in the data are just agency identifier columns). Because this "simple" data set need fewer columns, I include all offenses.

As the arrest data is very granular, and each category of arrest is its own column, there are dozens of columns per crime. To keep the data somewhat manageable, there are nine different files, eight which contain different crimes and the "simple" file. Each file contains the data for all years. The eight categories each have crimes belonging to a major crime category and do not overlap in crimes other than with the index offenses. Please note that the crime names provided below are not the same as the column names in the data. Due to Stata limiting column names to 32 characters maximum, I have abbreviated the crime names in the data. The files and their included crimes are:

Index Crimes
MurderRapeRobberyAggravated AssaultBurglaryTheftMotor Vehicle TheftArsonAlcohol CrimesDUIDrunkenness
LiquorDrug CrimesTotal DrugTotal Drug SalesTotal Drug PossessionCannabis PossessionCannabis SalesHeroin or Cocaine PossessionHeroin or Cocaine SalesOther Drug PossessionOther Drug SalesSynthetic Narcotic PossessionSynthetic Narcotic SalesGrey Collar and Property CrimesForgeryFraudStolen PropertyFinancial CrimesEmbezzlementTotal GamblingOther GamblingBookmakingNumbers LotterySex or Family CrimesOffenses Against the Family and Children
Other Sex Offenses
ProstitutionRapeViolent CrimesAggravated AssaultMurderNegligent ManslaughterRobberyWeapon Offenses
Other CrimesCurfewDisorderly ConductOther Non-trafficSuspicion
VandalismVagrancy
Simple
This data set has every crime and only the arrest categories that I created (see above).
If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.
Data from: Bike Sharing Dataset
kaggle.com
zip
Updated Sep 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ram Vishnu R (2024). Bike Sharing Dataset [Dataset]. https://www.kaggle.com/datasets/ramvishnur/bike-sharing-dataset/data
Explore at:
zip(22674 bytes)Available download formats
Dataset updated
Sep 10, 2024
Authors
Ram Vishnu R
Description
Problem Statement:

A bike-sharing system is a service in which bikes are made available for shared use to individuals on a short term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" which is usually computer-controlled wherein the user enters the payment information, and the system unlocks it. This bike can then be returned to another dock belonging to the same system.

A US bike-sharing provider BoomBikes has recently suffered considerable dip in their revenue due to the Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue.

In such an attempt, BoomBikes aspires to understand the demand for shared bikes among the people. They have planned this to prepare themselves to cater to the people's needs once the situation gets better all around and stand out from other service providers and make huge profits.

They have contracted a consulting company to understand the factors on which the demand for these shared bikes depends. Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know:

Which variables are significant in predicting the demand for shared bikes.

How well those variables describe the bike demands

Based on various meteorological surveys and people's styles, the service provider firm has gathered a large dataset on daily bike demands across the American market based on some factors.

Business Goal:

You are required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market.

Data Preparation:

You can observe in the dataset that some of the variables like 'weathersit' and 'season' have values as 1, 2, 3, 4 which have specific labels associated with them (as can be seen in the data dictionary). These numeric values associated with the labels may indicate that there is some order to them - which is actually not the case (Check the data dictionary and think why). So, it is advisable to convert such feature values into categorical string values before proceeding with model building. Please refer the data dictionary to get a better understanding of all the independent variables.

You might notice the column 'yr' with two values 0 and 1 indicating the years 2018 and 2019 respectively. At the first instinct, you might think it is a good idea to drop this column as it only has two values so it might not be a value-add to the model. But in reality, since these bike-sharing systems are slowly gaining popularity, the demand for these bikes is increasing every year proving that the column 'yr' might be a good variable for prediction. So think twice before dropping it.

Model Building:

In the dataset provided, you will notice that there are three columns named 'casual', 'registered', and 'cnt'. The variable 'casual' indicates the number casual users who have made a rental. The variable 'registered' on the other hand shows the total number of registered users who have made a booking on a given day. Finally, the 'cnt' variable indicates the total number of bike rentals, including both casual and registered. The model should be built taking this 'cnt' as the target variable.

Model Evaluation:

When you're done with model building and residual analysis and have made predictions on the test set, just make sure you use the following two lines of code to calculate the R-squared score on the test set. python from sklearn.metrics import r2_score r2_score(y_test, y_pred) - where y_test is the test data set for the target variable, and y_pred is the variable containing the predicted values of the target variable on the test set. - Please perform this step as the R-squared score on the test set holds as a benchmark for your model.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

vikram amin (2023). Housing Price Prediction using DT and RF in R [Dataset]. https://www.kaggle.com/datasets/vikramamin/housing-price-prediction-using-dt-and-rf-in-r

Housing Price Prediction using DT and RF in R

Decision Tree and Random Forest in R for house price prediction

Explore at:

zip(629100 bytes)Available download formats

Dataset updated

Aug 31, 2023

Authors

vikram amin

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Objective: To predict the prices of houses in the City of Melbourne
Approach: Using Decision Tree and Random Forest https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Ffc6fb7d0bd8e854daf7a6f033937a397%2FPicture1.png?generation=1693489996707941&alt=media" alt="">
Data Cleaning:
Date column is shown as a character vector which is converted into a date vector using the library ‘lubridate’
We create a new column called age to understand the age of the house as it can be a factor in the pricing of the house. We extract the year from column ‘Date’ and subtract it from the column ‘Year Built’
We remove 11566 records which have missing values
We drop columns which are not significant such as ‘X’, ‘suburb’, ‘address’, (we have kept zipcode as it serves the purpose in place of suburb and address), ‘type’, ‘method’, ‘SellerG’, ‘date’, ‘Car’, ‘year built’, ‘Council Area’, ‘Region Name’
We split the data into ‘train’ and ‘test’ in 80/20 ratio using the sample function
Run libraries ‘rpart’, ‘rpart.plot’, ‘rattle’, ‘RcolorBrewer’
Run decision tree using the rpart function. ‘Price’ is the dependent variable https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6065322d19b1376c4a341a4f22933a51%2FPicture2.png?generation=1693490067579017&alt=media" alt="">
Average price for 5464 houses is $1084349
Where building area is less than 200.5, the average price for 4582 houses is $931445. Where building area is less than 200.5 & age of the building is less than 67.5 years, the avg price for 3385 houses is $799299.6.
$4801538 is the Highest average prices of 13 houses where distance is lower than 5.35 & building are is >280.5
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F136542b7afb6f03c1890bae9b07dc464%2FDecision%20Tree%20Plot.jpeg?generation=1693490124083168&alt=media" alt="">
We use the caret package for tuning the parameter and the optimal complexity parameter found is 0.01 with RMSE 445197.9 https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Feb1633df9dd61ba3a51574873b055fd0%2FPicture3.png?generation=1693490163033658&alt=media" alt="">
We use library (Metrics) to find out the RMSE ($392107), MAPE (0.297) which means an accuracy of 99.70% and MAE ($272015.4)
Variables ‘postcode’, longitude and building are the most important variables
Test$Price indicates the actual price and test$predicted indicates the predicted price for particular 6 houses. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F620b1aad968c9aee169d0e7371bf3818%2FPicture4.png?generation=1693490211728176&alt=media" alt="">
We use the default parameters of random forest on the train data https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fe9a3c3f8776ee055e4a1bb92d782e19c%2FPicture5.png?generation=1693490244695668&alt=media" alt="">
The below image indicates that ‘Building Area’, ‘Age of the house’ and ‘Distance’ are the most important variables that affect the price of the house. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc14d6266184db8f30290c528d72b9f6b%2FRandom%20Forest%20Variables%20Importance.jpeg?generation=1693490284920037&alt=media" alt="">
Based on the default parameters, RMSE is $250426.2, MAPE is 0.147 (accuracy is 99.853%) and MAE is $151657.7
Error starts to remain constant between 100 to 200 trees and thereafter there is almost minimal reduction. We can choose N tree=200. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F365f9e8587d3a65805330889d22f9e60%2FNtree%20Plot.jpeg?generation=1693490308734539&alt=media" alt="">
We tune the model and find mtry = 3 has the lowest out of bag error
We use the caret package and use 5 fold cross validation technique
RMSE is $252216.10 , MAPE is 0.146 (accuracy is 99.854%) , MAE is $151669.4
We can conclude that Random Forest give us more accurate results as compared to Decision Tree
In Random Forest , the default parameters (N tree = 500) give us lower RMSE and MAPE as compared to N tree = 200. So we can proceed with those parameters.

Clear search

Close search

Google apps

Main menu

Housing Price Prediction using DT and RF in R

Data from: Superconductor-ferromagnet hybrids for non-reciprocal electronics...

Data from: Projections of Definitive Screening Designs by Dropping Columns:...

FacialRecognition

head(d.train)

install.packages('foreach')

save(d.train, im.train, d.test, im.test, file='data.Rd')

load('data.Rd')

install.packages('reshape2')

KC_House Dataset -Linear Regression of Home Prices

Young and older adult vowel categorization responses

Young and older adult vowel categorization responses

DataSheet1_Evaluation of Version 3 Total and Tropospheric Ozone Columns From...

South African Award Winning Wines

Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race,...

Data from: Bike Sharing Dataset

Problem Statement:

Business Goal:

Data Preparation:

Model Building:

Model Evaluation:

Housing Price Prediction using DT and RF in R

Decision Tree and Random Forest in R for house price prediction