4 datasets found
  1. Data from: Bike Sharing Dataset

    • kaggle.com
    Updated Sep 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ram Vishnu R (2024). Bike Sharing Dataset [Dataset]. https://www.kaggle.com/datasets/ramvishnur/bike-sharing-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 10, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ram Vishnu R
    Description

    Problem Statement:

    A bike-sharing system is a service in which bikes are made available for shared use to individuals on a short term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" which is usually computer-controlled wherein the user enters the payment information, and the system unlocks it. This bike can then be returned to another dock belonging to the same system.

    A US bike-sharing provider BoomBikes has recently suffered considerable dip in their revenue due to the Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue.

    In such an attempt, BoomBikes aspires to understand the demand for shared bikes among the people. They have planned this to prepare themselves to cater to the people's needs once the situation gets better all around and stand out from other service providers and make huge profits.

    They have contracted a consulting company to understand the factors on which the demand for these shared bikes depends. Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know:

    • Which variables are significant in predicting the demand for shared bikes.
    • How well those variables describe the bike demands

    Based on various meteorological surveys and people's styles, the service provider firm has gathered a large dataset on daily bike demands across the American market based on some factors.

    Business Goal:

    You are required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market.

    Data Preparation:

    1. You can observe in the dataset that some of the variables like 'weathersit' and 'season' have values as 1, 2, 3, 4 which have specific labels associated with them (as can be seen in the data dictionary). These numeric values associated with the labels may indicate that there is some order to them - which is actually not the case (Check the data dictionary and think why). So, it is advisable to convert such feature values into categorical string values before proceeding with model building. Please refer the data dictionary to get a better understanding of all the independent variables.
    2. You might notice the column 'yr' with two values 0 and 1 indicating the years 2018 and 2019 respectively. At the first instinct, you might think it is a good idea to drop this column as it only has two values so it might not be a value-add to the model. But in reality, since these bike-sharing systems are slowly gaining popularity, the demand for these bikes is increasing every year proving that the column 'yr' might be a good variable for prediction. So think twice before dropping it.

    Model Building:

    In the dataset provided, you will notice that there are three columns named 'casual', 'registered', and 'cnt'. The variable 'casual' indicates the number casual users who have made a rental. The variable 'registered' on the other hand shows the total number of registered users who have made a booking on a given day. Finally, the 'cnt' variable indicates the total number of bike rentals, including both casual and registered. The model should be built taking this 'cnt' as the target variable.

    Model Evaluation:

    When you're done with model building and residual analysis and have made predictions on the test set, just make sure you use the following two lines of code to calculate the R-squared score on the test set. python from sklearn.metrics import r2_score r2_score(y_test, y_pred) - where y_test is the test data set for the target variable, and y_pred is the variable containing the predicted values of the target variable on the test set. - Please perform this step as the R-squared score on the test set holds as a benchmark for your model.

  2. fMRI dataset: Violations of psychological and physical expectations in human...

    • openneuro.org
    Updated Jan 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shari Liu; Kirsten Lydic; Lingjie Mei; Rebecca Saxe (2024). fMRI dataset: Violations of psychological and physical expectations in human adult brains [Dataset]. http://doi.org/10.18112/openneuro.ds004934.v1.0.0
    Explore at:
    Dataset updated
    Jan 17, 2024
    Dataset provided by
    OpenNeurohttps://openneuro.org/
    Authors
    Shari Liu; Kirsten Lydic; Lingjie Mei; Rebecca Saxe
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Dataset description

    This dataset contains fMRI data from adults from one paper, with two experiments in it:

    Liu, S., Lydic, K., Mei, L., & Saxe, R. (in press, Imaging Neuroscience). Violations of physical and psychological expectations in the human adult brain. Preprint: https://doi.org/10.31234/osf.io/54x6b

    All subjects who contributed data to this repository consented explicitly to share their de-faced brain images publicly on OpenNeuro. Experiment 1 has 16 subjects who gave consent to share (17 total), and Experiment 2 has 29 subjects who gave consent to share (32 total). Experiment 1 subjects have subject IDs starting with "SAXNES*", and Experiment 2 subjects have subject IDs starting with "SAXNES2*".

    • code/ contains contrast files used in published work
    • sub-SAXNES*/ contains anatomical and functional images, and event files for each functional image. Event files contains the onset, duration, and condition labels
    • CHANGES will be logged in this file

    Tasks

    • VOE (Experiment 1 version): Novel task using hand-crafted stimuli from developmental psychology, showing violations of object solidity and support, and violations of goal-directed and efficient action. There were only 4 sets of stimuli in this experiment, that repeated across runs. Shown in mini-blocks of familization + two test events.
    • VOE (Experiment 2 version): Novel task including all stimuli from Experiment 1 except for support, showing violations of object permanence and continuity (from ADEPT dataset; Smith et al. 2019) and violations of goal-directed and efficient action (from AGENT dataset; Shu et al. 2021). Shown in pairs of familiarization + one test event (either expected or unexpected). All subjects saw one set of stimuli in runs 1-2, and a second set of stimuli in runs 3-4. If someone saw an expected outcome from a scenario in one run, they saw the unexpected outcome from the same scenario in the other run.
    • DOTS (2 runs, both Exp 1-2): Task contrasting social and physical interaction (Fischer et al. 2016, PNAS). Designed to localize regions like the STS and SMG.
    • Motion: Task contrasting coherent and incoherent motion (Robertson et al. 2014, Brain). Designed to localize area MT.
    • spWM: Task contrasting a hard vs easy spatial working memory task (Fedorenko et al., 2013, PNAS). Designed to localize multiple demand regions.

    There are (anonymized) event files associated with each run, subject and task, and contrast files.

    Event files

    All event files, for all tasks, have the following cols: onset_time, duration, trial_type and response_time. Below are notes about subject-specific event files.

    • sub-SAXNES2s001: The original MotionLoc outputs list the first block, 10s into the experiment, as the first event. This was preceded by a 10s fixation. For s001, prior to updating the script to reflect this 10s lag, we had to do some estimation - we saw that on average, each block was 11.8s but there was usually a .05s delay, such that each block started ~11.85s after the previous one. Thus we calculated start times as 11.85 after the previous block. For the rest of the subjects, the outputs were not manipulated - we just added an event to the start of the run.
    • sub-SAXNES2s013: no event files for DOTS run2; event files use approximate timings instead based on inferred information about block order
    • sub-SAXNES2s018 (excluded from sample): no event files, because this subject stopped participating without having contributed a complete, low-motion run, for which it was clear they were following the instructions for the task
    • sub-SAXNES2s019: no time to do run2 of DOTS or Motion, so only 1 run for those two
    • sub-SAXNES2s023, the event files from spWM run 1 did not save during scanning. We use timings from the default settings of condition 1 but we do not have trial-level data from this person.

    For the DOTS and VOE event files from Experiment 1, we have the additional columns:

    • experimentName ('DotsSocPhys' or 'VOESocPhys')
    • correct: at the end of the trial, subs made a response. In DOTS, they indicated whether the dot that disappeared reappeared at a plausible location. In VOE, they pressed a button when the fixation appeared as a cross rather than a plus sign. This col indicates whether the sub responded correctly (1/0)
    • stim_path: path to the stimuli, relative to the root BIDS directory, i.e. BIDS/stimuli/DOTS/xxxx

    For the DOTS event files from Experiment 2, we have the additional columns:

    • participant: redundant with the file name
    • experiment_name: name of the task, redundant with file name
    • block_order: which order the dots trials happened in (1 or 2)
    • prop_correct: the proportion of correct responses over the entire run

    For the Motion event files from Experiment 2, we have the additional columns:

    • experiment_name: name of the task, redundant with file name
    • block_order: which order the dots trials happened in (1 or 2)
    • event: the index of the current event (0-22)

    For the spWM event files from Experiment 2, we have the additional columns:

    • experiment_name: name of the task, redundant with file name
    • participant: redundant with the file name
    • block_order: which order the dots trials happened in (1 or 2)
    • run_accuracy_hard: the proportion of correct responses for the hard trials in this run
    • run_accuracy_easy: the proportion of correct responses for the easy trials in this run

    For the VOE event files from Experiment 2, we have the additional columns:

    • trial_type_specific: identifies trials at one more level of granularity, with respect to domain task and event (e.g. psychology_efficiency_unexp)
    • trial_type_morespecific: similar to trial_type_specific but includes information about domain task scenario and event (e.g. psychology_efficiency_trial-15-over_unexp)
    • experiment_name: name of the task, redundant with file name
    • participant: redundant with the file name
    • correct: whether the response for this trial was correct (1, or 0)
    • time_elapsed: how much time as elapsed by the end of this trial, in ms
    • trial_n: the index of the current event
    • correct_answer: what the correct answer was for the attention check (yes or no)
    • subject_correct: whether the subject in fact was correct in their response
    • event: fam, expected, or unexpected
    • identical_tests: were the test events identical, for this trial?
    • stim_ID: numerical string picking out each unique stimulus
    • scenario_string: string identifying each scenario within each task
    • domain: physics, psychology (psychology-action), both (psychology-environment)
    • task: solidity, permanence, goal, efficiency, infer-constraints, or agent-solidity
    • prop_correct:the proportion of correct responses over the entire run
    • stim_path: path to the stimuli, relative to the root BIDS directory, i.e. BIDS/stimuli/VOE/xxxx

    Associated Links

  3. Case Study: Cyclist

    • kaggle.com
    Updated Jul 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PatrickRCampbell (2021). Case Study: Cyclist [Dataset]. https://www.kaggle.com/patrickrcampbell/case-study-cyclist/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 27, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    PatrickRCampbell
    Description

    Phase 1: ASK

    Key Objectives:

    1. Business Task * Cyclist is looking to increase their earnings, and wants to know if creating a social media campaign can influence "Casual" users to become "Annual" members.

    2. Key Stakeholders: * The main stakeholder from Cyclist is Lily Moreno, whom is the Director of Marketing and responsible for the development of campaigns and initiatives to promote their bike-share program. The other teams involved with this project will be Marketing & Analytics, and the Executive Team.

    3. Business Task: * Comparing the two kinds of users and defining how they use the platform, what variables they have in common, what variables are different, and how can they get Casual users to become Annual members

    Phase 2: PREPARE:

    Key Objectives:

    1. Determine Data Credibility * Cyclist provided data from years 2013-2021 (through March 2021), all of which is first-hand data collected by the company.

    2. Sort & Filter Data: * The stakeholders want to know how the current users are using their service, so I am focusing on using the data from 2020-2021 since this is the most relevant period of time to answer the business task.

    #Installing packages
    install.packages("tidyverse", repos = "http://cran.us.r-project.org")
    install.packages("readr", repos = "http://cran.us.r-project.org")
    install.packages("janitor", repos = "http://cran.us.r-project.org")
    install.packages("geosphere", repos = "http://cran.us.r-project.org")
    install.packages("gridExtra", repos = "http://cran.us.r-project.org")
    
    library(tidyverse)
    library(readr)
    library(janitor)
    library(geosphere)
    library(gridExtra)
    
    #Importing data & verifying the information within the dataset
    all_tripdata_clean <- read.csv("/Data Projects/cyclist/cyclist_data_cleaned.csv")
    
    glimpse(all_tripdata_clean)
    
    summary(all_tripdata_clean)
    
    

    Phase 3: PROCESS

    Key Objectives:

    1. Cleaning Data & Preparing for Analysis: * Once the data has been placed into one dataset, and checked for errors, we began cleaning the data. * Eliminating data that correlates to the company servicing the bikes, and any ride with a traveled distance of zero. * New columns will be added to assist in the analysis, and to provide accurate assessments of whom is using the bikes.

    #Eliminating any data that represents the company performing maintenance, and trips without any measureable distance
    all_tripdata_clean <- all_tripdata_clean[!(all_tripdata_clean$start_station_name == "HQ QR" | all_tripdata_clean$ride_length<0),] 
    
    #Creating columns for the individual date components (days_of_week should be run last)
    all_tripdata_clean$day_of_week <- format(as.Date(all_tripdata_clean$date), "%A")
    all_tripdata_clean$date <- as.Date(all_tripdata_clean$started_at)
    all_tripdata_clean$day <- format(as.Date(all_tripdata_clean$date), "%d")
    all_tripdata_clean$month <- format(as.Date(all_tripdata_clean$date), "%m")
    all_tripdata_clean$year <- format(as.Date(all_tripdata_clean$date), "%Y")
    
    

    ** Now I will begin calculating the length of rides being taken, distance traveled, and the mean amount of time & distance.**

    #Calculating the ride length in miles & minutes
    all_tripdata_clean$ride_length <- difftime(all_tripdata_clean$ended_at,all_tripdata_clean$started_at,units = "mins")
    
    all_tripdata_clean$ride_distance <- distGeo(matrix(c(all_tripdata_clean$start_lng, all_tripdata_clean$start_lat), ncol = 2), matrix(c(all_tripdata_clean$end_lng, all_tripdata_clean$end_lat), ncol = 2))
    all_tripdata_clean$ride_distance = all_tripdata_clean$ride_distance/1609.34 #converting to miles
    
    #Calculating the mean time and distance based on the user groups
    userType_means <- all_tripdata_clean %>% group_by(member_casual) %>% summarise(mean_time = mean(ride_length))
    
    
    userType_means <- all_tripdata_clean %>% 
     group_by(member_casual) %>% 
     summarise(mean_time = mean(ride_length),mean_distance = mean(ride_distance))
    

    Adding in calculations that will differentiate between bike types and which type of user is using each specific bike type.

    #Calculations
    
    with_bike_type <- all_tripdata_clean %>% filter(rideable_type=="classic_bike" | rideable_type=="electric_bike")
    
    with_bike_type %>%
     mutate(weekday = wday(started_at, label = TRUE)) %>% 
     group_by(member_casual,rideable_type,weekday) %>%
     summarise(totals=n(), .groups="drop") %>%
     
    with_bike_type %>%
     group_by(member_casual,rideable_type) %>%
     summarise(totals=n(), .groups="drop") %>%
    
     #Calculating the ride differential
     
     all_tripdata_clean %>% 
     mutate(weekday = wkday(started_at, label = TRUE)) %>% 
     group_by(member_casual, weekday) %>% 
     summarise(number_of_rides = n()
          ,average_duration = mean(ride_length),.groups = 'drop') %>% 
     arrange(me...
    
  4. W

    CLM16gwl NSW Office of Water_GW licence extract linked to spatial...

    • cloud.csiss.gmu.edu
    • researchdata.edu.au
    • +2more
    Updated Dec 13, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Australia (2019). CLM16gwl NSW Office of Water_GW licence extract linked to spatial locations_CLM_v3_13032014 [Dataset]. https://cloud.csiss.gmu.edu/uddi/dataset/4b0e74ed-2fad-4608-a743-92163e13c30d
    Explore at:
    Dataset updated
    Dec 13, 2019
    Dataset provided by
    Australia
    Area covered
    New South Wales
    Description

    Abstract

    The dataset was derived by the Bioregional Assessment Programme. This dataset was derived from multiple datasets. You can find a link to the parent datasets in the Lineage Field in this metadata statement. The History Field in this metadata statement describes how this dataset was derived.

    The difference between NSW Office of Water GW licences - CLM v2 and v3 is that an additional column has been added, 'Asset Class' that aggregates the purpose of the licence into the set classes for the Asset Database. Also the 'Completed_Depth' has been added, which is the total depth of the groundwater bore. These columns were added for the purpose of the Asset Register.

    The aim of this dataset was to be able to map each groundwater works with the volumetric entitlement without double counting the volume and to aggregate/ disaggregate the data depending on the final use.

    This has not been clipped to the CLM PAE, therefore the number of economic assets/ relevant licences will drastically reduce once this occurs.

    The Clarence Moreton groundwater licences includes an extract of all licences that fell within the data management acquisition area as provided by BA to NSW Office of Water.

    Aim: To get a one to one ratio of licences numbers to bore IDs.

    Important notes about data:

    Data has not been clipped to the PAE.

    No decision have been made in regards to what purpose of groundwater should be protected. Therefore the purpose currently includes groundwater bores that have been drilled for non-extractive purposes including experimental research, test, monitoring bore, teaching, mineral explore and groundwater explore

    No volume has been included for domestic & stock as it is a basic right. Therefore an arbitrary volume could be applied to account for D&S use.

    Licence Number - Each sheet in the Original Data has a licence number, this is assumed to be the actual licence number. Some are old because they have not been updated to the new WA. Some are new (From_Spreadsheet_WALs). This is the reason for the different codes.

    WA/CA - This number is the 'works' number. It is assumed that the number indicates the bore permit or works approval. This is why there can be multiple works to licence and licences to works number. (For complete glossary see here http://registers.water.nsw.gov.au/wma/Glossary.jsp). Originally, the aim was to make sure that the when there was more than more than one licence to works number or mulitple works to licenes that the mulitple instances were compelte.

    Clarence Moreton worksheet links the individual licence to a works and a volumetric entitlement. For most sites, this can be linked to a bore which can be found in the NGIS through the HydroID. (\wron\Project\BA\BA_all\Hydrogeology_National_Groundwater_Information_System_v1.1_Sept2013). This will allow analysis of depths, lithology and hydrostratigraphy where the data exists.

    We can aggregate the data based on water source and water management zone as can be seen in the other worksheets.

    Data available:

    Original Data: Any data that was bought in from NSW Offcie of Water, includes

    Spatial locations provided by NoW- This is a exported data from the submitted shape files. Includes the licence (LICENCE) numbers and the bore ID (WORK_NUO). (Refer to lineage NSW Office of Water Groundwater Entitlements Spatial Locations).

    Spreadsheet_WAL - The spread sheet from the submitted data, WLS-EXTRACT_WALs_volume. (Refer to Lineage NSW Office of Water Groundwater Licence Extract CLM- Oct 2013)

    WLS_extracts - The combined spread sheets from the submitted data, WLS-EXTRACT . (Refer to Lineage NSW Office of Water Groundwater Licence Extract CLM- Oct 2013)

    Aggregated share component to water sharing plan, water source and water management zone

    Dataset History

    The difference between NSW Office of Water GW licences - CLM v2 and v3 is that an additional column has been added, 'Asset Class' that aggregates the purpose of the licence into the set classes for the Asset Database.

    Where purpose = domestic; or domestic & stock; or stock then it was classed as 'basic water right'. Where it is listed as both a domestic/stock and a licensed use such as irrigation, it was classed as a 'water access right.' All other take and use were classed as a 'water access right'. Where purpose = drainage, waste disposal, groundwater remediation, experimental research, null, conveyancing, test bore - these were not given an asset class. Monitoring bores were classed as 'Water supply and monitoring infrastructure'

    Depth has also been included which is the completed depth of the bore.

    Instructions

    Procedure: refer to Bioregional assessment data conversion script.docx

    1) Original spread sheets have mulitple licence instances if there are more than one WA/CA number. This means that there are more than one works or permit to the licence. The aim is to only have one licence instance.

    2) The individual licence numbers were combined into one column

    3) Using the new column of licence numbers, several vlookups were created to bring in other data. Where the columns are identical in the original spreadsheets, they are combined. The only ones that don't are the Share/Entitlement/allocation, these mean different things.

    4) A hydro ID column was created, this is a code that links this NSW to the NGIS, which is basically a ".1.1" at the end of the bore code.

    5) All 'cancelled' licences were removed

    6) A count of the number of works per licence and number of bores were included in the spreadsheet.

    7) Where the ShareComponent = NA, the Entitlement = 0, Allocation = 0 and there was more than one instance of the same bore, this means that the original licence assigned to the bore has been replaced by a new licence with a share component. Where these criteria were met, the instances were removed

    8) a volume per works ensures that the volume of the licence is not repeated for each works, but is divided by the number of works

    Bioregional assessment data conversion script

    Aim: The following document is the R-Studio script for the conversion and merging of the bioregional assessment data.

    Requirements: The user will need R-Studio. It would be recommended that there is some basic knowledge of R. If there isn't, the only thing that would really need to be changed is the file location and name. The way that R reads files is different to windows and also the locations that R-Studio read is dependent on where R-Studio is originally installed to point. This would need to be completed properly before the script can be run.

    Procedure: The information below the dashed line is the script. This can be copied and pasted directly into R-Studio. Any text with '#' will not be read as a script, so that can be added in and read as an instruction.

    ###########

    # 18/2/2014

    # Code by Brendan Dimech

    #

    # Script to merge extract files from submitted NSW bioregional

    # assessment and convert data into required format. Also use a 'vlookup'

    # process to get Bore and Location information from NGIS.

    #

    # There are 3 scripts, one for each of the individual regions.

    #

    ############

    # CLARENCE MORTON

    # Opening of files. Location can be changed if needed.

    # arc.file is the exported *.csv from the NGIS data which has bore data and Lat/long information.

    # Lat/long weren't in the file natively so were added to the table using Arc Toolbox tools.

    arc.folder = '/data/cdc_cwd_wra/awra/wra_share_01/GW_licencing_and_use_data/Rstudio/Data/Vlookup/Data'

    arc.file = "Moreton.csv"

    # Files from NSW came through in two types. WALS files, this included 'newer' licences that had a share component.

    # The 'OTH' files were older licences that had just an allocation. Some data was similar and this was combined,

    # and other information that wasn't similar from the datasets was removed.

    # This section is locating and importing the WALS and OTH files.

    WALS.folder = '/data/cdc_cwd_wra/awra/wra_share_01/GW_licencing_and_use_data/Rstudio/Data/Vlookup/Data'

    WALS.file = "GW_Clarence_Moreton_WLS-EXTRACT_4_WALs_volume.xls"

    OTH.file.1 = "GW_Clarence_Moreton_WLS-EXTRACT_1.xls"

    OTH.file.2 = "GW_Clarence_Moreton_WLS-EXTRACT_2.xls"

    OTH.file.3 = "GW_Clarence_Moreton_WLS-EXTRACT_3.xls"

    OTH.file.4 = "GW_Clarence_Moreton_WLS-EXTRACT_4.xls"

    newWALS.folder = '/data/cdc_cwd_wra/awra/wra_share_01/GW_licencing_and_use_data/Rstudio/Data/Vlookup/Products'

    newWALS.file = "Clarence_Moreton.csv"

    arc <- read.csv(paste(arc.folder, arc.file, sep="/" ), header =TRUE, sep = ",")

    WALS <- read.table(paste(WALS.folder, WALS.file, sep="/" ), header =TRUE, sep = "\t")

    # Merge any individual WALS and OTH files into a single WALS or OTH file if there were more than one.

    OTH1 <- read.table(paste(WALS.folder, OTH.file.1, sep="/" ), header =TRUE, sep = "\t")

    OTH2 <- read.table(paste(WALS.folder, OTH.file.2, sep="/" ), header =TRUE, sep = "\t")

    OTH3 <- read.table(paste(WALS.folder, OTH.file.3, sep="/" ), header =TRUE, sep = "\t")

    OTH4 <- read.table(paste(WALS.folder, OTH.file.4, sep="/" ), header =TRUE, sep = "\t")

    OTH <- merge(OTH1,OTH2, all.y = TRUE, all.x = TRUE)

    OTH <- merge(OTH,OTH3, all.y = TRUE, all.x = TRUE)

    OTH <- merge(OTH,OTH4, all.y = TRUE, all.x = TRUE)

    # Add new columns to OTH for the BORE, LAT and LONG. Then use 'merge' as a vlookup to add the corresponding

    # bore and location from the arc file. The WALS and OTH files are slightly different because the arc file has

    # a different licence number added in.

    OTH <- data.frame(OTH, BORE = "", LAT = "", LONG = "")

    OTH$BORE <- arc$WORK_NO[match(OTH$LICENSE.APPROVAL, arc$LICENSE)]

    OTH$LAT <- arc$POINT_X[match(OTH$LICENSE.APPROVAL, arc$LICENSE)]

    OTH$LONG <-

  5. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ram Vishnu R (2024). Bike Sharing Dataset [Dataset]. https://www.kaggle.com/datasets/ramvishnur/bike-sharing-dataset
Organization logo

Data from: Bike Sharing Dataset

To build multi linear regression model for prediction of demand for shared bikes

Related Article
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 10, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ram Vishnu R
Description

Problem Statement:

A bike-sharing system is a service in which bikes are made available for shared use to individuals on a short term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" which is usually computer-controlled wherein the user enters the payment information, and the system unlocks it. This bike can then be returned to another dock belonging to the same system.

A US bike-sharing provider BoomBikes has recently suffered considerable dip in their revenue due to the Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue.

In such an attempt, BoomBikes aspires to understand the demand for shared bikes among the people. They have planned this to prepare themselves to cater to the people's needs once the situation gets better all around and stand out from other service providers and make huge profits.

They have contracted a consulting company to understand the factors on which the demand for these shared bikes depends. Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know:

  • Which variables are significant in predicting the demand for shared bikes.
  • How well those variables describe the bike demands

Based on various meteorological surveys and people's styles, the service provider firm has gathered a large dataset on daily bike demands across the American market based on some factors.

Business Goal:

You are required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market.

Data Preparation:

  1. You can observe in the dataset that some of the variables like 'weathersit' and 'season' have values as 1, 2, 3, 4 which have specific labels associated with them (as can be seen in the data dictionary). These numeric values associated with the labels may indicate that there is some order to them - which is actually not the case (Check the data dictionary and think why). So, it is advisable to convert such feature values into categorical string values before proceeding with model building. Please refer the data dictionary to get a better understanding of all the independent variables.
  2. You might notice the column 'yr' with two values 0 and 1 indicating the years 2018 and 2019 respectively. At the first instinct, you might think it is a good idea to drop this column as it only has two values so it might not be a value-add to the model. But in reality, since these bike-sharing systems are slowly gaining popularity, the demand for these bikes is increasing every year proving that the column 'yr' might be a good variable for prediction. So think twice before dropping it.

Model Building:

In the dataset provided, you will notice that there are three columns named 'casual', 'registered', and 'cnt'. The variable 'casual' indicates the number casual users who have made a rental. The variable 'registered' on the other hand shows the total number of registered users who have made a booking on a given day. Finally, the 'cnt' variable indicates the total number of bike rentals, including both casual and registered. The model should be built taking this 'cnt' as the target variable.

Model Evaluation:

When you're done with model building and residual analysis and have made predictions on the test set, just make sure you use the following two lines of code to calculate the R-squared score on the test set. python from sklearn.metrics import r2_score r2_score(y_test, y_pred) - where y_test is the test data set for the target variable, and y_pred is the variable containing the predicted values of the target variable on the test set. - Please perform this step as the R-squared score on the test set holds as a benchmark for your model.

Search
Clear search
Close search
Google apps
Main menu