A bike-sharing system is a service in which bikes are made available for shared use to individuals on a short term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" which is usually computer-controlled wherein the user enters the payment information, and the system unlocks it. This bike can then be returned to another dock belonging to the same system.
A US bike-sharing provider BoomBikes has recently suffered considerable dip in their revenue due to the Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue.
In such an attempt, BoomBikes aspires to understand the demand for shared bikes among the people. They have planned this to prepare themselves to cater to the people's needs once the situation gets better all around and stand out from other service providers and make huge profits.
They have contracted a consulting company to understand the factors on which the demand for these shared bikes depends. Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know:
Based on various meteorological surveys and people's styles, the service provider firm has gathered a large dataset on daily bike demands across the American market based on some factors.
You are required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market.
In the dataset provided, you will notice that there are three columns named 'casual', 'registered', and 'cnt'. The variable 'casual' indicates the number casual users who have made a rental. The variable 'registered' on the other hand shows the total number of registered users who have made a booking on a given day. Finally, the 'cnt' variable indicates the total number of bike rentals, including both casual and registered. The model should be built taking this 'cnt' as the target variable.
When you're done with model building and residual analysis and have made predictions on the test set, just make sure you use the following two lines of code to calculate the R-squared score on the test set.
python
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)
- where y_test is the test data set for the target variable, and y_pred is the variable containing the predicted values of the target variable on the test set.
- Please perform this step as the R-squared score on the test set holds as a benchmark for your model.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains fMRI data from adults from one paper, with two experiments in it:
Liu, S., Lydic, K., Mei, L., & Saxe, R. (in press, Imaging Neuroscience). Violations of physical and psychological expectations in the human adult brain. Preprint: https://doi.org/10.31234/osf.io/54x6b
All subjects who contributed data to this repository consented explicitly to share their de-faced brain images publicly on OpenNeuro. Experiment 1 has 16 subjects who gave consent to share (17 total), and Experiment 2 has 29 subjects who gave consent to share (32 total). Experiment 1 subjects have subject IDs starting with "SAXNES*", and Experiment 2 subjects have subject IDs starting with "SAXNES2*".
There are (anonymized) event files associated with each run, subject and task, and contrast files.
All event files, for all tasks, have the following cols: onset_time
, duration
, trial_type
and response_time
. Below are notes about subject-specific event files.
For the DOTS and VOE event files from Experiment 1, we have the additional columns:
experimentName
('DotsSocPhys' or 'VOESocPhys')correct
: at the end of the trial, subs made a response. In DOTS, they indicated whether the dot that disappeared reappeared at a plausible location. In VOE, they pressed a button when the fixation appeared as a cross rather than a plus sign. This col indicates whether the sub responded correctly (1/0)stim_path
: path to the stimuli, relative to the root BIDS directory, i.e. BIDS/stimuli/DOTS/xxxx
For the DOTS event files from Experiment 2, we have the additional columns:
participant
: redundant with the file nameexperiment_name
: name of the task, redundant with file nameblock_order
: which order the dots trials happened in (1 or 2)prop_correct
: the proportion of correct responses over the entire runFor the Motion event files from Experiment 2, we have the additional columns:
experiment_name
: name of the task, redundant with file nameblock_order
: which order the dots trials happened in (1 or 2)event
: the index of the current event (0-22)For the spWM event files from Experiment 2, we have the additional columns:
experiment_name
: name of the task, redundant with file nameparticipant
: redundant with the file nameblock_order
: which order the dots trials happened in (1 or 2)run_accuracy_hard
: the proportion of correct responses for the hard trials in this runrun_accuracy_easy
: the proportion of correct responses for the easy trials in this runFor the VOE event files from Experiment 2, we have the additional columns:
trial_type_specific
: identifies trials at one more level of granularity, with respect to domain task and event (e.g. psychology_efficiency_unexp)trial_type_morespecific
: similar to trial_type_specific
but includes information about domain task scenario and event (e.g. psychology_efficiency_trial-15-over_unexp)experiment_name
: name of the task, redundant with file nameparticipant
: redundant with the file namecorrect
: whether the response for this trial was correct (1, or 0)time_elapsed
: how much time as elapsed by the end of this trial, in mstrial_n
: the index of the current event correct_answer
: what the correct answer was for the attention check (yes or no)subject_correct
: whether the subject in fact was correct in their responseevent
: fam, expected, or unexpectedidentical_tests
: were the test events identical, for this trial?stim_ID
: numerical string picking out each unique stimulusscenario_string
: string identifying each scenario within each taskdomain
: physics, psychology (psychology-action), both (psychology-environment)task
: solidity, permanence, goal, efficiency, infer-constraints, or agent-solidityprop_correct
:the proportion of correct responses over the entire runstim_path
: path to the stimuli, relative to the root BIDS directory, i.e. BIDS/stimuli/VOE/xxxx
Phase 1: ASK
1. Business Task * Cyclist is looking to increase their earnings, and wants to know if creating a social media campaign can influence "Casual" users to become "Annual" members.
2. Key Stakeholders: * The main stakeholder from Cyclist is Lily Moreno, whom is the Director of Marketing and responsible for the development of campaigns and initiatives to promote their bike-share program. The other teams involved with this project will be Marketing & Analytics, and the Executive Team.
3. Business Task: * Comparing the two kinds of users and defining how they use the platform, what variables they have in common, what variables are different, and how can they get Casual users to become Annual members
Phase 2: PREPARE:
1. Determine Data Credibility * Cyclist provided data from years 2013-2021 (through March 2021), all of which is first-hand data collected by the company.
2. Sort & Filter Data: * The stakeholders want to know how the current users are using their service, so I am focusing on using the data from 2020-2021 since this is the most relevant period of time to answer the business task.
#Installing packages
install.packages("tidyverse", repos = "http://cran.us.r-project.org")
install.packages("readr", repos = "http://cran.us.r-project.org")
install.packages("janitor", repos = "http://cran.us.r-project.org")
install.packages("geosphere", repos = "http://cran.us.r-project.org")
install.packages("gridExtra", repos = "http://cran.us.r-project.org")
library(tidyverse)
library(readr)
library(janitor)
library(geosphere)
library(gridExtra)
#Importing data & verifying the information within the dataset
all_tripdata_clean <- read.csv("/Data Projects/cyclist/cyclist_data_cleaned.csv")
glimpse(all_tripdata_clean)
summary(all_tripdata_clean)
Phase 3: PROCESS
1. Cleaning Data & Preparing for Analysis: * Once the data has been placed into one dataset, and checked for errors, we began cleaning the data. * Eliminating data that correlates to the company servicing the bikes, and any ride with a traveled distance of zero. * New columns will be added to assist in the analysis, and to provide accurate assessments of whom is using the bikes.
#Eliminating any data that represents the company performing maintenance, and trips without any measureable distance
all_tripdata_clean <- all_tripdata_clean[!(all_tripdata_clean$start_station_name == "HQ QR" | all_tripdata_clean$ride_length<0),]
#Creating columns for the individual date components (days_of_week should be run last)
all_tripdata_clean$day_of_week <- format(as.Date(all_tripdata_clean$date), "%A")
all_tripdata_clean$date <- as.Date(all_tripdata_clean$started_at)
all_tripdata_clean$day <- format(as.Date(all_tripdata_clean$date), "%d")
all_tripdata_clean$month <- format(as.Date(all_tripdata_clean$date), "%m")
all_tripdata_clean$year <- format(as.Date(all_tripdata_clean$date), "%Y")
** Now I will begin calculating the length of rides being taken, distance traveled, and the mean amount of time & distance.**
#Calculating the ride length in miles & minutes
all_tripdata_clean$ride_length <- difftime(all_tripdata_clean$ended_at,all_tripdata_clean$started_at,units = "mins")
all_tripdata_clean$ride_distance <- distGeo(matrix(c(all_tripdata_clean$start_lng, all_tripdata_clean$start_lat), ncol = 2), matrix(c(all_tripdata_clean$end_lng, all_tripdata_clean$end_lat), ncol = 2))
all_tripdata_clean$ride_distance = all_tripdata_clean$ride_distance/1609.34 #converting to miles
#Calculating the mean time and distance based on the user groups
userType_means <- all_tripdata_clean %>% group_by(member_casual) %>% summarise(mean_time = mean(ride_length))
userType_means <- all_tripdata_clean %>%
group_by(member_casual) %>%
summarise(mean_time = mean(ride_length),mean_distance = mean(ride_distance))
Adding in calculations that will differentiate between bike types and which type of user is using each specific bike type.
#Calculations
with_bike_type <- all_tripdata_clean %>% filter(rideable_type=="classic_bike" | rideable_type=="electric_bike")
with_bike_type %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual,rideable_type,weekday) %>%
summarise(totals=n(), .groups="drop") %>%
with_bike_type %>%
group_by(member_casual,rideable_type) %>%
summarise(totals=n(), .groups="drop") %>%
#Calculating the ride differential
all_tripdata_clean %>%
mutate(weekday = wkday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length),.groups = 'drop') %>%
arrange(me...
The dataset was derived by the Bioregional Assessment Programme. This dataset was derived from multiple datasets. You can find a link to the parent datasets in the Lineage Field in this metadata statement. The History Field in this metadata statement describes how this dataset was derived.
The difference between NSW Office of Water GW licences - CLM v2 and v3 is that an additional column has been added, 'Asset Class' that aggregates the purpose of the licence into the set classes for the Asset Database. Also the 'Completed_Depth' has been added, which is the total depth of the groundwater bore. These columns were added for the purpose of the Asset Register.
The aim of this dataset was to be able to map each groundwater works with the volumetric entitlement without double counting the volume and to aggregate/ disaggregate the data depending on the final use.
This has not been clipped to the CLM PAE, therefore the number of economic assets/ relevant licences will drastically reduce once this occurs.
The Clarence Moreton groundwater licences includes an extract of all licences that fell within the data management acquisition area as provided by BA to NSW Office of Water.
Aim: To get a one to one ratio of licences numbers to bore IDs.
Important notes about data:
Data has not been clipped to the PAE.
No decision have been made in regards to what purpose of groundwater should be protected. Therefore the purpose currently includes groundwater bores that have been drilled for non-extractive purposes including experimental research, test, monitoring bore, teaching, mineral explore and groundwater explore
No volume has been included for domestic & stock as it is a basic right. Therefore an arbitrary volume could be applied to account for D&S use.
Licence Number - Each sheet in the Original Data has a licence number, this is assumed to be the actual licence number. Some are old because they have not been updated to the new WA. Some are new (From_Spreadsheet_WALs). This is the reason for the different codes.
WA/CA - This number is the 'works' number. It is assumed that the number indicates the bore permit or works approval. This is why there can be multiple works to licence and licences to works number. (For complete glossary see here http://registers.water.nsw.gov.au/wma/Glossary.jsp). Originally, the aim was to make sure that the when there was more than more than one licence to works number or mulitple works to licenes that the mulitple instances were compelte.
Clarence Moreton worksheet links the individual licence to a works and a volumetric entitlement. For most sites, this can be linked to a bore which can be found in the NGIS through the HydroID. (\wron\Project\BA\BA_all\Hydrogeology_National_Groundwater_Information_System_v1.1_Sept2013). This will allow analysis of depths, lithology and hydrostratigraphy where the data exists.
We can aggregate the data based on water source and water management zone as can be seen in the other worksheets.
Data available:
Original Data: Any data that was bought in from NSW Offcie of Water, includes
Spatial locations provided by NoW- This is a exported data from the submitted shape files. Includes the licence (LICENCE) numbers and the bore ID (WORK_NUO). (Refer to lineage NSW Office of Water Groundwater Entitlements Spatial Locations).
Spreadsheet_WAL - The spread sheet from the submitted data, WLS-EXTRACT_WALs_volume. (Refer to Lineage NSW Office of Water Groundwater Licence Extract CLM- Oct 2013)
WLS_extracts - The combined spread sheets from the submitted data, WLS-EXTRACT . (Refer to Lineage NSW Office of Water Groundwater Licence Extract CLM- Oct 2013)
Aggregated share component to water sharing plan, water source and water management zone
The difference between NSW Office of Water GW licences - CLM v2 and v3 is that an additional column has been added, 'Asset Class' that aggregates the purpose of the licence into the set classes for the Asset Database.
Where purpose = domestic; or domestic & stock; or stock then it was classed as 'basic water right'. Where it is listed as both a domestic/stock and a licensed use such as irrigation, it was classed as a 'water access right.' All other take and use were classed as a 'water access right'. Where purpose = drainage, waste disposal, groundwater remediation, experimental research, null, conveyancing, test bore - these were not given an asset class. Monitoring bores were classed as 'Water supply and monitoring infrastructure'
Depth has also been included which is the completed depth of the bore.
Instructions
Procedure: refer to Bioregional assessment data conversion script.docx
1) Original spread sheets have mulitple licence instances if there are more than one WA/CA number. This means that there are more than one works or permit to the licence. The aim is to only have one licence instance.
2) The individual licence numbers were combined into one column
3) Using the new column of licence numbers, several vlookups were created to bring in other data. Where the columns are identical in the original spreadsheets, they are combined. The only ones that don't are the Share/Entitlement/allocation, these mean different things.
4) A hydro ID column was created, this is a code that links this NSW to the NGIS, which is basically a ".1.1" at the end of the bore code.
5) All 'cancelled' licences were removed
6) A count of the number of works per licence and number of bores were included in the spreadsheet.
7) Where the ShareComponent = NA, the Entitlement = 0, Allocation = 0 and there was more than one instance of the same bore, this means that the original licence assigned to the bore has been replaced by a new licence with a share component. Where these criteria were met, the instances were removed
8) a volume per works ensures that the volume of the licence is not repeated for each works, but is divided by the number of works
Bioregional assessment data conversion script
Aim: The following document is the R-Studio script for the conversion and merging of the bioregional assessment data.
Requirements: The user will need R-Studio. It would be recommended that there is some basic knowledge of R. If there isn't, the only thing that would really need to be changed is the file location and name. The way that R reads files is different to windows and also the locations that R-Studio read is dependent on where R-Studio is originally installed to point. This would need to be completed properly before the script can be run.
Procedure: The information below the dashed line is the script. This can be copied and pasted directly into R-Studio. Any text with '#' will not be read as a script, so that can be added in and read as an instruction.
###########
# 18/2/2014
# Code by Brendan Dimech
#
# Script to merge extract files from submitted NSW bioregional
# assessment and convert data into required format. Also use a 'vlookup'
# process to get Bore and Location information from NGIS.
#
# There are 3 scripts, one for each of the individual regions.
#
############
# CLARENCE MORTON
# Opening of files. Location can be changed if needed.
# arc.file is the exported *.csv from the NGIS data which has bore data and Lat/long information.
# Lat/long weren't in the file natively so were added to the table using Arc Toolbox tools.
arc.folder = '/data/cdc_cwd_wra/awra/wra_share_01/GW_licencing_and_use_data/Rstudio/Data/Vlookup/Data'
arc.file = "Moreton.csv"
# Files from NSW came through in two types. WALS files, this included 'newer' licences that had a share component.
# The 'OTH' files were older licences that had just an allocation. Some data was similar and this was combined,
# and other information that wasn't similar from the datasets was removed.
# This section is locating and importing the WALS and OTH files.
WALS.folder = '/data/cdc_cwd_wra/awra/wra_share_01/GW_licencing_and_use_data/Rstudio/Data/Vlookup/Data'
WALS.file = "GW_Clarence_Moreton_WLS-EXTRACT_4_WALs_volume.xls"
OTH.file.1 = "GW_Clarence_Moreton_WLS-EXTRACT_1.xls"
OTH.file.2 = "GW_Clarence_Moreton_WLS-EXTRACT_2.xls"
OTH.file.3 = "GW_Clarence_Moreton_WLS-EXTRACT_3.xls"
OTH.file.4 = "GW_Clarence_Moreton_WLS-EXTRACT_4.xls"
newWALS.folder = '/data/cdc_cwd_wra/awra/wra_share_01/GW_licencing_and_use_data/Rstudio/Data/Vlookup/Products'
newWALS.file = "Clarence_Moreton.csv"
arc <- read.csv(paste(arc.folder, arc.file, sep="/" ), header =TRUE, sep = ",")
WALS <- read.table(paste(WALS.folder, WALS.file, sep="/" ), header =TRUE, sep = "\t")
# Merge any individual WALS and OTH files into a single WALS or OTH file if there were more than one.
OTH1 <- read.table(paste(WALS.folder, OTH.file.1, sep="/" ), header =TRUE, sep = "\t")
OTH2 <- read.table(paste(WALS.folder, OTH.file.2, sep="/" ), header =TRUE, sep = "\t")
OTH3 <- read.table(paste(WALS.folder, OTH.file.3, sep="/" ), header =TRUE, sep = "\t")
OTH4 <- read.table(paste(WALS.folder, OTH.file.4, sep="/" ), header =TRUE, sep = "\t")
OTH <- merge(OTH1,OTH2, all.y = TRUE, all.x = TRUE)
OTH <- merge(OTH,OTH3, all.y = TRUE, all.x = TRUE)
OTH <- merge(OTH,OTH4, all.y = TRUE, all.x = TRUE)
# Add new columns to OTH for the BORE, LAT and LONG. Then use 'merge' as a vlookup to add the corresponding
# bore and location from the arc file. The WALS and OTH files are slightly different because the arc file has
# a different licence number added in.
OTH <- data.frame(OTH, BORE = "", LAT = "", LONG = "")
OTH$BORE <- arc$WORK_NO[match(OTH$LICENSE.APPROVAL, arc$LICENSE)]
OTH$LAT <- arc$POINT_X[match(OTH$LICENSE.APPROVAL, arc$LICENSE)]
OTH$LONG <-
Not seeing a result you expected?
Learn how you can add new datasets to our index.
A bike-sharing system is a service in which bikes are made available for shared use to individuals on a short term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" which is usually computer-controlled wherein the user enters the payment information, and the system unlocks it. This bike can then be returned to another dock belonging to the same system.
A US bike-sharing provider BoomBikes has recently suffered considerable dip in their revenue due to the Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue.
In such an attempt, BoomBikes aspires to understand the demand for shared bikes among the people. They have planned this to prepare themselves to cater to the people's needs once the situation gets better all around and stand out from other service providers and make huge profits.
They have contracted a consulting company to understand the factors on which the demand for these shared bikes depends. Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know:
Based on various meteorological surveys and people's styles, the service provider firm has gathered a large dataset on daily bike demands across the American market based on some factors.
You are required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market.
In the dataset provided, you will notice that there are three columns named 'casual', 'registered', and 'cnt'. The variable 'casual' indicates the number casual users who have made a rental. The variable 'registered' on the other hand shows the total number of registered users who have made a booking on a given day. Finally, the 'cnt' variable indicates the total number of bike rentals, including both casual and registered. The model should be built taking this 'cnt' as the target variable.
When you're done with model building and residual analysis and have made predictions on the test set, just make sure you use the following two lines of code to calculate the R-squared score on the test set.
python
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)
- where y_test is the test data set for the target variable, and y_pred is the variable containing the predicted values of the target variable on the test set.
- Please perform this step as the R-squared score on the test set holds as a benchmark for your model.