Sharing economy is booming in many urban cities thanks to the many telecommunication technologies. Data analysis particularly plays a key part of this emerging business. A good model for demand prediction is the driving force to survive in this industry. We found that this dataset presents typical information for us to build a good regression model to predict the hourly bike demand in Seoul South Korea during the year between 2017 and 2018. Based on this model, we hope that with the updated data, the organization can provide the city with a stable supply of rental bikes in a timely manner from now on.
The dataset originally published on SOUTH KOREA PUBLIC HOLIDAYS. This data set contains the following variables :
Date - year-month-day | Dew point temperature - Celsius |
Rented Bike count - Count of bikes rented at each hour | Solar radiation - MJ/m2 |
Hour - Hour of he day | Rainfall - mm |
Temperature - Temperature in Celsius | Snowfall - cm |
Humidity - % | Seasons- Winter, Spring, Summer, Autumn |
Windspeed - m/s | Holiday - Holiday/No holiday |
Visibility - 10m | Functional Day - NoFunc(Non Functional Hours), Fun(Functional hours) |
We are planning to apply multiple linear regression and find the best model using different evaluation methods. What we are interested in is finding which factors play major contributions in this prediction and we would like to compare these factors to the ones we expect before the data modeling.
origin_set = read.csv("SeoulBikeData.csv",stringsAsFactors=TRUE)
origin_set$Date = as.Date(origin_set$Date, format = c("%d/%m/%Y"))
origin_set$Seasons = factor(origin_set$Seasons, levels = c('Winter','Spring','Summer','Autumn'))
str(origin_set)
## 'data.frame': 8760 obs. of 14 variables:
## $ Date : Date, format: "2017-12-01" "2017-12-01" ...
## $ Rented.Bike.Count : int 254 204 173 107 78 100 181 460 930 490 ...
## $ Hour : int 0 1 2 3 4 5 6 7 8 9 ...
## $ Temperature : num -5.2 -5.5 -6 -6.2 -6 -6.4 -6.6 -7.4 -7.6 -6.5 ...
## $ Humidity : int 37 38 39 40 36 37 35 38 37 27 ...
## $ Wind.speed : num 2.2 0.8 1 0.9 2.3 1.5 1.3 0.9 1.1 0.5 ...
## $ Visibility : int 2000 2000 2000 2000 2000 2000 2000 2000 2000 1928 ...
## $ Dew.point.temperature: num -17.6 -17.6 -17.7 -17.6 -18.6 -18.7 -19.5 -19.3 -19.8 -22.4 ...
## $ Solar.Radiation : num 0 0 0 0 0 0 0 0 0.01 0.23 ...
## $ Rainfall : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Snowfall : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Seasons : Factor w/ 4 levels "Winter","Spring",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Holiday : Factor w/ 2 levels "Holiday","No Holiday": 2 2 2 2 2 2 2 2 2 2 ...
## $ Functioning.Day : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
We first read the dataset from the csv file. Successfully convert string variable to factors. Also convert Date to correct Date type.
As we scan the data read from the csv file, we notice that we need to trim the data and reshape it to a more meaningful dataframe.
First, we only look at the data that Functioning.Day is Yes.
origin_set = origin_set[origin_set$Functioning.Day == "Yes",]
origin_set$Functioning.Day = NULL
Second, a deeper understanding of time related variable is needed
As we can see from the scatter plots shows, Date and Seasons are highly correlated variables. The correlation is 0.97.
As we compare the Season and Date plot in a larger plot shown above, the shape are almost same. We then need to keep only one variable. As a common sense, we keep Seasons for simplicity. However, we also want to transform Date into a more informative variable. Therefore based on the Date variable, we create a factor variable Day and a Factor variable Weekend. We then look through how each day of week and each hour of day affect the bike demand.
Looking at average bike count through the week, from the bar chart above, Sundays appear to have the lowest demand, while Fridays, and Wednesdays are the top two days in terms of bike demand.
We notice Hour variable should be a categorical variable. So we take a look at how the hours affect the bike demand We notice from second bar chart above that there is higher demand around 8 am in the morning rush hour and in the afternoon around 6pm. So we create a new variable to make 7 time slots that separate the rush hours.
We also can see from the bar charts above that the weekday has a higher demand than weekdays. So we want to put those factors in the data modelling.
Based on what we discovered from the dataset, we extracted the key variables and transform them into a new dataframe. We also give them a shorter version of names to make it easier to refer them in the model definition.
Finally, after data cleaning and data transformation, we have our final version of dataframe.
## 'data.frame': 8465 obs. of 15 variables:
## $ count : int 254 204 173 107 78 100 181 460 930 490 ...
## $ hour : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ temp : num -5.2 -5.5 -6 -6.2 -6 -6.4 -6.6 -7.4 -7.6 -6.5 ...
## $ humid : int 37 38 39 40 36 37 35 38 37 27 ...
## $ windspeed : num 2.2 0.8 1 0.9 2.3 1.5 1.3 0.9 1.1 0.5 ...
## $ visibility: int 2000 2000 2000 2000 2000 2000 2000 2000 2000 1928 ...
## $ dewtemp : num -17.6 -17.6 -17.7 -17.6 -18.6 -18.7 -19.5 -19.3 -19.8 -22.4 ...
## $ solar : num 0 0 0 0 0 0 0 0 0.01 0.23 ...
## $ rainfall : num 0 0 0 0 0 0 0 0 0 0 ...
## $ snowfall : num 0 0 0 0 0 0 0 0 0 0 ...
## $ season : Factor w/ 4 levels "Winter","Spring",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ holiday : Factor w/ 2 levels "Holiday","No Holiday": 2 2 2 2 2 2 2 2 2 2 ...
## $ dayofweek : Factor w/ 7 levels "Monday","Tuesday",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ weekend : Factor w/ 2 levels "weekend","weekday": 2 2 2 2 2 2 2 2 2 2 ...
## $ timeslot : Factor w/ 7 levels "midnight","sunrise",..: 1 1 1 1 2 2 2 3 3 3 ...
Data Train Test Split
The dataset is split into a train(70%) set and a test(30%) set. The train set will be used to build the model, while the test set will be used for evaluating model accuracy
# train-test split 70%/30%
set.seed(42)
train_size = nrow(bikeDemand)*0.7
bikeDemand_tr_idx = sample(nrow(bikeDemand), train_size)
bikeDemand_trn = bikeDemand[bikeDemand_tr_idx,]
bikeDemand_tst = bikeDemand[-bikeDemand_tr_idx,]
Variable Selection
Given that the dataset contains many weather related variables, we would like to see if there are potential redundant variables that may cause collinearity when building models.
Looking at the correlation graph above, we can see that dewtemp and temp variable have 92% correlation.
We also see a 99% of the variability of the dewtemp variable is explained by the other variables in the dataset. Therefore, we can exclude dewtemp variable from the modeling process.
#dewtemp as a linear combination of other variables
m = lm(dewtemp ~ . - count, data = bikeDemand_trn)
summary(m)$adj.r.squared
## [1] 0.9916
In this section we explore multiple modeling approaches with the aim of finding a model with good accuracy. We start with a simple additive model, and then later on introduce interations, and higher order terms to further test out different types of models.
We start our model search with building a simple additive and a simple interaction model.
We start with simple additive full model, using BIC stepwise search to find some key predictors.
#Simple Additive Model
fullModel = lm( count ~., data = bikeDemand_trn)
selectedBICStepModel = step(fullModel,
direction = 'both',
k = log(nrow(bikeDemand_trn)),
trace = 0)
#Simple Interaction model
full_int_model = lm(formula = count ~ (hour +
temp +
humid +
solar +
rainfall +
season +
holiday +
dayofweek)^2,
data = bikeDemand_trn)
selectedIntModelBIC = step(full_int_model,
direction = "backward",
k = log(nrow(bikeDemand_trn)),
trace = 0)
summary(selectedBICStepModel)
##
## Call:
## lm(formula = count ~ hour + temp + humid + solar + rainfall +
## season + holiday + dayofweek, data = bikeDemand_trn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1376.8 -223.5 -12.1 203.0 1847.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 448.860 39.680 11.31 < 2e-16 ***
## hour1 -107.106 33.993 -3.15 0.0016 **
## hour2 -227.722 34.355 -6.63 3.7e-11 ***
## hour3 -316.504 34.136 -9.27 < 2e-16 ***
## hour4 -371.285 34.782 -10.67 < 2e-16 ***
## hour5 -366.556 34.861 -10.51 < 2e-16 ***
## hour6 -197.658 34.256 -5.77 8.3e-09 ***
## hour7 123.773 34.959 3.54 0.0004 ***
## hour8 501.048 34.414 14.56 < 2e-16 ***
## hour9 25.913 35.119 0.74 0.4606
## hour10 -201.637 36.764 -5.48 4.3e-08 ***
## hour11 -216.254 37.860 -5.71 1.2e-08 ***
## hour12 -174.463 39.557 -4.41 1.0e-05 ***
## hour13 -186.217 39.755 -4.68 2.9e-06 ***
## hour14 -190.902 39.324 -4.85 1.2e-06 ***
## hour15 -69.299 37.852 -1.83 0.0672 .
## hour16 30.290 36.588 0.83 0.4078
## hour17 311.678 35.396 8.81 < 2e-16 ***
## hour18 793.636 34.509 23.00 < 2e-16 ***
## hour19 526.864 34.449 15.29 < 2e-16 ***
## hour20 475.465 34.318 13.85 < 2e-16 ***
## hour21 433.596 34.580 12.54 < 2e-16 ***
## hour22 335.589 34.486 9.73 < 2e-16 ***
## hour23 89.888 34.197 2.63 0.0086 **
## temp 24.230 0.944 25.65 < 2e-16 ***
## humid -6.877 0.324 -21.24 < 2e-16 ***
## solar 69.366 11.769 5.89 4.0e-09 ***
## rainfall -62.552 4.726 -13.24 < 2e-16 ***
## seasonSpring 189.989 20.042 9.48 < 2e-16 ***
## seasonSummer 191.420 29.933 6.39 1.7e-10 ***
## seasonAutumn 360.776 20.700 17.43 < 2e-16 ***
## holidayNo Holiday 126.879 23.493 5.40 6.9e-08 ***
## dayofweekTuesday 24.005 18.497 1.30 0.1944
## dayofweekWednesday 47.978 18.248 2.63 0.0086 **
## dayofweekThursday 12.598 18.282 0.69 0.4908
## dayofweekFriday 32.229 18.272 1.76 0.0778 .
## dayofweekSaturday -25.934 18.296 -1.42 0.1564
## dayofweekSunday -85.465 18.301 -4.67 3.1e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 377 on 5887 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.664
## F-statistic: 317 on 37 and 5887 DF, p-value: <2e-16
The simple additive search resulted in a model with 0.664 Adjusted R2 and the search for a simple interaction model resulted in a model with 0.8589 Adjusted R2.
As we aim a higher Adj R^2 value, we reduce the full model variables and build interaction model with transformations and a higher order of temperature
Let’s start with a model that takes the log() of the response variable and only use a select list of variables
For interactions terms, let’s consider two-way interactions betwen Temperature, Hour, Day of the Week and Holiday variables
log_interaction_model_full_2way = lm( log(count) ~ (poly(temp,3) +
hour +
dayofweek +
holiday)^2 +
log(rainfall+0.0001) +
log(snowfall + 0.0001),
data = bikeDemand_trn)
selected_log_interaction_model_2way = step(log_interaction_model_full_2way, direction = "backward", k =2, trace = 0)
summary(selected_log_interaction_model_2way)$adj.r.squared
## [1] 0.8126
Let’s also use the same variables, but this time, try a 3 way interaction instead of the two-way interaction
log_interaction_model_full_3way = lm( log(count) ~ (poly(temp,3) +
hour +
dayofweek +
holiday)^3 +
log(rainfall+0.0001) +
log(snowfall + 0.0001),
data = bikeDemand_trn)
selected_log_interaction_model_3way = step(log_interaction_model_full_3way, direction = "backward", k =2, trace = 0)
summary(selected_log_interaction_model_3way)$adj.r.squared
## [1] 0.8161
Looking at the Adj. R2 values, it doesn’t suggest a significant lift by introducing the 3-way interaction. However, we can properly validate this by running an ANOVA test.
#anova test
res =anova(selected_log_interaction_model_2way, selected_log_interaction_model_3way)
res[2,"Pr(>F)"]
## [1] 1.058e-17
The ANOVA test results in a extremely small p-value suggesting that the bigger model if better
Now we are going to use previous log model and refine it without the influential points to see whether it is going to make a difference
#calculating cooks distance
cooks = cooks.distance(selected_log_interaction_model_3way)
log_interaction_model_3way_cooks_full = lm( log(count) ~ (poly(temp,3) +
hour +
dayofweek +
holiday)^3+
log(rainfall+0.0001) +
log(snowfall + 0.0001),
data = bikeDemand_trn[cooks< 4/ nrow(bikeDemand_trn),])
selected_log_interaction_model_3way_cooks = step(log_interaction_model_3way_cooks_full, direction = "backward", k =2, trace = 0)
summary(selected_log_interaction_model_3way_cooks)$adj.r.squared
## [1] 0.9272
That gives a good lift to our Adjusted R2 value.
Following a similar approach to the log transformation, let’s use a box-cox transformation on our dataset
interaction_full = lm( count ~ (poly(temp,3) +
hour +
dayofweek +
holiday)^3+
log(rainfall+0.0001) +
log(snowfall + 0.0001),
data = bikeDemand_trn)
selected_interaction_full = step(interaction_full, direction = "backward", k =2, trace = 0)
#box cox tranformation
bc = MASS::boxcox(selected_interaction_full)
lambda = bc$x[which.max(bc$y)]
We will use the resulting lambda = 0.3434 for the box-cox tranformation
#applying box cox transformation to the previously selected model
selected_interaction_boxcox = lm( paste("(count^lambda - 1)/lambda ~",
as.character(formula(selected_interaction_full))[3],
sep = ""),
data = bikeDemand_trn)
summary(selected_interaction_boxcox)$adj.r.squared
## [1] 0.8467
Also, we will test if we can remove some outlier in the model, we could achieve even higher adj^2
#identifying outliers
cooks = cooks.distance(selected_interaction_boxcox)
selected_interaction_boxcox_no_outliers = lm( paste("((count^lambda) - 1)/lambda ~",
as.character(formula(selected_interaction_boxcox))[3],
sep = ""),
data = bikeDemand_trn[cooks< 4/ nrow(bikeDemand_trn),])
summary(selected_interaction_boxcox_no_outliers)$adj.r.squared
## [1] 0.9258
Here we take a look at the summary of all models we built above.
Here are the results of test run to check Normality (Shapiro Test), and constant variance( BP Test)
Model | Shapiro.Test P-value | BP test P value |
---|---|---|
Simple Additive Model | 1.409710^{-89} | Inf10^{-324} |
Simple Interaction Model | 4.047710^{-85} | 6.36410^{-170} |
Log Transformed Model | 7.545810^{-83} | 6.387310^{-49} |
Log Transformed Model (Excludes Influential Points) | 8.744610^{-86} | 1.968510^{-98} |
Box-Cox Transformed Model | 7.545810^{-83} | 5.115710^{-60} |
Box-Cox Transformed Model (Excludes Influential Points) | 4.404610^{-85} | 1.678910^{-101} |
The test sample was used to evaluate the model performance and the observed results included in the below table.
Model | Adjusted R2 | Test RMSE | Test MAE |
---|---|---|---|
Simple Additive Model | 0.664 | 361.5688 | 274.6186 |
Simple Interaction Model | 0.8589 | 288.8499 | 176.5654 |
Log Transformed Model | 0.8161 | 259.2108 | 163.7149 |
Log Transformed Model (Excludes Influential Points) | 0.9272 | 245.4718 | 147.7707 |
Box-Cox Transformed Model | 0.8467 | 244.7013 | 155.3177 |
Box-Cox Transformed Model (Excludes Influential Points) | 0.9258 | 245.6812 | 148.3663 |
We can select the final models based on the model performance on the test RMSE and MAE (Mean Absolute Error).
log(count)~poly(temp, 3) + hour + dayofweek + holiday + log(rainfall + 0.0001) + log(snowfall + 0.0001) + poly(temp, 3):hour + poly(temp, 3):dayofweek + poly(temp, 3):holiday + hour:dayofweek + hour:holiday + dayofweek:holiday + poly(temp, 3):dayofweek:holiday
((count^lambda) - 1)/lambda~poly(temp, 3) + hour + dayofweek + holiday + log(rainfall + 0.0001) + log(snowfall + 0.0001) + poly(temp, 3):hour + poly(temp, 3):dayofweek + poly(temp, 3):holiday + hour:dayofweek + hour:holiday + dayofweek:holiday + poly(temp, 3):dayofweek:holiday
where lambda is 0.3434.
Here we show the plots for the two best models according to the Test RMSE
Normal QQ Plot
Normal QQ plots of the model residuals suggest violation of the normality assumption
Residual vs. Fitted Value Plots
During our model exploration process, we build three types of models
Log and Box-Cox transformed models yielded higher Adjusted R2, which is over 90%, and have better performance on the test sample.
For both transformation types, excluding influential points helped improve the test sample’s performance
Our best models are given below.
Both of our final models contain the following variables and interactions. It is clear that the model has been able to capture the key variables one would think would be impacting someone’s decision to ride a bicycle.
All the models that were tested failed the normality assumption test and the constant variance test. However, the Residual plots seem to be showing sufficient constant variance given the real-life nature of the dataset we used for this project.
Alternative approaches
However, those types of models were not in scope for our exploration.
Usefulness of the Model
#function to conduct constant variance test
check_constant_variance = function(model){
x = bptest(model)
return(bptest(model)$p.value[["BP"]])
}
#function to conduct the normality test
check_normality = function(model, random_samples = 5){
probs = rep(0, random_samples)
for(i in 1:random_samples){
probs[i] = shapiro.test(sample(hatvalues(model),5000))$p.value #run Shapiro test for multiple random samples
}
return(mean(probs))
}
#function calculate test rmse
test_rmse = function(model, testdataset, is_dependent_log = FALSE){
if(is_dependent_log){
pred = exp(predict(model, newdata = testdataset))
}else{
pred = predict(model, newdata = testdataset)
}
rmse = sqrt(mean((pred - testdataset$count)^2))
rmse
}
#function to calculate loocv_rmse
calc_loocv_rmse = function(model) {
sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}
#MAE
test_mae = function(model, testdataset, is_dependent_log = FALSE){
if(is_dependent_log){
pred = exp(predict(model, newdata = testdataset))
}else{
pred = predict(model, newdata = testdataset)
}
mae = MLmetrics::MAE(pred, testdataset$count)
mae
}
#code for plots and data cleaning
panel.cor = function(x, y){
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- round(cor(x, y), digits=2)
txt <- paste0("R = ", r)
#cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = 1)
}
upper.panel<-function(x, y){
points(x,y, pch = 19, col = 'cadetblue')
}
datetime_variables = cbind(origin_set$Rented.Bike.Count, origin_set$Date,origin_set$Hour,origin_set$Seasons)
colnames(datetime_variables) = c("Rented.Bike.Count","Date", "Hour", "Seasons")
pairs(datetime_variables, lower.panel = panel.cor, upper.panel = upper.panel)
par(mfrow=c(1,2))
a = origin_set %>%
group_by(Seasons) %>%
summarise(sum_count = mean(Rented.Bike.Count))
b = origin_set %>%
group_by(Date) %>%
summarise(sum_count = mean(Rented.Bike.Count))
barplot(a$sum_count,names.arg=a$Seasons,xlab="Seasons",ylab="Count",col="cadetblue",
main="Seasons vs # of Rented Bikes",space=5)
barplot(b$sum_count,names.arg=b$Date,xlab="Date",ylab="Count",col="cadetblue",
main="Dates vs # of Rented Bikes")
#Extracting Day of the week from date and categorizing into Weekend and Weekday
origin_set$Day = format(origin_set$Date, '%A')
weekdays <- c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday')
origin_set$Weekend = factor((origin_set$Day %in% weekdays),levels = c(FALSE, TRUE),labels = c('weekend', 'weekday'))
origin_set$Day = factor(origin_set$Day, levels = c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))
#removing date
origin_set$Date = NULL
dayofweek_summary = aggregate(list(origin_set$Rented.Bike.Count), by = list(origin_set$Day), FUN = 'mean')
names(dayofweek_summary) = c("DayofWeek","AvgRentedBikeCount")
g0 = ggplot(data=dayofweek_summary, aes(x=DayofWeek, y=round(AvgRentedBikeCount, digits=0))) +
geom_bar(stat="identity", fill="cadetblue",width=0.4)+ xlab("Day of Week") + ylab("Avg. Rented Bike Count")+labs(title = "Avg. Rented Bike Count by Day of Week")+
geom_text(aes(label=round(AvgRentedBikeCount,digits=0)), vjust=-0.3, size=3.5)+ylim(0,900)+
theme_minimal()
# 1 . Binning hours to time period
# 2. Creating a factor variable for hour
origin_set$Hour <- as.numeric(origin_set$Hour)
breaks = c(0,4,7,10,13,17,20,24)
hours_label = c('midnight', 'sunrise','rushmorning','forenoon','afternoon','rusheven','evening')
group_tags <- cut(origin_set$Hour,
breaks=breaks,
include.lowest=TRUE,
right=FALSE,
labels=hours_label)
origin_set$timeslot = group_tags
origin_set$Hour = factor(origin_set$Hour)
hourly_summary = aggregate(list(origin_set$Rented.Bike.Count), by = list(origin_set$Hour), FUN = 'mean')
names(hourly_summary) = c("Hour","AvgRentedBikeCount")
g1 = ggplot(data=hourly_summary, aes(x=Hour, y=round(AvgRentedBikeCount, digits=0))) +
geom_bar(stat="identity", fill="cadetblue")+ xlab("Hour") + ylab("Avg. Rented Bike Count")+labs(title = "Avg. Rented Bike Count by Hour")+
geom_text(aes(label=round(AvgRentedBikeCount,digits=0)), vjust=-0.3, size=3.5)+
theme(axis.text.x = element_text(size = 8))+
theme_minimal()
g0+g1
c = origin_set %>%
group_by(timeslot) %>%
summarise(sum_count = mean(Rented.Bike.Count))
# par(mfrow=c(2,1))
# barplot(c$sum_count,names.arg=c$timeslot,xlab="Time Slots",ylab="Count",col="cadetblue",
# main="Time Slots vs # of Rented Bikes",space=1,cex.axis=1, cex.names=0.8)
# barplot(d$sum_count,names.arg=d$Weekend,xlab="Weekend",ylab="Count",col="cadetblue",
# main="Weekend vs # of Rented Bikes",space=2)
g2 = ggplot(data=c, aes(x=timeslot, y=round(sum_count, digits=0))) +
geom_bar(stat="identity", fill="cadetblue",width=0.4)+ xlab("Time Slot") + ylab("Avg. Rented Bike Count")+labs(title = "Avg. Rented Bike Count by Timeslot")+
geom_text(aes(label=round(sum_count,digits=0)), vjust=-0.3, size=3.5)+ ylim(0,1500)+
theme(axis.text.x = element_text(size = 8))+
theme_minimal()
d = origin_set %>%
group_by(Weekend) %>%
summarise(sum_count = mean(Rented.Bike.Count))
g3 =ggplot(data=d, aes(x=Weekend, y=round(sum_count, digits=0))) +
geom_bar(stat="identity", fill="cadetblue",width=0.4)+ xlab("Time Slot") + ylab("Avg. Rented Bike Count")+labs(title = "Avg. Rented Bike Count by Weekday vs. Weekend")+
geom_text(aes(label=round(sum_count,digits=0)), vjust=-0.3, size=3.5)+ ylim(0,800)+
theme(axis.text.x = element_text(size = 8))+
theme_minimal()
g2+g3
# Check Missing Data
bikeDemand = origin_set
sum(is.na(bikeDemand))==0 # pass
#rename variables
colnames(bikeDemand) = c("count" ,"hour","temp", "humid", "windspeed",
"visibility", "dewtemp", "solar","rainfall","snowfall",
"season","holiday","dayofweek","weekend","timeslot")
panel.cor = function(x, y){
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- round(cor(x, y), digits=2)
txt <- paste0("R = ", r)
#cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = 1)
}
upper.panel<-function(x, y){
points(x,y, pch = 19, col = 'cadetblue',cex = .5)
}
weather_variables = bikeDemand_trn[,c("count","temp","humid","windspeed","visibility","dewtemp","solar","rainfall","snowfall")]
#colnames(datetime_variables) = c("Rented.Bike.Count","Date", "Hour", "Seasons")
pairs(weather_variables, lower.panel = panel.cor, upper.panel = upper.panel,cex.labels=2)