Introduction
Methods
- Data Preparation
  - Data Exploration
  - Data Extraction and Transformation
- Data Modeling
Results
Discussion
Appendix

Introduction

Sharing economy is booming in many urban cities thanks to the many telecommunication technologies. Data analysis particularly plays a key part of this emerging business. A good model for demand prediction is the driving force to survive in this industry. We found that this dataset presents typical information for us to build a good regression model to predict the hourly bike demand in Seoul South Korea during the year between 2017 and 2018. Based on this model, we hope that with the updated data, the organization can provide the city with a stable supply of rental bikes in a timely manner from now on.

The dataset originally published on SOUTH KOREA PUBLIC HOLIDAYS. This data set contains the following variables :

Date - year-month-day	Dew point temperature - Celsius
Rented Bike count - Count of bikes rented at each hour	Solar radiation - MJ/m2
Hour - Hour of he day	Rainfall - mm
Temperature - Temperature in Celsius	Snowfall - cm
Humidity - %	Seasons- Winter, Spring, Summer, Autumn
Windspeed - m/s	Holiday - Holiday/No holiday
Visibility - 10m	Functional Day - NoFunc(Non Functional Hours), Fun(Functional hours)

We are planning to apply multiple linear regression and find the best model using different evaluation methods. What we are interested in is finding which factors play major contributions in this prediction and we would like to compare these factors to the ones we expect before the data modeling.

Methods

Data Preparation

Data Exploration

origin_set = read.csv("SeoulBikeData.csv",stringsAsFactors=TRUE)
origin_set$Date = as.Date(origin_set$Date, format = c("%d/%m/%Y"))
origin_set$Seasons = factor(origin_set$Seasons, levels = c('Winter','Spring','Summer','Autumn'))
str(origin_set)

## 'data.frame':    8760 obs. of  14 variables:
##  $ Date                 : Date, format: "2017-12-01" "2017-12-01" ...
##  $ Rented.Bike.Count    : int  254 204 173 107 78 100 181 460 930 490 ...
##  $ Hour                 : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Temperature          : num  -5.2 -5.5 -6 -6.2 -6 -6.4 -6.6 -7.4 -7.6 -6.5 ...
##  $ Humidity             : int  37 38 39 40 36 37 35 38 37 27 ...
##  $ Wind.speed           : num  2.2 0.8 1 0.9 2.3 1.5 1.3 0.9 1.1 0.5 ...
##  $ Visibility           : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 1928 ...
##  $ Dew.point.temperature: num  -17.6 -17.6 -17.7 -17.6 -18.6 -18.7 -19.5 -19.3 -19.8 -22.4 ...
##  $ Solar.Radiation      : num  0 0 0 0 0 0 0 0 0.01 0.23 ...
##  $ Rainfall             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Snowfall             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Seasons              : Factor w/ 4 levels "Winter","Spring",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Holiday              : Factor w/ 2 levels "Holiday","No Holiday": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Functioning.Day      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...

We first read the dataset from the csv file. Successfully convert string variable to factors. Also convert Date to correct Date type.

As we scan the data read from the csv file, we notice that we need to trim the data and reshape it to a more meaningful dataframe.

First, we only look at the data that Functioning.Day is Yes.

origin_set = origin_set[origin_set$Functioning.Day == "Yes",]
origin_set$Functioning.Day = NULL

Second, a deeper understanding of time related variable is needed

As we can see from the scatter plots shows, Date and Seasons are highly correlated variables. The correlation is 0.97.

As we compare the Season and Date plot in a larger plot shown above, the shape are almost same. We then need to keep only one variable. As a common sense, we keep Seasons for simplicity. However, we also want to transform Date into a more informative variable. Therefore based on the Date variable, we create a factor variable Day and a Factor variable Weekend. We then look through how each day of week and each hour of day affect the bike demand.

Looking at average bike count through the week, from the bar chart above, Sundays appear to have the lowest demand, while Fridays, and Wednesdays are the top two days in terms of bike demand.

We notice Hour variable should be a categorical variable. So we take a look at how the hours affect the bike demand We notice from second bar chart above that there is higher demand around 8 am in the morning rush hour and in the afternoon around 6pm. So we create a new variable to make 7 time slots that separate the rush hours.

We also can see from the bar charts above that the weekday has a higher demand than weekdays. So we want to put those factors in the data modelling.

Data Extraction and Transformation

Based on what we discovered from the dataset, we extracted the key variables and transform them into a new dataframe. We also give them a shorter version of names to make it easier to refer them in the model definition.

Finally, after data cleaning and data transformation, we have our final version of dataframe.

## 'data.frame':    8465 obs. of  15 variables:
##  $ count     : int  254 204 173 107 78 100 181 460 930 490 ...
##  $ hour      : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ temp      : num  -5.2 -5.5 -6 -6.2 -6 -6.4 -6.6 -7.4 -7.6 -6.5 ...
##  $ humid     : int  37 38 39 40 36 37 35 38 37 27 ...
##  $ windspeed : num  2.2 0.8 1 0.9 2.3 1.5 1.3 0.9 1.1 0.5 ...
##  $ visibility: int  2000 2000 2000 2000 2000 2000 2000 2000 2000 1928 ...
##  $ dewtemp   : num  -17.6 -17.6 -17.7 -17.6 -18.6 -18.7 -19.5 -19.3 -19.8 -22.4 ...
##  $ solar     : num  0 0 0 0 0 0 0 0 0.01 0.23 ...
##  $ rainfall  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ snowfall  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ season    : Factor w/ 4 levels "Winter","Spring",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday   : Factor w/ 2 levels "Holiday","No Holiday": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dayofweek : Factor w/ 7 levels "Monday","Tuesday",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ weekend   : Factor w/ 2 levels "weekend","weekday": 2 2 2 2 2 2 2 2 2 2 ...
##  $ timeslot  : Factor w/ 7 levels "midnight","sunrise",..: 1 1 1 1 2 2 2 3 3 3 ...

Data Train Test Split

The dataset is split into a train(70%) set and a test(30%) set. The train set will be used to build the model, while the test set will be used for evaluating model accuracy

# train-test split  70%/30%
set.seed(42)
train_size = nrow(bikeDemand)*0.7
bikeDemand_tr_idx = sample(nrow(bikeDemand), train_size)
bikeDemand_trn = bikeDemand[bikeDemand_tr_idx,]
bikeDemand_tst = bikeDemand[-bikeDemand_tr_idx,]

Variable Selection

Given that the dataset contains many weather related variables, we would like to see if there are potential redundant variables that may cause collinearity when building models.

Looking at the correlation graph above, we can see that dewtemp and temp variable have 92% correlation.

We also see a 99% of the variability of the dewtemp variable is explained by the other variables in the dataset. Therefore, we can exclude dewtemp variable from the modeling process.

#dewtemp as a linear combination of other variables
m = lm(dewtemp ~ . - count, data  = bikeDemand_trn)
summary(m)$adj.r.squared

## [1] 0.9916

Data Modeling

In this section we explore multiple modeling approaches with the aim of finding a model with good accuracy. We start with a simple additive model, and then later on introduce interations, and higher order terms to further test out different types of models.

Simple Additive and Interactive Models

We start our model search with building a simple additive and a simple interaction model.

We start with simple additive full model, using BIC stepwise search to find some key predictors.

#Simple Additive Model
fullModel = lm( count ~., data = bikeDemand_trn)
selectedBICStepModel = step(fullModel, 
                            direction = 'both',
                            k = log(nrow(bikeDemand_trn)), 
                            trace = 0)

#Simple Interaction model
full_int_model = lm(formula = count ~ (hour + 
                                         temp + 
                                         humid + 
                                         solar + 
                                         rainfall + 
                                         season + 
                                         holiday + 
                                         dayofweek)^2, 
                    data = bikeDemand_trn)

selectedIntModelBIC = step(full_int_model, 
                           direction = "backward", 
                           k = log(nrow(bikeDemand_trn)), 
                           trace = 0)

summary(selectedBICStepModel)

## 
## Call:
## lm(formula = count ~ hour + temp + humid + solar + rainfall + 
##     season + holiday + dayofweek, data = bikeDemand_trn)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1376.8  -223.5   -12.1   203.0  1847.5 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         448.860     39.680   11.31  < 2e-16 ***
## hour1              -107.106     33.993   -3.15   0.0016 ** 
## hour2              -227.722     34.355   -6.63  3.7e-11 ***
## hour3              -316.504     34.136   -9.27  < 2e-16 ***
## hour4              -371.285     34.782  -10.67  < 2e-16 ***
## hour5              -366.556     34.861  -10.51  < 2e-16 ***
## hour6              -197.658     34.256   -5.77  8.3e-09 ***
## hour7               123.773     34.959    3.54   0.0004 ***
## hour8               501.048     34.414   14.56  < 2e-16 ***
## hour9                25.913     35.119    0.74   0.4606    
## hour10             -201.637     36.764   -5.48  4.3e-08 ***
## hour11             -216.254     37.860   -5.71  1.2e-08 ***
## hour12             -174.463     39.557   -4.41  1.0e-05 ***
## hour13             -186.217     39.755   -4.68  2.9e-06 ***
## hour14             -190.902     39.324   -4.85  1.2e-06 ***
## hour15              -69.299     37.852   -1.83   0.0672 .  
## hour16               30.290     36.588    0.83   0.4078    
## hour17              311.678     35.396    8.81  < 2e-16 ***
## hour18              793.636     34.509   23.00  < 2e-16 ***
## hour19              526.864     34.449   15.29  < 2e-16 ***
## hour20              475.465     34.318   13.85  < 2e-16 ***
## hour21              433.596     34.580   12.54  < 2e-16 ***
## hour22              335.589     34.486    9.73  < 2e-16 ***
## hour23               89.888     34.197    2.63   0.0086 ** 
## temp                 24.230      0.944   25.65  < 2e-16 ***
## humid                -6.877      0.324  -21.24  < 2e-16 ***
## solar                69.366     11.769    5.89  4.0e-09 ***
## rainfall            -62.552      4.726  -13.24  < 2e-16 ***
## seasonSpring        189.989     20.042    9.48  < 2e-16 ***
## seasonSummer        191.420     29.933    6.39  1.7e-10 ***
## seasonAutumn        360.776     20.700   17.43  < 2e-16 ***
## holidayNo Holiday   126.879     23.493    5.40  6.9e-08 ***
## dayofweekTuesday     24.005     18.497    1.30   0.1944    
## dayofweekWednesday   47.978     18.248    2.63   0.0086 ** 
## dayofweekThursday    12.598     18.282    0.69   0.4908    
## dayofweekFriday      32.229     18.272    1.76   0.0778 .  
## dayofweekSaturday   -25.934     18.296   -1.42   0.1564    
## dayofweekSunday     -85.465     18.301   -4.67  3.1e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 377 on 5887 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.664 
## F-statistic:  317 on 37 and 5887 DF,  p-value: <2e-16

The simple additive search resulted in a model with 0.664 Adjusted R2 and the search for a simple interaction model resulted in a model with 0.8589 Adjusted R2.

Log Interaction Model

As we aim a higher Adj R^2 value, we reduce the full model variables and build interaction model with transformations and a higher order of temperature

Let’s start with a model that takes the log() of the response variable and only use a select list of variables

For interactions terms, let’s consider two-way interactions betwen Temperature, Hour, Day of the Week and Holiday variables

log_interaction_model_full_2way = lm( log(count) ~ (poly(temp,3) + 
                                                      hour + 
                                                      dayofweek  +
                                                      holiday)^2 + 
                                                log(rainfall+0.0001) +
                                                log(snowfall + 0.0001), 
                                      data = bikeDemand_trn)

selected_log_interaction_model_2way  = step(log_interaction_model_full_2way, direction = "backward", k =2, trace = 0)

summary(selected_log_interaction_model_2way)$adj.r.squared

## [1] 0.8126

Let’s also use the same variables, but this time, try a 3 way interaction instead of the two-way interaction

log_interaction_model_full_3way = lm( log(count) ~ (poly(temp,3) + 
                                                      hour + 
                                                      dayofweek  + 
                                                      holiday)^3 + 
                                        log(rainfall+0.0001) +
                                        log(snowfall + 0.0001), 
                                      data = bikeDemand_trn)

selected_log_interaction_model_3way  = step(log_interaction_model_full_3way, direction = "backward", k =2, trace = 0)
summary(selected_log_interaction_model_3way)$adj.r.squared

## [1] 0.8161

Looking at the Adj. R2 values, it doesn’t suggest a significant lift by introducing the 3-way interaction. However, we can properly validate this by running an ANOVA test.

#anova test
res  =anova(selected_log_interaction_model_2way, selected_log_interaction_model_3way)
res[2,"Pr(>F)"]

## [1] 1.058e-17

The ANOVA test results in a extremely small p-value suggesting that the bigger model if better

Now we are going to use previous log model and refine it without the influential points to see whether it is going to make a difference

#calculating cooks distance
cooks = cooks.distance(selected_log_interaction_model_3way)


log_interaction_model_3way_cooks_full = lm( log(count) ~ (poly(temp,3) + 
                                                            hour + 
                                                            dayofweek  + 
                                                            holiday)^3+
                                              log(rainfall+0.0001) +
                                              log(snowfall + 0.0001), 
                                            data = bikeDemand_trn[cooks< 4/ nrow(bikeDemand_trn),])

selected_log_interaction_model_3way_cooks  = step(log_interaction_model_3way_cooks_full, direction = "backward", k =2, trace = 0)
summary(selected_log_interaction_model_3way_cooks)$adj.r.squared

## [1] 0.9272

That gives a good lift to our Adjusted R2 value.

Box-Cox Transformation

Following a similar approach to the log transformation, let’s use a box-cox transformation on our dataset

interaction_full = lm( count ~ (poly(temp,3) + 
                                  hour + 
                                  dayofweek  + 
                                  holiday)^3+
                         log(rainfall+0.0001) +
                         log(snowfall + 0.0001), 
                       data = bikeDemand_trn)

selected_interaction_full  = step(interaction_full, direction = "backward", k =2, trace = 0)

#box cox tranformation
bc = MASS::boxcox(selected_interaction_full)

lambda = bc$x[which.max(bc$y)]

We will use the resulting lambda = 0.3434 for the box-cox tranformation

#applying box cox transformation to the previously selected model

selected_interaction_boxcox = lm( paste("(count^lambda - 1)/lambda ~", 
                                        as.character(formula(selected_interaction_full))[3], 
                                        sep = ""), 
                                  data = bikeDemand_trn)
summary(selected_interaction_boxcox)$adj.r.squared

## [1] 0.8467

Also, we will test if we can remove some outlier in the model, we could achieve even higher adj^2

#identifying outliers
cooks = cooks.distance(selected_interaction_boxcox)


selected_interaction_boxcox_no_outliers = lm( paste("((count^lambda) - 1)/lambda ~", 
                                                    as.character(formula(selected_interaction_boxcox))[3], 
                                                    sep = ""), 
                                              data = bikeDemand_trn[cooks< 4/ nrow(bikeDemand_trn),])

summary(selected_interaction_boxcox_no_outliers)$adj.r.squared

## [1] 0.9258

Results

Here we take a look at the summary of all models we built above.

Model Assumptions

Here are the results of test run to check Normality (Shapiro Test), and constant variance( BP Test)

Model	Shapiro.Test P-value	BP test P value
Simple Additive Model	1.409710^{-89}	Inf10^{-324}
Simple Interaction Model	4.047710^{-85}	6.36410^{-170}
Log Transformed Model	7.545810^{-83}	6.387310^{-49}
Log Transformed Model (Excludes Influential Points)	8.744610^{-86}	1.968510^{-98}
Box-Cox Transformed Model	7.545810^{-83}	5.115710^{-60}
Box-Cox Transformed Model (Excludes Influential Points)	4.404610^{-85}	1.678910^{-101}

Model Performance

The test sample was used to evaluate the model performance and the observed results included in the below table.

Model	Adjusted R2	Test RMSE	Test MAE
Simple Additive Model	0.664	361.5688	274.6186
Simple Interaction Model	0.8589	288.8499	176.5654
Log Transformed Model	0.8161	259.2108	163.7149
Log Transformed Model (Excludes Influential Points)	0.9272	245.4718	147.7707
Box-Cox Transformed Model	0.8467	244.7013	155.3177
Box-Cox Transformed Model (Excludes Influential Points)	0.9258	245.6812	148.3663

Final Models

We can select the final models based on the model performance on the test RMSE and MAE (Mean Absolute Error).

Final Model1 : Log Transformed Model (Excludes Influential Points)

log(count)~poly(temp, 3) + hour + dayofweek + holiday + log(rainfall + 0.0001) + log(snowfall + 0.0001) + poly(temp, 3):hour + poly(temp, 3):dayofweek + poly(temp, 3):holiday + hour:dayofweek + hour:holiday + dayofweek:holiday + poly(temp, 3):dayofweek:holiday

Final Model2 : Box-Cox Transformed Model(Excludes Influential Points)

((count^lambda) - 1)/lambda~poly(temp, 3) + hour + dayofweek + holiday + log(rainfall + 0.0001) + log(snowfall + 0.0001) + poly(temp, 3):hour + poly(temp, 3):dayofweek + poly(temp, 3):holiday + hour:dayofweek + hour:holiday + dayofweek:holiday + poly(temp, 3):dayofweek:holiday

where lambda is 0.3434.

Diagnostic Plots

Here we show the plots for the two best models according to the Test RMSE

Normal QQ Plot

Normal QQ plots of the model residuals suggest violation of the normality assumption

Residual vs. Fitted Value Plots

Discussion

During our model exploration process, we build three types of models

Simple additive and interaction models
Log Transformed model – Two models, with and without influential points
Box-Cox Transformed model – Two models, with and without influential points

Log and Box-Cox transformed models yielded higher Adjusted R2, which is over 90%, and have better performance on the test sample.
For both transformation types, excluding influential points helped improve the test sample’s performance

Our best models are given below.

Log Transformed Model(Excl. Influential Points), Test RMSE = 245.5, MAE: 147.8, Adjusted R2 = 92.7%
Box-Cox Transformed Model(Excl. Influential Points), Test RMSE = 245.7, MAE: 148.3, Adjusted R2 = 92.6% We have picked one from each type of transformation

Both of our final models contain the following variables and interactions. It is clear that the model has been able to capture the key variables one would think would be impacting someone’s decision to ride a bicycle.

hour
dayofweek
holiday
log(rainfall + 0.0001)
log(snowfall + 0.0001)
poly(temp, 3):hour
poly(temp, 3):dayofweek
poly(temp, 3):holiday
hour:dayofweek
hour:holiday
dayofweek:holiday
poly(temp, 3):dayofweek:holiday

All the models that were tested failed the normality assumption test and the constant variance test. However, the Residual plots seem to be showing sufficient constant variance given the real-life nature of the dataset we used for this project.

Alternative approaches

There is a possibility that a count-based model such as Poisson or Negative Binomial may have suited better for this dataset, given the response variable, i.e., the number of Rented bikes is an integer.
Given that there are many factor variables, Tree-based methods such as Decision Trees or Random Forest might have the ability to capture the variability better.

However, those types of models were not in scope for our exploration.

Usefulness of the Model

For any bike rental company, it is important to know when to expect higher and lower demand.
When the demand is high, the company needs to deploy vehicles to distribute the bikes across their stations so that the riders have enough bikes when needed
The models that we explored during this study are helpful in understanding the hourly demand for bikes. With a MAE ~148, our error rate is ~21% (148 divided by avg hourly demand, which is 704.5).
The variables we used in the models(e.g., weather, holidays) are predictable and will know in advance. Therefore, it allows us to predict the expected bike demand in advance as well.

Appendix

#function to conduct constant variance test
check_constant_variance = function(model){
  x = bptest(model)
  return(bptest(model)$p.value[["BP"]])
}


#function to conduct the normality test
check_normality = function(model, random_samples = 5){
  probs = rep(0, random_samples)
  for(i in 1:random_samples){
    probs[i] = shapiro.test(sample(hatvalues(model),5000))$p.value #run Shapiro test for multiple random samples  
  }
  return(mean(probs))
}

#function calculate test rmse
test_rmse = function(model, testdataset, is_dependent_log = FALSE){
  
  if(is_dependent_log){
    pred = exp(predict(model, newdata = testdataset))
  }else{
    pred = predict(model, newdata = testdataset)
  }
  
  rmse = sqrt(mean((pred - testdataset$count)^2))
  rmse
}

#function to calculate loocv_rmse
calc_loocv_rmse = function(model) {
  sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}

#MAE
test_mae = function(model, testdataset, is_dependent_log = FALSE){
  
  if(is_dependent_log){
    pred = exp(predict(model, newdata = testdataset))
  }else{
    pred = predict(model, newdata = testdataset)
  }
  
  mae = MLmetrics::MAE(pred, testdataset$count)
  mae
}


#code for plots and data cleaning

panel.cor = function(x, y){
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- round(cor(x, y), digits=2)
    txt <- paste0("R = ", r)
    #cex.cor <- 0.8/strwidth(txt) 
    text(0.5, 0.5, txt, cex = 1)
}

upper.panel<-function(x, y){
  points(x,y, pch = 19, col = 'cadetblue')
}

datetime_variables = cbind(origin_set$Rented.Bike.Count, origin_set$Date,origin_set$Hour,origin_set$Seasons)
colnames(datetime_variables) = c("Rented.Bike.Count","Date", "Hour", "Seasons")
pairs(datetime_variables, lower.panel = panel.cor, upper.panel = upper.panel)



par(mfrow=c(1,2))

a = origin_set %>%
    group_by(Seasons) %>%
    summarise(sum_count = mean(Rented.Bike.Count))


b = origin_set %>%
    group_by(Date) %>%
    summarise(sum_count = mean(Rented.Bike.Count))
barplot(a$sum_count,names.arg=a$Seasons,xlab="Seasons",ylab="Count",col="cadetblue",
main="Seasons vs # of Rented Bikes",space=5)

barplot(b$sum_count,names.arg=b$Date,xlab="Date",ylab="Count",col="cadetblue",
main="Dates vs # of Rented Bikes")


#Extracting Day of the week from date and categorizing into Weekend and Weekday

origin_set$Day = format(origin_set$Date, '%A')
weekdays <- c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday')
origin_set$Weekend = factor((origin_set$Day %in% weekdays),levels = c(FALSE, TRUE),labels = c('weekend', 'weekday'))
origin_set$Day = factor(origin_set$Day, levels = c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))

#removing date
origin_set$Date = NULL

dayofweek_summary = aggregate(list(origin_set$Rented.Bike.Count), by = list(origin_set$Day), FUN = 'mean')
names(dayofweek_summary) = c("DayofWeek","AvgRentedBikeCount")

g0 = ggplot(data=dayofweek_summary, aes(x=DayofWeek, y=round(AvgRentedBikeCount, digits=0))) +
  geom_bar(stat="identity", fill="cadetblue",width=0.4)+ xlab("Day of Week") + ylab("Avg. Rented Bike Count")+labs(title = "Avg. Rented Bike Count by Day of Week")+
    geom_text(aes(label=round(AvgRentedBikeCount,digits=0)), vjust=-0.3, size=3.5)+ylim(0,900)+
  theme_minimal()

# 1 . Binning hours to time period
# 2.  Creating a factor variable for hour
origin_set$Hour <- as.numeric(origin_set$Hour)
breaks = c(0,4,7,10,13,17,20,24)
hours_label = c('midnight', 'sunrise','rushmorning','forenoon','afternoon','rusheven','evening')
group_tags <- cut(origin_set$Hour, 
                  breaks=breaks, 
                  include.lowest=TRUE, 
                  right=FALSE, 
                  labels=hours_label)
origin_set$timeslot = group_tags
origin_set$Hour = factor(origin_set$Hour)

hourly_summary = aggregate(list(origin_set$Rented.Bike.Count), by = list(origin_set$Hour), FUN = 'mean')
names(hourly_summary) = c("Hour","AvgRentedBikeCount")

g1 = ggplot(data=hourly_summary, aes(x=Hour, y=round(AvgRentedBikeCount, digits=0))) +
  geom_bar(stat="identity", fill="cadetblue")+ xlab("Hour") + ylab("Avg. Rented Bike Count")+labs(title = "Avg. Rented Bike Count by Hour")+
    geom_text(aes(label=round(AvgRentedBikeCount,digits=0)), vjust=-0.3, size=3.5)+
      theme(axis.text.x = element_text(size = 8))+
  theme_minimal()

g0+g1

c = origin_set %>%
    group_by(timeslot) %>%
    summarise(sum_count = mean(Rented.Bike.Count))

# par(mfrow=c(2,1))
# barplot(c$sum_count,names.arg=c$timeslot,xlab="Time Slots",ylab="Count",col="cadetblue",
# main="Time Slots vs # of Rented Bikes",space=1,cex.axis=1, cex.names=0.8)
# barplot(d$sum_count,names.arg=d$Weekend,xlab="Weekend",ylab="Count",col="cadetblue",
# main="Weekend vs # of Rented Bikes",space=2)

g2 = ggplot(data=c, aes(x=timeslot, y=round(sum_count, digits=0))) +
  geom_bar(stat="identity", fill="cadetblue",width=0.4)+ xlab("Time Slot") + ylab("Avg. Rented Bike Count")+labs(title = "Avg. Rented Bike Count by Timeslot")+
    geom_text(aes(label=round(sum_count,digits=0)), vjust=-0.3, size=3.5)+ ylim(0,1500)+
      theme(axis.text.x = element_text(size = 8))+
  theme_minimal()

d = origin_set %>%
    group_by(Weekend) %>%
    summarise(sum_count = mean(Rented.Bike.Count))


g3 =ggplot(data=d, aes(x=Weekend, y=round(sum_count, digits=0))) +
  geom_bar(stat="identity", fill="cadetblue",width=0.4)+ xlab("Time Slot") + ylab("Avg. Rented Bike Count")+labs(title = "Avg. Rented Bike Count by Weekday vs. Weekend")+
    geom_text(aes(label=round(sum_count,digits=0)), vjust=-0.3, size=3.5)+ ylim(0,800)+
      theme(axis.text.x = element_text(size = 8))+
  theme_minimal()

g2+g3

# Check Missing Data
bikeDemand = origin_set
sum(is.na(bikeDemand))==0   # pass

#rename variables

colnames(bikeDemand) = c("count" ,"hour","temp", "humid", "windspeed", 
                      "visibility", "dewtemp", "solar","rainfall","snowfall",
                      "season","holiday","dayofweek","weekend","timeslot")


panel.cor = function(x, y){
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- round(cor(x, y), digits=2)
    txt <- paste0("R = ", r)
    #cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = 1)
}

upper.panel<-function(x, y){
  points(x,y, pch = 19, col = 'cadetblue',cex = .5)
}

weather_variables = bikeDemand_trn[,c("count","temp","humid","windspeed","visibility","dewtemp","solar","rainfall","snowfall")]
#colnames(datetime_variables) = c("Rented.Bike.Count","Date", "Hour", "Seasons")
pairs(weather_variables, lower.panel = panel.cor, upper.panel = upper.panel,cex.labels=2)

DG21-Final Report - Predicting Demand for Rented Bikes

Song Chen, Asiri Silva, Satyo Iswara

08/08/2021