Project Background

The majority of our users never generate any revenue (i.e., ‘convert’, in marketing parlance). We want to offer a sale to some of these ‘non-payers’, in hopes of generating incremental revenue. Unfortunately, it’s not easy to identify true non-payers because many payers don’t convert until several weeks after installing the game. If you simply set a cutoff date and offer sales to everyone who hasn’t yet converted, you might find that revenue actually falls. This can occur if large numbers of not-yet-converted future payers take advantage of the sale instead of paying full price. If you’re not lucky, you might fail to convert a sufficient number of true non-payers, while ‘cannibalizing’ future payers by providing more value than they actually require.

The goal of this assignment is to design a sale that maximizes incremental revenue. We don’t expect you to get it right on the first try - we don’t hold ourselves to such a high standard! Instead, we’ll ask you to design one or more offers, select a target group, and prescribe a test protocol that can be used to evaluate the results.

Questions

  1. Specify one or more offers. Justify your offer(s) based on the goal of maximizing incremental revenue. Please include details about how the offer should be presented, and on how many occasions.

  2. Specify a test and evaluation protocol. How long should the test run? How will you know if you have successfully increased revenue?

  3. Build a model in R to help us identify users who are likely to convert on their own. This part can be done independently – it’s not necessary to fold the output of this model into your other analyses. Please assume, however, that we would eventually do this.

Dataset

· users: udid (user ID), install_date, lang, country, hw_ver(hardware version), os_ver(Operating System Version)

· sessions: udid (user ID) , ts(timestamp), date, session_num , last_session_termination_type

· iaps: udid (user ID), ts (timestamp), date, prod_name(purchased product name ), prod_type, rev(revenue)

· spendevents: udid, ts, date, story, chapter, spendtype, currency, amount

Note: Revenue is recorded in the ‘rev’ field and is measured in cents.

There are 4 tables recording users information , in-app purchase records , spend events and sessions where sessions table records how many times the user logged into the system

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
#install.packages('lubridate') 
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggplot2)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
#rm(list = ls())

Load Data

1 Users Table

users = read.csv("dataset/users.csv")
str(users)
## 'data.frame':    22576 obs. of  6 variables:
##  $ udid        : chr  "88258a36ad1447c4b407bb782c9f7f39" "8f3d900d5d624e95afc95a1e5ffe49fb" "90167c9cf1b9406fb81401da39b800b9" "479dbe15a9b5421a903361931d5d579c" ...
##  $ install_date: chr  "3/1/2016" "3/1/2016" "3/1/2016" "3/1/2016" ...
##  $ lang        : chr  "th" "id" "en" "en" ...
##  $ country     : chr  "TH" "ID" "US" "GB" ...
##  $ hw_ver      : chr  "iPhone6,2" "iPhone6,1" "iPad5,1" "iPhone8,2" ...
##  $ os_ver      : chr  "9.2.1" "8" "9.2.1" "9.2.1" ...
names(users)
## [1] "udid"         "install_date" "lang"         "country"      "hw_ver"      
## [6] "os_ver"
head(users)
summary(users)
##      udid           install_date           lang             country         
##  Length:22576       Length:22576       Length:22576       Length:22576      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##     hw_ver             os_ver         
##  Length:22576       Length:22576      
##  Class :character   Class :character  
##  Mode  :character   Mode  :character
colSums(is.na(users)) # 5 NAs from countries
##         udid install_date         lang      country       hw_ver       os_ver 
##            0            0            0            5            0            0
prop.table(table(users$install_date))
## 
##  3/1/2016  3/2/2016  3/3/2016  3/4/2016  3/5/2016  3/6/2016  3/7/2016 
## 0.1136605 0.1152551 0.1087438 0.1337261 0.2029146 0.2022059 0.1234940
Users table summary 1

Based on the data we can tell that the user installed the game more on March 5th and March 6th Which are the weekend. It is then easy to conclude that more users prefer install the game on the Weekend

2 Sessions Table

sessions = read.csv("dataset/sessions.csv")
str(sessions)
## 'data.frame':    722955 obs. of  5 variables:
##  $ udid                         : chr  "f19efdcec2f0414e9507c19da467af20" "f19efdcec2f0414e9507c19da467af20" "f19efdcec2f0414e9507c19da467af20" "f19efdcec2f0414e9507c19da467af20" ...
##  $ ts                           : chr  "3/1/16 00:13" "3/1/16 00:27" "3/1/16 00:29" "3/1/16 00:34" ...
##  $ date                         : chr  "3/1/2016" "3/1/2016" "3/1/2016" "3/1/2016" ...
##  $ session_num                  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ last_session_termination_type: logi  NA NA NA NA NA NA ...
names(sessions)
## [1] "udid"                          "ts"                           
## [3] "date"                          "session_num"                  
## [5] "last_session_termination_type"
#head(sessions)
summary(sessions)
##      udid                ts                date            session_num    
##  Length:722955      Length:722955      Length:722955      Min.   :   1.0  
##  Class :character   Class :character   Class :character   1st Qu.:  14.0  
##  Mode  :character   Mode  :character   Mode  :character   Median :  45.0  
##                                                           Mean   : 106.5  
##                                                           3rd Qu.: 124.0  
##                                                           Max.   :1919.0  
##  last_session_termination_type
##  Mode:logical                 
##  NA's:722955                  
##                               
##                               
##                               
## 
colSums(is.na(sessions)) # last_session_termination_type all NAs
##                          udid                            ts 
##                             0                             0 
##                          date                   session_num 
##                             0                             0 
## last_session_termination_type 
##                        722955
Session table conclusion 1

The every row in last_session_termination_type column has NA. This means all users are still playing or the information not is not recorded. Mean session number is 106.5. This indicates that most player play logged in over 106.5 times. It can be conclude the game is quite popular in this week.

iaps <- read.csv("dataset/iaps.csv")
names(iaps)
## [1] "udid"      "ts"        "date"      "prod_name" "prod_type" "rev"
#head(iaps)
summary(iaps)
##      udid                ts                date            prod_name        
##  Length:6685        Length:6685        Length:6685        Length:6685       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   prod_type              rev        
##  Length:6685        Min.   :  70.0  
##  Class :character   1st Qu.: 140.0  
##  Mode  :character   Median : 140.0  
##                     Mean   : 216.2  
##                     3rd Qu.: 350.0  
##                     Max.   :7000.0
colSums(is.na(iaps))
##      udid        ts      date prod_name prod_type       rev 
##         0         0         0         0         0         0

Iaps table conclusion 1

Iaps table is a very important table where displays all the purcase info. The data quality the this table is impressive as there is no NA shown on the table.

4 Spendevent Table

spendevents <- read.csv("dataset/spendevents.csv")
names(spendevents)
## [1] "udid"      "ts"        "date"      "story"     "chapter"   "spendtype"
## [7] "currency"  "amount"
# "udid","ts","date","story","chapter","spendtype","currency","amount"
#head(spendevents) # each chapter has many stories, when all stories are done, the chapter is over
str(spendevents)
## 'data.frame':    107764 obs. of  8 variables:
##  $ udid     : chr  "319d57bc3d9445c78fbeda4772ecbc9b" "988291d1f38d4b0e97c2c86fb4d0122f" "1e7a9cbe6fb44ecd93c3b47e64f4c377" "309d0946c01a46e88a9038da40145a9e" ...
##  $ ts       : chr  "3/1/16 0:46" "3/1/16 3:43" "3/1/16 4:19" "3/1/16 5:29" ...
##  $ date     : chr  "3/1/2016" "3/1/2016" "3/1/2016" "3/1/2016" ...
##  $ story    : chr  "Mean_Girls_Version_D" "Mean_Girls_Version_D" "Mean_Girls_Version_D" "Demi_Master" ...
##  $ chapter  : int  0 0 0 0 0 0 0 2 0 0 ...
##  $ spendtype: chr  "earnGemsCounter" "earnGemsCounter" "earnGemsCounter" "earnGemsCounter" ...
##  $ currency : chr  "gems" "gems" "gems" "gems" ...
##  $ amount   : int  -5 -5 -5 -5 -5 -5 -5 -1 -5 -5 ...
summary(spendevents)
##      udid                ts                date              story          
##  Length:107764      Length:107764      Length:107764      Length:107764     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     chapter        spendtype           currency             amount         
##  Min.   : 0.000   Length:107764      Length:107764      Min.   :-999999.0  
##  1st Qu.: 0.000   Class :character   Class :character   1st Qu.:     -5.0  
##  Median : 0.000   Mode  :character   Mode  :character   Median :      0.0  
##  Mean   : 3.018                                         Mean   :   -140.5  
##  3rd Qu.: 4.000                                         3rd Qu.:      0.0  
##  Max.   :89.000                                         Max.   :    165.0
# negative sign in amount means spend currency, positive means earn currency from daily activities or special event,etc.
# amount: min -999999, max 165 earn is more difficult, spend is easy
# all the currency spends in event are gems
colSums(is.na(spendevents)) 
##      udid        ts      date     story   chapter spendtype  currency    amount 
##         0         0         0         0         0         0         0         0
Spendevent conclusion 1

1.Negative sign in amount means spend currency, positive means earn currency from daily activities or special event,etc.

2.All the currency spends in event are gems 3.-999999 might be a mistake in data recording 4. Making gem is more difficult than spending

Data Processing

1 remove NAs

users = users[complete.cases(users),]
colSums(is.na(users))
##         udid install_date         lang      country       hw_ver       os_ver 
##            0            0            0            0            0            0

2. rename the column ts date

iaps = rename(iaps,iaps_ts = ts, iaps_date = date)
sessions = rename(sessions,sessions_ts = ts, sessions_date = date)
spendevents = rename(spendevents, spend_ts = ts, spend_date = date)
#head(iaps)
#head(sessions)
#head(spendevents)

3. Format Date and Time

#str(users)
users$install_date = as.Date(users$install_date,format = '%m/%d/%Y')
#head(users)
str(iaps)
## 'data.frame':    6685 obs. of  6 variables:
##  $ udid     : chr  "a737cacac9b9432180f7106c248dc9dc" "a737cacac9b9432180f7106c248dc9dc" "a737cacac9b9432180f7106c248dc9dc" "a737cacac9b9432180f7106c248dc9dc" ...
##  $ iaps_ts  : chr  "3/1/16 9:35" "3/1/16 9:51" "3/1/16 9:56" "3/1/16 11:11" ...
##  $ iaps_date: chr  "3/1/2016" "3/1/2016" "3/1/2016" "3/1/2016" ...
##  $ prod_name: chr  "iap_1_gems_2" "iap_1_gems_2" "iap_1_gems_2" "iap_1_gems_2" ...
##  $ prod_type: chr  "gems" "gems" "gems" "gems" ...
##  $ rev      : int  140 140 140 140 70 140 1400 70 350 70 ...
iaps$iaps_ts = as.POSIXct(iaps$iaps_ts, format = '%m/%d/%y %H:%M') # %y
iaps$iaps_date = as.Date(iaps$iaps_date, format = '%m/%d/%Y')
#head(iaps)
str(sessions)
## 'data.frame':    722955 obs. of  5 variables:
##  $ udid                         : chr  "f19efdcec2f0414e9507c19da467af20" "f19efdcec2f0414e9507c19da467af20" "f19efdcec2f0414e9507c19da467af20" "f19efdcec2f0414e9507c19da467af20" ...
##  $ sessions_ts                  : chr  "3/1/16 00:13" "3/1/16 00:27" "3/1/16 00:29" "3/1/16 00:34" ...
##  $ sessions_date                : chr  "3/1/2016" "3/1/2016" "3/1/2016" "3/1/2016" ...
##  $ session_num                  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ last_session_termination_type: logi  NA NA NA NA NA NA ...
sessions$sessions_ts = as.POSIXct(sessions$sessions_ts, format = '%m/%d/%y %H:%M')
sessions$sessions_date = as.Date(sessions$sessions_date, format = '%m/%d/%Y')
#head(sessions)
str(spendevents)
## 'data.frame':    107764 obs. of  8 variables:
##  $ udid      : chr  "319d57bc3d9445c78fbeda4772ecbc9b" "988291d1f38d4b0e97c2c86fb4d0122f" "1e7a9cbe6fb44ecd93c3b47e64f4c377" "309d0946c01a46e88a9038da40145a9e" ...
##  $ spend_ts  : chr  "3/1/16 0:46" "3/1/16 3:43" "3/1/16 4:19" "3/1/16 5:29" ...
##  $ spend_date: chr  "3/1/2016" "3/1/2016" "3/1/2016" "3/1/2016" ...
##  $ story     : chr  "Mean_Girls_Version_D" "Mean_Girls_Version_D" "Mean_Girls_Version_D" "Demi_Master" ...
##  $ chapter   : int  0 0 0 0 0 0 0 2 0 0 ...
##  $ spendtype : chr  "earnGemsCounter" "earnGemsCounter" "earnGemsCounter" "earnGemsCounter" ...
##  $ currency  : chr  "gems" "gems" "gems" "gems" ...
##  $ amount    : int  -5 -5 -5 -5 -5 -5 -5 -1 -5 -5 ...
spendevents$spend_ts = as.POSIXct(spendevents$spend_ts, format = '%m/%d/%y %H:%M')
spendevents$spend_date = as.Date(spendevents$spend_date, format = '%m/%d/%Y')
#head(spendevents)
Add a column story_chapter to be more specific on the progress of the player
spendevents$story_chapter <- paste(spendevents$story,spendevents$chapter,sep = "_")
#head(spendevents)
Converted Users

the users in iaps are the converted users, we could create a column in users, if udid in iaps, it’s converted-users, if udid not in iaps, it is it’s un-converted users.

purchased <- iaps %>% distinct(udid) %>%select(udid)
purchased_user <-purchased$udid
users$converted <- as.numeric(ifelse(users$udid %in% purchased_user,'1','0'))
converted_users <- users %>% filter(converted == 1)
non_payers <- users %>% filter(converted == 0)%>% select(udid)
#write.csv(users, file="dataset/final/final_user.csv")

User Exploration

Q1. Who are the users?
1. Overall conversion rate
mean(users$converted)
## [1] 0.06760888
count(distinct(users))
count(distinct(users))*mean(users$converted)

** To sum up 1: total 22571 users, there are 6.76% players are paied players which is 1526 player who paid**

2. Country wise
df2 <- users%>%select(country,converted) %>% group_by(country)%>%summarise(num_of_users = n(),conversion = mean(converted))%>% arrange(desc(num_of_users)) %>% head(14)
df2
p21 <- ggplot(df2,aes(country,num_of_users,fill= country))+geom_bar(stat='identity') + theme(legend.position = "none",plot.title = element_text(hjust = 0.5))+ggtitle('Distribution of Users by Country')
p22 <- ggplot(df2) + geom_line(aes(country, conversion), stat = 'identity', group=1) +
  geom_hline(yintercept=0.0676, linetype="dashed", color = "red")
grid.arrange(p21,p22,nrow=2)

To sum up: 1. US has most of player and to state the obivious, the conversion rate is the highest 2. Plots show high conversion rate in US, GB and Australia 3. Canada Normay and DE are short of mean 4. the rest of countries mostly are free players

3. Language wise
df3 <- users %>% select(lang, converted) %>% group_by(lang) %>%
  summarise(num_of_users = n(), conversion = mean(converted)) %>% 
  arrange(desc(num_of_users)) %>% head(14)
p31 <- ggplot(df3, aes(lang, num_of_users, fill=lang)) + 
  geom_bar(stat = 'identity') + theme(legend.position = "none") +
  ggtitle('Distribution of Users by Language') +
  theme(plot.title = element_text(hjust = 0.5))
p32 <- ggplot(df3) + geom_line(aes(lang, conversion), stat = 'identity', group=1) +
  geom_hline(yintercept=0.0676, linetype="dashed", color = "red")
grid.arrange(p31,p32,nrow=2)

To sum up:

Majority of users are English-speaker. They also contribute the most to the number of payers Non-English speakers are more likely to be non-payers than avg

Device Type
df4 <- users %>% select(hw_ver, converted) %>% group_by(hw_ver) %>%
  summarise(num_of_users = n(), conversion = mean(converted)) %>% 
  arrange(desc(num_of_users)) %>% head(14)
p41 <- ggplot(df4, aes(hw_ver, num_of_users, fill=hw_ver)) + 
  geom_bar(stat = 'identity') + theme(legend.position = "none") +
  ggtitle('Distribution of Users by Device Type') +
  theme(plot.title = element_text(hjust = 0.5),axis.text.x = element_text(angle = 90, hjust = 1))
p42 <- ggplot(df4) + geom_line(aes(hw_ver, conversion), stat = 'identity', group=1) +
  geom_hline(yintercept=0.0676, linetype="dashed", color = "red") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
grid.arrange(p41,p42,nrow=2)

To sum up: ## Most users play the game on iPhone ## Users with new model device are more likely to pay for gaming comparing to those holding old models

OS version
df5 <- users %>% select(os_ver,converted) %>% group_by(os_ver) %>%
  summarise(num_of_users = n(), conversion = mean(converted))%>% 
  arrange(desc(num_of_users)) %>% head(14)
p51<- ggplot(df5, aes(os_ver,num_of_users, fill=os_ver))+
  geom_bar(stat = 'identity') + theme(legend.position = "none",plot.title = element_text(hjust = 0.5),axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle('Distribution of Users by Operating system')

p52<-ggplot(df5) + geom_line(aes(os_ver,conversion),stat = 'identity',group =1) +
  geom_hline(yintercept=0.0676, linetype="dashed", color = "red") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
grid.arrange(p51,p52, nrow =2)

To sum up: 1. Most player uses os version 9.2.1 2. conversion rate peaks at os version 8.1.2, 9.2.1 and 9.3 3. It is safe to say OS version 9.2.1 should be our major target

Install Date
df6<- users%>%select(install_date, converted)%>% group_by(install_date) %>%
  summarise(num_of_users = n(), conversion = mean(converted))%>% 
  arrange(desc(num_of_users)) 
p61<-ggplot(df6, aes(as.factor(install_date),num_of_users, fill = install_date)) +
  geom_bar(stat = 'identity') + theme(legend.position = "none",plot.title = element_text(hjust = 0.5),axis.text.x = element_text(angle = 90,hjust=1))


p62<-ggplot(df6)+geom_line(aes(as.factor(install_date),conversion),stat = 'identity',group=1)+
  geom_hline(yintercept=0.0676, linetype="dashed", color = "red") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
grid.arrange(p61,p62,nrow=2)

To sum up: 1. Totoal of players peaks at the weekend which over 4500 player installed each day 2. Insteresting noted that most player start to pay for the game on March 3rd,2016, while the next day, on March 4th,2016, there are lowest number of players are willing to pay. It maybe due to the special event or sales opportunity 3. Overall, over the weekend, players have more time to play which explains why the numbers of player jumps on March 5th and march 6th. The conversion rate for the weekend is over the average. Decent

Q2. What do the users purchase?

We need to look at more on In purchase app table to find out more information

Product type
df7 <- iaps%>%group_by(prod_type)%>%summarise(number_of_orders = n(),revenue = sum(rev))

df7
p71 <- ggplot(df7, aes(prod_type, number_of_orders)) + 
  geom_bar(stat = 'identity') + theme(legend.position = "none") +
  ggtitle('Number of Orders by Product Type') +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + coord_flip() 
p72 <- ggplot(df7, aes(prod_type, revenue)) + 
  geom_bar(stat = 'identity') + theme(legend.position = "none") +
  ggtitle('Revenue by Product Type') +
  theme(plot.title = element_text(hjust = 0.5),axis.text.x = element_text(angle = 90, hjust = 1)) + coord_flip() 
grid.arrange(p71,p72,nrow=2)

To sum up: gems are most popular items for purchase then it is chapter Passes, whereas value pack revenue and sales are negligible

Q3. When do the users purchase?
iaps date
p81 <- ggplot(iaps) + geom_bar(aes(iaps_date, fill = prod_type))+
  ggtitle('Number of Orders over time') + theme(plot.title = element_text(hjust = 0.5)) +
  ylab('Number of orders')
p82 <- ggplot(iaps) + geom_bar(aes(iaps_date, rev,fill=prod_type), stat = 'identity') + 
  ggtitle('Revenue over Time') + theme(plot.title = element_text(hjust = 0.5))  
grid.arrange(p81,p82,nrow=2)

To sum up: The sales and revenue peak on March 7th about the first week of the purchase date. It is commonly true for gaming industry that the growth period is short and it is always the catch most attention in the first couple of days. Gems are the most selling product and its revenue dimishes drastically

First time of purchase
df9 <- iaps %>% select(udid, iaps_date) %>% group_by(udid) %>% summarise(converted_date = min(iaps_date))

converted_users <- inner_join(converted_users, df9, by = 'udid')

converted_users$days_n <- difftime(converted_users$converted_date, converted_users$install_date, units = 'days')

#converted_users

summary(as.numeric(converted_users$days_n))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   4.607   3.000  60.000
#write.csv(converted_users,file="dataset/final/final_converted_user.csv")

sum up averagely speaking players take 4.6 days to convert to paying player. However, it coult take someone 60 days to eventually get it there

## median users took 1 day to pay
ggplot(converted_users) + geom_bar(aes(as.numeric(days_n))) + ylab('Number of users') + 
  ggtitle('Days taken to Convert') + theme(plot.title = element_text(hjust = 0.5))  

df91 <- converted_users %>% group_by(days_n) %>% summarise(count = n(), 
                                                   perc = count/nrow(converted_users))
df91$perc_sum<-cumsum(df91$perc)
df91
#pareto.chart is better

sum up Nearly half of players made their payment the same day as when they installed. less than 2 days, 73% of players paid for the game. and 80% of gamers pay in four days

Q4.User behavior
converted users

sessions are total number of times player logged in the game

p101 <- sessions%>% filter(udid %in% purchased_user)%>%group_by(sessions_date)%>%count() %>% 
  ggplot() + geom_line(aes(sessions_date, n), stat = 'identity') + 
  ggtitle('Number of Sessions over Time of Converted Users') + theme(plot.title = element_text(hjust = 0.5))+ylab('Number of sessions')
p102 <- spendevents %>% filter(udid %in% purchased_user) %>% 
  select(spend_date, story_chapter) %>% group_by(spend_date) %>% 
  summarise(n = n_distinct(story_chapter)) %>%
  ggplot() + geom_line(aes(spend_date, n), stat = 'identity') + 
  ggtitle('Number of Story_Chapter over Time of Converted Users') + theme(plot.title = element_text(hjust = 0.5)) +
  ylab('Number of story_chapter')
spendevents_payers <- spendevents %>% filter(udid %in% purchased_user)
summary(spendevents$amount) 
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -999999.0      -5.0       0.0    -140.5       0.0     165.0

Positive number means how much players earned and negetive means how much they spent we can see averagely players spend -140 gems. -9999999 seems like a mistake

p103 <- spendevents %>% filter(udid %in% purchased_user) %>% 
  filter(amount != -999999) %>% select(spend_date, amount) %>% group_by(spend_date) %>% 
  summarise(total_amount = sum(amount)) %>%
  ggplot() + geom_line(aes(spend_date, total_amount)) + 
  ggtitle('Spending Amount over Time of Converted Users') + theme(plot.title = element_text(hjust = 0.5)) 
p104 <- iaps %>% group_by(iaps_date) %>% summarise(revenue = sum(rev)) %>%
  ggplot() + geom_line(aes(iaps_date, revenue)) + 
  ggtitle('Revenue over Time of Converted Users') + theme(plot.title = element_text(hjust = 0.5)) 
grid.arrange(p101,p102,p103,p104,nrow=4)

sum up Daily revenue peaked on March 6, then dropped quickly till March 16 when it started dropping in a much slower manner. Number of sessions peaked on March 6, then dropped quickly till March 16 when it started dropping in a much slower manner. Amount spending peaked on March 6, then dropped quickly till March 16 when it started dropping in a much slower manner.

non-payers
p111 <- sessions %>% filter(udid %in% non_payers$udid) %>% 
  group_by(sessions_date) %>% count() %>% 
  ggplot() + geom_line(aes(sessions_date, n), stat = 'identity') + 
  ggtitle('Number of Sessions over Time of Non Payers') + theme(plot.title = element_text(hjust = 0.5)) +
  ylab('Number of sessions')
p112 <- spendevents %>% filter(udid %in% non_payers$udid) %>% 
  select(spend_date, story_chapter) %>% group_by(spend_date) %>% 
  summarise(n = n_distinct(story_chapter)) %>%
  ggplot() + geom_line(aes(spend_date, n), stat = 'identity') + 
  ggtitle('Number of Story_Chapter over Time of Non Payers') + theme(plot.title = element_text(hjust = 0.5)) +
  ylab('Number of story_chapter')
spendevents_non_payers = spendevents %>% filter(udid %in% non_payers$udid)
summary(spendevents_non_payers$amount) # -999999 probably an outlier
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -999999.0      -5.0      -1.0    -188.8       0.0     165.0
p113 <- spendevents %>% filter(udid %in% non_payers$udid) %>% 
  filter(amount != -999999) %>% select(spend_date, amount) %>% group_by(spend_date) %>% 
  summarise(total_amount = sum(amount)) %>%
  ggplot() + geom_line(aes(spend_date, total_amount)) + 
  ggtitle('Spending Amount over Time of Non Payers') + theme(plot.title = element_text(hjust = 0.5)) 
grid.arrange(p111,p112,p113,nrow=3)

Converted Users vs Non Payers

The amount per day went down and up during the first week, then stayed stable after the first week Although converted users accouts for only 6.67%, the total sessions per day are higher than non payers In March, the number of stroy-chapter per day is above 40 for converted users while for non payers the number is under 40

use one week test
users$test_date = users$install_date + 7
#head(users)
adding session info
sessions_num = count(sessions, udid, sessions_ts, sessions_date)
sessions_num %>% select(n) %>% arrange(desc(n))
test_data <- left_join(users, sessions_num, by = "udid")
colSums(is.na(test_data))
##          udid  install_date          lang       country        hw_ver 
##             0             0             0             0             0 
##        os_ver     converted     test_date   sessions_ts sessions_date 
##             0             0             0           304            32 
##             n 
##            32
#head(test_data)
To check if the session date is before the test date. Return a logical value whether the users has session activity during one week
test_sessions <- test_data$sessions_date < test_data$test_date 
test_data$test_sessions = test_sessions

extract whose who are true on test_session from test_data Ti: test data which has users log in in the first week

ti<-test_data[test_data$test_sessions == 'TRUE',]

Add the count of session activitives for each udid

ts = mutate(group_by(ti,udid), t_sessions_n = sum(n))
ts = distinct(ts, udid, t_sessions_n)

ts: finally summarizes how many session activities for each distinct udid

Add the Revenue info from iapcopy revenue data before 1 week

test_data1 <- right_join(users, iaps, by = 'udid')

#head(test_data1)
test_rev <- test_data1$iaps_date < test_data1$test_date 
test_data1$test_rev = test_rev
#head(test_data1)

extract those who are ‘TRUE’ on test_rev from test_data1

test_data1 <- test_data1[test_data1$test_rev == 'TRUE',]
iaps_test <- test_data1[-(2:8)]

tr <- left_join(ti, iaps_test, by = 'udid')
tr = mutate(group_by(tr,udid), rev_sum = sum(rev))
tr = distinct(tr, udid, rev_sum)
#head(tr,10)

tr: It lists the players who login the game in the first week with their purchase record

tr %>% select(udid, rev_sum) %>% arrange(desc(rev_sum)) %>% head()
test_data2 <- left_join(ti, spendevents, by = 'udid')
#head(test_data2)
test_spend <- test_data2$spend_date < test_data2$test_date
test_data2$test_spend = test_spend
ti <- test_data2[test_data2$test_spend == 'TRUE',]
# Extract udid, story_chapter from ti and named 'time story count number'
t_sc_num = count(ti, udid, story_chapter)
tsc = count(t_sc_num, udid) # extract user id from data frame 'time story count number' 
tsc = na.omit(rename(tsc, t_story_chapter_n = n)) # Wipe na values and rename count number(how many story-chapter the user has played)
# Extract user id and count total amount from ti and named 'Test Amount Spent'
ta = mutate(group_by(ti, udid), t_amount_n = sum(amount))
ta = na.omit(distinct(ta, udid, t_amount_n)) # Wipe out na values and extarct only distinct values of user id
Data Manipulation

To calculate distinct user id with their sum of revenue, story, spendtype, time session number, time amount number and time story chapter ta: test amount spent for all players, tsc: total player story chapter count ts: summarizes how many session activities for each distinct udid tr: list of revenue for each player

ti1 <- full_join(ta, tsc, by = 'udid')
ti2 <- full_join(ts, ti1, by = 'udid')
ti <- full_join(tr, ti2, by = 'udid')
ti = ti[is.na(ti$t_sessions_n) == FALSE & is.na(ti$t_story_chapter_n) == FALSE & is.na(ti$t_amount_n) == FALSE,]
ti[is.na(ti)] = 0 # rev_sum = 0
colSums(is.na(ti))
##              udid           rev_sum      t_sessions_n        t_amount_n 
##                 0                 0                 0                 0 
## t_story_chapter_n 
##                 0
#head(ti)

users1 <- users %>% select(c(1:6)) 
# country_dis
country_dis <- function(country) {
  if (country == 'US') {
    return('US')
  } else if (country == 'GB') {
    return('GB')
  } else if (country == 'CA') {
    return('CA')
  } else if (country == 'AU') {
    return('AU')
  } else if (country == 'PH') {
    return('PH')
  } else if (country == 'ID') {
    return('ID')
  } else if (country == 'NL') {
    return('NL')
  } else if (country == 'RU') {
    return('RU')
  } else {
    return('other')
  }
}
users1$country <- factor(sapply(users1$country, country_dis))

# lang_dis
lang_dis <- function(lang) {
  if (lang == 'en') {
    return('en')
  } else if (lang == 'es') {
    return('es')
  } else if (lang == 'fr') {
    return('fr')
  } else if (lang == 'de') {
    return('de')
  } else if (lang == 'nl') {
    return('nl')
  } else if (lang == 'ru') {
    return('ru')
  } else if (lang == 'pt') {
    return('pt')
  } else if (lang == 'nb') {
    return('nb')
  } else {
    return('other')
  }
}
users1$lang <- factor(sapply(users1$lang, lang_dis))

# hw_ver_dis
hw_ver_dis <- function(hw_ver) {
  if (hw_ver == 'iPhone7,2') {
    return('iPhone7,2')
  } else if (hw_ver == 'iPhone8,1') {
    return('iPhone8,1')
  } else if (hw_ver == 'iPhone6,1') {
    return('iPhone6,1')
  } else if (hw_ver == 'iPhone6,2') {
    return('iPhone6,2')
  } else if (hw_ver == 'iPhone5,2') {
    return('iPhone5,2')
  } else if (hw_ver == 'iPhone5,3') {
    return('iPhone5,3')
  } else if (hw_ver == 'iPad2,5') {
    return('iPad2,5')
  } else {
    return('other')
  }
}
users1$hw_ver <- factor(sapply(users1$hw_ver, hw_ver_dis))

# os_ver_dis
os_ver_dis <- function(os_ver) {
  if (os_ver == '9.2.1') {
    return('9.2.1')
  } else if (os_ver == '9.2') {
    return('9.2')
  } else if (os_ver == '9.1') {
    return('9.1')
  } else if (os_ver == '7.1.2') {
    return('7.1.2')
  } else if (os_ver == '8.3') {
    return('8.3')
  } else {
    return('other')
  }
}
users1$os_ver <- factor(sapply(users1$os_ver, os_ver_dis))

ti_full <- inner_join(users1, ti, by = 'udid') 
colSums(is.na(ti_full))
##              udid      install_date              lang           country 
##                 0                 0                 0                 0 
##            hw_ver            os_ver           rev_sum      t_sessions_n 
##                 0                 0                 0                 0 
##        t_amount_n t_story_chapter_n 
##                 0                 0
ti_full$level = as.factor(ifelse(ti_full$rev_sum > 0, 'Converted_User', 'Non_Payer'))
prop.table(table(ti_full$level))
## 
## Converted_User      Non_Payer 
##     0.07020334     0.92979666
print(table(ti_full$level))
## 
## Converted_User      Non_Payer 
##           1236          16370
Data modeling
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.1.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(caTools)
## Warning: package 'caTools' was built under R version 4.1.3
set.seed(123)
split = sample.split(ti_full$level, SplitRatio = 0.75)

training_set = subset(ti_full, split == TRUE)
test_set = subset(ti_full, split == FALSE)
print(table(training_set$level))
## 
## Converted_User      Non_Payer 
##            927          12278

Unbalanced data It can be seen that the number of the converted user is only 927 whereas that of non_payer is 12278. The data is so unblanced and we need to make the ratio about 1:1

# to make them continuous and categorical. We need to remove udid and convert install_date

training_set_v2 = subset(training_set,select = -c(udid))
training_set_v2$install_date <- as.factor(training_set_v2$install_date)
test_set_v2 <- subset(test_set,select = -c(udid))
test_set_v2$install_date <- as.factor(test_set_v2$install_date)


library(ROSE)
## Warning: package 'ROSE' was built under R version 4.1.3
## Loaded ROSE 0.0-4
#rose_sample_train_data <- ROSE(level~., data = training_set_v2,seed = 111)$data
#print(table(rose_sample_train_data$level))

To ensure the size of training set is large enough, we use mix sampling method

both_sample_train_data <- ovun.sample(level ~ ., data = training_set_v2, method="both", p=0.5,seed =222,N=26410)$data
print(table(both_sample_train_data$level))
## 
##      Non_Payer Converted_User 
##          13276          13134
set.seed(123)
model <- randomForest(level ~ install_date + lang + country + hw_ver + os_ver + 
                        t_sessions_n + t_amount_n + t_story_chapter_n, 
                      data = both_sample_train_data,
                      ntree = 500,
                      importance = TRUE)
plot(model)

SUM UP

The plot shows a out of bag error goes down as the training time increases. When it trains 500 times the OOB error is < 0.01

y_pred = predict(model, newdata = test_set_v2, type = 'class')
head(y_pred)
##         2         4         5         8        11        16 
## Non_Payer Non_Payer Non_Payer Non_Payer Non_Payer Non_Payer 
## Levels: Non_Payer Converted_User
table_random_forest = table(Predicted = y_pred, Actual = test_set_v2$level)[2:1,1:2]

table_random_forest
##                 Actual
## Predicted        Converted_User Non_Payer
##   Converted_User            139       214
##   Non_Payer                 170      3878

Sum up

From the confusion matrix we can tell the overall prediction accuracy is very high (91.3%). However, the model has a high False Negative error. 55% of converted player are predict wrongly. The tpr(sensitivity) is only 139/(139+170) = 45.0%. We need to address the issue. The tnr is 3878/(3878+214)=94.7%

print(paste('Random Forest Accuracy', sum(diag(table_random_forest)/ sum(table_random_forest))))
## [1] "Random Forest Accuracy 0.912747102931152"
varImpPlot(model)

Conclusion Random Forest

Based on the data provided, we found the KPIs that can contribute to the prediction of the user likelihood of being converted to paying player

We chose random forest tree data modelling technique to train the model. The model then tested in the test set

The random forest tree model produce an accuracy of 91.2% result. With that being said, this model can be used for the company to predict if the user will pay for the game

Using Mean Decrease Accuracy score and Mean Decrease Gini score, we can tell that story chapter number, amount of spend and, user sessoin numbers are the most important KPIs. Which make sense because they represents how much time and effort that players put on the games. As more time and effort the player put on this game, they are very likely to pay for it.

One drawback of the data model is that the false negative error is considerable.Since the company need to target the paying gamers for the revenue, the model need to improve so the False Negative become less.However, considering the company does not want to waste money, a high false negative error is better than a high false positive error. Therefore the data modelling is acceptable.

Logistic Classifier
both_sample_classifier = glm(formula = level ~ install_date + lang + country + hw_ver + os_ver + 
                        t_sessions_n + t_amount_n + t_story_chapter_n, family = binomial, data = both_sample_train_data)
both_sample_probability_predict = predict(both_sample_classifier,type ='response',newdata = test_set_v2[-10] )

y_predict_both_sample = ifelse(both_sample_probability_predict>0.47, 'Converted_User ','non_payer')

table_logistic = table(Predicted = y_predict_both_sample, Actual = test_set_v2$level)

table_logistic
##                  Actual
## Predicted         Converted_User Non_Payer
##   Converted_User             246       725
##   non_payer                   63      3367
print(paste('Logistic Accuracy', sum(diag(table_logistic)/ sum(table_logistic))))
## [1] "Logistic Accuracy 0.820949784139968"
Finding a good cutoff
cutoff = seq(0.2,0.8,b=0.05)
acc_list = double()
tpr = double()
tnr = double()

for (val in cutoff){
  
  y_predict_both_sample = ifelse(both_sample_probability_predict>val, 'Converted_User ','non_payer')

  table_ligistic = table(Predicted = y_predict_both_sample, Actual = test_set_v2$level)
  accuracy = sum(diag(table_ligistic)/ sum(table_ligistic))
  acc_list<-append(acc_list,accuracy)
  tpr <-append(tpr,table_ligistic[1,1]/(table_ligistic[1,1]+table_ligistic[2,1]))
  tnr <-append(tnr,table_ligistic[2,2]/(table_ligistic[1,2]+table_ligistic[2,2]))
}

ggplot()+geom_line(aes(x = cutoff,acc_list),color = "red")+
  geom_line(aes(x = cutoff,tpr),color = "blue")+
  geom_line(aes(x = cutoff,tnr),color = "black")+
  xlab("cutoff")+ylab("rate")+guides()

As the plot shown above, the best cutoff we can chose is 0.47 where the accuracy is 82%

#plot(rose_classifier)

ROC curve

roc_over <- roc.curve(test_set_v2$level, both_sample_probability_predict, plotit = T)

roc_over <- roc.curve(test_set_v2$level, both_sample_probability_predict, plotit = F)
print(paste('mix sampling method: AUC',roc_over)[2])
## [1] "mix sampling method: AUC 0.86247022369008"
summary(both_sample_classifier)
## 
## Call:
## glm(formula = level ~ install_date + lang + country + hw_ver + 
##     os_ver + t_sessions_n + t_amount_n + t_story_chapter_n, family = binomial, 
##     data = both_sample_train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4859  -0.8527  -0.1720   0.8975   2.8608  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -1.802e+00  2.065e-01  -8.725  < 2e-16 ***
## install_date2016-03-02  1.366e-01  6.433e-02   2.124 0.033701 *  
## install_date2016-03-03  3.557e-01  6.447e-02   5.518 3.43e-08 ***
## install_date2016-03-04  1.090e-01  6.264e-02   1.740 0.081794 .  
## install_date2016-03-05  2.257e-01  5.715e-02   3.949 7.86e-05 ***
## install_date2016-03-06  2.506e-01  5.689e-02   4.405 1.06e-05 ***
## install_date2016-03-07  1.951e-01  6.336e-02   3.079 0.002074 ** 
## langen                 -4.696e-01  1.264e-01  -3.716 0.000202 ***
## langes                 -1.299e+00  1.979e-01  -6.562 5.29e-11 ***
## langfr                 -1.727e+00  2.193e-01  -7.874 3.44e-15 ***
## langnb                 -8.454e-01  2.389e-01  -3.538 0.000403 ***
## langnl                 -3.301e+00  5.559e-01  -5.939 2.88e-09 ***
## langother              -8.020e-01  1.351e-01  -5.934 2.95e-09 ***
## langpt                 -1.813e+00  3.056e-01  -5.934 2.95e-09 ***
## langru                 -2.235e+00  5.659e-01  -3.949 7.86e-05 ***
## countryCA              -1.432e-01  1.021e-01  -1.402 0.160800    
## countryGB              -2.090e-01  8.269e-02  -2.527 0.011493 *  
## countryID              -7.541e-01  1.612e-01  -4.678 2.90e-06 ***
## countryNL              -5.381e-01  5.012e-01  -1.074 0.282946    
## countryother           -9.352e-01  9.165e-02 -10.205  < 2e-16 ***
## countryPH              -1.889e+00  1.930e-01  -9.785  < 2e-16 ***
## countryRU              -2.898e-01  5.263e-01  -0.551 0.581819    
## countryUS               1.143e-01  7.439e-02   1.536 0.124590    
## hw_veriPhone5,2        -5.979e-02  1.046e-01  -0.571 0.567769    
## hw_veriPhone5,3         5.700e-02  1.007e-01   0.566 0.571359    
## hw_veriPhone6,1        -1.783e-01  8.928e-02  -1.997 0.045825 *  
## hw_veriPhone6,2         2.293e-01  9.912e-02   2.313 0.020712 *  
## hw_veriPhone7,2         4.538e-01  8.105e-02   5.599 2.15e-08 ***
## hw_veriPhone8,1         2.937e-01  8.384e-02   3.503 0.000460 ***
## hw_verother            -4.176e-02  7.742e-02  -0.539 0.589619    
## os_ver8.3               6.492e-01  1.611e-01   4.031 5.56e-05 ***
## os_ver9.1               6.158e-01  1.334e-01   4.617 3.89e-06 ***
## os_ver9.2               3.448e-01  1.298e-01   2.656 0.007917 ** 
## os_ver9.2.1             8.627e-01  1.220e-01   7.071 1.54e-12 ***
## os_verother             3.822e-01  1.306e-01   2.926 0.003438 ** 
## t_sessions_n            9.721e-03  6.569e-04  14.798  < 2e-16 ***
## t_amount_n              4.901e-07  6.689e-07   0.733 0.463710    
## t_story_chapter_n       3.335e-01  6.342e-03  52.582  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 36611  on 26409  degrees of freedom
## Residual deviance: 26596  on 26372  degrees of freedom
## AIC: 26672
## 
## Number of Fisher Scoring iterations: 6

Conclusion Logistic Regression

Use logistic regression with a adjustable cutoff variable, we can base company demand to find the best model for the company strategy. For this case, we choose cutoff to be 0.47, which yields an accuracy of 82.1%, a sensitivity of 79.6% (246/(246+63)=0.796) and a decent specificity of 82.3% (3367/(3367+725)=0.823).

Random Forest vs logistic regression

Random Forest: accuracy: 91.3%, sensitivity:45%, specifity:94.7% Logistic Regression accuracy: 82.1%,sentivity:79%,specificity 82%

we can see that the logistic regression has a lower accuracy but it produce a much higher sentivity. In logisitc regression, the cutoff value is adjustable, so it can also achieve the same performance as random forest modelling has. Therefore, a logistic regression is more favorable choice.