The majority of our users never generate any revenue (i.e., ‘convert’, in marketing parlance). We want to offer a sale to some of these ‘non-payers’, in hopes of generating incremental revenue. Unfortunately, it’s not easy to identify true non-payers because many payers don’t convert until several weeks after installing the game. If you simply set a cutoff date and offer sales to everyone who hasn’t yet converted, you might find that revenue actually falls. This can occur if large numbers of not-yet-converted future payers take advantage of the sale instead of paying full price. If you’re not lucky, you might fail to convert a sufficient number of true non-payers, while ‘cannibalizing’ future payers by providing more value than they actually require.
The goal of this assignment is to design a sale that maximizes incremental revenue. We don’t expect you to get it right on the first try - we don’t hold ourselves to such a high standard! Instead, we’ll ask you to design one or more offers, select a target group, and prescribe a test protocol that can be used to evaluate the results.
Specify one or more offers. Justify your offer(s) based on the goal of maximizing incremental revenue. Please include details about how the offer should be presented, and on how many occasions.
Specify a test and evaluation protocol. How long should the test run? How will you know if you have successfully increased revenue?
Build a model in R to help us identify users who are likely to convert on their own. This part can be done independently – it’s not necessary to fold the output of this model into your other analyses. Please assume, however, that we would eventually do this.
· users: udid (user ID), install_date, lang, country, hw_ver(hardware version), os_ver(Operating System Version)
· sessions: udid (user ID) , ts(timestamp), date, session_num , last_session_termination_type
· iaps: udid (user ID), ts (timestamp), date, prod_name(purchased product name ), prod_type, rev(revenue)
· spendevents: udid, ts, date, story, chapter, spendtype, currency, amount
Note: Revenue is recorded in the ‘rev’ field and is measured in cents.
There are 4 tables recording users information , in-app purchase records , spend events and sessions where sessions table records how many times the user logged into the system
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
#install.packages('lubridate')
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
#rm(list = ls())
users = read.csv("dataset/users.csv")
str(users)
## 'data.frame': 22576 obs. of 6 variables:
## $ udid : chr "88258a36ad1447c4b407bb782c9f7f39" "8f3d900d5d624e95afc95a1e5ffe49fb" "90167c9cf1b9406fb81401da39b800b9" "479dbe15a9b5421a903361931d5d579c" ...
## $ install_date: chr "3/1/2016" "3/1/2016" "3/1/2016" "3/1/2016" ...
## $ lang : chr "th" "id" "en" "en" ...
## $ country : chr "TH" "ID" "US" "GB" ...
## $ hw_ver : chr "iPhone6,2" "iPhone6,1" "iPad5,1" "iPhone8,2" ...
## $ os_ver : chr "9.2.1" "8" "9.2.1" "9.2.1" ...
names(users)
## [1] "udid" "install_date" "lang" "country" "hw_ver"
## [6] "os_ver"
head(users)
summary(users)
## udid install_date lang country
## Length:22576 Length:22576 Length:22576 Length:22576
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## hw_ver os_ver
## Length:22576 Length:22576
## Class :character Class :character
## Mode :character Mode :character
colSums(is.na(users)) # 5 NAs from countries
## udid install_date lang country hw_ver os_ver
## 0 0 0 5 0 0
prop.table(table(users$install_date))
##
## 3/1/2016 3/2/2016 3/3/2016 3/4/2016 3/5/2016 3/6/2016 3/7/2016
## 0.1136605 0.1152551 0.1087438 0.1337261 0.2029146 0.2022059 0.1234940
Based on the data we can tell that the user installed the game more on March 5th and March 6th Which are the weekend. It is then easy to conclude that more users prefer install the game on the Weekend
sessions = read.csv("dataset/sessions.csv")
str(sessions)
## 'data.frame': 722955 obs. of 5 variables:
## $ udid : chr "f19efdcec2f0414e9507c19da467af20" "f19efdcec2f0414e9507c19da467af20" "f19efdcec2f0414e9507c19da467af20" "f19efdcec2f0414e9507c19da467af20" ...
## $ ts : chr "3/1/16 00:13" "3/1/16 00:27" "3/1/16 00:29" "3/1/16 00:34" ...
## $ date : chr "3/1/2016" "3/1/2016" "3/1/2016" "3/1/2016" ...
## $ session_num : int 1 2 3 4 5 6 7 8 9 10 ...
## $ last_session_termination_type: logi NA NA NA NA NA NA ...
names(sessions)
## [1] "udid" "ts"
## [3] "date" "session_num"
## [5] "last_session_termination_type"
#head(sessions)
summary(sessions)
## udid ts date session_num
## Length:722955 Length:722955 Length:722955 Min. : 1.0
## Class :character Class :character Class :character 1st Qu.: 14.0
## Mode :character Mode :character Mode :character Median : 45.0
## Mean : 106.5
## 3rd Qu.: 124.0
## Max. :1919.0
## last_session_termination_type
## Mode:logical
## NA's:722955
##
##
##
##
colSums(is.na(sessions)) # last_session_termination_type all NAs
## udid ts
## 0 0
## date session_num
## 0 0
## last_session_termination_type
## 722955
The every row in last_session_termination_type column has NA. This means all users are still playing or the information not is not recorded. Mean session number is 106.5. This indicates that most player play logged in over 106.5 times. It can be conclude the game is quite popular in this week.
iaps <- read.csv("dataset/iaps.csv")
names(iaps)
## [1] "udid" "ts" "date" "prod_name" "prod_type" "rev"
#head(iaps)
summary(iaps)
## udid ts date prod_name
## Length:6685 Length:6685 Length:6685 Length:6685
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## prod_type rev
## Length:6685 Min. : 70.0
## Class :character 1st Qu.: 140.0
## Mode :character Median : 140.0
## Mean : 216.2
## 3rd Qu.: 350.0
## Max. :7000.0
colSums(is.na(iaps))
## udid ts date prod_name prod_type rev
## 0 0 0 0 0 0
Iaps table is a very important table where displays all the purcase info. The data quality the this table is impressive as there is no NA shown on the table.
spendevents <- read.csv("dataset/spendevents.csv")
names(spendevents)
## [1] "udid" "ts" "date" "story" "chapter" "spendtype"
## [7] "currency" "amount"
# "udid","ts","date","story","chapter","spendtype","currency","amount"
#head(spendevents) # each chapter has many stories, when all stories are done, the chapter is over
str(spendevents)
## 'data.frame': 107764 obs. of 8 variables:
## $ udid : chr "319d57bc3d9445c78fbeda4772ecbc9b" "988291d1f38d4b0e97c2c86fb4d0122f" "1e7a9cbe6fb44ecd93c3b47e64f4c377" "309d0946c01a46e88a9038da40145a9e" ...
## $ ts : chr "3/1/16 0:46" "3/1/16 3:43" "3/1/16 4:19" "3/1/16 5:29" ...
## $ date : chr "3/1/2016" "3/1/2016" "3/1/2016" "3/1/2016" ...
## $ story : chr "Mean_Girls_Version_D" "Mean_Girls_Version_D" "Mean_Girls_Version_D" "Demi_Master" ...
## $ chapter : int 0 0 0 0 0 0 0 2 0 0 ...
## $ spendtype: chr "earnGemsCounter" "earnGemsCounter" "earnGemsCounter" "earnGemsCounter" ...
## $ currency : chr "gems" "gems" "gems" "gems" ...
## $ amount : int -5 -5 -5 -5 -5 -5 -5 -1 -5 -5 ...
summary(spendevents)
## udid ts date story
## Length:107764 Length:107764 Length:107764 Length:107764
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## chapter spendtype currency amount
## Min. : 0.000 Length:107764 Length:107764 Min. :-999999.0
## 1st Qu.: 0.000 Class :character Class :character 1st Qu.: -5.0
## Median : 0.000 Mode :character Mode :character Median : 0.0
## Mean : 3.018 Mean : -140.5
## 3rd Qu.: 4.000 3rd Qu.: 0.0
## Max. :89.000 Max. : 165.0
# negative sign in amount means spend currency, positive means earn currency from daily activities or special event,etc.
# amount: min -999999, max 165 earn is more difficult, spend is easy
# all the currency spends in event are gems
colSums(is.na(spendevents))
## udid ts date story chapter spendtype currency amount
## 0 0 0 0 0 0 0 0
1.Negative sign in amount means spend currency, positive means earn currency from daily activities or special event,etc.
2.All the currency spends in event are gems 3.-999999 might be a mistake in data recording 4. Making gem is more difficult than spending
users = users[complete.cases(users),]
colSums(is.na(users))
## udid install_date lang country hw_ver os_ver
## 0 0 0 0 0 0
iaps = rename(iaps,iaps_ts = ts, iaps_date = date)
sessions = rename(sessions,sessions_ts = ts, sessions_date = date)
spendevents = rename(spendevents, spend_ts = ts, spend_date = date)
#head(iaps)
#head(sessions)
#head(spendevents)
#str(users)
users$install_date = as.Date(users$install_date,format = '%m/%d/%Y')
#head(users)
str(iaps)
## 'data.frame': 6685 obs. of 6 variables:
## $ udid : chr "a737cacac9b9432180f7106c248dc9dc" "a737cacac9b9432180f7106c248dc9dc" "a737cacac9b9432180f7106c248dc9dc" "a737cacac9b9432180f7106c248dc9dc" ...
## $ iaps_ts : chr "3/1/16 9:35" "3/1/16 9:51" "3/1/16 9:56" "3/1/16 11:11" ...
## $ iaps_date: chr "3/1/2016" "3/1/2016" "3/1/2016" "3/1/2016" ...
## $ prod_name: chr "iap_1_gems_2" "iap_1_gems_2" "iap_1_gems_2" "iap_1_gems_2" ...
## $ prod_type: chr "gems" "gems" "gems" "gems" ...
## $ rev : int 140 140 140 140 70 140 1400 70 350 70 ...
iaps$iaps_ts = as.POSIXct(iaps$iaps_ts, format = '%m/%d/%y %H:%M') # %y
iaps$iaps_date = as.Date(iaps$iaps_date, format = '%m/%d/%Y')
#head(iaps)
str(sessions)
## 'data.frame': 722955 obs. of 5 variables:
## $ udid : chr "f19efdcec2f0414e9507c19da467af20" "f19efdcec2f0414e9507c19da467af20" "f19efdcec2f0414e9507c19da467af20" "f19efdcec2f0414e9507c19da467af20" ...
## $ sessions_ts : chr "3/1/16 00:13" "3/1/16 00:27" "3/1/16 00:29" "3/1/16 00:34" ...
## $ sessions_date : chr "3/1/2016" "3/1/2016" "3/1/2016" "3/1/2016" ...
## $ session_num : int 1 2 3 4 5 6 7 8 9 10 ...
## $ last_session_termination_type: logi NA NA NA NA NA NA ...
sessions$sessions_ts = as.POSIXct(sessions$sessions_ts, format = '%m/%d/%y %H:%M')
sessions$sessions_date = as.Date(sessions$sessions_date, format = '%m/%d/%Y')
#head(sessions)
str(spendevents)
## 'data.frame': 107764 obs. of 8 variables:
## $ udid : chr "319d57bc3d9445c78fbeda4772ecbc9b" "988291d1f38d4b0e97c2c86fb4d0122f" "1e7a9cbe6fb44ecd93c3b47e64f4c377" "309d0946c01a46e88a9038da40145a9e" ...
## $ spend_ts : chr "3/1/16 0:46" "3/1/16 3:43" "3/1/16 4:19" "3/1/16 5:29" ...
## $ spend_date: chr "3/1/2016" "3/1/2016" "3/1/2016" "3/1/2016" ...
## $ story : chr "Mean_Girls_Version_D" "Mean_Girls_Version_D" "Mean_Girls_Version_D" "Demi_Master" ...
## $ chapter : int 0 0 0 0 0 0 0 2 0 0 ...
## $ spendtype : chr "earnGemsCounter" "earnGemsCounter" "earnGemsCounter" "earnGemsCounter" ...
## $ currency : chr "gems" "gems" "gems" "gems" ...
## $ amount : int -5 -5 -5 -5 -5 -5 -5 -1 -5 -5 ...
spendevents$spend_ts = as.POSIXct(spendevents$spend_ts, format = '%m/%d/%y %H:%M')
spendevents$spend_date = as.Date(spendevents$spend_date, format = '%m/%d/%Y')
#head(spendevents)
spendevents$story_chapter <- paste(spendevents$story,spendevents$chapter,sep = "_")
#head(spendevents)
the users in iaps are the converted users, we could create a column in users, if udid in iaps, it’s converted-users, if udid not in iaps, it is it’s un-converted users.
purchased <- iaps %>% distinct(udid) %>%select(udid)
purchased_user <-purchased$udid
users$converted <- as.numeric(ifelse(users$udid %in% purchased_user,'1','0'))
converted_users <- users %>% filter(converted == 1)
non_payers <- users %>% filter(converted == 0)%>% select(udid)
#write.csv(users, file="dataset/final/final_user.csv")
mean(users$converted)
## [1] 0.06760888
count(distinct(users))
count(distinct(users))*mean(users$converted)
** To sum up 1: total 22571 users, there are 6.76% players are paied players which is 1526 player who paid**
df2 <- users%>%select(country,converted) %>% group_by(country)%>%summarise(num_of_users = n(),conversion = mean(converted))%>% arrange(desc(num_of_users)) %>% head(14)
df2
p21 <- ggplot(df2,aes(country,num_of_users,fill= country))+geom_bar(stat='identity') + theme(legend.position = "none",plot.title = element_text(hjust = 0.5))+ggtitle('Distribution of Users by Country')
p22 <- ggplot(df2) + geom_line(aes(country, conversion), stat = 'identity', group=1) +
geom_hline(yintercept=0.0676, linetype="dashed", color = "red")
grid.arrange(p21,p22,nrow=2)
To sum up: 1. US has most of player and to state the obivious, the conversion rate is the highest 2. Plots show high conversion rate in US, GB and Australia 3. Canada Normay and DE are short of mean 4. the rest of countries mostly are free players
df3 <- users %>% select(lang, converted) %>% group_by(lang) %>%
summarise(num_of_users = n(), conversion = mean(converted)) %>%
arrange(desc(num_of_users)) %>% head(14)
p31 <- ggplot(df3, aes(lang, num_of_users, fill=lang)) +
geom_bar(stat = 'identity') + theme(legend.position = "none") +
ggtitle('Distribution of Users by Language') +
theme(plot.title = element_text(hjust = 0.5))
p32 <- ggplot(df3) + geom_line(aes(lang, conversion), stat = 'identity', group=1) +
geom_hline(yintercept=0.0676, linetype="dashed", color = "red")
grid.arrange(p31,p32,nrow=2)
To sum up:
Majority of users are English-speaker. They also contribute the most to the number of payers Non-English speakers are more likely to be non-payers than avg
df4 <- users %>% select(hw_ver, converted) %>% group_by(hw_ver) %>%
summarise(num_of_users = n(), conversion = mean(converted)) %>%
arrange(desc(num_of_users)) %>% head(14)
p41 <- ggplot(df4, aes(hw_ver, num_of_users, fill=hw_ver)) +
geom_bar(stat = 'identity') + theme(legend.position = "none") +
ggtitle('Distribution of Users by Device Type') +
theme(plot.title = element_text(hjust = 0.5),axis.text.x = element_text(angle = 90, hjust = 1))
p42 <- ggplot(df4) + geom_line(aes(hw_ver, conversion), stat = 'identity', group=1) +
geom_hline(yintercept=0.0676, linetype="dashed", color = "red") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
grid.arrange(p41,p42,nrow=2)
To sum up: ## Most users play the game on iPhone ## Users with new model device are more likely to pay for gaming comparing to those holding old models
df5 <- users %>% select(os_ver,converted) %>% group_by(os_ver) %>%
summarise(num_of_users = n(), conversion = mean(converted))%>%
arrange(desc(num_of_users)) %>% head(14)
p51<- ggplot(df5, aes(os_ver,num_of_users, fill=os_ver))+
geom_bar(stat = 'identity') + theme(legend.position = "none",plot.title = element_text(hjust = 0.5),axis.text.x = element_text(angle = 90, hjust = 1)) +
ggtitle('Distribution of Users by Operating system')
p52<-ggplot(df5) + geom_line(aes(os_ver,conversion),stat = 'identity',group =1) +
geom_hline(yintercept=0.0676, linetype="dashed", color = "red") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
grid.arrange(p51,p52, nrow =2)
To sum up: 1. Most player uses os version 9.2.1 2. conversion rate peaks at os version 8.1.2, 9.2.1 and 9.3 3. It is safe to say OS version 9.2.1 should be our major target
df6<- users%>%select(install_date, converted)%>% group_by(install_date) %>%
summarise(num_of_users = n(), conversion = mean(converted))%>%
arrange(desc(num_of_users))
p61<-ggplot(df6, aes(as.factor(install_date),num_of_users, fill = install_date)) +
geom_bar(stat = 'identity') + theme(legend.position = "none",plot.title = element_text(hjust = 0.5),axis.text.x = element_text(angle = 90,hjust=1))
p62<-ggplot(df6)+geom_line(aes(as.factor(install_date),conversion),stat = 'identity',group=1)+
geom_hline(yintercept=0.0676, linetype="dashed", color = "red") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
grid.arrange(p61,p62,nrow=2)
To sum up: 1. Totoal of players peaks at the weekend which over 4500 player installed each day 2. Insteresting noted that most player start to pay for the game on March 3rd,2016, while the next day, on March 4th,2016, there are lowest number of players are willing to pay. It maybe due to the special event or sales opportunity 3. Overall, over the weekend, players have more time to play which explains why the numbers of player jumps on March 5th and march 6th. The conversion rate for the weekend is over the average. Decent
We need to look at more on In purchase app table to find out more information
df7 <- iaps%>%group_by(prod_type)%>%summarise(number_of_orders = n(),revenue = sum(rev))
df7
p71 <- ggplot(df7, aes(prod_type, number_of_orders)) +
geom_bar(stat = 'identity') + theme(legend.position = "none") +
ggtitle('Number of Orders by Product Type') +
theme(plot.title = element_text(hjust = 0.5)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) + coord_flip()
p72 <- ggplot(df7, aes(prod_type, revenue)) +
geom_bar(stat = 'identity') + theme(legend.position = "none") +
ggtitle('Revenue by Product Type') +
theme(plot.title = element_text(hjust = 0.5),axis.text.x = element_text(angle = 90, hjust = 1)) + coord_flip()
grid.arrange(p71,p72,nrow=2)
To sum up: gems are most popular items for purchase then it is chapter Passes, whereas value pack revenue and sales are negligible
p81 <- ggplot(iaps) + geom_bar(aes(iaps_date, fill = prod_type))+
ggtitle('Number of Orders over time') + theme(plot.title = element_text(hjust = 0.5)) +
ylab('Number of orders')
p82 <- ggplot(iaps) + geom_bar(aes(iaps_date, rev,fill=prod_type), stat = 'identity') +
ggtitle('Revenue over Time') + theme(plot.title = element_text(hjust = 0.5))
grid.arrange(p81,p82,nrow=2)
To sum up: The sales and revenue peak on March 7th about the first week of the purchase date. It is commonly true for gaming industry that the growth period is short and it is always the catch most attention in the first couple of days. Gems are the most selling product and its revenue dimishes drastically
df9 <- iaps %>% select(udid, iaps_date) %>% group_by(udid) %>% summarise(converted_date = min(iaps_date))
converted_users <- inner_join(converted_users, df9, by = 'udid')
converted_users$days_n <- difftime(converted_users$converted_date, converted_users$install_date, units = 'days')
#converted_users
summary(as.numeric(converted_users$days_n))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.000 4.607 3.000 60.000
#write.csv(converted_users,file="dataset/final/final_converted_user.csv")
sum up averagely speaking players take 4.6 days to convert to paying player. However, it coult take someone 60 days to eventually get it there
## median users took 1 day to pay
ggplot(converted_users) + geom_bar(aes(as.numeric(days_n))) + ylab('Number of users') +
ggtitle('Days taken to Convert') + theme(plot.title = element_text(hjust = 0.5))
df91 <- converted_users %>% group_by(days_n) %>% summarise(count = n(),
perc = count/nrow(converted_users))
df91$perc_sum<-cumsum(df91$perc)
df91
#pareto.chart is better
sum up Nearly half of players made their payment the same day as when they installed. less than 2 days, 73% of players paid for the game. and 80% of gamers pay in four days
sessions are total number of times player logged in the game
p101 <- sessions%>% filter(udid %in% purchased_user)%>%group_by(sessions_date)%>%count() %>%
ggplot() + geom_line(aes(sessions_date, n), stat = 'identity') +
ggtitle('Number of Sessions over Time of Converted Users') + theme(plot.title = element_text(hjust = 0.5))+ylab('Number of sessions')
p102 <- spendevents %>% filter(udid %in% purchased_user) %>%
select(spend_date, story_chapter) %>% group_by(spend_date) %>%
summarise(n = n_distinct(story_chapter)) %>%
ggplot() + geom_line(aes(spend_date, n), stat = 'identity') +
ggtitle('Number of Story_Chapter over Time of Converted Users') + theme(plot.title = element_text(hjust = 0.5)) +
ylab('Number of story_chapter')
spendevents_payers <- spendevents %>% filter(udid %in% purchased_user)
summary(spendevents$amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -999999.0 -5.0 0.0 -140.5 0.0 165.0
Positive number means how much players earned and negetive means how much they spent we can see averagely players spend -140 gems. -9999999 seems like a mistake
p103 <- spendevents %>% filter(udid %in% purchased_user) %>%
filter(amount != -999999) %>% select(spend_date, amount) %>% group_by(spend_date) %>%
summarise(total_amount = sum(amount)) %>%
ggplot() + geom_line(aes(spend_date, total_amount)) +
ggtitle('Spending Amount over Time of Converted Users') + theme(plot.title = element_text(hjust = 0.5))
p104 <- iaps %>% group_by(iaps_date) %>% summarise(revenue = sum(rev)) %>%
ggplot() + geom_line(aes(iaps_date, revenue)) +
ggtitle('Revenue over Time of Converted Users') + theme(plot.title = element_text(hjust = 0.5))
grid.arrange(p101,p102,p103,p104,nrow=4)
sum up Daily revenue peaked on March 6, then dropped quickly till March 16 when it started dropping in a much slower manner. Number of sessions peaked on March 6, then dropped quickly till March 16 when it started dropping in a much slower manner. Amount spending peaked on March 6, then dropped quickly till March 16 when it started dropping in a much slower manner.
p111 <- sessions %>% filter(udid %in% non_payers$udid) %>%
group_by(sessions_date) %>% count() %>%
ggplot() + geom_line(aes(sessions_date, n), stat = 'identity') +
ggtitle('Number of Sessions over Time of Non Payers') + theme(plot.title = element_text(hjust = 0.5)) +
ylab('Number of sessions')
p112 <- spendevents %>% filter(udid %in% non_payers$udid) %>%
select(spend_date, story_chapter) %>% group_by(spend_date) %>%
summarise(n = n_distinct(story_chapter)) %>%
ggplot() + geom_line(aes(spend_date, n), stat = 'identity') +
ggtitle('Number of Story_Chapter over Time of Non Payers') + theme(plot.title = element_text(hjust = 0.5)) +
ylab('Number of story_chapter')
spendevents_non_payers = spendevents %>% filter(udid %in% non_payers$udid)
summary(spendevents_non_payers$amount) # -999999 probably an outlier
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -999999.0 -5.0 -1.0 -188.8 0.0 165.0
p113 <- spendevents %>% filter(udid %in% non_payers$udid) %>%
filter(amount != -999999) %>% select(spend_date, amount) %>% group_by(spend_date) %>%
summarise(total_amount = sum(amount)) %>%
ggplot() + geom_line(aes(spend_date, total_amount)) +
ggtitle('Spending Amount over Time of Non Payers') + theme(plot.title = element_text(hjust = 0.5))
grid.arrange(p111,p112,p113,nrow=3)
Converted Users vs Non Payers
The amount per day went down and up during the first week, then stayed stable after the first week Although converted users accouts for only 6.67%, the total sessions per day are higher than non payers In March, the number of stroy-chapter per day is above 40 for converted users while for non payers the number is under 40
users$test_date = users$install_date + 7
#head(users)
sessions_num = count(sessions, udid, sessions_ts, sessions_date)
sessions_num %>% select(n) %>% arrange(desc(n))
test_data <- left_join(users, sessions_num, by = "udid")
colSums(is.na(test_data))
## udid install_date lang country hw_ver
## 0 0 0 0 0
## os_ver converted test_date sessions_ts sessions_date
## 0 0 0 304 32
## n
## 32
#head(test_data)
test_sessions <- test_data$sessions_date < test_data$test_date
test_data$test_sessions = test_sessions
extract whose who are true on test_session from test_data Ti: test data which has users log in in the first week
ti<-test_data[test_data$test_sessions == 'TRUE',]
Add the count of session activitives for each udid
ts = mutate(group_by(ti,udid), t_sessions_n = sum(n))
ts = distinct(ts, udid, t_sessions_n)
ts: finally summarizes how many session activities for each distinct udid
Add the Revenue info from iapcopy revenue data before 1 week
test_data1 <- right_join(users, iaps, by = 'udid')
#head(test_data1)
test_rev <- test_data1$iaps_date < test_data1$test_date
test_data1$test_rev = test_rev
#head(test_data1)
extract those who are ‘TRUE’ on test_rev from test_data1
test_data1 <- test_data1[test_data1$test_rev == 'TRUE',]
iaps_test <- test_data1[-(2:8)]
tr <- left_join(ti, iaps_test, by = 'udid')
tr = mutate(group_by(tr,udid), rev_sum = sum(rev))
tr = distinct(tr, udid, rev_sum)
#head(tr,10)
tr: It lists the players who login the game in the first week with their purchase record
tr %>% select(udid, rev_sum) %>% arrange(desc(rev_sum)) %>% head()
test_data2 <- left_join(ti, spendevents, by = 'udid')
#head(test_data2)
test_spend <- test_data2$spend_date < test_data2$test_date
test_data2$test_spend = test_spend
ti <- test_data2[test_data2$test_spend == 'TRUE',]
# Extract udid, story_chapter from ti and named 'time story count number'
t_sc_num = count(ti, udid, story_chapter)
tsc = count(t_sc_num, udid) # extract user id from data frame 'time story count number'
tsc = na.omit(rename(tsc, t_story_chapter_n = n)) # Wipe na values and rename count number(how many story-chapter the user has played)
# Extract user id and count total amount from ti and named 'Test Amount Spent'
ta = mutate(group_by(ti, udid), t_amount_n = sum(amount))
ta = na.omit(distinct(ta, udid, t_amount_n)) # Wipe out na values and extarct only distinct values of user id
To calculate distinct user id with their sum of revenue, story, spendtype, time session number, time amount number and time story chapter ta: test amount spent for all players, tsc: total player story chapter count ts: summarizes how many session activities for each distinct udid tr: list of revenue for each player
ti1 <- full_join(ta, tsc, by = 'udid')
ti2 <- full_join(ts, ti1, by = 'udid')
ti <- full_join(tr, ti2, by = 'udid')
ti = ti[is.na(ti$t_sessions_n) == FALSE & is.na(ti$t_story_chapter_n) == FALSE & is.na(ti$t_amount_n) == FALSE,]
ti[is.na(ti)] = 0 # rev_sum = 0
colSums(is.na(ti))
## udid rev_sum t_sessions_n t_amount_n
## 0 0 0 0
## t_story_chapter_n
## 0
#head(ti)
users1 <- users %>% select(c(1:6))
# country_dis
country_dis <- function(country) {
if (country == 'US') {
return('US')
} else if (country == 'GB') {
return('GB')
} else if (country == 'CA') {
return('CA')
} else if (country == 'AU') {
return('AU')
} else if (country == 'PH') {
return('PH')
} else if (country == 'ID') {
return('ID')
} else if (country == 'NL') {
return('NL')
} else if (country == 'RU') {
return('RU')
} else {
return('other')
}
}
users1$country <- factor(sapply(users1$country, country_dis))
# lang_dis
lang_dis <- function(lang) {
if (lang == 'en') {
return('en')
} else if (lang == 'es') {
return('es')
} else if (lang == 'fr') {
return('fr')
} else if (lang == 'de') {
return('de')
} else if (lang == 'nl') {
return('nl')
} else if (lang == 'ru') {
return('ru')
} else if (lang == 'pt') {
return('pt')
} else if (lang == 'nb') {
return('nb')
} else {
return('other')
}
}
users1$lang <- factor(sapply(users1$lang, lang_dis))
# hw_ver_dis
hw_ver_dis <- function(hw_ver) {
if (hw_ver == 'iPhone7,2') {
return('iPhone7,2')
} else if (hw_ver == 'iPhone8,1') {
return('iPhone8,1')
} else if (hw_ver == 'iPhone6,1') {
return('iPhone6,1')
} else if (hw_ver == 'iPhone6,2') {
return('iPhone6,2')
} else if (hw_ver == 'iPhone5,2') {
return('iPhone5,2')
} else if (hw_ver == 'iPhone5,3') {
return('iPhone5,3')
} else if (hw_ver == 'iPad2,5') {
return('iPad2,5')
} else {
return('other')
}
}
users1$hw_ver <- factor(sapply(users1$hw_ver, hw_ver_dis))
# os_ver_dis
os_ver_dis <- function(os_ver) {
if (os_ver == '9.2.1') {
return('9.2.1')
} else if (os_ver == '9.2') {
return('9.2')
} else if (os_ver == '9.1') {
return('9.1')
} else if (os_ver == '7.1.2') {
return('7.1.2')
} else if (os_ver == '8.3') {
return('8.3')
} else {
return('other')
}
}
users1$os_ver <- factor(sapply(users1$os_ver, os_ver_dis))
ti_full <- inner_join(users1, ti, by = 'udid')
colSums(is.na(ti_full))
## udid install_date lang country
## 0 0 0 0
## hw_ver os_ver rev_sum t_sessions_n
## 0 0 0 0
## t_amount_n t_story_chapter_n
## 0 0
ti_full$level = as.factor(ifelse(ti_full$rev_sum > 0, 'Converted_User', 'Non_Payer'))
prop.table(table(ti_full$level))
##
## Converted_User Non_Payer
## 0.07020334 0.92979666
print(table(ti_full$level))
##
## Converted_User Non_Payer
## 1236 16370
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.1.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(caTools)
## Warning: package 'caTools' was built under R version 4.1.3
set.seed(123)
split = sample.split(ti_full$level, SplitRatio = 0.75)
training_set = subset(ti_full, split == TRUE)
test_set = subset(ti_full, split == FALSE)
print(table(training_set$level))
##
## Converted_User Non_Payer
## 927 12278
Unbalanced data It can be seen that the number of the converted user is only 927 whereas that of non_payer is 12278. The data is so unblanced and we need to make the ratio about 1:1
# to make them continuous and categorical. We need to remove udid and convert install_date
training_set_v2 = subset(training_set,select = -c(udid))
training_set_v2$install_date <- as.factor(training_set_v2$install_date)
test_set_v2 <- subset(test_set,select = -c(udid))
test_set_v2$install_date <- as.factor(test_set_v2$install_date)
library(ROSE)
## Warning: package 'ROSE' was built under R version 4.1.3
## Loaded ROSE 0.0-4
#rose_sample_train_data <- ROSE(level~., data = training_set_v2,seed = 111)$data
#print(table(rose_sample_train_data$level))
To ensure the size of training set is large enough, we use mix sampling method
both_sample_train_data <- ovun.sample(level ~ ., data = training_set_v2, method="both", p=0.5,seed =222,N=26410)$data
print(table(both_sample_train_data$level))
##
## Non_Payer Converted_User
## 13276 13134
set.seed(123)
model <- randomForest(level ~ install_date + lang + country + hw_ver + os_ver +
t_sessions_n + t_amount_n + t_story_chapter_n,
data = both_sample_train_data,
ntree = 500,
importance = TRUE)
plot(model)
SUM UP
The plot shows a out of bag error goes down as the training time increases. When it trains 500 times the OOB error is < 0.01
y_pred = predict(model, newdata = test_set_v2, type = 'class')
head(y_pred)
## 2 4 5 8 11 16
## Non_Payer Non_Payer Non_Payer Non_Payer Non_Payer Non_Payer
## Levels: Non_Payer Converted_User
table_random_forest = table(Predicted = y_pred, Actual = test_set_v2$level)[2:1,1:2]
table_random_forest
## Actual
## Predicted Converted_User Non_Payer
## Converted_User 139 214
## Non_Payer 170 3878
Sum up
From the confusion matrix we can tell the overall prediction accuracy is very high (91.3%). However, the model has a high False Negative error. 55% of converted player are predict wrongly. The tpr(sensitivity) is only 139/(139+170) = 45.0%. We need to address the issue. The tnr is 3878/(3878+214)=94.7%
print(paste('Random Forest Accuracy', sum(diag(table_random_forest)/ sum(table_random_forest))))
## [1] "Random Forest Accuracy 0.912747102931152"
varImpPlot(model)
Based on the data provided, we found the KPIs that can contribute to the prediction of the user likelihood of being converted to paying player
We chose random forest tree data modelling technique to train the model. The model then tested in the test set
The random forest tree model produce an accuracy of 91.2% result. With that being said, this model can be used for the company to predict if the user will pay for the game
Using Mean Decrease Accuracy score and Mean Decrease Gini score, we can tell that story chapter number, amount of spend and, user sessoin numbers are the most important KPIs. Which make sense because they represents how much time and effort that players put on the games. As more time and effort the player put on this game, they are very likely to pay for it.
One drawback of the data model is that the false negative error is considerable.Since the company need to target the paying gamers for the revenue, the model need to improve so the False Negative become less.However, considering the company does not want to waste money, a high false negative error is better than a high false positive error. Therefore the data modelling is acceptable.
both_sample_classifier = glm(formula = level ~ install_date + lang + country + hw_ver + os_ver +
t_sessions_n + t_amount_n + t_story_chapter_n, family = binomial, data = both_sample_train_data)
both_sample_probability_predict = predict(both_sample_classifier,type ='response',newdata = test_set_v2[-10] )
y_predict_both_sample = ifelse(both_sample_probability_predict>0.47, 'Converted_User ','non_payer')
table_logistic = table(Predicted = y_predict_both_sample, Actual = test_set_v2$level)
table_logistic
## Actual
## Predicted Converted_User Non_Payer
## Converted_User 246 725
## non_payer 63 3367
print(paste('Logistic Accuracy', sum(diag(table_logistic)/ sum(table_logistic))))
## [1] "Logistic Accuracy 0.820949784139968"
cutoff = seq(0.2,0.8,b=0.05)
acc_list = double()
tpr = double()
tnr = double()
for (val in cutoff){
y_predict_both_sample = ifelse(both_sample_probability_predict>val, 'Converted_User ','non_payer')
table_ligistic = table(Predicted = y_predict_both_sample, Actual = test_set_v2$level)
accuracy = sum(diag(table_ligistic)/ sum(table_ligistic))
acc_list<-append(acc_list,accuracy)
tpr <-append(tpr,table_ligistic[1,1]/(table_ligistic[1,1]+table_ligistic[2,1]))
tnr <-append(tnr,table_ligistic[2,2]/(table_ligistic[1,2]+table_ligistic[2,2]))
}
ggplot()+geom_line(aes(x = cutoff,acc_list),color = "red")+
geom_line(aes(x = cutoff,tpr),color = "blue")+
geom_line(aes(x = cutoff,tnr),color = "black")+
xlab("cutoff")+ylab("rate")+guides()
As the plot shown above, the best cutoff we can chose is 0.47 where the accuracy is 82%
#plot(rose_classifier)
ROC curve
roc_over <- roc.curve(test_set_v2$level, both_sample_probability_predict, plotit = T)
roc_over <- roc.curve(test_set_v2$level, both_sample_probability_predict, plotit = F)
print(paste('mix sampling method: AUC',roc_over)[2])
## [1] "mix sampling method: AUC 0.86247022369008"
summary(both_sample_classifier)
##
## Call:
## glm(formula = level ~ install_date + lang + country + hw_ver +
## os_ver + t_sessions_n + t_amount_n + t_story_chapter_n, family = binomial,
## data = both_sample_train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4859 -0.8527 -0.1720 0.8975 2.8608
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.802e+00 2.065e-01 -8.725 < 2e-16 ***
## install_date2016-03-02 1.366e-01 6.433e-02 2.124 0.033701 *
## install_date2016-03-03 3.557e-01 6.447e-02 5.518 3.43e-08 ***
## install_date2016-03-04 1.090e-01 6.264e-02 1.740 0.081794 .
## install_date2016-03-05 2.257e-01 5.715e-02 3.949 7.86e-05 ***
## install_date2016-03-06 2.506e-01 5.689e-02 4.405 1.06e-05 ***
## install_date2016-03-07 1.951e-01 6.336e-02 3.079 0.002074 **
## langen -4.696e-01 1.264e-01 -3.716 0.000202 ***
## langes -1.299e+00 1.979e-01 -6.562 5.29e-11 ***
## langfr -1.727e+00 2.193e-01 -7.874 3.44e-15 ***
## langnb -8.454e-01 2.389e-01 -3.538 0.000403 ***
## langnl -3.301e+00 5.559e-01 -5.939 2.88e-09 ***
## langother -8.020e-01 1.351e-01 -5.934 2.95e-09 ***
## langpt -1.813e+00 3.056e-01 -5.934 2.95e-09 ***
## langru -2.235e+00 5.659e-01 -3.949 7.86e-05 ***
## countryCA -1.432e-01 1.021e-01 -1.402 0.160800
## countryGB -2.090e-01 8.269e-02 -2.527 0.011493 *
## countryID -7.541e-01 1.612e-01 -4.678 2.90e-06 ***
## countryNL -5.381e-01 5.012e-01 -1.074 0.282946
## countryother -9.352e-01 9.165e-02 -10.205 < 2e-16 ***
## countryPH -1.889e+00 1.930e-01 -9.785 < 2e-16 ***
## countryRU -2.898e-01 5.263e-01 -0.551 0.581819
## countryUS 1.143e-01 7.439e-02 1.536 0.124590
## hw_veriPhone5,2 -5.979e-02 1.046e-01 -0.571 0.567769
## hw_veriPhone5,3 5.700e-02 1.007e-01 0.566 0.571359
## hw_veriPhone6,1 -1.783e-01 8.928e-02 -1.997 0.045825 *
## hw_veriPhone6,2 2.293e-01 9.912e-02 2.313 0.020712 *
## hw_veriPhone7,2 4.538e-01 8.105e-02 5.599 2.15e-08 ***
## hw_veriPhone8,1 2.937e-01 8.384e-02 3.503 0.000460 ***
## hw_verother -4.176e-02 7.742e-02 -0.539 0.589619
## os_ver8.3 6.492e-01 1.611e-01 4.031 5.56e-05 ***
## os_ver9.1 6.158e-01 1.334e-01 4.617 3.89e-06 ***
## os_ver9.2 3.448e-01 1.298e-01 2.656 0.007917 **
## os_ver9.2.1 8.627e-01 1.220e-01 7.071 1.54e-12 ***
## os_verother 3.822e-01 1.306e-01 2.926 0.003438 **
## t_sessions_n 9.721e-03 6.569e-04 14.798 < 2e-16 ***
## t_amount_n 4.901e-07 6.689e-07 0.733 0.463710
## t_story_chapter_n 3.335e-01 6.342e-03 52.582 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 36611 on 26409 degrees of freedom
## Residual deviance: 26596 on 26372 degrees of freedom
## AIC: 26672
##
## Number of Fisher Scoring iterations: 6
Use logistic regression with a adjustable cutoff variable, we can base company demand to find the best model for the company strategy. For this case, we choose cutoff to be 0.47, which yields an accuracy of 82.1%, a sensitivity of 79.6% (246/(246+63)=0.796) and a decent specificity of 82.3% (3367/(3367+725)=0.823).
Random Forest vs logistic regression
Random Forest: accuracy: 91.3%, sensitivity:45%, specifity:94.7% Logistic Regression accuracy: 82.1%,sentivity:79%,specificity 82%
we can see that the logistic regression has a lower accuracy but it produce a much higher sentivity. In logisitc regression, the cutoff value is adjustable, so it can also achieve the same performance as random forest modelling has. Therefore, a logistic regression is more favorable choice.