data = data[which(data$FY == '2014'),]
data$FQ <- NULL
我认为在这里我们只需要2014年的数据就足够了,并且财季对回归的建立没有帮助
我们先建立一个模型m0,在summary中我们可以看到各个factor的系数
m0 = glm(OnlineOrderFlag~Bikes+Clothing+Components+Accessories, data=data,
family=binomial(link='logit'))
summary(m0)
##
## Call:
## glm(formula = OnlineOrderFlag ~ Bikes + Clothing + Components +
## Accessories, family = binomial(link = "logit"), data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.5203 0.0639 0.0736 0.1408 0.5093
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.5617 0.1605 22.191 <2e-16 ***
## Bikes -1.3016 0.1373 -9.477 <2e-16 ***
## Clothing -0.2829 0.1122 -2.521 0.0117 *
## Components -23.4698 327.5423 -0.072 0.9429
## Accessories 2.6323 0.1366 19.266 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10109 on 21461 degrees of freedom
## Residual deviance: 3121 on 21457 degrees of freedom
## AIC: 3131
##
## Number of Fisher Scoring iterations: 18
注意小星标是significant level 的意思,Components没有小星标所以我们可以认为这个系数是极其不可靠的
所以接下来我们建立模型m1,不包含Componments
m1 = glm(OnlineOrderFlag~Bikes+Clothing+Accessories, data=data,
family=binomial(link='logit'))
summary(m1)
##
## Call:
## glm(formula = OnlineOrderFlag ~ Bikes + Clothing + Accessories,
## family = binomial(link = "logit"), data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0767 0.1330 0.1915 0.3999 0.8769
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.99548 0.08708 34.40 <2e-16 ***
## Bikes -1.50410 0.07391 -20.35 <2e-16 ***
## Clothing -0.73385 0.06301 -11.65 <2e-16 ***
## Accessories 1.72858 0.06326 27.32 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10108.9 on 21461 degrees of freedom
## Residual deviance: 8160.8 on 21458 degrees of freedom
## AIC: 8168.8
##
## Number of Fisher Scoring iterations: 6
新的模型AIC也更大了,good
现在我们得到了一个相对可靠的逻辑回归模型用于判断是否买自行车、附件及衣服对于用户最终选择线上下单还是去门店购买
到此为止关于Assignment所需要的内容就算完成了
接下来是 What else can we do 环节
- 我们希望可以看到根据已有factors所得到的判断(是否线上购买)是什么,我们建立列class
m0.logodd = predict(m0, type="link")
m0.class = as.numeric(m0.logodd>0)
data$class = m0.class
- 我们希望可以看到with such factors 我们得到【选择线上购买】的概率是多少,我们建立列prob,取3位小数,单位为%
m0.prob = predict(m0, type="response")
data$prob = round(m0.prob*100,3)
我们可以peek一下现在的data file长什么样
data[1:10,]
## FY Bikes Clothing Components Accessories OnlineOrderFlag class prob
## 9589 2014 0 0 0 1 1 1 99.796
## 9590 2014 1 0 0 0 1 1 90.552
## 9591 2014 1 0 1 0 0 0 0.000
## 9592 2014 1 0 0 1 1 1 99.255
## 9593 2014 0 1 0 1 1 1 99.730
## 9594 2014 1 1 0 1 1 1 99.014
## 9595 2014 1 1 0 0 1 1 87.839
## 9596 2014 0 1 0 1 1 1 99.730
## 9597 2014 1 0 0 1 1 1 99.255
## 9598 2014 0 0 0 1 1 1 99.796
要注意是[1:10,]有个逗号,不然会报错
接下来我们也许会想看看在各个情况下得到的class 和 prob分别是什么
我们引入package 【tidyverse】, 使用函数filter,并同上peek前十行
library(tidyverse)
q = data %>% filter(Bikes == 0, Clothing == 0, Components == 0, Accessories == 0)
q[1:10,]
## FY Bikes Clothing Components Accessories OnlineOrderFlag class prob
## NA NA NA NA NA NA NA NA NA
## NA.1 NA NA NA NA NA NA NA NA
## NA.2 NA NA NA NA NA NA NA NA
## NA.3 NA NA NA NA NA NA NA NA
## NA.4 NA NA NA NA NA NA NA NA
## NA.5 NA NA NA NA NA NA NA NA
## NA.6 NA NA NA NA NA NA NA NA
## NA.7 NA NA NA NA NA NA NA NA
## NA.8 NA NA NA NA NA NA NA NA
## NA.9 NA NA NA NA NA NA NA NA
e = data %>% filter(Bikes == 0, Clothing == 0, Components == 0, Accessories == 1)
e[1:10,]
## FY Bikes Clothing Components Accessories OnlineOrderFlag class prob
## 1 2014 0 0 0 1 1 1 99.796
## 2 2014 0 0 0 1 1 1 99.796
## 3 2014 0 0 0 1 1 1 99.796
## 4 2014 0 0 0 1 1 1 99.796
## 5 2014 0 0 0 1 1 1 99.796
## 6 2014 0 0 0 1 1 1 99.796
## 7 2014 0 0 0 1 1 1 99.796
## 8 2014 0 0 0 1 1 1 99.796
## 9 2014 0 0 0 1 1 1 99.796
## 10 2014 0 0 0 1 1 1 99.796
r = data %>% filter(Bikes == 1, Clothing == 1, Accessories == 1, OnlineOrderFlag == 0)
r[1:10,]
## FY Bikes Clothing Components Accessories OnlineOrderFlag class prob
## 1 2014 1 1 1 1 0 0 0
## 2 2014 1 1 1 1 0 0 0
## 3 2014 1 1 1 1 0 0 0
## 4 2014 1 1 1 1 0 0 0
## 5 2014 1 1 1 1 0 0 0
## 6 2014 1 1 1 1 0 0 0
## 7 2014 1 1 1 1 0 0 0
## 8 2014 1 1 1 1 0 0 0
## 9 2014 1 1 1 1 0 0 0
## 10 2014 1 1 1 1 0 0 0
t = data %>% filter(Bikes == 1, Clothing == 1, Accessories == 1, OnlineOrderFlag == 1)
t[1:10,]
## FY Bikes Clothing Components Accessories OnlineOrderFlag class prob
## 1 2014 1 1 0 1 1 1 99.014
## 2 2014 1 1 0 1 1 1 99.014
## 3 2014 1 1 0 1 1 1 99.014
## 4 2014 1 1 0 1 1 1 99.014
## 5 2014 1 1 0 1 1 1 99.014
## 6 2014 1 1 0 1 1 1 99.014
## 7 2014 1 1 0 1 1 1 99.014
## 8 2014 1 1 0 1 1 1 99.014
## 9 2014 1 1 0 1 1 1 99.014
## 10 2014 1 1 0 1 1 1 99.014
可能会好奇被我们删掉的Components到底是个什么情况,我们把Components单拎出来,分Components为1和为0两个单独看
check1 = data %>% filter(Components == 1)
summary(check1)
## FY Bikes Clothing Components Accessories
## Min. :2014 Min. :0.0000 Min. :0.0000 Min. :1 Min. :0.0000
## 1st Qu.:2014 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:1 1st Qu.:0.0000
## Median :2014 Median :1.0000 Median :1.0000 Median :1 Median :0.0000
## Mean :2014 Mean :0.8107 Mean :0.6341 Mean :1 Mean :0.4038
## 3rd Qu.:2014 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1 3rd Qu.:1.0000
## Max. :2014 Max. :1.0000 Max. :1.0000 Max. :1 Max. :1.0000
## OnlineOrderFlag class prob
## Min. :0 Min. :0 Min. :0
## 1st Qu.:0 1st Qu.:0 1st Qu.:0
## Median :0 Median :0 Median :0
## Mean :0 Mean :0 Mean :0
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
## Max. :0 Max. :0 Max. :0
check1[1:10,]
## FY Bikes Clothing Components Accessories OnlineOrderFlag class prob
## 1 2014 1 0 1 0 0 0 0
## 2 2014 1 1 1 1 0 0 0
## 3 2014 1 1 1 1 0 0 0
## 4 2014 1 1 1 1 0 0 0
## 5 2014 1 0 1 0 0 0 0
## 6 2014 1 1 1 1 0 0 0
## 7 2014 1 1 1 0 0 0 0
## 8 2014 1 1 1 1 0 0 0
## 9 2014 0 1 1 1 0 0 0
## 10 2014 0 0 1 0 0 0 0
check0 = data %>% filter(Components == 0)
summary(check0)
## FY Bikes Clothing Components Accessories
## Min. :2014 Min. :0.000 Min. :0.0000 Min. :0 Min. :0.0000
## 1st Qu.:2014 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0 1st Qu.:1.0000
## Median :2014 Median :0.000 Median :0.0000 Median :0 Median :1.0000
## Mean :2014 Mean :0.443 Mean :0.3419 Mean :0 Mean :0.8066
## 3rd Qu.:2014 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:0 3rd Qu.:1.0000
## Max. :2014 Max. :1.000 Max. :1.0000 Max. :0 Max. :1.0000
## OnlineOrderFlag class prob
## Min. :0.0000 Min. :1 Min. :87.84
## 1st Qu.:1.0000 1st Qu.:1 1st Qu.:99.01
## Median :1.0000 Median :1 Median :99.25
## Mean :0.9803 Mean :1 Mean :98.03
## 3rd Qu.:1.0000 3rd Qu.:1 3rd Qu.:99.80
## Max. :1.0000 Max. :1 Max. :99.80
check0[1:10,]
## FY Bikes Clothing Components Accessories OnlineOrderFlag class prob
## 1 2014 0 0 0 1 1 1 99.796
## 2 2014 1 0 0 0 1 1 90.552
## 3 2014 1 0 0 1 1 1 99.255
## 4 2014 0 1 0 1 1 1 99.730
## 5 2014 1 1 0 1 1 1 99.014
## 6 2014 1 1 0 0 1 1 87.839
## 7 2014 0 1 0 1 1 1 99.730
## 8 2014 1 0 0 1 1 1 99.255
## 9 2014 0 0 0 1 1 1 99.796
## 10 2014 0 0 0 1 1 1 99.796
懒得看了,大概就是这意思吧