1153人阅读 评论(0)

# 期中作业参考

## 1 采用最小描述长度有指导地进行分箱处理

bank<-read.table("data/bank/bank.csv",header=TRUE,sep=";")
m<-ncol(bank)
factors<-lapply(bank,is.factor)
iffactor<-as.logical(factors)
m<-ncol(bank)
bank.factors<-(1:m)[iffactor]
bank.int<-(1:m)[!iffactor]
bankint<-bank[,c(bank.int,m)] # 产生数值型数据，并保存为bank.int
bankfac<-bank[,iffactor]
summary(bankfac)

         job          marital         education    default    housing
management :969   divorced: 528   primary  : 678   no :4445   no :1962
blue-collar:946   married :2797   secondary:2306   yes:  76   yes:2559
technician :768   single  :1196   tertiary :1350
services   :417
retired    :230
(Other)    :713
loan           contact         month         poutcome      y
no :3830   cellular :2896   may    :1398   failure: 490   no :4000
yes: 691   telephone: 301   jul    : 706   other  : 197   yes: 521
unknown  :1324   aug    : 633   success: 129
jun    : 531   unknown:3705
nov    : 389
apr    : 293
(Other): 571


library(discretization)
truey<-bank.test$y table(predicty, truey)  载入需要的程辑包：MASS truey predicty no yes no 1389 71 yes 215 134  ## 4 采用主成分因子分析等产生新的变量，之后进行判别分析， 这里以主成分为例进行 bankinttrain<-bankint[id,] p<-ncol(bankint) trainpr<-princomp(bankinttrain[,-p],cor=TRUE,loadings=TRUE) summary(trainpr)  Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Standard deviation 1.2904231 1.0802812 1.0501263 0.9857049 0.9408610 Proportion of Variance 0.2378845 0.1667153 0.1575379 0.1388020 0.1264599 Cumulative Proportion 0.2378845 0.4045999 0.5621378 0.7009398 0.8273997 Comp.6 Comp.7 Standard deviation 0.8991575 0.63223248 Proportion of Variance 0.1154978 0.05710256 Cumulative Proportion 0.9428974 1.00000000  1. 挑选其中前五个主成分构成新的变量 loading5<-as.matrix(trainpr$loadings)[,1:5]
bankprtrain<-data.frame(components,y=bankint[id,8])
#相应的测试集合需要进行类似的处理
bankprtest<-data.frame(components,y=bankint[-id,8])

2. 对训练集进行新的 Fisher线性判别分析，再运用到新的测试集合上
z<-lda(y~.,data=bankprtrain)
prpredic<-predict(z,newdata=bankprtest)
predicy<-prpredic$class truey<-bankprtest$y
table(predicy,truey)

       truey
predicy   no  yes
no  1565  158
yes   39   47


效果不如直接进行判别分析的好，不再进行下去

## 5 进行广义线性回归，对测试集进行预测

bankinttest<-bankint[-id,]
m<-glm(y ~.,family="binomial",subset=id,data=bankint)
step(m) # 逐步回归方法选择最优变量

Start:  AIC=1611.09
y ~ age + balance + day + duration + campaign + pdays + previous

Df Deviance    AIC
- day       1   1595.5 1609.5
- balance   1   1596.6 1610.6
- age       1   1596.7 1610.7
<none>          1595.1 1611.1
- pdays     1   1602.3 1616.3
- campaign  1   1602.9 1616.9
- previous  1   1604.3 1618.3
- duration  1   1905.0 1919.0

Step:  AIC=1609.46
y ~ age + balance + duration + campaign + pdays + previous

Df Deviance    AIC
- balance   1   1596.9 1608.9
- age       1   1597.0 1609.0
<none>          1595.5 1609.5
- pdays     1   1602.5 1614.5
- campaign  1   1602.9 1614.9
- previous  1   1604.6 1616.6
- duration  1   1905.1 1917.1

Step:  AIC=1608.89
y ~ age + duration + campaign + pdays + previous

Df Deviance    AIC
- age       1   1598.8 1608.8
<none>          1596.9 1608.9
- pdays     1   1603.9 1613.9
- campaign  1   1604.4 1614.4
- previous  1   1606.2 1616.2
- duration  1   1905.4 1915.4

Step:  AIC=1608.8
y ~ duration + campaign + pdays + previous

Df Deviance    AIC
<none>          1598.8 1608.8
- pdays     1   1605.6 1613.6
- campaign  1   1606.7 1614.7
- previous  1   1608.3 1616.3
- duration  1   1908.1 1916.1

Call:  glm(formula = y ~ duration + campaign + pdays + previous, family = "binomial",
data = bankint, subset = id)

Coefficients:
(Intercept)     duration     campaign        pdays     previous
-3.191878     0.003464    -0.084405     0.001790     0.110634

Degrees of Freedom: 2711 Total (i.e. Null);  2707 Residual
Null Deviance:	    1952
Residual Deviance: 1599 	AIC: 1609

m<-glm(y ~ age + duration + campaign + pdays + previous,family="binomial",subset=id,data=bankint)
summary(m)

Call:
glm(formula = y ~ age + duration + campaign + pdays + previous,
family = "binomial", data = bankint, subset = id)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-3.8052  -0.4398  -0.3412  -0.2859   2.5410

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.5381351  0.2880890 -12.281  < 2e-16 ***
age          0.0083428  0.0060098   1.388  0.16507
duration     0.0034554  0.0002189  15.785  < 2e-16 ***
campaign    -0.0829148  0.0328637  -2.523  0.01164 *
pdays        0.0018123  0.0006601   2.746  0.00604 **
previous     0.1094893  0.0347442   3.151  0.00163 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1952.3  on 2711  degrees of freedom
Residual deviance: 1596.9  on 2706  degrees of freedom
AIC: 1608.9

Number of Fisher Scoring iterations: 6

mpredict<-predict(m, newdata=bankinttest,type = "response")
predicty<-ifelse(mpredict<0.5,"no","yes") # 对预测值进行重新编码
# 列表输出结果
truey<-bankint\$y[-id]
table(predicty,truey)

        truey
predicty   no  yes
no  1582  175
yes   22   30


0
0

【直播】机器学习&数据挖掘7周实训--韦玮
【套餐】系统集成项目管理工程师顺利通关--徐朋
【直播】3小时掌握Docker最佳实战-徐西宁
【套餐】机器学习系列套餐（算法+实战）--唐宇迪
【直播】计算机视觉原理及实战--屈教授
【套餐】微信订阅号+服务号Java版 v2.0--翟东平
【直播】机器学习之矩阵--黄博士
【套餐】微信订阅号+服务号Java版 v2.0--翟东平
【直播】机器学习之凸优化--马博士
【套餐】Javascript 设计模式实战--曾亮

* 以上用户言论只代表其个人观点，不代表CSDN网站的观点或立场
个人资料
• 访问：22284次
• 积分：359
• 等级：
• 排名：千里之外
• 原创：14篇
• 转载：0篇
• 译文：0篇
• 评论：2条
文章分类
文章存档
最新评论