93、R语言教程详解

加载数据> w<-read.table("test.prn",header = T)> w  X.. X...11   A     22   B     33   C     54   D     5> library(readxl)> dat<-read_excel("test.xlsx")> dat# A tibble: 4 x 2  `商品` `价格`   <chr>  <dbl>1      A      22      B      33      C      54      D      5> bank=read.table("bank-full.csv",header = TRUE,sep=",")查看数据结构> str(bank)'data.frame':    41188 obs. of  21 variables: $ age           : int  56 57 37 40 56 45 59 41 24 25 ... $ job           : Factor w/ 12 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ... $ marital       : Factor w/ 4 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ... $ education     : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 8 6 4 ... $ default       : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 2 1 2 1 1 ... $ housing       : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 1 1 3 3 ... $ loan          : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 1 1 1 1 1 ... $ contact       : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ... $ month         : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ... $ day_of_week   : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ... $ duration      : int  261 149 226 151 307 198 139 217 380 50 ... $ campaign      : int  1 1 1 1 1 1 1 1 1 1 ... $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ... $ previous      : int  0 0 0 0 0 0 0 0 0 0 ... $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ... $ emp.var.rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ... $ cons.price.idx: num  94 94 94 94 94 ... $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ... $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ... $ nr.employed   : num  5191 5191 5191 5191 5191 ... $ y             : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...查看数据的最小值,最大值,中位数,平均数,分位数> summary(bank)      age                 job            marital      Min.   :17.00   admin.     :10422   divorced: 4612   1st Qu.:32.00   blue-collar: 9254   married :24928   Median :38.00   technician : 6743   single  :11568   Mean   :40.02   services   : 3969   unknown :   80   3rd Qu.:47.00   management : 2924                    Max.   :98.00   retired    : 1720                                    (Other)    : 6156                                  education        default         housing      university.degree  :12168   no     :32588   no     :18622   high.school        : 9515   unknown: 8597   unknown:  990   basic.9y           : 6045   yes    :    3   yes    :21576   professional.course: 5243                                   basic.4y           : 4176                                   basic.6y           : 2292                                   (Other)            : 1749                                        loan            contact          month       day_of_week no     :33950   cellular :26144   may    :13769   fri:7827    unknown:  990   telephone:15044   jul    : 7174   mon:8514    yes    : 6248                     aug    : 6178   thu:8623                                      jun    : 5318   tue:8090                                      nov    : 4101   wed:8134                                      apr    : 2632                                                 (Other): 2016                  duration         campaign          pdays       Min.   :   0.0   Min.   : 1.000   Min.   :  0.0   1st Qu.: 102.0   1st Qu.: 1.000   1st Qu.:999.0   Median : 180.0   Median : 2.000   Median :999.0   Mean   : 258.3   Mean   : 2.568   Mean   :962.5   3rd Qu.: 319.0   3rd Qu.: 3.000   3rd Qu.:999.0   Max.   :4918.0   Max.   :56.000   Max.   :999.0                                                        previous            poutcome      emp.var.rate      Min.   :0.000   failure    : 4252   Min.   :-3.40000   1st Qu.:0.000   nonexistent:35563   1st Qu.:-1.80000   Median :0.000   success    : 1373   Median : 1.10000   Mean   :0.173                       Mean   : 0.08189   3rd Qu.:0.000                       3rd Qu.: 1.40000   Max.   :7.000                       Max.   : 1.40000                                                          cons.price.idx  cons.conf.idx     euribor3m     Min.   :92.20   Min.   :-50.8   Min.   :0.634   1st Qu.:93.08   1st Qu.:-42.7   1st Qu.:1.344   Median :93.75   Median :-41.8   Median :4.857   Mean   :93.58   Mean   :-40.5   Mean   :3.621   3rd Qu.:93.99   3rd Qu.:-36.4   3rd Qu.:4.961   Max.   :94.77   Max.   :-26.9   Max.   :5.045                                                    nr.employed     y         Min.   :4964   no :36548   1st Qu.:5099   yes: 4640   Median :5191               Mean   :5167               3rd Qu.:5228               Max.   :5228                                         > psych::describe(bank)        方差  个数    平均值  标准差  均值    去掉最大   中位数   最小值  最大值  极差    偏差        峰度                                  绝对偏差                             最小值                            之后                            的平均数               vars     n    mean     sd  median trimmed   mad     min     max   range  skew    kurtosisage               1 41188   40.02  10.42   38.00   39.30  10.38   17.00   98.00   81.00  0.78     0.79job*              2 41188    4.72   3.59    3.00    4.48   2.97    1.00   12.00   11.00  0.45    -1.39marital*          3 41188    2.17   0.61    2.00    2.21   0.00    1.00    4.00    3.00 -0.06    -0.34education*        4 41188    4.75   2.14    4.00    4.88   2.97    1.00    8.00    7.00 -0.24    -1.21default*          5 41188    1.21   0.41    1.00    1.14   0.00    1.00    3.00    2.00  1.44     0.07housing*          6 41188    2.07   0.99    3.00    2.09   0.00    1.00    3.00    2.00 -0.14    -1.95loan*             7 41188    1.33   0.72    1.00    1.16   0.00    1.00    3.00    2.00  1.82     1.38contact*          8 41188    1.37   0.48    1.00    1.33   0.00    1.00    2.00    1.00  0.56    -1.69month*            9 41188    5.23   2.32    5.00    5.31   2.97    1.00   10.00    9.00 -0.31    -1.03day_of_week*     10 41188    3.00   1.40    3.00    3.01   1.48    1.00    5.00    4.00  0.01    -1.27duration         11 41188  258.29 259.28  180.00  210.61 139.36    0.00 4918.00 4918.00  3.26    20.24campaign         12 41188    2.57   2.77    2.00    1.99   1.48    1.00   56.00   55.00  4.76    36.97pdays            13 41188  962.48 186.91  999.00  999.00   0.00    0.00  999.00  999.00 -4.92    22.23previous         14 41188    0.17   0.49    0.00    0.05   0.00    0.00    7.00    7.00  3.83    20.11poutcome*        15 41188    1.93   0.36    2.00    2.00   0.00    1.00    3.00    2.00 -0.88     3.98emp.var.rate     16 41188    0.08   1.57    1.10    0.27   0.44   -3.40    1.40    4.80 -0.72    -1.06cons.price.idx   17 41188   93.58   0.58   93.75   93.58   0.56   92.20   94.77    2.57 -0.23    -0.83cons.conf.idx    18 41188  -40.50   4.63  -41.80  -40.60   6.52  -50.80  -26.90   23.90  0.30    -0.36euribor3m        19 41188    3.62   1.73    4.86    3.81   0.16    0.63    5.04    4.41 -0.71    -1.41nr.employed      20 41188 5167.04  72.25 5191.00 5178.43  55.00 4963.60 5228.10  264.50 -1.04     0.00y*               21 41188    1.11   0.32    1.00    1.02   0.00    1.00    2.00    1.00  2.45     4.00               seage            0.05job*           0.02marital*       0.00education*     0.01default*       0.00housing*       0.00loan*          0.00contact*       0.00month*         0.01day_of_week*   0.01duration       1.28campaign       0.01pdays          0.92previous       0.00poutcome*      0.00emp.var.rate   0.01cons.price.idx 0.00cons.conf.idx  0.02euribor3m      0.01nr.employed    0.36y*             0.00查看数据是否有缺失值> sapply(bank,anyNA)           age            job        marital      education          FALSE          FALSE          FALSE          FALSE        default        housing           loan        contact          FALSE          FALSE          FALSE          FALSE          month    day_of_week       duration       campaign          FALSE          FALSE          FALSE          FALSE          pdays       previous       poutcome   emp.var.rate          FALSE          FALSE          FALSE          FALSE cons.price.idx  cons.conf.idx      euribor3m    nr.employed          FALSE          FALSE          FALSE          FALSE              y          FALSE 成功与不成功的个数> table(bank$y)   no   yes 36548  4640 在是否结婚这个属性的取值与是否成功的数量比较> table(bank$y,bank$marital)           divorced married single unknown  no      4136   22396   9948      68  yes      476    2532   1620      12> xtabs(~y+marital,data=bank)     maritaly     divorced married single unknown  no      4136   22396   9948      68  yes      476    2532   1620      12> tab=table(bank$y,bank$marital)> tab           divorced married single unknown  no      4136   22396   9948      68  yes      476    2532   1620      12在是否结婚这个属性上的取值> margin.table(tab,2)divorced  married   single  unknown     4612    24928    11568       80 > margin.table(tab,1)   no   yes 36548  4640 在是否结婚这个属性上横向看概率> prop.table(tab,1)              divorced     married      single     unknown  no  0.113166247 0.612783189 0.272189997 0.001860567  yes 0.102586207 0.545689655 0.349137931 0.002586207在是否结婚这个属性上纵向看概率> prop.table(tab,2)            divorced   married    single   unknown  no  0.8967910 0.8984275 0.8599585 0.8500000  yes 0.1032090 0.1015725 0.1400415 0.1500000平的列联表以第一列和第二列,展开分类group by 1,2以col.vars 的取值 进行次数统计> ftable(bank[,c(3,4,21)],row.vars = 1:2,col.vars = "y")                             y   no  yesmarital  education                      divorced basic.4y               406   83         basic.6y               169   13         basic.9y               534   31         high.school           1086  107         illiterate               1    1         professional.course    596   61         university.degree     1177  160         unknown                167   20married  basic.4y              2915  313         basic.6y              1628  139         basic.9y              3858  298         high.school           4683  475         illiterate              12    3         professional.course   2799  357         university.degree     5573  821         unknown                928  126single   basic.4y               422   31         basic.6y               301   36         basic.9y              1174  142         high.school           2702  448         illiterate               1    0         professional.course   1247  177         university.degree     3723  683         unknown                378  103unknown  basic.4y                 5    1         basic.6y                 6    0         basic.9y                 6    2         high.school             13    1         illiterate               0    0         professional.course      6    0         university.degree       25    6         unknown                  7    2卡方检验,在p值小于2.2e-16时,拒绝原假设,认为数据不服从卡方分布> chisq.test(tab)    Pearson's Chi-squared testdata:  tabX-squared = 122.66, df = 3, p-value < 2.2e-16画直方图> hist(bank$age)> library(lattice)画连续变量的分布,就是把直方图的中位数连接起来以年龄为横轴,y为纵轴,数据是bank,画图,auto.key是否有图例> densityplot(~age,groups = y,data=bank,plot.point=FALSE,auto.key = TRUE)画Box图> boxplot(age~y,data=bank)双样本t分布检验,p值小于0.05时拒绝原假设这里的原假设是两个样本没有相关性得到的结果是p值为1.805e-06,拒绝两个样本没有相关性的假设这里认为两个样本有相关性> t.test(age~y,data=bank,alternative="two.sided",var.equal=FALSE)    Welch Two Sample t-testdata:  age by yt = -4.7795, df = 5258.5, p-value = 1.805e-06alternative hypothesis: true difference in means is not equal to 095 percent confidence interval: -1.4129336 -0.5909889sample estimates: mean in group no mean in group yes          39.91119          40.91315 数据可视化画饼图> tab=table(bank$marital)> pie(tab)画直方图> tab=table(bank$marital)> barplot(tab)画下面这个图> tab=table(bank$marital,bank$y)> plot(tab) 画层叠直方图> tab=table(bank$marital,bank$y)> lattice::barchart(tab,auto.key=TRUE) 加载这个包,准备画图> library(dplyr)> data=group_by(bank,marital,y)> data=tally(data)!!!!!!!!!!!!!> ggplot2::ggplot(data=data,mapping=aes(marital,n))+geom_bar(mapping=aes(fill=y),position="dodge",stat="identity")数据预处理分组之后再画图> labels=c('青年','中年','老年')> bank$age_group=cut(bank$age,breaks = c(0,35,55,100),right = FALSE,labels = labels)> library(ggplot2)> ggplot(data=bank,mapping = aes(age_group))+geom_bar(mapping = aes(fill=y),position="dodge",stat="count") 衍生变量直接使用$符向原数据框添加新的变量> bank$log.cons.price.idx=log(bank$cons.price.idx)使用transform函数向原数据框添加变量> bank<-transform(bank,log.cons.price.idx=log(cons.price.idx),log.nr.employed=log(nr.employed))使用dplyr包里的mutate函数增加变量> bank<-dplyr::mutate(bank,log.cons.price.idx=log(cons.price.idx))使用dplyr包里的transmute函数只保留新生成的变量> bank2<-dplyr::transmute(bank,log.cons.price.idx=log(cons.price.idx),log.nr.employed=log(nr.employed))中心化> v=1:10> v1=v-mean(v)> v2=scale(v,center=TRUE,scale = FALSE)无量纲化> V1=v/sqrt(sum(v^2)/(length(v)-1))> v2=scale(v,center=FALSE,scale=TRUE)根据最大最小值进行归一化> v3=(v-min(v))/(max(v)-min(v))进行标准正态化> v1=(v-mean(v))/sd(v)> v2=scale(v,center = TRUE,scale=TRUE)Box-Cox变换使用car包里的boxCox函数> install.packages("car")> library(car)> boxCox(age~.,data=bank)  使用caret包,做Box-Cox变换> install.packages("caret")> library(caret)> dat<-subset(bank,select="age")> trans<-preProcess(dat,method=C("BoxCox"))数据预处理下违反常识的异常值基于数据分布的异常值(离群点)识别bank.dirty=read.csv("bank-dirty.csv")summary(bank.dirty)     age                  job            marital                    education     Min.   : 17.00   admin.     :10422   divorced: 4612   university.degree  :12165   1st Qu.: 32.00   blue-collar: 9254   married :24928   high.school        : 9515   Median : 38.00   technician : 6743   single  :11568   basic.9y           : 6043   Mean   : 40.03   services   : 3969   NA's    :   80   professional.course: 5242   3rd Qu.: 47.00   management : 2924                    basic.4y           : 4175   Max.   :123.00   (Other)    : 7546                    (Other)            : 2310   NA's   :2        NA's       :  330                    NA's               : 1738   default      housing        loan            contact          month       no  :32588   no  :18622   no  :33950   cellular :26144   may    :13769   yes :    3   yes :21576   yes : 6248   telephone:15044   jul    : 7174   NA's: 8597   NA's:  990   NA's:  990                     aug    : 6178                                                            jun    : 5318                                                            nov    : 4101                                                            apr    : 2632                                                            (Other): 2016   day_of_week    duration         campaign          pdays          previous     fri:7827    Min.   :   0.0   Min.   : 1.000   Min.   :  0.0   Min.   :0.000   mon:8514    1st Qu.: 102.0   1st Qu.: 1.000   1st Qu.:999.0   1st Qu.:0.000   thu:8623    Median : 180.0   Median : 2.000   Median :999.0   Median :0.000   tue:8090    Mean   : 258.3   Mean   : 2.568   Mean   :962.5   Mean   :0.173   wed:8134    3rd Qu.: 319.0   3rd Qu.: 3.000   3rd Qu.:999.0   3rd Qu.:0.000               Max.   :4918.0   Max.   :56.000   Max.   :999.0   Max.   :7.000                                                                                        poutcome      emp.var.rate      cons.price.idx  cons.conf.idx   failure    : 4252   Min.   :-3.40000   Min.   :92.20   Min.   :-50.8   nonexistent:35563   1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7   success    : 1373   Median : 1.10000   Median :93.75   Median :-41.8                       Mean   : 0.08189   Mean   :93.58   Mean   :-40.5                       3rd Qu.: 1.40000   3rd Qu.:93.99   3rd Qu.:-36.4                       Max.   : 1.40000   Max.   :94.77   Max.   :-26.9                                                                            euribor3m      nr.employed     y         Min.   :0.634   Min.   :4964   no :36548   1st Qu.:1.344   1st Qu.:5099   yes: 4640   Median :4.857   Median :5191               Mean   :3.621   Mean   :5167               3rd Qu.:4.961   3rd Qu.:5228               Max.   :5.045   Max.   :5228              常识告诉我们,虽然123岁的老人存在,但概率也极低,也不太可能是银行的客户找出在年龄这一列的上离群值和下离群值> head(bank.dirty[order(bank.dirty$age,decreasing = TRUE),'age',drop=FALSE],n=5)      age39494 12338453  9838456  9827827  9538922  94> tail(bank.dirty[order(bank.dirty$age,decreasing = TRUE),'age',drop=FALSE],n=5)      age37559  1737580  1738275  17120    NA156    NA异常值的处理当作缺失值处理> bank.dirty$age[which(bank.dirty$age>98)]<-NA删除或者插补重编码职业类型有12个分类,不利于后续分析,把除了unknown以外的分类进行重新编码,简化成4类Month有12个分类,把它转化成季度Education的分类,除了unknow之外有7类进行重编码levels(bank.dirty$job) <- c( "management","services","entrepreneur","entrepreneur",                        "management","unemployed",  "entrepreneur","services",                        "unemployed","services","unemployed","unknown" )> levels(bank.dirty$month) <- c("Q2","Q3","Q4","Q3","Q2",                         "Q1","Q2","Q4","Q4","Q3")> > levels(bank.dirty$education) <- c( "primary","primary","primary","secondary",                              "primary","tertiary","tertiary","unknown")缺失值分类较多,分类是unknown,不能给我们提供信息有些模型不能处理缺失值,比如Logistic回归缺失值插补的方法1、    用中位数或众数插补> library(imputeMissings)> bank.clean<-impute(bank.dirty,object = compute(bank.dirty,method = "median/mode"))2、    最邻近(knn)插补library(DMwR)bank.clean=knnImputation(bank.dirty,k=5)3、    随机森林插补library(missForest) Imp = missForest(bank.dirty) bank.clean = Imp$ximp缺失值插补的R包1、    imputeMissings包2、    DMwR包用Logistic回归建立客户响应模型1、    广义线性模型广义线性模型擅长于处理因变量不是连续变量的问题1)    Y是分类变量2)    Y是定序变量3)    Y是离散取值2、    当Y取值是0-1二分类变量是,就是Logistic回归Logistic回归在R中的实现数据重编码bank$y=ifelse(bank$y=='yes',1,0)改成以Q1为参考因子bank$month<-relevel(bank$month,ref="Q1")构建Logistic回归模型> model<-glm(y~.,data=bank,family = 'binomial')> summary(model)Call:glm(formula = y ~ ., family = "binomial", data = bank)Deviance Residuals:     Min       1Q   Median       3Q      Max  -5.9958  -0.3082  -0.1887  -0.1333   3.4283  Coefficients: (1 not defined because of singularities)                               Estimate Std. Error z value Pr(>|z|)    (Intercept)                  -1.957e+02  1.935e+01 -10.116  < 2e-16 ***age                           1.851e-03  2.415e-03   0.767 0.443289    jobblue-collar               -2.659e-01  7.942e-02  -3.348 0.000814 ***jobentrepreneur              -2.029e-01  1.248e-01  -1.626 0.103924    jobhousemaid                 -3.628e-02  1.475e-01  -0.246 0.805705    jobmanagement                -8.054e-02  8.501e-02  -0.947 0.343423    jobretired                    2.928e-01  1.067e-01   2.743 0.006092 ** jobself-employed             -1.680e-01  1.176e-01  -1.428 0.153332    jobservices                  -1.497e-01  8.552e-02  -1.751 0.079969 .  jobstudent                    2.674e-01  1.106e-01   2.416 0.015680 *  jobtechnician                 3.462e-03  7.096e-02   0.049 0.961086    jobunemployed                 8.514e-03  1.273e-01   0.067 0.946686    jobunknown                   -8.046e-02  2.390e-01  -0.337 0.736420    maritalmarried                1.567e-02  6.824e-02   0.230 0.818420    maritalsingle                 6.620e-02  7.791e-02   0.850 0.395473    maritalunknown                6.303e-02  4.113e-01   0.153 0.878211    educationbasic.6y             9.647e-02  1.202e-01   0.803 0.422195    educationbasic.9y            -2.154e-02  9.494e-02  -0.227 0.820557    educationhigh.school          3.381e-02  9.188e-02   0.368 0.712895    educationilliterate           1.132e+00  7.395e-01   1.531 0.125887    educationprofessional.course  1.136e-01  1.013e-01   1.121 0.262175    educationuniversity.degree    2.134e-01  9.188e-02   2.322 0.020211 *  educationunknown              1.361e-01  1.196e-01   1.138 0.255314    defaultunknown               -3.055e-01  6.712e-02  -4.552 5.32e-06 ***defaultyes                   -7.150e+00  1.135e+02  -0.063 0.949784    housingunknown               -7.385e-02  1.390e-01  -0.531 0.595260    housingyes                   -3.740e-03  4.121e-02  -0.091 0.927695    loanunknown                          NA         NA      NA       NA    loanyes                      -6.362e-02  5.725e-02  -1.111 0.266454    contacttelephone             -6.068e-01  7.124e-02  -8.518  < 2e-16 ***monthQ2                      -2.192e+00  1.125e-01 -19.479  < 2e-16 ***monthQ3                      -1.463e+00  1.148e-01 -12.747  < 2e-16 ***monthQ4                      -1.995e+00  1.240e-01 -16.088  < 2e-16 ***day_of_weekmon               -1.216e-01  6.588e-02  -1.846 0.064887 .  day_of_weekthu                6.375e-02  6.382e-02   0.999 0.317842    day_of_weektue                6.867e-02  6.545e-02   1.049 0.294118    day_of_weekwed                1.436e-01  6.530e-02   2.199 0.027911 *  duration                      4.667e-03  7.397e-05  63.092  < 2e-16 ***campaign                     -4.543e-02  1.158e-02  -3.922 8.77e-05 ***pdays                        -9.627e-04  2.162e-04  -4.452 8.50e-06 ***previous                     -5.806e-02  5.879e-02  -0.988 0.323369    poutcomenonexistent           4.507e-01  9.372e-02   4.809 1.51e-06 ***poutcomesuccess               9.371e-01  2.106e-01   4.451 8.56e-06 ***emp.var.rate                 -1.389e+00  7.693e-02 -18.057  < 2e-16 ***cons.price.idx                1.815e+00  1.193e-01  15.218  < 2e-16 ***cons.conf.idx                 3.353e-02  6.664e-03   5.033 4.84e-07 ***euribor3m                     6.054e-02  1.126e-01   0.537 0.590987    nr.employed                   4.937e-03  1.873e-03   2.635 0.008413 ** ---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1(Dispersion parameter for binomial family taken to be 1)    Null deviance: 28999  on 41187  degrees of freedomResidual deviance: 17199  on 41141  degrees of freedomAIC: 17293Number of Fisher Scoring iterations: 10> exp(coef(model))                 (Intercept)                          age               jobblue-collar                 9.856544e-86                 1.001853e+00                 7.665077e-01              jobentrepreneur                 jobhousemaid                jobmanagement                 8.163314e-01                 9.643733e-01                 9.226187e-01                   jobretired             jobself-employed                  jobservices                 1.340142e+00                 8.453874e-01                 8.609387e-01                   jobstudent                jobtechnician                jobunemployed                 1.306514e+00                 1.003468e+00                 1.008550e+00                   jobunknown               maritalmarried                maritalsingle                 9.226922e-01                 1.015789e+00                 1.068445e+00               maritalunknown            educationbasic.6y            educationbasic.9y                 1.065061e+00                 1.101276e+00                 9.786948e-01         educationhigh.school          educationilliterate educationprofessional.course                 1.034388e+00                 3.101297e+00                 1.120248e+00   educationuniversity.degree             educationunknown               defaultunknown                 1.237856e+00                 1.145744e+00                 7.367445e-01                   defaultyes               housingunknown                   housingyes                 7.851906e-04                 9.288126e-01                 9.962671e-01                  loanunknown                      loanyes             contacttelephone                           NA                 9.383587e-01                 5.450980e-01                      monthQ2                      monthQ3                      monthQ4                 1.116739e-01                 2.314802e-01                 1.360620e-01               day_of_weekmon               day_of_weekthu               day_of_weektue                 8.854888e-01                 1.065828e+00                 1.071082e+00               day_of_weekwed                     duration                     campaign                 1.154380e+00                 1.004678e+00                 9.555850e-01                        pdays                     previous          poutcomenonexistent                 9.990378e-01                 9.435960e-01                 1.569466e+00              poutcomesuccess                 emp.var.rate               cons.price.idx                 2.552531e+00                 2.493091e-01                 6.140533e+00                cons.conf.idx                    euribor3m                  nr.employed                 1.034103e+00                 1.062408e+00                 1.004949e+00 Job变量的基准水平是management,从上面的结果看,服务业和自主劳动者购买银行产品的几率(odds)是管理岗从业人员的0.88倍,未就业人员购买银行产品的几率是管理岗人员的1.25倍> summary(model.step)向前逐步回归> model.step=step(model,direction = "backward")向后逐步回归> model.step = step(model, direction = "forward")双向逐步回归> model.step = step(model, direction = "both")> summary(model.step)Call:glm(formula = y ~ job + education + default + contact + month +     day_of_week + duration + campaign + pdays + poutcome + emp.var.rate +     cons.price.idx + cons.conf.idx + nr.employed, family = "binomial",     data = bank)Deviance Residuals:     Min       1Q   Median       3Q      Max  -5.9884  -0.3088  -0.1887  -0.1332   3.4026  Coefficients:                               Estimate Std. Error z value Pr(>|z|)    (Intercept)                  -2.031e+02  1.426e+01 -14.246  < 2e-16 ***jobblue-collar               -2.700e-01  7.917e-02  -3.411 0.000648 ***jobentrepreneur              -2.043e-01  1.242e-01  -1.645 0.100003    jobhousemaid                 -2.832e-02  1.464e-01  -0.193 0.846590    jobmanagement                -8.368e-02  8.409e-02  -0.995 0.319670    jobretired                    3.234e-01  9.130e-02   3.542 0.000397 ***jobself-employed             -1.670e-01  1.176e-01  -1.421 0.155435    jobservices                  -1.528e-01  8.545e-02  -1.789 0.073666 .  jobstudent                    2.682e-01  1.046e-01   2.565 0.010316 *  jobtechnician                 4.389e-03  7.093e-02   0.062 0.950665    jobunemployed                 8.975e-03  1.271e-01   0.071 0.943715    jobunknown                   -6.363e-02  2.378e-01  -0.268 0.789057    educationbasic.6y             8.993e-02  1.196e-01   0.752 0.452024    educationbasic.9y            -2.716e-02  9.416e-02  -0.288 0.772992    educationhigh.school          2.890e-02  9.053e-02   0.319 0.749573    educationilliterate           1.118e+00  7.398e-01   1.511 0.130744    educationprofessional.course  1.084e-01  1.004e-01   1.079 0.280686    educationuniversity.degree    2.103e-01  9.017e-02   2.332 0.019678 *  educationunknown              1.363e-01  1.195e-01   1.140 0.254110    defaultunknown               -3.017e-01  6.666e-02  -4.526 6.02e-06 ***defaultyes                   -7.141e+00  1.135e+02  -0.063 0.949831    contacttelephone             -6.011e-01  7.069e-02  -8.504  < 2e-16 ***monthQ2                      -2.210e+00  1.108e-01 -19.939  < 2e-16 ***monthQ3                      -1.475e+00  1.146e-01 -12.869  < 2e-16 ***monthQ4                      -1.982e+00  1.183e-01 -16.755  < 2e-16 ***day_of_weekmon               -1.210e-01  6.584e-02  -1.837 0.066174 .  day_of_weekthu                6.208e-02  6.374e-02   0.974 0.330066    day_of_weektue                6.851e-02  6.538e-02   1.048 0.294651    day_of_weekwed                1.420e-01  6.525e-02   2.176 0.029592 *  duration                      4.667e-03  7.396e-05  63.099  < 2e-16 ***campaign                     -4.587e-02  1.158e-02  -3.960 7.49e-05 ***pdays                        -8.822e-04  2.024e-04  -4.358 1.31e-05 ***poutcomenonexistent           5.219e-01  6.356e-02   8.211  < 2e-16 ***poutcomesuccess               9.996e-01  2.028e-01   4.928 8.31e-07 ***emp.var.rate                 -1.376e+00  6.885e-02 -19.980  < 2e-16 ***cons.price.idx                1.845e+00  1.041e-01  17.725  < 2e-16 ***cons.conf.idx                 3.622e-02  4.853e-03   7.464 8.42e-14 ***nr.employed                   5.883e-03  9.765e-04   6.024 1.70e-09 ***---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1(Dispersion parameter for binomial family taken to be 1)    Null deviance: 28999  on 41187  degrees of freedomResidual deviance: 17203  on 41150  degrees of freedomAIC: 17279Number of Fisher Scoring iterations: 10模型预测用predict函数,参数type=’response’Newdata参数是要预测的数据集> prob<-predict(model.step,type = 'response')> head(prob)          1           2           3           4           5           6 0.015029328 0.006044212 0.011640349 0.010173952 0.016897254 0.007174804 假设以0.5为临界值> pre<-ifelse(prob>0.5,1,0)> table(pre,bank$y)   pre     0     1  0 35596  2667  1   952  1973> 预测的准确率> (35592+1964)/(35592+2676+956+1964)[1] 0.911819实际有响应的客户被识别出了多少> 1964/(1964+2676)[1] 0.4232759模型评估> confusionMatrix(bank$y,pre,pos='1')Confusion Matrix and Statistics          ReferencePrediction     0     1         0 35596   952         1  2667  1973                                                         Accuracy : 0.9121                           95% CI : (0.9094, 0.9149)    No Information Rate : 0.929               P-Value [Acc > NIR] : 1                                                                           Kappa : 0.476            Mcnemar's Test P-Value : <2e-16                                                                Sensitivity : 0.67453                     Specificity : 0.93030                  Pos Pred Value : 0.42522                  Neg Pred Value : 0.97395                      Prevalence : 0.07102                  Detection Rate : 0.04790            Detection Prevalence : 0.11265               Balanced Accuracy : 0.80241                                                          'Positive' Class : 1                                                   Kappa 统计量(kappa statistic)用于评判分类器的分类结果与随机分类的差异度用Kappa统计量评价:    较差:小于0.20    一般:0.20至0.40    稳健:0.40至0.60    好的:0.60至0.80很好的:0.80至1.00ROC曲线pred<-prediction(prob,bank$y)perf<-performance(pred,measure = "tpr",x="fpr")plot(perf) RandomForest加载数据列> data=read.table("input.txt",header = TRUE)> str(data)'data.frame':    222 obs. of  23 variables: $ Acti_Profile             : num  0 0 0 0 0 0 0 0 0 0 ... $ Activity                 : num  1.25 0 0.938 6.562 0 ... $ Diastolic_PTT            : num  256 240 253 0 241 ... $ Diastolic                : num  73.2 78.6 74 0 78.4 ... $ Heart_Rate_Curve         : num  81.2 69.7 77.6 95 83.6 ... $ Heart_Rate_Variability_HF: num  131 250 135 144 141 ... $ Heart_Rate_Variability_LF: num  311 218 203 301 244 ... $ MAP                      : num  86 93.5 86.9 0 91.7 ... $ Position                 : num  0 0 0 1 0 0 0 0 0 0 ... $ PTT_Raw                  : num  308 288 308 0 295 ... $ RR_Interval              : num  734 878 773 632 714 ... $ Sleep_Wake               : num  1 1 1 1 1 0 1 1 0 0 ... $ SpO2                     : num  0 0 99 0 98.4 ... $ Sympatho_Vagal_Balance   : num  23 8.17 14.5 20.4 16.88 ... $ Systolic_PTT             : num  308 288 307 0 295 ... $ Systolic                 : num  113 124 113 0 119 ... $ Autonomic_arousals       : num  0 0 0 0 0 0 0 0 0 0 ... $ Cardio_complex           : num  0 0 0 1 0 0 0 0 0 0 ... $ Cardio_rhythm            : num  0 0 2 0 0 0 0 0 0 0 ... $ Classification_Arousal   : num  0 0 0 0 0 0 0 0 0 0 ... $ PTT_Events               : num  1 0 2 0 0 0 0 0 0 0 ... $ Systolic_Events          : num  1 0 1 0 0 0 0 0 0 0 ... $ y                        : num  1 0 1 0 0 0 0 0 0 0 ...加载随机森林包> library(randomForest)进行训练  以y作为因变量,其余数据作为自变量> rf <- randomForest(y ~ ., data=data, ntree=100, proximity=TRUE,importance=TRUE)> plot(rf) 重要性检测衡量把一个变量的取值变为随机数,随机森林预测准确性的降低程度> importance(rf,type=1)                              %IncMSEActi_Profile               0.00000000Activity                   0.99353251Diastolic_PTT              0.32193611Diastolic                  1.99891809Heart_Rate_Curve           0.92001352Heart_Rate_Variability_HF  2.07870722Heart_Rate_Variability_LF -0.24957163MAP                        0.48142975Position                   1.86876751PTT_Raw                    1.94648914RR_Interval                0.60557964Sleep_Wake                 1.00503782SpO2                       0.25396165Sympatho_Vagal_Balance     1.42906765Systolic_PTT               1.27965813Systolic                   0.77382673Autonomic_arousals         0.00000000Cardio_complex             1.00503782Cardio_rhythm              1.14283152Classification_Arousal    -0.04383997PTT_Events                 4.63980680Systolic_Events           33.29461169输出随机森林的模型> print(rf)Call: randomForest(formula = y ~ ., data = data, ntree = 100, proximity = TRUE,      importance = TRUE)                Type of random forest: regression                     Number of trees: 100No. of variables tried at each split: 7          Mean of squared residuals: 0.003226897     残差平方和SSE                    % Var explained: 98.7> 总平方和(SST):(样本数据-样本均值)的平方和回归平方和(SSR):(预测数据-样本均值)的平方和残差平方和(SSE):(样本数据-预测数据均值)的平方和SST = SSR + SSE   基尼指数:> importance(rf,type=2)                          IncNodePurityActi_Profile                0.000000000Activity                    0.445181480Diastolic_PTT               0.452221870Diastolic                   0.449372186Heart_Rate_Curve            0.473113852Heart_Rate_Variability_HF   0.226815300Heart_Rate_Variability_LF   0.205457353MAP                         0.536977574Position                    0.307333210PTT_Raw                     0.656726800RR_Interval                 0.452738011Sleep_Wake                  0.014423077SpO2                        1.793361279Sympatho_Vagal_Balance      0.352759689Systolic_PTT                0.851951505Systolic                    0.823955781Autonomic_arousals          0.000000000Cardio_complex              0.008047619Cardio_rhythm               0.141907084Classification_Arousal      0.085739429PTT_Events                  7.468690820Systolic_Events            39.000163018> 进行预测prediction <- predict(rf, data[,],type="response")输出预测结果table(observed =data$y,predicted=prediction) plot(prediction) 支持向量机library(e1071)svmfit<-svm(y~.,data=data,kernel="linear",cost=10,scale=FALSE)> print(svmfit)Call:svm(formula = y ~ ., data = data, kernel = "linear", cost = 10, scale = FALSE)Parameters:   SVM-Type:  eps-regression  SVM-Kernel:  linear        cost:  10       gamma:  0.04545455     epsilon:  0.1 Number of Support Vectors:  20> plot(svmfit,data) 神经网络> concrete<-read_excel("Concrete_Data.xls")> str(concrete)Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    1030 obs. of  9 variables: $ Cement      : num  540 540 332 332 199 ... $ Slag        : num  0 0 142 142 132 ... $ Ash         : num  0 0 0 0 0 0 0 0 0 0 ... $ water       : num  162 162 228 228 192 228 228 228 228 228 ... $ superplastic: num  2.5 2.5 0 0 0 0 0 0 0 0 ... $ coarseagg   : num  1040 1055 932 932 978 ... $ fineagg     : num  676 676 594 594 826 ... $ age         : num  28 28 270 365 360 90 365 28 28 28 ... $ strength    : num  80 61.9 40.3 41.1 44.3 ...> normalize <- function(x){ return ((x-min(x))/(max(x)-min(x)))}> concrete_norm <- as.data.frame(lapply(concrete,normalize))> concrete_train <- concrete_norm[1:773,]> concrete_test <- concrete_norm[774:1030,]> library(neuralnet)> concrete_model <- neuralnet(strength ~ Cement+Slag+Ash+water+superplastic+coarseagg+fineagg+age,data=concrete_train)> plot(concrete_model) model_results <- compute(concrete_model,concrete_test[1:8])predicted_strength <- model_results$net.result> cor(predicted_strength,concrete_test$strength)             [,1][1,] 0.7205120076> concrete_model2 <- neuralnet(strength ~ Cement+Slag+Ash+water+superplastic+coarseagg+fineagg+age,data=concrete_train,hidden=5)> plot(concrete_model2) 计算误差> model_results2 <- compute(concrete_model2,concrete_test[1:8])> predicted_strength2 <- model_results2$net.result> cor(predicted_strength2,concrete_test$strength)             [,1][1,] 0.6727155609> 主成分分析身高、体重、胸围、坐高> test<-data.frame(+     X1=c(148, 139, 160, 149, 159, 142, 153, 150, 151, 139,+          140, 161, 158, 140, 137, 152, 149, 145, 160, 156,+          151, 147, 157, 147, 157, 151, 144, 141, 139, 148),+     X2=c(41, 34, 49, 36, 45, 31, 43, 43, 42, 31,+          29, 47, 49, 33, 31, 35, 47, 35, 47, 44,+          42, 38, 39, 30, 48, 36, 36, 30, 32, 38),+     X3=c(72, 71, 77, 67, 80, 66, 76, 77, 77, 68,+          64, 78, 78, 67, 66, 73, 82, 70, 74, 78,+          73, 73, 68, 65, 80, 74, 68, 67, 68, 70),+     X4=c(78, 76, 86, 79, 86, 76, 83, 79, 80, 74,+          74, 84, 83, 77, 73, 79, 79, 77, 87, 85,+          82, 78, 80, 75, 88, 80, 76, 76, 73, 78)+ )> test.pr<-princomp(test,cor=TRUE)> summary(test.pr,loadings=TRUE)Importance of components:                             Comp.1        Comp.2        Comp.3        Comp.4Standard deviation     1.8817805390 0.55980635717 0.28179594325 0.25711843909Proportion of Variance 0.8852744993 0.07834578938 0.01985223841 0.01652747293Cumulative Proportion  0.8852744993 0.96362028866 0.98347252707 1.00000000000Loadings:   Comp.1 Comp.2 Comp.3 Comp.4X1  0.497  0.543 -0.450  0.506X2  0.515 -0.210 -0.462 -0.691X3  0.481 -0.725  0.175  0.461X4  0.507  0.368  0.744 -0.232前两个主成分的累计贡献率已经达到96% 可以舍去另外两个主成分 达到降维的目的因此可以得到函数表达式 Z1=-0.497X'1-0.515X'2-0.481X'3-0.507X'4                                       Z2=  0.543X'1-0.210X'2-0.725X'3-0.368X'44.画主成分的碎石图并预测 > screeplot(test.pr,type="lines")> p<-predict(test.pr)> p              Comp.1         Comp.2         Comp.3          Comp.4 [1,] -0.06990949737 -0.23813701272 -0.35509247634 -0.266120139417 [2,] -1.59526339772 -0.71847399061  0.32813232022 -0.118056645885 [3,]  2.84793151061  0.38956678680 -0.09731731272 -0.279482487139 [4,] -0.75996988424  0.80604334819 -0.04945721875 -0.162949297761 [5,]  2.73966776853  0.01718087263  0.36012614873  0.358653043787 [6,] -2.10583167924  0.32284393414  0.18600422367 -0.036456083707 [7,]  1.42105591247 -0.06053164925  0.21093320662 -0.044223092351 [8,]  0.82583976981 -0.78102575640 -0.27557797533  0.057288571933 [9,]  0.93464401954 -0.58469241699 -0.08814135786  0.181037745585[10,] -2.36463819933 -0.36532199291  0.08840476284  0.045520127461[11,] -2.83741916086  0.34875841111  0.03310422938 -0.031146930047[12,]  2.60851223537  0.21278727930 -0.33398036623  0.210157574387[13,]  2.44253342081 -0.16769495893 -0.46918095412 -0.162987829937[14,] -1.86630668724  0.05021383642  0.37720280364 -0.358821916178[15,] -2.81347420580 -0.31790107093 -0.03291329149 -0.222035112399[16,] -0.06392982655  0.20718447599  0.04334339948  0.703533623798[17,]  1.55561022242 -1.70439673831 -0.33126406220  0.007551878960[18,] -1.07392250663 -0.06763418320  0.02283648409  0.048606680158[19,]  2.52174211878  0.97274300950  0.12164633439 -0.390667990681[20,]  2.14072377494  0.02217881219  0.37410972458  0.129548959692[21,]  0.79624421805  0.16307887263  0.12781269571 -0.294140762463[22,] -0.28708320594 -0.35744666106 -0.03962115883  0.080991988802[23,]  0.25151075072  1.25555187663 -0.55617324819  0.109068938725[24,] -2.05706031616  0.78894493512 -0.26552109297  0.388088642937[25,]  3.08596854773 -0.05775318018  0.62110421208 -0.218939612456[26,]  0.16367554630  0.04317931667  0.24481850312  0.560248997030[27,] -1.37265052598  0.02220972121 -0.23378320040 -0.257399715466[28,] -2.16097778154  0.13733232981  0.35589738735  0.093123683044[29,] -2.40434826507 -0.48613137190 -0.16154440788 -0.007914021222[30,] -0.50287467640  0.14734316507 -0.20590831261 -0.122078819188> 

 

 

加载数据

>w<-read.table("test.prn",header = T)

> w

  X.. X...1

1   A    2

2   B    3

3   C    5

4   D    5

> library(readxl)

>dat<-read_excel("test.xlsx")

> dat

# A tibble: 4 x 2

  `商品` `价格`

   <chr> <dbl>

1      A     2

2      B     3

3      C     5

4      D     5

>bank=read.table("bank-full.csv",header = TRUE,sep=",")

查看数据结构

> str(bank)

'data.frame':  41188 obs. of 21 variables:

 $ age          : int  56 57 37 40 56 45 59 41 2425 ...

 $ job          : Factor w/ 12 levels "admin.","blue-collar",..: 4 88 1 8 8 1 2 10 8 ...

 $ marital      : Factor w/ 4 levels "divorced","married",..: 2 2 22 2 2 2 2 3 3 ...

 $ education    : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 42 4 3 6 8 6 4 ...

 $ default      : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 21 2 1 1 ...

 $ housing      : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 11 1 3 3 ...

 $ loan         : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 11 1 1 1 ...

 $ contact      : Factor w/ 2 levels "cellular","telephone": 2 2 2 22 2 2 2 2 2 ...

 $ month        : Factor w/ 10 levels"apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...

 $ day_of_week  : Factor w/ 5 levels "fri","mon","thu",..:2 2 2 2 2 2 2 2 2 2 ...

 $ duration     : int  261 149 226 151 307 198 139217 380 50 ...

 $ campaign     : int  1 1 1 1 1 1 1 1 1 1 ...

 $ pdays        : int  999 999 999 999 999 999 999999 999 999 ...

 $ previous     : int  0 0 0 0 0 0 0 0 0 0 ...

 $ poutcome     : Factor w/ 3 levels "failure","nonexistent",..: 2 22 2 2 2 2 2 2 2 ...

 $ emp.var.rate : num  1.1 1.1 1.1 1.1 1.1 1.1 1.11.1 1.1 1.1 ...

 $ cons.price.idx: num  94 94 94 94 94 ...

 $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4-36.4 -36.4 -36.4 ...

 $ euribor3m    : num  4.86 4.86 4.86 4.86 4.86...

 $ nr.employed  : num  5191 5191 5191 5191 5191...

 $ y            : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1...

查看数据的最小值,最大值,中位数,平均数,分位数

> summary(bank)

      age                 job            marital    

 Min.  :17.00   admin.     :10422  divorced: 4612 

 1st Qu.:32.00  blue-collar: 9254   married:24928 

 Median :38.00  technician : 6743   single  :11568 

 Mean  :40.02   services   : 3969  unknown :   80 

 3rd Qu.:47.00  management : 2924                  

 Max.  :98.00   retired    : 1720                  

                 (Other)    : 6156                  

               education        default         housing    

 university.degree  :12168  no     :32588   no    :18622 

 high.school        : 9515  unknown: 8597   unknown:  990 

 basic.9y           : 6045   yes   :    3   yes   :21576 

 professional.course: 5243                                 

 basic.4y           : 4176                                 

 basic.6y           : 2292                                  

 (Other)            : 1749                                 

      loan            contact          month       day_of_week

 no    :33950   cellular :26144   may   :13769   fri:7827  

 unknown: 990   telephone:15044   jul   : 7174   mon:8514   

 yes    :6248                     aug    : 6178  thu:8623  

                                   jun    : 5318  tue:8090  

                                   nov    : 4101  wed:8134  

                                   apr    : 2632             

                                   (Other):2016             

    duration         campaign          pdays     

 Min.  :   0.0   Min.  : 1.000   Min.   : 0.0 

 1st Qu.: 102.0   1st Qu.: 1.000   1st Qu.:999.0 

 Median : 180.0   Median : 2.000   Median :999.0 

 Mean   :258.3   Mean   : 2.568  Mean   :962.5 

 3rd Qu.: 319.0   3rd Qu.: 3.000   3rd Qu.:999.0 

 Max.  :4918.0   Max.   :56.000  Max.   :999.0 

                                                 

    previous            poutcome      emp.var.rate    

 Min.  :0.000   failure    : 4252  Min.   :-3.40000 

 1st Qu.:0.000  nonexistent:35563   1stQu.:-1.80000 

 Median :0.000  success    : 1373   Median : 1.10000 

 Mean  :0.173                       Mean   : 0.08189 

 3rd Qu.:0.000                       3rd Qu.: 1.40000 

 Max.  :7.000                      Max.   : 1.40000 

                                                      

 cons.price.idx cons.conf.idx     euribor3m   

 Min.  :92.20   Min.   :-50.8  Min.   :0.634 

 1st Qu.:93.08  1st Qu.:-42.7   1st Qu.:1.344 

 Median :93.75  Median :-41.8   Median :4.857 

 Mean  :93.58   Mean   :-40.5  Mean   :3.621 

 3rd Qu.:93.99  3rd Qu.:-36.4   3rd Qu.:4.961 

 Max.  :94.77   Max.   :-26.9  Max.   :5.045 

                                               

  nr.employed     y       

 Min.  :4964   no :36548 

 1st Qu.:5099  yes: 4640 

 Median :5191             

 Mean  :5167             

 3rd Qu.:5228             

 Max.  :5228             

                          

> psych::describe(bank)

               方差  个数    平均值  标准差  均值    去掉最大   中位数   最小值  最大值  极差    偏差        峰度

                                                           绝对偏差

                                                  最小值

                                                  之后

                                                  的平均数

 

               vars     n   mean     sd  median trimmed   mad     min    max   range  skew    kurtosis

age               1 41188   40.02 10.42   38.00   39.30 10.38   17.00   98.00  81.00  0.78     0.79

job*              2 41188    4.72  3.59    3.00    4.48  2.97    1.00   12.00  11.00  0.45    -1.39

marital*          3 41188    2.17  0.61    2.00    2.21  0.00    1.00    4.00   3.00 -0.06    -0.34

education*        4 41188    4.75  2.14    4.00    4.88  2.97    1.00    8.00   7.00 -0.24    -1.21

default*          5 41188    1.21  0.41    1.00    1.14   0.00   1.00    3.00    2.00 1.44     0.07

housing*          6 41188    2.07  0.99    3.00    2.09  0.00    1.00    3.00   2.00 -0.14    -1.95

loan*             7 41188    1.33  0.72    1.00    1.16  0.00    1.00    3.00   2.00  1.82     1.38

contact*          8 41188    1.37  0.48    1.00    1.33  0.00    1.00    2.00   1.00  0.56    -1.69

month*            9 41188    5.23  2.32    5.00    5.31  2.97    1.00   10.00   9.00 -0.31    -1.03

day_of_week*     10 41188   3.00   1.40    3.00   3.01   1.48    1.00   5.00    4.00  0.01   -1.27

duration         11 41188  258.29 259.28 180.00  210.61 139.36    0.00 4918.00 4918.00  3.26   20.24

campaign         12 41188    2.57  2.77    2.00    1.99  1.48    1.00   56.00  55.00  4.76    36.97

pdays            13 41188  962.48 186.91 999.00  999.00   0.00   0.00  999.00  999.00 -4.92    22.23

previous         14 41188    0.17  0.49    0.00    0.05  0.00    0.00    7.00   7.00  3.83    20.11

poutcome*        15 41188    1.93  0.36    2.00    2.00  0.00    1.00    3.00   2.00 -0.88     3.98

emp.var.rate     16 41188   0.08   1.57    1.10   0.27   0.44   -3.40   1.40    4.80 -0.72    -1.06

cons.price.idx   17 41188  93.58   0.58   93.75  93.58   0.56   92.20  94.77    2.57 -0.23    -0.83

cons.conf.idx    18 41188 -40.50   4.63  -41.80 -40.60   6.52  -50.80 -26.90   23.90  0.30   -0.36

euribor3m        19 41188    3.62  1.73    4.86    3.81  0.16    0.63    5.04   4.41 -0.71    -1.41

nr.employed      20 41188 5167.04  72.25 5191.00 5178.43  55.00 4963.60 5228.10  264.50 -1.04     0.00

y*               21 41188    1.11  0.32    1.00    1.02  0.00    1.00    2.00   1.00  2.45     4.00

 

               se

age            0.05

job*           0.02

marital*       0.00

education*     0.01

default*       0.00

housing*       0.00

loan*          0.00

contact*       0.00

month*         0.01

day_of_week*   0.01

duration       1.28

campaign       0.01

pdays          0.92

previous       0.00

poutcome*      0.00

emp.var.rate   0.01

cons.price.idx 0.00

cons.conf.idx  0.02

euribor3m      0.01

nr.employed    0.36

y*             0.00

 

查看数据是否有缺失值

> sapply(bank,anyNA)

           age            job        marital      education

         FALSE          FALSE          FALSE          FALSE

       default        housing           loan        contact

         FALSE          FALSE          FALSE          FALSE

         month    day_of_week       duration       campaign

         FALSE          FALSE          FALSE          FALSE

         pdays       previous      poutcome   emp.var.rate

         FALSE          FALSE          FALSE          FALSE

cons.price.idx  cons.conf.idx      euribor3m    nr.employed

         FALSE          FALSE          FALSE          FALSE

             y

         FALSE

 

成功与不成功的个数

> table(bank$y)

 

   no  yes

36548  4640

 

在是否结婚这个属性的取值与

是否成功的数量比较

> table(bank$y,bank$marital)

    

      divorced married single unknown

  no     4136   22396   9948     68

  yes     476    2532   1620     12

 

> xtabs(~y+marital,data=bank)

     marital

y     divorced married single unknown

  no     4136   22396   9948     68

  yes     476    2532   1620     12

>tab=table(bank$y,bank$marital)

> tab

    

      divorced married single unknown

  no     4136   22396   9948     68

  yes     476    2532   1620     12

 

在是否结婚这个属性上的取值

> margin.table(tab,2)

 

divorced  married  single  unknown

    4612   24928    11568       80

> margin.table(tab,1)

 

   no  yes

36548  4640

 

在是否结婚这个属性上横向看概率

> prop.table(tab,1)

    

         divorced     married      single    unknown

  no 0.113166247 0.612783189 0.272189997 0.001860567

  yes 0.102586207 0.545689655 0.3491379310.002586207

在是否结婚这个属性上纵向看概率

 

> prop.table(tab,2)

    

       divorced   married   single   unknown

  no 0.8967910 0.8984275 0.8599585 0.8500000

  yes 0.1032090 0.1015725 0.1400415 0.1500000

 

 

平的列联表

以第一列和第二列,展开分类group by 1,2

以col.vars 的取值进行次数统计

>ftable(bank[,c(3,4,21)],row.vars = 1:2,col.vars = "y")

                             y   no yes

marital  education                     

divorced basic.4y               406   83

         basic.6y               169   13

         basic.9y               534   31

         high.school           1086 107

         illiterate               1    1

         professional.course    596  61

         university.degree     1177 160

         unknown                167   20

married  basic.4y              2915  313

         basic.6y              1628  139

         basic.9y              3858  298

         high.school           4683 475

         illiterate              12    3

         professional.course   2799 357

         university.degree     5573 821

         unknown                928  126

single   basic.4y               422   31

         basic.6y               301   36

         basic.9y              1174  142

         high.school           2702 448

         illiterate               1    0

         professional.course   1247 177

         university.degree     3723 683

         unknown                378  103

unknown  basic.4y                 5    1

         basic.6y                 6    0

         basic.9y                 6    2

         high.school             13    1

         illiterate               0    0

         professional.course      6   0

         university.degree       25   6

         unknown                  7    2

 

卡方检验,在p值小于2.2e-16时,拒绝原假设,认为数据不服从卡方分布

> chisq.test(tab)

 

        Pearson's Chi-squared test

 

data:  tab

X-squared = 122.66, df = 3,p-value < 2.2e-16

 

画直方图

> hist(bank$age)

> library(lattice)

 

画连续变量的分布,就是把直方图的中位数连接起来

以年龄为横轴,y为纵轴,数据是bank,画图,auto.key是否有图例

> densityplot(~age,groups =y,data=bank,plot.point=FALSE,auto.key = TRUE)

 

画Box图

> boxplot(age~y,data=bank)

 

双样本t分布检验,p值小于0.05时拒绝原假设

这里的原假设是两个样本没有相关性

得到的结果是p值为1.805e-06,拒绝两个样本没有相关性的假设

这里认为两个样本有相关性

>t.test(age~y,data=bank,alternative="two.sided",var.equal=FALSE)

 

        Welch Two Sample t-test

 

data:  age by y

t = -4.7795, df = 5258.5,p-value = 1.805e-06

alternative hypothesis: truedifference in means is not equal to 0

95 percent confidence interval:

 -1.4129336 -0.5909889

sample estimates:

 mean in group no mean in group yes

         39.91119          40.91315

 

 

数据可视化

画饼图

> tab=table(bank$marital)

> pie(tab)

 

画直方图

> tab=table(bank$marital)

> barplot(tab)

 

画下面这个图

> tab=table(bank$marital,bank$y)

> plot(tab)

 

 

画层叠直方图

>tab=table(bank$marital,bank$y)

>lattice::barchart(tab,auto.key=TRUE)

 

 

加载这个包,准备画图

> library(dplyr)

>data=group_by(bank,marital,y)

> data=tally(data)

!!!!!!!!!!!!!

> ggplot2::ggplot(data=data,mapping=aes(marital,n))+geom_bar(mapping=aes(fill=y),position="dodge",stat="identity")
 
 
 
数据预处理
分组之后再画图

> labels=c('青年','中年','老年')

> bank$age_group=cut(bank$age,breaks = c(0,35,55,100),right = FALSE,labels = labels)

> library(ggplot2)

> ggplot(data=bank,mapping = aes(age_group))+geom_bar(mapping = aes(fill=y),position="dodge",stat="count")

 

 

 

 

 

 

衍生变量
直接使用$符向原数据框添加新的变量

> bank$log.cons.price.idx=log(bank$cons.price.idx)

使用transform函数向原数据框添加变量

> bank<-transform(bank,log.cons.price.idx=log(cons.price.idx),log.nr.employed=log(nr.employed))

使用dplyr包里的mutate函数增加变量

> bank<-dplyr::mutate(bank,log.cons.price.idx=log(cons.price.idx))

使用dplyr包里的transmute函数只保留新生成的变量

> bank2<-dplyr::transmute(bank,log.cons.price.idx=log(cons.price.idx),log.nr.employed=log(nr.employed))

 

中心化

 

> v=1:10

> v1=v-mean(v)

> v2=scale(v,center=TRUE,scale = FALSE)

 

无量纲化

 

> V1=v/sqrt(sum(v^2)/(length(v)-1))

> v2=scale(v,center=FALSE,scale=TRUE)

 

根据最大最小值进行归一化

 

> v3=(v-min(v))/(max(v)-min(v))

 

 

进行标准正态化

 

 

> v1=(v-mean(v))/sd(v)

> v2=scale(v,center = TRUE,scale=TRUE)

 

 

 

 

Box-Cox变换

使用car包里的boxCox函数

> install.packages("car")

> library(car)

> boxCox(age~.,data=bank)

 

 

 

 

 

 

使用caret包,做Box-Cox变换

> install.packages("caret")

> library(caret)

> dat<-subset(bank,select="age")

> trans<-preProcess(dat,method=C("BoxCox"))

 

 

数据预处理下

违反常识的异常值

基于数据分布的异常值(离群点)识别

bank.dirty=read.csv("bank-dirty.csv")
summary(bank.dirty)

 

     age                  job            marital                    education    
 Min.   : 17.00   admin.     :10422   divorced: 4612   university.degree  :12165  
 1st Qu.: 32.00   blue-collar: 9254   married :24928   high.school        : 9515  
 Median : 38.00   technician : 6743   single  :11568   basic.9y           : 6043  
 Mean   : 40.03   services   : 3969   NA's    :   80   professional.course: 5242  
 3rd Qu.: 47.00   management : 2924                    basic.4y           : 4175  
 Max.   :123.00   (Other)    : 7546                    (Other)            : 2310  
 NA's   :2        NA's       :  330                    NA's               : 1738  
 default      housing        loan            contact          month      
 no  :32588   no  :18622   no  :33950   cellular :26144   may    :13769  
 yes :    3   yes :21576   yes : 6248   telephone:15044   jul    : 7174  
 NA's: 8597   NA's:  990   NA's:  990                     aug    : 6178  
                                                          jun    : 5318  
                                                          nov    : 4101  
                                                          apr    : 2632  
                                                          (Other): 2016  
 day_of_week    duration         campaign          pdays          previous    
 fri:7827    Min.   :   0.0   Min.   : 1.000   Min.   :  0.0   Min.   :0.000  
 mon:8514    1st Qu.: 102.0   1st Qu.: 1.000   1st Qu.:999.0   1st Qu.:0.000  
 thu:8623    Median : 180.0   Median : 2.000   Median :999.0   Median :0.000  
 tue:8090    Mean   : 258.3   Mean   : 2.568   Mean   :962.5   Mean   :0.173  
 wed:8134    3rd Qu.: 319.0   3rd Qu.: 3.000   3rd Qu.:999.0   3rd Qu.:0.000  
             Max.   :4918.0   Max.   :56.000   Max.   :999.0   Max.   :7.000  
                                                                              
        poutcome      emp.var.rate      cons.price.idx  cons.conf.idx  
 failure    : 4252   Min.   :-3.40000   Min.   :92.20   Min.   :-50.8  
 nonexistent:35563   1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7  
 success    : 1373   Median : 1.10000   Median :93.75   Median :-41.8  
                     Mean   : 0.08189   Mean   :93.58   Mean   :-40.5  
                     3rd Qu.: 1.40000   3rd Qu.:93.99   3rd Qu.:-36.4  
                     Max.   : 1.40000   Max.   :94.77   Max.   :-26.9  
                                                                       
   euribor3m      nr.employed     y        
 Min.   :0.634   Min.   :4964   no :36548  
 1st Qu.:1.344   1st Qu.:5099   yes: 4640  
 Median :4.857   Median :5191              
 Mean   :3.621   Mean   :5167              
 3rd Qu.:4.961   3rd Qu.:5228              
 Max.   :5.045   Max.   :5228              
 
 

常识告诉我们,虽然123岁的老人存在,但概率也极低,也不太可能是银行的客户

找出在年龄这一列的上离群值和下离群值

 

> head(bank.dirty[order(bank.dirty$age,decreasing = TRUE),'age',drop=FALSE],n=5)

      age

39494 123

38453  98

38456  98

27827  95

38922  94

> tail(bank.dirty[order(bank.dirty$age,decreasing = TRUE),'age',drop=FALSE],n=5)

      age

37559  17

37580  17

38275  17

120    NA

156    NA

 

异常值的处理

当作缺失值处理
> bank.dirty$age[which(bank.dirty$age>98)]<-NA

删除或者插补

 

 

重编码

职业类型有12个分类,不利于后续分析,把除了unknown以外的分类进行重新编码,简化成4类

Month有12个分类,把它转化成季度

Education的分类,除了unknow之外有7类

 

进行重编码

levels(bank.dirty$job) <- c( "management","services","entrepreneur","entrepreneur",
                       "management","unemployed",  "entrepreneur","services",
                       "unemployed","services","unemployed","unknown" )
> levels(bank.dirty$month) <- c("Q2","Q3","Q4","Q3","Q2",
                        "Q1","Q2","Q4","Q4","Q3")
> 
> levels(bank.dirty$education) <- c( "primary","primary","primary","secondary",
                             "primary","tertiary","tertiary","unknown")
 
 

缺失值

分类较多,分类是unknown,不能给我们提供信息

有些模型不能处理缺失值,比如Logistic回归

缺失值插补的方法

1、  用中位数或众数插补

> library(imputeMissings)
> bank.clean<-impute(bank.dirty,object = compute(bank.dirty,method = "median/mode"))

2、  最邻近(knn)插补

library(DMwR)
bank.clean=knnImputation(bank.dirty,k=5)

 

3、  随机森林插补

library(missForest)

 Imp = missForest(bank.dirty)

 bank.clean = Imp$ximp

 

缺失值插补的R包

1、  imputeMissings包

2、  DMwR包

 

 

 

 

 

 

用Logistic回归建立客户响应模型

1、广义线性模型

广义线性模型擅长于处理因变量不是连续变量的问题

1)  Y是分类变量

2)  Y是定序变量

3)  Y是离散取值

2、当Y取值是0-1二分类变量是,就是Logistic回归

 

Logistic回归在R中的实现

数据重编码

bank$y=ifelse(bank$y=='yes',1,0)

改成以Q1为参考因子

bank$month<-relevel(bank$month,ref="Q1")

构建Logistic回归模型

> model<-glm(y~.,data=bank,family = 'binomial')
> summary(model)
 
Call:
glm(formula = y ~ ., family = "binomial", data = bank)
 
Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-5.9958  -0.3082  -0.1887  -0.1333   3.4283  
 
Coefficients: (1 not defined because of singularities)
                               Estimate Std. Error z value Pr(>|z|)    
(Intercept)                  -1.957e+02  1.935e+01 -10.116  < 2e-16 ***
age                           1.851e-03  2.415e-03   0.767 0.443289    
jobblue-collar               -2.659e-01  7.942e-02  -3.348 0.000814 ***
jobentrepreneur              -2.029e-01  1.248e-01  -1.626 0.103924    
jobhousemaid                 -3.628e-02  1.475e-01  -0.246 0.805705    
jobmanagement                -8.054e-02  8.501e-02  -0.947 0.343423    
jobretired                    2.928e-01  1.067e-01   2.743 0.006092 ** 
jobself-employed             -1.680e-01  1.176e-01  -1.428 0.153332    
jobservices                  -1.497e-01  8.552e-02  -1.751 0.079969 .  
jobstudent                    2.674e-01  1.106e-01   2.416 0.015680 *  
jobtechnician                 3.462e-03  7.096e-02   0.049 0.961086    
jobunemployed                 8.514e-03  1.273e-01   0.067 0.946686    
jobunknown                   -8.046e-02  2.390e-01  -0.337 0.736420    
maritalmarried                1.567e-02  6.824e-02   0.230 0.818420    
maritalsingle                 6.620e-02  7.791e-02   0.850 0.395473    
maritalunknown                6.303e-02  4.113e-01   0.153 0.878211    
educationbasic.6y             9.647e-02  1.202e-01   0.803 0.422195    
educationbasic.9y            -2.154e-02  9.494e-02  -0.227 0.820557    
educationhigh.school          3.381e-02  9.188e-02   0.368 0.712895    
educationilliterate           1.132e+00  7.395e-01   1.531 0.125887    
educationprofessional.course  1.136e-01  1.013e-01   1.121 0.262175    
educationuniversity.degree    2.134e-01  9.188e-02   2.322 0.020211 *  
educationunknown              1.361e-01  1.196e-01   1.138 0.255314    
defaultunknown               -3.055e-01  6.712e-02  -4.552 5.32e-06 ***
defaultyes                   -7.150e+00  1.135e+02  -0.063 0.949784    
housingunknown               -7.385e-02  1.390e-01  -0.531 0.595260    
housingyes                   -3.740e-03  4.121e-02  -0.091 0.927695    
loanunknown                          NA         NA      NA       NA    
loanyes                      -6.362e-02  5.725e-02  -1.111 0.266454    
contacttelephone             -6.068e-01  7.124e-02  -8.518  < 2e-16 ***
monthQ2                      -2.192e+00  1.125e-01 -19.479  < 2e-16 ***
monthQ3                      -1.463e+00  1.148e-01 -12.747  < 2e-16 ***
monthQ4                      -1.995e+00  1.240e-01 -16.088  < 2e-16 ***
day_of_weekmon               -1.216e-01  6.588e-02  -1.846 0.064887 .  
day_of_weekthu                6.375e-02  6.382e-02   0.999 0.317842    
day_of_weektue                6.867e-02  6.545e-02   1.049 0.294118    
day_of_weekwed                1.436e-01  6.530e-02   2.199 0.027911 *  
duration                      4.667e-03  7.397e-05  63.092  < 2e-16 ***
campaign                     -4.543e-02  1.158e-02  -3.922 8.77e-05 ***
pdays                        -9.627e-04  2.162e-04  -4.452 8.50e-06 ***
previous                     -5.806e-02  5.879e-02  -0.988 0.323369    
poutcomenonexistent           4.507e-01  9.372e-02   4.809 1.51e-06 ***
poutcomesuccess               9.371e-01  2.106e-01   4.451 8.56e-06 ***
emp.var.rate                 -1.389e+00  7.693e-02 -18.057  < 2e-16 ***
cons.price.idx                1.815e+00  1.193e-01  15.218  < 2e-16 ***
cons.conf.idx                 3.353e-02  6.664e-03   5.033 4.84e-07 ***
euribor3m                     6.054e-02  1.126e-01   0.537 0.590987    
nr.employed                   4.937e-03  1.873e-03   2.635 0.008413 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
(Dispersion parameter for binomial family taken to be 1)
 
    Null deviance: 28999  on 41187  degrees of freedom
Residual deviance: 17199  on 41141  degrees of freedom
AIC: 17293
 
Number of Fisher Scoring iterations: 10

 

 

> exp(coef(model))
                 (Intercept)                          age               jobblue-collar 
                9.856544e-86                 1.001853e+00                 7.665077e-01 
             jobentrepreneur                 jobhousemaid                jobmanagement 
                8.163314e-01                 9.643733e-01                 9.226187e-01 
                  jobretired             jobself-employed                  jobservices 
                1.340142e+00                 8.453874e-01                 8.609387e-01 
                  jobstudent                jobtechnician                jobunemployed 
                1.306514e+00                 1.003468e+00                 1.008550e+00 
                  jobunknown               maritalmarried                maritalsingle 
                9.226922e-01                 1.015789e+00                 1.068445e+00 
              maritalunknown            educationbasic.6y            educationbasic.9y 
                1.065061e+00                 1.101276e+00                 9.786948e-01 
        educationhigh.school          educationilliterate educationprofessional.course 
                1.034388e+00                 3.101297e+00                 1.120248e+00 
  educationuniversity.degree             educationunknown               defaultunknown 
                1.237856e+00                 1.145744e+00                 7.367445e-01 
                  defaultyes               housingunknown                   housingyes 
                7.851906e-04                 9.288126e-01                 9.962671e-01 
                 loanunknown                      loanyes             contacttelephone 
                          NA                 9.383587e-01                 5.450980e-01 
                     monthQ2                      monthQ3                      monthQ4 
                1.116739e-01                 2.314802e-01                 1.360620e-01 
              day_of_weekmon               day_of_weekthu               day_of_weektue 
                8.854888e-01                 1.065828e+00                 1.071082e+00 
              day_of_weekwed                     duration                     campaign 
                1.154380e+00                 1.004678e+00                 9.555850e-01 
                       pdays                     previous          poutcomenonexistent 
                9.990378e-01                 9.435960e-01                 1.569466e+00 
             poutcomesuccess                 emp.var.rate               cons.price.idx 
                2.552531e+00                 2.493091e-01                 6.140533e+00 
               cons.conf.idx                    euribor3m                  nr.employed 
                1.034103e+00                 1.062408e+00                 1.004949e+00 

 

 

Job变量的基准水平是management,从上面的结果看,服务业和自主劳动者购买银行产品的几率(odds)是管理岗从业人员的0.88倍,未就业人员购买银行产品的几率是管理岗人员的1.25倍

 

 

> summary(model.step)
向前逐步回归
> model.step=step(model,direction = "backward")
向后逐步回归
> model.step = step(model, direction = "forward")
双向逐步回归
> model.step = step(model, direction = "both")

> summary(model.step)

 

Call:

glm(formula = y ~ job + education + default + contact + month +

    day_of_week + duration + campaign + pdays + poutcome + emp.var.rate +

    cons.price.idx + cons.conf.idx + nr.employed, family = "binomial",

    data = bank)

 

Deviance Residuals:

    Min       1Q   Median       3Q      Max 

-5.9884  -0.3088  -0.1887  -0.1332   3.4026 

 

Coefficients:

                               Estimate Std. Error z value Pr(>|z|)   

(Intercept)                  -2.031e+02  1.426e+01 -14.246  < 2e-16 ***

jobblue-collar               -2.700e-01  7.917e-02  -3.411 0.000648 ***

jobentrepreneur              -2.043e-01  1.242e-01  -1.645 0.100003   

jobhousemaid                 -2.832e-02  1.464e-01  -0.193 0.846590   

jobmanagement                -8.368e-02  8.409e-02  -0.995 0.319670   

jobretired                    3.234e-01  9.130e-02   3.542 0.000397 ***

jobself-employed             -1.670e-01  1.176e-01  -1.421 0.155435   

jobservices                  -1.528e-01  8.545e-02  -1.789 0.073666 . 

jobstudent                    2.682e-01  1.046e-01   2.565 0.010316 * 

jobtechnician                 4.389e-03  7.093e-02   0.062 0.950665   

jobunemployed                 8.975e-03  1.271e-01   0.071 0.943715   

jobunknown                   -6.363e-02  2.378e-01  -0.268 0.789057   

educationbasic.6y             8.993e-02  1.196e-01   0.752 0.452024   

educationbasic.9y            -2.716e-02  9.416e-02  -0.288 0.772992   

educationhigh.school          2.890e-02  9.053e-02   0.319 0.749573   

educationilliterate           1.118e+00  7.398e-01   1.511 0.130744   

educationprofessional.course  1.084e-01  1.004e-01   1.079 0.280686   

educationuniversity.degree    2.103e-01  9.017e-02   2.332 0.019678 * 

educationunknown              1.363e-01  1.195e-01   1.140 0.254110   

defaultunknown               -3.017e-01  6.666e-02  -4.526 6.02e-06 ***

defaultyes                   -7.141e+00  1.135e+02  -0.063 0.949831   

contacttelephone             -6.011e-01  7.069e-02  -8.504  < 2e-16 ***

monthQ2                      -2.210e+00  1.108e-01 -19.939  < 2e-16 ***

monthQ3                      -1.475e+00  1.146e-01 -12.869  < 2e-16 ***

monthQ4                      -1.982e+00  1.183e-01 -16.755  < 2e-16 ***

day_of_weekmon               -1.210e-01  6.584e-02  -1.837 0.066174 . 

day_of_weekthu                6.208e-02  6.374e-02   0.974 0.330066   

day_of_weektue                6.851e-02  6.538e-02   1.048 0.294651    

day_of_weekwed                1.420e-01  6.525e-02   2.176 0.029592 * 

duration                      4.667e-03  7.396e-05  63.099  < 2e-16 ***

campaign                     -4.587e-02  1.158e-02  -3.960 7.49e-05 ***

pdays                        -8.822e-04  2.024e-04  -4.358 1.31e-05 ***

poutcomenonexistent           5.219e-01  6.356e-02   8.211  < 2e-16 ***

poutcomesuccess               9.996e-01  2.028e-01   4.928 8.31e-07 ***

emp.var.rate                 -1.376e+00  6.885e-02 -19.980  < 2e-16 ***

cons.price.idx                1.845e+00  1.041e-01  17.725  < 2e-16 ***

cons.conf.idx                 3.622e-02  4.853e-03   7.464 8.42e-14 ***

nr.employed                   5.883e-03  9.765e-04   6.024 1.70e-09 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

(Dispersion parameter for binomial family taken to be 1)

 

    Null deviance: 28999  on 41187  degrees of freedom

Residual deviance: 17203  on 41150  degrees of freedom

AIC: 17279

 

Number of Fisher Scoring iterations: 10

 

 

 

模型预测

用predict函数,参数type=’response’

Newdata参数是要预测的数据集

 

> prob<-predict(model.step,type = 'response')
> head(prob)
          1           2           3           4           5           6 
0.015029328 0.006044212 0.011640349 0.010173952 0.016897254 0.007174804 

 

假设以0.5为临界值

> pre<-ifelse(prob>0.5,1,0)

> table(pre,bank$y)

  

pre     0     1

  0 35596  2667

  1   952  1973

 

>

预测的准确率

> (35592+1964)/(35592+2676+956+1964)

[1] 0.911819

 

实际有响应的客户被识别出了多少

> 1964/(1964+2676)
[1] 0.4232759

 

 

模型评估

 

> confusionMatrix(bank$y,pre,pos='1')
Confusion Matrix and Statistics
 
          Reference
Prediction     0     1
         0 35596   952
         1  2667  1973
                                          
               Accuracy : 0.9121          
                 95% CI : (0.9094, 0.9149)
    No Information Rate : 0.929           
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.476           
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.67453         
            Specificity : 0.93030         
         Pos Pred Value : 0.42522         
         Neg Pred Value : 0.97395         
             Prevalence : 0.07102         
         Detection Rate : 0.04790         
   Detection Prevalence : 0.11265         
      Balanced Accuracy : 0.80241         
                                          
       'Positive' Class : 1               
                                    

Kappa 统计量(kappa statistic)

用于评判分类器的分类结果与随机分类的差异度

用Kappa统计量评价:

    较差:小于0.20

    一般:0.20至0.40

    稳健:0.40至0.60

    好的:0.60至0.80

很好的:0.80至1.00

 

 

ROC曲线

pred<-prediction(prob,bank$y)
perf<-performance(pred,measure = "tpr",x="fpr")
plot(perf)
 
 
 
 
 
 
 
 
 
 
 
 
RandomForest
加载数据列
 

> data=read.table("input.txt",header = TRUE)

> str(data)

'data.frame':  222 obs. of  23 variables:

 $ Acti_Profile             : num  0 0 0 0 0 0 0 0 0 0 ...

 $ Activity                 : num  1.25 0 0.938 6.562 0 ...

 $ Diastolic_PTT            : num  256 240 253 0 241 ...

 $ Diastolic                : num  73.2 78.6 74 0 78.4 ...

 $ Heart_Rate_Curve         : num  81.2 69.7 77.6 95 83.6 ...

 $ Heart_Rate_Variability_HF: num  131 250 135 144 141 ...

 $ Heart_Rate_Variability_LF: num  311 218 203 301 244 ...

 $ MAP                      : num  86 93.5 86.9 0 91.7 ...

 $ Position                 : num  0 0 0 1 0 0 0 0 0 0 ...

 $ PTT_Raw                  : num  308 288 308 0 295 ...

 $ RR_Interval              : num  734 878 773 632 714 ...

 $ Sleep_Wake               : num  1 1 1 1 1 0 1 1 0 0 ...

 $ SpO2                     : num  0 0 99 0 98.4 ...

 $ Sympatho_Vagal_Balance   : num  23 8.17 14.5 20.4 16.88 ...

 $ Systolic_PTT             : num  308 288 307 0 295 ...

 $ Systolic                 : num  113 124 113 0 119 ...

 $ Autonomic_arousals       : num  0 0 0 0 0 0 0 0 0 0 ...

 $ Cardio_complex           : num  0 0 0 1 0 0 0 0 0 0 ...

 $ Cardio_rhythm            : num  0 0 2 0 0 0 0 0 0 0 ...

 $ Classification_Arousal   : num  0 0 0 0 0 0 0 0 0 0 ...

 $ PTT_Events               : num  1 0 2 0 0 0 0 0 0 0 ...

 $ Systolic_Events          : num  1 0 1 0 0 0 0 0 0 0 ...

 $ y                        : num  1 0 1 0 0 0 0 0 0 0 ...

加载随机森林包

> library(randomForest)

进行训练  以y作为因变量,其余数据作为自变量

> rf <- randomForest(y ~ ., data=data, ntree=100, proximity=TRUE,importance=TRUE)

> plot(rf)

重要性检测

衡量把一个变量的取值变为随机数,随机森林预测准确性的降低程度

> importance(rf,type=1)

                              %IncMSE

Acti_Profile               0.00000000

Activity                   0.99353251

Diastolic_PTT              0.32193611

Diastolic                  1.99891809

Heart_Rate_Curve           0.92001352

Heart_Rate_Variability_HF  2.07870722

Heart_Rate_Variability_LF -0.24957163

MAP                        0.48142975

Position                   1.86876751

PTT_Raw                    1.94648914

RR_Interval                0.60557964

Sleep_Wake                 1.00503782

SpO2                       0.25396165

Sympatho_Vagal_Balance     1.42906765

Systolic_PTT               1.27965813

Systolic                   0.77382673

Autonomic_arousals         0.00000000

Cardio_complex             1.00503782

Cardio_rhythm              1.14283152

Classification_Arousal    -0.04383997

PTT_Events                 4.63980680

Systolic_Events           33.29461169

 

输出随机森林的模型

> print(rf)

 

Call:

 randomForest(formula = y ~ ., data = data, ntree = 100, proximity = TRUE,      importance = TRUE)

               Type of random forest: regression

                     Number of trees: 100

No. of variables tried at each split: 7

 

          Mean of squared residuals: 0.003226897     残差平方和SSE

                    % Var explained: 98.7

 

>

总平方和(SST):(样本数据-样本均值)的平方和

回归平方和(SSR):(预测数据-样本均值)的平方和

残差平方和(SSE):(样本数据-预测数据均值)的平方和

 

SST = SSR + SSE   

 

 

 

 

 

基尼指数:

 

> importance(rf,type=2)

                          IncNodePurity

Acti_Profile                0.000000000

Activity                    0.445181480

Diastolic_PTT               0.452221870

Diastolic                   0.449372186

Heart_Rate_Curve            0.473113852

Heart_Rate_Variability_HF   0.226815300

Heart_Rate_Variability_LF   0.205457353

MAP                         0.536977574

Position                    0.307333210

PTT_Raw                     0.656726800

RR_Interval                 0.452738011

Sleep_Wake                  0.014423077

SpO2                        1.793361279

Sympatho_Vagal_Balance      0.352759689

Systolic_PTT                0.851951505

Systolic                    0.823955781

Autonomic_arousals          0.000000000

Cardio_complex              0.008047619

Cardio_rhythm               0.141907084

Classification_Arousal      0.085739429

PTT_Events                  7.468690820

Systolic_Events            39.000163018

 

>

进行预测

prediction <- predict(rf, data[,],type="response")

输出预测结果

table(observed =data$y,predicted=prediction)

plot(prediction)

 

 

 

支持向量机

library(e1071)

svmfit<-svm(y~.,data=data,kernel="linear",cost=10,scale=FALSE)

> print(svmfit)

 

Call:

svm(formula = y ~ ., data = data, kernel = "linear", cost = 10, scale = FALSE)

 

 

Parameters:

   SVM-Type:  eps-regression

 SVM-Kernel:  linear

       cost:  10

      gamma:  0.04545455

    epsilon:  0.1

 

 

Number of Support Vectors:  20

> plot(svmfit,data)

 

 

神经网络

 

> concrete<-read_excel("Concrete_Data.xls")

> str(concrete)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    1030 obs. of  9 variables:

 $ Cement      : num  540 540 332 332 199 ...

 $ Slag        : num  0 0 142 142 132 ...

 $ Ash         : num  0 0 0 0 0 0 0 0 0 0 ...

 $ water       : num  162 162 228 228 192 228 228 228 228 228 ...

 $ superplastic: num  2.5 2.5 0 0 0 0 0 0 0 0 ...

 $ coarseagg   : num  1040 1055 932 932 978 ...

 $ fineagg     : num  676 676 594 594 826 ...

 $ age         : num  28 28 270 365 360 90 365 28 28 28 ...

 $ strength    : num  80 61.9 40.3 41.1 44.3 ...

 

 

> normalize <- function(x){ return ((x-min(x))/(max(x)-min(x)))}

> concrete_norm <- as.data.frame(lapply(concrete,normalize))

 

 

> concrete_train <- concrete_norm[1:773,]

> concrete_test <- concrete_norm[774:1030,]

 

 

> library(neuralnet)

> concrete_model <- neuralnet(strength ~ Cement+Slag+Ash+water+superplastic+coarseagg+fineagg+age,data=concrete_train)

> plot(concrete_model)

 

 

 

 

 

 

model_results <- compute(concrete_model,concrete_test[1:8])

predicted_strength <- model_results$net.result

> cor(predicted_strength,concrete_test$strength)

             [,1]

[1,] 0.7205120076

> concrete_model2 <- neuralnet(strength ~ Cement+Slag+Ash+water+superplastic+coarseagg+fineagg+age,data=concrete_train,hidden=5)

> plot(concrete_model2)

计算误差

> model_results2 <- compute(concrete_model2,concrete_test[1:8])

> predicted_strength2 <- model_results2$net.result

> cor(predicted_strength2,concrete_test$strength)

             [,1]

[1,] 0.6727155609

 

>

 

 

 

 

主成分分析

身高、体重、胸围、坐高

> test<-data.frame(

+     X1=c(148, 139, 160, 149, 159, 142, 153, 150, 151, 139,

+          140, 161, 158, 140, 137, 152, 149, 145, 160, 156,

+          151, 147, 157, 147, 157, 151, 144, 141, 139, 148),

+     X2=c(41, 34, 49, 36, 45, 31, 43, 43, 42, 31,

+          29, 47, 49, 33, 31, 35, 47, 35, 47, 44,

+          42, 38, 39, 30, 48, 36, 36, 30, 32, 38),

+     X3=c(72, 71, 77, 67, 80, 66, 76, 77, 77, 68,

+          64, 78, 78, 67, 66, 73, 82, 70, 74, 78,

+          73, 73, 68, 65, 80, 74, 68, 67, 68, 70),

+     X4=c(78, 76, 86, 79, 86, 76, 83, 79, 80, 74,

+          74, 84, 83, 77, 73, 79, 79, 77, 87, 85,

+          82, 78, 80, 75, 88, 80, 76, 76, 73, 78)

+ )

> test.pr<-princomp(test,cor=TRUE)

> summary(test.pr,loadings=TRUE)

Importance of components:

                             Comp.1        Comp.2        Comp.3        Comp.4

Standard deviation     1.8817805390 0.55980635717 0.28179594325 0.25711843909

Proportion of Variance 0.8852744993 0.07834578938 0.01985223841 0.01652747293

Cumulative Proportion  0.8852744993 0.96362028866 0.98347252707 1.00000000000

 

Loadings:

   Comp.1 Comp.2 Comp.3 Comp.4

X1  0.497  0.543 -0.450  0.506

X2  0.515 -0.210 -0.462 -0.691

X3  0.481 -0.725  0.175  0.461

X4  0.507  0.368  0.744 -0.232

 

 

前两个主成分的累计贡献率已经达到96% 可以舍去另外两个主成分达到降维的目的

因此可以得到函数表达式 Z1=-0.497X'1-0.515X'2-0.481X'3-0.507X'4

                                       Z2=  0.543X'1-0.210X'2-0.725X'3-0.368X'4

4.画主成分的碎石图并预测

> screeplot(test.pr,type="lines")

> p<-predict(test.pr)

> p

              Comp.1         Comp.2         Comp.3          Comp.4

 [1,] -0.06990949737 -0.23813701272 -0.35509247634 -0.266120139417

 [2,] -1.59526339772 -0.71847399061  0.32813232022 -0.118056645885

 [3,]  2.84793151061  0.38956678680 -0.09731731272 -0.279482487139

 [4,] -0.75996988424  0.80604334819 -0.04945721875 -0.162949297761

 [5,]  2.73966776853  0.01718087263  0.36012614873  0.358653043787

 [6,] -2.10583167924  0.32284393414  0.18600422367 -0.036456083707

 [7,]  1.42105591247 -0.06053164925  0.21093320662 -0.044223092351

 [8,]  0.82583976981 -0.78102575640 -0.27557797533  0.057288571933

 [9,]  0.93464401954 -0.58469241699 -0.08814135786  0.181037745585

[10,] -2.36463819933 -0.36532199291  0.08840476284  0.045520127461

[11,] -2.83741916086  0.34875841111  0.03310422938 -0.031146930047

[12,]  2.60851223537  0.21278727930 -0.33398036623  0.210157574387

[13,]  2.44253342081 -0.16769495893 -0.46918095412 -0.162987829937

[14,] -1.86630668724  0.05021383642  0.37720280364 -0.358821916178

[15,] -2.81347420580 -0.31790107093 -0.03291329149 -0.222035112399

[16,] -0.06392982655  0.20718447599  0.04334339948  0.703533623798

[17,]  1.55561022242 -1.70439673831 -0.33126406220  0.007551878960

[18,] -1.07392250663 -0.06763418320  0.02283648409  0.048606680158

[19,]  2.52174211878  0.97274300950  0.12164633439 -0.390667990681

[20,]  2.14072377494  0.02217881219  0.37410972458  0.129548959692

[21,]  0.79624421805  0.16307887263  0.12781269571 -0.294140762463

[22,] -0.28708320594 -0.35744666106 -0.03962115883  0.080991988802

[23,]  0.25151075072  1.25555187663 -0.55617324819  0.109068938725

[24,] -2.05706031616  0.78894493512 -0.26552109297  0.388088642937

[25,]  3.08596854773 -0.05775318018  0.62110421208 -0.218939612456

[26,]  0.16367554630  0.04317931667  0.24481850312  0.560248997030

[27,] -1.37265052598  0.02220972121 -0.23378320040 -0.257399715466

[28,] -2.16097778154  0.13733232981  0.35589738735  0.093123683044

[29,] -2.40434826507 -0.48613137190 -0.16154440788 -0.007914021222

[30,] -0.50287467640  0.14734316507 -0.20590831261 -0.122078819188

 

>

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

加载数据> w<-read.table("test.prn",header = T)> w  X.. X...11   A     22   B     33   C     54   D     5> library(readxl)> dat<-read_excel("test.xlsx")> dat# A tibble: 4 x 2  `商品` `价格`   <chr>  <dbl>1      A      22      B      33      C      54      D      5> bank=read.table("bank-full.csv",header = TRUE,sep=",")查看数据结构> str(bank)'data.frame':    41188 obs. of  21 variables: $ age           : int  56 57 37 40 56 45 59 41 24 25 ... $ job           : Factor w/ 12 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ... $ marital       : Factor w/ 4 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ... $ education     : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 8 6 4 ... $ default       : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 2 1 2 1 1 ... $ housing       : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 1 1 3 3 ... $ loan          : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 1 1 1 1 1 ... $ contact       : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ... $ month         : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ... $ day_of_week   : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ... $ duration      : int  261 149 226 151 307 198 139 217 380 50 ... $ campaign      : int  1 1 1 1 1 1 1 1 1 1 ... $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ... $ previous      : int  0 0 0 0 0 0 0 0 0 0 ... $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ... $ emp.var.rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ... $ cons.price.idx: num  94 94 94 94 94 ... $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ... $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ... $ nr.employed   : num  5191 5191 5191 5191 5191 ... $ y             : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...查看数据的最小值,最大值,中位数,平均数,分位数> summary(bank)      age                 job            marital      Min.   :17.00   admin.     :10422   divorced: 4612   1st Qu.:32.00   blue-collar: 9254   married :24928   Median :38.00   technician : 6743   single  :11568   Mean   :40.02   services   : 3969   unknown :   80   3rd Qu.:47.00   management : 2924                    Max.   :98.00   retired    : 1720                                    (Other)    : 6156                                  education        default         housing      university.degree  :12168   no     :32588   no     :18622   high.school        : 9515   unknown: 8597   unknown:  990   basic.9y           : 6045   yes    :    3   yes    :21576   professional.course: 5243                                   basic.4y           : 4176                                   basic.6y           : 2292                                   (Other)            : 1749                                        loan            contact          month       day_of_week no     :33950   cellular :26144   may    :13769   fri:7827    unknown:  990   telephone:15044   jul    : 7174   mon:8514    yes    : 6248                     aug    : 6178   thu:8623                                      jun    : 5318   tue:8090                                      nov    : 4101   wed:8134                                      apr    : 2632                                                 (Other): 2016                  duration         campaign          pdays       Min.   :   0.0   Min.   : 1.000   Min.   :  0.0   1st Qu.: 102.0   1st Qu.: 1.000   1st Qu.:999.0   Median : 180.0   Median : 2.000   Median :999.0   Mean   : 258.3   Mean   : 2.568   Mean   :962.5   3rd Qu.: 319.0   3rd Qu.: 3.000   3rd Qu.:999.0   Max.   :4918.0   Max.   :56.000   Max.   :999.0                                                        previous            poutcome      emp.var.rate      Min.   :0.000   failure    : 4252   Min.   :-3.40000   1st Qu.:0.000   nonexistent:35563   1st Qu.:-1.80000   Median :0.000   success    : 1373   Median : 1.10000   Mean   :0.173                       Mean   : 0.08189   3rd Qu.:0.000                       3rd Qu.: 1.40000   Max.   :7.000                       Max.   : 1.40000                                                          cons.price.idx  cons.conf.idx     euribor3m     Min.   :92.20   Min.   :-50.8   Min.   :0.634   1st Qu.:93.08   1st Qu.:-42.7   1st Qu.:1.344   Median :93.75   Median :-41.8   Median :4.857   Mean   :93.58   Mean   :-40.5   Mean   :3.621   3rd Qu.:93.99   3rd Qu.:-36.4   3rd Qu.:4.961   Max.   :94.77   Max.   :-26.9   Max.   :5.045                                                    nr.employed     y         Min.   :4964   no :36548   1st Qu.:5099   yes: 4640   Median :5191               Mean   :5167               3rd Qu.:5228               Max.   :5228                                         > psych::describe(bank)        方差  个数    平均值  标准差  均值    去掉最大   中位数   最小值  最大值  极差    偏差        峰度                                  绝对偏差                             最小值                            之后                            的平均数               vars     n    mean     sd  median trimmed   mad     min     max   range  skew    kurtosisage               1 41188   40.02  10.42   38.00   39.30  10.38   17.00   98.00   81.00  0.78     0.79job*              2 41188    4.72   3.59    3.00    4.48   2.97    1.00   12.00   11.00  0.45    -1.39marital*          3 41188    2.17   0.61    2.00    2.21   0.00    1.00    4.00    3.00 -0.06    -0.34education*        4 41188    4.75   2.14    4.00    4.88   2.97    1.00    8.00    7.00 -0.24    -1.21default*          5 41188    1.21   0.41    1.00    1.14   0.00    1.00    3.00    2.00  1.44     0.07housing*          6 41188    2.07   0.99    3.00    2.09   0.00    1.00    3.00    2.00 -0.14    -1.95loan*             7 41188    1.33   0.72    1.00    1.16   0.00    1.00    3.00    2.00  1.82     1.38contact*          8 41188    1.37   0.48    1.00    1.33   0.00    1.00    2.00    1.00  0.56    -1.69month*            9 41188    5.23   2.32    5.00    5.31   2.97    1.00   10.00    9.00 -0.31    -1.03day_of_week*     10 41188    3.00   1.40    3.00    3.01   1.48    1.00    5.00    4.00  0.01    -1.27duration         11 41188  258.29 259.28  180.00  210.61 139.36    0.00 4918.00 4918.00  3.26    20.24campaign         12 41188    2.57   2.77    2.00    1.99   1.48    1.00   56.00   55.00  4.76    36.97pdays            13 41188  962.48 186.91  999.00  999.00   0.00    0.00  999.00  999.00 -4.92    22.23previous         14 41188    0.17   0.49    0.00    0.05   0.00    0.00    7.00    7.00  3.83    20.11poutcome*        15 41188    1.93   0.36    2.00    2.00   0.00    1.00    3.00    2.00 -0.88     3.98emp.var.rate     16 41188    0.08   1.57    1.10    0.27   0.44   -3.40    1.40    4.80 -0.72    -1.06cons.price.idx   17 41188   93.58   0.58   93.75   93.58   0.56   92.20   94.77    2.57 -0.23    -0.83cons.conf.idx    18 41188  -40.50   4.63  -41.80  -40.60   6.52  -50.80  -26.90   23.90  0.30    -0.36euribor3m        19 41188    3.62   1.73    4.86    3.81   0.16    0.63    5.04    4.41 -0.71    -1.41nr.employed      20 41188 5167.04  72.25 5191.00 5178.43  55.00 4963.60 5228.10  264.50 -1.04     0.00y*               21 41188    1.11   0.32    1.00    1.02   0.00    1.00    2.00    1.00  2.45     4.00               seage            0.05job*           0.02marital*       0.00education*     0.01default*       0.00housing*       0.00loan*          0.00contact*       0.00month*         0.01day_of_week*   0.01duration       1.28campaign       0.01pdays          0.92previous       0.00poutcome*      0.00emp.var.rate   0.01cons.price.idx 0.00cons.conf.idx  0.02euribor3m      0.01nr.employed    0.36y*             0.00查看数据是否有缺失值> sapply(bank,anyNA)           age            job        marital      education          FALSE          FALSE          FALSE          FALSE        default        housing           loan        contact          FALSE          FALSE          FALSE          FALSE          month    day_of_week       duration       campaign          FALSE          FALSE          FALSE          FALSE          pdays       previous       poutcome   emp.var.rate          FALSE          FALSE          FALSE          FALSE cons.price.idx  cons.conf.idx      euribor3m    nr.employed          FALSE          FALSE          FALSE          FALSE              y          FALSE 成功与不成功的个数> table(bank$y)   no   yes 36548  4640 在是否结婚这个属性的取值与是否成功的数量比较> table(bank$y,bank$marital)           divorced married single unknown  no      4136   22396   9948      68  yes      476    2532   1620      12> xtabs(~y+marital,data=bank)     maritaly     divorced married single unknown  no      4136   22396   9948      68  yes      476    2532   1620      12> tab=table(bank$y,bank$marital)> tab           divorced married single unknown  no      4136   22396   9948      68  yes      476    2532   1620      12在是否结婚这个属性上的取值> margin.table(tab,2)divorced  married   single  unknown     4612    24928    11568       80 > margin.table(tab,1)   no   yes 36548  4640 在是否结婚这个属性上横向看概率> prop.table(tab,1)              divorced     married      single     unknown  no  0.113166247 0.612783189 0.272189997 0.001860567  yes 0.102586207 0.545689655 0.349137931 0.002586207在是否结婚这个属性上纵向看概率> prop.table(tab,2)            divorced   married    single   unknown  no  0.8967910 0.8984275 0.8599585 0.8500000  yes 0.1032090 0.1015725 0.1400415 0.1500000平的列联表以第一列和第二列,展开分类group by 1,2以col.vars 的取值 进行次数统计> ftable(bank[,c(3,4,21)],row.vars = 1:2,col.vars = "y")                             y   no  yesmarital  education                      divorced basic.4y               406   83         basic.6y               169   13         basic.9y               534   31         high.school           1086  107         illiterate               1    1         professional.course    596   61         university.degree     1177  160         unknown                167   20married  basic.4y              2915  313         basic.6y              1628  139         basic.9y              3858  298         high.school           4683  475         illiterate              12    3         professional.course   2799  357         university.degree     5573  821         unknown                928  126single   basic.4y               422   31         basic.6y               301   36         basic.9y              1174  142         high.school           2702  448         illiterate               1    0         professional.course   1247  177         university.degree     3723  683         unknown                378  103unknown  basic.4y                 5    1         basic.6y                 6    0         basic.9y                 6    2         high.school             13    1         illiterate               0    0         professional.course      6    0         university.degree       25    6         unknown                  7    2卡方检验,在p值小于2.2e-16时,拒绝原假设,认为数据不服从卡方分布> chisq.test(tab)    Pearson's Chi-squared testdata:  tabX-squared = 122.66, df = 3, p-value < 2.2e-16画直方图> hist(bank$age)> library(lattice)画连续变量的分布,就是把直方图的中位数连接起来以年龄为横轴,y为纵轴,数据是bank,画图,auto.key是否有图例> densityplot(~age,groups = y,data=bank,plot.point=FALSE,auto.key = TRUE)画Box图> boxplot(age~y,data=bank)双样本t分布检验,p值小于0.05时拒绝原假设这里的原假设是两个样本没有相关性得到的结果是p值为1.805e-06,拒绝两个样本没有相关性的假设这里认为两个样本有相关性> t.test(age~y,data=bank,alternative="two.sided",var.equal=FALSE)    Welch Two Sample t-testdata:  age by yt = -4.7795, df = 5258.5, p-value = 1.805e-06alternative hypothesis: true difference in means is not equal to 095 percent confidence interval: -1.4129336 -0.5909889sample estimates: mean in group no mean in group yes          39.91119          40.91315 数据可视化画饼图> tab=table(bank$marital)> pie(tab)画直方图> tab=table(bank$marital)> barplot(tab)画下面这个图> tab=table(bank$marital,bank$y)> plot(tab) 画层叠直方图> tab=table(bank$marital,bank$y)> lattice::barchart(tab,auto.key=TRUE) 加载这个包,准备画图> library(dplyr)> data=group_by(bank,marital,y)> data=tally(data)!!!!!!!!!!!!!> ggplot2::ggplot(data=data,mapping=aes(marital,n))+geom_bar(mapping=aes(fill=y),position="dodge",stat="identity")数据预处理分组之后再画图> labels=c('青年','中年','老年')> bank$age_group=cut(bank$age,breaks = c(0,35,55,100),right = FALSE,labels = labels)> library(ggplot2)> ggplot(data=bank,mapping = aes(age_group))+geom_bar(mapping = aes(fill=y),position="dodge",stat="count") 衍生变量直接使用$符向原数据框添加新的变量> bank$log.cons.price.idx=log(bank$cons.price.idx)使用transform函数向原数据框添加变量> bank<-transform(bank,log.cons.price.idx=log(cons.price.idx),log.nr.employed=log(nr.employed))使用dplyr包里的mutate函数增加变量> bank<-dplyr::mutate(bank,log.cons.price.idx=log(cons.price.idx))使用dplyr包里的transmute函数只保留新生成的变量> bank2<-dplyr::transmute(bank,log.cons.price.idx=log(cons.price.idx),log.nr.employed=log(nr.employed))中心化> v=1:10> v1=v-mean(v)> v2=scale(v,center=TRUE,scale = FALSE)无量纲化> V1=v/sqrt(sum(v^2)/(length(v)-1))> v2=scale(v,center=FALSE,scale=TRUE)根据最大最小值进行归一化> v3=(v-min(v))/(max(v)-min(v))进行标准正态化> v1=(v-mean(v))/sd(v)> v2=scale(v,center = TRUE,scale=TRUE)Box-Cox变换使用car包里的boxCox函数> install.packages("car")> library(car)> boxCox(age~.,data=bank)  使用caret包,做Box-Cox变换> install.packages("caret")> library(caret)> dat<-subset(bank,select="age")> trans<-preProcess(dat,method=C("BoxCox"))数据预处理下违反常识的异常值基于数据分布的异常值(离群点)识别bank.dirty=read.csv("bank-dirty.csv")summary(bank.dirty)     age                  job            marital                    education     Min.   : 17.00   admin.     :10422   divorced: 4612   university.degree  :12165   1st Qu.: 32.00   blue-collar: 9254   married :24928   high.school        : 9515   Median : 38.00   technician : 6743   single  :11568   basic.9y           : 6043   Mean   : 40.03   services   : 3969   NA's    :   80   professional.course: 5242   3rd Qu.: 47.00   management : 2924                    basic.4y           : 4175   Max.   :123.00   (Other)    : 7546                    (Other)            : 2310   NA's   :2        NA's       :  330                    NA's               : 1738   default      housing        loan            contact          month       no  :32588   no  :18622   no  :33950   cellular :26144   may    :13769   yes :    3   yes :21576   yes : 6248   telephone:15044   jul    : 7174   NA's: 8597   NA's:  990   NA's:  990                     aug    : 6178                                                            jun    : 5318                                                            nov    : 4101                                                            apr    : 2632                                                            (Other): 2016   day_of_week    duration         campaign          pdays          previous     fri:7827    Min.   :   0.0   Min.   : 1.000   Min.   :  0.0   Min.   :0.000   mon:8514    1st Qu.: 102.0   1st Qu.: 1.000   1st Qu.:999.0   1st Qu.:0.000   thu:8623    Median : 180.0   Median : 2.000   Median :999.0   Median :0.000   tue:8090    Mean   : 258.3   Mean   : 2.568   Mean   :962.5   Mean   :0.173   wed:8134    3rd Qu.: 319.0   3rd Qu.: 3.000   3rd Qu.:999.0   3rd Qu.:0.000               Max.   :4918.0   Max.   :56.000   Max.   :999.0   Max.   :7.000                                                                                        poutcome      emp.var.rate      cons.price.idx  cons.conf.idx   failure    : 4252   Min.   :-3.40000   Min.   :92.20   Min.   :-50.8   nonexistent:35563   1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7   success    : 1373   Median : 1.10000   Median :93.75   Median :-41.8                       Mean   : 0.08189   Mean   :93.58   Mean   :-40.5                       3rd Qu.: 1.40000   3rd Qu.:93.99   3rd Qu.:-36.4                       Max.   : 1.40000   Max.   :94.77   Max.   :-26.9                                                                            euribor3m      nr.employed     y         Min.   :0.634   Min.   :4964   no :36548   1st Qu.:1.344   1st Qu.:5099   yes: 4640   Median :4.857   Median :5191               Mean   :3.621   Mean   :5167               3rd Qu.:4.961   3rd Qu.:5228               Max.   :5.045   Max.   :5228              常识告诉我们,虽然123岁的老人存在,但概率也极低,也不太可能是银行的客户找出在年龄这一列的上离群值和下离群值> head(bank.dirty[order(bank.dirty$age,decreasing = TRUE),'age',drop=FALSE],n=5)      age39494 12338453  9838456  9827827  9538922  94> tail(bank.dirty[order(bank.dirty$age,decreasing = TRUE),'age',drop=FALSE],n=5)      age37559  1737580  1738275  17120    NA156    NA异常值的处理当作缺失值处理> bank.dirty$age[which(bank.dirty$age>98)]<-NA删除或者插补重编码职业类型有12个分类,不利于后续分析,把除了unknown以外的分类进行重新编码,简化成4类Month有12个分类,把它转化成季度Education的分类,除了unknow之外有7类进行重编码levels(bank.dirty$job) <- c( "management","services","entrepreneur","entrepreneur",                        "management","unemployed",  "entrepreneur","services",                        "unemployed","services","unemployed","unknown" )> levels(bank.dirty$month) <- c("Q2","Q3","Q4","Q3","Q2",                         "Q1","Q2","Q4","Q4","Q3")> > levels(bank.dirty$education) <- c( "primary","primary","primary","secondary",                              "primary","tertiary","tertiary","unknown")缺失值分类较多,分类是unknown,不能给我们提供信息有些模型不能处理缺失值,比如Logistic回归缺失值插补的方法1、    用中位数或众数插补> library(imputeMissings)> bank.clean<-impute(bank.dirty,object = compute(bank.dirty,method = "median/mode"))2、    最邻近(knn)插补library(DMwR)bank.clean=knnImputation(bank.dirty,k=5)3、    随机森林插补library(missForest) Imp = missForest(bank.dirty) bank.clean = Imp$ximp缺失值插补的R包1、    imputeMissings包2、    DMwR包用Logistic回归建立客户响应模型1、    广义线性模型广义线性模型擅长于处理因变量不是连续变量的问题1)    Y是分类变量2)    Y是定序变量3)    Y是离散取值2、    当Y取值是0-1二分类变量是,就是Logistic回归Logistic回归在R中的实现数据重编码bank$y=ifelse(bank$y=='yes',1,0)改成以Q1为参考因子bank$month<-relevel(bank$month,ref="Q1")构建Logistic回归模型> model<-glm(y~.,data=bank,family = 'binomial')> summary(model)Call:glm(formula = y ~ ., family = "binomial", data = bank)Deviance Residuals:     Min       1Q   Median       3Q      Max  -5.9958  -0.3082  -0.1887  -0.1333   3.4283  Coefficients: (1 not defined because of singularities)                               Estimate Std. Error z value Pr(>|z|)    (Intercept)                  -1.957e+02  1.935e+01 -10.116  < 2e-16 ***age                           1.851e-03  2.415e-03   0.767 0.443289    jobblue-collar               -2.659e-01  7.942e-02  -3.348 0.000814 ***jobentrepreneur              -2.029e-01  1.248e-01  -1.626 0.103924    jobhousemaid                 -3.628e-02  1.475e-01  -0.246 0.805705    jobmanagement                -8.054e-02  8.501e-02  -0.947 0.343423    jobretired                    2.928e-01  1.067e-01   2.743 0.006092 ** jobself-employed             -1.680e-01  1.176e-01  -1.428 0.153332    jobservices                  -1.497e-01  8.552e-02  -1.751 0.079969 .  jobstudent                    2.674e-01  1.106e-01   2.416 0.015680 *  jobtechnician                 3.462e-03  7.096e-02   0.049 0.961086    jobunemployed                 8.514e-03  1.273e-01   0.067 0.946686    jobunknown                   -8.046e-02  2.390e-01  -0.337 0.736420    maritalmarried                1.567e-02  6.824e-02   0.230 0.818420    maritalsingle                 6.620e-02  7.791e-02   0.850 0.395473    maritalunknown                6.303e-02  4.113e-01   0.153 0.878211    educationbasic.6y             9.647e-02  1.202e-01   0.803 0.422195    educationbasic.9y            -2.154e-02  9.494e-02  -0.227 0.820557    educationhigh.school          3.381e-02  9.188e-02   0.368 0.712895    educationilliterate           1.132e+00  7.395e-01   1.531 0.125887    educationprofessional.course  1.136e-01  1.013e-01   1.121 0.262175    educationuniversity.degree    2.134e-01  9.188e-02   2.322 0.020211 *  educationunknown              1.361e-01  1.196e-01   1.138 0.255314    defaultunknown               -3.055e-01  6.712e-02  -4.552 5.32e-06 ***defaultyes                   -7.150e+00  1.135e+02  -0.063 0.949784    housingunknown               -7.385e-02  1.390e-01  -0.531 0.595260    housingyes                   -3.740e-03  4.121e-02  -0.091 0.927695    loanunknown                          NA         NA      NA       NA    loanyes                      -6.362e-02  5.725e-02  -1.111 0.266454    contacttelephone             -6.068e-01  7.124e-02  -8.518  < 2e-16 ***monthQ2                      -2.192e+00  1.125e-01 -19.479  < 2e-16 ***monthQ3                      -1.463e+00  1.148e-01 -12.747  < 2e-16 ***monthQ4                      -1.995e+00  1.240e-01 -16.088  < 2e-16 ***day_of_weekmon               -1.216e-01  6.588e-02  -1.846 0.064887 .  day_of_weekthu                6.375e-02  6.382e-02   0.999 0.317842    day_of_weektue                6.867e-02  6.545e-02   1.049 0.294118    day_of_weekwed                1.436e-01  6.530e-02   2.199 0.027911 *  duration                      4.667e-03  7.397e-05  63.092  < 2e-16 ***campaign                     -4.543e-02  1.158e-02  -3.922 8.77e-05 ***pdays                        -9.627e-04  2.162e-04  -4.452 8.50e-06 ***previous                     -5.806e-02  5.879e-02  -0.988 0.323369    poutcomenonexistent           4.507e-01  9.372e-02   4.809 1.51e-06 ***poutcomesuccess               9.371e-01  2.106e-01   4.451 8.56e-06 ***emp.var.rate                 -1.389e+00  7.693e-02 -18.057  < 2e-16 ***cons.price.idx                1.815e+00  1.193e-01  15.218  < 2e-16 ***cons.conf.idx                 3.353e-02  6.664e-03   5.033 4.84e-07 ***euribor3m                     6.054e-02  1.126e-01   0.537 0.590987    nr.employed                   4.937e-03  1.873e-03   2.635 0.008413 ** ---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1(Dispersion parameter for binomial family taken to be 1)    Null deviance: 28999  on 41187  degrees of freedomResidual deviance: 17199  on 41141  degrees of freedomAIC: 17293Number of Fisher Scoring iterations: 10> exp(coef(model))                 (Intercept)                          age               jobblue-collar                 9.856544e-86                 1.001853e+00                 7.665077e-01              jobentrepreneur                 jobhousemaid                jobmanagement                 8.163314e-01                 9.643733e-01                 9.226187e-01                   jobretired             jobself-employed                  jobservices                 1.340142e+00                 8.453874e-01                 8.609387e-01                   jobstudent                jobtechnician                jobunemployed                 1.306514e+00                 1.003468e+00                 1.008550e+00                   jobunknown               maritalmarried                maritalsingle                 9.226922e-01                 1.015789e+00                 1.068445e+00               maritalunknown            educationbasic.6y            educationbasic.9y                 1.065061e+00                 1.101276e+00                 9.786948e-01         educationhigh.school          educationilliterate educationprofessional.course                 1.034388e+00                 3.101297e+00                 1.120248e+00   educationuniversity.degree             educationunknown               defaultunknown                 1.237856e+00                 1.145744e+00                 7.367445e-01                   defaultyes               housingunknown                   housingyes                 7.851906e-04                 9.288126e-01                 9.962671e-01                  loanunknown                      loanyes             contacttelephone                           NA                 9.383587e-01                 5.450980e-01                      monthQ2                      monthQ3                      monthQ4                 1.116739e-01                 2.314802e-01                 1.360620e-01               day_of_weekmon               day_of_weekthu               day_of_weektue                 8.854888e-01                 1.065828e+00                 1.071082e+00               day_of_weekwed                     duration                     campaign                 1.154380e+00                 1.004678e+00                 9.555850e-01                        pdays                     previous          poutcomenonexistent                 9.990378e-01                 9.435960e-01                 1.569466e+00              poutcomesuccess                 emp.var.rate               cons.price.idx                 2.552531e+00                 2.493091e-01                 6.140533e+00                cons.conf.idx                    euribor3m                  nr.employed                 1.034103e+00                 1.062408e+00                 1.004949e+00 Job变量的基准水平是management,从上面的结果看,服务业和自主劳动者购买银行产品的几率(odds)是管理岗从业人员的0.88倍,未就业人员购买银行产品的几率是管理岗人员的1.25倍> summary(model.step)向前逐步回归> model.step=step(model,direction = "backward")向后逐步回归> model.step = step(model, direction = "forward")双向逐步回归> model.step = step(model, direction = "both")> summary(model.step)Call:glm(formula = y ~ job + education + default + contact + month +     day_of_week + duration + campaign + pdays + poutcome + emp.var.rate +     cons.price.idx + cons.conf.idx + nr.employed, family = "binomial",     data = bank)Deviance Residuals:     Min       1Q   Median       3Q      Max  -5.9884  -0.3088  -0.1887  -0.1332   3.4026  Coefficients:                               Estimate Std. Error z value Pr(>|z|)    (Intercept)                  -2.031e+02  1.426e+01 -14.246  < 2e-16 ***jobblue-collar               -2.700e-01  7.917e-02  -3.411 0.000648 ***jobentrepreneur              -2.043e-01  1.242e-01  -1.645 0.100003    jobhousemaid                 -2.832e-02  1.464e-01  -0.193 0.846590    jobmanagement                -8.368e-02  8.409e-02  -0.995 0.319670    jobretired                    3.234e-01  9.130e-02   3.542 0.000397 ***jobself-employed             -1.670e-01  1.176e-01  -1.421 0.155435    jobservices                  -1.528e-01  8.545e-02  -1.789 0.073666 .  jobstudent                    2.682e-01  1.046e-01   2.565 0.010316 *  jobtechnician                 4.389e-03  7.093e-02   0.062 0.950665    jobunemployed                 8.975e-03  1.271e-01   0.071 0.943715    jobunknown                   -6.363e-02  2.378e-01  -0.268 0.789057    educationbasic.6y             8.993e-02  1.196e-01   0.752 0.452024    educationbasic.9y            -2.716e-02  9.416e-02  -0.288 0.772992    educationhigh.school          2.890e-02  9.053e-02   0.319 0.749573    educationilliterate           1.118e+00  7.398e-01   1.511 0.130744    educationprofessional.course  1.084e-01  1.004e-01   1.079 0.280686    educationuniversity.degree    2.103e-01  9.017e-02   2.332 0.019678 *  educationunknown              1.363e-01  1.195e-01   1.140 0.254110    defaultunknown               -3.017e-01  6.666e-02  -4.526 6.02e-06 ***defaultyes                   -7.141e+00  1.135e+02  -0.063 0.949831    contacttelephone             -6.011e-01  7.069e-02  -8.504  < 2e-16 ***monthQ2                      -2.210e+00  1.108e-01 -19.939  < 2e-16 ***monthQ3                      -1.475e+00  1.146e-01 -12.869  < 2e-16 ***monthQ4                      -1.982e+00  1.183e-01 -16.755  < 2e-16 ***day_of_weekmon               -1.210e-01  6.584e-02  -1.837 0.066174 .  day_of_weekthu                6.208e-02  6.374e-02   0.974 0.330066    day_of_weektue                6.851e-02  6.538e-02   1.048 0.294651    day_of_weekwed                1.420e-01  6.525e-02   2.176 0.029592 *  duration                      4.667e-03  7.396e-05  63.099  < 2e-16 ***campaign                     -4.587e-02  1.158e-02  -3.960 7.49e-05 ***pdays                        -8.822e-04  2.024e-04  -4.358 1.31e-05 ***poutcomenonexistent           5.219e-01  6.356e-02   8.211  < 2e-16 ***poutcomesuccess               9.996e-01  2.028e-01   4.928 8.31e-07 ***emp.var.rate                 -1.376e+00  6.885e-02 -19.980  < 2e-16 ***cons.price.idx                1.845e+00  1.041e-01  17.725  < 2e-16 ***cons.conf.idx                 3.622e-02  4.853e-03   7.464 8.42e-14 ***nr.employed                   5.883e-03  9.765e-04   6.024 1.70e-09 ***---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1(Dispersion parameter for binomial family taken to be 1)    Null deviance: 28999  on 41187  degrees of freedomResidual deviance: 17203  on 41150  degrees of freedomAIC: 17279Number of Fisher Scoring iterations: 10模型预测用predict函数,参数type=’response’Newdata参数是要预测的数据集> prob<-predict(model.step,type = 'response')> head(prob)          1           2           3           4           5           6 0.015029328 0.006044212 0.011640349 0.010173952 0.016897254 0.007174804 假设以0.5为临界值> pre<-ifelse(prob>0.5,1,0)> table(pre,bank$y)   pre     0     1  0 35596  2667  1   952  1973> 预测的准确率> (35592+1964)/(35592+2676+956+1964)[1] 0.911819实际有响应的客户被识别出了多少> 1964/(1964+2676)[1] 0.4232759模型评估> confusionMatrix(bank$y,pre,pos='1')Confusion Matrix and Statistics          ReferencePrediction     0     1         0 35596   952         1  2667  1973                                                         Accuracy : 0.9121                           95% CI : (0.9094, 0.9149)    No Information Rate : 0.929               P-Value [Acc > NIR] : 1                                                                           Kappa : 0.476            Mcnemar's Test P-Value : <2e-16                                                                Sensitivity : 0.67453                     Specificity : 0.93030                  Pos Pred Value : 0.42522                  Neg Pred Value : 0.97395                      Prevalence : 0.07102                  Detection Rate : 0.04790            Detection Prevalence : 0.11265               Balanced Accuracy : 0.80241                                                          'Positive' Class : 1                                                   Kappa 统计量(kappa statistic)用于评判分类器的分类结果与随机分类的差异度用Kappa统计量评价:    较差:小于0.20    一般:0.20至0.40    稳健:0.40至0.60    好的:0.60至0.80很好的:0.80至1.00ROC曲线pred<-prediction(prob,bank$y)perf<-performance(pred,measure = "tpr",x="fpr")plot(perf) RandomForest加载数据列> data=read.table("input.txt",header = TRUE)> str(data)'data.frame':    222 obs. of  23 variables: $ Acti_Profile             : num  0 0 0 0 0 0 0 0 0 0 ... $ Activity                 : num  1.25 0 0.938 6.562 0 ... $ Diastolic_PTT            : num  256 240 253 0 241 ... $ Diastolic                : num  73.2 78.6 74 0 78.4 ... $ Heart_Rate_Curve         : num  81.2 69.7 77.6 95 83.6 ... $ Heart_Rate_Variability_HF: num  131 250 135 144 141 ... $ Heart_Rate_Variability_LF: num  311 218 203 301 244 ... $ MAP                      : num  86 93.5 86.9 0 91.7 ... $ Position                 : num  0 0 0 1 0 0 0 0 0 0 ... $ PTT_Raw                  : num  308 288 308 0 295 ... $ RR_Interval              : num  734 878 773 632 714 ... $ Sleep_Wake               : num  1 1 1 1 1 0 1 1 0 0 ... $ SpO2                     : num  0 0 99 0 98.4 ... $ Sympatho_Vagal_Balance   : num  23 8.17 14.5 20.4 16.88 ... $ Systolic_PTT             : num  308 288 307 0 295 ... $ Systolic                 : num  113 124 113 0 119 ... $ Autonomic_arousals       : num  0 0 0 0 0 0 0 0 0 0 ... $ Cardio_complex           : num  0 0 0 1 0 0 0 0 0 0 ... $ Cardio_rhythm            : num  0 0 2 0 0 0 0 0 0 0 ... $ Classification_Arousal   : num  0 0 0 0 0 0 0 0 0 0 ... $ PTT_Events               : num  1 0 2 0 0 0 0 0 0 0 ... $ Systolic_Events          : num  1 0 1 0 0 0 0 0 0 0 ... $ y                        : num  1 0 1 0 0 0 0 0 0 0 ...加载随机森林包> library(randomForest)进行训练  以y作为因变量,其余数据作为自变量> rf <- randomForest(y ~ ., data=data, ntree=100, proximity=TRUE,importance=TRUE)> plot(rf) 重要性检测衡量把一个变量的取值变为随机数,随机森林预测准确性的降低程度> importance(rf,type=1)                              %IncMSEActi_Profile               0.00000000Activity                   0.99353251Diastolic_PTT              0.32193611Diastolic                  1.99891809Heart_Rate_Curve           0.92001352Heart_Rate_Variability_HF  2.07870722Heart_Rate_Variability_LF -0.24957163MAP                        0.48142975Position                   1.86876751PTT_Raw                    1.94648914RR_Interval                0.60557964Sleep_Wake                 1.00503782SpO2                       0.25396165Sympatho_Vagal_Balance     1.42906765Systolic_PTT               1.27965813Systolic                   0.77382673Autonomic_arousals         0.00000000Cardio_complex             1.00503782Cardio_rhythm              1.14283152Classification_Arousal    -0.04383997PTT_Events                 4.63980680Systolic_Events           33.29461169输出随机森林的模型> print(rf)Call: randomForest(formula = y ~ ., data = data, ntree = 100, proximity = TRUE,      importance = TRUE)                Type of random forest: regression                     Number of trees: 100No. of variables tried at each split: 7          Mean of squared residuals: 0.003226897     残差平方和SSE                    % Var explained: 98.7> 总平方和(SST):(样本数据-样本均值)的平方和回归平方和(SSR):(预测数据-样本均值)的平方和残差平方和(SSE):(样本数据-预测数据均值)的平方和SST = SSR + SSE   基尼指数:> importance(rf,type=2)                          IncNodePurityActi_Profile                0.000000000Activity                    0.445181480Diastolic_PTT               0.452221870Diastolic                   0.449372186Heart_Rate_Curve            0.473113852Heart_Rate_Variability_HF   0.226815300Heart_Rate_Variability_LF   0.205457353MAP                         0.536977574Position                    0.307333210PTT_Raw                     0.656726800RR_Interval                 0.452738011Sleep_Wake                  0.014423077SpO2                        1.793361279Sympatho_Vagal_Balance      0.352759689Systolic_PTT                0.851951505Systolic                    0.823955781Autonomic_arousals          0.000000000Cardio_complex              0.008047619Cardio_rhythm               0.141907084Classification_Arousal      0.085739429PTT_Events                  7.468690820Systolic_Events            39.000163018> 进行预测prediction <- predict(rf, data[,],type="response")输出预测结果table(observed =data$y,predicted=prediction) plot(prediction) 支持向量机library(e1071)svmfit<-svm(y~.,data=data,kernel="linear",cost=10,scale=FALSE)> print(svmfit)Call:svm(formula = y ~ ., data = data, kernel = "linear", cost = 10, scale = FALSE)Parameters:   SVM-Type:  eps-regression  SVM-Kernel:  linear        cost:  10       gamma:  0.04545455     epsilon:  0.1 Number of Support Vectors:  20> plot(svmfit,data) 神经网络> concrete<-read_excel("Concrete_Data.xls")> str(concrete)Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    1030 obs. of  9 variables: $ Cement      : num  540 540 332 332 199 ... $ Slag        : num  0 0 142 142 132 ... $ Ash         : num  0 0 0 0 0 0 0 0 0 0 ... $ water       : num  162 162 228 228 192 228 228 228 228 228 ... $ superplastic: num  2.5 2.5 0 0 0 0 0 0 0 0 ... $ coarseagg   : num  1040 1055 932 932 978 ... $ fineagg     : num  676 676 594 594 826 ... $ age         : num  28 28 270 365 360 90 365 28 28 28 ... $ strength    : num  80 61.9 40.3 41.1 44.3 ...> normalize <- function(x){ return ((x-min(x))/(max(x)-min(x)))}> concrete_norm <- as.data.frame(lapply(concrete,normalize))> concrete_train <- concrete_norm[1:773,]> concrete_test <- concrete_norm[774:1030,]> library(neuralnet)> concrete_model <- neuralnet(strength ~ Cement+Slag+Ash+water+superplastic+coarseagg+fineagg+age,data=concrete_train)> plot(concrete_model) model_results <- compute(concrete_model,concrete_test[1:8])predicted_strength <- model_results$net.result> cor(predicted_strength,concrete_test$strength)             [,1][1,] 0.7205120076> concrete_model2 <- neuralnet(strength ~ Cement+Slag+Ash+water+superplastic+coarseagg+fineagg+age,data=concrete_train,hidden=5)> plot(concrete_model2) 计算误差> model_results2 <- compute(concrete_model2,concrete_test[1:8])> predicted_strength2 <- model_results2$net.result> cor(predicted_strength2,concrete_test$strength)             [,1][1,] 0.6727155609> 主成分分析身高、体重、胸围、坐高> test<-data.frame(+     X1=c(148, 139, 160, 149, 159, 142, 153, 150, 151, 139,+          140, 161, 158, 140, 137, 152, 149, 145, 160, 156,+          151, 147, 157, 147, 157, 151, 144, 141, 139, 148),+     X2=c(41, 34, 49, 36, 45, 31, 43, 43, 42, 31,+          29, 47, 49, 33, 31, 35, 47, 35, 47, 44,+          42, 38, 39, 30, 48, 36, 36, 30, 32, 38),+     X3=c(72, 71, 77, 67, 80, 66, 76, 77, 77, 68,+          64, 78, 78, 67, 66, 73, 82, 70, 74, 78,+          73, 73, 68, 65, 80, 74, 68, 67, 68, 70),+     X4=c(78, 76, 86, 79, 86, 76, 83, 79, 80, 74,+          74, 84, 83, 77, 73, 79, 79, 77, 87, 85,+          82, 78, 80, 75, 88, 80, 76, 76, 73, 78)+ )> test.pr<-princomp(test,cor=TRUE)> summary(test.pr,loadings=TRUE)Importance of components:                             Comp.1        Comp.2        Comp.3        Comp.4Standard deviation     1.8817805390 0.55980635717 0.28179594325 0.25711843909Proportion of Variance 0.8852744993 0.07834578938 0.01985223841 0.01652747293Cumulative Proportion  0.8852744993 0.96362028866 0.98347252707 1.00000000000Loadings:   Comp.1 Comp.2 Comp.3 Comp.4X1  0.497  0.543 -0.450  0.506X2  0.515 -0.210 -0.462 -0.691X3  0.481 -0.725  0.175  0.461X4  0.507  0.368  0.744 -0.232前两个主成分的累计贡献率已经达到96% 可以舍去另外两个主成分 达到降维的目的因此可以得到函数表达式 Z1=-0.497X'1-0.515X'2-0.481X'3-0.507X'4                                       Z2=  0.543X'1-0.210X'2-0.725X'3-0.368X'44.画主成分的碎石图并预测 > screeplot(test.pr,type="lines")> p<-predict(test.pr)> p              Comp.1         Comp.2         Comp.3          Comp.4 [1,] -0.06990949737 -0.23813701272 -0.35509247634 -0.266120139417 [2,] -1.59526339772 -0.71847399061  0.32813232022 -0.118056645885 [3,]  2.84793151061  0.38956678680 -0.09731731272 -0.279482487139 [4,] -0.75996988424  0.80604334819 -0.04945721875 -0.162949297761 [5,]  2.73966776853  0.01718087263  0.36012614873  0.358653043787 [6,] -2.10583167924  0.32284393414  0.18600422367 -0.036456083707 [7,]  1.42105591247 -0.06053164925  0.21093320662 -0.044223092351 [8,]  0.82583976981 -0.78102575640 -0.27557797533  0.057288571933 [9,]  0.93464401954 -0.58469241699 -0.08814135786  0.181037745585[10,] -2.36463819933 -0.36532199291  0.08840476284  0.045520127461[11,] -2.83741916086  0.34875841111  0.03310422938 -0.031146930047[12,]  2.60851223537  0.21278727930 -0.33398036623  0.210157574387[13,]  2.44253342081 -0.16769495893 -0.46918095412 -0.162987829937[14,] -1.86630668724  0.05021383642  0.37720280364 -0.358821916178[15,] -2.81347420580 -0.31790107093 -0.03291329149 -0.222035112399[16,] -0.06392982655  0.20718447599  0.04334339948  0.703533623798[17,]  1.55561022242 -1.70439673831 -0.33126406220  0.007551878960[18,] -1.07392250663 -0.06763418320  0.02283648409  0.048606680158[19,]  2.52174211878  0.97274300950  0.12164633439 -0.390667990681[20,]  2.14072377494  0.02217881219  0.37410972458  0.129548959692[21,]  0.79624421805  0.16307887263  0.12781269571 -0.294140762463[22,] -0.28708320594 -0.35744666106 -0.03962115883  0.080991988802[23,]  0.25151075072  1.25555187663 -0.55617324819  0.109068938725[24,] -2.05706031616  0.78894493512 -0.26552109297  0.388088642937[25,]  3.08596854773 -0.05775318018  0.62110421208 -0.218939612456[26,]  0.16367554630  0.04317931667  0.24481850312  0.560248997030[27,] -1.37265052598  0.02220972121 -0.23378320040 -0.257399715466[28,] -2.16097778154  0.13733232981  0.35589738735  0.093123683044[29,] -2.40434826507 -0.48613137190 -0.16154440788 -0.007914021222[30,] -0.50287467640  0.14734316507 -0.20590831261 -0.122078819188> 

 

 

加载数据

>w<-read.table("test.prn",header = T)

> w

  X.. X...1

1   A    2

2   B    3

3   C    5

4   D    5

> library(readxl)

>dat<-read_excel("test.xlsx")

> dat

# A tibble: 4 x 2

  `商品` `价格`

   <chr> <dbl>

1      A     2

2      B     3

3      C     5

4      D     5

>bank=read.table("bank-full.csv",header = TRUE,sep=",")

查看数据结构

> str(bank)

'data.frame':  41188 obs. of 21 variables:

 $ age          : int  56 57 37 40 56 45 59 41 2425 ...

 $ job          : Factor w/ 12 levels "admin.","blue-collar",..: 4 88 1 8 8 1 2 10 8 ...

 $ marital      : Factor w/ 4 levels "divorced","married",..: 2 2 22 2 2 2 2 3 3 ...

 $ education    : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 42 4 3 6 8 6 4 ...

 $ default      : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 21 2 1 1 ...

 $ housing      : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 11 1 3 3 ...

 $ loan         : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 11 1 1 1 ...

 $ contact      : Factor w/ 2 levels "cellular","telephone": 2 2 2 22 2 2 2 2 2 ...

 $ month        : Factor w/ 10 levels"apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...

 $ day_of_week  : Factor w/ 5 levels "fri","mon","thu",..:2 2 2 2 2 2 2 2 2 2 ...

 $ duration     : int  261 149 226 151 307 198 139217 380 50 ...

 $ campaign     : int  1 1 1 1 1 1 1 1 1 1 ...

 $ pdays        : int  999 999 999 999 999 999 999999 999 999 ...

 $ previous     : int  0 0 0 0 0 0 0 0 0 0 ...

 $ poutcome     : Factor w/ 3 levels "failure","nonexistent",..: 2 22 2 2 2 2 2 2 2 ...

 $ emp.var.rate : num  1.1 1.1 1.1 1.1 1.1 1.1 1.11.1 1.1 1.1 ...

 $ cons.price.idx: num  94 94 94 94 94 ...

 $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4-36.4 -36.4 -36.4 ...

 $ euribor3m    : num  4.86 4.86 4.86 4.86 4.86...

 $ nr.employed  : num  5191 5191 5191 5191 5191...

 $ y            : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1...

查看数据的最小值,最大值,中位数,平均数,分位数

> summary(bank)

      age                 job            marital    

 Min.  :17.00   admin.     :10422  divorced: 4612 

 1st Qu.:32.00  blue-collar: 9254   married:24928 

 Median :38.00  technician : 6743   single  :11568 

 Mean  :40.02   services   : 3969  unknown :   80 

 3rd Qu.:47.00  management : 2924                  

 Max.  :98.00   retired    : 1720                  

                 (Other)    : 6156                  

               education        default         housing    

 university.degree  :12168  no     :32588   no    :18622 

 high.school        : 9515  unknown: 8597   unknown:  990 

 basic.9y           : 6045   yes   :    3   yes   :21576 

 professional.course: 5243                                 

 basic.4y           : 4176                                 

 basic.6y           : 2292                                  

 (Other)            : 1749                                 

      loan            contact          month       day_of_week

 no    :33950   cellular :26144   may   :13769   fri:7827  

 unknown: 990   telephone:15044   jul   : 7174   mon:8514   

 yes    :6248                     aug    : 6178  thu:8623  

                                   jun    : 5318  tue:8090  

                                   nov    : 4101  wed:8134  

                                   apr    : 2632             

                                   (Other):2016             

    duration         campaign          pdays     

 Min.  :   0.0   Min.  : 1.000   Min.   : 0.0 

 1st Qu.: 102.0   1st Qu.: 1.000   1st Qu.:999.0 

 Median : 180.0   Median : 2.000   Median :999.0 

 Mean   :258.3   Mean   : 2.568  Mean   :962.5 

 3rd Qu.: 319.0   3rd Qu.: 3.000   3rd Qu.:999.0 

 Max.  :4918.0   Max.   :56.000  Max.   :999.0 

                                                 

    previous            poutcome      emp.var.rate    

 Min.  :0.000   failure    : 4252  Min.   :-3.40000 

 1st Qu.:0.000  nonexistent:35563   1stQu.:-1.80000 

 Median :0.000  success    : 1373   Median : 1.10000 

 Mean  :0.173                       Mean   : 0.08189 

 3rd Qu.:0.000                       3rd Qu.: 1.40000 

 Max.  :7.000                      Max.   : 1.40000 

                                                      

 cons.price.idx cons.conf.idx     euribor3m   

 Min.  :92.20   Min.   :-50.8  Min.   :0.634 

 1st Qu.:93.08  1st Qu.:-42.7   1st Qu.:1.344 

 Median :93.75  Median :-41.8   Median :4.857 

 Mean  :93.58   Mean   :-40.5  Mean   :3.621 

 3rd Qu.:93.99  3rd Qu.:-36.4   3rd Qu.:4.961 

 Max.  :94.77   Max.   :-26.9  Max.   :5.045 

                                               

  nr.employed     y       

 Min.  :4964   no :36548 

 1st Qu.:5099  yes: 4640 

 Median :5191             

 Mean  :5167             

 3rd Qu.:5228             

 Max.  :5228             

                          

> psych::describe(bank)

               方差  个数    平均值  标准差  均值    去掉最大   中位数   最小值  最大值  极差    偏差        峰度

                                                           绝对偏差

                                                  最小值

                                                  之后

                                                  的平均数

 

               vars     n   mean     sd  median trimmed   mad     min    max   range  skew    kurtosis

age               1 41188   40.02 10.42   38.00   39.30 10.38   17.00   98.00  81.00  0.78     0.79

job*              2 41188    4.72  3.59    3.00    4.48  2.97    1.00   12.00  11.00  0.45    -1.39

marital*          3 41188    2.17  0.61    2.00    2.21  0.00    1.00    4.00   3.00 -0.06    -0.34

education*        4 41188    4.75  2.14    4.00    4.88  2.97    1.00    8.00   7.00 -0.24    -1.21

default*          5 41188    1.21  0.41    1.00    1.14   0.00   1.00    3.00    2.00 1.44     0.07

housing*          6 41188    2.07  0.99    3.00    2.09  0.00    1.00    3.00   2.00 -0.14    -1.95

loan*             7 41188    1.33  0.72    1.00    1.16  0.00    1.00    3.00   2.00  1.82     1.38

contact*          8 41188    1.37  0.48    1.00    1.33  0.00    1.00    2.00   1.00  0.56    -1.69

month*            9 41188    5.23  2.32    5.00    5.31  2.97    1.00   10.00   9.00 -0.31    -1.03

day_of_week*     10 41188   3.00   1.40    3.00   3.01   1.48    1.00   5.00    4.00  0.01   -1.27

duration         11 41188  258.29 259.28 180.00  210.61 139.36    0.00 4918.00 4918.00  3.26   20.24

campaign         12 41188    2.57  2.77    2.00    1.99  1.48    1.00   56.00  55.00  4.76    36.97

pdays            13 41188  962.48 186.91 999.00  999.00   0.00   0.00  999.00  999.00 -4.92    22.23

previous         14 41188    0.17  0.49    0.00    0.05  0.00    0.00    7.00   7.00  3.83    20.11

poutcome*        15 41188    1.93  0.36    2.00    2.00  0.00    1.00    3.00   2.00 -0.88     3.98

emp.var.rate     16 41188   0.08   1.57    1.10   0.27   0.44   -3.40   1.40    4.80 -0.72    -1.06

cons.price.idx   17 41188  93.58   0.58   93.75  93.58   0.56   92.20  94.77    2.57 -0.23    -0.83

cons.conf.idx    18 41188 -40.50   4.63  -41.80 -40.60   6.52  -50.80 -26.90   23.90  0.30   -0.36

euribor3m        19 41188    3.62  1.73    4.86    3.81  0.16    0.63    5.04   4.41 -0.71    -1.41

nr.employed      20 41188 5167.04  72.25 5191.00 5178.43  55.00 4963.60 5228.10  264.50 -1.04     0.00

y*               21 41188    1.11  0.32    1.00    1.02  0.00    1.00    2.00   1.00  2.45     4.00

 

               se

age            0.05

job*           0.02

marital*       0.00

education*     0.01

default*       0.00

housing*       0.00

loan*          0.00

contact*       0.00

month*         0.01

day_of_week*   0.01

duration       1.28

campaign       0.01

pdays          0.92

previous       0.00

poutcome*      0.00

emp.var.rate   0.01

cons.price.idx 0.00

cons.conf.idx  0.02

euribor3m      0.01

nr.employed    0.36

y*             0.00

 

查看数据是否有缺失值

> sapply(bank,anyNA)

           age            job        marital      education

         FALSE          FALSE          FALSE          FALSE

       default        housing           loan        contact

         FALSE          FALSE          FALSE          FALSE

         month    day_of_week       duration       campaign

         FALSE          FALSE          FALSE          FALSE

         pdays       previous      poutcome   emp.var.rate

         FALSE          FALSE          FALSE          FALSE

cons.price.idx  cons.conf.idx      euribor3m    nr.employed

         FALSE          FALSE          FALSE          FALSE

             y

         FALSE

 

成功与不成功的个数

> table(bank$y)

 

   no  yes

36548  4640

 

在是否结婚这个属性的取值与

是否成功的数量比较

> table(bank$y,bank$marital)

    

      divorced married single unknown

  no     4136   22396   9948     68

  yes     476    2532   1620     12

 

> xtabs(~y+marital,data=bank)

     marital

y     divorced married single unknown

  no     4136   22396   9948     68

  yes     476    2532   1620     12

>tab=table(bank$y,bank$marital)

> tab

    

      divorced married single unknown

  no     4136   22396   9948     68

  yes     476    2532   1620     12

 

在是否结婚这个属性上的取值

> margin.table(tab,2)

 

divorced  married  single  unknown

    4612   24928    11568       80

> margin.table(tab,1)

 

   no  yes

36548  4640

 

在是否结婚这个属性上横向看概率

> prop.table(tab,1)

    

         divorced     married      single    unknown

  no 0.113166247 0.612783189 0.272189997 0.001860567

  yes 0.102586207 0.545689655 0.3491379310.002586207

在是否结婚这个属性上纵向看概率

 

> prop.table(tab,2)

    

       divorced   married   single   unknown

  no 0.8967910 0.8984275 0.8599585 0.8500000

  yes 0.1032090 0.1015725 0.1400415 0.1500000

 

 

平的列联表

以第一列和第二列,展开分类group by 1,2

以col.vars 的取值进行次数统计

>ftable(bank[,c(3,4,21)],row.vars = 1:2,col.vars = "y")

                             y   no yes

marital  education                     

divorced basic.4y               406   83

         basic.6y               169   13

         basic.9y               534   31

         high.school           1086 107

         illiterate               1    1

         professional.course    596  61

         university.degree     1177 160

         unknown                167   20

married  basic.4y              2915  313

         basic.6y              1628  139

         basic.9y              3858  298

         high.school           4683 475

         illiterate              12    3

         professional.course   2799 357

         university.degree     5573 821

         unknown                928  126

single   basic.4y               422   31

         basic.6y               301   36

         basic.9y              1174  142

         high.school           2702 448

         illiterate               1    0

         professional.course   1247 177

         university.degree     3723 683

         unknown                378  103

unknown  basic.4y                 5    1

         basic.6y                 6    0

         basic.9y                 6    2

         high.school             13    1

         illiterate               0    0

         professional.course      6   0

         university.degree       25   6

         unknown                  7    2

 

卡方检验,在p值小于2.2e-16时,拒绝原假设,认为数据不服从卡方分布

> chisq.test(tab)

 

        Pearson's Chi-squared test

 

data:  tab

X-squared = 122.66, df = 3,p-value < 2.2e-16

 

画直方图

> hist(bank$age)

> library(lattice)

 

画连续变量的分布,就是把直方图的中位数连接起来

以年龄为横轴,y为纵轴,数据是bank,画图,auto.key是否有图例

> densityplot(~age,groups =y,data=bank,plot.point=FALSE,auto.key = TRUE)

 

画Box图

> boxplot(age~y,data=bank)

 

双样本t分布检验,p值小于0.05时拒绝原假设

这里的原假设是两个样本没有相关性

得到的结果是p值为1.805e-06,拒绝两个样本没有相关性的假设

这里认为两个样本有相关性

>t.test(age~y,data=bank,alternative="two.sided",var.equal=FALSE)

 

        Welch Two Sample t-test

 

data:  age by y

t = -4.7795, df = 5258.5,p-value = 1.805e-06

alternative hypothesis: truedifference in means is not equal to 0

95 percent confidence interval:

 -1.4129336 -0.5909889

sample estimates:

 mean in group no mean in group yes

         39.91119          40.91315

 

 

数据可视化

画饼图

> tab=table(bank$marital)

> pie(tab)

 

画直方图

> tab=table(bank$marital)

> barplot(tab)

 

画下面这个图

> tab=table(bank$marital,bank$y)

> plot(tab)

 

 

画层叠直方图

>tab=table(bank$marital,bank$y)

>lattice::barchart(tab,auto.key=TRUE)

 

 

加载这个包,准备画图

> library(dplyr)

>data=group_by(bank,marital,y)

> data=tally(data)

!!!!!!!!!!!!!

> ggplot2::ggplot(data=data,mapping=aes(marital,n))+geom_bar(mapping=aes(fill=y),position="dodge",stat="identity")
 
 
 
数据预处理
分组之后再画图

> labels=c('青年','中年','老年')

> bank$age_group=cut(bank$age,breaks = c(0,35,55,100),right = FALSE,labels = labels)

> library(ggplot2)

> ggplot(data=bank,mapping = aes(age_group))+geom_bar(mapping = aes(fill=y),position="dodge",stat="count")

 

 

 

 

 

 

衍生变量
直接使用$符向原数据框添加新的变量

> bank$log.cons.price.idx=log(bank$cons.price.idx)

使用transform函数向原数据框添加变量

> bank<-transform(bank,log.cons.price.idx=log(cons.price.idx),log.nr.employed=log(nr.employed))

使用dplyr包里的mutate函数增加变量

> bank<-dplyr::mutate(bank,log.cons.price.idx=log(cons.price.idx))

使用dplyr包里的transmute函数只保留新生成的变量

> bank2<-dplyr::transmute(bank,log.cons.price.idx=log(cons.price.idx),log.nr.employed=log(nr.employed))

 

中心化

 

> v=1:10

> v1=v-mean(v)

> v2=scale(v,center=TRUE,scale = FALSE)

 

无量纲化

 

> V1=v/sqrt(sum(v^2)/(length(v)-1))

> v2=scale(v,center=FALSE,scale=TRUE)

 

根据最大最小值进行归一化

 

> v3=(v-min(v))/(max(v)-min(v))

 

 

进行标准正态化

 

 

> v1=(v-mean(v))/sd(v)

> v2=scale(v,center = TRUE,scale=TRUE)

 

 

 

 

Box-Cox变换

使用car包里的boxCox函数

> install.packages("car")

> library(car)

> boxCox(age~.,data=bank)

 

 

 

 

 

 

使用caret包,做Box-Cox变换

> install.packages("caret")

> library(caret)

> dat<-subset(bank,select="age")

> trans<-preProcess(dat,method=C("BoxCox"))

 

 

数据预处理下

违反常识的异常值

基于数据分布的异常值(离群点)识别

bank.dirty=read.csv("bank-dirty.csv")
summary(bank.dirty)

 

     age                  job            marital                    education    
 Min.   : 17.00   admin.     :10422   divorced: 4612   university.degree  :12165  
 1st Qu.: 32.00   blue-collar: 9254   married :24928   high.school        : 9515  
 Median : 38.00   technician : 6743   single  :11568   basic.9y           : 6043  
 Mean   : 40.03   services   : 3969   NA's    :   80   professional.course: 5242  
 3rd Qu.: 47.00   management : 2924                    basic.4y           : 4175  
 Max.   :123.00   (Other)    : 7546                    (Other)            : 2310  
 NA's   :2        NA's       :  330                    NA's               : 1738  
 default      housing        loan            contact          month      
 no  :32588   no  :18622   no  :33950   cellular :26144   may    :13769  
 yes :    3   yes :21576   yes : 6248   telephone:15044   jul    : 7174  
 NA's: 8597   NA's:  990   NA's:  990                     aug    : 6178  
                                                          jun    : 5318  
                                                          nov    : 4101  
                                                          apr    : 2632  
                                                          (Other): 2016  
 day_of_week    duration         campaign          pdays          previous    
 fri:7827    Min.   :   0.0   Min.   : 1.000   Min.   :  0.0   Min.   :0.000  
 mon:8514    1st Qu.: 102.0   1st Qu.: 1.000   1st Qu.:999.0   1st Qu.:0.000  
 thu:8623    Median : 180.0   Median : 2.000   Median :999.0   Median :0.000  
 tue:8090    Mean   : 258.3   Mean   : 2.568   Mean   :962.5   Mean   :0.173  
 wed:8134    3rd Qu.: 319.0   3rd Qu.: 3.000   3rd Qu.:999.0   3rd Qu.:0.000  
             Max.   :4918.0   Max.   :56.000   Max.   :999.0   Max.   :7.000  
                                                                              
        poutcome      emp.var.rate      cons.price.idx  cons.conf.idx  
 failure    : 4252   Min.   :-3.40000   Min.   :92.20   Min.   :-50.8  
 nonexistent:35563   1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7  
 success    : 1373   Median : 1.10000   Median :93.75   Median :-41.8  
                     Mean   : 0.08189   Mean   :93.58   Mean   :-40.5  
                     3rd Qu.: 1.40000   3rd Qu.:93.99   3rd Qu.:-36.4  
                     Max.   : 1.40000   Max.   :94.77   Max.   :-26.9  
                                                                       
   euribor3m      nr.employed     y        
 Min.   :0.634   Min.   :4964   no :36548  
 1st Qu.:1.344   1st Qu.:5099   yes: 4640  
 Median :4.857   Median :5191              
 Mean   :3.621   Mean   :5167              
 3rd Qu.:4.961   3rd Qu.:5228              
 Max.   :5.045   Max.   :5228              
 
 

常识告诉我们,虽然123岁的老人存在,但概率也极低,也不太可能是银行的客户

找出在年龄这一列的上离群值和下离群值

 

> head(bank.dirty[order(bank.dirty$age,decreasing = TRUE),'age',drop=FALSE],n=5)

      age

39494 123

38453  98

38456  98

27827  95

38922  94

> tail(bank.dirty[order(bank.dirty$age,decreasing = TRUE),'age',drop=FALSE],n=5)

      age

37559  17

37580  17

38275  17

120    NA

156    NA

 

异常值的处理

当作缺失值处理
> bank.dirty$age[which(bank.dirty$age>98)]<-NA

删除或者插补

 

 

重编码

职业类型有12个分类,不利于后续分析,把除了unknown以外的分类进行重新编码,简化成4类

Month有12个分类,把它转化成季度

Education的分类,除了unknow之外有7类

 

进行重编码

levels(bank.dirty$job) <- c( "management","services","entrepreneur","entrepreneur",
                       "management","unemployed",  "entrepreneur","services",
                       "unemployed","services","unemployed","unknown" )
> levels(bank.dirty$month) <- c("Q2","Q3","Q4","Q3","Q2",
                        "Q1","Q2","Q4","Q4","Q3")
> 
> levels(bank.dirty$education) <- c( "primary","primary","primary","secondary",
                             "primary","tertiary","tertiary","unknown")
 
 

缺失值

分类较多,分类是unknown,不能给我们提供信息

有些模型不能处理缺失值,比如Logistic回归

缺失值插补的方法

1、  用中位数或众数插补

> library(imputeMissings)
> bank.clean<-impute(bank.dirty,object = compute(bank.dirty,method = "median/mode"))

2、  最邻近(knn)插补

library(DMwR)
bank.clean=knnImputation(bank.dirty,k=5)

 

3、  随机森林插补

library(missForest)

 Imp = missForest(bank.dirty)

 bank.clean = Imp$ximp

 

缺失值插补的R包

1、  imputeMissings包

2、  DMwR包

 

 

 

 

 

 

用Logistic回归建立客户响应模型

1、广义线性模型

广义线性模型擅长于处理因变量不是连续变量的问题

1)  Y是分类变量

2)  Y是定序变量

3)  Y是离散取值

2、当Y取值是0-1二分类变量是,就是Logistic回归

 

Logistic回归在R中的实现

数据重编码

bank$y=ifelse(bank$y=='yes',1,0)

改成以Q1为参考因子

bank$month<-relevel(bank$month,ref="Q1")

构建Logistic回归模型

> model<-glm(y~.,data=bank,family = 'binomial')
> summary(model)
 
Call:
glm(formula = y ~ ., family = "binomial", data = bank)
 
Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-5.9958  -0.3082  -0.1887  -0.1333   3.4283  
 
Coefficients: (1 not defined because of singularities)
                               Estimate Std. Error z value Pr(>|z|)    
(Intercept)                  -1.957e+02  1.935e+01 -10.116  < 2e-16 ***
age                           1.851e-03  2.415e-03   0.767 0.443289    
jobblue-collar               -2.659e-01  7.942e-02  -3.348 0.000814 ***
jobentrepreneur              -2.029e-01  1.248e-01  -1.626 0.103924    
jobhousemaid                 -3.628e-02  1.475e-01  -0.246 0.805705    
jobmanagement                -8.054e-02  8.501e-02  -0.947 0.343423    
jobretired                    2.928e-01  1.067e-01   2.743 0.006092 ** 
jobself-employed             -1.680e-01  1.176e-01  -1.428 0.153332    
jobservices                  -1.497e-01  8.552e-02  -1.751 0.079969 .  
jobstudent                    2.674e-01  1.106e-01   2.416 0.015680 *  
jobtechnician                 3.462e-03  7.096e-02   0.049 0.961086    
jobunemployed                 8.514e-03  1.273e-01   0.067 0.946686    
jobunknown                   -8.046e-02  2.390e-01  -0.337 0.736420    
maritalmarried                1.567e-02  6.824e-02   0.230 0.818420    
maritalsingle                 6.620e-02  7.791e-02   0.850 0.395473    
maritalunknown                6.303e-02  4.113e-01   0.153 0.878211    
educationbasic.6y             9.647e-02  1.202e-01   0.803 0.422195    
educationbasic.9y            -2.154e-02  9.494e-02  -0.227 0.820557    
educationhigh.school          3.381e-02  9.188e-02   0.368 0.712895    
educationilliterate           1.132e+00  7.395e-01   1.531 0.125887    
educationprofessional.course  1.136e-01  1.013e-01   1.121 0.262175    
educationuniversity.degree    2.134e-01  9.188e-02   2.322 0.020211 *  
educationunknown              1.361e-01  1.196e-01   1.138 0.255314    
defaultunknown               -3.055e-01  6.712e-02  -4.552 5.32e-06 ***
defaultyes                   -7.150e+00  1.135e+02  -0.063 0.949784    
housingunknown               -7.385e-02  1.390e-01  -0.531 0.595260    
housingyes                   -3.740e-03  4.121e-02  -0.091 0.927695    
loanunknown                          NA         NA      NA       NA    
loanyes                      -6.362e-02  5.725e-02  -1.111 0.266454    
contacttelephone             -6.068e-01  7.124e-02  -8.518  < 2e-16 ***
monthQ2                      -2.192e+00  1.125e-01 -19.479  < 2e-16 ***
monthQ3                      -1.463e+00  1.148e-01 -12.747  < 2e-16 ***
monthQ4                      -1.995e+00  1.240e-01 -16.088  < 2e-16 ***
day_of_weekmon               -1.216e-01  6.588e-02  -1.846 0.064887 .  
day_of_weekthu                6.375e-02  6.382e-02   0.999 0.317842    
day_of_weektue                6.867e-02  6.545e-02   1.049 0.294118    
day_of_weekwed                1.436e-01  6.530e-02   2.199 0.027911 *  
duration                      4.667e-03  7.397e-05  63.092  < 2e-16 ***
campaign                     -4.543e-02  1.158e-02  -3.922 8.77e-05 ***
pdays                        -9.627e-04  2.162e-04  -4.452 8.50e-06 ***
previous                     -5.806e-02  5.879e-02  -0.988 0.323369    
poutcomenonexistent           4.507e-01  9.372e-02   4.809 1.51e-06 ***
poutcomesuccess               9.371e-01  2.106e-01   4.451 8.56e-06 ***
emp.var.rate                 -1.389e+00  7.693e-02 -18.057  < 2e-16 ***
cons.price.idx                1.815e+00  1.193e-01  15.218  < 2e-16 ***
cons.conf.idx                 3.353e-02  6.664e-03   5.033 4.84e-07 ***
euribor3m                     6.054e-02  1.126e-01   0.537 0.590987    
nr.employed                   4.937e-03  1.873e-03   2.635 0.008413 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
(Dispersion parameter for binomial family taken to be 1)
 
    Null deviance: 28999  on 41187  degrees of freedom
Residual deviance: 17199  on 41141  degrees of freedom
AIC: 17293
 
Number of Fisher Scoring iterations: 10

 

 

> exp(coef(model))
                 (Intercept)                          age               jobblue-collar 
                9.856544e-86                 1.001853e+00                 7.665077e-01 
             jobentrepreneur                 jobhousemaid                jobmanagement 
                8.163314e-01                 9.643733e-01                 9.226187e-01 
                  jobretired             jobself-employed                  jobservices 
                1.340142e+00                 8.453874e-01                 8.609387e-01 
                  jobstudent                jobtechnician                jobunemployed 
                1.306514e+00                 1.003468e+00                 1.008550e+00 
                  jobunknown               maritalmarried                maritalsingle 
                9.226922e-01                 1.015789e+00                 1.068445e+00 
              maritalunknown            educationbasic.6y            educationbasic.9y 
                1.065061e+00                 1.101276e+00                 9.786948e-01 
        educationhigh.school          educationilliterate educationprofessional.course 
                1.034388e+00                 3.101297e+00                 1.120248e+00 
  educationuniversity.degree             educationunknown               defaultunknown 
                1.237856e+00                 1.145744e+00                 7.367445e-01 
                  defaultyes               housingunknown                   housingyes 
                7.851906e-04                 9.288126e-01                 9.962671e-01 
                 loanunknown                      loanyes             contacttelephone 
                          NA                 9.383587e-01                 5.450980e-01 
                     monthQ2                      monthQ3                      monthQ4 
                1.116739e-01                 2.314802e-01                 1.360620e-01 
              day_of_weekmon               day_of_weekthu               day_of_weektue 
                8.854888e-01                 1.065828e+00                 1.071082e+00 
              day_of_weekwed                     duration                     campaign 
                1.154380e+00                 1.004678e+00                 9.555850e-01 
                       pdays                     previous          poutcomenonexistent 
                9.990378e-01                 9.435960e-01                 1.569466e+00 
             poutcomesuccess                 emp.var.rate               cons.price.idx 
                2.552531e+00                 2.493091e-01                 6.140533e+00 
               cons.conf.idx                    euribor3m                  nr.employed 
                1.034103e+00                 1.062408e+00                 1.004949e+00 

 

 

Job变量的基准水平是management,从上面的结果看,服务业和自主劳动者购买银行产品的几率(odds)是管理岗从业人员的0.88倍,未就业人员购买银行产品的几率是管理岗人员的1.25倍

 

 

> summary(model.step)
向前逐步回归
> model.step=step(model,direction = "backward")
向后逐步回归
> model.step = step(model, direction = "forward")
双向逐步回归
> model.step = step(model, direction = "both")

> summary(model.step)

 

Call:

glm(formula = y ~ job + education + default + contact + month +

    day_of_week + duration + campaign + pdays + poutcome + emp.var.rate +

    cons.price.idx + cons.conf.idx + nr.employed, family = "binomial",

    data = bank)

 

Deviance Residuals:

    Min       1Q   Median       3Q      Max 

-5.9884  -0.3088  -0.1887  -0.1332   3.4026 

 

Coefficients:

                               Estimate Std. Error z value Pr(>|z|)   

(Intercept)                  -2.031e+02  1.426e+01 -14.246  < 2e-16 ***

jobblue-collar               -2.700e-01  7.917e-02  -3.411 0.000648 ***

jobentrepreneur              -2.043e-01  1.242e-01  -1.645 0.100003   

jobhousemaid                 -2.832e-02  1.464e-01  -0.193 0.846590   

jobmanagement                -8.368e-02  8.409e-02  -0.995 0.319670   

jobretired                    3.234e-01  9.130e-02   3.542 0.000397 ***

jobself-employed             -1.670e-01  1.176e-01  -1.421 0.155435   

jobservices                  -1.528e-01  8.545e-02  -1.789 0.073666 . 

jobstudent                    2.682e-01  1.046e-01   2.565 0.010316 * 

jobtechnician                 4.389e-03  7.093e-02   0.062 0.950665   

jobunemployed                 8.975e-03  1.271e-01   0.071 0.943715   

jobunknown                   -6.363e-02  2.378e-01  -0.268 0.789057   

educationbasic.6y             8.993e-02  1.196e-01   0.752 0.452024   

educationbasic.9y            -2.716e-02  9.416e-02  -0.288 0.772992   

educationhigh.school          2.890e-02  9.053e-02   0.319 0.749573   

educationilliterate           1.118e+00  7.398e-01   1.511 0.130744   

educationprofessional.course  1.084e-01  1.004e-01   1.079 0.280686   

educationuniversity.degree    2.103e-01  9.017e-02   2.332 0.019678 * 

educationunknown              1.363e-01  1.195e-01   1.140 0.254110   

defaultunknown               -3.017e-01  6.666e-02  -4.526 6.02e-06 ***

defaultyes                   -7.141e+00  1.135e+02  -0.063 0.949831   

contacttelephone             -6.011e-01  7.069e-02  -8.504  < 2e-16 ***

monthQ2                      -2.210e+00  1.108e-01 -19.939  < 2e-16 ***

monthQ3                      -1.475e+00  1.146e-01 -12.869  < 2e-16 ***

monthQ4                      -1.982e+00  1.183e-01 -16.755  < 2e-16 ***

day_of_weekmon               -1.210e-01  6.584e-02  -1.837 0.066174 . 

day_of_weekthu                6.208e-02  6.374e-02   0.974 0.330066   

day_of_weektue                6.851e-02  6.538e-02   1.048 0.294651    

day_of_weekwed                1.420e-01  6.525e-02   2.176 0.029592 * 

duration                      4.667e-03  7.396e-05  63.099  < 2e-16 ***

campaign                     -4.587e-02  1.158e-02  -3.960 7.49e-05 ***

pdays                        -8.822e-04  2.024e-04  -4.358 1.31e-05 ***

poutcomenonexistent           5.219e-01  6.356e-02   8.211  < 2e-16 ***

poutcomesuccess               9.996e-01  2.028e-01   4.928 8.31e-07 ***

emp.var.rate                 -1.376e+00  6.885e-02 -19.980  < 2e-16 ***

cons.price.idx                1.845e+00  1.041e-01  17.725  < 2e-16 ***

cons.conf.idx                 3.622e-02  4.853e-03   7.464 8.42e-14 ***

nr.employed                   5.883e-03  9.765e-04   6.024 1.70e-09 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

(Dispersion parameter for binomial family taken to be 1)

 

    Null deviance: 28999  on 41187  degrees of freedom

Residual deviance: 17203  on 41150  degrees of freedom

AIC: 17279

 

Number of Fisher Scoring iterations: 10

 

 

 

模型预测

用predict函数,参数type=’response’

Newdata参数是要预测的数据集

 

> prob<-predict(model.step,type = 'response')
> head(prob)
          1           2           3           4           5           6 
0.015029328 0.006044212 0.011640349 0.010173952 0.016897254 0.007174804 

 

假设以0.5为临界值

> pre<-ifelse(prob>0.5,1,0)

> table(pre,bank$y)

  

pre     0     1

  0 35596  2667

  1   952  1973

 

>

预测的准确率

> (35592+1964)/(35592+2676+956+1964)

[1] 0.911819

 

实际有响应的客户被识别出了多少

> 1964/(1964+2676)
[1] 0.4232759

 

 

模型评估

 

> confusionMatrix(bank$y,pre,pos='1')
Confusion Matrix and Statistics
 
          Reference
Prediction     0     1
         0 35596   952
         1  2667  1973
                                          
               Accuracy : 0.9121          
                 95% CI : (0.9094, 0.9149)
    No Information Rate : 0.929           
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.476           
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.67453         
            Specificity : 0.93030         
         Pos Pred Value : 0.42522         
         Neg Pred Value : 0.97395         
             Prevalence : 0.07102         
         Detection Rate : 0.04790         
   Detection Prevalence : 0.11265         
      Balanced Accuracy : 0.80241         
                                          
       'Positive' Class : 1               
                                    

Kappa 统计量(kappa statistic)

用于评判分类器的分类结果与随机分类的差异度

用Kappa统计量评价:

    较差:小于0.20

    一般:0.20至0.40

    稳健:0.40至0.60

    好的:0.60至0.80

很好的:0.80至1.00

 

 

ROC曲线

pred<-prediction(prob,bank$y)
perf<-performance(pred,measure = "tpr",x="fpr")
plot(perf)
 
 
 
 
 
 
 
 
 
 
 
 
RandomForest
加载数据列
 

> data=read.table("input.txt",header = TRUE)

> str(data)

'data.frame':  222 obs. of  23 variables:

 $ Acti_Profile             : num  0 0 0 0 0 0 0 0 0 0 ...

 $ Activity                 : num  1.25 0 0.938 6.562 0 ...

 $ Diastolic_PTT            : num  256 240 253 0 241 ...

 $ Diastolic                : num  73.2 78.6 74 0 78.4 ...

 $ Heart_Rate_Curve         : num  81.2 69.7 77.6 95 83.6 ...

 $ Heart_Rate_Variability_HF: num  131 250 135 144 141 ...

 $ Heart_Rate_Variability_LF: num  311 218 203 301 244 ...

 $ MAP                      : num  86 93.5 86.9 0 91.7 ...

 $ Position                 : num  0 0 0 1 0 0 0 0 0 0 ...

 $ PTT_Raw                  : num  308 288 308 0 295 ...

 $ RR_Interval              : num  734 878 773 632 714 ...

 $ Sleep_Wake               : num  1 1 1 1 1 0 1 1 0 0 ...

 $ SpO2                     : num  0 0 99 0 98.4 ...

 $ Sympatho_Vagal_Balance   : num  23 8.17 14.5 20.4 16.88 ...

 $ Systolic_PTT             : num  308 288 307 0 295 ...

 $ Systolic                 : num  113 124 113 0 119 ...

 $ Autonomic_arousals       : num  0 0 0 0 0 0 0 0 0 0 ...

 $ Cardio_complex           : num  0 0 0 1 0 0 0 0 0 0 ...

 $ Cardio_rhythm            : num  0 0 2 0 0 0 0 0 0 0 ...

 $ Classification_Arousal   : num  0 0 0 0 0 0 0 0 0 0 ...

 $ PTT_Events               : num  1 0 2 0 0 0 0 0 0 0 ...

 $ Systolic_Events          : num  1 0 1 0 0 0 0 0 0 0 ...

 $ y                        : num  1 0 1 0 0 0 0 0 0 0 ...

加载随机森林包

> library(randomForest)

进行训练  以y作为因变量,其余数据作为自变量

> rf <- randomForest(y ~ ., data=data, ntree=100, proximity=TRUE,importance=TRUE)

> plot(rf)

重要性检测

衡量把一个变量的取值变为随机数,随机森林预测准确性的降低程度

> importance(rf,type=1)

                              %IncMSE

Acti_Profile               0.00000000

Activity                   0.99353251

Diastolic_PTT              0.32193611

Diastolic                  1.99891809

Heart_Rate_Curve           0.92001352

Heart_Rate_Variability_HF  2.07870722

Heart_Rate_Variability_LF -0.24957163

MAP                        0.48142975

Position                   1.86876751

PTT_Raw                    1.94648914

RR_Interval                0.60557964

Sleep_Wake                 1.00503782

SpO2                       0.25396165

Sympatho_Vagal_Balance     1.42906765

Systolic_PTT               1.27965813

Systolic                   0.77382673

Autonomic_arousals         0.00000000

Cardio_complex             1.00503782

Cardio_rhythm              1.14283152

Classification_Arousal    -0.04383997

PTT_Events                 4.63980680

Systolic_Events           33.29461169

 

输出随机森林的模型

> print(rf)

 

Call:

 randomForest(formula = y ~ ., data = data, ntree = 100, proximity = TRUE,      importance = TRUE)

               Type of random forest: regression

                     Number of trees: 100

No. of variables tried at each split: 7

 

          Mean of squared residuals: 0.003226897     残差平方和SSE

                    % Var explained: 98.7

 

>

总平方和(SST):(样本数据-样本均值)的平方和

回归平方和(SSR):(预测数据-样本均值)的平方和

残差平方和(SSE):(样本数据-预测数据均值)的平方和

 

SST = SSR + SSE   

 

 

 

 

 

基尼指数:

 

> importance(rf,type=2)

                          IncNodePurity

Acti_Profile                0.000000000

Activity                    0.445181480

Diastolic_PTT               0.452221870

Diastolic                   0.449372186

Heart_Rate_Curve            0.473113852

Heart_Rate_Variability_HF   0.226815300

Heart_Rate_Variability_LF   0.205457353

MAP                         0.536977574

Position                    0.307333210

PTT_Raw                     0.656726800

RR_Interval                 0.452738011

Sleep_Wake                  0.014423077

SpO2                        1.793361279

Sympatho_Vagal_Balance      0.352759689

Systolic_PTT                0.851951505

Systolic                    0.823955781

Autonomic_arousals          0.000000000

Cardio_complex              0.008047619

Cardio_rhythm               0.141907084

Classification_Arousal      0.085739429

PTT_Events                  7.468690820

Systolic_Events            39.000163018

 

>

进行预测

prediction <- predict(rf, data[,],type="response")

输出预测结果

table(observed =data$y,predicted=prediction)

plot(prediction)

 

 

 

支持向量机

library(e1071)

svmfit<-svm(y~.,data=data,kernel="linear",cost=10,scale=FALSE)

> print(svmfit)

 

Call:

svm(formula = y ~ ., data = data, kernel = "linear", cost = 10, scale = FALSE)

 

 

Parameters:

   SVM-Type:  eps-regression

 SVM-Kernel:  linear

       cost:  10

      gamma:  0.04545455

    epsilon:  0.1

 

 

Number of Support Vectors:  20

> plot(svmfit,data)

 

 

神经网络

 

> concrete<-read_excel("Concrete_Data.xls")

> str(concrete)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    1030 obs. of  9 variables:

 $ Cement      : num  540 540 332 332 199 ...

 $ Slag        : num  0 0 142 142 132 ...

 $ Ash         : num  0 0 0 0 0 0 0 0 0 0 ...

 $ water       : num  162 162 228 228 192 228 228 228 228 228 ...

 $ superplastic: num  2.5 2.5 0 0 0 0 0 0 0 0 ...

 $ coarseagg   : num  1040 1055 932 932 978 ...

 $ fineagg     : num  676 676 594 594 826 ...

 $ age         : num  28 28 270 365 360 90 365 28 28 28 ...

 $ strength    : num  80 61.9 40.3 41.1 44.3 ...

 

 

> normalize <- function(x){ return ((x-min(x))/(max(x)-min(x)))}

> concrete_norm <- as.data.frame(lapply(concrete,normalize))

 

 

> concrete_train <- concrete_norm[1:773,]

> concrete_test <- concrete_norm[774:1030,]

 

 

> library(neuralnet)

> concrete_model <- neuralnet(strength ~ Cement+Slag+Ash+water+superplastic+coarseagg+fineagg+age,data=concrete_train)

> plot(concrete_model)

 

 

 

 

 

 

model_results <- compute(concrete_model,concrete_test[1:8])

predicted_strength <- model_results$net.result

> cor(predicted_strength,concrete_test$strength)

             [,1]

[1,] 0.7205120076

> concrete_model2 <- neuralnet(strength ~ Cement+Slag+Ash+water+superplastic+coarseagg+fineagg+age,data=concrete_train,hidden=5)

> plot(concrete_model2)

计算误差

> model_results2 <- compute(concrete_model2,concrete_test[1:8])

> predicted_strength2 <- model_results2$net.result

> cor(predicted_strength2,concrete_test$strength)

             [,1]

[1,] 0.6727155609

 

>

 

 

 

 

主成分分析

身高、体重、胸围、坐高

> test<-data.frame(

+     X1=c(148, 139, 160, 149, 159, 142, 153, 150, 151, 139,

+          140, 161, 158, 140, 137, 152, 149, 145, 160, 156,

+          151, 147, 157, 147, 157, 151, 144, 141, 139, 148),

+     X2=c(41, 34, 49, 36, 45, 31, 43, 43, 42, 31,

+          29, 47, 49, 33, 31, 35, 47, 35, 47, 44,

+          42, 38, 39, 30, 48, 36, 36, 30, 32, 38),

+     X3=c(72, 71, 77, 67, 80, 66, 76, 77, 77, 68,

+          64, 78, 78, 67, 66, 73, 82, 70, 74, 78,

+          73, 73, 68, 65, 80, 74, 68, 67, 68, 70),

+     X4=c(78, 76, 86, 79, 86, 76, 83, 79, 80, 74,

+          74, 84, 83, 77, 73, 79, 79, 77, 87, 85,

+          82, 78, 80, 75, 88, 80, 76, 76, 73, 78)

+ )

> test.pr<-princomp(test,cor=TRUE)

> summary(test.pr,loadings=TRUE)

Importance of components:

                             Comp.1        Comp.2        Comp.3        Comp.4

Standard deviation     1.8817805390 0.55980635717 0.28179594325 0.25711843909

Proportion of Variance 0.8852744993 0.07834578938 0.01985223841 0.01652747293

Cumulative Proportion  0.8852744993 0.96362028866 0.98347252707 1.00000000000

 

Loadings:

   Comp.1 Comp.2 Comp.3 Comp.4

X1  0.497  0.543 -0.450  0.506

X2  0.515 -0.210 -0.462 -0.691

X3  0.481 -0.725  0.175  0.461

X4  0.507  0.368  0.744 -0.232

 

 

前两个主成分的累计贡献率已经达到96% 可以舍去另外两个主成分达到降维的目的

因此可以得到函数表达式 Z1=-0.497X'1-0.515X'2-0.481X'3-0.507X'4

                                       Z2=  0.543X'1-0.210X'2-0.725X'3-0.368X'4

4.画主成分的碎石图并预测

> screeplot(test.pr,type="lines")

> p<-predict(test.pr)

> p

              Comp.1         Comp.2         Comp.3          Comp.4

 [1,] -0.06990949737 -0.23813701272 -0.35509247634 -0.266120139417

 [2,] -1.59526339772 -0.71847399061  0.32813232022 -0.118056645885

 [3,]  2.84793151061  0.38956678680 -0.09731731272 -0.279482487139

 [4,] -0.75996988424  0.80604334819 -0.04945721875 -0.162949297761

 [5,]  2.73966776853  0.01718087263  0.36012614873  0.358653043787

 [6,] -2.10583167924  0.32284393414  0.18600422367 -0.036456083707

 [7,]  1.42105591247 -0.06053164925  0.21093320662 -0.044223092351

 [8,]  0.82583976981 -0.78102575640 -0.27557797533  0.057288571933

 [9,]  0.93464401954 -0.58469241699 -0.08814135786  0.181037745585

[10,] -2.36463819933 -0.36532199291  0.08840476284  0.045520127461

[11,] -2.83741916086  0.34875841111  0.03310422938 -0.031146930047

[12,]  2.60851223537  0.21278727930 -0.33398036623  0.210157574387

[13,]  2.44253342081 -0.16769495893 -0.46918095412 -0.162987829937

[14,] -1.86630668724  0.05021383642  0.37720280364 -0.358821916178

[15,] -2.81347420580 -0.31790107093 -0.03291329149 -0.222035112399

[16,] -0.06392982655  0.20718447599  0.04334339948  0.703533623798

[17,]  1.55561022242 -1.70439673831 -0.33126406220  0.007551878960

[18,] -1.07392250663 -0.06763418320  0.02283648409  0.048606680158

[19,]  2.52174211878  0.97274300950  0.12164633439 -0.390667990681

[20,]  2.14072377494  0.02217881219  0.37410972458  0.129548959692

[21,]  0.79624421805  0.16307887263  0.12781269571 -0.294140762463

[22,] -0.28708320594 -0.35744666106 -0.03962115883  0.080991988802

[23,]  0.25151075072  1.25555187663 -0.55617324819  0.109068938725

[24,] -2.05706031616  0.78894493512 -0.26552109297  0.388088642937

[25,]  3.08596854773 -0.05775318018  0.62110421208 -0.218939612456

[26,]  0.16367554630  0.04317931667  0.24481850312  0.560248997030

[27,] -1.37265052598  0.02220972121 -0.23378320040 -0.257399715466

[28,] -2.16097778154  0.13733232981  0.35589738735  0.093123683044

[29,] -2.40434826507 -0.48613137190 -0.16154440788 -0.007914021222

[30,] -0.50287467640  0.14734316507 -0.20590831261 -0.122078819188

 

>

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
R语言函数是指在R语言中已经定义好的一系列可用来执行特定任务的操作。R语言作为一种功能强大的统计分析和数据可视化工具,拥有丰富的函数库,使得用户能够快速地实现各种统计分析和数据处理操作。 R语言函数大全及详解可以包括以下几大类: 1. 基本函数:包括算术运算(如加减乘除)、逻辑运算(如与或非)、数学函数(如指数、对数、三角函数等)等常用操作函数。 2. 数据处理函数:用于处理数据集,包括数据读取、数据清洗、数据变换等操作。例如,read.table()函数用于将外部文件读入R环境,subset()函数用于按照特定条件筛选数据等。 3. 统计分析函数:用于实现各种统计分析方法,包括描述性统计、假设检验、回归分析、因子分析等。例如,mean()函数用于计算平均值,t.test()函数用于进行t检验等。 4. 形函数:用于数据可视化,包括绘制散点、柱状、折线、饼等。例如,plot()函数用于绘制散点,hist()函数用于绘制直方等。 5. 数据结构函数:用于创建、操作和管理各种数据结构,包括向量、矩阵、数据框、列表等。例如,c()函数用于创建向量,matrix()函数用于创建矩阵,data.frame()函数用于创建数据框等。 这只是R函数的一小部分,R语言函数非常丰富多样,用户可以根据具体需求查阅相关文档或使用帮助功能来了解更多函数的详细用法和参数设置。同时,R语言开源社区也提供了丰富的函数包(packages),用户可以下载并安装这些函数包来扩展R的功能和使用范围。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值