二、因变量多分类 logistic
回归
1、概述:多元Logistic回归模型被用来建立有多个输出变量的模型,且这些预测变量通过一个线性组合变成为一个最终的预测变量。Multinomial
Logistic 回归模型中因变量可以取多个值.
所需的应用包:
library(foreign)
library(nnet)
library(ggplot2)
library(reshape2)
2、建模例子:
Example . Entering high school students make program
choices among general program, vocational program and academic
program. Their choice might be modeled using their writing score
and their social economic
status.
例子:进入高中的学生在做计划选择时,面临着一般计划、职业计划及学术计划三种选择。他们的选择往往是通过对他们的写作成绩及社会经济地位进行模型化决定。
1)读取数据:
ml
read.dta("http://www.ats.ucla.edu/stat/data/hsbdemo.dta")
这个数据集包含了有200个学生的记录,结果变量为prog(program
type)三种项目类型,预测变量为ses(social economic
status)社会经济地位,3个分类变量(id,female,schtyp),及其他连续变量。
2)数据初步探索:
with(ml,
table(ses, prog))
prog
ses
general
academic vocation
low
16
19
12
middle
20
44
31
high
9
42
7
按照prog分组统计写作得分的均值及标准方差
with(ml, do.call(rbind, tapply(write, prog, function(x) c(M =
mean(x), SD = sd(x)))))
M
SD
general 51.33333 9.397775
academic 56.25714 7.943343
vocation 46.76000 9.318754
3)建立模型:
利用nnet包中的函数multinom,建立多元logistic回归模型:
Before
running our model. We then choose the level of our outcome that we
wish to use as our baseline and specify this in
therelevelfunction.
Then, we run our model usingmultinom.
Themultinompackage
does not include p-value calculation for the regression
coefficients, so we calculate p-values using Wald tests (here
z-tests).
在建立模型前,我们选择输出变量的水平,并利用函数relevel设置prog为哑变量(虚拟变量),以academic作为参考水平:
ml$prog2
"academic")
test
~ ses + write, data = ml)
summary(test)
结果显示:
Call:
multinom(formula = prog2 ~ ses + write, data =
ml)
Coefficients:
(Intercept) sesmiddle
seshigh
write
general 2.852198
-0.5332810 -1.1628226 -0.0579287
vocation 5.218260
0.2913859 -0.9826649 -0.1136037
Std. Errors:
(Intercept) sesmiddle seshigh
write
general 1.166441
0.4437323 0.5142196 0.02141097
vocation 1.163552
0.4763739 0.5955665 0.02221996
Residual Deviance: 359.9635
AIC: 375.9635
z
summary(test)$coefficients/summary(test)$standard.errors
(Intercept) sesmiddle
seshigh
write
general 2.445214
-1.2018081 -2.261334 -2.705562
vocation 4.484769
0.6116747 -1.649967 -5.112689
2-tailed z
test
p
-
pnorm(abs(z),0, 1)) * 2
(Intercept)
sesmiddle
seshigh
write
general 0.0144766100 0.2294379
0.02373856 6.818902e-03
vocation 0.0000072993 0.5407530 0.09894976
3.176045e-07
The model summary output
has a block of coefficients and a block of standard errors.
模型的结果输出为一系列的系数及相应的标准差,每一行参数都反应了一个模型的等式。Each of
these blocks has one row of values corresponding to a model
equation. Focusing on the block of coefficients, we can
look at the first row
comparing prog =
"general"to our
baselineprog =
"academic"and the second row
comparingprog =
"vocation"to our
baselineprog =
"academic". If we consider our coefficients from the
first row to be b_1 and our coefficients from the second row to be
b_2, we can write our model equations:
\[ln\left(\frac{P(prog=general)}{P(prog=academic)}\right)
= b_{10} + b_{11}(ses=2) + b_{12}(ses=3) +
b_{13}write\]
模型公式一
\[ln\left(\frac{P(prog=vocation)}{P(prog=academic)}\right)
= b_{20} + b_{21}(ses=2) + b_{22}(ses=3) + b_{23}write\]
模型公式二
## extract the
coefficients from the model and exponentiate
exp(coef(test))
(Intercept) sesmiddle
seshigh
write
general 17.32582
0.5866769 0.3126026 0.9437172
vocation 184.61262 1.3382809 0.3743123
0.8926116
解释:
The relative risk ratio for a one-unit
increase in the
variable write is
.9437 for being in general program vs. academic
program.
The relative risk ratio switching
from ses= 1 to 3 is .3126 for being
in general program vs. academic program.
You can also use predicted
probabilities to help you understand the model.
You can calculate predicted probabilities for
each of our outcome levels using
thefittedfunction.
We can start by generating the predicted probabilities for the
observations in our dataset and viewing the first few
rows
head(pp
academic
general vocation
1 0.1482764 0.3382454 0.5134781
2 0.1202017 0.1806283 0.6991700
3 0.4186747 0.2368082 0.3445171
4 0.1726885 0.3508384 0.4764731
5 0.1001231 0.1689374 0.7309395
6 0.3533566 0.2377976 0.4088458
Next, if we want to examine the changes in predicted
probability associated with one of our two variables, we can create
small datasets varying one variable while holding the other
constant. We will first do this
holdingwriteat
its mean and examining the predicted probabilities for each level
ofses.
dses
data.frame(ses
=
c("low",
"middle",
"high"), write
=
mean(ml$write))
ses
write
1
low
52.775
2 middle 52.775
3
high
52.775
predict(test, newdata = dses,
"probs")
academic
general vocation
1 0.4396845 0.3581917 0.2021238
2 0.4777488 0.2283353 0.2939159
3 0.7009007 0.1784939 0.1206054
Another way to understand
the model using the predicted probabilities is to look at the
averaged predicted probabilities for different values of the
continuous predictor
variablewritewithin
each level
ofses.
dwrite
data.frame(ses = rep(c("low", "middle", "high"), each = 41), write
= rep(c(30:70),3))
##
store the predicted probabilities for each value of ses and
write
pp.write
cbind(dwrite, predict(test, newdata = dwrite, type = "probs", se =
TRUE))
## calculate the mean probabilities within each level of
ses
by(pp.write[,
3:5],
pp.write$ses,
colMeans)
pp.write$ses: high
academic general
vocation
0.6164315 0.1808037 0.2027648
------------------------------------------------------------------------
pp.write$ses: low
academic general
vocation
0.3972977 0.3278174 0.2748849
------------------------------------------------------------------------
pp.write$ses: middle
academic general
vocation
0.4256198 0.2010864 0.3732938
Sometimes, a couple of
plots can convey a good deal amount of information. Using the
predictions we generated for the pp.write object above, we can plot
the predicted probabilities against the writing score by the level
ofsesfor
different levels of the outcome variable.
## melt data set to long
for ggplot2
lpp
melt(pp.write, id.vars
=
c("ses",
"write"), value.name
=
"probability")
head(lpp)
# view first few
rows
ses write
variable
probability
1 low 30
academic 0.09843588
2 low 31
academic 0.10716868
3 low 32
academic 0.11650390
4 low 33
academic 0.12645834
5 low 34
academic 0.13704576
6 low 35
academic 0.14827643
## plot predicted
probabilities across write values for each level of ses
## facetted by program
type
ggplot(lpp, aes(x = write, y = probability, colour =
ses)) + geom_line() + facet_grid(variable ~ ., scales =
"free")
4、其他应用包及函数
1)程序包mlogit提供了多项logit的模型拟合函数:
mlogit(formula, data, subset, weights,na.action, start =
NULL,
alt.subset = NULL, reflevel = NULL,
nests = NULL, un.nest.el = FALSE, unscaled = FALSE,
heterosc = FALSE, rpar = NULL, probit = FALSE,
R = 40, correlation = FALSE, halton = NULL,
random.nb = NULL, panel = FALSE, estimate = TRUE,
seed = 10, ...)
mlogit.data(data, choice, shape = c("wide","long"), varying =
NULL,
sep=".",alt.var = NULL, chid.var = NULL, alt.levels =
NULL,id.var = NULL, opposite = NULL, drop.index = FALSE,
ranked = FALSE, ...)
参数说明:
formula:mlogit提供了条件logit,多项logit,混合logit多种模型,对于多项logit的估计模型应写为:因变量~0|自变量,如:mode
~ 0|income
data:使用mlogit.data函数使得数据结构符合mlogit函数要求。
Choice:确定分类变量是什么
Shape:如果每一行是一个观测,我们选择wide,如果每一行是表示一个选择,那么就应该选择long。
alt.var:对于shape为long的数据,需要标明所有的选择名称
例子:
install.packages("mlogit")
library(Formula)
library(maxLik)
library(miscTools)
library(mlogit)
data("Fishing", package = "mlogit")
Fish
= "mode")
summary(mlogit(mode ~ 0|income, data =
Fish))
结果显示:
Call:
mlogit(formula = mode ~ 0|income, data = Fish, method =
"nr",
print.level = 0)
Frequencies of alternatives:
beach
boat
charter
pier
0.11337
0.35364
0.38240
0.15059
nr method
4 iterations, 0h:0m:0s
g'(-H)^-1g = 8.32E-07
gradient close to zero
Coefficients :
Estimate Std. Error t-value Pr(>t)
boat:(intercept) 7.3892e-01 1.9673e-01 3.7560 0.0001727 ***
charter:(intercept) 1.3413e+00 1.9452e-01 6.8955 5.367e-12
***
pier:(intercept) 8.1415e-01 2.2863e-01 3.5610 0.0003695 ***
boat:income 9.1906e-05 4.0664e-05 2.2602 0.0238116 *
charter:income -3.1640e-05 4.1846e-05 -0.7561 0.4495908
pier:income -1.4340e-04 5.3288e-05 -2.6911 0.0071223 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
Log-Likelihood: -1477.2
McFadden R^2: 0.013736
Likelihood ratio test : chisq = 41.145 (p.value =
6.0931e-09)
2)程序包MASS提供polr()函数可以进行ordered
logit或probit回归
polr(formula, data, weights, start, ..., subset, na.action,
contrasts = NULL, Hess = FALSE, model = TRUE,
method = c("logistic", "probit", "cloglog",
"cauchit"))
参数说明:
Formula:回归形式,与lm()函数的formula参数用法一致
Data:数据集
Method:默认为order logit,选择probit时变为order
probit模型。
例子:
library(MASS)
house.plr
weights = Freq, data = housing)
summary(house.plr, digits = 3)
结果显示:
Re-fitting to get Hessian
Call:
polr(formula = Sat ~ Infl + Type + Cont, data = housing, weights
= Freq)
Coefficients:
Value Std. Error t value
InflMedium 0.566 0.1047 5.41
InflHigh
1.289 0.1272 10.14
TypeApartment -0.572 0.1192 -4.80
TypeAtrium -0.366 0.1552 -2.36
TypeTerrace -1.091 0.1515 -7.20
ContHigh 0.360 0.0955 3.77
Intercepts:
Value Std. Error t value
Low|Medium -0.496 0.125 -3.974
Medium|High 0.691 0.125 5.505
Residual Deviance: 3479.149
AIC: 3495.149