逻辑回归模型
逻辑分布:
x:continuous variable
累计分布函数:
F
(
x
)
=
1
1
+
e
x
p
(
−
(
x
−
μ
)
γ
)
F(x)=\frac{1}{1+exp(-\frac{(x-\mu)}{\gamma})}
F(x)=1+exp(−γ(x−μ))1
density function:
f
(
x
)
=
e
x
p
(
−
(
x
−
μ
)
γ
)
γ
(
1
+
e
x
p
(
−
(
x
−
μ
)
γ
)
)
2
f(x)=\frac{exp(-\frac{(x-\mu)}{\gamma})}{\gamma(1+exp(-\frac{(x-\mu)}{\gamma}))^2}
f(x)=γ(1+exp(−γ(x−μ)))2exp(−γ(x−μ))
f
(
x
)
f(x)
f(x)关于
μ
\mu
μ对称
模型
input: x
output: Y label(分类)
对于二分类问题:
p
(
Y
=
1
∣
x
)
=
e
x
p
(
x
T
β
)
1
+
e
x
p
(
x
T
β
)
p(Y=1|x)=\frac{exp(x^T\beta)}{1+exp(x^T\beta)}
p(Y=1∣x)=1+exp(xTβ)exp(xTβ)
p
(
Y
=
0
∣
x
)
=
1
1
+
e
x
p
(
x
T
β
)
p(Y=0|x)=\frac{1}{1+exp(x^T\beta)}
p(Y=0∣x)=1+exp(xTβ)1
模型分析:
如果
x
T
β
→
∞
x^T\beta\rightarrow \infty
xTβ→∞,
p
(
Y
=
1
∣
x
)
=
1
p(Y=1|x)=1
p(Y=1∣x)=1
如果
x
T
β
→
−
∞
x^T\beta\rightarrow -\infty
xTβ→−∞,
p
(
Y
=
0
∣
x
)
=
1
p(Y=0|x)=1
p(Y=0∣x)=1
广义线性模型:
odds:
p
(
Y
=
1
∣
x
)
p
(
Y
=
0
∣
x
)
=
e
x
p
(
x
T
β
)
\frac{p(Y=1|x)}{p(Y=0|x)}=exp(x^T\beta)
p(Y=0∣x)p(Y=1∣x)=exp(xTβ)
log odds:
x
T
β
x^T\beta
xTβ
model estimation
observation:
for the i_th subject,
(
x
i
,
y
i
)
(x_i,y_i)
(xi,yi)
表示方法:
p
(
x
i
,
β
)
=
p
(
Y
=
1
∣
X
=
x
i
)
p(x_i,\beta)=p(Y=1|X=x_i)
p(xi,β)=p(Y=1∣X=xi)
maximum likelihood estimation:
独立的伯努利分布
L
(
β
)
=
∏
i
=
1
n
p
i
y
i
(
1
−
p
i
)
1
−
y
i
L(\beta)=\prod _{i=1}^n p_i^{y_i}(1-p_i)^{1-y_i}
L(β)=∏i=1npiyi(1−pi)1−yi
l
(
β
)
=
∑
i
=
1
n
y
i
l
o
g
p
i
+
(
1
−
y
i
)
l
o
g
(
1
−
p
i
)
=
∑
i
=
1
n
y
i
l
o
g
p
(
x
i
,
β
)
+
(
1
−
y
i
)
l
o
g
(
1
−
p
(
x
i
,
β
)
)
=
∑
i
=
1
n
y
i
x
i
T
β
−
l
o
g
(
1
+
e
x
p
(
x
i
T
β
)
)
l(\beta)=\sum _{i=1}^n {y_i}logp_i+{(1-y_i)}log(1-p_i)=\sum _{i=1}^n {y_i}logp(x_i,\beta)+{(1-y_i)}log(1-p(x_i,\beta))=\sum _{i=1}^n {y_i}x_i^T\beta-log(1+exp(x_i^T\beta))
l(β)=∑i=1nyilogpi+(1−yi)log(1−pi)=∑i=1nyilogp(xi,β)+(1−yi)log(1−p(xi,β))=∑i=1nyixiTβ−log(1+exp(xiTβ))
对似然函数求导:
∂ l ( β ) ∂ β = ∑ i = 1 n x i ( y i − p ( x i , β ) ) \frac{\partial l(\beta) }{\partial \beta}=\sum _{i=1}^n {x_i}(y_i-p(x_i,\beta)) ∂β∂l(β)=∑i=1nxi(yi−p(xi,β))
algorithm:
β
n
e
w
=
β
o
l
d
−
(
∂
2
l
(
β
)
∂
β
∂
β
T
)
−
1
∂
l
(
β
)
∂
β
\beta^{new}=\beta^{old}-(\frac{\partial^2l(\beta)}{\partial\beta\partial\beta^T})^{-1}\frac{\partial l(\beta)}{\partial \beta}
βnew=βold−(∂β∂βT∂2l(β))−1∂β∂l(β)
∂ 2 l ( β ) ∂ β ∂ β T = − ∑ i = 1 n x i x i T p ( x i , β ) ( 1 − p ( x i , β ) ) \frac{\partial^2l(\beta)}{\partial\beta\partial\beta^T}=-\sum _{i=1}^n x_i{x_i}^Tp(x_i,\beta)(1-p(x_i,\beta)) ∂β∂βT∂2l(β)=−∑i=1nxixiTp(xi,β)(1−p(xi,β))
将相关函数进行形式改写
P
=
(
p
(
x
1
,
β
)
,
⋯
,
p
(
x
n
,
β
)
)
T
P=(p(x_1,\beta),\cdots,p(x_n,\beta))^T
P=(p(x1,β),⋯,p(xn,β))T
W
=
d
i
a
g
(
p
(
x
1
,
β
)
(
1
−
p
(
x
1
,
β
)
)
,
⋯
,
p
(
x
n
,
β
)
(
1
−
p
(
x
n
,
β
)
)
)
W=diag(p(x_1,\beta)(1-p(x_1,\beta)),\cdots,p(x_n,\beta)(1-p(x_n,\beta)))
W=diag(p(x1,β)(1−p(x1,β)),⋯,p(xn,β)(1−p(xn,β)))
∂
l
(
β
)
∂
β
=
∑
i
=
1
n
x
i
(
y
i
−
p
(
x
i
,
β
)
)
=
X
T
(
Y
−
P
)
\frac{\partial l(\beta) }{\partial \beta}=\sum _{i=1}^n {x_i}(y_i-p(x_i,\beta))=X^T(Y-P)
∂β∂l(β)=∑i=1nxi(yi−p(xi,β))=XT(Y−P)
∂
2
l
(
β
)
∂
β
∂
β
T
=
−
X
T
W
X
\frac{\partial^2l(\beta)}{\partial\beta\partial\beta^T}=-X^TWX
∂β∂βT∂2l(β)=−XTWX
如果我们将 x i x_i xi表示为列向量, X T = ( x 1 , ⋯ , x n ) X^T=(x_1,\cdots,x_n) XT=(x1,⋯,xn)
β
n
e
w
=
β
o
l
d
+
(
X
T
W
X
)
−
1
X
T
(
Y
−
P
)
=
(
X
T
W
X
)
−
1
X
T
W
(
X
β
o
l
d
+
W
−
1
(
Y
−
P
)
)
=
(
X
T
W
X
)
−
1
X
T
W
Z
\beta^{new}\\=\beta^{old}+(X^TWX)^{-1}X^T(Y-P)\\=(X^TWX)^{-1}X^TW(X\beta^{old}+W^{-1}(Y-P))\\=(X^TWX)^{-1}X^TWZ
βnew=βold+(XTWX)−1XT(Y−P)=(XTWX)−1XTW(Xβold+W−1(Y−P))=(XTWX)−1XTWZ
Z
=
(
X
β
o
l
d
+
W
−
1
(
Y
−
P
)
)
Z=(X\beta^{old}+W^{-1}(Y-P))
Z=(Xβold+W−1(Y−P))
这个算法被称为iteratively reweighted least squares
comment:
(没有证明过)
1.
β
^
\hat{\beta}
β^ converge to
N
(
β
,
(
X
T
W
X
)
−
1
)
N(\beta,(X^TWX)^{-1})
N(β,(XTWX)−1)
2. likelihood test:
L
R
=
−
2
m
a
x
β
0
l
(
β
0
,
β
1
=
0
)
+
2
m
a
x
β
0
,
β
1
l
(
β
0
,
β
1
)
=
D
E
V
0
−
D
E
V
1
LR=-2max_{\beta_0}l(\beta_0,\beta_1=0)+2max_{\beta_0,\beta_1}l(\beta_0,\beta_1)=DEV_0-DEV_1
LR=−2maxβ0l(β0,β1=0)+2maxβ0,β1l(β0,β1)=DEV0−DEV1
复杂模型似然值-简单模型似然值
follow
χ
2
(
n
u
m
o
f
p
a
r
e
m
e
t
e
r
s
i
n
β
1
)
\chi^2(num\ of\ paremeters\ in\ \beta_1)
χ2(num of paremeters in β1)
multinominal logistic regression
多分类问题: Y ∈ { 1 , ⋯ , K } Y\in \{1,\cdots,K\} Y∈{1,⋯,K}
p
(
Y
=
k
∣
x
)
=
e
x
p
(
x
T
β
k
)
1
+
e
x
p
(
x
T
β
k
)
p(Y=k|x)=\frac{exp(x^T\beta_k)}{1+exp(x^T\beta_k)}
p(Y=k∣x)=1+exp(xTβk)exp(xTβk)
k
=
1
,
2
,
⋯
,
K
−
1
k=1,2,\cdots,K-1
k=1,2,⋯,K−1
最后一个类的概率
1
−
∑
i
=
1
K
−
1
p
(
Y
=
k
∣
x
)
1-\sum_{i=1}^{K-1}p(Y=k|x)
1−∑i=1K−1p(Y=k∣x)