Generalized Linear Model
线性回归模型的基本假设为:
E
(
Y
∣
X
)
=
μ
(
X
)
μ
(
X
)
=
X
T
β
E(Y|X)\ =\ \mu(X)\\ \mu(X)\ =\ X^T\beta
E(Y∣X) = μ(X)μ(X) = XTβ
由此看来
X
X
X只影响到
Y
Y
Y的均值,不会影响到
Y
Y
Y的方差,也就是说
Y
Y
Y的高斯分布为
N
(
μ
(
X
)
,
σ
2
)
N(\mu(X),\sigma^2)
N(μ(X),σ2),因此
Y
Y
Y的取值为整个实数域。如果将线性模型扩展到广义的模型,需要确定一个问题,什么样的分布可以使用GLM。
Exponential Distribution Family
指数分布族就是这样一个分布族,可以通过一定的方法,使
X
X
X和
Y
Y
Y的关系转化为线性模型的关系。这个分布族包含了许多分布:比如Gaussion分布、Bernulli分布、Poisson分布、Gamma分布等。指数分布族的表达形式如下:
P
θ
(
x
)
=
P
(
θ
,
x
)
=
exp
(
θ
⋅
x
)
f
(
x
)
g
(
θ
)
P_\theta (x) = P(\theta, x) = \exp(\theta\cdot x)f(x)g(\theta)
Pθ(x)=P(θ,x)=exp(θ⋅x)f(x)g(θ)
其中
θ
⋅
x
\theta\cdot x
θ⋅x为两个参数的内积
θ
\theta
θ是这个表征分布的参数。通过变形,可以得到下面的形式:
P
θ
(
x
)
=
exp
[
∑
i
=
1
k
η
i
(
θ
)
T
i
(
θ
)
−
B
(
θ
)
]
h
(
x
)
P_\theta (x)\ =\ \exp\left[\sum_{i=1}^k \eta_i(\theta)T_i(\theta) - B(\theta) \right]h(x)
Pθ(x) = exp[i=1∑kηi(θ)Ti(θ)−B(θ)]h(x)
我们可以认为
B
(
θ
)
B(\theta)
B(θ)用来归一化,
h
(
x
)
h(x)
h(x)用来处理离散值或者连续值。
以高斯分布为例:
P
θ
(
x
)
=
1
σ
2
π
exp
(
−
(
x
−
μ
)
2
2
σ
2
)
=
exp
(
−
(
x
2
2
σ
2
+
μ
2
2
σ
2
−
2
x
μ
2
σ
2
)
−
log
(
σ
2
π
)
)
=
exp
(
[
−
x
2
⋅
1
2
σ
2
+
x
⋅
μ
σ
2
]
−
(
μ
2
2
σ
2
+
log
(
σ
2
π
)
)
)
\begin{aligned} P_\theta(x)\ =\ & \frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\\ \ =\ & \exp\left(-\left(\frac{x^2}{2\sigma^2}+\frac{\mu^2}{2\sigma^2}-\frac{2x\mu}{2\sigma^2}\right) - \log(\sigma\sqrt{2\pi})\right)\\ \ =\ & \exp\left(\left[-x^2\cdot \frac{1}{2\sigma^2}+x\cdot \frac{\mu}{\sigma^2}\right] - \left(\frac{\mu^2}{2\sigma^2} + \log(\sigma\sqrt{2\pi})\right)\right) \end{aligned}
Pθ(x) = = = σ2π1exp(−2σ2(x−μ)2)exp(−(2σ2x2+2σ2μ2−2σ22xμ)−log(σ2π))exp([−x2⋅2σ21+x⋅σ2μ]−(2σ2μ2+log(σ2π)))
其中前面一部分就是
η
(
θ
)
⋅
T
(
x
)
\eta (\theta)\cdot T(x)
η(θ)⋅T(x),后面一部分可以认为
B
(
x
)
B(x)
B(x)(可以将
2
π
\sqrt{2\pi}
2π放到指数的系数上),
h
(
x
)
h(x)
h(x)此时可以认为是1。
exp
(
x
⋅
μ
σ
2
−
μ
2
2
σ
2
)
⋅
(
exp
(
−
x
2
2
σ
2
)
σ
2
π
)
\begin{aligned} \exp\left(x\cdot\frac{\mu}{\sigma^2}-\frac{\mu^2}{2\sigma^2}\right)\cdot \left(\frac{\exp(-\frac{x^2}{2\sigma^2})}{\sigma\sqrt{2\pi}}\right) \end{aligned}
exp(x⋅σ2μ−2σ2μ2)⋅(σ2πexp(−2σ2x2))
这个形式就一定程度上可以看出高斯分布是由一个中心伴随它的偏离。
接下来是Bernulli分布:
P
θ
(
x
)
=
p
x
(
1
−
p
)
(
1
−
x
)
=
exp
(
x
log
(
p
1
−
p
)
+
log
(
1
−
p
)
)
\begin{aligned} P_\theta(x)\ =\ & p^x(1-p)^(1-x)\\ \ =\ & \exp\left(x\log(\frac{p}{1-p})+\log(1-p)\right) \end{aligned}
Pθ(x) = = px(1−p)(1−x)exp(xlog(1−pp)+log(1−p))
接下来介绍的是Canonical Exponential Family。这种分布族在仅有
θ
\theta
θ一个未知参数的时候,可以得到很多很好的形状,概率密度形式如下:
f
θ
(
y
)
=
exp
(
y
⋅
θ
−
b
(
θ
)
ϕ
+
c
(
y
,
θ
)
)
\begin{aligned} f_\theta(y)\ =\ \exp\left(\frac{y\cdot \theta - b(\theta)}{\phi}+c(y,\theta)\right) \end{aligned}
fθ(y) = exp(ϕy⋅θ−b(θ)+c(y,θ))
这里我们假设
ϕ
\phi
ϕ是已知的。以高斯分布作为例子:
f
θ
(
y
)
=
exp
(
−
(
x
2
2
σ
2
+
μ
2
2
σ
2
−
2
x
μ
2
σ
2
)
−
log
(
σ
2
π
)
)
=
exp
(
−
y
2
2
−
μ
2
2
+
y
μ
ϕ
−
log
(
σ
2
π
)
=
exp
(
y
θ
−
μ
2
2
ϕ
−
(
log
(
σ
2
π
−
y
2
2
ϕ
)
)
\begin{aligned} f_\theta(y)\ =\ &\exp\left(-\left(\frac{x^2}{2\sigma^2}+\frac{\mu^2}{2\sigma^2}-\frac{2x\mu}{2\sigma^2}\right) - \log(\sigma\sqrt{2\pi})\right)\\ \ =\ & \exp\left(\frac{-\frac{y^2}{2}-\frac{\mu^2}{2}+y\mu}{\phi}-\log(\sigma\sqrt{2\pi}\right)\\ \ =\ & \exp\left(\frac{y\theta-\frac{\mu^2}{2}}{\phi}-\left(\log(\sigma\sqrt{2\pi-\frac{y^2}{2\phi}}\right)\right) \end{aligned}
fθ(y) = = = exp(−(2σ2x2+2σ2μ2−2σ22xμ)−log(σ2π))exp(ϕ−2y2−2μ2+yμ−log(σ2π)exp(ϕyθ−2μ2−(log(σ2π−2ϕy2))
这里的
b
(
θ
)
=
y
2
2
b(\theta)=\frac{y^2}{2}
b(θ)=2y2。
因为我们假设了
ϕ
\phi
ϕ是已知的,因此在转化的过程中,主需要拼凑出
y
θ
y\theta
yθ这一项。
Likelihood
需要进行对
θ
\theta
θ的参数估计时,似然函数是我们首先想到的,假设概率密度的对数似然函数为为
l
(
θ
)
=
log
(
f
θ
(
y
)
)
l(\theta)=\log(f_\theta(y))
l(θ)=log(fθ(y)),
E
[
∂
l
∂
θ
]
=
0
E
[
∂
2
l
∂
θ
2
]
+
E
[
∂
l
∂
θ
]
2
=
0
\begin{aligned} E\left[\frac{\partial l}{\partial \theta}\right]\ =\ 0\\ E\left[\frac{\partial^2 l}{\partial \theta^2}\right]+E\left[\frac{\partial l}{\partial \theta}\right]^2\ =\ 0 \end{aligned}
E[∂θ∂l] = 0E[∂θ2∂2l]+E[∂θ∂l]2 = 0
由此可以得到:
m
e
a
n
=
b
′
(
θ
)
=
μ
V
a
r
=
ϕ
b
′
′
(
θ
)
=
σ
2
\begin{aligned} mean \ =\ & b'(\theta)\ =\ \mu\\ Var \ =\ & \phi b''(\theta)\ =\ \sigma^2 \end{aligned}
mean = Var = b′(θ) = μϕb′′(θ) = σ2
这样就产生了指数分布族的参数与其矩的对应关系。
简要证明如下:
∂
∂
θ
log
(
f
(
θ
)
)
=
∂
f
(
θ
)
∂
θ
f
(
θ
)
E
=
∫
−
∞
∞
∂
f
(
θ
)
∂
θ
f
(
θ
)
f
(
θ
)
d
θ
=
∂
∂
θ
∫
−
∞
∞
f
(
θ
)
d
θ
=
0
l
(
θ
)
=
y
⋅
θ
−
b
(
θ
)
ϕ
+
c
(
y
,
θ
)
∂
l
∂
θ
=
y
−
b
′
(
θ
)
ϕ
E
[
∂
l
∂
θ
]
=
E
(
y
)
−
b
′
(
θ
)
ϕ
∂
2
∂
θ
2
log
f
(
θ
)
=
(
∂
2
f
(
θ
)
)
∂
θ
2
f
(
θ
)
−
∂
f
(
θ
)
∂
θ
∂
f
(
θ
)
∂
θ
f
θ
2
E
=
∫
∂
2
∂
θ
2
f
(
θ
)
−
(
∂
f
(
θ
)
∂
θ
)
2
f
(
θ
)
d
θ
∂
2
l
∂
θ
2
+
∂
l
∂
θ
=
−
b
′
′
(
θ
)
ϕ
+
(
y
−
b
′
(
θ
)
ϕ
)
2
=
0
b
′
′
(
θ
)
=
v
a
r
(
y
)
ϕ
\begin{aligned} \frac{\partial}{\partial \theta}\log(f(\theta))\ =\ &\frac{\frac{\partial f(\theta)}{\partial \theta}}{f(\theta)}\\ E \ =\ & \int_{-\infty}^{\infty} \frac{\frac{\partial f(\theta)}{\partial \theta}}{f(\theta)} f(\theta) d\theta\\ \ =\ & \frac{\partial }{\partial \theta}\int_{-\infty}^{\infty} f(\theta) d\theta\\ \ =\ & 0\\ l(\theta)\ =\ & \frac{y\cdot \theta-b(\theta)}{\phi}+c(y,\theta)\\ \frac{\partial l}{\partial \theta}\ =\ & \frac{y - b'(\theta)}{\phi}\\ E\left[\frac{\partial l}{\partial \theta}\right]\ =\ &\frac{E(y)-b'(\theta)}{\phi}\\ \frac{\partial^2 }{\partial \theta^2}\log f(\theta)\ =\ & \frac{(\frac{\partial^2 f(\theta))}{\partial\theta^2}f(\theta)-\frac{\partial f(\theta)}{\partial\theta}\frac{\partial f(\theta)}{\partial\theta}}{f^2_\theta}\\ E \ =\ &\int\frac{\partial^2}{\partial \theta^2}f(\theta)-\frac{(\frac{\partial f(\theta)}{\partial\theta})^2}{f(\theta)}d\theta\\ \frac{\partial^2 l}{\partial \theta^2} + \frac{\partial l}{\partial \theta}\ =\ & -\frac{b''(\theta)}{\phi} + \left(\frac{y-b'(\theta)}{\phi}\right)^2\\ \ =\ & 0\\ b''(\theta)\ =\ &\frac{var(y)}{\phi} \end{aligned}
∂θ∂log(f(θ)) = E = = = l(θ) = ∂θ∂l = E[∂θ∂l] = ∂θ2∂2logf(θ) = E = ∂θ2∂2l+∂θ∂l = = b′′(θ) = f(θ)∂θ∂f(θ)∫−∞∞f(θ)∂θ∂f(θ)f(θ)dθ∂θ∂∫−∞∞f(θ)dθ0ϕy⋅θ−b(θ)+c(y,θ)ϕy−b′(θ)ϕE(y)−b′(θ)fθ2(∂θ2∂2f(θ))f(θ)−∂θ∂f(θ)∂θ∂f(θ)∫∂θ2∂2f(θ)−f(θ)(∂θ∂f(θ))2dθ−ϕb′′(θ)+(ϕy−b′(θ))20ϕvar(y)
我们接下来讨论的分布都属于Canonical Exponential Distribution Family。
Link Function
这些分布由于性质,
Y
Y
Y的取值有所限定,比如Bernulli分布的取值为
{
0
,
1
}
\{0,1\}
{0,1},Poisson分布的取值为非负整数,这样导致了两者均值为
[
0
,
1
]
,
[
0
,
+
∞
)
[0,1],[0,+\infty)
[0,1],[0,+∞)。因此需要一个函数将取值转化到整个实数域
R
\mathbf{R}
R上,这个函数就是Link Function
g
(
μ
(
X
)
)
=
X
T
β
g(\mu(X))=X^T\beta
g(μ(X))=XTβ。比如指数分布:
μ
=
r
exp
δ
t
log
(
μ
)
=
log
(
r
)
+
δ
t
\begin{aligned} \mu\ =\ &r\exp{\delta t}\\ \log{(\mu)}\ =\ &\log(r)+\delta t \end{aligned}
μ = log(μ) = rexpδtlog(r)+δt
其中
r
,
δ
r,\delta
r,δ是参数,
t
,
μ
t,\mu
t,μ是对应的
X
,
Y
X,Y
X,Y。这样的情况下,我们就可以把它抽象为
g
(
μ
)
=
β
0
+
X
T
β
1
\begin{aligned} g(\mu)\ =\ \beta_0 + X^T\beta_1 \end{aligned}
g(μ) = β0+XTβ1
也就是说Link Function的目标就是将取值范围拓展到整个实数轴上就行了,对于一个特定的分布,可以说这样的函数有很多种。以Poisson分布为例
Y
∣
X
P
o
i
s
s
o
n
(
μ
(
X
)
)
Y|X~Poisson(\mu(X))
Y∣X Poisson(μ(X)),Link Function需要把非负实数映射到整个实数域,选择对数函数是一个好的方法:
g
(
μ
(
x
)
)
=
X
T
β
g(\mu(x))=X^T\beta
g(μ(x))=XTβ。至于Bernulli分布,使用的函数为
g
(
μ
)
=
ln
(
μ
1
−
μ
)
g(\mu)=\ln(\frac{\mu}{1-\mu})
g(μ)=ln(1−μμ)(logit),或者是
Φ
−
1
(
μ
)
\Phi^{-1}(\mu)
Φ−1(μ)(probit)。\par
从函数性质来看,link function
g
g
g需要有如下性质:
- 连续可导
- 单调严格增长
- I m ( g ) = R Im(g)=\mathcal{R} Im(g)=R
- g − 1 g^{-1} g−1存在且单调递增
那么我们如何选择这样的Link Function。之前谈到指数分布族的参数
θ
,
b
′
(
θ
)
=
μ
\theta,b'(\theta)=\mu
θ,b′(θ)=μ,而
g
(
μ
)
=
X
T
β
g(\mu)=X^T\beta
g(μ)=XTβ。既然
g
g
g可以有很多种选择,
θ
=
g
(
μ
)
=
X
T
β
\theta=g(\mu)=X^T\beta
θ=g(μ)=XTβ又何妨。这样就直接建立起Canonical EDF和Link Function的关系:
(
b
′
)
=
g
−
1
b
′
−
1
=
g
\begin{aligned} (b') \ =\ &g^{-1}\\ b'^{-1} \ =\ &g \end{aligned}
(b′) = b′−1 = g−1g
θ
,
μ
,
β
\theta, \mu, \beta
θ,μ,β三者建立起了直接关系:
θ
(
X
)
⟷
b
′
μ
(
X
)
⟷
g
(
μ
)
=
X
T
β
β
θ
(
x
i
)
=
b
′
−
1
(
g
−
1
(
x
i
)
)
=
h
(
x
i
T
β
)
=
x
i
T
β
\begin{aligned} \theta(X)\stackrel{b'}{\longleftrightarrow}\mu(X)\stackrel{g(\mu)=X^T\beta}{\longleftrightarrow}\beta\\ \theta(x_i)\ =\ b'^{-1}(g^{-1}(x_i))\ =\ h(x_i^T\beta)\ = \ x_i^T\beta \end{aligned}
θ(X)⟷b′μ(X)⟷g(μ)=XTββθ(xi) = b′−1(g−1(xi)) = h(xiTβ) = xiTβ
以Bernulli分布为例:
p
y
(
1
−
p
)
1
−
y
=
exp
(
y
log
p
1
−
p
+
log
(
1
−
p
)
)
θ
=
log
p
1
−
p
p
=
e
θ
1
+
e
θ
f
θ
(
y
)
=
exp
(
y
θ
−
log
(
1
+
e
θ
)
)
b
′
(
θ
)
=
e
θ
1
+
e
θ
=
p
b
′
′
(
θ
)
=
e
θ
(
1
+
e
θ
)
2
=
p
(
1
−
p
)
g
(
μ
)
=
b
′
−
1
(
θ
)
=
log
p
1
−
p
\begin{aligned} p^y(1-p)^{1-y}\ =\ & \exp\left(y\log\frac{p}{1-p}+\log(1-p)\right)\\ \theta \ =\ & \log\frac{p}{1-p}\\ p \ =\ & \frac{e^\theta}{1+e^\theta}\\ f_\theta(y)\ =\ & \exp(y\theta-\log(1+e^\theta))\\ b'(\theta)\ =\ & \frac{e^\theta}{1+e^\theta}\ =\ p\\ b''(\theta)\ =\ & \frac{e^\theta}{(1+e^\theta)^2}\ =\ p(1-p)\\ g(\mu)\ =\ & b'^{-1}(\theta)\ =\ \log\frac{p}{1-p} \end{aligned}
py(1−p)1−y = θ = p = fθ(y) = b′(θ) = b′′(θ) = g(μ) = exp(ylog1−pp+log(1−p))log1−pp1+eθeθexp(yθ−log(1+eθ))1+eθeθ = p(1+eθ)2eθ = p(1−p)b′−1(θ) = log1−pp
因此只要得到Canonical EDF中的
b
(
θ
)
b(\theta)
b(θ),就可以得到该分布关于GLM的Link Function。
Optimization
在这里要求的参数和线性模型一样,是
β
\beta
β。这里写出对数似然的公式:
l
l
n
(
β
)
=
∑
i
=
1
N
(
y
i
h
(
X
i
T
β
)
−
b
(
h
(
X
i
T
β
)
)
)
ϕ
+
c
(
y
i
,
ϕ
)
=
∑
i
=
1
N
(
y
i
X
i
T
β
−
b
(
X
i
T
β
)
)
ϕ
=
∑
i
=
1
N
(
Y
i
X
i
T
β
ϕ
−
b
(
X
i
T
β
)
ϕ
)
\begin{aligned} l_ln(\beta)\ =\ &\sum_{i=1}^N\frac{ \left(y_ih(X_i^T\beta)-b(h(X^T_i\beta))\right)}{\phi}+c(y_i,\phi)\\ \ =\ & \frac{\sum_{i=1}^N\left(y_iX_i^T\beta-b(X^T_i\beta)\right)}{\phi} \\ \ =\ & \sum_{i=1}^N\left(\frac{Y_iX_i^T\beta}{\phi} - \frac{b(X_i^T\beta)}{\phi}\right) \end{aligned}
lln(β) = = = i=1∑Nϕ(yih(XiTβ)−b(h(XiTβ)))+c(yi,ϕ)ϕ∑i=1N(yiXiTβ−b(XiTβ))i=1∑N(ϕYiXiTβ−ϕb(XiTβ))
这个函数中第一项是线形项,二阶导数为0,不改变函数的凹凸性。从Canonical EDF可以知道
b
′
′
(
θ
)
=
V
a
r
(
y
)
ϕ
>
0
b''(\theta)=\frac{Var(y)}{\phi}>0
b′′(θ)=ϕVar(y)>0,因此这个函数严格凸,使用凸优化的方法就可以很好地求解到唯一最佳参数。通过解析的方式得到闭式解也是可以的。