广义线性模型是机器学习中一种的模型框架,我们常见的 线性模型,Logistic模型,softmax模型 都属于广义线性模型。下面我们就从广义线性模型角度推导这三种模型。
首先介绍一下广义线性模型的定义,满足以下三个条件的模型属于广义线性模型:
\quad
- 因变量 y y y 服从指数族分布: P ( y ; η ) = b ( y ) e x p ( η T T ( y ) − a ( η ) ) P(y;\eta) = b(y) exp(\eta^T T(y) - a(\eta)) P(y;η)=b(y)exp(ηTT(y)−a(η))
- 给定 x x x,模型的目标是求解 E [ T ( y ) ∣ x ] E[T(y)|x] E[T(y)∣x]
- η = ζ T x \eta = \zeta^T x η=ζTx
广义线性模型推导线性回归模型
对于线性模型,因变量
y
y
y 服从高斯分布:
N
(
μ
,
σ
2
)
\mathcal{N}(\mu,\sigma^2)
N(μ,σ2),将高斯分布改写成指数族分布:
P
(
y
;
μ
,
σ
2
)
=
1
2
π
σ
e
x
p
(
−
(
y
−
μ
)
2
2
σ
2
)
=
e
x
p
(
−
1
2
σ
2
y
2
+
μ
σ
2
y
−
μ
2
2
σ
2
+
l
o
g
1
2
π
σ
)
=
e
x
p
(
[
−
1
2
σ
2
μ
σ
2
]
[
y
2
y
]
−
μ
2
2
σ
2
−
1
2
l
o
g
(
2
π
σ
2
)
)
\begin{aligned} P(y;\mu,\sigma^2) &= \frac{1}{\sqrt{2\pi}\sigma} exp(-\frac{(y-\mu)^2}{2\sigma^2})\\ & = exp(-\frac{1}{2\sigma^2} y^2 + \frac{\mu}{\sigma^2} y - \frac{\mu^2}{2\sigma^2} + log\frac{1}{\sqrt{2\pi}\sigma})\\ & = exp(\left[ \begin{matrix} -\frac{1}{2\sigma^2} & \frac{\mu}{\sigma^2} \end{matrix} \right] \left[ \begin{matrix} y^2 \\ y \\ \end{matrix} \right] - \frac{\mu^2}{2\sigma^2} - \frac{1}{2}log(2\pi\sigma^2)) \end{aligned}
P(y;μ,σ2)=2πσ1exp(−2σ2(y−μ)2)=exp(−2σ21y2+σ2μy−2σ2μ2+log2πσ1)=exp([−2σ21σ2μ][y2y]−2σ2μ2−21log(2πσ2))
因此:
- b ( y ) = 1 b(y) = 1 b(y)=1
- T ( y ) = [ y 2 y ] T(y) = \left[ \begin{matrix} y^2 \\ y \\ \end{matrix} \right] T(y)=[y2y]
- η = [ η 1 η 2 ] = [ − 1 2 σ 2 μ σ 2 ] \eta = \left[ \begin{matrix} \eta_1 & \eta_2 \end{matrix} \right] = \left[ \begin{matrix} -\frac{1}{2\sigma^2} & \frac{\mu}{\sigma^2} \end{matrix} \right] η=[η1η2]=[−2σ21σ2μ]
- a ( η ) = μ 2 2 σ 2 + 1 2 l o g ( 2 π σ 2 ) a(\eta) = \frac{\mu^2}{2\sigma^2} + \frac{1}{2}log(2\pi\sigma^2) a(η)=2σ2μ2+21log(2πσ2)
那么模型的目标就是求解
f
(
x
;
θ
)
=
E
[
T
(
y
)
∣
x
]
=
E
[
y
∣
x
]
=
高斯分布
μ
=
σ
2
η
2
=
条件3
σ
2
∗
ζ
T
x
f(x;\theta) = E[T(y)|x] = E[y|x] \overset{\text{高斯分布}}{=} \mu = \sigma^2 \eta_2 \overset{\text{条件3}}{=} \sigma^2*\zeta^Tx
f(x;θ)=E[T(y)∣x]=E[y∣x]=高斯分布μ=σ2η2=条件3σ2∗ζTx
令 θ = σ 2 ∗ ζ \theta = \sigma^2*\zeta θ=σ2∗ζ,就得到了线性模型: f ( x ; θ ) = θ T x f(x;\theta) = \theta^T x f(x;θ)=θTx
广义线性模型推导Logistic模型
作为二分类模型,Logistic回归实际上建模的是Bernoulli分布,也就是说,在已知样本
x
x
x 的情况下,标签
y
y
y 满足分布:
P
(
y
;
ϕ
)
=
ϕ
y
(
1
−
ϕ
)
1
−
y
,
y
∈
{
0
,
1
}
P(y;\phi) = \phi^y(1-\phi)^{1-y}, \quad y \in \{0,1\}
P(y;ϕ)=ϕy(1−ϕ)1−y,y∈{0,1}
其中
ϕ
=
P
(
y
=
1
∣
x
;
ϕ
)
\phi = P(y=1|x;\phi)
ϕ=P(y=1∣x;ϕ)
下面我们将伯努利分布转化成指数族分布的形式:
P
(
y
;
ϕ
)
=
ϕ
y
(
1
−
ϕ
)
1
−
y
,
y
∈
{
0
,
1
}
=
e
x
p
(
y
log
ϕ
+
(
1
−
y
)
l
o
g
(
1
−
ϕ
)
)
=
e
x
p
(
y
log
ϕ
1
−
ϕ
+
l
o
g
(
1
−
ϕ
)
)
\begin{aligned} P(y;\phi) &= \phi^y(1-\phi)^{1-y}, \quad y \in \{0,1\}\\ &= exp(y\text{log}\phi + (1-y)log(1-\phi))\\ &= exp(y\text{log}\frac{\phi}{1-\phi} + log(1-\phi)) \end{aligned}
P(y;ϕ)=ϕy(1−ϕ)1−y,y∈{0,1}=exp(ylogϕ+(1−y)log(1−ϕ))=exp(ylog1−ϕϕ+log(1−ϕ))
因此
- b ( y ) = 1 b(y) =1 b(y)=1
- T ( y ) = y T(y) = y T(y)=y
- η = ϕ 1 − ϕ ⇒ ϕ = 1 1 + e − η \eta = \frac{\phi}{1-\phi} \Rightarrow \phi = \frac{1}{1+e^{-\eta}} η=1−ϕϕ⇒ϕ=1+e−η1
- a ( η ) = − l o g ( 1 − ϕ ) = l o g ( 1 + e η ) a(\eta) = - log(1-\phi) = log(1+e^{\eta}) a(η)=−log(1−ϕ)=log(1+eη)
那么模型的目标就是求解
f
(
x
;
θ
)
=
E
[
T
(
y
)
∣
x
]
=
E
[
y
∣
x
]
=
Bernoulli分布
0
⋅
P
(
y
=
0
∣
x
;
ϕ
)
+
1
⋅
P
(
y
=
1
∣
x
;
ϕ
)
=
ϕ
=
1
1
+
e
−
η
=
条件3
1
1
+
e
−
ζ
T
x
f(x;\theta) = E[T(y)|x] = E[y|x] \overset{\text{Bernoulli分布}}{=} 0\cdot P(y=0|x;\phi) + 1\cdot P(y=1|x;\phi) = \phi = \frac{1}{1+e^{-\eta}} \overset{\text{条件3}}{=} \frac{1}{1+e^{-\zeta^T x}}
f(x;θ)=E[T(y)∣x]=E[y∣x]=Bernoulli分布0⋅P(y=0∣x;ϕ)+1⋅P(y=1∣x;ϕ)=ϕ=1+e−η1=条件31+e−ζTx1
令 θ = − ζ \theta = - \zeta θ=−ζ,就得到了Logistic模型: f ( x ; θ ) = 1 1 + e θ T x f(x;\theta) = \frac{1}{1+e^{\theta^T x}} f(x;θ)=1+eθTx1
广义线性模型推导多项Logistic模型
多项Logistic模型用于解决多分类问题,它实际建模的是Multinoulli分布,在已知样本
x
x
x 的情况下,标签
y
y
y 满足分布:
P
(
y
;
Φ
)
=
∏
i
=
1
C
ϕ
i
y
i
P(y;\Phi) = \prod_{i=1}^C \phi_i^{y_i}
P(y;Φ)=i=1∏Cϕiyi
其中
ϕ
i
=
P
(
y
=
y
i
∣
x
;
Φ
)
\phi_i = P(y=y_i|x;\Phi)
ϕi=P(y=yi∣x;Φ),由于
ϕ
C
=
∑
i
=
1
C
ϕ
i
\phi_{C} = \sum_{i=1}^C \phi_i
ϕC=∑i=1Cϕi,因此实际只需知道
ϕ
1
,
⋯
,
ϕ
C
−
1
\phi_1,\cdots, \phi_{C-1}
ϕ1,⋯,ϕC−1
下面我们将Multinoulli分布转化成指数族分布的形式,为了方便记录,我们将标签
y
y
y 用one-hot向量的方式表示:
y
1
=
[
1
0
⋯
0
]
,
y
2
=
[
0
1
⋯
0
]
,
⋯
,
y
C
−
1
=
[
0
0
⋯
1
]
,
y
C
=
[
0
0
⋯
0
]
y_1 = \left[ \begin{matrix} 1 \\ 0\\ \cdots \\ 0 \end{matrix} \right],y_2 = \left[ \begin{matrix} 0 \\ 1\\ \cdots \\ 0 \end{matrix} \right],\cdots,y_{C-1} = \left[ \begin{matrix} 0 \\ 0\\ \cdots \\ 1 \end{matrix} \right],y_C = \left[ \begin{matrix} 0 \\ 0\\ \cdots \\ 0 \end{matrix} \right]
y1=⎣⎢⎢⎡10⋯0⎦⎥⎥⎤,y2=⎣⎢⎢⎡01⋯0⎦⎥⎥⎤,⋯,yC−1=⎣⎢⎢⎡00⋯1⎦⎥⎥⎤,yC=⎣⎢⎢⎡00⋯0⎦⎥⎥⎤
那么有:
P
(
y
;
ϕ
)
=
ϕ
1
y
1
⋯
ϕ
C
y
C
=
ϕ
1
y
1
⋯
ϕ
C
−
1
y
C
−
1
⋅
ϕ
C
1
−
∑
i
=
1
C
−
1
y
i
=
e
x
p
(
y
1
log
ϕ
1
+
⋯
+
y
C
−
1
log
ϕ
C
−
1
+
(
1
−
∑
i
=
1
C
−
1
y
i
)
log
ϕ
C
)
=
e
x
p
(
y
1
log
ϕ
1
ϕ
C
+
⋯
+
y
C
−
1
log
ϕ
C
−
1
ϕ
C
+
log
ϕ
C
)
\begin{aligned} P(y;\phi) &= \phi_1^{y_1} \cdots \phi_C^{y_C}\\ &= \phi_1^{y_1} \cdots \phi_{C-1}^{y_{C-1}}\cdot \phi_C^{{1-\sum_{i=1}^{C-1} y_i}}\\ &= exp(y_1\text{log}\phi_1 + \cdots + y_{C-1}\text{log}\phi_{C-1} + (1-\sum_{i=1}^{C-1} y_i)\text{log}\phi_C)\\ &= exp(y_1\text{log}\frac{\phi_1}{\phi_C} + \cdots + y_{C-1}\text{log}\frac{\phi_{C-1}}{\phi_C} + \text{log}\phi_C) \end{aligned}
P(y;ϕ)=ϕ1y1⋯ϕCyC=ϕ1y1⋯ϕC−1yC−1⋅ϕC1−∑i=1C−1yi=exp(y1logϕ1+⋯+yC−1logϕC−1+(1−i=1∑C−1yi)logϕC)=exp(y1logϕCϕ1+⋯+yC−1logϕCϕC−1+logϕC)
因此
- b ( y ) = 1 b(y) =1 b(y)=1
- T ( y ) = y T(y) = y T(y)=y
- η = [ log ϕ 1 ϕ C log ϕ 2 ϕ C ⋯ log ϕ C − 1 ϕ C ] \eta = \left[ \begin{matrix} \text{log}\frac{\phi_1}{\phi_C} \\ \text{log}\frac{\phi_2}{\phi_C} \\ \cdots \\ \text{log}\frac{\phi_{C-1}}{\phi_C} \end{matrix} \right] η=⎣⎢⎢⎢⎡logϕCϕ1logϕCϕ2⋯logϕCϕC−1⎦⎥⎥⎥⎤
- a ( η ) = − log ϕ C a(\eta) = - \text{log}\phi_C a(η)=−logϕC
由于 η i = log ϕ i ϕ C , ⇒ \eta_i = \text{log}\frac{\phi_i}{\phi_C},\Rightarrow ηi=logϕCϕi,⇒
ϕ i = ϕ C e η i ⇒ 1 = ∑ i = 1 C ϕ i = ϕ C ∑ i = 1 C e η i ⇒ ϕ C = 1 ∑ i = 1 C e η i ⇒ ϕ i = ϕ C e η i = e η i ∑ i = 1 C e η i \phi_i = \phi_C e^{\eta_i} \Rightarrow 1 = \sum_{i=1}^C \phi_i = \phi_C \sum_{i=1}^C e^{\eta_i} \Rightarrow \phi_C = \frac{1}{\sum_{i=1}^C e^{\eta_i}} \Rightarrow \phi_i = \phi_C e^{\eta_i} = \frac{e^{\eta_i}}{\sum_{i=1}^C e^{\eta_i}} ϕi=ϕCeηi⇒1=i=1∑Cϕi=ϕCi=1∑Ceηi⇒ϕC=∑i=1Ceηi1⇒ϕi=ϕCeηi=∑i=1Ceηieηi
那么模型的目标就是求解:
f
(
x
;
θ
)
=
E
[
T
(
y
)
∣
x
]
=
E
[
y
∣
x
]
=
Multinoulli分布
[
ϕ
1
ϕ
2
⋯
ϕ
C
−
1
]
=
[
e
η
1
∑
i
=
1
C
e
η
i
e
η
2
∑
i
=
1
C
e
η
i
⋯
e
η
C
−
1
∑
i
=
1
C
e
η
i
]
=
条件3
[
e
ζ
1
T
x
∑
i
=
1
C
e
ζ
i
T
x
e
ζ
2
T
x
∑
i
=
1
C
e
ζ
i
T
x
⋯
e
ζ
C
−
1
T
x
∑
i
=
1
C
e
ζ
i
T
x
]
f(x;\theta) = E[T(y)|x] = E[y|x] \overset{\text{Multinoulli分布}}{=} \left[ \begin{matrix} \phi_1 \\ \phi_2 \\ \cdots \\ \phi_{C-1} \end{matrix} \right] = \left[ \begin{matrix} \frac{e^{\eta_1}}{\sum_{i=1}^C e^{\eta_i}} \\ \frac{e^{\eta_2}}{\sum_{i=1}^C e^{\eta_i}} \\ \cdots \\ \frac{e^{\eta_{C-1}}}{\sum_{i=1}^C e^{\eta_i}} \end{matrix} \right] \overset{\text{条件3}}{=} \left[ \begin{matrix} \frac{e^{\zeta_1^T x}}{\sum_{i=1}^C e^{\zeta_i^T x}} \\ \frac{e^{\zeta_2^T x}}{\sum_{i=1}^C e^{\zeta_i^T x}} \\ \cdots \\ \frac{e^{\zeta_{C-1}^T x}}{\sum_{i=1}^C e^{\zeta_i^T x}} \end{matrix} \right]
f(x;θ)=E[T(y)∣x]=E[y∣x]=Multinoulli分布⎣⎢⎢⎡ϕ1ϕ2⋯ϕC−1⎦⎥⎥⎤=⎣⎢⎢⎢⎡∑i=1Ceηieη1∑i=1Ceηieη2⋯∑i=1CeηieηC−1⎦⎥⎥⎥⎤=条件3⎣⎢⎢⎢⎢⎢⎢⎡∑i=1CeζiTxeζ1Tx∑i=1CeζiTxeζ2Tx⋯∑i=1CeζiTxeζC−1Tx⎦⎥⎥⎥⎥⎥⎥⎤
由此可得多项Logistic模型的函数形式就是softmax