Generalized Linear Models
The exponential family
指数家族分布形式如下:
p
(
y
;
η
)
=
b
(
y
)
exp
(
η
T
T
(
y
)
−
a
(
η
)
)
p(y;\eta) = b(y) \exp {(\eta^TT(y)-a(\eta))}
p(y;η)=b(y)exp(ηTT(y)−a(η))
大部分时候,
T
(
y
)
=
y
T(y) = y
T(y)=y 。
η
\eta
η 为自然参数 (natural parameter),
a
(
η
)
a(\eta)
a(η) 为对数分割函数 (log partition function),
b
(
y
)
b(y)
b(y) 为基础测量 (base measure)。固定
T
、
a
、
b
T、a、b
T、a、b 后,通过调整
η
\eta
η 来调整分布,很多分布都可以转化成指数家族分布的形式。
Bernoulli distribution
下面推导将伯努利分布转换为指数家族分布的形式。假设均值为
ϕ
\phi
ϕ ,
y
∈
{
0
,
1
}
y \in \{0,\ 1\}
y∈{0, 1} ,则:
p
(
y
=
1
;
ϕ
)
=
ϕ
p
(
y
=
0
;
ϕ
)
=
1
−
ϕ
\begin{aligned} p(y = 1;\phi) &= \phi \\ p(y = 0;\phi) &= 1 - \phi \end{aligned}
p(y=1;ϕ)p(y=0;ϕ)=ϕ=1−ϕ
所以可以推出:
p
(
y
;
ϕ
)
=
ϕ
y
(
1
−
ϕ
)
1
−
y
p(y;\phi) = \phi ^y (1 - \phi)^{1 - y}
p(y;ϕ)=ϕy(1−ϕ)1−y
所以有:
p
(
y
;
ϕ
)
=
exp
(
ln
(
ϕ
y
(
1
−
ϕ
)
1
−
y
)
)
=
exp
(
y
ln
ϕ
+
(
1
−
y
)
ln
(
1
−
ϕ
)
)
=
exp
(
y
ln
ϕ
1
−
ϕ
+
ln
(
1
−
ϕ
)
)
\begin{aligned} p(y;\phi) &= \exp {( \ln {(\phi ^y (1 - \phi)^{1 - y})} )} \\ &= \exp {(y \ln \phi +(1-y)\ln(1-\phi))} \\ &= \exp {(y\ln \frac {\phi} {1-\phi} +\ln (1-\phi))} \end{aligned}
p(y;ϕ)=exp(ln(ϕy(1−ϕ)1−y))=exp(ylnϕ+(1−y)ln(1−ϕ))=exp(yln1−ϕϕ+ln(1−ϕ))
因为
ln
ϕ
1
−
ϕ
\ln \frac {\phi} { 1 - \phi }
ln1−ϕϕ 是一个常数,所以令
η
=
ln
ϕ
1
−
ϕ
\eta = \ln \frac {\phi} {1 - \phi}
η=ln1−ϕϕ 有:
e
η
=
ϕ
1
−
ϕ
e
η
=
(
1
+
e
η
)
ϕ
ϕ
=
e
η
1
+
e
η
=
1
1
+
e
−
η
\begin{aligned} e^\eta &= \frac {\phi} {1 - \phi} \\ e^\eta &= (1 + e^\eta) \phi \\ \phi &= \frac {e^\eta} {1 + e^\eta} \\ &= \frac {1} {1+e^{-\eta}} \end{aligned}
eηeηϕ=1−ϕϕ=(1+eη)ϕ=1+eηeη=1+e−η1
则
p
(
y
;
ϕ
)
p(y;\phi)
p(y;ϕ) 可以转换为:
p
(
y
;
ϕ
)
=
exp
(
η
T
y
+
ln
1
1
+
e
η
)
\begin{aligned} p(y;\phi) &= \exp (\eta^T y + \ln \frac {1} {1 + e^\eta}) \end{aligned}
p(y;ϕ)=exp(ηTy+ln1+eη1)
至此将该式转化为了指数家族分布的形式,其中:
b
(
y
)
=
1
T
(
y
)
=
y
a
(
η
)
=
−
ln
1
1
+
e
η
=
ln
(
1
+
e
η
)
\begin{aligned} b(y) &= 1 \\ T(y) &= y \\ a(\eta) &= - \ln \frac {1} {1 + e^\eta} \\ &=\ln (1 + e^\eta) \\ \end{aligned}
b(y)T(y)a(η)=1=y=−ln1+eη1=ln(1+eη)
Gaussian distribution
回顾 Linear Regression,当时用高斯分布推导的时候,高斯分布的方差
σ
2
\sigma^2
σ2 并不会影响
θ
\boldsymbol \theta
θ 的结果,所以在这里,不妨将其设为
1
1
1 。所以,高斯分布如下:
p
(
y
;
μ
)
=
1
2
π
exp
(
−
(
y
−
μ
)
2
2
)
=
1
2
π
exp
(
−
y
2
2
)
exp
(
μ
y
−
1
2
μ
2
)
\begin{aligned} p(y;\mu) &= \frac {1} {\sqrt{2 \pi}} \exp (- \frac {(y - \mu)^2} {2}) \\ &=\frac {1} {\sqrt{2 \pi}} \exp (- \frac {y^2} {2}) \exp (\mu y - \frac {1} {2} \mu^2) \end{aligned}
p(y;μ)=2π1exp(−2(y−μ)2)=2π1exp(−2y2)exp(μy−21μ2)
至此,我们将高斯分布转换为了指数家族分布的形式,其中:
b
(
y
)
=
1
2
π
exp
(
−
y
2
2
)
T
(
y
)
=
y
η
=
μ
a
(
η
)
=
1
2
μ
2
=
1
2
η
2
\begin{aligned} b(y) &= \frac {1} {\sqrt{2 \pi}} \exp (- \frac {y^2} {2}) \\ T(y) &= y \\ \eta &= \mu \\ a(\eta) &= \frac {1} {2} \mu^2 \\ &= \frac {1} {2} \eta^2 \\ \end{aligned}
b(y)T(y)ηa(η)=2π1exp(−2y2)=y=μ=21μ2=21η2
指数家族分布是广义线性模型,很多分布都可以转化为这种形式
构建 GLMs
首先需要进行三个假设:
1.
y
∣
x
;
θ
∼
ExponentialFamily
(
η
)
2.
Given
x
, our goal is to predict the expected value of
T
(
y
)
. This means
we would like learned hypothesis
h
to satisfy
h
(
x
)
=
E
[
y
∣
x
]
3.
η
=
θ
T
x
. (If
η
is a vector-valued, then
η
i
=
θ
i
T
x
\begin{aligned} 1.\ \ &y|\boldsymbol x; \boldsymbol \theta \sim \text {ExponentialFamily}(\eta) \\ 2.\ \ &\text {Given $\boldsymbol x$, our goal is to predict the expected value of $T(y)$. This means} \\ &\text{we would like learned hypothesis $h$ to satisfy $h(\boldsymbol x) = E[y|\boldsymbol x]$} \\ 3.\ \ &\eta = \boldsymbol \theta^T \boldsymbol x \text{. (If $\eta$ is a vector-valued, then $\eta_i = \boldsymbol \theta_i^T \boldsymbol x$} \end{aligned}
1. 2. 3. y∣x;θ∼ExponentialFamily(η)Given x, our goal is to predict the expected value of T(y). This meanswe would like learned hypothesis h to satisfy h(x)=E[y∣x]η=θTx. (If η is a vector-valued, then ηi=θiTx
从这三个假设出发,我们来推导线性回归、Logistic Regression 的模型形式。
Ordinary Least Squares(普通最小二乘法)
在线性回归中,我们假设了
y
y
y 和
x
\boldsymbol x
x 是严格的线性关系加高斯分布的随机噪声的结果,所以
y
∣
θ
;
x
∼
N
(
μ
,
σ
2
)
y|\theta;x \sim N(\mu,\sigma^2)
y∣θ;x∼N(μ,σ2) ,在前面我们推到出高斯分布可以转化为指数家族分布,由假设 2和假设3 ,我们有:
h
θ
(
x
)
=
E
[
y
∣
x
]
=
μ
=
η
=
θ
T
x
\begin{aligned} h_{\boldsymbol \theta} (\boldsymbol x) &= E[y|\boldsymbol x] \\ &=\mu \\ &=\eta \\ &= \boldsymbol \theta^T \boldsymbol x \end{aligned}
hθ(x)=E[y∣x]=μ=η=θTx
Logistic Regression
根据前面推到的结果和假设2、假设3,我们有:
h
θ
(
x
)
=
E
[
y
∣
x
]
=
ϕ
=
1
1
+
e
−
η
=
1
1
+
e
−
θ
T
x
\begin{aligned} h_{\boldsymbol \theta} (\boldsymbol x) &= E[y|\boldsymbol x] \\ &= \phi \\ &= \frac {1} {1 + e^{-\eta}} \\ &= \frac {1} {1 + e^{-\boldsymbol \theta^T \boldsymbol x}} \\ \end{aligned}
hθ(x)=E[y∣x]=ϕ=1+e−η1=1+e−θTx1
Softmax Regression
在分类问题中,当分类结果不是两个,而是
k
k
k 个时,伯努利分布就不满足我们的需求了,这个时候我们可以假设
y
∣
x
;
θ
y|\boldsymbol x; \boldsymbol \theta
y∣x;θ 服从多项式分布 (Multinomial distribution) ,我们可以使用
ϕ
1
,
ϕ
2
,
⋯
,
ϕ
k
\phi_1,\phi_2, \cdots,\phi_k
ϕ1,ϕ2,⋯,ϕk 来分别表示取到某一个值的概率。因为
∑
i
=
1
k
ϕ
k
=
1
\sum_{i=1}^k \phi_k = 1
∑i=1kϕk=1 ,所以可以令:
ϕ
k
=
1
−
∑
i
=
1
k
−
1
ϕ
i
\phi_k = 1 - \sum_{i= 1}^{k-1} \phi_i
ϕk=1−i=1∑k−1ϕi
定义如下计算,大括号中是一个逻辑表达式的结果:
1
{
True
}
=
1
1
{
False
}
=
0
1\{ \text{True}\} = 1\\ 1\{ \text{False}\} = 0
1{True}=11{False}=0
所以有:
p
(
y
;
ϕ
)
=
∏
i
=
1
k
ϕ
i
1
{
y
=
i
}
\begin{aligned} p(y;\phi) = \prod_{i=1}^k\phi_i^{1\{y = i\}} \end{aligned}
p(y;ϕ)=i=1∏kϕi1{y=i}
令
T
T
T 为对
y
y
y 进行的一个转换,转换结果为一个
k
−
1
k-1
k−1 行的列向量,转换规则为第
y
y
y 行为
1
1
1 ,其余全为零,若
y
=
k
y = k
y=k ,则每一行都为
0
0
0 。
KaTeX parse error: Undefined control sequence: \ at position 267: …) \times 1} , \̲ ̲T(k) = \begin{…
则 p ( y ; ϕ ) p(y;\phi) p(y;ϕ) 可继续转化为:
p
(
y
;
ϕ
)
=
ϕ
1
1
{
y
=
1
}
ϕ
2
1
{
y
=
2
}
⋯
ϕ
k
−
1
1
{
y
=
k
−
1
}
ϕ
k
1
−
∑
i
=
1
k
−
1
1
{
y
=
i
}
=
ϕ
1
(
T
(
y
)
)
1
ϕ
2
(
T
(
y
)
)
2
⋯
ϕ
k
−
1
(
T
(
y
)
)
k
−
1
ϕ
k
1
−
∑
i
=
1
k
−
1
(
T
(
y
)
)
i
=
exp
(
(
T
(
y
)
)
1
ln
ϕ
1
+
(
T
(
y
)
)
2
ln
ϕ
2
+
⋯
+
(
T
(
y
)
)
k
−
1
ln
ϕ
k
−
1
+
(
1
−
∑
i
=
1
k
−
1
(
T
(
y
)
)
i
)
ln
ϕ
k
)
=
exp
(
(
T
(
y
)
)
1
ln
ϕ
1
ϕ
k
+
⋯
+
(
T
(
y
)
)
k
−
1
ln
ϕ
k
−
1
ϕ
k
+
ln
ϕ
k
)
=
b
(
y
)
exp
(
η
T
T
(
y
)
−
a
(
η
)
)
\begin{aligned} p(y;\phi) &= \phi_1^{1\{y = 1\}} \phi_2^{1\{y = 2\}} \cdots\phi_{k-1}^{1\{y = k-1\}}\phi_k^{1 -\sum_{i=1}^{k-1}1\{y=i\}} \\ &= \phi_1^{(T(y))_1}\phi_2^{(T(y))_2}\cdots\phi_{k-1}^{(T(y))_{k-1}}\phi_k^{1 - \sum_{i=1}^{k-1}(T(y))_i} \\ &=\exp((T(y))_1\ln \phi_1 + (T(y))_2\ln \phi_2 +\cdots + (T(y))_{k-1}\ln \phi_{k-1} +(1-\sum_{i=1}^{k-1} (T(y))_i)\ln \phi_k ) \\ &=\exp ((T(y))_1 \ln \frac {\phi_1} {\phi_k} + \cdots + (T(y))_{k-1}\ln \frac {\phi_{k-1}} {\phi_k}+\ln \phi_k) \\ &= b(y)\exp (\eta^T T(y) - a(\eta)) \end{aligned}
p(y;ϕ)=ϕ11{y=1}ϕ21{y=2}⋯ϕk−11{y=k−1}ϕk1−∑i=1k−11{y=i}=ϕ1(T(y))1ϕ2(T(y))2⋯ϕk−1(T(y))k−1ϕk1−∑i=1k−1(T(y))i=exp((T(y))1lnϕ1+(T(y))2lnϕ2+⋯+(T(y))k−1lnϕk−1+(1−i=1∑k−1(T(y))i)lnϕk)=exp((T(y))1lnϕkϕ1+⋯+(T(y))k−1lnϕkϕk−1+lnϕk)=b(y)exp(ηTT(y)−a(η))
所以有:
b
(
y
)
=
1
η
=
[
ln
ϕ
1
ϕ
k
⋮
ln
ϕ
k
−
1
ϕ
k
]
a
(
η
)
=
−
ln
(
ϕ
k
)
\begin{aligned} b(y) &=1 \\ \eta &= \begin{bmatrix} \ln \frac {\phi_1} {\phi_k} \\ \vdots \\ \ln \frac {\phi_{k-1}} {\phi_k} \\ \end{bmatrix} \\ a(\eta) &= - \ln (\phi_k) \end{aligned}
b(y)ηa(η)=1=⎣⎢⎢⎡lnϕkϕ1⋮lnϕkϕk−1⎦⎥⎥⎤=−ln(ϕk)
所以可以得到:
η
i
=
ln
ϕ
i
ϕ
k
ϕ
k
e
η
i
=
ϕ
i
ϕ
k
∑
i
=
1
k
e
η
i
=
∑
i
=
1
k
ϕ
i
=
1
ϕ
k
=
1
∑
i
=
1
k
e
η
i
\begin{aligned} \eta_i &= \ln \frac {\phi_i} {\phi_k} \\ \phi_k e^{\eta_i} &= \phi_i \\ \phi_k \sum_{i=1}^{k} e^{\eta_i} &= \sum_{i=1}^{k} \phi_i \\ &=1 \\ \phi_k &= \frac {1} {\sum_{i=1}^{k} e^{\eta_i}} \end{aligned}
ηiϕkeηiϕki=1∑keηiϕk=lnϕkϕi=ϕi=i=1∑kϕi=1=∑i=1keηi1
将
ϕ
k
\phi_k
ϕk 代回到 $\phi_k e^{\eta_i} = \phi_i $ 有:
ϕ
i
=
e
η
i
∑
j
=
1
k
e
η
j
\phi_i =\frac {e^{\eta_i}} {\sum_{j=1}^{k} e^{\eta_j}}
ϕi=∑j=1keηjeηi
由假设 2 和假设 3 可以推出:
p
(
y
=
k
∣
x
;
θ
)
=
ϕ
i
=
e
η
i
∑
j
=
1
k
e
η
j
=
e
θ
i
T
x
∑
j
=
1
k
e
θ
j
T
x
\begin{aligned} p(y=k|\boldsymbol x; \boldsymbol \theta) &= \phi_i \\ &= \frac {e^{\eta_i}} {\sum_{j=1}^{k} e^{\eta_j}} \\ &= \frac {e^{\boldsymbol \theta_i ^ T \boldsymbol x}} {\sum_{j=1}^{k} e^{\boldsymbol \theta_j ^ T \boldsymbol x}} \\ \end{aligned}
p(y=k∣x;θ)=ϕi=∑j=1keηjeηi=∑j=1keθjTxeθiTx
h θ ( x ) = E [ T ( y ) ∣ x ; θ ] = [ ϕ 1 ϕ 2 ⋮ ϕ k − 1 ] = [ exp ( θ 1 T x ) ∑ j = 1 k exp ( θ j T x ) exp ( θ 2 T x ) ∑ j = 1 k exp ( θ j T x ) ⋮ exp ( θ k − 1 T x ) ∑ j = 1 k exp ( θ j T x ) ] \begin{aligned} h_{\boldsymbol \theta}(\boldsymbol x) &= E[T(y)| \boldsymbol x;\boldsymbol \theta] \\ &= \begin{bmatrix} \phi_1 \\ \phi_2 \\ \vdots \\ \phi_{k-1} \end{bmatrix} \\ &= \begin{bmatrix} \frac {\exp({\boldsymbol \theta_{1} ^ T \boldsymbol x})} {\sum_{j=1}^{k} \exp({\boldsymbol \theta_j ^ T \boldsymbol x})} \\ \frac {\exp({\boldsymbol \theta_{2} ^ T \boldsymbol x})} {\sum_{j=1}^{k} \exp({\boldsymbol \theta_j ^ T \boldsymbol x})} \\ \vdots \\ \frac {\exp({\boldsymbol \theta_{k-1} ^ T \boldsymbol x})} {\sum_{j=1}^{k} \exp({\boldsymbol \theta_j ^ T \boldsymbol x})} \end{bmatrix} \\ \end{aligned} hθ(x)=E[T(y)∣x;θ]=⎣⎢⎢⎢⎡ϕ1ϕ2⋮ϕk−1⎦⎥⎥⎥⎤=⎣⎢⎢⎢⎢⎢⎢⎡∑j=1kexp(θjTx)exp(θ1Tx)∑j=1kexp(θjTx)exp(θ2Tx)⋮∑j=1kexp(θjTx)exp(θk−1Tx)⎦⎥⎥⎥⎥⎥⎥⎤
损失函数推导
样本被采集到的概率
L
(
θ
)
L(\boldsymbol \theta)
L(θ) 为:
L
(
θ
)
=
∏
i
=
1
m
p
(
y
(
i
)
∣
x
(
i
)
;
θ
)
L(\boldsymbol \theta) = \prod_{i = 1}^{m}p(y^{(i)}|\boldsymbol x^{(i)};\boldsymbol \theta)
L(θ)=i=1∏mp(y(i)∣x(i);θ)
两边取对数有:
ln
L
(
θ
)
=
∑
i
=
1
m
ln
p
(
y
(
i
)
∣
x
(
i
)
;
θ
)
=
∑
i
=
1
m
∑
k
=
1
K
1
⋅
{
y
=
k
}
ln
ϕ
k
=
∑
i
=
1
m
∑
k
=
1
K
y
k
ln
ϕ
k
\begin{aligned} \ln L(\boldsymbol \theta) &= \sum_{i=1}^m \ln p(y^{(i)}|\boldsymbol x^{(i)};\boldsymbol \theta) \\ &= \sum_{i=1}^m \sum_{k=1}^K 1 \cdot \{ y=k\} \ln \phi_k \\ &= \sum_{i=1}^m \sum_{k=1}^K y_k\ln \phi_k \end{aligned}
lnL(θ)=i=1∑mlnp(y(i)∣x(i);θ)=i=1∑mk=1∑K1⋅{y=k}lnϕk=i=1∑mk=1∑Kyklnϕk
由极大似然估计可以,我们的目标是使
L
L
L 最大,故:
θ
=
arg
max
L
(
θ
)
\boldsymbol \theta = \arg \max L(\boldsymbol \theta)
θ=argmaxL(θ)
所以损失函数
J
(
θ
)
J(\boldsymbol \theta)
J(θ) 为:
J
(
θ
)
=
−
∑
i
=
1
m
∑
k
=
1
K
y
k
ln
ϕ
k
J(\boldsymbol \theta) = -\sum_{i=1}^m \sum_{k=1}^K y_k\ln \phi_k
J(θ)=−i=1∑mk=1∑Kyklnϕk
这也被称为交叉熵。