之前讲过,对于二分类问题我们可以采用逻辑回归。逻辑回归通过
Logit Function
\text{Logit Function}
Logit Function将
x
i
x^i
xi映射到
(
0
,
1
)
(0,1)
(0,1)的一个区间,可解释为事件
x
i
x^i
xi发生的概率
p
(
x
i
;
θ
)
=
1
1
+
e
−
θ
T
x
i
p(x_i; \theta) = \frac {1}{1+e^{-\theta^Tx_i}}
p(xi;θ)=1+e−θTxi1。而对于多分类问题
y
i
∈
{
1
,
2
,
…
,
K
}
y^i \in \{1, 2, \dots, K\}
yi∈{1,2,…,K},
Softmax Regression
\text{Softmax Regression}
Softmax Regression采用同样的思路,同样是将
x
i
x^i
xi映射到
(
0
−
1
)
(0-1)
(0−1)的区间上,计算
x
i
x^i
xi发生的概率。不过,
p
(
x
i
;
θ
)
p(x_i; \theta)
p(xi;θ)的定义稍有不同:
p
(
x
i
;
θ
)
=
P
r
(
y
i
=
k
∣
x
i
;
θ
)
=
e
θ
(
k
)
T
x
i
∑
j
=
1
K
e
θ
(
j
)
T
x
i
p(x_i; \theta) = Pr(y^i = k \vert x^i; \theta) = \frac {e^{{\theta^{(k)}}^T x^i}}{\sum\nolimits_{j=1}^K e^{{\theta^{(j)}}^T x^i}}
p(xi;θ)=Pr(yi=k∣xi;θ)=∑j=1Keθ(j)Txieθ(k)Txi
其中,分子表示数据
x
i
x^i
xi属于类
k
k
k的概率,分母表示其属于不同类的概率之和,
p
(
x
i
;
θ
)
∈
(
0
,
1
)
p(x_i; \theta) \in (0,1)
p(xi;θ)∈(0,1)。为加快程序运行速度,在代码实现中通常会向量化上式:
p
(
x
i
;
θ
)
=
[
P
r
(
y
i
=
1
∣
x
i
;
θ
)
P
r
(
y
i
=
2
∣
x
i
;
θ
)
⋮
P
r
(
y
i
=
K
∣
x
i
;
θ
)
]
=
1
∑
j
=
1
K
e
θ
(
j
)
T
x
i
[
e
θ
(
1
)
T
x
i
e
θ
(
2
)
T
x
i
⋮
e
θ
(
K
)
T
x
i
]
p(x_i; \theta) = \begin{bmatrix} Pr(y^i = 1 \vert x^i; \theta) \\ Pr(y^i = 2 \vert x^i; \theta) \\ \vdots \\ Pr(y^i = K \vert x^i; \theta) \end{bmatrix} = \frac {1}{\sum\nolimits_{j=1}^K e^{{\theta^{(j)}}^T x^i}} \begin{bmatrix} e^{{\theta^{(1)}}^T x^i} \\ e^{{\theta^{(2)}}^T x^i} \\ \vdots \\ e^{{\theta^{(K)}}^T x^i} \end{bmatrix}
p(xi;θ)=⎣⎢⎢⎢⎡Pr(yi=1∣xi;θ)Pr(yi=2∣xi;θ)⋮Pr(yi=K∣xi;θ)⎦⎥⎥⎥⎤=∑j=1Keθ(j)Txi1⎣⎢⎢⎢⎢⎡eθ(1)Txieθ(2)Txi⋮eθ(K)Txi⎦⎥⎥⎥⎥⎤
Softmax Regression
\text{Softmax Regression}
Softmax Regression最大似然函数为:
L
(
θ
)
=
∏
i
=
1
m
p
(
x
i
;
θ
)
1
{
y
i
=
k
}
L(\theta) = \prod\limits_{i=1}^m p(x_i; \theta)^{1\{y^i=k\}}
L(θ)=i=1∏mp(xi;θ)1{yi=k}
其中,
1
{
⋅
}
1\{ \cdot \}
1{⋅}是指示函数,
1
{
T
r
u
e
}
=
1
;
1
{
F
a
l
s
e
}
=
0
1\{ True \} = 1; 1\{ False\} = 0
1{True}=1;1{False}=0。对数似然函数为:
l
(
θ
)
=
log
L
(
θ
)
=
∑
i
=
1
m
1
{
y
i
=
k
}
log
e
θ
(
k
)
T
x
i
∑
j
=
1
K
e
θ
(
j
)
T
x
i
l(\theta) = \log L(\theta) = \sum\limits_{i=1}^m 1\{ y^i=k \} \log \frac {e^{{\theta^{(k)}}^T x^i}}{\sum\nolimits_{j=1}^K e^{{\theta^{(j)}}^T x^i}}
l(θ)=logL(θ)=i=1∑m1{yi=k}log∑j=1Keθ(j)Txieθ(k)Txi
成本函数
J
(
θ
)
J(\theta)
J(θ)定义为:
J
(
θ
)
=
−
l
(
θ
)
=
−
∑
i
=
1
m
1
{
y
i
=
k
}
log
e
θ
(
k
)
T
x
i
∑
j
=
1
K
e
θ
(
j
)
T
x
i
J(\theta) = - l(\theta) = - \sum\limits_{i=1}^m 1\{ y^i=k \} \log \frac {e^{{\theta^{(k)}}^T x^i}}{\sum\nolimits_{j=1}^K e^{{\theta^{(j)}}^T x^i}}
J(θ)=−l(θ)=−i=1∑m1{yi=k}log∑j=1Keθ(j)Txieθ(k)Txi
J
(
θ
)
J(\theta)
J(θ)对
θ
(
k
)
\theta^{(k)}
θ(k)求偏导:
∂
J
(
θ
)
∂
θ
(
k
)
=
∂
∂
θ
(
k
)
[
−
∑
i
=
1
m
1
{
y
i
=
k
}
log
e
θ
(
k
)
T
x
i
∑
j
=
1
K
e
θ
(
j
)
T
x
i
]
=
∂
∂
θ
(
k
)
[
−
∑
i
=
1
m
1
{
y
i
=
k
}
(
log
e
θ
(
k
)
T
x
i
−
log
∑
j
=
1
K
e
θ
(
j
)
T
x
i
)
]
=
−
∑
i
=
1
m
(
1
{
y
i
=
k
}
x
i
−
e
θ
(
j
)
T
x
i
∑
j
=
1
K
e
θ
(
j
)
T
x
i
x
i
)
=
−
∑
i
=
1
m
x
i
(
1
{
y
i
=
k
}
−
e
θ
(
j
)
T
x
i
∑
j
=
1
K
e
θ
(
j
)
T
x
i
)
=
−
∑
i
=
1
m
x
i
(
1
{
y
i
=
k
}
−
P
r
(
y
i
=
k
∣
x
i
;
θ
)
)
\begin{aligned} & \frac {\partial J(\theta)}{\partial \theta^{(k)}} = \frac {\partial}{\partial \theta^{(k)}} [- \sum\limits_{i=1}^m 1\{ y^i=k \} \log \frac {e^{{\theta^{(k)}}^T x^i}}{\sum\nolimits_{j=1}^K e^{{\theta^{(j)}}^T x^i}}] \\ & = \frac {\partial}{\partial \theta^{(k)}} [- \sum\limits_{i=1}^m 1\{ y^i=k \} (\log e^{{\theta^{(k)}}^T x^i} - \log {\sum\nolimits_{j=1}^K e^{{\theta^{(j)}}^T x^i}})] \\ & = - \sum\limits_{i=1}^m (1\{ y^i=k \} x^i - \frac {e^{{\theta^{(j)}}^T x^i}}{\sum\nolimits_{j=1}^K e^{{\theta^{(j)}}^T x^i}} x^i) \\ & = - \sum\limits_{i=1}^m x^i(1\{ y^i=k \} - \frac {e^{{\theta^{(j)}}^T x^i}}{\sum\nolimits_{j=1}^K e^{{\theta^{(j)}}^T x^i}}) \\ & = - \sum\limits_{i=1}^m x^i(1\{ y^i=k \} - Pr(y^i = k \vert x^i; \theta)) \end{aligned}
∂θ(k)∂J(θ)=∂θ(k)∂[−i=1∑m1{yi=k}log∑j=1Keθ(j)Txieθ(k)Txi]=∂θ(k)∂[−i=1∑m1{yi=k}(logeθ(k)Txi−log∑j=1Keθ(j)Txi)]=−i=1∑m(1{yi=k}xi−∑j=1Keθ(j)Txieθ(j)Txixi)=−i=1∑mxi(1{yi=k}−∑j=1Keθ(j)Txieθ(j)Txi)=−i=1∑mxi(1{yi=k}−Pr(yi=k∣xi;θ))
最后,通过梯度下降求
J
(
θ
)
J(\theta)
J(θ)最小化时
θ
(
k
)
\theta^{(k)}
θ(k)的近似解:
θ
(
k
)
:
=
θ
(
k
)
−
α
∂
J
(
θ
)
∂
θ
(
k
)
\theta^{(k)} := \theta^{(k)} - \alpha \frac {\partial J(\theta)}{\partial \theta^{(k)}}
θ(k):=θ(k)−α∂θ(k)∂J(θ)
将上式写成向量形式,即: θ = θ − α ▽ J ( θ ) \theta = \theta - \alpha \bigtriangledown J(\theta) θ=θ−α▽J(θ)。