目录
逻辑回归损失函数推导
Logistic回归是二分类问题,即类标签种类等于2。我们假设现在有m个训练样本 { ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , ⋯ , ( x ( m ) , y ( m ) ) } \{(x(1),y(1)),(x(2),y(2)),⋯,(x(m),y(m))\} {(x(1),y(1)),(x(2),y(2)),⋯,(x(m),y(m))},其输入特征为x(i)∈ R n + 1 R^{n+1} Rn+1算上偏置,类标记为:y(i)∈{0,1}。
θ 0 + θ 1 x 1 + . . . + θ n x n = ∑ i = 1 n θ i x i = θ T x \theta_{0} + \theta_{1}x_{1} + ...+\theta_{n}x_{n}=\sum_{i=1}^{n}\theta_{i}x_{i}=\theta^{T}x θ0+θ1x1+...+θnxn=i=1∑nθixi=θTx
激活函数,使用sigmoid函数,在线性回归得到一个值,经过sigmoid函数映射,就完成了由值到概率的计算,也就是分类。
g
(
x
)
=
1
1
+
e
−
x
g(x) =\frac{1}{1+e^{-x}}
g(x)=1+e−x1
h
θ
(
x
)
=
g
(
θ
T
x
)
=
1
1
+
e
−
θ
T
x
h_{\theta}(x)=g(\theta^{T}x)=\frac {1}{1+e^{-\theta^{T}x}}
hθ(x)=g(θTx)=1+e−θTx1
分类任务 :
P
(
y
=
1
∣
x
;
θ
)
=
h
θ
(
x
)
P
(
y
=
0
∣
x
;
θ
)
=
1
−
h
θ
(
x
)
\begin{aligned}P(y=1|x;\theta)&=h_{\theta}(x)\\ P(y=0|x;\theta)&=1-h_{\theta}(x)\end{aligned}
P(y=1∣x;θ)P(y=0∣x;θ)=hθ(x)=1−hθ(x)
整合:
P
(
y
∣
x
;
θ
)
=
(
h
θ
(
x
)
)
y
(
1
−
h
θ
(
x
)
)
1
−
y
P(y|x;\theta)=(h_{\theta}(x))^{y}(1-h_{\theta}(x))^{1-y}
P(y∣x;θ)=(hθ(x))y(1−hθ(x))1−y
解释:对于二分类任务(0,1),整合后y取0只保留
(
1
−
h
θ
(
x
)
)
1
−
y
(1-h_{\theta}(x))^{1-y}
(1−hθ(x))1−y,y取1只保留
(
h
θ
(
x
)
)
y
(h_{\theta}(x))^{y}
(hθ(x))y
注意:
逻辑回归使用极大似然估计的方法进行优化。
L
(
θ
)
=
∏
i
=
1
m
P
(
y
i
∣
x
i
;
θ
)
=
∏
i
=
1
m
(
h
θ
(
x
i
)
)
y
i
(
1
−
h
θ
(
x
i
)
)
1
−
y
i
L(\theta)=\prod_{i=1}^{m}P(y_{i}|x_{i};\theta)=\prod_{i=1}^{m} (h_{\theta}(x_{i}))^{y_{i}}(1-h_{\theta}(x_{i}))^{1-y_{i}}
L(θ)=i=1∏mP(yi∣xi;θ)=i=1∏m(hθ(xi))yi(1−hθ(xi))1−yi
对数似然:
l
(
θ
)
=
l
o
g
L
(
θ
)
=
∑
i
=
1
m
(
y
i
∗
l
o
g
h
θ
(
x
i
)
+
(
1
−
y
i
)
l
o
g
(
1
−
h
θ
(
x
i
)
)
)
l(\theta)=log L(\theta)=\sum_{i=1}^{m}(y_{i}* log h_{\theta}(x_{i})+(1-y_{i})log(1-h_{\theta}(x_{i})))
l(θ)=logL(θ)=i=1∑m(yi∗loghθ(xi)+(1−yi)log(1−hθ(xi)))
注意:
此时应使用梯度上升求最大值,引入
J
(
θ
)
=
−
1
m
l
(
θ
)
J(\theta)=-\frac{1} {m}l(\theta)
J(θ)=−m1l(θ)转为梯度下降的任务。
损失函数为交叉熵,使用交叉熵的好处,梯度下降可以在错误大的时候改变很大,错误小的更改小,如果使用最小二乘法,在错误或者正确修改的都很小甚至不修改。
梯度下降法
求导过程:
对每一个进行求偏导,每一个特征j表示每一个特征,i表示每一个样本
∂
J
(
θ
)
∂
θ
j
=
−
1
m
∑
i
=
1
m
[
y
i
1
h
θ
(
x
i
)
∂
h
θ
(
x
i
)
∂
θ
j
−
(
1
−
y
i
)
1
1
−
h
θ
(
x
i
)
∂
h
θ
(
x
i
)
∂
θ
j
]
=
−
1
m
∑
i
=
1
m
[
y
i
1
g
(
θ
T
x
i
)
−
(
1
−
y
i
)
1
1
−
g
(
θ
T
x
i
)
]
∂
g
(
θ
T
x
i
)
∂
θ
j
=
−
1
m
∑
i
=
1
m
[
y
i
1
g
(
θ
T
x
i
)
−
(
1
−
y
i
)
1
1
−
g
(
θ
T
x
i
)
]
g
(
θ
T
x
i
)
(
1
−
g
(
θ
T
x
i
)
)
∂
θ
T
x
i
∂
θ
j
=
−
1
m
∑
i
=
1
m
[
y
i
(
1
−
g
(
θ
T
x
i
)
)
−
(
1
−
y
i
)
g
(
θ
T
x
i
)
]
x
i
j
=
−
1
m
∑
i
=
1
m
(
y
i
−
g
(
θ
T
x
i
)
)
x
i
j
=
1
m
∑
i
=
1
m
(
h
θ
(
x
i
)
−
y
i
)
x
i
j
\begin{aligned} \frac{\partial J(\theta)}{\partial \theta_{j}}&= -\frac{1}{m}\sum_{i=1}^{m}\left [ y_{i} \frac{1}{h_{\theta}(x_{i})} \frac{\partial h_{\theta}(x_{i})}{\partial \theta_{j}} -(1-y_{i})\frac{1}{1-h_{\theta}(x_{i})}\frac{\partial h_{\theta}(x_{i})}{\partial \theta_{j}} \right ]\\ &=-\frac{1}{m} \sum_{i=1}^{m} \left [y_{i} \frac{1}{g(\theta^{T}x_{i})}-(1-y_{i}) \frac{1}{1-g(\theta^{T}x_{i})}\right ]\frac{\partial g(\theta^{T}x_{i})}{\partial \theta_{j}}\\ &=-\frac{1}{m} \sum_{i=1}^{m} \left [ y_{i} \frac{1}{g(\theta^{T}x_{i})}-(1-y_{i}) \frac{1}{1-g(\theta^{T}x_{i})}\right ] g(\theta^{T}x_{i})(1-g(\theta^{T}x_{i}))\frac{\partial \theta^{T}x_{i}}{\partial \theta_{j}}\\ &=-\frac{1}{m} \sum_{i=1}^{m}\left [ y_{i} (1-g(\theta^{T}x_{i})) -(1-y_{i}) g(\theta^{T}x_{i})\right ]x_{i}^{j}\\ &=-\frac{1}{m}\sum_{i=1}^{m}(y_{i}-g(\theta^{T}x_{i}))x_{i}^{j}\\ &=\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x_{i})-y_{i})x_{i}^{j} \end{aligned}
∂θj∂J(θ)=−m1i=1∑m[yihθ(xi)1∂θj∂hθ(xi)−(1−yi)1−hθ(xi)1∂θj∂hθ(xi)]=−m1i=1∑m[yig(θTxi)1−(1−yi)1−g(θTxi)1]∂θj∂g(θTxi)=−m1i=1∑m[yig(θTxi)1−(1−yi)1−g(θTxi)1]g(θTxi)(1−g(θTxi))∂θj∂θTxi=−m1i=1∑m[yi(1−g(θTxi))−(1−yi)g(θTxi)]xij=−m1i=1∑m(yi−g(θTxi))xij=m1i=1∑m(hθ(xi)−yi)xij
参数更新:
θ
j
=
θ
j
−
α
1
m
∑
i
=
1
m
(
h
θ
(
x
i
)
−
y
i
)
x
i
j
\theta_{j} = \theta_{j}-\alpha\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x_{i})-y_{i})x_{i}^{j}
θj=θj−αm1i=1∑m(hθ(xi)−yi)xij
softmax原理
Softmax是Logistic回归在多分类上的应用,即标签种类大于等于2。我们假设现在有m个训练样本
{
(
x
(
1
)
,
y
(
1
)
)
,
(
x
(
2
)
,
y
(
2
)
)
,
⋯
,
(
x
(
m
)
,
y
(
m
)
)
}
\{(x(1),y(1)),(x(2),y(2)),⋯,(x(m),y(m))\}
{(x(1),y(1)),(x(2),y(2)),⋯,(x(m),y(m))},对于Softmax回归,其输入特征为:x(i)∈
R
n
+
1
R^{n+1}
Rn+1,类标记为:y(i)∈{0,1,⋯n}。估计每一个样本所属的类别的概率p(y=j∣x):
h
θ
(
x
i
)
=
[
p
(
y
i
=
1
∣
x
i
;
θ
)
p
(
y
i
=
2
∣
x
i
;
θ
)
⋮
p
(
y
i
=
n
∣
x
i
;
θ
)
]
=
1
∑
j
=
1
n
e
θ
j
T
x
i
[
e
θ
1
T
x
i
e
θ
2
T
x
i
⋮
e
θ
n
T
x
i
]
h_{\theta}(x^{i}) = \begin{bmatrix} p(y^{i}=1|x^{i};\theta)\\ p(y^{i}=2|x^{i};\theta) \\ \vdots \\ p(y^{i}=n|x^{i};\theta) \end{bmatrix}=\frac{1}{\sum_{j=1}^{n}e^{\theta^{T}_{j}x^{i}}}\begin{bmatrix} e^{\theta^{T}_{1}x^{i}}\\ e^{\theta^{T}_{2}x^{i}} \\ \vdots \\ e^{\theta^{T}_{n}x^{i}} \end{bmatrix}
hθ(xi)=⎣⎢⎢⎢⎡p(yi=1∣xi;θ)p(yi=2∣xi;θ)⋮p(yi=n∣xi;θ)⎦⎥⎥⎥⎤=∑j=1neθjTxi1⎣⎢⎢⎢⎢⎡eθ1Txieθ2Txi⋮eθnTxi⎦⎥⎥⎥⎥⎤
对于每一个样本的概率,我们选取最大的就是样本分类,一般是ont-hot,
有几个类别长度就为几,对应分类的为1,其他为0。
e
θ
j
T
x
i
∑
j
=
1
n
e
θ
j
T
x
i
\frac{e^{\theta^{T}_{j}x^{i}}}{\sum_{j=1}^{n}e^{\theta^{T}_{j}x^{i}}}
∑j=1neθjTxieθjTxi