文章目录
1 逻辑回归 (Logistic Regression)
虽然有"回归"两个字, 但实际上是一个二分类模型 (Bernoulli).
1.1 逻辑回归模型
模型假设
y
^
=
P
θ
(
Y
=
1
∣
x
)
=
1
1
+
e
−
θ
T
x
=
g
(
θ
T
x
)
P
θ
(
Y
=
0
∣
x
)
=
e
−
θ
T
x
1
+
e
−
θ
T
x
=
1
−
g
(
θ
T
x
)
\begin{aligned} \hat{y} = &P_{\theta}(Y=1\mid x) = \frac{1}{1+e^{-\theta^{T}x}} = g(\theta^{T}x)\\ &P_{\theta}(Y=0\mid x) = \frac{e^{-\theta^{T}x}}{1+e^{-\theta^{T}x}} = 1-g(\theta^{T}x) \end{aligned}
y^=Pθ(Y=1∣x)=1+e−θTx1=g(θTx)Pθ(Y=0∣x)=1+e−θTxe−θTx=1−g(θTx)
其中,
θ
=
[
θ
0
θ
1
⋮
θ
n
]
,
x
=
[
1
x
(
1
)
⋮
x
(
n
)
]
,
g
(
z
)
=
1
1
+
e
z
.
\theta=\left[\begin{matrix} \theta_{0} \\ \theta_{1} \\ \vdots \\ \theta_{n} \end{matrix}\right],\quad x=\left[\begin{matrix} 1 \\ x^{(1)} \\ \vdots \\ x^{(n)} \end{matrix}\right],\quad g(z) = \frac{1}{1+e^z}.
θ=⎣⎢⎢⎢⎡θ0θ1⋮θn⎦⎥⎥⎥⎤,x=⎣⎢⎢⎢⎡1x(1)⋮x(n)⎦⎥⎥⎥⎤,g(z)=1+ez1.
不难看出
log
P
θ
(
Y
=
1
∣
x
)
P
θ
(
Y
=
0
∣
x
)
=
θ
T
x
\log\frac{P_{\theta}(Y=1\mid x)}{P_{\theta}(Y=0\mid x)} = \theta^{T}x
logPθ(Y=0∣x)Pθ(Y=1∣x)=θTx
所以, “逻辑回归"又称为"对数几率回归”.
# 定义sigmoid函数
def sigmoid(z):
return 1/(1+np.exp(-z))
决策函数
p
r
e
d
i
c
t
i
o
n
=
{
1
,
P
θ
(
Y
=
1
∣
x
)
≥
0.5
0
,
P
θ
(
Y
=
1
∣
x
)
<
0.5
=
{
1
,
θ
T
x
≥
0
0
,
θ
T
x
<
0
prediction = \begin{cases} 1,& P_{\theta}(Y=1\mid x)\geq0.5 \\ 0, & P_{\theta}(Y=1\mid x)<0.5 \end{cases} = \begin{cases} 1,& \theta^{T}x\geq0 \\ 0, & \theta^{T}x < 0 \end{cases}
prediction={1,0,Pθ(Y=1∣x)≥0.5Pθ(Y=1∣x)<0.5={1,0,θTx≥0θTx<0
1.2 训练数据集
X = [ 1 x 1 ( 1 ) ⋯ x 1 ( n ) ⋮ ⋮ ⋮ 1 x m ( 1 ) ⋯ x m ( n ) ] = [ x 1 T ⋮ x m T ] , y = [ y 1 ⋮ y m ] X = \left[\begin{matrix}1 & x_{1}^{(1)} & \cdots & x_{1}^{(n)}\\ \vdots & \vdots & & \vdots\\ 1 & x_{m}^{(1)} & \cdots & x_{m}^{(n)} \end{matrix}\right]=\left[\begin{matrix}x_{1}^T\\ \vdots \\ x_{m}^T \end{matrix}\right],\quad y = \left[\begin{matrix}y_{1}\\ \vdots \\ y_{m} \end{matrix}\right] X=⎣⎢⎢⎡1⋮1x1(1)⋮xm(1)⋯⋯x1(n)⋮xm(n)⎦⎥⎥⎤=⎣⎢⎡x1T⋮xmT⎦⎥⎤,y=⎣⎢⎡y1⋮ym⎦⎥⎤
训练目标:
y
^
=
g
(
X
θ
)
≈
y
\hat{y} = g(X\theta) \approx y
y^=g(Xθ)≈y
1.3 对数似然函数
在独立同分布假设下,
l
(
θ
)
=
log
P
(
Y
1
=
y
1
,
⋯
,
Y
m
=
y
m
∣
x
1
,
⋯
,
x
m
;
θ
)
=
log
∏
i
=
1
m
P
θ
(
Y
i
=
y
i
∣
x
i
)
=
∑
i
=
1
m
log
P
θ
(
Y
i
=
y
i
∣
x
i
)
=
∑
i
=
1
m
[
y
i
log
P
θ
(
Y
i
=
1
∣
x
i
)
+
(
1
−
y
i
)
log
P
θ
(
Y
i
=
0
∣
x
i
)
]
=
∑
i
=
1
m
[
y
i
log
y
^
i
+
(
1
−
y
i
)
log
(
1
−
y
^
i
)
]
\begin{aligned} l(\theta) &= \log P(Y_{1}=y_{1},\cdots,Y_{m}=y_{m}\mid x_{1},\cdots,x_{m};\theta) \\ &= \log\prod_{i=1}^{m}P_{\theta}(Y_{i}=y_{i}\mid x_{i}) = \sum_{i=1}^{m}\log{P_{\theta}(Y_{i}=y_i\mid x_{i})} \\ &= \sum_{i=1}^{m}[y_{i}\log P_{\theta}(Y_{i}=1\mid x_{i}) + (1-y_{i})\log P_{\theta}(Y_{i}=0\mid x_{i})] \\ &= \sum_{i=1}^{m}[y_{i}\log\hat{y}_{i} + (1-y_{i})\log(1-\hat{y}_{i})] \\ \end{aligned}
l(θ)=logP(Y1=y1,⋯,Ym=ym∣x1,⋯,xm;θ)=logi=1∏mPθ(Yi=yi∣xi)=i=1∑mlogPθ(Yi=yi∣xi)=i=1∑m[yilogPθ(Yi=1∣xi)+(1−yi)logPθ(Yi=0∣xi)]=i=1∑m[yilogy^i+(1−yi)log(1−y^i)]
矩阵形式:
l
(
θ
)
=
y
T
log
y
^
+
(
1
−
y
T
)
log
(
1
−
y
^
)
l(\theta) = y^{T}\log\hat{y} + (1-y^{T})\log(1-\hat{y})
l(θ)=yTlogy^+(1−yT)log(1−y^)
1.4 代价函数
损失函数:
L
(
y
^
,
y
)
=
{
−
log
P
(
Y
=
1
∣
x
)
,
y
=
1
−
log
P
(
Y
=
0
∣
x
)
,
y
=
0
=
−
{
y
log
y
^
+
(
1
−
y
)
log
(
1
−
y
^
)
}
\begin{aligned} L(\hat{y},y) &= \begin{cases} -\log P(Y=1\mid x), & y=1 \\ -\log P(Y=0\mid x), & y=0 \end{cases} \\\\ &= -\{y\log{\hat{y}} + (1-y)\log(1-\hat{y})\} \end{aligned}
L(y^,y)={−logP(Y=1∣x),−logP(Y=0∣x),y=1y=0=−{ylogy^+(1−y)log(1−y^)}
代价函数:
J
(
θ
)
=
−
1
m
∑
i
=
1
m
[
y
i
log
y
^
i
+
(
1
−
y
i
)
log
(
1
−
y
^
i
)
]
+
λ
2
m
∑
j
=
1
n
θ
j
2
J(\theta)= -\frac{1}{m}\sum_{i=1}^{m}[y_{i}\log\hat{y}_{i} + (1-y_{i})\log(1-\hat{y}_{i})] + \boxed{\frac{\lambda}{2m}\sum_{j=1}^{n}\theta_{j}^{2}}
J(θ)=−m1i=1∑m[yilogy^i+(1−yi)log(1−y^i)]+2mλj=1∑nθj2
其中方框内为正则化项,上式可以理解为 J ( θ ) : = − 1 m l ( θ ) + r e g ( θ ) J(\theta):=-\frac{1}{m}l(\theta)+reg(\theta) J(θ):=−m1l(θ)+reg(θ).
矩阵形式:
J
(
θ
)
=
−
1
m
[
y
T
log
y
^
+
(
1
−
y
T
)
log
(
1
−
y
^
)
]
+
λ
2
m
(
E
˚
θ
)
T
(
E
˚
θ
)
J(\theta) =-\frac{1}{m}[y^{T}\log \hat{y} + (1-y^{T})\log(1-\hat{y})] + \boxed{\frac{\lambda}{2m}(\mathring{E}\theta)^{T}(\mathring{E}\theta)}
J(θ)=−m1[yTlogy^+(1−yT)log(1−y^)]+2mλ(E˚θ)T(E˚θ)
# 计算代价函数
def computeCost(theta, X, y, penalty):=
first = -1/m * y.T @ np.log(sigmoid(X@theta))
second = -1/m * (1-y.T)np.log(1-sogmoid(X@theta))
reg = penalty/(2*m) * thata[1:].T @ theta[1:]
return first + second + reg
1.5 代价函数的梯度
注意到对 Logistic 函数
g
(
z
)
g(z)
g(z) 有
g
′
(
z
)
=
g
(
z
)
(
1
−
g
(
z
)
)
g'(z)=g(z)(1-g(z))
g′(z)=g(z)(1−g(z))
不难得到
∂
∂
θ
0
J
(
θ
)
=
1
m
∑
i
=
1
m
(
y
^
i
−
y
i
)
x
i
(
0
)
∂
∂
θ
j
J
(
θ
)
=
1
m
∑
i
=
1
m
(
y
^
i
−
y
i
)
x
i
(
j
)
+
λ
m
∑
j
=
1
n
θ
j
,
j
=
1
,
⋯
,
n
\begin{aligned} \frac{\partial}{\partial \theta_{0}}J(\theta) &= \frac{1}{m}\sum\limits_{i=1}^{m}(\hat{y}_{i}-y_{i})x_{i}^{(0)} \\ \frac{\partial}{\partial \theta_{j}}J(\theta) &= \frac{1}{m}\sum\limits_{i=1}^{m}(\hat{y}_{i}-y_{i})x_{i}^{(j)} + \boxed{\frac{\lambda}{m}\sum_{j=1}^{n}\theta_{j}},\quad j=1,\cdots,n \end{aligned}
∂θ0∂J(θ)∂θj∂J(θ)=m1i=1∑m(y^i−yi)xi(0)=m1i=1∑m(y^i−yi)xi(j)+mλj=1∑nθj,j=1,⋯,n
矩阵形式:
∂
∂
θ
J
(
θ
)
=
1
m
X
T
(
y
^
−
y
)
+
λ
m
E
˚
θ
\frac{\partial}{\partial \theta}J(\theta)=\frac{1}{m}X^{T}(\hat{y}-y) + \boxed{\frac{\lambda}{m}\mathring{E}\theta}
∂θ∂J(θ)=m1XT(y^−y)+mλE˚θ
# 计算梯度
def gradient(theta, X, y, penalty):
first = 1/m * X.T @ (sigmoid(X@theta)-y)
reg = penalty/m * theta[1:]
return first + reg
1.6 目标函数
θ ∗ = arg min θ J ( θ ) \theta^{*}=\mathop{\arg\min}_{\theta}J(\theta) θ∗=argminθJ(θ)
算法 批量梯度下降 (Batch Gradient Descent)
R e p e a t u n t i l c o n v e r g e n c e { θ : = θ − α ∂ ∂ θ J ( θ ) } \begin{aligned} &Repeat\ until\ convergence\{\\ &\qquad \theta := \theta - \alpha\frac{\partial}{\partial\theta}J(\theta)\\ &\} \end{aligned} Repeat until convergence{θ:=θ−α∂θ∂J(θ)}
其中, α \alpha α是学习速率.
# 批量梯度下降
def gradientDescent(theta, X, y, l_rate, penalty, iter):
cost = np.zeros(n+1)
cost[0] = computeCost(theta, X, y)
for k in range(iter):
theta = theta - l_rate * gradient(theta, X, y, penalty)
cost[k] = computeCost(theta, X, y)
return theta, cost
2 Softmax Regression
是 Logistic Regression 的推广,用于多分类 (Multinoulli)。
2.1 Softmax Regression 模型
模型假设
y
^
≡
[
P
θ
(
Y
=
1
∣
x
)
⋮
P
θ
(
Y
=
K
∣
x
)
]
=
[
s
o
f
t
m
a
x
(
W
T
x
+
b
T
)
1
⋮
s
o
f
t
m
a
x
(
W
T
x
+
b
T
)
K
]
\hat{y} \equiv \left[\begin{matrix} P_{\theta}(Y=1\mid \boldsymbol{x}) \\ \vdots \\ P_{\theta}(Y=K\mid \boldsymbol{x})\end{matrix}\right] = \left[\begin{matrix} softmax(W^{T}\boldsymbol{x}+\boldsymbol{b}^{T})_{1} \\ \vdots \\ softmax(W^{T}\boldsymbol{x}+\boldsymbol{b}^{T})_{K} \end{matrix}\right]
y^≡⎣⎢⎡Pθ(Y=1∣x)⋮Pθ(Y=K∣x)⎦⎥⎤=⎣⎢⎡softmax(WTx+bT)1⋮softmax(WTx+bT)K⎦⎥⎤
其中,
W
T
=
[
w
(
1
)
T
⋮
w
(
K
)
T
]
=
[
w
1
(
1
)
⋯
w
n
(
1
)
⋮
⋱
⋮
w
1
(
K
)
⋯
w
n
(
K
)
]
,
x
=
[
x
(
1
)
⋮
x
(
n
)
]
,
b
T
=
[
b
(
1
)
⋮
b
(
K
)
]
,
s
o
f
t
m
a
x
(
z
)
k
=
e
z
(
k
)
∑
j
e
z
(
j
)
,
k
=
1
,
⋯
,
K
\begin{aligned} W^{T} = \left[\begin{matrix} w^{(1)T} \\ \vdots \\ w^{(K)T}\end{matrix}\right] &= \left[\begin{matrix} w_{1}^{(1)} & \cdots &w_{n}^{(1)} \\ \vdots &\ddots & \vdots \\ w_{1}^{(K)} & \cdots &w_{n}^{(K)} \\ \end{matrix}\right], \ \boldsymbol{x} = \left[\begin{matrix} x^{(1)} \\ \vdots \\ x^{(n)} \end{matrix}\right], \ \boldsymbol{b}^{T} = \left[\begin{matrix} b^{(1)} \\ \vdots \\ b^{(K)}\end{matrix}\right], \\\\ &softmax(\boldsymbol{z})_{k} = \frac{e^{z^{(k)}}}{\sum_{j}e^{z^{(j)}}},\ k=1,\cdots,K \end{aligned}
WT=⎣⎢⎡w(1)T⋮w(K)T⎦⎥⎤=⎣⎢⎢⎡w1(1)⋮w1(K)⋯⋱⋯wn(1)⋮wn(K)⎦⎥⎥⎤, x=⎣⎢⎡x(1)⋮x(n)⎦⎥⎤, bT=⎣⎢⎡b(1)⋮b(K)⎦⎥⎤,softmax(z)k=∑jez(j)ez(k), k=1,⋯,K
决策函数
p
r
e
d
i
c
t
i
o
n
=
arg
max
k
y
^
(
k
)
prediction = \arg\max_{k}\hat{y}^{(k)}
prediction=argkmaxy^(k)
2.2 训练数据集
X = [ x 1 T ⋮ x m T ] = [ x 1 ( 1 ) ⋯ x 1 ( n ) ⋮ ⋮ x m ( 1 ) ⋯ x m ( n ) ] , Y = [ y 1 T ⋮ y m T ] = [ y 1 ( 1 ) ⋯ y 1 ( K ) ⋮ ⋮ y m ( 1 ) ⋯ y m ( K ) ] X =\left[\begin{matrix} \boldsymbol{x}_{1}^T\\ \vdots \\ \boldsymbol{x}_{m}^T \end{matrix}\right] =\left[\begin{matrix}x_{1}^{(1)} & \cdots & x_{1}^{(n)}\\ \vdots & & \vdots\\ x_{m}^{(1)} & \cdots & x_{m}^{(n)} \end{matrix}\right],\quad Y = \left[\begin{matrix} \boldsymbol{y}_{1}^{T}\\ \vdots \\ \boldsymbol{y}_{m}^{T} \end{matrix}\right] = \left[\begin{matrix} y_{1}^{(1)} & \cdots & y_{1}^{(K)}\\ \vdots & & \vdots\\ y_{m}^{(1)} & \cdots & y_{m}^{(K)} \end{matrix}\right] X=⎣⎢⎡x1T⋮xmT⎦⎥⎤=⎣⎢⎢⎡x1(1)⋮xm(1)⋯⋯x1(n)⋮xm(n)⎦⎥⎥⎤,Y=⎣⎢⎡y1T⋮ymT⎦⎥⎤=⎣⎢⎢⎡y1(1)⋮ym(1)⋯⋯y1(K)⋮ym(K)⎦⎥⎥⎤
记
Z
=
X
W
+
b
Z = XW+b
Z=XW+b
训练目标:
Y
^
=
g
(
Z
)
≈
Y
\hat{Y} = g(Z) \approx Y
Y^=g(Z)≈Y
2.3 对数似然函数
在独立同分布假设下,
l
(
Θ
)
=
log
P
(
Y
=
y
∣
x
,
Θ
)
=
log
∏
i
=
1
m
P
Θ
(
Y
i
=
y
i
∣
x
i
)
=
∑
i
=
1
m
log
P
Θ
(
Y
i
=
y
i
∣
x
i
)
=
∑
i
=
1
m
∑
k
=
1
K
y
i
(
k
)
log
y
^
i
(
k
)
=
y
T
log
y
^
\begin{aligned} l(\Theta) &= \log{P(Y=\boldsymbol{y}\mid \boldsymbol{x},\Theta)}\\ &= \log\prod_{i=1}^{m}P_{\Theta}(Y_{i}=\boldsymbol{y}_{i}\mid \boldsymbol{x}_{i}) = \sum_{i=1}^{m}\log{P_{\Theta}(Y_{i}=\boldsymbol{y}_i\mid \boldsymbol{x}_{i})} \\ &= \sum_{i=1}^{m}\sum_{k=1}^{K}y_{i}^{(k)}\log\hat{y}_{i}^{(k)} = \boldsymbol{y}^{T}\log{\hat{\boldsymbol{y}}} \\ \end{aligned}
l(Θ)=logP(Y=y∣x,Θ)=logi=1∏mPΘ(Yi=yi∣xi)=i=1∑mlogPΘ(Yi=yi∣xi)=i=1∑mk=1∑Kyi(k)logy^i(k)=yTlogy^
2.4 代价函数
每个样本的损失函数:
L
(
y
^
,
y
)
=
−
log
P
(
Y
=
y
∣
x
)
=
−
y
T
log
y
^
\begin{aligned} L(\hat{\boldsymbol{y}}, \boldsymbol{y}) &= -\log{P(Y=\boldsymbol{y}\mid \boldsymbol{x})} =-\boldsymbol{y}^{T}\log{\hat{\boldsymbol{y}}} \end{aligned}
L(y^,y)=−logP(Y=y∣x)=−yTlogy^
训练集上的代价函数:
J
(
Θ
)
=
1
m
∑
i
=
1
m
L
(
y
^
i
,
y
i
)
+
λ
2
m
∣
∣
W
∣
∣
2
=
−
1
m
∑
i
=
1
m
y
i
T
log
y
^
i
+
λ
2
m
∣
∣
W
∣
∣
2
=
−
1
m
∑
i
=
1
m
∑
k
=
1
K
y
i
(
k
)
log
y
^
i
(
k
)
+
λ
2
m
∑
j
=
1
n
∑
k
=
1
K
(
w
j
(
k
)
)
2
\begin{aligned} J(\Theta) &= \frac{1}{m}\sum_{i=1}^{m}L(\hat{\boldsymbol{y}}_{i},\boldsymbol{y}_{i}) + \frac{\lambda}{2m}||W||^{2} \\ &= - \frac{1}{m}\sum_{i=1}^{m}\boldsymbol{y}_{i}^{T}\log\hat{\boldsymbol{y}}_{i} + \frac{\lambda}{2m}||W||^{2} \\ &= - \frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{K}y_{i}^{(k)}\log\hat{y}_{i}^{(k)} + \frac{\lambda}{2m}\sum_{j=1}^{n}\sum_{k=1}^{K}(w_{j}^{(k)})^{2} \end{aligned}
J(Θ)=m1i=1∑mL(y^i,yi)+2mλ∣∣W∣∣2=−m1i=1∑myiTlogy^i+2mλ∣∣W∣∣2=−m1i=1∑mk=1∑Kyi(k)logy^i(k)+2mλj=1∑nk=1∑K(wj(k))2
矩阵形式:
J
(
Θ
)
=
−
1
m
1
1
×
m
(
Y
⊗
log
Y
^
)
1
K
×
1
+
λ
2
m
∣
∣
W
∣
∣
2
J(\Theta) = -\frac{1}{m}{\bf1}_{1\times m}(Y\otimes\log{\hat{Y}}){\bf1}_{K\times1} + \frac{\lambda}{2m}||W||^{2}
J(Θ)=−m111×m(Y⊗logY^)1K×1+2mλ∣∣W∣∣2
2.5 代价函数的梯度
Step1
∂
J
(
Θ
)
∂
y
^
i
(
j
)
=
−
1
m
y
i
(
j
)
y
^
i
(
j
)
,
i
=
1
,
⋯
,
m
,
j
=
1
,
⋯
,
K
\begin{aligned} \frac{\partial J(\Theta)}{\partial \hat{y}_{i}^{(j)}} &= -\frac{1}{m}\frac{y_{i}^{(j)}}{\hat{y}_{i}^{(j)}},\quad i=1,\cdots,m,\ j=1,\cdots, K \end{aligned}
∂y^i(j)∂J(Θ)=−m1y^i(j)yi(j),i=1,⋯,m, j=1,⋯,K
且由 softmax 函数性质知:
∂
y
^
i
(
j
)
∂
z
i
(
k
)
(
L
)
=
{
y
^
i
(
j
)
(
1
−
y
^
i
(
j
)
)
,
j
=
k
−
y
^
i
(
j
)
y
^
i
(
k
)
,
j
≠
k
\frac{\partial \hat{y}_{i}^{(j)}}{\partial z_{i}^{(k)}(L)} = \begin{cases} \hat{y}_{i}^{(j)}(1-\hat{y}_{i}^{(j)}), & j=k \\ -\hat{y}_{i}^{(j)}\hat{y}_{i}^{(k)}, &j\neq k\end{cases}
∂zi(k)(L)∂y^i(j)={y^i(j)(1−y^i(j)),−y^i(j)y^i(k),j=kj=k
所以:
∂
∂
z
i
(
k
)
(
L
)
J
(
Θ
)
=
∑
j
=
1
K
∂
J
(
Θ
)
∂
y
^
i
(
j
)
∂
y
^
i
(
j
)
z
i
(
k
)
(
L
)
=
−
1
m
y
i
(
k
)
(
1
−
y
^
i
(
k
)
)
+
1
m
∑
j
≠
k
y
i
(
j
)
y
^
i
(
k
)
=
1
m
(
y
^
i
(
k
)
−
y
i
(
k
)
)
,
i
=
1
,
⋯
,
m
,
k
=
1
,
⋯
,
K
\begin{aligned} \frac{\partial}{\partial z_{i}^{(k)}(L)}J(\Theta) &= \sum_{j=1}^{K}\frac{\partial J(\Theta)}{\partial \hat{y}_{i}^{(j)}}\frac{\partial \hat{y}_{i}^{(j)}}{z_{i}^{(k)}(L)} \\ &= -\frac{1}{m}y_{i}^{(k)}(1-\hat{y}_{i}^{(k)}) + \frac{1}{m}\sum_{j\neq k}y_{i}^{(j)}\hat{y}_{i}^{(k)} \\ &= \frac{1}{m}\left(\hat{y}_{i}^{(k)}-y_{i}^{(k)}\right),\ i=1,\cdots,m,\ k=1,\cdots, K \end{aligned}
∂zi(k)(L)∂J(Θ)=j=1∑K∂y^i(j)∂J(Θ)zi(k)(L)∂y^i(j)=−m1yi(k)(1−y^i(k))+m1j=k∑yi(j)y^i(k)=m1(y^i(k)−yi(k)), i=1,⋯,m, k=1,⋯,K
矩阵形式:
✓
∂
J
(
Θ
)
∂
Z
(
L
)
=
1
m
(
Y
^
−
Y
)
\checkmark\quad \frac{\partial J(\Theta)}{\partial Z(L)} = \frac{1}{m}(\hat{Y}-Y)
✓∂Z(L)∂J(Θ)=m1(Y^−Y)
Step2
不难得到
∂
∂
w
j
(
k
)
J
(
Θ
)
=
∑
i
=
1
m
∂
J
(
Θ
)
∂
z
i
(
k
)
∂
z
i
(
k
)
∂
w
j
(
k
)
+
λ
m
w
j
(
k
)
=
1
m
∑
i
=
1
m
(
y
i
(
k
)
−
y
^
i
(
k
)
)
x
i
(
j
)
+
λ
m
w
j
(
k
)
∂
∂
b
(
k
)
J
(
Θ
)
=
∑
i
=
1
m
∂
J
(
Θ
)
∂
z
i
(
k
)
∂
z
i
(
k
)
∂
b
(
k
)
=
1
m
∑
i
=
1
m
(
y
i
(
k
)
−
y
^
i
(
k
)
)
\begin{aligned} \frac{\partial}{\partial w_{j}^{(k)}}J(\Theta) &= \sum_{i=1}^{m}\frac{\partial J(\Theta)}{\partial z_{i}^{(k)}}\frac{\partial z_{i}^{(k)}}{\partial w_{j}^{(k)}} + \frac{\lambda}{m}w_{j}^{(k)} \\ &=\frac{1}{m}\sum_{i=1}^{m} (y_{i}^{(k)}-\hat{y}_{i}^{(k)})x_{i}^{(j)} + \frac{\lambda}{m}w_{j}^{(k)} \\\\ \frac{\partial}{\partial b^{(k)}}J(\Theta) &= \sum_{i=1}^{m}\frac{\partial J(\Theta)}{\partial z_{i}^{(k)}}\frac{\partial z_{i}^{(k)}}{\partial b^{(k)}} =\frac{1}{m}\sum_{i=1}^{m} (y_{i}^{(k)}-\hat{y}_{i}^{(k)})\\ \end{aligned}
∂wj(k)∂J(Θ)∂b(k)∂J(Θ)=i=1∑m∂zi(k)∂J(Θ)∂wj(k)∂zi(k)+mλwj(k)=m1i=1∑m(yi(k)−y^i(k))xi(j)+mλwj(k)=i=1∑m∂zi(k)∂J(Θ)∂b(k)∂zi(k)=m1i=1∑m(yi(k)−y^i(k))
矩阵形式:
∂
∂
W
J
(
Θ
)
=
1
m
X
T
(
Y
−
Y
^
)
+
λ
m
W
∂
∂
b
J
(
Θ
)
=
1
m
(
Y
−
Y
^
)
\frac{\partial}{\partial W}J(\Theta) = \frac{1}{m}X^{T}(Y-\hat{Y}) + \frac{\lambda}{m}W\\ \frac{\partial}{\partial b}J(\Theta) = \frac{1}{m}(Y-\hat{Y})\\
∂W∂J(Θ)=m1XT(Y−Y^)+mλW∂b∂J(Θ)=m1(Y−Y^)
2.6 目标函数
Θ ∗ = arg min Θ J ( Θ ) \Theta^{*}=\mathop{\arg\min}_{\Theta}J(\Theta) Θ∗=argminΘJ(Θ)
算法 批量梯度下降 (Batch Gradient Descent)
R e p e a t u n t i l c o n v e r g e n c e { W : = W − α ∂ ∂ W J ( Θ ) b : = b − α ∂ ∂ b J ( Θ ) } \begin{aligned} &Repeat\ until\ convergence\{\\ &\qquad W := W - \alpha\frac{\partial}{\partial W}J(\Theta)\\ &\qquad b := b - \alpha\frac{\partial}{\partial b}J(\Theta)\\ &\} \end{aligned} Repeat until convergence{W:=W−α∂W∂J(Θ)b:=b−α∂b∂J(Θ)}
其中, α \alpha α是学习速率.