逻辑回归和Softmax回归

1 逻辑回归 (Logistic Regression)

虽然有"回归"两个字, 但实际上是一个二分类模型 (Bernoulli).

1.1 逻辑回归模型

模型假设
y ^ = P θ ( Y = 1 ∣ x ) = 1 1 + e − θ T x = g ( θ T x ) P θ ( Y = 0 ∣ x ) = e − θ T x 1 + e − θ T x = 1 − g ( θ T x ) \begin{aligned} \hat{y} = &P_{\theta}(Y=1\mid x) = \frac{1}{1+e^{-\theta^{T}x}} = g(\theta^{T}x)\\ &P_{\theta}(Y=0\mid x) = \frac{e^{-\theta^{T}x}}{1+e^{-\theta^{T}x}} = 1-g(\theta^{T}x) \end{aligned} y^=Pθ(Y=1x)=1+eθTx1=g(θTx)Pθ(Y=0x)=1+eθTxeθTx=1g(θTx)

其中,
θ = [ θ 0 θ 1 ⋮ θ n ] , x = [ 1 x ( 1 ) ⋮ x ( n ) ] , g ( z ) = 1 1 + e z . \theta=\left[\begin{matrix} \theta_{0} \\ \theta_{1} \\ \vdots \\ \theta_{n} \end{matrix}\right],\quad x=\left[\begin{matrix} 1 \\ x^{(1)} \\ \vdots \\ x^{(n)} \end{matrix}\right],\quad g(z) = \frac{1}{1+e^z}. θ=θ0θ1θn,x=1x(1)x(n),g(z)=1+ez1.

不难看出
log ⁡ P θ ( Y = 1 ∣ x ) P θ ( Y = 0 ∣ x ) = θ T x \log\frac{P_{\theta}(Y=1\mid x)}{P_{\theta}(Y=0\mid x)} = \theta^{T}x logPθ(Y=0x)Pθ(Y=1x)=θTx

所以, “逻辑回归"又称为"对数几率回归”.

# 定义sigmoid函数
def sigmoid(z):
  return 1/(1+np.exp(-z))

决策函数
p r e d i c t i o n = { 1 , P θ ( Y = 1 ∣ x ) ≥ 0.5 0 , P θ ( Y = 1 ∣ x ) < 0.5 = { 1 , θ T x ≥ 0 0 , θ T x < 0 prediction = \begin{cases} 1,& P_{\theta}(Y=1\mid x)\geq0.5 \\ 0, & P_{\theta}(Y=1\mid x)<0.5 \end{cases} = \begin{cases} 1,& \theta^{T}x\geq0 \\ 0, & \theta^{T}x < 0 \end{cases} prediction={1,0,Pθ(Y=1x)0.5Pθ(Y=1x)<0.5={1,0,θTx0θTx<0

1.2 训练数据集

X = [ 1 x 1 ( 1 ) ⋯ x 1 ( n ) ⋮ ⋮ ⋮ 1 x m ( 1 ) ⋯ x m ( n ) ] = [ x 1 T ⋮ x m T ] , y = [ y 1 ⋮ y m ] X = \left[\begin{matrix}1 & x_{1}^{(1)} & \cdots & x_{1}^{(n)}\\ \vdots & \vdots & & \vdots\\ 1 & x_{m}^{(1)} & \cdots & x_{m}^{(n)} \end{matrix}\right]=\left[\begin{matrix}x_{1}^T\\ \vdots \\ x_{m}^T \end{matrix}\right],\quad y = \left[\begin{matrix}y_{1}\\ \vdots \\ y_{m} \end{matrix}\right] X=11x1(1)xm(1)x1(n)xm(n)=x1TxmT,y=y1ym

训练目标:
y ^ = g ( X θ ) ≈ y \hat{y} = g(X\theta) \approx y y^=g(Xθ)y

1.3 对数似然函数

在独立同分布假设下,
l ( θ ) = log ⁡ P ( Y 1 = y 1 , ⋯   , Y m = y m ∣ x 1 , ⋯   , x m ; θ ) = log ⁡ ∏ i = 1 m P θ ( Y i = y i ∣ x i ) = ∑ i = 1 m log ⁡ P θ ( Y i = y i ∣ x i ) = ∑ i = 1 m [ y i log ⁡ P θ ( Y i = 1 ∣ x i ) + ( 1 − y i ) log ⁡ P θ ( Y i = 0 ∣ x i ) ] = ∑ i = 1 m [ y i log ⁡ y ^ i + ( 1 − y i ) log ⁡ ( 1 − y ^ i ) ] \begin{aligned} l(\theta) &= \log P(Y_{1}=y_{1},\cdots,Y_{m}=y_{m}\mid x_{1},\cdots,x_{m};\theta) \\ &= \log\prod_{i=1}^{m}P_{\theta}(Y_{i}=y_{i}\mid x_{i}) = \sum_{i=1}^{m}\log{P_{\theta}(Y_{i}=y_i\mid x_{i})} \\ &= \sum_{i=1}^{m}[y_{i}\log P_{\theta}(Y_{i}=1\mid x_{i}) + (1-y_{i})\log P_{\theta}(Y_{i}=0\mid x_{i})] \\ &= \sum_{i=1}^{m}[y_{i}\log\hat{y}_{i} + (1-y_{i})\log(1-\hat{y}_{i})] \\ \end{aligned} l(θ)=logP(Y1=y1,,Ym=ymx1,,xm;θ)=logi=1mPθ(Yi=yixi)=i=1mlogPθ(Yi=yixi)=i=1m[yilogPθ(Yi=1xi)+(1yi)logPθ(Yi=0xi)]=i=1m[yilogy^i+(1yi)log(1y^i)]

矩阵形式:
l ( θ ) = y T log ⁡ y ^ + ( 1 − y T ) log ⁡ ( 1 − y ^ ) l(\theta) = y^{T}\log\hat{y} + (1-y^{T})\log(1-\hat{y}) l(θ)=yTlogy^+(1yT)log(1y^)

1.4 代价函数

损失函数:
L ( y ^ , y ) = { − log ⁡ P ( Y = 1 ∣ x ) , y = 1 − log ⁡ P ( Y = 0 ∣ x ) , y = 0 = − { y log ⁡ y ^ + ( 1 − y ) log ⁡ ( 1 − y ^ ) } \begin{aligned} L(\hat{y},y) &= \begin{cases} -\log P(Y=1\mid x), & y=1 \\ -\log P(Y=0\mid x), & y=0 \end{cases} \\\\ &= -\{y\log{\hat{y}} + (1-y)\log(1-\hat{y})\} \end{aligned} L(y^,y)={logP(Y=1x),logP(Y=0x),y=1y=0={ylogy^+(1y)log(1y^)}
代价函数:
J ( θ ) = − 1 m ∑ i = 1 m [ y i log ⁡ y ^ i + ( 1 − y i ) log ⁡ ( 1 − y ^ i ) ] + λ 2 m ∑ j = 1 n θ j 2 J(\theta)= -\frac{1}{m}\sum_{i=1}^{m}[y_{i}\log\hat{y}_{i} + (1-y_{i})\log(1-\hat{y}_{i})] + \boxed{\frac{\lambda}{2m}\sum_{j=1}^{n}\theta_{j}^{2}} J(θ)=m1i=1m[yilogy^i+(1yi)log(1y^i)]+2mλj=1nθj2

其中方框内为正则化项,上式可以理解为 J ( θ ) : = − 1 m l ( θ ) + r e g ( θ ) J(\theta):=-\frac{1}{m}l(\theta)+reg(\theta) J(θ):=m1l(θ)+reg(θ).

矩阵形式:
J ( θ ) = − 1 m [ y T log ⁡ y ^ + ( 1 − y T ) log ⁡ ( 1 − y ^ ) ] + λ 2 m ( E ˚ θ ) T ( E ˚ θ ) J(\theta) =-\frac{1}{m}[y^{T}\log \hat{y} + (1-y^{T})\log(1-\hat{y})] + \boxed{\frac{\lambda}{2m}(\mathring{E}\theta)^{T}(\mathring{E}\theta)} J(θ)=m1[yTlogy^+(1yT)log(1y^)]+2mλ(E˚θ)T(E˚θ)

# 计算代价函数
def computeCost(theta, X, y, penalty):=
  first = -1/m * y.T @ np.log(sigmoid(X@theta))
  second = -1/m * (1-y.T)np.log(1-sogmoid(X@theta))
  reg = penalty/(2*m) * thata[1:].T @ theta[1:]
  return first + second + reg

1.5 代价函数的梯度

注意到对 Logistic 函数 g ( z ) g(z) g(z)
g ′ ( z ) = g ( z ) ( 1 − g ( z ) ) g'(z)=g(z)(1-g(z)) g(z)=g(z)(1g(z))
不难得到
∂ ∂ θ 0 J ( θ ) = 1 m ∑ i = 1 m ( y ^ i − y i ) x i ( 0 ) ∂ ∂ θ j J ( θ ) = 1 m ∑ i = 1 m ( y ^ i − y i ) x i ( j ) + λ m ∑ j = 1 n θ j , j = 1 , ⋯   , n \begin{aligned} \frac{\partial}{\partial \theta_{0}}J(\theta) &= \frac{1}{m}\sum\limits_{i=1}^{m}(\hat{y}_{i}-y_{i})x_{i}^{(0)} \\ \frac{\partial}{\partial \theta_{j}}J(\theta) &= \frac{1}{m}\sum\limits_{i=1}^{m}(\hat{y}_{i}-y_{i})x_{i}^{(j)} + \boxed{\frac{\lambda}{m}\sum_{j=1}^{n}\theta_{j}},\quad j=1,\cdots,n \end{aligned} θ0J(θ)θjJ(θ)=m1i=1m(y^iyi)xi(0)=m1i=1m(y^iyi)xi(j)+mλj=1nθj,j=1,,n

矩阵形式:
∂ ∂ θ J ( θ ) = 1 m X T ( y ^ − y ) + λ m E ˚ θ \frac{\partial}{\partial \theta}J(\theta)=\frac{1}{m}X^{T}(\hat{y}-y) + \boxed{\frac{\lambda}{m}\mathring{E}\theta} θJ(θ)=m1XT(y^y)+mλE˚θ

# 计算梯度
def gradient(theta, X, y, penalty):
  first = 1/m * X.T @ (sigmoid(X@theta)-y)
  reg = penalty/m * theta[1:]
  return first + reg

1.6 目标函数

θ ∗ = arg ⁡ min ⁡ θ J ( θ ) \theta^{*}=\mathop{\arg\min}_{\theta}J(\theta) θ=argminθJ(θ)

算法 批量梯度下降 (Batch Gradient Descent)

R e p e a t   u n t i l   c o n v e r g e n c e { θ : = θ − α ∂ ∂ θ J ( θ ) } \begin{aligned} &Repeat\ until\ convergence\{\\ &\qquad \theta := \theta - \alpha\frac{\partial}{\partial\theta}J(\theta)\\ &\} \end{aligned} Repeat until convergence{θ:=θαθJ(θ)}

其中, α \alpha α是学习速率.

# 批量梯度下降
def gradientDescent(theta, X, y, l_rate, penalty, iter):
  cost = np.zeros(n+1)
  cost[0] = computeCost(theta, X, y)
  for k in range(iter):
    theta = theta - l_rate * gradient(theta, X, y, penalty)
    cost[k] = computeCost(theta, X, y)
  return theta, cost

2 Softmax Regression

是 Logistic Regression 的推广,用于多分类 (Multinoulli)。

2.1 Softmax Regression 模型

模型假设
y ^ ≡ [ P θ ( Y = 1 ∣ x ) ⋮ P θ ( Y = K ∣ x ) ] = [ s o f t m a x ( W T x + b T ) 1 ⋮ s o f t m a x ( W T x + b T ) K ] \hat{y} \equiv \left[\begin{matrix} P_{\theta}(Y=1\mid \boldsymbol{x}) \\ \vdots \\ P_{\theta}(Y=K\mid \boldsymbol{x})\end{matrix}\right] = \left[\begin{matrix} softmax(W^{T}\boldsymbol{x}+\boldsymbol{b}^{T})_{1} \\ \vdots \\ softmax(W^{T}\boldsymbol{x}+\boldsymbol{b}^{T})_{K} \end{matrix}\right] y^Pθ(Y=1x)Pθ(Y=Kx)=softmax(WTx+bT)1softmax(WTx+bT)K
其中,
W T = [ w ( 1 ) T ⋮ w ( K ) T ] = [ w 1 ( 1 ) ⋯ w n ( 1 ) ⋮ ⋱ ⋮ w 1 ( K ) ⋯ w n ( K ) ] ,   x = [ x ( 1 ) ⋮ x ( n ) ] ,   b T = [ b ( 1 ) ⋮ b ( K ) ] , s o f t m a x ( z ) k = e z ( k ) ∑ j e z ( j ) ,   k = 1 , ⋯   , K \begin{aligned} W^{T} = \left[\begin{matrix} w^{(1)T} \\ \vdots \\ w^{(K)T}\end{matrix}\right] &= \left[\begin{matrix} w_{1}^{(1)} & \cdots &w_{n}^{(1)} \\ \vdots &\ddots & \vdots \\ w_{1}^{(K)} & \cdots &w_{n}^{(K)} \\ \end{matrix}\right], \ \boldsymbol{x} = \left[\begin{matrix} x^{(1)} \\ \vdots \\ x^{(n)} \end{matrix}\right], \ \boldsymbol{b}^{T} = \left[\begin{matrix} b^{(1)} \\ \vdots \\ b^{(K)}\end{matrix}\right], \\\\ &softmax(\boldsymbol{z})_{k} = \frac{e^{z^{(k)}}}{\sum_{j}e^{z^{(j)}}},\ k=1,\cdots,K \end{aligned} WT=w(1)Tw(K)T=w1(1)w1(K)wn(1)wn(K), x=x(1)x(n), bT=b(1)b(K),softmax(z)k=jez(j)ez(k), k=1,,K
决策函数
p r e d i c t i o n = arg ⁡ max ⁡ k y ^ ( k ) prediction = \arg\max_{k}\hat{y}^{(k)} prediction=argkmaxy^(k)

2.2 训练数据集

X = [ x 1 T ⋮ x m T ] = [ x 1 ( 1 ) ⋯ x 1 ( n ) ⋮ ⋮ x m ( 1 ) ⋯ x m ( n ) ] , Y = [ y 1 T ⋮ y m T ] = [ y 1 ( 1 ) ⋯ y 1 ( K ) ⋮ ⋮ y m ( 1 ) ⋯ y m ( K ) ] X =\left[\begin{matrix} \boldsymbol{x}_{1}^T\\ \vdots \\ \boldsymbol{x}_{m}^T \end{matrix}\right] =\left[\begin{matrix}x_{1}^{(1)} & \cdots & x_{1}^{(n)}\\ \vdots & & \vdots\\ x_{m}^{(1)} & \cdots & x_{m}^{(n)} \end{matrix}\right],\quad Y = \left[\begin{matrix} \boldsymbol{y}_{1}^{T}\\ \vdots \\ \boldsymbol{y}_{m}^{T} \end{matrix}\right] = \left[\begin{matrix} y_{1}^{(1)} & \cdots & y_{1}^{(K)}\\ \vdots & & \vdots\\ y_{m}^{(1)} & \cdots & y_{m}^{(K)} \end{matrix}\right] X=x1TxmT=x1(1)xm(1)x1(n)xm(n),Y=y1TymT=y1(1)ym(1)y1(K)ym(K)


Z = X W + b Z = XW+b Z=XW+b
训练目标:
Y ^ = g ( Z ) ≈ Y \hat{Y} = g(Z) \approx Y Y^=g(Z)Y

2.3 对数似然函数

在独立同分布假设下,
l ( Θ ) = log ⁡ P ( Y = y ∣ x , Θ ) = log ⁡ ∏ i = 1 m P Θ ( Y i = y i ∣ x i ) = ∑ i = 1 m log ⁡ P Θ ( Y i = y i ∣ x i ) = ∑ i = 1 m ∑ k = 1 K y i ( k ) log ⁡ y ^ i ( k ) = y T log ⁡ y ^ \begin{aligned} l(\Theta) &= \log{P(Y=\boldsymbol{y}\mid \boldsymbol{x},\Theta)}\\ &= \log\prod_{i=1}^{m}P_{\Theta}(Y_{i}=\boldsymbol{y}_{i}\mid \boldsymbol{x}_{i}) = \sum_{i=1}^{m}\log{P_{\Theta}(Y_{i}=\boldsymbol{y}_i\mid \boldsymbol{x}_{i})} \\ &= \sum_{i=1}^{m}\sum_{k=1}^{K}y_{i}^{(k)}\log\hat{y}_{i}^{(k)} = \boldsymbol{y}^{T}\log{\hat{\boldsymbol{y}}} \\ \end{aligned} l(Θ)=logP(Y=yx,Θ)=logi=1mPΘ(Yi=yixi)=i=1mlogPΘ(Yi=yixi)=i=1mk=1Kyi(k)logy^i(k)=yTlogy^

2.4 代价函数

每个样本的损失函数:
L ( y ^ , y ) = − log ⁡ P ( Y = y ∣ x ) = − y T log ⁡ y ^ \begin{aligned} L(\hat{\boldsymbol{y}}, \boldsymbol{y}) &= -\log{P(Y=\boldsymbol{y}\mid \boldsymbol{x})} =-\boldsymbol{y}^{T}\log{\hat{\boldsymbol{y}}} \end{aligned} L(y^,y)=logP(Y=yx)=yTlogy^
训练集上的代价函数:
J ( Θ ) = 1 m ∑ i = 1 m L ( y ^ i , y i ) + λ 2 m ∣ ∣ W ∣ ∣ 2 = − 1 m ∑ i = 1 m y i T log ⁡ y ^ i + λ 2 m ∣ ∣ W ∣ ∣ 2 = − 1 m ∑ i = 1 m ∑ k = 1 K y i ( k ) log ⁡ y ^ i ( k ) + λ 2 m ∑ j = 1 n ∑ k = 1 K ( w j ( k ) ) 2 \begin{aligned} J(\Theta) &= \frac{1}{m}\sum_{i=1}^{m}L(\hat{\boldsymbol{y}}_{i},\boldsymbol{y}_{i}) + \frac{\lambda}{2m}||W||^{2} \\ &= - \frac{1}{m}\sum_{i=1}^{m}\boldsymbol{y}_{i}^{T}\log\hat{\boldsymbol{y}}_{i} + \frac{\lambda}{2m}||W||^{2} \\ &= - \frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{K}y_{i}^{(k)}\log\hat{y}_{i}^{(k)} + \frac{\lambda}{2m}\sum_{j=1}^{n}\sum_{k=1}^{K}(w_{j}^{(k)})^{2} \end{aligned} J(Θ)=m1i=1mL(y^i,yi)+2mλW2=m1i=1myiTlogy^i+2mλW2=m1i=1mk=1Kyi(k)logy^i(k)+2mλj=1nk=1K(wj(k))2
矩阵形式:
J ( Θ ) = − 1 m 1 1 × m ( Y ⊗ log ⁡ Y ^ ) 1 K × 1 + λ 2 m ∣ ∣ W ∣ ∣ 2 J(\Theta) = -\frac{1}{m}{\bf1}_{1\times m}(Y\otimes\log{\hat{Y}}){\bf1}_{K\times1} + \frac{\lambda}{2m}||W||^{2} J(Θ)=m111×m(YlogY^)1K×1+2mλW2

2.5 代价函数的梯度

Step1
∂ J ( Θ ) ∂ y ^ i ( j ) = − 1 m y i ( j ) y ^ i ( j ) , i = 1 , ⋯   , m ,   j = 1 , ⋯   , K \begin{aligned} \frac{\partial J(\Theta)}{\partial \hat{y}_{i}^{(j)}} &= -\frac{1}{m}\frac{y_{i}^{(j)}}{\hat{y}_{i}^{(j)}},\quad i=1,\cdots,m,\ j=1,\cdots, K \end{aligned} y^i(j)J(Θ)=m1y^i(j)yi(j),i=1,,m, j=1,,K

且由 softmax 函数性质知:
∂ y ^ i ( j ) ∂ z i ( k ) ( L ) = { y ^ i ( j ) ( 1 − y ^ i ( j ) ) , j = k − y ^ i ( j ) y ^ i ( k ) , j ≠ k \frac{\partial \hat{y}_{i}^{(j)}}{\partial z_{i}^{(k)}(L)} = \begin{cases} \hat{y}_{i}^{(j)}(1-\hat{y}_{i}^{(j)}), & j=k \\ -\hat{y}_{i}^{(j)}\hat{y}_{i}^{(k)}, &j\neq k\end{cases} zi(k)(L)y^i(j)={y^i(j)(1y^i(j)),y^i(j)y^i(k),j=kj=k

所以:
∂ ∂ z i ( k ) ( L ) J ( Θ ) = ∑ j = 1 K ∂ J ( Θ ) ∂ y ^ i ( j ) ∂ y ^ i ( j ) z i ( k ) ( L ) = − 1 m y i ( k ) ( 1 − y ^ i ( k ) ) + 1 m ∑ j ≠ k y i ( j ) y ^ i ( k ) = 1 m ( y ^ i ( k ) − y i ( k ) ) ,   i = 1 , ⋯   , m ,   k = 1 , ⋯   , K \begin{aligned} \frac{\partial}{\partial z_{i}^{(k)}(L)}J(\Theta) &= \sum_{j=1}^{K}\frac{\partial J(\Theta)}{\partial \hat{y}_{i}^{(j)}}\frac{\partial \hat{y}_{i}^{(j)}}{z_{i}^{(k)}(L)} \\ &= -\frac{1}{m}y_{i}^{(k)}(1-\hat{y}_{i}^{(k)}) + \frac{1}{m}\sum_{j\neq k}y_{i}^{(j)}\hat{y}_{i}^{(k)} \\ &= \frac{1}{m}\left(\hat{y}_{i}^{(k)}-y_{i}^{(k)}\right),\ i=1,\cdots,m,\ k=1,\cdots, K \end{aligned} zi(k)(L)J(Θ)=j=1Ky^i(j)J(Θ)zi(k)(L)y^i(j)=m1yi(k)(1y^i(k))+m1j=kyi(j)y^i(k)=m1(y^i(k)yi(k)), i=1,,m, k=1,,K

矩阵形式:
✓ ∂ J ( Θ ) ∂ Z ( L ) = 1 m ( Y ^ − Y ) \checkmark\quad \frac{\partial J(\Theta)}{\partial Z(L)} = \frac{1}{m}(\hat{Y}-Y) Z(L)J(Θ)=m1(Y^Y)

Step2

不难得到
∂ ∂ w j ( k ) J ( Θ ) = ∑ i = 1 m ∂ J ( Θ ) ∂ z i ( k ) ∂ z i ( k ) ∂ w j ( k ) + λ m w j ( k ) = 1 m ∑ i = 1 m ( y i ( k ) − y ^ i ( k ) ) x i ( j ) + λ m w j ( k ) ∂ ∂ b ( k ) J ( Θ ) = ∑ i = 1 m ∂ J ( Θ ) ∂ z i ( k ) ∂ z i ( k ) ∂ b ( k ) = 1 m ∑ i = 1 m ( y i ( k ) − y ^ i ( k ) ) \begin{aligned} \frac{\partial}{\partial w_{j}^{(k)}}J(\Theta) &= \sum_{i=1}^{m}\frac{\partial J(\Theta)}{\partial z_{i}^{(k)}}\frac{\partial z_{i}^{(k)}}{\partial w_{j}^{(k)}} + \frac{\lambda}{m}w_{j}^{(k)} \\ &=\frac{1}{m}\sum_{i=1}^{m} (y_{i}^{(k)}-\hat{y}_{i}^{(k)})x_{i}^{(j)} + \frac{\lambda}{m}w_{j}^{(k)} \\\\ \frac{\partial}{\partial b^{(k)}}J(\Theta) &= \sum_{i=1}^{m}\frac{\partial J(\Theta)}{\partial z_{i}^{(k)}}\frac{\partial z_{i}^{(k)}}{\partial b^{(k)}} =\frac{1}{m}\sum_{i=1}^{m} (y_{i}^{(k)}-\hat{y}_{i}^{(k)})\\ \end{aligned} wj(k)J(Θ)b(k)J(Θ)=i=1mzi(k)J(Θ)wj(k)zi(k)+mλwj(k)=m1i=1m(yi(k)y^i(k))xi(j)+mλwj(k)=i=1mzi(k)J(Θ)b(k)zi(k)=m1i=1m(yi(k)y^i(k))

矩阵形式:
∂ ∂ W J ( Θ ) = 1 m X T ( Y − Y ^ ) + λ m W ∂ ∂ b J ( Θ ) = 1 m ( Y − Y ^ ) \frac{\partial}{\partial W}J(\Theta) = \frac{1}{m}X^{T}(Y-\hat{Y}) + \frac{\lambda}{m}W\\ \frac{\partial}{\partial b}J(\Theta) = \frac{1}{m}(Y-\hat{Y})\\ WJ(Θ)=m1XT(YY^)+mλWbJ(Θ)=m1(YY^)

2.6 目标函数

Θ ∗ = arg ⁡ min ⁡ Θ J ( Θ ) \Theta^{*}=\mathop{\arg\min}_{\Theta}J(\Theta) Θ=argminΘJ(Θ)

算法 批量梯度下降 (Batch Gradient Descent)

R e p e a t   u n t i l   c o n v e r g e n c e { W : = W − α ∂ ∂ W J ( Θ ) b : = b − α ∂ ∂ b J ( Θ ) } \begin{aligned} &Repeat\ until\ convergence\{\\ &\qquad W := W - \alpha\frac{\partial}{\partial W}J(\Theta)\\ &\qquad b := b - \alpha\frac{\partial}{\partial b}J(\Theta)\\ &\} \end{aligned} Repeat until convergence{W:=WαWJ(Θ)b:=bαbJ(Θ)}

其中, α \alpha α是学习速率.

  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值