学习笔记 - GreedyAI - DeepLearningCV - Lesson1 Introduction

第3章 逻辑回归

任务学习12 二元分类问题

任务学习13 逻辑函数

f ( x ) = 1 1 + e − x f(x) = \frac{1}{1 + e^{-x}} f(x)=1+ex1

1 1 + e ∞ = 1 1 + ∞ = 0 \frac{1}{1 + e^{\infty}} = \frac{1}{1 + \infty} = 0 1+e1=1+1=0

1 1 + e − ∞ = 1 1 − ∞ = 1 \frac{1}{1 + e^{- \infty}} = \frac{1}{1 - \infty} = 1 1+e1=11=1

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp( - x ))

x = np.linspace(-6, 6, 100)
y = sigmoid(x)
mark = 0.5 * np.ones(x.shape)

fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(x, y)
ax.plot(x, mark, ":")
ax.set_xlabel("$x$")
ax.set_ylabel("$f(x)$")
ax.grid()
plt.show()

在这里插入图片描述

任务学习14 指数与对数 、逻辑回归

  • 指数与对数
def exp(x):
    return np.exp(x)

def ln(x):
    return np.log(x)

def lin(x):
    return x

x = np.linspace(-4, 4, 100)
y_exp = exp(x)
y_ln = ln(x[np.nonzero(x > 0)])
y_lin = lin(x)

fig = plt.figure(figsize = (5, 5))
ax = fig.add_subplot(111)
ax.plot(x, y_exp, label="$y = e^{x}$")
ax.plot(x[np.nonzero(x > 0)], y_ln, label="$y = ln(x)$")
ax.plot(x, y_lin, label="$y = x$")
ax.set_xlabel("$x$")
ax.set_ylabel("$f(x)$")
ax.set_ylim(-4, 4)
ax.grid()
ax.legend()
plt.show()

在这里插入图片描述

  • 逻辑回归

解决二元(0、1)分类问题

P ( y = 1 ∣ x ; θ ) = f ( x ; θ ) = 1 1 + e − θ T x P(y = 1 | \mathbf{x}; \mathbf{\theta}) = f(\mathbf{x}; \mathbf{\theta}) = \frac{1}{1 + e^{-\mathbf{\theta}^{\mathrm{T}} \mathbf{x}}} P(y=1x;θ)=f(x;θ)=1+eθTx1

θ T x = θ 0 + θ 1 x 1 + θ 2 x 2 + ⋯ \mathbf{\theta}^{\mathrm{T}} \mathbf{x} = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots θTx=θ0+θ1x1+θ2x2+

θ = [ θ 0 , θ 1 , θ 2 , ⋯   ] \mathbf{\theta} = \left[ \theta_0, \theta_1, \theta_2, \cdots \right] θ=[θ0,θ1,θ2,]

x = [ 1 , x 1 , x 2 , ⋯   ] \mathbf{x} = \left[ 1, x_1, x_2, \cdots \right] x=[1,x1,x2,]

P ( y = 1 ∣ x ) > 0.5 P(y = 1 | \mathbf{x}) > 0.5 P(y=1x)>0.5,推理为1;否则推理为0。

  • 逻辑回归知识点

类别1的概率: P = 1 1 + e − θ T x P = \frac{1}{1 + e^{-\mathbf{\theta}^{\mathrm{T}} \mathbf{x}}} P=1+eθTx1

类别0的概率: 1 − P = e − θ T x 1 + e − θ T x = 1 1 + e θ T x 1 - P = \frac{e^{-\mathbf{\theta}^{\mathrm{T}} \mathbf{x}}}{1 + e^{-\mathbf{\theta}^{\mathrm{T}} \mathbf{x}}} = \frac{1}{1 + e^{\mathbf{\theta}^{\mathrm{T}} \mathbf{x}}} 1P=1+eθTxeθTx=1+eθTx1

类别1与0概率的比值: P 1 − P = e θ T x \frac{P}{1 - P} = e^{\mathbf{\theta}^{\mathrm{T}} \mathbf{x}} 1PP=eθTx

类别1与0概率比值的自然对数: ln ⁡ P 1 − P = θ T x \ln \frac{P}{1 - P} = \mathbf{\theta}^{\mathrm{T}} \mathbf{x} ln1PP=θTx

任务学习15 逻辑回归示例

年龄( x 1 x_1 x1年收入( x 2 x_2 x2)(万元)是否买车(1:是;2:否)
2030
2371
31101
42131
5070
6050
288
from sklearn import linear_model

X = [[20, 3],
     [23, 7],
     [31, 10],
     [42, 13],
     [50, 7],
     [60, 5]]

y = [0,
     1,
     1,
     1,
     0,
     0]

lr = linear_model.LogisticRegression()
lr.fit(X, y)

testX = [[28, 8]]

label = lr.predict(testX)
print("predicted label = {}".format(label))

prob = lr.predict_proba(testX)
print("probability = {}".format(prob))

print("theta_0 = {0[0]}, theta_1 = {1[0][0]}, theta_0 = {1[0][1]}".format(lr.intercept_, lr.coef_))
predicted label = [1]
probability = [[0.14694811 0.85305189]]
theta_0 = -0.04131837596993478, theta_1 = -0.1973000136829152, theta_0 = 0.915557452347983

任务学习16 损失函数

类别1概率:

P ( y = 1 ∣ x ; θ ) = f ( x ; θ ) = 1 1 + e − θ T x P(y = 1 | \mathbf{x}; \mathbf{\theta}) = f(\mathbf{x}; \mathbf{\theta}) = \frac{1}{1 + e^{-\mathbf{\theta}^{\mathrm{T}} \mathbf{x}}} P(y=1x;θ)=f(x;θ)=1+eθTx1

损失函数:

J ( θ ) = − ∑ i = 1 N [ y ( i ) ln ⁡ P ( Y = 1 ∣ X = x ( i ) ; θ ) + ( 1 − y ( i ) ) ln ⁡ ( 1 − P ( Y = 1 ∣ X = x ( i ) ; θ ) ) ] J(\mathbf{\theta}) = - \sum_{i=1}^{N} \left[ y^{(i)} \ln P(Y = 1 | \mathbf{X} = \mathbf{x}^{(i)}; \theta) + \left( 1 - y^{(i)} \right) \ln \left( 1 - P(Y = 1 | \mathbf{X} = \mathbf{x}^{(i)}; \theta) \right) \right] J(θ)=i=1N[y(i)lnP(Y=1X=x(i);θ)+(1y(i))ln(1P(Y=1X=x(i);θ))]

损失函数梯度:

∇ θ J ( θ ) = ∑ i = 1 N ( P ( Y = 1 ∣ X = x ( i ) ; θ ) − y ( i ) ) x ( i ) = ∑ i = 1 N x ( i ) ( f ( x ( i ) ; θ ) − y ( i ) ) \begin{aligned} \nabla_{\mathbf{\theta}} J(\mathbf{\theta}) = & \sum_{i=1}^{N} \left( P(Y = 1 | \mathbf{X} = \mathbf{x}^{(i)}; \theta) - y^{(i)} \right) \mathbf{x}^{(i)} \\ = & \sum_{i=1}^{N} \mathbf{x}^{(i)} \left( f(\mathbf{x}^{(i)}; \mathbf{\theta}) - y^{(i)} \right) \\ \end{aligned} θJ(θ)==i=1N(P(Y=1X=x(i);θ)y(i))x(i)i=1Nx(i)(f(x(i);θ)y(i))

任务学习17 损失函数推演

  1. 求导

( f ( x ) g ( x ) ) ′ = f ′ ( x ) g ( x ) + f ( x ) g ′ ( x ) \left( f(x)g(x) \right) ^{\prime} = f^{\prime}(x)g(x) + f(x)g^{\prime}(x) (f(x)g(x))=f(x)g(x)+f(x)g(x)

  1. 对数

log ⁡ ( x y ) = log ⁡ ( x ) + log ⁡ ( y ) \log(xy) = \log(x) + \log(y) log(xy)=log(x)+log(y)

log ⁡ ′ ( x ) = 1 x \log^{\prime}(x) = \frac{1}{x} log(x)=x1

  1. 链式法则

z = f ( y ) y = g ( x ) ↓ d z d x = d z d y d y d x \begin{aligned} z = & f(y) \\ y = & g(x) \\ \downarrow & \\ \frac{d z}{d x} = & \frac{d z}{d y} \frac{d y}{d x} \end{aligned} z=y=dxdz=f(y)g(x)dydzdxdy

  1. sigmoid

f ( x ) = 1 1 + e − x ↓ f ′ ( x ) = ( − 1 ) e − x ( − 1 ) ( 1 + e − x ) 2 = e − x 1 + e − x 1 1 + e − x = f ( x ) ( 1 − f ( x ) ) \begin{aligned} f(x) = & \frac{1}{1 + e^{-x}} \\ \downarrow & \\ f^{\prime}(x) = & (-1) \frac{e^{-x} (-1)}{\left( 1 + e^{-x} \right)^2} \\ = & \frac{e^{-x}}{1 + e^{-x}} \frac{1}{1 + e^{-x}} \\ = & f(x) \left( 1- f(x) \right) \end{aligned} f(x)=f(x)===1+ex1(1)(1+ex)2ex(1)1+exex1+ex1f(x)(1f(x))

f ( z ) = 1 1 + e − z z = θ x ↓ d f d x = f ( z ) ( 1 − f ( z ) ) θ \begin{aligned} f(z) = & \frac{1}{1 + e^{-z}} \\ z = & \theta x \\ \downarrow & \\ \frac{d f}{d x} = & f(z) \left( 1- f(z) \right) \theta \end{aligned} f(z)=z=dxdf=1+ez1θxf(z)(1f(z))θ

  1. 损失函数

训练数据集 { ( x i , y i ) } \{ \left( \mathbf{x}_i, y_i \right) \} {(xi,yi)} i ∈ { 1 , 2 , ⋯   , N } i \in \{1, 2, \cdots, N \} i{1,2,,N} x i ∈ R m \mathbf{x}_i \in R^m xiRm y i ∈ { 0 , 1 } y_i \in \{ 0, 1 \} yi{0,1}

逻辑函数表示给定样本 x i \mathbf{x}_i xi,分类器推理为 y i = 1 y_i = 1 yi=1的概率:

P i = P ( y i = 1 ∣ θ : x i ) = f ( θ T x i ) \begin{aligned} P_i = & P\left( y_i = 1 | \mathbf{\theta}: \mathbf{x}_i \right) \\ = & f(\mathbf{\theta}^{\mathrm{T}} \mathbf{x}_i) \end{aligned} Pi==P(yi=1θ:xi)f(θTxi)

似然函数

L ( θ ) = ∏ i ∣ y i = 1 N P i ⋅ ∏ i ∣ y i = 0 N ( 1 − P i ) \begin{aligned} L(\mathbf{\theta}) = & \prod^{N}_{i | y_i = 1} P_i \cdot \prod^{N}_{i | y_i = 0} \left( 1 - P_i \right) \end{aligned} L(θ)=iyi=1NPiiyi=0N(1Pi)

目标是求使 L ( θ ) L(\mathbf{\theta}) L(θ)最大时的 θ \theta θ

θ = arg ⁡ max ⁡ θ L ( θ ) \mathbf{\theta} = \arg \max_{\theta} L(\mathbf{\theta}) θ=argθmaxL(θ)

对数似然函数

l ( θ ) = log ⁡ L ( θ ) = log ⁡ [ ∑ i ∣ y i = 1 N P i + ∑ i ∣ y i = 0 N ( 1 − P i ) ] = ∑ i ∣ y i = 1 N log ⁡ P i + ∑ i ∣ y i = 0 N log ⁡ ( 1 − P i ) = ∑ i = 1 N [ y i log ⁡ P i + ( 1 − y i ) log ⁡ ( 1 − P i ) ] \begin{aligned} l(\theta) = \log L(\mathbf{\theta}) = & \log \left[ \sum^{N}_{i | y_i = 1} P_i + \sum^{N}_{i | y_i = 0} \left( 1 - P_i \right) \right ]\\ = & \sum^{N}_{i | y_i = 1} \log P_i + \sum^{N}_{i | y_i = 0} \log \left( 1 - P_i \right) \\ = & \sum^{N}_{i = 1} \left[ y_i \log P_i + \left( 1 - y_i \right) \log \left( 1 - P_i \right) \right] \\ \end{aligned} l(θ)=logL(θ)===logiyi=1NPi+iyi=0N(1Pi)iyi=1NlogPi+iyi=0Nlog(1Pi)i=1N[yilogPi+(1yi)log(1Pi)]

d l ( θ ) d θ = ∑ i = 1 N [ y i d log ⁡ P i d θ + ( 1 − y i ) d log ⁡ ( 1 − P i ) d θ ] = ∑ i = 1 N [ y i P i ( 1 − P i ) P i x i + ( 1 − y i ) ( − 1 ) P i ( 1 − P i ) 1 − P i x i ] = ∑ i = 1 N [ y i ( 1 − P i ) x i − ( 1 − y i ) P i x i ] = ∑ i = 1 N ( y i − P i ) x i \begin{aligned} \frac{d l(\theta)}{d \theta} = & \sum^{N}_{i = 1} \left[ y_i \frac{d \log P_i}{d \theta} + \left( 1 - y_i \right) \frac{d \log \left( 1 - P_i \right)}{d \theta} \right] \\ = & \sum^{N}_{i = 1} \left[ y_i \frac{P_i \left( 1 - P_i \right)}{P_i} \mathbf{x}_i + \left( 1 - y_i \right) \frac{(- 1) P_i \left( 1 - P_i \right)}{1 - P_i} \mathbf{x}_i \right] \\ = & \sum^{N}_{i = 1} \left[ y_i \left( 1 - P_i \right) \mathbf{x}_i - \left( 1 - y_i \right) P_i \mathbf{x}_i \right] \\ = & \sum^{N}_{i = 1} \left( y_i - P_i \right) \mathbf{x}_i \end{aligned} dθdl(θ)====i=1N[yidθdlogPi+(1yi)dθdlog(1Pi)]i=1N[yiPiPi(1Pi)xi+(1yi)1Pi(1)Pi(1Pi)xi]i=1N[yi(1Pi)xi(1yi)Pixi]i=1N(yiPi)xi

l ( θ ) = log ⁡ L ( θ ) l(\theta) = \log L (\theta) l(θ)=logL(θ)是求 L ( θ ) L (\theta) L(θ)的最大期望,定义损失函数为:

l o s s ( θ ) = − l ( θ ) loss(\theta) = - l(\theta) loss(θ)=l(θ)

则:

d l o s s ( θ ) d θ = ∑ i = 1 N ( P i − y i ) x i \frac{d loss(\theta)}{d \theta} = \sum^{N}_{i = 1} \left( P_i - y_i \right) \mathbf{x}_i dθdloss(θ)=i=1N(Piyi)xi

任务学习18 梯度下降法

f ( x ; θ ) = 1 1 + e − θ T x f(\mathbf{x}; \mathbf{\theta}) = \frac{1}{1 + e^{-\mathbf{\theta}^{\mathrm{T}} \mathbf{x}}} f(x;θ)=1+eθTx1

θ = θ − α ∇ θ J ( θ ) = θ − α ∑ i = 1 N x ( i ) ( f ( x ( i ) ; θ ) − y ( i ) ) \mathbf{\theta} = \mathbf{\theta} - \alpha \nabla_{\mathbf{\theta}} J(\mathbf{\theta}) = \mathbf{\theta} - \alpha \sum_{i=1}^{N} \mathbf{x}^{(i)} \left( f(\mathbf{x}^{(i)}; \mathbf{\theta}) - y^{(i)} \right) θ=θαθJ(θ)=θαi=1Nx(i)(f(x(i);θ)y(i))

  • 系数的意义

概率比值 o d d s = P 1 − P = e θ T x odds = \frac{P}{1 - P} = e^{\mathbf{\theta}^{\mathrm{T}} \mathbf{x}} odds=1PP=eθTx

系数 θ j \theta_j θj意味着:假设原始 o d d s = λ 1 odds = \lambda_1 odds=λ1,若对应的特征 x j x_j xj增加1,假设新的 o d d s = λ 2 odds = \lambda_2 odds=λ2,则 λ 1 λ 2 ≡ e θ j \frac{\lambda_1}{\lambda_2} \equiv e^{\theta_j} λ2λ1eθj

theta_0 = lr.intercept_
theta_1 = lr.coef_[0][0]
theta_2 = lr.coef_[0][1]

print("theta_0 = {0[0]}, theta_1 = {1}, theta_2 = {2}".format(theta_0, theta_1, theta_2))

testX = [[28, 8]]
ratio = prob[0][1] / prob[0][0]

testX = [[28, 9]]
prob_new = lr.predict_proba(testX)
ratio_new = prob_new[0][1] / prob_new[0][0]

ratio_of_ratio = ratio_new / ratio
print("ratio of ratio = {0}".format(ratio_of_ratio))

import math
theta2_e = math.exp(theta_2)
print("theta2 e = {}".format(theta2_e))
theta_0 = -0.04131837596993478, theta_1 = -0.1973000136829152, theta_2 = 0.915557452347983
ratio of ratio = 2.4981674731438943
theta2 e = 2.4981674731438948

θ 2 = 0.92 \theta_2 = 0.92 θ2=0.92意味着,如果年收入增加1万,一个人买车和不买车的概率的比值与之前的比值相比较,增加了 e 0.92 = 2.5 e^{0.92}=2.5 e0.92=2.5倍。

θ 1 = − 0.20 \theta_1 = -0.20 θ1=0.20意味着,如果年龄增加1岁,一个人买车和不买车的概率的比值与之前的比值相比较,降低了 e − 0.20 = 0.82 e^{-0.20}=0.82 e0.20=0.82倍。

任务学习19 应用

import pandas as pd
from sklearn import linear_model
from sklearn.feature_extraction.text import TfidfVectorizer

df = pd.read_csv("./data/SMSSpamCollection.csv", delimiter=',', header=None)
y, X_train = df[0], df[1]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(X_train)

lr = linear_model.LogisticRegression()
lr.fit(X, y)

testX = vectorizer.transform(["URGENT! Your mobile No. 1234 was awarded a Prize.",
                              "Hey honey, what's up?"])

predictions = lr.predict(testX)
print(predictions)

['spam' 'ham']

PS:损失函数 J ( θ ) J(\mathbf{\theta}) J(θ) θ \theta θ的Hessian矩阵:

  • 损失函数:

J ( θ ) = − ∑ i = 1 N [ y ( i ) ln ⁡ f ( x ; θ ) + ( 1 − y ( i ) ) ln ⁡ ( 1 − f ( x ; θ ) ) ] J(\mathbf{\theta}) = - \sum_{i=1}^{N} \left[ y^{(i)} \ln f(\mathbf{x}; \mathbf{\theta}) + \left( 1 - y^{(i)} \right) \ln \left( 1 - f(\mathbf{x}; \mathbf{\theta}) \right) \right] J(θ)=i=1N[y(i)lnf(x;θ)+(1y(i))ln(1f(x;θ))]

其中,
f ( x ; θ ) = 1 1 + e − θ T x f(\mathbf{x}; \mathbf{\theta}) = \frac{1}{1 + e^{-\mathbf{\theta}^{\mathrm{T}} \mathbf{x}}} f(x;θ)=1+eθTx1
x = [ 1 , x 1 , x 2 , ⋯   , x n ] T \mathbf{x} = \left[ 1, x_1, x_2, \cdots, x_n \right]^{\mathrm{T}} x=[1,x1,x2,,xn]T
θ = [ θ 0 , θ 1 , θ 2 , ⋯   , θ n ] T \mathbf{\theta} = \left[ \theta_0, \theta_1, \theta_2, \cdots, \theta_n \right]^{\mathrm{T}} θ=[θ0,θ1,θ2,,θn]T

其中, x ( i ) \mathbf{x}^{(i)} x(i)为表示第 i i i条样本的列向量。

  • 损失函数 J ( θ ) J(\mathbf{\theta}) J(θ) θ \mathbf{\theta} θ的梯度:

∇ θ J ( θ ) = ∑ i = 1 N x ( i ) ( f ( x ( i ) ; θ ) − y ( i ) ) \begin{aligned} \nabla_{\mathbf{\theta}} J(\mathbf{\theta}) = & \sum_{i=1}^{N} \mathbf{x}^{(i)} \left( f(\mathbf{x}^{(i)}; \mathbf{\theta}) - y^{(i)} \right) \\ \end{aligned} θJ(θ)=i=1Nx(i)(f(x(i);θ)y(i))

  • 损失函数 J ( θ ) J(\mathbf{\theta}) J(θ) θ \theta θ的Hessian矩阵:

易知, J ( θ ) J(\mathbf{\theta}) J(θ) θ p \theta_p θp的一阶偏导数为:

∂ J ( θ ) ∂ θ p = ∑ i = 1 N x p ( i ) ( f ( x ( i ) ; θ ) − y ( i ) ) \begin{aligned} \frac{\partial J(\mathbf{\theta})}{\partial \theta_p} = & \sum_{i=1}^{N} x^{(i)}_p \left( f(\mathbf{x}^{(i)}; \mathbf{\theta}) - y^{(i)} \right) \\ \end{aligned} θpJ(θ)=i=1Nxp(i)(f(x(i);θ)y(i))

J ( θ ) J(\mathbf{\theta}) J(θ) θ p \theta_p θp θ q \theta_q θq的二阶偏导数为:

∂ 2 J ( θ ) ∂ θ p ∂ θ q = ∑ i = 1 N x p ( i ) ∂ f ( x ( i ) ; θ ) ∂ θ q = ∑ i = 1 N x p ( i ) f ( x ( i ) ; θ ) ( 1 − f ( x ( i ) ; θ ) ) x q ( i ) = ∑ i = 1 N f ( x ( i ) ; θ ) ( 1 − f ( x ( i ) ; θ ) ) x p ( i ) x q ( i ) \begin{aligned} \frac{\partial^2 J(\mathbf{\theta})}{\partial \theta_p \partial \theta_q} = & \sum_{i=1}^{N} x^{(i)}_p \frac{\partial f(\mathbf{x}^{(i)}; \mathbf{\theta})}{\partial \theta_q} \\ = & \sum_{i=1}^{N} x^{(i)}_p f(\mathbf{x}^{(i)}; \mathbf{\theta}) \left( 1 - f(\mathbf{x}^{(i)}; \mathbf{\theta}) \right) x^{(i)}_q \\ = & \sum_{i=1}^{N} f(\mathbf{x}^{(i)}; \mathbf{\theta}) \left( 1 - f(\mathbf{x}^{(i)}; \mathbf{\theta}) \right) x^{(i)}_p x^{(i)}_q \\ \end{aligned} θpθq2J(θ)===i=1Nxp(i)θqf(x(i);θ)i=1Nxp(i)f(x(i);θ)(1f(x(i);θ))xq(i)i=1Nf(x(i);θ)(1f(x(i);θ))xp(i)xq(i)

注意 f ( x ( i ) ; θ ) ( 1 − f ( x ( i ) ; θ ) ) f(\mathbf{x}^{(i)}; \mathbf{\theta}) \left( 1 - f(\mathbf{x}^{(i)}; \mathbf{\theta}) \right) f(x(i);θ)(1f(x(i);θ))为标量,且大于零。

H ( J ( θ ) ) = [ ∂ 2 J ( θ ) ∂ θ 1 ∂ θ 1 ∂ 2 J ( θ ) ∂ θ 1 ∂ θ 2 ⋯ ∂ 2 J ( θ ) ∂ θ 1 ∂ θ n ∂ 2 J ( θ ) ∂ θ 2 ∂ θ 1 ∂ 2 J ( θ ) ∂ θ 2 ∂ θ 2 ⋯ ∂ 2 J ( θ ) ∂ θ 2 ∂ θ n ⋮ ⋮ ⋱ ⋮ ∂ 2 J ( θ ) ∂ θ n ∂ θ 1 ∂ 2 J ( θ ) ∂ θ n ∂ θ 2 ⋯ ∂ 2 J ( θ ) ∂ θ n ∂ θ n ] = ∑ i = 1 N ( f ( x ( i ) ; θ ) ( 1 − f ( x ( i ) ; θ ) ) [ x 1 ( i ) x 1 ( i ) x 1 ( i ) x 2 ( i ) ⋯ x 1 ( i ) x n ( i ) x 2 ( i ) x 1 ( i ) x 2 ( i ) x 2 ( i ) ⋯ x 2 ( i ) x n ( i ) ⋮ ⋮ ⋱ ⋮ x n ( i ) x 1 ( i ) x n ( i ) x 2 ( i ) ⋯ x n ( i ) x n ( i ) ] ) = ∑ i = 1 N f ( x ( i ) ; θ ) ( 1 − f ( x ( i ) ; θ ) ) x ( i ) ( x ( i ) ) T \begin{aligned} H \left(J(\mathbf{\theta}) \right) = & \begin{bmatrix} \frac{\partial^2 J(\mathbf{\theta})}{\partial \theta_1 \partial \theta_1} & \frac{\partial^2 J(\mathbf{\theta})}{\partial \theta_1 \partial \theta_2} & \cdots & \frac{\partial^2 J(\mathbf{\theta})}{\partial \theta_1 \partial \theta_n} \\ \frac{\partial^2 J(\mathbf{\theta})}{\partial \theta_2 \partial \theta_1} & \frac{\partial^2 J(\mathbf{\theta})}{\partial \theta_2 \partial \theta_2} & \cdots & \frac{\partial^2 J(\mathbf{\theta})}{\partial \theta_2 \partial \theta_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 J(\mathbf{\theta})}{\partial \theta_n \partial \theta_1} & \frac{\partial^2 J(\mathbf{\theta})}{\partial \theta_n \partial \theta_2} & \cdots & \frac{\partial^2 J(\mathbf{\theta})}{\partial \theta_n \partial \theta_n} \\ \end{bmatrix} \\ = & \sum_{i=1}^{N} \left( f(\mathbf{x}^{(i)}; \mathbf{\theta}) \left( 1 - f(\mathbf{x}^{(i)}; \mathbf{\theta}) \right) \begin{bmatrix} x^{(i)}_1 x^{(i)}_1 & x^{(i)}_1 x^{(i)}_2 & \cdots & x^{(i)}_1 x^{(i)}_n \\ x^{(i)}_2 x^{(i)}_1 & x^{(i)}_2 x^{(i)}_2 & \cdots & x^{(i)}_2 x^{(i)}_n \\ \vdots & \vdots & \ddots & \vdots \\ x^{(i)}_n x^{(i)}_1 & x^{(i)}_n x^{(i)}_2 & \cdots & x^{(i)}_n x^{(i)}_n \\ \end{bmatrix} \right)\\ = & \sum_{i=1}^{N} f(\mathbf{x}^{(i)}; \mathbf{\theta}) \left( 1 - f(\mathbf{x}^{(i)}; \mathbf{\theta}) \right) \mathbf{x}^{(i)} (\mathbf{x}^{(i)})^{\mathrm{T}}\\ \end{aligned} H(J(θ))===θ1θ12J(θ)θ2θ12J(θ)θnθ12J(θ)θ1θ22J(θ)θ2θ22J(θ)θnθ22J(θ)θ1θn2J(θ)θ2θn2J(θ)θnθn2J(θ)i=1Nf(x(i);θ)(1f(x(i);θ))x1(i)x1(i)x2(i)x1(i)xn(i)x1(i)x1(i)x2(i)x2(i)x2(i)xn(i)x2(i)x1(i)xn(i)x2(i)xn(i)xn(i)xn(i)i=1Nf(x(i);θ)(1f(x(i);θ))x(i)(x(i))T

  • Hessian矩阵正定性分析

H ( J ( θ ) ) = ∑ i = 1 m f ( x ( i ) ; θ ) ( 1 − f ( x ( i ) ; θ ) ) x ( i ) ( x ( i ) ) T \begin{aligned} H \left(J(\mathbf{\theta}) \right) = & \sum_{i=1}^{m} f(\mathbf{x}^{(i)}; \mathbf{\theta}) \left( 1 - f(\mathbf{x}^{(i)}; \mathbf{\theta}) \right) \mathbf{x}^{(i)} (\mathbf{x}^{(i)})^{\mathrm{T}} \\ \end{aligned} H(J(θ))=i=1mf(x(i);θ)(1f(x(i);θ))x(i)(x(i))T

(1) f ( x ( i ) ; θ ) ( 1 − f ( x ( i ) ; θ ) ) > 0 f(\mathbf{x}^{(i)}; \mathbf{\theta}) \left( 1 - f(\mathbf{x}^{(i)}; \mathbf{\theta}) \right) \gt 0 f(x(i);θ)(1f(x(i);θ))>0

(2) H ( J ( θ ) ) H \left(J(\mathbf{\theta}) \right) H(J(θ))在形式上类似于随机过程向量的自相关矩阵

m ≫ 0 m \gg 0 m0时,可得:

E [ x j x k ] ≈ 1 m ∑ i = 1 m x j ( i ) x k ( i ) \mathrm{E}\left[ x_j x_k \right] \approx \frac{1}{m} \sum_{i=1}^{m} x^{(i)}_j x^{(i)}_k E[xjxk]m1i=1mxj(i)xk(i)

x ( i ) \mathbf{x}^{(i)} x(i)的各分量 x j x_j xj相互独立时,可知:

E [ x j x k ] { = 0 , if  j ̸ = k > 0 , if  j = k \mathrm{E}\left[ x_j x_k \right] \begin{cases} = 0, & \quad \text{if} \ j \not= k \\ \gt 0, & \quad \text{if} \ j = k \\ \end{cases} E[xjxk]{=0,>0,if j̸=kif j=k

m ≫ n m \gg n mn时, E [ x x T ] \mathrm{E}\left[ \mathbf{x} \mathbf{x}^\text{T} \right] E[xxT]为满秩对角矩阵,且对角元素均大于零, H ( J ( θ ) ) H \left(J(\mathbf{\theta}) \right) H(J(θ))是正定的(positive definite);否则 H ( J ( θ ) ) H \left(J(\mathbf{\theta}) \right) H(J(θ))是半正定的(semi-positive definite)。

H ( J ( θ ) ) H \left(J(\mathbf{\theta}) \right) H(J(θ))满足正定条件时( m ≫ n m \gg n mn), J ( θ ) J(\mathbf{\theta}) J(θ)为凸优函数,有全局最优解,即批量梯度下降(batch gradient descent)能够保证 J ( θ ) J(\mathbf{\theta}) J(θ)收敛到全局最小值;当 H ( J ( θ ) ) H \left(J(\mathbf{\theta}) \right) H(J(θ))满足半正定条件时( m &lt; n m \lt n m<n),即小批量梯度下降(batch gradient descent)或随机梯度下降(stochastic gradient descent)可能使 J ( θ ) J(\mathbf{\theta}) J(θ)陷入局部最小值。

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值