第3章 逻辑回归
任务学习12 二元分类问题
任务学习13 逻辑函数
f ( x ) = 1 1 + e − x f(x) = \frac{1}{1 + e^{-x}} f(x)=1+e−x1
1 1 + e ∞ = 1 1 + ∞ = 0 \frac{1}{1 + e^{\infty}} = \frac{1}{1 + \infty} = 0 1+e∞1=1+∞1=0
1 1 + e − ∞ = 1 1 − ∞ = 1 \frac{1}{1 + e^{- \infty}} = \frac{1}{1 - \infty} = 1 1+e−∞1=1−∞1=1
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp( - x ))
x = np.linspace(-6, 6, 100)
y = sigmoid(x)
mark = 0.5 * np.ones(x.shape)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(x, y)
ax.plot(x, mark, ":")
ax.set_xlabel("$x$")
ax.set_ylabel("$f(x)$")
ax.grid()
plt.show()
任务学习14 指数与对数 、逻辑回归
- 指数与对数
def exp(x):
return np.exp(x)
def ln(x):
return np.log(x)
def lin(x):
return x
x = np.linspace(-4, 4, 100)
y_exp = exp(x)
y_ln = ln(x[np.nonzero(x > 0)])
y_lin = lin(x)
fig = plt.figure(figsize = (5, 5))
ax = fig.add_subplot(111)
ax.plot(x, y_exp, label="$y = e^{x}$")
ax.plot(x[np.nonzero(x > 0)], y_ln, label="$y = ln(x)$")
ax.plot(x, y_lin, label="$y = x$")
ax.set_xlabel("$x$")
ax.set_ylabel("$f(x)$")
ax.set_ylim(-4, 4)
ax.grid()
ax.legend()
plt.show()
- 逻辑回归
解决二元(0、1)分类问题
P ( y = 1 ∣ x ; θ ) = f ( x ; θ ) = 1 1 + e − θ T x P(y = 1 | \mathbf{x}; \mathbf{\theta}) = f(\mathbf{x}; \mathbf{\theta}) = \frac{1}{1 + e^{-\mathbf{\theta}^{\mathrm{T}} \mathbf{x}}} P(y=1∣x;θ)=f(x;θ)=1+e−θTx1
θ T x = θ 0 + θ 1 x 1 + θ 2 x 2 + ⋯ \mathbf{\theta}^{\mathrm{T}} \mathbf{x} = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots θTx=θ0+θ1x1+θ2x2+⋯
θ = [ θ 0 , θ 1 , θ 2 , ⋯   ] \mathbf{\theta} = \left[ \theta_0, \theta_1, \theta_2, \cdots \right] θ=[θ0,θ1,θ2,⋯]
x = [ 1 , x 1 , x 2 , ⋯   ] \mathbf{x} = \left[ 1, x_1, x_2, \cdots \right] x=[1,x1,x2,⋯]
P ( y = 1 ∣ x ) > 0.5 P(y = 1 | \mathbf{x}) > 0.5 P(y=1∣x)>0.5,推理为1;否则推理为0。
- 逻辑回归知识点
类别1的概率: P = 1 1 + e − θ T x P = \frac{1}{1 + e^{-\mathbf{\theta}^{\mathrm{T}} \mathbf{x}}} P=1+e−θTx1
类别0的概率: 1 − P = e − θ T x 1 + e − θ T x = 1 1 + e θ T x 1 - P = \frac{e^{-\mathbf{\theta}^{\mathrm{T}} \mathbf{x}}}{1 + e^{-\mathbf{\theta}^{\mathrm{T}} \mathbf{x}}} = \frac{1}{1 + e^{\mathbf{\theta}^{\mathrm{T}} \mathbf{x}}} 1−P=1+e−θTxe−θTx=1+eθTx1
类别1与0概率的比值: P 1 − P = e θ T x \frac{P}{1 - P} = e^{\mathbf{\theta}^{\mathrm{T}} \mathbf{x}} 1−PP=eθTx
类别1与0概率比值的自然对数: ln P 1 − P = θ T x \ln \frac{P}{1 - P} = \mathbf{\theta}^{\mathrm{T}} \mathbf{x} ln1−PP=θTx
任务学习15 逻辑回归示例
年龄( x 1 x_1 x1) | 年收入( x 2 x_2 x2)(万元) | 是否买车(1:是;2:否) |
---|---|---|
20 | 3 | 0 |
23 | 7 | 1 |
31 | 10 | 1 |
42 | 13 | 1 |
50 | 7 | 0 |
60 | 5 | 0 |
— | — | — |
28 | 8 | ? |
from sklearn import linear_model
X = [[20, 3],
[23, 7],
[31, 10],
[42, 13],
[50, 7],
[60, 5]]
y = [0,
1,
1,
1,
0,
0]
lr = linear_model.LogisticRegression()
lr.fit(X, y)
testX = [[28, 8]]
label = lr.predict(testX)
print("predicted label = {}".format(label))
prob = lr.predict_proba(testX)
print("probability = {}".format(prob))
print("theta_0 = {0[0]}, theta_1 = {1[0][0]}, theta_0 = {1[0][1]}".format(lr.intercept_, lr.coef_))
predicted label = [1]
probability = [[0.14694811 0.85305189]]
theta_0 = -0.04131837596993478, theta_1 = -0.1973000136829152, theta_0 = 0.915557452347983
任务学习16 损失函数
类别1概率:
P ( y = 1 ∣ x ; θ ) = f ( x ; θ ) = 1 1 + e − θ T x P(y = 1 | \mathbf{x}; \mathbf{\theta}) = f(\mathbf{x}; \mathbf{\theta}) = \frac{1}{1 + e^{-\mathbf{\theta}^{\mathrm{T}} \mathbf{x}}} P(y=1∣x;θ)=f(x;θ)=1+e−θTx1
损失函数:
J ( θ ) = − ∑ i = 1 N [ y ( i ) ln P ( Y = 1 ∣ X = x ( i ) ; θ ) + ( 1 − y ( i ) ) ln ( 1 − P ( Y = 1 ∣ X = x ( i ) ; θ ) ) ] J(\mathbf{\theta}) = - \sum_{i=1}^{N} \left[ y^{(i)} \ln P(Y = 1 | \mathbf{X} = \mathbf{x}^{(i)}; \theta) + \left( 1 - y^{(i)} \right) \ln \left( 1 - P(Y = 1 | \mathbf{X} = \mathbf{x}^{(i)}; \theta) \right) \right] J(θ)=−∑i=1N[y(i)lnP(Y=1∣X=x(i);θ)+(1−y(i))ln(1−P(Y=1∣X=x(i);θ))]
损失函数梯度:
∇ θ J ( θ ) = ∑ i = 1 N ( P ( Y = 1 ∣ X = x ( i ) ; θ ) − y ( i ) ) x ( i ) = ∑ i = 1 N x ( i ) ( f ( x ( i ) ; θ ) − y ( i ) ) \begin{aligned} \nabla_{\mathbf{\theta}} J(\mathbf{\theta}) = & \sum_{i=1}^{N} \left( P(Y = 1 | \mathbf{X} = \mathbf{x}^{(i)}; \theta) - y^{(i)} \right) \mathbf{x}^{(i)} \\ = & \sum_{i=1}^{N} \mathbf{x}^{(i)} \left( f(\mathbf{x}^{(i)}; \mathbf{\theta}) - y^{(i)} \right) \\ \end{aligned} ∇θJ(θ)==i=1∑N(P(Y=1∣X=x(i);θ)−y(i))x(i)i=1∑Nx(i)(f(x(i);θ)−y(i))
任务学习17 损失函数推演
- 求导
( f ( x ) g ( x ) ) ′ = f ′ ( x ) g ( x ) + f ( x ) g ′ ( x ) \left( f(x)g(x) \right) ^{\prime} = f^{\prime}(x)g(x) + f(x)g^{\prime}(x) (f(x)g(x))′=f′(x)g(x)+f(x)g′(x)
- 对数
log ( x y ) = log ( x ) + log ( y ) \log(xy) = \log(x) + \log(y) log(xy)=log(x)+log(y)
log ′ ( x ) = 1 x \log^{\prime}(x) = \frac{1}{x} log′(x)=x1
- 链式法则
z = f ( y ) y = g ( x ) ↓ d z d x = d z d y d y d x \begin{aligned} z = & f(y) \\ y = & g(x) \\ \downarrow & \\ \frac{d z}{d x} = & \frac{d z}{d y} \frac{d y}{d x} \end{aligned} z=y=↓dxdz=f(y)g(x)dydzdxdy
- sigmoid
f ( x ) = 1 1 + e − x ↓ f ′ ( x ) = ( − 1 ) e − x ( − 1 ) ( 1 + e − x ) 2 = e − x 1 + e − x 1 1 + e − x = f ( x ) ( 1 − f ( x ) ) \begin{aligned} f(x) = & \frac{1}{1 + e^{-x}} \\ \downarrow & \\ f^{\prime}(x) = & (-1) \frac{e^{-x} (-1)}{\left( 1 + e^{-x} \right)^2} \\ = & \frac{e^{-x}}{1 + e^{-x}} \frac{1}{1 + e^{-x}} \\ = & f(x) \left( 1- f(x) \right) \end{aligned} f(x)=↓f′(x)===1+e−x1(−1)(1+e−x)2e−x(−1)1+e−xe−x1+e−x1f(x)(1−f(x))
f ( z ) = 1 1 + e − z z = θ x ↓ d f d x = f ( z ) ( 1 − f ( z ) ) θ \begin{aligned} f(z) = & \frac{1}{1 + e^{-z}} \\ z = & \theta x \\ \downarrow & \\ \frac{d f}{d x} = & f(z) \left( 1- f(z) \right) \theta \end{aligned} f(z)=z=↓dxdf=1+e−z1θxf(z)(1−f(z))θ
- 损失函数
训练数据集 { ( x i , y i ) } \{ \left( \mathbf{x}_i, y_i \right) \} {(xi,yi)}, i ∈ { 1 , 2 , ⋯   , N } i \in \{1, 2, \cdots, N \} i∈{1,2,⋯,N}, x i ∈ R m \mathbf{x}_i \in R^m xi∈Rm, y i ∈ { 0 , 1 } y_i \in \{ 0, 1 \} yi∈{0,1}
逻辑函数表示给定样本 x i \mathbf{x}_i xi,分类器推理为 y i = 1 y_i = 1 yi=1的概率:
P i = P ( y i = 1 ∣ θ : x i ) = f ( θ T x i ) \begin{aligned} P_i = & P\left( y_i = 1 | \mathbf{\theta}: \mathbf{x}_i \right) \\ = & f(\mathbf{\theta}^{\mathrm{T}} \mathbf{x}_i) \end{aligned} Pi==P(yi=1∣θ:xi)f(θTxi)
似然函数
L ( θ ) = ∏ i ∣ y i = 1 N P i ⋅ ∏ i ∣ y i = 0 N ( 1 − P i ) \begin{aligned} L(\mathbf{\theta}) = & \prod^{N}_{i | y_i = 1} P_i \cdot \prod^{N}_{i | y_i = 0} \left( 1 - P_i \right) \end{aligned} L(θ)=i∣yi=1∏NPi⋅i∣yi=0∏N(1−Pi)
目标是求使 L ( θ ) L(\mathbf{\theta}) L(θ)最大时的 θ \theta θ:
θ = arg max θ L ( θ ) \mathbf{\theta} = \arg \max_{\theta} L(\mathbf{\theta}) θ=argθmaxL(θ)
对数似然函数
l ( θ ) = log L ( θ ) = log [ ∑ i ∣ y i = 1 N P i + ∑ i ∣ y i = 0 N ( 1 − P i ) ] = ∑ i ∣ y i = 1 N log P i + ∑ i ∣ y i = 0 N log ( 1 − P i ) = ∑ i = 1 N [ y i log P i + ( 1 − y i ) log ( 1 − P i ) ] \begin{aligned} l(\theta) = \log L(\mathbf{\theta}) = & \log \left[ \sum^{N}_{i | y_i = 1} P_i + \sum^{N}_{i | y_i = 0} \left( 1 - P_i \right) \right ]\\ = & \sum^{N}_{i | y_i = 1} \log P_i + \sum^{N}_{i | y_i = 0} \log \left( 1 - P_i \right) \\ = & \sum^{N}_{i = 1} \left[ y_i \log P_i + \left( 1 - y_i \right) \log \left( 1 - P_i \right) \right] \\ \end{aligned} l(θ)=logL(θ)===log⎣⎡i∣yi=1∑NPi+i∣yi=0∑N(1−Pi)⎦⎤i∣yi=1∑NlogPi+i∣yi=0∑Nlog(1−Pi)i=1∑N[yilogPi+(1−yi)log(1−Pi)]
d l ( θ ) d θ = ∑ i = 1 N [ y i d log P i d θ + ( 1 − y i ) d log ( 1 − P i ) d θ ] = ∑ i = 1 N [ y i P i ( 1 − P i ) P i x i + ( 1 − y i ) ( − 1 ) P i ( 1 − P i ) 1 − P i x i ] = ∑ i = 1 N [ y i ( 1 − P i ) x i − ( 1 − y i ) P i x i ] = ∑ i = 1 N ( y i − P i ) x i \begin{aligned} \frac{d l(\theta)}{d \theta} = & \sum^{N}_{i = 1} \left[ y_i \frac{d \log P_i}{d \theta} + \left( 1 - y_i \right) \frac{d \log \left( 1 - P_i \right)}{d \theta} \right] \\ = & \sum^{N}_{i = 1} \left[ y_i \frac{P_i \left( 1 - P_i \right)}{P_i} \mathbf{x}_i + \left( 1 - y_i \right) \frac{(- 1) P_i \left( 1 - P_i \right)}{1 - P_i} \mathbf{x}_i \right] \\ = & \sum^{N}_{i = 1} \left[ y_i \left( 1 - P_i \right) \mathbf{x}_i - \left( 1 - y_i \right) P_i \mathbf{x}_i \right] \\ = & \sum^{N}_{i = 1} \left( y_i - P_i \right) \mathbf{x}_i \end{aligned} dθdl(θ)====i=1∑N[yidθdlogPi+(1−yi)dθdlog(1−Pi)]i=1∑N[yiPiPi(1−Pi)xi+(1−yi)1−Pi(−1)Pi(1−Pi)xi]i=1∑N[yi(1−Pi)xi−(1−yi)Pixi]i=1∑N(yi−Pi)xi
l ( θ ) = log L ( θ ) l(\theta) = \log L (\theta) l(θ)=logL(θ)是求 L ( θ ) L (\theta) L(θ)的最大期望,定义损失函数为:
l o s s ( θ ) = − l ( θ ) loss(\theta) = - l(\theta) loss(θ)=−l(θ)
则:
d l o s s ( θ ) d θ = ∑ i = 1 N ( P i − y i ) x i \frac{d loss(\theta)}{d \theta} = \sum^{N}_{i = 1} \left( P_i - y_i \right) \mathbf{x}_i dθdloss(θ)=i=1∑N(Pi−yi)xi
任务学习18 梯度下降法
f ( x ; θ ) = 1 1 + e − θ T x f(\mathbf{x}; \mathbf{\theta}) = \frac{1}{1 + e^{-\mathbf{\theta}^{\mathrm{T}} \mathbf{x}}} f(x;θ)=1+e−θTx1
θ = θ − α ∇ θ J ( θ ) = θ − α ∑ i = 1 N x ( i ) ( f ( x ( i ) ; θ ) − y ( i ) ) \mathbf{\theta} = \mathbf{\theta} - \alpha \nabla_{\mathbf{\theta}} J(\mathbf{\theta}) = \mathbf{\theta} - \alpha \sum_{i=1}^{N} \mathbf{x}^{(i)} \left( f(\mathbf{x}^{(i)}; \mathbf{\theta}) - y^{(i)} \right) θ=θ−α∇θJ(θ)=θ−αi=1∑Nx(i)(f(x(i);θ)−y(i))
- 系数的意义
概率比值 o d d s = P 1 − P = e θ T x odds = \frac{P}{1 - P} = e^{\mathbf{\theta}^{\mathrm{T}} \mathbf{x}} odds=1−PP=eθTx
系数 θ j \theta_j θj意味着:假设原始 o d d s = λ 1 odds = \lambda_1 odds=λ1,若对应的特征 x j x_j xj增加1,假设新的 o d d s = λ 2 odds = \lambda_2 odds=λ2,则 λ 1 λ 2 ≡ e θ j \frac{\lambda_1}{\lambda_2} \equiv e^{\theta_j} λ2λ1≡eθj
theta_0 = lr.intercept_
theta_1 = lr.coef_[0][0]
theta_2 = lr.coef_[0][1]
print("theta_0 = {0[0]}, theta_1 = {1}, theta_2 = {2}".format(theta_0, theta_1, theta_2))
testX = [[28, 8]]
ratio = prob[0][1] / prob[0][0]
testX = [[28, 9]]
prob_new = lr.predict_proba(testX)
ratio_new = prob_new[0][1] / prob_new[0][0]
ratio_of_ratio = ratio_new / ratio
print("ratio of ratio = {0}".format(ratio_of_ratio))
import math
theta2_e = math.exp(theta_2)
print("theta2 e = {}".format(theta2_e))
theta_0 = -0.04131837596993478, theta_1 = -0.1973000136829152, theta_2 = 0.915557452347983
ratio of ratio = 2.4981674731438943
theta2 e = 2.4981674731438948
θ 2 = 0.92 \theta_2 = 0.92 θ2=0.92意味着,如果年收入增加1万,一个人买车和不买车的概率的比值与之前的比值相比较,增加了 e 0.92 = 2.5 e^{0.92}=2.5 e0.92=2.5倍。
θ 1 = − 0.20 \theta_1 = -0.20 θ1=−0.20意味着,如果年龄增加1岁,一个人买车和不买车的概率的比值与之前的比值相比较,降低了 e − 0.20 = 0.82 e^{-0.20}=0.82 e−0.20=0.82倍。
任务学习19 应用
import pandas as pd
from sklearn import linear_model
from sklearn.feature_extraction.text import TfidfVectorizer
df = pd.read_csv("./data/SMSSpamCollection.csv", delimiter=',', header=None)
y, X_train = df[0], df[1]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(X_train)
lr = linear_model.LogisticRegression()
lr.fit(X, y)
testX = vectorizer.transform(["URGENT! Your mobile No. 1234 was awarded a Prize.",
"Hey honey, what's up?"])
predictions = lr.predict(testX)
print(predictions)
['spam' 'ham']
PS:损失函数 J ( θ ) J(\mathbf{\theta}) J(θ)对 θ \theta θ的Hessian矩阵:
- 损失函数:
J ( θ ) = − ∑ i = 1 N [ y ( i ) ln f ( x ; θ ) + ( 1 − y ( i ) ) ln ( 1 − f ( x ; θ ) ) ] J(\mathbf{\theta}) = - \sum_{i=1}^{N} \left[ y^{(i)} \ln f(\mathbf{x}; \mathbf{\theta}) + \left( 1 - y^{(i)} \right) \ln \left( 1 - f(\mathbf{x}; \mathbf{\theta}) \right) \right] J(θ)=−∑i=1N[y(i)lnf(x;θ)+(1−y(i))ln(1−f(x;θ))]
其中,
f
(
x
;
θ
)
=
1
1
+
e
−
θ
T
x
f(\mathbf{x}; \mathbf{\theta}) = \frac{1}{1 + e^{-\mathbf{\theta}^{\mathrm{T}} \mathbf{x}}}
f(x;θ)=1+e−θTx1,
x
=
[
1
,
x
1
,
x
2
,
⋯
 
,
x
n
]
T
\mathbf{x} = \left[ 1, x_1, x_2, \cdots, x_n \right]^{\mathrm{T}}
x=[1,x1,x2,⋯,xn]T,
θ
=
[
θ
0
,
θ
1
,
θ
2
,
⋯
 
,
θ
n
]
T
\mathbf{\theta} = \left[ \theta_0, \theta_1, \theta_2, \cdots, \theta_n \right]^{\mathrm{T}}
θ=[θ0,θ1,θ2,⋯,θn]T
其中, x ( i ) \mathbf{x}^{(i)} x(i)为表示第 i i i条样本的列向量。
- 损失函数 J ( θ ) J(\mathbf{\theta}) J(θ)对 θ \mathbf{\theta} θ的梯度:
∇ θ J ( θ ) = ∑ i = 1 N x ( i ) ( f ( x ( i ) ; θ ) − y ( i ) ) \begin{aligned} \nabla_{\mathbf{\theta}} J(\mathbf{\theta}) = & \sum_{i=1}^{N} \mathbf{x}^{(i)} \left( f(\mathbf{x}^{(i)}; \mathbf{\theta}) - y^{(i)} \right) \\ \end{aligned} ∇θJ(θ)=i=1∑Nx(i)(f(x(i);θ)−y(i))
- 损失函数 J ( θ ) J(\mathbf{\theta}) J(θ)对 θ \theta θ的Hessian矩阵:
易知, J ( θ ) J(\mathbf{\theta}) J(θ)对 θ p \theta_p θp的一阶偏导数为:
∂ J ( θ ) ∂ θ p = ∑ i = 1 N x p ( i ) ( f ( x ( i ) ; θ ) − y ( i ) ) \begin{aligned} \frac{\partial J(\mathbf{\theta})}{\partial \theta_p} = & \sum_{i=1}^{N} x^{(i)}_p \left( f(\mathbf{x}^{(i)}; \mathbf{\theta}) - y^{(i)} \right) \\ \end{aligned} ∂θp∂J(θ)=i=1∑Nxp(i)(f(x(i);θ)−y(i))
J ( θ ) J(\mathbf{\theta}) J(θ)对 θ p \theta_p θp和 θ q \theta_q θq的二阶偏导数为:
∂ 2 J ( θ ) ∂ θ p ∂ θ q = ∑ i = 1 N x p ( i ) ∂ f ( x ( i ) ; θ ) ∂ θ q = ∑ i = 1 N x p ( i ) f ( x ( i ) ; θ ) ( 1 − f ( x ( i ) ; θ ) ) x q ( i ) = ∑ i = 1 N f ( x ( i ) ; θ ) ( 1 − f ( x ( i ) ; θ ) ) x p ( i ) x q ( i ) \begin{aligned} \frac{\partial^2 J(\mathbf{\theta})}{\partial \theta_p \partial \theta_q} = & \sum_{i=1}^{N} x^{(i)}_p \frac{\partial f(\mathbf{x}^{(i)}; \mathbf{\theta})}{\partial \theta_q} \\ = & \sum_{i=1}^{N} x^{(i)}_p f(\mathbf{x}^{(i)}; \mathbf{\theta}) \left( 1 - f(\mathbf{x}^{(i)}; \mathbf{\theta}) \right) x^{(i)}_q \\ = & \sum_{i=1}^{N} f(\mathbf{x}^{(i)}; \mathbf{\theta}) \left( 1 - f(\mathbf{x}^{(i)}; \mathbf{\theta}) \right) x^{(i)}_p x^{(i)}_q \\ \end{aligned} ∂θp∂θq∂2J(θ)===i=1∑Nxp(i)∂θq∂f(x(i);θ)i=1∑Nxp(i)f(x(i);θ)(1−f(x(i);θ))xq(i)i=1∑Nf(x(i);θ)(1−f(x(i);θ))xp(i)xq(i)
注意: f ( x ( i ) ; θ ) ( 1 − f ( x ( i ) ; θ ) ) f(\mathbf{x}^{(i)}; \mathbf{\theta}) \left( 1 - f(\mathbf{x}^{(i)}; \mathbf{\theta}) \right) f(x(i);θ)(1−f(x(i);θ))为标量,且大于零。
H ( J ( θ ) ) = [ ∂ 2 J ( θ ) ∂ θ 1 ∂ θ 1 ∂ 2 J ( θ ) ∂ θ 1 ∂ θ 2 ⋯ ∂ 2 J ( θ ) ∂ θ 1 ∂ θ n ∂ 2 J ( θ ) ∂ θ 2 ∂ θ 1 ∂ 2 J ( θ ) ∂ θ 2 ∂ θ 2 ⋯ ∂ 2 J ( θ ) ∂ θ 2 ∂ θ n ⋮ ⋮ ⋱ ⋮ ∂ 2 J ( θ ) ∂ θ n ∂ θ 1 ∂ 2 J ( θ ) ∂ θ n ∂ θ 2 ⋯ ∂ 2 J ( θ ) ∂ θ n ∂ θ n ] = ∑ i = 1 N ( f ( x ( i ) ; θ ) ( 1 − f ( x ( i ) ; θ ) ) [ x 1 ( i ) x 1 ( i ) x 1 ( i ) x 2 ( i ) ⋯ x 1 ( i ) x n ( i ) x 2 ( i ) x 1 ( i ) x 2 ( i ) x 2 ( i ) ⋯ x 2 ( i ) x n ( i ) ⋮ ⋮ ⋱ ⋮ x n ( i ) x 1 ( i ) x n ( i ) x 2 ( i ) ⋯ x n ( i ) x n ( i ) ] ) = ∑ i = 1 N f ( x ( i ) ; θ ) ( 1 − f ( x ( i ) ; θ ) ) x ( i ) ( x ( i ) ) T \begin{aligned} H \left(J(\mathbf{\theta}) \right) = & \begin{bmatrix} \frac{\partial^2 J(\mathbf{\theta})}{\partial \theta_1 \partial \theta_1} & \frac{\partial^2 J(\mathbf{\theta})}{\partial \theta_1 \partial \theta_2} & \cdots & \frac{\partial^2 J(\mathbf{\theta})}{\partial \theta_1 \partial \theta_n} \\ \frac{\partial^2 J(\mathbf{\theta})}{\partial \theta_2 \partial \theta_1} & \frac{\partial^2 J(\mathbf{\theta})}{\partial \theta_2 \partial \theta_2} & \cdots & \frac{\partial^2 J(\mathbf{\theta})}{\partial \theta_2 \partial \theta_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 J(\mathbf{\theta})}{\partial \theta_n \partial \theta_1} & \frac{\partial^2 J(\mathbf{\theta})}{\partial \theta_n \partial \theta_2} & \cdots & \frac{\partial^2 J(\mathbf{\theta})}{\partial \theta_n \partial \theta_n} \\ \end{bmatrix} \\ = & \sum_{i=1}^{N} \left( f(\mathbf{x}^{(i)}; \mathbf{\theta}) \left( 1 - f(\mathbf{x}^{(i)}; \mathbf{\theta}) \right) \begin{bmatrix} x^{(i)}_1 x^{(i)}_1 & x^{(i)}_1 x^{(i)}_2 & \cdots & x^{(i)}_1 x^{(i)}_n \\ x^{(i)}_2 x^{(i)}_1 & x^{(i)}_2 x^{(i)}_2 & \cdots & x^{(i)}_2 x^{(i)}_n \\ \vdots & \vdots & \ddots & \vdots \\ x^{(i)}_n x^{(i)}_1 & x^{(i)}_n x^{(i)}_2 & \cdots & x^{(i)}_n x^{(i)}_n \\ \end{bmatrix} \right)\\ = & \sum_{i=1}^{N} f(\mathbf{x}^{(i)}; \mathbf{\theta}) \left( 1 - f(\mathbf{x}^{(i)}; \mathbf{\theta}) \right) \mathbf{x}^{(i)} (\mathbf{x}^{(i)})^{\mathrm{T}}\\ \end{aligned} H(J(θ))===⎣⎢⎢⎢⎢⎢⎡∂θ1∂θ1∂2J(θ)∂θ2∂θ1∂2J(θ)⋮∂θn∂θ1∂2J(θ)∂θ1∂θ2∂2J(θ)∂θ2∂θ2∂2J(θ)⋮∂θn∂θ2∂2J(θ)⋯⋯⋱⋯∂θ1∂θn∂2J(θ)∂θ2∂θn∂2J(θ)⋮∂θn∂θn∂2J(θ)⎦⎥⎥⎥⎥⎥⎤i=1∑N⎝⎜⎜⎜⎜⎛f(x(i);θ)(1−f(x(i);θ))⎣⎢⎢⎢⎢⎡x1(i)x1(i)x2(i)x1(i)⋮xn(i)x1(i)x1(i)x2(i)x2(i)x2(i)⋮xn(i)x2(i)⋯⋯⋱⋯x1(i)xn(i)x2(i)xn(i)⋮xn(i)xn(i)⎦⎥⎥⎥⎥⎤⎠⎟⎟⎟⎟⎞i=1∑Nf(x(i);θ)(1−f(x(i);θ))x(i)(x(i))T
- Hessian矩阵正定性分析
H ( J ( θ ) ) = ∑ i = 1 m f ( x ( i ) ; θ ) ( 1 − f ( x ( i ) ; θ ) ) x ( i ) ( x ( i ) ) T \begin{aligned} H \left(J(\mathbf{\theta}) \right) = & \sum_{i=1}^{m} f(\mathbf{x}^{(i)}; \mathbf{\theta}) \left( 1 - f(\mathbf{x}^{(i)}; \mathbf{\theta}) \right) \mathbf{x}^{(i)} (\mathbf{x}^{(i)})^{\mathrm{T}} \\ \end{aligned} H(J(θ))=i=1∑mf(x(i);θ)(1−f(x(i);θ))x(i)(x(i))T
(1) f ( x ( i ) ; θ ) ( 1 − f ( x ( i ) ; θ ) ) > 0 f(\mathbf{x}^{(i)}; \mathbf{\theta}) \left( 1 - f(\mathbf{x}^{(i)}; \mathbf{\theta}) \right) \gt 0 f(x(i);θ)(1−f(x(i);θ))>0
(2) H ( J ( θ ) ) H \left(J(\mathbf{\theta}) \right) H(J(θ))在形式上类似于随机过程向量的自相关矩阵
当 m ≫ 0 m \gg 0 m≫0时,可得:
E [ x j x k ] ≈ 1 m ∑ i = 1 m x j ( i ) x k ( i ) \mathrm{E}\left[ x_j x_k \right] \approx \frac{1}{m} \sum_{i=1}^{m} x^{(i)}_j x^{(i)}_k E[xjxk]≈m1i=1∑mxj(i)xk(i)
当 x ( i ) \mathbf{x}^{(i)} x(i)的各分量 x j x_j xj相互独立时,可知:
E [ x j x k ] { = 0 , if j ̸ = k > 0 , if j = k \mathrm{E}\left[ x_j x_k \right] \begin{cases} = 0, & \quad \text{if} \ j \not= k \\ \gt 0, & \quad \text{if} \ j = k \\ \end{cases} E[xjxk]{=0,>0,if j̸=kif j=k
当 m ≫ n m \gg n m≫n时, E [ x x T ] \mathrm{E}\left[ \mathbf{x} \mathbf{x}^\text{T} \right] E[xxT]为满秩对角矩阵,且对角元素均大于零, H ( J ( θ ) ) H \left(J(\mathbf{\theta}) \right) H(J(θ))是正定的(positive definite);否则 H ( J ( θ ) ) H \left(J(\mathbf{\theta}) \right) H(J(θ))是半正定的(semi-positive definite)。
当 H ( J ( θ ) ) H \left(J(\mathbf{\theta}) \right) H(J(θ))满足正定条件时( m ≫ n m \gg n m≫n), J ( θ ) J(\mathbf{\theta}) J(θ)为凸优函数,有全局最优解,即批量梯度下降(batch gradient descent)能够保证 J ( θ ) J(\mathbf{\theta}) J(θ)收敛到全局最小值;当 H ( J ( θ ) ) H \left(J(\mathbf{\theta}) \right) H(J(θ))满足半正定条件时( m < n m \lt n m<n),即小批量梯度下降(batch gradient descent)或随机梯度下降(stochastic gradient descent)可能使 J ( θ ) J(\mathbf{\theta}) J(θ)陷入局部最小值。