二项逻辑斯蒂回归
1 定义
二项逻辑斯蒂回归模型是如下的条件概率分布:
P ( Y = 1 ∣ x ) = exp ( ω ⋅ x + b ) 1 + exp ( ω ⋅ x + b ) P ( Y = 0 ∣ x ) = 1 1 + exp ( ω ⋅ x + b ) P(Y=1 | x)=\frac{\exp (\omega \cdot x+b)}{1+\exp (\omega \cdot x+b)} \\[4ex] P(Y=0 | x)=\frac{1}{1+\exp (\omega \cdot x+b)} P(Y=1∣x)=1+exp(ω⋅x+b)exp(ω⋅x+b)P(Y=0∣x)=1+exp(ω⋅x+b)1
这里, x ∈ R n x\in R^n x∈Rn 是输入, Υ ∈ { 0 , 1 } \Upsilon\in\left\{0,1\right\} Υ∈{0,1} 是输出, ω ∈ R n \omega\in R^n ω∈Rn 和 b ∈ R b\in R b∈R 是参数, ω \omega ω 称为权值向量, b b b 称为偏置, ω ⋅ x \omega\cdot x ω⋅x 为 ω \omega ω 和 x x x 的内积。可以将权值向量和输入向量加以扩充,仍记作 ω , x \omega, x ω,x ,即
ω = ( ω ( 1 ) ω ( 2 ) ω ( 3 ) ⋯ ω ( n ) b ) x = ( x ( 1 ) x ( 2 ) x ( 3 ) ⋯ x ( n ) 1 ) \begin{array}{l} \omega = \begin{pmatrix} \omega^{(1)}&\omega^{(2)}&\omega^{(3)}&\cdots&\omega^{(n)}&b \end{pmatrix} \\[2ex] x = \begin{pmatrix} x^{(1)}&x^{(2)}&x^{(3)}&\cdots&x^{(n)}&1 \end{pmatrix} \end{array} ω=(ω(1)ω(2)ω(3)⋯ω(n)b)x=(x(1)x(2)x(3)⋯x(n)1)
这时,逻辑斯蒂回归模型如下:
P ( Y = 1 ∣ x ) = exp ( ω ⋅ x ) 1 + exp ( ω ⋅ x ) P ( Y = 0 ∣ x ) = 1 1 + exp ( ω ⋅ x ) P(Y=1 | x)=\frac{\exp (\omega \cdot x)}{1+\exp (\omega \cdot x)} \\[4ex] P(Y=0 | x)=\frac{1}{1+\exp (\omega \cdot x)} P(Y=1∣x)=1+exp(ω⋅x)exp(ω⋅x)P(Y=0∣x)=1+exp(ω⋅x)1
2 模型参数估计
应用极大似然估计法估计模型参数,从而得到逻辑斯蒂回归模型。
对数似然函数为
L ( ω ) = ∑ i = 1 N [ y i ( ω ⋅ x i ) − log ( 1 + exp ( ω ⋅ x i ) ) ] L(\omega)=\sum_{i=1}^N \left[y_i (\omega \cdot x_i)-\log \left(1+\exp (\omega \cdot x_i)\right)\right] L(ω)=i=1∑N[yi(ω⋅xi)−log(1+exp(ω⋅xi))]
从理论上来讲,直接求 L ( ω ) L(\omega) L(ω) 的极大值,就能得到 ω \omega ω 的估计值。但在实际的数据场景中,我们常采用梯度下降法进行求解。L ( ω ) L(\omega) L(ω) 对 ω \omega ω 的每一个元素求偏导,
∂ L ( ω ) ∂ ω ( j ) = ∑ i = 1 N [ x i ( j ) y i − exp ( ω ⋅ x i ) x i ( j ) 1 + exp ( ω ⋅ x i ) ] \frac{\partial L(\omega)}{\partial \omega^{(j)}} = \sum_{i=1}^N\left[x_i^{(j)} y_i-\frac{\exp (\omega \cdot x_i) x_i^{(j)}}{1+\exp (\omega \cdot x_i)}\right] ∂ω(j)∂L(ω)=i=1∑N[xi(j)yi−1+exp(ω⋅xi)exp(ω⋅xi)xi(j)]
再由所有偏导组成向量,得到的就是梯度。
3 学习算法1
输入:训练数据集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ , ( x N , y N ) } T=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \cdots,\left(x_{N}, y_{N}\right)\right\} T={(x1,y1),(x2,y2),⋯,(xN,yN)},其中 x i ∈ χ = R n x_{i}\in\chi=R^n xi∈χ=Rn, y i ∈ Υ = { − 1 , + 1 } y_{i}\in\Upsilon=\left\{-1,+1\right\} yi∈Υ={−1,+1}, i = 1 , 2 , ⋯ , N i=1,2,\cdots,N i=1,2,⋯,N;学习率 η ( 0 < η ≤ 1 ) \eta\left(0<\eta\leq1\right) η(0<η≤1),又称为步长;梯度阈值 δ \delta δ ;
输出: ω \omega ω;
选取初始值 ω 0 \omega_{0} ω0;
在训练集中选取数据 ( x i , y i ) \left(x_{i},y_{i}\right) (xi,yi);
如果 ∃ j ∈ [ 1 , n ] \exists j \in [1,n] ∃j∈[1,n] ,使得:
[ x i ( j ) y i − exp ( ω ⋅ x i ) x i ( j ) 1 + exp ( ω ⋅ x i ) ] > δ \left[ x_i^{(j)} y_i-\frac{\exp (\omega \cdot x_i) x_i^{(j)}}{1+\exp (\omega \cdot x_i)} \right] \gt \delta [xi(j)yi−1+exp(ω⋅xi)exp(ω⋅xi)xi(j)]>δ
则:
ω ← ω + η [ x i ( j ) y i − exp ( ω ⋅ x i ) x i ( j ) 1 + exp ( ω ⋅ x i ) ] \omega\leftarrow \omega+\eta \left[ x_i^{(j)} y_i-\frac{\exp (\omega \cdot x_i) x_i^{(j)}}{1+\exp (\omega \cdot x_i)} \right] ω←ω+η[xi(j)yi−1+exp(ω⋅xi)exp(ω⋅xi)xi(j)]转至2,直到对于 ∀ j ∈ [ 1 , n ] \forall j \in [1,n] ∀j∈[1,n] ,都有
[ x i ( j ) y i − exp ( ω ⋅ x i ) x i ( j ) 1 + exp ( ω ⋅ x i ) ] ≤ δ \left[ x_i^{(j)} y_i-\frac{\exp (\omega \cdot x_i) x_i^{(j)}}{1+\exp (\omega \cdot x_i)} \right] \leq \delta [xi(j)yi−1+exp(ω⋅xi)exp(ω⋅xi)xi(j)]≤δ
4 代码实现
-
了解数据
- 输入向量
χ
\chi
χ 和
Υ
\Upsilon
Υ
χ = ( x 1 ( 1 ) x 1 ( 2 ) x 1 ( 3 ) ⋯ x 1 ( n ) x 2 ( 1 ) x 2 ( 2 ) x 2 ( 3 ) ⋯ x 2 ( n ) ⋮ ⋮ ⋮ ⋱ ⋮ x N ( 1 ) x N ( 2 ) x N ( 3 ) ⋯ x N ( n ) ) Υ = ( y 1 y 2 ⋮ y N ) \begin{array}{l} \chi = \begin{pmatrix} x_1^{(1)}&x_1^{(2)}&x_1^{(3)}&\cdots&x_1^{(n)}\\[2ex] x_2^{(1)}&x_2^{(2)}&x_2^{(3)}&\cdots&x_2^{(n)}\\[2ex] \vdots&\vdots&\vdots&\ddots&\vdots\\[2ex] x_N^{(1)}&x_N^{(2)}&x_N^{(3)}&\cdots&x_N^{(n)}\\[2ex] \end{pmatrix} \Upsilon = \begin{pmatrix} y_1\\[2ex] y_2\\[2ex] \vdots\\[2ex] y_N\\[2ex] \end{pmatrix} \end{array} χ=⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎛x1(1)x2(1)⋮xN(1)x1(2)x2(2)⋮xN(2)x1(3)x2(3)⋮xN(3)⋯⋯⋱⋯x1(n)x2(n)⋮xN(n)⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎞Υ=⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎛y1y2⋮yN⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎞ - 扩充后的输入向量
χ = ( x 1 ( 1 ) x 1 ( 2 ) x 1 ( 3 ) ⋯ x 1 ( n ) 1 x 2 ( 1 ) x 2 ( 2 ) x 2 ( 3 ) ⋯ x 2 ( n ) 1 ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ x N ( 1 ) x N ( 2 ) x N ( 3 ) ⋯ x N ( n ) 1 ) \chi = \begin{pmatrix} x_1^{(1)}&x_1^{(2)}&x_1^{(3)}&\cdots&x_1^{(n)}&1\\[2ex] x_2^{(1)}&x_2^{(2)}&x_2^{(3)}&\cdots&x_2^{(n)}&1\\[2ex] \vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\[2ex] x_N^{(1)}&x_N^{(2)}&x_N^{(3)}&\cdots&x_N^{(n)}&1\\[2ex] \end{pmatrix} χ=⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎛x1(1)x2(1)⋮xN(1)x1(2)x2(2)⋮xN(2)x1(3)x2(3)⋮xN(3)⋯⋯⋱⋯x1(n)x2(n)⋮xN(n)11⋮1⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎞
对应的代码为:
def preprocessing(X): X_plus = np.ones(X.shape[0]).reshape(-1, 1) X_new = np.hstack([X, X_plus]) return X_new X = preprocessing(X)
- 输入向量
χ
\chi
χ 和
Υ
\Upsilon
Υ
-
初始化参数。
-
权值向量 ω \omega ω 是用来和样本 x i x_{i} xi 求内积的一个向量,对应于扩充后的输入向量:
ω = ( ω ( 1 ) ω ( 2 ) ω ( 3 ) ⋯ ω ( n ) b ) = ( 0 0 0 ⋯ 0 ) \omega = \begin{pmatrix} \omega^{(1)}&\omega^{(2)}&\omega^{(3)}&\cdots&\omega^{(n)}&b \end{pmatrix} = \begin{pmatrix} 0&0&0&\cdots&0\\ \end{pmatrix} ω=(ω(1)ω(2)ω(3)⋯ω(n)b)=(000⋯0)
对应的代码为:w = np.zeros(X.shape[1]).reshape(1, -1)
-
超参数 η \eta η ,赋予默认值 1
η = 1 \eta = 1 η=1eta = 1
-
梯度更新的阈值 δ \delta δ ,赋予默认值 0.01
delta = 1e-2
-
-
计算梯度
-
定义
∇ L ( ω ) = ( ∂ L ( ω ) ∂ ω ( 1 ) ⋯ ∂ L ( ω ) ∂ ω ( n + 1 ) ) \displaystyle \nabla L(\omega) = \left( \frac{\partial L(\omega)}{\partial \omega^{(1)}} \quad \cdots \quad \frac{\partial L(\omega)}{\partial \omega^{(n+1)}} \right) ∇L(ω)=(∂ω(1)∂L(ω)⋯∂ω(n+1)∂L(ω)) -
公式解析
∂ L ( ω ) ∂ ω ( j ) = ∑ i = 1 N [ x i ( j ) y i − exp ( ω ⋅ x i ) x i ( j ) 1 + exp ( ω ⋅ x i ) ] = ∑ i = 1 N [ x i ( j ) ( y i − exp ( ω ⋅ x i ) 1 + exp ( ω ⋅ x i ) ) ] \frac{\partial L(\omega)}{\partial \omega^{(j)}} = \sum_{i=1}^N\left[x_i^{(j)} y_i-\frac{\exp (\omega \cdot x_i) x_i^{(j)}}{1+\exp (\omega \cdot x_i)}\right] = \sum_{i=1}^N\left[x_i^{(j)} \left(y_i-\frac{\exp (\omega \cdot x_i)}{1+\exp (\omega \cdot x_i)}\right)\right] ∂ω(j)∂L(ω)=i=1∑N[xi(j)yi−1+exp(ω⋅xi)exp(ω⋅xi)xi(j)]=i=1∑N[xi(j)(yi−1+exp(ω⋅xi)exp(ω⋅xi))]
等式左边,是对 ω \omega ω 的第 j j j 个元素求偏导,得到的就是梯度第 j j j 个元素的值。等式右边是一个求和公式。先看求和符号, i ∈ [ 1 , N ] i\in[1,N] i∈[1,N] 表示所有的样本;再来看求和的具体内容, x i x_{i} xi 是第 i i i 个样本的特征向量, x i ( j ) x_{i}^{(j)} xi(j) 第 i i i 个样本的第 j j j 个特征,即特征向量的第 j j j 个元素,二者有如下关系:
x i = ( x i ( 1 ) x i ( 2 ) ⋯ x i ( j ) ⋯ x i ( n ) 1 ) x_{i} = \begin{pmatrix} x_{i}^{(1)}&x_i^{(2)}&\cdots&x_i^{(j)}&\cdots&x_i^{(n)}&1 \\ \end{pmatrix} xi=(xi(1)xi(2)⋯xi(j)⋯xi(n)1)
所以等式右边表示的是:所有样本的、第 j j j 个特征的、计算值的求和。 -
公式计算过程
由内向外看,对于样本 x i x_{i} xi ,最内部的内积,在这里是一个数值:
z i = ω ⋅ x i = ( ω ( 1 ) ω ( 2 ) ω ( 3 ) ⋯ ω ( n ) b ) ( x i ( 1 ) x i ( 2 ) ⋯ x i ( j ) ⋯ x i ( n ) 1 ) = ω ( 1 ) x i ( 1 ) + ω ( 2 ) x i ( 2 ) + ⋯ + ω ( n ) x i ( n ) + b z_{i} = \omega \cdot x_{i} = \begin{pmatrix} \omega^{(1)}&\omega^{(2)}&\omega^{(3)}&\cdots&\omega^{(n)}&b \end{pmatrix} \begin{pmatrix} x_{i}^{(1)} \\ x_i^{(2)} \\ \cdots \\ x_i^{(j)}\\ \cdots \\ x_i^{(n)} \\ 1 \end{pmatrix} = \omega^{(1)}x_{i}^{(1)} + \omega^{(2)}x_{i}^{(2)} + \cdots + \omega^{(n)}x_{i}^{(n)} + b zi=ω⋅xi=(ω(1)ω(2)ω(3)⋯ω(n)b)⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎛xi(1)xi(2)⋯xi(j)⋯xi(n)1⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎞=ω(1)xi(1)+ω(2)xi(2)+⋯+ω(n)xi(n)+b
紧接着点乘的,是一个Sigmoid函数,变换后,仍旧是一个数值:
z ^ i = exp ( z i ) 1 + exp ( z i ) \hat z_{i} = \frac{\exp (z_{i})}{1+\exp (z_{i})} z^i=1+exp(zi)exp(zi)紧接着,是与 y i y_{i} yi 做差,再与 x i ( j ) x_{i}^{(j)} xi(j) 相差,这两个变量都是数值,所以第 i i i 个样本、第 j j j 个特征最终的计算值为:
x i ( j ) ( y i − z ^ i ) x_{i}^{(j)} \left(y_{i} - \hat z_{i}\right) xi(j)(yi−z^i)
最后,把所有样本的上述计算值求和,就是对 ω \omega ω 的第 j j j 个元素求偏导的结果。 -
公式变形
教材中给出的上述公式,每次只能计算梯度中的一个值,如果想一次性计算出整个梯度的值,要如何做呢?
将上述计算过程中的 x i x_{i} xi 变成 χ \chi χ ,则 z z z 的值为:
z = χ ⋅ ω = ( x 1 ( 1 ) x 1 ( 2 ) x 1 ( 3 ) ⋯ x 1 ( n ) 1 x 2 ( 1 ) x 2 ( 2 ) x 2 ( 3 ) ⋯ x 2 ( n ) 1 ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ x N ( 1 ) x N ( 2 ) x N ( 3 ) ⋯ x N ( n ) 1 ) ( ω ( 1 ) ω ( 2 ) ⋮ ω ( n ) b ) = ( ω ( 1 ) x 1 ( 1 ) + ω ( 2 ) x 1 ( 2 ) + ⋯ + ω ( n ) x 1 ( n ) + b ω ( 1 ) x 2 ( 1 ) + ω ( 2 ) x 2 ( 2 ) + ⋯ + ω ( n ) x 2 ( n ) + b ⋮ ω ( 1 ) x N ( 1 ) + ω ( 2 ) x N ( 2 ) + ⋯ + ω ( n ) x N ( n ) + b ) = ( x 1 ⋅ ω x 2 ⋅ ω ⋮ x i ⋅ ω ⋮ x N ⋅ ω ) = ( z 1 z 2 ⋮ z i ⋮ z N ) z = \chi \cdot \omega = \begin{pmatrix} x_1^{(1)}&x_1^{(2)}&x_1^{(3)}&\cdots&x_1^{(n)}&1\\[2ex] x_2^{(1)}&x_2^{(2)}&x_2^{(3)}&\cdots&x_2^{(n)}&1\\[2ex] \vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\[2ex] x_N^{(1)}&x_N^{(2)}&x_N^{(3)}&\cdots&x_N^{(n)}&1\\[2ex] \end{pmatrix} \begin{pmatrix} \omega^{(1)} \\[2ex] \omega^{(2)} \\[2ex] \vdots \\[2ex] \omega^{(n)} \\[2ex] b \end{pmatrix} = \begin{pmatrix} \omega^{(1)}x_{1}^{(1)} + \omega^{(2)}x_{1}^{(2)} + \cdots + \omega^{(n)}x_{1}^{(n)} + b \\[2ex] \omega^{(1)}x_{2}^{(1)} + \omega^{(2)}x_{2}^{(2)} + \cdots + \omega^{(n)}x_{2}^{(n)} + b \\[2ex] \vdots \\[2ex] \omega^{(1)}x_{N}^{(1)} + \omega^{(2)}x_{N}^{(2)} + \cdots + \omega^{(n)}x_{N}^{(n)} + b \end{pmatrix} = \begin{pmatrix} x_1 \cdot \omega\\[2ex] x_2 \cdot \omega\\[2ex] \vdots\\[2ex] x_i \cdot \omega\\[2ex] \vdots\\[2ex] x_N \cdot \omega\\[2ex] \end{pmatrix} = \begin{pmatrix} z_1\\[2ex] z_2\\[2ex] \vdots\\[2ex] z_i\\[2ex] \vdots\\[2ex] z_N\\[2ex] \end{pmatrix} z=χ⋅ω=⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎛x1(1)x2(1)⋮xN(1)x1(2)x2(2)⋮xN(2)x1(3)x2(3)⋮xN(3)⋯⋯⋱⋯x1(n)x2(n)⋮xN(n)11⋮1⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎞⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎛ω(1)ω(2)⋮ω(n)b⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎞=⎝⎜⎜⎜⎜⎜⎜⎜⎜⎛ω(1)x1(1)+ω(2)x1(2)+⋯+ω(n)x1(n)+bω(1)x2(1)+ω(2)x2(2)+⋯+ω(n)x2(n)+b⋮ω(1)xN(1)+ω(2)xN(2)+⋯+ω(n)xN(n)+b⎠⎟⎟⎟⎟⎟⎟⎟⎟⎞=⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎛x1⋅ωx2⋅ω⋮xi⋅ω⋮xN⋅ω⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎞=⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎛z1z2⋮zi⋮zN⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎞z = np.dot(X, w.T) # 将点乘转换为矩阵乘法
使用Sigmoid函数对 z z z 进行变换:
z ^ = exp ( z ) 1 + exp ( z ) \hat z = \frac{\exp (z)}{1+\exp (z)} z^=1+exp(z)exp(z)
def sigmoid(x): return 1/(1 + np.exp(-x)) z = sigmoid(z)
和 Υ \Upsilon Υ 做差:
Υ − z ^ = ( y 1 − z ^ 1 y 2 − z ^ 2 ⋮ y i − z ^ i ⋮ y N − z ^ N ) \Upsilon - \hat z = \begin{pmatrix} y_{1} - \hat z_1\\[2ex] y_{2} - \hat z_2\\[2ex] \vdots\\[2ex] y_{i} - \hat z_i\\[2ex] \vdots\\[2ex] y_{N} - \hat z_N\\[2ex] \end{pmatrix} Υ−z^=⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎛y1−z^1y2−z^2⋮yi−z^i⋮yN−z^N⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎞
y - z
再和 χ \chi χ 相乘:
( x 1 ( 1 ) ⋯ x 1 ( j ) ⋯ x 1 ( n ) 1 x 2 ( 1 ) ⋯ x 2 ( j ) ⋯ x 2 ( n ) 1 ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ x i ( 1 ) ⋯ x i ( j ) ⋯ x i ( n ) 1 ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ x N ( 1 ) ⋯ x N ( j ) ⋯ x N ( n ) 1 ) × ( y 1 − z ^ 1 y 2 − z ^ 2 ⋮ y i − z ^ i ⋮ y N − z ^ N ) = ( x 1 ( 1 ) ( y 1 − z ^ 1 ) ⋯ x 1 ( j ) ( y 1 − z ^ 1 ) ⋯ x 1 ( n ) ( y 1 − z ^ 1 ) ( y 1 − z ^ 1 ) x 2 ( 1 ) ( y 2 − z ^ 2 ) ⋯ x 2 ( j ) ( y 2 − z ^ 2 ) ⋯ x 2 ( n ) ( y 2 − z ^ 2 ) ( y 2 − z ^ 2 ) ⋮ ⋯ ⋮ ⋱ ⋮ ⋮ x i ( 1 ) ( y i − z ^ i ) ⋯ x i ( j ) ( y i − z ^ i ) ⋯ x i ( n ) ( y i − z ^ i ) ( y i − z ^ i ) ⋮ ⋯ ⋮ ⋱ ⋮ ⋮ x N ( 1 ) ( y N − z ^ N ) ⋯ x N ( j ) ( y N − z ^ N ) ⋯ x N ( n ) ( y N − z ^ N ) ( y N − z ^ N ) ) \begin{pmatrix} x_1^{(1)}&\cdots&x_1^{(j)}&\cdots&x_1^{(n)}&1\\[2ex] x_2^{(1)}&\cdots&x_2^{(j)}&\cdots&x_2^{(n)}&1\\[2ex] \vdots&\vdots&\ddots&\vdots&\vdots&\vdots\\[2ex] x_i^{(1)}&\cdots&x_i^{(j)}&\cdots&x_i^{(n)}&1\\[2ex] \vdots&\vdots&\ddots&\vdots&\vdots&\vdots\\[2ex] x_N^{(1)}&\cdots&x_N^{(j)}&\cdots&x_N^{(n)}&1\\[2ex] \end{pmatrix} \times \begin{pmatrix} y_{1} - \hat z_1\\[2ex] y_{2} - \hat z_2\\[2ex] \vdots\\[2ex] y_{i} - \hat z_i\\[2ex] \vdots\\[2ex] y_{N} - \hat z_N\\[2ex] \end{pmatrix} = \begin{pmatrix} x_1^{(1)}(y_{1}-\hat z_1)&\cdots&x_1^{(j)}(y_{1}-\hat z_1)&\cdots&x_1^{(n)}(y_{1}-\hat z_1)&(y_{1}-\hat z_1)\\[2ex] x_2^{(1)}(y_{2}-\hat z_2)&\cdots&x_2^{(j)}(y_{2}-\hat z_2)&\cdots&x_2^{(n)}(y_{2}-\hat z_2)&(y_{2}-\hat z_2)\\[2ex] \vdots&\cdots&\vdots&\ddots&\vdots&\vdots\\[2ex] x_i^{(1)}(y_{i}-\hat z_i)&\cdots&x_i^{(j)}(y_{i}-\hat z_i)&\cdots&x_i^{(n)}(y_{i}-\hat z_i)&(y_{i}-\hat z_i)\\[2ex] \vdots&\cdots&\vdots&\ddots&\vdots&\vdots\\[2ex] x_N^{(1)}(y_{N}-\hat z_N)&\cdots&x_N^{(j)}(y_{N}-\hat z_N)&\cdots&x_N^{(n)}(y_{N}-\hat z_N)&(y_{N}-\hat z_N)\\[2ex] \end{pmatrix} ⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎛x1(1)x2(1)⋮xi(1)⋮xN(1)⋯⋯⋮⋯⋮⋯x1(j)x2(j)⋱xi(j)⋱xN(j)⋯⋯⋮⋯⋮⋯x1(n)x2(n)⋮xi(n)⋮xN(n)11⋮1⋮1⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎞×⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎛y1−z^1y2−z^2⋮yi−z^i⋮yN−z^N⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎞=⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎛x1(1)(y1−z^1)x2(1)(y2−z^2)⋮xi(1)(yi−z^i)⋮xN(1)(yN−z^N)⋯⋯⋯⋯⋯⋯x1(j)(y1−z^1)x2(j)(y2−z^2)⋮xi(j)(yi−z^i)⋮xN(j)(yN−z^N)⋯⋯⋱⋯⋱⋯x1(n)(y1−z^1)x2(n)(y2−z^2)⋮xi(n)(yi−z^i)⋮xN(n)(yN−z^N)(y1−z^1)(y2−z^2)⋮(yi−z^i)⋮(yN−z^N)⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎞
需要注意的是,这里既不是点乘,也不是矩阵乘法,而是将 $ \Upsilon - \hat z$ 中的元素,与 χ \chi χ 每一列元素对应相乘。之所以要这样做,是因为我们最终想得到的就是:
x i ( j ) ( y i − z ^ i ) x_{i}^{(j)} \left(y_{i} - \hat z_{i}\right) xi(j)(yi−z^i)
得益于Python,我们可以直接写作:X * (y - z)
最后,我们需要对所有样本的、第 j j j 个特征的计算值进行求和,得到的就是一个由偏导组成的向量,即梯度:
( ∂ L ( ω ) ∂ ω ( 1 ) ⋯ ∂ L ( ω ) ∂ ω ( n + 1 ) ) = ( ∑ i = 1 N [ x i ( 1 ) ( y i − z ^ i ) ] ∑ i = 1 N [ x i ( 2 ) ( y i − z ^ i ) ] ⋯ ∑ i = 1 N [ x i ( n ) ( y i − z ^ i ) ] ∑ i = 1 N [ ( y i − z ^ i ) ] ) \left( \frac{\partial L(\omega)}{\partial \omega^{(1)}} \quad \cdots \quad \frac{\partial L(\omega)}{\partial \omega^{(n+1)}} \right) = \begin{pmatrix} \sum_{i=1}^N\left[x_i^{(1)}(y_{i}-\hat z_i)\right]&\sum_{i=1}^N\left[x_i^{(2)}(y_{i}-\hat z_i)\right]&\cdots&\sum_{i=1}^N\left[x_i^{(n)}(y_{i}-\hat z_i)\right]&\sum_{i=1}^N\left[(y_{i}-\hat z_i)\right]\\[2ex] \end{pmatrix} (∂ω(1)∂L(ω)⋯∂ω(n+1)∂L(ω))=(∑i=1N[xi(1)(yi−z^i)]∑i=1N[xi(2)(yi−z^i)]⋯∑i=1N[xi(n)(yi−z^i)]∑i=1N[(yi−z^i)])
这里的代码表示为:(X * (y - z)).sum(axis=0)
最核心的梯度部分梳理完了,将代码整理如下:
def sigmoid(x): return 1/(1 + np.exp(-x)) z = np.dot(X, w.T) # 求梯度(方向导数) grad = (X * (y - sigmoid(z))).sum(axis=0)
-
-
参数更新
根据已经计算出来的梯度,更新参数 ω \omega ω :
ω ^ = ( ω ( 1 ) ω ( 2 ) ⋯ ω ( n ) b ) + η ( ∂ L ( ω ) ∂ ω ( 1 ) ⋯ ∂ L ( ω ) ∂ ω ( n + 1 ) ) = ( ω ( 1 ) + η ∂ L ( ω ) ∂ ω ( 1 ) ⋯ ω ( n ) + η ∂ L ( ω ) ∂ ω ( n ) b + η ∂ L ( ω ) ∂ ω ( n + 1 ) ) \hat \omega = \begin{pmatrix} \omega^{(1)}&\omega^{(2)}&\cdots&\omega^{(n)}&b \end{pmatrix} + \eta \left( \frac{\partial L(\omega)}{\partial \omega^{(1)}} \quad \cdots \quad \frac{\partial L(\omega)}{\partial \omega^{(n+1)}} \right) = \left( \omega^{(1)}+\eta \frac{\partial L(\omega)}{\partial \omega^{(1)}} \quad \cdots \quad \omega^{(n)}+\eta \frac{\partial L(\omega)}{\partial \omega^{(n)}} \quad b+\eta \frac{\partial L(\omega)}{\partial \omega^{(n+1)}} \right) ω^=(ω(1)ω(2)⋯ω(n)b)+η(∂ω(1)∂L(ω)⋯∂ω(n+1)∂L(ω))=(ω(1)+η∂ω(1)∂L(ω)⋯ω(n)+η∂ω(n)∂L(ω)b+η∂ω(n+1)∂L(ω))w += eta * grad
-
停止条件
模型中设置了两个停止条件,只要满足任一个,模型都会停止迭代。
-
当梯度小于给定的阈值时,意味着接下来的参数更新只能带来很少的收益,也就意味着参数达到了我们认可的一种最优状态。在这里,只有当梯度中的每一个元素都小于给定的阈值时,我们才停止更新:
if (np.abs(grad) <= delta).all(): print('停止迭代')
-
设定最大迭代次数,是为了防止模型的过拟合。所以当迭代次数达到提前设定的最大值时,也要停止更新:
loop = 1 while loop <= max_iter: pass # 模型训练过程 loop += 1
-
-
模型预测
求解出参数 ω \omega ω 的值,根据模型的定义,我们就能预测新样本所属的标签:
P ( Y = 1 ∣ x ) = exp ( ω ⋅ x ) 1 + exp ( ω ⋅ x ) P ( Y = 0 ∣ x ) = 1 1 + exp ( ω ⋅ x ) P(Y=1 | x)=\frac{\exp (\omega \cdot x)}{1+\exp (\omega \cdot x)} \\[4ex] P(Y=0 | x)=\frac{1}{1+\exp (\omega \cdot x)} P(Y=1∣x)=1+exp(ω⋅x)exp(ω⋅x)P(Y=0∣x)=1+exp(ω⋅x)1X_test = preprocessing(X_test) p = sigmoid(np.dot(X_test, w.T)) p[np.where(p>=0.5)] = 1 p[np.where(p<0.5)] = 0
5 代码封装
class LogisticRegression(object):
"""
Binomial Logistic Regression classifier.
Parameters
----------
eta : float, default=0.1
Step size for each iteration.
max_iter : int, default=10000
Maximum number of iterations taken for the solvers to converge.
delta : float, default=1e-2
deltaerance for stopping criteria.
method : {'BGD', 'SGD'}, default='BGD'
The way of Gradient Descent.
The 'BGD' is Batch Gradient Descent. The 'SGD' is Stochastic Gradient Descent.
Attributes
----------
w : ndarray of shape (1, n_features)
Coefficient of the features in the model.
"""
def __init__(self,
eta=0.1,
max_iter=10000,
delta=1e-2,
method='BGD'):
self.eta = eta
self.max_iter = max_iter
self.delta = delta
self.method = method
self.w = None
# Y = 1 时的模型
def sigmoid(self, x):
"""
Sigmoid function.
Parameters
----------
x : float or ndarray of shape (n_samples, )
The independent variable of the Sigmoid function.
Returns
-------
y : float or ndarray of shape (n_samples, )
Function value
"""
return 1/(1 + np.exp(-x))
def preprocessing(self, X):
"""
Extend input vector.
Parameters
----------
X : ndarray of shape (n_samples, n_features)
Input vector.
Returns
-------
X_new : ndarray of shape (n_samples, n_features + 1)
Extend input vector.
"""
X_plus = np.ones(X.shape[0]).reshape(-1, 1)
X_new = np.hstack([X, X_plus])
return X_new
def fit(self, X, y):
"""
Fit the model according to the given training data.
Parameters
----------
X : ndarray of shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and
n_features is the number of features.
y : ndarray of shape (n_samples, )
Target vector relative to X.
Returns
-------
self
Fitted estimator.
"""
# 初始化 w
X = self.preprocessing(X)
self.w = np.zeros(X.shape[1]).reshape(1, -1)
if self.method == 'BGD':
loop = 1
while loop <= self.max_iter:
# w * x
z = np.dot(X, self.w.T)
# 求梯度(方向导数)
grad = (X * (y - self.sigmoid(z))).sum(axis=0)
# 下降(上升)的幅度
if (np.abs(grad) <= self.delta).all():
break
else:
self.w += self.eta * grad
loop += 1
elif self.method == 'SGD':
# 随机获取样本点
index = random.randint(0, X.shape[0] - 1)
xi = X[index]
yi = y[index]
loop = 1
while loop <= self.max_iter:
z = np.dot(xi, self.w.T)
grad = xi * (yi - self.sigmoid(z))
if (np.abs(grad) <= self.delta).all():
break
else:
self.w += self.eta * grad
loop += 1
else:
pass
def predict(self, X):
"""
Predict class labels for samples in X.
Parameters
----------
X : ndarray of shape (n_samples, n_features)
Samples.
Returns
-------
p : ndarray of shape (n_samples,)
Predicted class label per sample.
"""
X = self.preprocessing(X)
p = self.sigmoid(np.dot(X, self.w.T))
p[np.where(p>=0.5)] = 1
p[np.where(p<0.5)] = 0
return p
文章作者整理的内容,原书中并没有这一部分的表述 ↩︎