西瓜书——线性模型笔记



线性模型

Lei_ZM
2019-09-10



1. 一元线性回归

求解偏置 b b b和权重 w w w推导思路

  1. 由最小二乘法导出损失函数 E ( w , b ) E(w, b) E(w,b)

  2. 证明损失函数

  3. 分别对损失函数 E ( w , b ) E(w, b) E(w,b)关于 b b b w w w求一阶偏导数

  4. 令各自的一阶偏导数等于0解出 b b b w w w


1.1. 由最小二乘法导出损失函数 E ( w , b ) E(w, b) E(w,b)

E ( w , b ) = ∑ i = 1 m ( y i − f ( x i ) ) 2 = ∑ i = 1 m ( y i − ( w x i + b ) ) 2 = ∑ i = 1 m ( y i − w x i − b ) 2 (西瓜书式3.4) \begin{aligned} E_{(w, b)} &=\sum_{i=1}^{m}\left(y_{i}-f\left(x_{i}\right)\right)^{2} \\ &=\sum_{i=1}^{m}\left(y_{i}-\left(w x_{i}+b\right)\right)^{2} \\ &=\sum_{i=1}^{m}\left(y_{i}-w x_{i}-b\right)^{2} \end{aligned} \tag{西瓜书式3.4} E(w,b)=i=1m(yif(xi))2=i=1m(yi(wxi+b))2=i=1m(yiwxib)2(西3.4)



1.2. 证明损失函数

1.2.1. 二元函数判断凹凸性:

f ( x , y ) f(x, y) f(x,y)在区域 D D D上具有二阶连续偏导数,记 A = f x x ′ ′ ( x , y ) A=f_{x x}^{\prime \prime}(x, y) A=fxx(x,y) B = f x y ′ ′ ( x , y ) B=f_{x y}^{\prime \prime}(x, y) B=fxy(x,y) C = f y y ′ ′ ( x , y ) C=f_{y y}^{\prime \prime}(x, y) C=fyy(x,y)。则:

  1. D D D上恒有 A > 0 A>0 A>0,且 A C − B 2 ≥ 0 AC-B^{2}\geq 0 ACB20时, f ( x , y ) f(x, y) f(x,y)在区域 D D D上是凸函数
  2. D D D上恒有 A < 0 A<0 A<0,且 A C − B 2 ≥ 0 AC-B^{2}\geq 0 ACB20时, f ( x , y ) f(x, y) f(x,y)在区域 D D D上是凹函数

1.2.2. 二元凹凸函数求最值:

f ( x , y ) f(x, y) f(x,y)是在开区域 D D D内具有连续偏导数的凸(或者凹)函数, ( x 0 , y 0 ) ∈ D (x_{0}, y_{0})\in D (x0,y0)D,且 f x ′ ( x 0 , y 0 ) = 0 f_{x}^{\prime}(x_{0}, y_{0})=0 fx(x0,y0)=0 f y ′ ( x 0 , y 0 ) = 0 f_{y}^{\prime}(x_{0}, y_{0})=0 fy(x0,y0)=0,则 f ( x 0 , y 0 ) f(x_{0}, y_{0}) f(x0,y0)必为 f ( x , y ) f(x, y) f(x,y) D D D内的最小值(或最大值)。


1.2.3. 证明

证明损失函数 E ( w , b ) E(w, b) E(w,b)是关于 w w w b b b的凸函数——求 A = f x x ′ ′ ( x , y ) A=f_{xx}^{\prime \prime}(x, y) A=fxx(x,y)

∂ E ( w , b ) ∂ w = ∂ ∂ w [ ∑ i = 1 m ( y i − ( w x i + b ) ) 2 ] = ∑ i = 1 m ∂ ∂ w ( y i − w x i − b ) 2 = ∑ i = 1 m 2 ⋅ ( y i − w x i − b ) ⋅ ( − x i ) = 2 ( w ∑ i = 1 m x i 2 − ∑ i = 1 m ( y i − b ) x i ) (西瓜书式3.5) \begin{aligned} \frac{\partial E_{(w, b)}}{\partial w} &=\frac{\partial}{\partial w}\left[\sum_{i=1}^{m}\left(y_{i}-\left(w x_{i}+b\right)\right)^{2}\right] \\ &=\sum_{i=1}^{m} \frac{\partial}{\partial w}\left(y_{i}-w x_{i}-b\right)^{2} \\ &=\sum_{i=1}^{m} 2 \cdot\left(y_{i}-w x_{i}-b\right) \cdot\left(-x_{i}\right) \\ &=2\left(w \sum_{i=1}^{m} x_{i}^{2}-\sum_{i=1}^{m}\left(y_{i}-b\right) x_{i}\right) \end{aligned} \tag{西瓜书式3.5} wE(w,b)=w[i=1m(yi(wxi+b))2]=i=1mw(yiwxib)2=i=1m2(yiwxib)(xi)=2(wi=1mxi2i=1m(yib)xi)(西3.5)

故有:

∂ 2 E ( w , b ) ∂ w 2 = ∂ ∂ w ( ∂ E ( w , b ) ∂ w ) = ∂ ∂ w [ 2 ( w ∑ i = 1 m x i 2 − ∑ i = 1 m ( y i − b ) x i ) ] = ∂ ∂ w [ 2 w ∑ i = 1 m x i 2 ] = 2 ∑ i = 1 m x i 2 \begin{aligned} \frac{\partial^{2} E_{(w, b)}}{\partial w^{2}} &=\frac{\partial}{\partial w}\left(\frac{\partial E_{(w, b)}}{\partial w}\right) \\ &=\frac{\partial}{\partial w}\left[2\left(w \sum_{i=1}^{m} x_{i}^{2}-\sum_{i=1}^{m}\left(y_{i}-b\right) x_{i}\right)\right] \\ &=\frac{\partial}{\partial w}\left[2 w \sum_{i=1}^{m} x_{i}^{2}\right] \\ &=2 \sum_{i=1}^{m} x_{i}^{2} \end{aligned} w22E(w,b)=w(wE(w,b))=w[2(wi=1mxi2i=1m(yib)xi)]=w[2wi=1mxi2]=2i=1mxi2

此式即为 A = f x x ′ ′ ( x , y ) A=f_{xx}^{\prime \prime}(x, y) A=fxx(x,y)

证明损失函数 E ( w , b ) E(w, b) E(w,b)是关于 w w w b b b的凸函数——求 B = f x y ′ ′ ( x , y ) B=f_{xy}^{\prime \prime}(x, y) B=fxy(x,y)

∂ 2 E ( w , b ) ∂ w ∂ b = ∂ ∂ b ( ∂ E ( w , b ) ∂ w ) = ∂ ∂ b [ 2 ( w ∑ i = 1 m x i 2 − ∑ i = 1 m ( y i − b ) x i ) ] = ∂ ∂ b [ − 2 ∑ i = 1 m ( y i − b ) x i ] = ∂ ∂ b ( − 2 ∑ i = 1 m y i x i + 2 ∑ i = 1 m b x i ) = ∂ ∂ b ( 2 ∑ i = 1 m b x i ) = 2 ∑ i = 1 m x i \begin{aligned} \frac{\partial^{2} E_{(w, b)}}{\partial w \partial b} &=\frac{\partial}{\partial b}\left(\frac{\partial E_{(w, b)}}{\partial w}\right) \\ &=\frac{\partial}{\partial b}\left[2\left(w \sum_{i=1}^{m} x_{i}^{2}-\sum_{i=1}^{m}\left(y_{i}-b\right) x_{i}\right)\right] \\ &=\frac{\partial}{\partial b}\left[-2 \sum_{i=1}^{m}\left(y_{i}-b\right) x_{i}\right] \\ &=\frac{\partial}{\partial b}\left(-2 \sum_{i=1}^{m} y_{i} x_{i}+2 \sum_{i=1}^{m} b x_{i}\right) \\ &=\frac{\partial}{\partial b}\left(2 \sum_{i=1}^{m} b x_{i}\right) \\ &=2 \sum_{i=1}^{m} x_{i} \end{aligned} wb2E(w,b)=b(wE(w,b))=b[2(wi=1mxi2i=1m(yib)xi)]=b[2i=1m(yib)xi]=b(2i=1myixi+2i=1mbxi)=b(2i=1mbxi)=2i=1mxi

此式即为 B = f x y ′ ′ ( x , y ) B=f_{xy}^{\prime \prime}(x, y) B=fxy(x,y)

证明损失函数 E ( w , b ) E(w, b) E(w,b)是关于 w w w b b b的凸函数——求 C = f y y ′ ′ ( x , y ) C=f_{yy}^{\prime \prime}(x, y) C=fyy(x,y)

∂ E ( w , b ) ∂ b = ∂ ∂ b [ ∑ i = 1 m ( y i − ( w x i + b ) ) 2 ] = ∑ i = 1 m ∂ ∂ b ( y i − w x i − b ) 2 = ∑ i = 1 m 2 ⋅ ( y i − w x i − b ) ⋅ ( − 1 ) = 2 ( m b − ∑ i = 1 m ( y i − w x i ) ) (西瓜书式3.6) \begin{aligned} \frac{\partial E_{(w, b)}}{\partial b} &=\frac{\partial}{\partial b}\left[\sum_{i=1}^{m}\left(y_{i}-\left(w x_{i}+b\right)\right)^{2}\right] \\ &=\sum_{i=1}^{m} \frac{\partial}{\partial b}\left(y_{i}-w x_{i}-b\right)^{2} \\ &=\sum_{i=1}^{m} 2 \cdot\left(y_{i}-w x_{i}-b\right) \cdot(-1) \\ &=2\left(m b-\sum_{i=1}^{m}\left(y_{i}-w x_{i}\right)\right) \end{aligned} \tag{西瓜书式3.6} bE(w,b)=b[i=1m(yi(wxi+b))2]=i=1mb(yiwxib)2=i=1m2(yiwxib)(1)=2(mbi=1m(yiwxi))(西3.6)

故有:

∂ 2 E ( w , b ) ∂ b 2 = ∂ ∂ b ( ∂ E ( w , b ) ∂ b ) = ∂ ∂ b [ 2 ( m b − ∑ i = 1 m ( y i − w x i ) ) ] = ∂ ∂ b ( 2 m b ) = 2 m \begin{aligned} \frac{\partial^{2} E_{(w, b)}}{\partial b^{2}} &=\frac{\partial}{\partial b}\left(\frac{\partial E_{(w, b)}}{\partial b}\right) \\ &=\frac{\partial}{\partial b}\left[2\left(m b-\sum_{i=1}^{m}\left(y_{i}-w x_{i}\right)\right)\right] \\ &=\frac{\partial}{\partial b}(2 m b) \\ &=2 m \end{aligned} b22E(w,b)=b(bE(w,b))=b[2(mbi=1m(yiwxi))]=b(2mb)=2m

此式即为 C = f y y ′ ′ ( x , y ) C=f_{yy}^{\prime \prime}(x, y) C=fyy(x,y)

综上所述,有:

{ A = f x x ′ ′ ( x , y ) = 2 ∑ i = 1 m x i 2 B = f x y ′ ′ ( x , y ) = 2 ∑ i = 1 m x i C = f y y ′ ′ ( x , y ) = 2 m \left\{ \begin{aligned} &A=f_{xx}^{\prime \prime}(x, y)=2 \sum_{i=1}^{m} x_{i}^{2} \\ &B=f_{xy}^{\prime \prime}(x, y)=2 \sum_{i=1}^{m} x_{i} \\ &C=f_{yy}^{\prime \prime}(x, y)=2 m \end{aligned} \right. A=fxx(x,y)=2i=1mxi2B=fxy(x,y)=2i=1mxiC=fyy(x,y)=2m

所以:

A C − B 2 = 2 m ⋅ 2 ∑ i = 1 m x i 2 − ( 2 ∑ i = 1 m x i ) 2 = 4 m ∑ i = 1 m x i 2 − 4 ( ∑ i = 1 m x i ) 2 = 4 m ∑ i = 1 m x i 2 − 4 ⋅ m ⋅ 1 m ⋅ ( ∑ i = 1 m x i ) 2 = 4 m ∑ i = 1 m x i 2 − 4 m ⋅ x ˉ ⋅ ∑ i = 1 m x i = 4 m ( ∑ i = 1 m x i 2 − ∑ i = 1 m x i x ˉ ) = 4 m ∑ i = 1 m ( x i 2 − x i x ˉ ) = 4 m ∑ i = 1 m ( x i 2 − x i x ˉ − x i x ˉ + x i x ˉ ) ∑ i = 1 m x i x ˉ = x ˉ ∑ i = 1 m x i = x ˉ ⋅ m ⋅ 1 m ⋅ ∑ i = 1 m x i = m x ˉ 2 = ∑ i = 1 m x ˉ 2 = 4 m ∑ i = 1 m ( x i 2 − x i x ˉ − x i x ˉ + x ˉ 2 ) = 4 m ∑ i = 1 m ( x i − x ˉ ) 2 \begin{aligned} A C-B^{2} &=2 m \cdot 2 \sum_{i=1}^{m} x_{i}^{2}-\left(2 \sum_{i=1}^{m} x_{i}\right)^{2} \\ &=4 m \sum_{i=1}^{m} x_{i}^{2}-4\left(\sum_{i=1}^{m} x_{i}\right)^{2} \\ &=4 m \sum_{i=1}^{m} x_{i}^{2}-4 \cdot m \cdot \frac{1}{m} \cdot\left(\sum_{i=1}^{m} x_{i}\right)^{2} \\ &=4 m \sum_{i=1}^{m} x_{i}^{2}-4 m \cdot \bar{x} \cdot \sum_{i=1}^{m} x_{i} \\ &=4 m\left(\sum_{i=1}^{m} x_{i}^{2}-\sum_{i=1}^{m} x_{i} \bar{x}\right) \\ &=4 m \sum_{i=1}^{m}\left(x_{i}^{2}-x_{i} \bar{x}\right) \\ &=4 m \sum_{i=1}^{m}\left(x_{i}^{2}-x_{i} \bar{x}-x_{i} \bar{x}+x_{i} \bar{x}\right) \\ &\qquad \sum_{i=1}^{m} x_{i} \bar{x}=\bar{x} \sum_{i=1}^{m} x_{i}=\bar{x} \cdot m \cdot \frac{1}{m} \cdot \sum_{i=1}^{m} x_{i}=m \bar{x}^{2}=\sum_{i=1}^{m} \bar{x}^{2} \\ &=4 m \sum_{i=1}^{m}\left(x_{i}^{2}-x_{i} \bar{x}-x_{i} \bar{x}+\bar{x}^{2}\right) \\ &=4 m \sum_{i=1}^{m}\left(x_{i}-\bar{x}\right)^{2} \end{aligned} ACB2=2m2i=1mxi2(2i=1mxi)2=4mi=1mxi24(i=1mxi)2=4mi=1mxi24mm1(i=1mxi)2=4mi=1mxi24mxˉi=1mxi=4m(i=1mxi2i=1mxixˉ)=4mi=1m(xi2xixˉ)=4mi=1m(xi2xixˉxixˉ+xixˉ)i=1mxixˉ=xˉi=1mxi=xˉmm1i=1mxi=mxˉ2=i=1mxˉ2=4mi=1m(xi2xixˉxixˉ+xˉ2)=4mi=1m(xixˉ)2

故有:

A C − B 2 = 4 m ∑ i = 1 m ( x i − x ˉ ) 2 ≥ 0 AC-B^{2} = 4 m \sum_{i=1}^{m}\left(x_{i}-\bar{x}\right)^{2} \geq 0 ACB2=4mi=1m(xixˉ)20

也即损失函数 E ( w , b ) E(w, b) E(w,b)是关于 w w w b b b的凸函数,得证!



1.3. 分别对损失函数 E ( w , b ) E(w, b) E(w,b)关于 b b b w w w求一阶偏导数

损失函数 E ( w , b ) E(w, b) E(w,b)关于 b b b求一阶偏导数:

∂ E ( w , b ) ∂ b = ∂ ∂ b [ ∑ i = 1 m ( y i − ( w x i + b ) ) 2 ] = ∑ i = 1 m ∂ ∂ b ( y i − w x i − b ) 2 = ∑ i = 1 m 2 ⋅ ( y i − w x i − b ) ⋅ ( − 1 ) = 2 ( m b − ∑ i = 1 m ( y i − w x i ) ) (西瓜书式3.6) \begin{aligned} \frac{\partial E_{(w, b)}}{\partial b} &=\frac{\partial}{\partial b}\left[\sum_{i=1}^{m}\left(y_{i}-\left(w x_{i}+b\right)\right)^{2}\right] \\ &=\sum_{i=1}^{m} \frac{\partial}{\partial b}\left(y_{i}-w x_{i}-b\right)^{2} \\ &=\sum_{i=1}^{m} 2 \cdot\left(y_{i}-w x_{i}-b\right) \cdot(-1) \\ &=2\left(m b-\sum_{i=1}^{m}\left(y_{i}-w x_{i}\right)\right) \end{aligned} \tag{西瓜书式3.6} bE(w,b)=b[i=1m(yi(wxi+b))2]=i=1mb(yiwxib)2=i=1m2(yiwxib)(1)=2(mbi=1m(yiwxi))(西3.6)

损失函数 E ( w , b ) E(w, b) E(w,b)关于 w w w求一阶偏导数:

∂ E ( w , b ) ∂ w = ∂ ∂ w [ ∑ i = 1 m ( y i − ( w x i + b ) ) 2 ] = ∑ i = 1 m ∂ ∂ w ( y i − w x i − b ) 2 = ∑ i = 1 m 2 ⋅ ( y i − w x i − b ) ⋅ ( − x i ) = 2 ( w ∑ i = 1 m x i 2 − ∑ i = 1 m ( y i − b ) x i ) (西瓜书式3.5) \begin{aligned} \frac{\partial E_{(w, b)}}{\partial w} &=\frac{\partial}{\partial w}\left[\sum_{i=1}^{m}\left(y_{i}-\left(w x_{i}+b\right)\right)^{2}\right] \\ &=\sum_{i=1}^{m} \frac{\partial}{\partial w}\left(y_{i}-w x_{i}-b\right)^{2} \\ &=\sum_{i=1}^{m} 2 \cdot\left(y_{i}-w x_{i}-b\right) \cdot\left(-x_{i}\right) \\ &=2\left(w \sum_{i=1}^{m} x_{i}^{2}-\sum_{i=1}^{m}\left(y_{i}-b\right) x_{i}\right) \end{aligned} \tag{西瓜书式3.5} wE(w,b)=w[i=1m(yi(wxi+b))2]=i=1mw(yiwxib)2=i=1m2(yiwxib)(xi)=2(wi=1mxi2i=1m(yib)xi)(西3.5)



1.4. 令各自的一阶偏导数等于0解出 b b b w w w

令损失函数 E ( w , b ) E(w, b) E(w,b)关于 b b b的一阶偏导数等于0解出 b b b

∂ E ( w , b ) ∂ b = 2 ( m b − ∑ i = 1 m ( y i − w x i ) ) = 0 ⇒ m b − ∑ i = 1 m ( y i − w x i ) = 0 ⇒ b = 1 m ∑ i = 1 m ( y i − w x i ) = 1 m ∑ i = 1 m y i − w 1 m ∑ i = 1 m x i = y ˉ − w x ˉ (西瓜书式3.8) \begin{aligned} \frac{\partial E_{(w, b)}}{\partial b} &=2\left(m b-\sum_{i=1}^{m}\left(y_{i}-w x_{i}\right)\right) =0 \\ &\Rightarrow m b-\sum_{i=1}^{m}\left(y_{i}-w x_{i}\right)=0 \\ & \begin{aligned} \Rightarrow b&=\frac{1}{m}\sum_{i=1}^{m}\left(y_{i}-w x_{i}\right) \\ &=\frac{1}{m}\sum_{i=1}^{m} y_{i} - w \frac{1}{m}\sum_{i=1}^{m} x_{i} \\ &=\bar{y}-w\bar{x} \end{aligned} \end{aligned} \tag{西瓜书式3.8} bE(w,b)=2(mbi=1m(yiwxi))=0mbi=1m(yiwxi)=0b=m1i=1m(yiwxi)=m1i=1myiwm1i=1mxi=yˉwxˉ(西3.8)

令损失函数 E ( w , b ) E(w, b) E(w,b)关于 w w w的一阶偏导数等于0解出 w w w

∂ E ( w , b ) ∂ w = 2 ( w ∑ i = 1 m x i 2 − ∑ i = 1 m ( y i − b ) x i ) = 0 ⇒ w ∑ i = 1 m x i 2 − ∑ i = 1 m ( y i − b ) x i = 0 ⇒ w ∑ i = 1 m x i 2 = ∑ i = 1 m y i x i − ∑ i = 1 m b x i b = y ˉ − w x ˉ ⇒ w ∑ i = 1 m x i 2 = ∑ i = 1 m y i x i − ∑ i = 1 m ( y ˉ − w x ˉ ) x i ⇒ w ∑ i = 1 m x i 2 = ∑ i = 1 m y i x i − y ˉ ∑ i = 1 m x i + w x ˉ ∑ i = 1 m x i ⇒ w ∑ i = 1 m x i 2 − w x ˉ ∑ i = 1 m x i = ∑ i = 1 m y i x i − y ˉ ∑ i = 1 m x i ⇒ w ( ∑ i = 1 m x i 2 − x ˉ ∑ i = 1 m x i ) = ∑ i = 1 m y i x i − y ˉ ∑ i = 1 m x i ⇒ w = ∑ i = 1 m y i x i − y ˉ ∑ i = 1 m x i ∑ i = 1 m x i 2 − x ˉ ∑ i = 1 m x i y ˉ ∑ i = 1 m x i = 1 m ∑ i = 1 m y i ∑ i = 1 m x i = x ˉ ∑ i = 1 m y i x ˉ ∑ i = 1 m x i = 1 m ∑ i = 1 m x i ∑ i = 1 m x i = 1 m ( ∑ i = 1 m x i ) 2 = ∑ i = 1 m y i x i − x ˉ ∑ i = 1 m y i ∑ i = 1 m x i 2 − 1 m ( ∑ i = 1 m x i ) 2 = ∑ i = 1 m y i ( x i − x ˉ ) ∑ i = 1 m x i 2 − 1 m ( ∑ i = 1 m x i ) 2 ‎ ‬ ‎ ‪ ‍ ‭ ‎ ‏ ‌ ‎ ‬ ‎ ‪ ‍ ‭ ‎ ‪ ‫ ‫ ‌ ‬ ‎ ‮ ‌ ‌ ‫ ⁠ ‌ ‏ ‌ ‎ ‬ ‎ ‪ ‍ ‭ ‎ (西瓜书式3.7) \begin{aligned} \frac{\partial E_{(w, b)}}{\partial w} &=2\left(w \sum_{i=1}^{m} x_{i}^{2}-\sum_{i=1}^{m}\left(y_{i}-b\right) x_{i}\right) =0 \\ &\Rightarrow w \sum_{i=1}^{m} x_{i}^{2}-\sum_{i=1}^{m}\left(y_{i}-b\right) x_{i}=0 \\ &\Rightarrow w \sum_{i=1}^{m} x_{i}^{2} = \sum_{i=1}^{m}y_{i} x_{i} - \sum_{i=1}^{m} b x_{i} \\ &\qquad b=\bar{y}-w\bar{x} \\ &\Rightarrow w \sum_{i=1}^{m} x_{i}^{2}=\sum_{i=1}^{m} y_{i} x_{i}-\sum_{i=1}^{m}(\bar{y}-w \bar{x}) x_{i} \\ &\Rightarrow w \sum_{i=1}^{m} x_{i}^{2} =\sum_{i=1}^{m} y_{i} x_{i}-\bar{y} \sum_{i=1}^{m} x_{i}+w \bar{x} \sum_{i=1}^{m} x_{i} \\ &\Rightarrow w \sum_{i=1}^{m} x_{i}^{2}-w \bar{x} \sum_{i=1}^{m} x_{i}=\sum_{i=1}^{m} y_{i} x_{i}-\bar{y} \sum_{i=1}^{m} x_{i} \\ &\Rightarrow w\left(\sum_{i=1}^{m} x_{i}^{2}-\bar{x} \sum_{i=1}^{m} x_{i}\right)=\sum_{i=1}^{m} y_{i} x_{i}-\bar{y} \sum_{i=1}^{m} x_{i} \\ &\begin{aligned} \Rightarrow w &= \frac{\sum_{i=1}^{m} y_{i} x_{i}-\bar{y} \sum_{i=1}^{m} x_{i}}{\sum_{i=1}^{m} x_{i}^{2}-\bar{x} \sum_{i=1}^{m} x_{i}} \\ &\qquad \bar{y} \sum_{i=1}^{m} x_{i} = \frac{1}{m}\sum_{i=1}^{m} y_{i} \sum_{i=1}^{m} x_{i} = \bar{x} \sum_{i=1}^{m} y_{i} \\ &\qquad \bar{x}\sum_{i=1}^{m} x_{i} = \frac{1}{m}\sum_{i=1}^{m} x_{i} \sum_{i=1}^{m} x_{i} = \frac{1}{m} \left(\sum_{i=1}^{m} x_{i}\right)^{2} \\ &=\frac{\sum_{i=1}^{m} y_{i} x_{i}-\bar{x} \sum_{i=1}^{m} y_{i}}{\sum_{i=1}^{m} x_{i}^{2}-\frac{1}{m}\left(\sum_{i=1}^{m} x_{i}\right)^{2}} \\ &=\frac{\sum_{i=1}^{m} y_{i}\left(x_{i}-\bar{x}\right)}{\sum_{i=1}^{m} x_{i}^{2}-\frac{1}{m}\left(\sum_{i=1}^{m} x_{i}\right)^{2}}‎‬‎‪‍‭‎‏‌‎‬‎‪‍‭‎‪‫‫‌‬‎‮‌‌‫⁠‌ \end{aligned}‏‌‎‬‎‪‍‭‎ \end{aligned} \tag{西瓜书式3.7} wE(w,b)=2(wi=1mxi2i=1m(yib)xi)=0wi=1mxi2i=1m(yib)xi=0wi=1mxi2=i=1myixii=1mbxib=yˉwxˉwi=1mxi2=i=1myixii=1m(yˉwxˉ)xiwi=1mxi2=i=1myixiyˉi=1mxi+wxˉi=1mxiwi=1mxi2wxˉi=1mxi=i=1myixiyˉi=1mxiw(i=1mxi2xˉi=1mxi)=i=1myixiyˉi=1mxiw=i=1mxi2xˉi=1mxii=1myixiyˉi=1mxiyˉi=1mxi=m1i=1myii=1mxi=xˉi=1myixˉi=1mxi=m1i=1mxii=1mxi=m1(i=1mxi)2=i=1mxi2m1(i=1mxi)2i=1myixixˉi=1myi=i=1mxi2m1(i=1mxi)2i=1myi(xixˉ)(西3.7)

w w w向量化,有:

w = ∑ i = 1 m y i ( x i − x ˉ ) ∑ i = 1 m x i 2 − 1 m ( ∑ i = 1 m x i ) 2 ‎ ‬ ‎ ‪ ‍ ‭ ‎ ‏ ‌ ‎ ‬ ‎ ‪ ‍ ‭ ‎ 1 m ( ∑ i = 1 m x i ) 2 = ( 1 m ∑ i = 1 m x i ) ∑ i = 1 m x i = x ˉ ∑ i = 1 m x i = ∑ i = 1 m x i x ˉ = ∑ i = 1 m ( y i x i − y i x ˉ ) ∑ i = 1 m ( x i 2 − x i x ˉ ) ‎ ‬ ‎ ‪ ‍ ‭ ‎ ‏ ‌ ‎ ‬ ‎ ‪ ‍ ‭ ‎ = ∑ i = 1 m ( y i x i − y i x ˉ − y i x ˉ − y i x ˉ ) ∑ i = 1 m ( x i 2 − x i x ˉ − x i x ˉ − x i x ˉ ) ‎ ‬ ‎ ‪ ‍ ‭ ‎ ‏ ‌ ‎ ‬ ‎ ‪ ‍ ‭ ‎ ∑ i = 1 m y i x ˉ = x ˉ ∑ i = 1 m y i = 1 m ∑ i = 1 m x i ∑ i = 1 m y i = ∑ i = 1 m x i ⋅ 1 m ⋅ ∑ i = 1 m y i = ∑ i = 1 m x i y ˉ ∑ i = 1 m y i x ˉ = x ˉ ∑ i = 1 m y i = x ˉ ⋅ m ⋅ 1 m ⋅ ∑ i = 1 m y i = m x ˉ y ˉ = ∑ i = 1 m x ˉ y ˉ ∑ i = 1 m x i x ˉ = x ˉ ∑ i = 1 m x i = x ˉ ⋅ m ⋅ 1 m ⋅ ∑ i = 1 m x i = m x ˉ 2 = ∑ i = 1 m x ˉ 2 = ∑ i = 1 m ( y i x i − y i x ˉ − x i y ˉ − x ˉ y ˉ ) ∑ i = 1 m ( x i 2 − x i x ˉ − x i x ˉ − x ˉ 2 ) ‎ ‬ ‎ ‪ ‍ ‭ ‎ ‏ ‌ ‎ ‬ ‎ ‪ ‍ ‭ ‎ = ∑ i = 1 m ( x i − x ˉ ) ( y i − y ˉ ) ∑ i = 1 m ( x i − x ˉ ) 2 ‎ x = ( x 1 , x 2 , ⋯   , x m ) T y = ( y 1 , y 2 , ⋯   , y m ) T x d = ( x 1 − x ˉ , x 2 − x ˉ , ⋯   , x m − x ˉ ) T y d = ( y 1 − y ˉ , y 2 − y ˉ , ⋯   , y m − y ˉ ) T = x d T y d x d T x d \begin{aligned} w &=\frac{\sum_{i=1}^{m} y_{i}\left(x_{i}-\bar{x}\right)}{\sum_{i=1}^{m} x_{i}^{2}-\frac{1}{m}\left(\sum_{i=1}^{m} x_{i}\right)^{2}}‎‬‎‪‍‭‎‏‌‎‬‎‪‍‭‎ \\ &\qquad \frac{1}{m}\left(\sum_{i=1}^{m} x_{i}\right)^{2} = \left(\frac{1}{m} \sum_{i=1}^{m} x_{i}\right) \sum_{i=1}^{m} x_{i} = \bar{x} \sum_{i=1}^{m} x_{i} = \sum_{i=1}^{m} x_{i} \bar{x} \\ &=\frac{\sum_{i=1}^{m} \left(y_{i} x_{i}-y_{i} \bar{x}\right)}{\sum_{i=1}^{m} \left(x_{i}^{2}-x_{i} \bar{x}\right)}‎‬‎‪‍‭‎‏‌‎‬‎‪‍‭‎ \\ &=\frac{\sum_{i=1}^{m} \left(y_{i} x_{i}-y_{i} \bar{x}-y_{i} \bar{x}-y_{i} \bar{x}\right)}{\sum_{i=1}^{m} \left(x_{i}^{2}-x_{i} \bar{x}-x_{i} \bar{x}-x_{i} \bar{x}\right)}‎‬‎‪‍‭‎‏‌‎‬‎‪‍‭‎ \\ &\qquad \sum_{i=1}^{m} y_{i} \bar{x}=\bar{x} \sum_{i=1}^{m} y_{i}=\frac{1}{m} \sum_{i=1}^{m} x_{i} \sum_{i=1}^{m} y_{i}=\sum_{i=1}^{m} x_{i} \cdot \frac{1}{m} \cdot \sum_{i=1}^{m} y_{i}=\sum_{i=1}^{m} x_{i} \bar{y} \\ &\qquad \sum_{i=1}^{m} y_{i} \bar{x}=\bar{x} \sum_{i=1}^{m} y_{i}=\bar{x} \cdot m \cdot \frac{1}{m} \cdot \sum_{i=1}^{m} y_{i}=m \bar{x} \bar{y}=\sum_{i=1}^{m} \bar{x} \bar{y} \\ &\qquad \sum_{i=1}^{m} x_{i} \bar{x}=\bar{x} \sum_{i=1}^{m} x_{i}=\bar{x} \cdot m \cdot \frac{1}{m} \cdot \sum_{i=1}^{m} x_{i}=m \bar{x}^{2}=\sum_{i=1}^{m} \bar{x}^{2} \\ &=\frac{\sum_{i=1}^{m} \left(y_{i} x_{i}-y_{i} \bar{x}-x_{i} \bar{y}-\bar{x}\bar{y}\right)}{\sum_{i=1}^{m} \left(x_{i}^{2}-x_{i} \bar{x}-x_{i} \bar{x}-\bar{x}^{2}\right)}‎‬‎‪‍‭‎‏‌‎‬‎‪‍‭‎ \\ &=\frac{\sum_{i=1}^{m} \left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sum_{i=1}^{m} \left(x_{i}-\bar{x}\right)^{2}}‎ \\ &\qquad x=\left(x_{1},x_{2},\cdots, x_{m}\right)^{T} \\ &\qquad y=\left(y_{1},y_{2},\cdots,y_{m}\right)^{T} \\ &\qquad x_{d}=\left(x_{1}-\bar{x},x_{2}-\bar{x},\cdots,x_{m}-\bar{x}\right)^{T} \\ &\qquad y_{d}=\left(y_{1}-\bar{y},y_{2}-\bar{y},\cdots,y_{m}-\bar{y}\right)^{T} \\ &=\frac{x_{d}^{T} y_{d}}{x_{d}^{T} x_{d}} \end{aligned} w=i=1mxi2m1(i=1mxi)2i=1myi(xixˉ)m1(i=1mxi)2=(m1i=1mxi)i=1mxi=xˉi=1mxi=i=1mxixˉ=i=1m(xi2xixˉ)i=1m(yixiyixˉ)=i=1m(xi2xixˉxixˉxixˉ)i=1m(yixiyixˉyixˉyixˉ)i=1myixˉ=xˉi=1myi=m1i=1mxii=1myi=i=1mxim1i=1myi=i=1mxiyˉi=1myixˉ=xˉi=1myi=xˉmm1i=1myi=mxˉyˉ=i=1mxˉyˉi=1mxixˉ=xˉi=1mxi=xˉmm1i=1mxi=mxˉ2=i=1mxˉ2=i=1m(xi2xixˉxixˉxˉ2)i=1m(yixiyixˉxiyˉxˉyˉ)=i=1m(xixˉ)2i=1m(xixˉ)(yiyˉ)x=(x1,x2,,xm)Ty=(y1,y2,,ym)Txd=(x1xˉ,x2xˉ,,xmxˉ)Tyd=(y1yˉ,y2yˉ,,ymyˉ)T=xdTxdxdTyd




2. 二元线性回归

求解权重 w ^ \hat{w} w^的公式推导推导思路:

  1. 由最小二乘法导出损失函数 E w ^ E_{\hat{w}} Ew^

  2. 证明损失函数 E w ^ E_{\hat{w}} Ew^是关于 w ^ \hat{w} w^的凸函数

  3. 对损失函数 E w ^ E_{\hat{w}} Ew^关于 w ^ \hat{w} w^求一阶偏导数

  4. 令各自的一阶偏导数等于0解出 w ^ ∗ \hat{w}^{*} w^


2.1. 将 w w w b b b组合成 w ^ \hat{w} w^

f ( x i ) = w T x i + b = ( w 1 w 2 … w d ) ( x i 1 x i 2 ⋮ x i d ) + b = w 1 x i 1 + w 2 x i 2 + … + w d x i d + b w d + 1 = b = w 1 x i 1 + w 2 x i 2 + … + w d x i d + w d + 1 ⋅ 1 = ( w 1 w 2 … w d w d + 1 ) ( x i 1 x i 2 ⋮ x i d 1 ) = w ^ T x ^ i \begin{aligned} f\left(\boldsymbol{x}_{i}\right) &=\boldsymbol{w}^{T} \boldsymbol{x}_{i}+b \\ &=\left(\begin{array}{cccc} {w_{1}} & {w_{2}} & {\dots} & {w_{d}}\end{array}\right) \left(\begin{array}{c}{x_{i 1}} \\ {x_{i 2}} \\ {\vdots} \\ {x_{i d}}\end{array}\right)+b \\ &=w_{1} x_{i 1}+w_{2} x_{i 2}+\ldots+w_{d} x_{i d}+b \\ &\qquad w_{d+1}=b \\ &=w_{1} x_{i 1}+w_{2} x_{i 2}+\ldots+w_{d} x_{i d}+w_{d+1} \cdot 1 \\ &=\left(\begin{array}{ccccc} {w_{1}} & {w_{2}} & {\dots} & {w_{d}} & {w_{d+1}}\end{array}\right) \left(\begin{array}{c}{x_{i 1}} \\ {x_{i 2}} \\ {\vdots} \\ {x_{i d}} \\ 1\end{array}\right) \\ &=\hat{w}^{T}\hat{x}_{i} \end{aligned} f(xi)=wTxi+b=(w1w2wd)xi1xi2xid+b=w1xi1+w2xi2++wdxid+bwd+1=b=w1xi1+w2xi2++wdxid+wd+11=(w1w2wdwd+1)xi1xi2xid1=w^Tx^i



2.2. 由最小二乘法导出损失函数 E w ^ E_{\hat{w}} Ew^

E w ^ = ∑ i = 1 m ( y i − f ( x ^ i ) ) 2 = ∑ m ( y i − w ^ T x ^ i ) 2 X = ( x 11 x 12 … x 1 d 1 x 21 x 22 … x 2 d 1 ⋮ ⋮ ⋱ ⋮ ⋮ x m 1 x m 2 … x m d 1 ) = ( x 1 T 1 x 2 T 1 ⋮ ⋮ x m T 1 ) = ( x ^ 1 T x ^ 2 T ⋮ x ^ m T ) y = ( y 1 , y 2 , ⋯   , y m ) T = ( y 1 − w ^ T x ^ 1 ) 2 + ( y 2 − w ^ T x ^ 2 ) 2 + ⋯ + ( y m − w ^ T x ^ m ) 2 = ( y 1 − w ^ T x ^ 1 y 2 − w ^ T x ^ 2 ⋯ y m − w ^ T x ^ m ) ( y 1 − w ^ T x ^ 1 y 2 − w ^ T x ^ 2 ⋮ y m − w ^ T x ^ m ) ( y 1 − w ^ T x ^ 1 y 2 − w ^ T x ^ 2 ⋮ y m − w ^ T x ^ m ) = ( y 1 y 2 ⋮ y m ) − ( w ^ T x ^ 1 w ^ T x ^ 2 ⋮ w ^ T x ^ m ) = ( y 1 y 2 ⋮ y m ) − ( x ^ 1 T w ^ x ^ 2 T w ^ ⋮ x ^ m T w ^ ) = ( y 1 y 2 ⋮ y m ) − ( x ^ 1 T x ^ 2 T ⋮ x ^ m T ) w ^ = y − X w ^ ( y 1 − w ^ T x ^ 1 y 2 − w ^ T x ^ 2 ⋯ y m − w ^ T x ^ m ) = ( y 1 − w ^ T x ^ 1 y 2 − w ^ T x ^ 2 ⋮ y m − w ^ T x ^ m ) T = ( y − X w ^ ) T = ( y − X w ^ ) T ( y − X w ^ ) \begin{aligned} E_{\hat{\boldsymbol{w}}} &=\sum_{i=1}^{m}\left(y_{i}-f\left(\hat{\boldsymbol{x}}_{i}\right)\right)^{2} \\ &=\sum^{m}\left(y_{i}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{i}\right)^{2} \\ &\qquad \begin{aligned} &\mathbf{X} =\left(\begin{array}{ccccc} {x_{11}} & {x_{12}} & {\dots} & {x_{1 d}} & {1} \\ {x_{21}} & {x_{22}} & {\dots} & {x_{2 d}} & {1} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} & {\vdots} \\ {x_{m 1}} & {x_{m 2}} & {\dots} & {x_{m d}} & {1} \end{array}\right) =\left(\begin{array}{cc} {\boldsymbol{x}_{1}^{\mathrm{T}}} & {1} \\ {\boldsymbol{x}_{2}^{\mathrm{T}}} & {1} \\ {\vdots} & {\vdots} \\ {\boldsymbol{x}_{m}^{\mathrm{T}}} & {1} \end{array}\right) =\left(\begin{array}{c} {\hat{\boldsymbol{x}}_{1}^{T}} \\ {\hat{\boldsymbol{x}}_{2}^{T}} \\ {\vdots} \\ {\hat{\boldsymbol{x}}_{m}^{T}} \end{array}\right) \\ &\boldsymbol{y}=\left(y_{1},y_{2},\cdots,y_{m}\right)^{T} \end{aligned} \\ &=\left(y_{1}-\hat{\boldsymbol{w}}^{T} \hat{x}_{1}\right)^{2} + \left(y_{2}-\hat{\boldsymbol{w}}^{T} \hat{x}_{2}\right)^{2} + \cdots + \left(y_{m}-\hat{\boldsymbol{w}}^{T} \hat{x}_{m}\right)^{2} \\ &=\left(\begin{array}{cccc} {y_{1}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{1}} & {y_{2}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{2}} & {\cdots} & {y_{m}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{m}} \end{array}\right) \left(\begin{array}{c} {y_{1}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{1}} \\ {y_{2}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{2}} \\ {\vdots} \\ {y_{m}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{m}} \end{array}\right) \\ &\qquad \left(\begin{array}{c} {y_{1}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{1}} \\ {y_{2}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{2}} \\ {\vdots} \\ {y_{m}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{m}} \end{array}\right) =\left(\begin{array}{c} {y_{1}} \\ {y_{2}} \\ {\vdots} \\ {y_{m}} \end{array}\right) -\left(\begin{array}{c} {\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{1}} \\ {\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{2}} \\ {\vdots} \\ {\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{m}} \end{array}\right) =\left(\begin{array}{c} {y_{1}} \\ {y_{2}} \\ {\vdots} \\ {y_{m}} \end{array}\right) -\left(\begin{array}{c} {\hat{\boldsymbol{x}}_{1}^{T} \hat{\boldsymbol{w}}} \\ {\hat{\boldsymbol{x}}_{2}^{T} \hat{\boldsymbol{w}}} \\ {\vdots} \\ {\hat{\boldsymbol{x}}_{m}^{T} \hat{\boldsymbol{w}}} \end{array}\right) =\left(\begin{array}{c} {y_{1}} \\ {y_{2}} \\ {\vdots} \\ {y_{m}} \end{array}\right) -\left(\begin{array}{c} {\hat{\boldsymbol{x}}_{1}^{T}} \\ {\hat{\boldsymbol{x}}_{2}^{T}} \\ {\vdots} \\ {\hat{\boldsymbol{x}}_{m}^{T}} \end{array}\right) \hat{\boldsymbol{w}} =\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}} \\ &\qquad \left(\begin{array}{cccc} {y_{1}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{1}} & {y_{2}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{2}} & {\cdots} & {y_{m}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{m}} \end{array}\right) =\left(\begin{array}{c} {y_{1}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{1}} \\ {y_{2}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{2}} \\ {\vdots} \\ {y_{m}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{m}} \end{array}\right)^{T} =\left(\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}}\right)^{T} \\ &=\left(\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}}\right)^{T}\left(\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}}\right) \end{aligned} Ew^=i=1m(yif(x^i))2=m(yiw^Tx^i)2X=x11x21xm1x12x22xm2x1dx2dxmd111=x1Tx2TxmT111=x^1Tx^2Tx^mTy=(y1,y2,,ym)T=(y1w^Tx^1)2+(y2w^Tx^2)2++(ymw^Tx^m)2=(y1w^Tx^1y2w^Tx^2ymw^Tx^m)y1w^Tx^1y2w^Tx^2ymw^Tx^my1w^Tx^1y2w^Tx^2ymw^Tx^m=y1y2ymw^Tx^1w^Tx^2w^Tx^m=y1y2ymx^1Tw^x^2Tw^x^mTw^=y1y2ymx^1Tx^2Tx^mTw^=yXw^(y1w^Tx^1y2w^Tx^2ymw^Tx^m)=y1w^Tx^1y2w^Tx^2ymw^Tx^mT=(yXw^)T=(yXw^)T(yXw^)



2.3. 证明损失函数 E w ^ E_{\hat{w}} Ew^是关于 w ^ \hat{w} w^的凸函数

凸集定义:

设集合 D ∈ R n D\in R^{n} DRn,如果对任意的 x , y ∈ D x,y\in D x,yD与任意的 a ∈ [ 0 , 1 ] a\in [0,1] a[0,1],有 a x + ( 1 − a ) y ∈ D ax+(1-a)y\in D ax+(1a)yD,则称集合 D D D是凸集。

凸集的几何意义:

若两个点属于此集合,则这两点连线上的任意一点均属于此集合。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-wqzm08P6-1571976653225)(_v_images/20191019162748269_1153694228.png =711x)]

梯度定义:

n n n元函数 f ( x ) f(\boldsymbol{x}) f(x)对自变量 x = ( x 1 , x 2 , ⋯   , x n ) T \boldsymbol{x}=\left(x_{1}, x_{2}, \cdots, x_{n}\right)^{T} x=(x1,x2,,xn)T的各分量 x i x_{i} xi的偏导数 ∂ f ( x ) ∂ x i ( i = 1 , 2 , ⋯   , n ) \frac{\partial f(\boldsymbol{x})}{\partial x_{i}} \quad \left(i=1,2,\cdots,n\right) xif(x)(i=1,2,,n)都存在,则称函数 f ( x ) f(\boldsymbol{x}) f(x) x \boldsymbol{x} x处一阶可导,并称向量

∇ f ( x ) = ( ∂ f ( x ) ∂ x 1 ∂ f ( x ) ∂ x 2 ⋮ ∂ f ( x ) ∂ x n ) \nabla f(\boldsymbol{x}) =\left(\begin{array}{c} {\frac{\partial f(\boldsymbol{x})}{\partial x_{1}}} \\ {\frac{\partial f(\boldsymbol{x})}{\partial x_{2}}} \\ {\vdots} \\ {\frac{\partial f(\boldsymbol{x})}{\partial x_{n}}}\end{array}\right) f(x)=x1f(x)x2f(x)xnf(x)

为函数 f ( x ) f(\boldsymbol{x}) f(x) x \boldsymbol{x} x处的一阶导数或梯度,记为 ∇ f ( x ) \nabla f(\boldsymbol{x}) f(x)(列向量)。

Hessian(海塞)矩阵定义:设 n n n元函数 f ( x ) f(\boldsymbol{x}) f(x)对自变量 x = ( x 1 , x 2 , ⋯   , x n ) T \boldsymbol{x}=\left(x_{1}, x_{2}, \cdots, x_{n}\right)^{T} x=(x1,x2,,xn)T的各分量 x i x_{i} xi的二阶偏导数 ∂ 2 f ( x ) ∂ x i ∂ x j ( i = 1 , 2 , ⋯   , n ; j = 1 , 2 , ⋯   , n ) \frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{i} \partial x_{j}} \quad \left(i=1,2,\cdots,n; j=1,2,\cdots,n\right) xixj2f(x)(i=1,2,,n;j=1,2,,n)都存在,则称函数 f ( x ) f(\boldsymbol{x}) f(x) x \boldsymbol{x} x处二阶可导,并称矩阵

∇ 2 f ( x ) = [ ∂ 2 f ( x ) ∂ x 1 2 ∂ 2 f ( x ) ∂ x 1 ∂ x 2 ⋯ ∂ 2 f ( x ) ∂ x 1 ∂ x n ∂ 2 f ( x ) ∂ x 2 ∂ x 1 ∂ 2 f ( x ) ∂ x 2 2 ⋯ ∂ 2 f ( x ) ∂ x 2 ∂ x n ⋮ ⋮ ⋱ ⋮ ∂ 2 f ( x ) ∂ x n ∂ x 1 ∂ 2 f ( x ) ∂ x n ∂ x 2 ⋯ ∂ 2 f ( x ) ∂ x n 2 ] \nabla^{2} f(\boldsymbol{x}) =\left[\begin{array}{cccc} {\frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{1}^{2}}} & {\frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{1} \partial x_{2}}} & {\cdots} & {\frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{1} \partial x_{n}}} \\ {\frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{2} \partial x_{1}}} & {\frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{2}^{2}}} & {\cdots} & {\frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{2} \partial x_{n}}} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ {\frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{n} \partial x_{1}}} & {\frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{n} \partial x_{2}}} & {\cdots} & {\frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{n}^{2}}} \end{array}\right] 2f(x)=x122f(x)x2x12f(x)xnx12f(x)x1x22f(x)x222f(x)xnx22f(x)x1xn2f(x)x2xn2f(x)xn22f(x)

为函数 f ( x ) f(\boldsymbol{x}) f(x) x \boldsymbol{x} x处的二阶导数或Hessian(海塞)矩阵,记为 ∇ 2 f ( x ) \nabla^{2} f(\boldsymbol{x}) 2f(x)。若 f ( x ) f(\boldsymbol{x}) f(x) x \boldsymbol{x} x各变元的所有二阶偏导数都连续,则 ∂ 2 f ( x ) ∂ x i ∂ x j = ∂ 2 f ( x ) ∂ x j ∂ x i \frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{i} \partial x_{j}}=\frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{j} \partial x_{i}} xixj2f(x)=xjxi2f(x),此时 ∇ 2 f ( x ) \nabla^{2} f(\boldsymbol{x}) 2f(x)为对称矩阵。

多元实值函数凹凸性判定定理:

D ⊂ R n D\subset R^{n} DRn是非空开凸集, f : D ⊂ R n → R f:D\subset R^{n} \to R f:DRnR,且 f ( x ) f(\boldsymbol{x}) f(x) D D D上二阶连续可微,如果 f ( x ) f(\boldsymbol{x}) f(x) H e s s i a n Hessian Hessian矩阵 ∇ 2 f ( x ) \nabla^{2} f(\boldsymbol{x}) 2f(x) D D D上是正定的,则 f ( x ) f(\boldsymbol{x}) f(x) D D D上的严格凸函数。

凸充分性定理:

f : R n → R f:R^{n} \to R f:RnR是凸函数,且 f ( x ) f(\boldsymbol{x}) f(x)一阶连续可微,则 x ∗ x^{*} x是全局解的充分必要条件是 ∇ f ( x ∗ ) = 0 \nabla f(\boldsymbol{x}^{*})=0 f(x)=0,其中 ∇ f ( x ) \nabla f(\boldsymbol{x}) f(x) f ( x ) f(\boldsymbol{x}) f(x)关于 x \boldsymbol{x} x的一阶导数(也称梯度)。



2.4. 对损失函数 E w ^ E_{\hat{w}} Ew^关于 w ^ \hat{w} w^求一阶偏导数

∂ E w ^ ∂ w ^ = ∂ ∂ w ^ [ ( y − X w ^ ) T ( y − X w ^ ) ] = ∂ ∂ w ^ [ ( y T − w ^ T X T ) ( y − X w ^ ) ] = ∂ ∂ w ^ [ y T y − y T X w ^ − w ^ T X T y + w ^ T X T X w ^ ] = ∂ ∂ w ^ [ − y T X w ^ − w ^ T X T y + w ^ T X T X w ^ ] = − ∂ y T X w ^ ∂ w ^ − ∂ w ^ T X T y ∂ w ^ + ∂ w ^ T X T X w ^ ∂ w ^ ∂ x T a ∂ x = ∂ a T x ∂ x = a ∂ x T B x ∂ x = ( B + B T ) x = − X T y − X T y + ( X T X + X T X ) w ^ = 2 X T ( X w ^ − y ) (西瓜书式3.10) \begin{aligned} \frac{\partial E_{\hat{\boldsymbol{w}}}}{\partial \hat{\boldsymbol{w}}} &=\frac{\partial}{\partial \hat{\boldsymbol{w}}}\left[(\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}})^{T}(\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}})\right] \\ &=\frac{\partial}{\partial \hat{\boldsymbol{w}}}\left[\left(\boldsymbol{y}^{T}-\hat{\boldsymbol{w}}^{T} \mathbf{X}^{T}\right)(\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}})\right] \\ &=\frac{\partial}{\partial \hat{\boldsymbol{w}}}\left[\boldsymbol{y}^{T} \boldsymbol{y}-\boldsymbol{y}^{T} \mathbf{X} \hat{\boldsymbol{w}}-\hat{\boldsymbol{w}}^{T} \mathbf{X}^{T} \boldsymbol{y}+\hat{\boldsymbol{w}}^{T} \mathbf{X}^{T} \mathbf{X} \hat{\boldsymbol{w}}\right] \\ &=\frac{\partial}{\partial \hat{\boldsymbol{w}}}\left[-\boldsymbol{y}^{T} \mathbf{X} \hat{\boldsymbol{w}}-\hat{\boldsymbol{w}}^{T} \mathbf{X}^{T} \boldsymbol{y}+\hat{\boldsymbol{w}}^{T} \mathbf{X}^{T} \mathbf{X} \hat{\boldsymbol{w}}\right] \\ &=-\frac{\partial \boldsymbol{y}^{T} \mathbf{X} \hat{\boldsymbol{w}}}{\partial \hat{\boldsymbol{w}}}-\frac{\partial \hat{\boldsymbol{w}}^{T} \mathbf{X}^{T} \boldsymbol{y}}{\partial \hat{\boldsymbol{w}}}+\frac{\partial \hat{\boldsymbol{w}}^{T} \mathbf{X}^{T} \mathbf{X} \hat{\boldsymbol{w}}}{\partial \hat{\boldsymbol{w}}} \\ &\qquad \frac{\partial \boldsymbol{x}^{T} \boldsymbol{a}}{\partial \boldsymbol{x}}=\frac{\partial \boldsymbol{a}^{T} \boldsymbol{x}}{\partial \boldsymbol{x}}=\boldsymbol{a} \\ &\qquad \frac{\partial \boldsymbol{x}^{T} \mathbf{B} \boldsymbol{x}}{\partial \boldsymbol{x}}=\left(\mathbf{B}+\mathbf{B}^{T}\right) \boldsymbol{x} \\ &=-\mathbf{X}^{T} \boldsymbol{y}-\mathbf{X}^{T} \boldsymbol{y}+\left(\mathbf{X}^{T} \mathbf{X}+\mathbf{X}^{T} \mathbf{X}\right) \hat{w} \\ &=2\mathbf{X}^{T}\left(\mathbf{X} \hat{w}-\boldsymbol{y}\right) \end{aligned} \tag{西瓜书式3.10} w^Ew^=w^[(yXw^)T(yXw^)]=w^[(yTw^TXT)(yXw^)]=w^[yTyyTXw^w^TXTy+w^TXTXw^]=w^[yTXw^w^TXTy+w^TXTXw^]=w^yTXw^w^w^TXTy+w^w^TXTXw^xxTa=xaTx=axxTBx=(B+BT)x=XTyXTy+(XTX+XTX)w^=2XT(Xw^y)(西3.10)

所以有:

∂ 2 E w ^ ∂ w ^ ∂ w ^ T = ∂ ∂ w ^ ( ∂ E w ^ ∂ w ^ ) = ∂ ∂ w ^ [ 2 X T ( X w ^ − y ) ] = ∂ ∂ w ^ ( 2 X T X w ^ − 2 X T y ) = 2 X T X w ^ (Hessian矩阵) \begin{aligned} \frac{\partial^{2} E_{\hat{w}}}{\partial \hat{w} \partial \hat{w}^{T}} &=\frac{\partial}{\partial \hat{w}}\left(\frac{\partial E_{\hat{w}}}{\partial \hat{w}}\right) \\ &=\frac{\partial}{\partial \hat{w}}\left[2 \mathbf{X}^{T}(\mathbf{X} \hat{w}-\boldsymbol{y})\right] \\ &=\frac{\partial}{\partial \hat{w}}\left(2 \mathbf{X}^{T} \mathbf{X} \hat{w}-2 \mathbf{X}^{T} \boldsymbol{y}\right) \\ &=2 \mathbf{X}^{T} \mathbf{X} \hat{w} \end{aligned} \tag{Hessian矩阵} w^w^T2Ew^=w^(w^Ew^)=w^[2XT(Xw^y)]=w^(2XTXw^2XTy)=2XTXw^(Hessian)



2.5. 令一阶偏导数等于0解出 w ^ ∗ \hat{w}^{*} w^

∂ E w ^ ∂ w ^ = 2 X T ( X w ^ − y ) = 0 ⇒ 2 X T X w ^ − 2 X T y = 0 ⇒ 2 X T X w ^ = 2 X T y ⇒ w ^ = ( X T X ) − 1 X T y (西瓜书式3.11) \begin{aligned} &\quad \frac{\partial E_{\hat{w}}}{\partial \hat{w}} =2 \mathbf{X}^{T}(\mathbf{X} \hat{w}-\boldsymbol{y})=0 \\ &\Rightarrow 2 \mathbf{X}^{T} \mathbf{X} \hat{w}-2 \mathbf{X}^{T} \boldsymbol{y}=0 \\ &\Rightarrow 2 \mathbf{X}^{T} \mathbf{X} \hat{w}=2 \mathbf{X}^{T} \boldsymbol{y} \\ &\Rightarrow \hat{w} = \left(\mathbf{X}^{T} \mathbf{X} \right)^{-1} \mathbf{X}^{T} \boldsymbol{y} \end{aligned} \tag{西瓜书式3.11} w^Ew^=2XT(Xw^y)=02XTXw^2XTy=02XTXw^=2XTyw^=(XTX)1XTy(西3.11)




3. 广义线性模型

3.1. 指数族分布

指数族(Exponential family)分布是一类分布的总称,该类分布的分布律(或者概率密度函数)的一般形式如下:

p ( y ; η ) = b ( y ) exp ⁡ ( η T T ( y ) − a ( η ) ) p(y ; \eta)=b(y) \exp \left(\eta^{T} T(y)-a(\eta)\right) p(y;η)=b(y)exp(ηTT(y)a(η))

其中, η \eta η称为该分布的自然参数; T ( y ) T(y) T(y)为充分统计量,视具体的分布而定,通常是等于随机变量 y y y本身; a ( η ) a(\eta) a(η)为配分函数; b ( y ) b(y) b(y)为关于随机变量 y y y的函数,常见的伯努利分布和正态分布均属于指数族分布。

证明伯努利分布属于指数族分布:

已知伯努利分布的分布律为:

p ( y ) = ϕ y ( 1 − ϕ ) 1 − y p(y)=\phi^{y}(1-\phi)^{1-y} p(y)=ϕy(1ϕ)1y

其中 y ∈ { 0 , 1 } y\in\{0,1\} y{0,1} ϕ \phi ϕ y = 1 y=1 y=1的概率,即 p ( y = 1 ) = ϕ p(y=1)=\phi p(y=1)=ϕ,对上式恒等变形可得:

p ( y ) = ϕ y ( 1 − ϕ ) 1 − y = exp ⁡ ( ln ⁡ ( ϕ y ( 1 − ϕ ) 1 − y ) ) = exp ⁡ ( ln ⁡ ϕ y + ln ⁡ ( 1 − ϕ ) 1 − y ) = exp ⁡ ( y ln ⁡ ϕ + ( 1 − y ) ln ⁡ ( 1 − ϕ ) ) = exp ⁡ ( y ln ⁡ ϕ + ln ⁡ ( 1 − ϕ ) − y ln ⁡ ( 1 − ϕ ) ) = exp ⁡ ( y ( ln ⁡ ϕ − ln ⁡ ( 1 − ϕ ) ) + ln ⁡ ( 1 − ϕ ) ) = exp ⁡ ( y ln ⁡ ( ϕ 1 − ϕ ) + ln ⁡ ( 1 − ϕ ) ) \begin{aligned} p(y) &=\phi^{y}(1-\phi)^{1-y} \\ &=\exp \left(\ln \left(\phi^{y}(1-\phi)^{1-y}\right)\right) \\ &=\exp \left(\ln \phi^{y}+\ln(1-\phi)^{1-y}\right) \\ &=\exp (y \ln \phi+(1-y) \ln (1-\phi)) \\ &=\exp (y \ln \phi+\ln (1-\phi)-y \ln (1-\phi)) \\ &=\exp (y(\ln \phi-\ln (1-\phi))+\ln (1-\phi)) \\ &=\exp \left(y \ln \left(\frac{\phi}{1-\phi}\right)+\ln (1-\phi)\right) \end{aligned} p(y)=ϕy(1ϕ)1y=exp(ln(ϕy(1ϕ)1y))=exp(lnϕy+ln(1ϕ)1y)=exp(ylnϕ+(1y)ln(1ϕ))=exp(ylnϕ+ln(1ϕ)yln(1ϕ))=exp(y(lnϕln(1ϕ))+ln(1ϕ))=exp(yln(1ϕϕ)+ln(1ϕ))

对比指数分布的一般形式 p ( y ; η ) = b ( y ) e x p ( η ( T ) T ( y ) − a ( η ) ) p(y;\eta)=b(y)exp\left(\eta^(T)T(y)-a(\eta)\right) p(y;η)=b(y)exp(η(T)T(y)a(η)),可知:

所以,伯努利分布的指数族分布对应参数为:

b ( y ) = 1 η = ln ⁡ ( f r a c ϕ 1 − ϕ ) T ( y ) = y a ( η ) = − ln ⁡ ( 1 − ϕ ) = l n ( 1 + e x p η ) \begin{aligned} b(y)&=1 \\ \eta&=\ln\left(frac{\phi}{1-\phi}\right) \\ T(y)&=y \\ a(\eta)&=-\ln(1-\phi)=ln(1+exp{\eta}) \end{aligned} b(y)ηT(y)a(η)=1=ln(fracϕ1ϕ)=y=ln(1ϕ)=ln(1+expη)



3.2. 广义线性模型的三条假设

  1. 在给定 x \boldsymbol{x} x的条件下,假设随机变量 y \boldsymbol{y} y服从某个指数族分布

  2. 在给定 x \boldsymbol{x} x的条件下,我们的目标是得到一个模型 h ( x ) h(\boldsymbol{x}) h(x)能预测出 T ( y ) T(\boldsymbol{y}) T(y)的期望值

  3. 假设该指数族分布中的自然参数 η \eta η x \boldsymbol{x} x呈线性关系,即 η = w T x \eta=w^{T}x η=wTx




4. 对数几率回归

对数几率回归是在对一个二分类问题进行建模,并且假设被建模的随机变量 y y y取值为0或1,因此我们可以很自然地假设 y y y服从伯努利分布。此时,如果我们想要构建一个线性模型来预测在给定 x \boldsymbol{x} x的条件下 y y y的取值的话,可以考虑使用广义线性模型来进行建模。

4.1. 对数几率回归的广义线性模型推导

已知 y y y是服从伯努利分布,而伯努利分布属于指数在发布,所以满足广义线性模型的第一条假设,接着根据广义线性模型的第二条假设我们可以推得模型 h ( x ) h(x) h(x)的表达式为:

h ( x ) = E [ T ( y ∣ x ) ] h(\boldsymbol{x})=E[T(y|\boldsymbol{x})] h(x)=E[T(yx)]

由于伯努利分布的 T ( y ∣ x ) = y ∣ x T(y|\boldsymbol{x})=y|\boldsymbol{x} T(yx)=yx,所以:

h ( x ) = E [ y ∣ x ] h(\boldsymbol{x})=E[y|\boldsymbol{x}] h(x)=E[yx]

又因为 E [ y ∣ x ] = 1 × p ( y = 1 ∣ x ) + 0 × p ( y = 0 ∣ x ) = p ( y = 1 ∣ x ) = ϕ E[y|\boldsymbol{x}]=1\times p(y=1|\boldsymbol{x})+0\times p(y=0|\boldsymbol{x})=p(y=1|\boldsymbol{x})=\phi E[yx]=1×p(y=1x)+0×p(y=0x)=p(y=1x)=ϕ,所以:

h ( x ) = ϕ h(\boldsymbol{x})=\phi h(x)=ϕ

在前面证明伯努利分布属于指数族分布时我们知道:

η = ln ⁡ ( ϕ 1 − ϕ ) e η = ϕ 1 − ϕ e − η = 1 − ϕ ϕ e − η = 1 ϕ − 1 1 + e − η = 1 ϕ 1 1 + e − η = ϕ \begin{aligned} &\eta=\ln \left(\frac{\phi}{1-\phi}\right) \\ &e^{\eta}=\frac{\phi}{1-\phi} \\ &e^{-\eta}=\frac{1-\phi}{\phi} \\ &e^{-\eta}=\frac{1}{\phi}-1 \\ &1+e^{-\eta}=\frac{1}{\phi} \\ &\frac{1}{1+e^{-\eta}}=\phi &\end{aligned} η=ln(1ϕϕ)eη=1ϕϕeη=ϕ1ϕeη=ϕ111+eη=ϕ11+eη1=ϕ

ϕ = 1 1 + e − η \phi=\frac{1}{1+e^{-\eta}} ϕ=1+eη1代入 h ( x ) h(\boldsymbol{x}) h(x)的表达式可得:

h ( x ) = ϕ = 1 1 + e − η h(\boldsymbol{x})=\phi=\frac{1}{1+e^{-\eta}} h(x)=ϕ=1+eη1

根据广义模型的第三条假设: η = w T x \eta=w^{T}x η=wTx h ( x ) h(\boldsymbol{x}) h(x)最终可化为:

h ( x ) = ϕ = 1 1 + e − w T x = p ( y = 1 ∣ x ) (西瓜书式3.23) h(\boldsymbol{x})=\phi=\frac{1}{1+e^{-w^{T}x}}=p(y=1|\boldsymbol{x}) \tag{西瓜书式3.23} h(x)=ϕ=1+ewTx1=p(y=1x)(西3.23)

此即为对数几率回归模型。



4.2. 极大似然估计法

设总体的概率密度函数(或分布律)为 f ( y , w 1 , w 2 , ⋯   , w k ) f(y, w_{1}, w_{2}, \cdots, w_{k}) f(y,w1,w2,,wk) y 1 y_{1} y1 y 2 y_{2} y2,…, y m y_{m} ym,为从该总体中抽出的样本。因为 y 1 y_{1} y1 y 2 y_{2} y2,…, y m y_{m} ym相互独立且同分布,于是,它们的联合概率密度函数(或联合概率)为:

L ( y 1 , y 2 , … , y m ; w 1 , w 2 , … , w k ) = ∏ i = 1 m f ( y i , w 1 , w 2 , … , w k ) L\left(y_{1}, y_{2}, \ldots, y_{m} ; w_{1}, w_{2}, \ldots, w_{k}\right)=\prod_{i=1}^{m} f\left(y_{i}, w_{1}, w_{2}, \ldots, w_{k}\right) L(y1,y2,,ym;w1,w2,,wk)=i=1mf(yi,w1,w2,,wk)

其中, w 1 w_{1} w1 w 2 w_{2} w2,…, w m w_{m} wm被看作固定但是未知的参数。当我们已经观测到一组样本观测值 y 1 y_{1} y1 y 2 y_{2} y2,…, y m y_{m} ym时,要去估计未知参数,一种直观的想法就是,哪一组参数使得现在的样本观测值出现的概率最大,哪一组参数可能就是真正的参数,我们就用它作为参数的估计值,这就是所谓的极大似然估计。

极大似然估计的具体方法:

通常记 L ( y 1 , y 2 , … , y m ; w 1 , w 2 , … , w k ) = L ( w ) L\left(y_{1}, y_{2}, \ldots, y_{m} ; w_{1}, w_{2}, \ldots, w_{k}\right)=L\left(w\right) L(y1,y2,,ym;w1,w2,,wk)=L(w),并称为其似然函数。于是求 w w w的极大似然估计就归结为 L ( w ) L(w) L(w)的最大值点。由于对数函数是单调递增函数,所以:

ln ⁡ L ( w ) = ln ⁡ ( ∏ i = 1 m f ( y i , w 1 , w 2 , … , w k ) ) = ∑ i = 1 m ln ⁡ f ( y i , w 1 , w 2 , … , w k ) \begin{aligned} \ln L(\boldsymbol{w}) &=\ln \left(\prod_{i=1}^{m} f\left(y_{i}, w_{1}, w_{2}, \ldots, w_{k}\right)\right) \\ &=\sum_{i=1}^{m} \ln f\left(y_{i}, w_{1}, w_{2}, \ldots, w_{k}\right) \end{aligned} lnL(w)=ln(i=1mf(yi,w1,w2,,wk))=i=1mlnf(yi,w1,w2,,wk)

L ( w ) L(w) L(w)有相同的最大值点,而在许多情况下,求 ln ⁡ L ( w ) \ln L(w) lnL(w)的最大值点比较简单,于是,我们就将求 L ( w ) L(w) L(w)的最大值点转化为了求 ln ⁡ L ( w ) \ln L(w) lnL(w)的最大值点,通常称 ln ⁡ L ( w ) \ln L(w) lnL(w)为对数似然函数。

对数几率回归的极大似然估计:

已知随机变量 y y y取1和0的概率分别为:

p ( y = 1 ∣ x ) = e w T x + b 1 + e w T x + b p ( y = 0 ∣ x ) = 1 1 + e w T x + b \begin{aligned} &p(y=1 | \boldsymbol{x})=\frac{e^{\boldsymbol{w}^{\mathrm{T}} \boldsymbol{x}+b}}{1+e^{\boldsymbol{w}^{\mathrm{T}} \boldsymbol{x}+b}} \\ &p(y=0 | \boldsymbol{x})=\frac{1}{1+e^{\boldsymbol{w}^{\mathrm{T}} \boldsymbol{x}+b}} \end{aligned} p(y=1x)=1+ewTx+bewTx+bp(y=0x)=1+ewTx+b1

β = ( w ; b ) \boldsymbol{\beta}=(w;b) β=(w;b) x ^ = ( x ; 1 ) \hat{\boldsymbol{x}}=(\boldsymbol{x}; 1) x^=(x;1),则 w T x + b w^{T}\boldsymbol{x}+b wTx+b可简化为 β T x ^ \boldsymbol{\beta}^{T}\hat{x} βTx^,于是上式可化简为:

p ( y = 1 ∣ x ) = e β T x ^ 1 + e β T x ^ p ( y = 0 ∣ x ) = 1 1 + e β T x ^ \begin{aligned} &p(y=1 | \boldsymbol{x})=\frac{e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}}{1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}} \\ &p(y=0 | \boldsymbol{x})=\frac{1}{1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}} \end{aligned} p(y=1x)=1+eβTx^eβTx^p(y=0x)=1+eβTx^1

记:

p ( y = 1 ∣ x ) = e β T x ^ 1 + e β T x ^ = p 1 ( x ^ ; β ) p ( y = 0 ∣ x ) = 1 1 + e β T x ^ = p 0 ( x ^ ; β ) \begin{aligned} &p(y=1 | \boldsymbol{x})=\frac{e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}}{1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}}=p_{1}(\hat{\boldsymbol{x}};\boldsymbol{\beta}) \\ &p(y=0 | \boldsymbol{x})=\frac{1}{1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}}=p_{0}(\hat{\boldsymbol{x}};\boldsymbol{\beta}) \end{aligned} p(y=1x)=1+eβTx^eβTx^=p1(x^;β)p(y=0x)=1+eβTx^1=p0(x^;β)

于是,使用一个小技巧即可得到随机变量 y y y的分布律表达式:

p ( y ∣ x ; w , b ) = y ⋅ p 1 ( x ^ ; β ) + ( 1 − y ) ⋅ p 0 ( x ^ ; β ) (西瓜书式3.26) p(y | \boldsymbol{x} ; \boldsymbol{w}, b) =y \cdot p_{1}(\hat{\boldsymbol{x}} ; \boldsymbol{\beta})+(1-y) \cdot p_{0}(\hat{\boldsymbol{x}} ; \boldsymbol{\beta}) \tag{西瓜书式3.26} p(yx;w,b)=yp1(x^;β)+(1y)p0(x^;β)(西3.26)

或者:

p ( y ∣ x ; w , b ) = [ p 1 ( x ^ ; β ) ] y [ p 0 ( x ^ ; β ) ] 1 − y p(y | \boldsymbol{x} ; \boldsymbol{w}, b) =\left[p_{1}(\hat{\boldsymbol{x}} ; \boldsymbol{\beta})\right]^{y} \left[p_{0}(\hat{\boldsymbol{x}} ; \boldsymbol{\beta})\right]^{1-y} p(yx;w,b)=[p1(x^;β)]y[p0(x^;β)]1y



4.3. 对数几率回归的参数估计

根据对数似然函数的定义可知:

ln ⁡ L ( w ) = ∑ i = 1 m ln ⁡ f ( y i , w 1 , w 2 , … , w k ) \ln L(\boldsymbol{w}) =\sum_{i=1}^{m} \ln f\left(y_{i}, w_{1}, w_{2}, \ldots, w_{k}\right) lnL(w)=i=1mlnf(yi,w1,w2,,wk)

由于此时的 y y y为离散型,所以将对数似然函数中的概率密度函数换成分布律即可,既有:

ℓ ( w , b ) : = ln ⁡ L ( w , b ) = ∑ i = 1 m ln ⁡ f ( y i ∣ x i ; w , b ) (西瓜书式3.25) \ell(w,b) :=\ln L(\boldsymbol{w},b) =\sum_{i=1}^{m} \ln f\left(y_{i} | x_{i}; \boldsymbol{w},b\right) \tag{西瓜书式3.25} (w,b):=lnL(w,b)=i=1mlnf(yixi;w,b)(西3.25)

p ( y ∣ x ; w , b ) = y ⋅ p 1 ( x ^ ; β ) + ( 1 − y ) ⋅ p 0 ( x ^ ; β ) p(y | \boldsymbol{x} ; \boldsymbol{w}, b)=y \cdot p_{1}(\hat{\boldsymbol{x}} ; \boldsymbol{\beta})+(1-y) \cdot p_{0}(\hat{\boldsymbol{x}} ; \boldsymbol{\beta}) p(yx;w,b)=yp1(x^;β)+(1y)p0(x^;β)代入对数似然函数可得:

ℓ ( β ) = ∑ i = 1 m ln ⁡ ( y i p 1 ( x ^ i ; β ) + ( 1 − y i ) p 0 ( x ^ i ; β ) ) p 1 ( x ^ ; β ) = e β T x ^ 1 + e β T x ^ p 0 ( x ^ ; β ) = 1 1 + e β T x ^ = ∑ i = 1 m ln ⁡ ( y i e β T x ^ i 1 + e β T x ^ i + 1 − y i 1 + e β T x ^ i ) = ∑ i = 1 m ln ⁡ ( y i e β T x ^ i + 1 − y i 1 + e β T x ^ i ) = ∑ i = 1 m ( ln ⁡ ( y i e β T x ^ i + 1 − y i ) − ln ⁡ ( 1 + e β T x ^ i ) ) y i ∈ { 0 , 1 } y i = 0 ℓ ( β ) = ∑ i = 1 m ( ln ⁡ ( 0 ⋅ e β T x ^ i + 1 − 0 ) − ln ⁡ ( 1 + e β T x ^ i ) ) = ∑ i = 1 m ( ln ⁡ 1 − ln ⁡ ( 1 + e β T x ^ i ) ) = ∑ i = 1 m ( − ln ⁡ ( 1 + e β T x ^ i ) ) y i = 1 ℓ ( β ) = ∑ i = 1 m ( ln ⁡ ( 1 ⋅ e β T x ^ i + 1 − 1 ) − ln ⁡ ( 1 + e β T x ^ i ) ) = ∑ i = 1 m ( ln ⁡ e r i T − ln ⁡ ( 1 + e β T z i ) ) = ∑ i = 1 m ( β T x ^ i − ln ⁡ ( 1 + e β T x ^ i ) ) = ∑ i = 1 m ( y i β T x ^ i − ln ⁡ ( 1 + e β T x ^ i ) ) (西瓜书式3.27) \begin{aligned} \ell(\boldsymbol{\beta}) &=\sum_{i=1}^{m} \ln \left(y_{i} p_{1}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)+\left(1-y_{i}\right) p_{0}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right) \\ &\qquad p_{1}(\hat{\boldsymbol{x}};\boldsymbol{\beta}) = \frac{e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}}{1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}} \\ &\qquad p_{0}(\hat{\boldsymbol{x}};\boldsymbol{\beta}) = \frac{1}{1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}} \\ &=\sum_{i=1}^{m} \ln \left(\frac{y_{i} e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}}}{1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}}}+\frac{1-y_{i}}{1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}}}\right) \\ &=\sum_{i=1}^{m} \ln \left(\frac{y_{i} e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}}+1-y_{i}}{1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}}}\right) \\ &=\sum_{i=1}^{m}\left(\ln \left(y_{i} e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}}+1-y_{i}\right)-\ln \left(1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}}\right)\right) \\ &\qquad y_{i}\in \{0,1\} \\ &\qquad y_{i}=0 \\ &\qquad \quad \ell(\boldsymbol{\beta})=\sum_{i=1}^{m}\left(\ln \left(0 \cdot e^{\boldsymbol{\beta}^{T} \hat{x}_{i}}+1-0\right)-\ln \left(1+e^{\boldsymbol{\beta}^{T} \hat{x}_{i}}\right)\right)=\sum_{i=1}^{m}\left(\ln 1-\ln \left(1+e^{\boldsymbol{\beta}^{T} \hat{x}_{i}}\right)\right)=\sum_{i=1}^{m}\left(-\ln \left(1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}}\right)\right) \\ &\qquad y_{i}=1 \\ &\qquad \quad \ell(\boldsymbol{\beta})=\sum_{i=1}^{m}\left(\ln \left(1 \cdot e^{\boldsymbol{\beta}^{T} \hat{x}_{i}}+1-1\right)-\ln \left(1+e^{\boldsymbol{\beta}^{T} \hat{x}_{i}}\right)\right)=\sum_{i=1}^{m}\left(\ln e^{\boldsymbol{r}_{i}^{T}}-\ln \left(1+e^{\boldsymbol{\beta}^{T} \boldsymbol{z}_{i}}\right)\right)=\sum_{i=1}^{m}\left(\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}-\ln \left(1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}}\right)\right) \\ &=\sum_{i=1}^{m}\left(y_{i} \boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}-\ln \left(1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}}\right)\right) \end{aligned} \tag{西瓜书式3.27} (β)=i=1mln(yip1(x^i;β)+(1yi)p0(x^i;β))p1(x^;β)=1+eβTx^eβTx^p0(x^;β)=1+eβTx^1=i=1mln(1+eβTx^iyieβTx^i+1+eβTx^i1yi)=i=1mln(1+eβTx^iyieβTx^i+1yi)=i=1m(ln(yieβTx^i+1yi)ln(1+eβTx^i))yi{0,1}yi=0(β)=i=1m(ln(0eβTx^i+10)ln(1+eβTx^i))=i=1m(ln1ln(1+eβTx^i))=i=1m(ln(1+eβTx^i))yi=1(β)=i=1m(ln(1eβTx^i+11)ln(1+eβTx^i))=i=1m(lneriTln(1+eβTzi))=i=1m(βTx^iln(1+eβTx^i))=i=1m(yiβTx^iln(1+eβTx^i))(西3.27)

p ( y ∣ x ; w , b ) = [ p 1 ( x ^ ; β ) ] y [ p 0 ( x ^ ; β ) ] 1 − y p(y | \boldsymbol{x} ; \boldsymbol{w}, b)=\left[p_{1}(\hat{\boldsymbol{x}} ; \boldsymbol{\beta})\right]^{y}\left[p_{0}(\hat{\boldsymbol{x}} ; \boldsymbol{\beta})\right]^{1-y} p(yx;w,b)=[p1(x^;β)]y[p0(x^;β)]1y,将其代入对数似然函数可得:

ℓ ( β ) = ∑ i = 1 m ln ⁡ ( [ p 1 ( x ^ i ; β ) ] y i [ p 0 ( x ^ i ; β ) ] 1 − y i ) = ∑ i = 1 m [ ln ⁡ ( [ p 1 ( x ^ i ; β ) ] y i ) + ln ⁡ ( [ p 0 ( x ^ i ; β ) ] 1 − y i ) ] = ∑ i = 1 m [ y i ln ⁡ ( p 1 ( x ^ i ; β ) ) + ( 1 − y i ) ln ⁡ ( p 0 ( x ^ i ; β ) ) ] = ∑ i = 1 m { y i [ ln ⁡ ( p 1 ( x ^ i ; β ) ) − ln ⁡ ( p 0 ( x ^ i ; β ) ) ] + ln ⁡ ( p 0 ( x ^ i ; β ) ) } = ∑ i = 1 m [ y i ln ⁡ p 1 ( x ^ i ; β ) p 0 ( x ^ i ; β ) + ln ⁡ ( p 0 ( x ^ i ; β ) ) ] p 1 ( x ^ ; β ) = e β T x ^ 1 + e β T x ^ p 0 ( x ^ ; β ) = 1 1 + e β T x ^ = ∑ i = 1 m [ y i ln ⁡ ( e β T x ^ ) + ln ⁡ ( p 0 ( x ^ i ; β ) ) ] = ∑ i = 1 m ( y i β T x ^ i − ln ⁡ ( 1 + e β T x ^ i ) ) \begin{aligned} \ell(\boldsymbol{\beta}) &=\sum_{i=1}^{m} \ln \left(\left[p_{1}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right]^{y_{i}}\left[p_{0}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right]^{1-y_{i}}\right) \\ &=\sum_{i=1}^{m}\left[\ln \left(\left[p_{1}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right]^{y_{i}}\right)+\ln \left(\left[p_{0}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right]^{1-y_{i}}\right)\right] \\ &=\sum_{i=1}^{m}\left[y_{i} \ln \left(p_{1}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right)+\left(1-y_{i}\right) \ln \left(p_{0}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right)\right] \\ &=\sum_{i=1}^{m}\left\{y_{i}\left[\ln \left(p_{1}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right)-\ln \left(p_{0}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right)\right]+\ln \left(p_{0}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right)\right\} \\ &=\sum_{i=1}^{m}\left[y_{i}\ln \frac{p_{1}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)}{p_{0}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)}+\ln \left(p_{0}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right)\right] \\ &\qquad p_{1}(\hat{\boldsymbol{x}};\boldsymbol{\beta}) = \frac{e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}}{1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}} \\ &\qquad p_{0}(\hat{\boldsymbol{x}};\boldsymbol{\beta}) = \frac{1}{1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}} \\ &=\sum_{i=1}^{m}\left[y_{i}\ln \left(e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}\right) + \ln \left(p_{0}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right)\right] \\ &=\sum_{i=1}^{m}\left(y_{i} \boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}-\ln \left(1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}}\right)\right) \end{aligned} (β)=i=1mln([p1(x^i;β)]yi[p0(x^i;β)]1yi)=i=1m[ln([p1(x^i;β)]yi)+ln([p0(x^i;β)]1yi)]=i=1m[yiln(p1(x^i;β))+(1yi)ln(p0(x^i;β))]=i=1m{yi[ln(p1(x^i;β))ln(p0(x^i;β))]+ln(p0(x^i;β))}=i=1m[yilnp0(x^i;β)p1(x^i;β)+ln(p0(x^i;β))]p1(x^;β)=1+eβTx^eβTx^p0(x^;β)=1+eβTx^1=i=1m[yiln(eβTx^)+ln(p0(x^i;β))]=i=1m(yiβTx^iln(1+eβTx^i))

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值