线性模型
Lei_ZM
2019-09-10
1. 一元线性回归
求解偏置 b b b和权重 w w w推导思路
-
由最小二乘法导出损失函数 E ( w , b ) E(w, b) E(w,b)
-
证明损失函数
-
分别对损失函数 E ( w , b ) E(w, b) E(w,b)关于 b b b和 w w w求一阶偏导数
-
令各自的一阶偏导数等于0解出 b b b和 w w w
1.1. 由最小二乘法导出损失函数 E ( w , b ) E(w, b) E(w,b)
E ( w , b ) = ∑ i = 1 m ( y i − f ( x i ) ) 2 = ∑ i = 1 m ( y i − ( w x i + b ) ) 2 = ∑ i = 1 m ( y i − w x i − b ) 2 (西瓜书式3.4) \begin{aligned} E_{(w, b)} &=\sum_{i=1}^{m}\left(y_{i}-f\left(x_{i}\right)\right)^{2} \\ &=\sum_{i=1}^{m}\left(y_{i}-\left(w x_{i}+b\right)\right)^{2} \\ &=\sum_{i=1}^{m}\left(y_{i}-w x_{i}-b\right)^{2} \end{aligned} \tag{西瓜书式3.4} E(w,b)=i=1∑m(yi−f(xi))2=i=1∑m(yi−(wxi+b))2=i=1∑m(yi−wxi−b)2(西瓜书式3.4)
1.2. 证明损失函数
1.2.1. 二元函数判断凹凸性:
设 f ( x , y ) f(x, y) f(x,y)在区域 D D D上具有二阶连续偏导数,记 A = f x x ′ ′ ( x , y ) A=f_{x x}^{\prime \prime}(x, y) A=fxx′′(x,y), B = f x y ′ ′ ( x , y ) B=f_{x y}^{\prime \prime}(x, y) B=fxy′′(x,y), C = f y y ′ ′ ( x , y ) C=f_{y y}^{\prime \prime}(x, y) C=fyy′′(x,y)。则:
- 在 D D D上恒有 A > 0 A>0 A>0,且 A C − B 2 ≥ 0 AC-B^{2}\geq 0 AC−B2≥0时, f ( x , y ) f(x, y) f(x,y)在区域 D D D上是凸函数
- 在 D D D上恒有 A < 0 A<0 A<0,且 A C − B 2 ≥ 0 AC-B^{2}\geq 0 AC−B2≥0时, f ( x , y ) f(x, y) f(x,y)在区域 D D D上是凹函数
1.2.2. 二元凹凸函数求最值:
设 f ( x , y ) f(x, y) f(x,y)是在开区域 D D D内具有连续偏导数的凸(或者凹)函数, ( x 0 , y 0 ) ∈ D (x_{0}, y_{0})\in D (x0,y0)∈D,且 f x ′ ( x 0 , y 0 ) = 0 f_{x}^{\prime}(x_{0}, y_{0})=0 fx′(x0,y0)=0, f y ′ ( x 0 , y 0 ) = 0 f_{y}^{\prime}(x_{0}, y_{0})=0 fy′(x0,y0)=0,则 f ( x 0 , y 0 ) f(x_{0}, y_{0}) f(x0,y0)必为 f ( x , y ) f(x, y) f(x,y)在 D D D内的最小值(或最大值)。
1.2.3. 证明
证明损失函数 E ( w , b ) E(w, b) E(w,b)是关于 w w w和 b b b的凸函数——求 A = f x x ′ ′ ( x , y ) A=f_{xx}^{\prime \prime}(x, y) A=fxx′′(x,y):
∂ E ( w , b ) ∂ w = ∂ ∂ w [ ∑ i = 1 m ( y i − ( w x i + b ) ) 2 ] = ∑ i = 1 m ∂ ∂ w ( y i − w x i − b ) 2 = ∑ i = 1 m 2 ⋅ ( y i − w x i − b ) ⋅ ( − x i ) = 2 ( w ∑ i = 1 m x i 2 − ∑ i = 1 m ( y i − b ) x i ) (西瓜书式3.5) \begin{aligned} \frac{\partial E_{(w, b)}}{\partial w} &=\frac{\partial}{\partial w}\left[\sum_{i=1}^{m}\left(y_{i}-\left(w x_{i}+b\right)\right)^{2}\right] \\ &=\sum_{i=1}^{m} \frac{\partial}{\partial w}\left(y_{i}-w x_{i}-b\right)^{2} \\ &=\sum_{i=1}^{m} 2 \cdot\left(y_{i}-w x_{i}-b\right) \cdot\left(-x_{i}\right) \\ &=2\left(w \sum_{i=1}^{m} x_{i}^{2}-\sum_{i=1}^{m}\left(y_{i}-b\right) x_{i}\right) \end{aligned} \tag{西瓜书式3.5} ∂w∂E(w,b)=∂w∂[i=1∑m(yi−(wxi+b))2]=i=1∑m∂w∂(yi−wxi−b)2=i=1∑m2⋅(yi−wxi−b)⋅(−xi)=2(wi=1∑mxi2−i=1∑m(yi−b)xi)(西瓜书式3.5)
故有:
∂ 2 E ( w , b ) ∂ w 2 = ∂ ∂ w ( ∂ E ( w , b ) ∂ w ) = ∂ ∂ w [ 2 ( w ∑ i = 1 m x i 2 − ∑ i = 1 m ( y i − b ) x i ) ] = ∂ ∂ w [ 2 w ∑ i = 1 m x i 2 ] = 2 ∑ i = 1 m x i 2 \begin{aligned} \frac{\partial^{2} E_{(w, b)}}{\partial w^{2}} &=\frac{\partial}{\partial w}\left(\frac{\partial E_{(w, b)}}{\partial w}\right) \\ &=\frac{\partial}{\partial w}\left[2\left(w \sum_{i=1}^{m} x_{i}^{2}-\sum_{i=1}^{m}\left(y_{i}-b\right) x_{i}\right)\right] \\ &=\frac{\partial}{\partial w}\left[2 w \sum_{i=1}^{m} x_{i}^{2}\right] \\ &=2 \sum_{i=1}^{m} x_{i}^{2} \end{aligned} ∂w2∂2E(w,b)=∂w∂(∂w∂E(w,b))=∂w∂[2(wi=1∑mxi2−i=1∑m(yi−b)xi)]=∂w∂[2wi=1∑mxi2]=2i=1∑mxi2
此式即为 A = f x x ′ ′ ( x , y ) A=f_{xx}^{\prime \prime}(x, y) A=fxx′′(x,y)。
证明损失函数 E ( w , b ) E(w, b) E(w,b)是关于 w w w和 b b b的凸函数——求 B = f x y ′ ′ ( x , y ) B=f_{xy}^{\prime \prime}(x, y) B=fxy′′(x,y):
∂ 2 E ( w , b ) ∂ w ∂ b = ∂ ∂ b ( ∂ E ( w , b ) ∂ w ) = ∂ ∂ b [ 2 ( w ∑ i = 1 m x i 2 − ∑ i = 1 m ( y i − b ) x i ) ] = ∂ ∂ b [ − 2 ∑ i = 1 m ( y i − b ) x i ] = ∂ ∂ b ( − 2 ∑ i = 1 m y i x i + 2 ∑ i = 1 m b x i ) = ∂ ∂ b ( 2 ∑ i = 1 m b x i ) = 2 ∑ i = 1 m x i \begin{aligned} \frac{\partial^{2} E_{(w, b)}}{\partial w \partial b} &=\frac{\partial}{\partial b}\left(\frac{\partial E_{(w, b)}}{\partial w}\right) \\ &=\frac{\partial}{\partial b}\left[2\left(w \sum_{i=1}^{m} x_{i}^{2}-\sum_{i=1}^{m}\left(y_{i}-b\right) x_{i}\right)\right] \\ &=\frac{\partial}{\partial b}\left[-2 \sum_{i=1}^{m}\left(y_{i}-b\right) x_{i}\right] \\ &=\frac{\partial}{\partial b}\left(-2 \sum_{i=1}^{m} y_{i} x_{i}+2 \sum_{i=1}^{m} b x_{i}\right) \\ &=\frac{\partial}{\partial b}\left(2 \sum_{i=1}^{m} b x_{i}\right) \\ &=2 \sum_{i=1}^{m} x_{i} \end{aligned} ∂w∂b∂2E(w,b)=∂b∂(∂w∂E(w,b))=∂b∂[2(wi=1∑mxi2−i=1∑m(yi−b)xi)]=∂b∂[−2i=1∑m(yi−b)xi]=∂b∂(−2i=1∑myixi+2i=1∑mbxi)=∂b∂(2i=1∑mbxi)=2i=1∑mxi
此式即为 B = f x y ′ ′ ( x , y ) B=f_{xy}^{\prime \prime}(x, y) B=fxy′′(x,y)。
证明损失函数 E ( w , b ) E(w, b) E(w,b)是关于 w w w和 b b b的凸函数——求 C = f y y ′ ′ ( x , y ) C=f_{yy}^{\prime \prime}(x, y) C=fyy′′(x,y):
∂ E ( w , b ) ∂ b = ∂ ∂ b [ ∑ i = 1 m ( y i − ( w x i + b ) ) 2 ] = ∑ i = 1 m ∂ ∂ b ( y i − w x i − b ) 2 = ∑ i = 1 m 2 ⋅ ( y i − w x i − b ) ⋅ ( − 1 ) = 2 ( m b − ∑ i = 1 m ( y i − w x i ) ) (西瓜书式3.6) \begin{aligned} \frac{\partial E_{(w, b)}}{\partial b} &=\frac{\partial}{\partial b}\left[\sum_{i=1}^{m}\left(y_{i}-\left(w x_{i}+b\right)\right)^{2}\right] \\ &=\sum_{i=1}^{m} \frac{\partial}{\partial b}\left(y_{i}-w x_{i}-b\right)^{2} \\ &=\sum_{i=1}^{m} 2 \cdot\left(y_{i}-w x_{i}-b\right) \cdot(-1) \\ &=2\left(m b-\sum_{i=1}^{m}\left(y_{i}-w x_{i}\right)\right) \end{aligned} \tag{西瓜书式3.6} ∂b∂E(w,b)=∂b∂[i=1∑m(yi−(wxi+b))2]=i=1∑m∂b∂(yi−wxi−b)2=i=1∑m2⋅(yi−wxi−b)⋅(−1)=2(mb−i=1∑m(yi−wxi))(西瓜书式3.6)
故有:
∂ 2 E ( w , b ) ∂ b 2 = ∂ ∂ b ( ∂ E ( w , b ) ∂ b ) = ∂ ∂ b [ 2 ( m b − ∑ i = 1 m ( y i − w x i ) ) ] = ∂ ∂ b ( 2 m b ) = 2 m \begin{aligned} \frac{\partial^{2} E_{(w, b)}}{\partial b^{2}} &=\frac{\partial}{\partial b}\left(\frac{\partial E_{(w, b)}}{\partial b}\right) \\ &=\frac{\partial}{\partial b}\left[2\left(m b-\sum_{i=1}^{m}\left(y_{i}-w x_{i}\right)\right)\right] \\ &=\frac{\partial}{\partial b}(2 m b) \\ &=2 m \end{aligned} ∂b2∂2E(w,b)=∂b∂(∂b∂E(w,b))=∂b∂[2(mb−i=1∑m(yi−wxi))]=∂b∂(2mb)=2m
此式即为 C = f y y ′ ′ ( x , y ) C=f_{yy}^{\prime \prime}(x, y) C=fyy′′(x,y)。
综上所述,有:
{ A = f x x ′ ′ ( x , y ) = 2 ∑ i = 1 m x i 2 B = f x y ′ ′ ( x , y ) = 2 ∑ i = 1 m x i C = f y y ′ ′ ( x , y ) = 2 m \left\{ \begin{aligned} &A=f_{xx}^{\prime \prime}(x, y)=2 \sum_{i=1}^{m} x_{i}^{2} \\ &B=f_{xy}^{\prime \prime}(x, y)=2 \sum_{i=1}^{m} x_{i} \\ &C=f_{yy}^{\prime \prime}(x, y)=2 m \end{aligned} \right. ⎩⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎧A=fxx′′(x,y)=2i=1∑mxi2B=fxy′′(x,y)=2i=1∑mxiC=fyy′′(x,y)=2m
所以:
A C − B 2 = 2 m ⋅ 2 ∑ i = 1 m x i 2 − ( 2 ∑ i = 1 m x i ) 2 = 4 m ∑ i = 1 m x i 2 − 4 ( ∑ i = 1 m x i ) 2 = 4 m ∑ i = 1 m x i 2 − 4 ⋅ m ⋅ 1 m ⋅ ( ∑ i = 1 m x i ) 2 = 4 m ∑ i = 1 m x i 2 − 4 m ⋅ x ˉ ⋅ ∑ i = 1 m x i = 4 m ( ∑ i = 1 m x i 2 − ∑ i = 1 m x i x ˉ ) = 4 m ∑ i = 1 m ( x i 2 − x i x ˉ ) = 4 m ∑ i = 1 m ( x i 2 − x i x ˉ − x i x ˉ + x i x ˉ ) ∑ i = 1 m x i x ˉ = x ˉ ∑ i = 1 m x i = x ˉ ⋅ m ⋅ 1 m ⋅ ∑ i = 1 m x i = m x ˉ 2 = ∑ i = 1 m x ˉ 2 = 4 m ∑ i = 1 m ( x i 2 − x i x ˉ − x i x ˉ + x ˉ 2 ) = 4 m ∑ i = 1 m ( x i − x ˉ ) 2 \begin{aligned} A C-B^{2} &=2 m \cdot 2 \sum_{i=1}^{m} x_{i}^{2}-\left(2 \sum_{i=1}^{m} x_{i}\right)^{2} \\ &=4 m \sum_{i=1}^{m} x_{i}^{2}-4\left(\sum_{i=1}^{m} x_{i}\right)^{2} \\ &=4 m \sum_{i=1}^{m} x_{i}^{2}-4 \cdot m \cdot \frac{1}{m} \cdot\left(\sum_{i=1}^{m} x_{i}\right)^{2} \\ &=4 m \sum_{i=1}^{m} x_{i}^{2}-4 m \cdot \bar{x} \cdot \sum_{i=1}^{m} x_{i} \\ &=4 m\left(\sum_{i=1}^{m} x_{i}^{2}-\sum_{i=1}^{m} x_{i} \bar{x}\right) \\ &=4 m \sum_{i=1}^{m}\left(x_{i}^{2}-x_{i} \bar{x}\right) \\ &=4 m \sum_{i=1}^{m}\left(x_{i}^{2}-x_{i} \bar{x}-x_{i} \bar{x}+x_{i} \bar{x}\right) \\ &\qquad \sum_{i=1}^{m} x_{i} \bar{x}=\bar{x} \sum_{i=1}^{m} x_{i}=\bar{x} \cdot m \cdot \frac{1}{m} \cdot \sum_{i=1}^{m} x_{i}=m \bar{x}^{2}=\sum_{i=1}^{m} \bar{x}^{2} \\ &=4 m \sum_{i=1}^{m}\left(x_{i}^{2}-x_{i} \bar{x}-x_{i} \bar{x}+\bar{x}^{2}\right) \\ &=4 m \sum_{i=1}^{m}\left(x_{i}-\bar{x}\right)^{2} \end{aligned} AC−B2=2m⋅2i=1∑mxi2−(2i=1∑mxi)2=4mi=1∑mxi2−4(i=1∑mxi)2=4mi=1∑mxi2−4⋅m⋅m1⋅(i=1∑mxi)2=4mi=1∑mxi2−4m⋅xˉ⋅i=1∑mxi=4m(i=1∑mxi2−i=1∑mxixˉ)=4mi=1∑m(xi2−xixˉ)=4mi=1∑m(xi2−xixˉ−xixˉ+xixˉ)i=1∑mxixˉ=xˉi=1∑mxi=xˉ⋅m⋅m1⋅i=1∑mxi=mxˉ2=i=1∑mxˉ2=4mi=1∑m(xi2−xixˉ−xixˉ+xˉ2)=4mi=1∑m(xi−xˉ)2
故有:
A C − B 2 = 4 m ∑ i = 1 m ( x i − x ˉ ) 2 ≥ 0 AC-B^{2} = 4 m \sum_{i=1}^{m}\left(x_{i}-\bar{x}\right)^{2} \geq 0 AC−B2=4mi=1∑m(xi−xˉ)2≥0
也即损失函数 E ( w , b ) E(w, b) E(w,b)是关于 w w w和 b b b的凸函数,得证!
1.3. 分别对损失函数 E ( w , b ) E(w, b) E(w,b)关于 b b b和 w w w求一阶偏导数
损失函数 E ( w , b ) E(w, b) E(w,b)关于 b b b求一阶偏导数:
∂ E ( w , b ) ∂ b = ∂ ∂ b [ ∑ i = 1 m ( y i − ( w x i + b ) ) 2 ] = ∑ i = 1 m ∂ ∂ b ( y i − w x i − b ) 2 = ∑ i = 1 m 2 ⋅ ( y i − w x i − b ) ⋅ ( − 1 ) = 2 ( m b − ∑ i = 1 m ( y i − w x i ) ) (西瓜书式3.6) \begin{aligned} \frac{\partial E_{(w, b)}}{\partial b} &=\frac{\partial}{\partial b}\left[\sum_{i=1}^{m}\left(y_{i}-\left(w x_{i}+b\right)\right)^{2}\right] \\ &=\sum_{i=1}^{m} \frac{\partial}{\partial b}\left(y_{i}-w x_{i}-b\right)^{2} \\ &=\sum_{i=1}^{m} 2 \cdot\left(y_{i}-w x_{i}-b\right) \cdot(-1) \\ &=2\left(m b-\sum_{i=1}^{m}\left(y_{i}-w x_{i}\right)\right) \end{aligned} \tag{西瓜书式3.6} ∂b∂E(w,b)=∂b∂[i=1∑m(yi−(wxi+b))2]=i=1∑m∂b∂(yi−wxi−b)2=i=1∑m2⋅(yi−wxi−b)⋅(−1)=2(mb−i=1∑m(yi−wxi))(西瓜书式3.6)
损失函数 E ( w , b ) E(w, b) E(w,b)关于 w w w求一阶偏导数:
∂ E ( w , b ) ∂ w = ∂ ∂ w [ ∑ i = 1 m ( y i − ( w x i + b ) ) 2 ] = ∑ i = 1 m ∂ ∂ w ( y i − w x i − b ) 2 = ∑ i = 1 m 2 ⋅ ( y i − w x i − b ) ⋅ ( − x i ) = 2 ( w ∑ i = 1 m x i 2 − ∑ i = 1 m ( y i − b ) x i ) (西瓜书式3.5) \begin{aligned} \frac{\partial E_{(w, b)}}{\partial w} &=\frac{\partial}{\partial w}\left[\sum_{i=1}^{m}\left(y_{i}-\left(w x_{i}+b\right)\right)^{2}\right] \\ &=\sum_{i=1}^{m} \frac{\partial}{\partial w}\left(y_{i}-w x_{i}-b\right)^{2} \\ &=\sum_{i=1}^{m} 2 \cdot\left(y_{i}-w x_{i}-b\right) \cdot\left(-x_{i}\right) \\ &=2\left(w \sum_{i=1}^{m} x_{i}^{2}-\sum_{i=1}^{m}\left(y_{i}-b\right) x_{i}\right) \end{aligned} \tag{西瓜书式3.5} ∂w∂E(w,b)=∂w∂[i=1∑m(yi−(wxi+b))2]=i=1∑m∂w∂(yi−wxi−b)2=i=1∑m2⋅(yi−wxi−b)⋅(−xi)=2(wi=1∑mxi2−i=1∑m(yi−b)xi)(西瓜书式3.5)
1.4. 令各自的一阶偏导数等于0解出 b b b和 w w w
令损失函数 E ( w , b ) E(w, b) E(w,b)关于 b b b的一阶偏导数等于0解出 b b b:
∂ E ( w , b ) ∂ b = 2 ( m b − ∑ i = 1 m ( y i − w x i ) ) = 0 ⇒ m b − ∑ i = 1 m ( y i − w x i ) = 0 ⇒ b = 1 m ∑ i = 1 m ( y i − w x i ) = 1 m ∑ i = 1 m y i − w 1 m ∑ i = 1 m x i = y ˉ − w x ˉ (西瓜书式3.8) \begin{aligned} \frac{\partial E_{(w, b)}}{\partial b} &=2\left(m b-\sum_{i=1}^{m}\left(y_{i}-w x_{i}\right)\right) =0 \\ &\Rightarrow m b-\sum_{i=1}^{m}\left(y_{i}-w x_{i}\right)=0 \\ & \begin{aligned} \Rightarrow b&=\frac{1}{m}\sum_{i=1}^{m}\left(y_{i}-w x_{i}\right) \\ &=\frac{1}{m}\sum_{i=1}^{m} y_{i} - w \frac{1}{m}\sum_{i=1}^{m} x_{i} \\ &=\bar{y}-w\bar{x} \end{aligned} \end{aligned} \tag{西瓜书式3.8} ∂b∂E(w,b)=2(mb−i=1∑m(yi−wxi))=0⇒mb−i=1∑m(yi−wxi)=0⇒b=m1i=1∑m(yi−wxi)=m1i=1∑myi−wm1i=1∑mxi=yˉ−wxˉ(西瓜书式3.8)
令损失函数 E ( w , b ) E(w, b) E(w,b)关于 w w w的一阶偏导数等于0解出 w w w:
∂ E ( w , b ) ∂ w = 2 ( w ∑ i = 1 m x i 2 − ∑ i = 1 m ( y i − b ) x i ) = 0 ⇒ w ∑ i = 1 m x i 2 − ∑ i = 1 m ( y i − b ) x i = 0 ⇒ w ∑ i = 1 m x i 2 = ∑ i = 1 m y i x i − ∑ i = 1 m b x i b = y ˉ − w x ˉ ⇒ w ∑ i = 1 m x i 2 = ∑ i = 1 m y i x i − ∑ i = 1 m ( y ˉ − w x ˉ ) x i ⇒ w ∑ i = 1 m x i 2 = ∑ i = 1 m y i x i − y ˉ ∑ i = 1 m x i + w x ˉ ∑ i = 1 m x i ⇒ w ∑ i = 1 m x i 2 − w x ˉ ∑ i = 1 m x i = ∑ i = 1 m y i x i − y ˉ ∑ i = 1 m x i ⇒ w ( ∑ i = 1 m x i 2 − x ˉ ∑ i = 1 m x i ) = ∑ i = 1 m y i x i − y ˉ ∑ i = 1 m x i ⇒ w = ∑ i = 1 m y i x i − y ˉ ∑ i = 1 m x i ∑ i = 1 m x i 2 − x ˉ ∑ i = 1 m x i y ˉ ∑ i = 1 m x i = 1 m ∑ i = 1 m y i ∑ i = 1 m x i = x ˉ ∑ i = 1 m y i x ˉ ∑ i = 1 m x i = 1 m ∑ i = 1 m x i ∑ i = 1 m x i = 1 m ( ∑ i = 1 m x i ) 2 = ∑ i = 1 m y i x i − x ˉ ∑ i = 1 m y i ∑ i = 1 m x i 2 − 1 m ( ∑ i = 1 m x i ) 2 = ∑ i = 1 m y i ( x i − x ˉ ) ∑ i = 1 m x i 2 − 1 m ( ∑ i = 1 m x i ) 2 (西瓜书式3.7) \begin{aligned} \frac{\partial E_{(w, b)}}{\partial w} &=2\left(w \sum_{i=1}^{m} x_{i}^{2}-\sum_{i=1}^{m}\left(y_{i}-b\right) x_{i}\right) =0 \\ &\Rightarrow w \sum_{i=1}^{m} x_{i}^{2}-\sum_{i=1}^{m}\left(y_{i}-b\right) x_{i}=0 \\ &\Rightarrow w \sum_{i=1}^{m} x_{i}^{2} = \sum_{i=1}^{m}y_{i} x_{i} - \sum_{i=1}^{m} b x_{i} \\ &\qquad b=\bar{y}-w\bar{x} \\ &\Rightarrow w \sum_{i=1}^{m} x_{i}^{2}=\sum_{i=1}^{m} y_{i} x_{i}-\sum_{i=1}^{m}(\bar{y}-w \bar{x}) x_{i} \\ &\Rightarrow w \sum_{i=1}^{m} x_{i}^{2} =\sum_{i=1}^{m} y_{i} x_{i}-\bar{y} \sum_{i=1}^{m} x_{i}+w \bar{x} \sum_{i=1}^{m} x_{i} \\ &\Rightarrow w \sum_{i=1}^{m} x_{i}^{2}-w \bar{x} \sum_{i=1}^{m} x_{i}=\sum_{i=1}^{m} y_{i} x_{i}-\bar{y} \sum_{i=1}^{m} x_{i} \\ &\Rightarrow w\left(\sum_{i=1}^{m} x_{i}^{2}-\bar{x} \sum_{i=1}^{m} x_{i}\right)=\sum_{i=1}^{m} y_{i} x_{i}-\bar{y} \sum_{i=1}^{m} x_{i} \\ &\begin{aligned} \Rightarrow w &= \frac{\sum_{i=1}^{m} y_{i} x_{i}-\bar{y} \sum_{i=1}^{m} x_{i}}{\sum_{i=1}^{m} x_{i}^{2}-\bar{x} \sum_{i=1}^{m} x_{i}} \\ &\qquad \bar{y} \sum_{i=1}^{m} x_{i} = \frac{1}{m}\sum_{i=1}^{m} y_{i} \sum_{i=1}^{m} x_{i} = \bar{x} \sum_{i=1}^{m} y_{i} \\ &\qquad \bar{x}\sum_{i=1}^{m} x_{i} = \frac{1}{m}\sum_{i=1}^{m} x_{i} \sum_{i=1}^{m} x_{i} = \frac{1}{m} \left(\sum_{i=1}^{m} x_{i}\right)^{2} \\ &=\frac{\sum_{i=1}^{m} y_{i} x_{i}-\bar{x} \sum_{i=1}^{m} y_{i}}{\sum_{i=1}^{m} x_{i}^{2}-\frac{1}{m}\left(\sum_{i=1}^{m} x_{i}\right)^{2}} \\ &=\frac{\sum_{i=1}^{m} y_{i}\left(x_{i}-\bar{x}\right)}{\sum_{i=1}^{m} x_{i}^{2}-\frac{1}{m}\left(\sum_{i=1}^{m} x_{i}\right)^{2}} \end{aligned} \end{aligned} \tag{西瓜书式3.7} ∂w∂E(w,b)=2(wi=1∑mxi2−i=1∑m(yi−b)xi)=0⇒wi=1∑mxi2−i=1∑m(yi−b)xi=0⇒wi=1∑mxi2=i=1∑myixi−i=1∑mbxib=yˉ−wxˉ⇒wi=1∑mxi2=i=1∑myixi−i=1∑m(yˉ−wxˉ)xi⇒wi=1∑mxi2=i=1∑myixi−yˉi=1∑mxi+wxˉi=1∑mxi⇒wi=1∑mxi2−wxˉi=1∑mxi=i=1∑myixi−yˉi=1∑mxi⇒w(i=1∑mxi2−xˉi=1∑mxi)=i=1∑myixi−yˉi=1∑mxi⇒w=∑i=1mxi2−xˉ∑i=1mxi∑i=1myixi−yˉ∑i=1mxiyˉi=1∑mxi=m1i=1∑myii=1∑mxi=xˉi=1∑myixˉi=1∑mxi=m1i=1∑mxii=1∑mxi=m1(i=1∑mxi)2=∑i=1mxi2−m1(∑i=1mxi)2∑i=1myixi−xˉ∑i=1myi=∑i=1mxi2−m1(∑i=1mxi)2∑i=1myi(xi−xˉ)(西瓜书式3.7)
将 w w w向量化,有:
w = ∑ i = 1 m y i ( x i − x ˉ ) ∑ i = 1 m x i 2 − 1 m ( ∑ i = 1 m x i ) 2 1 m ( ∑ i = 1 m x i ) 2 = ( 1 m ∑ i = 1 m x i ) ∑ i = 1 m x i = x ˉ ∑ i = 1 m x i = ∑ i = 1 m x i x ˉ = ∑ i = 1 m ( y i x i − y i x ˉ ) ∑ i = 1 m ( x i 2 − x i x ˉ ) = ∑ i = 1 m ( y i x i − y i x ˉ − y i x ˉ − y i x ˉ ) ∑ i = 1 m ( x i 2 − x i x ˉ − x i x ˉ − x i x ˉ ) ∑ i = 1 m y i x ˉ = x ˉ ∑ i = 1 m y i = 1 m ∑ i = 1 m x i ∑ i = 1 m y i = ∑ i = 1 m x i ⋅ 1 m ⋅ ∑ i = 1 m y i = ∑ i = 1 m x i y ˉ ∑ i = 1 m y i x ˉ = x ˉ ∑ i = 1 m y i = x ˉ ⋅ m ⋅ 1 m ⋅ ∑ i = 1 m y i = m x ˉ y ˉ = ∑ i = 1 m x ˉ y ˉ ∑ i = 1 m x i x ˉ = x ˉ ∑ i = 1 m x i = x ˉ ⋅ m ⋅ 1 m ⋅ ∑ i = 1 m x i = m x ˉ 2 = ∑ i = 1 m x ˉ 2 = ∑ i = 1 m ( y i x i − y i x ˉ − x i y ˉ − x ˉ y ˉ ) ∑ i = 1 m ( x i 2 − x i x ˉ − x i x ˉ − x ˉ 2 ) = ∑ i = 1 m ( x i − x ˉ ) ( y i − y ˉ ) ∑ i = 1 m ( x i − x ˉ ) 2 x = ( x 1 , x 2 , ⋯ , x m ) T y = ( y 1 , y 2 , ⋯ , y m ) T x d = ( x 1 − x ˉ , x 2 − x ˉ , ⋯ , x m − x ˉ ) T y d = ( y 1 − y ˉ , y 2 − y ˉ , ⋯ , y m − y ˉ ) T = x d T y d x d T x d \begin{aligned} w &=\frac{\sum_{i=1}^{m} y_{i}\left(x_{i}-\bar{x}\right)}{\sum_{i=1}^{m} x_{i}^{2}-\frac{1}{m}\left(\sum_{i=1}^{m} x_{i}\right)^{2}} \\ &\qquad \frac{1}{m}\left(\sum_{i=1}^{m} x_{i}\right)^{2} = \left(\frac{1}{m} \sum_{i=1}^{m} x_{i}\right) \sum_{i=1}^{m} x_{i} = \bar{x} \sum_{i=1}^{m} x_{i} = \sum_{i=1}^{m} x_{i} \bar{x} \\ &=\frac{\sum_{i=1}^{m} \left(y_{i} x_{i}-y_{i} \bar{x}\right)}{\sum_{i=1}^{m} \left(x_{i}^{2}-x_{i} \bar{x}\right)} \\ &=\frac{\sum_{i=1}^{m} \left(y_{i} x_{i}-y_{i} \bar{x}-y_{i} \bar{x}-y_{i} \bar{x}\right)}{\sum_{i=1}^{m} \left(x_{i}^{2}-x_{i} \bar{x}-x_{i} \bar{x}-x_{i} \bar{x}\right)} \\ &\qquad \sum_{i=1}^{m} y_{i} \bar{x}=\bar{x} \sum_{i=1}^{m} y_{i}=\frac{1}{m} \sum_{i=1}^{m} x_{i} \sum_{i=1}^{m} y_{i}=\sum_{i=1}^{m} x_{i} \cdot \frac{1}{m} \cdot \sum_{i=1}^{m} y_{i}=\sum_{i=1}^{m} x_{i} \bar{y} \\ &\qquad \sum_{i=1}^{m} y_{i} \bar{x}=\bar{x} \sum_{i=1}^{m} y_{i}=\bar{x} \cdot m \cdot \frac{1}{m} \cdot \sum_{i=1}^{m} y_{i}=m \bar{x} \bar{y}=\sum_{i=1}^{m} \bar{x} \bar{y} \\ &\qquad \sum_{i=1}^{m} x_{i} \bar{x}=\bar{x} \sum_{i=1}^{m} x_{i}=\bar{x} \cdot m \cdot \frac{1}{m} \cdot \sum_{i=1}^{m} x_{i}=m \bar{x}^{2}=\sum_{i=1}^{m} \bar{x}^{2} \\ &=\frac{\sum_{i=1}^{m} \left(y_{i} x_{i}-y_{i} \bar{x}-x_{i} \bar{y}-\bar{x}\bar{y}\right)}{\sum_{i=1}^{m} \left(x_{i}^{2}-x_{i} \bar{x}-x_{i} \bar{x}-\bar{x}^{2}\right)} \\ &=\frac{\sum_{i=1}^{m} \left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sum_{i=1}^{m} \left(x_{i}-\bar{x}\right)^{2}} \\ &\qquad x=\left(x_{1},x_{2},\cdots, x_{m}\right)^{T} \\ &\qquad y=\left(y_{1},y_{2},\cdots,y_{m}\right)^{T} \\ &\qquad x_{d}=\left(x_{1}-\bar{x},x_{2}-\bar{x},\cdots,x_{m}-\bar{x}\right)^{T} \\ &\qquad y_{d}=\left(y_{1}-\bar{y},y_{2}-\bar{y},\cdots,y_{m}-\bar{y}\right)^{T} \\ &=\frac{x_{d}^{T} y_{d}}{x_{d}^{T} x_{d}} \end{aligned} w=∑i=1mxi2−m1(∑i=1mxi)2∑i=1myi(xi−xˉ)m1(i=1∑mxi)2=(m1i=1∑mxi)i=1∑mxi=xˉi=1∑mxi=i=1∑mxixˉ=∑i=1m(xi2−xixˉ)∑i=1m(yixi−yixˉ)=∑i=1m(xi2−xixˉ−xixˉ−xixˉ)∑i=1m(yixi−yixˉ−yixˉ−yixˉ)i=1∑myixˉ=xˉi=1∑myi=m1i=1∑mxii=1∑myi=i=1∑mxi⋅m1⋅i=1∑myi=i=1∑mxiyˉi=1∑myixˉ=xˉi=1∑myi=xˉ⋅m⋅m1⋅i=1∑myi=mxˉyˉ=i=1∑mxˉyˉi=1∑mxixˉ=xˉi=1∑mxi=xˉ⋅m⋅m1⋅i=1∑mxi=mxˉ2=i=1∑mxˉ2=∑i=1m(xi2−xixˉ−xixˉ−xˉ2)∑i=1m(yixi−yixˉ−xiyˉ−xˉyˉ)=∑i=1m(xi−xˉ)2∑i=1m(xi−xˉ)(yi−yˉ)x=(x1,x2,⋯,xm)Ty=(y1,y2,⋯,ym)Txd=(x1−xˉ,x2−xˉ,⋯,xm−xˉ)Tyd=(y1−yˉ,y2−yˉ,⋯,ym−yˉ)T=xdTxdxdTyd
2. 二元线性回归
求解权重 w ^ \hat{w} w^的公式推导推导思路:
-
由最小二乘法导出损失函数 E w ^ E_{\hat{w}} Ew^
-
证明损失函数 E w ^ E_{\hat{w}} Ew^是关于 w ^ \hat{w} w^的凸函数
-
对损失函数 E w ^ E_{\hat{w}} Ew^关于 w ^ \hat{w} w^求一阶偏导数
-
令各自的一阶偏导数等于0解出 w ^ ∗ \hat{w}^{*} w^∗
2.1. 将 w w w和 b b b组合成 w ^ \hat{w} w^
f ( x i ) = w T x i + b = ( w 1 w 2 … w d ) ( x i 1 x i 2 ⋮ x i d ) + b = w 1 x i 1 + w 2 x i 2 + … + w d x i d + b w d + 1 = b = w 1 x i 1 + w 2 x i 2 + … + w d x i d + w d + 1 ⋅ 1 = ( w 1 w 2 … w d w d + 1 ) ( x i 1 x i 2 ⋮ x i d 1 ) = w ^ T x ^ i \begin{aligned} f\left(\boldsymbol{x}_{i}\right) &=\boldsymbol{w}^{T} \boldsymbol{x}_{i}+b \\ &=\left(\begin{array}{cccc} {w_{1}} & {w_{2}} & {\dots} & {w_{d}}\end{array}\right) \left(\begin{array}{c}{x_{i 1}} \\ {x_{i 2}} \\ {\vdots} \\ {x_{i d}}\end{array}\right)+b \\ &=w_{1} x_{i 1}+w_{2} x_{i 2}+\ldots+w_{d} x_{i d}+b \\ &\qquad w_{d+1}=b \\ &=w_{1} x_{i 1}+w_{2} x_{i 2}+\ldots+w_{d} x_{i d}+w_{d+1} \cdot 1 \\ &=\left(\begin{array}{ccccc} {w_{1}} & {w_{2}} & {\dots} & {w_{d}} & {w_{d+1}}\end{array}\right) \left(\begin{array}{c}{x_{i 1}} \\ {x_{i 2}} \\ {\vdots} \\ {x_{i d}} \\ 1\end{array}\right) \\ &=\hat{w}^{T}\hat{x}_{i} \end{aligned} f(xi)=wTxi+b=(w1w2…wd)⎝⎜⎜⎜⎛xi1xi2⋮xid⎠⎟⎟⎟⎞+b=w1xi1+w2xi2+…+wdxid+bwd+1=b=w1xi1+w2xi2+…+wdxid+wd+1⋅1=(w1w2…wdwd+1)⎝⎜⎜⎜⎜⎜⎛xi1xi2⋮xid1⎠⎟⎟⎟⎟⎟⎞=w^Tx^i
2.2. 由最小二乘法导出损失函数 E w ^ E_{\hat{w}} Ew^
E w ^ = ∑ i = 1 m ( y i − f ( x ^ i ) ) 2 = ∑ m ( y i − w ^ T x ^ i ) 2 X = ( x 11 x 12 … x 1 d 1 x 21 x 22 … x 2 d 1 ⋮ ⋮ ⋱ ⋮ ⋮ x m 1 x m 2 … x m d 1 ) = ( x 1 T 1 x 2 T 1 ⋮ ⋮ x m T 1 ) = ( x ^ 1 T x ^ 2 T ⋮ x ^ m T ) y = ( y 1 , y 2 , ⋯ , y m ) T = ( y 1 − w ^ T x ^ 1 ) 2 + ( y 2 − w ^ T x ^ 2 ) 2 + ⋯ + ( y m − w ^ T x ^ m ) 2 = ( y 1 − w ^ T x ^ 1 y 2 − w ^ T x ^ 2 ⋯ y m − w ^ T x ^ m ) ( y 1 − w ^ T x ^ 1 y 2 − w ^ T x ^ 2 ⋮ y m − w ^ T x ^ m ) ( y 1 − w ^ T x ^ 1 y 2 − w ^ T x ^ 2 ⋮ y m − w ^ T x ^ m ) = ( y 1 y 2 ⋮ y m ) − ( w ^ T x ^ 1 w ^ T x ^ 2 ⋮ w ^ T x ^ m ) = ( y 1 y 2 ⋮ y m ) − ( x ^ 1 T w ^ x ^ 2 T w ^ ⋮ x ^ m T w ^ ) = ( y 1 y 2 ⋮ y m ) − ( x ^ 1 T x ^ 2 T ⋮ x ^ m T ) w ^ = y − X w ^ ( y 1 − w ^ T x ^ 1 y 2 − w ^ T x ^ 2 ⋯ y m − w ^ T x ^ m ) = ( y 1 − w ^ T x ^ 1 y 2 − w ^ T x ^ 2 ⋮ y m − w ^ T x ^ m ) T = ( y − X w ^ ) T = ( y − X w ^ ) T ( y − X w ^ ) \begin{aligned} E_{\hat{\boldsymbol{w}}} &=\sum_{i=1}^{m}\left(y_{i}-f\left(\hat{\boldsymbol{x}}_{i}\right)\right)^{2} \\ &=\sum^{m}\left(y_{i}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{i}\right)^{2} \\ &\qquad \begin{aligned} &\mathbf{X} =\left(\begin{array}{ccccc} {x_{11}} & {x_{12}} & {\dots} & {x_{1 d}} & {1} \\ {x_{21}} & {x_{22}} & {\dots} & {x_{2 d}} & {1} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} & {\vdots} \\ {x_{m 1}} & {x_{m 2}} & {\dots} & {x_{m d}} & {1} \end{array}\right) =\left(\begin{array}{cc} {\boldsymbol{x}_{1}^{\mathrm{T}}} & {1} \\ {\boldsymbol{x}_{2}^{\mathrm{T}}} & {1} \\ {\vdots} & {\vdots} \\ {\boldsymbol{x}_{m}^{\mathrm{T}}} & {1} \end{array}\right) =\left(\begin{array}{c} {\hat{\boldsymbol{x}}_{1}^{T}} \\ {\hat{\boldsymbol{x}}_{2}^{T}} \\ {\vdots} \\ {\hat{\boldsymbol{x}}_{m}^{T}} \end{array}\right) \\ &\boldsymbol{y}=\left(y_{1},y_{2},\cdots,y_{m}\right)^{T} \end{aligned} \\ &=\left(y_{1}-\hat{\boldsymbol{w}}^{T} \hat{x}_{1}\right)^{2} + \left(y_{2}-\hat{\boldsymbol{w}}^{T} \hat{x}_{2}\right)^{2} + \cdots + \left(y_{m}-\hat{\boldsymbol{w}}^{T} \hat{x}_{m}\right)^{2} \\ &=\left(\begin{array}{cccc} {y_{1}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{1}} & {y_{2}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{2}} & {\cdots} & {y_{m}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{m}} \end{array}\right) \left(\begin{array}{c} {y_{1}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{1}} \\ {y_{2}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{2}} \\ {\vdots} \\ {y_{m}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{m}} \end{array}\right) \\ &\qquad \left(\begin{array}{c} {y_{1}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{1}} \\ {y_{2}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{2}} \\ {\vdots} \\ {y_{m}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{m}} \end{array}\right) =\left(\begin{array}{c} {y_{1}} \\ {y_{2}} \\ {\vdots} \\ {y_{m}} \end{array}\right) -\left(\begin{array}{c} {\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{1}} \\ {\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{2}} \\ {\vdots} \\ {\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{m}} \end{array}\right) =\left(\begin{array}{c} {y_{1}} \\ {y_{2}} \\ {\vdots} \\ {y_{m}} \end{array}\right) -\left(\begin{array}{c} {\hat{\boldsymbol{x}}_{1}^{T} \hat{\boldsymbol{w}}} \\ {\hat{\boldsymbol{x}}_{2}^{T} \hat{\boldsymbol{w}}} \\ {\vdots} \\ {\hat{\boldsymbol{x}}_{m}^{T} \hat{\boldsymbol{w}}} \end{array}\right) =\left(\begin{array}{c} {y_{1}} \\ {y_{2}} \\ {\vdots} \\ {y_{m}} \end{array}\right) -\left(\begin{array}{c} {\hat{\boldsymbol{x}}_{1}^{T}} \\ {\hat{\boldsymbol{x}}_{2}^{T}} \\ {\vdots} \\ {\hat{\boldsymbol{x}}_{m}^{T}} \end{array}\right) \hat{\boldsymbol{w}} =\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}} \\ &\qquad \left(\begin{array}{cccc} {y_{1}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{1}} & {y_{2}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{2}} & {\cdots} & {y_{m}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{m}} \end{array}\right) =\left(\begin{array}{c} {y_{1}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{1}} \\ {y_{2}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{2}} \\ {\vdots} \\ {y_{m}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{m}} \end{array}\right)^{T} =\left(\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}}\right)^{T} \\ &=\left(\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}}\right)^{T}\left(\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}}\right) \end{aligned} Ew^=i=1∑m(yi−f(x^i))2=∑m(yi−w^Tx^i)2X=⎝⎜⎜⎜⎛x11x21⋮xm1x12x22⋮xm2……⋱…x1dx2d⋮xmd11⋮1⎠⎟⎟⎟⎞=⎝⎜⎜⎜⎛x1Tx2T⋮xmT11⋮1⎠⎟⎟⎟⎞=⎝⎜⎜⎜⎛x^1Tx^2T⋮x^mT⎠⎟⎟⎟⎞y=(y1,y2,⋯,ym)T=(y1−w^Tx^1)2+(y2−w^Tx^2)2+⋯+(ym−w^Tx^m)2=(y1−w^Tx^1y2−w^Tx^2⋯ym−w^Tx^m)⎝⎜⎜⎜⎛y1−w^Tx^1y2−w^Tx^2⋮ym−w^Tx^m⎠⎟⎟⎟⎞⎝⎜⎜⎜⎛y1−w^Tx^1y2−w^Tx^2⋮ym−w^Tx^m⎠⎟⎟⎟⎞=⎝⎜⎜⎜⎛y1y2⋮ym⎠⎟⎟⎟⎞−⎝⎜⎜⎜⎛w^Tx^1w^Tx^2⋮w^Tx^m⎠⎟⎟⎟⎞=⎝⎜⎜⎜⎛y1y2⋮ym⎠⎟⎟⎟⎞−⎝⎜⎜⎜⎛x^1Tw^x^2Tw^⋮x^mTw^⎠⎟⎟⎟⎞=⎝⎜⎜⎜⎛y1y2⋮ym⎠⎟⎟⎟⎞−⎝⎜⎜⎜⎛x^1Tx^2T⋮x^mT⎠⎟⎟⎟⎞w^=y−Xw^(y1−w^Tx^1y2−w^Tx^2⋯ym−w^Tx^m)=⎝⎜⎜⎜⎛y1−w^Tx^1y2−w^Tx^2⋮ym−w^Tx^m⎠⎟⎟⎟⎞T=(y−Xw^)T=(y−Xw^)T(y−Xw^)
2.3. 证明损失函数 E w ^ E_{\hat{w}} Ew^是关于 w ^ \hat{w} w^的凸函数
凸集定义:
设集合 D ∈ R n D\in R^{n} D∈Rn,如果对任意的 x , y ∈ D x,y\in D x,y∈D与任意的 a ∈ [ 0 , 1 ] a\in [0,1] a∈[0,1],有 a x + ( 1 − a ) y ∈ D ax+(1-a)y\in D ax+(1−a)y∈D,则称集合 D D D是凸集。
凸集的几何意义:
若两个点属于此集合,则这两点连线上的任意一点均属于此集合。
梯度定义:
设 n n n元函数 f ( x ) f(\boldsymbol{x}) f(x)对自变量 x = ( x 1 , x 2 , ⋯ , x n ) T \boldsymbol{x}=\left(x_{1}, x_{2}, \cdots, x_{n}\right)^{T} x=(x1,x2,⋯,xn)T的各分量 x i x_{i} xi的偏导数 ∂ f ( x ) ∂ x i ( i = 1 , 2 , ⋯ , n ) \frac{\partial f(\boldsymbol{x})}{\partial x_{i}} \quad \left(i=1,2,\cdots,n\right) ∂xi∂f(x)(i=1,2,⋯,n)都存在,则称函数 f ( x ) f(\boldsymbol{x}) f(x)在 x \boldsymbol{x} x处一阶可导,并称向量
∇ f ( x ) = ( ∂ f ( x ) ∂ x 1 ∂ f ( x ) ∂ x 2 ⋮ ∂ f ( x ) ∂ x n ) \nabla f(\boldsymbol{x}) =\left(\begin{array}{c} {\frac{\partial f(\boldsymbol{x})}{\partial x_{1}}} \\ {\frac{\partial f(\boldsymbol{x})}{\partial x_{2}}} \\ {\vdots} \\ {\frac{\partial f(\boldsymbol{x})}{\partial x_{n}}}\end{array}\right) ∇f(x)=⎝⎜⎜⎜⎜⎛∂x1∂f(x)∂x2∂f(x)⋮∂xn∂f(x)⎠⎟⎟⎟⎟⎞
为函数 f ( x ) f(\boldsymbol{x}) f(x)在 x \boldsymbol{x} x处的一阶导数或梯度,记为 ∇ f ( x ) \nabla f(\boldsymbol{x}) ∇f(x)(列向量)。
Hessian(海塞)矩阵定义:设 n n n元函数 f ( x ) f(\boldsymbol{x}) f(x)对自变量 x = ( x 1 , x 2 , ⋯ , x n ) T \boldsymbol{x}=\left(x_{1}, x_{2}, \cdots, x_{n}\right)^{T} x=(x1,x2,⋯,xn)T的各分量 x i x_{i} xi的二阶偏导数 ∂ 2 f ( x ) ∂ x i ∂ x j ( i = 1 , 2 , ⋯ , n ; j = 1 , 2 , ⋯ , n ) \frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{i} \partial x_{j}} \quad \left(i=1,2,\cdots,n; j=1,2,\cdots,n\right) ∂xi∂xj∂2f(x)(i=1,2,⋯,n;j=1,2,⋯,n)都存在,则称函数 f ( x ) f(\boldsymbol{x}) f(x)在 x \boldsymbol{x} x处二阶可导,并称矩阵
∇ 2 f ( x ) = [ ∂ 2 f ( x ) ∂ x 1 2 ∂ 2 f ( x ) ∂ x 1 ∂ x 2 ⋯ ∂ 2 f ( x ) ∂ x 1 ∂ x n ∂ 2 f ( x ) ∂ x 2 ∂ x 1 ∂ 2 f ( x ) ∂ x 2 2 ⋯ ∂ 2 f ( x ) ∂ x 2 ∂ x n ⋮ ⋮ ⋱ ⋮ ∂ 2 f ( x ) ∂ x n ∂ x 1 ∂ 2 f ( x ) ∂ x n ∂ x 2 ⋯ ∂ 2 f ( x ) ∂ x n 2 ] \nabla^{2} f(\boldsymbol{x}) =\left[\begin{array}{cccc} {\frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{1}^{2}}} & {\frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{1} \partial x_{2}}} & {\cdots} & {\frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{1} \partial x_{n}}} \\ {\frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{2} \partial x_{1}}} & {\frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{2}^{2}}} & {\cdots} & {\frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{2} \partial x_{n}}} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ {\frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{n} \partial x_{1}}} & {\frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{n} \partial x_{2}}} & {\cdots} & {\frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{n}^{2}}} \end{array}\right] ∇2f(x)=⎣⎢⎢⎢⎢⎢⎡∂x12∂2f(x)∂x2∂x1∂2f(x)⋮∂xn∂x1∂2f(x)∂x1∂x2∂2f(x)∂x22∂2f(x)⋮∂xn∂x2∂2f(x)⋯⋯⋱⋯∂x1∂xn∂2f(x)∂x2∂xn∂2f(x)⋮∂xn2∂2f(x)⎦⎥⎥⎥⎥⎥⎤
为函数 f ( x ) f(\boldsymbol{x}) f(x)在 x \boldsymbol{x} x处的二阶导数或Hessian(海塞)矩阵,记为 ∇ 2 f ( x ) \nabla^{2} f(\boldsymbol{x}) ∇2f(x)。若 f ( x ) f(\boldsymbol{x}) f(x)在 x \boldsymbol{x} x各变元的所有二阶偏导数都连续,则 ∂ 2 f ( x ) ∂ x i ∂ x j = ∂ 2 f ( x ) ∂ x j ∂ x i \frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{i} \partial x_{j}}=\frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{j} \partial x_{i}} ∂xi∂xj∂2f(x)=∂xj∂xi∂2f(x),此时 ∇ 2 f ( x ) \nabla^{2} f(\boldsymbol{x}) ∇2f(x)为对称矩阵。
多元实值函数凹凸性判定定理:
设 D ⊂ R n D\subset R^{n} D⊂Rn是非空开凸集, f : D ⊂ R n → R f:D\subset R^{n} \to R f:D⊂Rn→R,且 f ( x ) f(\boldsymbol{x}) f(x)在 D D D上二阶连续可微,如果 f ( x ) f(\boldsymbol{x}) f(x)的 H e s s i a n Hessian Hessian矩阵 ∇ 2 f ( x ) \nabla^{2} f(\boldsymbol{x}) ∇2f(x)在 D D D上是正定的,则 f ( x ) f(\boldsymbol{x}) f(x)是 D D D上的严格凸函数。
凸充分性定理:
若 f : R n → R f:R^{n} \to R f:Rn→R是凸函数,且 f ( x ) f(\boldsymbol{x}) f(x)一阶连续可微,则 x ∗ x^{*} x∗是全局解的充分必要条件是 ∇ f ( x ∗ ) = 0 \nabla f(\boldsymbol{x}^{*})=0 ∇f(x∗)=0,其中 ∇ f ( x ) \nabla f(\boldsymbol{x}) ∇f(x)为 f ( x ) f(\boldsymbol{x}) f(x)关于 x \boldsymbol{x} x的一阶导数(也称梯度)。
2.4. 对损失函数 E w ^ E_{\hat{w}} Ew^关于 w ^ \hat{w} w^求一阶偏导数
∂ E w ^ ∂ w ^ = ∂ ∂ w ^ [ ( y − X w ^ ) T ( y − X w ^ ) ] = ∂ ∂ w ^ [ ( y T − w ^ T X T ) ( y − X w ^ ) ] = ∂ ∂ w ^ [ y T y − y T X w ^ − w ^ T X T y + w ^ T X T X w ^ ] = ∂ ∂ w ^ [ − y T X w ^ − w ^ T X T y + w ^ T X T X w ^ ] = − ∂ y T X w ^ ∂ w ^ − ∂ w ^ T X T y ∂ w ^ + ∂ w ^ T X T X w ^ ∂ w ^ ∂ x T a ∂ x = ∂ a T x ∂ x = a ∂ x T B x ∂ x = ( B + B T ) x = − X T y − X T y + ( X T X + X T X ) w ^ = 2 X T ( X w ^ − y ) (西瓜书式3.10) \begin{aligned} \frac{\partial E_{\hat{\boldsymbol{w}}}}{\partial \hat{\boldsymbol{w}}} &=\frac{\partial}{\partial \hat{\boldsymbol{w}}}\left[(\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}})^{T}(\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}})\right] \\ &=\frac{\partial}{\partial \hat{\boldsymbol{w}}}\left[\left(\boldsymbol{y}^{T}-\hat{\boldsymbol{w}}^{T} \mathbf{X}^{T}\right)(\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}})\right] \\ &=\frac{\partial}{\partial \hat{\boldsymbol{w}}}\left[\boldsymbol{y}^{T} \boldsymbol{y}-\boldsymbol{y}^{T} \mathbf{X} \hat{\boldsymbol{w}}-\hat{\boldsymbol{w}}^{T} \mathbf{X}^{T} \boldsymbol{y}+\hat{\boldsymbol{w}}^{T} \mathbf{X}^{T} \mathbf{X} \hat{\boldsymbol{w}}\right] \\ &=\frac{\partial}{\partial \hat{\boldsymbol{w}}}\left[-\boldsymbol{y}^{T} \mathbf{X} \hat{\boldsymbol{w}}-\hat{\boldsymbol{w}}^{T} \mathbf{X}^{T} \boldsymbol{y}+\hat{\boldsymbol{w}}^{T} \mathbf{X}^{T} \mathbf{X} \hat{\boldsymbol{w}}\right] \\ &=-\frac{\partial \boldsymbol{y}^{T} \mathbf{X} \hat{\boldsymbol{w}}}{\partial \hat{\boldsymbol{w}}}-\frac{\partial \hat{\boldsymbol{w}}^{T} \mathbf{X}^{T} \boldsymbol{y}}{\partial \hat{\boldsymbol{w}}}+\frac{\partial \hat{\boldsymbol{w}}^{T} \mathbf{X}^{T} \mathbf{X} \hat{\boldsymbol{w}}}{\partial \hat{\boldsymbol{w}}} \\ &\qquad \frac{\partial \boldsymbol{x}^{T} \boldsymbol{a}}{\partial \boldsymbol{x}}=\frac{\partial \boldsymbol{a}^{T} \boldsymbol{x}}{\partial \boldsymbol{x}}=\boldsymbol{a} \\ &\qquad \frac{\partial \boldsymbol{x}^{T} \mathbf{B} \boldsymbol{x}}{\partial \boldsymbol{x}}=\left(\mathbf{B}+\mathbf{B}^{T}\right) \boldsymbol{x} \\ &=-\mathbf{X}^{T} \boldsymbol{y}-\mathbf{X}^{T} \boldsymbol{y}+\left(\mathbf{X}^{T} \mathbf{X}+\mathbf{X}^{T} \mathbf{X}\right) \hat{w} \\ &=2\mathbf{X}^{T}\left(\mathbf{X} \hat{w}-\boldsymbol{y}\right) \end{aligned} \tag{西瓜书式3.10} ∂w^∂Ew^=∂w^∂[(y−Xw^)T(y−Xw^)]=∂w^∂[(yT−w^TXT)(y−Xw^)]=∂w^∂[yTy−yTXw^−w^TXTy+w^TXTXw^]=∂w^∂[−yTXw^−w^TXTy+w^TXTXw^]=−∂w^∂yTXw^−∂w^∂w^TXTy+∂w^∂w^TXTXw^∂x∂xTa=∂x∂aTx=a∂x∂xTBx=(B+BT)x=−XTy−XTy+(XTX+XTX)w^=2XT(Xw^−y)(西瓜书式3.10)
所以有:
∂ 2 E w ^ ∂ w ^ ∂ w ^ T = ∂ ∂ w ^ ( ∂ E w ^ ∂ w ^ ) = ∂ ∂ w ^ [ 2 X T ( X w ^ − y ) ] = ∂ ∂ w ^ ( 2 X T X w ^ − 2 X T y ) = 2 X T X w ^ (Hessian矩阵) \begin{aligned} \frac{\partial^{2} E_{\hat{w}}}{\partial \hat{w} \partial \hat{w}^{T}} &=\frac{\partial}{\partial \hat{w}}\left(\frac{\partial E_{\hat{w}}}{\partial \hat{w}}\right) \\ &=\frac{\partial}{\partial \hat{w}}\left[2 \mathbf{X}^{T}(\mathbf{X} \hat{w}-\boldsymbol{y})\right] \\ &=\frac{\partial}{\partial \hat{w}}\left(2 \mathbf{X}^{T} \mathbf{X} \hat{w}-2 \mathbf{X}^{T} \boldsymbol{y}\right) \\ &=2 \mathbf{X}^{T} \mathbf{X} \hat{w} \end{aligned} \tag{Hessian矩阵} ∂w^∂w^T∂2Ew^=∂w^∂(∂w^∂Ew^)=∂w^∂[2XT(Xw^−y)]=∂w^∂(2XTXw^−2XTy)=2XTXw^(Hessian矩阵)
2.5. 令一阶偏导数等于0解出 w ^ ∗ \hat{w}^{*} w^∗
∂ E w ^ ∂ w ^ = 2 X T ( X w ^ − y ) = 0 ⇒ 2 X T X w ^ − 2 X T y = 0 ⇒ 2 X T X w ^ = 2 X T y ⇒ w ^ = ( X T X ) − 1 X T y (西瓜书式3.11) \begin{aligned} &\quad \frac{\partial E_{\hat{w}}}{\partial \hat{w}} =2 \mathbf{X}^{T}(\mathbf{X} \hat{w}-\boldsymbol{y})=0 \\ &\Rightarrow 2 \mathbf{X}^{T} \mathbf{X} \hat{w}-2 \mathbf{X}^{T} \boldsymbol{y}=0 \\ &\Rightarrow 2 \mathbf{X}^{T} \mathbf{X} \hat{w}=2 \mathbf{X}^{T} \boldsymbol{y} \\ &\Rightarrow \hat{w} = \left(\mathbf{X}^{T} \mathbf{X} \right)^{-1} \mathbf{X}^{T} \boldsymbol{y} \end{aligned} \tag{西瓜书式3.11} ∂w^∂Ew^=2XT(Xw^−y)=0⇒2XTXw^−2XTy=0⇒2XTXw^=2XTy⇒w^=(XTX)−1XTy(西瓜书式3.11)
3. 广义线性模型
3.1. 指数族分布
指数族(Exponential family)分布是一类分布的总称,该类分布的分布律(或者概率密度函数)的一般形式如下:
p ( y ; η ) = b ( y ) exp ( η T T ( y ) − a ( η ) ) p(y ; \eta)=b(y) \exp \left(\eta^{T} T(y)-a(\eta)\right) p(y;η)=b(y)exp(ηTT(y)−a(η))
其中, η \eta η称为该分布的自然参数; T ( y ) T(y) T(y)为充分统计量,视具体的分布而定,通常是等于随机变量 y y y本身; a ( η ) a(\eta) a(η)为配分函数; b ( y ) b(y) b(y)为关于随机变量 y y y的函数,常见的伯努利分布和正态分布均属于指数族分布。
证明伯努利分布属于指数族分布:
已知伯努利分布的分布律为:
p ( y ) = ϕ y ( 1 − ϕ ) 1 − y p(y)=\phi^{y}(1-\phi)^{1-y} p(y)=ϕy(1−ϕ)1−y
其中 y ∈ { 0 , 1 } y\in\{0,1\} y∈{0,1}, ϕ \phi ϕ为 y = 1 y=1 y=1的概率,即 p ( y = 1 ) = ϕ p(y=1)=\phi p(y=1)=ϕ,对上式恒等变形可得:
p ( y ) = ϕ y ( 1 − ϕ ) 1 − y = exp ( ln ( ϕ y ( 1 − ϕ ) 1 − y ) ) = exp ( ln ϕ y + ln ( 1 − ϕ ) 1 − y ) = exp ( y ln ϕ + ( 1 − y ) ln ( 1 − ϕ ) ) = exp ( y ln ϕ + ln ( 1 − ϕ ) − y ln ( 1 − ϕ ) ) = exp ( y ( ln ϕ − ln ( 1 − ϕ ) ) + ln ( 1 − ϕ ) ) = exp ( y ln ( ϕ 1 − ϕ ) + ln ( 1 − ϕ ) ) \begin{aligned} p(y) &=\phi^{y}(1-\phi)^{1-y} \\ &=\exp \left(\ln \left(\phi^{y}(1-\phi)^{1-y}\right)\right) \\ &=\exp \left(\ln \phi^{y}+\ln(1-\phi)^{1-y}\right) \\ &=\exp (y \ln \phi+(1-y) \ln (1-\phi)) \\ &=\exp (y \ln \phi+\ln (1-\phi)-y \ln (1-\phi)) \\ &=\exp (y(\ln \phi-\ln (1-\phi))+\ln (1-\phi)) \\ &=\exp \left(y \ln \left(\frac{\phi}{1-\phi}\right)+\ln (1-\phi)\right) \end{aligned} p(y)=ϕy(1−ϕ)1−y=exp(ln(ϕy(1−ϕ)1−y))=exp(lnϕy+ln(1−ϕ)1−y)=exp(ylnϕ+(1−y)ln(1−ϕ))=exp(ylnϕ+ln(1−ϕ)−yln(1−ϕ))=exp(y(lnϕ−ln(1−ϕ))+ln(1−ϕ))=exp(yln(1−ϕϕ)+ln(1−ϕ))
对比指数分布的一般形式 p ( y ; η ) = b ( y ) e x p ( η ( T ) T ( y ) − a ( η ) ) p(y;\eta)=b(y)exp\left(\eta^(T)T(y)-a(\eta)\right) p(y;η)=b(y)exp(η(T)T(y)−a(η)),可知:
所以,伯努利分布的指数族分布对应参数为:
b ( y ) = 1 η = ln ( f r a c ϕ 1 − ϕ ) T ( y ) = y a ( η ) = − ln ( 1 − ϕ ) = l n ( 1 + e x p η ) \begin{aligned} b(y)&=1 \\ \eta&=\ln\left(frac{\phi}{1-\phi}\right) \\ T(y)&=y \\ a(\eta)&=-\ln(1-\phi)=ln(1+exp{\eta}) \end{aligned} b(y)ηT(y)a(η)=1=ln(fracϕ1−ϕ)=y=−ln(1−ϕ)=ln(1+expη)
3.2. 广义线性模型的三条假设
-
在给定 x \boldsymbol{x} x的条件下,假设随机变量 y \boldsymbol{y} y服从某个指数族分布
-
在给定 x \boldsymbol{x} x的条件下,我们的目标是得到一个模型 h ( x ) h(\boldsymbol{x}) h(x)能预测出 T ( y ) T(\boldsymbol{y}) T(y)的期望值
-
假设该指数族分布中的自然参数 η \eta η和 x \boldsymbol{x} x呈线性关系,即 η = w T x \eta=w^{T}x η=wTx
4. 对数几率回归
对数几率回归是在对一个二分类问题进行建模,并且假设被建模的随机变量 y y y取值为0或1,因此我们可以很自然地假设 y y y服从伯努利分布。此时,如果我们想要构建一个线性模型来预测在给定 x \boldsymbol{x} x的条件下 y y y的取值的话,可以考虑使用广义线性模型来进行建模。
4.1. 对数几率回归的广义线性模型推导
已知 y y y是服从伯努利分布,而伯努利分布属于指数在发布,所以满足广义线性模型的第一条假设,接着根据广义线性模型的第二条假设我们可以推得模型 h ( x ) h(x) h(x)的表达式为:
h ( x ) = E [ T ( y ∣ x ) ] h(\boldsymbol{x})=E[T(y|\boldsymbol{x})] h(x)=E[T(y∣x)]
由于伯努利分布的 T ( y ∣ x ) = y ∣ x T(y|\boldsymbol{x})=y|\boldsymbol{x} T(y∣x)=y∣x,所以:
h ( x ) = E [ y ∣ x ] h(\boldsymbol{x})=E[y|\boldsymbol{x}] h(x)=E[y∣x]
又因为 E [ y ∣ x ] = 1 × p ( y = 1 ∣ x ) + 0 × p ( y = 0 ∣ x ) = p ( y = 1 ∣ x ) = ϕ E[y|\boldsymbol{x}]=1\times p(y=1|\boldsymbol{x})+0\times p(y=0|\boldsymbol{x})=p(y=1|\boldsymbol{x})=\phi E[y∣x]=1×p(y=1∣x)+0×p(y=0∣x)=p(y=1∣x)=ϕ,所以:
h ( x ) = ϕ h(\boldsymbol{x})=\phi h(x)=ϕ
在前面证明伯努利分布属于指数族分布时我们知道:
η = ln ( ϕ 1 − ϕ ) e η = ϕ 1 − ϕ e − η = 1 − ϕ ϕ e − η = 1 ϕ − 1 1 + e − η = 1 ϕ 1 1 + e − η = ϕ \begin{aligned} &\eta=\ln \left(\frac{\phi}{1-\phi}\right) \\ &e^{\eta}=\frac{\phi}{1-\phi} \\ &e^{-\eta}=\frac{1-\phi}{\phi} \\ &e^{-\eta}=\frac{1}{\phi}-1 \\ &1+e^{-\eta}=\frac{1}{\phi} \\ &\frac{1}{1+e^{-\eta}}=\phi &\end{aligned} η=ln(1−ϕϕ)eη=1−ϕϕe−η=ϕ1−ϕe−η=ϕ1−11+e−η=ϕ11+e−η1=ϕ
将 ϕ = 1 1 + e − η \phi=\frac{1}{1+e^{-\eta}} ϕ=1+e−η1代入 h ( x ) h(\boldsymbol{x}) h(x)的表达式可得:
h ( x ) = ϕ = 1 1 + e − η h(\boldsymbol{x})=\phi=\frac{1}{1+e^{-\eta}} h(x)=ϕ=1+e−η1
根据广义模型的第三条假设: η = w T x \eta=w^{T}x η=wTx, h ( x ) h(\boldsymbol{x}) h(x)最终可化为:
h ( x ) = ϕ = 1 1 + e − w T x = p ( y = 1 ∣ x ) (西瓜书式3.23) h(\boldsymbol{x})=\phi=\frac{1}{1+e^{-w^{T}x}}=p(y=1|\boldsymbol{x}) \tag{西瓜书式3.23} h(x)=ϕ=1+e−wTx1=p(y=1∣x)(西瓜书式3.23)
此即为对数几率回归模型。
4.2. 极大似然估计法
设总体的概率密度函数(或分布律)为 f ( y , w 1 , w 2 , ⋯ , w k ) f(y, w_{1}, w_{2}, \cdots, w_{k}) f(y,w1,w2,⋯,wk), y 1 y_{1} y1, y 2 y_{2} y2,…, y m y_{m} ym,为从该总体中抽出的样本。因为 y 1 y_{1} y1, y 2 y_{2} y2,…, y m y_{m} ym相互独立且同分布,于是,它们的联合概率密度函数(或联合概率)为:
L ( y 1 , y 2 , … , y m ; w 1 , w 2 , … , w k ) = ∏ i = 1 m f ( y i , w 1 , w 2 , … , w k ) L\left(y_{1}, y_{2}, \ldots, y_{m} ; w_{1}, w_{2}, \ldots, w_{k}\right)=\prod_{i=1}^{m} f\left(y_{i}, w_{1}, w_{2}, \ldots, w_{k}\right) L(y1,y2,…,ym;w1,w2,…,wk)=i=1∏mf(yi,w1,w2,…,wk)
其中, w 1 w_{1} w1, w 2 w_{2} w2,…, w m w_{m} wm被看作固定但是未知的参数。当我们已经观测到一组样本观测值 y 1 y_{1} y1, y 2 y_{2} y2,…, y m y_{m} ym时,要去估计未知参数,一种直观的想法就是,哪一组参数使得现在的样本观测值出现的概率最大,哪一组参数可能就是真正的参数,我们就用它作为参数的估计值,这就是所谓的极大似然估计。
极大似然估计的具体方法:
通常记 L ( y 1 , y 2 , … , y m ; w 1 , w 2 , … , w k ) = L ( w ) L\left(y_{1}, y_{2}, \ldots, y_{m} ; w_{1}, w_{2}, \ldots, w_{k}\right)=L\left(w\right) L(y1,y2,…,ym;w1,w2,…,wk)=L(w),并称为其似然函数。于是求 w w w的极大似然估计就归结为 L ( w ) L(w) L(w)的最大值点。由于对数函数是单调递增函数,所以:
ln L ( w ) = ln ( ∏ i = 1 m f ( y i , w 1 , w 2 , … , w k ) ) = ∑ i = 1 m ln f ( y i , w 1 , w 2 , … , w k ) \begin{aligned} \ln L(\boldsymbol{w}) &=\ln \left(\prod_{i=1}^{m} f\left(y_{i}, w_{1}, w_{2}, \ldots, w_{k}\right)\right) \\ &=\sum_{i=1}^{m} \ln f\left(y_{i}, w_{1}, w_{2}, \ldots, w_{k}\right) \end{aligned} lnL(w)=ln(i=1∏mf(yi,w1,w2,…,wk))=i=1∑mlnf(yi,w1,w2,…,wk)
与 L ( w ) L(w) L(w)有相同的最大值点,而在许多情况下,求 ln L ( w ) \ln L(w) lnL(w)的最大值点比较简单,于是,我们就将求 L ( w ) L(w) L(w)的最大值点转化为了求 ln L ( w ) \ln L(w) lnL(w)的最大值点,通常称 ln L ( w ) \ln L(w) lnL(w)为对数似然函数。
对数几率回归的极大似然估计:
已知随机变量 y y y取1和0的概率分别为:
p ( y = 1 ∣ x ) = e w T x + b 1 + e w T x + b p ( y = 0 ∣ x ) = 1 1 + e w T x + b \begin{aligned} &p(y=1 | \boldsymbol{x})=\frac{e^{\boldsymbol{w}^{\mathrm{T}} \boldsymbol{x}+b}}{1+e^{\boldsymbol{w}^{\mathrm{T}} \boldsymbol{x}+b}} \\ &p(y=0 | \boldsymbol{x})=\frac{1}{1+e^{\boldsymbol{w}^{\mathrm{T}} \boldsymbol{x}+b}} \end{aligned} p(y=1∣x)=1+ewTx+bewTx+bp(y=0∣x)=1+ewTx+b1
令 β = ( w ; b ) \boldsymbol{\beta}=(w;b) β=(w;b), x ^ = ( x ; 1 ) \hat{\boldsymbol{x}}=(\boldsymbol{x}; 1) x^=(x;1),则 w T x + b w^{T}\boldsymbol{x}+b wTx+b可简化为 β T x ^ \boldsymbol{\beta}^{T}\hat{x} βTx^,于是上式可化简为:
p ( y = 1 ∣ x ) = e β T x ^ 1 + e β T x ^ p ( y = 0 ∣ x ) = 1 1 + e β T x ^ \begin{aligned} &p(y=1 | \boldsymbol{x})=\frac{e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}}{1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}} \\ &p(y=0 | \boldsymbol{x})=\frac{1}{1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}} \end{aligned} p(y=1∣x)=1+eβTx^eβTx^p(y=0∣x)=1+eβTx^1
记:
p ( y = 1 ∣ x ) = e β T x ^ 1 + e β T x ^ = p 1 ( x ^ ; β ) p ( y = 0 ∣ x ) = 1 1 + e β T x ^ = p 0 ( x ^ ; β ) \begin{aligned} &p(y=1 | \boldsymbol{x})=\frac{e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}}{1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}}=p_{1}(\hat{\boldsymbol{x}};\boldsymbol{\beta}) \\ &p(y=0 | \boldsymbol{x})=\frac{1}{1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}}=p_{0}(\hat{\boldsymbol{x}};\boldsymbol{\beta}) \end{aligned} p(y=1∣x)=1+eβTx^eβTx^=p1(x^;β)p(y=0∣x)=1+eβTx^1=p0(x^;β)
于是,使用一个小技巧即可得到随机变量 y y y的分布律表达式:
p ( y ∣ x ; w , b ) = y ⋅ p 1 ( x ^ ; β ) + ( 1 − y ) ⋅ p 0 ( x ^ ; β ) (西瓜书式3.26) p(y | \boldsymbol{x} ; \boldsymbol{w}, b) =y \cdot p_{1}(\hat{\boldsymbol{x}} ; \boldsymbol{\beta})+(1-y) \cdot p_{0}(\hat{\boldsymbol{x}} ; \boldsymbol{\beta}) \tag{西瓜书式3.26} p(y∣x;w,b)=y⋅p1(x^;β)+(1−y)⋅p0(x^;β)(西瓜书式3.26)
或者:
p ( y ∣ x ; w , b ) = [ p 1 ( x ^ ; β ) ] y [ p 0 ( x ^ ; β ) ] 1 − y p(y | \boldsymbol{x} ; \boldsymbol{w}, b) =\left[p_{1}(\hat{\boldsymbol{x}} ; \boldsymbol{\beta})\right]^{y} \left[p_{0}(\hat{\boldsymbol{x}} ; \boldsymbol{\beta})\right]^{1-y} p(y∣x;w,b)=[p1(x^;β)]y[p0(x^;β)]1−y
4.3. 对数几率回归的参数估计
根据对数似然函数的定义可知:
ln L ( w ) = ∑ i = 1 m ln f ( y i , w 1 , w 2 , … , w k ) \ln L(\boldsymbol{w}) =\sum_{i=1}^{m} \ln f\left(y_{i}, w_{1}, w_{2}, \ldots, w_{k}\right) lnL(w)=i=1∑mlnf(yi,w1,w2,…,wk)
由于此时的 y y y为离散型,所以将对数似然函数中的概率密度函数换成分布律即可,既有:
ℓ ( w , b ) : = ln L ( w , b ) = ∑ i = 1 m ln f ( y i ∣ x i ; w , b ) (西瓜书式3.25) \ell(w,b) :=\ln L(\boldsymbol{w},b) =\sum_{i=1}^{m} \ln f\left(y_{i} | x_{i}; \boldsymbol{w},b\right) \tag{西瓜书式3.25} ℓ(w,b):=lnL(w,b)=i=1∑mlnf(yi∣xi;w,b)(西瓜书式3.25)
将 p ( y ∣ x ; w , b ) = y ⋅ p 1 ( x ^ ; β ) + ( 1 − y ) ⋅ p 0 ( x ^ ; β ) p(y | \boldsymbol{x} ; \boldsymbol{w}, b)=y \cdot p_{1}(\hat{\boldsymbol{x}} ; \boldsymbol{\beta})+(1-y) \cdot p_{0}(\hat{\boldsymbol{x}} ; \boldsymbol{\beta}) p(y∣x;w,b)=y⋅p1(x^;β)+(1−y)⋅p0(x^;β)代入对数似然函数可得:
ℓ ( β ) = ∑ i = 1 m ln ( y i p 1 ( x ^ i ; β ) + ( 1 − y i ) p 0 ( x ^ i ; β ) ) p 1 ( x ^ ; β ) = e β T x ^ 1 + e β T x ^ p 0 ( x ^ ; β ) = 1 1 + e β T x ^ = ∑ i = 1 m ln ( y i e β T x ^ i 1 + e β T x ^ i + 1 − y i 1 + e β T x ^ i ) = ∑ i = 1 m ln ( y i e β T x ^ i + 1 − y i 1 + e β T x ^ i ) = ∑ i = 1 m ( ln ( y i e β T x ^ i + 1 − y i ) − ln ( 1 + e β T x ^ i ) ) y i ∈ { 0 , 1 } y i = 0 ℓ ( β ) = ∑ i = 1 m ( ln ( 0 ⋅ e β T x ^ i + 1 − 0 ) − ln ( 1 + e β T x ^ i ) ) = ∑ i = 1 m ( ln 1 − ln ( 1 + e β T x ^ i ) ) = ∑ i = 1 m ( − ln ( 1 + e β T x ^ i ) ) y i = 1 ℓ ( β ) = ∑ i = 1 m ( ln ( 1 ⋅ e β T x ^ i + 1 − 1 ) − ln ( 1 + e β T x ^ i ) ) = ∑ i = 1 m ( ln e r i T − ln ( 1 + e β T z i ) ) = ∑ i = 1 m ( β T x ^ i − ln ( 1 + e β T x ^ i ) ) = ∑ i = 1 m ( y i β T x ^ i − ln ( 1 + e β T x ^ i ) ) (西瓜书式3.27) \begin{aligned} \ell(\boldsymbol{\beta}) &=\sum_{i=1}^{m} \ln \left(y_{i} p_{1}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)+\left(1-y_{i}\right) p_{0}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right) \\ &\qquad p_{1}(\hat{\boldsymbol{x}};\boldsymbol{\beta}) = \frac{e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}}{1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}} \\ &\qquad p_{0}(\hat{\boldsymbol{x}};\boldsymbol{\beta}) = \frac{1}{1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}} \\ &=\sum_{i=1}^{m} \ln \left(\frac{y_{i} e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}}}{1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}}}+\frac{1-y_{i}}{1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}}}\right) \\ &=\sum_{i=1}^{m} \ln \left(\frac{y_{i} e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}}+1-y_{i}}{1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}}}\right) \\ &=\sum_{i=1}^{m}\left(\ln \left(y_{i} e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}}+1-y_{i}\right)-\ln \left(1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}}\right)\right) \\ &\qquad y_{i}\in \{0,1\} \\ &\qquad y_{i}=0 \\ &\qquad \quad \ell(\boldsymbol{\beta})=\sum_{i=1}^{m}\left(\ln \left(0 \cdot e^{\boldsymbol{\beta}^{T} \hat{x}_{i}}+1-0\right)-\ln \left(1+e^{\boldsymbol{\beta}^{T} \hat{x}_{i}}\right)\right)=\sum_{i=1}^{m}\left(\ln 1-\ln \left(1+e^{\boldsymbol{\beta}^{T} \hat{x}_{i}}\right)\right)=\sum_{i=1}^{m}\left(-\ln \left(1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}}\right)\right) \\ &\qquad y_{i}=1 \\ &\qquad \quad \ell(\boldsymbol{\beta})=\sum_{i=1}^{m}\left(\ln \left(1 \cdot e^{\boldsymbol{\beta}^{T} \hat{x}_{i}}+1-1\right)-\ln \left(1+e^{\boldsymbol{\beta}^{T} \hat{x}_{i}}\right)\right)=\sum_{i=1}^{m}\left(\ln e^{\boldsymbol{r}_{i}^{T}}-\ln \left(1+e^{\boldsymbol{\beta}^{T} \boldsymbol{z}_{i}}\right)\right)=\sum_{i=1}^{m}\left(\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}-\ln \left(1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}}\right)\right) \\ &=\sum_{i=1}^{m}\left(y_{i} \boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}-\ln \left(1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}}\right)\right) \end{aligned} \tag{西瓜书式3.27} ℓ(β)=i=1∑mln(yip1(x^i;β)+(1−yi)p0(x^i;β))p1(x^;β)=1+eβTx^eβTx^p0(x^;β)=1+eβTx^1=i=1∑mln(1+eβTx^iyieβTx^i+1+eβTx^i1−yi)=i=1∑mln(1+eβTx^iyieβTx^i+1−yi)=i=1∑m(ln(yieβTx^i+1−yi)−ln(1+eβTx^i))yi∈{0,1}yi=0ℓ(β)=i=1∑m(ln(0⋅eβTx^i+1−0)−ln(1+eβTx^i))=i=1∑m(ln1−ln(1+eβTx^i))=i=1∑m(−ln(1+eβTx^i))yi=1ℓ(β)=i=1∑m(ln(1⋅eβTx^i+1−1)−ln(1+eβTx^i))=i=1∑m(lneriT−ln(1+eβTzi))=i=1∑m(βTx^i−ln(1+eβTx^i))=i=1∑m(yiβTx^i−ln(1+eβTx^i))(西瓜书式3.27)
若 p ( y ∣ x ; w , b ) = [ p 1 ( x ^ ; β ) ] y [ p 0 ( x ^ ; β ) ] 1 − y p(y | \boldsymbol{x} ; \boldsymbol{w}, b)=\left[p_{1}(\hat{\boldsymbol{x}} ; \boldsymbol{\beta})\right]^{y}\left[p_{0}(\hat{\boldsymbol{x}} ; \boldsymbol{\beta})\right]^{1-y} p(y∣x;w,b)=[p1(x^;β)]y[p0(x^;β)]1−y,将其代入对数似然函数可得:
ℓ ( β ) = ∑ i = 1 m ln ( [ p 1 ( x ^ i ; β ) ] y i [ p 0 ( x ^ i ; β ) ] 1 − y i ) = ∑ i = 1 m [ ln ( [ p 1 ( x ^ i ; β ) ] y i ) + ln ( [ p 0 ( x ^ i ; β ) ] 1 − y i ) ] = ∑ i = 1 m [ y i ln ( p 1 ( x ^ i ; β ) ) + ( 1 − y i ) ln ( p 0 ( x ^ i ; β ) ) ] = ∑ i = 1 m { y i [ ln ( p 1 ( x ^ i ; β ) ) − ln ( p 0 ( x ^ i ; β ) ) ] + ln ( p 0 ( x ^ i ; β ) ) } = ∑ i = 1 m [ y i ln p 1 ( x ^ i ; β ) p 0 ( x ^ i ; β ) + ln ( p 0 ( x ^ i ; β ) ) ] p 1 ( x ^ ; β ) = e β T x ^ 1 + e β T x ^ p 0 ( x ^ ; β ) = 1 1 + e β T x ^ = ∑ i = 1 m [ y i ln ( e β T x ^ ) + ln ( p 0 ( x ^ i ; β ) ) ] = ∑ i = 1 m ( y i β T x ^ i − ln ( 1 + e β T x ^ i ) ) \begin{aligned} \ell(\boldsymbol{\beta}) &=\sum_{i=1}^{m} \ln \left(\left[p_{1}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right]^{y_{i}}\left[p_{0}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right]^{1-y_{i}}\right) \\ &=\sum_{i=1}^{m}\left[\ln \left(\left[p_{1}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right]^{y_{i}}\right)+\ln \left(\left[p_{0}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right]^{1-y_{i}}\right)\right] \\ &=\sum_{i=1}^{m}\left[y_{i} \ln \left(p_{1}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right)+\left(1-y_{i}\right) \ln \left(p_{0}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right)\right] \\ &=\sum_{i=1}^{m}\left\{y_{i}\left[\ln \left(p_{1}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right)-\ln \left(p_{0}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right)\right]+\ln \left(p_{0}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right)\right\} \\ &=\sum_{i=1}^{m}\left[y_{i}\ln \frac{p_{1}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)}{p_{0}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)}+\ln \left(p_{0}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right)\right] \\ &\qquad p_{1}(\hat{\boldsymbol{x}};\boldsymbol{\beta}) = \frac{e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}}{1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}} \\ &\qquad p_{0}(\hat{\boldsymbol{x}};\boldsymbol{\beta}) = \frac{1}{1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}} \\ &=\sum_{i=1}^{m}\left[y_{i}\ln \left(e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}\right) + \ln \left(p_{0}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right)\right] \\ &=\sum_{i=1}^{m}\left(y_{i} \boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}-\ln \left(1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}}\right)\right) \end{aligned} ℓ(β)=i=1∑mln([p1(x^i;β)]yi[p0(x^i;β)]1−yi)=i=1∑m[ln([p1(x^i;β)]yi)+ln([p0(x^i;β)]1−yi)]=i=1∑m[yiln(p1(x^i;β))+(1−yi)ln(p0(x^i;β))]=i=1∑m{yi[ln(p1(x^i;β))−ln(p0(x^i;β))]+ln(p0(x^i;β))}=i=1∑m[yilnp0(x^i;β)p1(x^i;β)+ln(p0(x^i;β))]p1(x^;β)=1+eβTx^eβTx^p0(x^;β)=1+eβTx^1=i=1∑m[yiln(eβTx^)+ln(p0(x^i;β))]=i=1∑m(yiβTx^i−ln(1+eβTx^i))