【cs229】吴恩达MachineLearning-1/2

0. 课程计划

coursera地址
本系列相关链接
【cs229】吴恩达MachineLearning-1/2

共计10周,目的是深入掌握基础知识点,笔记不是重复翻译,而是记录自己的疑问并及时回答自己。
你想从这个课程学到什么?如果是CV、deep learning 在内的稍微新点的方法,no,这里没有。这里是Machine Learning,是传统教程。不过类似random forest之类的其他李航书中的传统机器学习算法也是不包含的。
如果给它一个定位,应该是通往DL的桥梁、理论基础。

1. Introduction

2. Linear Regression

2.1 hypothesis

由于历史原因,表征从输入空间到输出空间的函数,称之为hypothesis.

2.2 objective function

区别于损失函数的概念:
在这里插入图片描述
目标是求出使损失函数 J ( θ 0 , θ 1 ) J(\theta_0,\theta_1) J(θ0,θ1)最小的参数。

2.3 contour figure

同一条线的函数值相同;

2.4 Gradient Descent

梯度下降算法对所有参数同时做update;

2.4.1 convex function

  • bowl-shaped function
  • 初始值的扰动,可能使得陷入不同的局部最小值

2.4.2 learning rate

朴素的梯度下降算法中 θ i = θ i − α ∂ ∂ θ i J ( θ 0 , . . . , θ n ) \theta_i=\theta_i-\alpha\frac{\partial}{\partial\theta_i}J(\theta_0,...,\theta_n) θi=θiαθiJ(θ0,...,θn),导致在偏导大/斜率大/陡峭的地方修正的多,在偏导小/斜率小/平坦的地方修正少,这样合理吗?

2.4.3 Batch Gradient Descent

根据一批/batch数据,更新一次参数;

2.5. Linear Algebra

在这个教程中,矩阵(常用大写字母表示)、向量(常用小写字母表示)的小标从1开始(1-indexed);

Vector: An n × 1 n\times1 n×1 matrix;

在这里插入图片描述
注意,这里补充的4个1使的维度匹配,这也是为什么\theta_0称为bias
在这里插入图片描述
Neural Network中bias的概念类似:
在这里插入图片描述

SIMD/GPU/TPU等现在硬件可以高效计算矩阵乘;

矩阵乘符合结合律,不符合交换律;

Identity Matrix指单位矩阵

没有逆矩阵的矩阵称为singular(奇异矩阵)或者degenerate(退化矩阵/非满秩矩阵);

2.6 Multivariate Linear Regression

h Θ ( x ) = Θ T x , x ∈ R n h_\varTheta(x)=\varTheta^Tx, x\in R^n hΘ(x)=ΘTx,xRn
Θ i = Θ i − α 1 m ∑ i = 1 m ( h Θ ( x ( i ) ) − y ( i ) ) ⋅ x i ( i ) \varTheta_i=\varTheta_i-\alpha\frac{1}{m}\sum_{i=1}^{m}(h_\varTheta(x^{(i)})-y^{(i)})\cdot x_i^{(i)} Θi=Θiαm1i=1m(hΘ(x(i))y(i))xi(i)

2.6.1 feature scale

在这里插入图片描述
特征的数据量纲相距非常大,可能导致收敛慢!因此有必要做数据归一化。
减均值除范围 x i ← x i − x μ x m a x − x m i n x_i\gets\frac{x_i-x_\mu}{x_{max}-x_{min}} xixmaxxminxixμ,可以将数据规范化到 [ − 1 , 1 ] [-1,1] [1,1]之间,其中分母也可以用标准差代替,这样就规范化到 N ( 0 , 1 ) N(0,1) N(0,1)标准正态分布了。

2.6.2 Polynomial Regression

通过用 x = z t , x = z 2 , x = z 3 , x = z x=zt, x=z^2, x=z^3, x =\sqrt z x=zt,x=z2,x=z3,x=z 等等方式替换,可以将多项式回归问题蜕化到线性回归问题。

2.6.3 Computing Parameters Analytically

在这里插入图片描述
design matrix X X X 中的每一行都是样本 ( x ( i ) ) T (x^{(i)})^T (x(i))T,所以 X X X的维度是:
X ⇢ m × ( n + 1 ) X θ = y X T ( X θ ) = X T y ( X T X ) − 1 ( X T X ) θ = ( X T X ) − 1 X T y = θ \begin{aligned} X&\dashrightarrow m\times(n+1)\\ X\theta&=y\\ X^T(X\theta)&=X^Ty\\ (X^TX)^{-1}(X^TX)\theta&=(X^TX)^{-1}X^Ty=\theta\\ \end{aligned} XXθXT(Xθ)(XTX)1(XTX)θm×(n+1)=y=XTy=(XTX)1XTy=θ
还有另外一种求法:
J ( θ ) = 1 2 m [ ∑ i = 1 m ( h θ ( x ) − y ) 2 ] ∂ ∂ θ j J ( θ ) = 1 m ∑ i = 1 m ( h θ ( x i ) − y i ) x j i ∂ ∂ θ J ( θ ) = 1 m X T ( X θ − y ) ⇓ 令 其 = 0 X T X θ = X T y ( X T X ) − 1 X T X θ = ( X T X ) − 1 X T y = θ \begin{aligned} J(\theta)&=\frac{1}{2m}[\sum_{i=1}^{m}(h_\theta(x)-y)^2]\\ \frac{\partial}{\partial\theta_j}J(\theta)&=\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^i)-y^i){x_j}^i\\ \frac{\partial}{\partial\theta}J(\theta)&=\frac{1}{m}X^T(X\theta-y)\\ &\dArr令其=0\\ X^TX\theta&=X^Ty\\ (X^TX)^{-1}X^TX\theta&=(X^TX)^{-1}X^Ty=\theta \end{aligned} J(θ)θjJ(θ)θJ(θ)XTXθ(XTX)1XTXθ=2m1[i=1m(hθ(x)y)2]=m1i=1m(hθ(xi)yi)xji=m1XT(Xθy)=0=XTy=(XTX)1XTy=θ
大多数情况下 ( X T X ) (X^TX) (XTX)可逆,即使不可逆(inv),octave之类的软件也会计算出大约的逆,pinv(pseudo inv).
不可逆的情况例如:

  • 样本量m过小(m<n+1)
  • n个特征中有冗余

3. Logistic Regression

  • 逻辑回归按道理应该是“Logistic Classification”,这个命名是个历史问题;
  • 线性回归有其弊端,例如增加一个没有信息量的新样本,却会改变模型;
  • 逻辑回归能保证输出范围 0 ⩽ h θ ( x ) ⩽ 1 0 \leqslant h_{\theta}(x) \leqslant 1 0hθ(x)1
  • feature scaling 同样适用于逻辑回归;

3.1 sigmoid function

sigmoid function = logistic function
h θ ( x ) = g ( θ T X ) = P ( y = 1 ∣ x ; θ ) g ( z ) = 1 1 + e − z g ( z ) − 1 = ( − 1 ) × ( 1 + e − z ) − 2 × ( − 1 ) × ( e − z ) = e − z ( 1 + e − z ) 2 = 1 + e − z − 1 ( 1 + e − z ) 2 = 1 1 + e − z − 1 ( 1 + e − z ) 2 = 1 1 + e − z ( 1 − 1 1 + e − z ) = g ( z ) ( 1 − g ( z ) ) \begin{aligned} h_{\theta}(x)&=g(\theta^TX)=P(y=1\mid x;\theta)\\ g(z)&=\frac{1}{1+e^{-z}}\\ g(z)^{-1}&=(-1)\times(1+e^{-z})^{-2}\times(-1)\times(e^{-z})\\ &=\frac{e^{-z}}{(1+e^{-z})^{2}}=\frac{1+e^{-z}-1}{(1+e^{-z})^{2}}=\frac{1}{1+e^{-z}}-\frac{1}{(1+e^{-z})^2}\\ &=\frac{1}{1+e^{-z}}(1-\frac{1}{1+e^{-z}})\\ &=g(z)(1-g(z)) \end{aligned} hθ(x)g(z)g(z)1=g(θTX)=P(y=1x;θ)=1+ez1=(1)×(1+ez)2×(1)×(ez)=(1+ez)2ez=(1+ez)21+ez1=1+ez1(1+ez)21=1+ez1(11+ez1)=g(z)(1g(z))

3.2 Decision Boundary

边界情况就是 h θ ( x ) = θ T X = 0.5 h_\theta(x)=\theta^TX=0.5 hθ(x)=θTX=0.5;
根据sigmoid函数的特性可知, P ≥ 0.5 P\geq0.5 P0.5 等效于 ( θ T X ) ≥ 0 (\theta^TX)\geq0 (θTX)0.

3.3 Cost function

线性回归中的损失函数是:
c o s t ( h θ ( x ) , y ) = 1 2 ( h θ ( x ) − y ) 2 J ( θ ) = 1 m ∑ j = 1 m c o s t ( h θ ( x ) , y ) = 1 m ∑ j = 1 m 1 2 ( h θ ( x ) − y ) 2 \begin{aligned} cost(h_\theta(x),y)&=\frac{1}{2}(h_\theta(x)-y)^2\\ J(\theta)&=\frac{1}{m}\sum_{j=1}^{m}cost(h_\theta(x),y)=\frac{1}{m}\sum_{j=1}^{m}\frac{1}{2}(h_{\theta}(x)-y)^2\\ \end{aligned} cost(hθ(x),y)J(θ)=21(hθ(x)y)2=m1j=1mcost(hθ(x),y)=m1j=1m21(hθ(x)y)2
如果计算方式不变,则逻辑回归的损失函数是:
c o s t ( h θ ( x ) , y ) = 1 2 ( h θ ( x ) − y ) 2 = 1 2 ( 1 1 + e − θ T x − y ) 2 J ( θ ) = 1 m ∑ j = 1 m 1 2 ( g ( θ T X ) − y ) 2 \begin{aligned} cost(h_\theta(x),y)&=\frac{1}{2}(h_\theta(x)-y)^2\\ &=\frac{1}{2}(\frac{1}{1+e^{-\theta^Tx}}-y)^2\\ J(\theta)&=\frac{1}{m}\sum_{j=1}^{m}\frac{1}{2}(g(\theta^TX)-y)^2 \end{aligned} cost(hθ(x),y)J(θ)=21(hθ(x)y)2=21(1+eθTx1y)2=m1j=1m21(g(θTX)y)2

由于sigmoid函数的存在,使得此时 J ( θ ) J(\theta) J(θ)不是凸函数(non-convex),有很多局部最优解。

损失最小等效于似然最大,而概率为:
P ( y ∣ x ; θ ) = h θ ( x ) y ( 1 − h θ ( x ) 1 − y P(y\mid x;\theta)={h_\theta(x)}^y{(1-h_\theta(x)}^{1-y} P(yx;θ)=hθ(x)y(1hθ(x)1y
对应的极大似然函数为:
L ( θ ) = Π j = 0 m P ( y j ∣ x j ; θ ) L(\theta)=\Pi_{j=0}^{m}P(y^j\mid x^j;\theta) L(θ)=Πj=0mP(yjxj;θ);
等效于使下式最大(前提:概率都是非负数):
l o g ( L ( θ ) ) = l o g ( Π j = 0 m P ( y j ∣ x j ; θ ) ) = l o g ( Π j = 0 m h θ ( x ) y ( 1 − h θ ( x ) 1 − y ) = ∑ j = 0 m y l o g ( h θ ( x ) ) + ( 1 − y ) l o g ( 1 − h θ ( x ) ) = − m J ( θ ) \begin{aligned} log(L(\theta))&=log(\Pi_{j=0}^{m}P(y^j\mid x^j;\theta))\\ &=log(\Pi_{j=0}^{m}{h_\theta(x)}^y{(1-h_\theta(x)}^{1-y})\\ &=\sum_{j=0}^{m}ylog(h_\theta(x))+(1-y)log(1-h_\theta(x))\\ &=-mJ(\theta) \end{aligned} log(L(θ))=log(Πj=0mP(yjxj;θ))=log(Πj=0mhθ(x)y(1hθ(x)1y)=j=0mylog(hθ(x))+(1y)log(1hθ(x))=mJ(θ)
所以目标极大似然等效于目标损失最小;
J ( θ ) = 1 m ∑ j = 0 m − y l o g ( h θ ( x ) ) − ( 1 − y ) l o g ( ( 1 − h θ ( x ) ) c o s t ( h θ ( x ) , y ) = − y l o g ( h θ ( x ) ) − ( 1 − y ) l o g ( 1 − h θ ( x ) ) = { − l o g ( h θ ( x ) ) if  y = 1 − l o g ( 1 − h θ ( x ) ) if  y = 0 \begin{aligned} J(\theta)&=\frac{1}{m}\sum_{j=0}^{m}-ylog(h_\theta(x))-(1-y)log((1-h_\theta(x))\\ cost(h_\theta(x),y) &=-ylog(h_\theta(x))-(1-y)log(1-h_\theta(x))\\ &= \begin{cases} -log(h_\theta(x)) &\text{if } y=1 \\ -log(1-h_\theta(x)) &\text{if } y=0 \end{cases} \end{aligned} J(θ)cost(hθ(x),y)=m1j=0mylog(hθ(x))(1y)log((1hθ(x))=ylog(hθ(x))(1y)log(1hθ(x))={log(hθ(x))log(1hθ(x))if y=1if y=0
在这里插入图片描述
更舒服的是,这个损失函数满足convex。
这样的损失函数正好也是交叉熵损失函数,具体来说:
假设有两个分布 p ( x ) p(x) p(x) q ( x ) q(x) q(x),则两者的交叉熵为:
H ( p , q ) = − p ( x ) l o g ( q ( x ) ) H(p,q)=-p(x)log(q(x)) H(p,q)=p(x)log(q(x))
熵越小越接近,这里 p ( x ) p(x) p(x)对应真实分布, q ( x ) q(x) q(x) θ \theta θ刻画的估计分布;

g ( z ) = 1 1 + e − z ∂ ∂ z g ( z ) = g ( z ) ( 1 − g ( z ) ) h θ ( x ) = g ( θ T x ) ∂ ∂ θ h θ ( x ) = h θ ( x ) ( 1 − h θ ( x ) ) x J ( θ ) = 1 m ∑ j = 0 m − y l o g ( h θ ( x ) ) − ( 1 − y ) l o g ( ( 1 − h θ ( x ) ) ∂ ∂ θ J ( θ ) = − 1 m ∑ j = 0 m [ y 1 h θ ( x ) ∂ ∂ θ h θ ( x ) + ( 1 − y ) 1 1 − h θ ( x ) ( − 1 ) ∂ ∂ θ h θ ( x ) ] = − 1 m ∑ j = 0 m [ y 1 h θ ( x ) h θ ( x ) ( 1 − h θ ( x ) ) x + ( y − 1 ) 1 1 − h θ ( x ) h θ ( x ) ( 1 − h θ ( x ) ) x ] = 1 m ∑ j = 0 m ( h θ ( x ) − y ) x ( 方 便 书 写 , 没 带 上 标 ) θ : = θ − α ∂ ∂ θ J ( θ ) = θ − α 1 m ∑ j = 0 m ( h θ ( x j ) − y j ) x j \begin{aligned} g(z)&=\frac{1}{1+e^{-z}}\\ \frac{\partial}{\partial z}g(z)&=g(z)(1-g(z))\\ h_\theta(x)&=g(\theta^Tx)\\ \frac{\partial}{\partial \theta}h_\theta(x)&=h_\theta(x)(1-h_\theta(x))x\\ J(\theta)&=\frac{1}{m}\sum_{j=0}^{m}-ylog(h_\theta(x))-(1-y)log((1-h_\theta(x))\\ \frac{\partial}{\partial \theta}J(\theta)&=-\frac{1}{m}\sum_{j=0}^{m}[y\frac{1}{h_\theta(x)}\frac{\partial}{\partial \theta}h_\theta(x)+(1-y)\frac{1}{1-h_\theta(x)}(-1)\frac{\partial}{\partial \theta}h_\theta(x)]\\ &=-\frac{1}{m}\sum_{j=0}^{m}[y\frac{1}{h_\theta(x)}h_\theta(x)(1-h_\theta(x))x+(y-1)\frac{1}{1-h_\theta(x)}h_\theta(x)(1-h_\theta(x))x]\\ &=\frac{1}{m}\sum_{j=0}^{m}(h_\theta(x)-y)x(方便书写,没带上标)\\ \theta:&=\theta-\alpha\frac{\partial}{\partial \theta}J(\theta)\\ &=\theta-\alpha\frac{1}{m}\sum_{j=0}^{m}(h_\theta(x^j)-y^j)x^j \end{aligned} g(z)zg(z)hθ(x)θhθ(x)J(θ)θJ(θ)θ:=1+ez1=g(z)(1g(z))=g(θTx)=hθ(x)(1hθ(x))x=m1j=0mylog(hθ(x))(1y)log((1hθ(x))=m1j=0m[yhθ(x)1θhθ(x)+(1y)1hθ(x)1(1)θhθ(x)]=m1j=0m[yhθ(x)1hθ(x)(1hθ(x))x+(y1)1hθ(x)1hθ(x)(1hθ(x))x]=m1j=0m(hθ(x)y)x便=θαθJ(θ)=θαm1j=0m(hθ(xj)yj)xj
巧合的是,最终逻辑回归的 θ \theta θ的更新公式看起来跟线性回归的公式完全相同,只是两者的 h θ ( x ) h_\theta(x) hθ(x)本身有区别。

3.4 Optimization algorithm

  • batch/mini-batch/stochastic gradient descent
  • conjugate gradient
  • BFGS
  • L-BFGS

3.5 Multiclass Classification

将K分类问题,切分成K个二分类,训练好的K个分类器给出K个得分,取得分最高的作为预测的输出。
这么麻烦吗?

3.6 Overfitting & Regularization

解决方案:

  • 减少特征数量
  • 正则化(降低 θ 0 \theta_0 θ0的幅度)
    正则化影响convex吗?
    不影响:
    https://blog.csdn.net/yyxyuxueYang/article/details/81534965

正则化的理论, θ \theta θ值小可以带来:

  • 更简单的假设
    (例如在costFunc中加一个 λ θ 3 2 \lambda{\theta_3}^2 λθ32项,则收敛时 θ 3 \theta_3 θ3必然很小)
    λ \lambda λ过大时,收敛时 θ \theta θ过小,underfitting
  • 更不容易过拟合
    加入正则项之后,线性回归的Normal Equation最优解为:
    J ( θ ) = 1 2 m [ ∑ i = 1 m ( h θ ( x ) − y ) 2 + λ ∑ j = 1 n θ j 2 ] ( θ 0 并 不 惩 罚 ) ⇓ j ≥ 1 时 ∂ ∂ θ j J ( θ ) = [ 1 m ∑ i = 1 m ( h θ ( x i ) − y i ) x j i ] + λ m θ j ∂ ∂ θ J ( θ ) = 1 m X T ( X θ − y ) + λ m θ ⇓ 令 其 = 0 X T X θ + λ θ = X T y ( X T X + λ ) − 1 ( X T X + λ ) θ = ( X T X + λ ) − 1 X T y = θ ⇓ 囊 括 θ 0 θ = ( X T X + λ [ 0 0 0 I n × n ] ) − 1 X T y \begin{aligned} J(\theta)&=\frac{1}{2m}[\sum_{i=1}^{m}(h_\theta(x)-y)^2+\lambda\sum_{j=1}^{n}{\theta_j}^2](\theta_0并不惩罚)\\ &\dArr j\geq1时\\ \frac{\partial}{\partial\theta_j}J(\theta)&=[\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^i)-y^i){x_j}^i]+\frac{\lambda}{m}\theta_j\\ \frac{\partial}{\partial\theta}J(\theta)&=\frac{1}{m}X^T(X\theta-y)+\frac{\lambda}{m}\theta\\ &\dArr 令其=0\\ X^TX\theta+\lambda\theta&=X^Ty\\ (X^TX+\lambda)^{-1}(X^TX+\lambda)\theta&=(X^TX+\lambda)^{-1}X^Ty=\theta\\ &\dArr 囊括\theta_0\\ \theta&=(X^TX+\lambda\begin{bmatrix} 0 & 0 \\ 0 & I_{n\times n} \\ \end{bmatrix} )^{-1}X^Ty \end{aligned} J(θ)θjJ(θ)θJ(θ)XTXθ+λθ(XTX+λ)1(XTX+λ)θθ=2m1[i=1m(hθ(x)y)2+λj=1nθj2](θ0)j1=[m1i=1m(hθ(xi)yi)xji]+mλθj=m1XT(Xθy)+mλθ=0=XTy=(XTX+λ)1XTy=θθ0=(XTX+λ[000In×n])1XTy

加入正则项之后,逻辑回归的Normal Equation最优解为:
J ( θ ) = [ 1 m ∑ i = 0 m − y l o g ( h θ ( x ) ) − ( 1 − y ) l o g ( ( 1 − h θ ( x ) ) ] + λ 2 m ∑ j = 1 n θ j 2 ⇓ j ≥ 1 时 ∂ ∂ θ j J ( θ ) = 1 m ∑ i = 0 m ( h θ ( x i ) − y i ) x j i + λ m θ j ⇓ θ j : = θ j − α ∂ ∂ θ j J ( θ ) = θ j − α [ 1 m ∑ i = 0 m ( h θ ( x i ) − y i ) x j i + λ m θ j ] = ( 1 − α λ m ) θ j − α 1 m ∑ i = 0 m ( h θ ( x i ) − y i ) x j i \begin{aligned} J(\theta)&=[\frac{1}{m}\sum_{i=0}^{m}-ylog(h_\theta(x))-(1-y)log((1-h_\theta(x))]+\frac{\lambda}{2m}\sum_{j=1}^{n}{\theta_j}^2\\ &\dArr j\geq1时\\ \frac{\partial}{\partial \theta_j}J(\theta)&=\frac{1}{m}\sum_{i=0}^{m}(h_\theta(x^i)-y^i){x_j}^i+\frac{\lambda}{m}\theta_j\\ &\dArr\\ \theta_j:&=\theta_j-\alpha\frac{\partial}{\partial \theta_j}J(\theta)\\ &=\theta_j-\alpha[\frac{1}{m}\sum_{i=0}^{m}(h_\theta(x^i)-y^i){x_j}^i+\frac{\lambda}{m}\theta_j]\\ &=(1-\alpha\frac{\lambda}{m})\theta_j-\alpha\frac{1}{m}\sum_{i=0}^{m}(h_\theta(x^i)-y^i){x_j}^i \end{aligned} J(θ)θjJ(θ)θj:=[m1i=0mylog(hθ(x))(1y)log((1hθ(x))]+2mλj=1nθj2j1=m1i=0m(hθ(xi)yi)xji+mλθj=θjαθjJ(θ)=θjα[m1i=0m(hθ(xi)yi)xji+mλθj]=(1αmλ)θjαm1i=0m(hθ(xi)yi)xji

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值