简要声明
- 课程学习相关网址
- 由于课程学习内容为英文,文本会采用英文进行内容记录,采用中文进行简要解释。
- 本学习笔记单纯是为了能对学到的内容有更深入的理解,如果有错误的地方,恳请包容和指正。
- 非常感谢Andrew Ng吴恩达教授的无私奉献!!!
文章目录
专有名词
hypothesis | 假设函数 | Linear regression | 线性回归 |
---|---|---|---|
Parameter | 模型参数 | cost function | 代价函数 |
Gradient descent | 梯度下降 | convex function | 凸函数 |
Model representation
Training set of housing prices
- Nocation
- m = Number of training examples →训练样本的数量
- x’s = “input” variable / features →输入变量 / 特征
- y’s = “output” variable / “target” variable →输出变量 / 目标变量
- (x, y) = single training example →一个训练样本
- (x(i), y(i)) = ith training example →第i个训练样本
Supervised learning algorithm work
- training set → learning algorithm→hypothesis假设函数
- h是一个映射x到y的函数
How do we represent h ?
h θ ( x ) = θ 0 + θ 1 x h_\theta(x)=\theta_0+\theta_1x hθ(x)=θ0+θ1x
- Linear regression with one variable. = Univariate linear regression →单变量线性回归
- θi’s: Parameters →模型参数
Cost function
How to choose θi’s ?
- 选择能使h(x)也就是输入x时预测的值最接近该样本对应的y值的参数θ_0, θ_1
h θ ( x ) = θ 0 + θ 1 x h_\theta(x)=\theta_0+\theta_1x hθ(x)=θ0+θ1x
m i n i m i z e θ 0 θ 1 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 \mathop {minimize }\limits_{\theta_0 \ \theta_1} \ \frac{1}{2m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2 θ0 θ1minimize 2m1i=1∑m(hθ(x(i))−y(i))2
- 找到参数θ0, θ1使得所有训练样本的h(x)预测值和y实际值之间的差的平方和最小化
- m为训练样本的个数
- 1/2m为了减少平均误差(只是为了使数学更加直白一点)
- cost function代价函数J(θ_0, θ_1)也叫做squared error function平方误差函数
J ( θ 0 , θ 1 ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta_0,\theta_1)=\frac{1}{2m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2 J(θ0,θ1)=2m1i=1∑m(hθ(x(i))−y(i))2
m i n i m i z e θ 0 θ 1 J ( θ 0 , θ 1 ) \mathop {minimize }\limits_{\theta_0 \ \theta_1} \ J(\theta_0,\theta_1) θ0 θ1minimize J(θ0,θ1)
Formulation
- Hypotheis:
h θ ( x ) = θ 0 + θ 1 x h_\theta(x)=\theta_0+\theta_1x hθ(x)=θ0+θ1x
- Parameters:
θ 0 , θ 1 \theta_0,\theta_1 θ0,θ1
- Cost function:
J ( θ 0 , θ 1 ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta_0,\theta_1)=\frac{1}{2m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2 J(θ0,θ1)=2m1i=1∑m(hθ(x(i))−y(i))2
- Goal:
m i n i m i z e θ 0 θ 1 J ( θ 0 , θ 1 ) \mathop {minimize }\limits_{\theta_0 \ \theta_1} \ J(\theta_0,\theta_1) θ0 θ1minimize J(θ0,θ1)
Simplified cost function
h θ ( x ) = θ 1 x ( θ 0 = 0 ) h_\theta(x)=\theta_1x\quad(\theta_0=0) hθ(x)=θ1x(θ0=0)
J ( θ 1 ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta_1)=\frac{1}{2m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2 J(θ1)=2m1i=1∑m(hθ(x(i))−y(i))2
m i n i m i z e θ 1 J ( θ 1 ) \mathop {minimize }\limits_{\theta_1} \ J(\theta_1) θ1minimize J(θ1)
- 如果θ_0=0相当于选择了经过原点的假设函数(过点(0,0)的函数)
- 分别计算θ_1对应的J(θ_1)可以绘制出J(θ_1)的曲线
- 学习算法的优化目标是通过选择θ_1的值获得最小的J(θ_1)
Two parameters cost function
- 随着θ_0和θ_1的不断接近代价函数的中心,假设函数对数据的拟合越来越好
Gradient descent
Outline
- Start with some θ_0, θ_1 →随机从θ_0和θ_1的某个值出发
- Keep changing θ_0, θ_1 to reduce J(θ_0, θ_1) until we hopefully end up at a minimum → 一步一步下山知道收敛至局部最低点
- 梯度下降算法的特点:不同的起始点出发会到达不同的局部最优解
Gradient descent algorithm
repeat until convergence {
θ j : = θ j − α ∂ ∂ θ j J ( θ 0 , θ 1 ) ( f o r j = 0 a n d j = 1 ) \theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1)\quad (for\ j=0\ and\ j=1) θj:=θj−α∂θj∂J(θ0,θ1)(for j=0 and j=1)
}
- “:=”是赋值运算符,将右边的值赋值给左边
- α称为学习速率,控制梯度下降的步子,其越大梯度下降越迅速
- 导数项表示对J求偏导
Simultaneous update
- Correct:
t e m p 0 : = θ 0 − α ∂ ∂ θ 0 J ( θ 0 , θ 1 ) temp0:=\theta_0-\alpha\frac{\partial}{\partial\theta_0}J(\theta_0,\theta_1) temp0:=θ0−α∂θ0∂J(θ0,θ1)
t e m p 1 : = θ 1 − α ∂ ∂ θ 1 J ( θ 0 , θ 1 ) temp1:=\theta_1-\alpha\frac{\partial}{\partial\theta_1}J(\theta_0,\theta_1) temp1:=θ1−α∂θ1∂J(θ0,θ1)
θ 0 : = t e m p 0 \theta_0:=temp0 θ0:=temp0
θ 1 : = t e m p 1 \theta_1:=temp1 θ1:=temp1
- ❌❌❌Incorrect:
t e m p 0 : = θ 0 − α ∂ ∂ θ 0 J ( θ 0 , θ 1 ) temp0:=\theta_0-\alpha\frac{\partial}{\partial\theta_0}J(\theta_0,\theta_1) temp0:=θ0−α∂θ0∂J(θ0,θ1)
θ 0 : = t e m p 0 \theta_0:=temp0 θ0:=temp0
t e m p 1 : = θ 1 − α ∂ ∂ θ 1 J ( θ 0 , θ 1 ) temp1:=\theta_1-\alpha\frac{\partial}{\partial\theta_1}J(\theta_0,\theta_1) temp1:=θ1−α∂θ1∂J(θ0,θ1)
θ 1 : = t e m p 1 \theta_1:=temp1 θ1:=temp1
- 更新方程时需要同时更新θ_0和θ_1
- 正确方法:先同时计算右边部分,然后同时更新θ_0和θ_1
- ❌错误方法:先计算temp0然后更新θ_0,再计算temp1然后更新θ_1
Gradient descent intuition
- 导数表示切线的斜率,α学习速率大于0,当是正导数则参数θ减小,当是负导数则参数θ增加,J(θ)逐渐接近最小值
- 如果α太小,需要很多步才能到达最低点,梯度下降速度变慢
- 如果α太大,梯度下降可能会越过最低点,甚至可能无法收敛或者发散
- 如果已经在局部最优点,参数θ将不再改变
- 梯度下降时斜率会变小,参数θ更新的幅度也会变小
- 梯度下降法会自动采用更小的幅度,当接近局部最小点时导数值会自动变得越来越小,没必要另外减小α
Gradient descent for linear regression
∂ ∂ θ j J ( θ 0 , θ 1 ) = ∂ ∂ θ j ⋅ 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 = ∂ ∂ θ j ⋅ 1 2 m ∑ i = 1 m ( θ 0 + θ 1 x ( i ) − y ( i ) ) 2 \frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1)=\frac{\partial}{\partial\theta_j}\cdot \frac{1}{2m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2 \\ =\frac{\partial}{\partial\theta_j}\cdot \frac{1}{2m} \sum_{i=1}^m(\theta_0+\theta_1x^{(i)}-y^{(i)})^2 ∂θj∂J(θ0,θ1)=∂θj∂⋅2m1i=1∑m(hθ(x(i))−y(i))2=∂θj∂⋅2m1i=1∑m(θ0+θ1x(i)−y(i))2
∂ ∂ θ 0 J ( θ 0 , θ 1 ) = 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ( j = 0 ) \frac{\partial}{\partial\theta_0}J(\theta_0,\theta_1)=\frac{1}{m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})\quad (j=0) ∂θ0∂J(θ0,θ1)=m1i=1∑m(hθ(x(i))−y(i))(j=0)
∂ ∂ θ 1 J ( θ 0 , θ 1 ) = 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x ( i ) ( j = 1 ) \frac{\partial}{\partial\theta_1}J(\theta_0,\theta_1)=\frac{1}{m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})\cdot x^{(i)} \quad (j=1) ∂θ1∂J(θ0,θ1)=m1i=1∑m(hθ(x(i))−y(i))⋅x(i)(j=1)
repeat until convergence {
θ 0 : = θ 0 − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) \theta_0:=\theta_0-\alpha\frac{1}{m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)}) θ0:=θ0−αm1i=1∑m(hθ(x(i))−y(i))
θ 1 : = θ 1 − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x ( i ) \theta_1:=\theta_1-\alpha\frac{1}{m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})\cdot x^{(i)} θ1:=θ1−αm1i=1∑m(hθ(x(i))−y(i))⋅x(i)
} update θ_0 and θ_1 simultaneously
- 线性回归的代价函数总是一个convex function凸函数,bow-shaped function弓状函数
- 凸函数没有局部最优解,只有一个全局最优解
Batch Gradient Descent
“Batch”: Each step of gradient descent uses all the training examples. →每一步梯度下降都遍历了整个训练集样本→对训练样本进行求和
吴恩达教授语录
- “Congrats on learning about your first Machine Learning algorithm.”