简要声明
- 课程学习相关网址
- 由于课程学习内容为英文,文本会采用英文进行内容记录,采用中文进行简要解释。
- 本学习笔记单纯是为了能对学到的内容有更深入的理解,如果有错误的地方,恳请包容和指正。
- 非常感谢Andrew Ng吴恩达教授的无私奉献!!!
文章目录
专有名词
feature | 特征 | Multivariate linear regression | 多元线性回归 |
---|---|---|---|
Feature Scaling | 特征缩放 | Example automatic convergence test | 自动收敛测试 |
Polynomial regression | 多项式回归 | Normal equation | 正规方程 |
Multiple features
Training set of housing prices
- Notation
- n = number of features →训练样本的数量
- x(i) = input features of ith training example →第i个训练样本的输入特征值
- x_j(i) = value of feature j in ith training example →第i个训练样本中第j个特征值
Hypothesis
h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 + ⋯ + θ n x n h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2+\dots+\theta_nx_n hθ(x)=θ0+θ1x1+θ2x2+⋯+θnxn
- For convenience of notation, define x_0=1 →定义x_0^{(i)}=1
.
h θ ( x ) = θ 0 x 0 + θ 1 x 1 + θ 2 x 2 + ⋯ + θ n x n = θ T x h_\theta(x)=\theta_0x_0+\theta_1x_1+\theta_2x_2+\dots+\theta_nx_n =\theta^Tx hθ(x)=θ0x0+θ1x1+θ2x2+⋯+θnxn=θTx
- Multivariate linear regression →多元线性回归
Gradient descent for multiple variables
Formulation
- Hypotheis:
h θ ( x ) = θ T x = θ 0 x 0 + θ 1 x 1 + θ 2 x 2 + ⋯ + θ n x n ( x 0 = 1 ) h_\theta(x)=\theta^Tx=\theta_0x_0+\theta_1x_1+\theta_2x_2+\dots+\theta_nx_n \quad (x_0=1) hθ(x)=θTx=θ0x0+θ1x1+θ2x2+⋯+θnxn(x0=1)
- Parameters:
θ 0 , θ 1 , … , θ n o r θ ( n + 1 ) − d i m e n s i o n a l v e c t o r \theta_0,\theta_1,\dots,\theta_n \quad or \quad \theta \quad (n+1)-dimensional\ vector θ0,θ1,…,θnorθ(n+1)−dimensional vector
- Cost function:
J ( θ 0 , θ 1 , … , θ n ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta_0,\theta_1,\dots,\theta_n)=\frac{1}{2m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2 J(θ0,θ1,…,θn)=2m1i=1∑m(hθ(x(i))−y(i))2
- Gradient descent:
Repeat {
θ j : = θ j − α ∂ ∂ θ j J ( θ 0 , θ 1 , … , θ n ) \theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1,\dots,\theta_n) θj:=θj−α∂θj∂J(θ0,θ1,…,θn)
} (simultaneously update for every j = 0, … , n)
Gradient Descent ( n ≥ 1 )
Repeat {
θ j : = θ j − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x j ( i ) \theta_j:=\theta_j-\alpha\frac{1}{m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})\cdot x_j^{(i)} θj:=θj−αm1i=1∑m(hθ(x(i))−y(i))⋅xj(i)
} (simultaneously update θ_j for every j = 0, … , n)
θ 0 : = θ 0 − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x 0 ( i ) \theta_0:=\theta_0-\alpha\frac{1}{m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})\cdot x_0^{(i)} θ0:=θ0−αm1i=1∑m(hθ(x(i))−y(i))⋅x0(i)
θ 1 : = θ 1 − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x 1 ( i ) \theta_1:=\theta_1-\alpha\frac{1}{m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})\cdot x_1^{(i)} θ1:=θ1−αm1i=1∑m(hθ(x(i))−y(i))⋅x1(i)
θ 2 : = θ 2 − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x 2 ( i ) \theta_2:=\theta_2-\alpha\frac{1}{m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})\cdot x_2^{(i)} θ2:=θ2−αm1i=1∑m(hθ(x(i))−y(i))⋅x2(i)
- 由于x(i)_0=1,所以不需要区分n=0和n=1。
Feature Scaling
- Make sure features are on a similar scale →确保特征都处在一个相近的范围
- 特征缩放可以使梯度更快收敛
- Get every feature into approximately a -1 ≤ x_i ≤ 1 range →归一化
- 不用过于担心特征是否在相同的范围或者区间内 →足够近似-1或1即可
- Mean normalization 均值归一化 用 x-μ 代替 x 使特征值具有为0的平均值
x = x − μ s x = \frac{x-\mu}{s} x=sx−μ
- μ为特征x的平均值,s为特征x的最大值减最小值或者标准差
Learning rate
- “Debugging”: How to make sure gradient descent is working correctly →确保梯度下降是正常工作的
- 曲线的横坐标代表梯度下降算法的迭代次数,纵坐标代表最小化代价函数的值
- 如果梯度下降算法正常工作的话,每一步迭代之后J(θ)都应该下降,判断算法是否收敛
- Example automatic convergence test自动收敛测试:如果代价函数J(θ)一步迭代后的下降小于一个很小的值ε,则判断函数已经收敛
- For sufficiently small α, J(θ) should decrease on every iteration →只要学习率α足够小,那么每次迭代后代价函数J(θ)都会下降
- But if α is too small, gradient descent can be slow to converge →如果学习率α太小,梯度下降算法会收敛很慢
- If α is too small: slow convergence →如果学习率α太小,收敛速度慢
If α is too large: may not decrease on every iteration; may not converge →如果学习率α太大,代价函数可能不会再每次迭代中都下降,甚至可能不收敛 - To choose α, try …,0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, … →按3的倍数取值
Features and polynomial regression
Polynomial regression
h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 3 = θ 0 + θ 1 x + θ 2 x 2 + θ 3 x 3 h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3 \\ = \theta_0+\theta_1x+\theta_2x^2+\theta_3x^3 hθ(x)=θ0+θ1x1+θ2x2+θ3x3=θ0+θ1x+θ2x2+θ3x3
- 三个特征范围很大,如果使用梯度下降法,特征缩放非常重要,才能将值的范围变得具有可比性
h θ ( x ) = θ 0 + θ 1 x + θ 2 x 2 h θ ( x ) = θ 0 + θ 1 x + θ 2 x h_\theta(x)=\theta_0+\theta_1x+\theta_2x^2 \\ h_\theta(x)= \theta_0+\theta_1x+\theta_2\sqrt{x} hθ(x)=θ0+θ1x+θ2x2hθ(x)=θ0+θ1x+θ2x
Normal equation
Normal equation
Normal equation: Method to solve for θ analytically →提供一个求θ的解析解法
- 最小化函数→对函数求导并且将导数置零→得到最小值时的θ
- 最小化代价函数J的θ值→对逐个θ求J的偏导数并且全部置零
θ = ( X T X ) − 1 X T y \theta=(X^TX)^{-1}X^Ty θ=(XTX)−1XTy
- design matrix 设计矩阵 X 每一行为一个训练样本的转置
- (XTX)-1是XTX矩阵的逆
- 如果使用正规方程法,就不需要特征缩放
Gradient Descent vs. Normal Equation
- 梯度下降法的缺点是需要选择学习速率α,通常需要运行多次
- 梯度下降法的缺点是需要多次的迭代,计算可能会变慢
- 正规方程不需要选择学习速率α也不需要迭代
- 梯度下降法在特征变量很多的情况下能运行得相当好
- 正规方程需要计算X转置乘以X的逆,是一个n×n的矩阵,如果数量n很大的话计算会很慢
- 如果n很大(n≥10000)可选择梯度下降法,如果n比较小(n=100 or 1000)可选择正规方程法
- 正规方程不适用于更复杂的学习算法
Normal equation and non-‐invertibility
θ = ( X T X ) − 1 X T y \theta=(X^TX)^{-1}X^Ty θ=(XTX)−1XTy
- What if XTX is non-‐invertible? (singular/ degenerate) →如果XTX不可逆怎么办?
- Redundant features (linearly dependent) →包含了多余的特征(存在线性相关)
- Too many features ( m ≤ n ) →有太多特征( m ≤ n )→ 参数太多样本太少
- Delete some features →删除一些特征
- use regularization →使用正则化
吴恩达教授语录
- “So, gradient descent is a very useful algorithm to know.”