【吴恩达机器学习】第二周课程精简笔记——多元线性回归和计算参数分析

1. Multivariate Linear Regerssion(多元线性回归)

(1)Multiple Feature

We now introduce notation for equations where we can have any number of input variables.

x j ( i ) x^{(i)}_j xj(i):value of feature j in the ith training example.
x ( i ) x^{(i)} x(i):the input(features) of the ith training example
m:the number of training examples
n:the number of features

h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 3 + . . . + θ n x n = [ θ 0 θ 1 . . . θ n ] [ x 0 x 1 . . . x n ] = θ T x \begin{aligned} &h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3+...+\theta_nx_n \\ & =\left[ \begin{matrix} \theta_0 & \theta_1 & ... &\theta_n \end{matrix} \right] \left[ \begin{matrix} x_0 \\ x_1 \\ ... \\ x_n \end{matrix} \right] \\ & = \theta^{T}x \end{aligned} hθ(x)=θ0+θ1x1+θ2x2+θ3x3+...+θnxn=[θ0θ1...θn] x0x1...xn =θTx

Remark: Note that for convenience reasons in this course we assume x 0 ( i ) = 1  for  ( i ∈ 1 , … , m ) . x_{0}^{(i)} =1 \text{ for } (i\in { 1,\dots, m } ). x0(i)=1 for (i1,,m). This allows us to do matrix operations with theta and x. Hence making the two vectors ′ θ ′ '\theta' θ and x ( i ) x^{(i)} x(i)
match each other element-wise (that is, have the same number of elements: n+1).]


现在我们为那些可以有任意数量输入变量的方程引入符号。

x j ( i ) x^{(i)}_j xj(i):在第i个训练样本中的特征值j
x ( i ) x^{(i)} x(i):第i个训练样本中的输入值(特征)
m:训练样本的数量
n:特征的数量

h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 3 + . . . + θ n x n = [ θ 0 θ 1 . . . θ n ] [ x 0 x 1 . . . x n ] = θ T x \begin{aligned} &h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3+...+\theta_nx_n \\ & =\left[ \begin{matrix} \theta_0 & \theta_1 & ... &\theta_n \end{matrix} \right] \left[ \begin{matrix} x_0 \\ x_1 \\ ... \\ x_n \end{matrix} \right] \\ & = \mathbf{\theta}^T\mathbf{x} \end{aligned} hθ(x)=θ0+θ1x1+θ2x2+θ3x3+...+θnxn=[θ0θ1...θn] x0x1...xn =θTx

注意,为了方便起见,在本课程中我们假设 x 0 ( i ) = 1  for  ( i ∈ 1 , … , m ) . x_{0}^{(i)} =1 \text{ for } (i\in { 1,\dots, m } ). x0(i)=1 for (i1,,m). 这允许我们对 θ \theta θ x x x 做矩阵运算。 因此使向量 ′ θ ′ '\theta' θ x ( i ) x^{(i)} x(i)在元素上相互匹配(即具有相同的元素数目:n+1)。

(2)Gradient Descent For Multiple Variables(多元梯度下降法)

Cost Function:
J ( θ 0 , θ 1 , . . . , θ n ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta_0,\theta_1,...,\theta_n) = \frac{1}{2m} \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2 J(θ0,θ1,...,θn)=2m1i=1m(hθ(x(i))y(i))2

Gradient descent:
r e p e a t   u n t i l   c o n v e r g e n c e : { θ j   : = θ j − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x ( i ) f o r   j   : = 0 … n } \begin{aligned} &repeat \ until \ convergence: \{ \\& \theta_j \::= \theta_j - \alpha \frac{1}{m} \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})·x^{(i)} \qquad for \ j \::= 0 \dots n \\ \} \end{aligned} }repeat until convergence:{θj:=θjαm1i=1m(hθ(x(i))y(i))x(i)for j:=0n
在这里插入图片描述

(3)Feature Scaling(特征放缩)

In oreder to get gradient descent to run quite a lot faster and converge in a lot fewer iteratins.We can speed up gradient descent by having each of our input values in roughly the same range. This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.

The way to prevent this is to modify the ranges of our input variables so that they are all roughly the same.

Ideally:
− 1 ≤ x ( i ) ≤ 1   o r   − 0.5 ≤ x ( i ) ≤ 0.5 −1 ≤ x_{(i)} ≤ 1 \ or \ −0.5 ≤ x_{(i)}≤ 0.5 1x(i)1 or 0.5x(i)0.5

Two techniques to help with this are feature scaling and mean normalization. Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1. Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero. To implement both of these techniques, adjust your input values as shown in this formula:

x i : = x i − μ i s i x_i := \frac{x_i-\mu_i}{s_i} xi:=sixiμi

Where μ i μ_i μi is the average of all the values for feature (i) and s i s_i si is the range of values (max - min), or s i s_i si is the standard deviation.


为了让梯度下降运行得更快并且在更少的迭代中收敛。我们可以让每个输入值在大致相同的范围内来加速梯度下降。 这是因为θ在小范围内会快速下降,而在大范围下降会很慢。因此当变量非常不均匀时,θ是以低效地振荡的方式到达最优解。

防止这种情况发生的方法是修改输入变量的范围,使它们大致相同。

理想的情况是:
− 1 ≤ x ( i ) ≤ 1   o r   − 0.5 ≤ x ( i ) ≤ 0.5 −1 ≤ x_{(i)} ≤ 1 \ or \ −0.5 ≤ x_{(i)}≤ 0.5 1x(i)1 or 0.5x(i)0.5

有两种技术可以帮助实现这一点:特征缩放和均值标准化。 特征缩放包括将输入值除以输入变量的范围(即最大值减去最小值),从而得到一个只有1的新范围。 均值标准化涉及到从一个输入变量的值中减去一个输入变量的平均值,从而得到一个新的输入变量的平均值为零。 要实现这两种技术,请按照以下公式调整输入值:

x i : = x i − μ i s i x_i := \frac{x_i-\mu_i}{s_i} xi:=sixiμi

其中 μ i \mu_i μi是特征i的所有特征值的平均值, s i s_i si是最大值减去最小值所得的数值的范围或者是标准差。

(4)Learning Rate

Cost Function:
J ( θ 0 , θ 1 , . . . , θ n ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta_0,\theta_1,...,\theta_n) = \frac{1}{2m} \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2 J(θ0,θ1,...,θn)=2m1i=1m(hθ(x(i))y(i))2

Gradient descent:
θ j : = θ j − α ∂ J ( θ ) ∂ θ j = θ j   : = θ j − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x ( i ) f o r   θ : = θ 0 , θ 1 , … , θ m \begin{aligned} \theta_j &:= \theta_j - \alpha \frac{\partial J(\theta)}{\partial \theta_j} \\ &=\theta_j \::= \theta_j - \alpha \frac{1}{m} \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})·x^{(i)} \qquad for \ \theta := \theta_0,\theta_1,\dots,\theta_m \end{aligned} θj:=θjαθjJ(θ)=θj:=θjαm1i=1m(hθ(x(i))y(i))x(i)for θ:=θ0,θ1,,θm

Problem:
(1)“Debugging”:How to make sure gradient descent is working correctly.
(2)How to choose learning rate α \alpha α.

Method:
(1)Debugging gradient descent. Make a plot with number of iterations on the x-axis. Now plot the cost function, J(θ) over the number of iterations of gradient descent. If J(θ) ever increases, then you probably need to decrease α.

(2)Automatic convergence test. Declare convergence if J(θ) decreases by less than E in one iteration, where E is some small value such as 10(−3). However in practice it’s difficult to choose this threshold value.
在这里插入图片描述

It has been proven that if learning rate α is sufficiently small, then J(θ) will decrease on every iteration.

To summarize:
(1) If α \alpha α is too small: slow convergence.
(2) If α \alpha α is too large: may not decrease on every iteration and thus may not converge.
在这里插入图片描述
(3)To choose α \alpha α, try:
0.0001,0.003,0.001,0.003,0.01,0.03,0.01


代价函数:
J ( θ 0 , θ 1 , . . . , θ n ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta_0,\theta_1,...,\theta_n) = \frac{1}{2m} \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2 J(θ0,θ1,...,θn)=2m1i=1m(hθ(x(i))y(i))2

梯度下降法:
θ j : = θ j − α ∂ J ( θ ) ∂ θ j = θ j   : = θ j − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x ( i ) f o r   θ : = θ 0 , θ 1 , … , θ m \begin{aligned} \theta_j &:= \theta_j - \alpha \frac{\partial J(\theta)}{\partial \theta_j} \\ &=\theta_j \::= \theta_j - \alpha \frac{1}{m} \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})·x^{(i)} \qquad for \ \theta := \theta_0,\theta_1,\dots,\theta_m \end{aligned} θj:=θjαθjJ(θ)=θj:=θjαm1i=1m(hθ(x(i))y(i))x(i)for θ:=θ0,θ1,,θm

问题:
(2)“调试”:如何确保梯度下降正常工作。
(2)如何选择学习速率 α \alpha α

调试梯度下降法: 在x轴上绘制具有迭代次数的图。 现在绘制代价函数J(θ),除以梯度下降的迭代次数。 如果J(θ)增加,那么你可能需要减少α。

自动收敛测试 : 如果J(θ)在迭代减小时,刚好小于E(E是一个很小的值),则声明收敛,其中,例如10(−3)。 但是在实践中很难选择这个阈值。
在这里插入图片描述
已证明了当学习速率α足够小时,J(θ)在每次迭代中都会减小。

总结:
(1)如果 α \alpha α太小:收敛速度慢。
(2)如果 α \alpha α太大:可能会导致不会在每次迭代时都减小,因此可能不会收敛。
在这里插入图片描述
(3)选择 α \alpha α, 尝试:
0.0001,0.003,0.001,0.003,0.01,0.03,0.01

(5)Features and Polynomial Regression

We can improve our features and the form of our hypothesis function in a couple different ways.

We can combine multiple features into one. For example, we can combine x 1 x_1 x1 and x 2 x_2 x2 into a new feature x 3 x_3 x3 by taking x 1 x_1 x1 x 2 x_2 x2.

We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).
在这里插入图片描述
so, by having insight into, in this case, the shape of a squre root function, and into the shape of the data, by choosing different features, you can sometimes get better models.
在这里插入图片描述
In the cubic version, we have created new features x 2 x_2 x2 and x 3 x_3 x3.

To make it a square root function, we could do: h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 1 h_\theta(x)=\theta_0 + \theta_1x_1+\theta_2\sqrt{x1} hθ(x)=θ0+θ1x1+θ2x1

It’s important to apply feature scaling if you’re using gradient descent to get them into comparable ranges of values.


我们可以用几种不同的方法改进我们的特征和假设函数的形式。

我们可以将多个特征组合成一个。 例如,我们可以将 x 1 x_1 x1 x 2 x_2 x2 合并为一个新特性 x 3 x_3 x3 ,只要取 x 1 x_1 x1 x 2 x_2 x2 即可。

我们可以通过将假设函数变成二次函数、三次函数或平方根函数(或任何其他形式)来改变它的行为或曲线。
在这里插入图片描述
在这种情况下,通过洞察,平方根函数的形状,和数据的形状,通过选择不同的特征,你有时可以得到更好的模型。
在这里插入图片描述
在含有立方根时,我们可创建具有新的含义的 x 2 x_2 x2 x 3 x_3 x3

为了使它成为一个平方根函数,我们可以这样做: h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 1 h_\theta (x)= \theta_0 + \theta_1x_1+ \theta_2\sqrt{x1} hθ(x)=θ0+θ1x1+θ2x1

如果你正在使用梯度下降,那么应用特征缩放是很重要的,这样才能将值的范围变得具有可比性。

2. Computing Parameters Analytically(计算参数分析)

(1)Normal equation(正规方程法)

Gradient descent gives one way of minimizing J . Let’s discuss a second way of doing so, this time performing the minimization explicitly and without resorting to an iterative algorithm.

Normal equation: Method to solve for θ \theta θ analytically, so that rather than needing to run this iterative algorithm, we can instead just solve for the optional value for θ \theta θ all at one go.The normal equation formula is given below:
θ = ( X T X ) − 1 X T y \theta=(X^TX)^{-1}X^Ty θ=(XTX)1XTy
There is no need to do feature scaling with the normal equation.

The following is a comparison of gradient descent and the normal equation:
在这里插入图片描述
With the normal equation, computing the inversion has complexity O ( n 3 ) \mathcal{O}(n^3) O(n3). So if we have a very large number of features, the normal equation will be slow. In practice, when n exceeds 10,000 it might be a good time to go from a normal solution to an iterative process.


梯度下降给出了一种最小化J的方法,让我们来讨论第二种方法,这次是显式地最小化方式,而不是使用迭代算法。

正规方程法: 以分析的方式解决   θ \ θ  θ ,这样我们就不需要运行这个迭代算法,而是可以一次解出A的可选值。正规方程如下:
θ = ( X T X ) − 1 X T y \theta=(X^TX)^{-1}X^Ty θ=(XTX)1XTy
不需要用常规方程进行特征缩放。

下面是梯度下降法与正规方程法的比较:
在这里插入图片描述
对于常规方程,求逆的复杂度为 O ( n 3 ) \mathcal{O}(n^3) O(n3)。 所以如果我们有很多特征,正规方程会很慢。 在实践中,当n超过10,000时,可能就需要从普通的解决方案转向迭代过程了。

(2)Normal Equation Noninvertibility(正规方程不可逆)

If X T X X^TX XTX is noninvertible, the common causes might be having :

Redundant features, where two features are very closely related (i.e. they are linearly dependent)

Too many features (e.g. m ≤ n). In this case, delete some features or use “regularization” (to be explained in a later lesson).

Solutions to the above problems include deleting a feature that is linearly dependent with another or deleting one or more features when there are too many features.

Note that the issure or the problem of X transpose X being non-invertible should happen pretty rarely.


如果 X T X X^TX XTX是不可逆的,常见的原因可能是:

冗余特征,即两个特征密切相关(例如:线性相关)。

特征太多(如m≤n),删除部分特征或使用“正则化”(待后续课讲解)。

解决上述问题的方法包括删除一个与另一个线性相关的特性,或者在特性过多时删除一个或多个特性。

注意: 这个问题或者说X转置X不可逆的问题应该很少发生。

Exercise 1:线性回归、特征放缩和正规方程

【吴恩达机器学习】Week2 编程作业——线性回归、特征放缩和正规方程

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

辰阳星宇

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值