【吴恩达机器学习】第二周课程精简笔记——多元线性回归和计算参数分析

辰阳星宇

已于 2023-01-06 21:40:47 修改

阅读量1k

点赞数

分类专栏：机器学习文章标签： python 人工智能

于 2022-01-16 11:07:31 首次发布

本文链接：https://blog.csdn.net/qq_41094332/article/details/122520002

版权

机器学习专栏收录该内容

17 篇文章 0 订阅

订阅专栏

1. Multivariate Linear Regerssion（多元线性回归）

（1）Multiple Feature

We now introduce notation for equations where we can have any number of input variables.

$x^{(i)}_j$ ：value of feature j in the i^th training example.
$x^{(i)}$ ：the input(features) of the i^th training example
m：the number of training examples
n：the number of features

$\begin{aligned} &h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3+...+\theta_nx_n \\ & =\left[ \begin{matrix} \theta_0 & \theta_1 & ... &\theta_n \end{matrix} \right] \left[ \begin{matrix} x_0 \\ x_1 \\ ... \\ x_n \end{matrix} \right] \\ & = \theta^{T}x \end{aligned}$

Remark: Note that for convenience reasons in this course we assume $x_{0}^{(i)} =1 \text{ for } (i\in { 1,\dots, m } ).$ This allows us to do matrix operations with theta and x. Hence making the two vectors $'\theta'$ and $x^{(i)}$
match each other element-wise (that is, have the same number of elements: n+1).]

现在我们为那些可以有任意数量输入变量的方程引入符号。

$x^{(i)}_j$ ：在第i个训练样本中的特征值j
$x^{(i)}$ ：第i个训练样本中的输入值（特征）
m：训练样本的数量
n：特征的数量

注意，为了方便起见，在本课程中我们假设 $x_{0}^{(i)} =1 \text{ for } (i\in { 1,\dots, m } ).$ 这允许我们对 $\theta$ 和 $x$ 做矩阵运算。因此使向量 $'\theta'$ 和 $x^{(i)}$ 在元素上相互匹配（即具有相同的元素数目：n+1）。

（2）Gradient Descent For Multiple Variables（多元梯度下降法）

Cost Function：
$J(\theta_0,\theta_1,...,\theta_n) = \frac{1}{2m} \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2$

Gradient descent:
$\begin{aligned} &repeat \ until \ convergence: \{ \\& \theta_j \::= \theta_j - \alpha \frac{1}{m} \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})·x^{(i)} \qquad for \ j \::= 0 \dots n \\ \} \end{aligned}$
在这里插入图片描述

（3）Feature Scaling（特征放缩）

In oreder to get gradient descent to run quite a lot faster and converge in a lot fewer iteratins.We can speed up gradient descent by having each of our input values in roughly the same range. This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.

The way to prevent this is to modify the ranges of our input variables so that they are all roughly the same.

Ideally:
$1 ≤ x_{(i)} ≤ 1 \ or \ −0.5 ≤ x_{(i)}≤ 0.5$

Two techniques to help with this are feature scaling and mean normalization. Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1. Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero. To implement both of these techniques, adjust your input values as shown in this formula:

$x_i := \frac{x_i-\mu_i}{s_i}$

Where $μ_i$ is the average of all the values for feature (i) and $s_i$ is the range of values (max - min), or $s_i$ is the standard deviation.

为了让梯度下降运行得更快并且在更少的迭代中收敛。我们可以让每个输入值在大致相同的范围内来加速梯度下降。这是因为θ在小范围内会快速下降，而在大范围下降会很慢。因此当变量非常不均匀时，θ是以低效地振荡的方式到达最优解。

防止这种情况发生的方法是修改输入变量的范围，使它们大致相同。

理想的情况是：
$1 ≤ x_{(i)} ≤ 1 \ or \ −0.5 ≤ x_{(i)}≤ 0.5$

有两种技术可以帮助实现这一点:特征缩放和均值标准化。特征缩放包括将输入值除以输入变量的范围(即最大值减去最小值)，从而得到一个只有1的新范围。均值标准化涉及到从一个输入变量的值中减去一个输入变量的平均值，从而得到一个新的输入变量的平均值为零。要实现这两种技术，请按照以下公式调整输入值:

$x_i := \frac{x_i-\mu_i}{s_i}$

其中 $\mu_i$ 是特征i的所有特征值的平均值， $s_i$ 是最大值减去最小值所得的数值的范围或者是标准差。

（4）Learning Rate

Cost Function：
$J(\theta_0,\theta_1,...,\theta_n) = \frac{1}{2m} \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2$

Gradient descent：
$\begin{aligned} \theta_j &:= \theta_j - \alpha \frac{\partial J(\theta)}{\partial \theta_j} \\ &=\theta_j \::= \theta_j - \alpha \frac{1}{m} \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})·x^{(i)} \qquad for \ \theta := \theta_0,\theta_1,\dots,\theta_m \end{aligned}$

Problem：
（1）“Debugging”：How to make sure gradient descent is working correctly.
（2）How to choose learning rate $\alpha$ .

Method：
（1）Debugging gradient descent. Make a plot with number of iterations on the x-axis. Now plot the cost function, J(θ) over the number of iterations of gradient descent. If J(θ) ever increases, then you probably need to decrease α.

（2）Automatic convergence test. Declare convergence if J(θ) decreases by less than E in one iteration, where E is some small value such as 10⁽⁻³⁾. However in practice it’s difficult to choose this threshold value.
在这里插入图片描述

It has been proven that if learning rate α is sufficiently small, then J(θ) will decrease on every iteration.

To summarize:
（1） If $\alpha$ is too small: slow convergence.
（2） If $\alpha$ is too large: may not decrease on every iteration and thus may not converge.
在这里插入图片描述
（3）To choose $\alpha$ , try：
0.0001,0.003,0.001,0.003,0.01,0.03,0.01

代价函数：
$J(\theta_0,\theta_1,...,\theta_n) = \frac{1}{2m} \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2$

梯度下降法：
$\begin{aligned} \theta_j &:= \theta_j - \alpha \frac{\partial J(\theta)}{\partial \theta_j} \\ &=\theta_j \::= \theta_j - \alpha \frac{1}{m} \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})·x^{(i)} \qquad for \ \theta := \theta_0,\theta_1,\dots,\theta_m \end{aligned}$

问题:
（2）“调试”：如何确保梯度下降正常工作。
（2）如何选择学习速率 $\alpha$ 。

调试梯度下降法： 在x轴上绘制具有迭代次数的图。现在绘制代价函数J(θ)，除以梯度下降的迭代次数。如果J(θ)增加，那么你可能需要减少α。

自动收敛测试： 如果J(θ)在迭代减小时，刚好小于E（E是一个很小的值），则声明收敛，其中，例如10⁽⁻³⁾。但是在实践中很难选择这个阈值。
在这里插入图片描述
已证明了当学习速率α足够小时，J(θ)在每次迭代中都会减小。

总结:
（1）如果 $\alpha$ 太小：收敛速度慢。
（2）如果 $\alpha$ 太大：可能会导致不会在每次迭代时都减小，因此可能不会收敛。
在这里插入图片描述
（3）选择 $\alpha$ , 尝试：
0.0001,0.003,0.001,0.003,0.01,0.03,0.01

（5）Features and Polynomial Regression

We can improve our features and the form of our hypothesis function in a couple different ways.

We can combine multiple features into one. For example, we can combine $x_1$ and $x_2$ into a new feature $x_3$ by taking $x_1$ ⋅ $x_2$ .

We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).
在这里插入图片描述
so, by having insight into, in this case, the shape of a squre root function, and into the shape of the data, by choosing different features, you can sometimes get better models.

In the cubic version, we have created new features $x_2$ and $x_3$ .

To make it a square root function, we could do: $h_\theta(x)=\theta_0 + \theta_1x_1+\theta_2\sqrt{x1}$

It’s important to apply feature scaling if you’re using gradient descent to get them into comparable ranges of values.

我们可以用几种不同的方法改进我们的特征和假设函数的形式。

我们可以将多个特征组合成一个。例如，我们可以将 $x_1$ 和 $x_2$ 合并为一个新特性 $x_3$ ，只要取 $x_1$ ⋅ $x_2$ 即可。

我们可以通过将假设函数变成二次函数、三次函数或平方根函数(或任何其他形式)来改变它的行为或曲线。
在这里插入图片描述
在这种情况下，通过洞察，平方根函数的形状，和数据的形状，通过选择不同的特征，你有时可以得到更好的模型。

在含有立方根时，我们可创建具有新的含义的 $x_2$ 和 $x_3$ 。

为了使它成为一个平方根函数，我们可以这样做: $h_\theta (x)= \theta_0 + \theta_1x_1+ \theta_2\sqrt{x1}$

如果你正在使用梯度下降，那么应用特征缩放是很重要的，这样才能将值的范围变得具有可比性。

2. Computing Parameters Analytically（计算参数分析）

（1）Normal equation（正规方程法）

Gradient descent gives one way of minimizing J . Let’s discuss a second way of doing so, this time performing the minimization explicitly and without resorting to an iterative algorithm.

Normal equation： Method to solve for $\theta$ analytically, so that rather than needing to run this iterative algorithm, we can instead just solve for the optional value for $\theta$ all at one go.The normal equation formula is given below:
$\theta=(X^TX)^{-1}X^Ty$
There is no need to do feature scaling with the normal equation.

The following is a comparison of gradient descent and the normal equation:
在这里插入图片描述
With the normal equation, computing the inversion has complexity $\mathcal{O}(n^3)$ . So if we have a very large number of features, the normal equation will be slow. In practice, when n exceeds 10,000 it might be a good time to go from a normal solution to an iterative process.

梯度下降给出了一种最小化J的方法，让我们来讨论第二种方法，这次是显式地最小化方式，而不是使用迭代算法。

正规方程法: 以分析的方式解决 $\ θ$ ，这样我们就不需要运行这个迭代算法，而是可以一次解出A的可选值。正规方程如下：
$\theta=(X^TX)^{-1}X^Ty$
不需要用常规方程进行特征缩放。

下面是梯度下降法与正规方程法的比较:
在这里插入图片描述
对于常规方程，求逆的复杂度为 $\mathcal{O}(n^3)$ 。所以如果我们有很多特征，正规方程会很慢。在实践中，当n超过10,000时，可能就需要从普通的解决方案转向迭代过程了。

（2）Normal Equation Noninvertibility（正规方程不可逆）

If $X^TX$ is noninvertible, the common causes might be having :

Redundant features, where two features are very closely related (i.e. they are linearly dependent)

Too many features (e.g. m ≤ n). In this case, delete some features or use “regularization” (to be explained in a later lesson).

Solutions to the above problems include deleting a feature that is linearly dependent with another or deleting one or more features when there are too many features.

Note that the issure or the problem of X transpose X being non-invertible should happen pretty rarely.

如果 $X^TX$ 是不可逆的，常见的原因可能是:

冗余特征，即两个特征密切相关（例如：线性相关）。

特征太多(如m≤n)，删除部分特征或使用“正则化”（待后续课讲解）。

解决上述问题的方法包括删除一个与另一个线性相关的特性，或者在特性过多时删除一个或多个特性。

注意： 这个问题或者说X转置X不可逆的问题应该很少发生。

Exercise 1：线性回归、特征放缩和正规方程

【吴恩达机器学习】Week2 编程作业——线性回归、特征放缩和正规方程

辰阳星宇

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
【吴恩达机器学习】第二周课程精简笔记——多元线性回归和计算参数分析

Multivariate Linear Regerssion（多元线性回归）（1）Multiple FeatureWe now introduce notation for equations where we can have any number of input variables.xj(i)x^{(i)}_jxj(i)：value of feature j in the ith training example.x(i)x^{(i)}x(i)：the input(features) of
复制链接

扫一扫