Exercise 3: Multivariate Linear Regression

最新推荐文章于 2020-09-30 20:26:40 发布

ytlcainiao

最新推荐文章于 2020-09-30 20:26:40 发布

阅读量1.3k

点赞数

分类专栏： ML Exercise

ML Exercise 专栏收录该内容

9 篇文章

订阅专栏

Exercise 3: Multivariate Linear Regression

In this exercise, you will investigate multivariate linear regression using gradient descent and the normal equations. You will also examine the relationship between the cost function $J(\theta)$ , the convergence of gradient descent, and the learning rate $\alpha$ .

Data

This is a training set of housing prices in Portland, Oregon, where the outputs $y^{(i)}$ are the prices and the inputs $x^{(i)}$ are the living area and the number of bedrooms. There are training examples.

Preprocessing your data

Load the data for the training examples into your program and add the intercept term into your x matrix. Recall that the command in Matlab/Octave for adding a column of ones is

x = [ones(m, 1), x];

Take a look at the values of the inputs $x^{(i)}$ and note that the living areas are about 1000 times the number of bedrooms. This difference means that preprocessing the inputs will significantly increase gradient descent's efficiency.

In your program, scale both types of inputs by their standard deviations and set their means to zero. In Matlab/Octave, this can be executed with

sigma = std(x);
mu = mean(x);
x(:,2) = (x(:,2) - mu(2))./ sigma(2);
x(:,3) = (x(:,3) - mu(3))./ sigma(3);

Gradient descent

Previously, you implemented gradient descent on a univariate regression problem. The only difference now is that there is one more feature in the matrix x.

The hypothesis function is still

$\begin{displaymath}h_{\theta}(x) = \theta^Tx = \sum_{i=0}^n \theta_i x_i, \nonumber\end{displaymath}$

and the batch gradient descent update rule is

$\begin{displaymath}\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m (h_{\......{(i)}) x_j^{(i)} \;\;\;\;\;\mbox{(for all $j$)} \nonumber\par\end{displaymath}$

Once again, initialize your parameters to $\theta = \vec{0}$ .

Selecting a learning rate using $J(\theta)$

Now it's time to select a learning rate $\alpha.$ The goal of this part is to pick a good learning rate in the range of

$\begin{displaymath}0.001 \leq \alpha \leq 10 \nonumber\end{displaymath}$

You will do this by making an initial selection, running gradient descent and observing the cost function, and adjusting the learning rate accordingly. Recall that the cost function is defined as

$\begin{displaymath}J(\theta)=\frac{1}{2m}\sum_{i=1}^{m}\left(h_{\theta}(x^{(i)})-y^{(i)}\right)^{2}. \nonumber\end{displaymath}$

The cost function can also be written in the following vectorized form,

$\begin{displaymath}J(\theta) &=& \frac{1}{2m}\left(X\theta-\vec{y}\right)^{T}(X\theta-\vec{y}) \nonumber\end{displaymath}$

where

$\begin{displaymath}\begin{array}{cc}\par\vec{y} = \left[\begin{array}{c}y......m)})^T\mbox{-}\end{array}\right] \nonumber\par\end{array}\end{displaymath}$

The vectorized version is useful and efficient when you're working with numerical computing tools like Matlab/Octave. If you are familiar with matrices, you can prove to yourself that the two forms are equivalent.

While in the previous exercise you calculated $J(\theta)$ over a grid of $\theta_0$ and $\theta_1$ values, you will now calculate $J(\theta)$ using the $\theta$ of the current stage of gradient descent. After stepping through many stages, you will see how $J(\theta)$ changes as the iterations advance.

Now, run gradient descent for about 50 iterations at your initial learning rate. In each iteration, calculate $J(\theta)$ and store the result in a vector J. After the last iteration, plot the J values against the number of the iteration. In Matlab/Octave, the steps would look something like this:

theta = zeros(size(x(1,:)))'; % initialize fitting parameters
alpha = %% Your initial learning rate %%
J = zeros(50, 1); 

for num_iterations = 1:50
    J(num_iterations) = %% Calculate your cost function here %%
    theta = %% Result of gradient descent update %%
end

% now plot J
% technically, the first J starts at the zero-eth iteration
% but Matlab/Octave doesn't have a zero index
figure;
plot(0:49, J(1:50), '-')
xlabel('Number of iterations')
ylabel('Cost J')

If you picked a learning rate within a good range, your plot should appear like the figure below.

If your graph looks very different, especially if your value of $J(\theta)$ increases or even blows up, adjust your learning rate and try again. We recommend testing alphas at a rate of of 3 times the next smallest value (i.e. 0.01, 0.03, 0.1, 0.3 and so on). You may also want to adjust the number of iterations you are running if that will help you see the overall trend in the curve.

To compare how different learning learning rates affect convergence, it's helpful to plot J for several learning rates on the same graph. In Matlab/Octave, this can be done by performing gradient descent multiple times with a 'hold on' command between plots. Concretely, if you've tried three different values of alpha (you should probably try more values than this) and stored the costs in J1, J2 and J3, you can use the following commands to plot them on the same figure:

plot(0:49, J1(1:50), 'b-');
hold on;
plot(0:49, J2(1:50), 'r-');
plot(0:49, J3(1:50), 'k-');

The final arguments `b-', `r-', and 'k-' specify different plot styles for the plots. Type

help plot

at the Matlab/Octave command line for more information on plot styles.

Observe the changes in the cost function happens as the learning rate changes. What happens when the learning rate is too small? Too large?

Using the best learning rate that you found, run gradient descent until convergence to find

1. The final values of $\theta$

2. The predicted price of a house with 1650 square feet and 3 bedrooms. Don't forget to scale your features when you make this prediction!

Normal Equations

In the Normal Equations video, you learned that the closed-form solution to a least squares fit is

$\begin{displaymath}\theta=\left(X^{T}X\right)^{-1}X^{T}\vec{y}.\end{displaymath}$

Using this formula does not require any feature scaling, and you will get an exact solution in one calculation: there is no 'loop until convergence' like in gradient descent.

1. In your program, use the formula above to calculate $\theta$ . Remember that while you don't need to scale your features, you still need to add an intercept term.

2. Once you have found $\theta$ from this method, use it to make a price prediction for a 1650-square-foot house with 3 bedrooms. Did you get the same price that you found through gradient descent?

Solutions

After you have completed the exercises above, please refer to the solutions below and check that your implementation and your answers are correct. In a case where your implementation does not result in the same parameters/phenomena as described below, debug your solution until you manage to replicate the same effect as our implementation.
Selecting a learning rate

Your plot of the cost function for different learning rates should look something like this:

Notice that for a small alpha like 0.01, the cost function decreases slowly, which means slow convergence during gradient descent. Also, notice that while alpha=1.3 is the largest learning rate, alpha=1.0 has a faster convergence. This shows that after a certain point, increasing the learning rate will no longer increase the speed of convergence.

In fact, if your learning rate is too large, gradient descent will not converge at all, and $J(\theta)$ might blow up like in the following graph for alpha = 1.4.

Worse yet, $J(\theta)$ might not plot at all because the numbers are too large for computer calculations. Matlab/Octave will give you NaN's. NaN stands for 'not a number' and is often caused by undefined operations like (+Infinity)+(-Infinity).

1. For your final values of theta, you should have

$\displaystyle \theta_0$	$\textstyle =$	$\displaystyle 340,413$
$\displaystyle \theta_1$	$\textstyle =$	$\displaystyle 110,631$
$\displaystyle \theta_2$	$\textstyle =$	$\displaystyle -6,649$

These are the values given by Matlab/Octave after 100 iterations with $\alpha = 1$ . If any of your parameters differs by more than 10 from the given answers, try to see if you can get your answers closer. If you have not already run gradient descent using exactly 100 iterations at $\alpha = 1$ , try that. If your answers then match the solutions, your original problem was in your learning rate or in the number of iterations you were running. But if things still do not work, you may have made an error in your algorithm or in your feature scaling.

2. The predicted price of the house should be $293,081. If you get the correct answer for $\theta$ but not for the price prediction, you probably forgot to scale your features when making the prediction.

Normal equations

1. Your values of theta should be

$\begin{eqnarray*}\theta_0 &=& 89,598 \\\theta_1 &=& 139.21 \\\theta_2 &=& -8738.0\end{eqnarray*}$

Notice that these values are different from the ones you got from gradient descent. This is because features don't need to be scaled when using the normal equations solution.

2. The predicted price of the house should be $293,081, as before.