Exercise 5: Regularization

最新推荐文章于 2021-12-27 17:57:14 发布

ytlcainiao

最新推荐文章于 2021-12-27 17:57:14 发布

阅读量1.3k

点赞数

分类专栏： ML Exercise

ML Exercise 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

Exercise 5: Regularization

In this exercise, you will implement regularized linear regression and regularized logistic regression.

Data

This data bundle contains two sets of data, one for linear regression and the other for logistic regression. It also includes a helper function named 'map_feature.m' which will be used for logistic regression. Make sure that this function's m-file is placed in the same working directory where you plan to write your code.

Regularized linear regression

The first part of this exercise focuses on regularized linear regression and the normal equations.

Plot the data

Load the data files "ex5Linx.dat" and "ex5Liny.dat" into your program. These correspond to the "x" and "y" variables that you will start out with.

Notice that in this data, the input "x" is a single feature, so you can plot y as a function of x on a 2-dimensional graph (try it yourself):

From looking at this plot, it seems that fitting a straight line might be too simple of an approximation. Instead, we will try fitting a higher-order polynomial to the data to capture more of the variations in the points.

Let's try a fifth-order polynomial. Our hypothesis will be

$\begin{displaymath}h_{\theta}(x)=\theta_{0}+\theta_{1}x+\theta_{2}x^{2}+\theta_{3}x^{3}+\theta_{4}x^{4}+\theta_{5}x^{5}\end{displaymath}$

This means that we have a hypothesis of six features, because $x^0, x^1, \ldots x^5$ are now all features of our regression. Notice that even though we are producing a polynomial fit, we still have a linear regression problem because the hypothesis is linear in each feature.

Since we are fitting a 5th-order polynomial to a data set of only 7 points, over-fitting is likely to occur. To guard against this, we will use regularization in our model.

Recall that in regularization problems, the goal is to minimize the following cost function with respect to $\theta$ :

$\begin{eqnarray*}J(\theta) & = & \frac{{1}}{2m}\left[\sum_{i=1}^{m}(h_{\theta}(......& & \mbox{where }\lambda\mbox{ is the regularization parameter}\end{eqnarray*}$

The regularization parameter $\lambda$ is a control on your fitting parameters. As the magnitues of the fitting parameters increase, there will be an increasing penalty on the cost function. This penalty is dependent on the squares of the parameters as well as the magnitude of $\lambda$ . Also, notice that the summation after $\lambda$ does not include $\theta_{0}^{2}.$

Normal equations

Now we will find the best parameters of our model using the normal equations. Recall that the normal equations solution to regularized linear regression is

$\begin{displaymath}\theta=(X^{T}X+\lambda\left[\begin{array}{cccc}0\\& 1\\& & \ddots\\& & & 1\end{array}\right])^{-1}X^{T}\vec{y}\end{displaymath}$

The matrix following $\lambda$ is an $(n+1)\times(n+1)$ diagonal matrix with a zero in the upper left and ones down the other diagonal entries. (Remember that

is the number of features, not counting the intecept term). The vector $\vec{y}$ and the matrix

have the same definition they had for unregularized regression:

$\begin{displaymath}\begin{array}{cc}\par\vec{y} = \left[\begin{array}{c}y......m)})^T\mbox{-}\end{array}\right] \nonumber\par\end{array}\end{displaymath}$

Using this equation, find values for $\theta$ using the three regularization parameters below:

a. $\lambda = 0$ (this is the same case as non-regularized linear regression)

b. $\lambda = 1$

c. $\lambda = 10$

As you are implementing your program, keep in mind that is an $m \times (n+1)$ matrix, because there are training examples and features, plus an intercept term. In the data provided for this exercise, you were only give the first power of . You will need to include the other powers of in your feature vector , which means that the first column will contain all ones, the next column will contain the first powers, the next column will contain the second powers, and so on. You can do this in Matlab/Octave with the command

x = [ones(m, 1), x, x.^2, x.^3, x.^4, x.^5];

When you have found the answers for $\theta$ , verify them with the values in the solutions. In addition to listing the values for each element $\theta_j$ of the $\theta$ vector, we will also provide the L2-norm of $\theta$ so you can quickly check if your answer is correct. In Matlab/Octave, you can calculate the L2-norm of a vector x using the command norm(x).

Also, plot the polyomial fit for each value of $\lambda$ . You will get plots similar to these:

From looking at these graphs, what conclusions can you make about how the regularization parameter $\lambda$ affects your model?

Regularized logistic regression

In this 2nd part of the exercise, you will implement regularized logistic regression using Newton's Method. To begin, load the files 'ex5Logx.dat' and ex5Logy.dat' into your program. This dataset represents the training set of a logistic regression problem with two features. To avoid confusion later, we will refer to the two input features contained in 'ex5Logx.dat' as and . So in the 'ex5Logx.dat' file, the first column of numbers represents the feature , which you will plot on the horizontal axis, and the second feature represents , which you will plot on the vertical axis.

After loading the data, plot the points using different markers to distinguish between the two classifications. The commands in Matlab/Octave will be:

x = load('ex5Logx.dat'); 
y = load('ex5Logy.dat');

figure

% Find the indices for the 2 classes
pos = find(y); neg = find(y == 0);

plot(x(pos, 1), x(pos, 2), '+')
hold on
plot(x(neg, 1), x(neg, 2), 'o')

After plotting your image, it should look something like this:

We will now fit a regularized regression model to this data.

Recall that in logistic regression, the hypothesis function is

$\displaystyle h_{\theta}(x)$	$\textstyle =$	$\displaystyle g(\theta^{T}x)=\frac{{1}}{1+e^{-\theta^{T}x}}$
	$\textstyle =$	$\displaystyle P(y=1\vert x;\theta)$

Let's look at the $\theta^Tx$ parameter in the sigmoid function $g(\theta^Tx)$ .

In this exercise, we will assign to be all monomials (meaning polynomial terms) of and up to the sixth power:

$\begin{displaymath}x=\left[\begin{array}{c}1\\u\\v\\u^2\\uv\\v^2\\u^3\\\vdots\\uv^5\\v^6\end{array}\right]\end{displaymath}$

To clarify this notation: we have made a 28-feature vector where $x_0 = 1, x_1=u, x_2= v,\ldots x_{28} =v^6$ . Remember that was the first column of numbers in your 'ex5Logx.dat' file and was the second column. From now on, we will just refer to the entries of as , , and so on instead of their values in terms of and .

To save you the trouble of enumerating all the terms of , we've included a Matlab/Octave helper function named'map_feature' that maps the original inputs to the feature vector. This function works for a single training example as well as for an entire training. To use this function, place 'map_feature.m' in your working directory and call

x = map_feature(u, v)

This assumes that the two original features were stored in column vectors named 'u' and 'v.' (If you had only one training example, each column vector would be a scalar.) The function will output a new feature array stored in the variable 'x.' Of course, you can use any names you'd like for the arguments and the output. Just make sure your two arguments are column vectors of the same size.

Before building this model, recall that our objective is to minimize the cost function in regularized logistic regression:

$\begin{displaymath}J(\theta)=-\frac{{1}}{m}\sum_{i=1}^{m}\left[y^{(i)}\log(h_......\right] +\frac{\lambda}{2m}\sum_{j=1}^{n}\theta_{j}^{2}\par\end{displaymath}$

Notice that this looks like the cost function for unregularized logistic regression, except that there is a regularization term at the end. We will now minimize this function using Newton's method.

Newton's method

Recall that the Newton's Method update rule is

$\begin{displaymath}\theta^{(t+1)}=\theta^{(t)}-H^{-1}\nabla_{\theta}J \nonumber\end{displaymath}$

This is the same rule that you used for unregularized logistic regression in Exercise 4. But because you are now implementing regularization, the gradient $\nabla_{\theta}(J)$ and the Hessian have different forms:

$\begin{displaymath}\nabla_{\theta}J = \left[\begin{array}{c}\frac{1}{m}\sum_{i......ight)x_{n}^{(i)}+\frac{\lambda}{m}\theta_{n}\end{array}\right]\end{displaymath}$

$\begin{displaymath}H & = & \frac{1}{m}\left[\sum_{i=1}^{m}h_{\theta}(x^{(i)})\l......}{cccc}0\\& 1\\& & \ddots\\& & & 1\end{array}\right]\end{displaymath}$

Notice that if you substitute $\lambda = 0$ into these expressions, you will see the same formulas as unregularized logistic regression. Also, remember that in these formulas,

1. $x^{(i)}$ is your feature vector, which is a 28x1 vector in this exercise.

2. $\nabla_{\theta}J$ is a 28x1 vector.

3. $x^{(i)}(x^{(i)})^T$ and are 28x28 matrices.

4. $y^{(i)}$ and $h_\theta(x^{(i)})$ are scalars.

5. The matrix following $\frac{\lambda}{m}$ in the Hessian formula is a 28x28 diagonal matrix with a zero in the upper left and ones on every other diagonal entry.

Run Newton's Method

Now run Newton's Method on this dataset using the three values of lambda below:

a. $\lambda = 0$ (this is the same case as non-regularized linear regression)

b. $\lambda = 1$

c. $\lambda = 10$

To determine whether Newton's Method has converged, it may help to print out the value of $J(\theta)$ during each iteration. $J(\theta)$ should not be decreasing at any point during Newton's Method. If it is, check that you have defined $J(\theta)$ correctly. Also check your definitions of the gradient and Hessian to make sure there are no mistakes in the regularization parts.

After convergence, use your values of theta to find the decision boundary in the classification problem. The decision boundary is defined as the line where

$\begin{displaymath}P(y=1\vert x; \theta) = 0.5 \;\;\;\;\; \Longrightarrow\;\;\;\;\; \theta^T x = 0 \nonumber\end{displaymath}$

Plotting the decision boundary here will be trickier than plotting the best-fit curve in linear regression. You will need to plot the $\theta^T x = 0$ line implicity, by plotting a contour. This can be done by evaluating $\theta^Tx$ over a grid of points representing the original and inputs, and then plotting the line where $\theta^Tx$ evaluates to zero.

The plot implementation for Matlab/Octave is given below. To get the best viewing results, use the same plotting ranges that we use here.

% Define the ranges of the grid
u = linspace(-1, 1.5, 200);
v = linspace(-1, 1.5, 200);

% Initialize space for the values to be plotted
z = zeros(length(u), length(v));

% Evaluate z = theta*x over the grid
for i = 1:length(u)
    for j = 1:length(v)
        % Notice the order of j, i here!
        z(j,i) = map_feature(u(i), v(j))*theta;
    end
end

% Because of the way that contour plotting works
% in Matlab, we need to transpose z, or
% else the axis orientation will be flipped!
z = z'
% Plot z = 0 by specifying the range [0, 0]
contour(u,v,z, [0, 0], 'LineWidth', 2)

When you are finished, your plots for the three values of $\lambda$ should look similar to the ones below.

Finally, because there are 28 elements $\theta$ , we will not provide an element-by-element comparison in the solutions. Instead, use norm(theta) to calculate the L2-norm of $\theta$ , and check it against the norm in the solutions.

Solutions

After you have completed the exercises above, please refer to the solutions below and check that your implementation and your answers are correct. In a case where your implementation does not result in the same parameters/phenomena as described below, debug your solution until you manage to replicate the same effect as our implementation.

Normal equations

The values for $\theta$ from the normal equations appear below. If you are using Matlab/Ocatve, you should see these exact values.

$\begin{tabular}{\vert c\vert c\vert c\vert c\vert}\cline{2-4}\multicolumn{1}{......alcolor \theta}$ & 8.1690 & 0.8098 & 0.5931\tabularnewline\hline\end{tabular}$

Notice that as $\lambda$ increases, the norm of $\theta$ decreases. This is because a higher $\lambda$ penalizes large fitting parameters. By adjusting $\lambda$ , you can have more control over your data fitting.

In the first plot, $\lambda = 0$ , which means this fit is the same as unregularized linear regression. Since the optimization objective seeks only to minimize squared error, this curve is very specific to the data, but it's probably not very good for showing a general trend. This is a case of overfitting.

The second plot shows that overfitting is reduced after the regularization parameter increases to $\lambda = 1$ . Though the fitted function is still a 5th-order polynomial, the curve appears a lot simpler than the first plot.

The third plot shows what happens when $\lambda$ is too large. Underfitting occurs, and the curve does not follow the direction of the points as well is it did before.

Newton's Method

Here are the norms of $\theta$ you should see after Newton's Method has converged. Convergence takes about 15 iterations for $\lambda = 0$ , and 5 or fewer iterations for $\lambda = 1$ and $\lambda = 10$ .

$\begin{tabular}{\vert c\vert c\vert c\vert c\vert}\cline{2-4}\multicolumn{1}{......a}$ & $7.173\times10^{3}$ & 4.240 & 0.9384\tabularnewline\hline\end{tabular}$

Notice that the norms of $\theta$ decrease as $\lambda$ increases. There are also visible fitting changes in the corresponding plots.

In the $\lambda = 0$ plot, the algorithm has tried to fit a very precise boundary around the positives and negatives. There is even an island region inside a larger region. This is a bit too precise for the general classification trend that we're looking for.

The $\lambda = 1$ plot shows an simpler decision boundary which still separates the positives and negatives fairly well. On the other hand, $\lambda = 10$ is probably too high of a regularizaton parameter, as the boundary decision does not follow the data so well on in the lower-left region.