UFLDL Tutorial_Softmax Regression

最新推荐文章于 2021-03-31 16:30:49 发布

Kylin-Xu

最新推荐文章于 2021-03-31 16:30:49 发布

阅读量3.3k

点赞数

分类专栏： ANN deep learning 文章标签： deep learning neural network matlab machine learning

deep learning 同时被 2 个专栏收录

44 篇文章 2 订阅

订阅专栏

ANN

30 篇文章 0 订阅

订阅专栏

Softmax Regression

[hide]

Introduction

In these notes, we describe the Softmax regression model. This model generalizes logistic regression to classification problems where the class label $y$ can take on more than two possible values. This will be useful for such problems as MNIST digit classification, where the goal is to distinguish between 10 different numerical digits. Softmax regression is a supervised learning algorithm, but we will later be using it in conjuction with our deep learning/unsupervised feature learning methods.

Recall that in logistic regression, we had a training set $\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}$ of $m$ labeled examples, where the input features are $x^{(i)} \in \Re^{n+1}$ . (In this set of notes, we will use the notational convention of letting the feature vectors $x$ be $n + 1$ dimensional, with $x 0 = 1$ corresponding to the intercept term.) With logistic regression, we were in the binary classification setting, so the labels were $y^{(i)} \in \{0,1\}$ . Our hypothesis took the form:

$\begin{align}h_\theta(x) = \frac{1}{1+\exp(-\theta^Tx)},\end{align}$

and the model parameters $θ$ were trained to minimize the cost function

$\begin{align}J(\theta) = -\frac{1}{m} \left[ \sum_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \right]\end{align}$

In the softmax regression setting, we are interested in multi-class classification (as opposed to only binary classification), and so the label $y$ can take on $k$ different values, rather than only two. Thus, in our training set $\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}$ , we now have that $y^{(i)} \in \{1, 2, \ldots, k\}$ . (Note that our convention will be to index the classes starting from 1, rather than from 0.) For example, in the MNIST digit recognition task, we would have $k = 10$ different classes.

Given a test input $x$ , we want our hypothesis to estimate the probability that $p (y = j | x)$ for each value of $j = 1, \ldots, k$ . I.e., we want to estimate the probability of the class label taking on each of the $k$ different possible values. Thus, our hypothesis will output a $k$ dimensional vector (whose elements sum to 1) giving us our $k$ estimated probabilities. Concretely, our hypothesis $h θ (x)$ takes the form:

$\begin{align}h_\theta(x^{(i)}) =\begin{bmatrix}p(y^{(i)} = 1 | x^{(i)}; \theta) \\p(y^{(i)} = 2 | x^{(i)}; \theta) \\\vdots \\p(y^{(i)} = k | x^{(i)}; \theta)\end{bmatrix}=\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} }\begin{bmatrix}e^{ \theta_1^T x^{(i)} } \\e^{ \theta_2^T x^{(i)} } \\\vdots \\e^{ \theta_k^T x^{(i)} } \\\end{bmatrix}\end{align}$

Here $\theta_1, \theta_2, \ldots, \theta_k \in \Re^{n+1}$ are the parameters of our model. Notice that the term $\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} }$ normalizes the distribution, so that it sums to one.

For convenience, we will also write $θ$ to denote all the parameters of our model. When you implement softmax regression, it is usually convenient to represent $θ$ as a $k$ -by- $(n + 1)$ matrix obtained by stacking up $\theta_1, \theta_2, \ldots, \theta_k$ in rows, so that

$\theta = \begin{bmatrix}\mbox{---} \theta_1^T \mbox{---} \\\mbox{---} \theta_2^T \mbox{---} \\\vdots \\\mbox{---} \theta_k^T \mbox{---} \\\end{bmatrix}$

Cost Function

We now describe the cost function that we'll use for softmax regression. In the equation below, $1\{\cdot\}$ is the indicator function,so that $1{a true statement} = 1$ , and $1{a false statement} = 0$ . For example, $1{2 + 2 = 4}$ evaluates to 1; whereas $1{1 + 1 = 5}$ evaluates to 0. Our cost function will be:

$\begin{align}J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]\end{align}$

Notice that this generalizes the logistic regression cost function, which could also have been written:

$\begin{align}J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\&= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]\end{align}$

The softmax cost function is similar, except that we now sum over the $k$ different possible values of the class label. Note also that in softmax regression, we have that $p(y^{(i)} = j | x^{(i)} ; \theta) = \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} }$ .

There is no known closed-form way to solve for the minimum of $J (θ)$ , and thus as usual we'll resort to an iterative optimization algorithm such as gradient descent or L-BFGS. Taking derivatives, one can show that the gradient is:

$\begin{align}\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\} - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right] }\end{align}$

Recall the meaning of the " $\nabla_{\theta_j}$ " notation. In particular, $\nabla_{\theta_j} J(\theta)$ is itself a vector, so that its $l$ -th element is $\frac{\partial J(\theta)}{\partial \theta_{jl}}$ the partial derivative of $J (θ)$ with respect to the $l$ -th element of $θ j$ .

Armed with this formula for the derivative, one can then plug it into an algorithm such as gradient descent, and have it minimize $J (θ)$ . For example, with the standard implementation of gradient descent, on each iteration we would perform the update $\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)$ (for each $j=1,\ldots,k$ ).

When implementing softmax regression, we will typically use a modified version of the cost function described above; specifically, one that incorporates weight decay. We describe the motivation and details below.

Properties of softmax regression parameterization

Softmax regression has an unusual property that it has a "redundant" set of parameters. To explain what this means, suppose we take each of our parameter vectors $θ j$ , and subtract some fixed vector $ψ$ from it, so that every $θ j$ is now replaced with $θ j - ψ$ (for every $j=1, \ldots, k$ ). Our hypothesis now estimates the class label probabilities as

$\begin{align}p(y^{(i)} = j | x^{(i)} ; \theta)&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}} \\&= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}}}.\end{align}$

In other words, subtracting $ψ$ from every $θ j$ does not affect our hypothesis' predictions at all! This shows that softmax regression's parameters are "redundant." More formally, we say that our softmax model is overparameterized, meaning that for any hypothesis we might fit to the data, there are multiple parameter settings that give rise to exactly the same hypothesis function $h θ$ mapping from inputs $x$ to the predictions.

Further, if the cost function $J (θ)$ is minimized by some setting of the parameters $(\theta_1, \theta_2,\ldots, \theta_k)$ , then it is also minimized by $(\theta_1 - \psi, \theta_2 - \psi,\ldots,\theta_k - \psi)$ for any value of $ψ$ . Thus, the minimizer of $J (θ)$ is not unique. (Interestingly, $J (θ)$ is still convex, and thus gradient descent will not run into a local optima problems. But the Hessian is singular/non-invertible, which causes a straightforward implementation of Newton's method to run into numerical problems.)

Notice also that by setting $ψ = θ 1$ , one can always replace $θ 1$ with $\theta_1 - \psi = \vec{0}$ (the vector of all 0's), without affecting the hypothesis. Thus, one could "eliminate" the vector of parameters $θ 1$ (or any other $θ j$ , for any single value of $j$ ), without harming the representational power of our hypothesis. Indeed, rather than optimizing over the $k (n + 1)$ parameters $(\theta_1, \theta_2,\ldots, \theta_k)$ (where $\theta_j \in \Re^{n+1}$ ), one could instead set $\theta_1 =\vec{0}$ and optimize only with respect to the $(k - 1)(n + 1)$ remaining parameters, and this would work fine.

In practice, however, it is often cleaner and simpler to implement the version which keeps all the parameters $(\theta_1, \theta_2,\ldots, \theta_n)$ , without arbitrarily setting one of them to zero. But we will make one change to the cost function: Adding weight decay. This will take care of the numerical problems associated with softmax regression's overparameterized representation.

Weight Decay

We will modify the cost function by adding a weight decay term $\textstyle \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^{n} \theta_{ij}^2$ which penalizes large values of the parameters. Our cost function is now

$\begin{align}J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }} \right] + \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^n \theta_{ij}^2\end{align}$

With this weight decay term (for any $λ > 0$ ), the cost function $J (θ)$ is now strictly convex, and is guaranteed to have a unique solution. The Hessian is now invertible, and because $J (θ)$ is convex, algorithms such as gradient descent, L-BFGS, etc. are guaranteed to converge to the global minimum.

To apply an optimization algorithm, we also need the derivative of this new definition of $J (θ)$ . One can show that the derivative is: $\begin{align}\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} ( 1\{ y^{(i)} = j\} - p(y^{(i)} = j | x^{(i)}; \theta) ) \right] } + \lambda \theta_j\end{align}$

By minimizing $J (θ)$ with respect to $θ$ , we will have a working implementation of softmax regression.

Relationship to Logistic Regression

In the special case where $k = 2$ , one can show that softmax regression reduces to logistic regression. This shows that softmax regression is a generalization of logistic regression. Concretely, when $k = 2$ , the softmax regression hypothesis outputs

$\begin{align}h_\theta(x) &=\frac{1}{ e^{\theta_1^Tx} + e^{ \theta_2^T x^{(i)} } }\begin{bmatrix}e^{ \theta_1^T x } \\e^{ \theta_2^T x }\end{bmatrix}\end{align}$

Taking advantage of the fact that this hypothesis is overparameterized and setting $ψ = θ 1$ , we can subtract $θ 1$ from each of the two parameters, giving us

$\begin{align}h(x) &=\frac{1}{ e^{\vec{0}^Tx} + e^{ (\theta_2-\theta_1)^T x^{(i)} } }\begin{bmatrix}e^{ \vec{0}^T x } \\e^{ (\theta_2-\theta_1)^T x }\end{bmatrix} \\&=\begin{bmatrix}\frac{1}{ 1 + e^{ (\theta_2-\theta_1)^T x^{(i)} } } \\\frac{e^{ (\theta_2-\theta_1)^T x }}{ 1 + e^{ (\theta_2-\theta_1)^T x^{(i)} } }\end{bmatrix} \\&=\begin{bmatrix}\frac{1}{ 1 + e^{ (\theta_2-\theta_1)^T x^{(i)} } } \\1 - \frac{1}{ 1 + e^{ (\theta_2-\theta_1)^T x^{(i)} } } \\\end{bmatrix}\end{align}$

Thus, replacing $θ 2 - θ 1$ with a single parameter vector $θ'$ , we find that softmax regression predicts the probability of one of the classes as $\frac{1}{ 1 + e^{ (\theta')^T x^{(i)} } }$ , and that of the other class as $1 - \frac{1}{ 1 + e^{ (\theta')^T x^{(i)} } }$ , same as logistic regression.

Softmax Regression vs. k Binary Classifiers

Suppose you are working on a music classification application, and there are $k$ types of music that you are trying to recognize. Should you use a softmax classifier, or should you build $k$ separate binary classifiers using logistic regression?

This will depend on whether the four classes are mutually exclusive. For example, if your four classes are classical, country, rock, and jazz, then assuming each of your training examples is labeled with exactly one of these four class labels, you should build a softmax classifier with $k = 4$ . (If there're also some examples that are none of the above four classes, then you can set $k = 5$ in softmax regression, and also have a fifth, "none of the above," class.)

If however your categories are has_vocals, dance, soundtrack, pop, then the classes are not mutually exclusive; for example, there can be a piece of pop music that comes from a soundtrack and in addition has vocals. In this case, it would be more appropriate to build 4 binary logistic regression classifiers. This way, for each new musical piece, your algorithm can separately decide whether it falls into each of the four categories.

Now, consider a computer vision example, where you're trying to classify images into three different classes. (i) Suppose that your classes are indoor_scene, outdoor_urban_scene, and outdoor_wilderness_scene. Would you use sofmax regression or three logistic regression classifiers? (ii) Now suppose your classes are indoor_scene, black_and_white_image, and image_has_people. Would you use softmax regression or multiple logistic regression classifiers?

In the first case, the classes are mutually exclusive, so a softmax regression classifier would be appropriate. In the second case, it would be more appropriate to build three separate logistic regression classifiers.

Exercise:Softmax Regression

[hide]

1 Softmax regression

Softmax regression

In this problem set, you will use softmax regression to classify MNIST images. The goal of this exercise is to build a softmax classifier that you will be able to reuse in the future exercises and also on other classification problems that you might encounter.

In the file softmax_exercise.zip, we have provided some starter code. You should write your code in the places indicated by "YOUR CODE HERE" in the files.

In the starter code, you will need to modify softmaxCost.m and softmaxPredict.m for this exercise.

We have also provided softmaxExercise.m that will help walk you through the steps in this exercise.

Dependencies

The following additional files are required for this exercise:

You will also need:

computeNumericalGradient.m from Exercise:Sparse Autoencoder

If you have not completed the exercises listed above, we strongly suggest you complete them first.

Step 0: Initialize constants and parameters

We've provided the code for this step in softmaxExercise.m.

Two constants, inputSize and numClasses, corresponding to the size of each input vector and the number of class labels have been defined in the starter code. This will allow you to reuse your code on a different data set in a later exercise. We also initializelambda, the weight decay parameter here.

Step 1: Load data

The starter code loads the MNIST images and labels into inputData and labels respectively. The images are pre-processed to scale the pixel values to the range $[0,1]$ , and the label 0 is remapped to 10 for convenience of implementation, so that the labels take values in $\{1, 2, \ldots, 10\}$ . You will not need to change any code in this step for this exercise, but note that your code should be general enough to operate on data of arbitrary size belonging to any number of classes.

Step 2: Implement softmaxCost

In softmaxCost.m, implement code to compute the softmax cost function $J (θ)$ . Remember to include the weight decay term in the cost as well. Your code should also compute the appropriate gradients, as well as the predictions for the input data (which will be used in the cross-validation step later).

It is important to vectorize your code so that it runs quickly. We also provide several implementation tips below:

Note: In the provided starter code, theta is a matrix where each the j^th row is $\theta_j^T$

Implementation Tip: Computing the ground truth matrix - In your code, you may need to compute the ground truth matrix M, such that M(r, c) is 1 if $y (c) = r$ and 0 otherwise. This can be done quickly, without a loop, using the MATLAB functions sparse and full. Specifically, the command M = sparse(r, c, v) creates a sparse matrix such that M(r(i), c(i)) = v(i) for all i. That is, the vectors r and cgive the position of the elements whose values we wish to set, and v the corresponding values of the elements. Running full on a sparse matrix gives a "full" representation of the matrix for use (meaning that Matlab will no longer try to represent it as a sparse matrix in memory). The code for using sparse and full to compute the ground truth matrix has already been included in softmaxCost.m.

Implementation Tip: Preventing overflows - in softmax regression, you will have to compute the hypothesis

$\begin{align} h(x^{(i)}) = \frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} }\begin{bmatrix} e^{ \theta_1^T x^{(i)} } \\e^{ \theta_2^T x^{(i)} } \\\vdots \\e^{ \theta_k^T x^{(i)} } \\\end{bmatrix}\end{align}$

When the products $\theta_i^T x^{(i)}$ are large, the exponential function $e^{\theta_i^T x^{(i)}}$ will become very large and possibly overflow. When this happens, you will not be able to compute your hypothesis. However, there is an easy solution - observe that we can multiply the top and bottom of the hypothesis by some constant without changing the output:

$\begin{align} h(x^{(i)}) &= \frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} }\begin{bmatrix} e^{ \theta_1^T x^{(i)} } \\e^{ \theta_2^T x^{(i)} } \\\vdots \\e^{ \theta_k^T x^{(i)} } \\\end{bmatrix} \\&=\frac{ e^{-\alpha} }{ e^{-\alpha} \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} }\begin{bmatrix} e^{ \theta_1^T x^{(i)} } \\e^{ \theta_2^T x^{(i)} } \\\vdots \\e^{ \theta_k^T x^{(i)} } \\\end{bmatrix} \\&=\frac{ 1 }{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} - \alpha }} }\begin{bmatrix} e^{ \theta_1^T x^{(i)} - \alpha } \\e^{ \theta_2^T x^{(i)} - \alpha } \\\vdots \\e^{ \theta_k^T x^{(i)} - \alpha } \\\end{bmatrix} \\\end{align}$

Hence, to prevent overflow, simply subtract some large constant value from each of the $\theta_j^T x^{(i)}$ terms before computing the exponential. In practice, for each example, you can use the maximum of the $\theta_j^T x^{(i)}$ terms as the constant. Assuming you have a matrixM containing these terms such that M(r, c) is $\theta_r^T x^{(c)}$ , then you can use the following code to accomplish this:

% M is the matrix as described in the text
M = bsxfun(@minus, M, max(M, [], 1));

max(M) yields a row vector with each element giving the maximum value in that column. bsxfun (short for binary singleton expansion function) applies minus along each row of M, hence subtracting the maximum of each column from every element in the column.

Implementation Tip: Computing the predictions - you may also find bsxfun useful in computing your predictions - if you have a matrix M containing the $e^{\theta_j^T x^{(i)}}$ terms, such that M(r, c) contains the $e^{\theta_r^T x^{(c)}}$ term, you can use the following code to compute the hypothesis (by dividing all elements in each column by their column sum):

% M is the matrix as described in the text
M = bsxfun(@rdivide, M, sum(M))

The operation of bsxfun in this case is analogous to the earlier example.

Step 3: Gradient checking

Once you have written the softmax cost function, you should check your gradients numerically. In general, whenever implementing any learning algorithm, you should always check your gradients numerically before proceeding to train the model. The norm of the difference between the numerical gradient and your analytical gradient should be small, on the order of $10 - 9$ .

Implementation Tip: Faster gradient checking - when debugging, you can speed up gradient checking by reducing the number of parameters your model uses. In this case, we have included code for reducing the size of the input data, using the first 8 pixels of the images instead of the full 28x28 images. This code can be used by setting the variable DEBUG to true, as described in step 1 of the code.

Step 4: Learning parameters

Now that you've verified that your gradients are correct, you can train your softmax model using the function softmaxTrain insoftmaxTrain.m. softmaxTrain which uses the L-BFGS algorithm, in the function minFunc. Training the model on the entire MNIST training set of 60000 28x28 images should be rather quick, and take less than 5 minutes for 100 iterations.

Factoring softmaxTrain out as a function means that you will be able to easily reuse it to train softmax models on other data sets in the future by invoking the function with different parameters.

Use the following parameter when training your softmax classifier:

lambda = 1e-4

Step 5: Testing

Now that you've trained your model, you will test it against the MNIST test set, comprising 10000 28x28 images. However, to do so, you will first need to complete the function softmaxPredict in softmaxPredict.m, a function which generates predictions for input data under a trained softmax model.

Once that is done, you will be able to compute the accuracy (the proportion of correctly classified images) of your model using the code provided. Our implementation achieved an accuracy of 92.6%. If your model's accuracy is significantly less (less than 91%), check your code, ensure that you are using the trained weights, and that you are training your model on the full 60000 training images. Conversely, if your accuracy is too high (99-100%), ensure that you have not accidentally trained your model on the test set as well.

CS294A/CS294W Softmax Exercise
STEP 0: Initialise constants and parameters
STEP 1: Load data
STEP 2: Implement softmaxCost
STEP 3: Gradient checking
STEP 4: Learning parameters
STEP 5: Testing

CS294A/CS294W Softmax Exercise

%  Instructions
%  ------------
%
%  This file contains code that helps you get started on the
%  softmax exercise. You will need to write the softmax cost function
%  in softmaxCost.m and the softmax prediction function in softmaxPred.m.
%  For this exercise, you will not need to change any code in this file,
%  or any other files other than those mentioned above.
%  (However, you may be required to do so in later exercises)

%%======================================================================

STEP 0: Initialise constants and parameters

Here we define and initialise some constants which allow your code
to be used more generally on any arbitrary input.
We also initialise some parameters used for tuning the model.

inputSize = 28 * 28; % Size of input vector (MNIST images are 28x28)
numClasses = 10;     % Number of classes (MNIST images fall into 10 classes)

lambda = 1e-4; % Weight decay parameter

%%======================================================================

STEP 1: Load data

In this section, we load the input and output data.
For softmax regression on MNIST pixels,
the input data is the images, and
the output data is the labels.

% Change the filenames if you've saved the files under different names
% On some platforms, the files might be saved as
% train-images.idx3-ubyte / train-labels.idx1-ubyte

images = loadMNISTImages('train-images.idx3-ubyte');
images= images(:,1:1000);
labels = loadMNISTLabels('train-labels.idx1-ubyte');
labels=labels(1:1000);
labels(labels==0) = 10; % Remap 0 to 10

inputData = images;

% For debugging purposes, you may wish to reduce the size of the input data
% in order to speed up gradient checking.
% Here, we create synthetic dataset using random data for testing

% DEBUG = true; % Set DEBUG to true when debugging.
DEBUG = false;
if DEBUG
    inputSize = 8;
    inputData = randn(8, 100);
    labels = randi(10, 100, 1);
end

% Randomly initialise theta
theta = 0.005 * randn(numClasses * inputSize, 1);%输入的是一个列向量

%%======================================================================

STEP 2: Implement softmaxCost

Implement softmaxCost in softmaxCost.m.

[cost, grad] = softmaxCost(theta, numClasses, inputSize, lambda, inputData, labels);

%%======================================================================

STEP 3: Gradient checking

As with any learning algorithm, you should always check that your
gradients are correct before learning the parameters.

if DEBUG
    numGrad = computeNumericalGradient( @(x) softmaxCost(x, numClasses, ...
                                    inputSize, lambda, inputData, labels), theta);

    % Use this to visually compare the gradients side by side
    disp([numGrad grad]);

    % Compare numerically computed gradients with those computed analytically
    diff = norm(numGrad-grad)/norm(numGrad+grad);
    disp(diff);
    % The difference should be small.
    % In our implementation, these values are usually less than 1e-7.

    % When your gradients are correct, congratulations!
end

%%======================================================================

STEP 4: Learning parameters

Once you have verified that your gradients are correct,
you can start training your softmax regression code using softmaxTrain
(which uses minFunc).

options.maxIter = 100;
%softmaxModel其实只是一个结构体，里面包含了学习到的最优参数以及输入尺寸大小和类别个数信息
softmaxModel = softmaxTrain(inputSize, numClasses, lambda, ...
                            inputData, labels, options);

% Although we only use 100 iterations here to train a classifier for the
% MNIST data set, in practice, training for more iterations is usually
% beneficial.

%%======================================================================

 Iteration   FunEvals     Step Length    Function Val        Opt Cond
         1          4    1.59248e+000    1.19230e+000    4.33255e+001
         2          5    1.00000e+000    7.28040e-001    2.62810e+001
         3          6    1.00000e+000    5.96902e-001    1.36673e+001
         4          7    1.00000e+000    4.97060e-001    7.62977e+000
         5          8    1.00000e+000    4.36190e-001    8.28016e+000
         6          9    1.00000e+000    3.83119e-001    7.72440e+000
         7         10    1.00000e+000    3.48327e-001    6.34988e+000
         8         11    1.00000e+000    3.04777e-001    5.26709e+000
         9         12    1.00000e+000    2.62218e-001    6.92635e+000
        10         13    1.00000e+000    2.19711e-001    4.17829e+000
        11         14    1.00000e+000    1.86280e-001    2.60336e+000
        12         15    1.00000e+000    1.58599e-001    3.04655e+000
        13         16    1.00000e+000    1.32768e-001    2.97857e+000
        14         17    1.00000e+000    1.03318e-001    2.47431e+000
        15         18    1.00000e+000    8.39789e-002    1.97871e+000
        16         19    1.00000e+000    7.12138e-002    1.20812e+000
        17         20    1.00000e+000    6.10423e-002    8.64657e-001
        18         21    1.00000e+000    5.22495e-002    7.74831e-001
        19         22    1.00000e+000    4.78492e-002    6.27386e-001
        20         23    1.00000e+000    4.51253e-002    4.16814e-001
        21         24    1.00000e+000    4.31256e-002    3.32939e-001
        22         25    1.00000e+000    4.13947e-002    2.74519e-001
        23         26    1.00000e+000    3.92058e-002    2.45624e-001
        24         27    1.00000e+000    3.75433e-002    2.70281e-001
        25         28    1.00000e+000    3.66373e-002    1.99194e-001
        26         29    1.00000e+000    3.58240e-002    1.91338e-001
        27         30    1.00000e+000    3.48081e-002    2.05655e-001
        28         31    1.00000e+000    3.42492e-002    2.22351e-001
        29         32    1.00000e+000    3.39075e-002    1.20124e-001
        30         33    1.00000e+000    3.35928e-002    1.13429e-001
        31         34    1.00000e+000    3.33237e-002    1.21845e-001
        32         35    1.00000e+000    3.29107e-002    1.07538e-001
        33         36    1.00000e+000    3.27070e-002    1.76798e-001
        34         37    1.00000e+000    3.24330e-002    9.03546e-002
        35         38    1.00000e+000    3.23295e-002    7.23728e-002
        36         39    1.00000e+000    3.22197e-002    6.97756e-002
        37         40    1.00000e+000    3.20992e-002    5.95588e-002
        38         41    1.00000e+000    3.19603e-002    5.09530e-002
        39         42    1.00000e+000    3.18884e-002    6.00830e-002
        40         43    1.00000e+000    3.18445e-002    3.63585e-002
        41         44    1.00000e+000    3.18088e-002    3.94141e-002
        42         45    1.00000e+000    3.17800e-002    3.49093e-002
        43         46    1.00000e+000    3.17362e-002    3.53885e-002
        44         47    1.00000e+000    3.17287e-002    5.05020e-002
        45         48    1.00000e+000    3.16976e-002    2.08303e-002
        46         49    1.00000e+000    3.16890e-002    1.68020e-002
        47         50    1.00000e+000    3.16768e-002    1.80603e-002
        48         51    1.00000e+000    3.16639e-002    1.81174e-002
        49         52    1.00000e+000    3.16465e-002    1.75304e-002
        50         53    1.00000e+000    3.16425e-002    2.38297e-002
        51         54    1.00000e+000    3.16374e-002    1.10866e-002
        52         55    1.00000e+000    3.16358e-002    1.00352e-002
        53         56    1.00000e+000    3.16328e-002    1.03756e-002
        54         57    1.00000e+000    3.16290e-002    1.00415e-002
        55         58    1.00000e+000    3.16261e-002    1.38456e-002
        56         59    1.00000e+000    3.16240e-002    6.29343e-003
        57         60    1.00000e+000    3.16231e-002    5.60261e-003
        58         61    1.00000e+000    3.16223e-002    5.45850e-003
        59         62    1.00000e+000    3.16210e-002    5.09732e-003
        60         63    1.00000e+000    3.16202e-002    7.21061e-003
        61         64    1.00000e+000    3.16196e-002    3.37519e-003
        62         65    1.00000e+000    3.16193e-002    3.11828e-003
        63         66    1.00000e+000    3.16192e-002    3.09610e-003
        64         67    1.00000e+000    3.16188e-002    2.87817e-003
        65         68    1.00000e+000    3.16185e-002    4.14749e-003
        66         69    1.00000e+000    3.16183e-002    2.38061e-003
        67         70    1.00000e+000    3.16182e-002    1.54519e-003
        68         71    1.00000e+000    3.16182e-002    1.64320e-003
        69         72    1.00000e+000    3.16181e-002    1.72780e-003
        70         73    1.00000e+000    3.16180e-002    1.41962e-003
        71         74    1.00000e+000    3.16179e-002    1.16720e-003
        72         75    1.00000e+000    3.16178e-002    9.07254e-004
        73         76    1.00000e+000    3.16178e-002    7.43246e-004
        74         77    1.00000e+000    3.16178e-002    5.51991e-004
        75         78    1.00000e+000    3.16178e-002    6.07661e-004
        76         79    1.00000e+000    3.16178e-002    4.61223e-004
        77         80    1.00000e+000    3.16177e-002    3.86478e-004
        78         81    1.00000e+000    3.16177e-002    4.58116e-004
        79         82    1.00000e+000    3.16177e-002    2.64885e-004
        80         83    1.00000e+000    3.16177e-002    2.24526e-004
        81         84    1.00000e+000    3.16177e-002    1.83070e-004
        82         85    1.00000e+000    3.16177e-002    1.56014e-004
        83         86    1.00000e+000    3.16177e-002    1.49062e-004
        84         87    1.00000e+000    3.16177e-002    1.39336e-004
Function Value changing by less than TolX

STEP 5: Testing

You should now test your model against the test images.
To do this, you will first need to write softmaxPredict
(in softmaxPredict.m), which should return predictions
given a softmax model and the input data.

images = loadMNISTImages('t10k-images.idx3-ubyte');
labels = loadMNISTLabels('t10k-labels.idx1-ubyte');
labels(labels==0) = 10; % Remap 0 to 10

inputData = images;
size(softmaxModel.optTheta)
size(inputData)

% You will have to implement softmaxPredict in softmaxPredict.m
[pred] = softmaxPredict(softmaxModel, inputData);

acc = mean(labels(:) == pred(:));
fprintf('Accuracy: %0.3f%%\n', acc * 100);

% Accuracy is the proportion of correctly classified images
% After 100 iterations, the results for our implementation were:
%
% Accuracy: 92.200%
%
% If your values are too low (accuracy less than 0.91), you should check
% your code for errors, and make sure you are training on the
% entire data set of 60000 28x28 training images
% (unless you modified the loading code, this should be the case)

function images = loadMNISTImages(filename)
%loadMNISTImages returns a 28x28x[number of MNIST images] matrix containing
%the raw MNIST images

fp = fopen(filename, 'rb');
assert(fp ~= -1, ['Could not open ', filename, '']);

magic = fread(fp, 1, 'int32', 0, 'ieee-be');
assert(magic == 2051, ['Bad magic number in ', filename, '']);

numImages = fread(fp, 1, 'int32', 0, 'ieee-be');
numRows = fread(fp, 1, 'int32', 0, 'ieee-be');
numCols = fread(fp, 1, 'int32', 0, 'ieee-be');

images = fread(fp, inf, 'unsigned char');
images = reshape(images, numCols, numRows, numImages);
images = permute(images,[2 1 3]);

fclose(fp);

% Reshape to #pixels x #examples
images = reshape(images, size(images, 1) * size(images, 2), size(images, 3));
% Convert to double and rescale to [0,1]
images = double(images) / 255;

end
function labels = loadMNISTLabels(filename)
%loadMNISTLabels returns a [number of MNIST images]x1 matrix containing
%the labels for the MNIST images

fp = fopen(filename, 'rb');
assert(fp ~= -1, ['Could not open ', filename, '']);

magic = fread(fp, 1, 'int32', 0, 'ieee-be');
assert(magic == 2049, ['Bad magic number in ', filename, '']);

numLabels = fread(fp, 1, 'int32', 0, 'ieee-be');

labels = fread(fp, inf, 'unsigned char');

assert(size(labels,1) == numLabels, 'Mismatch in label count');

fclose(fp);

end
Contents
    
    
     
     ---------- YOUR CODE HERE --------------------------------------
    
    
function [cost, grad] = softmaxCost(theta, numClasses, inputSize, lambda, data, labels)
% numClasses - the number of classes
% inputSize - the size N of the input vector
% lambda - weight decay parameter
% data - the N x M input matrix, where each column data(:, i) corresponds to
%        a single test set
% labels - an M x 1 matrix containing the labels corresponding for the input data
%

% Unroll the parameters from theta
theta = reshape(theta, numClasses, inputSize);%将输入的参数列向量变成一个矩阵

numCases = size(data, 2);%输入样本的个数
groundTruth = full(sparse(labels, 1:numCases, 1));%这里sparse是生成一个稀疏矩阵，该矩阵中的值都是第三个值1
                                                    %稀疏矩阵的小标由labels和1:numCases对应值构成
cost = 0;

thetagrad = zeros(numClasses, inputSize);
Input argument "theta" is undefined.

Error in ==> softmaxCost at 12
theta = reshape(theta, numClasses, inputSize);%将输入的参数列向量变成一个矩阵
---------- YOUR CODE HERE --------------------------------------
Instructions: Compute the cost and gradient for softmax regression.
              You need to compute thetagrad and cost.
              The groundTruth matrix might come in handy.
M = bsxfun(@minus,theta*data,max(theta*data, [], 1));
M = exp(M);
p = bsxfun(@rdivide, M, sum(M));
cost = -1/numCases * groundTruth(:)' * log(p(:)) + lambda/2 * sum(theta(:) .^ 2);
thetagrad = -1/numCases * (groundTruth - p) * data' + lambda * theta;



% ------------------------------------------------------------------
% Unroll the gradient matrices into a vector for minFunc
grad = [thetagrad(:)];
end
function [softmaxModel] = softmaxTrain(inputSize, numClasses, lambda, inputData, labels, options)
%softmaxTrain Train a softmax model with the given parameters on the given
% data. Returns softmaxOptTheta, a vector containing the trained parameters
% for the model.
%
% inputSize: the size of an input vector x^(i)
% numClasses: the number of classes
% lambda: weight decay parameter
% inputData: an N by M matrix containing the input data, such that
%            inputData(:, c) is the cth input
% labels: M by 1 matrix containing the class labels for the
%            corresponding inputs. labels(c) is the class label for
%            the cth input
% options (optional): options
%   options.maxIter: number of iterations to train for

if ~exist('options', 'var')
    options = struct;
end

if ~isfield(options, 'maxIter')
    options.maxIter = 400;
end

% initialize parameters
theta = 0.005 * randn(numClasses * inputSize, 1);

% Use minFunc to minimize the function
addpath minFunc/
options.Method = 'lbfgs'; % Here, we use L-BFGS to optimize our cost
                          % function. Generally, for minFunc to work, you
                          % need a function pointer with two outputs: the
                          % function value and the gradient. In our problem,
                          % softmaxCost.m satisfies this.
minFuncOptions.display = 'on';

[softmaxOptTheta, cost] = minFunc( @(p) softmaxCost(p, ...
                                   numClasses, inputSize, lambda, ...
                                   inputData, labels), ...
                              theta, options);

% Fold softmaxOptTheta into a nicer format
softmaxModel.optTheta = reshape(softmaxOptTheta, numClasses, inputSize);
softmaxModel.inputSize = inputSize;
softmaxModel.numClasses = numClasses;

end