1. Guide
Gradient descent gives one way of minimizing J. Lets discuss a second way of doing so, this time performing the minimization explicitly and without resorting to an iterative algorithm. In this method, we will minimize J by explicitly taking its derivatives with respect to the θj ’s, and setting them to zero.
2. Matrix derivatives
For a function f : Rm×n → R mapping from m-by-n matrices to the real numbers, we define the derivative of f with respect to A to be:
Thus, the gradient ∇Af(A) is itself an m-by-n matrix, whose (i, j)-element is ∂f/∂Aij .
We also introduce the trace operator, written “tr”:
If a is a real number (i.e., a 1-by-1 matrix), then tr a = a.
More properties about tr:
a. trAB = trBA;
b. trABC = trCAB = trBCA;
c. trABCD = trDABC = trCDAB = trBCDA;
d. trA = trAT;
e. tr(A + B) = trA + trB;
f. tr(aA) = a(trA)
We now state without proof some facts of matrix derivatives:
a. ∇AtrAB = BT;
b. ∇AT f(A) = (∇Af(A))T;
c. ∇AtrABATC = CAB + CTABT;
d. ∇A|A| = |A|(A−1)T.(A is non-singular square matrices)
proof: We define A′ to be the matrix whose (i, j) element is (−1)i+j times the determinant of the square matrix resulting from deleting row i and column j from A, then it can be proved that A−1 = (A′)T /|A|.
The determinant of a matrix can be written |A| =Σj Aij(A′ )ij . Since (A′)ij does not depend on Aij (as can be seen from its definition), this implies that (∂/∂Aij)|A| = A′ij .=>∇A|A| = A′ = |A|(A−1)T.
2. Least squares revisited
Define the design matrix X to be the m-by-n+1 matrix that contains the training examples' input values in its row:
Also, let ~y(y is a vector) be the m-dimensional vector containing all the target values from the training set:
Now, since h(x(i)) = (x(i))T θ, we can easily verify that:
Hence,
To minimize J, we set its derivatives to zero, and obtain the normal equations:
Thus, the value of θ that minimizes J(θ) is given in closed form by the equation:
Here, the inverse is a pseudoinverse.