线性回归的模型很简单。该模型是假设输出
hθ(x)
与输入
x
的任何一个分量都是呈线性关系的。用数学来表示即
这里 θ 表示的是参数向量, x 表示的是输入向量。一般来说,
我们用 n 来表示feature的个数,尽管
使用线性回归模型时,
θ
的值是必不可少的。现在我们来研究一下对
θ
的估计。假设我们有一组现成的数据。基于这些数据我们来估计
θ
。我们用
m
来表示数据的个数,
可以看出这个成本函数是用来描述回归模型预测的结果 θTx 与标准输出 y 之间的平均误差的。显然,我们希望选取一组可以使得成本函数
如果我们把训练序列用矩阵/向量的形式来表述的话,在描述成本函数时,我们可以省去加和符号
∑
。我们用
X
和
那么,成本函数则可以被表示为
下面我们给一个叫做normal equation的方法,来确定最佳参数向量 θ 。由于下面这段是从维基百科上面抄的,大家凑合看吧。关于成本函数的定义稍稍有一点不同,维基百科上给的是 S=2J 。不过对最后求出的 θ 没有影响。
另一种同样比较常见的数值解法请参见我的另一篇博文,Gradient Descent Algorithm。
Derivation of the normal equations
Common method
Define the ith residual to be
Then S can be rewritten
S is minimized when its gradient vector is zero. (This follows by definition: if the gradient vector is not zero, there is a direction in which we can move to minimize it further - see maxima and minima.) The elements of the gradient vector are the partial derivatives of S with respect to the parameters:
The derivatives are
Substitution of the expressions for the residuals and the derivatives into the gradient equations gives
Thus if θ^ minimizes S , we have
Upon rearrangement, we obtain the normal equations:
The normal equations are written in matrix notation as
where XT is the matrix transpose of X .
The solution of the normal equations yields the vector θ^ of the optimal parameter values.
Derivation directly in terms of matrices
The normal equations can be derived directly from a matrix representation of the problem as follows. The objective is to minimize
Note that: (θTXTy)T=yTXθ has the dimension 1x1 (the number of columns of y ), so it is a scalar and equal to its own transpose, hence θTXTy=yTXθ and the quantity to minimize becomes
Differentiating this with respect to
θ
and equating to zero to satisfy the first-order conditions gives
which is equivalent to the above-given normal equations. A sufficient condition for satisfaction of the second-order conditions for a minimum is that X have full column rank, in which case XTX is positive definite.
Derivation without calculus
When
XTX
is positive definite, the formula for the minimizing value of \boldsymbol \beta can be derived without the use of derivatives. The quantity
can be written as
where C depends only on
It follows that S(β) is equal to
and therefore minimized exactly when