Linear Regression

最新推荐文章于 2024-02-29 16:48:50 发布

GreTony

最新推荐文章于 2024-02-29 16:48:50 发布

阅读量299

点赞数

分类专栏：机器学习笔记文章标签：数学

本文链接：https://blog.csdn.net/qiongyu0422/article/details/51024803

版权

机器学习笔记专栏收录该内容

6 篇文章 0 订阅

订阅专栏

线性回归的模型很简单。该模型是假设输出 $h_\theta(x)$ 与输入 $x$ 的任何一个分量都是呈线性关系的。用数学来表示即

h θ (x) = θ T x .

$h_\theta(x)=\theta^T x .$
这里

θ $\theta$ 表示的是参数向量，

x $x$ 表示的是输入向量。一般来说，

θ $\theta$ 与

x $x$ 是维度相同的列向量，而

hθ(x) $h_\theta(x)$ 是一个标量。

x = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ x 0 x 1 ⋮ x n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥, θ = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ θ 0 θ 1 ⋮ θ n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ .

${x} = \begin{bmatrix} x_0 \\ x_1 \\ \vdots \\ x_n \end{bmatrix} , \qquad \theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_n \end{bmatrix}.$
我们用

n $n$ 来表示feature的个数，尽管

θ $\theta$ 与

x $x$ 都是

n+1 $n+1$ 维度的。那是因为

x0 $x_0$ 被恒定的设置为1，用来代表线性模型里面的”常数项”。在某些特定的情况下，

hθ(x) $h_\theta(x)$ 也是一个向量。这时，

θ $\theta$ 的维度不再与

x $x$ 相同，

θ $\theta$ 会是一个矩阵，通常用大写的希腊字母

Θ $\Theta$ 来表示。

使用线性回归模型时， $\theta$ 的值是必不可少的。现在我们来研究一下对 $\theta$ 的估计。假设我们有一组现成的数据。基于这些数据我们来估计 $\theta$ 。我们用 $m$ 来表示数据的个数， $x^{(i)}$ 表示第 $i$ 个数据的“输入”；与之对应的第 $i$ 个数据的“输出”用 $y^{(i)}$ 来表示。成本函数（cost function）被定义为

J (θ) = 1 2 m \sum i = 1 m (y (i) - θ T x (i)) 2 .

$J(\theta) = \frac{1}{2m}\sum_{i=1}^m ( y^{(i)} - \theta^T x^{(i)})^2.$
可以看出这个成本函数是用来描述回归模型预测的结果

θTx $\theta^T x$ 与标准输出

y $y$ 之间的平均误差的。显然，我们希望选取一组可以使得成本函数

J(θ) $J(\theta)$ 取最小值的

θ $\theta$ ，即

min θ J (θ) .

$\min_{\theta} J(\theta).$

如果我们把训练序列用矩阵/向量的形式来表述的话，在描述成本函数时，我们可以省去加和符号 $\sum$ 。我们用 $X$ 和 $y$ 来表示训练数据的输入与输出，其中

X = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ — (x (1)) T — — (x (2)) T — ⋮ — (x (m)) T — ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ x (1) 0 x (2) 0 ⋮ x (m) 0 x (1) 1 x (2) 1 ⋮ x (m) 1 \dots \dots ⋱ \dots x (1) n x (2) n ⋮ x (m) n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥, y = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ y (1) y (2) ⋮ y (m) ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ .

${X} = \begin{bmatrix} —(x^{(1)})^T— \\ —(x^{(2)})^T— \\ \vdots \\ —(x^{(m)})^T— \end{bmatrix} = \begin{bmatrix} x_{0}^{(1)} & x_{1}^{(1)} & \cdots & x_{n}^{(1)} \\ x_{0}^{(2)} & x_{1}^{(2)}& \cdots & x_{n}^{(2)} \\ \vdots & \vdots & \ddots & \vdots \\ x_{0}^{(m)} & x_{1}^{(m)} & \cdots & x_{n}^{(m)} \end{bmatrix}, \qquad y= \begin{bmatrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(m)} \end{bmatrix}.$
那么，成本函数则可以被表示为

J (θ) = 1 2 m ∥ y - X θ ∥ 2 .

$J(\theta) = \frac{1}{2m} \| y - X \theta \|^2.$

下面我们给一个叫做normal equation的方法，来确定最佳参数向量 $\theta$ 。由于下面这段是从维基百科上面抄的，大家凑合看吧。关于成本函数的定义稍稍有一点不同，维基百科上给的是 $S=2J$ 。不过对最后求出的 $\theta$ 没有影响。

另一种同样比较常见的数值解法请参见我的另一篇博文，Gradient Descent Algorithm。

Derivation of the normal equations

Common method

Define the ith residual to be

r (i) = y (i) - \sum j = 0 n x (i) j θ j .

$r^{(i)}= y^{(i)}- \sum_{j=0}^{n} x^{(i)}_{j}\theta_j.$
Then

S $S$ can be rewritten

S = \sum i = 1 m (r (i)) 2 .

$S = \sum_{i=1}^m (r^{(i)})^2.$

S $S$ is minimized when its gradient vector is zero. (This follows by definition: if the gradient vector is not zero, there is a direction in which we can move to minimize it further - see maxima and minima.) The elements of the gradient vector are the partial derivatives of S with respect to the parameters:

\partial S \partial θ j = 2 \sum i = 1 m r (i) \partial r ( i ) \partial θ j (j = 0, 1, 2, \dots, n) .

$\frac{\partial S}{\partial \theta_j}=2\sum_{i = 1}^m r^{(i)} \frac{\partial r^{(i)}}{\partial \theta_j} \ (j=0,1,2,\dots, n).$
The derivatives are

\partial r ( i ) \partial θ j = - x (i) j .

$\frac{\partial r^{(i)}}{\partial \theta_j}=-x_{j}^{(i)}.$
Substitution of the expressions for the residuals and the derivatives into the gradient equations gives

\partial S \partial θ j = 2 \sum i = 1 m (y (i) - \sum k = 0 n x (i) k θ k) (- x (i) j) (j = 1, 2, \dots, n) .

$\frac{\partial S}{\partial \theta_j} = 2\sum_{i=1}^{m} \left( y^{(i)}-\sum_{k=0}^{n} x_{k}^{(i)}\theta_k \right) (-x_{j}^{(i)})\ (j=1,2,\dots, n).$
Thus if

θ^ $\hat \theta$ minimizes

S $S$ , we have

2 \sum i = 1 m (y (i) - \sum k = 0 n x (i) k θ k) (- x (i) j) = 0 (j = 1, 2, \dots, n) .

$2\sum_{i=1}^{m} \left( y^{(i)}-\sum_{k=0}^{n} x_{k}^{(i)}\theta_k \right) (-x_{j}^{(i)}) = 0\ (j=1,2,\dots, n).$
Upon rearrangement, we obtain the normal equations:

\sum i = 1 m \sum k = 0 n x (i) j x (i) k θ^k = \sum i = 1 m x (i) j y (i) (j = 1, 2, \dots, n) .

$\sum_{i=1}^{m}\sum_{k=0}^{n} x_{j}^{(i)} x_{k}^{(i)} \hat \theta_k=\sum_{i=1}^{m} x^{(i)}_{j} y^{(i)}\ (j=1,2,\dots, n).$
The normal equations are written in matrix notation as

(X T X) θ^= X T y

$(\mathbf X^\mathrm{T} \mathbf X) \hat{\boldsymbol{\theta}} = \mathbf X^\mathrm{T} \mathbf y$
where

XT $\mathbf{X}^T$ is the matrix transpose of

X $\mathbf{X}$ .
The solution of the normal equations yields the vector

θ^ $\hat{\boldsymbol{\theta}}$ of the optimal parameter values.

Derivation directly in terms of matrices

The normal equations can be derived directly from a matrix representation of the problem as follows. The objective is to minimize

S (θ) = ∥ ∥ y - X θ ∥ ∥ 2 = (y - X θ) T (y - X θ) = y T y - θ T X T y - y T X θ + θ T X T X θ .

$S(\boldsymbol{\theta}) = \bigl\|\mathbf y - \mathbf X \boldsymbol \theta \bigr\|^2 \\ = (\mathbf y-\mathbf X \boldsymbol \theta)^{\rm T}(\mathbf y-\mathbf X \boldsymbol \theta) \\ = \mathbf y ^{\rm T} \mathbf y - \boldsymbol \theta ^{\rm T} \mathbf X ^{\rm T} \mathbf y - \mathbf y ^{\rm T} \mathbf X \boldsymbol \theta + \boldsymbol \theta ^{\rm T} \mathbf X ^{\rm T} \mathbf X \boldsymbol \theta .$
Note that:

(θTXTy)T=yTXθ $( \boldsymbol \theta ^{\rm T} \mathbf X ^{\rm T} \mathbf y ) ^{\rm T} = \mathbf y ^{\rm T} \mathbf X \boldsymbol \theta$ has the dimension 1x1 (the number of columns of

y $\mathbf y$ ), so it is a scalar and equal to its own transpose, hence

θTXTy=yTXθ $\boldsymbol \theta ^{\rm T} \mathbf X ^{\rm T} \mathbf y = \mathbf y ^{\rm T} \mathbf X \boldsymbol \theta$ and the quantity to minimize becomes

S (θ) = y T y - 2 θ T X T y + θ T X T X θ .

$S(\boldsymbol{\theta}) = \mathbf y ^{\rm T} \mathbf y - 2\boldsymbol \theta ^{\rm T} \mathbf X ^{\rm T} \mathbf y + \boldsymbol \theta ^{\rm T} \mathbf X ^{\rm T} \mathbf X \boldsymbol \theta .$

Differentiating this with respect to $\boldsymbol \theta$ and equating to zero to satisfy the first-order conditions gives

- X T y + (X T X) θ = 0,

$- \mathbf X^{\rm T} \mathbf y+ (\mathbf X^{\rm T} \mathbf X ){\boldsymbol{\theta}} = 0,$
which is equivalent to the above-given normal equations. A sufficient condition for satisfaction of the second-order conditions for a minimum is that

X $\mathbf X$ have full column rank, in which case

XTX $\mathbf X^{\rm T} \mathbf X$ is positive definite.

Derivation without calculus

When $\mathbf X^{\rm T} \mathbf X$ is positive definite, the formula for the minimizing value of \boldsymbol \beta can be derived without the use of derivatives. The quantity

S (β) = y T y - 2 β T X T y + β T X T X β

$S(\boldsymbol{\beta}) = \mathbf y ^{\rm T} \mathbf y - 2\boldsymbol \beta ^{\rm T} \mathbf X ^{\rm T} \mathbf y + \boldsymbol \beta ^{\rm T} \mathbf X ^{\rm T} \mathbf X \boldsymbol \beta$
can be written as

⟨ β, β ⟩ - 2 ⟨ β, (X T X) - 1 X T y ⟩ + ⟨ (X T X) - 1 X T y, (X T X) - 1 X T y ⟩ + C,

$\langle \boldsymbol \beta, \boldsymbol \beta \rangle - 2\langle \boldsymbol \beta, (\mathbf X^{\rm T} \mathbf X)^{-1}\mathbf X ^{\rm T} \mathbf y \rangle + \langle(\mathbf X^{\rm T} \mathbf X)^{-1}\mathbf X ^{\rm T} \mathbf y,(\mathbf X^{\rm T} \mathbf X)^{-1}\mathbf X ^{\rm T} \mathbf y \rangle+ C,$
where

C $C$ depends only on

y $\mathbf y$ and

X $\mathbf X$ , and

⟨⋅,⋅⟩ $\langle \cdot, \cdot \rangle$ is the inner product defined by

⟨ x, y ⟩ = x T (X T X) y .

$\langle x, y \rangle = x ^{\rm T} (\mathbf X^{\rm T} \mathbf X) y.$
It follows that

S(β) $S(\boldsymbol{\beta})$ is equal to

⟨ β - (X T X) - 1 X T y, β - (X T X) - 1 X T y ⟩ + C

$\langle \boldsymbol \beta - (\mathbf X^{\rm T} \mathbf X)^{-1}\mathbf X ^{\rm T} \mathbf y,\boldsymbol \beta - (\mathbf X^{\rm T} \mathbf X)^{-1}\mathbf X ^{\rm T} \mathbf y \rangle+ C$
and therefore minimized exactly when

β - (X T X) - 1 X T y = 0.

$\boldsymbol \beta - (\mathbf X^{\rm T} \mathbf X)^{-1}\mathbf X ^{\rm T} \mathbf y = 0.$

GreTony

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Linear Regression

线性回归的模型很简单。该模型是假设输出hθ(x)h_\theta(x)与输入xx的任何一个分量都是呈线性关系的。用数学来表示即 hθ(x)=θTx.h_\theta(x)=\theta^T x . 这里θ\theta表示的是参数向量，xx表示的是输入向量。一般来说，θ\theta与xx是维度相同的列向量，而hθ(x)h_\theta(x)是一个标量。 x=⎡⎣⎢⎢⎢⎢x0x1⋮xn⎤⎦⎥⎥⎥⎥
复制链接

扫一扫

专栏目录