Machine Learning-Chapter 2

最新推荐文章于 2024-09-13 11:18:00 发布

齐铭_

最新推荐文章于 2024-09-13 11:18:00 发布

阅读量120

点赞数

文章标签：机器学习

本文链接：https://blog.csdn.net/weixin_45986528/article/details/106588776

版权

Chapter 2 : Linear Regression

1 Statistical Learning Theory

1.1 Supervised Learning

In supervised learning, the goal is to learn the mapping (the rules) between a set of inputs and outputs.

1.2 Problem Definition

Given a set of $n$ examples (data) $\left\{\left(\mathbf{x}_{1}, y_{1}\right),\left(\mathbf{x}_{2}, y_{2}\right), \ldots,\left(\mathbf{x}_{n}, y_{n}\right)\right\}$

Question: find function $f$ such that : $f(\mathbf{x})=\hat{y}$

is a good predictor of $y$ for a future input $\mathbf{x}$ (fitting the data is not enough!)

1.3 Statistical Learning Definition

There is an unknown probability distribution on the product space $\times Y$ , written $\mu(x, y)$

We assume that $X$ is a compact domain in Euclidean space and $Y$ a bounded subset of $\mathbb{R}$ . The training set ：

$S=\left\{\left(\mathbf{x}_{1}, y_{1}\right),\left(\mathbf{x}_{2}, y_{2}\right), \ldots,\left(\mathbf{x}_{n}, y_{n}\right)\right\}$ consists of $n$ samples drawn i.i.d. (Independent and identically distributed) from $\mu$ .

$\mathcal{H}$ is the hypothesis space, a space of functions $\rightarrow Y$

A learning algorithm is a map $Z^{n} \rightarrow \mathcal{H}$ that looks at $S$ and selects from $\mathcal{H}$ a function $f_{S}: X \rightarrow Y$ such that :

$f_{S}(\mathbf{x}) \approx y$ in a predictive way

Given a function $f,$ a loss function $\ell: Y \times Y \rightarrow \mathbb{R},$ we define the expected or true error of $f$ is :

$\mathcal{L}(f)=\mathbb{E}_{X, Y}[\ell(y, f(x))]=\int_{X \times Y} \ell(y, f(x)) d \mu(x, y)$

which is the expected loss on a new example drawn at random from $\mu$

The empirical error of $f_{S}$ is:

$\mathcal{L}_{S}\left(f_{S}\right)=\frac{1}{n} \sum_{i=1}^{n} \ell\left(y_{i}, f_{S}\left(\mathbf{x}_{i}\right)\right)$

A very natural requirement for $f_{S}$ is distribution independent generalization :

$\forall \mu, \lim \left|\mathcal{L}_{S}\left(f_{S}\right)-\mathcal{L}\left(f_{S}\right)\right|=0$

in probability. In other words, the training error for the solution must converge to the expected error and thus be a proxy for it.

2 Linear Regression

2.1 Introduction

y=ax+b

2.2 Problem Setting

2.2.1 Elements

A set of training data $\left\{\left(\mathbf{x}_{1}, y_{1}\right),\left(\mathbf{x}_{2}, y_{2}\right), \ldots,\left(\mathbf{x}_{n}, y_{n}\right)\right\}$

2.2.2 Assumptions

Function $f$ has linear structure between input $X=\left[\begin{array}{c}X_{0} \\ X_{1} \\ \vdots \\ X_{p}\end{array}\right]$ , a random vector $\in \mathbb{R}^{p+1}$ in which $X_{0}=1$ and output $Y$ , a random variable $\in \mathbb{R}$ . It has the form :

$Y=f(X)=\beta_{0}+\sum_{j=1}^{p} X_{j} \beta_{j}=X_{0} \beta_{0}+\sum_{j=1}^{p} X_{j} \beta_{j}=X^{T} \beta$

where the $\beta$ , are unknown parameters or coefficients, $\beta=\left[\begin{array}{c}\beta_{0} \\ \beta_{1} \\ \vdots \\ \beta_{p}\end{array}\right]$

The loss function $\ell: Y \times Y \rightarrow \mathbb{R}$ has the following form:

$\ell(y, \hat{y})=(y-\hat{y})^{2}$

The empirical error of $f$ is:

$\mathcal{L}_{S}(f)=\frac{1}{n} \sum_{i=1}^{n} \ell\left(y_{i}, f\left(\mathbf{x}_{i}\right)\right)$

$=\frac{1}{n} \sum_{i=1}^{n}\left(y_{i}-\mathbf{x}_{i}^{T} \beta\right)^{2}$

2.2.3 Matrix Form

A set of training data $\left\{\left(\mathbf{x}_{1}, y_{1}\right),\left(\mathbf{x}_{2}, y_{2}\right), \ldots,\left(\mathbf{x}_{n}, y_{n}\right)\right\}$ can be written as matrix and vector form, the matrix form for input is :

$\mathbf{X}=\left[\begin{array}{ccc}- & \mathbf{x}_{1}^{T} & - \\ - & \mathbf{x}_{2}^{T} & - \\ & \vdots & \\ - & \mathbf{x}_{n}^{T} & -\end{array}\right]$

and the vector form for output is

$\mathbf{y}=\left[\begin{array}{c}y_{1} \\ y_{2} \\ \vdots \\ y_{n}\end{array}\right]$

The empirical error of $f$ can be written as matrix form:

$\mathcal{L}_{S}(f)=\frac{1}{n} \sum_{i=1}^{n} \ell\left(y_{i}, f\left(\mathbf{x}_{i}\right)\right)$

$=\frac{1}{n} \sum_{i=1}^{n}\left(y_{i}-\mathbf{x}_{i}^{T} \beta\right)^{2}$

$=\frac{1}{n}(\mathbf{y}-\mathbf{X} \beta)^{T}(\mathbf{y}-\mathbf{X} \beta)$

2.2.4 Conclusion

Assuming that $\mathbf{X}$ has full column rank, minimization of the empirical error leads to the estimator of the $f$ :

$\hat{y}=\mathbf{X} \hat{\beta}$

where :

$\hat{\beta}=\left(\mathbf{X}^{T} \mathbf{X}\right)^{-1} \mathbf{X}^{T} \mathbf{y}$

2.2.4.1 Proof:

Theorem: $\alpha_{s}(t)=(\vec{y}-\mathbb{X} \vec{\beta})^{\prime}(\vec{y}-\mathbb{X} \vec{p}) \cdot \quad\left(\frac{1}{n} \text { ignore }\right)$

Question : $\operatorname{argmin} \alpha(f)$ . $\in H$ , $H$ : Linear function

Conclusion : $\hat{f}=\mathbb{X} \hat{\vec{\beta}}$ , $\hat{\beta}=\left(\mathbf{X}^{T} \mathbf{X}\right)^{-1} \mathbf{X}^{T} \mathbf{y}$

Pf:

$J_{\vec{\beta}}=\frac{d \alpha_{s}}{d \vec{\beta}}$ , $H_{\vec{\beta}}=\frac{d J_{B}}{d {\vec{\beta}}}=\frac{d^{2} \alpha_{s}}{d \vec{\beta}^{2}}$

Let $\vec{a}=\vec{y}-\vec{x} \vec{\beta}, \quad \vec{b}=\vec{y}-x \vec{\beta}$

$\frac{d \alpha_{s}}{d \vec{\beta}}=\frac{d \alpha_{s}}{\partial \vec{a}} \frac{\partial \vec{a}}{\partial \vec{\beta}}+\frac{d \alpha_{c}}{\partial \vec{b}} \frac{\partial \vec{b}}{\partial \vec{\beta}}$ = $\vec{b}^{\top}(-x)+\vec{a}^{\prime}(-x)=-(\vec{y}-x \vec{p})^{\top} x-(\vec{y}-x \vec{p})^{\top} x$ = $-2(\vec{y}-x \vec{p})^{\top} x$

$H_{p}=\frac{d J_{\vec{\beta}}}{\partial {vec{\beta}}}=\frac{d\left(2(x \vec{\beta})^{\prime} x\right)}{\partial \vec{\beta}}=\frac{d\left(2 \vec{\beta}^{\prime} x^{\prime} x\right)}{\partial \vec{\beta}}=2\left({x}^{\top} x\right)^{\top}=2 x^{\top} x$

$\alpha_s(f)$ is a convex function ,and any local minimum of a convex function is also a global minimum :

$-2(\vec{y}-x \vec{\beta})^{\prime} x=0 \Rightarrow x^{\top}(\vec{y}-x \vec{\beta})=0 \Rightarrow x^{\top} x \vec{\beta}=x^{\top} \vec{y}$

If X is full column rank , $X^{\top}X$ is positive definite and then is invertible:

$\hat{\beta}=\left(\mathbf{X}^{T} \mathbf{X}\right)^{-1} \mathbf{X}^{T} \mathbf{y}$

2.2.4.2 Extra notes

$n > > p$ ,which means $S a m p l e > > F e a t u r e$

f $n < < p$

$\operatorname{rank}(x) \leqslant \min (n , p+1)=n$
$\operatorname{rank}\left(x^{\top} x\right) \leq \min \left(\operatorname{rank}\left(x^{\top}\right), rank (x)\right) \leq n$
If $X^{\top} X$ is invertible, $\operatorname{rank}\left(x^{\top} x\right)=p+1$

Then $p+1 \leq n $ , which means $X^{\top}X$ is in vertible.