Machine Learning-Chapter 2

Chapter 2 : Linear Regression

1 Statistical Learning Theory

1.1 Supervised Learning

In supervised learning, the goal is to learn the mapping (the rules) between a set of inputs and outputs.

1.2 Problem Definition

Given a set of n n n examples (data) { ( x 1 , y 1 ) , ( x 2 , y 2 ) , … , ( x n , y n ) } \left\{\left(\mathbf{x}_{1}, y_{1}\right),\left(\mathbf{x}_{2}, y_{2}\right), \ldots,\left(\mathbf{x}_{n}, y_{n}\right)\right\} {(x1,y1),(x2,y2),,(xn,yn)}

Question: find function f f f such that : f ( x ) = y ^ f(\mathbf{x})=\hat{y} f(x)=y^

is a good predictor of y y y for a future input x \mathbf{x} x (fitting the data is not enough!)

1.3 Statistical Learning Definition

There is an unknown probability distribution on the product space Z = X × Y Z=X \times Y Z=X×Y, written μ ( x , y ) \mu(x, y) μ(x,y)

We assume that X X X is a compact domain in Euclidean space and Y Y Y a bounded subset of R \mathbb{R} R. The training set :

S = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , … , ( x n , y n ) } S=\left\{\left(\mathbf{x}_{1}, y_{1}\right),\left(\mathbf{x}_{2}, y_{2}\right), \ldots,\left(\mathbf{x}_{n}, y_{n}\right)\right\} S={(x1,y1),(x2,y2),,(xn,yn)} consists of n n n samples drawn i.i.d. (Independent and identically distributed) from μ \mu μ .

H \mathcal{H} H is the hypothesis space, a space of functions f : X → Y f: X \rightarrow Y f:XY

A learning algorithm is a map L : Z n → H L: Z^{n} \rightarrow \mathcal{H} L:ZnH that looks at S S S and selects from H \mathcal{H} H a function f S : X → Y f_{S}: X \rightarrow Y fS:XY such that :

f S ( x ) ≈ y f_{S}(\mathbf{x}) \approx y fS(x)y in a predictive way

Given a function f , f, f, a loss function ℓ : Y × Y → R , \ell: Y \times Y \rightarrow \mathbb{R}, :Y×YR, we define the expected or true error of f f f is :

L ( f ) = E X , Y [ ℓ ( y , f ( x ) ) ] = ∫ X × Y ℓ ( y , f ( x ) ) d μ ( x , y ) \mathcal{L}(f)=\mathbb{E}_{X, Y}[\ell(y, f(x))]=\int_{X \times Y} \ell(y, f(x)) d \mu(x, y) L(f)=EX,Y[(y,f(x))]=X×Y(y,f(x))dμ(x,y)

which is the expected loss on a new example drawn at random from μ \mu μ

The empirical error of f S f_{S} fS is:

L S ( f S ) = 1 n ∑ i = 1 n ℓ ( y i , f S ( x i ) ) \mathcal{L}_{S}\left(f_{S}\right)=\frac{1}{n} \sum_{i=1}^{n} \ell\left(y_{i}, f_{S}\left(\mathbf{x}_{i}\right)\right) LS(fS)=n1i=1n(yi,fS(xi))

A very natural requirement for f S f_{S} fS is distribution independent generalization :

∀ μ , lim ⁡ ∣ L S ( f S ) − L ( f S ) ∣ = 0 \forall \mu, \lim \left|\mathcal{L}_{S}\left(f_{S}\right)-\mathcal{L}\left(f_{S}\right)\right|=0 μ,limLS(fS)L(fS)=0

in probability. In other words, the training error for the solution must converge to the expected error and thus be a proxy for it.

2 Linear Regression

2.1 Introduction

y=ax+b

2.2 Problem Setting

2.2.1 Elements

A set of training data { ( x 1 , y 1 ) , ( x 2 , y 2 ) , … , ( x n , y n ) } \left\{\left(\mathbf{x}_{1}, y_{1}\right),\left(\mathbf{x}_{2}, y_{2}\right), \ldots,\left(\mathbf{x}_{n}, y_{n}\right)\right\} {(x1,y1),(x2,y2),,(xn,yn)}

2.2.2 Assumptions

Function f f f has linear structure between input X = [ X 0 X 1 ⋮ X p ] X=\left[\begin{array}{c}X_{0} \\ X_{1} \\ \vdots \\ X_{p}\end{array}\right] X=X0X1Xp , a random vector ∈ R p + 1 \in \mathbb{R}^{p+1} Rp+1 in which X 0 = 1 X_{0}=1 X0=1 and output Y Y Y, a random variable ∈ R \in \mathbb{R} R. It has the form :

Y = f ( X ) = β 0 + ∑ j = 1 p X j β j = X 0 β 0 + ∑ j = 1 p X j β j = X T β Y=f(X)=\beta_{0}+\sum_{j=1}^{p} X_{j} \beta_{j}=X_{0} \beta_{0}+\sum_{j=1}^{p} X_{j} \beta_{j}=X^{T} \beta Y=f(X)=β0+j=1pXjβj=X0β0+j=1pXjβj=XTβ

where the β \beta β, are unknown parameters or coefficients, β = [ β 0 β 1 ⋮ β p ] \beta=\left[\begin{array}{c}\beta_{0} \\ \beta_{1} \\ \vdots \\ \beta_{p}\end{array}\right] β=β0β1βp

The loss function ℓ : Y × Y → R \ell: Y \times Y \rightarrow \mathbb{R} :Y×YR has the following form:

ℓ ( y , y ^ ) = ( y − y ^ ) 2 \ell(y, \hat{y})=(y-\hat{y})^{2} (y,y^)=(yy^)2

The empirical error of f f f is:

L S ( f ) = 1 n ∑ i = 1 n ℓ ( y i , f ( x i ) ) \mathcal{L}_{S}(f)=\frac{1}{n} \sum_{i=1}^{n} \ell\left(y_{i}, f\left(\mathbf{x}_{i}\right)\right) LS(f)=n1i=1n(yi,f(xi))

= 1 n ∑ i = 1 n ( y i − x i T β ) 2 =\frac{1}{n} \sum_{i=1}^{n}\left(y_{i}-\mathbf{x}_{i}^{T} \beta\right)^{2} =n1i=1n(yixiTβ)2

2.2.3 Matrix Form

A set of training data { ( x 1 , y 1 ) , ( x 2 , y 2 ) , … , ( x n , y n ) } \left\{\left(\mathbf{x}_{1}, y_{1}\right),\left(\mathbf{x}_{2}, y_{2}\right), \ldots,\left(\mathbf{x}_{n}, y_{n}\right)\right\} {(x1,y1),(x2,y2),,(xn,yn)} can be written as matrix and vector form, the matrix form for input is :

X = [ − x 1 T − − x 2 T − ⋮ − x n T − ] \mathbf{X}=\left[\begin{array}{ccc}- & \mathbf{x}_{1}^{T} & - \\ - & \mathbf{x}_{2}^{T} & - \\ & \vdots & \\ - & \mathbf{x}_{n}^{T} & -\end{array}\right] X=x1Tx2TxnT

and the vector form for output is

y = [ y 1 y 2 ⋮ y n ] \mathbf{y}=\left[\begin{array}{c}y_{1} \\ y_{2} \\ \vdots \\ y_{n}\end{array}\right] y=y1y2yn

The empirical error of f f f can be written as matrix form:

L S ( f ) = 1 n ∑ i = 1 n ℓ ( y i , f ( x i ) ) \mathcal{L}_{S}(f)=\frac{1}{n} \sum_{i=1}^{n} \ell\left(y_{i}, f\left(\mathbf{x}_{i}\right)\right) LS(f)=n1i=1n(yi,f(xi))

= 1 n ∑ i = 1 n ( y i − x i T β ) 2 =\frac{1}{n} \sum_{i=1}^{n}\left(y_{i}-\mathbf{x}_{i}^{T} \beta\right)^{2} =n1i=1n(yixiTβ)2

= 1 n ( y − X β ) T ( y − X β ) =\frac{1}{n}(\mathbf{y}-\mathbf{X} \beta)^{T}(\mathbf{y}-\mathbf{X} \beta) =n1(yXβ)T(yXβ)

2.2.4 Conclusion

Assuming that X \mathbf{X} X has full column rank, minimization of the empirical error leads to the estimator of the f f f:

y ^ = X β ^ \hat{y}=\mathbf{X} \hat{\beta} y^=Xβ^

where :

β ^ = ( X T X ) − 1 X T y \hat{\beta}=\left(\mathbf{X}^{T} \mathbf{X}\right)^{-1} \mathbf{X}^{T} \mathbf{y} β^=(XTX)1XTy

2.2.4.1 Proof:

Theorem: α s ( t ) = ( y ⃗ − X β ⃗ ) ′ ( y ⃗ − X p ⃗ ) ⋅ ( 1 n  ignore  ) \alpha_{s}(t)=(\vec{y}-\mathbb{X} \vec{\beta})^{\prime}(\vec{y}-\mathbb{X} \vec{p}) \cdot \quad\left(\frac{1}{n} \text { ignore }\right) αs(t)=(y Xβ )(y Xp )(n1 ignore )

Question : argmin ⁡ α ( f ) \operatorname{argmin} \alpha(f) argminα(f) . f ∈ H f \in H fH , H H H: Linear function

Conclusion : f ^ = X β ⃗ ^ \hat{f}=\mathbb{X} \hat{\vec{\beta}} f^=Xβ ^ , β ^ = ( X T X ) − 1 X T y \hat{\beta}=\left(\mathbf{X}^{T} \mathbf{X}\right)^{-1} \mathbf{X}^{T} \mathbf{y} β^=(XTX)1XTy

Pf:

J β ⃗ = d α s d β ⃗ J_{\vec{\beta}}=\frac{d \alpha_{s}}{d \vec{\beta}} Jβ =dβ dαs , H β ⃗ = d J B d β ⃗ = d 2 α s d β ⃗ 2 H_{\vec{\beta}}=\frac{d J_{B}}{d {\vec{\beta}}}=\frac{d^{2} \alpha_{s}}{d \vec{\beta}^{2}} Hβ =dβ dJB=dβ 2d2αs

Let a ⃗ = y ⃗ − x ⃗ β ⃗ , b ⃗ = y ⃗ − x β ⃗ \vec{a}=\vec{y}-\vec{x} \vec{\beta}, \quad \vec{b}=\vec{y}-x \vec{\beta} a =y x β ,b =y xβ

d α s d β ⃗ = d α s ∂ a ⃗ ∂ a ⃗ ∂ β ⃗ + d α c ∂ b ⃗ ∂ b ⃗ ∂ β ⃗ \frac{d \alpha_{s}}{d \vec{\beta}}=\frac{d \alpha_{s}}{\partial \vec{a}} \frac{\partial \vec{a}}{\partial \vec{\beta}}+\frac{d \alpha_{c}}{\partial \vec{b}} \frac{\partial \vec{b}}{\partial \vec{\beta}} dβ dαs=a dαsβ a +b dαcβ b = b ⃗ ⊤ ( − x ) + a ⃗ ′ ( − x ) = − ( y ⃗ − x p ⃗ ) ⊤ x − ( y ⃗ − x p ⃗ ) ⊤ x \vec{b}^{\top}(-x)+\vec{a}^{\prime}(-x)=-(\vec{y}-x \vec{p})^{\top} x-(\vec{y}-x \vec{p})^{\top} x b (x)+a (x)=(y xp )x(y xp )x = − 2 ( y ⃗ − x p ⃗ ) ⊤ x -2(\vec{y}-x \vec{p})^{\top} x 2(y xp )x

H p = d J β ⃗ ∂ v e c β = d ( 2 ( x β ⃗ ) ′ x ) ∂ β ⃗ = d ( 2 β ⃗ ′ x ′ x ) ∂ β ⃗ = 2 ( x ⊤ x ) ⊤ = 2 x ⊤ x H_{p}=\frac{d J_{\vec{\beta}}}{\partial {vec{\beta}}}=\frac{d\left(2(x \vec{\beta})^{\prime} x\right)}{\partial \vec{\beta}}=\frac{d\left(2 \vec{\beta}^{\prime} x^{\prime} x\right)}{\partial \vec{\beta}}=2\left({x}^{\top} x\right)^{\top}=2 x^{\top} x Hp=vecβdJβ =β d(2(xβ )x)=β d(2β xx)=2(xx)=2xx

α s ( f ) \alpha_s(f) αs(f) is a convex function ,and any local minimum of a convex function is also a global minimum :

− 2 ( y ⃗ − x β ⃗ ) ′ x = 0 ⇒ x ⊤ ( y ⃗ − x β ⃗ ) = 0 ⇒ x ⊤ x β ⃗ = x ⊤ y ⃗ -2(\vec{y}-x \vec{\beta})^{\prime} x=0 \Rightarrow x^{\top}(\vec{y}-x \vec{\beta})=0 \Rightarrow x^{\top} x \vec{\beta}=x^{\top} \vec{y} 2(y xβ )x=0x(y xβ )=0xxβ =xy

If X is full column rank , X ⊤ X X^{\top}X XX is positive definite and then is invertible:

β ^ = ( X T X ) − 1 X T y \hat{\beta}=\left(\mathbf{X}^{T} \mathbf{X}\right)^{-1} \mathbf{X}^{T} \mathbf{y} β^=(XTX)1XTy

2.2.4.2 Extra notes

n > > p n>>p n>>p ,which means S a m p l e > > F e a t u r e Sample >> Feature Sample>>Feature

f n < < p n << p n<<p

  • rank ⁡ ( x ) ⩽ min ⁡ ( n , p + 1 ) = n \operatorname{rank}(x) \leqslant \min (n , p+1)=n rank(x)min(n,p+1)=n

  • rank ⁡ ( x ⊤ x ) ≤ min ⁡ ( rank ⁡ ( x ⊤ ) , r a n k ( x ) ) ≤ n \operatorname{rank}\left(x^{\top} x\right) \leq \min \left(\operatorname{rank}\left(x^{\top}\right), rank (x)\right) \leq n rank(xx)min(rank(x),rank(x))n

  • If X ⊤ X X^{\top} X XX is invertible, rank ⁡ ( x ⊤ x ) = p + 1 \operatorname{rank}\left(x^{\top} x\right)=p+1 rank(xx)=p+1

    Then $p+1 \leq n $ , which means X ⊤ X X^{\top}X XX is in vertible.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值