李宏毅 machine learning 笔记2 Regression

最新推荐文章于 2022-10-07 22:02:47 发布

-RE-

最新推荐文章于 2022-10-07 22:02:47 发布

阅读量149

点赞数

分类专栏：人工智能

本文链接：https://blog.csdn.net/weixin_38494515/article/details/104378391

版权

人工智能专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Regression

Examples:

stock market forecast
output = Dow Jones Industrial Average at tomorrow
self-driving car
output = angle
recommendation
output = purchase possibility

Step 1: Model

define a set of function
e.g. linear model
$y=b+\sum w_i x_i$

b=bias (偏置)
w=weight (权重)
x_i: an attribute of input x (feature)

Step 2: Goodness of Function

Loss function $L$ :
input: a function
output: how bad it is
$L(f)=\sum_{i=1}^{n} ({\hat y}^i-f({x}^i))^2$
$f({x}^i)$ : estimated $y$ based on input function
$({\hat y}^i-f({x}^i))^2$ : estimation error
$\sum\limits_{i=1}^{n} ({\hat y}^i-f({x}^i))^2$ : sum over examples

$\because f=f(w,b)$
$\therefore L(f)=L(f(w,b))=L(w,b)$
$\therefore$
$L(f)=\sum_{m=1}^{n} ({\hat y}^m-(b+\sum w \cdot x^m))^2$

Step 3: Best Function

pick the “best” function
$f^*=\arg \min\limits_{f}L(f)$
$w^*,b^*=\arg \min \limits_{w,b}L(w,b)\\=\arg \min \limits_{w,b}\sum\limits_{i=1}^{n} ({\hat y}^i-(b+w \cdot x^i))^2$

method:Gradient Descent
e.g.1 consider loss function $L$ with one parameter $w$ :

(radomly) pick an initial value $w^0$ .
comput $\frac{\text{d}L}{\text{d}w}\rvert_{w=w^0}$ .
if $\frac{\text{d}L}{\text{d}w}\rvert_{w=w^0}<0$ then increase $w$ , if not then decrease.
$w^1\gets w^0-\eta\frac{\text{d}L}{\text{d}w}\rvert_{w=w^0}$ , $\eta$ is called "learning rate"
comput $\frac{\text{d}L}{\text{d}w}\rvert_{w=w^1}$ .
$w^2\gets w^1-\eta\frac{\text{d}L}{\text{d}w}\rvert_{w=w^1}$

$\dots$ many iteration(迭代)

this method may not get the global minima,it may get the local minima

e.g.2 consider loss function $L$ with two parameter $w, b$ :

(radomly) pick an initial value $w^0,b^0$ .
comput $\frac{\partial L}{\partial w}\rvert_{w=w^0,b=b^0},\frac{\partial L}{\partial b}\rvert_{w=w^0,b=b^0}$ .
if $\frac{\text{d}L}{\text{d}w}\rvert_{w=w^0}<0$ then increase $w$ , if not then decrease.
$w^1\gets w^0-\eta\frac{\partial L}{\partial w}\rvert_{w=w^0,b=b^0}\\b^1\gets b^0-\eta\frac{\partial L}{\partial b}\rvert_{w=w^0,b=b^0}$
$\cdots$ many iteration
matrix form:
$\nabla L=\begin{bmatrix}\frac{\partial L}{\partial w} \\ \\\frac{\partial L}{\partial b}\end{bmatrix}_{gradient}$

P.S.
when solving:
$\theta^*=\arg \max\limits_{\theta}L(\theta)$
by gradient descent,each time we update the parameter,we obtain $\theta$ that makes $L(\theta)$ smaller.
$L(\theta^0)>L(\theta^1)>L(\theta^2)>\cdots$
is this statement correct ?

NOT exactly.

Improve

Suitable model

$y=b+w_1 x+w_2 x^2$
$y=b+w_1 x+w_2 x^2+w_3x^3$
$y=b+w_1 x+w_2 x^2+w_3x^3+w_4x^4$
$y=b+w_1 x+w_2 x^2+w_3x^3+w_4x^4+w_5x^5$
$\cdots$

they are also linear model,because the parameter $w_1,w_2,b$ are linear term.
when the model is more complex, the training error is lower.
a more complex model does NOT always lead to better performance on testing data.(overfitting过拟合)
conclusion: select suitable model

to solve the overfitting.

get more data
find the hidden factor
redesign the model

Regularization(正则化)

$y=b+\sum w_ix_i \\ L(f)=\sum_{i=1}^{n} ({\hat y}^i-(b+w \cdot x^i))^2+\lambda\sum (w_i)^2$

$\lambda\sum (w_i)^2$ is called regularization term
the functions with smaller $w_i$ are better
smaller $w_i$ means smoother
we believe smoother function is more likely to be correct
$\lambda \uparrow \to$ smother
regularize bias $b$ is unnecessary
we prefer smooth function, but do not too smooth

Error resource

estimate the mean of a variable $x$
assume the mean(均值) of $x$ is $\mu$
assume the variance(方差) of $x$ is $\sigma^2$
estimator(估计量) of the mean $\mu$
sample N points: $\{x^1,x^2,\dots,x^N\}$
$m=\frac{1}{N}\sum\limits_{n}x^{n}\ne\mu$
$E(m)=E(\frac{1}{N}\sum\limits_{n}x^{n})=\frac{1}{N}\sum\limits_{n}E(x^{n})=\mu$ (unbiased estimator 无偏估计)
$Var(m)=\frac{\sigma^2}{N}$
estimator of variance $\sigma^2$
sample N points: $\{x^1,x^2,\dots,x^N\}$
$m=\frac{1}{N}\sum\limits_{n}x^{n}$
$s^2=\frac{1}{N}\sum\limits_{n}(x^{n}-m)^2$
$E(s^2)=\frac{N-1}{N}\sigma^2\ne\sigma^2$ (biased estimator 有偏估计)
simpler model is less influenced by the same data

在这里插入图片描述

What to do with large bias and variance?

Diagnosis:
If your model cannot even fit the training examples, then you have large bias (underfitting)
If you cannot fit the training data, but large error on testing data, then you probably have large variance (overfitting)
For bias, redesign your model:
Add more feature as input
A more complex model
For variance
More data, very effective,but not always practical
Regularization, get it smooth

Model Selection

There is usually a trade-off between bias and variance
Select a model that balances two kinds of error to minimize total error
What you should NOT do:
Error on real testing set may larger than on your own testing set.(your testing set maybe biased)