李宏毅 machine learning 笔记2 Regression

Regression

Examples:

  • stock market forecast
    output = Dow Jones Industrial Average at tomorrow
  • self-driving car
    output = angle
  • recommendation
    output = purchase possibility

Step 1: Model

define a set of function
e.g. linear model
y = b + ∑ w i x i y=b+\sum w_i x_i y=b+wixi

  • b=bias (偏置)
  • w=weight (权重)
  • xi: an attribute of input x (feature)

Step 2: Goodness of Function

Loss function L L L:
input: a function
output: how bad it is
L ( f ) = ∑ i = 1 n ( y ^ i − f ( x i ) ) 2 L(f)=\sum_{i=1}^{n} ({\hat y}^i-f({x}^i))^2 L(f)=i=1n(y^if(xi))2
f ( x i ) f({x}^i) f(xi): estimated y y y based on input function
( y ^ i − f ( x i ) ) 2 ({\hat y}^i-f({x}^i))^2 (y^if(xi))2: estimation error
∑ i = 1 n ( y ^ i − f ( x i ) ) 2 \sum\limits_{i=1}^{n} ({\hat y}^i-f({x}^i))^2 i=1n(y^if(xi))2: sum over examples

∵ f = f ( w , b ) \because f=f(w,b) f=f(w,b)
∴ L ( f ) = L ( f ( w , b ) ) = L ( w , b ) \therefore L(f)=L(f(w,b))=L(w,b) L(f)=L(f(w,b))=L(w,b)
∴ \therefore
L ( f ) = ∑ m = 1 n ( y ^ m − ( b + ∑ w ⋅ x m ) ) 2 L(f)=\sum_{m=1}^{n} ({\hat y}^m-(b+\sum w \cdot x^m))^2 L(f)=m=1n(y^m(b+wxm))2

Step 3: Best Function

pick the “best” function
f ∗ = arg ⁡ min ⁡ f L ( f ) f^*=\arg \min\limits_{f}L(f) f=argfminL(f)
w ∗ , b ∗ = arg ⁡ min ⁡ w , b L ( w , b ) = arg ⁡ min ⁡ w , b ∑ i = 1 n ( y ^ i − ( b + w ⋅ x i ) ) 2 w^*,b^*=\arg \min \limits_{w,b}L(w,b)\\=\arg \min \limits_{w,b}\sum\limits_{i=1}^{n} ({\hat y}^i-(b+w \cdot x^i))^2 w,b=argw,bminL(w,b)=argw,bmini=1n(y^i(b+wxi))2

method:Gradient Descent
e.g.1 consider loss function L L L with one parameter w w w:

  • (radomly) pick an initial value w 0 w^0 w0.
  • comput d L d w ∣ w = w 0 \frac{\text{d}L}{\text{d}w}\rvert_{w=w^0} dwdLw=w0.
  • if d L d w ∣ w = w 0 < 0 \frac{\text{d}L}{\text{d}w}\rvert_{w=w^0}<0 dwdLw=w0<0 then increase w w w, if not then decrease.
  • w 1 ← w 0 − η d L d w ∣ w = w 0 w^1\gets w^0-\eta\frac{\text{d}L}{\text{d}w}\rvert_{w=w^0} w1w0ηdwdLw=w0, η \eta η is called "learning rate"
  • comput d L d w ∣ w = w 1 \frac{\text{d}L}{\text{d}w}\rvert_{w=w^1} dwdLw=w1.
  • w 2 ← w 1 − η d L d w ∣ w = w 1 w^2\gets w^1-\eta\frac{\text{d}L}{\text{d}w}\rvert_{w=w^1} w2w1ηdwdLw=w1

… \dots many iteration(迭代)

  • this method may not get the global minima,it may get the local minima
    在这里插入图片描述

e.g.2 consider loss function L L L with two parameter w , b w,b w,b:

  • (radomly) pick an initial value w 0 , b 0 w^0,b^0 w0,b0.
  • comput ∂ L ∂ w ∣ w = w 0 , b = b 0 , ∂ L ∂ b ∣ w = w 0 , b = b 0 \frac{\partial L}{\partial w}\rvert_{w=w^0,b=b^0},\frac{\partial L}{\partial b}\rvert_{w=w^0,b=b^0} wLw=w0,b=b0,bLw=w0,b=b0.
  • if d L d w ∣ w = w 0 < 0 \frac{\text{d}L}{\text{d}w}\rvert_{w=w^0}<0 dwdLw=w0<0 then increase w w w, if not then decrease.
  • w 1 ← w 0 − η ∂ L ∂ w ∣ w = w 0 , b = b 0 b 1 ← b 0 − η ∂ L ∂ b ∣ w = w 0 , b = b 0 w^1\gets w^0-\eta\frac{\partial L}{\partial w}\rvert_{w=w^0,b=b^0}\\b^1\gets b^0-\eta\frac{\partial L}{\partial b}\rvert_{w=w^0,b=b^0} w1w0ηwLw=w0,b=b0b1b0ηbLw=w0,b=b0
    ⋯ \cdots many iteration
  • matrix form:
    ∇ L = [ ∂ L ∂ w ∂ L ∂ b ] g r a d i e n t \nabla L=\begin{bmatrix}\frac{\partial L}{\partial w} \\ \\\frac{\partial L}{\partial b}\end{bmatrix}_{gradient} L=wLbLgradient
    在这里插入图片描述
    P.S.
    when solving:
    θ ∗ = arg ⁡ max ⁡ θ L ( θ ) \theta^*=\arg \max\limits_{\theta}L(\theta) θ=argθmaxL(θ)
    by gradient descent,each time we update the parameter,we obtain θ \theta θ that makes L ( θ ) L(\theta) L(θ) smaller.
    L ( θ 0 ) > L ( θ 1 ) > L ( θ 2 ) > ⋯ L(\theta^0)>L(\theta^1)>L(\theta^2)>\cdots L(θ0)>L(θ1)>L(θ2)>
    is this statement correct ?
    在这里插入图片描述
    NOT exactly.

Improve

Suitable model

y = b + w 1 x + w 2 x 2 y=b+w_1 x+w_2 x^2 y=b+w1x+w2x2
y = b + w 1 x + w 2 x 2 + w 3 x 3 y=b+w_1 x+w_2 x^2+w_3x^3 y=b+w1x+w2x2+w3x3
y = b + w 1 x + w 2 x 2 + w 3 x 3 + w 4 x 4 y=b+w_1 x+w_2 x^2+w_3x^3+w_4x^4 y=b+w1x+w2x2+w3x3+w4x4
y = b + w 1 x + w 2 x 2 + w 3 x 3 + w 4 x 4 + w 5 x 5 y=b+w_1 x+w_2 x^2+w_3x^3+w_4x^4+w_5x^5 y=b+w1x+w2x2+w3x3+w4x4+w5x5
⋯ \cdots

  • they are also linear model,because the parameter w 1 , w 2 , b w_1,w_2,b w1,w2,b are linear term.
  • when the model is more complex, the training error is lower.
  • a more complex model does NOT always lead to better performance on testing data.(overfitting过拟合)
  • conclusion: select suitable model

to solve the overfitting.

  • get more data
  • find the hidden factor
  • redesign the model

Regularization(正则化)

y = b + ∑ w i x i L ( f ) = ∑ i = 1 n ( y ^ i − ( b + w ⋅ x i ) ) 2 + λ ∑ ( w i ) 2 y=b+\sum w_ix_i \\ L(f)=\sum_{i=1}^{n} ({\hat y}^i-(b+w \cdot x^i))^2+\lambda\sum (w_i)^2 y=b+wixiL(f)=i=1n(y^i(b+wxi))2+λ(wi)2

  • λ ∑ ( w i ) 2 \lambda\sum (w_i)^2 λ(wi)2 is called regularization term
  • the functions with smaller w i w_i wi are better
  • smaller w i w_i wi means smoother
  • we believe smoother function is more likely to be correct
  • λ ↑ → \lambda \uparrow \to λsmother
  • regularize bias b b b is unnecessary
  • we prefer smooth function, but do not too smooth

Error resource

  • estimate the mean of a variable x x x
    assume the mean(均值) of x x x is μ \mu μ
    assume the variance(方差) of x x x is σ 2 \sigma^2 σ2
  • estimator(估计量) of the mean μ \mu μ
    sample N points: { x 1 , x 2 , … , x N } \{x^1,x^2,\dots,x^N\} {x1,x2,,xN}
    m = 1 N ∑ n x n ≠ μ m=\frac{1}{N}\sum\limits_{n}x^{n}\ne\mu m=N1nxn=μ
    E ( m ) = E ( 1 N ∑ n x n ) = 1 N ∑ n E ( x n ) = μ E(m)=E(\frac{1}{N}\sum\limits_{n}x^{n})=\frac{1}{N}\sum\limits_{n}E(x^{n})=\mu E(m)=E(N1nxn)=N1nE(xn)=μ(unbiased estimator 无偏估计)
    V a r ( m ) = σ 2 N Var(m)=\frac{\sigma^2}{N} Var(m)=Nσ2
  • estimator of variance σ 2 \sigma^2 σ2
    sample N points: { x 1 , x 2 , … , x N } \{x^1,x^2,\dots,x^N\} {x1,x2,,xN}
    m = 1 N ∑ n x n m=\frac{1}{N}\sum\limits_{n}x^{n} m=N1nxn
    s 2 = 1 N ∑ n ( x n − m ) 2 s^2=\frac{1}{N}\sum\limits_{n}(x^{n}-m)^2 s2=N1n(xnm)2
    E ( s 2 ) = N − 1 N σ 2 ≠ σ 2 E(s^2)=\frac{N-1}{N}\sigma^2\ne\sigma^2 E(s2)=NN1σ2=σ2 (biased estimator 有偏估计)
    bias & variance
  • simpler model is less influenced by the same data
    在这里插入图片描述

在这里插入图片描述
在这里插入图片描述

What to do with large bias and variance?

  • Diagnosis:
    If your model cannot even fit the training examples, then you have large bias (underfitting)
    If you cannot fit the training data, but large error on testing data, then you probably have large variance (overfitting)
  • For bias, redesign your model:
    Add more feature as input
    A more complex model
  • For variance
    More data, very effective,but not always practical
    Regularization, get it smooth
    在这里插入图片描述

Model Selection

  • There is usually a trade-off between bias and variance
  • Select a model that balances two kinds of error to minimize total error
  • What you should NOT do:
    Error on real testing set may larger than on your own testing set.(your testing set maybe biased)

Cross Validation(交叉验证)

divide training set to two set: training set & validation set
在这里插入图片描述

N-fold Cross validation

在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值