深度学习Deep learning小白入门笔记——学习李宏毅大佬视频笔记 20230708~20230710

Deep learning 2023/07/08~2023/07/10

Backpropagation

  • ∂ C ∂ w → ∂ z ∂ w ∂ C ∂ z \frac{\partial C}{\partial w} \rightarrow \frac{\partial z}{\partial w}\frac{\partial C}{\partial z} wCwzzC

    • Forward pass :

    ∂ z ∂ w (1) \frac{\partial z}{\partial w} \tag{1} wz(1)

    Compute (1) for all parameters
    ∂ z ∂ w 1 = ? (1a) \frac{\partial z}{\partial w1}=? \tag{1a} w1z=?(1a)

    ∂ z ∂ w 2 = ? (1b) \frac{\partial z}{\partial w2}=? \tag{1b} w2z=?(1b)

    The value of the input connected by the weight.

    • Backward pass :

∂ C ∂ z (2) \frac{\partial C}{\partial z} \tag{2} zC(2)

Compute (2) for all activation function inputs z.

Compute (2) from the output layer.
∂ C ∂ z = ∂ a ∂ z ∂ C ∂ a (2a) \frac{\partial C}{\partial z}=\frac{\partial a}{\partial z}\frac{\partial C}{\partial a} \tag{2a} zC=zaaC(2a)

∂ C ∂ z = ∂ z ′ ∂ a ∂ C ∂ z ′ + ∂ z ′ ′ ∂ a ∂ C ∂ z ′ ′ (2b) \frac{\partial C}{\partial z}=\frac{\partial z'}{\partial a}\frac{\partial C}{\partial z'} + \frac{\partial z''}{\partial a}\frac{\partial C}{\partial z''} \tag{2b} zC=azzC+az′′z′′C(2b)

  • Backpropagation Summary

F o r w a r d   P a s s ⇒ ∂ z ∂ w = a Forward \ Pass \Rightarrow \frac{\partial z}{\partial w} = a Forward Passwz=a

B a c k w a r d   P a s s ⇒ ∂ C ∂ z Backward \ Pass \Rightarrow \frac{\partial C}{\partial z} Backward PasszC

F o r w a r d   P a s s   ∗   B a c k w a r d   P a s s ⇒ ∂ z ∂ w ∂ C ∂ z = ∂ C ∂ w Forward\ Pass \ * \ Backward \ Pass \Rightarrow \frac{\partial z}{\partial w}\frac{\partial C}{\partial z} = \frac{\partial C}{\partial w} Forward Pass  Backward PasswzzC=wC

Regression

  • Stock Market Forecast.
  • Self-driving Car
  • Recommendation
  • Estimating the Combat Power(CP) of a pokemon after evolution

Estimating the Combat Power(CP) of a pokemon after evolution

  • Step 1: Model

  • Step 2: Goodness of Function

    • Loss function L:

      • Input: a function, output: how bad it is
        L ( f ) = L ( w , b ) L(f)=L(w,b) L(f)=L(w,b)
    • y = b + ∑ w i x i y=b+\sum w_ix_i y=b+wixi

      L = ∑ n L=\sum_n L=n

  • Step 3: Best Function & Gradient Descent

    • Best Function
      f ∗ = a r g min ⁡ f L ( f ) f^*=arg\min_fL(f) f=argfminL(f)

      w ∗ , b ∗ = a r g   m i n w , b L ( w , b ) w^*,b^*=arg\ min_{w,b}L(w,b) w,b=arg minw,bL(w,b)

    • Gradient Descent

      • Consider loss function L(w) with one parameter w:
        w ∗ = a r g   min ⁡ w L ( w ) w^*=arg\ \min_wL(w) w=arg wminL(w)

        • Pick an initial value w^0

        • Compute (1)
          d L d w ∣ w = w 0 (1) \frac{dL}{dw}|_{w=w^0} \tag{1} dwdLw=w0(1)

          w 1 ← w 0 − η d L d w ∣ w = w 0 w^1 \leftarrow w^0-\eta\frac{dL}{dw}|_{w=w^0} w1w0ηdwdLw=w0

          • η is called “learning rate”
        • Compute (2)
          d L d w ∣ w = w 1 (2) \frac{dL}{dw}|_{w=w^1} \tag{2} dwdLw=w1(2)

          • next

          w 2 ← w 1 − η d L d w ∣ w = w 1 w^2 \leftarrow w^1-\eta \frac{dL}{dw}|_{w=w^1} w2w1ηdwdLw=w1

      • How about two parameters ?
        w ∗ , b ∗ = a r g min ⁡ w , b L ( w , b ) w^*,b^*=arg\min_{w,b}L(w,b) w,b=argw,bminL(w,b)

    • Worry

      • In liner regression the loss function L is convex. (No local optimal)
    • A more complex model dose not always lead to better performance on testing data

      • This is Overfitting
  • How to do Classification

    • Training data for Classification

    • Classification as Regression ?

      • Binary classification as example ?
      • Training: Class 1 means the target is 1; Class 2 means the target is -1.
      • Testing: closer to 1 → class 1; closer to -1 → class 2
    • Ideal Alternatives

      • Function (Model):
        x ⇒ f ( x ) = { g ( x ) > 0     O u t p u t = c l a s s 1 e l s e             O u t p u t = c l a s s 2 x \Rightarrow f(x)=\begin{cases} g(x)>0 \ \ \ Output = class 1 \\ else \ \ \ \ \ \ \ \ \ \ \ Output = class 2 \end{cases} xf(x)={g(x)>0   Output=class1else           Output=class2
  • Loss function:
    L ( f ) = ∑ n δ ( f ( x n ) ) ≠ y ^ n L(f)=\sum_n \delta(f(x^n)) \neq \hat{y}^n L(f)=nδ(f(xn))=y^n
    The number of times f get incorrect results on training data.

  • Find the best function:

    • Example: Perceptron, SVM
  • Gaussian Distribution
    f u , Σ ( x ) = 1 ( 2 π ) D / 2 1 ∣ Σ ∣ 1 / 2 e x p { − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) } f_{u,\Sigma}(x)=\frac{1}{(2\pi)^{D/2}}\frac{1}{|\Sigma|^{1/2}}exp\{-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu)\} fu,Σ(x)=(2π)D/21∣Σ1/21exp{21(xμ)TΣ1(xμ)}

    • Input: vector x, output: probability of sampling x.
      • The shape of the function determines by mean μ and covariance matrix
  • Maximun Likelihood
    f u , Σ ( x ) = 1 ( 2 π ) D / 2 1 ∣ Σ ∣ 1 / 2 e x p { − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) } f_{u,\Sigma}(x)=\frac{1}{(2\pi)^{D/2}}\frac{1}{|\Sigma|^{1/2}}exp\{-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu)\} fu,Σ(x)=(2π)D/21∣Σ1/21exp{21(xμ)TΣ1(xμ)}
    The Gaussian with any mean μ and covariance matrix can generate these points.
    x ⇒ P ( C 1 ∣ x ) = P ( x ∣ C 1 ) P ( C 1 ) P ( x ∣ C 1 ) P ( C 1 ) + P ( x ∣ C 2 ) P ( C 2 ) x \Rightarrow P(C_1|x)=\frac{P(x|C_1)P(C_1)}{P(x|C_1)P(C_1)+P(x|C_2)P(C_2)} xP(C1x)=P(xC1)P(C1)+P(xC2)P(C2)P(xC1)P(C1)

  • Three Steps

    • Function Set(Model) :

    x ⇒ { i f     P ( C 1 ∣ x ) > 0.5 , o u t p u t : c l a s s   1 O t h e r w i s e ,               o u t p u t : c l a s s    2 x \Rightarrow \begin{cases} if \ \ \ P(C_1|x) > 0.5,output: class \ 1 \\ Otherwise, \ \ \ \ \ \ \ \ \ \ \ \ \ output: class \ \ 2\end{cases} x{if   P(C1x)>0.5,output:class 1Otherwise,             output:class  2

    • Goodness of a function
      • The mean μ and covariance that maximzing the likelihood(the probability of generating data)
      • Find the best function: easy
  • Posterior Probability

P ( C 1 ∣ x ) = σ ( z )        s i g m o i d        z = l n P ( x ∣ C 1 ) P ( C 1 ) P ( x ∣ C 2 ) P ( C 2 ) P(C_1|x)=\sigma(z) \ \ \ \ \ \ sigmoid \ \ \ \ \ \ z=ln\frac{P(x|C_1)P(C_1)}{P(x|C_2)P(C_2)} P(C1x)=σ(z)      sigmoid      z=lnP(xC2)P(C2)P(xC1)P(C1)

z = l n P ( x ∣ C 1 ) P ( x ∣ C 2 ) + l n P ( C 1 ) P ( C 2 ) ⇒ N 1 N 1 + N 2 N 2 N 1 + N 2 = N 1 N 2 z=ln\frac{P(x|C_1)}{P(x|C_2)} + ln\frac{P(C_1)}{P(C_2)} \Rightarrow \frac{\frac{N_1}{N_1+N_2}}{\frac{N_2}{N_1+N_2}}=\frac{N_1}{N_2} z=lnP(xC2)P(xC1)+lnP(C2)P(C1)N1+N2N2N1+N2N1=N2N1

P ( x ∣ C 1 ) = 1 ( 2 π ) D / 2 1 ∣ Σ 1 ∣ 1 / 2 e x p { − 1 2 ( x − μ 1 ) T ( Σ 1 ) − 1 ( x − μ 1 ) } (3a) P(x|C_1)=\frac{1}{(2\pi)^{D/2}}\frac{1}{|\Sigma^1|^{1/2}}exp\{-\frac{1}{2}(x-\mu^1)^T (\Sigma^{1})^{-1}(x-\mu^1)\} \tag{3a} P(xC1)=(2π)D/21Σ11/21exp{21(xμ1)T(Σ1)1(xμ1)}(3a)

P ( x ∣ C 2 ) = 1 ( 2 π ) D / 2 1 ∣ Σ 2 ∣ 1 / 2 e x p { − 1 2 ( x − μ 2 ) T ( Σ 2 ) − 1 ( x − μ 2 ) } (3b) P(x|C_2)=\frac{1}{(2\pi)^{D/2}}\frac{1}{|\Sigma^2|^{1/2}}exp\{-\frac{1}{2}(x-\mu^2)^T (\Sigma^{2})^{-1}(x-\mu^2)\} \tag{3b} P(xC2)=(2π)D/21Σ21/21exp{21(xμ2)T(Σ2)1(xμ2)}(3b)

  • According to (3a) and (3b), it is derived as follows

l n 1 ( 2 π ) D / 2 1 ∣ Σ 1 ∣ 1 / 2 e x p { − 1 2 ( x − μ 1 ) T ( Σ 1 ) − 1 ( x − μ 1 ) } 1 ( 2 π ) D / 2 1 ∣ Σ 2 ∣ 1 / 2 e x p { − 1 2 ( x − μ 2 ) T ( Σ 2 ) − 1 ( x − μ 2 ) } ln\frac{\frac{1}{(2\pi)^{D/2}}\frac{1}{|\Sigma^1|^{1/2}}exp\{-\frac{1}{2}(x-\mu^1)^T (\Sigma^{1})^{-1}(x-\mu^1)\}}{\frac{1}{(2\pi)^{D/2}}\frac{1}{|\Sigma^2|^{1/2}}exp\{-\frac{1}{2}(x-\mu^2)^T (\Sigma^{2})^{-1}(x-\mu^2)\}} ln(2π)D/21Σ21/21exp{21(xμ2)T(Σ2)1(xμ2)}(2π)D/21Σ11/21exp{21(xμ1)T(Σ1)1(xμ1)}


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值