Optimization algorithms -----week2

最新推荐文章于 2022-02-17 20:09:31 发布
杵心
最新推荐文章于 2022-02-17 20:09:31 发布
阅读量188
点赞数
本文链接：https://blog.csdn.net/qq_31805127/article/details/79788571
版权
                    
                    Batch vs. mini-batch gradientt descent 
 (1)可以分成5000个子集。 
 for t = 1, …., 5000 
 Forward prop on  
    
    x{t} 
    
        x 
       
         { 
        
         t 
        
         } 
        
    . 
  
    Z[1]=w[1]x{t}+b[t] 
    
        Z 
       
         [ 
        
         1 
        
         ] 
        
       = 
      
        w 
       
         [ 
        
         1 
        
         ] 
        
        x 
       
         { 
        
         t 
        
         } 
        
       + 
      
        b 
       
         [ 
        
         t 
        
         ] 
        
    A[1]=g[1](Z[1]) 
    
        A 
       
         [ 
        
         1 
        
         ] 
        
       = 
      
        g 
       
         [ 
        
         1 
        
         ] 
        
       ( 
      
        Z 
       
         [ 
        
         1 
        
         ] 
        
       ) 
      
 ….. 
  
    A[l]=g[l](z[l]) 
    
        A 
       
         [ 
        
         l 
        
         ] 
        
       = 
      
        g 
       
         [ 
        
         l 
        
         ] 
        
       ( 
      
        z 
       
         [ 
        
         l 
        
         ] 
        
       ) 
      
 (2) Compute cost : 
  
    J=11000∑li=1L(y^(i),y(i))+λ2⋅1000∑l∥w[l]∥2F 
    
       J 
      
       = 
      
        1 
       
        1000 
       
        ∑ 
       
         i 
        
         = 
        
         1 
        
        l 
       
       L 
      
       ( 
      
          y 
         
          ^ 
         
         ( 
        
         i 
        
         ) 
        
       , 
      
        y 
       
         ( 
        
         i 
        
         ) 
        
       ) 
      
       + 
      
        λ 
       
         2 
        
         ⋅ 
        
         1000 
        
        ∑ 
       
        l 
       
       ‖ 
      
        w 
       
         [ 
        
         l 
        
         ] 
        
        ‖ 
       
        F 
       
        2 
       
 (3) Backprop to compute gradients  
    
    J{t}(x{t},y{t}) 
    
        J 
       
         { 
        
         t 
        
         } 
        
       ( 
      
        x 
       
         { 
        
         t 
        
         } 
        
       , 
      
        y 
       
         { 
        
         t 
        
         } 
        
       ) 
      
    w[t]=w[l]−αdw[l] 
    
        w 
       
         [ 
        
         t 
        
         ] 
        
       = 
      
        w 
       
         [ 
        
         l 
        
         ] 
        
       − 
      
       α 
      
       d 
      
        w 
       
         [ 
        
         l 
        
         ] 
        
    b[l]=b[l]−αdb[l] 
    
        b 
       
         [ 
        
         l 
        
         ] 
        
       = 
      
        b 
       
         [ 
        
         l 
        
         ] 
        
       − 
      
       α 
      
       d 
      
        b 
       
         [ 
        
         l 
        
         ] 
        
Choosing mini-batch size 
 (1)if mini-batch size = m: the size of training set——->Batch gradient descent.(如果训练集数据大将会运行很长时间) 
 (2) if mini-batch size = 1: Stochastic gradient descent—->Every example is its own mini-batch.(噪声大，而且最后总是在最小值附近摆动) 
 (3)Choose In-between(minibatch size not too big/small)
Some guidelines about choosing your mini-batch size: 
 (1)If small training(m <= 2000 ) set: Use batch gradient descent. 
 (2)Typical mini-batch size:64, 128, 256, 512(据说 
    
    2n 
    
        2 
       
        n 
       
    代码运行得快) 
 (3)Make sure some mini-batch fit in CPU/GPU memory.  
    
    x[t],y[t] 
    
        x 
       
         [ 
        
         t 
        
         ] 
        
       , 
      
        y 
       
         [ 
        
         t 
        
         ] 
        
Exponentially weighted moving averages(指数加权滑动平均) 
  
    Vt=βVt−1+(1−β)θt 
    
        V 
       
        t 
       
       = 
      
       β 
      
        V 
       
         t 
        
         − 
        
         1 
        
       + 
      
       ( 
      
       1 
      
       − 
      
       β 
      
       ) 
      
        θ 
       
         t 
        
    β=0.9≈11−βdays′tempetaure 
    
       β 
      
       = 
      
       0.9 
      
       ≈ 
      
        1 
       
         1 
        
         − 
        
         β 
        
       d 
      
       a 
      
       y 
      
        s 
       
        ′ 
       
       t 
      
       e 
      
       m 
      
       p 
      
       e 
      
       t 
      
       a 
      
       u 
      
       r 
      
       e 
      
    β=0.98≈11−β=50days′temperature 
    
       β 
      
       = 
      
       0.98 
      
       ≈ 
      
        1 
       
         1 
        
         − 
        
         β 
        
       = 
      
       50 
      
       d 
      
       a 
      
       y 
      
        s 
       
        ′ 
       
       t 
      
       e 
      
       m 
      
       p 
      
       e 
      
       r 
      
       a 
      
       t 
      
       u 
      
       r 
      
       e 
      
Bias correction(偏差修正) in exponentially weighted average. 
  
    Vt=βVt−1+(1−β)θt 
    
        V 
       
        t 
       
       = 
      
       β 
      
        V 
       
         t 
        
         − 
        
         1 
        
       + 
      
       ( 
      
       1 
      
       − 
      
       β 
      
       ) 
      
        θ 
       
         t 
        
    Vt1−βt 
    
         V 
        
         t 
        
         1 
        
         − 
        
          β 
         
          t 
         
Gradient descent with momentum(动量梯度下降法): 
 (1) Compute  
    
    dw,db 
    
       d 
      
       w 
      
       , 
      
       d 
      
       b 
      
     on current mini-batch. 
  
    Vdw=βVdw+(1−β)dw 
    
        V 
       
         d 
        
         w 
        
       = 
      
       β 
      
        V 
       
         d 
        
         w 
        
       + 
      
       ( 
      
       1 
      
       − 
      
       β 
      
       ) 
      
       d 
      
       w 
      
    Vθ=βVθ+(1−β)θt 
    
        V 
       
         θ 
        
       = 
      
       β 
      
        V 
       
         θ 
        
       + 
      
       ( 
      
       1 
      
       − 
      
       β 
      
       ) 
      
        θ 
       
         t 
        
    Vdb=βVdb+(1−β)db 
    
        V 
       
         d 
        
         b 
        
       = 
      
       β 
      
        V 
       
         d 
        
         b 
        
       + 
      
       ( 
      
       1 
      
       − 
      
       β 
      
       ) 
      
       d 
      
       b 
      
 (2) Update  
    
    w,b 
    
       w 
      
       , 
      
       b 
      
    : 
  
    w=w−αVdw 
    
       w 
      
       = 
      
       w 
      
       − 
      
       α 
      
        V 
       
         d 
        
         w 
        
    b=b−αVdb 
    
       b 
      
       = 
      
       b 
      
       − 
      
       α 
      
        V 
       
         d 
        
         b 
        
 使得梯度下降法在垂直方向的震荡幅度变小，水平方向的移动更快速,以更快速度进行梯度下降。 
 (3)Implementation details: 
  
    Vdw=0,Vdb=0 
    
        V 
       
         d 
        
         w 
        
       = 
      
       0 
      
       , 
      
        V 
       
         d 
        
         b 
        
       = 
      
       0 
      
 On iteration t: 
 Compute  
    
    dW,db 
    
       d 
      
       W 
      
       , 
      
       d 
      
       b 
      
     on the current mini-batch. 
  
    vdW=βvdW+(1−β)dW 
    
        v 
       
         d 
        
         W 
        
       = 
      
       β 
      
        v 
       
         d 
        
         W 
        
       + 
      
       ( 
      
       1 
      
       − 
      
       β 
      
       ) 
      
       d 
      
       W 
      
    vdb=βvdb+(1−β)db 
    
        v 
       
         d 
        
         b 
        
       = 
      
       β 
      
        v 
       
         d 
        
         b 
        
       + 
      
       ( 
      
       1 
      
       − 
      
       β 
      
       ) 
      
       d 
      
       b 
      
    W=W−αvdW,b=b−αvdb 
    
       W 
      
       = 
      
       W 
      
       − 
      
       α 
      
        v 
       
         d 
        
         W 
        
       , 
      
       b 
      
       = 
      
       b 
      
       − 
      
       α 
      
        v 
       
         d 
        
         b 
        
 Hyperparameters:  
    
    α 
    
       α 
      
    ,  
    
    β 
    
       β 
      
    ,  
    
    β=0.9 
    
       β 
      
       = 
      
       0.9 
      
 (4) RMSprop(root mean square prop)(均方根传递) 
 On iteration t: 
 Compute  
    
    dw,db 
    
       d 
      
       w 
      
       , 
      
       d 
      
       b 
      
     on current mini-batch 
  
    SdW=βSdW+(1−β)(dW)2 
    
        S 
       
         d 
        
         W 
        
       = 
      
       β 
      
        S 
       
         d 
        
         W 
        
       + 
      
       ( 
      
       1 
      
       − 
      
       β 
      
       ) 
      
       ( 
      
       d 
      
       W 
      
        ) 
       
        2 
       
     :  
    
    (dW)2,element−wise 
    
       ( 
      
       d 
      
       W 
      
        ) 
       
        2 
       
       , 
      
       e 
      
       l 
      
       e 
      
       m 
      
       e 
      
       n 
      
       t 
      
       − 
      
       w 
      
       i 
      
       s 
      
       e 
      
    Sdb=βSdb+(1−β)(db)2 
    
        S 
       
         d 
        
         b 
        
       = 
      
       β 
      
        S 
       
         d 
        
         b 
        
       + 
      
       ( 
      
       1 
      
       − 
      
       β 
      
       ) 
      
       ( 
      
       d 
      
       b 
      
        ) 
       
        2 
       
 Update: 
  
    W=W−αdWSdW√+ε 
    
       W 
      
       = 
      
       W 
      
       − 
      
       α 
      
         d 
        
         W 
        
           S 
          
            d 
           
            W 
           
         + 
        
         ε 
        
    b=b−αdbSdb√+ε 
    
       b 
      
       = 
      
       b 
      
       − 
      
       α 
      
         d 
        
         b 
        
           S 
          
            d 
           
            b 
           
         + 
        
         ε 
        
 (5) Adam optimization algorithms 
  
    VdW=0,SdW=0,Vdb=0,Sdb=0 
    
        V 
       
         d 
        
         W 
        
       = 
      
       0 
      
       , 
      
        S 
       
         d 
        
         W 
        
       = 
      
       0 
      
       , 
      
        V 
       
         d 
        
         b 
        
       = 
      
       0 
      
       , 
      
        S 
       
         d 
        
         b 
        
       = 
      
       0 
      
 On iteration t: 
 Compute  
    
    dW,db 
    
       d 
      
       W 
      
       , 
      
       d 
      
       b 
      
     using current mini-batch.(mini-batch gradient) 
 “momentum”: 
  
    VdW=β1VdW+(1−β1)dW,Vdb=β1Vdb+(1−β1)db 
    
        V 
       
         d 
        
         W 
        
       = 
      
        β 
       
        1 
       
        V 
       
         d 
        
         W 
        
       + 
      
       ( 
      
       1 
      
       − 
      
        β 
       
        1 
       
       ) 
      
       d 
      
       W 
      
       , 
      
        V 
       
         d 
        
         b 
        
       = 
      
        β 
       
        1 
       
        V 
       
         d 
        
         b 
        
       + 
      
       ( 
      
       1 
      
       − 
      
        β 
       
        1 
       
       ) 
      
       d 
      
       b 
      
 “RMSprop”: 
  
    Sdw=β2Sdw+(1−β2)(dW)2,Sdb=β2Sdb+(1−β2)db 
    
        S 
       
         d 
        
         w 
        
       = 
      
        β 
       
        2 
       
        S 
       
         d 
        
         w 
        
       + 
      
       ( 
      
       1 
      
       − 
      
        β 
       
        2 
       
       ) 
      
       ( 
      
       d 
      
       W 
      
        ) 
       
        2 
       
       , 
      
        S 
       
         d 
        
         b 
        
       = 
      
        β 
       
        2 
       
        S 
       
         d 
        
         b 
        
       + 
      
       ( 
      
       1 
      
       − 
      
        β 
       
        2 
       
       ) 
      
       d 
      
       b 
      
 Bias corrected: 
  
    vcorrecteddw=VdW1−βt1, 
    
        v 
       
         d 
        
         w 
        
         c 
        
         o 
        
         r 
        
         r 
        
         e 
        
         c 
        
         t 
        
         e 
        
         d 
        
       = 
      
         V 
        
          d 
         
          W 
         
         1 
        
         − 
        
          β 
         
          1 
         
          t 
         
       , 
      
    Vcorrecteddb=Vdb1−βt1 
    
        V 
       
         d 
        
         b 
        
         c 
        
         o 
        
         r 
        
         r 
        
         e 
        
         c 
        
         t 
        
         e 
        
         d 
        
       = 
      
         V 
        
          d 
         
          b 
         
         1 
        
         − 
        
          β 
         
          1 
         
          t 
         
    ScorrecteddW=Sdw1−βt2, 
    
        S 
       
         d 
        
         W 
        
         c 
        
         o 
        
         r 
        
         r 
        
         e 
        
         c 
        
         t 
        
         e 
        
         d 
        
       = 
      
         S 
        
          d 
         
          w 
         
         1 
        
         − 
        
          β 
         
          2 
         
          t 
         
       , 
      
    Scorrecteddb=Sdb1−βt2 
    
        S 
       
         d 
        
         b 
        
         c 
        
         o 
        
         r 
        
         r 
        
         e 
        
         c 
        
         t 
        
         e 
        
         d 
        
       = 
      
         S 
        
          d 
         
          b 
         
         1 
        
         − 
        
          β 
         
          2 
         
          t 
         
    W=W−αVcorrecteddwScorrecteddw+√+ε 
    
       W 
      
       = 
      
       W 
      
       − 
      
       α 
      
         V 
        
          d 
         
          w 
         
          c 
         
          o 
         
          r 
         
          r 
         
          e 
         
          c 
         
          t 
         
          e 
         
          d 
         
           S 
          
            d 
           
            w 
           
            c 
           
            o 
           
            r 
           
            r 
           
            e 
           
            c 
           
            t 
           
            e 
           
            d 
           
          + 
         
         + 
        
         ε 
        
    b=b−αVcorrecteddbScorrecteddb√+ 
    
       b 
      
       = 
      
       b 
      
       − 
      
       α 
      
         V 
        
          d 
         
          b 
         
          c 
         
          o 
         
          r 
         
          r 
         
          e 
         
          c 
         
          t 
         
          e 
         
          d 
         
           S 
          
            d 
           
            b 
           
            c 
           
            o 
           
            r 
           
            r 
           
            e 
           
            c 
           
            t 
           
            e 
           
            d 
           
         + 
        
 Hyperparameters choice: 
  
    α:needstobetune 
    
       α 
      
       : 
      
       n 
      
       e 
      
       e 
      
       d 
      
       s 
      
       t 
      
       o 
      
       b 
      
       e 
      
       t 
      
       u 
      
       n 
      
       e 
      
    β1:0.9(dW) 
    
        β 
       
        1 
       
       : 
      
       0.9 
      
       ( 
      
       d 
      
       W 
      
       ) 
      
    β2:0.999(dW2)element−wise 
    
        β 
       
        2 
       
       : 
      
       0.999 
      
       ( 
      
       d 
      
        W 
       
        2 
       
       ) 
      
       e 
      
       l 
      
       e 
      
       m 
      
       e 
      
       n 
      
       t 
      
       − 
      
       w 
      
       i 
      
       s 
      
       e 
      
    ε:10−8 
    
       ε 
      
       : 
      
        10 
       
         − 
        
         8 
        
Learning rate decay 
 epoch: 迭代次数 
  
    α=11+decay_rate⋅epoch_numbers⋅α0 
    
       α 
      
       = 
      
        1 
       
         1 
        
         + 
        
         d 
        
         e 
        
         c 
        
         a 
        
         y 
        
         _ 
        
         r 
        
         a 
        
         t 
        
         e 
        
         ⋅ 
        
         e 
        
         p 
        
         o 
        
         c 
        
         h 
        
         _ 
        
         n 
        
         u 
        
         m 
        
         b 
        
         e 
        
         r 
        
         s 
        
       ⋅ 
      
        α 
       
        0 
       
 8.Local optima in neural network
杵心
关注
0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Optimization algorithms -----week2

Batch vs. mini-batch gradientt descent (1)可以分成5000个子集。 for t = 1, …., 5000 Forward prop on x{t}x{t}x^{\{t\}}. Z[1]=w[1]x{t}+b[t]Z[1]=w[1]x{t}+b[t]Z^{[1]} = w^{[1]}x^{\{t\}} + b^{[t]} A[1...
复制链接

扫一扫