01-ML基本概念

最新推荐文章于 2024-09-14 19:55:48 发布

Mooque

最新推荐文章于 2024-09-14 19:55:48 发布

阅读量68

点赞数 1

分类专栏： DL学习笔记文章标签： python pytorch 深度学习神经网络人工智能 deep learning

本文链接：https://blog.csdn.net/m0_64711692/article/details/134032398

版权

DL学习笔记专栏收录该内容

13 篇文章 0 订阅

订阅专栏

本文介绍了机器学习中的基础概念，包括不同类型的函数（如回归和分类）、结构化学习中的模型构建，以及梯度下降法优化过程中的损失函数（如MSE和MAE）。文章详细解释了如何通过调整超参数和使用Sigmoid和ReLU激活函数来改进模型。此外，还探讨了深度学习中深度与宽度的关系。

摘要由CSDN通过智能技术生成

ML基本概念

注意：对于公式显示问题，可复制文本到 Typora 中打开，主要看文字就好 ~~~

ML = Looking for function(f)
Different types of functions
- Regression: f outputs a scalar
- Classification: given classes, f outputs the correct one
- Structured Learning: creat sth with structure(image,doc)
How to find a f ? => training
- step1: f with unknown parameters
- step2: define loss(L) from training data
  - loss is a f of parameters
  - loss means how good a set of value is
    
    eg:
    
    L=\frac{1}{n}\sum_{n}{e_n}\\MAE:e=|y-\hat y|\\MSE:e=(y-\hat y)^2
    
    MAE: L is mean absolute error
    
    MSE: L is mean square error
  - optimization: w^*,b^*=argmin_{w,b}L
    
    method: Gradient Descent
    - randomly pick an initail value w0
    - compute \frac{\partial L}{\partial w}|_{w=w_0}
      
      if nagative => increase w
      
      elif positive => decrease w
      
      so w_0 \to w_1
      
      what about the increment ?
      
      \textcolor{red}{\eta}\cdot\frac{\partial L}{\partial w}|_{w=w_0} (\eta is learning rate)
      
      \eta : a parameter that needs to be set by self => hyperparameter(超参数)
      
      in conclusion, w_1\leftarrow w_0-\eta\frac{\partial L}{\partial w}|_{w=w_0}
    - update w iteratively(反复迭代 w)
      
      故梯度下降法存在：局部最优解的问题(won't cause a problem actually)
    - In cases with multiple parameters, it's similar to having only a single parameter.
  prediction(then adjusting the model based on prediction results again...)
  
  The above example is based on the foundation of a linear model.
  
  但是线性模型具有一定的局限性(model bias)
  
  solution: add s set of piecewise linear fs
  
  You can modify the parameters(c,b,w) in the function to adjust the shape of it.
  
  So the new model got more features.
  
  $$
  y=b+\sum_ic_isigmoid(b_i+w_ix_i)\\ y=b+\sum_ic_isigmoid(b_i+\textcolor{green}{\sum_jw_{ij}x_j})
  $$
  
  i : number of sigmoid fs; j : number of features sigmoid()=\sigma()
  
  so this time Loss = L(\theta)
- step3: optimization
  - \vec{\theta^*}=argmin_{\vec\theta}L
    - randomly pick initial values \vec\theta_0
    - gradient: \nabla
    - update \vec\theta iteratively
      - $$
        \vec\theta_1\leftarrow\vec\theta_0-\eta\vec g\\ \vec\theta_2\leftarrow\vec\theta_1-\eta\vec g\\ ......
        $$
    - if N = 10000, batch size = 10, how many update in 1 epoch?
      
      answer: 1000 updates
  - sigmoid \toReLU(Rectified Linear Unit): cmax(0,b+wx_0)
    
    $$
    sigmoid:y=b+\sum_ic_isigmoid(b_i+\sum_iw_{ij}x_j)\\ ReLU:y=b+\sum_{\textcolor{red}{2i}}c_imax(0,b_i+\sum_jw_{ij}x_j)
    $$
    
    which one is better? =>ReLU
  - multiple hidden layers
    
    Increasing this hyperparameter can reduce the value of Loss, but increases the complexity of the model.
deep means many hidden layers, but why want "deep" but not "fat"(just put all the neurons in a row)??? --hhh