【ML】 李宏毅机器学习笔记

我的github链接 - 课程相关代码:

https://github.com/YidaoXianren/Machine-Learning-course-note

0. Introduction

  • Machine Learning: define a set of function, goodness of function, pick the best function
  • Regression输出的是一个标量,Classification输出的是(1)是或否(Binary Classification) (2) Multi-class Classification
  • 选不同的function set其实就是选不同的model,model里面最简单的就是linear model;此外还有很多nonlinear model,如deep learning, SVM, decision tree, kNN...... 以上都是supervised learning - 需要搜集很多training data
  • Semi-supervised learning(半监督学习) - 有些有有些没有label
  • Transfer Learning - data not related to the task considered
  • Unsupervised Learning(非监督学习)
  • Structured Learning - Beyond Classification (输出的是一个有结构性的object)
  • Reinforcement Learning - 没有监督知道,只有一个好or坏的评分机制(learning from critics)
  • 蓝色: scenario; 红色: task - 要解的问题; 绿色: method.

                                             li1

1. Regression

  • output a scalar
  • Step1: Model: y = b + wx_{cp}          w and b are parameters, w: weight, b: bias
  • Linear Modely = b + \sum w_ix_i
  • Step2: Goodness of Function - Loss function L: input is a function, output is how bad it is (损失函数)
  • first version of Loss FunctionL(f) = \sum (\hat y^n - f(x^n_{cp}))^2      or   L(w,b) = \sum (\hat y^n - (b+wx^n_{cp}))^2
  • Step3: Best Function - w^*b^* = arg \min_{w,b}L(w,b) = arg \min_{w,b}\sum_n (\hat y^n-(b+wx^n_{cp}))^2 
  • Gradient Descent (梯度下降法) - 只要loss func对与它的参数是可微分的就可以用,不需要一定是线性方程
  • - Pick an initial value w^0;  - Compute \frac{dL}{dw}|_{w=w^0};  - w1 \leftarrow w^0 - \eta \frac{dL}{dw}|_{w=w^0} , where \eta is learning rate. Continue this step until finding gradient that is equal to zero.
  • For two parameters: w^*, b^*; - Pick initial value: w_0, b_0;  - Compute \frac{\partial L}{\partial w}|_{w=w_0, b=b_0}, \frac{\partial L}{\partial b}|_{w=w_0, b=b_0}; - w^1 \leftarrow w^0 - \eta \frac{\partial L}{\partial w}|_{w=w^0, b=b^0}     b^1 \leftarrow b^0 - \eta \frac{\partial L}{\partial b}|_{w=w^0, b=b^0} 。Continue this step until finding gradient that is equal to zero.
  • 以上方法得出来的结果\theta^*满足:\theta^* = arg \min_\theta L(\theta)
  • gradient descent缺点:可能会卡在saddle point或者local minima
  • 对于linear regression, 由于它是convex的函数,所以不存在上述缺点。
  • Liner Regression - Gradient descent formula summary:
    • L(w,b) = \sum^{10}_{n=1}(\hat y^n-(b+wx^n_{cp}))^2
    • \frac{\partial L}{\partial w} = \sum^{10}_{n=1}2(\hat y^n-(b+wx^n_{cp}))(-x^n_{cp})
    • \frac{\partial L}{\partial b} = \sum^{10}_{n=1}2(\hat y^n-(b+wx^n_{cp}))(-1)
  • 复杂的模型在test data上不一定有更好的表现,有可能是overfitting(过拟合)
  • overfit的解决方法:1. 增加input数据集 2. regularization
  • Regularization (正则化)
    • y = b + \sum w_ix_i            L = \sum_n(\hat y^n-(b + \sum w_ix_i))^2 + \lambda (w_i)^2
    • 不但要选择一个loss小的function,还要选择一个平滑的function(正则化使函数更平滑, 因为w比较小) - smoother function is more likely to be correct
    • \lambda大,找出来的function就比较smooth。反之,找出来的则不太smooth. 在\lambda由小到大变化的过程中,函数不止要考虑loss最小化还要考虑weights最小化,所以对training error最小化的考虑就会相对(于没有正则化的时候)减小,因此training error会随着\lambda增大而增大。test error则先减小再增大。

2. Error

  • Bias:  m = \frac{1}{N}\sum_nx^n  ;   Variance:    s^2 = \frac{1}{N}\sum_n(x^n-m)^2  ;      E[s^2] = \frac{N-1}{N}\sigma^2 \neq \sigma^2 ;  want low bias & low variance
  • when using low degree(simple) models, variance is small, while complicate model leads to large variance. 简单的模型受采样数据的影响较小
  • Bias: If we average all the f^*, it is close to \hat f.   f^*是每次训练的最佳函数(model)解(注:每次训练包含多个数据样本-sample data),而\hat f是真实的函数(model)。
  • simple models have larger bias & smaller variance, while complicate models have smaller bias & larger variance.
  • 如果error来自于variance很大,说明现在的模型是overfitting;如果error来自bias很大,说明现在的模型是underfitting
  • 如果模型没法fit training data,说明此时bias很大;如果模型很fit training data, 但是很不fit test data,说明此时variance很大
  • For large bias: add more feature, make a more complicate model
  • For large variance: get more data, or regularization (所有曲线都会变得比较平滑)
  • Cross Validation: Training Set, Validation Set, Testing Set (Public, Private)
  • N-fold Cross Validation - 交叉验证: 可以先分成training set和validation set, train的用来训练model, validation的用来挑选model。选定model之后再用整个data set (training set+validation set)来重新train这个model的参数

3. Gradient Descent

  • \theta^* = arg \min_\theta L(\theta)                L: loss function,     \theta: parameters
  • 假设\theta有两个变量{\theta_1, \theta_2} , 则:
  • \theta^0 = \left[ \begin{matrix} \theta^0_1 \\ \theta^0_2 \end{matrix} \right];       \theta^1 = \theta^0 - \eta\triangledown L(\theta^0) ;  -->    
  • 1
    点赞
  • 19
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值