【机器学习】1 线性回归

1 Linear Regression with One Variable / Univariate Linear Regression(单变量线性回归)

单变量线性回归

1.1 model

  • Hypothesis: h θ ( x ) = θ 0 + θ 1 x h_\theta(x)=\theta_0+\theta_1x hθ(x)=θ0+θ1x
  • Parameters: θ 0 , θ 1 \theta_0,\theta_1 θ0,θ1
  • Cost Function:square error function / square error cost function J ( θ 0 , θ 1 ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta_0,\theta_1)=\frac{1}{2m}\sum_{i=1}^m(h_\theta{(x^{(i)})-y^{(i)})}^2 J(θ0,θ1)=2m1i=1m(hθ(x(i))y(i))2
  • Goal(Object Function): minimize θ 0 , θ 1 J ( θ 0 , θ 1 ) \mathop{\text{minimize}}\limits_{\theta_0,\theta_1} J(\theta_0,\theta_1) θ0,θ1minimizeJ(θ0,θ1)

1.2 ‘Batch’ Gradient Descent Algorithm(梯度下降)

梯度下降

to solve minimize θ 0 , θ 1 J ( θ 0 , θ 1 ) \mathop{\text{minimize}}\limits_{\theta_0,\theta_1} J(\theta_0,\theta_1) θ0,θ1minimizeJ(θ0,θ1)

1.2.1 algorithm

  • repeat until convergence{
    θ j : = θ j − α ∂ ∂ θ j J ( θ 0 , θ 1 ) (for  j = 0  and  j = 1 ) \theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1)\text{(for $j=0$ and $j=1$)} θj:=θjαθjJ(θ0,θ1)for j=0 and j=1
             α \alpha α —— learning rate
    }
  • Correct: t e m p 0 : = θ 0 − α ∂ ∂ θ 0 J ( θ 0 , θ 1 ) t e m p 1 : = θ 1 − α ∂ ∂ θ 1 J ( θ 0 , θ 1 ) θ 0 : = t e m p 0 θ 1 : = t e m p 1 \begin{aligned} {temp}_0&:=\theta_0-\alpha\frac{\partial}{\partial\theta_0}J(\theta_0,\theta_1)\\ {temp}_1&:=\theta_1-\alpha\frac{\partial}{\partial\theta_1}J(\theta_0,\theta_1)\\ \theta_0&:={temp}_0\\ \theta_1&:={temp}_1 \end{aligned} temp0temp1θ0θ1:=θ0αθ0J(θ0,θ1):=θ1αθ1J(θ0,θ1):=temp0:=temp1
  • Notice: need to simultaneously update θ 0 \theta_0 θ0 and θ 1 \theta_1 θ1

1.2.2 use for univariate linear regression

  • repeat{
       θ 0 : = θ 0 − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) θ 1 : = θ 1 − α 1 m ∑ i = 1 m ( ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x ( i ) ) \begin{aligned}\theta_0&:=\theta_0-\alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})\\ \theta_1&:=\theta_1-\alpha\frac{1}{m}\sum_{i=1}^m((h_\theta(x^{(i)})-y^{(i)})·x^{(i)}) \end{aligned} θ0θ1:=θ0αm1i=1m(hθ(x(i))y(i)):=θ1αm1i=1m((hθ(x(i))y(i))x(i))
    }

1.2.3 features

  • 在梯度下降的每一步中,我们都用到了所有的训练样本

2 Linear Regression with Multiple Variables / Multivariate Linear Regression (多变量线性回归)

多变量线性回归

2.1 model

  • Hypothesis: x 0 = 1 x_0=1 x0=1 h θ ( x ) = θ T x = θ 0 x 0 + θ 1 x 1 + ⋅ ⋅ ⋅ + θ n x n h_\theta(x)=\theta^Tx=\theta_0x_0+\theta_1x_1+···+\theta_nx_n hθ(x)=θTx=θ0x0+θ1x1++θnxn
  • Parameters: n + 1 n+1 n+1-demensional vector θ = θ 0 , θ 1 , ⋅ ⋅ ⋅ , θ n \theta=\theta_0,\theta_1,···,\theta_n θ=θ0,θ1,,θn
  • Cost Function:square error function / square error cost function J ( θ ) = J ( θ 0 , θ 1 , ⋅ ⋅ ⋅ , θ n ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta)=J(\theta_0,\theta_1,···,\theta_n)=\frac{1}{2m}\sum_{i=1}^m(h_\theta{(x^{(i)})-y^{(i)})}^2 J(θ)=J(θ0,θ1,,θn)=2m1i=1m(hθ(x(i))y(i))2
  • Goal(Object Function): minimize θ 0 , θ 1 , ⋅ ⋅ ⋅ , θ n J ( θ 0 , θ 1 , θ n ) \mathop{\text{minimize}}\limits_{\theta_0,\theta_1,···,\theta_n} J(\theta_0,\theta_1,\theta_n) θ0,θ1,,θnminimizeJ(θ0,θ1,θn)

2.2 Gradient Descent Algorithm

2.2.1 algorithm

  • repeat until convergence{
    θ j : = θ j − α ∂ ∂ θ j J ( θ 0 , θ 1 , ⋅ ⋅ ⋅ , θ n ) (simultaneously update  θ j  for  j = 0 , 1 , ⋅ ⋅ ⋅ , n ) \theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1,···,\theta_n)\text{(simultaneously update $\theta_j$ for $j=0,1,···,n$)} θj:=θjαθjJ(θ0,θ1,,θn)simultaneously update θj for j=0,1,,n
    }

2.2.2 use for multiple linear regression

  • repeat{
    x 0 ( i ) = 1 x_0^{(i)}=1 x0(i)=1
       θ 0 : = θ 0 − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x 0 ( i ) ) θ 1 : = θ 1 − α 1 m ∑ i = 1 m ( ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x 1 ( i ) ) ⋅ ⋅ ⋅ θ n : = θ n − α 1 m ∑ i = 1 m ( ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x n ( i ) ) \begin{aligned}\theta_0&:=\theta_0-\alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})·x_0^{(i)})\\ \theta_1&:=\theta_1-\alpha\frac{1}{m}\sum_{i=1}^m((h_\theta(x^{(i)})-y^{(i)})·x_1^{(i)})\\ ···\\ \theta_n&:=\theta_n-\alpha\frac{1}{m}\sum_{i=1}^m((h_\theta(x^{(i)})-y^{(i)})·x_n^{(i)})\\ \end{aligned} θ0θ1θn:=θ0αm1i=1m(hθ(x(i))y(i))x0(i)):=θ1αm1i=1m((hθ(x(i))y(i))x1(i)):=θnαm1i=1m((hθ(x(i))y(i))xn(i))
    }

3 Gradient Descent in Practice

3.1 Feature Scaling(特征缩放)

  • Goal: make sure features are on a similar scale
    使得所有的特征有统一的量度
  • Advantages:
    (1) make gradient desecent run much faster
    使梯度下降更快
    (2) converge in a lot fewer iterations
    减少收敛次数
  • Methods:
    (1) dividing by the maximum value
    根据最大值划分
    (2) mean normalization
    均值归一化

3.1.1 mean normaliztion(均值归一化)

  • Theory: replace x i x_i xi with x i − μ i x_i-\mu_i xiμi to make features have approximately zero mean
    Do not apply to x 0 = 1 x_0=1 x0=1
    x i = x i − μ i s i x_i=\frac{x_i-\mu_i}{s_i} xi=sixiμi
  • μ i \mu_i μi —— average value of x i x_i xi in the training sets
  • s i s_i si —— the range of values of that feature
    (1) maximum value − minimum value \text{maximum value} - \text{minimum value} maximum valueminimum value
    (2) standard deviation of the variable \text{standard deviation of the variable} standard deviation of the variable(标准差)

3.2 Learning Rate

  • Debugging: How to make sure gradient descent in working correctly
    调试:确保梯度下降正常运行

3.2.1 good method: plot

  • plot the cost function as we increase the number of iterations
    画出代价函数值 J ( θ ) J(\theta) J(θ)随迭代次数变化的曲线
  • Advantages:
    (1) can show gradient descent if working correctly —— J ( θ ) J(\theta) J(θ) should decrease after every iteration
    可以观察梯度下降法是否正常运行
    (2) can judge whether or not gradient descent has converged
    可以判断梯度下降是否收敛

3.2.1 other method

  • declare convergence if J ( θ ) J(\theta) J(θ) decreases by less than ε \varepsilon ε in one iteration
    可以设置一个阈值,当代价函数的值低于这个阈值时,停止收敛
  • Advantage: judge if converging automatically
    优点:自动完成收敛过程
  • Disadvantage: choose ε \varepsilon ε can be difficult
    缺点:阈值选择存在困难

3.2.3 choose α \alpha α

  • consider: α = 0.01 , 0.03 , 0.1 , 0.3 , 1 , 3 , 10 , ⋅ ⋅ ⋅ \alpha=0.01,0.03,0.1,0.3,1,3,10,··· α=0.01,0.03,0.1,0.3,1,3,10,

3.2.4 plot problem

  • problem: J ( θ ) J(\theta) J(θ) does not decrease after every iteration
    问题:每次迭代后 J ( θ ) J(\theta) J(θ)没有下降
  • cause: α \alpha α is too large that J ( θ ) J(\theta) J(θ) may not decrease on every iteration; may not converge and slow converge is also possible
    原因: α \alpha α太大,可能导致 J ( θ ) J(\theta) J(θ)没法收敛,或者收敛速度很慢
  • Solve: choose sufficiently smaller α \alpha α
    解决:适当减少 α \alpha α
  • Problem caused by solution above: α \alpha α is too small that gradient descent can be slow to converge
    减少 α \alpha α值可能造成的问题:收敛速度变得很慢

3.3 Features and Polynomial Regression

  • linear regression does not adapt to all datas
    不是所有数据集都能用线性回归的
  • polynomial regression can translate into linear regression
    多项式回归可以转化为线性回归
  • we need to observe training sets so that we can choose a sufficient model
    我们需要通过观察训练集,从而选择恰当的模型
  • Notice: feature scaling is necessary if choosing polynomial regresssion
    如果使用多项式回归解决问题,必须进行特征缩放

4 Normal Equation(正规方程)

a method to solve for θ \theta θ analytically one step to get to the optimal value right
一步求解得到最优值时的 θ \theta θ的方法

4.1 model

  • m m m examples ( x ( 1 ) , y ( 1 ) ) , ⋅ ⋅ ⋅ , ( x ( m ) , y ( m ) ) (x^{(1)},y^{(1)}),···,(x^{(m)},y^{(m)}) (x(1),y(1)),,(x(m),y(m)) ; n n n features
  • n + 1 n+1 n+1 dimensional vector: x ( i ) = [ x 0 ( i ) ⋅ ⋅ ⋅ x n ( i ) ] x^{(i)}=\left[ \begin{matrix} x_0^{(i)}\\ ···\\ x_n^{(i)}\\ \end{matrix} \right] x(i)=x0(i)xn(i)
  • design matrix ( m ⋅ ( n + 1 ) m·(n+1) m(n+1) dimensional vector): X = [ ( x ( 1 ) ) T ⋅ ⋅ ⋅ ( x ( m ) ) T ] X=\left[ \begin{matrix} {(x^{(1)})}^T\\ ···\\ {(x^{(m)})}^T\\ \end{matrix} \right] X=(x(1))T(x(m))T
  • y = [ y ( 1 ) ⋅ ⋅ ⋅ y ( m ) ] y=\left[ \begin{matrix} y^{(1)}\\ ···\\ y^{(m)}\\ \end{matrix} \right] y=y(1)y(m)
  • θ = ( X T X ) − 1 X T y \theta={(X^TX)}^{-1}X^Ty θ=(XTX)1XTy

4.2 writing

  • Octave
 pinv(X'*X)*X'*y
  • Python
import numpy as np
def notmalEqn(X,y):
	theta = np.linalg.inv(X.T@X)@X.T@y	# X.T@X 等价于 X.T.dot(X)
	return theta

problem: if X T X X^TX XTX is non-invertible ( singular or degenerate matrices)?

  • cause:
    (1) redundant features (liearly dependent)
    (2) too many features —— delete some features or use regularation
  • 出现不可逆矩阵的情况极少发生
  • 伪逆pesudo-inverse:pinv()
  • 逆:inv()

5 梯度下降与正规方程的比较

Gradient Descent 梯度下降Normal Euqation 正规方程
need to choose α \alpha α 需要选择 α \alpha αno need to choose α \alpha α不需要选择 α \alpha α
need many iterations 需要多次迭代no need to iterate 不需要迭代
works well even when n n n is large 即使特征很多,也能很好地运行slow if n n n is very large because we need to compute ( X T X ) − 1 (X^TX)^{-1} (XTX)1 n n n is hard to get】特征很多的话,运行会变得很慢
适用于各种类型的模型只适用于线性回归,不适合逻辑回归

6 参考

吴恩达 机器学习 coursera machine learning
黄海广 机器学习笔记

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

社恐患者

赚钱不易呜呜呜

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值