Machine Learning 02 - Multivariate Linear Regression

正在学习Stanford吴恩达的机器学习课程,常做笔记,以便复习巩固。
鄙人才疏学浅,如有错漏与想法,还请多包涵,指点迷津。非常欢迎一起学习的伙伴们来讨论!

Week 02

2.1 Multivariate Linear Regression

2.1.1 Multiple Features
  • The multivariable form of the hypothesis function :
    hθ(x)=θ0x0+θ1x1+θ2x2+θ3x3++θnxn h θ ( x ) = θ 0 x 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 3 + ⋯ + θ n x n

    =[θ0θ1θn]x0x1xn=θTx = [ θ 0 θ 1 ⋯ θ n ] [ x 0 x 1 ⋮ x n ] = θ T x
  • Remark : For convenice, assume x(i)0=1for i1,,m x 0 ( i ) = 1 for   i ∈ 1 , ⋯ , m .
  • The cost function J(θ) J ( θ ) has the same form
    J(θ)=12mi=1m(hθ(x)y)2 J ( θ ) = 1 2 m ∑ i = 1 m ( h θ ( x ) − y ) 2
2.1.2 Gradient Descent


  • Gradient descent for mutivariate linear Regression - Algorithm 1’

Repeat {

θj:=θjα1mi=1m(hθ(x(i))y(i))x(i)j θ j := θ j − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i )

(simultaneously update θj θ j for j=0,,n j = 0 , ⋯ , n )
}

2.1.3 Practical Tricks in GD
  • Feature Scaling si s i
    • Idea : Make sure features are on a similar scale. This is because θ θ will descent very quickly on small ranges, otherwise it will oscillate inefficiently down to the optimum.
    • Get every feature into approximately a 1xi1 − 1 ≤ x i ≤ 1 range (number 1 is no a necessary problem).
    • Remark : The quizzes in this course use range - the programming exercises use standard deviation.
  • Mean Normalization μi μ i
    • Replace xi x i with xiμi x i − μ i to make features have approximately zero mean (do no apply to x0=1 x 0 = 1 ).
    • In general, we have :
      xi:=xiμisi x i := x i − μ i s i

      where μi μ i is the average of all the values for features (i) ( i ) and si s i is the range of values (max-min), or si s i is the standard deviation.
  • Learning Rate Check
    • Debug gradient descent, make a plot of iterations on x-axis, judge whether the J(θ) J ( θ ) converge to zeor or not :
      • If α α is too small, slow convergence
      • If α α is too large, J(θ) J ( θ ) may not decrease on every iteration.
    • Try to use 1×10k 1 × 10 k or 3×10k 3 × 10 k or other similar value, when judging from the plot.
    • It has been proven that if learning rate α is sufficiently small, then J(θ) will decrease on every iteration.
2.1.4 Improvement of Linear Regression
  • Feature Combination
    • Combine some features in one using a variety of methods.
  • Polynomial Regression
    hθ(x)=θ0x0+θ1xa11+θ2xa22++θnxann h θ ( x ) = θ 0 x 0 + θ 1 x 1 a 1 + θ 2 x 2 a 2 + ⋯ + θ n x n a n
  • Remark : One important thing to keep in mind is, if you choose your features this way then feature scaling becomes very important.

2.2 Another Method for Normal Equation

2.2.1 Normal Equaltion

xi1im=x(i)0x(i)1x(i)2x(i)nRn+1,X=(x(1))T(x(2))T(x(3))T(x(m))T,Y=y(1)y(2)y(3)y(m) x i 1 ≤ i ≤ m = [ x 0 ( i ) x 1 ( i ) x 2 ( i ) ⋯ x n ( i ) ] ∈ R n + 1 , X = [ ( x ( 1 ) ) T ( x ( 2 ) ) T ( x ( 3 ) ) T ⋯ ( x ( m ) ) T ] , Y = [ y ( 1 ) y ( 2 ) y ( 3 ) ⋯ y ( m ) ]

and
θ=θ0θ1θ2θn θ = [ θ 0 θ 1 θ 2 ⋯ θ n ]

Then the normal equation formula is given below :
θ=(XTX)1XTy θ = ( X T X ) − 1 X T y

2.2.2 Comparison of GD and NE
  • Gradient Descent
    • Need to choose alpha and iterate
    • Need learning rate
    • O(kn2) O ( k n 2 )
    • Works well when n n is large.
  • Normal Equation
    • No need to choose alpha and iterate
    • No Need to set learning rate
    • O(n3)
    • Slow if n n is large

2.2.3 The non-invertable case of (XTX)1
  • Reason 1 : Redundant features. That is two or more features are linear dependent., delete one or more.
  • Reason 2 : Too many features. Delete some features or use “regularization”.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值