Machine Learning 02 - Multivariate Linear Regression


Week 02

2.1 Multivariate Linear Regression

2.1.1 Multiple Features
  • The multivariable form of the hypothesis function :
    hθ(x)=θ0x0+θ1x1+θ2x2+θ3x3++θnxn h θ ( x ) = θ 0 x 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 3 + ⋯ + θ n x n

    =[θ0θ1θn]x0x1xn=θTx = [ θ 0 θ 1 ⋯ θ n ] [ x 0 x 1 ⋮ x n ] = θ T x
  • Remark : For convenice, assume x(i)0=1for i1,,m x 0 ( i ) = 1 for   i ∈ 1 , ⋯ , m .
  • The cost function J(θ) J ( θ ) has the same form
    J(θ)=12mi=1m(hθ(x)y)2 J ( θ ) = 1 2 m ∑ i = 1 m ( h θ ( x ) − y ) 2
2.1.2 Gradient Descent

  • Gradient descent for mutivariate linear Regression - Algorithm 1’

Repeat {

θj:=θjα1mi=1m(hθ(x(i))y(i))x(i)j θ j := θ j − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i )

(simultaneously update θj θ j for j=0,,n j = 0 , ⋯ , n )

2.1.3 Practical Tricks in GD
  • Feature Scaling si s i
    • Idea : Make sure features are on a similar scale. This is because θ θ will descent very quickly on small ranges, otherwise it will oscillate inefficiently down to the optimum.
    • Get every feature into approximately a 1xi1 − 1 ≤ x i ≤ 1 range (number 1 is no a necessary problem).
    • Remark : The quizzes in this course use range - the programming exercises use standard deviation.
  • Mean Normalization μi μ i
    • Replace xi x i with xiμi x i − μ i to make features have approximately zero mean (do no apply to x0=1 x 0 = 1 ).
    • In general, we have :
      xi:=xiμisi x i := x i − μ i s i

      where μi μ i is the average of all the values for features (i) ( i ) and si s i is the range of values (max-min), or si s i is the standard deviation.
  • Learning Rate Check
    • Debug gradient descent, make a plot of iterations on x-axis, judge whether the J(θ) J ( θ ) converge to zeor or not :
      • If α α is too small, slow convergence
      • If α α is too large, J(θ) J ( θ ) may not decrease on every iteration.
    • Try to use 1×10k 1 × 10 k or 3×10k 3 × 10 k or other similar value, when judging from the plot.
    • It has been proven that if learning rate α is sufficiently small, then J(θ) will decrease on every iteration.
2.1.4 Improvement of Linear Regression
  • Feature Combination
    • Combine some features in one using a variety of methods.
  • Polynomial Regression
    hθ(x)=θ0x0+θ1xa11+θ2xa22++θnxann h θ ( x ) = θ 0 x 0 + θ 1 x 1 a 1 + θ 2 x 2 a 2 + ⋯ + θ n x n a n
  • Remark : One important thing to keep in mind is, if you choose your features this way then feature scaling becomes very important.

2.2 Another Method for Normal Equation

2.2.1 Normal Equaltion

xi1im=x(i)0x(i)1x(i)2x(i)nRn+1,X=(x(1))T(x(2))T(x(3))T(x(m))T,Y=y(1)y(2)y(3)y(m) x i 1 ≤ i ≤ m = [ x 0 ( i ) x 1 ( i ) x 2 ( i ) ⋯ x n ( i ) ] ∈ R n + 1 , X = [ ( x ( 1 ) ) T ( x ( 2 ) ) T ( x ( 3 ) ) T ⋯ ( x ( m ) ) T ] , Y = [ y ( 1 ) y ( 2 ) y ( 3 ) ⋯ y ( m ) ]

θ=θ0θ1θ2θn θ = [ θ 0 θ 1 θ 2 ⋯ θ n ]

Then the normal equation formula is given below :
θ=(XTX)1XTy θ = ( X T X ) − 1 X T y

2.2.2 Comparison of GD and NE
  • Gradient Descent
    • Need to choose alpha and iterate
    • Need learning rate
    • O(kn2) O ( k n 2 )
    • Works well when n n is large.
  • Normal Equation
    • No need to choose alpha and iterate
    • No Need to set learning rate
    • O(n3)
    • Slow if n n is large

2.2.3 The non-invertable case of (XTX)1
  • Reason 1 : Redundant features. That is two or more features are linear dependent., delete one or more.
  • Reason 2 : Too many features. Delete some features or use “regularization”.

  • 0
  • 0
    觉得还不错? 一键收藏
  • 0




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


