model 1 — linear regression
1 The method to find suitable parameters via cost function
1. steps
- find a suitable model
- calculate cost function
- min cost function — gradient descent
2. gradient descent
take model f ( x ) = w x + b f(x) = wx + b f(x)=wx+b as an example
- cost function:
J ( w , b ) = 1 2 m ∑ i = 1 m ( ( w x ( i ) + b ) − y ( i ) ) 2 J(w, b) = \frac{1}{2m} \sum_{i=1}^{m}((wx^{(i)}+b) - y^{(i)})^2 J(w,b)=2m1i=1∑m((wx(i)+b)−y(i))2
- gradient descent
- α \alpha α is learning rate
- J ( w , b ) J(w, b) J(w,b) is cost function
w = w − α ∂ ∂ w J ( w , b ) b = b − α ∂ ∂ b J ( w , b ) w = w - \alpha \frac{\partial}{\partial w}J(w, b)\\ b = b - \alpha \frac{\partial}{\partial b}J(w, b) w=w−α∂w∂J(w,b)b=b−α∂b∂J(w,b)
3. notice
if implement with code, make sure w and b are updated from origin w and b
t m p _ w = w − α ∂ ∂ w J ( w , b ) t m p _ b = b − α ∂ ∂ b J ( w , b ) w = t m p _ w b = t m p _ b tmp\_w = w - \alpha \frac{\partial}{\partial w}J(w, b)\\ tmp\_b = b - \alpha \frac{\partial}{\partial b}J(w, b)\\ w = tmp\_w\\ b = tmp\_b tmp_w=w−α∂w∂J(w,b)tmp_b=b−α∂b∂J(w,b)w=tmp_wb=tmp_b
4. learning rate
it’s a rather important point to chose a suitable learning rate
- α \alpha α is too small: convergence is too slow
- α \alpha α is too large: overshoot
5. gradient descent for multiple variables
take f w ⃗ , b ( x ⃗ ) = w ⃗ ⋅ x ⃗ + b f_{\vec{w}, b}(\vec{x}) = \vec{w} \cdot \vec{x} + b fw,b(x)=w⋅x+b as an example
- cost function
J ( w ⃗ , b ) = 1 2 m ∑ i = 1 m ( f w ⃗ , b ( x ⃗ ) − y ( i ) ) 2 J(\vec{w}, b) = \frac{1}{2m} \sum_{i=1}^{m}(f_{\vec{w}, b}(\vec{x}) - y^{(i)})^2 J(w,b)=2m1i=1∑m(fw,b(x)−y(i))2
- gradient descent
w i = w i − α ∂ ∂ w i J ( w ⃗ , b ) b = b − α ∂ ∂ b J ( w ⃗ , b ) w_i = w_i - \alpha \frac{\partial}{\partial w_i}J(\vec{w}, b)\\ b = b - \alpha \frac{\partial}{\partial b}J(\vec{w}, b) wi=wi−α∂wi∂J(w,b)b=b−α∂b∂J(w,b)
- formula for code
w j = w j − α 1 m ∑ i = 1 m ( ( w ⃗ ⋅ x ⃗ ( i ) + b ) − y ( i ) ) x j ( i ) b = b − α 1 m ∑ i = 1 m ( ( w ⃗ ⋅ x ⃗ ( i ) + b ) − y ( i ) ) w_j = w_j - \alpha \frac{1}{m} \sum_{i=1}^{m}((\vec{w} \cdot \vec{x}^{(i)}+b) - y^{(i)})x_j^{(i)} \\ b = b - \alpha \frac{1}{m} \sum_{i=1}^{m}((\vec{w} \cdot \vec{x}^{(i)}+b) - y^{(i)}) wj=wj−αm1i=1∑m((w⋅x(i)+b)−y(i))xj(i)b=b−αm1i=1∑m((w⋅x(i)+b)−y(i))
2 Accelerate convergence through feature scaling
take 300 ≤ x 1 ≤ 200 , 0 ≤ x 2 ≤ 5 300 \le x_1 \le 200, 0 \le x_2 \le 5 300≤x1≤200,0≤x2≤5 as an example
1. max normalization
x
1
,
s
c
a
l
i
n
g
=
x
1
2000
x
2
,
s
c
a
l
i
n
g
=
x
2
5
x_{1, scaling} = \frac{x_1}{2000}\\ x_{2, scaling} = \frac{x_2}{5}
x1,scaling=2000x1x2,scaling=5x2
2. mean normalization
μ \mu μ is average
x 1 , s c a l i n g = x 1 − μ 1 2000 − 300 x 2 , s c a l i n g = x 2 − μ 2 5 − 0 x_{1, scaling} = \frac{x_1 - \mu_1}{2000-300}\\ x_{2, scaling} = \frac{x_2 - \mu_2}{5-0} x1,scaling=2000−300x1−μ1x2,scaling=5−0x2−μ2
3. Z-score normalization
transformed into a standard normal distribution
μ \mu μ is average
σ \sigma σ is standard deviation
x 1 , s c a l i n g = x 1 − μ 1 σ 1 x 2 , s c a l i n g = x 2 − μ 2 σ 2 x_{1, scaling} = \frac{x_1 - \mu_1}{\sigma_1}\\ x_{2, scaling} = \frac{x_2 - \mu_2}{\sigma_2} x1,scaling=σ1x1−μ1x2,scaling=σ2x2−μ2
3 Tips for linear regression
- plot learning curve
- set a suitable learning rate, try ··· 0.001, 0.003, 0.01, 0.03 ···
- conduct feature engineering, choose suitable feature