机器学习(一) — 线性回归模型

model 1 — linear regression

1 The method to find suitable parameters via cost function

1. steps

  1. find a suitable model
  2. calculate cost function
  3. min cost function — gradient descent

2. gradient descent

take model f ( x ) = w x + b f(x) = wx + b f(x)=wx+b as an example

  1. cost function:

J ( w , b ) = 1 2 m ∑ i = 1 m ( ( w x ( i ) + b ) − y ( i ) ) 2 J(w, b) = \frac{1}{2m} \sum_{i=1}^{m}((wx^{(i)}+b) - y^{(i)})^2 J(w,b)=2m1i=1m((wx(i)+b)y(i))2

  1. gradient descent
    • α \alpha α is learning rate
    • J ( w , b ) J(w, b) J(w,b) is cost function

w = w − α ∂ ∂ w J ( w , b ) b = b − α ∂ ∂ b J ( w , b ) w = w - \alpha \frac{\partial}{\partial w}J(w, b)\\ b = b - \alpha \frac{\partial}{\partial b}J(w, b) w=wαwJ(w,b)b=bαbJ(w,b)

3. notice

if implement with code, make sure w and b are updated from origin w and b

t m p _ w = w − α ∂ ∂ w J ( w , b ) t m p _ b = b − α ∂ ∂ b J ( w , b ) w = t m p _ w b = t m p _ b tmp\_w = w - \alpha \frac{\partial}{\partial w}J(w, b)\\ tmp\_b = b - \alpha \frac{\partial}{\partial b}J(w, b)\\ w = tmp\_w\\ b = tmp\_b tmp_w=wαwJ(w,b)tmp_b=bαbJ(w,b)w=tmp_wb=tmp_b

4. learning rate

it’s a rather important point to chose a suitable learning rate

  1. α \alpha α is too small: convergence is too slow
  2. α \alpha α is too large: overshoot

5. gradient descent for multiple variables

take f w ⃗ , b ( x ⃗ ) = w ⃗ ⋅ x ⃗ + b f_{\vec{w}, b}(\vec{x}) = \vec{w} \cdot \vec{x} + b fw ,b(x )=w x +b as an example

  1. cost function

J ( w ⃗ , b ) = 1 2 m ∑ i = 1 m ( f w ⃗ , b ( x ⃗ ) − y ( i ) ) 2 J(\vec{w}, b) = \frac{1}{2m} \sum_{i=1}^{m}(f_{\vec{w}, b}(\vec{x}) - y^{(i)})^2 J(w ,b)=2m1i=1m(fw ,b(x )y(i))2

  1. gradient descent

w i = w i − α ∂ ∂ w i J ( w ⃗ , b ) b = b − α ∂ ∂ b J ( w ⃗ , b ) w_i = w_i - \alpha \frac{\partial}{\partial w_i}J(\vec{w}, b)\\ b = b - \alpha \frac{\partial}{\partial b}J(\vec{w}, b) wi=wiαwiJ(w ,b)b=bαbJ(w ,b)

  1. formula for code

w j = w j − α 1 m ∑ i = 1 m ( ( w ⃗ ⋅ x ⃗ ( i ) + b ) − y ( i ) ) x j ( i ) b = b − α 1 m ∑ i = 1 m ( ( w ⃗ ⋅ x ⃗ ( i ) + b ) − y ( i ) ) w_j = w_j - \alpha \frac{1}{m} \sum_{i=1}^{m}((\vec{w} \cdot \vec{x}^{(i)}+b) - y^{(i)})x_j^{(i)} \\ b = b - \alpha \frac{1}{m} \sum_{i=1}^{m}((\vec{w} \cdot \vec{x}^{(i)}+b) - y^{(i)}) wj=wjαm1i=1m((w x (i)+b)y(i))xj(i)b=bαm1i=1m((w x (i)+b)y(i))

2 Accelerate convergence through feature scaling

take 300 ≤ x 1 ≤ 200 , 0 ≤ x 2 ≤ 5 300 \le x_1 \le 200, 0 \le x_2 \le 5 300x1200,0x25 as an example

1. max normalization
x 1 , s c a l i n g = x 1 2000 x 2 , s c a l i n g = x 2 5 x_{1, scaling} = \frac{x_1}{2000}\\ x_{2, scaling} = \frac{x_2}{5} x1,scaling=2000x1x2,scaling=5x2
2. mean normalization

μ \mu μ is average

x 1 , s c a l i n g = x 1 − μ 1 2000 − 300 x 2 , s c a l i n g = x 2 − μ 2 5 − 0 x_{1, scaling} = \frac{x_1 - \mu_1}{2000-300}\\ x_{2, scaling} = \frac{x_2 - \mu_2}{5-0} x1,scaling=2000300x1μ1x2,scaling=50x2μ2

3. Z-score normalization

transformed into a standard normal distribution

μ \mu μ is average

σ \sigma σ is standard deviation

x 1 , s c a l i n g = x 1 − μ 1 σ 1 x 2 , s c a l i n g = x 2 − μ 2 σ 2 x_{1, scaling} = \frac{x_1 - \mu_1}{\sigma_1}\\ x_{2, scaling} = \frac{x_2 - \mu_2}{\sigma_2} x1,scaling=σ1x1μ1x2,scaling=σ2x2μ2

3 Tips for linear regression

  1. plot learning curve
  2. set a suitable learning rate, try ··· 0.001, 0.003, 0.01, 0.03 ···
  3. conduct feature engineering, choose suitable feature
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值