Machine Learning:Regression

Regression’s Application:stock market forecast,self-driving car,recommendation.And it Roles: select function, output value is prediction.

step1:model

function set.linear model is like:
y = b + w x i y=b+wx_i y=b+wxi
( b b b and w w w are parameters, x i x_i xi is feature, w w w is weight, b b b is bias)
The superscript is used to mark the object number, and the subscript is used to display the object properties.

step2:goodness of function

training data is the real data.Loss function L.input:function,output is how bad it is.L(f)=L(b,w).So we can define Loss function as ∑ n = 1 a ( y ^ n − ( b + w ⋅ x i n ) ) 2 \sum_{n=1}^{a}(\hat{y}^n-(b+w\cdot x_{i}^n))^2 n=1a(y^n(b+wxin))2( a a a is a parameter)

step3:best function
f ∗ = a r g min ⁡ f L ( f ) f^*=arg\min_fL(f) f=argfminL(f)
w ∗ , b ∗ = a r g min ⁡ w , b L ( w , b ) w^*,b^*=arg\min_{w,b}L(w,b) w,b=argw,bminL(w,b)
= a r g min ⁡ w , b ∑ n = 1 a ( y ^ n − ( b + w ⋅ x i n ) ) 2 =arg\min_{w,b}\sum_{n=1}^{a}(\hat{y}^n-(b+w\cdot x_{i}^n))^2 =argw,bminn=1a(y^n(b+wxin))2

what is gradient descent?

Assump only one parameter,pick a random initial value w0,Differentiate this point.
d L d w ∣ w = w 0 \frac{dL}{dw}|_{w=w^0} dwdLw=w0
But what is the value of increase or decrease?
− η d L d w ∣ w = w 0 -\eta\frac{dL}{dw}|_{w=w^0} ηdwdLw=w0
Differential value or eta value(learning rate) are related to it, and both are related to it.
w 1 = w 0 − η d L d w ∣ w = w 0 w^1=w^0-\eta\frac{dL}{dw}|_{w=w^0} w1=w0ηdwdLw=w0
Then continue to recalculate.We will get local optimal.NOT GLOBAL OPTIMAL!

How about two parameters?

Actually the same as one parameter.Just do it twice.What is gradient?We need to consider two parameters,like array:
∇ L = [ d L d w d L d b ] g r a d i e n t \nabla L=\begin{bmatrix} \frac{dL}{dw}\\ \frac{dL}{db} \end{bmatrix}_{gradient} L=[dwdLdbdL]gradient
But we are worry!Because we are trying our luck.In linear regression,there is no local optimal.We can still find the partial differential:
d L d w = ∑ n = 1 a 2 ( y ^ n − ( b + w ⋅ x i n ) ) ( − x i n ) \frac{dL}{dw}=\sum_{n=1}^{a}2(\hat{y}^n-(b+w\cdot x_{i}^n))(-x_{i}^n) dwdL=n=1a2(y^n(b+wxin))(xin)
d L d b = ∑ n = 1 a 2 ( y ^ n − ( b + w ⋅ x i n ) ) \frac{dL}{db}=\sum_{n=1}^{a}2(\hat{y}^n-(b+w\cdot x_{i}^n)) dbdL=n=1a2(y^n(b+wxin))

How’s the result?

First, we will get locally optimistic parameter values.Then we can calculate the error.

What is ‘error’?

The average value of the distance between the data and the curve.
∑ n = 1 a e n \sum_{n=1}^{a}e^n n=1aen

In fact, we don’t care about the error of this training data.We care Generalization.

What is ‘Generalization’?

Use the current model to get the output value, which is not exactly the same as the real value.

What we really care is the error of the testing data.If the average error on the test data is greater than the average error on the training data,We need to change the model.

How can we do better?

We need a more complex model.Like
y = b + w 1 ⋅ x i + w 2 ⋅ ( x i ) 2 y=b+w_1\cdot x_i+w_2\cdot (x_i)^2 y=b+w1xi+w2(xi)2
or
y = b + w 1 ⋅ x i + w 2 ⋅ ( x i ) 2 + w 3 ⋅ ( x i ) 3 y=b+w_1\cdot x_i+w_2\cdot (x_i)^2+w_3\cdot (x_i)^3 y=b+w1xi+w2(xi)2+w3(xi)3
.If using a more complex model instead makes the error of the test data larger, it may be necessary to reduce the complexity.In other words, the curve does not match the actual situation, you need to modify the model.

In fact, the higher the complexity of the model, the lower the error value of the training data.There is a word called ‘Overfitting’.

What is ‘Overfitting’?

A more complex model does not always lead to better performance on testing data.

Let’s collect more data!

We may find that our previous model is useless.There is some hidden factors not considered in the previous model.

Back to step 1:Redesign the Model

If the parameter is not a number, we can not add the parameter and use the conditional statement to divide the different conditions.Like
y = b 1 ⋅ δ ( x j = h e l l o ) + w 1 ⋅ δ ( x j = h e l l o ) ⋅ x i + b 2 ⋅ δ ( x j = w o r l d ) + w 2 ⋅ δ ( x j = w o r l d ) ⋅ x i y=b_1\cdot \delta(x_j=hello)+w_1\cdot \delta(x_j=hello)\cdot x_i+b_2\cdot \delta(x_j=world)+w_2\cdot \delta(x_j=world)\cdot x_i y=b1δ(xj=hello)+w1δ(xj=hello)xi+b2δ(xj=world)+w2δ(xj=world)xi

Linear Model

y = b + ∑ w i x i y=b+\sum w_ix_i y=b+wixi

Other hidden factors

In fact, you can add all possible related factors to the model, just add more parameters, it can be better.The same as one factor.We can get low training error.But the complexity improve,and may overfitting.

Back to step 2:Regularization

Better Loss function is
L = ∑ n ( y ^ n − ( b + ∑ w i x i ) ) 2 + λ ∑ ( w i ) 2 L=\sum_n(\hat{y}^n-(b+\sum w_ix_i))^2+\lambda\sum(w_i)^2 L=n(y^n(b+wixi))2+λ(wi)2
.This means that the smaller w i w_i wi are better.This will make our functions smooth.We should make our functions no sensitive.And can defend noises when testing.The bigger the λ \lambda λ, the smoother the selected model.The more we consider smoothness, the error of training data may be larger, but the error of test data may be smaller.

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值