Machine Learning:Regression

最新推荐文章于 2022-07-12 18:10:01 发布

holidai97

最新推荐文章于 2022-07-12 18:10:01 发布

阅读量122

点赞数 1

分类专栏：机器学习文章标签：机器学习

本文链接：https://blog.csdn.net/JACKDjn/article/details/113941695

版权

机器学习专栏收录该内容

5 篇文章 0 订阅

订阅专栏

Regression’s Application:stock market forecast,self-driving car,recommendation.And it Roles: select function, output value is prediction.

step1:model

function set.linear model is like:
$y=b+wx_i$
( $b$ and $w$ are parameters, $x_i$ is feature, $w$ is weight, $b$ is bias)
The superscript is used to mark the object number, and the subscript is used to display the object properties.

step2:goodness of function

training data is the real data.Loss function L.input:function,output is how bad it is.L(f)=L(b,w).So we can define Loss function as $\sum_{n=1}^{a}(\hat{y}^n-(b+w\cdot x_{i}^n))^2$ ( $a$ is a parameter)

step3:best function
$f^*=arg\min_fL(f)$
$w^*,b^*=arg\min_{w,b}L(w,b)$
$=arg\min_{w,b}\sum_{n=1}^{a}(\hat{y}^n-(b+w\cdot x_{i}^n))^2$

what is gradient descent?

Assump only one parameter,pick a random initial value w0,Differentiate this point.
$\frac{dL}{dw}|_{w=w^0}$
But what is the value of increase or decrease?
$-\eta\frac{dL}{dw}|_{w=w^0}$
Differential value or eta value(learning rate) are related to it, and both are related to it.
$w^1=w^0-\eta\frac{dL}{dw}|_{w=w^0}$
Then continue to recalculate.We will get local optimal.NOT GLOBAL OPTIMAL!

How about two parameters?

Actually the same as one parameter.Just do it twice.What is gradient?We need to consider two parameters,like array:
$\nabla L=\begin{bmatrix} \frac{dL}{dw}\\ \frac{dL}{db} \end{bmatrix}_{gradient}$
But we are worry!Because we are trying our luck.In linear regression,there is no local optimal.We can still find the partial differential：
$\frac{dL}{dw}=\sum_{n=1}^{a}2(\hat{y}^n-(b+w\cdot x_{i}^n))(-x_{i}^n)$
$\frac{dL}{db}=\sum_{n=1}^{a}2(\hat{y}^n-(b+w\cdot x_{i}^n))$

How’s the result?

First, we will get locally optimistic parameter values.Then we can calculate the error.

What is ‘error’?

The average value of the distance between the data and the curve.
$\sum_{n=1}^{a}e^n$

In fact, we don’t care about the error of this training data.We care Generalization.

What is ‘Generalization’?

Use the current model to get the output value, which is not exactly the same as the real value.

What we really care is the error of the testing data.If the average error on the test data is greater than the average error on the training data,We need to change the model.

How can we do better?

We need a more complex model.Like
$y=b+w_1\cdot x_i+w_2\cdot (x_i)^2$
or
$y=b+w_1\cdot x_i+w_2\cdot (x_i)^2+w_3\cdot (x_i)^3$
.If using a more complex model instead makes the error of the test data larger, it may be necessary to reduce the complexity.In other words, the curve does not match the actual situation, you need to modify the model.

In fact, the higher the complexity of the model, the lower the error value of the training data.There is a word called ‘Overfitting’.

What is ‘Overfitting’?

A more complex model does not always lead to better performance on testing data.

Let’s collect more data!

We may find that our previous model is useless.There is some hidden factors not considered in the previous model.

Back to step 1:Redesign the Model

If the parameter is not a number, we can not add the parameter and use the conditional statement to divide the different conditions.Like
$y=b_1\cdot \delta(x_j=hello)+w_1\cdot \delta(x_j=hello)\cdot x_i+b_2\cdot \delta(x_j=world)+w_2\cdot \delta(x_j=world)\cdot x_i$

Linear Model

$y=b+\sum w_ix_i$

Other hidden factors

In fact, you can add all possible related factors to the model, just add more parameters, it can be better.The same as one factor.We can get low training error.But the complexity improve,and may overfitting.

Back to step 2:Regularization

Better Loss function is
$L=\sum_n(\hat{y}^n-(b+\sum w_ix_i))^2+\lambda\sum(w_i)^2$
.This means that the smaller $w_i$ are better.This will make our functions smooth.We should make our functions no sensitive.And can defend noises when testing.The bigger the $\lambda$ , the smoother the selected model.The more we consider smoothness, the error of training data may be larger, but the error of test data may be smaller.

holidai97

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Machine Learning:Regression

application:stock market forecast,self-driving car,recommendation.Role: select function, output value is prediction.step1:modelfunction set.linear model:y=b+wxiy=b+wx_iy=b+wxi(bbb and www are parameters,xix_ixi is feature,www is weight,bbb is bias)T
复制链接

扫一扫