Regression’s Application:stock market forecast,self-driving car,recommendation.And it Roles: select function, output value is prediction.
step1:model
function set.linear model is like:
y
=
b
+
w
x
i
y=b+wx_i
y=b+wxi
(
b
b
b and
w
w
w are parameters,
x
i
x_i
xi is feature,
w
w
w is weight,
b
b
b is bias)
The superscript is used to mark the object number, and the subscript is used to display the object properties.
step2:goodness of function
training data is the real data.Loss function L.input:function,output is how bad it is.L(f)=L(b,w).So we can define Loss function as ∑ n = 1 a ( y ^ n − ( b + w ⋅ x i n ) ) 2 \sum_{n=1}^{a}(\hat{y}^n-(b+w\cdot x_{i}^n))^2 n=1∑a(y^n−(b+w⋅xin))2( a a a is a parameter)
step3:best function
f
∗
=
a
r
g
min
f
L
(
f
)
f^*=arg\min_fL(f)
f∗=argfminL(f)
w
∗
,
b
∗
=
a
r
g
min
w
,
b
L
(
w
,
b
)
w^*,b^*=arg\min_{w,b}L(w,b)
w∗,b∗=argw,bminL(w,b)
=
a
r
g
min
w
,
b
∑
n
=
1
a
(
y
^
n
−
(
b
+
w
⋅
x
i
n
)
)
2
=arg\min_{w,b}\sum_{n=1}^{a}(\hat{y}^n-(b+w\cdot x_{i}^n))^2
=argw,bminn=1∑a(y^n−(b+w⋅xin))2
what is gradient descent?
Assump only one parameter,pick a random initial value w0,Differentiate this point.
d
L
d
w
∣
w
=
w
0
\frac{dL}{dw}|_{w=w^0}
dwdL∣w=w0
But what is the value of increase or decrease?
−
η
d
L
d
w
∣
w
=
w
0
-\eta\frac{dL}{dw}|_{w=w^0}
−ηdwdL∣w=w0
Differential value or eta value(learning rate) are related to it, and both are related to it.
w
1
=
w
0
−
η
d
L
d
w
∣
w
=
w
0
w^1=w^0-\eta\frac{dL}{dw}|_{w=w^0}
w1=w0−ηdwdL∣w=w0
Then continue to recalculate.We will get local optimal.NOT GLOBAL OPTIMAL!
How about two parameters?
Actually the same as one parameter.Just do it twice.What is gradient?We need to consider two parameters,like array:
∇
L
=
[
d
L
d
w
d
L
d
b
]
g
r
a
d
i
e
n
t
\nabla L=\begin{bmatrix} \frac{dL}{dw}\\ \frac{dL}{db} \end{bmatrix}_{gradient}
∇L=[dwdLdbdL]gradient
But we are worry!Because we are trying our luck.In linear regression,there is no local optimal.We can still find the partial differential:
d
L
d
w
=
∑
n
=
1
a
2
(
y
^
n
−
(
b
+
w
⋅
x
i
n
)
)
(
−
x
i
n
)
\frac{dL}{dw}=\sum_{n=1}^{a}2(\hat{y}^n-(b+w\cdot x_{i}^n))(-x_{i}^n)
dwdL=n=1∑a2(y^n−(b+w⋅xin))(−xin)
d
L
d
b
=
∑
n
=
1
a
2
(
y
^
n
−
(
b
+
w
⋅
x
i
n
)
)
\frac{dL}{db}=\sum_{n=1}^{a}2(\hat{y}^n-(b+w\cdot x_{i}^n))
dbdL=n=1∑a2(y^n−(b+w⋅xin))
How’s the result?
First, we will get locally optimistic parameter values.Then we can calculate the error.
What is ‘error’?
The average value of the distance between the data and the curve.
∑
n
=
1
a
e
n
\sum_{n=1}^{a}e^n
n=1∑aen
In fact, we don’t care about the error of this training data.We care Generalization.
What is ‘Generalization’?
Use the current model to get the output value, which is not exactly the same as the real value.
What we really care is the error of the testing data.If the average error on the test data is greater than the average error on the training data,We need to change the model.
How can we do better?
We need a more complex model.Like
y
=
b
+
w
1
⋅
x
i
+
w
2
⋅
(
x
i
)
2
y=b+w_1\cdot x_i+w_2\cdot (x_i)^2
y=b+w1⋅xi+w2⋅(xi)2
or
y
=
b
+
w
1
⋅
x
i
+
w
2
⋅
(
x
i
)
2
+
w
3
⋅
(
x
i
)
3
y=b+w_1\cdot x_i+w_2\cdot (x_i)^2+w_3\cdot (x_i)^3
y=b+w1⋅xi+w2⋅(xi)2+w3⋅(xi)3
.If using a more complex model instead makes the error of the test data larger, it may be necessary to reduce the complexity.In other words, the curve does not match the actual situation, you need to modify the model.
In fact, the higher the complexity of the model, the lower the error value of the training data.There is a word called ‘Overfitting’.
What is ‘Overfitting’?
A more complex model does not always lead to better performance on testing data.
Let’s collect more data!
We may find that our previous model is useless.There is some hidden factors not considered in the previous model.
Back to step 1:Redesign the Model
If the parameter is not a number, we can not add the parameter and use the conditional statement to divide the different conditions.Like
y
=
b
1
⋅
δ
(
x
j
=
h
e
l
l
o
)
+
w
1
⋅
δ
(
x
j
=
h
e
l
l
o
)
⋅
x
i
+
b
2
⋅
δ
(
x
j
=
w
o
r
l
d
)
+
w
2
⋅
δ
(
x
j
=
w
o
r
l
d
)
⋅
x
i
y=b_1\cdot \delta(x_j=hello)+w_1\cdot \delta(x_j=hello)\cdot x_i+b_2\cdot \delta(x_j=world)+w_2\cdot \delta(x_j=world)\cdot x_i
y=b1⋅δ(xj=hello)+w1⋅δ(xj=hello)⋅xi+b2⋅δ(xj=world)+w2⋅δ(xj=world)⋅xi
Linear Model
y = b + ∑ w i x i y=b+\sum w_ix_i y=b+∑wixi
Other hidden factors
In fact, you can add all possible related factors to the model, just add more parameters, it can be better.The same as one factor.We can get low training error.But the complexity improve,and may overfitting.
Back to step 2:Regularization
Better Loss function is
L
=
∑
n
(
y
^
n
−
(
b
+
∑
w
i
x
i
)
)
2
+
λ
∑
(
w
i
)
2
L=\sum_n(\hat{y}^n-(b+\sum w_ix_i))^2+\lambda\sum(w_i)^2
L=n∑(y^n−(b+∑wixi))2+λ∑(wi)2
.This means that the smaller
w
i
w_i
wi are better.This will make our functions smooth.We should make our functions no sensitive.And can defend noises when testing.The bigger the
λ
\lambda
λ, the smoother the selected model.The more we consider smoothness, the error of training data may be larger, but the error of test data may be smaller.