Regression is All you Need
Author: Bobby(Zhuoran) Peng
People have created and utilized a bewildering variety of algorithms trying to predict future data in various industries. However, the most prevalent and perhaps east-to-intrpret and proven algorithm remains to be linear regression. The following figure showing linear regression to be the most welcomed algorithm also implies that it is one of the most basic rules for supervised machine learning and beyond.
What is Linear Regression
There are two variables in a linear regression, one is dependent variable and the other is independent variable. The dependent variable is what we want to predict, and its value depends on the changes of independent variable.
y
=
α
0
+
α
1
x
y = \alpha_0 + \alpha_1x
y=α0+α1x
In the above model, we are using x to predict the value of y.
The above scatter plot is an example of a linear regression model. Here we get a fitting line of
y
=
−
15.69
+
9.72
x
y=-15.69+9.72x
y=−15.69+9.72x, if we then have a
x
x
x value, we can use the equation to predict a
y
y
y value.
Also, we can use matrix to present the linear relationship. Let X X X be a d × n d\times n d×n matrix ( n n n items and d d d features), and w w w be d × 1 d\times 1 d×1 matrix (containing cooeficients), and Y ˉ \bar{Y} Yˉ be the output matrix of size n × 1 n\times 1 n×1, then we have: X T w = Y ˉ X^Tw=\bar{Y} XTw=Yˉ
Data in the real world are just like this scatter plot, our linear regression model is just a prediction, and we use y ˉ \bar{y} yˉ to present the predicted values of the model on our blue fitting line, while real data are those black spots scattering in the figure. The difference between the predicted y ˉ \bar{y} yˉ and the real y y y value is called the error, and we often use e e e to represent it, and we have the following equation: y = α 0 + α 1 x + e y = \alpha_0 + \alpha_1x + e y=α0+α1x+e or Y = X T w + e Y=X^Tw + e Y=XTw+e
The Error Function
As mentioned above, error is the distance between
y
y
y and
y
ˉ
\bar{y}
yˉ, then we can derive a way to estimate the total error of the whole set:
E
=
∑
i
=
1
n
(
y
i
−
y
i
ˉ
)
2
=
∑
i
=
1
n
(
y
i
−
α
0
−
α
1
x
i
)
2
E = \sum_{i=1}^n (y_i-\bar{y_i})^2=\sum_{i=1}^n (y_i-\alpha_0 - \alpha_1x_i)^2
E=i=1∑n(yi−yiˉ)2=i=1∑n(yi−α0−α1xi)2 or
E
=
∣
∣
Y
−
X
T
w
∣
∣
2
E = ||Y-X^Tw||^2
E=∣∣Y−XTw∣∣2And this is obviously a quadratic function. Thus to minimize the error, we need to manipulate
α
0
\alpha_0
α0 and
α
1
\alpha_1
α1, or
w
w
w matrice, taking derivative until it reaches 0. Thus we have:
∂
E
∂
α
0
=
2
∑
i
=
1
n
(
y
i
−
α
0
−
α
1
x
i
)
=
0
\frac{\partial E}{\partial \alpha_0}=2\sum_{i=1}^n (y_i-\alpha_0 - \alpha_1x_i)=0
∂α0∂E=2i=1∑n(yi−α0−α1xi)=0
∂
E
∂
α
1
=
2
∑
i
=
1
n
(
y
i
−
α
0
−
α
1
x
i
)
x
i
=
0
\frac{\partial E}{\partial \alpha_1}=2\sum_{i=1}^n (y_i-\alpha_0 - \alpha_1x_i)x_i=0
∂α1∂E=2i=1∑n(yi−α0−α1xi)xi=0
or for the matrix expression:
E
=
∣
∣
Y
−
X
T
w
∣
∣
2
=
(
Y
−
X
T
w
)
T
(
Y
−
X
T
w
)
=
w
T
X
X
T
w
−
w
T
X
Y
−
Y
T
X
T
w
+
Y
T
Y
E = ||Y-X^Tw||^2=(Y-X^Tw)^T(Y-X^Tw)\\ =w^TXX^Tw-w^TXY-Y^TX^Tw+Y^TY
E=∣∣Y−XTw∣∣2=(Y−XTw)T(Y−XTw)=wTXXTw−wTXY−YTXTw+YTY
Then
∂
E
∂
w
=
2
X
X
T
w
−
2
X
Y
=
0
w
=
(
X
X
T
)
−
1
Y
\frac{\partial E}{\partial w}=2XX^Tw-2XY=0 \\w=(XX^T)^{-1}Y
∂w∂E=2XXTw−2XY=0w=(XXT)−1Y
Correlation is not Causality
We now know how to analyze a linear regression to predict y by x. In real life practices, two variables may have a strong correlation in the linear model, but this does not necessaryly mean x is the cause of y. For example, in a searching engine, items of higher ranking would recieve higher click rates, like the following figure showing.
You may find a linear relation between ranking and click rates, but this does not lead to the conclusion that users like items of higher ranking more than items of lower ranking. This is the typical position bias in recommender system, and the causality inspired machine learning problems are getting increasing focus these days, trying to find real causes behind linear or nonlinear correlations.