Basic Mathematics for Linear Regression and Least Square Method
Content
Background
In statistics, linear regression is a linear approach to modeling the relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is call simple linear regression; for more than one, the process is called multiple linear regression. In linear regression, the relationship are modeled using linear predictor functions whose unknown model parameters are estimated from the data. Such models are called linear models. Most, commonly, the conditional mean of the response given the values of the explanatory variables is assumed to be an affine function of those values. Like all forms of regression analysis, linear regression focuses on the conditional probability distribution of the response given the values of the predictors.
Linear regression was the first type of regression analysis to be studied rigorously, and to be used extensively in practical applications. This is because models which depend linearly on their unknown parameters are easier to fit than models which are non-linearly related to their parameters and because the statistical properties of the resulting estimators are easier to determine.
Mathematical Approach
Linear regression models are often fitted using the least squares approach, but they may also be fitted in other ways, such as by minimizing the “lack of fit” in some other norm (as with least absolute deviations regression), or by minimizing a penalized version of the least squares cost function as in ridge regression (L2-norm penalty) and lasso (L1-norm penalty). Conversely, the least squares approach can be used to fit models that are not linear models. Thus, although the terms “least squares” and “linear model” are closely linked, they are not synonymous.
Least Squares
Background & Founding
The method of least squares grew out of the fields of astronomy and geodesy, as scientists and mathematicians sought to provide solutions to the challenges of navigating the Earth’s oceans during the Age of Exploration. The accurate description of the behavior of celestial bodies was the key to enabling ships to sail in open seas, where sailors could no longer rely on land sightings for navigation.
The first clear and concise exposition of the method of least squares was published by Legendre in 1805.[8] The technique is described as an algebraic procedure for fitting linear equations to data and Legendre demonstrates the new method by analyzing the same data as Laplace for the shape of the earth. The value of Legendre’s method of least squares was immediately recognized by leading astronomers and geodesists of the time.[citation needed]
In 1809 Carl Friedrich Gauss published his method of calculating the orbits of celestial bodies. In that work he claimed to have been in possession of the method of least squares since 1795. This naturally led to a priority dispute with Legendre. However, to Gauss’s credit, he went beyond Legendre and succeeded in connecting the method of least squares with the principles of probability and to the normal distribution. He had managed to complete Laplace’s program of specifying a mathematical form of the probability density for the observations, depending on a finite number of unknown parameters, and define a method of estimation that minimizes the error of estimation. Gauss showed that the arithmetic mean is indeed the best estimate of the location parameter by changing both the probability density and the method of estimation. He then turned the problem around by asking what form the density should have and what method of estimation should be used to get the arithmetic mean as estimate of the location parameter. In this attempt, he invented the normal distribution.
An early demonstration of the strength of Gauss’s method came when it was used to predict the future location of the newly discovered asteroid Ceres. On 1 January 1801, the Italian astronomer Giuseppe Piazzi discovered Ceres and was able to track its path for 40 days before it was lost in the glare of the sun. Based on these data, astronomers desired to determine the location of Ceres after it emerged from behind the sun without solving Kepler’s complicated nonlinear equations of planetary motion. The only predictions that successfully allowed Hungarian astronomer Franz Xaver von Zach to relocate Ceres were those performed by the 24-year-old Gauss using least-squares analysis.
In 1810, after reading Gauss’s work, Laplace, after proving the central limit theorem, used it to give a large sample justification for the method of least squares and the normal distribution. In 1822, Gauss was able to state that the least-squares approach to regression analysis is optimal in the sense that in a linear model where the errors have a mean of zero, are uncorrelated, and have equal variances, the best linear unbiased estimator of the coefficients is the least-squares estimator. This result is known as the Gauss–Markov theorem.
The idea of least-squares analysis was also independently formulated by the American Robert Adrain in 1808. In the next two centuries workers in the theory of errors and in statistics found many different ways of implementing least squares.
Problem Statement
The objective consists of adjusting the parameters of a model function to best fit a data set. A simple data set consists of a point(data pairs) (
x
i
,
y
i
x_i,y_i
xi,yi),
i
=
1
,
.
.
.
,
n
i = 1, ... , n
i=1,...,n, where
x
i
x_i
xi is an independent variable and
y
i
y_i
yi is a dependent variable whose value is found by observation. The model function has the form
f
(
x
,
β
)
f(x, \beta )
f(x,β), where
m
m
m adjustable parameters are held in the vector
β
\beta
β. The goal is to find the parameter values for the model that “best” fits the data. The fit of a model to a data point is measured by its residual, defined as the difference between the actual value of the dependent variable and the value predicted by the model:
r
i
=
y
i
−
f
(
x
i
,
β
)
r_i = y_i - f(x_i,\beta)
ri=yi−f(xi,β). The lest-squares motheod find the optimal parameter values by minimizing the sum,
S
S
S, of squared residuals:
S
=
∑
i
=
1
i
r
i
2
S=\sum^{i}_{i=1}r_i^{2}
S=i=1∑iri2
An example of a model in two dimensions in that of the straight line. Denoting the y-intercept as
β
0
\beta_0
β0 and the slope as
β
1
\beta_1
β1, the model functions is given by
f
(
x
,
β
)
=
β
0
+
β
1
x
f(x,\beta) = \beta_0 + \beta_1x
f(x,β)=β0+β1x. A data point may consist of more than one independent variable. In most general case there may be one or more independent variables and one or more dependent variables at each data point.
If a residual plot illustrating random fluctuations about r i = 0 r_i = 0 ri=0, indicating that a linear mode( Y i = α + β x i + U i Y_i = \alpha + \beta x_i + U_i Yi=α+βxi+Ui ) is appropriate. U i U_i Ui is an independent, random variable. If the residual plot had a parabolic shape, a parabolic model ( Y i = α + β x i + γ x i 2 + U i Y_i = \alpha + \beta x_i + \gamma x_i^{2} + U_i Yi=α+βxi+γxi2+Ui ) would be appropriate model for the data.
In the model model above, the least squares approach to solving the problem is to try to make the sum of the squares of these residuals as small as possible; that is, to find the minimum of the function:
S
(
β
0
,
β
1
)
=
∑
r
i
2
S(\beta_0, \beta_1) = \sum r_i^{2}
S(β0,β1)=∑ri2
The minimum is determined by calculating the partial derivatives of
S
(
β
0
,
β
1
)
S(\beta_0, \beta_1)
S(β0,β1) with respect to
β
0
\beta_0
β0 and
β
1
\beta_1
β1 and setting them to zero:
∂
S
∂
β
0
=
2
n
β
0
+
∑
2
(
β
1
x
i
−
y
i
)
∂
S
∂
β
1
=
∑
(
β
0
x
i
+
β
1
x
i
2
−
y
i
x
i
)
=
0
\frac{\partial S}{\partial \beta_0} =2n\beta_0 + \sum 2(\beta_1x_i-y_i) \\ \frac{\partial S}{\partial \beta_1} = \sum (\beta_0 x_i + \beta_1 x_i^2 - y_ix_i) = 0
∂β0∂S=2nβ0+∑2(β1xi−yi)∂β1∂S=∑(β0xi+β1xi2−yixi)=0
thus,
β
1
=
∑
i
n
y
i
x
i
−
n
x
ˉ
y
ˉ
∑
i
n
x
i
2
−
n
x
ˉ
2
β
0
=
n
y
ˉ
−
β
1
x
ˉ
\beta_1 = \frac{\sum_{i}^{n}y_ix_i - n\bar{x}\bar{y}}{\sum_{i}^{n}x_i^2 - n\bar{x}^2}\\ \beta_0 = n\bar{y} - \beta_1 \bar{x}
β1=∑inxi2−nxˉ2∑inyixi−nxˉyˉβ0=nyˉ−β1xˉ
Some notations could be imported to simplify the form:
S
x
y
=
∑
i
n
x
i
y
i
β
1
=
S
x
y
−
n
x
ˉ
y
ˉ
S
x
x
−
n
x
ˉ
2
β
0
=
n
y
ˉ
−
β
1
x
ˉ
S_{xy} = \sum _i^n x_iy_i\\ \beta_1 = \frac{S_{xy} - n\bar{x}\bar{y}}{S_{xx} - n\bar{x}^2}\\ \beta_0 = n\bar{y} - \beta_1 \bar{x}
Sxy=i∑nxiyiβ1=Sxx−nxˉ2Sxy−nxˉyˉβ0=nyˉ−β1xˉ
For application to higher order polynomial
f
(
x
)
=
∑
j
=
0
m
−
1
a
j
x
j
f(x) = \sum^{m-1}_{j=0} a_jx^j
f(x)=∑j=0m−1ajxj
, we could consider matrix as follow:
[
x
1
0
x
1
1
x
1
2
⋯
x
1
m
−
1
x
2
0
x
2
1
x
2
2
⋯
x
2
m
−
1
⋮
⋮
⋮
⋱
⋮
x
n
0
x
n
1
x
n
2
⋯
x
n
m
−
1
]
[
β
0
β
1
⋮
β
n
]
=
[
f
(
x
1
)
f
(
x
2
)
⋮
f
(
x
n
)
]
\begin{bmatrix} x_1^0 &x_1^1 &x_1^2 &\cdots &x_1^{m-1}\\ x_2^0 &x_2^1 &x_2^2 &\cdots &x_2^{m-1}\\ \vdots &\vdots &\vdots &\ddots &\vdots\\ x_n^0 &x_n^1 &x_n^2 &\cdots &x_n^{m-1}\\ \end{bmatrix} \begin{bmatrix} \beta_0\\ \beta_1\\ \vdots\\ \beta_n \end{bmatrix}= \begin{bmatrix} f(x_1)\\ f(x_2)\\ \vdots\\ f(x_n) \end{bmatrix}
⎣⎢⎢⎢⎡x10x20⋮xn0x11x21⋮xn1x12x22⋮xn2⋯⋯⋱⋯x1m−1x2m−1⋮xnm−1⎦⎥⎥⎥⎤⎣⎢⎢⎢⎡β0β1⋮βn⎦⎥⎥⎥⎤=⎣⎢⎢⎢⎡f(x1)f(x2)⋮f(xn)⎦⎥⎥⎥⎤
denotes as
X
B
⃗
=
F
⃗
X\vec{\Beta} = \vec{F}
XB=F,
m
i
n
ε
=
∣
∣
X
B
⃗
−
Y
⃗
∣
∣
2
=
(
B
⃗
T
X
T
−
Y
⃗
T
)
(
B
⃗
X
−
Y
⃗
)
=
B
⃗
T
A
T
A
B
⃗
−
B
⃗
T
A
T
Y
⃗
−
Y
⃗
T
A
B
⃗
+
Y
⃗
T
Y
⃗
=
B
⃗
T
A
T
A
B
⃗
−
2
B
⃗
T
A
T
Y
⃗
+
Y
⃗
T
Y
⃗
\begin{aligned} min \hspace{1mm} \varepsilon &= ||X\vec{\Beta} - \vec{Y}||^2 \\ &= (\vec{\Beta}^T X^T- \vec{Y}^T)(\vec{\Beta} X - \vec{Y})\\ &= \vec{\Beta}^TA^TA\vec{\Beta} - \vec{\Beta}^TA^T\vec{Y}-\vec{Y}^TA\vec{\Beta} + \vec{Y}^T\vec{Y}\\ &=\vec{\Beta}^TA^TA\vec{\Beta} - 2\vec{\Beta}^TA^T\vec{Y}+ \vec{Y}^T\vec{Y} \end{aligned}
minε=∣∣XB−Y∣∣2=(BTXT−YT)(BX−Y)=BTATAB−BTATY−YTAB+YTY=BTATAB−2BTATY+YTY
∂
(
ε
)
∂
B
⃗
=
B
⃗
T
X
T
X
B
⃗
−
2
B
⃗
T
X
T
Y
⃗
+
Y
⃗
T
Y
⃗
∂
B
⃗
=
∂
(
B
⃗
T
X
T
X
B
⃗
)
∂
B
⃗
−
2
X
T
Y
⃗
=
2
X
T
X
B
⃗
−
2
X
T
Y
⃗
.
\begin{aligned} \frac{\partial(\varepsilon)}{\partial \vec{\Beta}} &= \frac{\vec{\Beta}^TX^TX\vec{\Beta} - 2\vec{\Beta}^TX^T\vec{Y}+ \vec{Y}^T\vec{Y}}{\partial \vec{\Beta}} \\&=\frac{\partial(\vec{\Beta}^TX^TX\vec{\Beta})}{\partial \vec{\Beta}} - 2X^T\vec{Y}\\ &=2X^TX\vec{\Beta} - 2X^T\vec{Y}. \end{aligned}
∂B∂(ε)=∂BBTXTXB−2BTXTY+YTY=∂B∂(BTXTXB)−2XTY=2XTXB−2XTY.
Therefore:
B
⃗
=
(
X
T
X
)
−
1
X
T
Y
⃗
\vec{\Beta} = (X^TX)^{-1}X^T\vec{Y}
B=(XTX)−1XTY
Computer Programming
Matrix manipulation
import numpy as np
from numpy.linalg import inv
dimension = int(input())
if dimension > 1:
entries_x = list(map(int, input().split()))
entries_y = list(map(int, input().split()))
X = np.array(entries_x).reshape(dimension,1)
one_column = []
X_0 = np.ones((dimension,1))
X = np.hstack((X_0, X))
Y = np.array(entries_y).reshape(dimension,1)
X_trans = X.transpose()
B = np.matmul(np.matmul(inv(np.matmul(X_trans, X)),X_trans), Y)
if(abs(B[0][0]) > 1e-6): print("y = %-.3f"% B[0][0], "%+.3f" % B[1][0], "x")
else: print("y =","%+.3f" % B[1][0], "x")
else:
a = int(input())
b = int(input())
print("y = %.3f" % (a/b), "x")
Package sklearn.linear_model.LinearRegression
import numpy as np
from sklearn.linear_model import LinearRegression
X = np.array(list(map(int,input().split()))).reshape(-1,1)
y = np.array(list(map(int,input().split())))
reg = LinearRegression().fit(X, y)
reg.score(X, y)
print("%.3f" % reg.intercept_ ) #beta_0
print("%.3f" % reg.coef_) #beta_1
Application
-
If the goal is prediction, forecasting, or error reduction, linear regression can be used to fit a predictive model to an observed data set of values of the response and explanatory variables. After developing such a model, if additional values of the explanatory variables are collected without an accompanying response value, the fitted model can be used to make a prediction of the response.
-
If the goal is to explain variation in the response variable that can be attributed to variation in the explanatory variables, linear regression analysis can be applied to quantify the strength of the relationship between the response and the explanatory variables, and in particular to determine whether some explanatory variables may have no linear relationship with the response at all, or to identify which subsets of explanatory variables may contain redundant information about the response.