Basic Mathematics for Linear Regression and Least Square Method

Basic Mathematics for Linear Regression and Least Square Method


Background

In statistics, linear regression is a linear approach to modeling the relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is call simple linear regression; for more than one, the process is called multiple linear regression. In linear regression, the relationship are modeled using linear predictor functions whose unknown model parameters are estimated from the data. Such models are called linear models. Most, commonly, the conditional mean of the response given the values of the explanatory variables is assumed to be an affine function of those values. Like all forms of regression analysis, linear regression focuses on the conditional probability distribution of the response given the values of the predictors.
Linear regression was the first type of regression analysis to be studied rigorously, and to be used extensively in practical applications. This is because models which depend linearly on their unknown parameters are easier to fit than models which are non-linearly related to their parameters and because the statistical properties of the resulting estimators are easier to determine.

Mathematical Approach

Linear regression models are often fitted using the least squares approach, but they may also be fitted in other ways, such as by minimizing the “lack of fit” in some other norm (as with least absolute deviations regression), or by minimizing a penalized version of the least squares cost function as in ridge regression (L2-norm penalty) and lasso (L1-norm penalty). Conversely, the least squares approach can be used to fit models that are not linear models. Thus, although the terms “least squares” and “linear model” are closely linked, they are not synonymous.

Least Squares

Background & Founding

The method of least squares grew out of the fields of astronomy and geodesy, as scientists and mathematicians sought to provide solutions to the challenges of navigating the Earth’s oceans during the Age of Exploration. The accurate description of the behavior of celestial bodies was the key to enabling ships to sail in open seas, where sailors could no longer rely on land sightings for navigation.

The first clear and concise exposition of the method of least squares was published by Legendre in 1805.[8] The technique is described as an algebraic procedure for fitting linear equations to data and Legendre demonstrates the new method by analyzing the same data as Laplace for the shape of the earth. The value of Legendre’s method of least squares was immediately recognized by leading astronomers and geodesists of the time.[citation needed]

In 1809 Carl Friedrich Gauss published his method of calculating the orbits of celestial bodies. In that work he claimed to have been in possession of the method of least squares since 1795. This naturally led to a priority dispute with Legendre. However, to Gauss’s credit, he went beyond Legendre and succeeded in connecting the method of least squares with the principles of probability and to the normal distribution. He had managed to complete Laplace’s program of specifying a mathematical form of the probability density for the observations, depending on a finite number of unknown parameters, and define a method of estimation that minimizes the error of estimation. Gauss showed that the arithmetic mean is indeed the best estimate of the location parameter by changing both the probability density and the method of estimation. He then turned the problem around by asking what form the density should have and what method of estimation should be used to get the arithmetic mean as estimate of the location parameter. In this attempt, he invented the normal distribution.

An early demonstration of the strength of Gauss’s method came when it was used to predict the future location of the newly discovered asteroid Ceres. On 1 January 1801, the Italian astronomer Giuseppe Piazzi discovered Ceres and was able to track its path for 40 days before it was lost in the glare of the sun. Based on these data, astronomers desired to determine the location of Ceres after it emerged from behind the sun without solving Kepler’s complicated nonlinear equations of planetary motion. The only predictions that successfully allowed Hungarian astronomer Franz Xaver von Zach to relocate Ceres were those performed by the 24-year-old Gauss using least-squares analysis.

In 1810, after reading Gauss’s work, Laplace, after proving the central limit theorem, used it to give a large sample justification for the method of least squares and the normal distribution. In 1822, Gauss was able to state that the least-squares approach to regression analysis is optimal in the sense that in a linear model where the errors have a mean of zero, are uncorrelated, and have equal variances, the best linear unbiased estimator of the coefficients is the least-squares estimator. This result is known as the Gauss–Markov theorem.

The idea of least-squares analysis was also independently formulated by the American Robert Adrain in 1808. In the next two centuries workers in the theory of errors and in statistics found many different ways of implementing least squares.

Problem Statement

The objective consists of adjusting the parameters of a model function to best fit a data set. A simple data set consists of a point(data pairs) ( x i , y i x_i,y_i xi,yi), i = 1 , . . . , n i = 1, ... , n i=1,...,n, where x i x_i xi is an independent variable and y i y_i yi is a dependent variable whose value is found by observation. The model function has the form f ( x , β ) f(x, \beta ) f(x,β), where m m m adjustable parameters are held in the vector β \beta β. The goal is to find the parameter values for the model that “best” fits the data. The fit of a model to a data point is measured by its residual, defined as the difference between the actual value of the dependent variable and the value predicted by the model: r i = y i − f ( x i , β ) r_i = y_i - f(x_i,\beta) ri=yif(xi,β). The lest-squares motheod find the optimal parameter values by minimizing the sum, S S S, of squared residuals:
S = ∑ i = 1 i r i 2 S=\sum^{i}_{i=1}r_i^{2} S=i=1iri2
An example of a model in two dimensions in that of the straight line. Denoting the y-intercept as β 0 \beta_0 β0 and the slope as β 1 \beta_1 β1, the model functions is given by f ( x , β ) = β 0 + β 1 x f(x,\beta) = \beta_0 + \beta_1x f(x,β)=β0+β1x. A data point may consist of more than one independent variable. In most general case there may be one or more independent variables and one or more dependent variables at each data point.

If a residual plot illustrating random fluctuations about r i = 0 r_i = 0 ri=0, indicating that a linear mode( Y i = α + β x i + U i Y_i = \alpha + \beta x_i + U_i Yi=α+βxi+Ui ) is appropriate. U i U_i Ui is an independent, random variable. If the residual plot had a parabolic shape, a parabolic model ( Y i = α + β x i + γ x i 2 + U i Y_i = \alpha + \beta x_i + \gamma x_i^{2} + U_i Yi=α+βxi+γxi2+Ui ) would be appropriate model for the data.

In the model model above, the least squares approach to solving the problem is to try to make the sum of the squares of these residuals as small as possible; that is, to find the minimum of the function:
S ( β 0 , β 1 ) = ∑ r i 2 S(\beta_0, \beta_1) = \sum r_i^{2} S(β0,β1)=ri2
The minimum is determined by calculating the partial derivatives of S ( β 0 , β 1 ) S(\beta_0, \beta_1) S(β0,β1) with respect to β 0 \beta_0 β0 and β 1 \beta_1 β1 and setting them to zero:
∂ S ∂ β 0 = 2 n β 0 + ∑ 2 ( β 1 x i − y i ) ∂ S ∂ β 1 = ∑ ( β 0 x i + β 1 x i 2 − y i x i ) = 0 \frac{\partial S}{\partial \beta_0} =2n\beta_0 + \sum 2(\beta_1x_i-y_i) \\ \frac{\partial S}{\partial \beta_1} = \sum (\beta_0 x_i + \beta_1 x_i^2 - y_ix_i) = 0 β0S=2nβ0+2(β1xiyi)β1S=(β0xi+β1xi2yixi)=0
thus,
β 1 = ∑ i n y i x i − n x ˉ y ˉ ∑ i n x i 2 − n x ˉ 2 β 0 = n y ˉ − β 1 x ˉ \beta_1 = \frac{\sum_{i}^{n}y_ix_i - n\bar{x}\bar{y}}{\sum_{i}^{n}x_i^2 - n\bar{x}^2}\\ \beta_0 = n\bar{y} - \beta_1 \bar{x} β1=inxi2nxˉ2inyixinxˉyˉβ0=nyˉβ1xˉ
Some notations could be imported to simplify the form:
S x y = ∑ i n x i y i β 1 = S x y − n x ˉ y ˉ S x x − n x ˉ 2 β 0 = n y ˉ − β 1 x ˉ S_{xy} = \sum _i^n x_iy_i\\ \beta_1 = \frac{S_{xy} - n\bar{x}\bar{y}}{S_{xx} - n\bar{x}^2}\\ \beta_0 = n\bar{y} - \beta_1 \bar{x} Sxy=inxiyiβ1=Sxxnxˉ2Sxynxˉyˉβ0=nyˉβ1xˉ

For application to higher order polynomial f ( x ) = ∑ j = 0 m − 1 a j x j f(x) = \sum^{m-1}_{j=0} a_jx^j f(x)=j=0m1ajxj
, we could consider matrix as follow:
[ x 1 0 x 1 1 x 1 2 ⋯ x 1 m − 1 x 2 0 x 2 1 x 2 2 ⋯ x 2 m − 1 ⋮ ⋮ ⋮ ⋱ ⋮ x n 0 x n 1 x n 2 ⋯ x n m − 1 ] [ β 0 β 1 ⋮ β n ] = [ f ( x 1 ) f ( x 2 ) ⋮ f ( x n ) ] \begin{bmatrix} x_1^0 &x_1^1 &x_1^2 &\cdots &x_1^{m-1}\\ x_2^0 &x_2^1 &x_2^2 &\cdots &x_2^{m-1}\\ \vdots &\vdots &\vdots &\ddots &\vdots\\ x_n^0 &x_n^1 &x_n^2 &\cdots &x_n^{m-1}\\ \end{bmatrix} \begin{bmatrix} \beta_0\\ \beta_1\\ \vdots\\ \beta_n \end{bmatrix}= \begin{bmatrix} f(x_1)\\ f(x_2)\\ \vdots\\ f(x_n) \end{bmatrix} x10x20xn0x11x21xn1x12x22xn2x1m1x2m1xnm1β0β1βn=f(x1)f(x2)f(xn)
denotes as X B ⃗ = F ⃗ X\vec{\Beta} = \vec{F} XB =F ,

m i n ε = ∣ ∣ X B ⃗ − Y ⃗ ∣ ∣ 2 = ( B ⃗ T X T − Y ⃗ T ) ( B ⃗ X − Y ⃗ ) = B ⃗ T A T A B ⃗ − B ⃗ T A T Y ⃗ − Y ⃗ T A B ⃗ + Y ⃗ T Y ⃗ = B ⃗ T A T A B ⃗ − 2 B ⃗ T A T Y ⃗ + Y ⃗ T Y ⃗ \begin{aligned} min \hspace{1mm} \varepsilon &= ||X\vec{\Beta} - \vec{Y}||^2 \\ &= (\vec{\Beta}^T X^T- \vec{Y}^T)(\vec{\Beta} X - \vec{Y})\\ &= \vec{\Beta}^TA^TA\vec{\Beta} - \vec{\Beta}^TA^T\vec{Y}-\vec{Y}^TA\vec{\Beta} + \vec{Y}^T\vec{Y}\\ &=\vec{\Beta}^TA^TA\vec{\Beta} - 2\vec{\Beta}^TA^T\vec{Y}+ \vec{Y}^T\vec{Y} \end{aligned} minε=XB Y 2=(B TXTY T)(B XY )=B TATAB B TATY Y TAB +Y TY =B TATAB 2B TATY +Y TY
∂ ( ε ) ∂ B ⃗ = B ⃗ T X T X B ⃗ − 2 B ⃗ T X T Y ⃗ + Y ⃗ T Y ⃗ ∂ B ⃗ = ∂ ( B ⃗ T X T X B ⃗ ) ∂ B ⃗ − 2 X T Y ⃗ = 2 X T X B ⃗ − 2 X T Y ⃗ . \begin{aligned} \frac{\partial(\varepsilon)}{\partial \vec{\Beta}} &= \frac{\vec{\Beta}^TX^TX\vec{\Beta} - 2\vec{\Beta}^TX^T\vec{Y}+ \vec{Y}^T\vec{Y}}{\partial \vec{\Beta}} \\&=\frac{\partial(\vec{\Beta}^TX^TX\vec{\Beta})}{\partial \vec{\Beta}} - 2X^T\vec{Y}\\ &=2X^TX\vec{\Beta} - 2X^T\vec{Y}. \end{aligned} B (ε)=B B TXTXB 2B TXTY +Y TY =B (B TXTXB )2XTY =2XTXB 2XTY .
Therefore: B ⃗ = ( X T X ) − 1 X T Y ⃗ \vec{\Beta} = (X^TX)^{-1}X^T\vec{Y} B =(XTX)1XTY

Computer Programming

Matrix manipulation

import numpy as np 
from numpy.linalg import inv

dimension = int(input())

if dimension > 1:
    entries_x = list(map(int, input().split())) 
    entries_y = list(map(int, input().split()))

    X = np.array(entries_x).reshape(dimension,1)
    one_column = []
    X_0 = np.ones((dimension,1))
    X = np.hstack((X_0, X))
    
    Y = np.array(entries_y).reshape(dimension,1)
    
    X_trans = X.transpose()

    B = np.matmul(np.matmul(inv(np.matmul(X_trans, X)),X_trans), Y)

    if(abs(B[0][0]) > 1e-6): print("y = %-.3f"% B[0][0], "%+.3f" % B[1][0], "x")
    else: print("y =","%+.3f" % B[1][0], "x")

else:
    a = int(input())
    b = int(input())
    print("y = %.3f" % (a/b), "x")

Package sklearn.linear_model.LinearRegression


import numpy as np
from sklearn.linear_model import LinearRegression

X = np.array(list(map(int,input().split()))).reshape(-1,1)

y = np.array(list(map(int,input().split())))

reg = LinearRegression().fit(X, y)
reg.score(X, y)

print("%.3f" % reg.intercept_ ) #beta_0
print("%.3f" % reg.coef_) #beta_1

Application

  • If the goal is prediction, forecasting, or error reduction, linear regression can be used to fit a predictive model to an observed data set of values of the response and explanatory variables. After developing such a model, if additional values of the explanatory variables are collected without an accompanying response value, the fitted model can be used to make a prediction of the response.

  • If the goal is to explain variation in the response variable that can be attributed to variation in the explanatory variables, linear regression analysis can be applied to quantify the strength of the relationship between the response and the explanatory variables, and in particular to determine whether some explanatory variables may have no linear relationship with the response at all, or to identify which subsets of explanatory variables may contain redundant information about the response.

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值