Introduction
Linear regression is perhaps the most fundamental algorithm in machine learning. In this setting, given a dataset D = { ( x i , y i ) ∣ x i ∈ R n , y i ∈ R } i = 1 m D=\{(x^i,y^i)|x^i\in \mathbb{R}^n, y^i\in\mathbb{R} \}_{i=1}^m D={ (xi,yi)∣xi∈Rn,yi∈R}i=1m (x is feature, y is label) we fit a model of the form h θ ( x ) = θ T ϕ ( x ) h_\theta(x) = \theta^T\phi(x) hθ(x)=θTϕ(x), where θ \theta θ is the parameter vector, ϕ ( x ) \phi(x) ϕ(x) is a transformed vector (for example, ϕ ( x ) = [ 1 , x 1 , x 2 , . . . , x 1 x 2 , . . . , x n x n − 1 ] \phi(x) = [1,x_1,x_2,...,x_1x_2,...,x_nx_{n-1}] ϕ(x)=[1,x1,x2,...,x1x2,...,xnxn−1]). That is, the model is linear IN TERMS OF parameters instead of input vector x x x, as feature transformation is allowed.
Our goal is to fit the model h θ ( x ) = θ T ϕ ( x ) h_\theta(x) = \theta^T\phi(x) hθ(x)=θTϕ(x) as good as possible. That is, after tuning our parameters, given an unseen x ∗ x^* x∗, we should be able to make h θ ( x ∗ ) → y ∗ h_\theta(x^*)\to y^* hθ(x∗)→y∗. In a nutshell, find the BEST θ \theta θ.
Sometimes, our model might fit the training dataset well, yet failed to generalize to unseen data. This introduces the problem of OVERFITTING. To address this, we could use robust linear regression, ridge regression, lasso regression.
In what follows, I will derive the various linear regression (standard, robust, ridge, lasso) from 2 perspectives (deterministic and probabilistic). Also, generalized linear regression will be discussed.
Deterministic perspective
Intuitively, we could let our cost function to be J ( θ ) = 1 2 ∑ i m ( h θ ( x i ) − y i ) 2 J(\theta)=\frac{1}{2}\sum_i^m (h_\theta(x^i)-y^i)^2 J(θ)=21∑im(hθ(xi)−yi)2, another name for it is residual sum of squares (RSS) or sum of squared errors (SSE). Clearly, J is a convex function.
Then, the (standard) linear regression is formulated as θ ∗ : = arg min θ J ( θ ) \theta^* := \arg \min_\theta J(\theta) θ∗:=argminθJ(θ) [How to solve it? 1. gradient descent algorithm; 2. Analytically set ∂ J / ∂ θ = 0 \partial J/\partial \theta=0 ∂J/∂θ=0. We have a particular nice solution if x ˉ = [ 1 , x ] , h θ ( x ) = θ T x ˉ ⇒ ∂ J / ∂ θ = ∑ i m ( x ˉ i T − y i ) x ˉ i = X T X θ − X T y = 0 ⇒ θ ∗ = ( X T X ) − 1 X T y \bar{x} = [1,x], h_\theta(x)=\theta^T\bar{x} \Rightarrow \partial J/\partial \theta = \sum_i^m(\bar{x}_i^T-y_i)\bar{x}_i=X^TX\theta - X^Ty=0\Rightarrow \theta^* = (X^TX)^{-1}X^Ty xˉ=[1,x],