Content
1 Linear Model
y
i
=
θ
1
x
i
1
+
θ
2
x
i
2
+
.
.
.
+
θ
n
p
x
i
n
p
+
ε
i
,
i
=
1
,
2
,
.
.
.
,
n
d
.
y_i=\theta_1 x_{i1} + \theta_2 x_{i2} + ... + \theta_{n_p} x_{in_p} + \varepsilon_i,~~~~~i=1,2,...,n_d.
yi=θ1xi1+θ2xi2+...+θnpxinp+εi, i=1,2,...,nd.
or,
y
=
X
θ
+
ε
y = X \theta + \varepsilon
y=Xθ+ε
where
ε
\varepsilon
ε is noise,
θ
\theta
θ is the vector of unkown parameters.
- The linear model is parametric with n p n_p np parameters
- If adding an interception θ 0 \theta_0 θ0, then the intercept is a column of 1 in the design matrix X X X
- For dimension of design matrix X : n d × n p X:~n_d \times n_p X: nd×np or X : n d × ( n p + 1 ) X:~n_d \times (n_p + 1) X: nd×(np+1) for adding θ 0 \theta_0 θ0, n d > n p n_d > n_p nd>np
- The traditional assumption of distribution of noise (iid noise, means noise with distribution) ε \varepsilon ε: ε i ∼ N ( 0 , σ 2 ) , i = 1 , 2 , . . . , n d \varepsilon_i ~\sim ~ \mathcal{N}(0,\sigma^2),~~i=1,2,...,n_d εi ∼ N(0,σ2), i=1,2,...,nd
1.1 parameter estimation
For parameter estimation of Linear Model: via the cost function or ** maximum likelihood principle**, these two approaches coincide to the same optimal solution of the parameter θ \theta θ, the difference is: cost function reduces the discrepancy between predictions and data, while MLE maximizes the likelihood of observing data given parameters.
1.2 Cost Function
The goal: reduces the discrepancy between predictions and data
⟹
\Longrightarrow
⟹ reduce the MSE of predictions
M
S
E
(
y
^
)
=
1
n
d
∣
∣
y
−
y
^
∣
∣
2
=
1
n
d
∑
i
=
1
n
d
(
y
i
−
y
^
i
)
2
=
1
n
d
∣
∣
y
−
X
θ
^
∣
∣
2
=
1
n
d
∑
i
=
1
n
d
(
y
i
−
x
i
T
θ
^
)
2
MSE(\hat{y}) = \frac{1}{n_d}||y-\hat{y}||^2 = \frac{1}{n_d}\sum^{n_d}_{i=1}(y_i - \hat{y}_i)^2 \\ = \frac{1}{n_d}||y-X\hat{\theta}||^2 = \frac{1}{n_d}\sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\hat{\theta})^2
MSE(y^)=nd1∣∣y−y^∣∣2=nd1i=1∑nd(yi−y^i)2=nd1∣∣y−Xθ^∣∣2=nd1i=1∑nd(yi−xiTθ^)2
Let
J
(
θ
)
J(\theta)
J(θ) be cost function, find
θ
^
\hat{\theta}
θ^ of
θ
\theta
θ to minimize
J
(
θ
)
J(\theta)
J(θ), where
J
(
θ
)
=
∣
∣
y
−
X
θ
∣
∣
2
=
∑
i
=
1
n
d
(
y
i
−
x
i
T
θ
)
2
J(\theta) = ||y-X\theta||^2 = \sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2
J(θ)=∣∣y−Xθ∣∣2=i=1∑nd(yi−xiTθ)2
so
θ
^
=
arg
m
i
n
θ
J
(
θ
)
\hat\theta=\arg \underset{\theta} {min} J(\theta)
θ^=argθminJ(θ), thia is the same as ordinary least squares (OLS) estimator, because OLS for
θ
=
∣
∣
ε
∣
∣
2
=
∑
i
=
1
n
d
ε
2
=
∑
i
=
1
n
d
(
y
i
−
x
i
T
θ
)
2
\theta = ||\varepsilon||^2=\sum^{n_d}_{i=1}\varepsilon^2=\sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2
θ=∣∣ε∣∣2=∑i=1ndε2=∑i=1nd(yi−xiTθ)2, so OLS estimator for
θ
^
\hat\theta
θ^:
θ
^
=
arg
m
i
n
θ
∣
∣
ε
(
θ
)
∣
∣
2
=
arg
m
i
n
θ
(
y
i
−
x
i
T
θ
)
2
\hat\theta = \arg \underset{\theta} {min} ||\varepsilon(\theta)||^2 = \arg \underset{\theta} {min} (y_i - x_i^{\mathsf{T}}\theta)^2
θ^=argθmin∣∣ε(θ)∣∣2=argθmin(yi−xiTθ)2
The ordinary least squares (OLS) θ ^ \hat\theta θ^:
θ ^ = arg m i n θ ∣ ∣ ε ( θ ) ∣ ∣ 2 = arg m i n θ ( y i − x i T θ ) 2 \hat\theta = \arg \underset{\theta} {min} ||\varepsilon(\theta)||^2 = \arg \underset{\theta} {min} (y_i - x_i^{\mathsf{T}}\theta)^2 θ^=argθmin∣∣ε(θ)∣∣2=argθmin(yi−xiTθ)2
Cost Function:
J ( θ ) = ∑ i = 1 n d ( y i − h θ ( x i ) ) 2 J(\theta)=\sum^{n_d}_{i=1}(y_i - h_\theta(x_i))^2 J(θ)=i=1∑nd(yi−hθ(xi))2, where h θ ( x ) h_\theta(x) hθ(x) n is called the hypothesis and h θ ( x i ) = x i T θ h_\theta(x_i)=x_i^\mathsf{T}\theta hθ(xi)=xiTθ for linear models. So, the estimator obtained by minimizing the cost function for h θ ( x i ) = x i T θ h_\theta(x_i)=x_i^\mathsf{T}\theta hθ(xi)=xiTθ and the OLS estimator obtained by minimizing the squared noise norm ∣ ∣ ε ( θ ) ∣ ∣ 2 ||\varepsilon(\theta)||^2 ∣∣ε(θ)∣∣2 coincide
Get θ ^ \hat\theta θ^:
{ g r a d i e n t d e s c e n t a p p r o x i m a t e s θ ^ , J ( θ ) i s c o n v e x o t h e r n u m e r i c a l o p t i m i z a t i o n s c h e m e s θ ^ , J ( θ ) i s n o t c o n v e x \left\{ \begin{array}{cc} \mathrm{gradient ~descent ~approximates~} \hat\theta, ~~~~~ J(\theta)~\mathrm{is ~convex} & \\ \mathrm{ other~ numerical~ optimization~ schemes~} \hat\theta, ~~~~~ J(\theta)~\mathrm{is~ not~ convex} & \end{array} \right. {gradient descent approximates θ^, J(θ) is convexother numerical optimization schemes θ^, J(θ) is not convex
1.3 Normal equation to get θ ^ \hat\theta θ^
For invertible
(
X
T
X
)
−
1
(X^\mathsf{T}X)^{-1}
(XTX)−1,
θ
^
=
(
X
T
X
)
−
1
X
T
y
\hat\theta = (X^\mathsf{T}X)^{-1}X^\mathsf{T}y
θ^=(XTX)−1XTy
Gauss-Markov assumptions
- E ( ε i ) = 0 E(\varepsilon_i) = 0 E(εi)=0
- V a r ( ε i ) = σ 2 Var(\varepsilon_i)=\sigma^2 Var(εi)=σ2
- C o v ( ε i , ε j ) = 0 Cov(\varepsilon_i,\varepsilon_j)=0 Cov(εi,εj)=0
Under this assumption:
- X X X should be non-singular to allow its columns to be linearly independent.
- E ( θ ^ ) = θ E(\hat\theta)=\theta E(θ^)=θ
- V a r ( θ ^ ) = σ 2 ( X T X ) − 1 Var(\hat\theta)=\sigma^2(X^\mathsf{T}X)^{-1} Var(θ^)=σ2(XTX)−1
-
θ
^
∼
N
(
θ
,
σ
2
(
X
T
X
)
−
1
)
,
i
=
1
,
2
,
.
.
.
,
n
d
\hat\theta~\sim ~ \mathcal{N}(\theta,\sigma^2(X^\mathsf{T}X)^{-1}),~~i=1,2,...,n_d
θ^ ∼ N(θ,σ2(XTX)−1), i=1,2,...,nd
( under the assumption of ε i ∼ N ( 0 , σ 2 ) , i = 1 , 2 , . . . , n d \varepsilon_i ~\sim ~ \mathcal{N}(0,\sigma^2),~~i=1,2,...,n_d εi ∼ N(0,σ2), i=1,2,...,nd) - The maximum likelihood is
L ( y , X ∣ θ , σ 2 ) = ∏ i = 1 n d ( 2 π σ 2 ) − n d / 2 e x p ( − 1 2 σ 2 ∑ i = 1 n d ( y i − x i T θ ) 2 ) \mathcal{L}(y,X|\theta,\sigma^2)=\prod^{n_d}_{i=1}(2\pi\sigma^2)^{-n_d/2} exp\left(-\frac{1}{2\sigma^2}\sum^{n_d}_{i=1}(y_i - x_i^\mathsf{T}\theta)^2\right) L(y,X∣θ,σ2)=i=1∏nd(2πσ2)−nd/2exp(−2σ21i=1∑nd(yi−xiTθ)2) - The associated MLE:
θ ^ = ( X T X ) − 1 X T y \hat\theta = (X^\mathsf{T}X)^{-1}X^\mathsf{T}y θ^=(XTX)−1XTy, coincide with OLS for normal noise, equal to a r g m i n θ ( y i − x i T θ ) 2 \underset{\theta} {argmin} (y_i - x_i^{\mathsf{T}}\theta)^2 θargmin(yi−xiTθ)2
Limitations of normal equation with large n p n_p np
- computational cost of ( X T X ) − 1 (X^\mathsf{T}X)^{-1} (XTX)−1
- singularity, with large n p n_p np, its columns might turn to be highly correlated
1.4 Gradient Descent Process to get θ ^ \hat\theta θ^
The Logic: start with θ = ( θ 0 , θ 1 ) \theta = (\theta_0,\theta_1) θ=(θ0,θ1) , then keep to change ( θ 0 , θ 1 ) (\theta_0,\theta_1) (θ0,θ1) to reduce J ( θ 0 , θ 1 ) J(\theta_0,\theta_1) J(θ0,θ1) until reach m i n θ 0 , θ 1 J ( θ 0 , θ 1 ) \underset{\theta_0,\theta_1} {min}J(\theta_0,\theta_1) θ0,θ1minJ(θ0,θ1)
1.4.1 Algorithm
- Initialize ( θ 1 , θ 2 , . . . , θ n ) (\theta_1,\theta_2,...,\theta_n) (θ1,θ2,...,θn)
- ( θ ~ 1 , θ ~ 2 , . . . , θ ~ n ) = ( θ 1 , θ 2 , . . . , θ n ) (\tilde\theta_1,\tilde\theta_2,...,\tilde\theta_n)=(\theta_1,\theta_2,...,\theta_n) (θ~1,θ~2,...,θ~n)=(θ1,θ2,...,θn)
- Update θ i \theta_i θi with θ i = θ ~ i − a ⋅ ∂ ∂ θ i J ( θ ~ 1 , θ ~ 2 , . . . , θ ~ n ) \theta_i=\tilde\theta_i - a \cdot \frac{\partial}{\partial\theta_i}J(\tilde\theta_1,\tilde\theta_2,...,\tilde\theta_n) θi=θ~i−a⋅∂θi∂J(θ~1,θ~2,...,θ~n), where a a a is the learning rate
- Repeat 2, 3 until θ i \theta_i θi converges
Gradient descent algorithm for linear regression
cost function for multiple linear regression model:
J
(
θ
)
=
1
2
n
d
∑
i
=
1
n
d
(
x
i
T
θ
−
y
i
)
2
J(\theta)=\frac{1}{2n_d}\sum_{i=1}^{n_d}(x_i^\mathsf{T} \theta - y_i)^2
J(θ)=2nd1i=1∑nd(xiTθ−yi)2
The updating step of gradient descent for linear regression with the given cost function:
θ
j
(
k
+
1
)
=
θ
j
(
k
)
−
a
1
n
d
∑
i
=
1
n
d
(
x
i
T
θ
(
k
)
−
y
i
)
x
i
j
\theta_j^{(k+1)}=\theta_j^{(k)}-a\frac{1}{n_d}\sum_{i=1}^{n_d}(x_i^\mathsf{T}\theta^{(k)}-y_i)x_{ij}
θj(k+1)=θj(k)−and1i=1∑nd(xiTθ(k)−yi)xij
1.4.2 Learning Rate
The learning rate a a a used in the algorithm should be > 0, constant.
- Learning rate can affect convergence and speed of convergence:
Too small learning rate might lead to slow convergence, while too large learning rate might lead to non-convergence or divergence. - Can be selected using a validation set or cross-validation.
1.4.3 Stoping Convergence
Check the error tolerance close to 0:
- Absolute error tolerance:
ε a b s = ∣ J ( θ ( k + 1 ) ) − J ( θ k ) ∣ \varepsilon_{abs}=|J(\theta^{(k+1)})-J(\theta^{k})| εabs=∣J(θ(k+1))−J(θk)∣ - Relative error tolerance:
ε r e l = ∣ J ( θ ( k + 1 ) ) − J ( θ k ) J ( θ ( k + 1 ) ) ∣ \varepsilon_{rel}=\left|\frac{J(\theta^{(k+1)})-J(\theta^{k})}{J(\theta^{(k+1)})}\right| εrel=∣∣∣∣J(θ(k+1))J(θ(k+1))−J(θk)∣∣∣∣
2 Logistic Regression
Output:
{
0
,
1
}
\{0,1\}
{0,1}
Decision boundary:
x
i
T
θ
x_i^{\mathsf{T}}\theta
xiTθ
Hypothesis:
h
θ
(
x
i
)
=
g
(
x
i
T
θ
)
=
1
1
+
e
x
p
(
−
x
i
T
θ
)
h_\theta(x_i)=g(x_i^\mathsf{T}\theta) =\frac{1}{1+exp(-x_i^\mathsf{T}\theta)}
hθ(xi)=g(xiTθ)=1+exp(−xiTθ)1
Cost function:
J
(
θ
0
,
θ
1
)
=
{
−
l
o
g
(
h
(
θ
0
,
θ
1
)
(
x
)
)
,
i
f
y
=
1
−
l
o
g
(
1
−
h
(
θ
0
,
θ
1
)
(
x
)
)
,
i
f
y
=
0
J(\theta_0,\theta_1)=\left\{ \begin{array}{cc} -log(h_{(\theta_0,\theta_1)}(x)),~~~~~~~~~~~if~y=1 & \\ -log(1-h_{(\theta_0,\theta_1)}(x)),~~~~~~~~~~~if~y=0 & \end{array} \right.
J(θ0,θ1)={−log(h(θ0,θ1)(x)), if y=1−log(1−h(θ0,θ1)(x)), if y=0
General form for n samples cost function:
J
(
θ
)
=
−
1
n
∑
i
=
1
n
(
y
i
l
o
g
(
h
θ
(
x
i
)
)
+
(
1
−
y
i
)
l
o
g
(
1
−
h
θ
(
x
i
)
)
)
J(\theta)=-\frac{1}{n}\sum^n_{i=1}(y_i log(h_\theta(x_i))+(1-y_i)log(1-h_\theta(x_i)))
J(θ)=−n1i=1∑n(yilog(hθ(xi))+(1−yi)log(1−hθ(xi)))
Classification:
-
x
i
T
θ
≥
0
⟹
h
θ
(
x
i
)
≥
0.5
⟹
y
^
=
1
x_i^\mathsf{T}\theta \ge 0 ~\Longrightarrow ~h_\theta(x_i) \ge 0.5 ~\Longrightarrow ~ \hat{y}=1
xiTθ≥0 ⟹ hθ(xi)≥0.5 ⟹ y^=1
y = 1 y=1 y=1, then J ( θ ) → 0 J(\theta)~\rightarrow~0 J(θ) → 0
y = 0 , J ( θ ) → ∞ y=0,~J(\theta)~\rightarrow~\infty y=0, J(θ) → ∞ -
x
i
T
θ
<
0
⟹
h
θ
(
x
i
)
<
0.5
⟹
y
^
=
0
x_i^\mathsf{T}\theta < 0 ~\Longrightarrow ~h_\theta(x_i) < 0.5 ~\Longrightarrow ~ \hat{y}=0
xiTθ<0 ⟹ hθ(xi)<0.5 ⟹ y^=0
y = 1 y=1 y=1, then J ( θ ) → ∞ J(\theta)~\rightarrow~\infty J(θ) → ∞
y = 0 , J ( θ ) → 0 y=0,~J(\theta)~\rightarrow~0 y=0, J(θ) → 0
3 Avoid Overfitting
higher overfitting
⟹
\Longrightarrow
⟹ higher variance of the predictions
higher underfitting
⟹
\Longrightarrow
⟹ higher bias of the predictions
3.1 Penalizing parameters
∑ i = 1 n d ( y i − h θ ( x i ) ) 2 + λ r ( θ ) \sum^{n_d}_{i=1}(y_i - h_\theta(x_i))^2+\lambda r(\theta) i=1∑nd(yi−hθ(xi))2+λr(θ), where λ r ( θ ) \lambda r(\theta) λr(θ) is the Penalty, r ( θ ) r(\theta) r(θ) is the parameter penalty, λ \lambda λ is the regularization parameter.l
regularization parameter λ \lambda λ
- λ → 0 \lambda ~ \rightarrow~0 λ → 0 regularized regression tend to linear regression estimates
- λ → ∞ \lambda ~ \rightarrow~\infty λ → ∞ parameters get penalized more, the less the over-fitting to the data
- select λ \lambda λ via cross-validation
parameter penalty functions
r
(
θ
)
r(\theta)
r(θ)
Some typical parameter penalty functions (参考distance)
- d − v a r i a t i o n d-variation d−variation function: r ( θ ) = ( ∑ i = 1 n p ∣ θ i ∣ d ) 1 / d r(\theta)=\left(\sum^{n_p}_{i=1}|\theta_i|^d\right)^{1/d} r(θ)=(∑i=1np∣θi∣d)1/d
- L 1 n o r m : r ( θ ) = ∑ i = 1 n p ∣ θ i ∣ L_{1}~norm:~r(\theta)=\sum^{n_p}_{i=1}|\theta_i| L1 norm: r(θ)=∑i=1np∣θi∣
- s q u a r e d L 2 n o r m r ( θ ) = ∑ i = 1 n p ∣ θ i ∣ 2 squared~L_2~norm~r(\theta)=\sum^{n_p}_{i=1}|\theta_i|^2 squared L2 norm r(θ)=∑i=1np∣θi∣2
3.2 Regression model with penalty function
3.2.1 Lasso regression
Lasso regression uses L 1 n o r m L_1~norm L1 norm as penalty function. Lasso regression “zeroes out” coefficients, so it performs variable selection and to some extent parameter shrinkage.
-
Cost function
J L ( θ ) = ∣ ∣ y − X θ ∣ ∣ 2 + λ ∣ ∣ θ ∣ ∣ 1 = ∑ i = 1 n d ( y i − x i T θ ) 2 + λ ∑ j = 1 n d ∣ θ j ∣ J_L(\theta)=||y-X\theta||^2+\lambda||\theta||_1= \sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2+\lambda\sum^{n_d}_{j=1}|\theta_j| JL(θ)=∣∣y−Xθ∣∣2+λ∣∣θ∣∣1=i=1∑nd(yi−xiTθ)2+λj=1∑nd∣θj∣ -
Gradient of Cost function
∂ J L ( θ ) ∂ θ 1 , . . . ∂ J L ( θ ) ∂ θ n d \frac{\partial J_L{(\theta)}}{\partial \theta_1},... \frac{\partial J_L{(\theta)}}{\partial \theta_{nd}} ∂θ1∂JL(θ),...∂θnd∂JL(θ)
note:
∂
∑
i
=
1
n
d
(
y
i
−
x
i
T
θ
)
2
∂
θ
1
=
2
x
1
i
∑
i
=
1
n
d
(
y
i
−
x
i
T
θ
)
\frac{\partial \sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2}{\partial \theta_1} = 2x_{1i}\sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)
∂θ1∂∑i=1nd(yi−xiTθ)2=2x1ii=1∑nd(yi−xiTθ)
- Lasso estimate
θ ^ L = a r g m i n θ ( ∑ i = 1 n d ( y i − x i T θ ) 2 + λ ∑ j = 1 n d ∣ θ j ∣ ) \hat\theta_L=\underset{\theta}{argmin}\left(\sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2 + \lambda\sum^{n_d}_{j=1}|\theta_j| \right) θ^L=θargmin(i=1∑nd(yi−xiTθ)2+λj=1∑nd∣θj∣)
Can’t be solved, but J L ( θ ) J_L(\theta) JL(θ) is convex, using the least angle regression algorithm to solve.
3.2.2 Ridge regression
Ridge regression uses L 2 n o r m L_2~norm L2 norm as penalty function. So it does not “zero out” coefficients, i.e. it can not perform variable selection, but rather shrinks parameter values.
-
Cost function
J R ( θ ) = ∣ ∣ y − X θ ∣ ∣ 2 + λ ∣ ∣ θ ∣ ∣ 2 2 = ∑ i = 1 n d ( y i − x i T θ ) 2 + λ ∑ j = 1 n d ( θ j ) 2 J_R(\theta)=||y-X\theta||^2+\lambda||\theta||_2^2= \sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2+\lambda\sum^{n_d}_{j=1}(\theta_j)^2 JR(θ)=∣∣y−Xθ∣∣2+λ∣∣θ∣∣22=i=1∑nd(yi−xiTθ)2+λj=1∑nd(θj)2 -
ridge estimate
θ ^ R = ( X T X + λ I ) − 1 X T y \hat\theta_R = (X^\mathsf{T} X + \lambda I)^{-1}X^\mathsf{T}y θ^R=(XTX+λI)−1XTy
This is the closed form solution.
The ridge estimate is biased because:
E
(
θ
^
R
)
=
E
(
(
X
T
X
+
λ
I
)
−
1
X
T
y
)
=
(
X
T
X
+
λ
I
)
−
1
X
T
E
(
y
)
=
(
X
T
X
+
λ
I
)
−
1
X
T
X
θ
=
(
X
T
X
+
λ
I
)
−
1
(
(
X
T
X
)
−
1
)
−
1
θ
=
[
(
X
T
X
)
−
1
(
X
T
X
+
λ
I
)
]
−
1
θ
=
[
I
+
λ
(
X
T
X
)
−
1
]
−
1
θ
\begin{aligned} E(\hat\theta_R) &= E\left( (X^\mathsf{T} X + \lambda I)^{-1}X^\mathsf{T}y \right) \\ &=(X^\mathsf{T} X + \lambda I)^{-1}X^\mathsf{T} E(y)\\ &=(X^\mathsf{T} X + \lambda I)^{-1}X^\mathsf{T}X \theta\\ &=(X^\mathsf{T} X + \lambda I)^{-1}((X^\mathsf{T}X)^{-1})^{-1}\theta\\ &=\left[(X^\mathsf{T}X)^{-1}(X^\mathsf{T} X + \lambda I)\right]^{-1} \theta\\ &=\left[ I + \lambda(X^\mathsf{T}X)^{-1}\right]^{-1}\theta \end{aligned}
E(θ^R)=E((XTX+λI)−1XTy)=(XTX+λI)−1XTE(y)=(XTX+λI)−1XTXθ=(XTX+λI)−1((XTX)−1)−1θ=[(XTX)−1(XTX+λI)]−1θ=[I+λ(XTX)−1]−1θ
X
T
X
X^\mathsf{T}X
XTX is positive definite,
λ
>
0
\lambda > 0
λ>0 by definition, so
E
(
θ
^
R
)
≠
θ
E(\hat\theta_R)\neq \theta
E(θ^R)=θ
Tips:
Although ridge regression does not perform variable selection, it performs grouped selection. So if one variable amongst a group of correlated ones is selected, ridge regression automatically includes the whole group. Ridge regression can resolve near multicollinearity.
3.2.3 Elastic net
The elastic net penalty is a compromise between Lasso and ridge.
- Cost function
J E ( θ ) = ∣ ∣ y − X θ ∣ ∣ 2 + λ 1 ∣ ∣ θ ∣ ∣ 1 + λ 2 ∣ ∣ θ ∣ ∣ 2 2 J_E(\theta)=||y-X\theta||^2+\lambda_1||\theta||_1+\lambda_2||\theta||_2^2 JE(θ)=∣∣y−Xθ∣∣2+λ1∣∣θ∣∣1+λ2∣∣θ∣∣22
if X is an n d × n p n_d\times n_p nd×np design matrix
J E ( θ ) = ∑ i = 1 n d ( y i − x i T θ ) 2 + λ 1 ∑ j = 1 n p ∣ θ j ∣ + λ 2 ∑ j = 1 n p θ j 2 J_E(\theta)=\sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2+\lambda_1\sum^{n_p}_{j=1}|\theta_j|+\lambda_2\sum^{n_p}_{j=1}\theta_j^2 JE(θ)=i=1∑nd(yi−xiTθ)2+λ1j=1∑np∣θj∣+λ2j=1∑npθj2 - elastic net estimate
θ ^ E = a r g m i n θ ( ∣ ∣ y − X θ ∣ ∣ 2 + λ 1 ∣ ∣ θ ∣ ∣ 1 + λ 2 ∣ ∣ θ ∣ ∣ 2 2 ) \hat\theta_E =\underset{\theta}{argmin}\left(||y-X\theta||^2+\lambda_1||\theta||_1+\lambda_2||\theta||_2^2\right) θ^E=θargmin(∣∣y−Xθ∣∣2+λ1∣∣θ∣∣1+λ2∣∣θ∣∣22)
no closed form solution for θ E \theta_E θE, but J E ( θ ) J_E(\theta) JE(θ) is convex to have a unique minimum
4 Learning curve
Increase prediction accuracy:
- variable selection: increase or reduce number of covariates
- add polynomial features: { x 2 , x 1 x 2 } \{x^2, x_1x_2\} {x2,x1x2}
- regularized regression (Lasso, ridge regression and elastic nets):
- Collect more data
Learning curve:
- A learning curve is a metric of prediction accuracy, like cost function J ( θ ) J(\theta) J(θ)
- Or error metric (function of a parameter that affects the metric)