Directory
Approximation & Fitting
6.3.1 Bi-criterion formulation
In the basic form of regularized approximation, the goal is to find a vector
x
x
x that is small (if possible), and also makes the residual
A
x
−
b
Ax − b
Ax−b small. This is naturally described as a (convex) vector optimization problem with two objectives,
∥
A
x
−
b
∥
\| Ax−b \|
∥Ax−b∥ and
∥
x
∥
\|x\|
∥x∥:
P
6.7
:
min
(
w
.
r
.
t
.
R
+
2
)
(
∥
A
x
−
b
∥
,
∥
x
∥
)
.
P6.7 :\min (w.r.t. \mathbf{R}_+^2) ~ ~(\| Ax−b \|, \|x\|).
P6.7:min(w.r.t.R+2) (∥Ax−b∥,∥x∥). The two norms can be different: the first, used to measure the size of the residual, is on
R
m
\mathbf{R}^m
Rm; the second, used to measure the size of
x
x
x, is on
R
n
\mathbf{R}^n
Rn .
The optimal trade-off between the two objectives can be found using several methods. The optimal trade-off curve of ∥ A x − b ∥ \| Ax−b \| ∥Ax−b∥ versus ∥ x ∥ \|x\| ∥x∥, which shows how large one of the objectives must be made to have the other one small, can then be plotted. One endpoint of the optimal trade-off curve between ∥ A x − b ∥ \| Ax−b \| ∥Ax−b∥ and kxk is easy to describe. The minimum value of kxk is zero, and is achieved only when x = 0 x = 0 x=0. For this value of x x x, the residual norm has the value ∥ b ∥ \|b\| ∥b∥.
The other endpoint of the trade-off curve is more complicated to describe. Let C C C denote the set of minimizers of ∥ A x − b ∥ \| Ax-b \| ∥Ax−b∥ (with no constraint on ∥ x ∥ \|x\| ∥x∥). Then any minimum norm point in C C C is Pareto optimal, corresponding to the other endpoint of the trade-off curve. In other words, Pareto optimal points at this endpoint are given by minimum norm minimizers of ∥ A x − b ∥ \|Ax-b\| ∥Ax−b∥. If both norms are Euclidean, this Pareto optima point is unique, and given by x = A † b , x=A^\dagger b, x=A†b, where A † A^\dagger A† is the pseudo-inverse of A A A.
6.3.2 Regularization
Regularization is a common scalarization method used to solve the bi-criterion problem (
P
6.7
P6.7
P6.7). One form of regularization is to minimize the weighted sum of the objectives:
P
6.8
min
∥
A
x
−
b
∥
+
γ
∥
x
∥
,
P6.8 ~~\min \|Ax - b\| + \gamma \|x\|,
P6.8 min∥Ax−b∥+γ∥x∥, where
γ
>
0
\gamma>0
γ>0 is a problem (regularization) parameter. As
γ
\gamma
γ varies over
(
0
,
∞
)
,
(0,\infty),
(0,∞), the solution of (P6.8) traces out the optimal trade-off curve.
Another common method of regularization, especially when the Euclidean norm is used, is to minimize the weighted sum of squared norms, i.e.,
min
∥
A
x
−
b
∥
2
+
δ
∥
x
∥
2
,
\min \| Ax - b \|^2 + \delta \| x \|^2,
min∥Ax−b∥2+δ∥x∥2, for a variety of value of
δ
>
0.
\delta>0.
δ>0.
These regularized approximation problems each solve the bi-criterion problem of making both ∥ A x − b ∥ \|Ax - b\| ∥Ax−b∥ and x x x small, by adding an extra term or penalty associated with the norm of x x x.
Interpretations
Regularization is used in several contexts. In an estimation setting, the extra term penalizing large ∥ x ∥ \|x\| ∥x∥ can be interpreted as our prior knowledge that ∥ x ∥ \|x\| ∥x∥ is not too large. In an optimal design setting, the extra term adds the cost of using large values of the design variables to the cost of missing the target specifications.
The constraint that ∥ x ∥ \|x\| ∥x∥ be small can also reflect a modeling issue. It might be, for example, that y = A x y = Ax y=Ax is only a good approximation of the true relationship y = f ( x ) y = f(x) y=f(x) between x x x and y y y. In order to have f ( x ) ≈ b f(x) ≈ b f(x)≈b, we want A x ≈ b Ax ≈ b Ax≈b, and also need x small in order to ensure that f ( x ) ≈ A x f(x) ≈ Ax f(x)≈Ax.
Tikhonov regularization
The most common form of regularization is based on (6.9), with Euclidean norms, which results in a (convex) quadratic optimization problem:
min
∥
A
x
−
b
∥
2
2
+
δ
∥
x
∥
2
2
=
x
T
(
A
T
A
+
δ
I
)
x
−
2
b
T
A
x
+
b
T
b
\min ~\|Ax-b\|_2^2 + \delta \| x\|_2^2 = x^T(A^TA+\delta I) x - 2b^TAx +b^Tb
min ∥Ax−b∥22+δ∥x∥22=xT(ATA+δI)x−2bTAx+bTb
This Tikhonov regularization problem has the analytical solution
x
=
(
A
T
A
+
δ
I
)
−
1
A
T
b
.
x = (A^TA + \delta I)^{-1} A^T b.
x=(ATA+δI)−1ATb.
Since
A
T
A
+
δ
I
≻
0
A T A+δI \succ 0
ATA+δI≻0 for any
δ
>
0
δ > 0
δ>0, the Tikhonov regularized least-squares solution requires no rank (or dimension) assumptions on the matrix A.
Smoothing regularization
The idea of regularization, i.e., adding to the objective a term that penalizes large x x x, can be extended in several ways. In one useful extension we add a regularization term of the form ∥ D x ∥ \|Dx\| ∥Dx∥, in place of kxk. In many applications, the matrix D D D represents an approxi-mate differentiation or second-order differentiation operator, so ∥ D x ∥ \|Dx\| ∥Dx∥ represents a measure of the variation or smoothness of x x x.
For example, suppose that the vector
x
∈
R
n
x ∈ \mathbf{R}_n
x∈Rn represents the value of some continuous physical parameter, say, temperature, along the interval
[
0
,
1
]
[0,1]
[0,1]:
x
i
~x_i
xi is the temperature at the point
i
/
n
i/n
i/n. A simple approximation of the gradient or first derivative of the parameter near i/n is given by
n
(
x
i
+
1
−
x
i
)
n(x_i+1 − x_i )
n(xi+1−xi), and a simple approximation of its second derivative is given by the second difference
n
(
n
(
x
i
+
1
−
x
i
)
−
n
(
x
i
−
x
i
−
1
)
)
=
n
2
(
x
i
+
1
−
2
x
i
+
x
i
−
1
)
.
n(n(x_{i+1} - x_i) - n(x_i-x_{i-1})) = n^2(x_{i+1}-2x_i + x_{i-1}).
n(n(xi+1−xi)−n(xi−xi−1))=n2(xi+1−2xi+xi−1).
If
Δ
\Delta
Δ is the (tridiagonal, Toeplitz) matrix
Δ
=
n
2
[
1
−
2
1
0
⋯
0
0
0
0
0
1
−
2
1
⋯
0
0
0
0
0
0
1
−
2
⋯
0
0
0
0
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
0
0
0
0
⋯
−
2
1
0
0
0
0
0
0
⋯
1
−
2
1
0
0
0
0
0
⋯
0
1
−
2
1
]
∈
R
(
n
−
2
)
×
n
,
\Delta=n^{2}\left[\begin{array}{rrrrrrrrr} 1 & -2 & 1 & 0 & \cdots & 0 & 0 & 0 & 0 \\ 0 & 1 & -2 & 1 & \cdots & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & -2 & \cdots & 0 & 0 & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & & \vdots & \vdots & \vdots & \vdots \\ 0 & 0 & 0 & 0 & \cdots & -2 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & \cdots & 1 & -2 & 1 & 0 \\ 0 & 0 & 0 & 0 & \cdots & 0 & 1 & -2 & 1 \end{array}\right] \in \mathbf{R}^{(n-2) \times n},
Δ=n2⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡100⋮000−210⋮0001−21⋮00001−2⋮000⋯⋯⋯⋯⋯⋯000⋮−210000⋮1−21000⋮01−2000⋮001⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤∈R(n−2)×n, then
Δ
x
\Delta x
Δx represents an approximation of the second derivative of the parameter, so
∥
Δ
x
∥
2
2
\| \Delta x \|_2^2
∥Δx∥22 represents a measure of the mean-square curvature of the parameter over the interval
[
0
,
1
]
[0,1]
[0,1].
The Tikhonov regularized problem
min
∥
A
x
−
b
∥
2
2
+
δ
∥
x
∥
2
2
\min ~ ~\|Ax-b\|_2^2 + \delta \| x\|_2^2
min ∥Ax−b∥22+δ∥x∥22 can be used to trade off the objective
∥
A
x
−
b
∥
2
\|Ax−b\|^2
∥Ax−b∥2, which might represent a measure of fit, or consistency with experimental data, and the objective
∥
∆
x
∥
2
\|∆x\|^2
∥∆x∥2 , which is (approximately) the mean-square curvature of the underlying physical parameter. The parameter
δ
δ
δ is used to control the amount of regularization required, or to plot the optimal trade-off curve of fit versus smoothness.
We can also add several regularization terms. For example, we can add terms associated with smoothness (
∥
Δ
x
∥
\|\Delta x\|
∥Δx∥) and size (
∥
x
∥
\|x\|
∥x∥), as in
P
6.9
:
min
∥
A
x
−
b
∥
2
2
+
δ
∥
x
∥
2
2
+
η
∥
x
∥
2
2
.
P6.9: ~~\min ~ ~\|Ax-b\|_2^2 + \delta \| x\|_2^2 + \eta\|x\|_2^2.
P6.9: min ∥Ax−b∥22+δ∥x∥22+η∥x∥22.
To be noted, the parameter
δ
≥
0
δ ≥ 0
δ≥0 is used to control the smoothness of the approximate solution, and the parameter
η
≥
0
η ≥ 0
η≥0 is used to control its size.
6.3.3 Reconstruction, smoothing, and de-noising
In reconstruction problems, we start with a signal represented by a vector x ∈ R n x ∈ \mathbf{R}_n x∈Rn. It is usually assumed that the signal does not vary too rapidly, which means that usually, we have x i ≈ x i + 1. x_i ≈ x_i+1. xi≈xi+1. (In this section we consider signals in one dimension, e.g., audio signals, but the same ideas can be applied to signals in two or more dimensions, e.g., images or video.)
The signal
x
x
x is corrupted by an additive noise
v
v
v:
x
c
o
r
=
x
+
v
.
x_{cor} = x + v.
xcor=x+v. The goal is to form an estimate
x
^
\hat{x}
x^ of the original signal
x
x
x, given the corrupted signal
x
c
o
r
x_{cor}
xcor. This process is called signal reconstruction (since we are trying to reconstruct the original signal from the corrupted version) or de-noising (since we are trying to remove the noise from the corrupted signal). Most reconstruction methods end up performing some sort of smoothing operation on
x
c
o
r
x_{cor}
xcor to produce
x
^
\hat{x}
x^, so the process is also called smoothing.
One simple formulation of the reconstruction problem is the bi-criterion problem
min
(
w
.
r
.
t
.
R
+
2
)
(
∥
x
^
−
x
c
o
r
∥
2
,
ϕ
(
x
^
)
)
,
\min (w.r.t.~\mathbf{R}_+^2)~~ (\|\hat{x}-x_{cor}\|_2, \phi(\hat{x})),
min(w.r.t. R+2) (∥x^−xcor∥2,ϕ(x^)), where
x
^
\hat{x}
x^ is the variable and
x
c
o
r
x_{cor}
xcor is a problem parameter. The function
ϕ
:
R
→
R
\phi:\mathbf{R}\rightarrow\mathbf{R}
ϕ:R→R is convex, and is called the regularization function or smoothing objective.
Quadratic smoothing
The simplest reconstruction method uses the quadratic smoothing function
ϕ
q
u
a
d
(
x
)
=
∑
i
=
1
n
−
1
(
x
i
+
1
−
x
i
)
2
=
∥
D
x
∥
2
2
,
\phi_{quad}(x) =\sum_{i=1}^{n-1}( x_{i+1} - x_i )^2 = \| Dx \|_2^2,
ϕquad(x)=i=1∑n−1(xi+1−xi)2=∥Dx∥22, where
D
∈
R
(
n
−
1
)
×
n
D \in \mathbf{R}^{(n-1) \times n}
D∈R(n−1)×n is the bidiagonal matrix
$$
D
=
[
−
1
1
0
⋯
0
0
0
0
−
1
1
⋯
0
0
0
⋮
⋮
⋮
⋮
⋮
⋮
0
0
0
⋯
−
1
1
0
0
0
0
⋯
0
−
1
1
]
.
D=\left[\begin{array}{rrrrrrrrr} -1 & 1 & 0 & \cdots & 0 & 0 & 0 \\ 0 & -1 & 1 & \cdots & 0 & 0 & 0 \\ \vdots & \vdots & \vdots & & \vdots & \vdots & \vdots \\ 0 & 0 & 0 & \cdots & -1 & 1 & 0 \\ 0 & 0 & 0 & \cdots & 0 & -1 & 1 \end{array}\right].
D=⎣⎢⎢⎢⎢⎢⎡−10⋮001−1⋮0001⋮00⋯⋯⋯⋯00⋮−1000⋮1−100⋮01⎦⎥⎥⎥⎥⎥⎤.
We can obtain the optimal trade-off between
∥
x
^
−
x
c
o
r
∥
2
\| \hat{x} −x_{cor} \|_2
∥x^−xcor∥2 and
∥
D
x
∥
2
\| D x\|_2
∥Dx∥2 by minimizing
∥
x
^
−
x
c
o
r
∥
2
2
+
δ
∥
D
x
^
∥
2
2
,
\| \hat{x} - x_{cor}\|_2^2 + \delta \| D \hat{x} \|_2^2 ,
∥x^−xcor∥22+δ∥Dx^∥22, where
δ
>
0
δ > 0
δ>0 parametrizesthe optimal trade-off curve.
The solution of this quadratic problem,
(
I
+
δ
D
T
D
)
−
1
x
c
o
r
(I + \delta D^TD)^{-1} x_{cor}
(I+δDTD)−1xcor can be computed very efficiently since
I
+
δ
D
T
D
I + \delta D^TD
I+δDTD is tridiagonal.
Total variation reconstruction
Simple quadratic smoothing works well as a reconstruction method when the original signal is very smooth, and the noise is rapidly varying. But any rapid variations in the original signal will, obviously, be attenuated or removed by quadratic smoothing.
In this section we describe a reconstruction method that can remove much of the noise, while still preserving occasional rapid variations in the original signal. The method is based on the smoothing function
ϕ
t
v
(
x
^
)
=
∑
i
=
1
n
−
1
∣
x
^
i
+
1
−
x
^
i
∣
=
∥
D
x
^
∥
1
,
\phi_{\mathrm{tv}}(\hat{x})=\sum_{i=1}^{n-1}\left|\hat{x}_{i+1}-\hat{x}_{i}\right|=\|D \hat{x}\|_{1},
ϕtv(x^)=i=1∑n−1∣x^i+1−x^i∣=∥Dx^∥1, which is called the total variation of
x
∈
R
n
x ∈ \mathbf{R}_n
x∈Rn. Like the quadratic smoothness measure φ quad , the total variation function assigns large values to rapidly varying
x
^
\hat{x}
x^. The total variation measure, however, assigns relatively less penalty to large
values of
∣
x
i
+
1
−
x
i
∣
|x_{i+1} − x_i |
∣xi+1−xi∣.