梯度下降应用举例
一、梯度下降法求解LASSO问题
LASSO问题的原始形式为:
min
f
(
x
)
=
1
2
∥
A
x
−
b
∥
2
+
μ
∥
x
∥
1
(1)
\min f(x)=\frac{1}{2}\|Ax-b\|^2+\mu\|x\|_1 \tag{1}
minf(x)=21∥Ax−b∥2+μ∥x∥1(1)
LASSO问题的目标函数
f
(
x
)
f(x)
f(x)不光滑,在某些点处无法求出梯度,因此不能直接对原始问题使用梯度法求解。考虑到目标函数的不光滑项为
∥
x
∥
1
\|x\|_1
∥x∥1,它实际上是哥哥分量绝对值的和,如果能找到一个光滑函数来近似绝对值函数,那么梯度法就可以被用在LASSO问题的求解上。在实际应用中,我们可以考虑如下以为光滑函数:
l
δ
(
x
)
=
{
1
2
δ
x
2
,
∣
x
∣
<
δ
∣
x
∣
−
δ
2
,
o
t
h
e
r
w
i
s
e
l_\delta(x)=\{ \begin{array}{lc} \frac{1}{2\delta}x^2, &|x|<\delta \\ |x|-\frac{\delta}{2}, &otherwise \end{array}
lδ(x)={2δ1x2,∣x∣−2δ,∣x∣<δotherwise
上述函数实际上是Huber损失函数的一种变形,当
δ
→
0
\delta\to 0
δ→0时,光滑函数
l
δ
(
x
)
l_\delta(x)
lδ(x)与绝对值函数会越来越近,下图展示了
δ
\delta
δ取不同值时,
l
δ
(
x
)
l_\delta(x)
lδ(x)的图形:
实验代码为:
huber函数
function fx = huber_func(x, delta)
fx = (abs(x) < delta) .* 1 / (2 * delta) .* x.^2 + ...
(abs(x) >= delta) .* (abs(x) - delta / 2);
end
画图函数为
x = -4:1e-3:4;
fx = abs(x);
delta0 = 0;
delta1 = 0.02;
delta2 = 0.2;
delta3 = 1;
fx0 = huber_func(x, delta0);
fx1 = huber_func(x, delta1);
fx2 = huber_func(x, delta2);
fx3 = huber_func(x, delta3);
figure, plot(x, fx);
hold on
plot(x, fx0);
plot(x, fx1);
plot(x, fx2);
plot(x, fx3);
legend('|x|', '\delta=0', '\delta=0.02', '\delta=0.2', '\delta=1')
因此,我们构造光滑化LASSO问题为:
min
f
δ
(
x
)
=
1
2
∥
A
x
−
b
∥
2
+
μ
L
δ
(
x
)
(2)
\min f_\delta(x)=\frac{1}{2}\|Ax-b\|^2+\mu L_\delta(x) \tag{2}
minfδ(x)=21∥Ax−b∥2+μLδ(x)(2)
其中
δ
\delta
δ为给定的光滑化参数,在这里
L
δ
(
x
)
=
∑
i
=
1
n
l
δ
(
x
i
)
L_\delta(x)=\sum_{i=1}^{n}l_\delta(x_i)
Lδ(x)=∑i=1nlδ(xi),即对
x
x
x的每个分量作用一维光滑函数
l
δ
(
x
i
)
l_\delta(x_i)
lδ(xi)再整体求和。容易计算出
f
δ
(
x
)
f_\delta(x)
fδ(x)的梯度为
∇
f
δ
(
x
)
=
A
T
(
A
x
−
b
)
+
μ
∇
L
δ
(
x
)
(3)
\nabla f_\delta(x)=A^T(Ax-b)+\mu\nabla L_\delta(x)\tag{3}
∇fδ(x)=AT(Ax−b)+μ∇Lδ(x)(3)
其中
∇
L
δ
\nabla L_\delta
∇Lδ是逐分量定义的:
(
∇
L
δ
(
x
)
)
i
=
{
s
i
g
n
(
x
i
)
,
∣
x
i
∣
>
δ
,
x
i
δ
,
∣
x
i
∣
≤
δ
l
(4)
(\nabla L_\delta(x))_i=\{ \begin{array}{l} sign(x_i),&|x_i|>\delta, \\ \frac{x_i}{\delta},&|x_i|\le\delta \tag{4} \end{array}{l}
(∇Lδ(x))i={sign(xi),δxi,∣xi∣>δ,∣xi∣≤δl(4)
显然
f
δ
(
x
)
f_\delta(x)
fδ(x)的梯度是李普希茨连续的,且相应常数为
L
=
∥
A
T
A
∥
2
+
μ
δ
L=\|A^TA\|_2+\frac{\mu}{\delta}
L=∥ATA∥2+δμ,根据梯度法收敛定理,固定补偿需不超过
1
L
\frac{1}{L}
L1才能保证算法收敛,如果
δ
\delta
δ过小,那么我们需要选取充分小的步长
α
k
\alpha_k
αk使得梯度法收敛。
lasso问题数值实验
目标函数
function fx = func_lasso(x, A, b, mu, delta)
L_delta = 0.5 * delta * x.^2 .* (abs(x) < delta) + ...
(abs(x) - 0.5 * delta) .* (abs(x) >= delta);
L_delta = sum(L_delta);
fx = 0.5 * norm(A * x - b, 'fro').^2 + mu * L_delta;
end
梯度函数
function gx = gfunc_lasso(x, A, b, mu, delta)
grad_L_delta = sign(x) .* (abs(x) > delta) + ...
x / delta .* (abs(x) <= delta);
gx = A' * (A * x - b) + mu * grad_L_delta;
end
BB梯度下降
function [fmin, xmin] = gradient_descent(func, gfunc, x0, epsilon)
iIter = 1;
iterMax = 33000;
xOld = x0;
alphaMin = 1e-5;
alphaMax = 1e5;
M = 10;
%alpha = 0.5;
%[~, ~, alpha] = armijo_rule(func, gfunc, x0, 0.5, -feval(gfunc, x0));
alpha = 0.2;
QOld = 1;
COld = feval(func, xOld);
c1 = 0.5;
eta = 0.5;
% xk = zeros(size(x0, 1), 11);
% fk = zeros(11, 1);
% xk(:, 1) = x0;
% fk(1, :) = feval(func, x0);
while iIter < iterMax
grad = feval(gfunc, xOld);
dk = -grad;
% Zhang, Hanger nonmonotone line search
for i = 1:M
xNew = xOld + alpha * dk;
fNew = feval(func, xNew);
if fNew <= COld + alpha * c1 * dot(dk, dk)
break;
end
alpha = alpha * eta;
end
%xNew = xOld - alpha * dk;
iIter = iIter + 1;
if norm(grad, 2) < epsilon
break;
end
% BB step-size calculation
sk = xNew - xOld; yk = feval(gfunc, xNew) - feval(gfunc, xOld);
if mod(iIter, 2) == 1
alpha = dot(sk, yk) / dot(yk, yk);
else
alpha = dot(sk, sk) / dot(sk, yk);
end
alpha = max(min(alpha, alphaMax), alphaMin);
QNew = eta * QOld + 1;
CNew = (eta * QOld * COld + fNew) / QNew;
COld = CNew;
xOld = xNew;
% xk(:, iIter) = xNew;
% fk(iIter, :) = fNew;
fprintf('iIter = %d, fx = %f\n', iIter, fNew);
end
if iIter == iterMax
fprintf('exceed max iteration, not found minimal point x.\n');
end
xmin = xNew;
fmin = feval(func, xmin);
fprintf('iIter = %d, fmin = %f\n', iIter, fmin);
end
armijo-rule线搜索
function [fnew, xnew, alpha] = armijo_rule(func, gfunc, x0, alpha0, dk)
c1 = 1e-3;
alpha = alpha0;
gamma = 0.8;
iIter = 0;
iterMax = 200;
alphaMin = 1e-5;
while iIter < iterMax
xnew = x0 + alpha * dk;
fnew = feval(func, xnew);
f0 = feval(func, x0);
if fnew <= f0 + c1 * feval(gfunc, x0)' * dk
break;
end
if alpha < alphaMin
break;
end
alpha = gamma * alpha;
iIter = iIter + 1;
end
if iIter == iterMax
alpha = alphaMin;
fprintf('reach maximum iteration, and not found suitable alpha.\n');
end
xnew = x0 + alpha * dk;
fnew = feval(func, xnew);
end
main函数
m = 512;
n = 1024;
r = 0.1;
mu = 1e-3;
delta = 1e-3;
epsilon = 1e-3;
A = randn(m, n);
u = sprandn(n, 1, r);
b = A * u;
u0 = zeros(n, 1);
func =@(x)(func_lasso(x, A, b, mu, delta));
gfunc = @(x)(gfunc_lasso(x, A, b, mu, delta));
[fmin, xmin] = gradient_descent(func, gfunc, u0, epsilon);
figure
subplot(211)
plot(u, 'ro');
hold on
plot(xmin, 'bx')
legend('original signal', 'lasso solution')
subplot(212)
plot(abs(xmin - u), '-')
title('abs_diff')
计算结果如下图所示
二、Tikhonov(吉洪诺夫)正则化模型求解
Tikhonov正则化方法是Tikhonov等人在1977年提出的求解病态问题的方法。在这里我们讨论其在图像处理上的应用。假设用
x
x
x来表示真实图像,它是一个
m
×
n
m\times n
m×n点阵,用
y
y
y来表示带有噪声的图像,即
y
=
x
+
e
y=x+e
y=x+e,其中
e
e
e是高斯白噪声。为了从带噪声的图像
y
y
y中还原处原始图像
x
x
x,我们可以利用Tikhonov正则化的思想建立如下模型:
min
x
f
(
x
)
=
1
2
∥
x
−
y
∥
F
2
+
λ
(
∥
D
1
x
∥
F
2
+
∥
D
2
x
∥
F
2
)
(5)
\min_{x} f(x)=\frac{1}{2}\|x-y\|_F^2+\lambda(\|D_1 x\|_F^2+\|D_2x\|_F^2) \tag{5}
xminf(x)=21∥x−y∥F2+λ(∥D1x∥F2+∥D2x∥F2)(5)
其中
D
1
x
,
D
2
x
D_1x,D_2x
D1x,D2x分别表示对
x
x
x在水平方向和竖直方向上做前向差分,即
(
D
1
x
)
i
j
=
1
h
(
x
i
+
1
,
j
−
x
i
j
)
,
(
D
2
x
)
i
j
=
1
h
(
x
i
,
j
+
1
−
x
i
j
)
(6)
(D_1x)_{ij}=\frac{1}{h}(x_{i+1,j}-x_{ij}),\\ (D_2x)_{ij}=\frac{1}{h}(x_{i,j+1}-x_{ij}) \tag{6}
(D1x)ij=h1(xi+1,j−xij),(D2x)ij=h1(xi,j+1−xij)(6)
其中
h
h
h为给定的离散间隔,模型(5)由两项组成,第一项为保真项,即要求真实图像
x
x
x和带噪图像
y
y
y不要相差太大,这里使用F番薯的原因是我们假设噪声是高斯白噪声;第二项为Tikhonov正则项,它实际上是对
x
x
x本身的性质做出限制,在这里的含义是希望原始图像
x
x
x各个部分的变化不要太剧烈(即水平和垂直方向上的差分的平方和不要太大),这种正则化会使得恢复出的
x
x
x有比较好的光滑性。模型(5)的目标函数是光滑的,求出
f
(
x
)
f(x)
f(x)的梯度为
∇
f
(
x
)
=
x
−
y
−
2
λ
Δ
x
,
(7)
\nabla f(x)=x-y-2\lambda\Delta x,\tag{7}
∇f(x)=x−y−2λΔx,(7)
其中
Δ
\Delta
Δ是图像
x
x
x的离散拉普拉斯算子,即
(
Δ
x
)
i
j
=
x
i
+
1
,
j
+
x
i
−
1
,
j
+
x
i
,
j
+
1
+
x
i
,
j
−
1
−
4
x
i
j
h
2
(8)
(\Delta x)_{ij}=\frac{x_{i+1,j}+x_{i-1,j}+x_{i,j+1}+x_{i,j-1}-4x_{ij}}{h^2}\tag{8}
(Δx)ij=h2xi+1,j+xi−1,j+xi,j+1+xi,j−1−4xij(8)
因此梯度法的迭代格式为:
x
k
+
1
=
x
k
−
t
∇
f
(
x
)
=
x
k
−
t
(
x
k
−
y
−
2
λ
Δ
x
)
=
x
k
−
t
(
I
−
2
λ
Δ
)
x
k
+
t
y
\begin{aligned} x^{k+1} &=x^{k}-t\nabla f(x) \\ &=x^{k}-t(x^{k}-y-2\lambda\Delta x) \\ &=x^{k}-t(I-2\lambda\Delta)x^{k}+ty \end{aligned}
xk+1=xk−t∇f(x)=xk−t(xk−y−2λΔx)=xk−t(I−2λΔ)xk+ty
数值实验
目标函数
function fx = func(x, y, lambda)
[height, width] = size(x);
dx = zeros(height, width);
dy = zeros(height, width);
dx(1:end-1, :) = x(2:end, :) - x(1:end-1, :);
dy(:, 1:end-1) = x(:, 2:end) - x(:, 1:end-1);
% ||x-y||^2 = norm(x-y, 'fro');
fx = 0.5 * norm(x-y, 'fro')^2 + lambda * ( norm(dx, 'fro')^2 + norm(dy, 'fro')^2);
end
目标函数梯度:
function g = gfunc(x, y,lambda)
[height, width] = size(x);
x_pad = zeros(height + 2, width + 2);
x_pad(2:end-1, 2:end-1) = x;
x_pad(1, 2:end-1) = x(1, :);
x_pad(end, 2:end-1) = x(end, :);
x_pad(:, 1) = x_pad(:, 2);
x_pad(:, end) = x_pad(:, end-1);
div_x = x_pad(2+1:end-1+1, 2:end-1) + x_pad(2-1:end-1-1, 2:end-1) + ...
x_pad(2:end-1, 2+1:end-1+1) + x_pad(2:end-1, 2-1:end-1-1) - 4 * x;
g = x - y - 2 * lambda * div_x;
end
BB梯度下降(适用于图像)
function [fmin, xmin] = gradient_descent(func, gfunc, x0, epsilon)
iIter = 1;
iterMax = 2000;
xOld = x0;
alphaMin = 1e-5;
alphaMax = 1e5;
M = 10;
[~, ~, alpha] = armijo_rule(func, gfunc, x0, 0.5, -feval(gfunc, x0));
QOld = 1;
COld = feval(func, xOld);
c1 = 0.5;
eta = 0.5;
% xk = zeros(size(x0, 1), 11);
% fk = zeros(201, 1);
% xk(:, 1) = x0;
% fk(1, :) = feval(func, x0);
fk = 0;
while iIter < iterMax
grad = feval(gfunc, xOld);
dk = -grad;
% Zhang, Hanger nonmonotone line search
for i = 1:M
xNew = xOld + alpha * dk;
fNew = feval(func, xNew);
if fNew <= COld + alpha * c1 * dot(dk(:), dk(:))
break;
end
alpha = alpha * eta;
end
%xNew = xOld - alpha * dk;
iIter = iIter + 1;
xDiff = xNew - xOld;
fDiff = fNew - feval(func, xOld);
xDiffNorm = norm(xDiff, 'fro');
fDiffNorm = abs(fDiff);
if norm(grad, 'fro')^2 < epsilon || (xDiffNorm < epsilon && fDiffNorm < epsilon)
break;
end
% BB step-size calculation
sk = xNew - xOld; yk = feval(gfunc, xNew) - feval(gfunc, xOld);
if mod(iIter, 2) == 1
alpha = dot(sk(:), yk(:)) / dot(yk(:), yk(:));
else
alpha = dot(sk(:), sk(:)) / dot(sk(:), yk(:));
end
alpha = max(min(alpha, alphaMax), alphaMin);
QNew = eta * QOld + 1;
CNew = (eta * QOld * COld + fNew) / QNew;
COld = CNew;
xOld = xNew;
% xk(:, iIter) = xNew;
fk = fNew;
fprintf('iIter = %d, fx = %f\n', iIter, fk);
end
if iIter == iterMax
fprintf('exceed max iteration, not found minimal point x.\n');
end
xmin = xNew;
fmin = feval(func, xmin);
fprintf('iIter = %d, fmin = %f\n', iIter, fmin);
end
armijo-rule
function [fnew, xnew, alpha] = armijo_rule(func, gfunc, x0, alpha0, dk)
c1 = 1e-3;
alpha = alpha0;
gamma = 0.8;
iIter = 0;
iterMax = 200;
alphaMin = 1e-5;
while iIter < iterMax
xnew = x0 + alpha * dk;
fnew = feval(func, xnew);
f0 = feval(func, x0);
if fnew <= f0 + c1 * feval(gfunc, x0)' * dk
break;
end
if alpha < alphaMin
break;
end
alpha = gamma * alpha;
iIter = iIter + 1;
end
if iIter == iterMax
alpha = alphaMin;
fprintf('reach maximum iteration, and not found suitable alpha.\n');
end
xnew = x0 + alpha * dk;
fnew = feval(func, xnew);
end
Tikhonov正则化主函数
u = load('tower.mat');
u = u.B1;
x = double(u);
[height, width] = size(u);
x0 = x;
y = x + 20 * randn(height, width);
% y = mat2gray(y);
% x = mat2gray(x);
lambda0 = 0.5;
lambda1 = 2;
epsilon = 1e-4;
func0 = @(x)(func(x, y, lambda0));
gfunc0 = @(x)(gfunc(x, y, lambda0));
[fmin, xmin] = gradient_descent(func0, gfunc0, y, epsilon);
figure, imshow(uint8(y)), title('noisy image');
figure, imshow(uint8(xmin)), title('tikhonov regularization lambda = 0.5');
func1 = @(x)(func(x, y, lambda1));
gfunc1 = @(x)(gfunc(x, y, lambda1));
[fmin, xmin] = gradient_descent(func1, gfunc1, y, epsilon);
figure, imshow(uint8(xmin)), title('tikhonov regularization lambda = 2');
lambda2 = 20;
func2 = @(x)(func(x, y, lambda2));
gfunc2 = @(x)(gfunc(x, y, lambda2));
[fmin, xmin] = gradient_descent(func2, gfunc2, y, epsilon);
figure, imshow(uint8(xmin)), title('tikhonov regularization lambda = 20');
仿真结果如下图所示:
通过数值实验我们可以看到Tikhonov正则化模型可以有效地去除图像中的噪声,但是它的缺点很明显:经过Tikhonov正则化处理的图像会偏光滑,图像中物体之间的边界变得模糊,当 λ \lambda λ较大时尤为明显。出现这一现象的原因时模型(5)中正则项选取不当,在以后的实验中,会考虑修改正则项来建模求解。