非线性最小二乘
从一个简单的问题引入:
m
x
i
n
1
2
∥
f
(
x
)
∥
2
2
\underset{x}min\frac 12\parallel f(x )\parallel_2^2
xmin21∥f(x)∥22
其中:
x
∈
R
n
,
f
x \in R^n, f
x∈Rn,f 为任意函数
当
f
f
f 很简单时,
d
f
d
x
∣
x
=
0
\left. \frac{ {\rm d}f}{ {\rm d}x} \right| _{x=0}
dxdf∣∣∣∣x=0
得到极值点或者鞍点。
当 f f f 很复杂时, ∂ f ∂ x \dfrac{\partial f}{\partial x} ∂x∂f 很难求,或其极值点或鞍点很难求,这时需要用到迭代的方式求取
复杂:导函数形式不好写,或写出来,不好求
迭代方式
- 给定初始值 x 0 x_{0} x0
- 对于第 k k k次的迭代,寻找一个合适的增量 Δ x k \Delta x_{k} Δxk , s . t . s.t. s.t. ∥ f ( x k + Δ x k ) ∥ 2 2 \parallel f(x_{k} + \Delta x_{k}) \parallel_2^2 ∥f(xk+Δxk)∥22 达到极小值
- 若 Δ x k \Delta x_{k} Δxk 足够小,则停止
- 否则, 另 x k + 1 = x k + Δ x k x_{k+1} = x_{k} + \Delta x_{k} xk+1=xk+Δxk , 返回 2
那么问题来了,这个增量怎么找?
- 确定增量的方法 ------ 梯度下降策略: 一阶或二阶
- 泰勒展开:
∥ f ( x k + Δ x k ) ∥ 2 2 ≈ ∥ f ( x k ) ∥ 2 2 + J ( x ) Δ x + 1 2 Δ x T H Δ x \parallel f(x_{k} + \Delta x_{k}) \parallel_2^2 \approx \parallel f(x_{k}) \parallel_2^2 + J(x)\Delta x + \frac 12 \Delta x^T H \Delta x ∥f(xk+Δxk)∥22≈∥f(xk)∥22+J(x)Δx+21ΔxTHΔx
其中: J , H J, H J,H 分别为雅克比和 海塞矩阵
- 保留一阶梯度
m Δ x i n ∥ f ( x k ) ∥ 2 2 + J ( x ) Δ x \underset{\Delta x}min\parallel f(x_{k}) \parallel_2^2 + J(x)\Delta x Δxmin∥f(xk)∥22+J(x)Δx
增量方向: Δ x ∗ = − J T ( x ) \Delta x^* = - J^T(x) Δx∗=−JT(x) , 通常还需要计算步长
上述这样方法称为最速下降法( Steep Method)
若保留二阶梯度:
Δ x ∗ = a r g min ∥ f ( x k ) ∥ 2 2 + J ( x ) Δ x + 1 2 Δ x T H Δ x \Delta x^* = arg\min\parallel f(x_{k}) \parallel_2^2 + J(x)\Delta x + \frac 12 \Delta x^T H \Delta x Δx∗=argmin∥f(xk)∥22+J(x)Δx+21ΔxTHΔx
则得到(令上式关于 Δ x \Delta x Δx 的倒数为零): H Δ x = − J T H \Delta x = - J^T HΔx=−JT
保留二阶倒数称为牛顿法
问题
最速下降法和牛顿法都有各自的缺点:
- 最速下降法,太慢了,因为它每次都会找到最陡的方向进行下降,也就是说一直走的是直角,故就算很简单的也会走很多步,即过于贪婪(zigzag 问题)
- 牛顿法虽然迭代次数少,对高阶表现良好,但不可避免的我们需要计算复杂的 Hessian 矩阵,在计算机中这个过程一般能避免就避免
解决
那我们是否可以保留二阶梯度,但简化 Hessian 计算:
Gauss-Newton
- 一阶近似 f ( x ) f(x) f(x), f ( x + Δ x ) ≈ f ( x ) + J ( x ) Δ x f(x + \Delta x) \approx f(x) +J(x)\Delta x f(x+Δx)≈f(x)+J(x)Δx
- 平方误差变为
1 2 ∥ f ( x ) + J ( x ) Δ x k ) ∥ 2 2 = 1 2 ( f ( x ) + J ( x ) Δ x k ) T ( f ( x ) + J ( x ) Δ x k ) = 1 2 ( ∥ f ( x ) ∥ 2 2 + 2 f ( x ) T J ( x ) Δ x + Δ x T J ( x ) T J ( x ) Δ x ) \begin{aligned} \frac 12\parallel f(x) + J(x)\Delta x_{k}) \parallel_2^2 = \frac 12 (f(x) + J(x)\Delta x_{k})^T(f(x) + J(x)\Delta x_{k})\\ =\frac 12(\parallel f(x) \parallel_2^2 + 2f(x)^T J(x)\Delta x + \Delta x^T J(x)^TJ(x)\Delta x)\end{aligned} 21∥f(x)+J(x)Δxk)∥22=21(f(x)+J(x)Δxk)T(f(x)+J(x)Δxk)=21(∥f(x)∥22+2f(x)TJ(x)Δx+ΔxTJ(x)TJ(x)Δx)
关于
Δ
x
\Delta x
Δx 导数为零:
2
J
(
x
)
T
f
(
x
)
+
2
J
(
X
)
T
J
(
x
)
Δ
x
=
0
2J(x)^Tf(x)+2J(X)^TJ(x)\Delta x = 0
2J(x)Tf(x)+2J(X)TJ(x)Δx=0
J ( X ) T J ( x ) Δ x = − J ( X ) T f ( x ) J(X)^TJ(x)\Delta x=-J(X)^Tf(x) J(X)TJ(x)Δx=−J(X)Tf(x)
H Δ = g H\Delta=g HΔ=g
G-N 中用 J J J 的表达式近似了 H H H
迭代步骤:
- 给定初始值 x 0 x_{0} x0
- 对于第 k k k 次的迭代,求出当前的雅克比矩阵 J ( x ) J(x) J(x) 和误差 f ( x x ) f(x_{x}) f(xx)
- 求解增量方程: H Δ x = g H\Delta x =g HΔx=g
- 若 Δ x \Delta x Δx 足够小,则停止,否则, 另 x k + 1 = x k + Δ x k x_{k+1} = x_{k} + \Delta x_{k} xk+1=xk+Δxk , 返回 2
Levenberg-Marquadt
G-N 简单实用,但 Δ x x = H − 1 g \Delta x_{x} = H^{-1}g Δxx=H−1g 中无法保证海塞矩阵是可逆的(二次近似不可靠)
Levenberg-Marquadt 方法在一定程度上面改善了 G-N
-
G-N 属于线搜索方法:即先找到方向,在确定长度
-
L-M 属于信赖区域方法(Trust Region),认为近似只在区域内是可靠的
ρ = f ( x + Δ x ) − f ( x ) J ( x ) Δ \rho=\frac {f(x+\Delta x) - f(x)}{J(x)\Delta} ρ=J(x)Δf(x+Δx)−f(x)
实际下降/近似下降,近似于1,比较可靠,反之,则不可靠
如果半径过小,则减小近似范围
如果半径过大,则增加近似范围
迭代步骤
伪代码
代码
import numpy as np
import matplotlib.pyplot as plt
# input data, whose shape is (num_data,1)
# data_input=np.array([[0.25, 0.5, 1, 1.5, 2, 3, 4, 6, 8]]).T
# data_output=np.array([[19.21, 18.15, 15.36, 14.10, 12.89, 9.32, 7.45, 5.24, 3.01]]).T
tao = 10 ** -3
threshold_stop = 10 ** -15
threshold_step = 10 ** -15
threshold_residual = 10 ** -15
residual_memory = []
# construct a user function
def my_Func(params, input_data):
a = params[0, 0]
b = params[1, 0]
# c = params[2,0]
# d = params[3,0]
return a * np.exp(b * input_data)
# return a*np.sin(b*input_data[:,0])+c*np.cos(d*input_data[:,1])
# generating the input_data and output_data,whose shape both is (num_data,1)
def generate_data(params, num_data):
x = np.array(np.linspace(0, 10, num_data)).reshape(num_data, 1) # 产生包含噪声的数据
mid, sigma = 0, 5
y = my_Func(params, x) + np.random.normal(mid, sigma, num_data).reshape(num_data, 1)
return x, y
# calculating the derive of pointed parameter,whose shape is (num_data,1)
def cal_deriv(params, input_data, param_index):
params1 = params.copy()
params2 = params.copy()
params1[param_index, 0] += 0.000001
params2[param_index, 0] -= 0.000001
data_est_output1 = my_Func(params1, input_data)
data_est_output2 = my_Func(params2, input_data)
return (data_est_output1 - data_est_output2) / 0.000002
# calculating jacobian matrix,whose shape is (num_data,num_params)
def cal_Jacobian(params, input_data):
num_params = np.shape(params)[0]
num_data = np.shape(input_data)[0]
J = np.zeros((num_data, num_params))
for i in range(0, num_params):
J[:, i] = list(cal_deriv(params, input_data, i))
return J
# calculating residual, whose shape is (num_data,1)
def cal_residual(params, input_data, output_data):
data_est_output = my_Func(params, input_data)
residual = output_data - data_est_output
return residual
'''
#calculating Hessian matrix, whose shape is (num_params,num_params)
def cal_Hessian_LM(Jacobian,u,num_params):
H = Jacobian.T.dot(Jacobian) + u*np.eye(num_params)
return H
#calculating g, whose shape is (num_params,1)
def cal_g(Jacobian,residual):
g = Jacobian.T.dot(residual)
return g
#calculating s,whose shape is (num_params,1)
def cal_step(Hessian_LM,g):
s = Hessian_LM.I.dot(g)
return s
'''
# get the init u, using equation u=tao*max(Aii)
def get_init_u(A, tao):
m = np.shape(A)[0]
Aii = []
for i in range(0, m):
Aii.append(A[i, i])
u = tao * max(Aii)
return u
# LM algorithm
def LM(num_iter, params, input_data, output_data):
num_params = np.shape(params)[0] # the number of params
k = 0 # set the init iter count is 0
# calculating the init residual
residual = cal_residual(params, input_data, output_data)
# calculating the init Jocobian matrix
Jacobian = cal_Jacobian(params, input_data)
A = Jacobian.T.dot(Jacobian) # calculating the init A
g = Jacobian.T.dot(residual) # calculating the init gradient g
stop = (np.linalg.norm(g, ord=np.inf) <= threshold_stop) # set the init stop
u = get_init_u(A, tao) # set the init u
v = 2 # set the init v=2
while ((not stop) and (k < num_iter)):
k += 1
while (1):
Hessian_LM = A + u * np.eye(num_params) # calculating Hessian matrix in LM
step = np.linalg.inv(Hessian_LM).dot(g) # calculating the update step
if (np.linalg.norm(step) <= threshold_step):
stop = True
else:
new_params = params + step # update params using step
new_residual = cal_residual(new_params, input_data, output_data) # get new residual using new params
rou = (np.linalg.norm(residual) ** 2 - np.linalg.norm(new_residual) ** 2) / (step.T.dot(u * step + g))
if rou > 0:
params = new_params
residual = new_residual
residual_memory.append(np.linalg.norm(residual) ** 2)
# print (np.linalg.norm(new_residual)**2)
Jacobian = cal_Jacobian(params, input_data) # recalculating Jacobian matrix with new params
A = Jacobian.T.dot(Jacobian) # recalculating A
g = Jacobian.T.dot(residual) # recalculating gradient g
stop = (np.linalg.norm(g, ord=np.inf) <= threshold_stop) or (
np.linalg.norm(residual) ** 2 <= threshold_residual)
u = u * max(1 / 3, 1 - (2 * rou - 1) ** 3)
v = 2
else:
u = u * v
v = 2 * v
if (rou > 0 or stop):
break;
return params
def main():
# set the true params for generate_data() function
params = np.zeros((2, 1))
params[0, 0] = 10.0
params[1, 0] = 0.8
num_data = 100 # set the data number
data_input, data_output = generate_data(params, num_data) # generate data as requested
# set the init params for LM algorithm
params[0, 0] = 6.0
params[1, 0] = 0.3
# using LM algorithm estimate params
num_iter = 100 # the number of iteration
est_params = LM(num_iter, params, data_input, data_output)
print(est_params)
a_est = est_params[0, 0]
b_est = est_params[1, 0]
# 老子画个图看看状况
plt.scatter(data_input, data_output, color='b')
x = np.arange(0, 100) * 0.1 # 生成0-10的共100个数据,然后设置间距为0.1
plt.plot(x, a_est * np.exp(b_est * x), 'r', lw=1.0)
plt.xlabel("2018.06.13")
plt.savefig("result_LM.png")
plt.show()
plt.plot(residual_memory)
plt.xlabel("2018.06.13")
plt.savefig("error-iter.png")
plt.show()
if __name__ == '__main__':
main()
几点注意
- Trust Region 内的优化,利用 Lagrange 乘子转化为无约束:
m Δ x i n 1 2 ∥ f ( x k ) + J ( x ) Δ x ∥ 2 2 + λ 2 ∥ D Δ x ∥ 2 2 \underset{\Delta x}min\frac 12\parallel f(x_{k})+J(x)\Delta x \parallel_2^2 + \frac \lambda2\parallel D\Delta x \parallel_2^2 Δxmin21∥f(xk)+J(x)Δx∥22+2λ∥DΔx∥22
-
仍参照 G-N 展开,增量方程为:
( H + λ D T D ) Δ x = g (H + \lambda D^TD)\Delta x = g (H+λDTD)Δx=g -
在 L-M 方法中, 取 D = I D=I D=I , 则:
( H + λ I ) Δ x = g (H + \lambda I) \Delta x = g (H+λI)Δx=g
- L-M 相比于 G-N , 能够保证增量方程的正定性,即认为近似只在一定范围内成立,如果近似不好则缩小范围
- 从增量方程上来看,可以看成一阶和二阶的混合,参数 λ \lambda λ 控制着两边的权重
我的理解是 L - M 很好的结合了最速下降和高斯牛顿两种方法。如果仅用最速下降方法耗时过久,在遇到高阶问题时,表现差强人意;仅使用高斯牛顿存在海塞矩阵不可逆问题且计算海塞对于计算机来说并不友好,故结合这两种方式从而产生 L -M: ( H + λ I ) Δ x = g (H + \lambda I) \Delta x = g (H+λI)Δx=g,当 H H H 为零时,就是一阶,当 λ \lambda λ 为零的时候,就是高斯牛顿,故 λ \lambda λ 的初值也是非常重要的,见代码:
def get_init_u(A, tao):
m = np.shape(A)[0]
Aii = []
for i in range(0, m):
Aii.append(A[i, i])
u = tao * max(Aii)
return u
对于非凸问题,对初值非常敏感,什么是非凸问题,见图:
这种情况会陷入局部极值,但我们希望的是全局极值(不要嫌我字丑,忍着!)因为在上家单位,我自己做的药代动力学模块,整个模块的卖点在我看来,就是给客户提供一组很优的初值,如果初值比较好,L-M 算法表现非常好,但是如果初值差强人意,其效果还不如单纯性算法(这个算法在我看来是一个非常中庸的算法)
代码参考:https://its304.com/article/wolfcsharp/89674973 ,这个代码我试过,是可以的,使用时有个地方,即更新的时候我求最小,所以梯度更新方向需要变一下
推荐一篇超级好的论文:http://www2.imm.dtu.dk/pubdb/edoc/imm3215.pdf
想看视频的欢迎关注我B站讲解:https://www.bilibili.com/video/BV1134y1k7gv/
注: 我之前给公司培训的时候做过PPT ,但是离职的时候没有拿,此篇博客,大多公式都是高教授 slam 十四讲中的,需要说明一下~