问题深入
注意到,一方面,对于AMP算法,基于信念传播的推导方式并不容易让人直观地去理解AMP的本质,比如为什么"Onsager"项可以在迭代过程中消除估计误差与感知矩阵的相关性?另一方面,AMP的状态演进分析过于复杂,但如果能直观地理解其演化过程,那么对AMP本质意义上的理解也会更上一层楼。我们接下来将基于AMP的迭代公式和Taylor展开,进行反向分析和理解。
“Onsager”项的理解
回顾AMP的迭代公式:
感知矩阵 A ∈ R m × n \boldsymbol A \in \mathbb R^{m \times n} A∈Rm×n,且 a i j ∼ N ( 0 , 1 / m ) a_{ij} \sim \mathcal N(0, 1/m) aij∼N(0,1/m),AMP的迭代式为:
Linear: ν t = y − A x ^ t + n m ν t − 1 div ( η ( r t − 1 ) ) ⏟ Onsager term Non-linear: x ^ t + 1 = η ( x ^ t + A T ν t ⏟ r t ) \begin{aligned} \text{Linear: } \boldsymbol \nu^t &= \boldsymbol y - \boldsymbol A \hat {\boldsymbol x}^t + \underset{\text{Onsager term}}{\underbrace{\frac{n}{m} \boldsymbol \nu^{t-1} \text{div}\left( \eta(\boldsymbol r^{t-1}) \right )}} \\ \text{Non-linear: } \hat {\boldsymbol x}^{t+1} &=\eta \left ( \underset{\boldsymbol r^t}{\underbrace{ \hat {\boldsymbol x}^t+\boldsymbol A^T \boldsymbol \nu^t }} \right ) \end{aligned} Linear: νtNon-linear: x^t+1=y−Ax^t+Onsager term mnνt−1div(η(rt−1))=η⎝⎛rt x^t+ATνt⎠⎞
我们首先考虑
A
x
^
t
\boldsymbol A \hat {\boldsymbol x}^t
Ax^t这一项,
[
A
x
^
t
]
i
=
a
i
T
η
(
x
^
t
−
1
+
∑
l
a
l
ν
l
t
−
1
)
where
a
i
T
is the
i
-th
row
of
A
=
a
i
T
η
(
x
^
t
−
1
+
∑
l
≠
i
a
l
ν
l
t
−
1
⏟
r
i
t
−
1
+
a
i
ν
i
t
−
1
)
=
a
i
T
(
η
(
r
i
t
−
1
)
+
∂
η
∂
r
(
r
i
t
−
1
)
a
i
ν
i
t
−
1
+
O
(
1
/
m
)
)
Taylor
expansion
=
a
i
T
η
(
r
i
t
−
1
)
+
ν
i
t
−
1
∑
j
a
i
j
2
η
′
(
r
i
j
t
−
1
)
+
O
(
1
/
m
)
=
a
i
T
η
(
r
i
t
−
1
)
+
n
m
ν
i
t
−
1
1
n
∑
j
η
′
(
r
i
j
t
−
1
)
⏟
div
(
η
(
r
i
t
−
1
)
)
+
O
(
1
/
m
)
\begin{aligned} [\boldsymbol A \hat {\boldsymbol x}^t]_i &= \boldsymbol a^T_i \eta \left ( \hat {\boldsymbol x}^{t-1} + \sum_{l} \boldsymbol a_l \nu^{t-1}_l \right ) \text{ where } \boldsymbol a^T_i \text{ is the } i \textbf{-th row of } \boldsymbol A \\ &= \boldsymbol a^T_i \eta \left ( \underset{\boldsymbol r^{t-1}_i }{\underbrace{ \hat{\boldsymbol x}^{t-1} + \sum_{l \neq i} \boldsymbol a_l \nu^{t-1}_l }} +\boldsymbol a_i \nu^{t-1}_i \right ) \\ &= \boldsymbol a^T_i \left ( \eta (\boldsymbol r^{t-1}_i) + \frac{\partial \eta}{\partial \boldsymbol r} (\boldsymbol r^{t-1}_i) \boldsymbol a_i \nu^{t-1}_i + O(1/m) \right ) \textbf{Taylor expansion} \\ &= \boldsymbol a^T_i \eta (\boldsymbol r^{t-1}_i) + \nu^{t-1}_i \sum_{j} a^2_{ij} \eta^{\prime}(r^{t-1}_{ij}) + O(1/\sqrt m) \\ &= \boldsymbol a^T_i \eta (\boldsymbol r^{t-1}_i) +\frac{n}{m} \nu^{t-1}_i \underset{\text{div} \left( \eta(\boldsymbol r^{t-1}_i) \right ) }{\underbrace{ \frac{1}{n} \sum_{j} \eta^{\prime}(r^{t-1}_{ij})}} + O(1/\sqrt m) \end{aligned}
[Ax^t]i=aiTη(x^t−1+l∑alνlt−1) where aiT is the i-th row of A=aiTη⎝⎜⎜⎜⎜⎜⎛rit−1
x^t−1+l=i∑alνlt−1+aiνit−1⎠⎟⎟⎟⎟⎟⎞=aiT(η(rit−1)+∂r∂η(rit−1)aiνit−1+O(1/m))Taylor expansion=aiTη(rit−1)+νit−1j∑aij2η′(rijt−1)+O(1/m)=aiTη(rit−1)+mnνit−1div(η(rit−1))
n1j∑η′(rijt−1)+O(1/m)
因此,
A
x
^
t
=
A
η
(
r
i
t
−
1
)
+
n
m
ν
t
−
1
div
(
η
(
r
i
t
−
1
)
)
+
O
(
1
/
m
)
\boldsymbol A \hat {\boldsymbol x}^t = \boldsymbol A \eta (\boldsymbol r^{t-1}_i) + \frac{n}{m} \boldsymbol \nu^{t-1} \text{div} \left( \eta(\boldsymbol r^{t-1}_i) \right ) + O(1/\sqrt m)
Ax^t=Aη(rit−1)+mnνt−1div(η(rit−1))+O(1/m)
因此,进一步考虑
ν
t
\boldsymbol \nu^t
νt与
A
\boldsymbol A
A之间的相关性:
ν
t
=
a
y
−
A
x
^
t
+
n
m
ν
t
−
1
div
(
η
(
r
t
−
1
)
)
⏟
Onsager term
=
A
x
0
+
w
−
[
A
η
(
r
i
t
−
1
)
+
n
m
ν
t
−
1
div
(
η
(
r
i
t
−
1
)
)
]
+
n
m
ν
t
−
1
div
(
η
(
r
t
−
1
)
)
⏟
Onsager term
+
O
(
1
/
m
)
→
b
A
(
x
0
−
(
x
0
+
ϵ
)
⏟
x
^
t
−
1
=
η
(
r
i
t
−
1
)
)
+
w
{
where
η
(
r
i
t
−
1
)
→
x
^
t
−
1
, define
x
^
t
−
1
=
x
0
+
ϵ
}
=
−
A
ϵ
+
w
\begin{aligned} \boldsymbol \nu^t &\overset{a}{=} \boldsymbol y - \boldsymbol A \hat {\boldsymbol x}^t + \underset{\text{Onsager term}}{\underbrace{\frac{n}{m} \boldsymbol \nu^{t-1} \text{div}\left( \eta(\boldsymbol r^{t-1}) \right )}} \\ &= \boldsymbol A \boldsymbol x_0 + \boldsymbol w - \left [ \boldsymbol A \eta (\boldsymbol r^{t-1}_i) + \frac{n}{m} \boldsymbol \nu^{t-1} \text{div} \left( \eta(\boldsymbol r^{t-1}_i) \right ) \right] + \underset{\text{Onsager term}}{\underbrace{\frac{n}{m} \boldsymbol \nu^{t-1} \text{div}\left( \eta(\boldsymbol r^{t-1}) \right )}} + O(1/\sqrt m) \\ &\overset{b}{\rightarrow} \boldsymbol A ( \boldsymbol x_0 - \underset{\hat{\boldsymbol x}^{t-1}=\eta(\boldsymbol r^{t-1}_i) }{\underbrace{(\boldsymbol x_0 + \boldsymbol \epsilon)}}) + \boldsymbol w \ \ \ \ \{\text{ where } \eta(\boldsymbol r^{t-1}_i) \rightarrow \hat{\boldsymbol x}^{t-1} \text{, define } \hat{\boldsymbol x}^{t-1} = \boldsymbol x_0 + \boldsymbol \epsilon \} \\ & \overset{}{=} - \boldsymbol A \boldsymbol \epsilon + \boldsymbol w \end{aligned}
νt=ay−Ax^t+Onsager term
mnνt−1div(η(rt−1))=Ax0+w−[Aη(rit−1)+mnνt−1div(η(rit−1))]+Onsager term
mnνt−1div(η(rt−1))+O(1/m)→bA(x0−x^t−1=η(rit−1)
(x0+ϵ))+w { where η(rit−1)→x^t−1, define x^t−1=x0+ϵ}=−Aϵ+w
注意到,在(a)中, ν t \boldsymbol \nu^t νt与矩阵 A \boldsymbol A A的相关性体现在 A x ^ t \boldsymbol A \hat {\boldsymbol x}^t Ax^t与Onsager term这两项中,因为AMP线性迭代式的操作,Onsager项的相关性被消除了。剩余一项的相关性,如(b)可见,随着估计误差的减小,而逐渐消失。
另一方面,我们还要考虑
x
^
t
\hat{\boldsymbol x}^{t}
x^t与矩阵
A
\boldsymbol A
A的相关性,在AMP迭代的非线性估计中,
x
^
t
\hat{\boldsymbol x}^{t}
x^t与矩阵
A
\boldsymbol A
A的相关性通过
A
T
v
t
\boldsymbol A^T \boldsymbol v^t
ATvt建立,有
A
T
v
t
=
−
A
T
A
ϵ
+
A
T
w
\boldsymbol A^T \boldsymbol v^t = - \boldsymbol A^T \boldsymbol A \boldsymbol \epsilon + \boldsymbol A^T \boldsymbol w
ATvt=−ATAϵ+ATw
一般有假设 A \boldsymbol A A与 w \boldsymbol w w相互独立, A T A ϵ \boldsymbol A^T \boldsymbol A \boldsymbol \epsilon ATAϵ项的思考与上述类似。事实上,相关性最强的项体现在Onsager term(依据Taylor展开直接得到的),但是因为线性迭代估计把Onsager term给消除了,所以依赖性大大降低。
直观理解状态演进过程
回顾AMP的状态演进分析
若噪声 w ∼ N ( 0 , σ 2 I ) \boldsymbol w \sim \mathcal N(\boldsymbol 0, \sigma^2 \boldsymbol I) w∼N(0,σ2I),则AMP的状态演进分析为:
for t = 0 , 1 , 2 , ⋯ τ t 2 = σ 2 + n m E t E t = E { [ η t ( X 0 + N ( 0 , τ t 2 ) ) − X 0 ] 2 } \begin{aligned} \text{for } t &=0,1,2,\cdots \\ \tau^2_t &= \sigma^2 + \frac{n}{m} \mathcal E^t \\ \mathcal E^t & = \mathbb E \left \{ {\left [ \eta^t \left ( X_0 + \mathcal N(0,\tau^2_t) \right ) - X_0 \right ]}^2 \right \} \end{aligned} for tτt2Et=0,1,2,⋯=σ2+mnEt=E{[ηt(X0+N(0,τt2))−X0]2}
考虑误差项
e
t
=
r
t
−
x
0
\boldsymbol e_t = \boldsymbol r_t - \boldsymbol x_0
et=rt−x0,有
e
t
=
r
t
−
x
0
=
x
^
t
+
A
T
ν
t
−
x
0
=
x
^
t
+
A
T
[
A
(
x
0
−
x
^
t
−
1
)
+
w
]
−
x
0
→
(
I
−
A
T
A
)
(
x
^
t
−
x
0
)
+
A
T
w
\begin{aligned} \boldsymbol e_t &= \boldsymbol r_t - \boldsymbol x_0 \\ &= \hat {\boldsymbol x}^t+\boldsymbol A^T \boldsymbol \nu^t - \boldsymbol x_0 \\ & = \hat {\boldsymbol x}^t + \boldsymbol A^T \left [ \boldsymbol A ( \boldsymbol x_0 - \hat {\boldsymbol x}^{t-1}) + \boldsymbol w \right ] - \boldsymbol x_0 \\ & \rightarrow (\boldsymbol I - \boldsymbol A^T \boldsymbol A)(\hat {\boldsymbol x}^t - \boldsymbol x_0) + \boldsymbol A^T \boldsymbol w \end{aligned}
et=rt−x0=x^t+ATνt−x0=x^t+AT[A(x0−x^t−1)+w]−x0→(I−ATA)(x^t−x0)+ATw
根据中心极限定理和矩阵
A
\boldsymbol A
A的分布,可以得到
(
I
−
A
T
A
)
(\boldsymbol I - \boldsymbol A^T \boldsymbol A)
(I−ATA)的每一项服从高斯分布
N
(
0
,
1
/
m
)
\mathcal N(0,1/m)
N(0,1/m),因此
∥
(
I
−
A
T
A
)
(
x
^
t
−
x
0
)
∥
F
2
→
n
m
∥
(
x
^
t
−
x
0
)
∥
2
2
(
m
,
n
→
∞
)
{\Vert (\boldsymbol I - \boldsymbol A^T \boldsymbol A)(\hat {\boldsymbol x}^t - \boldsymbol x_0) \Vert}^2_F \rightarrow \frac{n}{m} {\Vert (\hat {\boldsymbol x}^t - \boldsymbol x_0) \Vert }^2_2 \ (m,n \rightarrow \infty)
∥(I−ATA)(x^t−x0)∥F2→mn∥(x^t−x0)∥22 (m,n→∞)
因此
lim
n
→
∞
1
n
∥
e
t
∥
2
2
→
n
m
⋅
1
n
∥
(
x
^
t
−
x
0
)
∥
2
2
+
σ
2
=
n
m
E
[
x
^
t
−
x
0
]
+
σ
2
\begin{aligned} \lim_{n \rightarrow \infty} \frac{1}{n} {\Vert \boldsymbol e_t \Vert}^2_2 & \rightarrow \frac{n}{m} \cdot \frac{1}{n} {\Vert (\hat {\boldsymbol x}^t - \boldsymbol x_0) \Vert}^2_2 + \sigma^2 \\ & = \frac{n}{m} \mathbb E[\hat x_t - x_0] + \sigma^2 \end{aligned}
n→∞limn1∥et∥22→mn⋅n1∥(x^t−x0)∥22+σ2=mnE[x^t−x0]+σ2
因此直观上得到了AMP状态演进分析的方程,但是该推导过程最重要的假设是考虑 x ^ t \hat{\boldsymbol x}^{t} x^t与矩阵 A \boldsymbol A A的相关性已被消除(相关性的主成分是因为Onsager项被删除的)。