Reference:
Kay S M. Fundamentals of statistical signal processing[M]. Prentice Hall PTR, 1993. (Ch. 10 - 10.6)
Slides of ET4386, TUD
Content
We now depart from the classical approach to statistical estimation in which the parameter θ \theta θ of interest is assumed to be a deterministic but unknown constant. Instead, we assume that θ \theta θ is a random variable whose particular realization we must estimate.
Motivation:
- The Bayesian approach can incorporate the available prior knowledge into our estimator.
- Bayesian estimation is useful in situations where an MVU estimator cannot be found, as for example, when the variance of an unbiased estimator may not be uniformly less than that of all other unbiased estimators. In this instance, it may be true that for most values of the parameter an estimator can be found whose mean square error may be less than that of all other estimators.
Prior Knowledge and Estimation
It is a fundamental rule of estimation theory that the use of prior knowledge will lead to a more accurate estimator.
Reconsider the DC in WGN problem:
x
[
n
]
=
A
+
w
[
n
]
,
n
=
0
,
⋯
N
−
1
,
w
[
n
]
∼
N
(
0
,
σ
2
)
x[n]=A+w[n],n=0,\cdots N-1, w[n]\sim \mathcal N(0,\sigma ^2)
x[n]=A+w[n],n=0,⋯N−1,w[n]∼N(0,σ2). It can be shown that the MVU estimator of
A
A
A is the sample mean
x
ˉ
\bar x
xˉ. However, this assumed that
A
A
A could take on any value in the interval
−
∞
<
A
<
∞
-\infty<A<\infty
−∞<A<∞. Due to physical constraints it may be more reasonable to assume that
A
A
A can take on only values in the finite interval
−
A
0
≤
A
≤
A
0
-A_0\le A\le A_0
−A0≤A≤A0. Obviously, we would expect to improve our estimation if we used the truncated sample mean estimator
A
ˇ
=
{
−
A
0
x
ˉ
<
−
A
0
x
ˉ
−
A
0
≤
x
ˉ
≤
A
0
A
0
x
ˉ
>
A
0
\check A=\left\{\begin{array}{cc}&-A_0 && \bar x<-A_0\\&\bar x && -A_0\le \bar x\le A_0\\ &A_0 && \bar x>A_0 \end{array}\right.
Aˇ=⎩⎨⎧−A0xˉA0xˉ<−A0−A0≤xˉ≤A0xˉ>A0
which would be consistent with the known constraints. Such an estimator would have the PDF
p
A
ˇ
(
ξ
;
A
)
=
Pr
{
x
ˉ
≤
−
A
0
}
δ
(
ξ
+
A
0
)
+
p
A
^
(
ξ
;
A
)
[
u
(
ξ
+
A
0
)
−
u
(
ξ
−
A
0
)
]
+
Pr
{
x
ˉ
≥
A
0
}
δ
(
ξ
−
A
0
)
p_{\check A}(\xi;A)=\Pr\{\bar x \le -A_0 \}\delta(\xi+A_0)+p_{\hat A}(\xi;A)[u(\xi+A_0)-u(\xi-A_0)]+\Pr\{\bar x \ge A_0 \}\delta(\xi-A_0)
pAˇ(ξ;A)=Pr{xˉ≤−A0}δ(ξ+A0)+pA^(ξ;A)[u(ξ+A0)−u(ξ−A0)]+Pr{xˉ≥A0}δ(ξ−A0)
where
u
(
x
)
u(x)
u(x) is the unit step function.
It is seen that
A
ˇ
\check A
Aˇ is a biased estimator. However, if we compare the MSE of the two estimators, we note that for any
A
A
A in the interval
−
A
0
≤
A
≤
A
0
-A_0\le A\le A_0
−A0≤A≤A0
mse
(
A
^
)
=
∫
−
∞
∞
(
ξ
−
A
)
2
p
A
^
(
ξ
;
A
)
d
ξ
=
∫
−
∞
−
A
0
(
ξ
−
A
)
2
p
A
^
(
ξ
;
A
)
d
ξ
+
∫
−
A
0
A
0
(
ξ
−
A
)
2
p
A
^
(
ξ
;
A
)
d
ξ
+
∫
A
0
∞
(
ξ
−
A
)
2
p
A
^
(
ξ
;
A
)
d
ξ
>
∫
−
∞
−
A
0
(
−
A
0
−
A
)
2
p
A
^
(
ξ
;
A
)
d
ξ
+
∫
−
A
0
A
0
(
ξ
−
A
)
2
p
A
^
(
ξ
;
A
)
d
ξ
+
∫
A
0
∞
(
A
0
−
A
)
2
p
A
^
(
ξ
;
A
)
d
ξ
=
(
−
A
0
−
A
)
2
Pr
{
x
ˉ
≤
−
A
0
}
+
∫
−
A
0
A
0
(
ξ
−
A
)
2
p
A
^
(
ξ
;
A
)
d
ξ
+
(
A
0
−
A
)
2
Pr
{
x
ˉ
≥
A
0
}
=
mse
(
A
~
)
\begin{aligned} \operatorname{mse}(\hat{A})=& \int_{-\infty}^{\infty}(\xi-A)^{2} p_{\hat{A}}(\xi ; A) d \xi \\ =& \int_{-\infty}^{-A_{0}}(\xi-A)^{2} p_{\hat{A}}(\xi ; A) d \xi+\int_{-A_{0}}^{A_{0}}(\xi-A)^{2} p_{\hat{A}}(\xi ; A) d \xi +\int_{A_{0}}^{\infty}(\xi-A)^{2} p_{\hat{A}}(\xi ; A) d \xi \\>&\int_{-\infty}^{-A_{0}}(-A_0-A)^{2} p_{\hat{A}}(\xi ; A) d \xi+\int_{-A_{0}}^{A_{0}}(\xi-A)^{2} p_{\hat{A}}(\xi ; A) d \xi +\int_{A_{0}}^{\infty}(A_0-A)^{2} p_{\hat{A}}(\xi ; A) d \xi \\ =& \left(-A_{0}-A\right)^{2} \Pr\{\bar x \le -A_0 \}+\int_{-A_{0}}^{A_{0}}(\xi-A)^{2} p_{\hat{A}}(\xi ; A) d \xi +\left(A_{0}-A\right)^{2} \Pr\{\bar x \ge A_0 \} \\ =& \operatorname{mse}(\tilde{A}) \end{aligned}
mse(A^)==>==∫−∞∞(ξ−A)2pA^(ξ;A)dξ∫−∞−A0(ξ−A)2pA^(ξ;A)dξ+∫−A0A0(ξ−A)2pA^(ξ;A)dξ+∫A0∞(ξ−A)2pA^(ξ;A)dξ∫−∞−A0(−A0−A)2pA^(ξ;A)dξ+∫−A0A0(ξ−A)2pA^(ξ;A)dξ+∫A0∞(A0−A)2pA^(ξ;A)dξ(−A0−A)2Pr{xˉ≤−A0}+∫−A0A0(ξ−A)2pA^(ξ;A)dξ+(A0−A)2Pr{xˉ≥A0}mse(A~)
Hence,
A
ˇ
\check A
Aˇ, the truncated sample mean estimator, is better than the sample mean estimator in terms of MSE. Although
A
^
\hat A
A^ is still the MVU estimator, we have been able to reduce the mean square error by allowing the estimator to be biased.
In as much as we have been able to produce a better estimator, the question arises as to whether an optimal estimator exists for this problem. (In the classical case the MSE criterion of optimality usually led to unrealizable estimators. We shall see that this is not a problem in the Bayesian approach.)
With knowledge only of the interval and no inclination as to whether A A A should be nearer any particular value, it makes sense to assign a U [ − A 0 , A 0 ] \mathcal U[-A_0,A_0] U[−A0,A0] PDF to the random variable A A A. The overall data model is different from the classical approach
Now we can incorporate our knowledge of how
A
A
A was chosen. Define the Bayesian MSE as
B
m
s
e
(
A
^
)
=
E
[
(
A
−
A
^
)
2
]
(BP.1)
\mathrm{Bmse}(\hat A)=E[(A-\hat A)^2] \tag{BP.1}
Bmse(A^)=E[(A−A^)2](BP.1)
To appreciate the difference compare the classical MSE
m
s
e
(
A
^
)
=
∫
(
A
^
−
A
)
2
p
(
x
;
A
)
d
x
(BP.2)
\mathrm{mse}(\hat A)=\int(\hat A-A)^2p(\mathbf x;A)d\mathbf x \tag{BP.2}
mse(A^)=∫(A^−A)2p(x;A)dx(BP.2)
to the Bayesian MSE
B
m
s
e
(
A
^
)
=
∫
(
A
−
A
^
)
2
[
∫
p
(
x
,
A
)
d
x
]
d
A
=
∫
∫
(
A
−
A
^
)
2
p
(
x
,
A
)
d
x
d
A
(BP.3)
\mathrm{Bmse}(\hat A)=\int(A-\hat A)^2\left[\int p(\mathbf x,A)d\mathbf x\right] dA=\int\int(A-\hat A)^2p(\mathbf x,A)d\mathbf xdA\tag{BP.3}
Bmse(A^)=∫(A−A^)2[∫p(x,A)dx]dA=∫∫(A−A^)2p(x,A)dxdA(BP.3)
Note that whereas the classical MSE will depend on
A
A
A, and hence estimators that attempt to minimize the MSE will usually depend on
A
A
A, the Bayesian MSE will not. In effect, we have integrated the parameter dependence away.
Now we can derive the estimator that minimizes the Bayesian MSE. First, we use Bayes’ theorem to write
p
(
x
,
A
)
=
p
(
A
∣
x
)
p
(
x
)
p(\mathbf{x}, A)=p(A|\mathbf{x}) p(\mathbf{x})
p(x,A)=p(A∣x)p(x)
so that
Bmse
(
A
^
)
=
∫
[
∫
(
A
−
A
^
)
2
p
(
A
∣
x
)
d
A
]
p
(
x
)
d
x
(BP.4)
\operatorname{Bmse}(\hat{A})=\int\left[\int(A-\hat{A})^{2} p(A |\mathbf{x}) d A\right] p(\mathbf{x}) d \mathbf{x}\tag{BP.4}
Bmse(A^)=∫[∫(A−A^)2p(A∣x)dA]p(x)dx(BP.4)
Now since
p
(
x
)
≥
0
p(\mathbf{x}) \geq 0
p(x)≥0 for all
x
,
\mathbf{x},
x, if the integral in brackets can be minimized for each
x
,
\mathbf{x},
x, then the Bayesian MSE will be minimized. Hence, fixing
x
\mathbf x
x so that
A
^
\hat{A}
A^ is a scalar variable, we have
∂
∂
A
^
∫
(
A
−
A
^
)
2
p
(
A
∣
x
)
d
A
=
∫
∂
∂
A
^
(
A
−
A
^
)
2
p
(
A
∣
x
)
d
A
=
∫
−
2
(
A
−
A
^
)
p
(
A
∣
x
)
d
A
=
−
2
∫
A
p
(
A
∣
x
)
d
A
+
2
A
^
∫
p
(
A
∣
x
)
d
A
\begin{aligned} \frac{\partial}{\partial \hat{A}} \int(A-\hat{A})^{2} p(A | \mathbf{x}) d A &=\int \frac{\partial}{\partial \hat{A}}(A-\hat{A})^{2} p(A | \mathbf{x}) d A \\ &=\int-2(A-\hat{A}) p(A | \mathbf{x}) d A \\ &=-2 \int A p(A | \mathbf{x}) d A+2 \hat{A} \int p(A | \mathbf{x}) d A \end{aligned}
∂A^∂∫(A−A^)2p(A∣x)dA=∫∂A^∂(A−A^)2p(A∣x)dA=∫−2(A−A^)p(A∣x)dA=−2∫Ap(A∣x)dA+2A^∫p(A∣x)dA
which when set equal to zero results in
A
^
=
∫
A
p
(
A
∣
x
)
d
A
=
E
(
A
∣
x
)
(BP.5)
\hat{A}=\int A p(A | \mathbf{x}) d A=E(A | \mathbf{x})\tag{BP.5}
A^=∫Ap(A∣x)dA=E(A∣x)(BP.5)
since the conditional PDF must integrate to
1
1
1. It is seen that the optimal estimator in terms of minimizing the Bayesian MSE is the mean of the posterior PDF
p
(
A
∣
x
)
p(A | \mathbf{x})
p(A∣x). We will henceforth term the estimator that minimizes the Bayesian MSE the minimum mean square error (MMSE) estimator.
To determine the MMSE, we first require the posterior PDF. We can use Bayes’ rule to determine it as
p
(
A
∣
x
)
=
p
(
x
∣
A
)
p
(
A
)
p
(
x
)
=
p
(
x
∣
A
)
p
(
A
)
∫
p
(
x
∣
A
)
p
(
A
)
d
A
(BP.6)
p(A|\mathbf x)=\frac{p(\mathbf x|A)p(A)}{p(\mathbf x)}=\frac{p(\mathbf x|A)p(A)}{\int p(\mathbf x|A)p(A)dA}\tag{BP.6}
p(A∣x)=p(x)p(x∣A)p(A)=∫p(x∣A)p(A)dAp(x∣A)p(A)(BP.6)
Note that the denominator is just a normalizing factor, independent of
A
A
A, needed to ensure that
p
(
A
∣
x
)
p(A|\mathbf x)
p(A∣x) integrates to
1
1
1. Then the MMSE
A
^
=
E
(
A
∣
x
)
=
∫
A
p
(
x
∣
A
)
p
(
A
)
d
A
∫
p
(
x
∣
A
)
p
(
A
)
d
A
(BP.7)
\hat A=E(A|\mathbf x)=\frac{\int Ap(\mathbf x|A)p(A)dA}{\int p(\mathbf x|A)p(A)dA}\tag{BP.7}
A^=E(A∣x)=∫p(x∣A)p(A)dA∫Ap(x∣A)p(A)dA(BP.7)
If we continue our example, we recall that the prior PDF
p
(
A
)
p(A)
p(A) is
U
[
−
A
0
,
A
0
]
\mathcal U[-A_0,A_0]
U[−A0,A0]. To specify the conditional PDF
p
(
x
∣
A
)
p(\mathbf x|A)
p(x∣A) we need to further assume that the choice of
A
A
A via
p
(
A
)
p(A)
p(A) does not affect the PDF of the noise samples or that
w
[
n
]
w[n]
w[n] is independent of
A
A
A. Then for
n
=
0
,
1
,
⋯
,
N
−
1
n=0,1,\cdots, N-1
n=0,1,⋯,N−1
p
x
(
x
[
n
]
∣
A
)
=
p
w
(
x
[
n
]
−
A
∣
A
)
=
p
w
(
x
[
n
]
−
A
)
=
1
2
π
σ
2
exp
[
−
1
2
σ
2
(
x
[
n
]
−
A
)
2
]
\begin{aligned} p_x(x[n]|A)&=p_w(x[n]-A|A)\\ &= p_w(x[n]-A)\\ &=\frac{1}{\sqrt{2\pi\sigma^2}}\exp \left[-\frac{1}{2\sigma^2}(x[n]-A)^2 \right] \end{aligned}
px(x[n]∣A)=pw(x[n]−A∣A)=pw(x[n]−A)=2πσ21exp[−2σ21(x[n]−A)2]
and therefore
p
(
x
∣
A
)
=
1
(
2
π
σ
2
)
N
/
2
exp
[
−
1
2
σ
2
∑
n
=
0
N
−
1
(
x
[
n
]
−
A
)
2
]
(BP.8)
p(\mathbf x|A)=\frac{1}{(2\pi \sigma^2)^{N/2}}\exp\left[-\frac{1}{2\sigma^2}\sum_{n=0}^{N-1}(x[n]-A)^2 \right] \tag{BP.8}
p(x∣A)=(2πσ2)N/21exp[−2σ21n=0∑N−1(x[n]−A)2](BP.8)
It is apparent that the PDF is identical in form to the usual classical PDF
p
(
x
;
A
)
p(\mathbf x; A)
p(x;A). In the Bayesian case, however, the PDF is a conditional PDF, hence the “
∣
|
∣” separator, while in the classical case, it represents an unconditional PDF, albeit parameterized by
A
A
A, hence the separator "
;
;
;"
Using
(
B
P
.
7
)
(BP.7)
(BP.7) and
(
B
P
.
8
)
(BP.8)
(BP.8), the posterior PDF becomes
p
(
A
∣
x
)
=
{
1
2
A
0
(
2
π
σ
2
)
N
/
2
exp
[
−
1
2
σ
2
∑
n
=
0
N
−
1
(
x
[
n
]
−
A
)
2
]
∫
−
A
0
A
0
1
2
A
0
(
2
π
σ
2
)
N
/
2
exp
[
−
1
2
σ
2
∑
n
=
0
N
−
1
(
x
[
n
]
−
A
)
2
]
d
A
∣
A
∣
≤
A
0
0
∣
A
∣
>
A
0
p(A |\mathbf{x})=\left\{\begin{aligned} &\frac{\frac{1}{2 A_{0}\left(2 \pi \sigma^{2}\right)^{N/2}}\exp \left[-\frac{1}{2 \sigma^{2}} \sum_{n=0}^{N-1}(x[n]-A)^{2}\right]}{\int_{-A_0}^{A_0}\frac{1}{2 A_{0}\left(2 \pi \sigma^{2}\right)^{N/2}}\exp \left[-\frac{1}{2 \sigma^{2}} \sum_{n=0}^{N-1}(x[n]-A)^{2}\right]dA} && |A| \leq A_{0}\\ &0 && |A|>A_{0} \end{aligned}\right.
p(A∣x)=⎩⎪⎪⎪⎨⎪⎪⎪⎧∫−A0A02A0(2πσ2)N/21exp[−2σ21∑n=0N−1(x[n]−A)2]dA2A0(2πσ2)N/21exp[−2σ21∑n=0N−1(x[n]−A)2]0∣A∣≤A0∣A∣>A0
But
∑
n
=
0
N
−
1
(
x
[
n
]
−
A
)
2
=
∑
n
=
0
N
−
1
x
2
[
n
]
−
2
N
A
x
ˉ
+
N
A
2
=
N
(
A
−
x
ˉ
)
2
+
∑
n
=
0
N
−
1
x
2
[
n
]
−
N
x
ˉ
2
\begin{aligned} \sum_{n=0}^{N-1}(x[n]-A)^{2} &=\sum_{n=0}^{N-1} x^{2}[n]-2 N A \bar{x}+N A^{2} =N(A-\bar{x})^{2}+\sum_{n=0}^{N-1} x^{2}[n]-N \bar{x}^{2} \end{aligned}
n=0∑N−1(x[n]−A)2=n=0∑N−1x2[n]−2NAxˉ+NA2=N(A−xˉ)2+n=0∑N−1x2[n]−Nxˉ2
Cancel out the terms that are irrelevant to
A
A
A and write the form as truncated Gaussian distribution
p
(
A
∣
x
)
=
{
1
2
π
σ
2
/
N
exp
[
−
1
2
σ
2
/
N
(
A
−
x
ˉ
)
2
]
∫
−
A
0
A
0
1
2
π
σ
2
/
N
exp
[
−
1
2
σ
2
/
N
(
A
−
x
ˉ
)
2
]
d
A
∣
A
∣
≤
A
0
0
,
∣
A
∣
>
A
0
(BP.9)
p(A |\mathbf{x})=\left\{\begin{aligned} &\frac{\frac{1}{\sqrt{2 \pi \sigma^{2} / N}} \exp \left[-\frac{1}{2 \sigma^{2} / N}(A-\bar{x})^{2}\right]}{\int_{-A_0}^{A_0}\frac{1}{\sqrt{2 \pi \sigma^{2} / N}} \exp \left[-\frac{1}{2 \sigma^{2} / N}(A-\bar{x})^{2}\right]dA} && |A| \leq A_{0} \\ &0, && |A|>A_{0} \end{aligned}\right.\tag{BP.9}
p(A∣x)=⎩⎪⎪⎪⎪⎨⎪⎪⎪⎪⎧∫−A0A02πσ2/N1exp[−2σ2/N1(A−xˉ)2]dA2πσ2/N1exp[−2σ2/N1(A−xˉ)2]0,∣A∣≤A0∣A∣>A0(BP.9)
The final MMSE estimate is then given by
A
^
=
∫
−
∞
∞
A
p
(
A
∣
x
)
d
A
=
∫
−
A
0
A
0
A
1
2
π
σ
2
/
N
exp
[
−
1
2
σ
2
/
N
(
A
−
x
ˉ
)
2
]
d
A
∫
−
A
0
A
0
1
2
π
σ
2
/
N
exp
[
−
1
2
σ
2
/
N
(
A
−
x
ˉ
)
2
]
d
A
(BP.10)
\hat{A}=\int_{-\infty}^{\infty} A p(A| \mathbf{x}) d A=\frac{\int_{-A_{0}}^{A_{0}} A \frac{1}{\sqrt{2 \pi \sigma^{2} / N}} \exp \left[-\frac{1}{2 \sigma^{2} / N}(A-\bar{x})^{2}\right] d A}{\int_{-A_{0}}^{A_{0}} \frac{1}{\sqrt{2 \pi \sigma^{2} / N}} \exp \left[-\frac{1}{2 \sigma^{2} / N}(A-\bar{x})^{2}\right] d A} \tag{BP.10}
A^=∫−∞∞Ap(A∣x)dA=∫−A0A02πσ2/N1exp[−2σ2/N1(A−xˉ)2]dA∫−A0A0A2πσ2/N1exp[−2σ2/N1(A−xˉ)2]dA(BP.10)
-
In absence of data (if x x x gives no information about A A A),
A ^ = E ( A ∣ x ) = E ( A ) = 0 \hat A=E(A|\mathbf x)=E(A)=0 A^=E(A∣x)=E(A)=0 -
If A 0 ≫ σ 2 / N A_0 \gg \sqrt{\sigma^2/N} A0≫σ2/N (no truncation), e.g., as N N N increases, A ^ → x ˉ \hat A\to \bar x A^→xˉ
A ^ = lim A 0 → ∞ ∫ − A 0 A 0 A 1 2 π σ 2 / N exp [ − 1 2 σ 2 / N ( A − x ˉ ) 2 ] d A ∫ − A 0 A 0 1 2 π σ 2 / N exp [ − 1 2 σ 2 / N ( A − x ˉ ) 2 ] d A = x ˉ 1 = x ˉ \hat A=\lim_{A_0\to \infty}\frac{\int_{-A_{0}}^{A_{0}} A \frac{1}{\sqrt{2 \pi \sigma^{2} / N}} \exp \left[-\frac{1}{2 \sigma^{2} / N}(A-\bar{x})^{2}\right] d A}{\int_{-A_{0}}^{A_{0}} \frac{1}{\sqrt{2 \pi \sigma^{2} / N}} \exp \left[-\frac{1}{2 \sigma^{2} / N}(A-\bar{x})^{2}\right] d A}=\frac{\bar x}{1}=\bar x A^=A0→∞lim∫−A0A02πσ2/N1exp[−2σ2/N1(A−xˉ)2]dA∫−A0A0A2πσ2/N1exp[−2σ2/N1(A−xˉ)2]dA=1xˉ=xˉ
The effect of the data is to position the posterior mean between A = 0 A = 0 A=0 and A = x ˉ A =\bar x A=xˉ in a compromise between the prior knowledge and that contributed by the data. To further appreciate this weighting consider what happens as N N N becomes large so that the data knowledge becomes more important. As shown in Figure 10.4, as N N N increases, we have from ( B P . 9 ) (BP.9) (BP.9) that the posterior PDF becomes more concentrated about x ˉ \bar x xˉ (since σ 2 / N \sigma^2/N σ2/N decrease). Hence, it becomes nearly Gaussian, and its mean becomes just x ˉ \bar x xˉ. The MMSE estimator relies less and less on the prior knowledge and more on the data. It is said that the data “swamps out” the prior knowledge.
Choosing a Prior PDF
As shown in the previous section, once a prior PDF has been chosen, the MMSE estimator follows directly from
(
B
P
.
10
)
(BP.10)
(BP.10). The only practical stumbling block that remains, however, is whether or not
E
(
θ
∣
x
)
E(\theta|\mathbf x)
E(θ∣x) can be determined in closed form. This motivates us to assume the Gaussian prior PDF instead of the uniform prior PDF
p
(
A
)
=
1
2
π
σ
A
2
exp
[
−
1
2
σ
A
2
(
A
−
μ
A
)
2
]
p(A)=\frac{1}{\sqrt{2\pi\sigma_A^2}}\exp \left[-\frac{1}{2\sigma_A^2}(A-\mu_A)^2 \right]
p(A)=2πσA21exp[−2σA21(A−μA)2]
With
μ
A
=
0
\mu_A=0
μA=0 and
3
σ
A
=
A
0
3\sigma_A=A_0
3σA=A0 the Gaussian prior PDF could be thought of as incorporating the knowledge that
∣
A
∣
≤
A
0
|A|\le A_0
∣A∣≤A0. From
(
B
P
.
6
)
(BP.6)
(BP.6) and
(
B
P
.
8
)
(BP.8)
(BP.8), we can formulate
p
(
A
∣
x
)
=
p
(
x
∣
A
)
p
(
A
)
∫
p
(
x
∣
A
)
p
(
A
)
d
A
=
1
(
2
π
σ
2
)
N
/
2
2
π
σ
A
2
exp
[
−
1
2
σ
2
∑
n
=
0
N
−
1
x
2
[
n
]
]
exp
[
−
1
2
σ
2
(
N
A
2
−
2
N
A
x
ˉ
)
]
exp
[
−
1
2
σ
A
2
(
A
−
μ
A
)
2
]
∫
−
∞
∞
1
(
2
π
σ
2
)
N
/
2
2
π
σ
A
2
exp
[
−
1
2
σ
2
∑
n
=
0
N
−
1
x
2
[
n
]
]
exp
[
−
1
2
σ
2
(
N
A
2
−
2
N
A
x
ˉ
)
]
exp
[
−
1
2
σ
A
2
(
A
−
μ
A
)
2
]
d
A
=
exp
[
−
1
2
(
1
σ
2
(
N
A
2
−
2
N
A
x
ˉ
)
+
1
σ
A
2
(
A
−
μ
A
)
2
)
]
∫
−
∞
∞
exp
[
−
1
2
(
1
σ
2
(
N
A
2
−
2
N
A
x
ˉ
)
+
1
σ
A
2
(
A
−
μ
A
)
2
)
]
d
A
=
exp
[
−
1
2
Q
(
A
)
]
∫
−
∞
∞
exp
[
−
1
2
Q
(
A
)
]
d
A
\begin{aligned} p(A | \mathbf{x})=& \frac{p(\mathbf{x} |A) p(A)}{\int p(\mathbf{x} | A) p(A) d A} \\ =& \frac{\frac{1}{\left(2 \pi \sigma^{2}\right)^{N/2} \sqrt{2 \pi \sigma_{A}^{2}}} \exp \left[-\frac{1}{2 \sigma^{2}} \sum_{n=0}^{N-1} x^{2}[n]\right] \exp \left[-\frac{1}{2 \sigma^{2}}\left(N A^{2}-2 N A \bar{x}\right)\right]\exp \left[-\frac{1}{2 \sigma_{A}^{2}}\left(A-\mu_{A}\right)^{2}\right]} {\int_{-\infty}^{\infty} \frac{1}{\left(2 \pi \sigma^{2}\right)^{N/2} \sqrt{2 \pi \sigma_{A}^{2}}} \exp \left[-\frac{1}{2 \sigma^{2}} \sum_{n=0}^{N-1} x^{2}[n]\right] \exp \left[-\frac{1}{2 \sigma^{2}}\left(N A^{2}-2 N A \bar{x}\right)\right]\exp \left[-\frac{1}{2 \sigma_{A}^{2}}\left(A-\mu_{A}\right)^{2}\right] d A} \\ =& \frac{\exp \left[-\frac{1}{2}\left( \frac{1}{\sigma^{2}}(NA^2-2NA\bar x)+\frac{1}{ \sigma_{A}^{2}}\left(A-\mu_{A}\right)^{2} \right) \right]} {\int_{-\infty}^{\infty} \exp \left[-\frac{1}{2}\left( \frac{1}{\sigma^{2}}(NA^2-2NA\bar x)+\frac{1}{ \sigma_{A}^{2}}\left(A-\mu_{A}\right)^{2} \right) \right] d A} \\ =& \frac{\exp \left[-\frac{1}{2} Q(A)\right]}{\int_{-\infty}^{\infty} \exp \left[-\frac{1}{2} Q(A)\right] d A} \end{aligned}
p(A∣x)====∫p(x∣A)p(A)dAp(x∣A)p(A)∫−∞∞(2πσ2)N/22πσA21exp[−2σ21∑n=0N−1x2[n]]exp[−2σ21(NA2−2NAxˉ)]exp[−2σA21(A−μA)2]dA(2πσ2)N/22πσA21exp[−2σ21∑n=0N−1x2[n]]exp[−2σ21(NA2−2NAxˉ)]exp[−2σA21(A−μA)2]∫−∞∞exp[−21(σ21(NA2−2NAxˉ)+σA21(A−μA)2)]dAexp[−21(σ21(NA2−2NAxˉ)+σA21(A−μA)2)]∫−∞∞exp[−21Q(A)]dAexp[−21Q(A)]
Continuing, we have for
Q
(
A
)
Q(A)
Q(A)
Q
(
A
)
=
N
σ
2
A
2
−
2
N
A
x
ˉ
σ
2
+
A
2
σ
A
2
−
2
μ
A
A
σ
A
2
+
μ
A
2
σ
A
2
=
(
N
σ
2
+
1
σ
A
2
)
A
2
−
2
(
N
σ
2
x
ˉ
+
μ
A
σ
A
2
)
A
+
μ
A
2
σ
A
2
\begin{aligned} Q(A) &=\frac{N}{\sigma^{2}} A^{2}-\frac{2 N A \bar{x}}{\sigma^{2}}+\frac{A^{2}}{\sigma_{A}^{2}}-\frac{2 \mu_{A} A}{\sigma_{A}^{2}}+\frac{\mu_{A}^{2}}{\sigma_{A}^{2}} \\ &=\left(\frac{N}{\sigma^{2}}+\frac{1}{\sigma_{A}^{2}}\right) A^{2}-2\left(\frac{N}{\sigma^{2}} \bar{x}+\frac{\mu_{A}}{\sigma_{A}^{2}}\right) A+\frac{\mu_{A}^{2}}{\sigma_{A}^{2}} \end{aligned}
Q(A)=σ2NA2−σ22NAxˉ+σA2A2−σA22μAA+σA2μA2=(σ2N+σA21)A2−2(σ2Nxˉ+σA2μA)A+σA2μA2
Let
σ
A
∣
x
2
=
1
N
σ
2
+
1
σ
A
2
μ
A
∣
x
=
(
N
σ
2
x
ˉ
+
μ
A
σ
A
2
)
σ
A
∣
x
2
(BP.11)
\begin{aligned} \sigma_{A \mid x}^{2} &=\frac{1}{\frac{N}{\sigma^{2}}+\frac{1}{\sigma_{A}^{2}}} \\ \mu_{A \mid x} &=\left(\frac{N}{\sigma^{2}} \bar{x}+\frac{\mu_{A}}{\sigma_{A}^{2}}\right) \sigma_{A \mid x}^{2} \\ \end{aligned}\tag{BP.11}
σA∣x2μA∣x=σ2N+σA211=(σ2Nxˉ+σA2μA)σA∣x2(BP.11)
Then, by completing the square we have
Q
(
A
)
=
1
σ
A
∣
x
2
(
A
−
μ
A
∣
x
)
2
−
μ
A
∣
x
2
σ
A
∣
x
2
+
μ
A
2
σ
A
2
Q(A)=\frac{1}{\sigma_{A \mid x}^{2}}\left(A-\mu_{A \mid x} \right)^2-\frac{\mu_{A \mid x}^{2}}{\sigma_{A \mid x}^{2}}+\frac{\mu_{A}^{2}}{\sigma_{A}^{2}}
Q(A)=σA∣x21(A−μA∣x)2−σA∣x2μA∣x2+σA2μA2
so that
p
(
A
∣
x
)
=
exp
[
−
1
2
σ
A
∣
x
2
(
A
−
μ
A
∣
x
)
2
]
exp
[
−
1
2
(
μ
A
2
σ
A
2
−
μ
A
∣
x
2
σ
A
∣
x
2
)
]
∫
−
∞
∞
exp
[
−
1
2
σ
A
∣
x
2
(
A
−
μ
A
∣
x
)
2
]
exp
[
−
1
2
(
μ
A
2
σ
A
2
−
μ
A
∣
x
2
σ
A
∣
x
2
)
]
d
A
=
1
2
π
σ
A
∣
x
2
exp
[
−
1
2
σ
A
∣
x
2
(
A
−
μ
A
∣
x
)
2
]
\begin{aligned} p(A | \mathbf{x}) &=\frac{\exp \left[-\frac{1}{2 \sigma_{A \mid x}^{2}}\left(A-\mu_{A \mid x}\right)^{2}\right] \exp \left[-\frac{1}{2}\left(\frac{\mu_{A}^{2}}{\sigma_{A}^{2}}-\frac{\mu_{A \mid x}^{2}}{\sigma_{A \mid x}^{2}}\right)\right]}{\int_{-\infty}^{\infty} \exp \left[-\frac{1}{2 \sigma_{A \mid x}^{2}}\left(A-\mu_{A \mid x}\right)^{2}\right] \exp \left[-\frac{1}{2}\left(\frac{\mu_{A}^{2}}{\sigma_{A}^{2}}-\frac{\mu_{A \mid x}^{2}}{\sigma_{A \mid x}^{2}}\right)\right] d A} \\ &=\frac{1}{\sqrt{2 \pi \sigma_{A \mid x}^{2}}} \exp \left[-\frac{1}{2 \sigma_{A \mid x}^{2}}\left(A-\mu_{A \mid x}\right)^{2}\right] \end{aligned}
p(A∣x)=∫−∞∞exp[−2σA∣x21(A−μA∣x)2]exp[−21(σA2μA2−σA∣x2μA∣x2)]dAexp[−2σA∣x21(A−μA∣x)2]exp[−21(σA2μA2−σA∣x2μA∣x2)]=2πσA∣x21exp[−2σA∣x21(A−μA∣x)2]
where the last step follows from the requirement that
p
(
A
∣
x
)
p(A |\mathbf{x})
p(A∣x) integrate to
1.
1 .
1. The posterior PDF is also Gaussian,
p
(
A
∣
x
)
∼
N
(
μ
A
∣
x
,
σ
A
∣
x
2
)
p(A|\mathbf x)\sim \mathcal N(\mu_{A|x},\sigma^2_{A|x})
p(A∣x)∼N(μA∣x,σA∣x2).
In this form the MMSE estimator is readily found as
A
^
=
E
(
A
∣
x
)
=
μ
A
∣
x
=
N
σ
2
x
ˉ
+
μ
A
σ
A
2
N
σ
2
+
1
σ
A
2
=
σ
A
2
σ
A
2
+
σ
2
N
x
ˉ
+
σ
2
N
σ
A
2
+
σ
2
N
μ
A
=
α
x
ˉ
+
(
1
−
α
)
μ
A
(BP.12)
\begin{aligned} \hat A&=E(A|\mathbf x)=\mu_{A|x}=\frac{\frac{N}{\sigma^{2}} \bar{x}+\frac{\mu_{A}}{\sigma_{A}^{2}}}{\frac{N}{\sigma^{2}}+\frac{1}{\sigma_{A}^{2}}}\\ &=\frac{\sigma_A^2}{\sigma_A^2+\frac{\sigma ^2}{N}}\bar x+\frac{\frac{\sigma ^2}{N}}{\sigma_A^2+\frac{\sigma ^2}{N}}\mu_A\\ &=\alpha \bar x+(1-\alpha)\mu_A \end{aligned}\tag{BP.12}
A^=E(A∣x)=μA∣x=σ2N+σA21σ2Nxˉ+σA2μA=σA2+Nσ2σA2xˉ+σA2+Nσ2Nσ2μA=αxˉ+(1−α)μA(BP.12)
where
α
=
σ
A
2
σ
A
2
+
σ
2
N
.
\alpha =\frac{\sigma_A^2}{\sigma_A^2+\frac{\sigma ^2}{N}}.
α=σA2+Nσ2σA2.
Note that
α
\alpha
α is a weighting factor since
0
<
α
<
1
0<\alpha <1
0<α<1. It is interesting to examine the interplay between the prior knowledge and the data.
- If there is little data so that σ A 2 ≪ σ 2 / N \sigma_A^2\ll \sigma^2/N σA2≪σ2/N, α \alpha α is small and A ^ ≈ μ A \hat A\approx \mu_A A^≈μA
- If more data are observed so that σ A 2 ≫ σ 2 / N \sigma_A^2\gg \sigma^2/N σA2≫σ2/N, α ≈ 1 \alpha \approx 1 α≈1 and A ^ ≈ x ˉ \hat A\approx \bar x A^≈xˉ.
Alternatively, we may view this process by examining the posterior PDF. As
N
N
N increases,
var
(
A
∣
x
)
=
σ
A
∣
x
2
=
1
N
σ
2
+
1
σ
A
2
\operatorname{var}(A|\mathbf x)=\sigma_{A \mid x}^{2} =\frac{1}{\frac{N}{\sigma^{2}}+\frac{1}{\sigma_{A}^{2}}}
var(A∣x)=σA∣x2=σ2N+σA211
will decrease. Also, the posterior mean or
A
^
\hat A
A^
(
B
P
.
12
)
(BP.12)
(BP.12) will approach
x
ˉ
\bar x
xˉ.
Observe that if there is no prior knowledge, which can be modeled by letting σ A 2 → ∞ \sigma_A^2\to \infty σA2→∞, then A ^ → x ˉ \hat A\to \bar x A^→xˉ for any data record length. The classical estimator is obtained.
Finally, we are going to prove that by using prior knowledge we could improve the estimation accuracy. Recall that
Bmse
(
A
^
)
=
∫
[
∫
(
A
−
A
^
)
2
p
(
A
∣
x
)
d
A
]
p
(
x
)
d
x
(BP.4)
\operatorname{Bmse}(\hat{A})=\int\left[\int(A-\hat{A})^{2} p(A |\mathbf{x}) d A\right] p(\mathbf{x}) d \mathbf{x}\tag{BP.4}
Bmse(A^)=∫[∫(A−A^)2p(A∣x)dA]p(x)dx(BP.4)
Since
A
^
=
E
(
A
∣
x
)
\hat A=E(A|\mathbf x)
A^=E(A∣x), we have
Bmse
(
A
^
)
=
∫
[
∫
(
A
−
E
(
A
∣
x
)
)
2
p
(
A
∣
x
)
d
A
]
p
(
x
)
d
x
=
∫
var
(
A
∣
x
)
p
(
x
)
d
x
=
∫
1
N
σ
2
+
1
σ
A
2
p
(
x
)
d
x
=
1
N
σ
2
+
1
σ
A
2
(BP.13)
\begin{aligned} \operatorname{Bmse}(\hat{A})&=\int\left[\int(A-E(A|\mathbf x))^{2} p(A |\mathbf{x}) d A\right] p(\mathbf{x}) d \mathbf{x}\\ &=\int\operatorname{var}(A|\mathbf x)p(\mathbf{x}) d \mathbf{x}\\ &=\int\frac{1}{\frac{N}{\sigma^{2}}+\frac{1}{\sigma_{A}^{2}}}p(\mathbf{x}) d \mathbf{x}\\ &=\frac{1}{\frac{N}{\sigma^{2}}+\frac{1}{\sigma_{A}^{2}}} \end{aligned} \tag{BP.13}
Bmse(A^)=∫[∫(A−E(A∣x))2p(A∣x)dA]p(x)dx=∫var(A∣x)p(x)dx=∫σ2N+σA211p(x)dx=σ2N+σA211(BP.13)
This can be rewritten as
Bmse
(
A
^
)
=
σ
2
N
(
σ
A
2
σ
A
2
+
σ
2
N
)
<
σ
2
N
(BP.14)
\operatorname{Bmse}(\hat A)=\frac{\sigma^2}{N}(\frac{\sigma_A^2}{\sigma_A^2+\frac{\sigma^2}{N}})<\frac{\sigma ^2}{N}\tag{BP.14}
Bmse(A^)=Nσ2(σA2+Nσ2σA2)<Nσ2(BP.14)
where
σ
2
/
N
\sigma^2/N
σ2/N is the minimum MSE obtained when no prior knowledge is available (let
σ
A
2
→
∞
\sigma^2_A\to \infty
σA2→∞). Clearly, any prior knowledge when modeled in the Bayesian sense will improve our Bayesian estimator.
Properties of the Gaussian PDF
We now generalize the results of the previous section by examining the properties of the Gaussian PDF. The bivariate Gaussian PDF is first investigated to illustrate the important properties. Then, the corresponding results for the general multivariate Gaussian PDF are described.
Consider a jointly Gaussian random vector
[
x
y
]
T
[x~ ~y]^{T}
[x y]T whose PDF is
p
(
x
,
y
)
=
1
2
π
det
1
2
(
C
)
exp
[
−
1
2
[
x
−
E
(
x
)
y
−
E
(
y
)
]
T
C
−
1
[
x
−
E
(
x
)
y
−
E
(
y
)
]
]
(BF.15)
p(x, y)=\frac{1}{2 \pi \operatorname{det}^{\frac{1}{2}}(\mathbf{C})} \exp \left[-\frac{1}{2}\left[\begin{matrix} x-E(x) \\ y-E(y) \end{matrix}\right]^{T} \mathbf{C}^{-1}\left[\begin{array}{l} x-E(x) \\ y-E(y) \end{array}\right]\right]\tag{BF.15}
p(x,y)=2πdet21(C)1exp[−21[x−E(x)y−E(y)]TC−1[x−E(x)y−E(y)]](BF.15)
This is also termed as bivariate Gaussian PDF. The mean vector and covariance matrix are
E
(
[
x
y
]
)
=
[
E
(
x
)
E
(
y
)
]
C
=
[
var
(
x
)
cov
(
x
,
y
)
cov
(
x
,
y
)
var
(
y
)
]
\begin{aligned} E\left(\left[\begin{matrix}x\\y\end{matrix}\right]\right)&=\left[\begin{matrix}E(x)\\E(y)\end{matrix}\right]\\ \mathbf C&=\left[\begin{matrix}\operatorname{var}(x) & \operatorname{cov}(x,y)\\\operatorname{cov}(x,y) & \operatorname{var}(y)\end{matrix}\right] \end{aligned}
E([xy])C=[E(x)E(y)]=[var(x)cov(x,y)cov(x,y)var(y)]
Note that the marginal PDFs
p
(
x
)
p(x)
p(x) and
p
(
y
)
p(y)
p(y) are also Gaussian, as can be verified by the integrations
p
(
x
)
=
∫
−
∞
∞
p
(
x
,
y
)
d
y
=
1
2
π
var
(
x
)
exp
[
−
1
2
var
(
x
)
(
x
−
E
(
x
)
)
2
]
p
(
y
)
=
∫
−
∞
∞
p
(
x
,
y
)
d
x
=
1
2
π
var
(
y
)
exp
[
−
1
2
var
(
y
)
(
y
−
E
(
y
)
)
2
]
(BF.16)
\begin{aligned} p(x)=\int_{-\infty}^{\infty} p(x, y) d y=\frac{1}{\sqrt{2 \pi \operatorname{var}(x)}} \exp \left[-\frac{1}{2 \operatorname{var}(x)}(x-E(x))^{2}\right] \\ p(y)=\int_{-\infty}^{\infty} p(x, y) d x=\frac{1}{\sqrt{2 \pi \operatorname{var}(y)}} \exp \left[-\frac{1}{2 \operatorname{var}(y)}(y-E(y))^{2}\right] \end{aligned}\tag{BF.16}
p(x)=∫−∞∞p(x,y)dy=2πvar(x)1exp[−2var(x)1(x−E(x))2]p(y)=∫−∞∞p(x,y)dx=2πvar(y)1exp[−2var(y)1(y−E(y))2](BF.16)
The contours along which the PDF
p
(
x
,
y
)
p(x, y)
p(x,y) is constant are those values of
x
x
x and
y
y
y for which
[
x
−
E
(
x
)
y
−
E
(
y
)
]
T
C
−
1
[
x
−
E
(
x
)
y
−
E
(
y
)
]
\left[\begin{array}{l} x-E(x) \\ y-E(y) \end{array}\right]^{T} \mathbf{C}^{-1}\left[\begin{array}{l} x-E(x) \\ y-E(y) \end{array}\right]
[x−E(x)y−E(y)]TC−1[x−E(x)y−E(y)]
is a constant. They are shown in Figure 10.6 as elliptical contours.
Once
x
,
x,
x, say
x
0
,
x_{0},
x0, is observed, the conditional PDF of
y
y
y becomes
p
(
y
∣
x
0
)
=
p
(
x
0
,
y
)
p
(
x
0
)
=
p
(
x
0
,
y
)
∫
−
∞
∞
p
(
x
0
,
y
)
d
y
(BF.17)
p\left(y|x_{0}\right)=\frac{p\left(x_{0}, y\right)}{p\left(x_{0}\right)}=\frac{p\left(x_{0}, y\right)}{\int_{-\infty}^{\infty} p\left(x_{0}, y\right) d y}\tag{BF.17}
p(y∣x0)=p(x0)p(x0,y)=∫−∞∞p(x0,y)dyp(x0,y)(BF.17)
Since from
(
B
F
.
15
)
(BF.15)
(BF.15) the exponential argument is quadratic in
y
y
y,
p
(
x
0
,
y
)
p(x_0,y)
p(x0,y) has the Gaussian form in
y
y
y, and thus the conditional PDF
p
(
y
∣
x
0
)
p(y|x_0)
p(y∣x0) must also be Gaussian. To summarize,
Theorem 1 (Conditional PDF of Bivariate Gaussian) If x x x and y y y are distributed according to a bivariate Gaussian P D F P D F PDF with mean vector [ E ( x ) E ( y ) ] T [E(x)~ ~E(y)]^{T} [E(x) E(y)]T and covariance matrix
C = [ var ( x ) cov ( x , y ) cov ( y , x ) var ( y ) ] \mathbf{C}=\left[\begin{array}{cc} \operatorname{var}(x) & \operatorname{cov}(x, y) \\ \operatorname{cov}(y, x) & \operatorname{var}(y) \end{array}\right] C=[var(x)cov(y,x)cov(x,y)var(y)]
so that
p ( x , y ) = 1 2 π det 1 2 ( C ) exp [ − 1 2 [ x − E ( x ) y − E ( y ) ] T C − 1 [ x − E ( x ) y − E ( y ) ] ] p(x, y)=\frac{1}{2 \pi \operatorname{det}^{\frac{1}{2}}(\mathbf{C})} \exp \left[-\frac{1}{2}\left[\begin{array}{l} x-E(x) \\ y-E(y) \end{array}\right]^{T} \mathbf{C}^{-1}\left[\begin{array}{l} x-E(x) \\ y-E(y) \end{array}\right]\right] p(x,y)=2πdet21(C)1exp[−21[x−E(x)y−E(y)]TC−1[x−E(x)y−E(y)]]
then the conditional PDF p ( y ∣ x ) p(y | x) p(y∣x) is also Gaussian and
E ( y ∣ x ) = E ( y ) + cov ( x , y ) var ( x ) ( x − E ( x ) ) (BF.18) E(y | x) =E(y)+\frac{\operatorname{cov}(x, y)}{\operatorname{var}(x)}(x-E(x)) \tag{BF.18} E(y∣x)=E(y)+var(x)cov(x,y)(x−E(x))(BF.18)var ( y ∣ x ) = var ( y ) − cov 2 ( x , y ) var ( x ) (BF.19) \operatorname{var}(y | x) =\operatorname{var}(y)-\frac{\operatorname{cov}^{2}(x, y)}{\operatorname{var}(x)}\tag{BF.19} var(y∣x)=var(y)−var(x)cov2(x,y)(BF.19)
Assuming that
x
x
x and
y
y
y are not independent and hence
cov
(
x
,
y
)
≠
0
,
\operatorname{cov}(x, y) \neq 0,
cov(x,y)=0, the posterior PDF becomes more concentrated since there is less uncertainty about
y
.
y .
y. To verify this, note from
(
B
F
.
19
)
(BF.19)
(BF.19) that
var
(
y
∣
x
)
=
var
(
y
)
[
1
−
cov
2
(
x
,
y
)
var
(
x
)
var
(
y
)
]
=
var
(
y
)
(
1
−
ρ
2
)
(BF.20)
\begin{aligned} \operatorname{var}(y| x) &=\operatorname{var}(y)\left[1-\frac{\operatorname{cov}^{2}(x, y)}{\operatorname{var}(x) \operatorname{var}(y)}\right] \\ &=\operatorname{var}(y)\left(1-\rho^{2}\right) \end{aligned}\tag{BF.20}
var(y∣x)=var(y)[1−var(x)var(y)cov2(x,y)]=var(y)(1−ρ2)(BF.20)
where
ρ
=
cov
(
x
,
y
)
var
(
x
)
var
(
y
)
(BF.21)
\rho=\frac{\operatorname{cov}(x, y)}{\sqrt{\operatorname{var}(x) \operatorname{var}(y)}}\tag{BF.21}
ρ=var(x)var(y)cov(x,y)(BF.21)
is the correlation coefficient satisfying
∣
ρ
∣
≤
1
|\rho|\le 1
∣ρ∣≤1. From our previous discussions, we also realize that
E
(
y
∣
x
)
E(y|x)
E(y∣x) is the MMSE estimator of
y
y
y after observing
x
x
x, so that from
(
B
F
.
18
)
(BF.18)
(BF.18)
y
^
=
E
(
y
)
+
cov
(
x
,
y
)
var
(
x
)
(
x
−
E
(
x
)
)
(BF.22)
\hat y=E(y)+\frac{\operatorname{cov}(x, y)}{\operatorname{var}(x)}(x-E(x)) \tag{BF.22}
y^=E(y)+var(x)cov(x,y)(x−E(x))(BF.22)
In normalized form (a random variable with zero mean and unity variance) this becomes
y
^
−
E
(
y
)
var
(
y
)
=
cov
(
x
,
y
)
var
(
x
)
var
(
y
)
x
−
E
(
x
)
var
(
x
)
\frac{\hat y-E(y)}{\sqrt{\operatorname{var}(y)}}=\frac{\operatorname{cov}(x, y)}{\sqrt{\operatorname{var}(x) \operatorname{var}(y)}}\frac{x-E(x)}{\sqrt{\operatorname{var}(x)}}
var(y)y^−E(y)=var(x)var(y)cov(x,y)var(x)x−E(x)
or
y
^
n
=
ρ
x
n
(BF.23)
\hat y_n=\rho x_n\tag{BF.23}
y^n=ρxn(BF.23)
If the random variables are already normalized
(
E
(
x
)
=
E
(
y
)
=
0
,
var
(
x
)
=
var
(
y
)
=
1
)
(E(x) = E(y) = 0, \operatorname{var}(x) = \operatorname{var}(y) = 1)
(E(x)=E(y)=0,var(x)=var(y)=1), the constant PDF contours appear as in Figure 10.7.
The locations of the peaks of p ( x , y ) p(x,y) p(x,y), when considered as a function of y y y for each x x x, is the dashed line y = ρ x y = \rho x y=ρx, and it is readily shown that y ^ = E ( y ∣ x ) = ρ x \hat y = E(y|x) = \rho x y^=E(y∣x)=ρx.
The MMSE estimator therefore exploits the correlation between the random variables to estimate the realization of one based on the realization of the other. The minimum MSE is, from
(
B
F
.
13
)
(BF.13)
(BF.13) and
(
B
F
.
20
)
(BF.20)
(BF.20),
Bmse
(
y
^
)
=
∫
var
(
y
∣
x
)
p
(
x
)
d
x
=
var
(
y
∣
x
)
=
var
(
y
)
(
1
−
ρ
2
)
(BF.24)
\begin{aligned} \operatorname{Bmse}(\hat y)=&\int \operatorname{var}(y|x)p(x)dx\\ =&\operatorname{var}(y|x)\\ =&\operatorname{var}(y)(1-\rho^2) \end{aligned}\tag{BF.24}
Bmse(y^)===∫var(y∣x)p(x)dxvar(y∣x)var(y)(1−ρ2)(BF.24)
Hence, the quality of our estimator also depends on the correlation coefficient, which is a measure of the statistical dependence between
x
x
x and
y
y
y.
To generalize these results consider a jointly Gaussian vector [ x T y T ] T , \left[\mathbf{x}^{T} \mathbf{y}^{T}\right]^{T}, [xTyT]T, where x \mathbf{x} x is k × 1 k \times 1 k×1 and y \mathbf{y} y is l × 1. l \times 1 . l×1. In other words, [ x T y T ] T \left[\mathbf{x}^{T} \mathbf{y}^{T}\right]^{T} [xTyT]T is distributed according to a multivariate Gaussian PDF. Then, the conditional PDF of y \mathbf{y} y for a given x \mathbf{x} x is also Gaussian. as summarized in the following theorem (see Appendix 10A for proof).
Theorem 10.2 (Conditional PDF of Multivariate Gaussian) If x \mathbf{x} x and y \mathbf y y are jointly Gaussian, where x \mathbf{x} x is k × 1 k \times 1 k×1 and y \mathbf{y} y is l × 1 , l \times 1, l×1, with mean vector [ E ( x ) T E ( y ) T ] T \left[E(\mathbf{x})^{T} E(\mathbf{y})^{T}\right]^{T} [E(x)TE(y)T]T and
partitioned covariance matrix
C = [ C x x C x y C y x C y y ] = [ k × k k × l l × k l × l ] (BF.25) \mathbf{C}=\left[\begin{array}{ll} \mathbf{C}_{x x} & \mathbf{C}_{x y} \\ \mathbf{C}_{y x} & \mathbf{C}_{y y} \end{array}\right]=\left[\begin{array}{ll} k \times k & k \times l \\ l \times k & l \times l \end{array}\right]\tag{BF.25} C=[CxxCyxCxyCyy]=[k×kl×kk×ll×l](BF.25)
so that
p ( x , y ) = 1 ( 2 π ) k + 1 2 det 1 2 ( C ) exp [ − 1 2 ( [ x − E ( x ) y − E ( y ) ] ) T C − 1 ( [ x − E ( x ) y − E ( y ) ] ) ] p(\mathbf{x}, \mathbf{y})=\frac{1}{(2 \pi)^{\frac{k+1}{2}} \operatorname{det}^{\frac{1}{2}}(\mathbf{C})} \exp \left[-\frac{1}{2}\left(\left[\begin{array}{l} \mathbf{x}-E(\mathbf{x}) \\ \mathbf{y}-E(\mathbf{y}) \end{array}\right]\right)^{T} \mathbf{C}^{-1}\left(\left[\begin{array}{l} \mathbf{x}-E(\mathbf{x}) \\ \mathbf{y}-E(\mathbf{y}) \end{array}\right]\right)\right] p(x,y)=(2π)2k+1det21(C)1exp[−21([x−E(x)y−E(y)])TC−1([x−E(x)y−E(y)])]
Then the conditional PDF p ( y ∣ x ) p(\mathbf y|\mathbf x) p(y∣x) is also Gaussian and
E ( y ∣ x ) = E ( y ) + C y x C x x − 1 ( x − E ( x ) ) (BF.26) E(\mathbf y|\mathbf x)=E(\mathbf y)+\mathbf C_{yx} \mathbf C_{xx}^{-1}(\mathbf x-E(\mathbf x))\tag{BF.26} E(y∣x)=E(y)+CyxCxx−1(x−E(x))(BF.26)C y ∣ x = C y y − C y x C x x − 1 C x y (BF.27) \mathbf C_{y|x}=\mathbf C_{yy}-\mathbf C_{yx}\mathbf C_{xx}^{-1}\mathbf C_{xy}\tag{BF.27} Cy∣x=Cyy−CyxCxx−1Cxy(BF.27)
Bayesian Linear Model
Let the data be modeled as
x
=
H
θ
+
w
(BF.28)
\mathbf x=\mathbf H \boldsymbol \theta+\mathbf w\tag{BF.28}
x=Hθ+w(BF.28)
where
x
\mathbf x
x is an
N
×
1
N\times 1
N×1 data vector,
H
\mathbf H
H is a known
N
×
p
N\times p
N×p matrix,
θ
\boldsymbol \theta
θ is a
p
×
1
p\times 1
p×1 random vector with prior PDF
N
(
μ
θ
,
C
θ
)
\mathcal N(\boldsymbol \mu_\theta,\mathbf C_\theta)
N(μθ,Cθ), and
w
\mathbf w
w is a
N
×
1
N\times 1
N×1 noise vector with PDF
N
(
0
,
C
w
)
\mathcal N (\mathbf 0,\mathbf C_w)
N(0,Cw) and independent of
θ
\boldsymbol \theta
θ. This data model is termed the Bayesian general linear model. It differs from the classical general linear model in that
θ
\boldsymbol \theta
θ is modeled as a random variable with a Gaussian prior PDF.
From Theorem 10.2 we know that if
x
\mathbf x
x and
θ
\boldsymbol \theta
θ are jointly Gaussian, then the posterior PDF is also Gaussian. Hence, it only remains to verify that this is indeed the case. Let
z
=
[
x
T
θ
T
]
T
\mathbf z=[\mathbf x^T~~\boldsymbol \theta^T]^T
z=[xT θT]T, so that from
(
B
F
.
28
)
(BF.28)
(BF.28) we have
z
=
[
x
w
]
=
[
H
θ
+
w
w
]
=
[
H
I
I
0
]
[
θ
w
]
\mathbf z=\left[\begin{matrix}\mathbf x\\\mathbf w \end{matrix}\right]=\left[\begin{matrix}\mathbf H \boldsymbol \theta+\mathbf w\\\mathbf w \end{matrix}\right]=\left[\begin{matrix}\mathbf H & \mathbf I\\\mathbf I & \mathbf 0 \end{matrix}\right]\left[\begin{matrix}\boldsymbol \theta\\\mathbf w \end{matrix}\right]
z=[xw]=[Hθ+ww]=[HII0][θw]
Since
θ
\boldsymbol \theta
θ and
w
\mathbf w
w are independent of each other and each one is Gaussian, they are jointly Gaussian. Furthermore, because
z
\mathbf z
z is a linear transformation of a Gaussian vector, it too is Gaussian. Hence, Theorem 10.2 applies directly, and we need only determine the mean and covariance of the posterior PDF.
We can obtain the means and covariances
E
(
x
)
=
E
(
H
θ
+
w
)
=
H
E
(
θ
)
=
H
μ
θ
E
(
y
)
=
E
(
θ
)
=
μ
θ
C
x
x
=
E
[
(
x
−
E
(
x
)
)
(
x
−
E
(
x
)
)
T
]
=
E
[
(
H
(
θ
−
μ
θ
)
+
w
)
(
H
(
θ
−
μ
θ
)
+
w
)
T
]
=
H
C
θ
H
T
+
C
w
C
y
x
=
E
[
(
y
−
E
(
y
)
)
(
x
−
E
(
x
)
)
T
]
=
E
[
(
θ
−
μ
θ
)
(
H
(
θ
−
μ
θ
)
+
w
)
T
]
=
C
θ
H
T
\begin{aligned} E(\mathbf x)=& E(\mathbf H \boldsymbol \theta+\mathbf w)=\mathbf HE(\boldsymbol \theta)=\mathbf H \boldsymbol \mu_\theta\\ E(\mathbf y)=&E(\boldsymbol \theta)=\boldsymbol \mu_\theta\\ \mathbf C_{xx}=& E[(\mathbf x-E(\mathbf x))(\mathbf x-E(\mathbf x))^T]\\ =&E[(\mathbf H(\boldsymbol \theta-\boldsymbol \mu_\theta)+\mathbf w)(\mathbf H(\boldsymbol \theta-\boldsymbol \mu_\theta)+\mathbf w)^T]\\ =& \mathbf H \mathbf C_\theta \mathbf H^T+\mathbf C_w\\ \mathbf C_{yx}=&E[(\mathbf y-E(\mathbf y))(\mathbf x-E(\mathbf x))^T]\\ =&E[(\boldsymbol \theta-\boldsymbol \mu_\theta)(\mathbf H(\boldsymbol \theta-\boldsymbol \mu_\theta)+\mathbf w)^T]\\ =& \mathbf C_\theta \mathbf H^T \end{aligned}
E(x)=E(y)=Cxx===Cyx===E(Hθ+w)=HE(θ)=HμθE(θ)=μθE[(x−E(x))(x−E(x))T]E[(H(θ−μθ)+w)(H(θ−μθ)+w)T]HCθHT+CwE[(y−E(y))(x−E(x))T]E[(θ−μθ)(H(θ−μθ)+w)T]CθHT
We can now summarize our results for the Bayesian general linear model.
Theorem 10.3 (Posterior PDF for the Bayesian General Linear Model) If the observed data x \mathbf x x can be modeled as
x = H θ + w \mathbf{x}=\mathbf{H} \boldsymbol{\theta}+\mathbf{w} x=Hθ+w
where x \mathbf{x} x is an N × 1 N \times 1 N×1 data vector, H \mathbf{H} H is a known N × p N \times p N×p matrix, θ \boldsymbol{\theta} θ is a p × 1 p \times 1 p×1 random vector with prior PDF N ( μ θ , C θ ) , \mathcal{N}\left(\mu_{\theta}, \mathbf{C}_{\theta}\right), N(μθ,Cθ), and w \mathbf{w} w is an N × 1 N \times 1 N×1 noise vector with PDF N ( 0 , C w ) \mathcal{N}\left(\mathbf{0}, \mathbf{C}_{w}\right) N(0,Cw) and independent of θ , \boldsymbol \theta, θ, then the posterior PDF p ( θ ∣ x ) p(\boldsymbol\theta |\mathbf{x}) p(θ∣x) is Gaussian with mean
E ( θ ∣ x ) = μ θ + C θ H T ( H C θ H T + C w ) − 1 ( x − H μ θ ) (BF.29) E(\boldsymbol{\theta} |\mathbf{x})=\boldsymbol{\mu}_{\theta}+\mathbf{C}_{\theta} \mathbf{H}^{T}\left(\mathbf{H} \mathbf{C}_{\theta} \mathbf{H}^{T}+\mathbf{C}_{w}\right)^{-1}\left(\mathbf{x}-\mathbf{H} \boldsymbol{\mu}_{\theta}\right)\tag{BF.29} E(θ∣x)=μθ+CθHT(HCθHT+Cw)−1(x−Hμθ)(BF.29)
and covariance
C θ ∣ x = C θ − C θ H T ( H C θ H T + C w ) − 1 H C θ (BF.30) \mathbf{C}_{\theta \mid x}=\mathbf{C}_{\theta}-\mathbf{C}_{\theta} \mathbf{H}^{T}\left(\mathbf{H} \mathbf{C}_{\theta} \mathbf{H}^{T}+\mathbf{C}_{w}\right)^{-1} \mathbf{H} \mathbf{C}_{\theta}\tag{BF.30} Cθ∣x=Cθ−CθHT(HCθHT+Cw)−1HCθ(BF.30)
In contrast to the classical general linear model, H \mathbf{H} H need not be full rank to ensure the invertibility of H C θ H T + C w . \mathbf{H C}_{\theta} \mathbf{H}^{T}+\mathbf{C}_{w} . HCθHT+Cw.Alternative formulation using Matrix inversion lemma:
E ( θ ∣ x ) = μ θ + ( C θ − 1 + H T C w − 1 H ) − 1 H T C w − 1 ( x − H μ θ ) (BF.31) E(\boldsymbol{\theta} |\mathbf{x})=\boldsymbol\mu_{\theta}+\left(\mathbf{C}_{\theta}^{-1}+\mathbf{H}^{T} \mathbf{C}_w^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}_w^{-1}\left(\mathbf{x}-\mathbf{H} \boldsymbol\mu_{\theta}\right) \tag{BF.31} E(θ∣x)=μθ+(Cθ−1+HTCw−1H)−1HTCw−1(x−Hμθ)(BF.31)C θ ∣ x = ( C θ − 1 + H T C w − 1 H ) − 1 (BF.32) \mathbf{C}_{\theta \mid x} =\left(\mathbf{C}_{\theta}^{-1}+\mathbf{H}^{T} \mathbf{C}_w^{-1} \mathbf{H}\right)^{-1}\tag{BF.32} Cθ∣x=(Cθ−1+HTCw−1H)−1(BF.32)
We illustrate the use of these formulas by applying them to Example DC Level in WGN:
Let us assume now that the prior distribution of
A
A
A is Gaussian:
A
∼
N
(
μ
A
,
σ
A
2
)
A \sim \mathcal{N}\left(\mu _A, \sigma_{A}^{2}\right)
A∼N(μA,σA2), and
w
[
n
]
w[n]
w[n] is white Gaussian noise, i.e., for
n
=
0
,
…
,
N
−
1
n=0, \ldots, N-1
n=0,…,N−1,
w
[
n
]
∼
N
(
0
,
σ
2
)
w[n] \sim \mathcal{N}\left(0, \sigma^{2}\right)
w[n]∼N(0,σ2)
x
=
1
A
+
w
\mathbf{x}=\mathbf 1 A+\mathbf{w}
x=1A+w
then,
x
\mathbf{x}
x and
A
A
A are jointly Gaussian
(
k
=
N
(k=N
(k=N and
l
=
1
)
,
l=1),
l=1), with mean
E
(
A
∣
x
)
=
μ
A
+
σ
A
2
1
T
(
1
σ
A
2
1
T
+
σ
2
I
)
−
1
(
x
−
1
μ
A
)
\begin{aligned} E(A|\mathbf x)=\mu_A+\sigma_A^2 \mathbf 1^T(\mathbf 1 \sigma_A^2\mathbf 1^T+\sigma^2 I)^{-1}(\mathbf x-\mathbf 1\mu_A) \end{aligned}
E(A∣x)=μA+σA21T(1σA21T+σ2I)−1(x−1μA)
Using Woodbury’s identity
(
I
+
σ
A
2
σ
2
1
1
T
)
−
1
=
I
−
σ
A
2
σ
2
1
1
T
1
+
N
σ
A
2
σ
2
\left(\mathbf{I}+\frac{\sigma_{A}^{2}}{\sigma^{2}} \mathbf{1} \mathbf{1}^{T}\right)^{-1}=\mathbf{I}-\frac{\frac{\sigma_{A}^{2}}{\sigma^{2}} \mathbf{1} \mathbf{1}^{T}}{1+N \frac{\sigma_{A}^{2}}{\sigma^{2}}}
(I+σ2σA211T)−1=I−1+Nσ2σA2σ2σA211T
so that
A
^
=
E
(
A
∣
x
)
=
μ
A
+
σ
A
2
σ
2
1
T
(
I
−
1
1
T
N
+
σ
2
σ
A
2
)
(
x
−
1
μ
A
)
=
μ
A
+
σ
A
2
σ
2
(
1
T
−
N
N
+
σ
2
σ
A
2
1
T
)
(
x
−
1
μ
A
)
=
μ
A
+
σ
A
2
σ
2
(
1
−
N
N
+
σ
2
σ
A
2
)
(
N
x
ˉ
−
N
μ
A
)
=
μ
A
+
σ
A
2
σ
A
2
+
σ
2
N
(
x
ˉ
−
μ
A
)
=
σ
A
2
σ
A
2
+
σ
2
N
x
ˉ
+
σ
2
N
σ
A
2
+
σ
2
N
μ
A
=
α
x
ˉ
+
(
1
−
α
)
μ
A
\begin{aligned} \hat A=E(A |\mathbf{x}) &=\mu_{A}+\frac{\sigma_{A}^{2}}{\sigma^{2}} \mathbf{1}^{T}\left(\mathbf{I}-\frac{\mathbf{1} \mathbf{1}^{T}}{N+\frac{\sigma^{2}}{\sigma_{A}^{2}}}\right)\left(\mathbf{x}-\mathbf{1} \mu_{A}\right) \\ &=\mu_{A}+\frac{\sigma_{A}^{2}}{\sigma^{2}}\left(\mathbf{1}^{T}-\frac{N}{N+\frac{\sigma^{2}}{\sigma_{A}^{2}}} \mathbf{1}^{T}\right)\left(\mathbf{x}-\mathbf{1} \mu_{A}\right) \\ &=\mu_{A}+\frac{\sigma_{A}^{2}}{\sigma^{2}}\left(1-\frac{N}{N+\frac{\sigma^{2}}{\sigma_{A}^{2}}}\right)\left(N \bar{x}-N \mu_{A}\right) \\ &=\mu_{A}+\frac{\sigma_{A}^{2}}{\sigma_{A}^{2}+\frac{\sigma^{2}}{N}}\left(\bar{x}-\mu_{A}\right)\\ &=\frac{\sigma_A^2}{\sigma_A^2+\frac{\sigma ^2}{N}}\bar x+\frac{\frac{\sigma ^2}{N}}{\sigma_A^2+\frac{\sigma ^2}{N}}\mu_A\\ &=\alpha \bar x+(1-\alpha)\mu_A \end{aligned}
A^=E(A∣x)=μA+σ2σA21T⎝⎛I−N+σA2σ211T⎠⎞(x−1μA)=μA+σ2σA2⎝⎛1T−N+σA2σ2N1T⎠⎞(x−1μA)=μA+σ2σA2⎝⎛1−N+σA2σ2N⎠⎞(Nxˉ−NμA)=μA+σA2+Nσ2σA2(xˉ−μA)=σA2+Nσ2σA2xˉ+σA2+Nσ2Nσ2μA=αxˉ+(1−α)μA
which is exactly the same as
(
B
P
.
12
)
(BP.12)
(BP.12). And
Bmse
(
A
^
)
=
∫
[
∫
(
A
−
E
(
A
∣
x
)
)
2
p
(
A
∣
x
)
d
A
]
p
(
x
)
d
x
=
∫
var
(
A
∣
x
)
p
(
x
)
d
x
=
var
(
A
∣
x
)
=
σ
A
2
−
σ
A
2
σ
2
1
T
(
I
−
1
1
T
N
+
σ
2
σ
A
2
)
1
σ
A
2
=
1
N
σ
2
+
1
σ
A
2
\begin{aligned} \operatorname{Bmse}(\hat{A})&=\int\left[\int(A-E(A|\mathbf x))^{2} p(A |\mathbf{x}) d A\right] p(\mathbf{x}) d \mathbf{x}\\ &=\int\operatorname{var}(A|\mathbf x)p(\mathbf{x}) d \mathbf{x}=\operatorname{var}(A|\mathbf x)\\ &=\sigma_A^2-\frac{\sigma_A^2}{\sigma^2}\mathbf 1^T\left(\mathbf I-\frac{\mathbf 1 \mathbf 1^T}{N+\frac{\sigma^2}{\sigma_A^2}} \right)\mathbf 1 \sigma_A^2\\ &=\frac{1}{\frac{N}{\sigma^{2}}+\frac{1}{\sigma_{A}^{2}}} \end{aligned}
Bmse(A^)=∫[∫(A−E(A∣x))2p(A∣x)dA]p(x)dx=∫var(A∣x)p(x)dx=var(A∣x)=σA2−σ2σA21T⎝⎛I−N+σA2σ211T⎠⎞1σA2=σ2N+σA211
which is just ( B P . 13 ) (BP.13) (BP.13).