The Bayesian Philosophy

Reference:
Kay S M. Fundamentals of statistical signal processing[M]. Prentice Hall PTR, 1993. (Ch. 10 - 10.6)
Slides of ET4386, TUD


We now depart from the classical approach to statistical estimation in which the parameter θ \theta θ of interest is assumed to be a deterministic but unknown constant. Instead, we assume that θ \theta θ is a random variable whose particular realization we must estimate.

Motivation:

  • The Bayesian approach can incorporate the available prior knowledge into our estimator.
  • Bayesian estimation is useful in situations where an MVU estimator cannot be found, as for example, when the variance of an unbiased estimator may not be uniformly less than that of all other unbiased estimators. In this instance, it may be true that for most values of the parameter an estimator can be found whose mean square error may be less than that of all other estimators.

Prior Knowledge and Estimation

It is a fundamental rule of estimation theory that the use of prior knowledge will lead to a more accurate estimator.

Reconsider the DC in WGN problem: x [ n ] = A + w [ n ] , n = 0 , ⋯ N − 1 , w [ n ] ∼ N ( 0 , σ 2 ) x[n]=A+w[n],n=0,\cdots N-1, w[n]\sim \mathcal N(0,\sigma ^2) x[n]=A+w[n],n=0,N1,w[n]N(0,σ2). It can be shown that the MVU estimator of A A A is the sample mean x ˉ \bar x xˉ. However, this assumed that A A A could take on any value in the interval − ∞ < A < ∞ -\infty<A<\infty <A<. Due to physical constraints it may be more reasonable to assume that A A A can take on only values in the finite interval − A 0 ≤ A ≤ A 0 -A_0\le A\le A_0 A0AA0. Obviously, we would expect to improve our estimation if we used the truncated sample mean estimator
A ˇ = { − A 0 x ˉ < − A 0 x ˉ − A 0 ≤ x ˉ ≤ A 0 A 0 x ˉ > A 0 \check A=\left\{\begin{array}{cc}&-A_0 && \bar x<-A_0\\&\bar x && -A_0\le \bar x\le A_0\\ &A_0 && \bar x>A_0 \end{array}\right. Aˇ=A0xˉA0xˉ<A0A0xˉA0xˉ>A0
which would be consistent with the known constraints. Such an estimator would have the PDF
p A ˇ ( ξ ; A ) = Pr ⁡ { x ˉ ≤ − A 0 } δ ( ξ + A 0 ) + p A ^ ( ξ ; A ) [ u ( ξ + A 0 ) − u ( ξ − A 0 ) ] + Pr ⁡ { x ˉ ≥ A 0 } δ ( ξ − A 0 ) p_{\check A}(\xi;A)=\Pr\{\bar x \le -A_0 \}\delta(\xi+A_0)+p_{\hat A}(\xi;A)[u(\xi+A_0)-u(\xi-A_0)]+\Pr\{\bar x \ge A_0 \}\delta(\xi-A_0) pAˇ(ξ;A)=Pr{xˉA0}δ(ξ+A0)+pA^(ξ;A)[u(ξ+A0)u(ξA0)]+Pr{xˉA0}δ(ξA0)
where u ( x ) u(x) u(x) is the unit step function.

在这里插入图片描述

It is seen that A ˇ \check A Aˇ is a biased estimator. However, if we compare the MSE of the two estimators, we note that for any A A A in the interval − A 0 ≤ A ≤ A 0 -A_0\le A\le A_0 A0AA0
mse ⁡ ( A ^ ) = ∫ − ∞ ∞ ( ξ − A ) 2 p A ^ ( ξ ; A ) d ξ = ∫ − ∞ − A 0 ( ξ − A ) 2 p A ^ ( ξ ; A ) d ξ + ∫ − A 0 A 0 ( ξ − A ) 2 p A ^ ( ξ ; A ) d ξ + ∫ A 0 ∞ ( ξ − A ) 2 p A ^ ( ξ ; A ) d ξ > ∫ − ∞ − A 0 ( − A 0 − A ) 2 p A ^ ( ξ ; A ) d ξ + ∫ − A 0 A 0 ( ξ − A ) 2 p A ^ ( ξ ; A ) d ξ + ∫ A 0 ∞ ( A 0 − A ) 2 p A ^ ( ξ ; A ) d ξ = ( − A 0 − A ) 2 Pr ⁡ { x ˉ ≤ − A 0 } + ∫ − A 0 A 0 ( ξ − A ) 2 p A ^ ( ξ ; A ) d ξ + ( A 0 − A ) 2 Pr ⁡ { x ˉ ≥ A 0 } = mse ⁡ ( A ~ ) \begin{aligned} \operatorname{mse}(\hat{A})=& \int_{-\infty}^{\infty}(\xi-A)^{2} p_{\hat{A}}(\xi ; A) d \xi \\ =& \int_{-\infty}^{-A_{0}}(\xi-A)^{2} p_{\hat{A}}(\xi ; A) d \xi+\int_{-A_{0}}^{A_{0}}(\xi-A)^{2} p_{\hat{A}}(\xi ; A) d \xi +\int_{A_{0}}^{\infty}(\xi-A)^{2} p_{\hat{A}}(\xi ; A) d \xi \\>&\int_{-\infty}^{-A_{0}}(-A_0-A)^{2} p_{\hat{A}}(\xi ; A) d \xi+\int_{-A_{0}}^{A_{0}}(\xi-A)^{2} p_{\hat{A}}(\xi ; A) d \xi +\int_{A_{0}}^{\infty}(A_0-A)^{2} p_{\hat{A}}(\xi ; A) d \xi \\ =& \left(-A_{0}-A\right)^{2} \Pr\{\bar x \le -A_0 \}+\int_{-A_{0}}^{A_{0}}(\xi-A)^{2} p_{\hat{A}}(\xi ; A) d \xi +\left(A_{0}-A\right)^{2} \Pr\{\bar x \ge A_0 \} \\ =& \operatorname{mse}(\tilde{A}) \end{aligned} mse(A^)==>==(ξA)2pA^(ξ;A)dξA0(ξA)2pA^(ξ;A)dξ+A0A0(ξA)2pA^(ξ;A)dξ+A0(ξA)2pA^(ξ;A)dξA0(A0A)2pA^(ξ;A)dξ+A0A0(ξA)2pA^(ξ;A)dξ+A0(A0A)2pA^(ξ;A)dξ(A0A)2Pr{xˉA0}+A0A0(ξA)2pA^(ξ;A)dξ+(A0A)2Pr{xˉA0}mse(A~)
Hence, A ˇ \check A Aˇ, the truncated sample mean estimator, is better than the sample mean estimator in terms of MSE. Although A ^ \hat A A^ is still the MVU estimator, we have been able to reduce the mean square error by allowing the estimator to be biased.


In as much as we have been able to produce a better estimator, the question arises as to whether an optimal estimator exists for this problem. (In the classical case the MSE criterion of optimality usually led to unrealizable estimators. We shall see that this is not a problem in the Bayesian approach.)

With knowledge only of the interval and no inclination as to whether A A A should be nearer any particular value, it makes sense to assign a U [ − A 0 , A 0 ] \mathcal U[-A_0,A_0] U[A0,A0] PDF to the random variable A A A. The overall data model is different from the classical approach

在这里插入图片描述

Now we can incorporate our knowledge of how A A A was chosen. Define the Bayesian MSE as
B m s e ( A ^ ) = E [ ( A − A ^ ) 2 ] (BP.1) \mathrm{Bmse}(\hat A)=E[(A-\hat A)^2] \tag{BP.1} Bmse(A^)=E[(AA^)2](BP.1)
To appreciate the difference compare the classical MSE
m s e ( A ^ ) = ∫ ( A ^ − A ) 2 p ( x ; A ) d x (BP.2) \mathrm{mse}(\hat A)=\int(\hat A-A)^2p(\mathbf x;A)d\mathbf x \tag{BP.2} mse(A^)=(A^A)2p(x;A)dx(BP.2)
to the Bayesian MSE
B m s e ( A ^ ) = ∫ ( A − A ^ ) 2 [ ∫ p ( x , A ) d x ] d A = ∫ ∫ ( A − A ^ ) 2 p ( x , A ) d x d A (BP.3) \mathrm{Bmse}(\hat A)=\int(A-\hat A)^2\left[\int p(\mathbf x,A)d\mathbf x\right] dA=\int\int(A-\hat A)^2p(\mathbf x,A)d\mathbf xdA\tag{BP.3} Bmse(A^)=(AA^)2[p(x,A)dx]dA=(AA^)2p(x,A)dxdA(BP.3)
Note that whereas the classical MSE will depend on A A A, and hence estimators that attempt to minimize the MSE will usually depend on A A A, the Bayesian MSE will not. In effect, we have integrated the parameter dependence away.

Now we can derive the estimator that minimizes the Bayesian MSE. First, we use Bayes’ theorem to write
p ( x , A ) = p ( A ∣ x ) p ( x ) p(\mathbf{x}, A)=p(A|\mathbf{x}) p(\mathbf{x}) p(x,A)=p(Ax)p(x)
so that
Bmse ⁡ ( A ^ ) = ∫ [ ∫ ( A − A ^ ) 2 p ( A ∣ x ) d A ] p ( x ) d x (BP.4) \operatorname{Bmse}(\hat{A})=\int\left[\int(A-\hat{A})^{2} p(A |\mathbf{x}) d A\right] p(\mathbf{x}) d \mathbf{x}\tag{BP.4} Bmse(A^)=[(AA^)2p(Ax)dA]p(x)dx(BP.4)
Now since p ( x ) ≥ 0 p(\mathbf{x}) \geq 0 p(x)0 for all x , \mathbf{x}, x, if the integral in brackets can be minimized for each x , \mathbf{x}, x, then the Bayesian MSE will be minimized. Hence, fixing x \mathbf x x so that A ^ \hat{A} A^ is a scalar variable, we have
∂ ∂ A ^ ∫ ( A − A ^ ) 2 p ( A ∣ x ) d A = ∫ ∂ ∂ A ^ ( A − A ^ ) 2 p ( A ∣ x ) d A = ∫ − 2 ( A − A ^ ) p ( A ∣ x ) d A = − 2 ∫ A p ( A ∣ x ) d A + 2 A ^ ∫ p ( A ∣ x ) d A \begin{aligned} \frac{\partial}{\partial \hat{A}} \int(A-\hat{A})^{2} p(A | \mathbf{x}) d A &=\int \frac{\partial}{\partial \hat{A}}(A-\hat{A})^{2} p(A | \mathbf{x}) d A \\ &=\int-2(A-\hat{A}) p(A | \mathbf{x}) d A \\ &=-2 \int A p(A | \mathbf{x}) d A+2 \hat{A} \int p(A | \mathbf{x}) d A \end{aligned} A^(AA^)2p(Ax)dA=A^(AA^)2p(Ax)dA=2(AA^)p(Ax)dA=2Ap(Ax)dA+2A^p(Ax)dA
which when set equal to zero results in
A ^ = ∫ A p ( A ∣ x ) d A = E ( A ∣ x ) (BP.5) \hat{A}=\int A p(A | \mathbf{x}) d A=E(A | \mathbf{x})\tag{BP.5} A^=Ap(Ax)dA=E(Ax)(BP.5)
since the conditional PDF must integrate to 1 1 1. It is seen that the optimal estimator in terms of minimizing the Bayesian MSE is the mean of the posterior PDF p ( A ∣ x ) p(A | \mathbf{x}) p(Ax). We will henceforth term the estimator that minimizes the Bayesian MSE the minimum mean square error (MMSE) estimator.

在这里插入图片描述


To determine the MMSE, we first require the posterior PDF. We can use Bayes’ rule to determine it as
p ( A ∣ x ) = p ( x ∣ A ) p ( A ) p ( x ) = p ( x ∣ A ) p ( A ) ∫ p ( x ∣ A ) p ( A ) d A (BP.6) p(A|\mathbf x)=\frac{p(\mathbf x|A)p(A)}{p(\mathbf x)}=\frac{p(\mathbf x|A)p(A)}{\int p(\mathbf x|A)p(A)dA}\tag{BP.6} p(Ax)=p(x)p(xA)p(A)=p(xA)p(A)dAp(xA)p(A)(BP.6)
Note that the denominator is just a normalizing factor, independent of A A A, needed to ensure that p ( A ∣ x ) p(A|\mathbf x) p(Ax) integrates to 1 1 1. Then the MMSE
A ^ = E ( A ∣ x ) = ∫ A p ( x ∣ A ) p ( A ) d A ∫ p ( x ∣ A ) p ( A ) d A (BP.7) \hat A=E(A|\mathbf x)=\frac{\int Ap(\mathbf x|A)p(A)dA}{\int p(\mathbf x|A)p(A)dA}\tag{BP.7} A^=E(Ax)=p(xA)p(A)dAAp(xA)p(A)dA(BP.7)
If we continue our example, we recall that the prior PDF p ( A ) p(A) p(A) is U [ − A 0 , A 0 ] \mathcal U[-A_0,A_0] U[A0,A0]. To specify the conditional PDF p ( x ∣ A ) p(\mathbf x|A) p(xA) we need to further assume that the choice of A A A via p ( A ) p(A) p(A) does not affect the PDF of the noise samples or that w [ n ] w[n] w[n] is independent of A A A. Then for n = 0 , 1 , ⋯   , N − 1 n=0,1,\cdots, N-1 n=0,1,,N1
p x ( x [ n ] ∣ A ) = p w ( x [ n ] − A ∣ A ) = p w ( x [ n ] − A ) = 1 2 π σ 2 exp ⁡ [ − 1 2 σ 2 ( x [ n ] − A ) 2 ] \begin{aligned} p_x(x[n]|A)&=p_w(x[n]-A|A)\\ &= p_w(x[n]-A)\\ &=\frac{1}{\sqrt{2\pi\sigma^2}}\exp \left[-\frac{1}{2\sigma^2}(x[n]-A)^2 \right] \end{aligned} px(x[n]A)=pw(x[n]AA)=pw(x[n]A)=2πσ2 1exp[2σ21(x[n]A)2]
and therefore
p ( x ∣ A ) = 1 ( 2 π σ 2 ) N / 2 exp ⁡ [ − 1 2 σ 2 ∑ n = 0 N − 1 ( x [ n ] − A ) 2 ] (BP.8) p(\mathbf x|A)=\frac{1}{(2\pi \sigma^2)^{N/2}}\exp\left[-\frac{1}{2\sigma^2}\sum_{n=0}^{N-1}(x[n]-A)^2 \right] \tag{BP.8} p(xA)=(2πσ2)N/21exp[2σ21n=0N1(x[n]A)2](BP.8)
It is apparent that the PDF is identical in form to the usual classical PDF p ( x ; A ) p(\mathbf x; A) p(x;A). In the Bayesian case, however, the PDF is a conditional PDF, hence the “ ∣ | ” separator, while in the classical case, it represents an unconditional PDF, albeit parameterized by A A A, hence the separator " ; ; ;"

Using ( B P . 7 ) (BP.7) (BP.7) and ( B P . 8 ) (BP.8) (BP.8), the posterior PDF becomes
p ( A ∣ x ) = { 1 2 A 0 ( 2 π σ 2 ) N / 2 exp ⁡ [ − 1 2 σ 2 ∑ n = 0 N − 1 ( x [ n ] − A ) 2 ] ∫ − A 0 A 0 1 2 A 0 ( 2 π σ 2 ) N / 2 exp ⁡ [ − 1 2 σ 2 ∑ n = 0 N − 1 ( x [ n ] − A ) 2 ] d A ∣ A ∣ ≤ A 0 0 ∣ A ∣ > A 0 p(A |\mathbf{x})=\left\{\begin{aligned} &\frac{\frac{1}{2 A_{0}\left(2 \pi \sigma^{2}\right)^{N/2}}\exp \left[-\frac{1}{2 \sigma^{2}} \sum_{n=0}^{N-1}(x[n]-A)^{2}\right]}{\int_{-A_0}^{A_0}\frac{1}{2 A_{0}\left(2 \pi \sigma^{2}\right)^{N/2}}\exp \left[-\frac{1}{2 \sigma^{2}} \sum_{n=0}^{N-1}(x[n]-A)^{2}\right]dA} && |A| \leq A_{0}\\ &0 && |A|>A_{0} \end{aligned}\right. p(Ax)=A0A02A0(2πσ2)N/21exp[2σ21n=0N1(x[n]A)2]dA2A0(2πσ2)N/21exp[2σ21n=0N1(x[n]A)2]0AA0A>A0
But
∑ n = 0 N − 1 ( x [ n ] − A ) 2 = ∑ n = 0 N − 1 x 2 [ n ] − 2 N A x ˉ + N A 2 = N ( A − x ˉ ) 2 + ∑ n = 0 N − 1 x 2 [ n ] − N x ˉ 2 \begin{aligned} \sum_{n=0}^{N-1}(x[n]-A)^{2} &=\sum_{n=0}^{N-1} x^{2}[n]-2 N A \bar{x}+N A^{2} =N(A-\bar{x})^{2}+\sum_{n=0}^{N-1} x^{2}[n]-N \bar{x}^{2} \end{aligned} n=0N1(x[n]A)2=n=0N1x2[n]2NAxˉ+NA2=N(Axˉ)2+n=0N1x2[n]Nxˉ2
Cancel out the terms that are irrelevant to A A A and write the form as truncated Gaussian distribution
p ( A ∣ x ) = { 1 2 π σ 2 / N exp ⁡ [ − 1 2 σ 2 / N ( A − x ˉ ) 2 ] ∫ − A 0 A 0 1 2 π σ 2 / N exp ⁡ [ − 1 2 σ 2 / N ( A − x ˉ ) 2 ] d A ∣ A ∣ ≤ A 0 0 , ∣ A ∣ > A 0 (BP.9) p(A |\mathbf{x})=\left\{\begin{aligned} &\frac{\frac{1}{\sqrt{2 \pi \sigma^{2} / N}} \exp \left[-\frac{1}{2 \sigma^{2} / N}(A-\bar{x})^{2}\right]}{\int_{-A_0}^{A_0}\frac{1}{\sqrt{2 \pi \sigma^{2} / N}} \exp \left[-\frac{1}{2 \sigma^{2} / N}(A-\bar{x})^{2}\right]dA} && |A| \leq A_{0} \\ &0, && |A|>A_{0} \end{aligned}\right.\tag{BP.9} p(Ax)=A0A02πσ2/N 1exp[2σ2/N1(Axˉ)2]dA2πσ2/N 1exp[2σ2/N1(Axˉ)2]0,AA0A>A0(BP.9)
The final MMSE estimate is then given by
A ^ = ∫ − ∞ ∞ A p ( A ∣ x ) d A = ∫ − A 0 A 0 A 1 2 π σ 2 / N exp ⁡ [ − 1 2 σ 2 / N ( A − x ˉ ) 2 ] d A ∫ − A 0 A 0 1 2 π σ 2 / N exp ⁡ [ − 1 2 σ 2 / N ( A − x ˉ ) 2 ] d A (BP.10) \hat{A}=\int_{-\infty}^{\infty} A p(A| \mathbf{x}) d A=\frac{\int_{-A_{0}}^{A_{0}} A \frac{1}{\sqrt{2 \pi \sigma^{2} / N}} \exp \left[-\frac{1}{2 \sigma^{2} / N}(A-\bar{x})^{2}\right] d A}{\int_{-A_{0}}^{A_{0}} \frac{1}{\sqrt{2 \pi \sigma^{2} / N}} \exp \left[-\frac{1}{2 \sigma^{2} / N}(A-\bar{x})^{2}\right] d A} \tag{BP.10} A^=Ap(Ax)dA=A0A02πσ2/N 1exp[2σ2/N1(Axˉ)2]dAA0A0A2πσ2/N 1exp[2σ2/N1(Axˉ)2]dA(BP.10)

  • In absence of data (if x x x gives no information about A A A),
    A ^ = E ( A ∣ x ) = E ( A ) = 0 \hat A=E(A|\mathbf x)=E(A)=0 A^=E(Ax)=E(A)=0

  • If A 0 ≫ σ 2 / N A_0 \gg \sqrt{\sigma^2/N} A0σ2/N (no truncation), e.g., as N N N increases, A ^ → x ˉ \hat A\to \bar x A^xˉ
    A ^ = lim ⁡ A 0 → ∞ ∫ − A 0 A 0 A 1 2 π σ 2 / N exp ⁡ [ − 1 2 σ 2 / N ( A − x ˉ ) 2 ] d A ∫ − A 0 A 0 1 2 π σ 2 / N exp ⁡ [ − 1 2 σ 2 / N ( A − x ˉ ) 2 ] d A = x ˉ 1 = x ˉ \hat A=\lim_{A_0\to \infty}\frac{\int_{-A_{0}}^{A_{0}} A \frac{1}{\sqrt{2 \pi \sigma^{2} / N}} \exp \left[-\frac{1}{2 \sigma^{2} / N}(A-\bar{x})^{2}\right] d A}{\int_{-A_{0}}^{A_{0}} \frac{1}{\sqrt{2 \pi \sigma^{2} / N}} \exp \left[-\frac{1}{2 \sigma^{2} / N}(A-\bar{x})^{2}\right] d A}=\frac{\bar x}{1}=\bar x A^=A0limA0A02πσ2/N 1exp[2σ2/N1(Axˉ)2]dAA0A0A2πσ2/N 1exp[2σ2/N1(Axˉ)2]dA=1xˉ=xˉ

The effect of the data is to position the posterior mean between A = 0 A = 0 A=0 and A = x ˉ A =\bar x A=xˉ in a compromise between the prior knowledge and that contributed by the data. To further appreciate this weighting consider what happens as N N N becomes large so that the data knowledge becomes more important. As shown in Figure 10.4, as N N N increases, we have from ( B P . 9 ) (BP.9) (BP.9) that the posterior PDF becomes more concentrated about x ˉ \bar x xˉ (since σ 2 / N \sigma^2/N σ2/N decrease). Hence, it becomes nearly Gaussian, and its mean becomes just x ˉ \bar x xˉ. The MMSE estimator relies less and less on the prior knowledge and more on the data. It is said that the data “swamps out” the prior knowledge.

在这里插入图片描述

Choosing a Prior PDF

As shown in the previous section, once a prior PDF has been chosen, the MMSE estimator follows directly from ( B P . 10 ) (BP.10) (BP.10). The only practical stumbling block that remains, however, is whether or not E ( θ ∣ x ) E(\theta|\mathbf x) E(θx) can be determined in closed form. This motivates us to assume the Gaussian prior PDF instead of the uniform prior PDF
p ( A ) = 1 2 π σ A 2 exp ⁡ [ − 1 2 σ A 2 ( A − μ A ) 2 ] p(A)=\frac{1}{\sqrt{2\pi\sigma_A^2}}\exp \left[-\frac{1}{2\sigma_A^2}(A-\mu_A)^2 \right] p(A)=2πσA2 1exp[2σA21(AμA)2]
With μ A = 0 \mu_A=0 μA=0 and 3 σ A = A 0 3\sigma_A=A_0 3σA=A0 the Gaussian prior PDF could be thought of as incorporating the knowledge that ∣ A ∣ ≤ A 0 |A|\le A_0 AA0. From ( B P . 6 ) (BP.6) (BP.6) and ( B P . 8 ) (BP.8) (BP.8), we can formulate
p ( A ∣ x ) = p ( x ∣ A ) p ( A ) ∫ p ( x ∣ A ) p ( A ) d A = 1 ( 2 π σ 2 ) N / 2 2 π σ A 2 exp ⁡ [ − 1 2 σ 2 ∑ n = 0 N − 1 x 2 [ n ] ] exp ⁡ [ − 1 2 σ 2 ( N A 2 − 2 N A x ˉ ) ] exp ⁡ [ − 1 2 σ A 2 ( A − μ A ) 2 ] ∫ − ∞ ∞ 1 ( 2 π σ 2 ) N / 2 2 π σ A 2 exp ⁡ [ − 1 2 σ 2 ∑ n = 0 N − 1 x 2 [ n ] ] exp ⁡ [ − 1 2 σ 2 ( N A 2 − 2 N A x ˉ ) ] exp ⁡ [ − 1 2 σ A 2 ( A − μ A ) 2 ] d A = exp ⁡ [ − 1 2 ( 1 σ 2 ( N A 2 − 2 N A x ˉ ) + 1 σ A 2 ( A − μ A ) 2 ) ] ∫ − ∞ ∞ exp ⁡ [ − 1 2 ( 1 σ 2 ( N A 2 − 2 N A x ˉ ) + 1 σ A 2 ( A − μ A ) 2 ) ] d A = exp ⁡ [ − 1 2 Q ( A ) ] ∫ − ∞ ∞ exp ⁡ [ − 1 2 Q ( A ) ] d A \begin{aligned} p(A | \mathbf{x})=& \frac{p(\mathbf{x} |A) p(A)}{\int p(\mathbf{x} | A) p(A) d A} \\ =& \frac{\frac{1}{\left(2 \pi \sigma^{2}\right)^{N/2} \sqrt{2 \pi \sigma_{A}^{2}}} \exp \left[-\frac{1}{2 \sigma^{2}} \sum_{n=0}^{N-1} x^{2}[n]\right] \exp \left[-\frac{1}{2 \sigma^{2}}\left(N A^{2}-2 N A \bar{x}\right)\right]\exp \left[-\frac{1}{2 \sigma_{A}^{2}}\left(A-\mu_{A}\right)^{2}\right]} {\int_{-\infty}^{\infty} \frac{1}{\left(2 \pi \sigma^{2}\right)^{N/2} \sqrt{2 \pi \sigma_{A}^{2}}} \exp \left[-\frac{1}{2 \sigma^{2}} \sum_{n=0}^{N-1} x^{2}[n]\right] \exp \left[-\frac{1}{2 \sigma^{2}}\left(N A^{2}-2 N A \bar{x}\right)\right]\exp \left[-\frac{1}{2 \sigma_{A}^{2}}\left(A-\mu_{A}\right)^{2}\right] d A} \\ =& \frac{\exp \left[-\frac{1}{2}\left( \frac{1}{\sigma^{2}}(NA^2-2NA\bar x)+\frac{1}{ \sigma_{A}^{2}}\left(A-\mu_{A}\right)^{2} \right) \right]} {\int_{-\infty}^{\infty} \exp \left[-\frac{1}{2}\left( \frac{1}{\sigma^{2}}(NA^2-2NA\bar x)+\frac{1}{ \sigma_{A}^{2}}\left(A-\mu_{A}\right)^{2} \right) \right] d A} \\ =& \frac{\exp \left[-\frac{1}{2} Q(A)\right]}{\int_{-\infty}^{\infty} \exp \left[-\frac{1}{2} Q(A)\right] d A} \end{aligned} p(Ax)====p(xA)p(A)dAp(xA)p(A)(2πσ2)N/22πσA2 1exp[2σ21n=0N1x2[n]]exp[2σ21(NA22NAxˉ)]exp[2σA21(AμA)2]dA(2πσ2)N/22πσA2 1exp[2σ21n=0N1x2[n]]exp[2σ21(NA22NAxˉ)]exp[2σA21(AμA)2]exp[21(σ21(NA22NAxˉ)+σA21(AμA)2)]dAexp[21(σ21(NA22NAxˉ)+σA21(AμA)2)]exp[21Q(A)]dAexp[21Q(A)]
Continuing, we have for Q ( A ) Q(A) Q(A)
Q ( A ) = N σ 2 A 2 − 2 N A x ˉ σ 2 + A 2 σ A 2 − 2 μ A A σ A 2 + μ A 2 σ A 2 = ( N σ 2 + 1 σ A 2 ) A 2 − 2 ( N σ 2 x ˉ + μ A σ A 2 ) A + μ A 2 σ A 2 \begin{aligned} Q(A) &=\frac{N}{\sigma^{2}} A^{2}-\frac{2 N A \bar{x}}{\sigma^{2}}+\frac{A^{2}}{\sigma_{A}^{2}}-\frac{2 \mu_{A} A}{\sigma_{A}^{2}}+\frac{\mu_{A}^{2}}{\sigma_{A}^{2}} \\ &=\left(\frac{N}{\sigma^{2}}+\frac{1}{\sigma_{A}^{2}}\right) A^{2}-2\left(\frac{N}{\sigma^{2}} \bar{x}+\frac{\mu_{A}}{\sigma_{A}^{2}}\right) A+\frac{\mu_{A}^{2}}{\sigma_{A}^{2}} \end{aligned} Q(A)=σ2NA2σ22NAxˉ+σA2A2σA22μAA+σA2μA2=(σ2N+σA21)A22(σ2Nxˉ+σA2μA)A+σA2μA2
Let
σ A ∣ x 2 = 1 N σ 2 + 1 σ A 2 μ A ∣ x = ( N σ 2 x ˉ + μ A σ A 2 ) σ A ∣ x 2 (BP.11) \begin{aligned} \sigma_{A \mid x}^{2} &=\frac{1}{\frac{N}{\sigma^{2}}+\frac{1}{\sigma_{A}^{2}}} \\ \mu_{A \mid x} &=\left(\frac{N}{\sigma^{2}} \bar{x}+\frac{\mu_{A}}{\sigma_{A}^{2}}\right) \sigma_{A \mid x}^{2} \\ \end{aligned}\tag{BP.11} σAx2μAx=σ2N+σA211=(σ2Nxˉ+σA2μA)σAx2(BP.11)
Then, by completing the square we have
Q ( A ) = 1 σ A ∣ x 2 ( A − μ A ∣ x ) 2 − μ A ∣ x 2 σ A ∣ x 2 + μ A 2 σ A 2 Q(A)=\frac{1}{\sigma_{A \mid x}^{2}}\left(A-\mu_{A \mid x} \right)^2-\frac{\mu_{A \mid x}^{2}}{\sigma_{A \mid x}^{2}}+\frac{\mu_{A}^{2}}{\sigma_{A}^{2}} Q(A)=σAx21(AμAx)2σAx2μAx2+σA2μA2
so that
p ( A ∣ x ) = exp ⁡ [ − 1 2 σ A ∣ x 2 ( A − μ A ∣ x ) 2 ] exp ⁡ [ − 1 2 ( μ A 2 σ A 2 − μ A ∣ x 2 σ A ∣ x 2 ) ] ∫ − ∞ ∞ exp ⁡ [ − 1 2 σ A ∣ x 2 ( A − μ A ∣ x ) 2 ] exp ⁡ [ − 1 2 ( μ A 2 σ A 2 − μ A ∣ x 2 σ A ∣ x 2 ) ] d A = 1 2 π σ A ∣ x 2 exp ⁡ [ − 1 2 σ A ∣ x 2 ( A − μ A ∣ x ) 2 ] \begin{aligned} p(A | \mathbf{x}) &=\frac{\exp \left[-\frac{1}{2 \sigma_{A \mid x}^{2}}\left(A-\mu_{A \mid x}\right)^{2}\right] \exp \left[-\frac{1}{2}\left(\frac{\mu_{A}^{2}}{\sigma_{A}^{2}}-\frac{\mu_{A \mid x}^{2}}{\sigma_{A \mid x}^{2}}\right)\right]}{\int_{-\infty}^{\infty} \exp \left[-\frac{1}{2 \sigma_{A \mid x}^{2}}\left(A-\mu_{A \mid x}\right)^{2}\right] \exp \left[-\frac{1}{2}\left(\frac{\mu_{A}^{2}}{\sigma_{A}^{2}}-\frac{\mu_{A \mid x}^{2}}{\sigma_{A \mid x}^{2}}\right)\right] d A} \\ &=\frac{1}{\sqrt{2 \pi \sigma_{A \mid x}^{2}}} \exp \left[-\frac{1}{2 \sigma_{A \mid x}^{2}}\left(A-\mu_{A \mid x}\right)^{2}\right] \end{aligned} p(Ax)=exp[2σAx21(AμAx)2]exp[21(σA2μA2σAx2μAx2)]dAexp[2σAx21(AμAx)2]exp[21(σA2μA2σAx2μAx2)]=2πσAx2 1exp[2σAx21(AμAx)2]
where the last step follows from the requirement that p ( A ∣ x ) p(A |\mathbf{x}) p(Ax) integrate to 1. 1 . 1. The posterior PDF is also Gaussian, p ( A ∣ x ) ∼ N ( μ A ∣ x , σ A ∣ x 2 ) p(A|\mathbf x)\sim \mathcal N(\mu_{A|x},\sigma^2_{A|x}) p(Ax)N(μAx,σAx2).

In this form the MMSE estimator is readily found as
A ^ = E ( A ∣ x ) = μ A ∣ x = N σ 2 x ˉ + μ A σ A 2 N σ 2 + 1 σ A 2 = σ A 2 σ A 2 + σ 2 N x ˉ + σ 2 N σ A 2 + σ 2 N μ A = α x ˉ + ( 1 − α ) μ A (BP.12) \begin{aligned} \hat A&=E(A|\mathbf x)=\mu_{A|x}=\frac{\frac{N}{\sigma^{2}} \bar{x}+\frac{\mu_{A}}{\sigma_{A}^{2}}}{\frac{N}{\sigma^{2}}+\frac{1}{\sigma_{A}^{2}}}\\ &=\frac{\sigma_A^2}{\sigma_A^2+\frac{\sigma ^2}{N}}\bar x+\frac{\frac{\sigma ^2}{N}}{\sigma_A^2+\frac{\sigma ^2}{N}}\mu_A\\ &=\alpha \bar x+(1-\alpha)\mu_A \end{aligned}\tag{BP.12} A^=E(Ax)=μAx=σ2N+σA21σ2Nxˉ+σA2μA=σA2+Nσ2σA2xˉ+σA2+Nσ2Nσ2μA=αxˉ+(1α)μA(BP.12)
where
α = σ A 2 σ A 2 + σ 2 N . \alpha =\frac{\sigma_A^2}{\sigma_A^2+\frac{\sigma ^2}{N}}. α=σA2+Nσ2σA2.
Note that α \alpha α is a weighting factor since 0 < α < 1 0<\alpha <1 0<α<1. It is interesting to examine the interplay between the prior knowledge and the data.

  • If there is little data so that σ A 2 ≪ σ 2 / N \sigma_A^2\ll \sigma^2/N σA2σ2/N, α \alpha α is small and A ^ ≈ μ A \hat A\approx \mu_A A^μA
  • If more data are observed so that σ A 2 ≫ σ 2 / N \sigma_A^2\gg \sigma^2/N σA2σ2/N, α ≈ 1 \alpha \approx 1 α1 and A ^ ≈ x ˉ \hat A\approx \bar x A^xˉ.

Alternatively, we may view this process by examining the posterior PDF. As N N N increases,
var ⁡ ( A ∣ x ) = σ A ∣ x 2 = 1 N σ 2 + 1 σ A 2 \operatorname{var}(A|\mathbf x)=\sigma_{A \mid x}^{2} =\frac{1}{\frac{N}{\sigma^{2}}+\frac{1}{\sigma_{A}^{2}}} var(Ax)=σAx2=σ2N+σA211
will decrease. Also, the posterior mean or A ^ \hat A A^ ( B P . 12 ) (BP.12) (BP.12) will approach x ˉ \bar x xˉ.

在这里插入图片描述

Observe that if there is no prior knowledge, which can be modeled by letting σ A 2 → ∞ \sigma_A^2\to \infty σA2, then A ^ → x ˉ \hat A\to \bar x A^xˉ for any data record length. The classical estimator is obtained.

Finally, we are going to prove that by using prior knowledge we could improve the estimation accuracy. Recall that
Bmse ⁡ ( A ^ ) = ∫ [ ∫ ( A − A ^ ) 2 p ( A ∣ x ) d A ] p ( x ) d x (BP.4) \operatorname{Bmse}(\hat{A})=\int\left[\int(A-\hat{A})^{2} p(A |\mathbf{x}) d A\right] p(\mathbf{x}) d \mathbf{x}\tag{BP.4} Bmse(A^)=[(AA^)2p(Ax)dA]p(x)dx(BP.4)
Since A ^ = E ( A ∣ x ) \hat A=E(A|\mathbf x) A^=E(Ax), we have
Bmse ⁡ ( A ^ ) = ∫ [ ∫ ( A − E ( A ∣ x ) ) 2 p ( A ∣ x ) d A ] p ( x ) d x = ∫ var ⁡ ( A ∣ x ) p ( x ) d x = ∫ 1 N σ 2 + 1 σ A 2 p ( x ) d x = 1 N σ 2 + 1 σ A 2 (BP.13) \begin{aligned} \operatorname{Bmse}(\hat{A})&=\int\left[\int(A-E(A|\mathbf x))^{2} p(A |\mathbf{x}) d A\right] p(\mathbf{x}) d \mathbf{x}\\ &=\int\operatorname{var}(A|\mathbf x)p(\mathbf{x}) d \mathbf{x}\\ &=\int\frac{1}{\frac{N}{\sigma^{2}}+\frac{1}{\sigma_{A}^{2}}}p(\mathbf{x}) d \mathbf{x}\\ &=\frac{1}{\frac{N}{\sigma^{2}}+\frac{1}{\sigma_{A}^{2}}} \end{aligned} \tag{BP.13} Bmse(A^)=[(AE(Ax))2p(Ax)dA]p(x)dx=var(Ax)p(x)dx=σ2N+σA211p(x)dx=σ2N+σA211(BP.13)
This can be rewritten as
Bmse ⁡ ( A ^ ) = σ 2 N ( σ A 2 σ A 2 + σ 2 N ) < σ 2 N (BP.14) \operatorname{Bmse}(\hat A)=\frac{\sigma^2}{N}(\frac{\sigma_A^2}{\sigma_A^2+\frac{\sigma^2}{N}})<\frac{\sigma ^2}{N}\tag{BP.14} Bmse(A^)=Nσ2(σA2+Nσ2σA2)<Nσ2(BP.14)
where σ 2 / N \sigma^2/N σ2/N is the minimum MSE obtained when no prior knowledge is available (let σ A 2 → ∞ \sigma^2_A\to \infty σA2). Clearly, any prior knowledge when modeled in the Bayesian sense will improve our Bayesian estimator.

Properties of the Gaussian PDF

We now generalize the results of the previous section by examining the properties of the Gaussian PDF. The bivariate Gaussian PDF is first investigated to illustrate the important properties. Then, the corresponding results for the general multivariate Gaussian PDF are described.

Consider a jointly Gaussian random vector [ x    y ] T [x~ ~y]^{T} [x  y]T whose PDF is
p ( x , y ) = 1 2 π det ⁡ 1 2 ( C ) exp ⁡ [ − 1 2 [ x − E ( x ) y − E ( y ) ] T C − 1 [ x − E ( x ) y − E ( y ) ] ] (BF.15) p(x, y)=\frac{1}{2 \pi \operatorname{det}^{\frac{1}{2}}(\mathbf{C})} \exp \left[-\frac{1}{2}\left[\begin{matrix} x-E(x) \\ y-E(y) \end{matrix}\right]^{T} \mathbf{C}^{-1}\left[\begin{array}{l} x-E(x) \\ y-E(y) \end{array}\right]\right]\tag{BF.15} p(x,y)=2πdet21(C)1exp[21[xE(x)yE(y)]TC1[xE(x)yE(y)]](BF.15)
This is also termed as bivariate Gaussian PDF. The mean vector and covariance matrix are
E ( [ x y ] ) = [ E ( x ) E ( y ) ] C = [ var ⁡ ( x ) cov ⁡ ( x , y ) cov ⁡ ( x , y ) var ⁡ ( y ) ] \begin{aligned} E\left(\left[\begin{matrix}x\\y\end{matrix}\right]\right)&=\left[\begin{matrix}E(x)\\E(y)\end{matrix}\right]\\ \mathbf C&=\left[\begin{matrix}\operatorname{var}(x) & \operatorname{cov}(x,y)\\\operatorname{cov}(x,y) & \operatorname{var}(y)\end{matrix}\right] \end{aligned} E([xy])C=[E(x)E(y)]=[var(x)cov(x,y)cov(x,y)var(y)]
Note that the marginal PDFs p ( x ) p(x) p(x) and p ( y ) p(y) p(y) are also Gaussian, as can be verified by the integrations
p ( x ) = ∫ − ∞ ∞ p ( x , y ) d y = 1 2 π var ⁡ ( x ) exp ⁡ [ − 1 2 var ⁡ ( x ) ( x − E ( x ) ) 2 ] p ( y ) = ∫ − ∞ ∞ p ( x , y ) d x = 1 2 π var ⁡ ( y ) exp ⁡ [ − 1 2 var ⁡ ( y ) ( y − E ( y ) ) 2 ] (BF.16) \begin{aligned} p(x)=\int_{-\infty}^{\infty} p(x, y) d y=\frac{1}{\sqrt{2 \pi \operatorname{var}(x)}} \exp \left[-\frac{1}{2 \operatorname{var}(x)}(x-E(x))^{2}\right] \\ p(y)=\int_{-\infty}^{\infty} p(x, y) d x=\frac{1}{\sqrt{2 \pi \operatorname{var}(y)}} \exp \left[-\frac{1}{2 \operatorname{var}(y)}(y-E(y))^{2}\right] \end{aligned}\tag{BF.16} p(x)=p(x,y)dy=2πvar(x) 1exp[2var(x)1(xE(x))2]p(y)=p(x,y)dx=2πvar(y) 1exp[2var(y)1(yE(y))2](BF.16)
The contours along which the PDF p ( x , y ) p(x, y) p(x,y) is constant are those values of x x x and y y y for which
[ x − E ( x ) y − E ( y ) ] T C − 1 [ x − E ( x ) y − E ( y ) ] \left[\begin{array}{l} x-E(x) \\ y-E(y) \end{array}\right]^{T} \mathbf{C}^{-1}\left[\begin{array}{l} x-E(x) \\ y-E(y) \end{array}\right] [xE(x)yE(y)]TC1[xE(x)yE(y)]
is a constant. They are shown in Figure 10.6 as elliptical contours.

在这里插入图片描述

Once x , x, x, say x 0 , x_{0}, x0, is observed, the conditional PDF of y y y becomes
p ( y ∣ x 0 ) = p ( x 0 , y ) p ( x 0 ) = p ( x 0 , y ) ∫ − ∞ ∞ p ( x 0 , y ) d y (BF.17) p\left(y|x_{0}\right)=\frac{p\left(x_{0}, y\right)}{p\left(x_{0}\right)}=\frac{p\left(x_{0}, y\right)}{\int_{-\infty}^{\infty} p\left(x_{0}, y\right) d y}\tag{BF.17} p(yx0)=p(x0)p(x0,y)=p(x0,y)dyp(x0,y)(BF.17)
Since from ( B F . 15 ) (BF.15) (BF.15) the exponential argument is quadratic in y y y, p ( x 0 , y ) p(x_0,y) p(x0,y) has the Gaussian form in y y y, and thus the conditional PDF p ( y ∣ x 0 ) p(y|x_0) p(yx0) must also be Gaussian. To summarize,

Theorem 1 (Conditional PDF of Bivariate Gaussian) If x x x and y y y are distributed according to a bivariate Gaussian P D F P D F PDF with mean vector [ E ( x )    E ( y ) ] T [E(x)~ ~E(y)]^{T} [E(x)  E(y)]T and covariance matrix
C = [ var ⁡ ( x ) cov ⁡ ( x , y ) cov ⁡ ( y , x ) var ⁡ ( y ) ] \mathbf{C}=\left[\begin{array}{cc} \operatorname{var}(x) & \operatorname{cov}(x, y) \\ \operatorname{cov}(y, x) & \operatorname{var}(y) \end{array}\right] C=[var(x)cov(y,x)cov(x,y)var(y)]
so that
p ( x , y ) = 1 2 π det ⁡ 1 2 ( C ) exp ⁡ [ − 1 2 [ x − E ( x ) y − E ( y ) ] T C − 1 [ x − E ( x ) y − E ( y ) ] ] p(x, y)=\frac{1}{2 \pi \operatorname{det}^{\frac{1}{2}}(\mathbf{C})} \exp \left[-\frac{1}{2}\left[\begin{array}{l} x-E(x) \\ y-E(y) \end{array}\right]^{T} \mathbf{C}^{-1}\left[\begin{array}{l} x-E(x) \\ y-E(y) \end{array}\right]\right] p(x,y)=2πdet21(C)1exp[21[xE(x)yE(y)]TC1[xE(x)yE(y)]]
then the conditional PDF p ( y ∣ x ) p(y | x) p(yx) is also Gaussian and
E ( y ∣ x ) = E ( y ) + cov ⁡ ( x , y ) var ⁡ ( x ) ( x − E ( x ) ) (BF.18) E(y | x) =E(y)+\frac{\operatorname{cov}(x, y)}{\operatorname{var}(x)}(x-E(x)) \tag{BF.18} E(yx)=E(y)+var(x)cov(x,y)(xE(x))(BF.18)

var ⁡ ( y ∣ x ) = var ⁡ ( y ) − cov ⁡ 2 ( x , y ) var ⁡ ( x ) (BF.19) \operatorname{var}(y | x) =\operatorname{var}(y)-\frac{\operatorname{cov}^{2}(x, y)}{\operatorname{var}(x)}\tag{BF.19} var(yx)=var(y)var(x)cov2(x,y)(BF.19)

Assuming that x x x and y y y are not independent and hence cov ⁡ ( x , y ) ≠ 0 , \operatorname{cov}(x, y) \neq 0, cov(x,y)=0, the posterior PDF becomes more concentrated since there is less uncertainty about y . y . y. To verify this, note from ( B F . 19 ) (BF.19) (BF.19) that
var ⁡ ( y ∣ x ) = var ⁡ ( y ) [ 1 − cov ⁡ 2 ( x , y ) var ⁡ ( x ) var ⁡ ( y ) ] = var ⁡ ( y ) ( 1 − ρ 2 ) (BF.20) \begin{aligned} \operatorname{var}(y| x) &=\operatorname{var}(y)\left[1-\frac{\operatorname{cov}^{2}(x, y)}{\operatorname{var}(x) \operatorname{var}(y)}\right] \\ &=\operatorname{var}(y)\left(1-\rho^{2}\right) \end{aligned}\tag{BF.20} var(yx)=var(y)[1var(x)var(y)cov2(x,y)]=var(y)(1ρ2)(BF.20)
where
ρ = cov ⁡ ( x , y ) var ⁡ ( x ) var ⁡ ( y ) (BF.21) \rho=\frac{\operatorname{cov}(x, y)}{\sqrt{\operatorname{var}(x) \operatorname{var}(y)}}\tag{BF.21} ρ=var(x)var(y) cov(x,y)(BF.21)
is the correlation coefficient satisfying ∣ ρ ∣ ≤ 1 |\rho|\le 1 ρ1. From our previous discussions, we also realize that E ( y ∣ x ) E(y|x) E(yx) is the MMSE estimator of y y y after observing x x x, so that from ( B F . 18 ) (BF.18) (BF.18)
y ^ = E ( y ) + cov ⁡ ( x , y ) var ⁡ ( x ) ( x − E ( x ) ) (BF.22) \hat y=E(y)+\frac{\operatorname{cov}(x, y)}{\operatorname{var}(x)}(x-E(x)) \tag{BF.22} y^=E(y)+var(x)cov(x,y)(xE(x))(BF.22)
In normalized form (a random variable with zero mean and unity variance) this becomes
y ^ − E ( y ) var ⁡ ( y ) = cov ⁡ ( x , y ) var ⁡ ( x ) var ⁡ ( y ) x − E ( x ) var ⁡ ( x ) \frac{\hat y-E(y)}{\sqrt{\operatorname{var}(y)}}=\frac{\operatorname{cov}(x, y)}{\sqrt{\operatorname{var}(x) \operatorname{var}(y)}}\frac{x-E(x)}{\sqrt{\operatorname{var}(x)}} var(y) y^E(y)=var(x)var(y) cov(x,y)var(x) xE(x)
or
y ^ n = ρ x n (BF.23) \hat y_n=\rho x_n\tag{BF.23} y^n=ρxn(BF.23)
If the random variables are already normalized ( E ( x ) = E ( y ) = 0 , var ⁡ ( x ) = var ⁡ ( y ) = 1 ) (E(x) = E(y) = 0, \operatorname{var}(x) = \operatorname{var}(y) = 1) (E(x)=E(y)=0,var(x)=var(y)=1), the constant PDF contours appear as in Figure 10.7.

在这里插入图片描述

The locations of the peaks of p ( x , y ) p(x,y) p(x,y), when considered as a function of y y y for each x x x, is the dashed line y = ρ x y = \rho x y=ρx, and it is readily shown that y ^ = E ( y ∣ x ) = ρ x \hat y = E(y|x) = \rho x y^=E(yx)=ρx.

The MMSE estimator therefore exploits the correlation between the random variables to estimate the realization of one based on the realization of the other. The minimum MSE is, from ( B F . 13 ) (BF.13) (BF.13) and ( B F . 20 ) (BF.20) (BF.20),
Bmse ⁡ ( y ^ ) = ∫ var ⁡ ( y ∣ x ) p ( x ) d x = var ⁡ ( y ∣ x ) = var ⁡ ( y ) ( 1 − ρ 2 ) (BF.24) \begin{aligned} \operatorname{Bmse}(\hat y)=&\int \operatorname{var}(y|x)p(x)dx\\ =&\operatorname{var}(y|x)\\ =&\operatorname{var}(y)(1-\rho^2) \end{aligned}\tag{BF.24} Bmse(y^)===var(yx)p(x)dxvar(yx)var(y)(1ρ2)(BF.24)
Hence, the quality of our estimator also depends on the correlation coefficient, which is a measure of the statistical dependence between x x x and y y y.


To generalize these results consider a jointly Gaussian vector [ x T y T ] T , \left[\mathbf{x}^{T} \mathbf{y}^{T}\right]^{T}, [xTyT]T, where x \mathbf{x} x is k × 1 k \times 1 k×1 and y \mathbf{y} y is l × 1. l \times 1 . l×1. In other words, [ x T y T ] T \left[\mathbf{x}^{T} \mathbf{y}^{T}\right]^{T} [xTyT]T is distributed according to a multivariate Gaussian PDF. Then, the conditional PDF of y \mathbf{y} y for a given x \mathbf{x} x is also Gaussian. as summarized in the following theorem (see Appendix 10A for proof).

Theorem 10.2 (Conditional PDF of Multivariate Gaussian) If x \mathbf{x} x and y \mathbf y y are jointly Gaussian, where x \mathbf{x} x is k × 1 k \times 1 k×1 and y \mathbf{y} y is l × 1 , l \times 1, l×1, with mean vector [ E ( x ) T E ( y ) T ] T \left[E(\mathbf{x})^{T} E(\mathbf{y})^{T}\right]^{T} [E(x)TE(y)T]T and
partitioned covariance matrix
C = [ C x x C x y C y x C y y ] = [ k × k k × l l × k l × l ] (BF.25) \mathbf{C}=\left[\begin{array}{ll} \mathbf{C}_{x x} & \mathbf{C}_{x y} \\ \mathbf{C}_{y x} & \mathbf{C}_{y y} \end{array}\right]=\left[\begin{array}{ll} k \times k & k \times l \\ l \times k & l \times l \end{array}\right]\tag{BF.25} C=[CxxCyxCxyCyy]=[k×kl×kk×ll×l](BF.25)
so that
p ( x , y ) = 1 ( 2 π ) k + 1 2 det ⁡ 1 2 ( C ) exp ⁡ [ − 1 2 ( [ x − E ( x ) y − E ( y ) ] ) T C − 1 ( [ x − E ( x ) y − E ( y ) ] ) ] p(\mathbf{x}, \mathbf{y})=\frac{1}{(2 \pi)^{\frac{k+1}{2}} \operatorname{det}^{\frac{1}{2}}(\mathbf{C})} \exp \left[-\frac{1}{2}\left(\left[\begin{array}{l} \mathbf{x}-E(\mathbf{x}) \\ \mathbf{y}-E(\mathbf{y}) \end{array}\right]\right)^{T} \mathbf{C}^{-1}\left(\left[\begin{array}{l} \mathbf{x}-E(\mathbf{x}) \\ \mathbf{y}-E(\mathbf{y}) \end{array}\right]\right)\right] p(x,y)=(2π)2k+1det21(C)1exp[21([xE(x)yE(y)])TC1([xE(x)yE(y)])]
Then the conditional PDF p ( y ∣ x ) p(\mathbf y|\mathbf x) p(yx) is also Gaussian and
E ( y ∣ x ) = E ( y ) + C y x C x x − 1 ( x − E ( x ) ) (BF.26) E(\mathbf y|\mathbf x)=E(\mathbf y)+\mathbf C_{yx} \mathbf C_{xx}^{-1}(\mathbf x-E(\mathbf x))\tag{BF.26} E(yx)=E(y)+CyxCxx1(xE(x))(BF.26)

C y ∣ x = C y y − C y x C x x − 1 C x y (BF.27) \mathbf C_{y|x}=\mathbf C_{yy}-\mathbf C_{yx}\mathbf C_{xx}^{-1}\mathbf C_{xy}\tag{BF.27} Cyx=CyyCyxCxx1Cxy(BF.27)

Bayesian Linear Model

Let the data be modeled as
x = H θ + w (BF.28) \mathbf x=\mathbf H \boldsymbol \theta+\mathbf w\tag{BF.28} x=Hθ+w(BF.28)
where x \mathbf x x is an N × 1 N\times 1 N×1 data vector, H \mathbf H H is a known N × p N\times p N×p matrix, θ \boldsymbol \theta θ is a p × 1 p\times 1 p×1 random vector with prior PDF N ( μ θ , C θ ) \mathcal N(\boldsymbol \mu_\theta,\mathbf C_\theta) N(μθ,Cθ), and w \mathbf w w is a N × 1 N\times 1 N×1 noise vector with PDF N ( 0 , C w ) \mathcal N (\mathbf 0,\mathbf C_w) N(0,Cw) and independent of θ \boldsymbol \theta θ. This data model is termed the Bayesian general linear model. It differs from the classical general linear model in that θ \boldsymbol \theta θ is modeled as a random variable with a Gaussian prior PDF.

From Theorem 10.2 we know that if x \mathbf x x and θ \boldsymbol \theta θ are jointly Gaussian, then the posterior PDF is also Gaussian. Hence, it only remains to verify that this is indeed the case. Let z = [ x T    θ T ] T \mathbf z=[\mathbf x^T~~\boldsymbol \theta^T]^T z=[xT  θT]T, so that from ( B F . 28 ) (BF.28) (BF.28) we have
z = [ x w ] = [ H θ + w w ] = [ H I I 0 ] [ θ w ] \mathbf z=\left[\begin{matrix}\mathbf x\\\mathbf w \end{matrix}\right]=\left[\begin{matrix}\mathbf H \boldsymbol \theta+\mathbf w\\\mathbf w \end{matrix}\right]=\left[\begin{matrix}\mathbf H & \mathbf I\\\mathbf I & \mathbf 0 \end{matrix}\right]\left[\begin{matrix}\boldsymbol \theta\\\mathbf w \end{matrix}\right] z=[xw]=[Hθ+ww]=[HII0][θw]
Since θ \boldsymbol \theta θ and w \mathbf w w are independent of each other and each one is Gaussian, they are jointly Gaussian. Furthermore, because z \mathbf z z is a linear transformation of a Gaussian vector, it too is Gaussian. Hence, Theorem 10.2 applies directly, and we need only determine the mean and covariance of the posterior PDF.

We can obtain the means and covariances
E ( x ) = E ( H θ + w ) = H E ( θ ) = H μ θ E ( y ) = E ( θ ) = μ θ C x x = E [ ( x − E ( x ) ) ( x − E ( x ) ) T ] = E [ ( H ( θ − μ θ ) + w ) ( H ( θ − μ θ ) + w ) T ] = H C θ H T + C w C y x = E [ ( y − E ( y ) ) ( x − E ( x ) ) T ] = E [ ( θ − μ θ ) ( H ( θ − μ θ ) + w ) T ] = C θ H T \begin{aligned} E(\mathbf x)=& E(\mathbf H \boldsymbol \theta+\mathbf w)=\mathbf HE(\boldsymbol \theta)=\mathbf H \boldsymbol \mu_\theta\\ E(\mathbf y)=&E(\boldsymbol \theta)=\boldsymbol \mu_\theta\\ \mathbf C_{xx}=& E[(\mathbf x-E(\mathbf x))(\mathbf x-E(\mathbf x))^T]\\ =&E[(\mathbf H(\boldsymbol \theta-\boldsymbol \mu_\theta)+\mathbf w)(\mathbf H(\boldsymbol \theta-\boldsymbol \mu_\theta)+\mathbf w)^T]\\ =& \mathbf H \mathbf C_\theta \mathbf H^T+\mathbf C_w\\ \mathbf C_{yx}=&E[(\mathbf y-E(\mathbf y))(\mathbf x-E(\mathbf x))^T]\\ =&E[(\boldsymbol \theta-\boldsymbol \mu_\theta)(\mathbf H(\boldsymbol \theta-\boldsymbol \mu_\theta)+\mathbf w)^T]\\ =& \mathbf C_\theta \mathbf H^T \end{aligned} E(x)=E(y)=Cxx===Cyx===E(Hθ+w)=HE(θ)=HμθE(θ)=μθE[(xE(x))(xE(x))T]E[(H(θμθ)+w)(H(θμθ)+w)T]HCθHT+CwE[(yE(y))(xE(x))T]E[(θμθ)(H(θμθ)+w)T]CθHT
We can now summarize our results for the Bayesian general linear model.

Theorem 10.3 (Posterior PDF for the Bayesian General Linear Model) If the observed data x \mathbf x x can be modeled as
x = H θ + w \mathbf{x}=\mathbf{H} \boldsymbol{\theta}+\mathbf{w} x=Hθ+w
where x \mathbf{x} x is an N × 1 N \times 1 N×1 data vector, H \mathbf{H} H is a known N × p N \times p N×p matrix, θ \boldsymbol{\theta} θ is a p × 1 p \times 1 p×1 random vector with prior PDF N ( μ θ , C θ ) , \mathcal{N}\left(\mu_{\theta}, \mathbf{C}_{\theta}\right), N(μθ,Cθ), and w \mathbf{w} w is an N × 1 N \times 1 N×1 noise vector with PDF N ( 0 , C w ) \mathcal{N}\left(\mathbf{0}, \mathbf{C}_{w}\right) N(0,Cw) and independent of θ , \boldsymbol \theta, θ, then the posterior PDF p ( θ ∣ x ) p(\boldsymbol\theta |\mathbf{x}) p(θx) is Gaussian with mean
E ( θ ∣ x ) = μ θ + C θ H T ( H C θ H T + C w ) − 1 ( x − H μ θ ) (BF.29) E(\boldsymbol{\theta} |\mathbf{x})=\boldsymbol{\mu}_{\theta}+\mathbf{C}_{\theta} \mathbf{H}^{T}\left(\mathbf{H} \mathbf{C}_{\theta} \mathbf{H}^{T}+\mathbf{C}_{w}\right)^{-1}\left(\mathbf{x}-\mathbf{H} \boldsymbol{\mu}_{\theta}\right)\tag{BF.29} E(θx)=μθ+CθHT(HCθHT+Cw)1(xHμθ)(BF.29)
and covariance
C θ ∣ x = C θ − C θ H T ( H C θ H T + C w ) − 1 H C θ (BF.30) \mathbf{C}_{\theta \mid x}=\mathbf{C}_{\theta}-\mathbf{C}_{\theta} \mathbf{H}^{T}\left(\mathbf{H} \mathbf{C}_{\theta} \mathbf{H}^{T}+\mathbf{C}_{w}\right)^{-1} \mathbf{H} \mathbf{C}_{\theta}\tag{BF.30} Cθx=CθCθHT(HCθHT+Cw)1HCθ(BF.30)
In contrast to the classical general linear model, H \mathbf{H} H need not be full rank to ensure the invertibility of H C θ H T + C w . \mathbf{H C}_{\theta} \mathbf{H}^{T}+\mathbf{C}_{w} . HCθHT+Cw.

Alternative formulation using Matrix inversion lemma:
E ( θ ∣ x ) = μ θ + ( C θ − 1 + H T C w − 1 H ) − 1 H T C w − 1 ( x − H μ θ ) (BF.31) E(\boldsymbol{\theta} |\mathbf{x})=\boldsymbol\mu_{\theta}+\left(\mathbf{C}_{\theta}^{-1}+\mathbf{H}^{T} \mathbf{C}_w^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}_w^{-1}\left(\mathbf{x}-\mathbf{H} \boldsymbol\mu_{\theta}\right) \tag{BF.31} E(θx)=μθ+(Cθ1+HTCw1H)1HTCw1(xHμθ)(BF.31)

C θ ∣ x = ( C θ − 1 + H T C w − 1 H ) − 1 (BF.32) \mathbf{C}_{\theta \mid x} =\left(\mathbf{C}_{\theta}^{-1}+\mathbf{H}^{T} \mathbf{C}_w^{-1} \mathbf{H}\right)^{-1}\tag{BF.32} Cθx=(Cθ1+HTCw1H)1(BF.32)

We illustrate the use of these formulas by applying them to Example DC Level in WGN:

Let us assume now that the prior distribution of A A A is Gaussian: A ∼ N ( μ A , σ A 2 ) A \sim \mathcal{N}\left(\mu _A, \sigma_{A}^{2}\right) AN(μA,σA2), and w [ n ] w[n] w[n] is white Gaussian noise, i.e., for n = 0 , … , N − 1 n=0, \ldots, N-1 n=0,,N1, w [ n ] ∼ N ( 0 , σ 2 ) w[n] \sim \mathcal{N}\left(0, \sigma^{2}\right) w[n]N(0,σ2)
x = 1 A + w \mathbf{x}=\mathbf 1 A+\mathbf{w} x=1A+w
then, x \mathbf{x} x and A A A are jointly Gaussian ( k = N (k=N (k=N and l = 1 ) , l=1), l=1), with mean
E ( A ∣ x ) = μ A + σ A 2 1 T ( 1 σ A 2 1 T + σ 2 I ) − 1 ( x − 1 μ A ) \begin{aligned} E(A|\mathbf x)=\mu_A+\sigma_A^2 \mathbf 1^T(\mathbf 1 \sigma_A^2\mathbf 1^T+\sigma^2 I)^{-1}(\mathbf x-\mathbf 1\mu_A) \end{aligned} E(Ax)=μA+σA21T(1σA21T+σ2I)1(x1μA)
Using Woodbury’s identity
( I + σ A 2 σ 2 1 1 T ) − 1 = I − σ A 2 σ 2 1 1 T 1 + N σ A 2 σ 2 \left(\mathbf{I}+\frac{\sigma_{A}^{2}}{\sigma^{2}} \mathbf{1} \mathbf{1}^{T}\right)^{-1}=\mathbf{I}-\frac{\frac{\sigma_{A}^{2}}{\sigma^{2}} \mathbf{1} \mathbf{1}^{T}}{1+N \frac{\sigma_{A}^{2}}{\sigma^{2}}} (I+σ2σA211T)1=I1+Nσ2σA2σ2σA211T
so that
A ^ = E ( A ∣ x ) = μ A + σ A 2 σ 2 1 T ( I − 1 1 T N + σ 2 σ A 2 ) ( x − 1 μ A ) = μ A + σ A 2 σ 2 ( 1 T − N N + σ 2 σ A 2 1 T ) ( x − 1 μ A ) = μ A + σ A 2 σ 2 ( 1 − N N + σ 2 σ A 2 ) ( N x ˉ − N μ A ) = μ A + σ A 2 σ A 2 + σ 2 N ( x ˉ − μ A ) = σ A 2 σ A 2 + σ 2 N x ˉ + σ 2 N σ A 2 + σ 2 N μ A = α x ˉ + ( 1 − α ) μ A \begin{aligned} \hat A=E(A |\mathbf{x}) &=\mu_{A}+\frac{\sigma_{A}^{2}}{\sigma^{2}} \mathbf{1}^{T}\left(\mathbf{I}-\frac{\mathbf{1} \mathbf{1}^{T}}{N+\frac{\sigma^{2}}{\sigma_{A}^{2}}}\right)\left(\mathbf{x}-\mathbf{1} \mu_{A}\right) \\ &=\mu_{A}+\frac{\sigma_{A}^{2}}{\sigma^{2}}\left(\mathbf{1}^{T}-\frac{N}{N+\frac{\sigma^{2}}{\sigma_{A}^{2}}} \mathbf{1}^{T}\right)\left(\mathbf{x}-\mathbf{1} \mu_{A}\right) \\ &=\mu_{A}+\frac{\sigma_{A}^{2}}{\sigma^{2}}\left(1-\frac{N}{N+\frac{\sigma^{2}}{\sigma_{A}^{2}}}\right)\left(N \bar{x}-N \mu_{A}\right) \\ &=\mu_{A}+\frac{\sigma_{A}^{2}}{\sigma_{A}^{2}+\frac{\sigma^{2}}{N}}\left(\bar{x}-\mu_{A}\right)\\ &=\frac{\sigma_A^2}{\sigma_A^2+\frac{\sigma ^2}{N}}\bar x+\frac{\frac{\sigma ^2}{N}}{\sigma_A^2+\frac{\sigma ^2}{N}}\mu_A\\ &=\alpha \bar x+(1-\alpha)\mu_A \end{aligned} A^=E(Ax)=μA+σ2σA21TIN+σA2σ211T(x1μA)=μA+σ2σA21TN+σA2σ2N1T(x1μA)=μA+σ2σA21N+σA2σ2N(NxˉNμA)=μA+σA2+Nσ2σA2(xˉμA)=σA2+Nσ2σA2xˉ+σA2+Nσ2Nσ2μA=αxˉ+(1α)μA
which is exactly the same as ( B P . 12 ) (BP.12) (BP.12). And
Bmse ⁡ ( A ^ ) = ∫ [ ∫ ( A − E ( A ∣ x ) ) 2 p ( A ∣ x ) d A ] p ( x ) d x = ∫ var ⁡ ( A ∣ x ) p ( x ) d x = var ⁡ ( A ∣ x ) = σ A 2 − σ A 2 σ 2 1 T ( I − 1 1 T N + σ 2 σ A 2 ) 1 σ A 2 = 1 N σ 2 + 1 σ A 2 \begin{aligned} \operatorname{Bmse}(\hat{A})&=\int\left[\int(A-E(A|\mathbf x))^{2} p(A |\mathbf{x}) d A\right] p(\mathbf{x}) d \mathbf{x}\\ &=\int\operatorname{var}(A|\mathbf x)p(\mathbf{x}) d \mathbf{x}=\operatorname{var}(A|\mathbf x)\\ &=\sigma_A^2-\frac{\sigma_A^2}{\sigma^2}\mathbf 1^T\left(\mathbf I-\frac{\mathbf 1 \mathbf 1^T}{N+\frac{\sigma^2}{\sigma_A^2}} \right)\mathbf 1 \sigma_A^2\\ &=\frac{1}{\frac{N}{\sigma^{2}}+\frac{1}{\sigma_{A}^{2}}} \end{aligned} Bmse(A^)=[(AE(Ax))2p(Ax)dA]p(x)dx=var(Ax)p(x)dx=var(Ax)=σA2σ2σA21TIN+σA2σ211T1σA2=σ2N+σA211

which is just ( B P . 13 ) (BP.13) (BP.13).

1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。
1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值