The Bayesian Philosophy

最新推荐文章于 2024-09-12 20:58:33 发布

拉普拉斯的汪

最新推荐文章于 2024-09-12 20:58:33 发布

阅读量136

点赞数 1

分类专栏： Estimation and Detection 文章标签：信号处理数字信号处理

本文链接：https://blog.csdn.net/qq_39599295/article/details/110461498

版权

Estimation and Detection 专栏收录该内容

9 篇文章 3 订阅

订阅专栏

Reference:
Kay S M. Fundamentals of statistical signal processing[M]. Prentice Hall PTR, 1993. (Ch. 10 - 10.6)
Slides of ET4386, TUD

Content

We now depart from the classical approach to statistical estimation in which the parameter

\theta

of interest is assumed to be a deterministic but unknown constant. Instead, we assume that $\theta$ is a random variable whose particular realization we must estimate.

Motivation:

The Bayesian approach can incorporate the available prior knowledge into our estimator.
Bayesian estimation is useful in situations where an MVU estimator cannot be found, as for example, when the variance of an unbiased estimator may not be uniformly less than that of all other unbiased estimators. In this instance, it may be true that for most values of the parameter an estimator can be found whose mean square error may be less than that of all other estimators.

Prior Knowledge and Estimation

It is a fundamental rule of estimation theory that the use of prior knowledge will lead to a more accurate estimator.

Reconsider the DC in WGN problem: $x[n]=A+w[n],n=0,\cdots N-1, w[n]\sim \mathcal N(0,\sigma ^2)$ . It can be shown that the MVU estimator of $A$ is the sample mean $\bar x$ . However, this assumed that $A$ could take on any value in the interval $-\infty<A<\infty$ . Due to physical constraints it may be more reasonable to assume that $A$ can take on only values in the finite interval $-A_0\le A\le A_0$ . Obviously, we would expect to improve our estimation if we used the truncated sample mean estimator
$\check A=\left\{\begin{array}{cc}&-A_0 && \bar x<-A_0\\&\bar x && -A_0\le \bar x\le A_0\\ &A_0 && \bar x>A_0 \end{array}\right.$
which would be consistent with the known constraints. Such an estimator would have the PDF
$p_{\check A}(\xi;A)=\Pr\{\bar x \le -A_0 \}\delta(\xi+A_0)+p_{\hat A}(\xi;A)[u(\xi+A_0)-u(\xi-A_0)]+\Pr\{\bar x \ge A_0 \}\delta(\xi-A_0)$
where $u (x)$ is the unit step function.

在这里插入图片描述

It is seen that $\check A$ is a biased estimator. However, if we compare the MSE of the two estimators, we note that for any $A$ in the interval $-A_0\le A\le A_0$
$\begin{aligned} \operatorname{mse}(\hat{A})=& \int_{-\infty}^{\infty}(\xi-A)^{2} p_{\hat{A}}(\xi ; A) d \xi \\ =& \int_{-\infty}^{-A_{0}}(\xi-A)^{2} p_{\hat{A}}(\xi ; A) d \xi+\int_{-A_{0}}^{A_{0}}(\xi-A)^{2} p_{\hat{A}}(\xi ; A) d \xi +\int_{A_{0}}^{\infty}(\xi-A)^{2} p_{\hat{A}}(\xi ; A) d \xi \\>&\int_{-\infty}^{-A_{0}}(-A_0-A)^{2} p_{\hat{A}}(\xi ; A) d \xi+\int_{-A_{0}}^{A_{0}}(\xi-A)^{2} p_{\hat{A}}(\xi ; A) d \xi +\int_{A_{0}}^{\infty}(A_0-A)^{2} p_{\hat{A}}(\xi ; A) d \xi \\ =& \left(-A_{0}-A\right)^{2} \Pr\{\bar x \le -A_0 \}+\int_{-A_{0}}^{A_{0}}(\xi-A)^{2} p_{\hat{A}}(\xi ; A) d \xi +\left(A_{0}-A\right)^{2} \Pr\{\bar x \ge A_0 \} \\ =& \operatorname{mse}(\tilde{A}) \end{aligned}$
Hence, $\check A$ , the truncated sample mean estimator, is better than the sample mean estimator in terms of MSE. Although $\hat A$ is still the MVU estimator, we have been able to reduce the mean square error by allowing the estimator to be biased.

In as much as we have been able to produce a better estimator, the question arises as to whether an optimal estimator exists for this problem. (In the classical case the MSE criterion of optimality usually led to unrealizable estimators. We shall see that this is not a problem in the Bayesian approach.)

With knowledge only of the interval and no inclination as to whether $A$ should be nearer any particular value, it makes sense to assign a $\mathcal U[-A_0,A_0]$ PDF to the random variable $A$ . The overall data model is different from the classical approach

在这里插入图片描述

Now we can incorporate our knowledge of how $A$ was chosen. Define the Bayesian MSE as
$\mathrm{Bmse}(\hat A)=E[(A-\hat A)^2] \tag{BP.1}$
To appreciate the difference compare the classical MSE
$\mathrm{mse}(\hat A)=\int(\hat A-A)^2p(\mathbf x;A)d\mathbf x \tag{BP.2}$
to the Bayesian MSE
$\mathrm{Bmse}(\hat A)=\int(A-\hat A)^2\left[\int p(\mathbf x,A)d\mathbf x\right] dA=\int\int(A-\hat A)^2p(\mathbf x,A)d\mathbf xdA\tag{BP.3}$
Note that whereas the classical MSE will depend on $A$ , and hence estimators that attempt to minimize the MSE will usually depend on $A$ , the Bayesian MSE will not. In effect, we have integrated the parameter dependence away.

Now we can derive the estimator that minimizes the Bayesian MSE. First, we use Bayes’ theorem to write
$p(\mathbf{x}, A)=p(A|\mathbf{x}) p(\mathbf{x})$
so that
$\operatorname{Bmse}(\hat{A})=\int\left[\int(A-\hat{A})^{2} p(A |\mathbf{x}) d A\right] p(\mathbf{x}) d \mathbf{x}\tag{BP.4}$
Now since $p(\mathbf{x}) \geq 0$ for all $\mathbf{x},$ if the integral in brackets can be minimized for each $\mathbf{x},$ then the Bayesian MSE will be minimized. Hence, fixing $\mathbf x$ so that $\hat{A}$ is a scalar variable, we have
$\begin{aligned} \frac{\partial}{\partial \hat{A}} \int(A-\hat{A})^{2} p(A | \mathbf{x}) d A &=\int \frac{\partial}{\partial \hat{A}}(A-\hat{A})^{2} p(A | \mathbf{x}) d A \\ &=\int-2(A-\hat{A}) p(A | \mathbf{x}) d A \\ &=-2 \int A p(A | \mathbf{x}) d A+2 \hat{A} \int p(A | \mathbf{x}) d A \end{aligned}$
which when set equal to zero results in
$\hat{A}=\int A p(A | \mathbf{x}) d A=E(A | \mathbf{x})\tag{BP.5}$
since the conditional PDF must integrate to $1$ . It is seen that the optimal estimator in terms of minimizing the Bayesian MSE is the mean of the posterior PDF $\mathbf{x})$ . We will henceforth term the estimator that minimizes the Bayesian MSE the minimum mean square error (MMSE) estimator.

在这里插入图片描述

To determine the MMSE, we first require the posterior PDF. We can use Bayes’ rule to determine it as
$p(A|\mathbf x)=\frac{p(\mathbf x|A)p(A)}{p(\mathbf x)}=\frac{p(\mathbf x|A)p(A)}{\int p(\mathbf x|A)p(A)dA}\tag{BP.6}$
Note that the denominator is just a normalizing factor, independent of $A$ , needed to ensure that $p(A|\mathbf x)$ integrates to $1$ . Then the MMSE
$\hat A=E(A|\mathbf x)=\frac{\int Ap(\mathbf x|A)p(A)dA}{\int p(\mathbf x|A)p(A)dA}\tag{BP.7}$
If we continue our example, we recall that the prior PDF $p (A)$ is $\mathcal U[-A_0,A_0]$ . To specify the conditional PDF $p(\mathbf x|A)$ we need to further assume that the choice of $A$ via $p (A)$ does not affect the PDF of the noise samples or that $w [n]$ is independent of $A$ . Then for $n=0,1,\cdots, N-1$
$\begin{aligned} p_x(x[n]|A)&=p_w(x[n]-A|A)\\ &= p_w(x[n]-A)\\ &=\frac{1}{\sqrt{2\pi\sigma^2}}\exp \left[-\frac{1}{2\sigma^2}(x[n]-A)^2 \right] \end{aligned}$
and therefore
$p(\mathbf x|A)=\frac{1}{(2\pi \sigma^2)^{N/2}}\exp\left[-\frac{1}{2\sigma^2}\sum_{n=0}^{N-1}(x[n]-A)^2 \right] \tag{BP.8}$
It is apparent that the PDF is identical in form to the usual classical PDF $p(\mathbf x; A)$ . In the Bayesian case, however, the PDF is a conditional PDF, hence the “ $∣$ ” separator, while in the classical case, it represents an unconditional PDF, albeit parameterized by $A$ , hence the separator " $;$ "

Using $(B P . 7)$ and $(B P . 8)$ , the posterior PDF becomes
$|\mathbf{x})=\left\{\begin{aligned} &\frac{\frac{1}{2 A_{0}\left(2 \pi \sigma^{2}\right)^{N/2}}\exp \left[-\frac{1}{2 \sigma^{2}} \sum_{n=0}^{N-1}(x[n]-A)^{2}\right]}{\int_{-A_0}^{A_0}\frac{1}{2 A_{0}\left(2 \pi \sigma^{2}\right)^{N/2}}\exp \left[-\frac{1}{2 \sigma^{2}} \sum_{n=0}^{N-1}(x[n]-A)^{2}\right]dA} && |A| \leq A_{0}\\ &0 && |A|>A_{0} \end{aligned}\right.$
But
$\begin{aligned} \sum_{n=0}^{N-1}(x[n]-A)^{2} &=\sum_{n=0}^{N-1} x^{2}[n]-2 N A \bar{x}+N A^{2} =N(A-\bar{x})^{2}+\sum_{n=0}^{N-1} x^{2}[n]-N \bar{x}^{2} \end{aligned}$
Cancel out the terms that are irrelevant to $A$ and write the form as truncated Gaussian distribution
$|\mathbf{x})=\left\{\begin{aligned} &\frac{\frac{1}{\sqrt{2 \pi \sigma^{2} / N}} \exp \left[-\frac{1}{2 \sigma^{2} / N}(A-\bar{x})^{2}\right]}{\int_{-A_0}^{A_0}\frac{1}{\sqrt{2 \pi \sigma^{2} / N}} \exp \left[-\frac{1}{2 \sigma^{2} / N}(A-\bar{x})^{2}\right]dA} && |A| \leq A_{0} \\ &0, && |A|>A_{0} \end{aligned}\right.\tag{BP.9}$
The final MMSE estimate is then given by
$\hat{A}=\int_{-\infty}^{\infty} A p(A| \mathbf{x}) d A=\frac{\int_{-A_{0}}^{A_{0}} A \frac{1}{\sqrt{2 \pi \sigma^{2} / N}} \exp \left[-\frac{1}{2 \sigma^{2} / N}(A-\bar{x})^{2}\right] d A}{\int_{-A_{0}}^{A_{0}} \frac{1}{\sqrt{2 \pi \sigma^{2} / N}} \exp \left[-\frac{1}{2 \sigma^{2} / N}(A-\bar{x})^{2}\right] d A} \tag{BP.10}$

In absence of data (if $x$ gives no information about $A$ ),
$\hat A=E(A|\mathbf x)=E(A)=0$
If $A_0 \gg \sqrt{\sigma^2/N}$ (no truncation), e.g., as $N$ increases, $\hat A\to \bar x$
$\hat A=\lim_{A_0\to \infty}\frac{\int_{-A_{0}}^{A_{0}} A \frac{1}{\sqrt{2 \pi \sigma^{2} / N}} \exp \left[-\frac{1}{2 \sigma^{2} / N}(A-\bar{x})^{2}\right] d A}{\int_{-A_{0}}^{A_{0}} \frac{1}{\sqrt{2 \pi \sigma^{2} / N}} \exp \left[-\frac{1}{2 \sigma^{2} / N}(A-\bar{x})^{2}\right] d A}=\frac{\bar x}{1}=\bar x$

The effect of the data is to position the posterior mean between $A = 0$ and $=\bar x$ in a compromise between the prior knowledge and that contributed by the data. To further appreciate this weighting consider what happens as $N$ becomes large so that the data knowledge becomes more important. As shown in Figure 10.4, as $N$ increases, we have from $(B P . 9)$ that the posterior PDF becomes more concentrated about $\bar x$ (since $\sigma^2/N$ decrease). Hence, it becomes nearly Gaussian, and its mean becomes just $\bar x$ . The MMSE estimator relies less and less on the prior knowledge and more on the data. It is said that the data “swamps out” the prior knowledge.

在这里插入图片描述

Choosing a Prior PDF

As shown in the previous section, once a prior PDF has been chosen, the MMSE estimator follows directly from $(B P . 10)$ . The only practical stumbling block that remains, however, is whether or not $E(\theta|\mathbf x)$ can be determined in closed form. This motivates us to assume the Gaussian prior PDF instead of the uniform prior PDF
$p(A)=\frac{1}{\sqrt{2\pi\sigma_A^2}}\exp \left[-\frac{1}{2\sigma_A^2}(A-\mu_A)^2 \right]$
With $\mu_A=0$ and $3\sigma_A=A_0$ the Gaussian prior PDF could be thought of as incorporating the knowledge that $|A|\le A_0$ . From $(B P . 6)$ and $(B P . 8)$ , we can formulate
$\begin{aligned} p(A | \mathbf{x})=& \frac{p(\mathbf{x} |A) p(A)}{\int p(\mathbf{x} | A) p(A) d A} \\ =& \frac{\frac{1}{\left(2 \pi \sigma^{2}\right)^{N/2} \sqrt{2 \pi \sigma_{A}^{2}}} \exp \left[-\frac{1}{2 \sigma^{2}} \sum_{n=0}^{N-1} x^{2}[n]\right] \exp \left[-\frac{1}{2 \sigma^{2}}\left(N A^{2}-2 N A \bar{x}\right)\right]\exp \left[-\frac{1}{2 \sigma_{A}^{2}}\left(A-\mu_{A}\right)^{2}\right]} {\int_{-\infty}^{\infty} \frac{1}{\left(2 \pi \sigma^{2}\right)^{N/2} \sqrt{2 \pi \sigma_{A}^{2}}} \exp \left[-\frac{1}{2 \sigma^{2}} \sum_{n=0}^{N-1} x^{2}[n]\right] \exp \left[-\frac{1}{2 \sigma^{2}}\left(N A^{2}-2 N A \bar{x}\right)\right]\exp \left[-\frac{1}{2 \sigma_{A}^{2}}\left(A-\mu_{A}\right)^{2}\right] d A} \\ =& \frac{\exp \left[-\frac{1}{2}\left( \frac{1}{\sigma^{2}}(NA^2-2NA\bar x)+\frac{1}{ \sigma_{A}^{2}}\left(A-\mu_{A}\right)^{2} \right) \right]} {\int_{-\infty}^{\infty} \exp \left[-\frac{1}{2}\left( \frac{1}{\sigma^{2}}(NA^2-2NA\bar x)+\frac{1}{ \sigma_{A}^{2}}\left(A-\mu_{A}\right)^{2} \right) \right] d A} \\ =& \frac{\exp \left[-\frac{1}{2} Q(A)\right]}{\int_{-\infty}^{\infty} \exp \left[-\frac{1}{2} Q(A)\right] d A} \end{aligned}$
Continuing, we have for $Q (A)$
$\begin{aligned} Q(A) &=\frac{N}{\sigma^{2}} A^{2}-\frac{2 N A \bar{x}}{\sigma^{2}}+\frac{A^{2}}{\sigma_{A}^{2}}-\frac{2 \mu_{A} A}{\sigma_{A}^{2}}+\frac{\mu_{A}^{2}}{\sigma_{A}^{2}} \\ &=\left(\frac{N}{\sigma^{2}}+\frac{1}{\sigma_{A}^{2}}\right) A^{2}-2\left(\frac{N}{\sigma^{2}} \bar{x}+\frac{\mu_{A}}{\sigma_{A}^{2}}\right) A+\frac{\mu_{A}^{2}}{\sigma_{A}^{2}} \end{aligned}$
Let
$\begin{aligned} \sigma_{A \mid x}^{2} &=\frac{1}{\frac{N}{\sigma^{2}}+\frac{1}{\sigma_{A}^{2}}} \\ \mu_{A \mid x} &=\left(\frac{N}{\sigma^{2}} \bar{x}+\frac{\mu_{A}}{\sigma_{A}^{2}}\right) \sigma_{A \mid x}^{2} \\ \end{aligned}\tag{BP.11}$
Then, by completing the square we have
$Q(A)=\frac{1}{\sigma_{A \mid x}^{2}}\left(A-\mu_{A \mid x} \right)^2-\frac{\mu_{A \mid x}^{2}}{\sigma_{A \mid x}^{2}}+\frac{\mu_{A}^{2}}{\sigma_{A}^{2}}$
so that
$\begin{aligned} p(A | \mathbf{x}) &=\frac{\exp \left[-\frac{1}{2 \sigma_{A \mid x}^{2}}\left(A-\mu_{A \mid x}\right)^{2}\right] \exp \left[-\frac{1}{2}\left(\frac{\mu_{A}^{2}}{\sigma_{A}^{2}}-\frac{\mu_{A \mid x}^{2}}{\sigma_{A \mid x}^{2}}\right)\right]}{\int_{-\infty}^{\infty} \exp \left[-\frac{1}{2 \sigma_{A \mid x}^{2}}\left(A-\mu_{A \mid x}\right)^{2}\right] \exp \left[-\frac{1}{2}\left(\frac{\mu_{A}^{2}}{\sigma_{A}^{2}}-\frac{\mu_{A \mid x}^{2}}{\sigma_{A \mid x}^{2}}\right)\right] d A} \\ &=\frac{1}{\sqrt{2 \pi \sigma_{A \mid x}^{2}}} \exp \left[-\frac{1}{2 \sigma_{A \mid x}^{2}}\left(A-\mu_{A \mid x}\right)^{2}\right] \end{aligned}$
where the last step follows from the requirement that $|\mathbf{x})$ integrate to $1 .$ The posterior PDF is also Gaussian, $p(A|\mathbf x)\sim \mathcal N(\mu_{A|x},\sigma^2_{A|x})$ .

In this form the MMSE estimator is readily found as
$\begin{aligned} \hat A&=E(A|\mathbf x)=\mu_{A|x}=\frac{\frac{N}{\sigma^{2}} \bar{x}+\frac{\mu_{A}}{\sigma_{A}^{2}}}{\frac{N}{\sigma^{2}}+\frac{1}{\sigma_{A}^{2}}}\\ &=\frac{\sigma_A^2}{\sigma_A^2+\frac{\sigma ^2}{N}}\bar x+\frac{\frac{\sigma ^2}{N}}{\sigma_A^2+\frac{\sigma ^2}{N}}\mu_A\\ &=\alpha \bar x+(1-\alpha)\mu_A \end{aligned}\tag{BP.12}$
where
$\alpha =\frac{\sigma_A^2}{\sigma_A^2+\frac{\sigma ^2}{N}}.$
Note that $\alpha$ is a weighting factor since $0<\alpha <1$ . It is interesting to examine the interplay between the prior knowledge and the data.

If there is little data so that $\sigma_A^2\ll \sigma^2/N$ , $\alpha$ is small and $\hat A\approx \mu_A$
If more data are observed so that $\sigma_A^2\gg \sigma^2/N$ , $\alpha \approx 1$ and $\hat A\approx \bar x$ .

Alternatively, we may view this process by examining the posterior PDF. As $N$ increases,
$\operatorname{var}(A|\mathbf x)=\sigma_{A \mid x}^{2} =\frac{1}{\frac{N}{\sigma^{2}}+\frac{1}{\sigma_{A}^{2}}}$
will decrease. Also, the posterior mean or $\hat A$ $(B P . 12)$ will approach $\bar x$ .

在这里插入图片描述

Observe that if there is no prior knowledge, which can be modeled by letting $\sigma_A^2\to \infty$ , then $\hat A\to \bar x$ for any data record length. The classical estimator is obtained.

Finally, we are going to prove that by using prior knowledge we could improve the estimation accuracy. Recall that
$\operatorname{Bmse}(\hat{A})=\int\left[\int(A-\hat{A})^{2} p(A |\mathbf{x}) d A\right] p(\mathbf{x}) d \mathbf{x}\tag{BP.4}$
Since $\hat A=E(A|\mathbf x)$ , we have
$\begin{aligned} \operatorname{Bmse}(\hat{A})&=\int\left[\int(A-E(A|\mathbf x))^{2} p(A |\mathbf{x}) d A\right] p(\mathbf{x}) d \mathbf{x}\\ &=\int\operatorname{var}(A|\mathbf x)p(\mathbf{x}) d \mathbf{x}\\ &=\int\frac{1}{\frac{N}{\sigma^{2}}+\frac{1}{\sigma_{A}^{2}}}p(\mathbf{x}) d \mathbf{x}\\ &=\frac{1}{\frac{N}{\sigma^{2}}+\frac{1}{\sigma_{A}^{2}}} \end{aligned} \tag{BP.13}$
This can be rewritten as
$\operatorname{Bmse}(\hat A)=\frac{\sigma^2}{N}(\frac{\sigma_A^2}{\sigma_A^2+\frac{\sigma^2}{N}})<\frac{\sigma ^2}{N}\tag{BP.14}$
where $\sigma^2/N$ is the minimum MSE obtained when no prior knowledge is available (let $\sigma^2_A\to \infty$ ). Clearly, any prior knowledge when modeled in the Bayesian sense will improve our Bayesian estimator.

Properties of the Gaussian PDF

We now generalize the results of the previous section by examining the properties of the Gaussian PDF. The bivariate Gaussian PDF is first investigated to illustrate the important properties. Then, the corresponding results for the general multivariate Gaussian PDF are described.

Consider a jointly Gaussian random vector $x~ ~y]^{T}$ whose PDF is
$y)=\frac{1}{2 \pi \operatorname{det}^{\frac{1}{2}}(\mathbf{C})} \exp \left[-\frac{1}{2}\left[\begin{matrix} x-E(x) \\ y-E(y) \end{matrix}\right]^{T} \mathbf{C}^{-1}\left[\begin{array}{l} x-E(x) \\ y-E(y) \end{array}\right]\right]\tag{BF.15}$
This is also termed as bivariate Gaussian PDF. The mean vector and covariance matrix are
$\begin{aligned} E\left(\left[\begin{matrix}x\\y\end{matrix}\right]\right)&=\left[\begin{matrix}E(x)\\E(y)\end{matrix}\right]\\ \mathbf C&=\left[\begin{matrix}\operatorname{var}(x) & \operatorname{cov}(x,y)\\\operatorname{cov}(x,y) & \operatorname{var}(y)\end{matrix}\right] \end{aligned}$
Note that the marginal PDFs $p (x)$ and $p (y)$ are also Gaussian, as can be verified by the integrations
$\begin{aligned} p(x)=\int_{-\infty}^{\infty} p(x, y) d y=\frac{1}{\sqrt{2 \pi \operatorname{var}(x)}} \exp \left[-\frac{1}{2 \operatorname{var}(x)}(x-E(x))^{2}\right] \\ p(y)=\int_{-\infty}^{\infty} p(x, y) d x=\frac{1}{\sqrt{2 \pi \operatorname{var}(y)}} \exp \left[-\frac{1}{2 \operatorname{var}(y)}(y-E(y))^{2}\right] \end{aligned}\tag{BF.16}$
The contours along which the PDF $p (x, y)$ is constant are those values of $x$ and $y$ for which
$\left[\begin{array}{l} x-E(x) \\ y-E(y) \end{array}\right]^{T} \mathbf{C}^{-1}\left[\begin{array}{l} x-E(x) \\ y-E(y) \end{array}\right]$
is a constant. They are shown in Figure 10.6 as elliptical contours.

在这里插入图片描述

Once $x,$ say $x_{0},$ is observed, the conditional PDF of $y$ becomes
$p\left(y|x_{0}\right)=\frac{p\left(x_{0}, y\right)}{p\left(x_{0}\right)}=\frac{p\left(x_{0}, y\right)}{\int_{-\infty}^{\infty} p\left(x_{0}, y\right) d y}\tag{BF.17}$
Since from $(B F . 15)$ the exponential argument is quadratic in $y$ , $p(x_0,y)$ has the Gaussian form in $y$ , and thus the conditional PDF $p(y|x_0)$ must also be Gaussian. To summarize,

Theorem 1 (Conditional PDF of Bivariate Gaussian) If $x$ and $y$ are distributed according to a bivariate Gaussian $P D F$ with mean vector $E(x)~ ~E(y)]^{T}$ and covariance matrix
$\mathbf{C}=\left[\begin{array}{cc} \operatorname{var}(x) & \operatorname{cov}(x, y) \\ \operatorname{cov}(y, x) & \operatorname{var}(y) \end{array}\right]$
so that
$y)=\frac{1}{2 \pi \operatorname{det}^{\frac{1}{2}}(\mathbf{C})} \exp \left[-\frac{1}{2}\left[\begin{array}{l} x-E(x) \\ y-E(y) \end{array}\right]^{T} \mathbf{C}^{-1}\left[\begin{array}{l} x-E(x) \\ y-E(y) \end{array}\right]\right]$
then the conditional PDF $p (y ∣ x)$ is also Gaussian and
$=E(y)+\frac{\operatorname{cov}(x, y)}{\operatorname{var}(x)}(x-E(x)) \tag{BF.18}$

$\operatorname{var}(y | x) =\operatorname{var}(y)-\frac{\operatorname{cov}^{2}(x, y)}{\operatorname{var}(x)}\tag{BF.19}$

Assuming that $x$ and $y$ are not independent and hence $\operatorname{cov}(x, y) \neq 0,$ the posterior PDF becomes more concentrated since there is less uncertainty about $y .$ To verify this, note from $(B F . 19)$ that
$\begin{aligned} \operatorname{var}(y| x) &=\operatorname{var}(y)\left[1-\frac{\operatorname{cov}^{2}(x, y)}{\operatorname{var}(x) \operatorname{var}(y)}\right] \\ &=\operatorname{var}(y)\left(1-\rho^{2}\right) \end{aligned}\tag{BF.20}$
where
$\rho=\frac{\operatorname{cov}(x, y)}{\sqrt{\operatorname{var}(x) \operatorname{var}(y)}}\tag{BF.21}$
is the correlation coefficient satisfying $|\rho|\le 1$ . From our previous discussions, we also realize that $E (y ∣ x)$ is the MMSE estimator of $y$ after observing $x$ , so that from $(B F . 18)$
$\hat y=E(y)+\frac{\operatorname{cov}(x, y)}{\operatorname{var}(x)}(x-E(x)) \tag{BF.22}$
In normalized form (a random variable with zero mean and unity variance) this becomes
$\frac{\hat y-E(y)}{\sqrt{\operatorname{var}(y)}}=\frac{\operatorname{cov}(x, y)}{\sqrt{\operatorname{var}(x) \operatorname{var}(y)}}\frac{x-E(x)}{\sqrt{\operatorname{var}(x)}}$
or
$\hat y_n=\rho x_n\tag{BF.23}$
If the random variables are already normalized $\operatorname{var}(x) = \operatorname{var}(y) = 1)$ , the constant PDF contours appear as in Figure 10.7.

在这里插入图片描述

The locations of the peaks of $p (x, y)$ , when considered as a function of $y$ for each $x$ , is the dashed line $\rho x$ , and it is readily shown that $\hat y = E(y|x) = \rho x$ .

The MMSE estimator therefore exploits the correlation between the random variables to estimate the realization of one based on the realization of the other. The minimum MSE is, from $(B F . 13)$ and $(B F . 20)$ ,
$\begin{aligned} \operatorname{Bmse}(\hat y)=&\int \operatorname{var}(y|x)p(x)dx\\ =&\operatorname{var}(y|x)\\ =&\operatorname{var}(y)(1-\rho^2) \end{aligned}\tag{BF.24}$
Hence, the quality of our estimator also depends on the correlation coefficient, which is a measure of the statistical dependence between $x$ and $y$ .

To generalize these results consider a jointly Gaussian vector $\left[\mathbf{x}^{T} \mathbf{y}^{T}\right]^{T},$ where $\mathbf{x}$ is $\times 1$ and $\mathbf{y}$ is $\times 1 .$ In other words, $\left[\mathbf{x}^{T} \mathbf{y}^{T}\right]^{T}$ is distributed according to a multivariate Gaussian PDF. Then, the conditional PDF of $\mathbf{y}$ for a given $\mathbf{x}$ is also Gaussian. as summarized in the following theorem (see Appendix 10A for proof).

Theorem 10.2 (Conditional PDF of Multivariate Gaussian) If $\mathbf{x}$ and $\mathbf y$ are jointly Gaussian, where $\mathbf{x}$ is $\times 1$ and $\mathbf{y}$ is $\times 1,$ with mean vector $\left[E(\mathbf{x})^{T} E(\mathbf{y})^{T}\right]^{T}$ and
partitioned covariance matrix
$\mathbf{C}=\left[\begin{array}{ll} \mathbf{C}_{x x} & \mathbf{C}_{x y} \\ \mathbf{C}_{y x} & \mathbf{C}_{y y} \end{array}\right]=\left[\begin{array}{ll} k \times k & k \times l \\ l \times k & l \times l \end{array}\right]\tag{BF.25}$
so that
$p(\mathbf{x}, \mathbf{y})=\frac{1}{(2 \pi)^{\frac{k+1}{2}} \operatorname{det}^{\frac{1}{2}}(\mathbf{C})} \exp \left[-\frac{1}{2}\left(\left[\begin{array}{l} \mathbf{x}-E(\mathbf{x}) \\ \mathbf{y}-E(\mathbf{y}) \end{array}\right]\right)^{T} \mathbf{C}^{-1}\left(\left[\begin{array}{l} \mathbf{x}-E(\mathbf{x}) \\ \mathbf{y}-E(\mathbf{y}) \end{array}\right]\right)\right]$
Then the conditional PDF $p(\mathbf y|\mathbf x)$ is also Gaussian and
$E(\mathbf y|\mathbf x)=E(\mathbf y)+\mathbf C_{yx} \mathbf C_{xx}^{-1}(\mathbf x-E(\mathbf x))\tag{BF.26}$

$\mathbf C_{y|x}=\mathbf C_{yy}-\mathbf C_{yx}\mathbf C_{xx}^{-1}\mathbf C_{xy}\tag{BF.27}$

Bayesian Linear Model

Let the data be modeled as
$\mathbf x=\mathbf H \boldsymbol \theta+\mathbf w\tag{BF.28}$
where $\mathbf x$ is an $N\times 1$ data vector, $\mathbf H$ is a known $N\times p$ matrix, $\boldsymbol \theta$ is a $p\times 1$ random vector with prior PDF $\mathcal N(\boldsymbol \mu_\theta,\mathbf C_\theta)$ , and $\mathbf w$ is a $N\times 1$ noise vector with PDF $\mathcal N (\mathbf 0,\mathbf C_w)$ and independent of $\boldsymbol \theta$ . This data model is termed the Bayesian general linear model. It differs from the classical general linear model in that $\boldsymbol \theta$ is modeled as a random variable with a Gaussian prior PDF.

From Theorem 10.2 we know that if $\mathbf x$ and $\boldsymbol \theta$ are jointly Gaussian, then the posterior PDF is also Gaussian. Hence, it only remains to verify that this is indeed the case. Let $\mathbf z=[\mathbf x^T~~\boldsymbol \theta^T]^T$ , so that from $(B F . 28)$ we have
$\mathbf z=\left[\begin{matrix}\mathbf x\\\mathbf w \end{matrix}\right]=\left[\begin{matrix}\mathbf H \boldsymbol \theta+\mathbf w\\\mathbf w \end{matrix}\right]=\left[\begin{matrix}\mathbf H & \mathbf I\\\mathbf I & \mathbf 0 \end{matrix}\right]\left[\begin{matrix}\boldsymbol \theta\\\mathbf w \end{matrix}\right]$
Since $\boldsymbol \theta$ and $\mathbf w$ are independent of each other and each one is Gaussian, they are jointly Gaussian. Furthermore, because $\mathbf z$ is a linear transformation of a Gaussian vector, it too is Gaussian. Hence, Theorem 10.2 applies directly, and we need only determine the mean and covariance of the posterior PDF.

We can obtain the means and covariances
$\begin{aligned} E(\mathbf x)=& E(\mathbf H \boldsymbol \theta+\mathbf w)=\mathbf HE(\boldsymbol \theta)=\mathbf H \boldsymbol \mu_\theta\\ E(\mathbf y)=&E(\boldsymbol \theta)=\boldsymbol \mu_\theta\\ \mathbf C_{xx}=& E[(\mathbf x-E(\mathbf x))(\mathbf x-E(\mathbf x))^T]\\ =&E[(\mathbf H(\boldsymbol \theta-\boldsymbol \mu_\theta)+\mathbf w)(\mathbf H(\boldsymbol \theta-\boldsymbol \mu_\theta)+\mathbf w)^T]\\ =& \mathbf H \mathbf C_\theta \mathbf H^T+\mathbf C_w\\ \mathbf C_{yx}=&E[(\mathbf y-E(\mathbf y))(\mathbf x-E(\mathbf x))^T]\\ =&E[(\boldsymbol \theta-\boldsymbol \mu_\theta)(\mathbf H(\boldsymbol \theta-\boldsymbol \mu_\theta)+\mathbf w)^T]\\ =& \mathbf C_\theta \mathbf H^T \end{aligned}$
We can now summarize our results for the Bayesian general linear model.

Theorem 10.3 (Posterior PDF for the Bayesian General Linear Model) If the observed data $\mathbf x$ can be modeled as
$\mathbf{x}=\mathbf{H} \boldsymbol{\theta}+\mathbf{w}$
where $\mathbf{x}$ is an $\times 1$ data vector, $\mathbf{H}$ is a known $\times p$ matrix, $\boldsymbol{\theta}$ is a $\times 1$ random vector with prior PDF $\mathcal{N}\left(\mu_{\theta}, \mathbf{C}_{\theta}\right),$ and $\mathbf{w}$ is an $\times 1$ noise vector with PDF $\mathcal{N}\left(\mathbf{0}, \mathbf{C}_{w}\right)$ and independent of $\boldsymbol \theta,$ then the posterior PDF $p(\boldsymbol\theta |\mathbf{x})$ is Gaussian with mean
$E(\boldsymbol{\theta} |\mathbf{x})=\boldsymbol{\mu}_{\theta}+\mathbf{C}_{\theta} \mathbf{H}^{T}\left(\mathbf{H} \mathbf{C}_{\theta} \mathbf{H}^{T}+\mathbf{C}_{w}\right)^{-1}\left(\mathbf{x}-\mathbf{H} \boldsymbol{\mu}_{\theta}\right)\tag{BF.29}$
and covariance
$\mathbf{C}_{\theta \mid x}=\mathbf{C}_{\theta}-\mathbf{C}_{\theta} \mathbf{H}^{T}\left(\mathbf{H} \mathbf{C}_{\theta} \mathbf{H}^{T}+\mathbf{C}_{w}\right)^{-1} \mathbf{H} \mathbf{C}_{\theta}\tag{BF.30}$
In contrast to the classical general linear model, $\mathbf{H}$ need not be full rank to ensure the invertibility of $\mathbf{H C}_{\theta} \mathbf{H}^{T}+\mathbf{C}_{w} .$

Alternative formulation using Matrix inversion lemma:
$E(\boldsymbol{\theta} |\mathbf{x})=\boldsymbol\mu_{\theta}+\left(\mathbf{C}_{\theta}^{-1}+\mathbf{H}^{T} \mathbf{C}_w^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}_w^{-1}\left(\mathbf{x}-\mathbf{H} \boldsymbol\mu_{\theta}\right) \tag{BF.31}$

$\mathbf{C}_{\theta \mid x} =\left(\mathbf{C}_{\theta}^{-1}+\mathbf{H}^{T} \mathbf{C}_w^{-1} \mathbf{H}\right)^{-1}\tag{BF.32}$

We illustrate the use of these formulas by applying them to Example DC Level in WGN:

Let us assume now that the prior distribution of $A$ is Gaussian: $\sim \mathcal{N}\left(\mu _A, \sigma_{A}^{2}\right)$ , and $w [n]$ is white Gaussian noise, i.e., for $\ldots, N-1$ , $\sim \mathcal{N}\left(0, \sigma^{2}\right)$
$\mathbf{x}=\mathbf 1 A+\mathbf{w}$
then, $\mathbf{x}$ and $A$ are jointly Gaussian $(k = N$ and $l = 1),$ with mean
$\begin{aligned} E(A|\mathbf x)=\mu_A+\sigma_A^2 \mathbf 1^T(\mathbf 1 \sigma_A^2\mathbf 1^T+\sigma^2 I)^{-1}(\mathbf x-\mathbf 1\mu_A) \end{aligned}$
Using Woodbury’s identity
$\left(\mathbf{I}+\frac{\sigma_{A}^{2}}{\sigma^{2}} \mathbf{1} \mathbf{1}^{T}\right)^{-1}=\mathbf{I}-\frac{\frac{\sigma_{A}^{2}}{\sigma^{2}} \mathbf{1} \mathbf{1}^{T}}{1+N \frac{\sigma_{A}^{2}}{\sigma^{2}}}$
so that
$\begin{aligned} \hat A=E(A |\mathbf{x}) &=\mu_{A}+\frac{\sigma_{A}^{2}}{\sigma^{2}} \mathbf{1}^{T}\left(\mathbf{I}-\frac{\mathbf{1} \mathbf{1}^{T}}{N+\frac{\sigma^{2}}{\sigma_{A}^{2}}}\right)\left(\mathbf{x}-\mathbf{1} \mu_{A}\right) \\ &=\mu_{A}+\frac{\sigma_{A}^{2}}{\sigma^{2}}\left(\mathbf{1}^{T}-\frac{N}{N+\frac{\sigma^{2}}{\sigma_{A}^{2}}} \mathbf{1}^{T}\right)\left(\mathbf{x}-\mathbf{1} \mu_{A}\right) \\ &=\mu_{A}+\frac{\sigma_{A}^{2}}{\sigma^{2}}\left(1-\frac{N}{N+\frac{\sigma^{2}}{\sigma_{A}^{2}}}\right)\left(N \bar{x}-N \mu_{A}\right) \\ &=\mu_{A}+\frac{\sigma_{A}^{2}}{\sigma_{A}^{2}+\frac{\sigma^{2}}{N}}\left(\bar{x}-\mu_{A}\right)\\ &=\frac{\sigma_A^2}{\sigma_A^2+\frac{\sigma ^2}{N}}\bar x+\frac{\frac{\sigma ^2}{N}}{\sigma_A^2+\frac{\sigma ^2}{N}}\mu_A\\ &=\alpha \bar x+(1-\alpha)\mu_A \end{aligned}$
which is exactly the same as $(B P . 12)$ . And
$\begin{aligned} \operatorname{Bmse}(\hat{A})&=\int\left[\int(A-E(A|\mathbf x))^{2} p(A |\mathbf{x}) d A\right] p(\mathbf{x}) d \mathbf{x}\\ &=\int\operatorname{var}(A|\mathbf x)p(\mathbf{x}) d \mathbf{x}=\operatorname{var}(A|\mathbf x)\\ &=\sigma_A^2-\frac{\sigma_A^2}{\sigma^2}\mathbf 1^T\left(\mathbf I-\frac{\mathbf 1 \mathbf 1^T}{N+\frac{\sigma^2}{\sigma_A^2}} \right)\mathbf 1 \sigma_A^2\\ &=\frac{1}{\frac{N}{\sigma^{2}}+\frac{1}{\sigma_{A}^{2}}} \end{aligned}$