本文为 I n t r o d u c t i o n Introduction Introduction t o to to P r o b a b i l i t y Probability Probability 的读书笔记
目录
Bayesian Least Mean Squares Estimation
- In this section, we discuss in more detail the conditional expectation estimator. In particular, we show that it results in the least possible mean squared error (LMS).
- We start by considering the simpler problem of estimating
Θ
\Theta
Θ with a constant
θ
^
\hat\theta
θ^, in the absence of an observation
X
X
X. The estimation error
θ
^
−
Θ
\hat\theta-\Theta
θ^−Θ is random (because
Θ
\Theta
Θ is random), but the mean squared error
E
[
(
θ
^
−
Θ
)
2
]
E[(\hat\theta-\Theta)^2]
E[(θ^−Θ)2] is a number that depends on
θ
^
\hat\theta
θ^, and can be minimized over
θ
^
\hat\theta
θ^.
E [ ( θ ^ − Θ ) 2 ] = v a r ( θ ^ − Θ ) + ( E [ θ ^ − Θ ] ) 2 = v a r ( Θ ) + ( E [ Θ ] − θ ^ ) 2 E[(\hat\theta-\Theta)^2]=var(\hat\theta-\Theta)+(E[\hat\theta-\Theta])^2=var(\Theta)+(E[\Theta]-\hat\theta)^2 E[(θ^−Θ)2]=var(θ^−Θ)+(E[θ^−Θ])2=var(Θ)+(E[Θ]−θ^)2It turns that the best possible estimate is to set θ ^ \hat\theta θ^ equal to E [ Θ ] E[\Theta] E[Θ] - Suppose now that we use an observation X X X to estimate Θ \Theta Θ, so as to minimize the mean squared error. Once we know the value x x x of X X X, the situation is identical to the one considered earlier, except that we are now in a new “universe,” where everything is conditioned on X = x X = x X=x. We can therefore adapt our earlier conclusion and assert that the conditional expectation E [ Θ ∣ X = x ] E[\Theta|X=x] E[Θ∣X=x] minimizes the conditional mean squared error E [ ( θ ^ − Θ ) 2 ∣ X = x ] E[(\hat\theta-\Theta)^2|X=x] E[(θ^−Θ)2∣X=x] over all constants θ ^ \hat\theta θ^.
- Generally, the (unconditional) mean squared estimation error associated with an estimator
g
(
X
)
g(X)
g(X) is defined as
E [ ( Θ − g ( X ) ) 2 ] E[(\Theta-g(X))^2] E[(Θ−g(X))2]For any given value x x x of X X X, g ( x ) g(x) g(x) is a number, and therefore,
E [ ( Θ − E [ Θ ∣ X = x ] ) 2 ∣ X = x ] ≤ E [ ( Θ − g ( x ) ) 2 ∣ X = x ] E[(\Theta-E[\Theta|X=x])^2|X=x]\leq E[(\Theta-g(x))^2|X=x] E[(Θ−E[Θ∣X=x])2∣X=x]≤E[(Θ−g(x))2∣X=x]Thus,
E [ ( Θ − E [ Θ ∣ X ] ) 2 ∣ X ] ≤ E [ ( Θ − g ( X ) ) 2 ∣ X ] E[(\Theta-E[\Theta|X])^2|X]\leq E[(\Theta-g(X))^2|X] E[(Θ−E[Θ∣X])2∣X]≤E[(Θ−g(X))2∣X]which is now an inequality between random variables (functions of X X X). We take expectations of both sides, and use the law of iterated expectations, to conclude that
E [ ( Θ − E [ Θ ∣ X ] ) 2 ] ≤ E [ ( Θ − g ( X ) ) 2 ] E[(\Theta-E[\Theta|X])^2]\leq E[(\Theta-g(X))^2] E[(Θ−E[Θ∣X])2]≤E[(Θ−g(X))2]If we view E [ Θ ∣ X ] E[\Theta|X] E[Θ∣X] as an estimator/function of X X X, the preceding analysis shows that out of all possible estimators, the mean squared estimation error is minimizes when g ( X ) = E [ Θ ∣ X ] g(X)=E[\Theta|X] g(X)=E[Θ∣X].
Example 8.11.
Let
Θ
\Theta
Θ be uniformly distributed over the interval
[
4
,
10
]
[4, 10]
[4,10] and suppose that we observe
Θ
\Theta
Θ with some random error
W
W
W. In particular, we observe the value of the random variable
X
=
Θ
+
W
X=\Theta+W
X=Θ+Wwhere we assume that
W
W
W is uniformly distributed over the interval
[
−
1.1
]
[-1. 1]
[−1.1] and independent of
Θ
\Theta
Θ. What is the LMS estimate of
Θ
\Theta
Θ?
SOLUTION
- To calculate
E
[
Θ
∣
X
=
x
]
E[\Theta|X = x]
E[Θ∣X=x], we note that
f
Θ
(
θ
)
=
1
/
6
f_\Theta(\theta) = 1/6
fΘ(θ)=1/6, if
4
≤
θ
≤
10
4\leq\theta\leq 10
4≤θ≤10, and
f
θ
(
θ
)
=
0
f_\theta(\theta) = 0
fθ(θ)=0, otherwise. Conditioned on
Θ
\Theta
Θ being equal to some
θ
\theta
θ,
X
X
X is uniformly distributed over the interval
[
θ
−
1
,
θ
+
1
]
[\theta- 1, \theta + 1]
[θ−1,θ+1]. Thus, the joint PDF is given by
f Θ , X ( θ , x ) = f Θ ( θ ) f X ∣ Θ ( x ∣ θ ) = 1 6 ⋅ 1 2 = 1 12 f_{\Theta,X}(\theta,x)=f_\Theta(\theta)f_{X|\Theta}(x|\theta)=\frac{1}{6}\cdot\frac{1}{2}=\frac{1}{12} fΘ,X(θ,x)=fΘ(θ)fX∣Θ(x∣θ)=61⋅21=121if 4 ≤ θ ≤ 10 4\leq\theta\leq10 4≤θ≤10 and θ − 1 ≤ x ≤ θ + 1 \theta-1\leq x\leq\theta +1 θ−1≤x≤θ+1, and is zero for all other values of ( θ , x ) (\theta, x) (θ,x). The parallelogram in the right-hand side of Fig. 8.8 is the set of pairs ( θ , x ) (\theta, x) (θ,x) for which f Θ , X ( θ , x ) f_{\Theta,X}(\theta, x) fΘ,X(θ,x) is nonzero.
- Given that X = x X = x X=x, the posterior PDF f Θ ∣ X f_{\Theta|X} fΘ∣X is uniform on the corresponding vertical section of the parallelogram. Thus E [ Θ ∣ X = x ] E[\Theta|X = x] E[Θ∣X=x] is the midpoint of that section, which in this example happens to be a piecewise linear function of x x x.
Problem 13.
- (a) Let
Y
1
,
.
.
.
,
Y
n
Y_1, ... , Y_n
Y1,...,Yn be independent identically distributed random variables and let
Y
=
Y
1
+
⋅
⋅
⋅
+
Y
n
Y =Y_1+···+Y_n
Y=Y1+⋅⋅⋅+Yn. Show that
E [ Y 1 ∣ Y ] = Y n E[Y_1|Y]=\frac{Y}{n} E[Y1∣Y]=nY - (b) Let Θ \Theta Θ and W W W be independent zero-mean normal random variables, with positive integer variances k k k and m m m, respectively. Use the result of part (a) to find E [ Θ ∣ Θ + W ] E[\Theta |\Theta + W] E[Θ∣Θ+W].
- ( c ) (c) (c) Repeat part (b) for the case where Θ \Theta Θ and W W W are independent Poisson random variables with integer means λ \lambda λ and μ μ μ, respectively.
SOLUTION
- (a) By symmetry, we see that
E
[
Y
i
∣
Y
]
E[Y_i| Y]
E[Yi∣Y] is the same for all
i
i
i. Furthermore,
E [ Y 1 + ⋅ ⋅ ⋅ + Y n ∣ Y ] = E [ Y ∣ Y ] = Y E[Y_1 +· · ·+ Y_n | Y] = E[Y | Y] = Y E[Y1+⋅⋅⋅+Yn∣Y]=E[Y∣Y]=YTherefore, E [ Y 1 ∣ Y ] = Y n E[Y_1|Y]=\frac{Y}{n} E[Y1∣Y]=nY. - (b) We can think of
Θ
\Theta
Θ and
W
W
W as sums of independent standard normal random variables:
Θ = Θ 1 + . . . + Θ k , W = W 1 + . . . + W m \Theta=\Theta_1+...+\Theta_k,\ \ \ \ \ W=W_1+...+W_m Θ=Θ1+...+Θk, W=W1+...+WmWe identify Y Y Y with Θ + W \Theta + W Θ+W and use the result from part (a), to obtain
E [ Θ i ∣ Θ + W ] = Θ + W k + m E[\Theta_i|\Theta+W]=\frac{\Theta+W}{k+m} E[Θi∣Θ+W]=k+mΘ+WThus,
E [ Θ ∣ Θ + W ] = k E [ Θ i ∣ Θ + W ] = k k + m ( Θ + W ) E[\Theta|\Theta+W]=kE[\Theta_i|\Theta+W]=\frac{k}{k+m}(\Theta+W) E[Θ∣Θ+W]=kE[Θi∣Θ+W]=k+mk(Θ+W) -
(
c
)
(c)
(c) We recall that the sum of independent Poisson random variables is Poisson. Thus the argument in part (b) goes through, by thinking of
Θ
\Theta
Θ and
W
W
W as sums of
λ
\lambda
λ (respectively,
μ
μ
μ) independent Poisson random variables with mean one. We then obtain
E [ Θ ∣ Θ + W ] = λ λ + μ ( Θ + W ) E[\Theta|\Theta+W]=\frac{\lambda}{\lambda+\mu}(\Theta+W) E[Θ∣Θ+W]=λ+μλ(Θ+W)
Some Properties of the Estimation Error
- Let us use the notation
Θ ^ = E [ Θ ∣ X ] , Θ ~ = Θ ^ − Θ \hat\Theta=E[\Theta|X],\ \ \ \ \ \ \ \ \ \ \ \ \ \tilde\Theta=\hat\Theta-\Theta Θ^=E[Θ∣X], Θ~=Θ^−Θfor the LMS estimator and the associated estimation error, respectively. The random variables Θ ^ \hat\Theta Θ^ and Θ ~ \tilde\Theta Θ~ have a number of useful properties, which were derived in Section 4.3.
Example 8.14.
- Let us say that the observation X X X is u n i n f o r m a t i v e uninformative uninformative if the mean squared estimation error E [ Θ ~ 2 ] = v a r ( Θ ~ ) E[\tilde\Theta^2]= var(\tilde\Theta) E[Θ~2]=var(Θ~) is the same as v a r ( Θ ) var(\Theta) var(Θ), the unconditional variance of Θ \Theta Θ. When is this the case?
- Using the formula
v a r ( Θ ) = v a r ( Θ ~ ) + v a r ( Θ ^ ) var(\Theta) = var(\tilde\Theta) + var(\hat\Theta) var(Θ)=var(Θ~)+var(Θ^)we see that X X X is uninformative if and only if v a r ( Θ ^ ) = 0 var(\hat\Theta)=0 var(Θ^)=0. The variance of a random variable is zero if and only if that random variable is a constant, equal to its mean. We conclude that X X X is uninformative if and only if the estimate Θ ^ = E [ Θ ] \hat\Theta = E[\Theta] Θ^=E[Θ], for every value of X X X. - If Θ \Theta Θ and X X X are independent, we have Θ ^ = E [ Θ ∣ X = x ] = E [ Θ ] \hat\Theta=E[\Theta |X = x] = E[\Theta] Θ^=E[Θ∣X=x]=E[Θ] for all x x x, and X X X is indeed uninformative, which is quite intuitive. The converse, however, is not true: it is possible for E [ Θ ∣ X = x ] E[\Theta |X = x] E[Θ∣X=x] to be always equal to the constant E [ Θ ] E[\Theta] E[Θ], without Θ \Theta Θ and X X X being independent. (In fact, if E [ Θ ∣ X = x ] = E [ Θ ] E[\Theta |X = x]=E[\Theta] E[Θ∣X=x]=E[Θ], it can be derived that Θ \Theta Θ and X X X are uncorrelated.)
The Case of Multiple Observations and Multiple Parameters
- The preceding argument and its conclusions apply even if
X
X
X is a vector of random variables,
X
=
(
X
1
,
.
.
.
,
X
n
)
X = (X_1, ... , X_n)
X=(X1,...,Xn). Thus, the mean squared estimation error is minimized if we use
E
[
Θ
∣
X
1
,
.
.
.
,
X
n
]
E[\Theta|X_1, ... , X_n]
E[Θ∣X1,...,Xn] as our estimator
E [ ( Θ − E [ Θ ∣ X 1 , . . . , X n ] ) 2 ] ≤ E [ ( Θ − g ( X 1 , . . . , X n ) ) 2 ] E[(\Theta-E[\Theta|X_1, ... , X_n])^2]\leq E[(\Theta-g(X_1, ... , X_n))^2] E[(Θ−E[Θ∣X1,...,Xn])2]≤E[(Θ−g(X1,...,Xn))2] - This provides a complete solution to the general problem of LMS estimation, but is often difficult to implement, for the following reasons:
- (a) In order to compute the conditional expectation E [ Θ ∣ X 1 , . . . , X n ] E[\Theta|X_1,...,X_n] E[Θ∣X1,...,Xn], we need a complete probabilistic model, that is, the joint PDF f Θ , X 1 , . . . , X n f_{\Theta,X_1, ... ,X_n} fΘ,X1,...,Xn.
- (b) Even if this joint PDF is available, E [ Θ ∣ X 1 , . . . , X n ] E[\Theta|X_1, ... , X_n] E[Θ∣X1,...,Xn] can be a very complicated function of X 1 , . . . , X n X_1, ... , X_n X1,...,Xn.
- As a consequence, practitioners often resort to approximations of the conditional expectation or focus on estimators that are not optimal but are simple and easy to implement.
- The most common approach, discussed in the next section, involves a restriction to linear estimators.
- Finally, let us consider the case where we want to estimate multiple parameters
Θ
1
,
.
.
.
,
Θ
m
\Theta_1, ... , \Theta_m
Θ1,...,Θm. It is then natural to consider the criterion
E [ ( Θ 1 − Θ ^ 1 ) 2 ] + . . . + E [ ( Θ m − Θ ^ m ) 2 ] E[(\Theta_1-\hat\Theta_1)^2]+...+E[(\Theta_m-\hat\Theta_m)^2] E[(Θ1−Θ^1)2]+...+E[(Θm−Θ^m)2]and minimize it over all estimators Θ ^ 1 , . . . , Θ ^ m \hat\Theta_1, ... , \hat\Theta_m Θ^1,...,Θ^m. But this is equivalent to finding, an each i i i, an estimator Θ ^ i \hat\Theta_i Θ^i that minimizes E [ ( Θ i − Θ ^ i ) 2 ] E[(\Theta_i-\hat\Theta_i)^2] E[(Θi−Θ^i)2], so that we are essentially dealing with m m m decoupled estimation problems, one for each unknown parameter Θ i \Theta_i Θi, yielding Θ ^ i = E [ Θ i ∣ X 1 , . . . , X n ] \hat\Theta_i=E[\Theta_i|X_1,...,X_n] Θ^i=E[Θi∣X1,...,Xn], for all i i i.