Practical Estimators (MLE, BLUE, LS)

Reference:
Kay S M. Fundamentals of statistical signal processing[M]. Prentice Hall PTR, 1993. (Ch. 6.1 - 6.5, Ch. 7.1 - 7.6, Ch. 8.1 - 8.5 and 8.8 - 8.9)
Slides of ET4386, TUD


Motivation:

  • Determining the MVU requires to knowledge of the PDF.
  • Even when knowing the PDF, finding the MVU is not guaranteed.

Sub-optimal estimators:

  • MLE: Maximum Likelihood Estimator (needs PDF)
  • BLUE: Best Linear Unbiased Estimator (needs the first two moments but not the entire PDF)
  • LS: Least Squares (no need of probabilistic assumptions)

Under certain conditions, these sub-optimal estimators equal the MVU, or their variance converges to the variance of the MVU.

Maximum Likelihood Estimator (MLE)

As the name suggests, the MLE for a scalar parameter is defined to be the value of θ \theta θ that maximizes p ( x ; θ ) p(\mathbf x; \theta) p(x;θ) for x \mathbf x x fixed, i.e., the value that maximizes the likelihood function.

Asymptotic Properties of the MLE

If the PDF p ( x ; θ ) p(\mathbf x; \theta) p(x;θ) of the data x x x satisfies some “regularity” conditions, then the MLE of the unknown parameter θ \theta θ is asymptotically distributed (for large data records) according to
θ ^ ∼ a N ( θ , I − 1 ( θ ) ) \hat \theta \stackrel{a}{\sim} \mathcal {N}(\theta,I^{-1}(\theta)) θ^aN(θ,I1(θ))
where I ( θ ) I(\theta) I(θ) is the Fisher information evaluated at the true value of the unknown parameter.

The regularity conditions require the existence of the derivatives of the log-likelihood function, as well as the Fisher information being nonzero, as described more fully in Appendix 7B. An outline of the proof for IID observations is also given there.


Example: DC-level in White Gaussian Noise

Consider the observed data set
x [ n ] = A + w [ n ] n = 0 , 1 , ⋯   , N − 1  where  w [ n ] ∼ N ( 0 , A ) x[n]=A+w[n]\quad n=0,1,\cdots,N-1 \text{ where }w[n]\sim \mathcal{N}(0,A) x[n]=A+w[n]n=0,1,,N1 where w[n]N(0,A)

  • PDF:
    p ( x ; A ) = 1 ( 2 π A ) N / 2 exp ⁡ [ − 1 2 A ∑ n = 0 N − 1 ( x [ n ] − A ) 2 ] p(\mathbf{x} ; A)=\frac{1}{(2 \pi A)^{N / 2}} \exp \left[-\frac{1}{2 A} \sum_{n=0}^{N-1}(x[n]-A)^{2}\right] p(x;A)=(2πA)N/21exp[2A1n=0N1(x[n]A)2]

  • Score:
    ∂ ln ⁡ p ( x ; A ) ∂ A = − N 2 A + 1 A ∑ n = 0 N − 1 ( x [ n ] − A ) + 1 2 A 2 ∑ n = 0 N − 1 ( x [ n ] − A ) 2 \frac{\partial \ln p(\mathbf{x} ; A)}{\partial A}=-\frac{N}{2 A}+\frac{1}{A} \sum_{n=0}^{N-1}(x[n]-A)+\frac{1}{2 A^{2}} \sum_{n=0}^{N-1}(x[n]-A)^{2} Alnp(x;A)=2AN+A1n=0N1(x[n]A)+2A21n=0N1(x[n]A)2

  • CRLB:
    ∂ ln ⁡ 2 p ( x ; A ) ∂ A 2 = N 2 A 2 − 1 A 2 ∑ n = 0 N − 1 x [ n ] − 1 A 2 ∑ n = 0 N − 1 ( x [ n ] − A ) − 1 A 3 ∑ n = 0 N − 1 ( x [ n ] − A ) 2 I ( θ ) = − E ( ∂ ln ⁡ 2 p ( x ; A ) ∂ A 2 ) = − N 2 A 2 + N A A 2 + 0 + N A A 3 = N 2 A 2 + N A var ⁡ ( A ^ ) ≥ 1 I ( θ ) = A 2 N ( A + 1 / 2 ) \begin{aligned} \frac{\partial \ln^2 p(\mathbf{x} ; A)}{\partial A^2}&=\frac{N}{2A^2}-\frac{1}{A^2}\sum_{n=0}^{N-1}x[n]-\frac{1}{A^2}\sum_{n=0}^{N-1}(x[n]-A)-\frac{1}{A^3}\sum_{n=0}^{N-1}(x[n]-A)^2\\ I(\theta)&=-E(\frac{\partial \ln^2 p(\mathbf{x} ; A)}{\partial A^2})=-\frac{N}{2A^2}+\frac{NA}{A^2}+0+\frac{NA}{A^3}=\frac{N}{2A^2}+\frac{N}{A}\\ \operatorname{var}(\hat A)&\ge \frac{1}{I(\theta)}=\frac{A^2}{N(A+1/2)} \end{aligned} A2ln2p(x;A)I(θ)var(A^)=2A2NA21n=0N1x[n]A21n=0N1(x[n]A)A31n=0N1(x[n]A)2=E(A2ln2p(x;A))=2A2N+A2NA+0+A3NA=2A2N+ANI(θ)1=N(A+1/2)A2

However, it is not obvious whether the score function can be put in the form ∂ ln ⁡ p ( x ; A ) ∂ A = I ( A ) ( A ^ − A ) \frac{\partial \ln p(\mathbf{x} ; A)}{\partial A}=I(A)(\hat A-A) Alnp(x;A)=I(A)(A^A). Therefore, we try MLE instead by setting score to zero (maximizing loglikelihood function), i.e.,
∂ ln ⁡ p ( x ; A ) ∂ A = − N 2 A + 1 A ∑ n = 0 N − 1 ( x [ n ] − A ) + 1 2 A 2 ∑ n = 0 N − 1 ( x [ n ] − A ) 2 = 0 \frac{\partial \ln p(\mathbf{x} ; A)}{\partial A}=-\frac{N}{2 A}+\frac{1}{A} \sum_{n=0}^{N-1}(x[n]-A)+\frac{1}{2 A^{2}} \sum_{n=0}^{N-1}(x[n]-A)^{2}=0 Alnp(x;A)=2AN+A1n=0N1(x[n]A)+2A21n=0N1(x[n]A)2=0
We then obtain
A ^ 2 + A ^ − 1 N ∑ n = 0 N − 1 x 2 [ n ] = 0 \hat{A}^{2}+\hat{A}-\frac{1}{N} \sum_{n=0}^{N-1} x^{2}[n]=0 A^2+A^N1n=0N1x2[n]=0
Solving the above and choosing the positive A ^ \hat{A} A^ :
A ^ = − 1 2 + 1 N ∑ n = 0 N − 1 x 2 [ n ] + 1 4 \hat{A}=-\frac{1}{2}+\sqrt{\frac{1}{N} \sum_{n=0}^{N-1} x^{2}[n]+\frac{1}{4}} A^=21+N1n=0N1x2[n]+41
Note that since
E ( A ^ ) = E ( − 1 2 + 1 N ∑ n = 0 N − 1 x 2 [ n ] + 1 4 ) ≠ − 1 2 + E ( 1 N ∑ n = 0 N − 1 x 2 [ n ] ) + 1 4  for all  A = − 1 2 + A + A 2 + 1 4 = A \begin{aligned} E(\hat{A}) &=E\left(-\frac{1}{2}+\sqrt{\left.\frac{1}{N} \sum_{n=0}^{N-1} x^{2}[n]+\frac{1}{4}\right)}\right.\\ & \neq-\frac{1}{2}+\sqrt{E\left(\frac{1}{N} \sum_{n=0}^{N-1} x^{2}[n]\right)+\frac{1}{4}} \quad \text { for all } A \\ &=-\frac{1}{2}+\sqrt{A+A^{2}+\frac{1}{4}} \\ &=A \end{aligned} E(A^)=E21+N1n=0N1x2[n]+41) =21+E(N1n=0N1x2[n])+41  for all A=21+A+A2+41 =A
the estimator is biased. It is nonetheless a reasonable estimator in that as N → ∞ N\to \infty N, we have by the law of large numbers
1 N ∑ n = 0 N − 1 x 2 [ n ] → E ( x 2 [ n ] ) = A + A 2 \frac{1}{N}\sum_{n=0}^{N-1}x^2[n]\to E(x^2[n])=A+A^2 N1n=0N1x2[n]E(x2[n])=A+A2
and therefore
A ^ → A \hat A\to A A^A
Then we are going to show that A ^ \hat A A^ is a consistent estimator with
E ( A ^ ) → a A , var ⁡ ( A ^ ) → a A 2 N ( A + 1 / 2 ) E(\hat A)\stackrel{a}\to A,\quad \operatorname{var}(\hat A)\stackrel{a}\to \frac{A^2}{N(A+1/2)} E(A^)aA,var(A^)aN(A+1/2)A2
i.e., A ^ \hat A A^ is asymptotically unbiased and asymptotically achieves the CRLB. Hence, it is asymptotically efficient.

Denote
A ^ = − 1 2 + 1 N ∑ n = 0 N − 1 x 2 [ n ] + 1 4 ≜ g ( u ) \hat{A}=-\frac{1}{2}+\sqrt{\frac{1}{N} \sum_{n=0}^{N-1} x^{2}[n]+\frac{1}{4}}\triangleq g(u) A^=21+N1n=0N1x2[n]+41 g(u)
where u = 1 N ∑ n = 0 N − 1 x 2 [ n ] u=\frac{1}{N}\sum_{n=0}^{N-1}x^2[n] u=N1n=0N1x2[n], and therefore
g ( u ) = − 1 2 + u + 1 4 g(u)=-\frac{1}{2}+\sqrt{u+\frac{1}{4}} g(u)=21+u+41
Linearizing about u 0 = E ( u ) = A + A 2 u_0=E(u)=A+A^2 u0=E(u)=A+A2, we have
g ( u ) ≈ g ( u 0 ) + d g ( u ) d u ∣ u = u 0 ( u − u 0 ) g(u) \approx g\left(u_{0}\right)+\left.\frac{d g(u)}{d u}\right|_{u=u_{0}}\left(u-u_{0}\right) g(u)g(u0)+dudg(u)u=u0(uu0)
or
A ^ ≈ A + 1 2 A + 1 2 [ 1 N ∑ n = 0 N − 1 x 2 [ n ] − ( A + A 2 ) ] \hat{A} \approx A+\frac{\frac{1}{2}}{A+\frac{1}{2}}\left[\frac{1}{N} \sum_{n=0}^{N-1} x^{2}[n]-\left(A+A^{2}\right)\right] A^A+A+2121[N1n=0N1x2[n](A+A2)]
It now follows that the asymptotic mean is
E ( A ^ ) = A E(\hat A)=A E(A^)=A
so that A ^ \hat A A^ is asymptotically unbiased. Additionally, the asymptotic variance becomes
var ⁡ ( A ^ ) = ( 1 2 A + 1 2 ) 2 var ⁡ [ 1 N ∑ n = 0 N − 1 x 2 [ n ] ] = 1 4 N ( A + 1 2 ) 2 var ⁡ ( x 2 [ n ] ) = 1 4 N ( A + 1 2 ) 2 ( 4 A 3 + 2 A 2 ) = A 2 N ( A + 1 2 ) \begin{aligned} \operatorname{var}(\hat{A}) &=\left(\frac{\frac{1}{2}}{A+\frac{1}{2}}\right)^{2} \operatorname{var}\left[\frac{1}{N} \sum_{n=0}^{N-1} x^{2}[n]\right] =\frac{\frac{1}{4}}{N\left(A+\frac{1}{2}\right)^{2}} \operatorname{var}\left(x^{2}[n]\right)\\ &=\frac{\frac{1}{4}}{N\left(A+\frac{1}{2}\right)^{2}} (4A^3+2A^2)=\frac{A^2}{N(A+\frac{1}{2})} \end{aligned} var(A^)=(A+2121)2var[N1n=0N1x2[n]]=N(A+21)241var(x2[n])=N(A+21)241(4A3+2A2)=N(A+21)A2
which asymptotically achieves the CRLB.


Example: Linear Gaussian Model

For linear Gaussian model
x = h θ + w , w ∼ N ( 0 , C w ) \mathbf x=\mathbf h \theta+\mathbf w,\quad \mathbf w\sim\mathcal{N}(\mathbf 0,\mathbf C_w) x=hθ+w,wN(0,Cw)
The likelihood function is given by
p ( x ; θ ) = 1 ( 2 π ) N / 2 det ⁡ ( C w ) 1 / 2 exp ⁡ [ − 1 2 ( x − h θ ) T C w − 1 ( x − h θ ) ] p(\mathbf{x} ; \theta)=\frac{1}{(2 \pi)^{N / 2} \operatorname{det}(\mathbf{C}_w)^{1 / 2}} \exp \left[-\frac{1}{2}(\mathbf{x}-\mathbf{h} \theta)^{T} \mathbf{C}_w^{-1}(\mathbf{x}-\mathbf{h} \theta)\right] p(x;θ)=(2π)N/2det(Cw)1/21exp[21(xhθ)TCw1(xhθ)]
It is clear that this function is maximized by solving
θ ^ = argmin ⁡ θ [ ( x − h θ ) T C w − 1 ( x − h θ ) ] \hat{\theta}=\underset{\theta}{\operatorname{argmin}}\left[(\mathbf{x}-\mathbf{h} \theta)^{T} \mathbf{C}_w^{-1}(\mathbf{x}-\mathbf{h} \theta)\right] θ^=θargmin[(xhθ)TCw1(xhθ)]
Note that since x \mathbf{x} x is a stochastic variable that can take many values, so is θ ^ \hat{\theta} θ^.

Denote
J = ( x − h θ ) T C w − 1 ( x − h θ ) = x T C w − 1 x − 2 h T C w − 1 x θ + h T C w − 1 h θ 2 \begin{aligned} J&=(\mathbf{x}-\mathbf{h} \theta)^{T} \mathbf{C}_w^{-1}(\mathbf{x}-\mathbf{h} \theta)\\ &=\mathbf x^T \mathbf C_w^{-1}\mathbf x-2\mathbf h^T \mathbf C_w^{-1}\mathbf x\theta+\mathbf h^T\mathbf C_w^{-1}\mathbf h\theta^2 \end{aligned} J=(xhθ)TCw1(xhθ)=xTCw1x2hTCw1xθ+hTCw1hθ2
Setting the gradient w.r.t. θ \theta θ as zero, we have
∂ J ∂ θ = − 2 h T C w − 1 x + 2 h T C w − 1 h θ θ ^ = ( h T C w − 1 h ) − 1 h T C w − 1 x \begin{aligned} \frac{\partial J}{\partial \theta}&=-2\mathbf h^T \mathbf C_w^{-1}\mathbf x+2\mathbf h^T\mathbf C_w^{-1}\mathbf h\theta\\ \hat \theta&=(\mathbf h^T \mathbf C_w^{-1}\mathbf h)^{-1}\mathbf h^T \mathbf C_w^{-1}\mathbf x \end{aligned} θJθ^=2hTCw1x+2hTCw1hθ=(hTCw1h)1hTCw1x
Therefore, for the linear Gaussian model, the MLE is the MVU estimator.

Invariance Property of the MLE

In many instances we wish to estimate a function of θ \theta θ, the parameter characterizing the PDF. For example, we may not be interested in the value of a DC level A A A in WGN but only in the power A 2 A^2 A2. In this case, the MLE of the transformed parameter is found by substituting the MLE of the original parameter into the transformation. This property of the MLE is termed the invariance property.

The MLE of the parameter α = g ( θ ) , \alpha=g(\theta), α=g(θ), where the PDF ⁡ p ( x ; θ ) \operatorname{PDF} p(\mathbf{x} ; \theta) PDFp(x;θ) is parametrized by θ , \theta, θ, is given by
α ^ = g ( θ ^ ) \hat{\alpha}=g(\hat{\theta}) α^=g(θ^)
where θ ^ \hat{\theta} θ^ is the MLE if θ \theta θ, which is obtained by maximizing p ( x ; θ ) p(\mathbf{x} ; \theta) p(x;θ) over θ \theta θ.

If g ( ⋅ ) g(\cdot) g() is not a one-to-one function, then α ^ \hat{\alpha} α^ maximizes some modified likelihood function p ˉ T ( x ; α ) , \bar{p}_{T}(\mathbf{x} ; \alpha), pˉT(x;α), defined as
p ˉ T ( x ; α ) = max ⁡ { θ : α = g ( θ ) } p ( x ; θ ) \bar{p}_{T}(\mathbf{x} ; \alpha)=\max _{\{\theta: \alpha=g(\theta)\}} p(\mathbf{x} ; \theta) pˉT(x;α)={θ:α=g(θ)}maxp(x;θ)

Example: Transformed DC Level in WGN

Consider the data
x [ n ] = A + w [ n ] n = 0 , 1 , … , N − 1 x[n]=A+w[n] \quad n=0,1, \ldots, N-1 x[n]=A+w[n]n=0,1,,N1
where w [ n ] w[n] w[n] is WGN with variance σ 2 . \sigma^{2} . σ2. We wish to find the M L E \mathrm{MLE} MLE of α = exp ⁡ ( A ) . \alpha=\exp (A) . α=exp(A). The PDF is given as
p ( x ; A ) = 1 ( 2 π σ 2 ) N 2 exp ⁡ [ − 1 2 σ 2 ∑ n = 0 N − 1 ( x [ n ] − A ) 2 ] − ∞ < A < ∞ p(\mathbf{x} ; A)=\frac{1}{\left(2 \pi \sigma^{2}\right)^{\frac{N}{2}}} \exp \left[-\frac{1}{2 \sigma^{2}} \sum_{n=0}^{N-1}(x[n]-A)^{2}\right] \quad-\infty<A<\infty p(x;A)=(2πσ2)2N1exp[2σ21n=0N1(x[n]A)2]<A<
and is parameterized by the parameter θ = A \theta=A θ=A. However, since a is a one-to-one transformation of A A A, we can equivalently parameterize the PDF as
p T ( x ; α ) = 1 ( 2 π σ 2 ) N 2 exp ⁡ [ − 1 2 σ 2 ∑ n = 0 N − 1 ( x [ n ] − ln ⁡ α ) 2 ] α > 0 p_{T}(\mathbf{x} ; \alpha)=\frac{1}{\left(2 \pi \sigma^{2}\right)^{\frac{N}{2}}} \exp \left[-\frac{1}{2 \sigma^{2}} \sum_{n=0}^{N-1}(x[n]-\ln \alpha)^{2}\right] \quad \alpha>0 pT(x;α)=(2πσ2)2N1exp[2σ21n=0N1(x[n]lnα)2]α>0
where the subscript T T T indicates that the PDF is parameterized according to the transformed parameter. Clearly, p T ( x ; α ) p_{T}(\mathbf{x} ; \alpha) pT(x;α) is the PDF of the data set
x [ n ] = ln ⁡ α + w [ n ] n = 0 , 1 , … , N − 1 x[n]=\ln \alpha+w[n] \quad n=0,1, \ldots, N-1 x[n]=lnα+w[n]n=0,1,,N1
The MLE of α \alpha α is found by maximizing the equivalent PDF over α . \alpha . α. Setting the derivative of p T ( x ; α ) p_{T}(\mathbf{x} ; \alpha) pT(x;α) with respect to α \alpha α equal to zero yields
∑ n = 0 N − 1 ( x [ n ] − ln ⁡ α ^ ) 1 α ^ = 0 \sum_{n=0}^{N-1}(x[n]-\ln \hat{\alpha}) \frac{1}{\hat{\alpha}}=0 n=0N1(x[n]lnα^)α^1=0
or
α ^ = exp ⁡ ( x ˉ ) \hat{\alpha}=\exp (\bar{x}) α^=exp(xˉ)
But x ˉ \bar{x} xˉ is just the MLE of A , A, A, so that
α ^ = exp ⁡ ( A ^ ) = exp ⁡ ( θ ^ ) \hat{\alpha}=\exp (\hat{A})=\exp (\hat{\theta}) α^=exp(A^)=exp(θ^)


Example: The SNR of DC Level in WGN

For N N N IID observation from the PDF N ( A , σ 2 ) \mathcal N(A,\sigma^2) N(A,σ2), where A A A and σ 2 \sigma^2 σ2 are both unknown, find the MLE of the SNR α = A 2 / σ 2 \alpha=A^2/\sigma^2 α=A2/σ2.

PDF:
p ( x ; θ ) = 1 ( 2 π σ 2 ) N / 2 exp ⁡ [ − 1 2 σ 2 ∑ n = 0 N − 1 ( x [ n ] − A ) 2 ] p(\mathbf{x} ; \boldsymbol{\theta})=\frac{1}{(2 \pi \sigma^2)^{N / 2}} \exp \left[-\frac{1}{2 \sigma^2} \sum_{n=0}^{N-1}(x[n]-A)^{2}\right] p(x;θ)=(2πσ2)N/21exp[2σ21n=0N1(x[n]A)2]
where θ = [ A   σ 2 ] T \boldsymbol{\theta}=[A~\sigma^2]^T θ=[A σ2]T. Then
∂ ln ⁡ p ( x ; θ ) ∂ A = 1 σ 2 ∑ n = 0 N − 1 ( x [ n ] − A ) ∂ ln ⁡ p ( x ; θ ) ∂ σ 2 = − N 2 σ 2 + 1 2 σ 4 ∑ n = 0 N − 1 ( x [ n ] − A ) 2 \begin{aligned} \frac{\partial \ln p(\mathbf{x} ; \boldsymbol{\theta})}{\partial A} &=\frac{1}{\sigma^{2}} \sum_{n=0}^{N-1}(x[n]-A) \\ \frac{\partial \ln p(\mathbf{x} ; \boldsymbol{\theta})}{\partial \sigma^{2}} &=-\frac{N}{2 \sigma^{2}}+\frac{1}{2 \sigma^{4}} \sum_{n=0}^{N-1}(x[n]-A)^{2} \end{aligned} Alnp(x;θ)σ2lnp(x;θ)=σ21n=0N1(x[n]A)=2σ2N+2σ41n=0N1(x[n]A)2
Take the derivatives to zero, we have the MLE
θ ^ = [ x ‾ 1 N ∑ n = 0 N − 1 ( x [ n ] − x ‾ ) 2 ] \hat{\boldsymbol{\theta}}=\left[\begin{matrix} \overline x\\\frac{1}{N}\sum_{n=0}^{N-1}(x[n]-\overline x)^2 \end{matrix}\right] θ^=[xN1n=0N1(x[n]x)2]
According to the invariance property, we have
α ^ = A ^ 2 σ ^ 2 = x ‾ 1 N ∑ n = 0 N − 1 ( x [ n ] − x ‾ ) 2 \hat \alpha=\frac{\hat A^2}{\hat \sigma^2}=\frac{\overline x}{\frac{1}{N}\sum_{n=0}^{N-1}(x[n]-\overline x)^2} α^=σ^2A^2=N1n=0N1(x[n]x)2x

Best Linear Unbiased Estimator (BLUE)

Sometimes we do not know the PDF of the data, thus the MLE is not available. A common approach is to restrict the estimator to be linear in the data and find the linear estimator that is unbiased and has minimum variance. This estimator, which is termed the best linear unbiased estimator (BLUE), can be determined with knowledge of only the first and second moments of the PDF.

Scalar BLUE

Definition: We observe the data set { x [ 0 ] , x [ 1 ] , ⋯   , x [ N − 1 ] } \{x[0],x[1],\cdots,x[N-1]\} {x[0],x[1],,x[N1]} whose PDF p ( x ; θ ) p(\mathbf x;\theta) p(x;θ) depends on an unknown parameter θ \theta θ. The BLUE restricts the estimator to be linear in the data or
θ ^ = ∑ n = 0 N − 1 a n x [ n ] = a T x \hat \theta=\sum_{n=0}^{N-1}a_nx[n]=\mathbf a^T \mathbf x θ^=n=0N1anx[n]=aTx
where the a n a_n an's are constants yet to be determined.

Since we are restricting the class of estimators to be linear, the BLUE will be optimal (that is to say, the MVU estimator) only when the MVU estimator turns out to be linear.
在这里插入图片描述

Finding the BLUE:

  • Unbiased
    E ( θ ^ ) = ∑ n = 0 N − 1 a n E ( x [ n ] ) = a T E ( x ) = θ E(\hat \theta)=\sum_{n=0}^{N-1}a_n E(x[n])=\mathbf a^TE(\mathbf x)=\theta E(θ^)=n=0N1anE(x[n])=aTE(x)=θ
    In order to satisfy the unbiased constraint, E ( x [ n ] ) E(x[n]) E(x[n]) must be linear in θ \theta θ
    E ( x [ n ] ) = h [ n ] θ E(x[n])=h[n]\theta E(x[n])=h[n]θ
    where the h [ n ] h[n] h[n]'s are known. Or
    E ( x ) = h θ , a T h = 1. E(\mathbf x)=\mathbf h \theta,\quad \mathbf a^T\mathbf h=1. E(x)=hθ,aTh=1.
    Note that if we write x [ n ] x[n] x[n] as
    x [ n ] = E ( x [ n ] ) + [ x [ n ] − E ( x [ n ] ) ] x[n]=E(x[n])+[x[n]-E(x[n])] x[n]=E(x[n])+[x[n]E(x[n])]
    then by viewing x [ n ] − E ( x [ n ] ) x[n]-E(x[n]) x[n]E(x[n]) as noise or w [ n ] w[n] w[n] we have
    x [ n ] = θ h [ n ] + w [ n ] x[n]=\theta h[n]+w[n] x[n]=θh[n]+w[n]
    which means that the BLUE is applicable to amplitude estimation of known signals in noise.

  • Minimum variance
    var ⁡ ( θ ^ ) = E [ ( θ ^ − E ( θ ^ ) ) 2 ] = E { [ a T ( x − E ( x ) ) ] 2 } = a T E [ ( x − E ( x ) ) ( x − E ( x ) ) T ] a = a T C a \begin{aligned} \operatorname{var}(\hat \theta)=& E[(\hat \theta-E(\hat \theta))^2]=E\{[\mathbf a^T(\mathbf x-E(\mathbf x))]^2\}\\ =& \mathbf a^TE[(\mathbf x-E(\mathbf x))(\mathbf x-E(\mathbf x))^T]\mathbf a\\ =& \mathbf a^T \mathbf C \mathbf a \end{aligned} var(θ^)===E[(θ^E(θ^))2]=E{[aT(xE(x))]2}aTE[(xE(x))(xE(x))T]aaTCa

Therefore, we can formulate the optimization problem
minimize ⁡ a a T C a subject to a T h = 1 \begin{aligned} &\underset{\mathbf a} {\operatorname{minimize}} &&\mathbf a^T\mathbf C \mathbf a\\ &\text{subject to} && \mathbf a^T\mathbf h=1 \end{aligned} aminimizesubject toaTCaaTh=1
which can be solved using the method of Lagrange multipliers.
J = a T C a + λ ( a T h − 1 ) J=\mathbf a^T\mathbf C \mathbf a+\lambda (\mathbf a^T\mathbf h-1) J=aTCa+λ(aTh1)
Setting the gradient w.r.t. a \mathbf a a to zero,
∂ J ∂ a = 2 C a + λ h = 0 ⟹ a = − λ 2 C − 1 h \frac{\partial J}{\partial \mathbf a}=2\mathbf C \mathbf a+\lambda \mathbf h=0 \Longrightarrow \mathbf a=-\frac{\lambda}{2}\mathbf C^{-1}\mathbf h aJ=2Ca+λh=0a=2λC1h
The Lagrange multiplier λ \lambda λ is found using the constraint
a T h = − λ 2 h T C − 1 h = 1 ⟹ − λ 2 = 1 h T C − 1 h \mathbf a^T \mathbf h=-\frac{\lambda}{2}\mathbf h^T\mathbf C^{-1}\mathbf h=1 \Longrightarrow -\frac{\lambda}{2}=\frac{1}{\mathbf h^T \mathbf C^{-1}\mathbf h} aTh=2λhTC1h=12λ=hTC1h1
Thus the optimal a \mathbf a a is given by
a o p t = C − 1 h h T C − 1 h \mathbf a_{opt}=\frac{\mathbf C^{-1}\mathbf h}{\mathbf h ^T \mathbf C^{-1}\mathbf h} aopt=hTC1hC1h
and the variance for this value of a \mathbf a a
var ⁡ ( θ ^ ) = a o p t T C a o p t = 1 h T C − 1 h \operatorname{var}(\hat \theta)=\mathbf a^T_{opt}\mathbf C \mathbf a_{opt}=\frac{1}{\mathbf h^T\mathbf C^{-1}\mathbf h} var(θ^)=aoptTCaopt=hTC1h1
We can prove that the a o p t \mathbf a_{opt} aopt is indeed the global minimum and not just a local minimum by verifying
a T C a = ( a − a o p t ) T C ( a − a o p t ) + 1 h T C − 1 h \mathbf a^T \mathbf C \mathbf a=(\mathbf a-\mathbf a_{opt})^T\mathbf C(\mathbf a-\mathbf a_{opt})+\frac{1}{\mathbf h^T\mathbf C^{-1}\mathbf h} aTCa=(aaopt)TC(aaopt)+hTC1h1
Therefore,
θ ^ = a o p t T x = h T C − 1 x h T C − 1 h \hat \theta=\mathbf a_{opt}^T \mathbf x=\frac{\mathbf h^T\mathbf C^{-1} \mathbf x}{\mathbf h ^T \mathbf C^{-1}\mathbf h} θ^=aoptTx=hTC1hhTC1x
Since E ( x ) = h θ E(\mathbf x)=\mathbf h \theta E(x)=hθ, the BLUE is unbiased.

Also, as we asserted earlier, to determine the BLUE we only require knowledge of

  • E ( x ) E(\mathbf x) E(x), the mean or h \mathbf h h
  • C \mathbf C C, the covariance

Example: Scalar BLUE for Linear Model

For linear model
x = h θ + w ,  with  E ( w ) = 0  and  cov ⁡ ( w ) = C x \mathbf x=\mathbf h\theta+\mathbf w,\text{ with }E(\mathbf w)=\mathbf 0\text{ and }\operatorname{cov}(\mathbf w)=\mathbf C_x x=hθ+w, with E(w)=0 and cov(w)=Cx
Since E ( x ) = h θ E(\mathbf x)=\mathbf h \theta E(x)=hθ, C = C x \mathbf C=\mathbf C_x C=Cx, we can directly obtain that
θ ^ = ( h T C x − 1 h ) − 1 h T C x − 1 x \hat \theta=(\mathbf h^T \mathbf C_x^{-1}\mathbf h)^{-1}\mathbf h^T \mathbf C_x^{-1}\mathbf x θ^=(hTCx1h)1hTCx1x
which is exactly the form of MVU if the noise is Gaussian.


Vector BLUE

Definition: If the parameter to be estimated is a p × 1 p\times 1 p×1 vector parameter, then for the estimator to be linear in the data we require
θ ^ i = ∑ n = 0 N − 1 a i n x [ n ] i = 1 , 2 , ⋯   , p \hat \theta_i=\sum_{n=0}^{N-1}a_{in}x[n]\quad i=1,2,\cdots,p θ^i=n=0N1ainx[n]i=1,2,,p
where the a i n a_{in} ain's are weighting coefficients. In matrix form this is
θ ^ = A x \hat{\boldsymbol{\theta}}=\mathbf A \mathbf x θ^=Ax
where A \mathbf A A is a p × N p\times N p×N matrix.

Finding BLUE:

  • Unbiased
    E ( θ ^ ) = A E ( x ) = θ E(\hat{\boldsymbol{\theta}})=\mathbf AE(\mathbf x)=\boldsymbol{\theta} E(θ^)=AE(x)=θ
    To satisfy this condition, we must have
    E ( x ) = H θ , A H = I E(\mathbf x)=\mathbf H\boldsymbol{\theta},\quad \mathbf A \mathbf H=\mathbf I E(x)=Hθ,AH=I
    where
    H = [ h 1 h 2 ⋯ h p ] , A = [ a 1 T a 2 T ⋮ a p T ] \mathbf H=[\mathbf h_1\quad \mathbf h_2\cdots \mathbf h_p],\quad \mathbf A=\left[\begin{matrix}\mathbf a_1^T\\ \mathbf a_2^T\\ \vdots \\\mathbf a_p^T \end{matrix}\right] H=[h1h2hp],A=a1Ta2TapT
    With these definitions the unbiased constraint reduces to
    a i T h j = δ i j , i = 1 , 2 , ⋯   , p ; j = 1 , 2 , ⋯   , p \mathbf a_i^T \mathbf h_j=\delta_{ij},\quad i=1,2,\cdots,p;j=1,2,\cdots,p aiThj=δij,i=1,2,,p;j=1,2,,p

  • Minimum variance
    var ⁡ ( θ ^ i ) = a i T C a i \operatorname{var}(\hat \theta_i)= \mathbf a_i^T \mathbf C \mathbf a_i var(θ^i)=aiTCai

Therefore, we can formulate the optimization problem
minimize ⁡ { a i } i = 1 p a i T C a i subject to a i T h j = δ i j \begin{aligned} &\underset{\{\mathbf a_i\}_{i=1}^p} {\operatorname{minimize}} &&\mathbf a_i^T\mathbf C \mathbf a_i\\ &\text{subject to} && \mathbf a_i^T\mathbf h_j=\delta_{ij} \end{aligned} {ai}i=1pminimizesubject toaiTCaiaiThj=δij
Unlike the scalar BLUE, now we have p p p constraints on a i \mathbf a_i ai instead of just one. Since each a i \mathbf a_i ai is free to assume any value, independently of the others, we actually have p p p separate minimization problems linked only by the constraints. Consider the Lagrangian function for a i \mathbf a_i ai,
J i = a i T C a i + ∑ j = 1 p λ j ( i ) ( a i T h j − δ i j ) J_i=\mathbf a_i^T\mathbf C\mathbf a_i+\sum_{j=1}^p \lambda_j^{(i)}(\mathbf a_i^T\mathbf h_j-\delta_{ij}) Ji=aiTCai+j=1pλj(i)(aiThjδij)
Setting the gradient of the Lagrangian to zero
∂ J i ∂ a i = 2 C a i + ∑ j = 1 p λ j ( i ) h j = 2 C a i + H λ i = 0 ⟹ a i = − 1 2 C − 1 H λ i \frac{\partial J_i}{\partial \mathbf a_i}=2\mathbf C\mathbf a_i+\sum_{j=1}^p \lambda_{j}^{(i)}\mathbf h_j=2\mathbf C\mathbf a_i+\mathbf H\boldsymbol{\lambda}_i=\mathbf 0 \Longrightarrow \mathbf a_i=-\frac{1}{2}\mathbf C^{-1}\mathbf H\boldsymbol{\lambda}_i aiJi=2Cai+j=1pλj(i)hj=2Cai+Hλi=0ai=21C1Hλi
where λ i = [ λ 1 ( i )   λ 2 ( i )   ⋯   λ p ( i ) ] T \boldsymbol{\lambda}_i=[\lambda_1^{(i)}~\lambda_2^{(i)}~\cdots ~\lambda_p^{(i)}]^T λi=[λ1(i) λ2(i)  λp(i)]T.

To find the vector of Lagrangian multipliers we use the constraint equations
a i T h j = δ i j j = 1 , 2 , … , p \mathbf{a}_{i}^{T} \mathbf{h}_{j}=\delta_{i j} \quad j=1,2, \ldots, p aiThj=δijj=1,2,,p
or in combined form
[ h 1 T ⋮ h i − 1 T h i T h i + 1 T ⋮ h p T ] a i = [ 0 ⋮ 0 1 0 ⋮ 0 ] \left[\begin{array}{c} \mathbf{h}_{1}^{T} \\ \vdots \\ \mathbf{h}_{i-1}^{T} \\ \mathbf{h}_{i}^{T} \\ \mathbf{h}_{i+1}^{T} \\ \vdots \\ \mathbf{h}_{p}^{T} \end{array}\right] \mathbf{a}_{i}=\left[\begin{array}{c} 0 \\ \vdots \\ 0 \\ 1 \\ 0 \\ \vdots \\ 0 \end{array}\right] h1Thi1ThiThi+1ThpTai=00100
Letting e i \mathbf e_{i} ei denote the vector of all zeros except in the i i i th place, we have the constraint equation
H T a i = − 1 2 H T C − 1 H λ i = e i ⟹ − 1 2 λ i = ( H T C − 1 H ) − 1 e i \mathbf{H}^{T} \mathbf{a}_{i}=-\frac{1}{2} \mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H} \boldsymbol{\lambda}_{i}=\mathbf{e}_{i}\Longrightarrow -\frac{1}{2} \boldsymbol{\lambda}_{i}=\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \mathbf{e}_{i} HTai=21HTC1Hλi=ei21λi=(HTC1H)1ei
(assuming invertibility of H T C − 1 H \mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H} HTC1H).

Therefore,
a i o p t = C − 1 H ( H T C − 1 H ) − 1 e i \mathbf{a}_{i_{ {opt }}}=\mathbf{C}^{-1} \mathbf{H}\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \mathbf{e}_{i} aiopt=C1H(HTC1H)1ei
and a corresponding variance of
var ⁡ ( θ ^ i ) = a i opt  T C a i opt  = e i T ( H T C − 1 H ) − 1 H T C − 1 C C − 1 H ( H T C − 1 H ) − 1 e i = e i T ( H T C − 1 H ) − 1 e i = [ ( H T C − 1 H ) − 1 ] i i \begin{aligned} \operatorname{var}\left(\hat{\theta}_{i}\right) &=\mathbf{a}_{i_{\text {opt }}}^{T} \mathbf{C a}_{i_{\text {opt }}} \\ &=\mathbf{e}_{i}^{T}\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{C C}^{-1} \mathbf{H}\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \mathbf{e}_{i} \\ &=\mathbf{e}_{i}^{T}\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \mathbf{e}_{i}\\ &=[\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1}]_{ii} \end{aligned} var(θ^i)=aiopt TCaiopt =eiT(HTC1H)1HTC1CC1H(HTC1H)1ei=eiT(HTC1H)1ei=[(HTC1H)1]ii
We can express the vector BLUE in a more compact form by letting
θ ^ = [ a 1 opt  T x a 2  opt  T x ⋮ a p opt  T x ] = [ e 1 T ( H T C − 1 H ) − 1 H T C − 1 x e 2 T ( H T C − 1 H ) − 1 H T C − 1 x ⋮ e p T ( H T C − 1 H ) − 1 H T C − 1 x ] = [ e 1 T e 2 T ⋮ e p T ] ( H T C − 1 H ) − 1 H T C − 1 x = ( H T C − 1 H ) − 1 H T C − 1 x \begin{aligned} \hat{\boldsymbol{\theta}} &=\left[\begin{array}{c} \mathbf{a}_{1_{\text {opt }}}^{T} \mathbf{x} \\ \mathbf{a}_{2 \text { opt }}^{T} \mathbf{x} \\ \vdots \\ \mathbf{a}_{p_{\text {opt }}}^{T} \mathbf{x} \end{array}\right]=\left[\begin{array}{c} \mathbf{e}_{1}^{T}\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{x} \\ \mathbf{e}_{2}^{T}\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{x} \\ \vdots \\ \mathbf{e}_{p}^{T}\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{x} \end{array}\right] \\ &=\left[\begin{array}{c} \mathbf{e}_{1}^{T} \\ \mathbf{e}_{2}^{T} \\ \vdots \\ \mathbf{e}_{p}^{T} \end{array}\right]\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{x} \\ &=\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{x} \end{aligned} θ^=a1opt Txa2 opt Txapopt Tx=e1T(HTC1H)1HTC1xe2T(HTC1H)1HTC1xepT(HTC1H)1HTC1x=e1Te2TepT(HTC1H)1HTC1x=(HTC1H)1HTC1x

Example: Vector BLUE for General Linear Model (Gauss-Markov Theorem)

If the data are of the general linear model form
x = H θ + w \mathbf{x}=\mathbf{H} \boldsymbol{\theta}+\mathbf{w} x=Hθ+w
where H \mathbf{H} H is a known N × p N \times p N×p matrix, θ \boldsymbol{\theta} θ is a p × 1 p \times 1 p×1 vector of parameters to be estimated, and w \mathbf w w is a N × 1 N \times 1 N×1 noise vector with zero mean and covariance C \mathrm{C} C (the PDF of w \mathbf w w is otherwise arbitrary), then the BLUE of θ \theta θ is
θ ^ = ( H T C − 1 H ) − 1 H T C − 1 x \hat{\boldsymbol{\theta}}=\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{x} θ^=(HTC1H)1HTC1x
and the minimum variance of θ ^ i \hat{\theta}_{i} θ^i is
var ⁡ ( θ ^ i ) = [ ( H T C − 1 H ) − 1 ] i i \operatorname{var}\left(\hat{\theta}_{i}\right)=\left[\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1}\right]_{i i} var(θ^i)=[(HTC1H)1]ii
In addition, the covariance matrix of θ ^ \hat{\boldsymbol{\theta}} θ^ is
C θ ^ = E [ ( θ ^ − E ( θ ^ ) ) ( θ ^ − E ( θ ^ ) ) T ] \mathbf{C}_{\hat{\theta}}=E\left[(\hat{\boldsymbol{\theta}}-E(\hat{\boldsymbol{\theta}}))(\hat{\boldsymbol{\theta}}-E(\hat{\boldsymbol{\theta}}))^{T}\right] Cθ^=E[(θ^E(θ^))(θ^E(θ^))T]
where
θ ^ − E ( θ ^ ) = ( H T C − 1 H ) − 1 H T C − 1 ( H θ + w ) − E ( θ ^ ) = ( H T C − 1 H ) − 1 H T C − 1 w \begin{aligned} \hat{\boldsymbol{\theta}}-E(\hat{\boldsymbol{\theta}}) &=\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}^{-1}(\mathbf{H} \boldsymbol{\theta}+\mathbf{w})-E(\hat{\boldsymbol{\theta}}) \\ &=\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{w} \end{aligned} θ^E(θ^)=(HTC1H)1HTC1(Hθ+w)E(θ^)=(HTC1H)1HTC1w
Thus,
C θ ~ = E [ ( H T C − 1 H ) − 1 H T C − 1 w w T C − 1 H ( H T C − 1 H ) − 1 ] \mathbf{C}_{\tilde{\theta}}=E\left[\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{w} \mathbf{w}^{T} \mathbf{C}^{-1} \mathbf{H}\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1}\right] Cθ~=E[(HTC1H)1HTC1wwTC1H(HTC1H)1]
since the covariance matrix of w \mathbf{w} w is C , \mathbf{C}, C, we have
C θ ˙ = ( H T C − 1 H ) − 1 H T C − 1 C C − 1 H ( H T C − 1 H ) − 1 = ( H T C − 1 H ) − 1 \begin{aligned} \mathbf{C}_{\dot{\theta}} &=\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{C} \mathbf{C}^{-1} \mathbf{H}\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \\ &=\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \end{aligned} Cθ˙=(HTC1H)1HTC1CC1H(HTC1H)1=(HTC1H)1

Least Squares (LS)

Different from MVU, MLE and BLUE, the method of least squares (LS) needs no probabilistic assumptions about the data. However, estimator may not be statistically efficient.

在这里插入图片描述

 Measurement vector  : x = x [ 0 ] , x [ 1 ] , … , x [ N − 1 ]  Signal vector  : s = s [ 0 ] , s [ 1 ] , … , s [ N − 1 ]  Unknown parameter  : θ \begin{aligned} \text { Measurement vector } &: \quad \mathrm{x}=x[0], x[1], \ldots, x[N-1] \\ \text { Signal vector } &: \quad \mathrm{s}=s[0], s[1], \ldots, s[N-1]\\ \text { Unknown parameter }&:\quad \theta \end{aligned}  Measurement vector  Signal vector  Unknown parameter :x=x[0],x[1],,x[N1]:s=s[0],s[1],,s[N1]:θ
The least squares estimator (LSE) of θ \theta θ chooses the value that makes s [ n ] s[n] s[n] closest to the observed data x [ n ] x[n] x[n]. Closeness is measured by the LS error criterion
J ( θ ) = ∑ n = 0 N − 1 ( x [ n ] − s [ n ] ) 2 J(\theta)=\sum_{n=0}^{N-1}(x[n]-s[n])^2 J(θ)=n=0N1(x[n]s[n])2
The value of θ \theta θ that minimizes J ( θ ) J(\theta) J(θ) is the LSE. For a vector parameter θ \boldsymbol \theta θ of dimension p × 1 p\times 1 p×1, the LS error criterion is almost the same
J ( θ ) = ∑ n = 0 N − 1 ( x [ n ] − s [ n ] ) 2 J(\boldsymbol\theta)=\sum_{n=0}^{N-1}(x[n]-s[n])^2 J(θ)=n=0N1(x[n]s[n])2
Example: DC level signal

Assume that the signal model in Figure 8.1 is s [ n ] = A s[n]=A s[n]=A and we observe x [ n ] x[n] x[n] for n = n= n= 0 , 1 , … , N − 1. 0,1, \ldots, N-1 . 0,1,,N1. Then, according to the LS approach, we can estimate A A A by minimizing
J ( A ) = ∑ n = 0 N − 1 ( x [ n ] − A ) 2 J(A)=\sum_{n=0}^{N-1}(x[n]-A)^{2} J(A)=n=0N1(x[n]A)2
Differentiating with respect to A A A and setting the result equal to zero produces
A ^ = 1 N ∑ n = 0 N − 1 x [ n ] = x ˉ \hat{A} =\frac{1}{N} \sum_{n=0}^{N-1} x[n] =\bar{x} A^=N1n=0N1x[n]=xˉ
or the sample mean estimator. Our familiar estimator, however, cannot be claimed to be optimal in the MVU sense but only in that it minimizes the LS error. We know, however, from our previous discussions that if x [ n ] = A + w [ n ] , x[n]=A+w[n], x[n]=A+w[n], where w [ n ] w[n] w[n] is zero mean then the LSE will also be the MVU estimator, but otherwise not.

Linear Least Squares

In applying the linear LS approach for a scalar parameter we must assume that
s [ n ] = θ h [ n ] s[n]=\theta h[n] s[n]=θh[n]
where h [ n ] h[n] h[n] is a known sequence. (Compare it with [BLUE](#Scalar BLUE))

Or generalize it with vector parameter θ \boldsymbol \theta θ
s = H θ \mathbf s=\mathbf H\boldsymbol \theta s=Hθ
where the measurement vector s = [ s [ 0 ] , s [ 1 ] , ⋯   , s [ N − 1 ] ] T \mathbf s=[s[0],s[1],\cdots,s[N-1]]^T s=[s[0],s[1],,s[N1]]T, and H \mathbf H H is a known N × p N\times p N×p matrix ( N > p ) (N>p) (N>p) of full rank p p p. The matrix H \mathbf H H is referred to as the observation matrix.

The LSE is found by minimizing
J ( θ ) = ∑ n = 0 N − 1 ( x [ n ] − s [ n ] ) 2 = ( x − H θ ) T ( x − H θ ) J(\boldsymbol\theta)=\sum_{n=0}^{N-1}(x[n]-s[n])^2=(\mathbf x-\mathbf H\boldsymbol \theta)^T(\mathbf x-\mathbf H\boldsymbol \theta) J(θ)=n=0N1(x[n]s[n])2=(xHθ)T(xHθ)
Setting the derivative of J ( θ ) J(\boldsymbol \theta) J(θ) to zero
∂ J ( θ ) ∂ θ = − 2 H T x + 2 H T H θ = 0 ⟹ θ ^ = ( H T H ) − 1 H T x \frac{\partial J(\boldsymbol \theta)}{\partial \boldsymbol \theta}=-2\mathbf H^T\mathbf x+2\mathbf H^T\mathbf H\boldsymbol \theta=0 \Longrightarrow \hat{\boldsymbol \theta}=(\mathbf H^T\mathbf H)^{-1}\mathbf H^T\mathbf x θJ(θ)=2HTx+2HTHθ=0θ^=(HTH)1HTx
The equations H T H θ = H T x \mathbf H^T \mathbf H\boldsymbol \theta=\mathbf H^T\mathbf x HTHθ=HTx to be solved for θ ^ \hat{\boldsymbol \theta} θ^ are termed as the normal equations.

The minimum LS error is
J m i n = J ( θ ^ ) = ( x − H θ ^ ) T ( x − H θ ^ ) = x T ( x − H θ ^ ) − θ ^ T H T ( x − H θ ^ ) = a x T ( x − H θ ^ ) = x T x − x T H ( H T H ) − 1 H T x \begin{aligned} J_{min}&=J(\hat {\boldsymbol \theta})=(\mathbf x-\mathbf H\hat{\boldsymbol \theta})^T(\mathbf x-\mathbf H\hat {\boldsymbol \theta})\\ &=\mathbf x^T(\mathbf x-\mathbf H\hat {\boldsymbol \theta})-\hat {\boldsymbol \theta}^T\mathbf H^T(\mathbf x-\mathbf H\hat {\boldsymbol \theta})\\ &\stackrel{a}{=} \mathbf x^T(\mathbf x-\mathbf H\hat {\boldsymbol \theta})\\ &=\mathbf x^T\mathbf x-\mathbf x^T\mathbf H(\mathbf H^T\mathbf H)^{-1}\mathbf H^T\mathbf x \end{aligned} Jmin=J(θ^)=(xHθ^)T(xHθ^)=xT(xHθ^)θ^THT(xHθ^)=axT(xHθ^)=xTxxTH(HTH)1HTx
where = a \stackrel{a}{=} =a follows from the normal equations.

An extension of the linear LS problem is to weighted LS. We include an N × N N\times N N×N positive definite symmetric weighting matrix W \mathbf W W, so that
J ( θ ) = ( x − H θ ) T W ( x − H θ ) J(\boldsymbol \theta)=(\mathbf x-\mathbf H {\boldsymbol \theta})^T\mathbf W(\mathbf x-\mathbf H {\boldsymbol \theta}) J(θ)=(xHθ)TW(xHθ)
The rationale for introducing weighting factors into the error criterion is to emphasize the contributions of those data samples that are deemed to be more reliable.

The general form of the weighted LSE can be similarly derived by
∂ J ( θ ) ∂ θ = − 2 H T W x + 2 H T W H θ = 0 ⟹ θ ^ = ( H T W H ) − 1 H T W x \frac{\partial J(\boldsymbol \theta)}{\partial \boldsymbol \theta}=-2\mathbf H^T\mathbf W\mathbf x+2\mathbf H^T\mathbf W\mathbf H\boldsymbol \theta=0 \Longrightarrow \hat{\boldsymbol \theta}=(\mathbf H^T\mathbf W\mathbf H)^{-1}\mathbf H^T\mathbf W\mathbf x θJ(θ)=2HTWx+2HTWHθ=0θ^=(HTWH)1HTWx
and its minimum LS error is
J m i n = J ( θ ^ ) = ( x − H θ ^ ) T W ( x − H θ ^ ) = x T W ( x − H θ ^ ) − θ ^ T H T W ( x − H θ ^ ) = x T W ( x − H θ ^ ) = x T W x − x T W H ( H T W H ) − 1 H T W x \begin{aligned} J_{min}&=J(\hat {\boldsymbol \theta})=(\mathbf x-\mathbf H\hat{\boldsymbol \theta})^T\mathbf W(\mathbf x-\mathbf H\hat {\boldsymbol \theta})\\ &=\mathbf x^T\mathbf W(\mathbf x-\mathbf H\hat {\boldsymbol \theta})-\hat {\boldsymbol \theta}^T\mathbf H^T\mathbf W(\mathbf x-\mathbf H\hat {\boldsymbol \theta})\\ &= \mathbf x^T\mathbf W(\mathbf x-\mathbf H\hat {\boldsymbol \theta})\\ &=\mathbf x^T\mathbf W\mathbf x-\mathbf x^T\mathbf W\mathbf H(\mathbf H^T\mathbf W\mathbf H)^{-1}\mathbf H^T\mathbf W\mathbf x \end{aligned} Jmin=J(θ^)=(xHθ^)TW(xHθ^)=xTW(xHθ^)θ^THTW(xHθ^)=xTW(xHθ^)=xTWxxTWH(HTWH)1HTWx

Geometrical Interpretations

The LS error can be also written as
J ( θ ) = ∥ x − H θ ∥ 2 = ∥ x − ∑ i = 1 p θ i h i ∥ 2 J(\boldsymbol \theta)=\|\mathbf x-\mathbf H\boldsymbol \theta\|^2=\|\mathbf x-\sum_{i=1}^p\boldsymbol \theta_i\mathbf h_i\|^2 J(θ)=xHθ2=xi=1pθihi2
which can be interpreted as the problem of fitting or approximating a data vector x \mathbf x x in R N R^N RN by another vector s ^ \hat {\mathbf s} s^, which is a linear combination of vectors { h 1 , h 2 , . . . , h p } \{\mathbf h_1 , \mathbf h_2 , ... , \mathbf h_p\} {h1,h2,...,hp} that lie in a p p p-dimensional subspace of R N R^N RN.

在这里插入图片描述

Define the error vector as ϵ = x − H θ \boldsymbol \epsilon=\mathbf x-\mathbf H\boldsymbol \theta ϵ=xHθ. Then the normal equations becomes
H T ϵ = 0 \mathbf H^T \boldsymbol \epsilon=\mathbf 0 HTϵ=0
i.e., the error vector must be orthogonal to the columns of H \mathbf H H. This is the well-known orthogonality principle. Therefore, the problem is solved by choosing s ^ \hat{\mathbf s} s^ in the subspace to be the orthogonal projection of x \mathbf x x:
s ^ = P x = H ( H T H ) − 1 H T x \hat {\mathbf s}=\mathbf P \mathbf x=\mathbf H(\mathbf H^T\mathbf H)^{-1}\mathbf H^T\mathbf x s^=Px=H(HTH)1HTx
where the N × N N\times N N×N matrix P = H ( H T H ) − 1 H T \mathbf P=\mathbf H(\mathbf H^T\mathbf H)^{-1}\mathbf H^T P=H(HTH)1HT is known as the projection matrix. It has the properties

  • P T = P \mathbf P^T=\mathbf P PT=P, symmetric
  • P 2 = P \mathbf P^2=\mathbf P P2=P, idempotent

It can be easily verified from the properties given above that P ⊥ = I − P \mathbf P^\perp=\mathbf I-\mathbf P P=IP is also a projection matrix. As a result, the error vector ϵ = x − s ^ = ( I − P ) x \boldsymbol \epsilon=\mathbf x-\hat {\mathbf s}=(\mathbf I-\mathbf P)\mathbf x ϵ=xs^=(IP)x is the projection of x \mathbf x x onto the complement subspace or the subspace orthogonal to the signal subspace. The minimum LS error can then be presented as
J m i n = x T x − x T H ( H T H ) − 1 H T x = x T ( I − P ) x = x T P ⊥ x = x T ( P ⊥ ) T P ⊥ x = ∥ P ⊥ x ∥ 2 \begin{aligned} J_{min} &=\mathbf x^T\mathbf x-\mathbf x^T\mathbf H(\mathbf H^T\mathbf H)^{-1}\mathbf H^T\mathbf x\\ &=\mathbf x^T(\mathbf I-\mathbf P)\mathbf x\\ &=\mathbf x^T \mathbf P^\perp \mathbf x\\ &=\mathbf x^T (\mathbf P^\perp)^T \mathbf P^\perp \mathbf x\\ &=\|\mathbf P^\perp\mathbf x\|^2 \end{aligned} Jmin=xTxxTH(HTH)1HTx=xT(IP)x=xTPx=xT(P)TPx=Px2
which is just ∥ ϵ ∥ 2 \|\boldsymbol \epsilon\|^2 ϵ2.

Constraint Least Squares

At times we are confronted with LS problems whose unknown parameters must be constrained. Such would be the case if we wished to estimate the amplitudes of several signals but knew a priori that some of the amplitudes were equal. Then, the total number of parameters to be estimated should be reduced to take advantage of the a priori knowledge. For this example the parameters are linearly related, leading to a linear least squares problem with linear constraints. We summarize the constraints as
A θ = b \mathbf A \boldsymbol \theta =\mathbf b Aθ=b
where A \mathbf A A is a known r × p r\times p r×p matrix and b \mathbf b b is a known r × 1 r\times 1 r×1 vector.

To find the LSE subject to the linear constraints we use the technique of Lagrangian multipliers. We determine θ ^ c \hat {\boldsymbol \theta}_c θ^c ( c c c denotes the constrained LSE) by minimizing the Lagrangian
J c = ( x − H θ ) T ( x − H θ ) + λ T ( A θ − b ) J_c=(\mathbf x-\mathbf H {\boldsymbol \theta})^T(\mathbf x-\mathbf H {\boldsymbol \theta})+\boldsymbol \lambda^T(\mathbf A \boldsymbol \theta-\mathbf b) Jc=(xHθ)T(xHθ)+λT(Aθb)
where λ \boldsymbol \lambda λ is a r × 1 r\times 1 r×1 vector of Lagrangian multipliers. Expanding this expression, we have
J c = x T x − 2 θ T H T x + θ T H T H θ + λ T A θ − λ T b J_c=\mathbf x^T \mathbf x-2\boldsymbol \theta^T\mathbf H^T \mathbf x+\boldsymbol \theta^T\mathbf H^T\mathbf H\boldsymbol \theta+\boldsymbol \lambda^T\mathbf A\boldsymbol \theta-\boldsymbol \lambda ^T\mathbf b Jc=xTx2θTHTx+θTHTHθ+λTAθλTb
Taking the gradient with respect to θ \boldsymbol \theta θ and yields
∂ J c ∂ θ = − 2 H T x + 2 H T H θ + A T λ \frac{\partial J_{c}}{\partial \boldsymbol{\theta}}=-2 \mathbf{H}^{T} \mathbf{x}+2 \mathbf{H}^{T} \mathbf{H} \boldsymbol{\theta}+\mathbf{A}^{T} \boldsymbol{\lambda} θJc=2HTx+2HTHθ+ATλ
and setting it equal to zero produces
θ ^ c = ( H T H ) − 1 H T x − 1 2 ( H T H ) − 1 A T λ = θ ^ − ( H T H ) − 1 A T λ 2 \begin{aligned} \hat{\boldsymbol{\theta}}_{c} &=\left(\mathbf{H}^{T} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{x}-\frac{1}{2}\left(\mathbf{H}^{T} \mathbf{H}\right)^{-1} \mathbf{A}^{T} \boldsymbol{\lambda} \\ &=\hat{\boldsymbol{\theta}}-\left(\mathbf{H}^{T} \mathbf{H}\right)^{-1} \mathbf{A}^{T} \frac{\boldsymbol{\lambda}}{2} \end{aligned} θ^c=(HTH)1HTx21(HTH)1ATλ=θ^(HTH)1AT2λ
where θ ^ \hat{\boldsymbol \theta} θ^ is the unconstrained LSE and λ \boldsymbol \lambda λ is yet to be determined. To find λ \boldsymbol\lambda λ we impose the constraint so that
A θ c = A θ ^ − A ( H T H ) − 1 A T λ 2 = b \mathbf{A} \boldsymbol{\theta}_{c}=\mathbf{A} \hat{\boldsymbol{\theta}}-\mathbf{A}\left(\mathbf{H}^{T} \mathbf{H}\right)^{-1} \mathbf{A}^{T} \frac{\boldsymbol{\lambda}}{2}=\mathbf{b} Aθc=Aθ^A(HTH)1AT2λ=b
and hence
λ 2 = [ A ( H T H ) − 1 A T ] − 1 ( A θ ^ − b ) \frac{\boldsymbol{\lambda}}{2}=\left[\mathbf{A}\left(\mathbf{H}^{T} \mathbf{H}\right)^{-1} \mathbf{A}^{T}\right]^{-1}(\mathbf{A} \hat{\boldsymbol{\theta}}-\mathbf{b}) 2λ=[A(HTH)1AT]1(Aθ^b)
Substituting into (8.51) produces the solution
θ ^ c = θ ^ − ( H T H ) − 1 A T [ A ( H T H ) − 1 A T ] − 1 ( A θ ^ − b ) \hat{\boldsymbol{\theta}}_{c}=\hat{\boldsymbol{\theta}}-\left(\mathbf{H}^{T} \mathbf{H}\right)^{-1} \mathbf{A}^{T}\left[\mathbf{A}\left(\mathbf{H}^{T} \mathbf{H}\right)^{-1} \mathbf{A}^{T}\right]^{-1}(\mathbf{A} \hat{\boldsymbol{\theta}}-\mathbf{b}) θ^c=θ^(HTH)1AT[A(HTH)1AT]1(Aθ^b)
where θ ^ = ( H T H ) − 1 H T x . \hat{\boldsymbol{\theta}}=\left(\mathbf{H}^{T} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{x} . θ^=(HTH)1HTx. The constrained LSE is a corrected version of the unconstrained LSE.

Nonlinear Least Squares

Recall that the LS procedure estimates model parameters θ \boldsymbol \theta θ by minimizing the LS error criterion
J = ( x − s ( θ ) ) T ( x − s ( θ ) ) J=(\mathbf x-\mathbf s(\boldsymbol \theta))^T(\mathbf x-\mathbf s(\boldsymbol \theta)) J=(xs(θ))T(xs(θ))
where s ( θ ) \mathbf s(\boldsymbol \theta) s(θ) is the signal model for x \mathbf x x, with its dependence on θ \boldsymbol \theta θ explicitly shown. (Note that if x − s ( θ ) ∼ N ( 0 , σ 2 I ) \mathbf x-\mathbf s(\boldsymbol \theta)\sim \mathcal N(\mathbf 0,\sigma^2\mathbf I) xs(θ)N(0,σ2I), the LSE is also the MLE.) In the linear LS problem the signal takes on the special form s ( θ ) = H θ \mathbf s(\boldsymbol \theta) = \mathbf H \boldsymbol \theta s(θ)=Hθ, which leads to the simple linear LSE. In general, s ( θ ) \mathbf s(\boldsymbol \theta) s(θ) cannot be expressed in this manner but is an N N N-dimensional nonlinear function of θ \boldsymbol \theta θ.

Before discussing general methods for determining nonlinear LSEs we first describe two methods that can reduce the complexity of the problem. They are

  • transformation of parameters

    In this case we seek a one-to-one transformation of θ \boldsymbol \theta θ that produces a linear signal model in the new space. To do so we let
    α = g ( θ ) \boldsymbol \alpha=\mathbf g(\boldsymbol \theta) α=g(θ)
    where g \mathbf g g is a p p p-dimensional function of θ \boldsymbol \theta θ whose inverse exists. If a g \mathbf g g can be found so that
    s ( θ ( α ) ) = s ( g − 1 ( α ) ) = H α \mathbf s(\boldsymbol \theta(\boldsymbol \alpha))=\mathbf s(\mathbf g^{-1}(\boldsymbol \alpha))=\mathbf H\boldsymbol \alpha s(θ(α))=s(g1(α))=Hα
    then the signal model will be linear in α \boldsymbol \alpha α. We can then easily find the linear LSE of α \boldsymbol \alpha α and thus the nonlinear LSE of θ \boldsymbol \theta θ by
    θ ^ = g − 1 ( α ^ ) \hat{\boldsymbol \theta}=\mathbf g^{-1}(\hat {\boldsymbol \alpha}) θ^=g1(α^)
    where
    α ^ = ( H T H ) − 1 H T x \hat{\boldsymbol \alpha}=(\mathbf H^T\mathbf H)^{-1}\mathbf H^T\mathbf x α^=(HTH)1HTx
    This approach relies on the property that the minimization can be carried out in any transformed space that is obtained by a one-to-one mapping and then converted back to the original space.

    Example:

    For a sinusoidal signal model
    s [ n ] = A cos ⁡ ( 2 π f 0 n + ϕ ) n = 0 , 1 , … , N − 1 s[n]=A \cos \left(2 \pi f_{0} n+\phi\right) \quad n=0,1, \ldots, N-1 s[n]=Acos(2πf0n+ϕ)n=0,1,,N1
    it is desired to estimate the amplitude A , A, A, where A > 0 , A>0, A>0, and phase ϕ . \phi . ϕ. The frequency f 0 f_{0} f0 is assumed known. The LSE is obtained by minimizing
    J = ∑ n = 0 N − 1 ( x [ n ] − A cos ⁡ ( 2 π f 0 n + ϕ ) ) 2 J=\sum_{n=0}^{N-1}\left(x[n]-A \cos \left(2 \pi f_{0} n+\phi\right)\right)^{2} J=n=0N1(x[n]Acos(2πf0n+ϕ))2
    over A A A and ϕ \phi ϕ, a nonlinear LS problem. However, because
    A cos ⁡ ( 2 π f 0 n + ϕ ) = A cos ⁡ ϕ cos ⁡ 2 π f 0 n − A sin ⁡ ϕ sin ⁡ 2 π f 0 n A \cos \left(2 \pi f_{0} n+\phi\right)=A \cos \phi \cos 2 \pi f_{0} n-A \sin \phi \sin 2 \pi f_{0} n Acos(2πf0n+ϕ)=Acosϕcos2πf0nAsinϕsin2πf0n
    if we let
    α 1 = A cos ⁡ ϕ α 2 = − A sin ⁡ ϕ \begin{array}{l} {\alpha}_{1}=A \cos \phi\\ {\alpha}_{2}=-A\sin \phi \end{array} α1=Acosϕα2=Asinϕ
    then the signal model becomes
    s [ n ] = α 1 cos ⁡ 2 π f 0 n + α 2 sin ⁡ 2 π f 0 n s[n]=\alpha_{1} \cos 2 \pi f_{0} n+\alpha_{2} \sin 2 \pi f_{0} n s[n]=α1cos2πf0n+α2sin2πf0n
    In matrix form this is
    s = H α \mathbf{s}=\mathbf{H} \boldsymbol\alpha s=Hα
    where
    H = [ 1 0 cos ⁡ 2 π f 0 sin ⁡ 2 π f 0 ⋮ ⋮ cos ⁡ 2 π f 0 ( N − 1 ) sin ⁡ 2 π f 0 ( N − 1 ) ] \mathbf{H}=\left[\begin{array}{cc} 1 & 0 \\ \cos 2 \pi f_{0} & \sin 2 \pi f_{0} \\ \vdots & \vdots \\ \cos 2 \pi f_{0}(N-1) & \sin 2 \pi f_{0}(N-1) \end{array}\right] H=