# Practical Estimators (MLE, BLUE, LS)

Reference:
Kay S M. Fundamentals of statistical signal processing[M]. Prentice Hall PTR, 1993. (Ch. 6.1 - 6.5, Ch. 7.1 - 7.6, Ch. 8.1 - 8.5 and 8.8 - 8.9)
Slides of ET4386, TUD

Motivation:

• Determining the MVU requires to knowledge of the PDF.
• Even when knowing the PDF, finding the MVU is not guaranteed.

Sub-optimal estimators:

• MLE: Maximum Likelihood Estimator (needs PDF)
• BLUE: Best Linear Unbiased Estimator (needs the first two moments but not the entire PDF)
• LS: Least Squares (no need of probabilistic assumptions)

Under certain conditions, these sub-optimal estimators equal the MVU, or their variance converges to the variance of the MVU.

## Maximum Likelihood Estimator (MLE)

As the name suggests, the MLE for a scalar parameter is defined to be the value of θ \theta that maximizes p ( x ; θ ) p(\mathbf x; \theta) for x \mathbf x fixed, i.e., the value that maximizes the likelihood function.

### Asymptotic Properties of the MLE

If the PDF p ( x ; θ ) p(\mathbf x; \theta) of the data x x satisfies some “regularity” conditions, then the MLE of the unknown parameter θ \theta is asymptotically distributed (for large data records) according to
θ ^ ∼ a N ( θ , I − 1 ( θ ) ) \hat \theta \stackrel{a}{\sim} \mathcal {N}(\theta,I^{-1}(\theta))
where I ( θ ) I(\theta) is the Fisher information evaluated at the true value of the unknown parameter.

The regularity conditions require the existence of the derivatives of the log-likelihood function, as well as the Fisher information being nonzero, as described more fully in Appendix 7B. An outline of the proof for IID observations is also given there.

#### Example: DC-level in White Gaussian Noise

Consider the observed data set
x [ n ] = A + w [ n ] n = 0 , 1 , ⋯   , N − 1  where  w [ n ] ∼ N ( 0 , A ) x[n]=A+w[n]\quad n=0,1,\cdots,N-1 \text{ where }w[n]\sim \mathcal{N}(0,A)

• PDF:
p ( x ; A ) = 1 ( 2 π A ) N / 2 exp ⁡ [ − 1 2 A ∑ n = 0 N − 1 ( x [ n ] − A ) 2 ] p(\mathbf{x} ; A)=\frac{1}{(2 \pi A)^{N / 2}} \exp \left[-\frac{1}{2 A} \sum_{n=0}^{N-1}(x[n]-A)^{2}\right]

• Score:
∂ ln ⁡ p ( x ; A ) ∂ A = − N 2 A + 1 A ∑ n = 0 N − 1 ( x [ n ] − A ) + 1 2 A 2 ∑ n = 0 N − 1 ( x [ n ] − A ) 2 \frac{\partial \ln p(\mathbf{x} ; A)}{\partial A}=-\frac{N}{2 A}+\frac{1}{A} \sum_{n=0}^{N-1}(x[n]-A)+\frac{1}{2 A^{2}} \sum_{n=0}^{N-1}(x[n]-A)^{2}

• CRLB:
∂ ln ⁡ 2 p ( x ; A ) ∂ A 2 = N 2 A 2 − 1 A 2 ∑ n = 0 N − 1 x [ n ] − 1 A 2 ∑ n = 0 N − 1 ( x [ n ] − A ) − 1 A 3 ∑ n = 0 N − 1 ( x [ n ] − A ) 2 I ( θ ) = − E ( ∂ ln ⁡ 2 p ( x ; A ) ∂ A 2 ) = − N 2 A 2 + N A A 2 + 0 + N A A 3 = N 2 A 2 + N A var ⁡ ( A ^ ) ≥ 1 I ( θ ) = A 2 N ( A + 1 / 2 ) \begin{aligned} \frac{\partial \ln^2 p(\mathbf{x} ; A)}{\partial A^2}&=\frac{N}{2A^2}-\frac{1}{A^2}\sum_{n=0}^{N-1}x[n]-\frac{1}{A^2}\sum_{n=0}^{N-1}(x[n]-A)-\frac{1}{A^3}\sum_{n=0}^{N-1}(x[n]-A)^2\\ I(\theta)&=-E(\frac{\partial \ln^2 p(\mathbf{x} ; A)}{\partial A^2})=-\frac{N}{2A^2}+\frac{NA}{A^2}+0+\frac{NA}{A^3}=\frac{N}{2A^2}+\frac{N}{A}\\ \operatorname{var}(\hat A)&\ge \frac{1}{I(\theta)}=\frac{A^2}{N(A+1/2)} \end{aligned}

However, it is not obvious whether the score function can be put in the form ∂ ln ⁡ p ( x ; A ) ∂ A = I ( A ) ( A ^ − A ) \frac{\partial \ln p(\mathbf{x} ; A)}{\partial A}=I(A)(\hat A-A) . Therefore, we try MLE instead by setting score to zero (maximizing loglikelihood function), i.e.,
∂ ln ⁡ p ( x ; A ) ∂ A = − N 2 A + 1 A ∑ n = 0 N − 1 ( x [ n ] − A ) + 1 2 A 2 ∑ n = 0 N − 1 ( x [ n ] − A ) 2 = 0 \frac{\partial \ln p(\mathbf{x} ; A)}{\partial A}=-\frac{N}{2 A}+\frac{1}{A} \sum_{n=0}^{N-1}(x[n]-A)+\frac{1}{2 A^{2}} \sum_{n=0}^{N-1}(x[n]-A)^{2}=0
We then obtain
A ^ 2 + A ^ − 1 N ∑ n = 0 N − 1 x 2 [ n ] = 0 \hat{A}^{2}+\hat{A}-\frac{1}{N} \sum_{n=0}^{N-1} x^{2}[n]=0
Solving the above and choosing the positive A ^ \hat{A} :
A ^ = − 1 2 + 1 N ∑ n = 0 N − 1 x 2 [ n ] + 1 4 \hat{A}=-\frac{1}{2}+\sqrt{\frac{1}{N} \sum_{n=0}^{N-1} x^{2}[n]+\frac{1}{4}}
Note that since
E ( A ^ ) = E ( − 1 2 + 1 N ∑ n = 0 N − 1 x 2 [ n ] + 1 4 ) ≠ − 1 2 + E ( 1 N ∑ n = 0 N − 1 x 2 [ n ] ) + 1 4  for all  A = − 1 2 + A + A 2 + 1 4 = A \begin{aligned} E(\hat{A}) &=E\left(-\frac{1}{2}+\sqrt{\left.\frac{1}{N} \sum_{n=0}^{N-1} x^{2}[n]+\frac{1}{4}\right)}\right.\\ & \neq-\frac{1}{2}+\sqrt{E\left(\frac{1}{N} \sum_{n=0}^{N-1} x^{2}[n]\right)+\frac{1}{4}} \quad \text { for all } A \\ &=-\frac{1}{2}+\sqrt{A+A^{2}+\frac{1}{4}} \\ &=A \end{aligned}
the estimator is biased. It is nonetheless a reasonable estimator in that as N → ∞ N\to \infty , we have by the law of large numbers
1 N ∑ n = 0 N − 1 x 2 [ n ] → E ( x 2 [ n ] ) = A + A 2 \frac{1}{N}\sum_{n=0}^{N-1}x^2[n]\to E(x^2[n])=A+A^2
and therefore
A ^ → A \hat A\to A
Then we are going to show that A ^ \hat A is a consistent estimator with
E ( A ^ ) → a A , var ⁡ ( A ^ ) → a A 2 N ( A + 1 / 2 ) E(\hat A)\stackrel{a}\to A,\quad \operatorname{var}(\hat A)\stackrel{a}\to \frac{A^2}{N(A+1/2)}
i.e., A ^ \hat A is asymptotically unbiased and asymptotically achieves the CRLB. Hence, it is asymptotically efficient.

Denote
A ^ = − 1 2 + 1 N ∑ n = 0 N − 1 x 2 [ n ] + 1 4 ≜ g ( u ) \hat{A}=-\frac{1}{2}+\sqrt{\frac{1}{N} \sum_{n=0}^{N-1} x^{2}[n]+\frac{1}{4}}\triangleq g(u)
where u = 1 N ∑ n = 0 N − 1 x 2 [ n ] u=\frac{1}{N}\sum_{n=0}^{N-1}x^2[n] , and therefore
g ( u ) = − 1 2 + u + 1 4 g(u)=-\frac{1}{2}+\sqrt{u+\frac{1}{4}}
Linearizing about u 0 = E ( u ) = A + A 2 u_0=E(u)=A+A^2 , we have
g ( u ) ≈ g ( u 0 ) + d g ( u ) d u ∣ u = u 0 ( u − u 0 ) g(u) \approx g\left(u_{0}\right)+\left.\frac{d g(u)}{d u}\right|_{u=u_{0}}\left(u-u_{0}\right)
or
A ^ ≈ A + 1 2 A + 1 2 [ 1 N ∑ n = 0 N − 1 x 2 [ n ] − ( A + A 2 ) ] \hat{A} \approx A+\frac{\frac{1}{2}}{A+\frac{1}{2}}\left[\frac{1}{N} \sum_{n=0}^{N-1} x^{2}[n]-\left(A+A^{2}\right)\right]
It now follows that the asymptotic mean is
E ( A ^ ) = A E(\hat A)=A
so that A ^ \hat A is asymptotically unbiased. Additionally, the asymptotic variance becomes
var ⁡ ( A ^ ) = ( 1 2 A + 1 2 ) 2 var ⁡ [ 1 N ∑ n = 0 N − 1 x 2 [ n ] ] = 1 4 N ( A + 1 2 ) 2 var ⁡ ( x 2 [ n ] ) = 1 4 N ( A + 1 2 ) 2 ( 4 A 3 + 2 A 2 ) = A 2 N ( A + 1 2 ) \begin{aligned} \operatorname{var}(\hat{A}) &=\left(\frac{\frac{1}{2}}{A+\frac{1}{2}}\right)^{2} \operatorname{var}\left[\frac{1}{N} \sum_{n=0}^{N-1} x^{2}[n]\right] =\frac{\frac{1}{4}}{N\left(A+\frac{1}{2}\right)^{2}} \operatorname{var}\left(x^{2}[n]\right)\\ &=\frac{\frac{1}{4}}{N\left(A+\frac{1}{2}\right)^{2}} (4A^3+2A^2)=\frac{A^2}{N(A+\frac{1}{2})} \end{aligned}
which asymptotically achieves the CRLB.

#### Example: Linear Gaussian Model

For linear Gaussian model
x = h θ + w , w ∼ N ( 0 , C w ) \mathbf x=\mathbf h \theta+\mathbf w,\quad \mathbf w\sim\mathcal{N}(\mathbf 0,\mathbf C_w)
The likelihood function is given by
p ( x ; θ ) = 1 ( 2 π ) N / 2 det ⁡ ( C w ) 1 / 2 exp ⁡ [ − 1 2 ( x − h θ ) T C w − 1 ( x − h θ ) ] p(\mathbf{x} ; \theta)=\frac{1}{(2 \pi)^{N / 2} \operatorname{det}(\mathbf{C}_w)^{1 / 2}} \exp \left[-\frac{1}{2}(\mathbf{x}-\mathbf{h} \theta)^{T} \mathbf{C}_w^{-1}(\mathbf{x}-\mathbf{h} \theta)\right]
It is clear that this function is maximized by solving
θ ^ = argmin ⁡ θ [ ( x − h θ ) T C w − 1 ( x − h θ ) ] \hat{\theta}=\underset{\theta}{\operatorname{argmin}}\left[(\mathbf{x}-\mathbf{h} \theta)^{T} \mathbf{C}_w^{-1}(\mathbf{x}-\mathbf{h} \theta)\right]
Note that since x \mathbf{x} is a stochastic variable that can take many values, so is θ ^ \hat{\theta} .

Denote
J = ( x − h θ ) T C w − 1 ( x − h θ ) = x T C w − 1 x − 2 h T C w − 1 x θ + h T C w − 1 h θ 2 \begin{aligned} J&=(\mathbf{x}-\mathbf{h} \theta)^{T} \mathbf{C}_w^{-1}(\mathbf{x}-\mathbf{h} \theta)\\ &=\mathbf x^T \mathbf C_w^{-1}\mathbf x-2\mathbf h^T \mathbf C_w^{-1}\mathbf x\theta+\mathbf h^T\mathbf C_w^{-1}\mathbf h\theta^2 \end{aligned}
Setting the gradient w.r.t. θ \theta as zero, we have
∂ J ∂ θ = − 2 h T C w − 1 x + 2 h T C w − 1 h θ θ ^ = ( h T C w − 1 h ) − 1 h T C w − 1 x \begin{aligned} \frac{\partial J}{\partial \theta}&=-2\mathbf h^T \mathbf C_w^{-1}\mathbf x+2\mathbf h^T\mathbf C_w^{-1}\mathbf h\theta\\ \hat \theta&=(\mathbf h^T \mathbf C_w^{-1}\mathbf h)^{-1}\mathbf h^T \mathbf C_w^{-1}\mathbf x \end{aligned}
Therefore, for the linear Gaussian model, the MLE is the MVU estimator.

### Invariance Property of the MLE

In many instances we wish to estimate a function of θ \theta , the parameter characterizing the PDF. For example, we may not be interested in the value of a DC level A A in WGN but only in the power A 2 A^2 . In this case, the MLE of the transformed parameter is found by substituting the MLE of the original parameter into the transformation. This property of the MLE is termed the invariance property.

The MLE of the parameter α = g ( θ ) , \alpha=g(\theta), where the PDF ⁡ p ( x ; θ ) \operatorname{PDF} p(\mathbf{x} ; \theta) is parametrized by θ , \theta, is given by
α ^ = g ( θ ^ ) \hat{\alpha}=g(\hat{\theta})
where θ ^ \hat{\theta} is the MLE if θ \theta , which is obtained by maximizing p ( x ; θ ) p(\mathbf{x} ; \theta) over θ \theta .

If g ( ⋅ ) g(\cdot) is not a one-to-one function, then α ^ \hat{\alpha} maximizes some modified likelihood function p ˉ T ( x ; α ) , \bar{p}_{T}(\mathbf{x} ; \alpha), defined as
p ˉ T ( x ; α ) = max ⁡ { θ : α = g ( θ ) } p ( x ; θ ) \bar{p}_{T}(\mathbf{x} ; \alpha)=\max _{\{\theta: \alpha=g(\theta)\}} p(\mathbf{x} ; \theta)

#### Example: Transformed DC Level in WGN

Consider the data
x [ n ] = A + w [ n ] n = 0 , 1 , … , N − 1 x[n]=A+w[n] \quad n=0,1, \ldots, N-1
where w [ n ] w[n] is WGN with variance σ 2 . \sigma^{2} . We wish to find the M L E \mathrm{MLE} of α = exp ⁡ ( A ) . \alpha=\exp (A) . The PDF is given as
p ( x ; A ) = 1 ( 2 π σ 2 ) N 2 exp ⁡ [ − 1 2 σ 2 ∑ n = 0 N − 1 ( x [ n ] − A ) 2 ] − ∞ < A < ∞ p(\mathbf{x} ; A)=\frac{1}{\left(2 \pi \sigma^{2}\right)^{\frac{N}{2}}} \exp \left[-\frac{1}{2 \sigma^{2}} \sum_{n=0}^{N-1}(x[n]-A)^{2}\right] \quad-\infty<A<\infty
and is parameterized by the parameter θ = A \theta=A . However, since a is a one-to-one transformation of A A , we can equivalently parameterize the PDF as
p T ( x ; α ) = 1 ( 2 π σ 2 ) N 2 exp ⁡ [ − 1 2 σ 2 ∑ n = 0 N − 1 ( x [ n ] − ln ⁡ α ) 2 ] α > 0 p_{T}(\mathbf{x} ; \alpha)=\frac{1}{\left(2 \pi \sigma^{2}\right)^{\frac{N}{2}}} \exp \left[-\frac{1}{2 \sigma^{2}} \sum_{n=0}^{N-1}(x[n]-\ln \alpha)^{2}\right] \quad \alpha>0
where the subscript T T indicates that the PDF is parameterized according to the transformed parameter. Clearly, p T ( x ; α ) p_{T}(\mathbf{x} ; \alpha) is the PDF of the data set
x [ n ] = ln ⁡ α + w [ n ] n = 0 , 1 , … , N − 1 x[n]=\ln \alpha+w[n] \quad n=0,1, \ldots, N-1
The MLE of α \alpha is found by maximizing the equivalent PDF over α . \alpha . Setting the derivative of p T ( x ; α ) p_{T}(\mathbf{x} ; \alpha) with respect to α \alpha equal to zero yields
∑ n = 0 N − 1 ( x [ n ] − ln ⁡ α ^ ) 1 α ^ = 0 \sum_{n=0}^{N-1}(x[n]-\ln \hat{\alpha}) \frac{1}{\hat{\alpha}}=0
or
α ^ = exp ⁡ ( x ˉ ) \hat{\alpha}=\exp (\bar{x})
But x ˉ \bar{x} is just the MLE of A , A, so that
α ^ = exp ⁡ ( A ^ ) = exp ⁡ ( θ ^ ) \hat{\alpha}=\exp (\hat{A})=\exp (\hat{\theta})

#### Example: The SNR of DC Level in WGN

For N N IID observation from the PDF N ( A , σ 2 ) \mathcal N(A,\sigma^2) , where A A and σ 2 \sigma^2 are both unknown, find the MLE of the SNR α = A 2 / σ 2 \alpha=A^2/\sigma^2 .

PDF:
p ( x ; θ ) = 1 ( 2 π σ 2 ) N / 2 exp ⁡ [ − 1 2 σ 2 ∑ n = 0 N − 1 ( x [ n ] − A ) 2 ] p(\mathbf{x} ; \boldsymbol{\theta})=\frac{1}{(2 \pi \sigma^2)^{N / 2}} \exp \left[-\frac{1}{2 \sigma^2} \sum_{n=0}^{N-1}(x[n]-A)^{2}\right]
where θ = [ A   σ 2 ] T \boldsymbol{\theta}=[A~\sigma^2]^T . Then
∂ ln ⁡ p ( x ; θ ) ∂ A = 1 σ 2 ∑ n = 0 N − 1 ( x [ n ] − A ) ∂ ln ⁡ p ( x ; θ ) ∂ σ 2 = − N 2 σ 2 + 1 2 σ 4 ∑ n = 0 N − 1 ( x [ n ] − A ) 2 \begin{aligned} \frac{\partial \ln p(\mathbf{x} ; \boldsymbol{\theta})}{\partial A} &=\frac{1}{\sigma^{2}} \sum_{n=0}^{N-1}(x[n]-A) \\ \frac{\partial \ln p(\mathbf{x} ; \boldsymbol{\theta})}{\partial \sigma^{2}} &=-\frac{N}{2 \sigma^{2}}+\frac{1}{2 \sigma^{4}} \sum_{n=0}^{N-1}(x[n]-A)^{2} \end{aligned}
Take the derivatives to zero, we have the MLE
θ ^ = [ x ‾ 1 N ∑ n = 0 N − 1 ( x [ n ] − x ‾ ) 2 ] \hat{\boldsymbol{\theta}}=\left[\begin{matrix} \overline x\\\frac{1}{N}\sum_{n=0}^{N-1}(x[n]-\overline x)^2 \end{matrix}\right]
According to the invariance property, we have
α ^ = A ^ 2 σ ^ 2 = x ‾ 1 N ∑ n = 0 N − 1 ( x [ n ] − x ‾ ) 2 \hat \alpha=\frac{\hat A^2}{\hat \sigma^2}=\frac{\overline x}{\frac{1}{N}\sum_{n=0}^{N-1}(x[n]-\overline x)^2}

## Best Linear Unbiased Estimator (BLUE)

Sometimes we do not know the PDF of the data, thus the MLE is not available. A common approach is to restrict the estimator to be linear in the data and find the linear estimator that is unbiased and has minimum variance. This estimator, which is termed the best linear unbiased estimator (BLUE), can be determined with knowledge of only the first and second moments of the PDF.

### Scalar BLUE

Definition: We observe the data set { x [ 0 ] , x [ 1 ] , ⋯   , x [ N − 1 ] } \{x[0],x[1],\cdots,x[N-1]\} whose PDF p ( x ; θ ) p(\mathbf x;\theta) depends on an unknown parameter θ \theta . The BLUE restricts the estimator to be linear in the data or
θ ^ = ∑ n = 0 N − 1 a n x [ n ] = a T x \hat \theta=\sum_{n=0}^{N-1}a_nx[n]=\mathbf a^T \mathbf x
where the a n a_n 's are constants yet to be determined.

Since we are restricting the class of estimators to be linear, the BLUE will be optimal (that is to say, the MVU estimator) only when the MVU estimator turns out to be linear.

Finding the BLUE:

• Unbiased
E ( θ ^ ) = ∑ n = 0 N − 1 a n E ( x [ n ] ) = a T E ( x ) = θ E(\hat \theta)=\sum_{n=0}^{N-1}a_n E(x[n])=\mathbf a^TE(\mathbf x)=\theta
In order to satisfy the unbiased constraint, E ( x [ n ] ) E(x[n]) must be linear in θ \theta
E ( x [ n ] ) = h [ n ] θ E(x[n])=h[n]\theta
where the h [ n ] h[n] 's are known. Or
E ( x ) = h θ , a T h = 1. E(\mathbf x)=\mathbf h \theta,\quad \mathbf a^T\mathbf h=1.
Note that if we write x [ n ] x[n] as
x [ n ] = E ( x [ n ] ) + [ x [ n ] − E ( x [ n ] ) ] x[n]=E(x[n])+[x[n]-E(x[n])]
then by viewing x [ n ] − E ( x [ n ] ) x[n]-E(x[n]) as noise or w [ n ] w[n] we have
x [ n ] = θ h [ n ] + w [ n ] x[n]=\theta h[n]+w[n]
which means that the BLUE is applicable to amplitude estimation of known signals in noise.

• Minimum variance
var ⁡ ( θ ^ ) = E [ ( θ ^ − E ( θ ^ ) ) 2 ] = E { [ a T ( x − E ( x ) ) ] 2 } = a T E [ ( x − E ( x ) ) ( x − E ( x ) ) T ] a = a T C a \begin{aligned} \operatorname{var}(\hat \theta)=& E[(\hat \theta-E(\hat \theta))^2]=E\{[\mathbf a^T(\mathbf x-E(\mathbf x))]^2\}\\ =& \mathbf a^TE[(\mathbf x-E(\mathbf x))(\mathbf x-E(\mathbf x))^T]\mathbf a\\ =& \mathbf a^T \mathbf C \mathbf a \end{aligned}

Therefore, we can formulate the optimization problem
minimize ⁡ a a T C a subject to a T h = 1 \begin{aligned} &\underset{\mathbf a} {\operatorname{minimize}} &&\mathbf a^T\mathbf C \mathbf a\\ &\text{subject to} && \mathbf a^T\mathbf h=1 \end{aligned}
which can be solved using the method of Lagrange multipliers.
J = a T C a + λ ( a T h − 1 ) J=\mathbf a^T\mathbf C \mathbf a+\lambda (\mathbf a^T\mathbf h-1)
Setting the gradient w.r.t. a \mathbf a to zero,
∂ J ∂ a = 2 C a + λ h = 0 ⟹ a = − λ 2 C − 1 h \frac{\partial J}{\partial \mathbf a}=2\mathbf C \mathbf a+\lambda \mathbf h=0 \Longrightarrow \mathbf a=-\frac{\lambda}{2}\mathbf C^{-1}\mathbf h
The Lagrange multiplier λ \lambda is found using the constraint
a T h = − λ 2 h T C − 1 h = 1 ⟹ − λ 2 = 1 h T C − 1 h \mathbf a^T \mathbf h=-\frac{\lambda}{2}\mathbf h^T\mathbf C^{-1}\mathbf h=1 \Longrightarrow -\frac{\lambda}{2}=\frac{1}{\mathbf h^T \mathbf C^{-1}\mathbf h}
Thus the optimal a \mathbf a is given by
a o p t = C − 1 h h T C − 1 h \mathbf a_{opt}=\frac{\mathbf C^{-1}\mathbf h}{\mathbf h ^T \mathbf C^{-1}\mathbf h}
and the variance for this value of a \mathbf a
var ⁡ ( θ ^ ) = a o p t T C a o p t = 1 h T C − 1 h \operatorname{var}(\hat \theta)=\mathbf a^T_{opt}\mathbf C \mathbf a_{opt}=\frac{1}{\mathbf h^T\mathbf C^{-1}\mathbf h}
We can prove that the a o p t \mathbf a_{opt} is indeed the global minimum and not just a local minimum by verifying
a T C a = ( a − a o p t ) T C ( a − a o p t ) + 1 h T C − 1 h \mathbf a^T \mathbf C \mathbf a=(\mathbf a-\mathbf a_{opt})^T\mathbf C(\mathbf a-\mathbf a_{opt})+\frac{1}{\mathbf h^T\mathbf C^{-1}\mathbf h}
Therefore,
θ ^ = a o p t T x = h T C − 1 x h T C − 1 h \hat \theta=\mathbf a_{opt}^T \mathbf x=\frac{\mathbf h^T\mathbf C^{-1} \mathbf x}{\mathbf h ^T \mathbf C^{-1}\mathbf h}
Since E ( x ) = h θ E(\mathbf x)=\mathbf h \theta , the BLUE is unbiased.

Also, as we asserted earlier, to determine the BLUE we only require knowledge of

• E ( x ) E(\mathbf x) , the mean or h \mathbf h
• C \mathbf C , the covariance

#### Example: Scalar BLUE for Linear Model

For linear model
x = h θ + w ,  with  E ( w ) = 0  and  cov ⁡ ( w ) = C x \mathbf x=\mathbf h\theta+\mathbf w,\text{ with }E(\mathbf w)=\mathbf 0\text{ and }\operatorname{cov}(\mathbf w)=\mathbf C_x
Since E ( x ) = h θ E(\mathbf x)=\mathbf h \theta , C = C x \mathbf C=\mathbf C_x , we can directly obtain that
θ ^ = ( h T C x − 1 h ) − 1 h T C x − 1 x \hat \theta=(\mathbf h^T \mathbf C_x^{-1}\mathbf h)^{-1}\mathbf h^T \mathbf C_x^{-1}\mathbf x
which is exactly the form of MVU if the noise is Gaussian.

### Vector BLUE

Definition: If the parameter to be estimated is a p × 1 p\times 1 vector parameter, then for the estimator to be linear in the data we require
θ ^ i = ∑ n = 0 N − 1 a i n x [ n ] i = 1 , 2 , ⋯   , p \hat \theta_i=\sum_{n=0}^{N-1}a_{in}x[n]\quad i=1,2,\cdots,p
where the a i n a_{in} 's are weighting coefficients. In matrix form this is
θ ^ = A x \hat{\boldsymbol{\theta}}=\mathbf A \mathbf x
where A \mathbf A is a p × N p\times N matrix.

Finding BLUE:

• Unbiased
E ( θ ^ ) = A E ( x ) = θ E(\hat{\boldsymbol{\theta}})=\mathbf AE(\mathbf x)=\boldsymbol{\theta}
To satisfy this condition, we must have
E ( x ) = H θ , A H = I E(\mathbf x)=\mathbf H\boldsymbol{\theta},\quad \mathbf A \mathbf H=\mathbf I
where
H = [ h 1 h 2 ⋯ h p ] , A = [ a 1 T a 2 T ⋮ a p T ] \mathbf H=[\mathbf h_1\quad \mathbf h_2\cdots \mathbf h_p],\quad \mathbf A=\left[\begin{matrix}\mathbf a_1^T\\ \mathbf a_2^T\\ \vdots \\\mathbf a_p^T \end{matrix}\right]
With these definitions the unbiased constraint reduces to
a i T h j = δ i j , i = 1 , 2 , ⋯   , p ; j = 1 , 2 , ⋯   , p \mathbf a_i^T \mathbf h_j=\delta_{ij},\quad i=1,2,\cdots,p;j=1,2,\cdots,p

• Minimum variance
var ⁡ ( θ ^ i ) = a i T C a i \operatorname{var}(\hat \theta_i)= \mathbf a_i^T \mathbf C \mathbf a_i

Therefore, we can formulate the optimization problem
minimize ⁡ { a i } i = 1 p a i T C a i subject to a i T h j = δ i j \begin{aligned} &\underset{\{\mathbf a_i\}_{i=1}^p} {\operatorname{minimize}} &&\mathbf a_i^T\mathbf C \mathbf a_i\\ &\text{subject to} && \mathbf a_i^T\mathbf h_j=\delta_{ij} \end{aligned}
Unlike the scalar BLUE, now we have p p constraints on a i \mathbf a_i instead of just one. Since each a i \mathbf a_i is free to assume any value, independently of the others, we actually have p p separate minimization problems linked only by the constraints. Consider the Lagrangian function for a i \mathbf a_i ,
J i = a i T C a i + ∑ j = 1 p λ j ( i ) ( a i T h j − δ i j ) J_i=\mathbf a_i^T\mathbf C\mathbf a_i+\sum_{j=1}^p \lambda_j^{(i)}(\mathbf a_i^T\mathbf h_j-\delta_{ij})
Setting the gradient of the Lagrangian to zero
∂ J i ∂ a i = 2 C a i + ∑ j = 1 p λ j ( i ) h j = 2 C a i + H λ i = 0 ⟹ a i = − 1 2 C − 1 H λ i \frac{\partial J_i}{\partial \mathbf a_i}=2\mathbf C\mathbf a_i+\sum_{j=1}^p \lambda_{j}^{(i)}\mathbf h_j=2\mathbf C\mathbf a_i+\mathbf H\boldsymbol{\lambda}_i=\mathbf 0 \Longrightarrow \mathbf a_i=-\frac{1}{2}\mathbf C^{-1}\mathbf H\boldsymbol{\lambda}_i
where λ i = [ λ 1 ( i )   λ 2 ( i )   ⋯   λ p ( i ) ] T \boldsymbol{\lambda}_i=[\lambda_1^{(i)}~\lambda_2^{(i)}~\cdots ~\lambda_p^{(i)}]^T .

To find the vector of Lagrangian multipliers we use the constraint equations
a i T h j = δ i j j = 1 , 2 , … , p \mathbf{a}_{i}^{T} \mathbf{h}_{j}=\delta_{i j} \quad j=1,2, \ldots, p
or in combined form
[ h 1 T ⋮ h i − 1 T h i T h i + 1 T ⋮ h p T ] a i = [ 0 ⋮ 0 1 0 ⋮ 0 ] \left[\begin{array}{c} \mathbf{h}_{1}^{T} \\ \vdots \\ \mathbf{h}_{i-1}^{T} \\ \mathbf{h}_{i}^{T} \\ \mathbf{h}_{i+1}^{T} \\ \vdots \\ \mathbf{h}_{p}^{T} \end{array}\right] \mathbf{a}_{i}=\left[\begin{array}{c} 0 \\ \vdots \\ 0 \\ 1 \\ 0 \\ \vdots \\ 0 \end{array}\right]
Letting e i \mathbf e_{i} denote the vector of all zeros except in the i i th place, we have the constraint equation
H T a i = − 1 2 H T C − 1 H λ i = e i ⟹ − 1 2 λ i = ( H T C − 1 H ) − 1 e i \mathbf{H}^{T} \mathbf{a}_{i}=-\frac{1}{2} \mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H} \boldsymbol{\lambda}_{i}=\mathbf{e}_{i}\Longrightarrow -\frac{1}{2} \boldsymbol{\lambda}_{i}=\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \mathbf{e}_{i}
(assuming invertibility of H T C − 1 H \mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H} ).

Therefore,
a i o p t = C − 1 H ( H T C − 1 H ) − 1 e i \mathbf{a}_{i_{ {opt }}}=\mathbf{C}^{-1} \mathbf{H}\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \mathbf{e}_{i}
and a corresponding variance of
var ⁡ ( θ ^ i ) = a i opt  T C a i opt  = e i T ( H T C − 1 H ) − 1 H T C − 1 C C − 1 H ( H T C − 1 H ) − 1 e i = e i T ( H T C − 1 H ) − 1 e i = [ ( H T C − 1 H ) − 1 ] i i \begin{aligned} \operatorname{var}\left(\hat{\theta}_{i}\right) &=\mathbf{a}_{i_{\text {opt }}}^{T} \mathbf{C a}_{i_{\text {opt }}} \\ &=\mathbf{e}_{i}^{T}\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{C C}^{-1} \mathbf{H}\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \mathbf{e}_{i} \\ &=\mathbf{e}_{i}^{T}\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \mathbf{e}_{i}\\ &=[\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1}]_{ii} \end{aligned}
We can express the vector BLUE in a more compact form by letting
θ ^ = [ a 1 opt  T x a 2  opt  T x ⋮ a p opt  T x ] = [ e 1 T ( H T C − 1 H ) − 1 H T C − 1 x e 2 T ( H T C − 1 H ) − 1 H T C − 1 x ⋮ e p T ( H T C − 1 H ) − 1 H T C − 1 x ] = [ e 1 T e 2 T ⋮ e p T ] ( H T C − 1 H ) − 1 H T C − 1 x = ( H T C − 1 H ) − 1 H T C − 1 x \begin{aligned} \hat{\boldsymbol{\theta}} &=\left[\begin{array}{c} \mathbf{a}_{1_{\text {opt }}}^{T} \mathbf{x} \\ \mathbf{a}_{2 \text { opt }}^{T} \mathbf{x} \\ \vdots \\ \mathbf{a}_{p_{\text {opt }}}^{T} \mathbf{x} \end{array}\right]=\left[\begin{array}{c} \mathbf{e}_{1}^{T}\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{x} \\ \mathbf{e}_{2}^{T}\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{x} \\ \vdots \\ \mathbf{e}_{p}^{T}\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{x} \end{array}\right] \\ &=\left[\begin{array}{c} \mathbf{e}_{1}^{T} \\ \mathbf{e}_{2}^{T} \\ \vdots \\ \mathbf{e}_{p}^{T} \end{array}\right]\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{x} \\ &=\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{x} \end{aligned}

#### Example: Vector BLUE for General Linear Model (Gauss-Markov Theorem)

If the data are of the general linear model form
x = H θ + w \mathbf{x}=\mathbf{H} \boldsymbol{\theta}+\mathbf{w}
where H \mathbf{H} is a known N × p N \times p matrix, θ \boldsymbol{\theta} is a p × 1 p \times 1 vector of parameters to be estimated, and w \mathbf w is a N × 1 N \times 1 noise vector with zero mean and covariance C \mathrm{C} (the PDF of w \mathbf w is otherwise arbitrary), then the BLUE of θ \theta is
θ ^ = ( H T C − 1 H ) − 1 H T C − 1 x \hat{\boldsymbol{\theta}}=\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{x}
and the minimum variance of θ ^ i \hat{\theta}_{i} is
var ⁡ ( θ ^ i ) = [ ( H T C − 1 H ) − 1 ] i i \operatorname{var}\left(\hat{\theta}_{i}\right)=\left[\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1}\right]_{i i}
In addition, the covariance matrix of θ ^ \hat{\boldsymbol{\theta}} is
C θ ^ = E [ ( θ ^ − E ( θ ^ ) ) ( θ ^ − E ( θ ^ ) ) T ] \mathbf{C}_{\hat{\theta}}=E\left[(\hat{\boldsymbol{\theta}}-E(\hat{\boldsymbol{\theta}}))(\hat{\boldsymbol{\theta}}-E(\hat{\boldsymbol{\theta}}))^{T}\right]
where
θ ^ − E ( θ ^ ) = ( H T C − 1 H ) − 1 H T C − 1 ( H θ + w ) − E ( θ ^ ) = ( H T C − 1 H ) − 1 H T C − 1 w \begin{aligned} \hat{\boldsymbol{\theta}}-E(\hat{\boldsymbol{\theta}}) &=\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}^{-1}(\mathbf{H} \boldsymbol{\theta}+\mathbf{w})-E(\hat{\boldsymbol{\theta}}) \\ &=\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{w} \end{aligned}
Thus,
C θ ~ = E [ ( H T C − 1 H ) − 1 H T C − 1 w w T C − 1 H ( H T C − 1 H ) − 1 ] \mathbf{C}_{\tilde{\theta}}=E\left[\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{w} \mathbf{w}^{T} \mathbf{C}^{-1} \mathbf{H}\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1}\right]
since the covariance matrix of w \mathbf{w} is C , \mathbf{C}, we have
C θ ˙ = ( H T C − 1 H ) − 1 H T C − 1 C C − 1 H ( H T C − 1 H ) − 1 = ( H T C − 1 H ) − 1 \begin{aligned} \mathbf{C}_{\dot{\theta}} &=\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{C} \mathbf{C}^{-1} \mathbf{H}\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \\ &=\left(\mathbf{H}^{T} \mathbf{C}^{-1} \mathbf{H}\right)^{-1} \end{aligned}

## Least Squares (LS)

Different from MVU, MLE and BLUE, the method of least squares (LS) needs no probabilistic assumptions about the data. However, estimator may not be statistically efficient.

Measurement vector  : x = x [ 0 ] , x [ 1 ] , … , x [ N − 1 ]  Signal vector  : s = s [ 0 ] , s [ 1 ] , … , s [ N − 1 ]  Unknown parameter  : θ \begin{aligned} \text { Measurement vector } &: \quad \mathrm{x}=x[0], x[1], \ldots, x[N-1] \\ \text { Signal vector } &: \quad \mathrm{s}=s[0], s[1], \ldots, s[N-1]\\ \text { Unknown parameter }&:\quad \theta \end{aligned}
The least squares estimator (LSE) of θ \theta chooses the value that makes s [ n ] s[n] closest to the observed data x [ n ] x[n] . Closeness is measured by the LS error criterion
J ( θ ) = ∑ n = 0 N − 1 ( x [ n ] − s [ n ] ) 2 J(\theta)=\sum_{n=0}^{N-1}(x[n]-s[n])^2
The value of θ \theta that minimizes J ( θ ) J(\theta) is the LSE. For a vector parameter θ \boldsymbol \theta of dimension p × 1 p\times 1 , the LS error criterion is almost the same
J ( θ ) = ∑ n = 0 N − 1 ( x [ n ] − s [ n ] ) 2 J(\boldsymbol\theta)=\sum_{n=0}^{N-1}(x[n]-s[n])^2
Example: DC level signal

Assume that the signal model in Figure 8.1 is s [ n ] = A s[n]=A and we observe x [ n ] x[n] for n = n= 0 , 1 , … , N − 1. 0,1, \ldots, N-1 . Then, according to the LS approach, we can estimate A A by minimizing
J ( A ) = ∑ n = 0 N − 1 ( x [ n ] − A ) 2 J(A)=\sum_{n=0}^{N-1}(x[n]-A)^{2}
Differentiating with respect to A A and setting the result equal to zero produces
A ^ = 1 N ∑ n = 0 N − 1 x [ n ] = x ˉ \hat{A} =\frac{1}{N} \sum_{n=0}^{N-1} x[n] =\bar{x}
or the sample mean estimator. Our familiar estimator, however, cannot be claimed to be optimal in the MVU sense but only in that it minimizes the LS error. We know, however, from our previous discussions that if x [ n ] = A + w [ n ] , x[n]=A+w[n], where w [ n ] w[n] is zero mean then the LSE will also be the MVU estimator, but otherwise not.

### Linear Least Squares

In applying the linear LS approach for a scalar parameter we must assume that
s [ n ] = θ h [ n ] s[n]=\theta h[n]
where h [ n ] h[n] is a known sequence. (Compare it with [BLUE](#Scalar BLUE))

Or generalize it with vector parameter θ \boldsymbol \theta
s = H θ \mathbf s=\mathbf H\boldsymbol \theta
where the measurement vector s = [ s [ 0 ] , s [ 1 ] , ⋯   , s [ N − 1 ] ] T \mathbf s=[s[0],s[1],\cdots,s[N-1]]^T , and H \mathbf H is a known N × p N\times p matrix ( N > p ) (N>p) of full rank p p . The matrix H \mathbf H is referred to as the observation matrix.

The LSE is found by minimizing
J ( θ ) = ∑ n = 0 N − 1 ( x [ n ] − s [ n ] ) 2 = ( x − H θ ) T ( x − H θ ) J(\boldsymbol\theta)=\sum_{n=0}^{N-1}(x[n]-s[n])^2=(\mathbf x-\mathbf H\boldsymbol \theta)^T(\mathbf x-\mathbf H\boldsymbol \theta)
Setting the derivative of J ( θ ) J(\boldsymbol \theta) to zero
∂ J ( θ ) ∂ θ = − 2 H T x + 2 H T H θ = 0 ⟹ θ ^ = ( H T H ) − 1 H T x \frac{\partial J(\boldsymbol \theta)}{\partial \boldsymbol \theta}=-2\mathbf H^T\mathbf x+2\mathbf H^T\mathbf H\boldsymbol \theta=0 \Longrightarrow \hat{\boldsymbol \theta}=(\mathbf H^T\mathbf H)^{-1}\mathbf H^T\mathbf x
The equations H T H θ = H T x \mathbf H^T \mathbf H\boldsymbol \theta=\mathbf H^T\mathbf x to be solved for θ ^ \hat{\boldsymbol \theta} are termed as the normal equations.

The minimum LS error is
J m i n = J ( θ ^ ) = ( x − H θ ^ ) T ( x − H θ ^ ) = x T ( x − H θ ^ ) − θ ^ T H T ( x − H θ ^ ) = a x T ( x − H θ ^ ) = x T x − x T H ( H T H ) − 1 H T x \begin{aligned} J_{min}&=J(\hat {\boldsymbol \theta})=(\mathbf x-\mathbf H\hat{\boldsymbol \theta})^T(\mathbf x-\mathbf H\hat {\boldsymbol \theta})\\ &=\mathbf x^T(\mathbf x-\mathbf H\hat {\boldsymbol \theta})-\hat {\boldsymbol \theta}^T\mathbf H^T(\mathbf x-\mathbf H\hat {\boldsymbol \theta})\\ &\stackrel{a}{=} \mathbf x^T(\mathbf x-\mathbf H\hat {\boldsymbol \theta})\\ &=\mathbf x^T\mathbf x-\mathbf x^T\mathbf H(\mathbf H^T\mathbf H)^{-1}\mathbf H^T\mathbf x \end{aligned}
where = a \stackrel{a}{=} follows from the normal equations.

An extension of the linear LS problem is to weighted LS. We include an N × N N\times N positive definite symmetric weighting matrix W \mathbf W , so that
J ( θ ) = ( x − H θ ) T W ( x − H θ ) J(\boldsymbol \theta)=(\mathbf x-\mathbf H {\boldsymbol \theta})^T\mathbf W(\mathbf x-\mathbf H {\boldsymbol \theta})
The rationale for introducing weighting factors into the error criterion is to emphasize the contributions of those data samples that are deemed to be more reliable.

The general form of the weighted LSE can be similarly derived by
∂ J ( θ ) ∂ θ = − 2 H T W x + 2 H T W H θ = 0 ⟹ θ ^ = ( H T W H ) − 1 H T W x \frac{\partial J(\boldsymbol \theta)}{\partial \boldsymbol \theta}=-2\mathbf H^T\mathbf W\mathbf x+2\mathbf H^T\mathbf W\mathbf H\boldsymbol \theta=0 \Longrightarrow \hat{\boldsymbol \theta}=(\mathbf H^T\mathbf W\mathbf H)^{-1}\mathbf H^T\mathbf W\mathbf x
and its minimum LS error is
J m i n = J ( θ ^ ) = ( x − H θ ^ ) T W ( x − H θ ^ ) = x T W ( x − H θ ^ ) − θ ^ T H T W ( x − H θ ^ ) = x T W ( x − H θ ^ ) = x T W x − x T W H ( H T W H ) − 1 H T W x \begin{aligned} J_{min}&=J(\hat {\boldsymbol \theta})=(\mathbf x-\mathbf H\hat{\boldsymbol \theta})^T\mathbf W(\mathbf x-\mathbf H\hat {\boldsymbol \theta})\\ &=\mathbf x^T\mathbf W(\mathbf x-\mathbf H\hat {\boldsymbol \theta})-\hat {\boldsymbol \theta}^T\mathbf H^T\mathbf W(\mathbf x-\mathbf H\hat {\boldsymbol \theta})\\ &= \mathbf x^T\mathbf W(\mathbf x-\mathbf H\hat {\boldsymbol \theta})\\ &=\mathbf x^T\mathbf W\mathbf x-\mathbf x^T\mathbf W\mathbf H(\mathbf H^T\mathbf W\mathbf H)^{-1}\mathbf H^T\mathbf W\mathbf x \end{aligned}

### Geometrical Interpretations

The LS error can be also written as
J ( θ ) = ∥ x − H θ ∥ 2 = ∥ x − ∑ i = 1 p θ i h i ∥ 2 J(\boldsymbol \theta)=\|\mathbf x-\mathbf H\boldsymbol \theta\|^2=\|\mathbf x-\sum_{i=1}^p\boldsymbol \theta_i\mathbf h_i\|^2
which can be interpreted as the problem of fitting or approximating a data vector x \mathbf x in R N R^N by another vector s ^ \hat {\mathbf s} , which is a linear combination of vectors { h 1 , h 2 , . . . , h p } \{\mathbf h_1 , \mathbf h_2 , ... , \mathbf h_p\} that lie in a p p -dimensional subspace of R N R^N .

Define the error vector as ϵ = x − H θ \boldsymbol \epsilon=\mathbf x-\mathbf H\boldsymbol \theta . Then the normal equations becomes
H T ϵ = 0 \mathbf H^T \boldsymbol \epsilon=\mathbf 0
i.e., the error vector must be orthogonal to the columns of H \mathbf H . This is the well-known orthogonality principle. Therefore, the problem is solved by choosing s ^ \hat{\mathbf s} in the subspace to be the orthogonal projection of x \mathbf x :
s ^ = P x = H ( H T H ) − 1 H T x \hat {\mathbf s}=\mathbf P \mathbf x=\mathbf H(\mathbf H^T\mathbf H)^{-1}\mathbf H^T\mathbf x
where the N × N N\times N matrix P = H ( H T H ) − 1 H T \mathbf P=\mathbf H(\mathbf H^T\mathbf H)^{-1}\mathbf H^T is known as the projection matrix. It has the properties

• P T = P \mathbf P^T=\mathbf P , symmetric
• P 2 = P \mathbf P^2=\mathbf P , idempotent

It can be easily verified from the properties given above that P ⊥ = I − P \mathbf P^\perp=\mathbf I-\mathbf P is also a projection matrix. As a result, the error vector ϵ = x − s ^ = ( I − P ) x \boldsymbol \epsilon=\mathbf x-\hat {\mathbf s}=(\mathbf I-\mathbf P)\mathbf x is the projection of x \mathbf x onto the complement subspace or the subspace orthogonal to the signal subspace. The minimum LS error can then be presented as
J m i n = x T x − x T H ( H T H ) − 1 H T x = x T ( I − P ) x = x T P ⊥ x = x T ( P ⊥ ) T P ⊥ x = ∥ P ⊥ x ∥ 2 \begin{aligned} J_{min} &=\mathbf x^T\mathbf x-\mathbf x^T\mathbf H(\mathbf H^T\mathbf H)^{-1}\mathbf H^T\mathbf x\\ &=\mathbf x^T(\mathbf I-\mathbf P)\mathbf x\\ &=\mathbf x^T \mathbf P^\perp \mathbf x\\ &=\mathbf x^T (\mathbf P^\perp)^T \mathbf P^\perp \mathbf x\\ &=\|\mathbf P^\perp\mathbf x\|^2 \end{aligned}
which is just ∥ ϵ ∥ 2 \|\boldsymbol \epsilon\|^2 .

### Constraint Least Squares

At times we are confronted with LS problems whose unknown parameters must be constrained. Such would be the case if we wished to estimate the amplitudes of several signals but knew a priori that some of the amplitudes were equal. Then, the total number of parameters to be estimated should be reduced to take advantage of the a priori knowledge. For this example the parameters are linearly related, leading to a linear least squares problem with linear constraints. We summarize the constraints as
A θ = b \mathbf A \boldsymbol \theta =\mathbf b
where A \mathbf A is a known r × p r\times p matrix and b \mathbf b is a known r × 1 r\times 1 vector.

To find the LSE subject to the linear constraints we use the technique of Lagrangian multipliers. We determine θ ^ c \hat {\boldsymbol \theta}_c ( c c denotes the constrained LSE) by minimizing the Lagrangian
J c = ( x − H θ ) T ( x − H θ ) + λ T ( A θ − b ) J_c=(\mathbf x-\mathbf H {\boldsymbol \theta})^T(\mathbf x-\mathbf H {\boldsymbol \theta})+\boldsymbol \lambda^T(\mathbf A \boldsymbol \theta-\mathbf b)
where λ \boldsymbol \lambda is a r × 1 r\times 1 vector of Lagrangian multipliers. Expanding this expression, we have
J c = x T x − 2 θ T H T x + θ T H T H θ + λ T A θ − λ T b J_c=\mathbf x^T \mathbf x-2\boldsymbol \theta^T\mathbf H^T \mathbf x+\boldsymbol \theta^T\mathbf H^T\mathbf H\boldsymbol \theta+\boldsymbol \lambda^T\mathbf A\boldsymbol \theta-\boldsymbol \lambda ^T\mathbf b
Taking the gradient with respect to θ \boldsymbol \theta and yields
∂ J c ∂ θ = − 2 H T x + 2 H T H θ + A T λ \frac{\partial J_{c}}{\partial \boldsymbol{\theta}}=-2 \mathbf{H}^{T} \mathbf{x}+2 \mathbf{H}^{T} \mathbf{H} \boldsymbol{\theta}+\mathbf{A}^{T} \boldsymbol{\lambda}
and setting it equal to zero produces
θ ^ c = ( H T H ) − 1 H T x − 1 2 ( H T H ) − 1 A T λ = θ ^ − ( H T H ) − 1 A T λ 2 \begin{aligned} \hat{\boldsymbol{\theta}}_{c} &=\left(\mathbf{H}^{T} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{x}-\frac{1}{2}\left(\mathbf{H}^{T} \mathbf{H}\right)^{-1} \mathbf{A}^{T} \boldsymbol{\lambda} \\ &=\hat{\boldsymbol{\theta}}-\left(\mathbf{H}^{T} \mathbf{H}\right)^{-1} \mathbf{A}^{T} \frac{\boldsymbol{\lambda}}{2} \end{aligned}
where θ ^ \hat{\boldsymbol \theta} is the unconstrained LSE and λ \boldsymbol \lambda is yet to be determined. To find λ \boldsymbol\lambda we impose the constraint so that
A θ c = A θ ^ − A ( H T H ) − 1 A T λ 2 = b \mathbf{A} \boldsymbol{\theta}_{c}=\mathbf{A} \hat{\boldsymbol{\theta}}-\mathbf{A}\left(\mathbf{H}^{T} \mathbf{H}\right)^{-1} \mathbf{A}^{T} \frac{\boldsymbol{\lambda}}{2}=\mathbf{b}
and hence
λ 2 = [ A ( H T H ) − 1 A T ] − 1 ( A θ ^ − b ) \frac{\boldsymbol{\lambda}}{2}=\left[\mathbf{A}\left(\mathbf{H}^{T} \mathbf{H}\right)^{-1} \mathbf{A}^{T}\right]^{-1}(\mathbf{A} \hat{\boldsymbol{\theta}}-\mathbf{b})
Substituting into (8.51) produces the solution
θ ^ c = θ ^ − ( H T H ) − 1 A T [ A ( H T H ) − 1 A T ] − 1 ( A θ ^ − b ) \hat{\boldsymbol{\theta}}_{c}=\hat{\boldsymbol{\theta}}-\left(\mathbf{H}^{T} \mathbf{H}\right)^{-1} \mathbf{A}^{T}\left[\mathbf{A}\left(\mathbf{H}^{T} \mathbf{H}\right)^{-1} \mathbf{A}^{T}\right]^{-1}(\mathbf{A} \hat{\boldsymbol{\theta}}-\mathbf{b})
where θ ^ = ( H T H ) − 1 H T x . \hat{\boldsymbol{\theta}}=\left(\mathbf{H}^{T} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{x} . The constrained LSE is a corrected version of the unconstrained LSE.

### Nonlinear Least Squares

Recall that the LS procedure estimates model parameters θ \boldsymbol \theta by minimizing the LS error criterion
J = ( x − s ( θ ) ) T ( x − s ( θ ) ) J=(\mathbf x-\mathbf s(\boldsymbol \theta))^T(\mathbf x-\mathbf s(\boldsymbol \theta))
where s ( θ ) \mathbf s(\boldsymbol \theta) is the signal model for x \mathbf x , with its dependence on θ \boldsymbol \theta explicitly shown. (Note that if x − s ( θ ) ∼ N ( 0 , σ 2 I ) \mathbf x-\mathbf s(\boldsymbol \theta)\sim \mathcal N(\mathbf 0,\sigma^2\mathbf I) , the LSE is also the MLE.) In the linear LS problem the signal takes on the special form s ( θ ) = H θ \mathbf s(\boldsymbol \theta) = \mathbf H \boldsymbol \theta , which leads to the simple linear LSE. In general, s ( θ ) \mathbf s(\boldsymbol \theta) cannot be expressed in this manner but is an N N -dimensional nonlinear function of θ \boldsymbol \theta .

Before discussing general methods for determining nonlinear LSEs we first describe two methods that can reduce the complexity of the problem. They are

• transformation of parameters

In this case we seek a one-to-one transformation of θ \boldsymbol \theta that produces a linear signal model in the new space. To do so we let
α = g ( θ ) \boldsymbol \alpha=\mathbf g(\boldsymbol \theta)
where g \mathbf g is a p p -dimensional function of θ \boldsymbol \theta whose inverse exists. If a g \mathbf g can be found so that
s ( θ ( α ) ) = s ( g − 1 ( α ) ) = H α \mathbf s(\boldsymbol \theta(\boldsymbol \alpha))=\mathbf s(\mathbf g^{-1}(\boldsymbol \alpha))=\mathbf H\boldsymbol \alpha
then the signal model will be linear in α \boldsymbol \alpha . We can then easily find the linear LSE of α \boldsymbol \alpha and thus the nonlinear LSE of θ \boldsymbol \theta by
θ ^ = g − 1 ( α ^ ) \hat{\boldsymbol \theta}=\mathbf g^{-1}(\hat {\boldsymbol \alpha})
where
α ^ = ( H T H ) − 1 H T x \hat{\boldsymbol \alpha}=(\mathbf H^T\mathbf H)^{-1}\mathbf H^T\mathbf x
This approach relies on the property that the minimization can be carried out in any transformed space that is obtained by a one-to-one mapping and then converted back to the original space.

Example:

For a sinusoidal signal model
s [ n ] = A cos ⁡ ( 2 π f 0 n + ϕ ) n = 0 , 1 , … , N − 1 s[n]=A \cos \left(2 \pi f_{0} n+\phi\right) \quad n=0,1, \ldots, N-1
it is desired to estimate the amplitude A , A, where A > 0 , A>0, and phase ϕ . \phi . The frequency f 0 f_{0} is assumed known. The LSE is obtained by minimizing
J = ∑ n = 0 N − 1 ( x [ n ] − A cos ⁡ ( 2 π f 0 n + ϕ ) ) 2 J=\sum_{n=0}^{N-1}\left(x[n]-A \cos \left(2 \pi f_{0} n+\phi\right)\right)^{2}
over A A and ϕ \phi , a nonlinear LS problem. However, because
A cos ⁡ ( 2 π f 0 n + ϕ ) = A cos ⁡ ϕ cos ⁡ 2 π f 0 n − A sin ⁡ ϕ sin ⁡ 2 π f 0 n A \cos \left(2 \pi f_{0} n+\phi\right)=A \cos \phi \cos 2 \pi f_{0} n-A \sin \phi \sin 2 \pi f_{0} n
if we let
α 1 = A cos ⁡ ϕ α 2 = − A sin ⁡ ϕ \begin{array}{l} {\alpha}_{1}=A \cos \phi\\ {\alpha}_{2}=-A\sin \phi \end{array}
then the signal model becomes
s [ n ] = α 1 cos ⁡ 2 π f 0 n + α 2 sin ⁡ 2 π f 0 n s[n]=\alpha_{1} \cos 2 \pi f_{0} n+\alpha_{2} \sin 2 \pi f_{0} n
In matrix form this is
s = H α \mathbf{s}=\mathbf{H} \boldsymbol\alpha
where
H = [ 1 0 cos ⁡ 2 π f 0 sin ⁡ 2 π f 0 ⋮ ⋮ cos ⁡ 2 π f 0 ( N − 1 ) sin ⁡ 2 π f 0 ( N − 1 ) ] \mathbf{H}=\left[\begin{array}{cc} 1 & 0 \\ \cos 2 \pi f_{0} & \sin 2 \pi f_{0} \\ \vdots & \vdots \\ \cos 2 \pi f_{0}(N-1) & \sin 2 \pi f_{0}(N-1) \end{array}\right]