On Wald's Consistency
Introduction
我们一起来回顾 Wald 1949 年发表在 Ann. Math. Statist.上的一篇文章,并探讨一下对于一般的 M-估计,在什么条件下其是相合的 (converge in prob.)。
Preliminaries
An equivalent condition to strong consistency
A stochastic sequence { X n } n = 1 ∞ \{X_n\}_{n=1}^{\infty} {Xn}n=1∞ converges a.s. to a random vector X X X (defined on the same prob. space) iff. p r ( ∣ X n − X ∣ ≥ ε i . o . ) = 0 , ∀ ε > 0 pr(|X_n-X|\geq \varepsilon\quad i.o.)=0, \forall \varepsilon>0 pr(∣Xn−X∣≥εi.o.)=0,∀ε>0.
several ways to strong consistency
SLLN
If X n X_n Xn are i.i.d with X 1 ∈ L 1 ( p r ) X_1\in L_1(pr) X1∈L1(pr), then X ˉ n → a . s . E [ X 1 ] \bar{X}_n\overset{a.s.}{\rightarrow} E[X_1] Xˉn→a.s.E[X1] by the strong law of large numbers (SLLN).
Borel-Cantelli Lemma
Let { A n } n = 1 ∞ \{A_n\}_{n=1}^{\infty} {An}n=1∞ be a sequence of event. If ∑ n = 1 ∞ p r ( A n ) < ∞ \sum_{n=1}^{\infty}pr(A_n)<\infty ∑n=1∞pr(An)<∞, then p r ( A n i . o . ) = 0 pr(A_n\quad i.o.)=0 pr(Ani.o.)=0.
Let A n ε = { ∣ X n − X ∣ ≥ ε } A^{\varepsilon}_n=\{|X_n-X|\geq \varepsilon\} Anε={∣Xn−X∣≥ε}. If we can verify ∑ n p r ( A n ε ) < ∞ \sum_{n}pr(A^{\varepsilon}_n)<\infty ∑npr(Anε)<∞ for any ε \varepsilon ε, we have X n → a . s . X . X_n\overset{a.s.}{\rightarrow}X. Xn→a.s.X.
Lyapunov’s CLT
Let { X n } n = 1 ∞ \{X_n\}_{n=1}^{\infty} {Xn}n=1∞ be a sequence of independent r.v.s with E [ X i ] = μ i E[X_i]=\mu_i E[Xi]=μi and v a r ( X i ) = σ i 2 var(X_i)=\sigma^2_i var(Xi)=σi2. Further, if
- μ i \mu_i μi and σ i 2 \sigma_i^2 σi2 are all exist and finite,
- lim n → ∞ ∑ i = 1 n E [ ∣ X i − μ i ∣ 2 + δ ] ( ∑ i = 1 n σ i 2 ) 1 + δ / 2 = 0 \lim_{n\rightarrow\infty}\frac{\sum_{i=1}^nE[|X_i-\mu_i|^{2+\delta}]}{(\sum_{i=1}^n\sigma_i^2)^{1+\delta/2}}=0 limn→∞(∑i=1nσi2)1+δ/2∑i=1nE[∣Xi−μi∣2+δ]=0 for some δ > 0 \delta>0 δ>0,
then ∑ i = 1 n ( X i − μ i ) ∑ i = 1 n σ i 2 → d N ( 0 , 1 ) . \frac{\sum_{i=1}^n(X_i-\mu_i)}{\sqrt{\sum_{i=1}^n\sigma_i^2}}\overset{d}{\rightarrow}N(0,1). ∑i=1nσi2∑i=1n(Xi−μi)→dN(0,1).
Wald’s Consistency of the Maximum Likelihood Estimator
Notations
- { X n } n = 1 ∞ \{X_n\}_{n=1}^{\infty} {Xn}n=1∞: a sequence of i.i.d r.v.s.
- F ( x , θ 0 ) F(x,\theta_0) F(x,θ0): the law of X 1 X_1 X1 parameterized by a finite θ = ( θ 1 , ⋯ , θ p ) ∈ Θ \theta=(\theta^1,\cdots,\theta^p)\in\Theta θ=(θ1,⋯,θp)∈Θ.
- f ( x , θ 0 ) f(x,\theta_0) f(x,θ0): the probability density function of X 1 X_1 X1. θ 0 \theta_0 θ0 is the true parameter and an interior of Θ \Theta Θ.
- f ( x , θ , ρ ) = sup ∣ θ ′ − θ ∣ ≤ ρ f ( x , θ ′ ) f(x,\theta,\rho)=\underset{|\theta'-\theta|\leq\rho}{\sup}f(x,\theta') f(x,θ,ρ)=∣θ′−θ∣≤ρsupf(x,θ′).
- φ ( x , r ) = sup ∣ θ ′ ∣ > r f ( x , θ ′ ) \varphi(x,r)=\underset{|\theta'|> r}{\sup}f(x,\theta') φ(x,r)=∣θ′∣>rsupf(x,θ′)
- x + = max ( 0 , x ) x^{+}=\max(0,x) x+=max(0,x).
- ℓ n ( θ ) = ∑ i = 1 n log f ( X i , θ ) \ell_n(\theta)=\sum_{i=1}^n\log f(X_i,\theta) ℓn(θ)=∑i=1nlogf(Xi,θ) is the log-likelihood of θ \theta θ.
- θ ^ n = arg max θ ∈ Θ ℓ n ( θ ) \hat{\theta}_n=\arg\max_{\theta\in\Theta}\ell_n(\theta) θ^n=argmaxθ∈Θℓn(θ).
- X = { x : f ( x , θ 0 ) > 0 } \mathcal{X}=\{x:f(x,\theta_0)>0\} X={x:f(x,θ0)>0} is the support of X 1 X_1 X1.
Main Theorems
Under suitable regularity conditions, θ ^ n \hat{\theta}_n θ^n converges to θ 0 \theta_0 θ0 a.s…
A sketched proof:
First, note that
log
(
x
)
\log(x)
log(x) is a concave function and thus by Jensen’s Inequality we have
E
[
log
f
(
X
,
θ
)
−
log
f
(
X
,
θ
0
)
]
=
E
[
log
f
(
X
,
θ
)
f
(
X
,
θ
0
)
]
≤
log
E
[
f
(
X
,
θ
)
f
(
X
,
θ
0
)
]
=
0.
E[\log f(X,\theta)-\log f(X,\theta_0)]=E\left[\log\frac{f(X,\theta)}{f(X,\theta_0)}\right]\leq\log E\left[\frac{f(X,\theta)}{f(X,\theta_0)}\right]=0.
E[logf(X,θ)−logf(X,θ0)]=E[logf(X,θ0)f(X,θ)]≤logE[f(X,θ0)f(X,θ)]=0.
Since
−
log
(
x
)
-\log(x)
−log(x) is strictly convex, the equality holds iff.
f
(
X
,
θ
)
=
f
(
X
,
θ
0
)
f(X,\theta)=f(X,\theta_0)
f(X,θ)=f(X,θ0) a.s… As a consequence, under the assumption that
F
(
x
,
θ
)
≠
F
(
x
,
θ
0
)
F(x,\theta)\neq F(x,\theta_0)
F(x,θ)=F(x,θ0) for at least a point, the Jensen’s inequality will hold strictly, i.e.,
E
[
log
f
(
X
,
θ
)
]
<
E
[
log
f
(
X
,
θ
0
)
]
E[\log f(X,\theta)]<E[\log f(X,\theta_0)]
E[logf(X,θ)]<E[logf(X,θ0)] for
θ
≠
θ
0
\theta\neq\theta_0
θ=θ0.
The second step is to divide the parameter space into a combination of finite sets Θ = ∪ k = 1 K B k \Theta=\cup_{k=1}^KB_k Θ=∪k=1KBk such that θ 0 ∈ B 0 \theta_0\in B_0 θ0∈B0 and θ 0 ∉ B j \theta_0\notin B_j θ0∈/Bj for j = 1 , ⋯ , K j=1,\cdots,K j=1,⋯,K. Let f s ( x , B ) = sup θ ∈ B f ( x , θ ) f_s(x,B)=\underset{\theta\in B}{\sup}f(x,\theta) fs(x,B)=θ∈Bsupf(x,θ), we have E [ log f s ( X , B ) ] < E [ log f ( X , θ 0 ) ] E[\log f_s(X,B)]<E[\log f(X,\theta_0)] E[logfs(X,B)]<E[logf(X,θ0)] by the above statement.
Thirdly, by the strong law of large numbers (SLLN), we have ℓ n ( θ ) n \frac{\ell_n(\theta)}{n} nℓn(θ) converges almost surely to E [ log f ( X , θ ) ] E[\log f(X,\theta)] E[logf(X,θ)] for θ ∈ Θ \theta\in\Theta θ∈Θ. Let A ϵ = { w : ∑ i = 1 n log f ( X i , B j ) n → a . s . E [ log f ( X 1 , B j ) ] for j = 1 , ⋯ , K and ∑ i = 1 n log f ( X i , θ 0 ) n → a . s . E [ log f ( X 1 , θ 0 ) ] } . A_{\epsilon}=\left\{w:\frac{\sum_{i=1}^n\log f(X_i,B_j)}{n}\overset{a.s.}{\rightarrow}E[\log f(X_1,B_j)]\text{ for }j=1,\cdots,K\text{ and } \frac{\sum_{i=1}^n\log f(X_i,\theta_0)}{n}\overset{a.s.}{\rightarrow}E[\log f(X_1,\theta_0)] \right\}. Aϵ={w:n∑i=1nlogf(Xi,Bj)→a.s.E[logf(X1,Bj)] for j=1,⋯,K and n∑i=1nlogf(Xi,θ0)→a.s.E[logf(X1,θ0)]}.
Over A ϵ A_{\epsilon} Aϵ we have θ ^ n ∈ B 0 \hat{\theta}_{n}\in B_0 θ^n∈B0 when n n n is large enough, otherwise, there will be a contradiction that E [ log f s ( X , B ) ] = E [ log f ( X , θ 0 ) ] E[\log f_s(X,B)]=E[\log f(X,\theta_0)] E[logfs(X,B)]=E[logf(X,θ0)] for some B B B. Hence, θ ^ n → a . s . θ 0 \hat{\theta}_n\overset{a.s.}{\rightarrow}\theta_0 θ^n→a.s.θ0 if we can verify that p r ( A ϵ ) = 1 pr(A_\epsilon)=1 pr(Aϵ)=1.
Regularity Conditions
A1: { X n } n = 1 ∞ \{X_n\}_{n=1}^{\infty} {Xn}n=1∞ are either discrete or continuous.
A2: ∃ ρ > 0 and r > 0 \exists \rho>0\text{ and }r>0 ∃ρ>0 and r>0 such that E [ { log f ( X 1 , θ , ρ ) } + ] < ∞ E[\{\log f(X_1,\theta,\rho)\}^+]<\infty E[{logf(X1,θ,ρ)}+]<∞ for θ ∈ Θ \theta\in\Theta θ∈Θ and E [ { log φ ( X 1 , r ) } + ] < ∞ E[\{\log \varphi(X_1,r)\}^+]<\infty E[{logφ(X1,r)}+]<∞.
A3: f ( x , θ ) f(x,\theta) f(x,θ) is a continuous function of θ \theta θ given x x x.
A4: F ( x , θ 1 ) ≠ F ( x , θ 0 ) F(x,\theta_1)\neq F(x,\theta_0) F(x,θ1)=F(x,θ0) for at least one point when θ 1 ≠ θ 0 \theta_1\neq \theta_0 θ1=θ0.
A5: lim ∣ ∣ θ ∣ ∣ → ∞ f ( x , θ ) = 0 \lim_{||\theta||\rightarrow\infty}f(x,\theta)=0 lim∣∣θ∣∣→∞f(x,θ)=0 for any given x x x.
A6: E [ ∣ log f ( X 1 , θ 0 ) ∣ ] < ∞ E[|\log f(X_1,\theta_0)|]<\infty E[∣logf(X1,θ0)∣]<∞.
A7: Θ \Theta Θ is a closed subset of p − p- p−dimensional Cartesian space.
A8: f ( x , θ , ρ ) f(x,\theta,\rho) f(x,θ,ρ) is a measurable function of x x x for any θ \theta θ and ρ \rho ρ.
A8 is unnecessary for discrete case and A3, A5, A8 ensure that f f f is a “good” function of θ \theta θ.
Discussions of the regularity conditions
- A4 and A6 validate the strict Jensen’s inequality under which we have that E [ log f ( X , θ ) ] < E [ log f ( X , θ 0 ) ] E[\log f(X,\theta)]<E[\log f(X,\theta_0)] E[logf(X,θ)]<E[logf(X,θ0)] for θ ≠ θ 0 \theta\neq\theta_0 θ=θ0.
- A3 ensures that lim ρ → 0 f ( x , θ , ρ ) = f ( x , θ ) \lim_{\rho\rightarrow 0}f(x,\theta,\rho)=f(x,\theta) limρ→0f(x,θ,ρ)=f(x,θ). Since f ( x , θ , ρ ) f(x,\theta,\rho) f(x,θ,ρ) is a monotoning function of ρ \rho ρ given x x x and θ \theta θ, under A2 which says that f ( x , θ , ρ ) f(x,\theta,\rho) f(x,θ,ρ) is also dominated by an integrable function, we have lim ρ → 0 E [ log f ( X 1 , θ , ρ ) ] = E [ log f ( X 1 , θ ) ] \lim_{\rho\rightarrow 0}E[\log f (X_1,\theta,\rho)]=E[\log f(X_1,\theta)] limρ→0E[logf(X1,θ,ρ)]=E[logf(X1,θ)] by the dominated convergence theorem. It follows that for any θ ≠ θ 0 \theta\neq \theta_0 θ=θ0, E [ log f ( X 1 , θ , ρ ) ] < E [ log f ( X , θ 0 ) ] E[\log f (X_1,\theta,\rho)]<E[\log f(X,\theta_0)] E[logf(X1,θ,ρ)]<E[logf(X,θ0)] for some ρ > 0. \rho>0. ρ>0.
- Also, under A2, A3 and A5, DCT implies that lim r → ∞ E [ φ ( X 1 , r ) ] = 0 \lim_{r\rightarrow \infty}E[ \varphi (X_1,r)]=0 limr→∞E[φ(X1,r)]=0 and lim r → ∞ E [ log φ ( X 1 , r ) ] = − ∞ < E [ log f ( X , θ 0 ) ] \lim_{r\rightarrow \infty}E[\log \varphi (X_1,r)]=-\infty<E[\log f(X,\theta_0)] limr→∞E[logφ(X1,r)]=−∞<E[logf(X,θ0)]. Hence, we have ∃ r 0 > 0 \exists r_0>0 ∃r0>0 such that E [ log φ ( X 1 , r 0 ) ] < E [ log f ( X , θ 0 ) ] E[\log \varphi (X_1,r_0)]<E[\log f(X,\theta_0)] E[logφ(X1,r0)]<E[logf(X,θ0)].
- A7 ensures that θ ^ n ∈ Θ \hat{\theta}_n\in\Theta θ^n∈Θ.
- A8 guarantees that f ( X , θ , ρ ) f(X,\theta,\rho) f(X,θ,ρ) and φ ( X , r ) \varphi(X,r) φ(X,r) are still measurable given any θ , ρ \theta,\rho θ,ρ and r r r.
- A1-A2 and A7-A8 are true in general cases. A3-A6 are important and need to be verified.
- In fact, for A5, we only need that f ( x , θ ) f(x,\theta) f(x,θ) exists and is finite when θ \theta θ approaches the boundary of Θ \Theta Θ. Hence, A5 is unnecessary if we assume that Θ \Theta Θ is compact.
Three important implications under regularity assumptions
Under A1-A8, we have
Lemma 1. E [ log f ( X , θ ) ] < E [ log f ( X , θ 0 ) ] E[\log f(X,\theta)]<E[\log f(X,\theta_0)] E[logf(X,θ)]<E[logf(X,θ0)] for θ ≠ θ 0 \theta\neq\theta_0 θ=θ0.
Lemma 2. For any θ ≠ θ 0 \theta\neq \theta_0 θ=θ0, E [ log f ( X 1 , θ , ρ ) ] < E [ log f ( X , θ 0 ) ] E[\log f (X_1,\theta,\rho)]<E[\log f(X,\theta_0)] E[logf(X1,θ,ρ)]<E[logf(X,θ0)] for some ρ > 0. \rho>0. ρ>0.
Lemma 3. ∃ r 0 > 0 \exists r_0>0 ∃r0>0 such that E [ log φ ( X 1 , r 0 ) ] < E [ log f ( X , θ 0 ) ] E[ \log\varphi (X_1,r_0)]<E[\log f(X,\theta_0)] E[logφ(X1,r0)]<E[logf(X,θ0)].
A Rigorous Proof
step 1
∀
ε
>
0
\forall \varepsilon>0
∀ε>0, let
B
0
=
U
(
θ
0
,
ε
)
∩
Θ
=
{
θ
:
∣
θ
−
θ
0
∣
<
ε
}
∩
Θ
B_0=U(\theta_0,\varepsilon)\cap\Theta=\{\theta:|\theta-\theta_0|<\varepsilon\}\cap\Theta
B0=U(θ0,ε)∩Θ={θ:∣θ−θ0∣<ε}∩Θ.
step 2
By Lemma 3, we are able to find a
r
0
>
0
r_0>0
r0>0 such that
E
[
log
φ
(
X
1
,
r
0
)
]
<
E
[
log
f
(
X
,
θ
0
)
]
E[ \log\varphi (X_1,r_0)]<E[\log f(X,\theta_0)]
E[logφ(X1,r0)]<E[logf(X,θ0)]. Hence let
B
K
=
Θ
−
U
ˉ
(
0
,
r
0
)
∩
Θ
B_K=\Theta-\bar{U}(0,r_0)\cap\Theta
BK=Θ−Uˉ(0,r0)∩Θ. It is easy to verify that
A
=
Θ
−
B
0
−
B
K
A=\Theta-B_0-B_K
A=Θ−B0−BK is a compact set (since it is closed and bounded in Cartesian space).
step 3
By Lemma 2,
∀
θ
∈
A
\forall \theta\in A
∀θ∈A,
∃
ρ
θ
>
0
\exists \rho_\theta>0
∃ρθ>0 such that
E
[
log
f
(
X
1
,
θ
,
ρ
θ
)
]
<
E
[
log
f
(
X
,
θ
0
)
]
E[\log f (X_1,\theta,\rho_\theta)]<E[\log f(X,\theta_0)]
E[logf(X1,θ,ρθ)]<E[logf(X,θ0)]. Moreover,
A
=
∪
θ
∈
A
U
(
θ
,
ρ
θ
)
A=\cup_{\theta\in A}U(\theta,\rho_\theta)
A=∪θ∈AU(θ,ρθ). The Heine Borel theorem implies that there are finite
θ
1
,
⋯
,
θ
K
−
1
\theta_1,\cdots,\theta_{K-1}
θ1,⋯,θK−1 such that
A
=
∪
j
=
1
K
−
1
U
(
θ
j
,
ρ
θ
j
)
A=\cup_{j=1}^{K-1}U(\theta_j,\rho_{\theta_j})
A=∪j=1K−1U(θj,ρθj).
In summary, ∀ B 0 \forall B_0 ∀B0 as defined above we can find B 1 , ⋯ , B K B_1,\cdots,B_K B1,⋯,BK such that Θ = ∪ k = 0 K B k \Theta=\cup_{k=0}^{K}B_k Θ=∪k=0KBk and E [ log f s ( X , B j ) ] < E [ log f ( X , θ 0 ) ] E[\log f_s(X,B_j)]<E[\log f(X,\theta_0)] E[logfs(X,Bj)]<E[logf(X,θ0)] for B 1 , ⋯ , B K B_1,\cdots,B_K B1,⋯,BK.
step 4
Let
A
ε
=
{
w
:
∑
i
=
1
n
log
f
s
(
X
i
(
w
)
,
B
j
)
n
→
a
.
s
.
E
[
log
f
s
(
X
1
,
B
j
)
]
for
j
=
1
,
⋯
,
K
and
∑
i
=
1
n
log
f
(
X
i
(
w
)
,
θ
0
)
n
→
a
.
s
.
E
[
log
f
(
X
1
,
θ
0
)
]
}
.
A_{\varepsilon}=\left\{w:\frac{\sum_{i=1}^n\log f_s(X_i(w),B_j)}{n}\overset{a.s.}{\rightarrow}E[\log f_s(X_1,B_j)]\text{ for }j=1,\cdots,K\text{ and } \frac{\sum_{i=1}^n\log f(X_i(w),\theta_0)}{n}\overset{a.s.}{\rightarrow}E[\log f(X_1,\theta_0)] \right\}.
Aε={w:n∑i=1nlogfs(Xi(w),Bj)→a.s.E[logfs(X1,Bj)] for j=1,⋯,K and n∑i=1nlogf(Xi(w),θ0)→a.s.E[logf(X1,θ0)]}.
Note that by definition
f
s
(
x
,
B
j
)
=
{
f
(
x
,
θ
j
,
ρ
j
)
j
=
1
,
⋯
,
K
−
1
,
φ
(
x
,
r
0
)
j
=
K
,
f_s(x,B_j)=\left\{\begin{array}{ll}f(x,\theta_j,\rho_j)&j=1,\cdots,K-1,\\ \varphi(x,r_0)&j=K,\end{array}\right.
fs(x,Bj)={f(x,θj,ρj)φ(x,r0)j=1,⋯,K−1,j=K, which is measurable. By SLLN we have
p
r
(
A
ε
)
=
1
pr(A_{\varepsilon})=1
pr(Aε)=1 for any
ε
>
0
\varepsilon>0
ε>0. Note that over
A
ε
A_{\varepsilon}
Aε (i.e.,
∀
w
1
∈
A
ε
\forall w_1\in A_{\varepsilon}
∀w1∈Aε), we have
sup
θ
∈
B
j
ℓ
n
(
θ
)
n
=
∑
i
=
1
n
log
f
s
(
X
i
(
w
1
)
,
B
j
)
n
<
∑
i
=
1
n
log
f
(
X
i
(
w
1
)
,
θ
0
)
n
\sup_{\theta\in B_j}\frac{\ell_n(\theta)}{n}=\frac{\sum_{i=1}^n\log f_s(X_i(w_1),B_j)}{n}<\frac{\sum_{i=1}^n\log f(X_i(w_1),\theta_0)}{n}
supθ∈Bjnℓn(θ)=n∑i=1nlogfs(Xi(w1),Bj)<n∑i=1nlogf(Xi(w1),θ0) when
n
n
n is large enough. As a result,
∀
w
∈
A
ε
\forall w\in A_{\varepsilon}
∀w∈Aε,
∃
N
ε
>
0
\exists N_{\varepsilon}>0
∃Nε>0 such that
θ
^
n
(
w
)
∈
B
0
\hat{\theta}_n(w)\in B_0
θ^n(w)∈B0 for
n
>
N
ε
n>N_{\varepsilon}
n>Nε. It follows that
A
ε
⊂
∪
n
=
1
∞
∩
m
=
n
∞
{
∣
θ
^
n
−
θ
0
∣
<
ε
}
A_{\varepsilon}\subset \cup_{n=1}^{\infty}\cap_{m=n}^\infty\{|\hat{\theta}_n-\theta_0|<\varepsilon\}
Aε⊂∪n=1∞∩m=n∞{∣θ^n−θ0∣<ε} and
p
r
(
∣
θ
^
n
−
θ
0
∣
≥
ε
i
.
o
.
)
=
0
pr(|\hat{\theta}_n-\theta_0|\geq \varepsilon \quad i.o.)=0
pr(∣θ^n−θ0∣≥εi.o.)=0 for any
ε
>
0
\varepsilon>0
ε>0. This completes the proof.
Examples
Example 1
Suppose X 1 , ⋯ , X n X_1,\cdots,X_n X1,⋯,Xn are i.i.d with density f ( x , θ 0 ) = c ⋅ exp { − ∣ x − θ 0 ∣ 3 } f(x,\theta_0)=c\cdot\exp\{-|x-\theta_0|^3\} f(x,θ0)=c⋅exp{−∣x−θ0∣3} and Θ = ( − ∞ , ∞ ) \Theta=(-\infty,\infty) Θ=(−∞,∞). Show the strong consistency of the MLE for θ 0 \theta_0 θ0.
To prove the strong consistency of θ ^ n \hat{\theta}_n θ^n, we just need to check A1-A8. A2 obviously holds since sup θ f ( x , θ ) ≤ c \sup_{\theta}f(x,\theta)\leq c supθf(x,θ)≤c for any x x x and E [ log sup θ f ( X , θ ) + ] ≤ log c + < ∞ . E[\log\sup_{\theta}f(X,\theta)^+]\leq \log c^+<\infty. E[logsupθf(X,θ)+]≤logc+<∞. A3-A8 are also easy to verify.
Example 2
Suppose X 1 , ⋯ , X n X_1,\cdots,X_n X1,⋯,Xn are i.i.d normal with θ = ( μ , σ 2 ) \theta=(\mu,\sigma^2) θ=(μ,σ2) and Θ = ( − ∞ , ∞ ) × ( 0 , ∞ ) \Theta=(-\infty,\infty)\times(0,\infty) Θ=(−∞,∞)×(0,∞).
The key is to verify A5 since other assumptions are easy to verify. Approaching the boundary of Θ \Theta Θ, we have lim σ 2 → 0 + f ( x , θ ) = ∞ \lim_{\sigma^2\rightarrow 0^+}f(x,\theta)=\infty limσ2→0+f(x,θ)=∞ when x = μ x=\mu x=μ. Hence, A5 is not satisfied. There are two ways to go on. One is to restrict Θ = ( − ∞ , ∞ ) × [ δ , ∞ ) \Theta=(-\infty,\infty)\times[\delta,\infty) Θ=(−∞,∞)×[δ,∞) for some δ ≥ 0 \delta\geq 0 δ≥0. Another way is to prove that σ ^ 2 ≥ δ \hat{\sigma}^2\geq \delta σ^2≥δ when n n n is large enough.
Extensions to the M-estimation
Definition of M-estimation
Let X 1 , ⋯ , X n , ⋯ X_1,\cdots,X_n,\cdots X1,⋯,Xn,⋯ be i.i.d. r.v.s. and θ \theta θ is a parameter attached to the law of X 1 X_1 X1. m θ ( x ) m_\theta(x) mθ(x) is a known function of x x x and θ ^ n \hat{\theta}_n θ^n is defined as the maximizer of M n ( θ ) = P n m θ = 1 n ∑ i = 1 n m θ ( X i ) M_n(\theta)=\mathbb{P}_nm_\theta=\frac{1}{n}\sum_{i=1}^nm_\theta(X_i) Mn(θ)=Pnmθ=n1∑i=1nmθ(Xi) over θ ∈ Θ \theta\in\Theta θ∈Θ. Let M ( θ ) M(\theta) M(θ) be a fixed function such that M n ( θ ) → p M ( θ ) M_n(\theta)\overset{p}{\rightarrow}M(\theta) Mn(θ)→pM(θ) for every θ ∈ Θ \theta\in\Theta θ∈Θ. For example, we know one choice of M ( θ ) M(\theta) M(θ) is P m θ = E [ m θ ( X 1 ) ] \mathbb{P}m_\theta=E[m_\theta(X_1)] Pmθ=E[mθ(X1)] by the WLLN. Define θ 0 = arg max θ ∈ Θ M ( θ ) \theta_0=\underset{\theta\in\Theta}{\arg\max}M(\theta) θ0=θ∈ΘargmaxM(θ). Now we would like to extend Wald’s consistency to θ ^ n \hat{\theta}_n θ^n.
Wald’s Consistency of M-estimator
We follow Van der Vaart (1998) and contents of previous sections, and the regularity conditions are given by:
C1: { X n } n = 1 ∞ \{X_n\}_{n=1}^{\infty} {Xn}n=1∞ are either discrete or continuous.
C2: Θ \Theta Θ is a compact subset of p − p- p−dimensional space.
C3: For every sufficiently small ball U ∈ Θ U\in\Theta U∈Θ, sup θ ∈ U m θ ( x ) \sup_{\theta\in U}m_\theta(x) supθ∈Umθ(x) is a measurable function of x x x and E [ sup θ ∈ U m θ ( X 1 ) ] < ∞ E[\sup_{\theta\in U}m_\theta(X_1)]<\infty E[supθ∈Umθ(X1)]<∞.
C4: m θ ( x ) m_\theta(x) mθ(x) is upper semicontinuous with respect to θ \theta θ for almost every x x x.
C5: θ ^ n \hat{\theta}_n θ^n and θ 0 \theta_0 θ0 is identifiable, i.e., exsit and unique.
Note that C1 correponds to A1, C2 correponds to A5 and A7, C3 correponds to A2, A6 and A8. C4 corresponds to A3. C5 is the identifiability condition which corresponds to A4.
Under C1-C5, we have that θ ^ n → a . s . θ 0 \hat{\theta}_n\overset{a.s.}{\rightarrow}\theta_0 θ^n→a.s.θ0. If θ ^ n \hat{\theta}_n θ^n is defined such that M n ( θ ^ n ) ≥ M n ( θ 0 ) − o p ( 1 ) M_n(\hat{\theta}_n)\geq M_n(\theta_0)-o_p(1) Mn(θ^n)≥Mn(θ0)−op(1), then the convergence will degenerate to convergence in probability. In addition, here we permit multiple maximum of M ( θ ) M(\theta) M(θ) and define Θ 0 = { θ 0 ∈ Θ : M ( θ 0 ) = sup θ ∈ Θ M ( θ ) } \Theta_0=\{\theta_0\in\Theta:M(\theta_0)=\sup_{\theta\in \Theta} M(\theta)\} Θ0={θ0∈Θ:M(θ0)=supθ∈ΘM(θ)}, then we have p r ( d ( θ ^ n , Θ 0 ) > ε ) → 0 pr(d(\hat{\theta}_n,\Theta_0)>\varepsilon)\rightarrow 0 pr(d(θ^n,Θ0)>ε)→0 for any ε > 0 \varepsilon>0 ε>0. The proof is similar as above by finite coverings of the compact parameter space.
An alternative way to the consistency of M-estimation
Another simple, useful and commonly used way for the consistency of M-estimation is by the uniform law of large numbers which was detailedly described in Van der Vaart (1998).
Two crucial contions are required:
B1 For any ε > 0 \varepsilon>0 ε>0, sup ∣ θ − θ 0 ∣ ≥ ε M ( θ ) < M ( θ 0 ) \sup_{|\theta-\theta_0|\geq \varepsilon}M(\theta)<M(\theta_0) sup∣θ−θ0∣≥εM(θ)<M(θ0).
B2 sup θ ∈ Θ ∣ M n ( θ ) − M ( θ ) ∣ → p 0 \sup_{\theta\in\Theta}|M_n(\theta)-M(\theta)|\overset{p}{\rightarrow}0 supθ∈Θ∣Mn(θ)−M(θ)∣→p0.
B1 is the identifiability condition which states that θ 0 \theta_0 θ0 is the unique maximizer of M ( θ ) M(\theta) M(θ) which ensures that θ 0 \theta_0 θ0 is well defined. B2 requires the uniform convergence of M n ( θ ) M_n(\theta) Mn(θ) which is very strong and is an substiitue of C2-C4. The measurability in C3 is not necessary because we can use the outer probability instead.
The uniform convergence condition is equivalent to that { m θ , θ ∈ Θ } \{m_\theta,\theta\in\Theta\} {mθ,θ∈Θ} or { m θ , j ′ , θ ∈ Θ , j = 1 , ⋯ , K } \{m'_{\theta,j},\theta\in\Theta,j=1,\cdots,K\} {mθ,j′,θ∈Θ,j=1,⋯,K} being Glivenko-Cantelli, which requires the complexity of the function class to be bounded. One simple and sufficient condition is that Θ \Theta Θ being compact, m θ ( x ) m_\theta(x) mθ(x) or m θ , j ′ ( x ) m'_{\theta,j}(x) mθ,j′(x) is continuous for every x x x with respect to θ \theta θ and is dominated by an integrable function.
Aymptotic distributions of MLE and M-estimation
In this section, we study on the weak convergence of MLE and M-estimation under the consistency. Here we only consider the case when m θ ( x ) m_\theta(x) mθ(x) are smooth functions given x x x.
Let
Ψ
n
(
θ
)
\Psi_n(\theta)
Ψn(θ) be the derivative of
M
n
(
θ
)
M_n(\theta)
Mn(θ),
ψ
θ
\psi_\theta
ψθ be the derivative of
m
θ
m_\theta
mθ which refers to the score function, and equivalently
θ
^
n
\hat{\theta}_n
θ^n is the unique solution of
0
=
Φ
n
(
θ
)
0=\Phi_n(\theta)
0=Φn(θ). The Taylor series expansion gives that
0
=
1
n
∑
i
=
1
n
ψ
θ
0
(
X
i
)
+
1
n
∑
i
=
1
n
∂
ψ
θ
0
(
X
i
)
∂
θ
τ
(
θ
^
n
−
θ
0
)
+
e
n
.
0=\frac{1}{n}\sum_{i=1}^{n}\psi_{\theta_0}(X_i)+\frac{1}{n}\sum_{i=1}^n\frac{\partial \psi_{\theta_0}(X_i)}{\partial \theta^\tau}(\hat{\theta}_n-\theta_0)+e_n.
0=n1i=1∑nψθ0(Xi)+n1i=1∑n∂θτ∂ψθ0(Xi)(θ^n−θ0)+en. For the third term, it is a vector and each component of
e
n
e_n
en is
e
n
,
j
=
1
n
∑
i
=
1
n
(
θ
^
n
−
θ
0
)
τ
∂
ψ
θ
~
,
j
(
X
i
)
∂
θ
∂
θ
τ
(
θ
^
n
−
θ
0
)
.
e_{n,j}=\frac{1}{n}\sum_{i=1}^n(\hat{\theta}_n-\theta_0)^\tau\frac{\partial \psi_{\tilde{\theta},j}(X_i)}{\partial \theta\partial\theta^\tau}(\hat{\theta}_n-\theta_0).
en,j=n1i=1∑n(θ^n−θ0)τ∂θ∂θτ∂ψθ~,j(Xi)(θ^n−θ0). Under the condition that
∂
2
ψ
θ
~
,
j
(
x
)
∂
θ
i
∂
θ
k
\frac{\partial^2 \psi_{\tilde{\theta},j}(x)}{\partial \theta_i\partial\theta_k}
∂θi∂θk∂2ψθ~,j(x) exist for every
x
x
x for
i
,
k
i,k
i,k and
j
j
j and is controlled by a measurable and integrable function
ψ
¨
(
x
)
\ddot{\psi}(x)
ψ¨(x), we have
e
n
,
j
=
o
p
(
1
)
e_{n,j}=o_p(1)
en,j=op(1) is negligible.
Together with the assumption that
1
n
∑
i
=
1
n
∂
ψ
θ
0
(
X
i
)
∂
θ
τ
\frac{1}{n}\sum_{i=1}^n\frac{\partial \psi_{\theta_0}(X_i)}{\partial \theta^\tau}
n1∑i=1n∂θτ∂ψθ0(Xi) is nonsingular, we have
n
(
θ
^
n
−
θ
0
)
=
−
1
n
∑
i
=
1
n
(
P
n
ψ
˙
θ
0
)
−
1
ψ
θ
0
(
X
i
)
+
o
p
(
1
)
.
\sqrt{n}(\hat{\theta}_n-\theta_0)=-\frac{1}{\sqrt{n}}\sum_{i=1}^n\left(\mathbb{P}_n\dot{\psi}_{\theta_0}\right)^{-1}\psi_{\theta_0}(X_i)+o_p(1).
n(θ^n−θ0)=−n1i=1∑n(Pnψ˙θ0)−1ψθ0(Xi)+op(1). The SLLN as well as the central limit theorem imply that
n
(
θ
^
n
−
θ
0
)
→
d
N
(
P
ψ
θ
0
,
(
P
ψ
˙
θ
0
)
−
1
cov
(
ψ
θ
0
)
(
−
P
ψ
˙
θ
0
τ
)
−
1
)
.
\sqrt{n}(\hat{\theta}_n-\theta_0)\overset{d}{\rightarrow}N(\mathbb{P}\psi_{\theta_0},\left(\mathbb{P}\dot{\psi}_{\theta_0}\right)^{-1}\operatorname{cov}(\psi_{\theta_0})\left(-\mathbb{P}\dot{\psi}_{\theta_0}^\tau\right)^{-1}).
n(θ^n−θ0)→dN(Pψθ0,(Pψ˙θ0)−1cov(ψθ0)(−Pψ˙θ0τ)−1). The existence of
∂
2
ψ
θ
~
,
j
(
x
)
∂
θ
i
∂
θ
k
\frac{\partial^2 \psi_{\tilde{\theta},j}(x)}{\partial \theta_i\partial\theta_k}
∂θi∂θk∂2ψθ~,j(x) can be weakened to the Lipschitz condition.
In terms of MLE, it is the special case of M-estimation with
m
θ
(
x
)
=
log
f
(
x
,
θ
)
=
ℓ
θ
(
x
)
m_\theta(x)=\log f(x,\theta)=\ell_\theta(x)
mθ(x)=logf(x,θ)=ℓθ(x) and
ψ
θ
(
x
)
=
f
˙
(
x
,
θ
)
f
(
x
,
θ
)
=
ℓ
˙
θ
(
x
)
\psi_\theta(x)=\frac{\dot{f}(x,\theta)}{f(x,\theta)}=\dot{\ell}_\theta(x)
ψθ(x)=f(x,θ)f˙(x,θ)=ℓ˙θ(x) and
f
˙
(
x
,
θ
)
=
(
∂
f
(
x
,
θ
)
∂
θ
1
,
⋯
,
f
(
x
,
θ
)
∂
θ
p
)
τ
.
\dot{f}(x,\theta)=\left(\frac{\partial f(x,\theta)}{\partial \theta_1},\cdots,\frac{f(x,\theta)}{\partial \theta_p}\right)^\tau.
f˙(x,θ)=(∂θ1∂f(x,θ),⋯,∂θpf(x,θ))τ. By the definition of derivative,
∂
f
(
x
,
θ
)
∂
θ
i
=
lim
Δ
θ
i
→
0
f
(
x
,
θ
+
Δ
θ
)
−
f
(
x
,
θ
)
Δ
θ
i
=
lim
Δ
θ
i
→
0
f
′
(
x
,
θ
~
)
\frac{\partial f(x,\theta)}{\partial \theta_i}=\lim_{\Delta\theta_i\rightarrow 0}\frac{f(x,\theta+\Delta\theta)-f(x,\theta)}{\Delta\theta_i}=\lim_{\Delta\theta_i\rightarrow 0}f'(x,\tilde{\theta})
∂θi∂f(x,θ)=limΔθi→0Δθif(x,θ+Δθ)−f(x,θ)=limΔθi→0f′(x,θ~) by the mean value theorem. The expectation and differential is exchangeable when
∂
f
(
x
,
θ
0
)
∂
θ
\frac{\partial f(x,\theta_0)}{\partial \theta}
∂θ∂f(x,θ0) is controlled by an integrable function for
θ
\theta
θ in a small neighborhood of
θ
0
\theta_0
θ0 by DCT. Hence,
E
[
ℓ
˙
θ
0
(
X
1
)
]
=
E
[
f
˙
(
X
1
,
θ
0
)
f
(
X
1
,
θ
0
)
]
=
∫
f
˙
(
x
,
θ
0
)
d
x
=
∂
∂
θ
∫
f
(
x
,
θ
0
)
d
x
=
0.
E[\dot{\ell}_{\theta_0}(X_1)]=E[\frac{\dot{f}(X_1,\theta_0)}{f(X_1,\theta_0)}]=\int \dot{f}(x,\theta_0)dx=\frac{\partial}{\partial \theta}\int f(x,\theta_0)dx=0.
E[ℓ˙θ0(X1)]=E[f(X1,θ0)f˙(X1,θ0)]=∫f˙(x,θ0)dx=∂θ∂∫f(x,θ0)dx=0. Also under suitable regularity conditions, we have
P
ℓ
¨
θ
0
=
E
{
ℓ
¨
θ
0
(
X
1
)
f
2
(
X
1
,
θ
0
)
−
f
˙
(
X
1
,
θ
0
)
f
˙
τ
(
X
1
,
θ
0
)
f
2
(
X
1
,
θ
0
)
}
=
−
cov
(
ℓ
˙
θ
0
(
X
1
)
)
=
−
P
ℓ
˙
θ
0
ℓ
˙
θ
0
τ
=
−
I
θ
0
\mathbb{P}\ddot{\ell}_{\theta_0}=E\left\{\frac{\ddot{\ell}_{\theta_0}(X_1)}{f^2(X_1,\theta_0)}-\frac{\dot{f}(X_1,\theta_0)\dot{f}^\tau(X_1,\theta_0)}{f^2(X_1,\theta_0)}\right\}=-\operatorname{cov}(\dot{\ell}_{\theta_0}(X_1))=-\mathbb{P}\dot{\ell}_{\theta_0}\dot{\ell}_{\theta_0}^\tau=-I_{\theta_0}
Pℓ¨θ0=E{f2(X1,θ0)ℓ¨θ0(X1)−f2(X1,θ0)f˙(X1,θ0)f˙τ(X1,θ0)}=−cov(ℓ˙θ0(X1))=−Pℓ˙θ0ℓ˙θ0τ=−Iθ0, which is exactly the minus Fisher information matrix of
X
1
X_1
X1 for
θ
\theta
θ. In a summary, the MLE of
θ
\theta
θ has the asymptotic distribution
n
(
θ
^
n
−
θ
0
)
→
d
N
(
0
,
I
θ
0
−
1
)
\sqrt{n}(\hat{\theta}_n-\theta_0)\overset{d}{\rightarrow}N(0,I^{-1}_{\theta_0})
n(θ^n−θ0)→dN(0,Iθ0−1) under the following sufficient regularity conditions:
R0 θ ^ n \hat{\theta}_n θ^n is consistent for θ 0 \theta_0 θ0.
R1 θ 0 \theta_0 θ0 is an interior point of Θ \Theta Θ and the support of X 1 X_1 X1 does not depend on θ \theta θ.
R2 f ( x , θ ) f(x,\theta) f(x,θ) is continuously differentiable up to order 3 with respect to θ \theta θ for almost all x x x.
R3 The following functions are all controlled by a measurable and integrable function when
θ
\theta
θ is in a small neighborhood of
θ
0
\theta_0
θ0:
∂
f
(
x
,
θ
)
∂
θ
i
\frac{\partial f(x,\theta)}{\partial \theta_i}
∂θi∂f(x,θ),
∂
2
f
(
x
,
θ
)
∂
θ
i
∂
θ
j
\frac{\partial^2 f(x,\theta)}{\partial \theta_i\partial \theta_j}
∂θi∂θj∂2f(x,θ),
∂
3
ℓ
θ
(
x
)
∂
θ
i
∂
θ
j
∂
θ
k
\frac{\partial^3 \ell_{\theta}(x)}{\partial \theta_i\partial \theta_j\partial \theta_k}
∂θi∂θj∂θk∂3ℓθ(x) for
i
,
j
,
k
=
1
,
⋯
,
p
i,j,k=1,\cdots,p
i,j,k=1,⋯,p.
R4 I θ 0 I_{\theta_0} Iθ0 exists and is positive definite.