似然估计和 M-估计的相合性

Introduction

我们一起来回顾 Wald 1949 年发表在 Ann. Math. Statist.上的一篇文章,并探讨一下对于一般的 M-估计,在什么条件下其是相合的 (converge in prob.)。

Preliminaries

An equivalent condition to strong consistency

A stochastic sequence { X n } n = 1 ∞ \{X_n\}_{n=1}^{\infty} {Xn}n=1 converges a.s. to a random vector X X X (defined on the same prob. space) iff. p r ( ∣ X n − X ∣ ≥ ε i . o . ) = 0 , ∀ ε > 0 pr(|X_n-X|\geq \varepsilon\quad i.o.)=0, \forall \varepsilon>0 pr(XnXεi.o.)=0,ε>0.

several ways to strong consistency

SLLN

If X n X_n Xn are i.i.d with X 1 ∈ L 1 ( p r ) X_1\in L_1(pr) X1L1(pr), then X ˉ n → a . s . E [ X 1 ] \bar{X}_n\overset{a.s.}{\rightarrow} E[X_1] Xˉna.s.E[X1] by the strong law of large numbers (SLLN).

Borel-Cantelli Lemma

Let { A n } n = 1 ∞ \{A_n\}_{n=1}^{\infty} {An}n=1 be a sequence of event. If ∑ n = 1 ∞ p r ( A n ) < ∞ \sum_{n=1}^{\infty}pr(A_n)<\infty n=1pr(An)<, then p r ( A n i . o . ) = 0 pr(A_n\quad i.o.)=0 pr(Ani.o.)=0.

Let A n ε = { ∣ X n − X ∣ ≥ ε } A^{\varepsilon}_n=\{|X_n-X|\geq \varepsilon\} Anε={XnXε}. If we can verify ∑ n p r ( A n ε ) < ∞ \sum_{n}pr(A^{\varepsilon}_n)<\infty npr(Anε)< for any ε \varepsilon ε, we have X n → a . s . X . X_n\overset{a.s.}{\rightarrow}X. Xna.s.X.

Lyapunov’s CLT

Let { X n } n = 1 ∞ \{X_n\}_{n=1}^{\infty} {Xn}n=1 be a sequence of independent r.v.s with E [ X i ] = μ i E[X_i]=\mu_i E[Xi]=μi and v a r ( X i ) = σ i 2 var(X_i)=\sigma^2_i var(Xi)=σi2. Further, if

  1. μ i \mu_i μi and σ i 2 \sigma_i^2 σi2 are all exist and finite,
  2. lim ⁡ n → ∞ ∑ i = 1 n E [ ∣ X i − μ i ∣ 2 + δ ] ( ∑ i = 1 n σ i 2 ) 1 + δ / 2 = 0 \lim_{n\rightarrow\infty}\frac{\sum_{i=1}^nE[|X_i-\mu_i|^{2+\delta}]}{(\sum_{i=1}^n\sigma_i^2)^{1+\delta/2}}=0 limn(i=1nσi2)1+δ/2i=1nE[Xiμi2+δ]=0 for some δ > 0 \delta>0 δ>0,

then ∑ i = 1 n ( X i − μ i ) ∑ i = 1 n σ i 2 → d N ( 0 , 1 ) . \frac{\sum_{i=1}^n(X_i-\mu_i)}{\sqrt{\sum_{i=1}^n\sigma_i^2}}\overset{d}{\rightarrow}N(0,1). i=1nσi2 i=1n(Xiμi)dN(0,1).

Wald’s Consistency of the Maximum Likelihood Estimator

Notations

  • { X n } n = 1 ∞ \{X_n\}_{n=1}^{\infty} {Xn}n=1: a sequence of i.i.d r.v.s.
  • F ( x , θ 0 ) F(x,\theta_0) F(x,θ0): the law of X 1 X_1 X1 parameterized by a finite θ = ( θ 1 , ⋯   , θ p ) ∈ Θ \theta=(\theta^1,\cdots,\theta^p)\in\Theta θ=(θ1,,θp)Θ.
  • f ( x , θ 0 ) f(x,\theta_0) f(x,θ0): the probability density function of X 1 X_1 X1. θ 0 \theta_0 θ0 is the true parameter and an interior of Θ \Theta Θ.
  • f ( x , θ , ρ ) = sup ⁡ ∣ θ ′ − θ ∣ ≤ ρ f ( x , θ ′ ) f(x,\theta,\rho)=\underset{|\theta'-\theta|\leq\rho}{\sup}f(x,\theta') f(x,θ,ρ)=θθρsupf(x,θ).
  • φ ( x , r ) = sup ⁡ ∣ θ ′ ∣ > r f ( x , θ ′ ) \varphi(x,r)=\underset{|\theta'|> r}{\sup}f(x,\theta') φ(x,r)=θ>rsupf(x,θ)
  • x + = max ⁡ ( 0 , x ) x^{+}=\max(0,x) x+=max(0,x).
  • ℓ n ( θ ) = ∑ i = 1 n log ⁡ f ( X i , θ ) \ell_n(\theta)=\sum_{i=1}^n\log f(X_i,\theta) n(θ)=i=1nlogf(Xi,θ) is the log-likelihood of θ \theta θ.
  • θ ^ n = arg ⁡ max ⁡ θ ∈ Θ ℓ n ( θ ) \hat{\theta}_n=\arg\max_{\theta\in\Theta}\ell_n(\theta) θ^n=argmaxθΘn(θ).
  • X = { x : f ( x , θ 0 ) > 0 } \mathcal{X}=\{x:f(x,\theta_0)>0\} X={x:f(x,θ0)>0} is the support of X 1 X_1 X1.

Main Theorems

Under suitable regularity conditions, θ ^ n \hat{\theta}_n θ^n converges to θ 0 \theta_0 θ0 a.s…

A sketched proof:
First, note that log ⁡ ( x ) \log(x) log(x) is a concave function and thus by Jensen’s Inequality we have
E [ log ⁡ f ( X , θ ) − log ⁡ f ( X , θ 0 ) ] = E [ log ⁡ f ( X , θ ) f ( X , θ 0 ) ] ≤ log ⁡ E [ f ( X , θ ) f ( X , θ 0 ) ] = 0. E[\log f(X,\theta)-\log f(X,\theta_0)]=E\left[\log\frac{f(X,\theta)}{f(X,\theta_0)}\right]\leq\log E\left[\frac{f(X,\theta)}{f(X,\theta_0)}\right]=0. E[logf(X,θ)logf(X,θ0)]=E[logf(X,θ0)f(X,θ)]logE[f(X,θ0)f(X,θ)]=0.
Since − log ⁡ ( x ) -\log(x) log(x) is strictly convex, the equality holds iff. f ( X , θ ) = f ( X , θ 0 ) f(X,\theta)=f(X,\theta_0) f(X,θ)=f(X,θ0) a.s… As a consequence, under the assumption that F ( x , θ ) ≠ F ( x , θ 0 ) F(x,\theta)\neq F(x,\theta_0) F(x,θ)=F(x,θ0) for at least a point, the Jensen’s inequality will hold strictly, i.e., E [ log ⁡ f ( X , θ ) ] < E [ log ⁡ f ( X , θ 0 ) ] E[\log f(X,\theta)]<E[\log f(X,\theta_0)] E[logf(X,θ)]<E[logf(X,θ0)] for θ ≠ θ 0 \theta\neq\theta_0 θ=θ0.

The second step is to divide the parameter space into a combination of finite sets Θ = ∪ k = 1 K B k \Theta=\cup_{k=1}^KB_k Θ=k=1KBk such that θ 0 ∈ B 0 \theta_0\in B_0 θ0B0 and θ 0 ∉ B j \theta_0\notin B_j θ0/Bj for j = 1 , ⋯   , K j=1,\cdots,K j=1,,K. Let f s ( x , B ) = sup ⁡ θ ∈ B f ( x , θ ) f_s(x,B)=\underset{\theta\in B}{\sup}f(x,\theta) fs(x,B)=θBsupf(x,θ), we have E [ log ⁡ f s ( X , B ) ] < E [ log ⁡ f ( X , θ 0 ) ] E[\log f_s(X,B)]<E[\log f(X,\theta_0)] E[logfs(X,B)]<E[logf(X,θ0)] by the above statement.

Thirdly, by the strong law of large numbers (SLLN), we have ℓ n ( θ ) n \frac{\ell_n(\theta)}{n} nn(θ) converges almost surely to E [ log ⁡ f ( X , θ ) ] E[\log f(X,\theta)] E[logf(X,θ)] for θ ∈ Θ \theta\in\Theta θΘ. Let A ϵ = { w : ∑ i = 1 n log ⁡ f ( X i , B j ) n → a . s . E [ log ⁡ f ( X 1 , B j ) ]  for  j = 1 , ⋯   , K  and  ∑ i = 1 n log ⁡ f ( X i , θ 0 ) n → a . s . E [ log ⁡ f ( X 1 , θ 0 ) ] } . A_{\epsilon}=\left\{w:\frac{\sum_{i=1}^n\log f(X_i,B_j)}{n}\overset{a.s.}{\rightarrow}E[\log f(X_1,B_j)]\text{ for }j=1,\cdots,K\text{ and } \frac{\sum_{i=1}^n\log f(X_i,\theta_0)}{n}\overset{a.s.}{\rightarrow}E[\log f(X_1,\theta_0)] \right\}. Aϵ={w:ni=1nlogf(Xi,Bj)a.s.E[logf(X1,Bj)] for j=1,,K and ni=1nlogf(Xi,θ0)a.s.E[logf(X1,θ0)]}.

Over A ϵ A_{\epsilon} Aϵ we have θ ^ n ∈ B 0 \hat{\theta}_{n}\in B_0 θ^nB0 when n n n is large enough, otherwise, there will be a contradiction that E [ log ⁡ f s ( X , B ) ] = E [ log ⁡ f ( X , θ 0 ) ] E[\log f_s(X,B)]=E[\log f(X,\theta_0)] E[logfs(X,B)]=E[logf(X,θ0)] for some B B B. Hence, θ ^ n → a . s . θ 0 \hat{\theta}_n\overset{a.s.}{\rightarrow}\theta_0 θ^na.s.θ0 if we can verify that p r ( A ϵ ) = 1 pr(A_\epsilon)=1 pr(Aϵ)=1.

Regularity Conditions

A1: { X n } n = 1 ∞ \{X_n\}_{n=1}^{\infty} {Xn}n=1 are either discrete or continuous.

A2: ∃ ρ > 0  and  r > 0 \exists \rho>0\text{ and }r>0 ρ>0 and r>0 such that E [ { log ⁡ f ( X 1 , θ , ρ ) } + ] < ∞ E[\{\log f(X_1,\theta,\rho)\}^+]<\infty E[{logf(X1,θ,ρ)}+]< for θ ∈ Θ \theta\in\Theta θΘ and E [ { log ⁡ φ ( X 1 , r ) } + ] < ∞ E[\{\log \varphi(X_1,r)\}^+]<\infty E[{logφ(X1,r)}+]<.

A3: f ( x , θ ) f(x,\theta) f(x,θ) is a continuous function of θ \theta θ given x x x.

A4: F ( x , θ 1 ) ≠ F ( x , θ 0 ) F(x,\theta_1)\neq F(x,\theta_0) F(x,θ1)=F(x,θ0) for at least one point when θ 1 ≠ θ 0 \theta_1\neq \theta_0 θ1=θ0.

A5: lim ⁡ ∣ ∣ θ ∣ ∣ → ∞ f ( x , θ ) = 0 \lim_{||\theta||\rightarrow\infty}f(x,\theta)=0 limθf(x,θ)=0 for any given x x x.

A6: E [ ∣ log ⁡ f ( X 1 , θ 0 ) ∣ ] < ∞ E[|\log f(X_1,\theta_0)|]<\infty E[logf(X1,θ0)]<.

A7: Θ \Theta Θ is a closed subset of p − p- pdimensional Cartesian space.

A8: f ( x , θ , ρ ) f(x,\theta,\rho) f(x,θ,ρ) is a measurable function of x x x for any θ \theta θ and ρ \rho ρ.

A8 is unnecessary for discrete case and A3, A5, A8 ensure that f f f is a “good” function of θ \theta θ.

Discussions of the regularity conditions

  1. A4 and A6 validate the strict Jensen’s inequality under which we have that E [ log ⁡ f ( X , θ ) ] < E [ log ⁡ f ( X , θ 0 ) ] E[\log f(X,\theta)]<E[\log f(X,\theta_0)] E[logf(X,θ)]<E[logf(X,θ0)] for θ ≠ θ 0 \theta\neq\theta_0 θ=θ0.
  2. A3 ensures that lim ⁡ ρ → 0 f ( x , θ , ρ ) = f ( x , θ ) \lim_{\rho\rightarrow 0}f(x,\theta,\rho)=f(x,\theta) limρ0f(x,θ,ρ)=f(x,θ). Since f ( x , θ , ρ ) f(x,\theta,\rho) f(x,θ,ρ) is a monotoning function of ρ \rho ρ given x x x and θ \theta θ, under A2 which says that f ( x , θ , ρ ) f(x,\theta,\rho) f(x,θ,ρ) is also dominated by an integrable function, we have lim ⁡ ρ → 0 E [ log ⁡ f ( X 1 , θ , ρ ) ] = E [ log ⁡ f ( X 1 , θ ) ] \lim_{\rho\rightarrow 0}E[\log f (X_1,\theta,\rho)]=E[\log f(X_1,\theta)] limρ0E[logf(X1,θ,ρ)]=E[logf(X1,θ)] by the dominated convergence theorem. It follows that for any θ ≠ θ 0 \theta\neq \theta_0 θ=θ0, E [ log ⁡ f ( X 1 , θ , ρ ) ] < E [ log ⁡ f ( X , θ 0 ) ] E[\log f (X_1,\theta,\rho)]<E[\log f(X,\theta_0)] E[logf(X1,θ,ρ)]<E[logf(X,θ0)] for some ρ > 0. \rho>0. ρ>0.
  3. Also, under A2, A3 and A5, DCT implies that lim ⁡ r → ∞ E [ φ ( X 1 , r ) ] = 0 \lim_{r\rightarrow \infty}E[ \varphi (X_1,r)]=0 limrE[φ(X1,r)]=0 and lim ⁡ r → ∞ E [ log ⁡ φ ( X 1 , r ) ] = − ∞ < E [ log ⁡ f ( X , θ 0 ) ] \lim_{r\rightarrow \infty}E[\log \varphi (X_1,r)]=-\infty<E[\log f(X,\theta_0)] limrE[logφ(X1,r)]=<E[logf(X,θ0)]. Hence, we have ∃ r 0 > 0 \exists r_0>0 r0>0 such that E [ log ⁡ φ ( X 1 , r 0 ) ] < E [ log ⁡ f ( X , θ 0 ) ] E[\log \varphi (X_1,r_0)]<E[\log f(X,\theta_0)] E[logφ(X1,r0)]<E[logf(X,θ0)].
  4. A7 ensures that θ ^ n ∈ Θ \hat{\theta}_n\in\Theta θ^nΘ.
  5. A8 guarantees that f ( X , θ , ρ ) f(X,\theta,\rho) f(X,θ,ρ) and φ ( X , r ) \varphi(X,r) φ(X,r) are still measurable given any θ , ρ \theta,\rho θ,ρ and r r r.
  6. A1-A2 and A7-A8 are true in general cases. A3-A6 are important and need to be verified.
  7. In fact, for A5, we only need that f ( x , θ ) f(x,\theta) f(x,θ) exists and is finite when θ \theta θ approaches the boundary of Θ \Theta Θ. Hence, A5 is unnecessary if we assume that Θ \Theta Θ is compact.

Three important implications under regularity assumptions

Under A1-A8, we have

Lemma 1. E [ log ⁡ f ( X , θ ) ] < E [ log ⁡ f ( X , θ 0 ) ] E[\log f(X,\theta)]<E[\log f(X,\theta_0)] E[logf(X,θ)]<E[logf(X,θ0)] for θ ≠ θ 0 \theta\neq\theta_0 θ=θ0.

Lemma 2. For any θ ≠ θ 0 \theta\neq \theta_0 θ=θ0, E [ log ⁡ f ( X 1 , θ , ρ ) ] < E [ log ⁡ f ( X , θ 0 ) ] E[\log f (X_1,\theta,\rho)]<E[\log f(X,\theta_0)] E[logf(X1,θ,ρ)]<E[logf(X,θ0)] for some ρ > 0. \rho>0. ρ>0.

Lemma 3. ∃ r 0 > 0 \exists r_0>0 r0>0 such that E [ log ⁡ φ ( X 1 , r 0 ) ] < E [ log ⁡ f ( X , θ 0 ) ] E[ \log\varphi (X_1,r_0)]<E[\log f(X,\theta_0)] E[logφ(X1,r0)]<E[logf(X,θ0)].

A Rigorous Proof

step 1
∀ ε > 0 \forall \varepsilon>0 ε>0, let B 0 = U ( θ 0 , ε ) ∩ Θ = { θ : ∣ θ − θ 0 ∣ < ε } ∩ Θ B_0=U(\theta_0,\varepsilon)\cap\Theta=\{\theta:|\theta-\theta_0|<\varepsilon\}\cap\Theta B0=U(θ0,ε)Θ={θ:θθ0<ε}Θ.

step 2
By Lemma 3, we are able to find a r 0 > 0 r_0>0 r0>0 such that E [ log ⁡ φ ( X 1 , r 0 ) ] < E [ log ⁡ f ( X , θ 0 ) ] E[ \log\varphi (X_1,r_0)]<E[\log f(X,\theta_0)] E[logφ(X1,r0)]<E[logf(X,θ0)]. Hence let B K = Θ − U ˉ ( 0 , r 0 ) ∩ Θ B_K=\Theta-\bar{U}(0,r_0)\cap\Theta BK=ΘUˉ(0,r0)Θ. It is easy to verify that A = Θ − B 0 − B K A=\Theta-B_0-B_K A=ΘB0BK is a compact set (since it is closed and bounded in Cartesian space).

step 3
By Lemma 2, ∀ θ ∈ A \forall \theta\in A θA, ∃ ρ θ > 0 \exists \rho_\theta>0 ρθ>0 such that E [ log ⁡ f ( X 1 , θ , ρ θ ) ] < E [ log ⁡ f ( X , θ 0 ) ] E[\log f (X_1,\theta,\rho_\theta)]<E[\log f(X,\theta_0)] E[logf(X1,θ,ρθ)]<E[logf(X,θ0)]. Moreover, A = ∪ θ ∈ A U ( θ , ρ θ ) A=\cup_{\theta\in A}U(\theta,\rho_\theta) A=θAU(θ,ρθ). The Heine Borel theorem implies that there are finite θ 1 , ⋯   , θ K − 1 \theta_1,\cdots,\theta_{K-1} θ1,,θK1 such that A = ∪ j = 1 K − 1 U ( θ j , ρ θ j ) A=\cup_{j=1}^{K-1}U(\theta_j,\rho_{\theta_j}) A=j=1K1U(θj,ρθj).

In summary, ∀ B 0 \forall B_0 B0 as defined above we can find B 1 , ⋯   , B K B_1,\cdots,B_K B1,,BK such that Θ = ∪ k = 0 K B k \Theta=\cup_{k=0}^{K}B_k Θ=k=0KBk and E [ log ⁡ f s ( X , B j ) ] < E [ log ⁡ f ( X , θ 0 ) ] E[\log f_s(X,B_j)]<E[\log f(X,\theta_0)] E[logfs(X,Bj)]<E[logf(X,θ0)] for B 1 , ⋯   , B K B_1,\cdots,B_K B1,,BK.

step 4
Let A ε = { w : ∑ i = 1 n log ⁡ f s ( X i ( w ) , B j ) n → a . s . E [ log ⁡ f s ( X 1 , B j ) ]  for  j = 1 , ⋯   , K  and  ∑ i = 1 n log ⁡ f ( X i ( w ) , θ 0 ) n → a . s . E [ log ⁡ f ( X 1 , θ 0 ) ] } . A_{\varepsilon}=\left\{w:\frac{\sum_{i=1}^n\log f_s(X_i(w),B_j)}{n}\overset{a.s.}{\rightarrow}E[\log f_s(X_1,B_j)]\text{ for }j=1,\cdots,K\text{ and } \frac{\sum_{i=1}^n\log f(X_i(w),\theta_0)}{n}\overset{a.s.}{\rightarrow}E[\log f(X_1,\theta_0)] \right\}. Aε={w:ni=1nlogfs(Xi(w),Bj)a.s.E[logfs(X1,Bj)] for j=1,,K and ni=1nlogf(Xi(w),θ0)a.s.E[logf(X1,θ0)]}.

Note that by definition
f s ( x , B j ) = { f ( x , θ j , ρ j ) j = 1 , ⋯   , K − 1 , φ ( x , r 0 ) j = K , f_s(x,B_j)=\left\{\begin{array}{ll}f(x,\theta_j,\rho_j)&j=1,\cdots,K-1,\\ \varphi(x,r_0)&j=K,\end{array}\right. fs(x,Bj)={f(x,θj,ρj)φ(x,r0)j=1,,K1,j=K, which is measurable. By SLLN we have p r ( A ε ) = 1 pr(A_{\varepsilon})=1 pr(Aε)=1 for any ε > 0 \varepsilon>0 ε>0. Note that over A ε A_{\varepsilon} Aε (i.e., ∀ w 1 ∈ A ε \forall w_1\in A_{\varepsilon} w1Aε), we have sup ⁡ θ ∈ B j ℓ n ( θ ) n = ∑ i = 1 n log ⁡ f s ( X i ( w 1 ) , B j ) n < ∑ i = 1 n log ⁡ f ( X i ( w 1 ) , θ 0 ) n \sup_{\theta\in B_j}\frac{\ell_n(\theta)}{n}=\frac{\sum_{i=1}^n\log f_s(X_i(w_1),B_j)}{n}<\frac{\sum_{i=1}^n\log f(X_i(w_1),\theta_0)}{n} supθBjnn(θ)=ni=1nlogfs(Xi(w1),Bj)<ni=1nlogf(Xi(w1),θ0) when n n n is large enough. As a result, ∀ w ∈ A ε \forall w\in A_{\varepsilon} wAε, ∃ N ε > 0 \exists N_{\varepsilon}>0 Nε>0 such that θ ^ n ( w ) ∈ B 0 \hat{\theta}_n(w)\in B_0 θ^n(w)B0 for n > N ε n>N_{\varepsilon} n>Nε. It follows that A ε ⊂ ∪ n = 1 ∞ ∩ m = n ∞ { ∣ θ ^ n − θ 0 ∣ < ε } A_{\varepsilon}\subset \cup_{n=1}^{\infty}\cap_{m=n}^\infty\{|\hat{\theta}_n-\theta_0|<\varepsilon\} Aεn=1m=n{θ^nθ0<ε} and p r ( ∣ θ ^ n − θ 0 ∣ ≥ ε i . o . ) = 0 pr(|\hat{\theta}_n-\theta_0|\geq \varepsilon \quad i.o.)=0 pr(θ^nθ0εi.o.)=0 for any ε > 0 \varepsilon>0 ε>0. This completes the proof.

Examples

Example 1

Suppose X 1 , ⋯   , X n X_1,\cdots,X_n X1,,Xn are i.i.d with density f ( x , θ 0 ) = c ⋅ exp ⁡ { − ∣ x − θ 0 ∣ 3 } f(x,\theta_0)=c\cdot\exp\{-|x-\theta_0|^3\} f(x,θ0)=cexp{xθ03} and Θ = ( − ∞ , ∞ ) \Theta=(-\infty,\infty) Θ=(,). Show the strong consistency of the MLE for θ 0 \theta_0 θ0.

To prove the strong consistency of θ ^ n \hat{\theta}_n θ^n, we just need to check A1-A8. A2 obviously holds since sup ⁡ θ f ( x , θ ) ≤ c \sup_{\theta}f(x,\theta)\leq c supθf(x,θ)c for any x x x and E [ log ⁡ sup ⁡ θ f ( X , θ ) + ] ≤ log ⁡ c + < ∞ . E[\log\sup_{\theta}f(X,\theta)^+]\leq \log c^+<\infty. E[logsupθf(X,θ)+]logc+<. A3-A8 are also easy to verify.

Example 2

Suppose X 1 , ⋯   , X n X_1,\cdots,X_n X1,,Xn are i.i.d normal with θ = ( μ , σ 2 ) \theta=(\mu,\sigma^2) θ=(μ,σ2) and Θ = ( − ∞ , ∞ ) × ( 0 , ∞ ) \Theta=(-\infty,\infty)\times(0,\infty) Θ=(,)×(0,).

The key is to verify A5 since other assumptions are easy to verify. Approaching the boundary of Θ \Theta Θ, we have lim ⁡ σ 2 → 0 + f ( x , θ ) = ∞ \lim_{\sigma^2\rightarrow 0^+}f(x,\theta)=\infty limσ20+f(x,θ)= when x = μ x=\mu x=μ. Hence, A5 is not satisfied. There are two ways to go on. One is to restrict Θ = ( − ∞ , ∞ ) × [ δ , ∞ ) \Theta=(-\infty,\infty)\times[\delta,\infty) Θ=(,)×[δ,) for some δ ≥ 0 \delta\geq 0 δ0. Another way is to prove that σ ^ 2 ≥ δ \hat{\sigma}^2\geq \delta σ^2δ when n n n is large enough.

Extensions to the M-estimation

Definition of M-estimation

Let X 1 , ⋯   , X n , ⋯ X_1,\cdots,X_n,\cdots X1,,Xn, be i.i.d. r.v.s. and θ \theta θ is a parameter attached to the law of X 1 X_1 X1. m θ ( x ) m_\theta(x) mθ(x) is a known function of x x x and θ ^ n \hat{\theta}_n θ^n is defined as the maximizer of M n ( θ ) = P n m θ = 1 n ∑ i = 1 n m θ ( X i ) M_n(\theta)=\mathbb{P}_nm_\theta=\frac{1}{n}\sum_{i=1}^nm_\theta(X_i) Mn(θ)=Pnmθ=n1i=1nmθ(Xi) over θ ∈ Θ \theta\in\Theta θΘ. Let M ( θ ) M(\theta) M(θ) be a fixed function such that M n ( θ ) → p M ( θ ) M_n(\theta)\overset{p}{\rightarrow}M(\theta) Mn(θ)pM(θ) for every θ ∈ Θ \theta\in\Theta θΘ. For example, we know one choice of M ( θ ) M(\theta) M(θ) is P m θ = E [ m θ ( X 1 ) ] \mathbb{P}m_\theta=E[m_\theta(X_1)] Pmθ=E[mθ(X1)] by the WLLN. Define θ 0 = arg ⁡ max ⁡ θ ∈ Θ M ( θ ) \theta_0=\underset{\theta\in\Theta}{\arg\max}M(\theta) θ0=θΘargmaxM(θ). Now we would like to extend Wald’s consistency to θ ^ n \hat{\theta}_n θ^n.

Wald’s Consistency of M-estimator

We follow Van der Vaart (1998) and contents of previous sections, and the regularity conditions are given by:

C1: { X n } n = 1 ∞ \{X_n\}_{n=1}^{\infty} {Xn}n=1 are either discrete or continuous.

C2: Θ \Theta Θ is a compact subset of p − p- pdimensional space.

C3: For every sufficiently small ball U ∈ Θ U\in\Theta UΘ, sup ⁡ θ ∈ U m θ ( x ) \sup_{\theta\in U}m_\theta(x) supθUmθ(x) is a measurable function of x x x and E [ sup ⁡ θ ∈ U m θ ( X 1 ) ] < ∞ E[\sup_{\theta\in U}m_\theta(X_1)]<\infty E[supθUmθ(X1)]<.

C4: m θ ( x ) m_\theta(x) mθ(x) is upper semicontinuous with respect to θ \theta θ for almost every x x x.

C5: θ ^ n \hat{\theta}_n θ^n and θ 0 \theta_0 θ0 is identifiable, i.e., exsit and unique.

Note that C1 correponds to A1, C2 correponds to A5 and A7, C3 correponds to A2, A6 and A8. C4 corresponds to A3. C5 is the identifiability condition which corresponds to A4.

Under C1-C5, we have that θ ^ n → a . s . θ 0 \hat{\theta}_n\overset{a.s.}{\rightarrow}\theta_0 θ^na.s.θ0. If θ ^ n \hat{\theta}_n θ^n is defined such that M n ( θ ^ n ) ≥ M n ( θ 0 ) − o p ( 1 ) M_n(\hat{\theta}_n)\geq M_n(\theta_0)-o_p(1) Mn(θ^n)Mn(θ0)op(1), then the convergence will degenerate to convergence in probability. In addition, here we permit multiple maximum of M ( θ ) M(\theta) M(θ) and define Θ 0 = { θ 0 ∈ Θ : M ( θ 0 ) = sup ⁡ θ ∈ Θ M ( θ ) } \Theta_0=\{\theta_0\in\Theta:M(\theta_0)=\sup_{\theta\in \Theta} M(\theta)\} Θ0={θ0Θ:M(θ0)=supθΘM(θ)}, then we have p r ( d ( θ ^ n , Θ 0 ) > ε ) → 0 pr(d(\hat{\theta}_n,\Theta_0)>\varepsilon)\rightarrow 0 pr(d(θ^n,Θ0)>ε)0 for any ε > 0 \varepsilon>0 ε>0. The proof is similar as above by finite coverings of the compact parameter space.

An alternative way to the consistency of M-estimation

Another simple, useful and commonly used way for the consistency of M-estimation is by the uniform law of large numbers which was detailedly described in Van der Vaart (1998).

Two crucial contions are required:

B1 For any ε > 0 \varepsilon>0 ε>0, sup ⁡ ∣ θ − θ 0 ∣ ≥ ε M ( θ ) < M ( θ 0 ) \sup_{|\theta-\theta_0|\geq \varepsilon}M(\theta)<M(\theta_0) supθθ0εM(θ)<M(θ0).

B2 sup ⁡ θ ∈ Θ ∣ M n ( θ ) − M ( θ ) ∣ → p 0 \sup_{\theta\in\Theta}|M_n(\theta)-M(\theta)|\overset{p}{\rightarrow}0 supθΘMn(θ)M(θ)p0.

B1 is the identifiability condition which states that θ 0 \theta_0 θ0 is the unique maximizer of M ( θ ) M(\theta) M(θ) which ensures that θ 0 \theta_0 θ0 is well defined. B2 requires the uniform convergence of M n ( θ ) M_n(\theta) Mn(θ) which is very strong and is an substiitue of C2-C4. The measurability in C3 is not necessary because we can use the outer probability instead.

The uniform convergence condition is equivalent to that { m θ , θ ∈ Θ } \{m_\theta,\theta\in\Theta\} {mθ,θΘ} or { m θ , j ′ , θ ∈ Θ , j = 1 , ⋯   , K } \{m'_{\theta,j},\theta\in\Theta,j=1,\cdots,K\} {mθ,j,θΘ,j=1,,K} being Glivenko-Cantelli, which requires the complexity of the function class to be bounded. One simple and sufficient condition is that Θ \Theta Θ being compact, m θ ( x ) m_\theta(x) mθ(x) or m θ , j ′ ( x ) m'_{\theta,j}(x) mθ,j(x) is continuous for every x x x with respect to θ \theta θ and is dominated by an integrable function.

Aymptotic distributions of MLE and M-estimation

In this section, we study on the weak convergence of MLE and M-estimation under the consistency. Here we only consider the case when m θ ( x ) m_\theta(x) mθ(x) are smooth functions given x x x.

Let Ψ n ( θ ) \Psi_n(\theta) Ψn(θ) be the derivative of M n ( θ ) M_n(\theta) Mn(θ), ψ θ \psi_\theta ψθ be the derivative of m θ m_\theta mθ which refers to the score function, and equivalently θ ^ n \hat{\theta}_n θ^n is the unique solution of 0 = Φ n ( θ ) 0=\Phi_n(\theta) 0=Φn(θ). The Taylor series expansion gives that
0 = 1 n ∑ i = 1 n ψ θ 0 ( X i ) + 1 n ∑ i = 1 n ∂ ψ θ 0 ( X i ) ∂ θ τ ( θ ^ n − θ 0 ) + e n . 0=\frac{1}{n}\sum_{i=1}^{n}\psi_{\theta_0}(X_i)+\frac{1}{n}\sum_{i=1}^n\frac{\partial \psi_{\theta_0}(X_i)}{\partial \theta^\tau}(\hat{\theta}_n-\theta_0)+e_n. 0=n1i=1nψθ0(Xi)+n1i=1nθτψθ0(Xi)(θ^nθ0)+en. For the third term, it is a vector and each component of e n e_n en is
e n , j = 1 n ∑ i = 1 n ( θ ^ n − θ 0 ) τ ∂ ψ θ ~ , j ( X i ) ∂ θ ∂ θ τ ( θ ^ n − θ 0 ) . e_{n,j}=\frac{1}{n}\sum_{i=1}^n(\hat{\theta}_n-\theta_0)^\tau\frac{\partial \psi_{\tilde{\theta},j}(X_i)}{\partial \theta\partial\theta^\tau}(\hat{\theta}_n-\theta_0). en,j=n1i=1n(θ^nθ0)τθθτψθ~,j(Xi)(θ^nθ0). Under the condition that ∂ 2 ψ θ ~ , j ( x ) ∂ θ i ∂ θ k \frac{\partial^2 \psi_{\tilde{\theta},j}(x)}{\partial \theta_i\partial\theta_k} θiθk2ψθ~,j(x) exist for every x x x for i , k i,k i,k and j j j and is controlled by a measurable and integrable function ψ ¨ ( x ) \ddot{\psi}(x) ψ¨(x), we have e n , j = o p ( 1 ) e_{n,j}=o_p(1) en,j=op(1) is negligible.

Together with the assumption that 1 n ∑ i = 1 n ∂ ψ θ 0 ( X i ) ∂ θ τ \frac{1}{n}\sum_{i=1}^n\frac{\partial \psi_{\theta_0}(X_i)}{\partial \theta^\tau} n1i=1nθτψθ0(Xi) is nonsingular, we have
n ( θ ^ n − θ 0 ) = − 1 n ∑ i = 1 n ( P n ψ ˙ θ 0 ) − 1 ψ θ 0 ( X i ) + o p ( 1 ) . \sqrt{n}(\hat{\theta}_n-\theta_0)=-\frac{1}{\sqrt{n}}\sum_{i=1}^n\left(\mathbb{P}_n\dot{\psi}_{\theta_0}\right)^{-1}\psi_{\theta_0}(X_i)+o_p(1). n (θ^nθ0)=n 1i=1n(Pnψ˙θ0)1ψθ0(Xi)+op(1). The SLLN as well as the central limit theorem imply that
n ( θ ^ n − θ 0 ) → d N ( P ψ θ 0 , ( P ψ ˙ θ 0 ) − 1 cov ⁡ ( ψ θ 0 ) ( − P ψ ˙ θ 0 τ ) − 1 ) . \sqrt{n}(\hat{\theta}_n-\theta_0)\overset{d}{\rightarrow}N(\mathbb{P}\psi_{\theta_0},\left(\mathbb{P}\dot{\psi}_{\theta_0}\right)^{-1}\operatorname{cov}(\psi_{\theta_0})\left(-\mathbb{P}\dot{\psi}_{\theta_0}^\tau\right)^{-1}). n (θ^nθ0)dN(Pψθ0,(Pψ˙θ0)1cov(ψθ0)(Pψ˙θ0τ)1). The existence of ∂ 2 ψ θ ~ , j ( x ) ∂ θ i ∂ θ k \frac{\partial^2 \psi_{\tilde{\theta},j}(x)}{\partial \theta_i\partial\theta_k} θiθk2ψθ~,j(x) can be weakened to the Lipschitz condition.

In terms of MLE, it is the special case of M-estimation with m θ ( x ) = log ⁡ f ( x , θ ) = ℓ θ ( x ) m_\theta(x)=\log f(x,\theta)=\ell_\theta(x) mθ(x)=logf(x,θ)=θ(x) and ψ θ ( x ) = f ˙ ( x , θ ) f ( x , θ ) = ℓ ˙ θ ( x ) \psi_\theta(x)=\frac{\dot{f}(x,\theta)}{f(x,\theta)}=\dot{\ell}_\theta(x) ψθ(x)=f(x,θ)f˙(x,θ)=˙θ(x) and f ˙ ( x , θ ) = ( ∂ f ( x , θ ) ∂ θ 1 , ⋯   , f ( x , θ ) ∂ θ p ) τ . \dot{f}(x,\theta)=\left(\frac{\partial f(x,\theta)}{\partial \theta_1},\cdots,\frac{f(x,\theta)}{\partial \theta_p}\right)^\tau. f˙(x,θ)=(θ1f(x,θ),,θpf(x,θ))τ. By the definition of derivative, ∂ f ( x , θ ) ∂ θ i = lim ⁡ Δ θ i → 0 f ( x , θ + Δ θ ) − f ( x , θ ) Δ θ i = lim ⁡ Δ θ i → 0 f ′ ( x , θ ~ ) \frac{\partial f(x,\theta)}{\partial \theta_i}=\lim_{\Delta\theta_i\rightarrow 0}\frac{f(x,\theta+\Delta\theta)-f(x,\theta)}{\Delta\theta_i}=\lim_{\Delta\theta_i\rightarrow 0}f'(x,\tilde{\theta}) θif(x,θ)=limΔθi0Δθif(x,θ+Δθ)f(x,θ)=limΔθi0f(x,θ~) by the mean value theorem. The expectation and differential is exchangeable when ∂ f ( x , θ 0 ) ∂ θ \frac{\partial f(x,\theta_0)}{\partial \theta} θf(x,θ0) is controlled by an integrable function for θ \theta θ in a small neighborhood of θ 0 \theta_0 θ0 by DCT. Hence, E [ ℓ ˙ θ 0 ( X 1 ) ] = E [ f ˙ ( X 1 , θ 0 ) f ( X 1 , θ 0 ) ] = ∫ f ˙ ( x , θ 0 ) d x = ∂ ∂ θ ∫ f ( x , θ 0 ) d x = 0. E[\dot{\ell}_{\theta_0}(X_1)]=E[\frac{\dot{f}(X_1,\theta_0)}{f(X_1,\theta_0)}]=\int \dot{f}(x,\theta_0)dx=\frac{\partial}{\partial \theta}\int f(x,\theta_0)dx=0. E[˙θ0(X1)]=E[f(X1,θ0)f˙(X1,θ0)]=f˙(x,θ0)dx=θf(x,θ0)dx=0. Also under suitable regularity conditions, we have P ℓ ¨ θ 0 = E { ℓ ¨ θ 0 ( X 1 ) f 2 ( X 1 , θ 0 ) − f ˙ ( X 1 , θ 0 ) f ˙ τ ( X 1 , θ 0 ) f 2 ( X 1 , θ 0 ) } = − cov ⁡ ( ℓ ˙ θ 0 ( X 1 ) ) = − P ℓ ˙ θ 0 ℓ ˙ θ 0 τ = − I θ 0 \mathbb{P}\ddot{\ell}_{\theta_0}=E\left\{\frac{\ddot{\ell}_{\theta_0}(X_1)}{f^2(X_1,\theta_0)}-\frac{\dot{f}(X_1,\theta_0)\dot{f}^\tau(X_1,\theta_0)}{f^2(X_1,\theta_0)}\right\}=-\operatorname{cov}(\dot{\ell}_{\theta_0}(X_1))=-\mathbb{P}\dot{\ell}_{\theta_0}\dot{\ell}_{\theta_0}^\tau=-I_{\theta_0} P¨θ0=E{f2(X1,θ0)¨θ0(X1)f2(X1,θ0)f˙(X1,θ0)f˙τ(X1,θ0)}=cov(˙θ0(X1))=P˙θ0˙θ0τ=Iθ0, which is exactly the minus Fisher information matrix of X 1 X_1 X1 for θ \theta θ. In a summary, the MLE of θ \theta θ has the asymptotic distribution
n ( θ ^ n − θ 0 ) → d N ( 0 , I θ 0 − 1 ) \sqrt{n}(\hat{\theta}_n-\theta_0)\overset{d}{\rightarrow}N(0,I^{-1}_{\theta_0}) n (θ^nθ0)dN(0,Iθ01) under the following sufficient regularity conditions:

R0 θ ^ n \hat{\theta}_n θ^n is consistent for θ 0 \theta_0 θ0.

R1 θ 0 \theta_0 θ0 is an interior point of Θ \Theta Θ and the support of X 1 X_1 X1 does not depend on θ \theta θ.

R2 f ( x , θ ) f(x,\theta) f(x,θ) is continuously differentiable up to order 3 with respect to θ \theta θ for almost all x x x.

R3 The following functions are all controlled by a measurable and integrable function when θ \theta θ is in a small neighborhood of θ 0 \theta_0 θ0:
∂ f ( x , θ ) ∂ θ i \frac{\partial f(x,\theta)}{\partial \theta_i} θif(x,θ), ∂ 2 f ( x , θ ) ∂ θ i ∂ θ j \frac{\partial^2 f(x,\theta)}{\partial \theta_i\partial \theta_j} θiθj2f(x,θ), ∂ 3 ℓ θ ( x ) ∂ θ i ∂ θ j ∂ θ k \frac{\partial^3 \ell_{\theta}(x)}{\partial \theta_i\partial \theta_j\partial \theta_k} θiθjθk3θ(x) for i , j , k = 1 , ⋯   , p i,j,k=1,\cdots,p i,j,k=1,,p.

R4 I θ 0 I_{\theta_0} Iθ0 exists and is positive definite.

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值