似然估计和 M-估计的相合性

最新推荐文章于 2024-08-05 15:08:44 发布

qq_37353305

最新推荐文章于 2024-08-05 15:08:44 发布

阅读量483

点赞数 1

分类专栏：学习笔记文章标签：概率论统计学

本文链接：https://blog.csdn.net/qq_37353305/article/details/122773876

版权

学习笔记专栏收录该内容

9 篇文章 2 订阅

订阅专栏

Introduction

我们一起来回顾 Wald 1949 年发表在 Ann. Math. Statist.上的一篇文章，并探讨一下对于一般的 M-估计，在什么条件下其是相合的 (converge in prob.)。

Preliminaries

An equivalent condition to strong consistency

A stochastic sequence $\{X_n\}_{n=1}^{\infty}$ converges a.s. to a random vector $X$ (defined on the same prob. space) iff. $pr(|X_n-X|\geq \varepsilon\quad i.o.)=0, \forall \varepsilon>0$ .

several ways to strong consistency

SLLN

If $X_n$ are i.i.d with $X_1\in L_1(pr)$ , then $\bar{X}_n\overset{a.s.}{\rightarrow} E[X_1]$ by the strong law of large numbers (SLLN).

Borel-Cantelli Lemma

Let $\{A_n\}_{n=1}^{\infty}$ be a sequence of event. If $\sum_{n=1}^{\infty}pr(A_n)<\infty$ , then $pr(A_n\quad i.o.)=0$ .

Let $A^{\varepsilon}_n=\{|X_n-X|\geq \varepsilon\}$ . If we can verify $\sum_{n}pr(A^{\varepsilon}_n)<\infty$ for any $\varepsilon$ , we have $X_n\overset{a.s.}{\rightarrow}X.$

Lyapunov’s CLT

Let $\{X_n\}_{n=1}^{\infty}$ be a sequence of independent r.v.s with $E[X_i]=\mu_i$ and $var(X_i)=\sigma^2_i$ . Further, if

$\mu_i$ and $\sigma_i^2$ are all exist and finite,
$\lim_{n\rightarrow\infty}\frac{\sum_{i=1}^nE[|X_i-\mu_i|^{2+\delta}]}{(\sum_{i=1}^n\sigma_i^2)^{1+\delta/2}}=0$ for some $\delta>0$ ,

then $\frac{\sum_{i=1}^n(X_i-\mu_i)}{\sqrt{\sum_{i=1}^n\sigma_i^2}}\overset{d}{\rightarrow}N(0,1).$

Wald’s Consistency of the Maximum Likelihood Estimator

Notations

$\{X_n\}_{n=1}^{\infty}$ : a sequence of i.i.d r.v.s.
$F(x,\theta_0)$ : the law of $X_1$ parameterized by a finite $\theta=(\theta^1,\cdots,\theta^p)\in\Theta$ .
$f(x,\theta_0)$ : the probability density function of $X_1$ . $\theta_0$ is the true parameter and an interior of $\Theta$ .
$f(x,\theta,\rho)=\underset{|\theta'-\theta|\leq\rho}{\sup}f(x,\theta')$ .
$\varphi(x,r)=\underset{|\theta'|> r}{\sup}f(x,\theta')$
$x^{+}=\max(0,x)$ .
$\ell_n(\theta)=\sum_{i=1}^n\log f(X_i,\theta)$ is the log-likelihood of $\theta$ .
$\hat{\theta}_n=\arg\max_{\theta\in\Theta}\ell_n(\theta)$ .
$\mathcal{X}=\{x:f(x,\theta_0)>0\}$ is the support of $X_1$ .

Main Theorems

Under suitable regularity conditions, $\hat{\theta}_n$ converges to $\theta_0$ a.s…

A sketched proof:
First, note that $\log(x)$ is a concave function and thus by Jensen’s Inequality we have
$E[\log f(X,\theta)-\log f(X,\theta_0)]=E\left[\log\frac{f(X,\theta)}{f(X,\theta_0)}\right]\leq\log E\left[\frac{f(X,\theta)}{f(X,\theta_0)}\right]=0.$
Since $-\log(x)$ is strictly convex, the equality holds iff. $f(X,\theta)=f(X,\theta_0)$ a.s… As a consequence, under the assumption that $F(x,\theta)\neq F(x,\theta_0)$ for at least a point, the Jensen’s inequality will hold strictly, i.e., $E[\log f(X,\theta)]<E[\log f(X,\theta_0)]$ for $\theta\neq\theta_0$ .

The second step is to divide the parameter space into a combination of finite sets $\Theta=\cup_{k=1}^KB_k$ such that $\theta_0\in B_0$ and $\theta_0\notin B_j$ for $j=1,\cdots,K$ . Let $f_s(x,B)=\underset{\theta\in B}{\sup}f(x,\theta)$ , we have $E[\log f_s(X,B)]<E[\log f(X,\theta_0)]$ by the above statement.

Thirdly, by the strong law of large numbers (SLLN), we have $\frac{\ell_n(\theta)}{n}$ converges almost surely to $E[\log f(X,\theta)]$ for $\theta\in\Theta$ . Let $A_{\epsilon}=\left\{w:\frac{\sum_{i=1}^n\log f(X_i,B_j)}{n}\overset{a.s.}{\rightarrow}E[\log f(X_1,B_j)]\text{ for }j=1,\cdots,K\text{ and } \frac{\sum_{i=1}^n\log f(X_i,\theta_0)}{n}\overset{a.s.}{\rightarrow}E[\log f(X_1,\theta_0)] \right\}.$

Over $A_{\epsilon}$ we have $\hat{\theta}_{n}\in B_0$ when $n$ is large enough, otherwise, there will be a contradiction that $E[\log f_s(X,B)]=E[\log f(X,\theta_0)]$ for some $B$ . Hence, $\hat{\theta}_n\overset{a.s.}{\rightarrow}\theta_0$ if we can verify that $pr(A_\epsilon)=1$ .

Regularity Conditions

A1: $\{X_n\}_{n=1}^{\infty}$ are either discrete or continuous.

A2: $\exists \rho>0\text{ and }r>0$ such that $E[\{\log f(X_1,\theta,\rho)\}^+]<\infty$ for $\theta\in\Theta$ and $E[\{\log \varphi(X_1,r)\}^+]<\infty$ .

A3: $f(x,\theta)$ is a continuous function of $\theta$ given $x$ .

A4: $F(x,\theta_1)\neq F(x,\theta_0)$ for at least one point when $\theta_1\neq \theta_0$ .

A5: $\lim_{||\theta||\rightarrow\infty}f(x,\theta)=0$ for any given $x$ .

A6: $E[|\log f(X_1,\theta_0)|]<\infty$ .

A7: $\Theta$ is a closed subset of $p -$ dimensional Cartesian space.

A8: $f(x,\theta,\rho)$ is a measurable function of $x$ for any $\theta$ and $\rho$ .

A8 is unnecessary for discrete case and A3, A5, A8 ensure that $f$ is a “good” function of $\theta$ .

Discussions of the regularity conditions

A4 and A6 validate the strict Jensen’s inequality under which we have that $E[\log f(X,\theta)]<E[\log f(X,\theta_0)]$ for $\theta\neq\theta_0$ .
A3 ensures that $\lim_{\rho\rightarrow 0}f(x,\theta,\rho)=f(x,\theta)$ . Since $f(x,\theta,\rho)$ is a monotoning function of $\rho$ given $x$ and $\theta$ , under A2 which says that $f(x,\theta,\rho)$ is also dominated by an integrable function, we have $\lim_{\rho\rightarrow 0}E[\log f (X_1,\theta,\rho)]=E[\log f(X_1,\theta)]$ by the dominated convergence theorem. It follows that for any $\theta\neq \theta_0$ , $E[\log f (X_1,\theta,\rho)]<E[\log f(X,\theta_0)]$ for some $\rho>0.$
Also, under A2, A3 and A5, DCT implies that $\lim_{r\rightarrow \infty}E[ \varphi (X_1,r)]=0$ and $\lim_{r\rightarrow \infty}E[\log \varphi (X_1,r)]=-\infty<E[\log f(X,\theta_0)]$ . Hence, we have $\exists r_0>0$ such that $E[\log \varphi (X_1,r_0)]<E[\log f(X,\theta_0)]$ .
A7 ensures that $\hat{\theta}_n\in\Theta$ .
A8 guarantees that $f(X,\theta,\rho)$ and $\varphi(X,r)$ are still measurable given any $\theta,\rho$ and $r$ .
A1-A2 and A7-A8 are true in general cases. A3-A6 are important and need to be verified.
In fact, for A5, we only need that $f(x,\theta)$ exists and is finite when $\theta$ approaches the boundary of $\Theta$ . Hence, A5 is unnecessary if we assume that $\Theta$ is compact.

Three important implications under regularity assumptions

Under A1-A8, we have

Lemma 1. $E[\log f(X,\theta)]<E[\log f(X,\theta_0)]$ for $\theta\neq\theta_0$ .

Lemma 2. For any $\theta\neq \theta_0$ , $E[\log f (X_1,\theta,\rho)]<E[\log f(X,\theta_0)]$ for some $\rho>0.$

Lemma 3. $\exists r_0>0$ such that $\log\varphi (X_1,r_0)]<E[\log f(X,\theta_0)]$ .

A Rigorous Proof

step 1
$\forall \varepsilon>0$ , let $B_0=U(\theta_0,\varepsilon)\cap\Theta=\{\theta:|\theta-\theta_0|<\varepsilon\}\cap\Theta$ .

step 2
By Lemma 3, we are able to find a $r_0>0$ such that $\log\varphi (X_1,r_0)]<E[\log f(X,\theta_0)]$ . Hence let $B_K=\Theta-\bar{U}(0,r_0)\cap\Theta$ . It is easy to verify that $A=\Theta-B_0-B_K$ is a compact set (since it is closed and bounded in Cartesian space).

step 3
By Lemma 2, $\forall \theta\in A$ , $\exists \rho_\theta>0$ such that $E[\log f (X_1,\theta,\rho_\theta)]<E[\log f(X,\theta_0)]$ . Moreover, $A=\cup_{\theta\in A}U(\theta,\rho_\theta)$ . The Heine Borel theorem implies that there are finite $\theta_1,\cdots,\theta_{K-1}$ such that $A=\cup_{j=1}^{K-1}U(\theta_j,\rho_{\theta_j})$ .

In summary, $\forall B_0$ as defined above we can find $B_1,\cdots,B_K$ such that $\Theta=\cup_{k=0}^{K}B_k$ and $E[\log f_s(X,B_j)]<E[\log f(X,\theta_0)]$ for $B_1,\cdots,B_K$ .

step 4
Let $A_{\varepsilon}=\left\{w:\frac{\sum_{i=1}^n\log f_s(X_i(w),B_j)}{n}\overset{a.s.}{\rightarrow}E[\log f_s(X_1,B_j)]\text{ for }j=1,\cdots,K\text{ and } \frac{\sum_{i=1}^n\log f(X_i(w),\theta_0)}{n}\overset{a.s.}{\rightarrow}E[\log f(X_1,\theta_0)] \right\}.$

Note that by definition
$f_s(x,B_j)=\left\{\begin{array}{ll}f(x,\theta_j,\rho_j)&j=1,\cdots,K-1,\\ \varphi(x,r_0)&j=K,\end{array}\right.$ which is measurable. By SLLN we have $pr(A_{\varepsilon})=1$ for any $\varepsilon>0$ . Note that over $A_{\varepsilon}$ (i.e., $\forall w_1\in A_{\varepsilon}$ ), we have $\sup_{\theta\in B_j}\frac{\ell_n(\theta)}{n}=\frac{\sum_{i=1}^n\log f_s(X_i(w_1),B_j)}{n}<\frac{\sum_{i=1}^n\log f(X_i(w_1),\theta_0)}{n}$ when $n$ is large enough. As a result, $\forall w\in A_{\varepsilon}$ , $\exists N_{\varepsilon}>0$ such that $\hat{\theta}_n(w)\in B_0$ for $n>N_{\varepsilon}$ . It follows that $A_{\varepsilon}\subset \cup_{n=1}^{\infty}\cap_{m=n}^\infty\{|\hat{\theta}_n-\theta_0|<\varepsilon\}$ and $pr(|\hat{\theta}_n-\theta_0|\geq \varepsilon \quad i.o.)=0$ for any $\varepsilon>0$ . This completes the proof.

Examples

Example 1

Suppose $X_1,\cdots,X_n$ are i.i.d with density $f(x,\theta_0)=c\cdot\exp\{-|x-\theta_0|^3\}$ and $\Theta=(-\infty,\infty)$ . Show the strong consistency of the MLE for $\theta_0$ .

To prove the strong consistency of $\hat{\theta}_n$ , we just need to check A1-A8. A2 obviously holds since $\sup_{\theta}f(x,\theta)\leq c$ for any $x$ and $E[\log\sup_{\theta}f(X,\theta)^+]\leq \log c^+<\infty.$ A3-A8 are also easy to verify.

Example 2

Suppose $X_1,\cdots,X_n$ are i.i.d normal with $\theta=(\mu,\sigma^2)$ and $\Theta=(-\infty,\infty)\times(0,\infty)$ .

The key is to verify A5 since other assumptions are easy to verify. Approaching the boundary of $\Theta$ , we have $\lim_{\sigma^2\rightarrow 0^+}f(x,\theta)=\infty$ when $x=\mu$ . Hence, A5 is not satisfied. There are two ways to go on. One is to restrict $\Theta=(-\infty,\infty)\times[\delta,\infty)$ for some $\delta\geq 0$ . Another way is to prove that $\hat{\sigma}^2\geq \delta$ when $n$ is large enough.

Extensions to the M-estimation

Definition of M-estimation

Let $X_1,\cdots,X_n,\cdots$ be i.i.d. r.v.s. and $\theta$ is a parameter attached to the law of $X_1$ . $m_\theta(x)$ is a known function of $x$ and $\hat{\theta}_n$ is defined as the maximizer of $M_n(\theta)=\mathbb{P}_nm_\theta=\frac{1}{n}\sum_{i=1}^nm_\theta(X_i)$ over $\theta\in\Theta$ . Let $M(\theta)$ be a fixed function such that $M_n(\theta)\overset{p}{\rightarrow}M(\theta)$ for every $\theta\in\Theta$ . For example, we know one choice of $M(\theta)$ is $\mathbb{P}m_\theta=E[m_\theta(X_1)]$ by the WLLN. Define $\theta_0=\underset{\theta\in\Theta}{\arg\max}M(\theta)$ . Now we would like to extend Wald’s consistency to $\hat{\theta}_n$ .

Wald’s Consistency of M-estimator

We follow Van der Vaart (1998) and contents of previous sections, and the regularity conditions are given by:

C1: $\{X_n\}_{n=1}^{\infty}$ are either discrete or continuous.

C2: $\Theta$ is a compact subset of $p -$ dimensional space.

C3: For every sufficiently small ball $U\in\Theta$ , $\sup_{\theta\in U}m_\theta(x)$ is a measurable function of $x$ and $E[\sup_{\theta\in U}m_\theta(X_1)]<\infty$ .

C4: $m_\theta(x)$ is upper semicontinuous with respect to $\theta$ for almost every $x$ .

C5: $\hat{\theta}_n$ and $\theta_0$ is identifiable, i.e., exsit and unique.

Note that C1 correponds to A1, C2 correponds to A5 and A7, C3 correponds to A2, A6 and A8. C4 corresponds to A3. C5 is the identifiability condition which corresponds to A4.

Under C1-C5, we have that $\hat{\theta}_n\overset{a.s.}{\rightarrow}\theta_0$ . If $\hat{\theta}_n$ is defined such that $M_n(\hat{\theta}_n)\geq M_n(\theta_0)-o_p(1)$ , then the convergence will degenerate to convergence in probability. In addition, here we permit multiple maximum of $M(\theta)$ and define $\Theta_0=\{\theta_0\in\Theta:M(\theta_0)=\sup_{\theta\in \Theta} M(\theta)\}$ , then we have $pr(d(\hat{\theta}_n,\Theta_0)>\varepsilon)\rightarrow 0$ for any $\varepsilon>0$ . The proof is similar as above by finite coverings of the compact parameter space.

An alternative way to the consistency of M-estimation

Another simple, useful and commonly used way for the consistency of M-estimation is by the uniform law of large numbers which was detailedly described in Van der Vaart (1998).

Two crucial contions are required:

B1 For any $\varepsilon>0$ , $\sup_{|\theta-\theta_0|\geq \varepsilon}M(\theta)<M(\theta_0)$ .

B2 $\sup_{\theta\in\Theta}|M_n(\theta)-M(\theta)|\overset{p}{\rightarrow}0$ .

B1 is the identifiability condition which states that $\theta_0$ is the unique maximizer of $M(\theta)$ which ensures that $\theta_0$ is well defined. B2 requires the uniform convergence of $M_n(\theta)$ which is very strong and is an substiitue of C2-C4. The measurability in C3 is not necessary because we can use the outer probability instead.

The uniform convergence condition is equivalent to that $\{m_\theta,\theta\in\Theta\}$ or $\{m'_{\theta,j},\theta\in\Theta,j=1,\cdots,K\}$ being Glivenko-Cantelli, which requires the complexity of the function class to be bounded. One simple and sufficient condition is that $\Theta$ being compact, $m_\theta(x)$ or $m'_{\theta,j}(x)$ is continuous for every $x$ with respect to $\theta$ and is dominated by an integrable function.

Aymptotic distributions of MLE and M-estimation

In this section, we study on the weak convergence of MLE and M-estimation under the consistency. Here we only consider the case when $m_\theta(x)$ are smooth functions given $x$ .

Let $\Psi_n(\theta)$ be the derivative of $M_n(\theta)$ , $\psi_\theta$ be the derivative of $m_\theta$ which refers to the score function, and equivalently $\hat{\theta}_n$ is the unique solution of $0=\Phi_n(\theta)$ . The Taylor series expansion gives that
$0=\frac{1}{n}\sum_{i=1}^{n}\psi_{\theta_0}(X_i)+\frac{1}{n}\sum_{i=1}^n\frac{\partial \psi_{\theta_0}(X_i)}{\partial \theta^\tau}(\hat{\theta}_n-\theta_0)+e_n.$ For the third term, it is a vector and each component of $e_n$ is
$e_{n,j}=\frac{1}{n}\sum_{i=1}^n(\hat{\theta}_n-\theta_0)^\tau\frac{\partial \psi_{\tilde{\theta},j}(X_i)}{\partial \theta\partial\theta^\tau}(\hat{\theta}_n-\theta_0).$ Under the condition that $\frac{\partial^2 \psi_{\tilde{\theta},j}(x)}{\partial \theta_i\partial\theta_k}$ exist for every $x$ for $i, k$ and $j$ and is controlled by a measurable and integrable function $\ddot{\psi}(x)$ , we have $e_{n,j}=o_p(1)$ is negligible.

Together with the assumption that $\frac{1}{n}\sum_{i=1}^n\frac{\partial \psi_{\theta_0}(X_i)}{\partial \theta^\tau}$ is nonsingular, we have
$\sqrt{n}(\hat{\theta}_n-\theta_0)=-\frac{1}{\sqrt{n}}\sum_{i=1}^n\left(\mathbb{P}_n\dot{\psi}_{\theta_0}\right)^{-1}\psi_{\theta_0}(X_i)+o_p(1).$ The SLLN as well as the central limit theorem imply that
$\sqrt{n}(\hat{\theta}_n-\theta_0)\overset{d}{\rightarrow}N(\mathbb{P}\psi_{\theta_0},\left(\mathbb{P}\dot{\psi}_{\theta_0}\right)^{-1}\operatorname{cov}(\psi_{\theta_0})\left(-\mathbb{P}\dot{\psi}_{\theta_0}^\tau\right)^{-1}).$ The existence of $\frac{\partial^2 \psi_{\tilde{\theta},j}(x)}{\partial \theta_i\partial\theta_k}$ can be weakened to the Lipschitz condition.

In terms of MLE, it is the special case of M-estimation with $m_\theta(x)=\log f(x,\theta)=\ell_\theta(x)$ and $\psi_\theta(x)=\frac{\dot{f}(x,\theta)}{f(x,\theta)}=\dot{\ell}_\theta(x)$ and $\dot{f}(x,\theta)=\left(\frac{\partial f(x,\theta)}{\partial \theta_1},\cdots,\frac{f(x,\theta)}{\partial \theta_p}\right)^\tau.$ By the definition of derivative, $\frac{\partial f(x,\theta)}{\partial \theta_i}=\lim_{\Delta\theta_i\rightarrow 0}\frac{f(x,\theta+\Delta\theta)-f(x,\theta)}{\Delta\theta_i}=\lim_{\Delta\theta_i\rightarrow 0}f'(x,\tilde{\theta})$ by the mean value theorem. The expectation and differential is exchangeable when $\frac{\partial f(x,\theta_0)}{\partial \theta}$ is controlled by an integrable function for $\theta$ in a small neighborhood of $\theta_0$ by DCT. Hence, $E[\dot{\ell}_{\theta_0}(X_1)]=E[\frac{\dot{f}(X_1,\theta_0)}{f(X_1,\theta_0)}]=\int \dot{f}(x,\theta_0)dx=\frac{\partial}{\partial \theta}\int f(x,\theta_0)dx=0.$ Also under suitable regularity conditions, we have $\mathbb{P}\ddot{\ell}_{\theta_0}=E\left\{\frac{\ddot{\ell}_{\theta_0}(X_1)}{f^2(X_1,\theta_0)}-\frac{\dot{f}(X_1,\theta_0)\dot{f}^\tau(X_1,\theta_0)}{f^2(X_1,\theta_0)}\right\}=-\operatorname{cov}(\dot{\ell}_{\theta_0}(X_1))=-\mathbb{P}\dot{\ell}_{\theta_0}\dot{\ell}_{\theta_0}^\tau=-I_{\theta_0}$ , which is exactly the minus Fisher information matrix of $X_1$ for $\theta$ . In a summary, the MLE of $\theta$ has the asymptotic distribution
$\sqrt{n}(\hat{\theta}_n-\theta_0)\overset{d}{\rightarrow}N(0,I^{-1}_{\theta_0})$ under the following sufficient regularity conditions:

R0 $\hat{\theta}_n$ is consistent for $\theta_0$ .

R1 $\theta_0$ is an interior point of $\Theta$ and the support of $X_1$ does not depend on $\theta$ .

R2 $f(x,\theta)$ is continuously differentiable up to order 3 with respect to $\theta$ for almost all $x$ .

R3 The following functions are all controlled by a measurable and integrable function when $\theta$ is in a small neighborhood of $\theta_0$ :
$\frac{\partial f(x,\theta)}{\partial \theta_i}$ , $\frac{\partial^2 f(x,\theta)}{\partial \theta_i\partial \theta_j}$ , $\frac{\partial^3 \ell_{\theta}(x)}{\partial \theta_i\partial \theta_j\partial \theta_k}$ for $i,j,k=1,\cdots,p$ .

R4 $I_{\theta_0}$ exists and is positive definite.

qq_37353305

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
似然估计和 M-估计的相合性

On Wald's Consistency简介Preliminariesstrong consistency 的一个等价条件several ways to strong consistencySLLNBorel-Cantelli LemmaLyapunov's CLTWald's Consistency of the Maximum Likelihood EstimatorNotationsMain Theorems简介我们一起来回顾 Wald 1949 年发表在 Ann. Math. Statist.
复制链接

扫一扫