为什么估计的参数具有渐进高斯性？M-estimateor的渐进高斯性推导

最新推荐文章于 2024-12-16 20:58:33 发布

Jie Qiao

最新推荐文章于 2024-12-16 20:58:33 发布

阅读量502

点赞数

文章标签：机器学习人工智能

本文链接：https://blog.csdn.net/a358463121/article/details/133860659

版权

M-estimators

在这里我们研究一种叫M-estimators的渐进高斯性。具体来说，如果参数估计可以用一个最小化或者最大化目标表示：

$\theta _{o} =\arg\min_{\theta \in \Theta }\mathbb{E}[ q(w,\theta )]$

比如最大似然估计就是最大化似然函数的参数，那么把样本代进去，我们就可以得到m-estimator(maximum-likelihood-like estimator Huber (1967)):

$\hat{\theta } =\arg\min_{\theta } N^{-1}\sum _{i=1}^{N} q(w_{i} ,\theta )$

它表示就是从样本中估计的参数。
我们也可以用其极值条件来表示这个估计量，即找到其导数为0的极值点就是我们的 $\displaystyle \hat{\theta }$ ：

$N^{-1}\sum _{i=1}^{N} \nabla _{\theta } q(w_{i} ,\hat{\theta } )=0$

这个估计方式也被称为generalized method of moments（GMM），大部分情况下GMM和MLE是等价的，GMM的适用范围会更广一点，因为那些无法用MLE的模型可以用GMM来求，比如有些数据的分布不知道，或者写不出具体的形式，这时候MLE就没法用了。

Consistency and normality of M-estimators

那么这个估计的参数 $\displaystyle \hat{\theta }$ 有些什么性质呢？对于估计量的性质，我们一般关心这三个问题：可识别性（identifiability)，一致性（consistency），以及渐进高斯性（asymptotic normality）。

可识别性是基本要求，也就是这个极值点是唯一的，不可以存在另外的参数但他们的大小相同，这个性质一般是具体问题具体分析，这里先假设成立。

Consistency

对于一致性，其实就是 $\displaystyle \hat{\theta }\xrightarrow{p} \theta _{0}$ 是否依概率收敛到到真实的 $\displaystyle \theta _{0}$ 上去。

在这里插入图片描述

为了证明一致性，如上图，我们可以以他们的目标函数作为桥梁，通过证明 $\displaystyle N^{-1}\sum _{i=1}^{N} q(w_{i} ,\theta )$ uniform convergence收敛到 $\displaystyle E[ q( w,\theta )]$ （其实就是大数定理），并且基于 $\displaystyle \theta _{0}$ 的可识别性，以及连续有界等等性质，证明出 $\displaystyle \hat{\theta }\xrightarrow{p} \theta _{0}$ :

在这里插入图片描述

Normality

最后是渐进高斯性，所谓渐进高斯就是

在这里插入图片描述

这个定理告诉我们，这个估计的参数是服从正态分布的，而且他的方差取决于q的一阶和二阶导数。这东西是怎么来的呢，其实就是通过泰勒展开建立了估计参数与目标函数导数的桥梁。具体推导如下，这里都假设可识别性以及一致性成立。

首先定义符号， $\displaystyle s_{i} (\theta )$ 是一个 $1\times P$ 向量

$\begin{array}{ r c l } s_{i} (\theta ) & \equiv & \nabla _{\theta } q(w_{i} ,\theta )\\ & = & \left(\frac{\partial q(w_{i} ,\theta )}{\partial \theta _{1}} ,...,\frac{\partial q(w_{i} ,\theta )}{\partial \theta _{P}}\right)^{T} \end{array}$

$\displaystyle H_{i} (\theta )$ 则是 $P\times P$ 矩阵

$H_{i} (\theta )\equiv \nabla _{\theta \theta }^{2} q(w_{i} ,\theta )=\frac{\partial ^{2} q(w_{i} ,\theta )}{\partial \theta \partial \theta ^{\prime }} .$

接下来我们希望对 $\displaystyle s_{i}( \theta )$ 作一阶泰勒展开。回顾一下泰勒展开，一个连续函数 $\displaystyle f( x)$ 在 $\displaystyle x_{0}$ 处的展开为：

$f(x)=f(x_{0} )+f'(x^{+} )(x-x_{0} )$

其中 $\displaystyle \ x^{+}$ 是 $\displaystyle x_{0}$ 和 $\displaystyle x$ 之间的数，这个也称为中值定理。但如果f的输出是个向量，那么这个展开就是向量形式的泰勒展开：

$\mathbf{f} (\mathbf{x} )=\mathbf{f} (\mathbf{x}_{0} )+\frac{\partial \mathbf{f} (\mathbf{x} )}{\partial \mathbf{x}}\Bigl|_{\mathbf{x} =\mathbf{x}^{+}} (\mathbf{x} -\mathbf{x}_{0} )$

这里 $\displaystyle \frac{\partial \mathbf{f} (\mathbf{x} )}{\partial \mathbf{x}}$ 是一个矩阵，对于该矩阵的每一行，其对应的 $\displaystyle \mathbf{x}^{+}$ 都是不同的。

接下来，我们建立 $\displaystyle \theta$ 与 $\displaystyle s( \theta )$ 的联系，具体的，把这个泰勒展开用到 $\displaystyle s_{i} (\theta )$ 上，

$\sum _{i=1}^{N} s_{i} (\hat{\theta } )=\sum _{i=1}^{N} s_{i} (\theta _{0} )+\sum _{i=1}^{N}\frac{\partial s_{i} (\theta )}{\partial \theta }\Bigl|_{\theta ^{+}} (\hat{\theta } -\theta _{0} )$

现在，我们用 $\displaystyle S( \theta ) =\frac{1}{N}\sum _{i=1}^{N} s_{i} (\theta )=\frac{1}{N}\sum _{i=1}^{N} \nabla _{\theta } q(w_{i} ,\theta )$ ， $\displaystyle S'( \theta ) =\frac{1}{N}\sum _{i=1}^{N}\frac{\partial s_{i} (\theta )}{\partial \theta }\Bigl|_{\theta ^{+}} =\frac{1}{N}\sum _{i=1}^{N} \nabla _{\theta \theta }^{2} q(w_{i} ,\theta )$ ，于是

$S(\hat{\theta }) =S( \theta _{0}) +S'\left( \theta ^{+}\right) (\hat{\theta } -\theta _{0} )$

首先，根据 $\displaystyle \hat{\theta }$ 的定义，他是通过极值点求得的，因此 $\displaystyle S(\hat{\theta }) =0$ ，于是

$\begin{aligned} 0 & =S( \theta _{0}) +S'\left( \theta ^{+}\right) (\hat{\theta } -\theta _{0} )\\ \hat{\theta } -\theta _{0} & =-S'\left( \theta ^{+}\right)^{-1} S( \theta _{0}) \end{aligned}$

接下来，希望将 $\displaystyle \theta ^{+}$ 变成 $\displaystyle \theta _{0}$ 。基于参数的一致性 $\displaystyle \hat{\theta }\xrightarrow{p} \theta _{0}$ ，并且进一步假设 $\displaystyle S'$ 这个函数是平滑的，那么就会有

$\begin{aligned} \hat{\theta } -\theta _{0} & =-S'( \theta _{0})^{-1} S( \theta _{0}) +o_{p}( 1)\\ \sqrt{N}(\hat{\theta } -\theta _{0}) & =-S'( \theta _{0})^{-1}\underbrace{\sqrt{N} S( \theta _{0})}_{\rightarrow \mathcal{N}( 0,B_{0})} +o_{p}( 1) \end{aligned}$

这里 $\displaystyle o_{p}( 1)$ 表示这条等式在 $\displaystyle N\rightarrow \infty$ 的时候成立。接下来，因为 $\displaystyle \theta _{0}$ 是个常数，而 $\displaystyle S( \theta _{0}) =\frac{1}{N}\sum _{i=1}^{N} \nabla _{\theta } q(w_{i} ,\theta _{0} )$ ，是一个样本的均值，所以根据中心极限定理，这个东西会趋于正态分布，并且因为 $\displaystyle E[ S( \theta _{0})] =0$ （因为 $\displaystyle \theta _{0}$ 是 $\displaystyle \mathbb{E}[ q(w,\theta )]$ 的极值点），所以其正态分布的均值为0，而其方差则是 $\displaystyle Var( \nabla _{\theta } q(w_{i} ,\theta _{0} )) =Var( s_{i} (\theta _{0} ))$ ，记为 $\displaystyle B_{0}$ ，并且记 $\displaystyle A_{0} :=S'( \theta _{0})$

$\sqrt{N}(\hat{\theta } -\theta _{0})\xrightarrow{d}\mathcal{N}\left( 0,A_{0}^{-1} B_{0} A_{0}^{-1}\right)$

这里之所以两个 $\displaystyle A_{0}$ 是因为 $\displaystyle -a*N( 0,1) \sim N\left( 0,a^{2}\right)$ ，是矩阵的平方的写法。最后这就是我们的定理

在这里插入图片描述

我们发现，这个参数估计的方差是取决于 $\displaystyle Var( \nabla _{\theta } q(w_{i} ,\theta _{0} ))$ 以及 $\displaystyle E\left[ \nabla _{\theta \theta }^{2} q(w_{i} ,\theta )\right]$ .

这个证明核心的地方是那个泰勒展开，其导数 $\displaystyle S( \theta _{0})$ 是样本均值求和，根据中心极限定理是渐进高斯的，又因为参数 $\displaystyle \hat{\theta } -\theta _{0}$ 可以用 $\displaystyle S( \theta _{0})$ 表示，从而可以写出渐进高斯的表达式。

例子

考虑一个简单的线性模型

$y=ax+\epsilon ,\$

其中 $\displaystyle x\sim \mathcal{N}\left( 0,\sigma _{x}^{2}\right) ,\epsilon \sim \mathcal{N}\left( 0,\sigma _{\epsilon }^{2}\right)$ 。于是，

$\begin{aligned} \hat{a} & =\arg\max\frac{1}{N}\sum _{i=1}^{N}\log p( x_{i} ,y_{i} ;a)\\ & =\arg\min\frac{1}{N}\sum _{i=1}^{N}( y_{i} -ax_{i})^{2} \end{aligned}$

因此， $\displaystyle q( x_{i} ,y_{i} ,a) =( y_{i} -ax_{i})^{2}$ ,于是

$B_{0} =Var( \nabla _{a} q( x_{i} ,y_{i} ,a)) =Var( -2( y_{i} -ax_{i}) x_{i}) =Var( -2\epsilon _{i} x_{i}) =E\left[ 4\epsilon _{i}^{2} x_{i}^{2}\right] -4E[ \epsilon _{i} x_{i}]^{2} =4\sigma _{x}^{2} \sigma _{\epsilon }^{2}\\ A_{0} =E\left[ \nabla _{aa}^{2} q( x_{i} ,y_{i} ,a)\right] =E\left[ 2x_{i}^{2}\right] =2\sigma _{x}^{2}$