MLaPP Chapter 6 Frequentist statistics 频率学派统计学

最新推荐文章于 2024-02-01 21:48:12 发布

张小彬的代码人生

最新推荐文章于 2024-02-01 21:48:12 发布

阅读量2.6k

点赞数

分类专栏：机器学习 MLaPP 文章标签：频率学派贝叶斯学派概率论

本文链接：https://blog.csdn.net/zhangxb35/article/details/54927835

版权

本文深入探讨了频率学派统计学，包括估计量的抽样分布、Bootstrap方法、频率学派决策理论中的贝叶斯风险和最小最大风险，以及一致估计量、无偏估计量和最小化方差估计量的性质。此外，还讨论了经验风险最小化、交叉验证及其在模型选择中的应用，以及频率统计学的局限性和病态行为。

摘要由CSDN通过智能技术生成

6.1 Introduction

频率学派统计学（frequentist statistics），经典统计学（classical statistics），或者叫正统的统计学（orthodox statistics），设计了一些不把参数当做随机变量的统计推断方法，从而避免了使用贝叶斯法则和先验。

频率学派依赖于抽样分布（sampling distribution），而贝叶斯学派则依赖后验分布（posterior distribution）。

6.2 Sampling distribution of an estimator 估计量的抽样分布

和贝叶斯学派相反，频率学派估计参数时，认为参数是固定的（而不是不确定量，不当做是随机变量，因此也没有先验之说），反而数据是不固定的，可以不断地抽样。比如从总体中抽 $S$ 次，得到样本集 $\{\mathcal{D}^{(s)}\}_{s=1}^{S}$ ，每个样本都有 $N$ 个数据，即 $\mathcal{D}^{(s)} = \{x_i^{(s)}\}_{i=1}^{N}$ ，注意所有的样例都服从一个固定的分布，即 $x_i^{(s)} \sim p(\cdot | \theta *)$ 对所有的 $i, s$ 都成立。

针对每个样本 $\mathcal{D}^{(s)}$ ，可以用 estimator $\hat\theta(\cdot)$ 算出一个统计量，如均值，方差等。当 $S \rightarrow \infty$ 时， $\{\hat\theta(\mathcal{D}^{(s)})\}$ 构成新的分布，就叫做是 estimator $\hat\theta(\cdot)$ 的抽样分布（sampling distribution）.

6.2.1 Bootstrap

一般用蒙特卡洛方法来估计抽样分布（sampling distribution），这种方法就叫做 Bootstrap 方法，而这种方法又分有参数和无参数两种。

继续用上一小节的符号，直接计算 estimator 的结果，每个样本都会得到一个随机变量的取值， $\hat\theta^s = f(x_{1:N}^s)$ ，那么可以把经验分布当做是抽样分布。这种方法叫做 无参数 bootstrap，假如 estimator 中的参数 $\theta$ 是未知的，那么可以用最大似然估计出来的结果 $\hat\theta$ 来计算，这种叫做 参数 bootstrap 方法。

6.2.2 Large sample theory for the MLE *

当样本数量趋向无穷大时，那么似然函数的分布趋向于高斯分布，那么高斯分布的中心就是 MLE 的估计结果 $\hat\theta$ ，方差则是 MLE 整个曲面的弯曲情况。可以形式化地定义 score function 为似然函数对参数 $\theta$ 的偏导，

s (θ^) ≜ ▽ log p (D | θ) | θ^

$s(\hat{\boldsymbol\theta}) \triangleq \triangledown \log p(\mathcal{D}|\boldsymbol\theta)|_\hat{\boldsymbol\theta}$ 再定义 observed information matrix 为上面负的 score function 的导数，

J (θ^(D)) ≜ - ▽ s (θ^) = - ▽ 2 θ log p (D | θ) | θ^

$\mathbf{J}(\hat{\boldsymbol\theta}(\mathcal{D})) \triangleq -\triangledown \mathbf{s}(\hat{\boldsymbol\theta}) = - \triangledown_{\boldsymbol\theta}^2 \log p(\mathcal{D}|\boldsymbol\theta) | _\hat{\boldsymbol\theta}$

Fisher information matrix 定义为 observed information matrix 的期望，

I N (θ^| θ *) = E θ * [J (θ^| D)]

$\mathbf{I}_N(\hat{\boldsymbol\theta}|\boldsymbol\theta^*) = \mathbb{E}_{\boldsymbol\theta^*}[\mathbf{J}(\hat{\boldsymbol\theta}|\mathcal{D})]$

6.3 Frequentist decision theory 频率学派决策理论

上一章已经有了 estimator or decision procedure

δ : X \to A

$\delta: \mathcal{X} \rightarrow \mathcal{A}$ 的概念，在此基础上定义 风险（risk） 的概念，

R (θ *, δ) ≜ E p (D ~ | θ *) [L (θ *, δ (D ~))] = \int L (θ *, δ (D ~)) p (D ~ | θ *) d D ~

$R(\theta^*, \delta) \triangleq \mathbb{E}_{p(\mathcal{\tilde D}|\theta^*)}\left [L(\theta^*, \delta(\mathcal{\tilde D})) \right ] = \int L(\theta^*, \delta(\mathcal{\tilde D})) p(\mathcal{\tilde D}|\theta^*) d\mathcal{\tilde D}$ 然而这个式子是没法直接计算的，所以衍生出下面几种方法。