Chapter 9 (Classical Statistical Inference): Classical Parameter Estimation (经典参数估计)

最新推荐文章于 2022-07-26 00:00:18 发布

连理o

最新推荐文章于 2022-07-26 00:00:18 发布

阅读量250

点赞数

分类专栏：概率论与数理统计

本文链接：https://blog.csdn.net/weixin_42437114/article/details/114577262

版权

概率论与数理统计专栏收录该内容

34 篇文章 14 订阅

订阅专栏

本文为 $I n t r o d u c t i o n$ $t o$ $P r o b a b i l i t y$ 的读书笔记

常用统计量的分布

(1) 标准正态分布 $N (0, 1)$
$f(x)=\frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}$
(2) $\chi^2$ 分布 (卡方分布): 设 $X_1,...,X_n$ 相互独立，且均服从标准正态分布 $N (0, 1)$ ，则称
$\chi^2=X_1^2+...+X_n^2$ 服从自由度为 $n$ 的 $\chi^2$ 分布，记为 $\chi^2(n)$
$\begin{aligned} &f(x)=\frac{1}{2^{\frac{n}{2}} \Gamma\left(\frac{n}{2}\right)} x^{\frac{n}{2}-1} e^{-\frac{x}{2}}, x>0 \\ &f(x)=0, \text { 其他 } \end{aligned}$
- $\chi^2$ 分布的可加性：若 $X\sim\chi^2(n),Y\sim\chi^2(m)$ ，且 $X$ 与 $Y$ 相互独立，则
  $X+Y\sim\chi^2(n+m)$
- 若 $X\sim\chi^2(n)$ ，则有
  $E[\chi^2(n)]=n,var(\chi^2(n))=2n$
(3) $t$ 分布: 设 $X\sim N(0,1),Y\sim\chi^2(n)$ ，且 $X$ 与 $Y$ 相互独立，则称
$t=\frac{X}{\sqrt{Y/n}}$ 服从自由度为 $n$ 的 $t$ 分布
$f(x)=\frac{\Gamma(\frac{n+1}{2})}{\sqrt{n\pi}\Gamma(\frac{n}{2})}(1+\frac{x^2}{n})^{-\frac{n+1}{2}}$
- 可以证明， $\lim_{n\rightarrow\infty}f(x)=\frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}$ . 这表明当 $n$ 充分大 ( $n\geq45$ ) 时，自由度为 $n$ 的 $t$ 分布可以近似地看成是标准正态分布
(4) $F$ 分布: 若 $X\sim\chi^2(n),Y\sim\chi^2(m)$ ，且 $X$ 与 $Y$ 相互独立，则称
$F=\frac{X/n}{Y/m}$ 服从自由度为 $n, m$ 的 $F$ 分布，记为 $F (n, m)$

正态总体的抽样分布

设总体 $X\sim N(\mu,\sigma^2)$ ， $X_1,...,X_n$ 为总体 $X$ 的简单随机样本，样本均值为 $\bar X$ ，样本方差为 $S^2$ ，则有
$\frac{\bar X-\mu}{\sigma/\sqrt n}\sim N(0,1)\\ \frac{n-1}{\sigma^2}S^2\sim \chi^2(n-1),且\bar X与S^2相互独立\\ \frac{\bar X-\mu}{S/\sqrt n}\sim t(n-1)$

第 2 条结论证明超纲

Classical Statistical Inference

In the preceding chapter, we developed the Bayesian approach to inference, where unknown parameters are modeled as random variables. In all cases we worked within a single, fully-specified probabilistic model, and we based most of our derivations and calculations on judicious application of Bayes’ rule.

By contrast, in the present chapter we adopt a fundamentally different philosophy: we view the unknown parameter $\theta$ as a deterministic (not random) but unknown quantity. The observation $X$ is random and its distribution $p_X(x; \theta)$ [if $X$ is discrete] or $f_X(x; \theta)$ [if $X$ is continuous] depends on the value of $\theta$ .
Thus, instead of working within a single probabilistic model, we will be dealing simultaneously with multiple candidate models, one model for each possible value of $\theta$ .
In this context, a “good” hypothesis testing or estimation procedure will be one that possesses certain desirable properties under every candidate model, that is, for every possible value of $\theta$ . In some cases, this may be considered to be a worst case viewpoint: a procedure is not considered to fulfill our specifications unless it does so against the worst possible value that $\theta$ can take. (只有在最坏情况仍能达到要求，才能被认为具有好的效果)
- For example, we may require that the expected value of the estimation error be zero, or that the estimation error be small with high probability, for all possible values of the unknown parameter.

Our notation will generally indicate the dependence of probabilities and expected values on $\theta$ .
- For example, we will denote by $E_\theta[h(X)]$ the expected value of a random variable $h (X)$ as a function of $\theta$ . Similarly, we will use the notation $P_\theta(A)$ to denote the probability of an event $A$ .
- Note that this only indicates a functional dependence, not conditioning in the probabilistic sense.

Classical Parameter Estimation

Properties of Estimators

Given observations $X = (X_1 ..... X_n)$ , an estimator (估计量) is a random variable of the form $\hat\Theta= g(X)$ , for some function $g$ .
Note that since the distribution of $X$ depends on $\theta$ , the same is true for the distribution of $\hat\Theta$ . We use the term estimate (估计值) to refer to an actual realized value of $\hat\Theta$ .

Sometimes, particularly when we are interested in the role of the number of observations $n$ , we use the notation $\hat\Theta_n$ for an estimator. It is then also appropriate to view $\hat\Theta_n$ as a sequence of estimators (one for each value of $n$ ). The mean and variance of $\hat\Theta_n$ are denoted $E_\theta[\hat\Theta_n]$ and $var_\theta(\hat\Theta_n)$ , respectively. Both $E_\theta[\hat\Theta_n]$ and $var_\theta(\hat\Theta_n)$ are numerical functions of $\theta$ , but for simplicity, when the context is clear we sometimes do not show this dependence.

Terminology Regarding Estimators

Let $\hat\Theta$ be an estimator of an unknown parameter $\theta$ , that is, a function of $n$ observations $X_1, ... , X_n$ whose distribution depends on $\theta$ .

The estimation error, denoted by $\tilde\Theta_n$ , is defined by $\tilde\Theta_n=\hat\Theta_n-\theta$ .
The bias of the estimator, denoted by $b_\theta(\hat\Theta_n)$ , is the expected value of the estimation error:
$b_\theta(\hat\Theta_n)=E_\theta[\hat\Theta_n]-\theta$
The expected value, the variance, and the bias of $\hat\Theta_n$ depend on $\theta$ , while the estimation error depends in addition on the observations $X_1, ... ,X_n$ .
We call $\hat\Theta_n$ unbiased (无偏) if $E_\theta[\hat\Theta_n]=\theta$ , for every possible value of $\theta$ .
We call $\hat\Theta_n$ asymptotically unbiased (渐近无偏) if $\lim_{n\rightarrow\infty}E_\theta[\hat\Theta_n]=\theta$ , for every possible value of $\theta$ .
We call $\hat\Theta_n$ consistent if the sequence $\hat\Theta_n$ converges to the true value of the parameter $\theta$ , in probability, for every possible value of $\theta$ .

Besides the bias $b_\theta(\hat\Theta_n)$ , we are usually interested in the size of the estimation error. This is captured by the mean squared error $E_\theta[\tilde\Theta_n^2]$ , which is related to the bias and the variance of $\hat\Theta_n$ according to the following formula:
$E_\theta[\tilde\Theta_n^2]=b^2_\theta(\hat\Theta_n)+var_\theta(\hat\Theta_n)$
This formula is important because in many statistical problems. There is a tradeoff between the two terms on the right-hand-side. Often a reduction in the variance is accompanied by an increase in the bias. Of course, a good estimator is one that manages to keep both terms small.

Maximum Likelihood Estimation (最大似然估计)

This is a general method that bears similarity to MAP estimation.

Let the vector of observations $X = (X_1, ... , X_n)$ be described by a joint PMF $p_X(x;\theta)$ whose form depends on an unknown (scalar or vector) parameter $\theta$ . Suppose we observe a particular value $x = (x_1, ... , x_n)$ of $X$ . Then, a maximum likelihood (ML) estimate is a value of the parameter that maximizes the numerical function $p_X(x_1,...,x_n;\theta)$ over all $\theta$ :
$\hat\theta_n=\arg\max_\theta p_X(x_1,...,x_n;\theta)$ For the case where $X$ is continuous,
$\hat\theta_n=\arg\max_\theta f_X(x_1,...,x_n;\theta)$
We refer to $p_X(x;\theta)$ [or $f_X(x;\theta)$ if $X$ is continuous] as the likelihood function (似然函数).

In many applications, the observations $X_i$ are assumed to be independent, in which case, the likelihood function is of the form
$p_X(x_1,...,x_n;\theta)=\prod_{i=1}^np_{X_i}(x_i;\theta)$ (for discrete $X_i$ ). In this case, it is often analytically or computationally convenient to maximize its logarithm, called the log-likelihood function (对数似然函数),
$\log p_X(x_1,...,x_n;\theta)=\sum_{i=1}^n\log p_{X_i}(x_i;\theta)$ over $\theta$ . When $X$ is continuous, there is a similar possibility, with PMFs replaced by PDFs: we maximize over $\theta$ the expression
$\log f_X(x_1,...,x_n;\theta)=\sum_{i=1}^n\log f_{X_i}(x_i;\theta)$

Recall that in Bayesian MAP estimation, the estimate is chosen to maximize the expression $p_\Theta(\theta)p_{X|\Theta}(x |\theta)$ over all $\theta$ , where $p_\Theta(\theta)$ is the prior PMF of an unknown discrete parameter $\theta$ . Thus, if we view $p_X(x;\theta)$ as a conditional PMF, we may interpret ML estimation as MAP estimation with a flat prior (均匀先验), i.e., a prior which is the same for all $\theta$ , indicating the absence of any useful prior knowledge.

Example 9.1.

Let us revisit Example 8.2, in which Juliet is always late by an amount $X$ that is uniformly distributed over the interval $[0,\theta]$ , and $\theta$ is an unknown parameter. In that example, we used a random variable $\Theta$ with flat prior PDF $f_\Theta(\theta)$ (uniform over the interval $[0, 1]$ ) to model the parameter. and we showed that the MAP estimate is the value $x$ of $X$ .
In the classical context of this section, there is no prior, and $\theta$ is treated as a constant, but the ML estimate is also $\hat\theta=x$ . The resulting estimator is $\hat\Theta=X$ .

Example 9.4. Estimating the Mean and Variance of a Normal.

Consider the problem of estimating the mean $μ$ and variance $v$ of a normal distribution using $n$ independent observations $X_1, ... , X_n$ . The parameter vector here is $\theta = (μ, v)$ . The corresponding likelihood function is
$f_X(x;\mu,v)=\prod_{i=1}^nf_{X_i}(x_i;\mu,v)=\prod_{i=1}^n\frac{1}{\sqrt{2\pi v}}e^{-(x_i-\mu)^2/2v}=\frac{1}{(2\pi v)^{n/2}}\prod_{i=1}^ne^{-(x_i-\mu)^2/2v}$ After some calculation it can be written as
$f_X(x;\mu,v)=\frac{1}{(2\pi v)^{n/2}}\cdot\exp\bigg\{-\frac{ns_n^2}{2v}\bigg\}\cdot\exp\bigg\{-\frac{n(m_n-\mu)^2}{2v}\bigg\}$ where $m_n$ is the realized value of the random variable
$M_n=\frac{1}{n}\sum_{i=1}^nX_i$ and $s_n^2$ is the realized value of the random variable
$\overline S_n^2=\frac{1}{n}\sum_{i=1}^n(X_i-M_n)^2$
- To verify this, write for $i = 1, . . ., n$ ,
  $(x_i-\mu)^2=(x_i-m_n+m_n-\mu)^2=(x_i-m_n)^2+(m_n-\mu)^2+2(x_i-m_n)(m_n-\mu)$ sum over $i$ , and note that
  $\sum_{i=1}^n(x_i-m_n)(m_n-\mu) = (m_n-\mu)\sum_{i=1}^n(x_i-m_n)= 0$
The log-likelihood function is
$\log f_X(x;\mu,v)=-\frac{n}{2}\cdot\log(2\pi)-\frac{n}{2}\cdot\log(v)-\frac{ns_n^2}{2v}-\frac{n(m_n-\mu)^2}{2v}$ Setting to zero the derivatives of this function with respect to $μ$ and $v$ , we obtain the estimate and estimator, respectively,
$\hat\theta_n=(m_n,s_n^2)\ \ \ \ \ \ \ \ \ \ \ \ \hat\Theta_n=(M_n,\overline S_n^2)$ Note that $M_n$ is the sample mean, while $\overline S_n^2$ may be viewed as a “sample variance.” As will be shown shortly, $E_\theta[\overline S_n^2]$ converges to $v$ as $n$ increases, so that $\overline S_n^2$ is asymptotically unbiased. Using also the weak law of large numbers, it can be shown that $M_n$ and $S_n$ are consistent estimators of $μ$ and $v$ , respectively.

Maximum likelihood estimation has some appealing properties.
- For example, it obeys the invariance principle (不变原理): if $\hat\Theta_n$ is the ML estimate of $\theta$ , then for any one-to-one function $h$ of $\theta$ , the ML estimate of the parameter $\zeta=h(\theta)$ is $h(\hat\Theta_n)$ .
- Also. when the observations are i.i.d. (independent identically distributed), and under some mild additional assumptions, it can be shown that the ML estimator is consistent.
- Another interesting property is that when $\theta$ is a scalar parameter, then under some mild conditions, the ML estimator has an asymptotic normality 渐近正态性质 property. In particular, it can be shown that the distribution of $(\hat\Theta_n-\theta)/\sigma(\hat\Theta_n)$ , where $\sigma^2(\hat\Theta_n)$ is the variance of $\hat\Theta_n$ , approaches a standard normal distribution. Thus, if we are able to also estimate $\sigma(\hat\Theta_n)$ , we can use it to derive an error variance estimate based on a normal approximation. When $\theta$ is a vector parameter, a similar statement applies to each one of its components.

Estimation of the Mean and Variance of a Random Variable

Suppose that the observations $X_1 ,..., X_n$ are i.i.d., with an unknown common mean $\theta$ . The most natural estimator of $\theta$ is the sample mean:
$M_n=\frac{X_1+...+X_n}{n}$ This estimator is unbiased. Its mean squared error is equal to its variance, which is $v / n$ , where $v$ is the common variance of the $X_i$ . Furthermore, by the weak law of large numbers, this estimator converges to $\theta$ in probability, and is therefore consistent.
Suppose that we are interested in an estimator of the variance $v$ . A natural one is
$\overline S_n^2=\frac{1}{n}\sum_{i=1}^n(X_i-M_n)^2$ which coincides with the ML estimator derived in Example 9.4 under a normality assumption. We have
$\begin{aligned}E_{(\theta,v)}[\overline S_n^2]&=\frac{1}{n}E_{(\theta,v)}\bigg[\sum_{i=1}^nX_i^2-2M_n\sum_{i=1}^nX_i+nM_n^2\bigg] \\&=E_{(\theta,v)}\bigg[\frac{1}{n}\sum_{i=1}^nX_i^2-2M_n^2+M_n^2\bigg] \\&=E_{(\theta,v)}\bigg[\frac{1}{n}\sum_{i=1}^nX_i^2-M_n^2\bigg] \\&=E_{(\theta,v)}[X^2]-E_{(\theta,v)}[M_n^2] \\&=(\theta^2+v)-(\theta^2+v/n) \\&=\frac{n-1}{n}v\end{aligned}$ Thus, $\overline S_n^2$ is not an unbiased estimator of $v$ , although it is asymptotically unbiased. We can obtain an unbiased variance estimator after some suitable scaling. This is the estimator
$\hat S_n^2=\frac{n}{n-1}\overline S_n^2=\frac{1}{n-1}\sum_{i=1}^n(X_i-M_n)^2$

Confidence Intervals 置信区间

Consider an estimator $\hat\Theta$ of an unknown parameter $\theta$ . Besides the numerical value provided by an estimate, we are often interested in constructing a so-called confidence interval. Roughly speaking, this is an interval that contains $\theta$ with a certain high probability, for every possible value of $\theta$ .
For a precise definition, let us first fix a desired confidence level, $1-\alpha$ , where $\alpha$ is typically a small number. We then replace the point estimator $\hat\Theta_n$ by a lower estimator $\hat\Theta_n^-$ , and an upper estimator $\hat\Theta_n^+$ , designed so that $\hat\Theta_n^-\leq\hat\Theta_n^+$ , and
$P_\theta(\hat\Theta_n^-\leq\theta\leq\hat\Theta_n^+)\geq1-\alpha$ for every possible value of $\theta$ . Note that, similar to estimators, $\hat\Theta_n^-$ and $\hat\Theta_n^+$ , are functions of the observations, and hence random variables whose distributions depend on $\theta$ . We call [ $\hat\Theta_n^-,\hat\Theta_n^+$ ] a $\boldsymbol{1-\alpha}$ confidence interval.

Example 9.6.

Suppose that the observations $X_i$ are i.i.d. normal, with unknown mean $\theta$ and known variance $v$ . Then, the sample mean estimator
$\hat\Theta_n=\frac{X_1+...+X_n}{n}$ is normal, with mean $\theta$ and variance $v / n$ .
Let $\alpha = 0.05$ . Using the CDF $\Phi(z)$ of the standard normal (available in the normal tables), we have $\Phi(1.96) = 0.975 = 1-\alpha/2$ and we obtain
$P_\theta\bigg(\frac{|\hat\Theta_n-\theta|}{\sqrt{v/n}}\leq1.96\bigg)=1-\alpha=0.95$ We can rewrite this statement in the form
$P_\theta\bigg(\hat\Theta_n-1.96\sqrt{\frac{v}{n}}\leq\theta\leq\hat\Theta_n+1.96\sqrt{\frac{v}{n}}\bigg)$ which implies that
$\bigg[\hat\Theta_n-1.96\sqrt{\frac{v}{n}},\hat\Theta_n+1.96\sqrt{\frac{v}{n}}\bigg]$ is a 95% confidence interval.

Out of a variety of possible confidence intervals, one with the smallest possible width is usually desirable.

In the preceding example, we may be tempted to describe the concept of a 95% confidence interval by a statement such as “the true parameter lies in the confidence interval with probability 0.95.” Such statements, however, can be ambiguous. For example, suppose that after the observations are obtained, the confidence interval turns out to be $[- 2.3, 4.1]$ with probability $0.95$ , because the latter statement does not involve any random variables; after all, in the classical approach, $\theta$ is a constant.
For a concrete interpretation, suppose that $\theta$ is fixed. We construct a confidence interval many times, using the same statistical procedure, i.e., each time, we obtain an independent collection of $n$ observations and construct the corresponding 95% confidence interval. We then expect that about 95% of these confidence intervals will include $\theta$ . This should be true regardless of what the value of $\theta$ is.

The construction of confidence intervals is sometimes hard. Fortunately, for many important models, $\hat\Theta_n-\theta$ is asymptotically normal and asymptotically unbiased. By this we mean that the CDF of the random variable
$\frac{\hat\Theta_n-\theta}{\sqrt{var_\theta(\hat\Theta_n)}}$ approaches the standard normal CDF as $n$ increases, for every value of $\theta$ . We may then proceed exactly as in Example 9.6, provided that $var_\theta(\hat\Theta_n)$ is known or can be approximated.

Confidence Intervals Based on Estimator Variance Approximations

Suppose that the observations $X_i$ are i.i.d. with mean $\theta$ and variance $v$ that are unknown. We may estimate $\theta$ with the sample mean
$\hat\Theta_n=\frac{X_1+...+X_n}{n}$ and estimate $v$ with the unbiased estimator
$\hat S_n^2=\frac{1}{n-1}\sum_{i=1}^n(X_i-M_n)^2$
In particular, we may estimate the variance $v / n$ of the sample mean by $\hat S_n^2/n$ . Then, for a given $\alpha$ , we may use these estimates and the central limit theorem to construct an (approximate) $\alpha$ confidence interval. This is the interval
$\bigg[\hat\Theta_n-z{\frac{\hat S_n}{\sqrt n}},\hat\Theta_n+z{\frac{\hat S_n}{\sqrt n}}\bigg]$ where $z$ is obtained from the relation
$\Phi(z)=1-\frac{\alpha}{2}$ and the normal tables.

Note that in this approach, there are two different approximations in effect. First, we are treating $\hat\Theta_n$ as if it were a normal random variable; second, we are replacing the true variance $v / n$ of $\hat\Theta_n$ by its estimate $\hat S_n^2/n$ .
Even in the special case where the $X_i$ are normal random variables, the confidence interval produced by the preceding procedure is still approximate. The reason is that $\hat S_n^2$ is only an approximation to the true variance $v$ , and the random variable
$T_n=\frac{\sqrt n(\hat\Theta_n-\theta)}{\hat S_n}$ is not normal. However, for normal $X_i$ , it can be shown that the PDF of $T_n$ does not depend on $\theta$ and $v$ , and can be computed explicitly. It is called the $t$ -distribution with $n - 1$ degrees of freedom (自由度). Like the standard normal PDF, it is symmetric and bell-shaped, but it is a little more spread out and has heavier tails. The probabilities of various intervals of interest are available in tables, similar to the normal tables.
Thus, when the $X_i$ are normal (or nearly normal) and $n$ is relatively small, a more accurate confidence interval is of the form
$\bigg[\hat\Theta_n-z{\frac{\hat S_n}{\sqrt n}},\hat\Theta_n+z{\frac{\hat S_n}{\sqrt n}}\bigg]$ where $z$ is obtained from the relation
$\Psi_{n-1}(z)=1-\frac{\alpha}{2}$
On the other hand, when $n$ is moderately large (e.g., $n\geq50$ ), the $t$ -distribution is very close to the normal distribution, and the normal tables can be used.

Example 9.7.

The weight of an object is measured eight times using an electronic scale that reports the true weight plus a random error that is normally distributed with zero mean and unknown variance. Assume that the errors in the observations are independent. The following results are obtained:
We compute a 95% confidence interval ( $n = 0.05$ ) using the $t$ -distribution. The value of the sample mean $\hat\Theta_n$ is 0.5747, and $\hat S_n/\sqrt n$ is 0.0182. From the $t$ -distribution tables, we obtain $\Psi(2.365) =0.025 =\alpha/2$ , so that
$\bigg[\hat\Theta_n-z{\frac{\hat S_n}{\sqrt n}},\hat\Theta_n+z{\frac{\hat S_n}{\sqrt n}}\bigg]=[0.531,0.618]$ is a 95% confidence interval.

The approximate confidence intervals constructed so far relied on the particular estimator $\hat S_n^2$ for the unknown variance $v$ . However, different estimators or approximations of the variance are possible.
- For example, suppose that the observations $X_1, ... , X_n$ are i.i.d. Bernoulli with unknown mean $\theta$ , and variance $v=\theta(1 -\theta)$ . Then, instead of $\hat S_n^2$ , the variance could be approximated by $\hat\Theta(1 -\hat\Theta)$ . Another possibility is to just observe that $\theta(1 -\theta)\leq1/4$ for all $\theta\in [0,1]$ , and use 1/4 as a conservative estimate of the variance.

连理o

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Chapter 9 (Classical Statistical Inference): Classical Parameter Estimation (经典参数估计)

本文为 IntroductionIntroductionIntroduction tototo ProbabilityProbabilityProbability 的读书笔记目录Classical Statistical InferenceClassical Parameter EstimationProperties of EstimatorsMaximum Likelihood Estimation (最大似然估计)Classical Statistical InferenceIn the pr
复制链接

扫一扫